Re: SparkR read.df failed to read file from local directory
Thanks for the comment Felix, I tried giving "/home/myuser/test_data/sparkR/flights.csv", but it tried to search the path in hdfs, and gave errors: 15/12/08 12:47:10 ERROR r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hostname:8020/home/myuser/test_data/sparkR/flights.csv at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.RDD$$ Thanks, Boyu On Tue, Dec 8, 2015 at 12:38 PM, Felix Cheungwrote: > Have you tried > > flightsDF <- read.df(sqlContext, > "/home/myuser/test_data/sparkR/flights.csv", source = > "com.databricks.spark.csv", header = "true") > > > > _ > From: Boyu Zhang > Sent: Tuesday, December 8, 2015 8:47 AM > Subject: SparkR read.df failed to read file from local directory > To: > > > > Hello everyone, > > I tried to run the example data--manipulation.R, and can't get it to read > the flights.csv file that is stored in my local fs. I don't want to store > big files in my hdfs, so reading from a local fs (lustre fs) is the desired > behavior for me. > > I tried the following: > > flightsDF <- read.df(sqlContext, > "file:///home/myuser/test_data/sparkR/flights.csv", source = > "com.databricks.spark.csv", header = "true") > > I got the message and eventually failed: > > 15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 > in memory on hathi-a003.rcac.purdue.edu:33894 (size: 14.4 KB, free: 530.2 > MB) > 15/12/08 11:42:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 3.0 (TID 9, hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException: > File file:/home/myuser/test_data/sparkR/flights.csv does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) > > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) > > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) > > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763) > at > org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106) > at > org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) > > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:745) > > Can someone please provide comments? Any tips is appreciated, thank you! > > Boyu Zhang > > > >
Re: SparkR read.df failed to read file from local directory
Have you tried flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", source = "com.databricks.spark.csv", header = "true") _ From: Boyu ZhangSent: Tuesday, December 8, 2015 8:47 AM Subject: SparkR read.df failed to read file from local directory To: Hello everyone, I tried to run the example data--manipulation.R, and can't get it to read the flights.csv file that is stored in my local fs. I don't want to store big files in my hdfs, so reading from a local fs (lustre fs) is the desired behavior for me. I tried the following: flightsDF <- read.df(sqlContext, "file:///home/myuser/test_data/sparkR/flights.csv", source = "com.databricks.spark.csv", header = "true") I got the message and eventually failed: 15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on hathi-a003.rcac.purdue.edu:33894 (size: 14.4 KB, free: 530.2 MB) 15/12/08 11:42:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 9, hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException: File file:/home/myuser/test_data/sparkR/flights.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Can someone please provide comments? Any tips is appreciated, thank you! Boyu Zhang
RE: SparkR read.df failed to read file from local directory
Hi, Boyu, Does the local file “/home/myuser/test_data/sparkR/flights.csv” really exist? I just tried, and had no problem creating a DataFrame from a local CSV file. From: Boyu Zhang [mailto:boyuzhan...@gmail.com] Sent: Wednesday, December 9, 2015 1:49 AM To: Felix Cheung Cc: user@spark.apache.org Subject: Re: SparkR read.df failed to read file from local directory Thanks for the comment Felix, I tried giving "/home/myuser/test_data/sparkR/flights.csv", but it tried to search the path in hdfs, and gave errors: 15/12/08 12:47:10 ERROR r.RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hostname:8020/home/myuser/test_data/sparkR/flights.csv at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.RDD$$ Thanks, Boyu On Tue, Dec 8, 2015 at 12:38 PM, Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote: Have you tried flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", source = "com.databricks.spark.csv", header = "true") _ From: Boyu Zhang <boyuzhan...@gmail.com<mailto:boyuzhan...@gmail.com>> Sent: Tuesday, December 8, 2015 8:47 AM Subject: SparkR read.df failed to read file from local directory To: <user@spark.apache.org<mailto:user@spark.apache.org>> Hello everyone, I tried to run the example data--manipulation.R, and can't get it to read the flights.csv file that is stored in my local fs. I don't want to store big files in my hdfs, so reading from a local fs (lustre fs) is the desired behavior for me. I tried the following: flightsDF <- read.df(sqlContext, "file:///home/myuser/test_data/sparkR/flights.csv", source = "com.databricks.spark.csv", header = "true") I got the message and eventually failed: 15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on hathi-a003.rcac.purdue.edu:33894<http://hathi-a003.rcac.purdue.edu:33894> (size: 14.4 KB, free: 530.2 MB) 15/12/08 11:42:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 9, hathi-a003.rcac.purdue.edu<http://hathi-a003.rcac.purdue.edu>): java.io.FileNotFoundException: File file:/home/myuser/test_data/sparkR/flights.csv does not exist at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763) at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106) at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Can someone please provide comments? Any tips is appreciated, thank you! Boyu Zhang