Have you tried
flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", 
source = "com.databricks.spark.csv", header = "true")    



    _____________________________
From: Boyu Zhang <boyuzhan...@gmail.com>
Sent: Tuesday, December 8, 2015 8:47 AM
Subject: SparkR read.df failed to read file from local directory
To:  <user@spark.apache.org>


           Hello everyone,          
          I tried to run the example data--manipulation.R, and can't get it to 
read the flights.csv file that is stored in my local fs. I don't want to store 
big files in my hdfs, so reading from a local fs (lustre fs) is the desired 
behavior for me.          
          I tried the following:          
   flightsDF <- read.df(sqlContext, 
"file:///home/myuser/test_data/sparkR/flights.csv", source = 
"com.databricks.spark.csv", header = "true")    
       
          I got the message and eventually failed:          
               15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added 
broadcast_6_piece0 in memory on      hathi-a003.rcac.purdue.edu:33894 (size: 
14.4 KB, free: 530.2 MB)             15/12/08 11:42:41 WARN 
scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 9,      
hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException: File 
file:/home/myuser/test_data/sparkR/flights.csv does not exist              at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
              at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)  
            at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:137)
              at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)       
       at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)             
 at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:106) 
             at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
              at 
org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)              
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)              at 
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)              at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)              at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)              at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)        
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)        
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)              at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)              
at org.apache.spark.scheduler.Task.run(Task.scala:88)              at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)           
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
             at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
             at java.lang.Thread.run(Thread.java:745)              
          Can someone please provide comments? Any tips is appreciated, thank 
you!          
          Boyu Zhang          
       


  

Reply via email to