Re: SparkR read.df failed to read file from local directory

2015-12-08 Thread Boyu Zhang
Thanks for the comment Felix, I tried giving
"/home/myuser/test_data/sparkR/flights.csv", but it tried to search the
path in hdfs, and gave errors:

15/12/08 12:47:10 ERROR r.RBackendHandler: loadDF on
org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://hostname:8020/home/myuser/test_data/sparkR/flights.csv
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$

Thanks,
Boyu

On Tue, Dec 8, 2015 at 12:38 PM, Felix Cheung 
wrote:

> Have you tried
>
> flightsDF <- read.df(sqlContext,
> "/home/myuser/test_data/sparkR/flights.csv", source =
> "com.databricks.spark.csv", header = "true")
>
>
>
> _
> From: Boyu Zhang 
> Sent: Tuesday, December 8, 2015 8:47 AM
> Subject: SparkR read.df failed to read file from local directory
> To: 
>
>
>
> Hello everyone,
>
> I tried to run the example data--manipulation.R, and can't get it to read
> the flights.csv file that is stored in my local fs. I don't want to store
> big files in my hdfs, so reading from a local fs (lustre fs) is the desired
> behavior for me.
>
> I tried the following:
>
> flightsDF <- read.df(sqlContext,
> "file:///home/myuser/test_data/sparkR/flights.csv", source =
> "com.databricks.spark.csv", header = "true")
>
> I got the message and eventually failed:
>
> 15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0
> in memory on hathi-a003.rcac.purdue.edu:33894 (size: 14.4 KB, free: 530.2
> MB)
> 15/12/08 11:42:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 3.0 (TID 9, hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException:
> File file:/home/myuser/test_data/sparkR/flights.csv does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106)
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Can someone please provide comments? Any tips is appreciated, thank you!
>
> Boyu Zhang
>
>
>
>


Re: SparkR read.df failed to read file from local directory

2015-12-08 Thread Felix Cheung
Have you tried
flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", 
source = "com.databricks.spark.csv", header = "true")    



_
From: Boyu Zhang 
Sent: Tuesday, December 8, 2015 8:47 AM
Subject: SparkR read.df failed to read file from local directory
To:  


   Hello everyone,  
  I tried to run the example data--manipulation.R, and can't get it to 
read the flights.csv file that is stored in my local fs. I don't want to store 
big files in my hdfs, so reading from a local fs (lustre fs) is the desired 
behavior for me.  
  I tried the following:  
   flightsDF <- read.df(sqlContext, 
"file:///home/myuser/test_data/sparkR/flights.csv", source = 
"com.databricks.spark.csv", header = "true")    
   
  I got the message and eventually failed:  
   15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added 
broadcast_6_piece0 in memory on  hathi-a003.rcac.purdue.edu:33894 (size: 
14.4 KB, free: 530.2 MB) 15/12/08 11:42:41 WARN 
scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 9,  
hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException: File 
file:/home/myuser/test_data/sparkR/flights.csv does not exist  at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
  at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)  
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
  at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)   
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763) 
 at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106) 
 at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
  at 
org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)  
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)  at 
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)  at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)  
at org.apache.spark.scheduler.Task.run(Task.scala:88)  at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)   
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
 at java.lang.Thread.run(Thread.java:745)  
  Can someone please provide comments? Any tips is appreciated, thank 
you!  
  Boyu Zhang  
   


  

RE: SparkR read.df failed to read file from local directory

2015-12-08 Thread Sun, Rui
Hi, Boyu,

Does the local file “/home/myuser/test_data/sparkR/flights.csv” really exist?

I just tried, and had no problem creating a DataFrame from a local CSV file.

From: Boyu Zhang [mailto:boyuzhan...@gmail.com]
Sent: Wednesday, December 9, 2015 1:49 AM
To: Felix Cheung
Cc: user@spark.apache.org
Subject: Re: SparkR read.df failed to read file from local directory

Thanks for the comment Felix, I tried giving 
"/home/myuser/test_data/sparkR/flights.csv", but it tried to search the path in 
hdfs, and gave errors:

15/12/08 12:47:10 ERROR r.RBackendHandler: loadDF on 
org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://hostname:8020/home/myuser/test_data/sparkR/flights.csv
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$

Thanks,
Boyu

On Tue, Dec 8, 2015 at 12:38 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Have you tried

flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", 
source = "com.databricks.spark.csv", header = "true")


_
From: Boyu Zhang <boyuzhan...@gmail.com<mailto:boyuzhan...@gmail.com>>
Sent: Tuesday, December 8, 2015 8:47 AM
Subject: SparkR read.df failed to read file from local directory
To: <user@spark.apache.org<mailto:user@spark.apache.org>>


Hello everyone,

I tried to run the example data--manipulation.R, and can't get it to read the 
flights.csv file that is stored in my local fs. I don't want to store big files 
in my hdfs, so reading from a local fs (lustre fs) is the desired behavior for 
me.

I tried the following:

flightsDF <- read.df(sqlContext, 
"file:///home/myuser/test_data/sparkR/flights.csv",
 source = "com.databricks.spark.csv", header = "true")

I got the message and eventually failed:

15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in 
memory on 
hathi-a003.rcac.purdue.edu:33894<http://hathi-a003.rcac.purdue.edu:33894> 
(size: 14.4 KB, free: 530.2 MB)
15/12/08 11:42:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
(TID 9, hathi-a003.rcac.purdue.edu<http://hathi-a003.rcac.purdue.edu>): 
java.io.FileNotFoundException: File 
file:/home/myuser/test_data/sparkR/flights.csv does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Can someone please provide comments? Any tips is appreciated, thank you!

Boyu Zhang