Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Harsh J
General note: The /root is a protected local directory, meaning that if
your program spawns as a non-root user, it will never be able to access the
file.

On Sat, Dec 12, 2015 at 12:21 AM Zhan Zhang  wrote:

> As Sean mentioned, you cannot referring to the local file in your remote
> machine (executors). One walk around is to copy the file to all machines
> within same directory.
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 11, 2015, at 10:26 AM, Lin, Hao  wrote:
>
>  of the master node
>
>
>


Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge
This issue is due to file permission issue. You need to execute spark
operations using root command only.



Regards,
Vijay Gharge



On Fri, Dec 11, 2015 at 11:20 PM, Vijay Gharge <vijay.gha...@gmail.com>
wrote:

> One more question. Are you also running spark commands using root user ?
> Meanwhile am trying to simulate this locally.
>
>
> On Friday 11 December 2015, Lin, Hao <hao@finra.org> wrote:
>
>> Here you go, thanks.
>>
>>
>>
>> -rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv
>>
>>
>>
>> *From:* Vijay Gharge [mailto:vijay.gha...@gmail.com]
>> *Sent:* Friday, December 11, 2015 12:31 PM
>> *To:* Lin, Hao
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: how to access local file from Spark
>> sc.textFile("file:///path to/myfile")
>>
>>
>>
>> Can you provide output of "ls -lh /root/2008.csv" ?
>>
>> On Friday 11 December 2015, Lin, Hao <hao@finra.org> wrote:
>>
>> Hi,
>>
>>
>>
>> I have problem accessing local file, with such example:
>>
>>
>>
>> sc.textFile("file:///root/2008.csv").count()
>>
>>
>>
>> with error: File file:/root/2008.csv does not exist.
>>
>> The file clearly exists since, since if I missed type the file name to an
>> non-existing one, it will show:
>>
>>
>>
>> Error: Input path does not exist
>>
>>
>>
>> Please help!
>>
>>
>>
>> The following is the error message:
>>
>>
>>
>> scala> sc.textFile("file:///root/2008.csv").count()
>>
>> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID
>> 498, 10.162.167.24): java.io.FileNotFoundException: File
>> file:/root/2008.csv does not exist
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>>
>> at
>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>>
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>>
>> at
>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>>
>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>>
>> at
>> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>>
>> at
>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>>
>> at
>> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>>
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>>
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4
>> times; aborting job
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
>> in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage
>> 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
>> file:/root/2008.csv does not exist
>>
>> at
>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>>
>> at
>> org.apache.hadoop.fs.RawLocalF

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Sean Owen
Hm, are you referencing a local file from your remote workers? That
won't work as the file only exists in one machine (I presume).

On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao  wrote:
> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to an
> non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498,
> 10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does
> not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times;
> aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in
> stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0
> (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> 

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Here you go, thanks.

-rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv

From: Vijay Gharge [mailto:vijay.gha...@gmail.com]
Sent: Friday, December 11, 2015 12:31 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Can you provide output of "ls -lh /root/2008.csv" ?

On Friday 11 December 2015, Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Hi,

I have problem accessing local file, with such example:

sc.textFile("file:///root/2008.csv").count()

with error: File file:/root/2008.csv does not exist.
The file clearly exists since, since if I missed type the file name to an 
non-existing one, it will show:

Error: Input path does not exist

Please help!

The following is the error message:

scala> sc.textFile("file:///root/2008.csv").count()
15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 
10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in 
stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 
547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge
Please ignore typo. I meant root "permissions"

Regards,
Vijay Gharge



On Fri, Dec 11, 2015 at 11:30 PM, Vijay Gharge <vijay.gha...@gmail.com>
wrote:

> This issue is due to file permission issue. You need to execute spark
> operations using root command only.
>
>
>
> Regards,
> Vijay Gharge
>
>
>
> On Fri, Dec 11, 2015 at 11:20 PM, Vijay Gharge <vijay.gha...@gmail.com>
> wrote:
>
>> One more question. Are you also running spark commands using root user ?
>> Meanwhile am trying to simulate this locally.
>>
>>
>> On Friday 11 December 2015, Lin, Hao <hao@finra.org> wrote:
>>
>>> Here you go, thanks.
>>>
>>>
>>>
>>> -rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv
>>>
>>>
>>>
>>> *From:* Vijay Gharge [mailto:vijay.gha...@gmail.com]
>>> *Sent:* Friday, December 11, 2015 12:31 PM
>>> *To:* Lin, Hao
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: how to access local file from Spark
>>> sc.textFile("file:///path to/myfile")
>>>
>>>
>>>
>>> Can you provide output of "ls -lh /root/2008.csv" ?
>>>
>>> On Friday 11 December 2015, Lin, Hao <hao@finra.org> wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> I have problem accessing local file, with such example:
>>>
>>>
>>>
>>> sc.textFile("file:///root/2008.csv").count()
>>>
>>>
>>>
>>> with error: File file:/root/2008.csv does not exist.
>>>
>>> The file clearly exists since, since if I missed type the file name to
>>> an non-existing one, it will show:
>>>
>>>
>>>
>>> Error: Input path does not exist
>>>
>>>
>>>
>>> Please help!
>>>
>>>
>>>
>>> The following is the error message:
>>>
>>>
>>>
>>> scala> sc.textFile("file:///root/2008.csv").count()
>>>
>>> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID
>>> 498, 10.162.167.24): java.io.FileNotFoundException: File
>>> file:/root/2008.csv does not exist
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>>>
>>> at
>>> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>>>
>>> at
>>> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>>>
>>> at
>>> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>>>
>>> at
>>> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>>>
>>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>>>
>>> at
>>> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>>>
>>> at
>>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>>>
>>> at
>>> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>>>
>>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>>>
>>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>>>
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>>>
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>>>
>>> at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>
>>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>>>
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>
>>> at
>>> java.util.concurrent.ThreadPoolExec

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge
Can you provide output of "ls -lh /root/2008.csv" ?

On Friday 11 December 2015, Lin, Hao  wrote:

> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to an
> non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID
> 498, 10.162.167.24): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4
> times; aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
> in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage
> 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> 

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Vijay Gharge
One more question. Are you also running spark commands using root user ?
Meanwhile am trying to simulate this locally.

On Friday 11 December 2015, Lin, Hao <hao@finra.org> wrote:

> Here you go, thanks.
>
>
>
> -rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv
>
>
>
> *From:* Vijay Gharge [mailto:vijay.gha...@gmail.com
> <javascript:_e(%7B%7D,'cvml','vijay.gha...@gmail.com');>]
> *Sent:* Friday, December 11, 2015 12:31 PM
> *To:* Lin, Hao
> *Cc:* user@spark.apache.org
> <javascript:_e(%7B%7D,'cvml','user@spark.apache.org');>
> *Subject:* Re: how to access local file from Spark
> sc.textFile("file:///path to/myfile")
>
>
>
> Can you provide output of "ls -lh /root/2008.csv" ?
>
> On Friday 11 December 2015, Lin, Hao <hao@finra.org
> <javascript:_e(%7B%7D,'cvml','hao@finra.org');>> wrote:
>
> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to an
> non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID
> 498, 10.162.167.24): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4
> times; aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 9
> in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage
> 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: File
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
>
> at org.apache.hadoop.fs.File

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
Yes to your question. I have spun up a cluster, login to the master as a root 
user, run spark-shell, and reference the local file of the master machine.

From: Vijay Gharge [mailto:vijay.gha...@gmail.com]
Sent: Friday, December 11, 2015 12:50 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

One more question. Are you also running spark commands using root user ? 
Meanwhile am trying to simulate this locally.

On Friday 11 December 2015, Lin, Hao 
<hao@finra.org<mailto:hao@finra.org>> wrote:
Here you go, thanks.

-rw-r--r-- 1 root root 658M Dec  9  2014 /root/2008.csv

From: Vijay Gharge 
[mailto:vijay.gha...@gmail.com<javascript:_e(%7B%7D,'cvml','vijay.gha...@gmail.com');>]
Sent: Friday, December 11, 2015 12:31 PM
To: Lin, Hao
Cc: user@spark.apache.org<javascript:_e(%7B%7D,'cvml','user@spark.apache.org');>
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Can you provide output of "ls -lh /root/2008.csv" ?

On Friday 11 December 2015, Lin, Hao 
<hao@finra.org<javascript:_e(%7B%7D,'cvml','hao@finra.org');>> wrote:
Hi,

I have problem accessing local file, with such example:

sc.textFile("file:///root/2008.csv").count()

with error: File file:/root/2008.csv does not exist.
The file clearly exists since, since if I missed type the file name to an 
non-existing one, it will show:

Error: Input path does not exist

Please help!

The following is the error message:

scala> sc.textFile("file:///root/2008.csv").count()
15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 (TID 498, 
10.162.167.24): java.io.FileNotFoundException: File file:/root/2008.csv does 
not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in 
stage 8.0 failed 4 times, most recent failure: Lost task 9.3 in stage 8.0 (TID 
547, 10.162.167.23): java.io.FileNotFoundException: File file:/root/2008.csv 
does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:397)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
at 
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:108)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(H

RE: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Lin, Hao
I logged into master of my cluster and referenced the local file of the master 
node machine.  And yes that file only resides on master node, not on any of the 
remote workers.  

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Friday, December 11, 2015 1:00 PM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: how to access local file from Spark sc.textFile("file:///path 
to/myfile")

Hm, are you referencing a local file from your remote workers? That won't work 
as the file only exists in one machine (I presume).

On Fri, Dec 11, 2015 at 5:19 PM, Lin, Hao <hao@finra.org> wrote:
> Hi,
>
>
>
> I have problem accessing local file, with such example:
>
>
>
> sc.textFile("file:///root/2008.csv").count()
>
>
>
> with error: File file:/root/2008.csv does not exist.
>
> The file clearly exists since, since if I missed type the file name to 
> an non-existing one, it will show:
>
>
>
> Error: Input path does not exist
>
>
>
> Please help!
>
>
>
> The following is the error message:
>
>
>
> scala> sc.textFile("file:///root/2008.csv").count()
>
> 15/12/11 17:12:08 WARN TaskSetManager: Lost task 15.0 in stage 8.0 
> (TID 498,
> 10.162.167.24): java.io.FileNotFoundException: File 
> file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLoc
> alFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawL
> ocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSyst
> em.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.j
> ava:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(
> ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:3
> 39)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java
> :108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputForm
> at.java:67)
>
> at
> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
>
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:3
> 8)
>
> at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> ava:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> 15/12/11 17:12:08 ERROR TaskSetManager: Task 9 in stage 8.0 failed 4 
> times; aborting job
>
> org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 9 in stage 8.0 failed 4 times, most recent failure: Lost task 9.3 
> in stage 8.0 (TID 547, 10.162.167.23): java.io.FileNotFoundException: 
> File file:/root/2008.csv does not exist
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLoc
> alFileSystem.java:511)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawL
> ocalFileSystem.java:724)
>
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSyst
> em.java:501)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.j
> ava:397)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(
> ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:3
> 39)
>
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:764)
>
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java
> :108)
>
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordR

Re: how to access local file from Spark sc.textFile("file:///path to/myfile")

2015-12-11 Thread Zhan Zhang
As Sean mentioned, you cannot referring to the local file in your remote 
machine (executors). One walk around is to copy the file to all machines within 
same directory.

Thanks.

Zhan Zhang

On Dec 11, 2015, at 10:26 AM, Lin, Hao 
> wrote:

 of the master node