Fwd: distribute work (files)

Peter Figliozzi Wed, 07 Sep 2016 17:47:32 -0700

Yes indeed (see below).  Just to reiterate, I am not running Hadoop.  The
"curly" node name mentioned in the stacktrace is the name of one of the
worker nodes.  I've mounted the same directory "datashare" with two text
files to all worker nodes with sshfs.  The Spark documentation suggests
that this should work:


*If using a path on the local filesystem, the file must also be accessible
at the same path on worker nodes. Either copy the file to all workers or
use a network-mounted shared file system.*

I was hoping someone else could try this and see if it works.

Here's what I did to generate the error:

val data = sc.textFile("file:///home/peter/datashare/*.txt")
data.collect()

It's working to some extent because if I put a bogus path in, I'll get a
different (correct) error (InvalidInputException: Input Pattern
file:/home/peter/ddatashare/*.txt matches 0 files).

Here's the stack trace when I use a valid path:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage
18.0 (TID 792, curly): java.io.FileNotFoundException: File
file:/home/peter/datashare/f1.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(
RawLocalFileSystem.java:609)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(
RawLocalFileSystem.java:822)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(
RawLocalFileSystem.java:599)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(
FilterFileSystem.java:421)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(
ChecksumFileSystem.java:140)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
at org.apache.hadoop.mapred.LineRecordReader.<init>(
LineRecordReader.java:109)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(
TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:246)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


On Wed, Sep 7, 2016 at 9:50 AM, Yong Zhang <java8...@hotmail.com> wrote:

> What error do you get? FileNotFoundException?
>
>
> Please paste the stacktrace here.
>
>
> Yong
>
>
> ------------------------------
> *From:* Peter Figliozzi <pete.figlio...@gmail.com>
> *Sent:* Wednesday, September 7, 2016 10:18 AM
> *To:* ayan guha
> *Cc:* Lydia Ickler; user.spark
> *Subject:* Re: distribute work (files)
>
> That's failing for me.  Can someone please try this-- is this even
> supposed to work:
>
>    - create a directory somewhere and add two text files to it
>    - mount that directory on the Spark worker machines with sshfs
>    - read the textfiles into one datas structure using a file URL with a
>    wildcard
>
> Thanks,
>
> Pete
>
> On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> To access local file, try with file:// URI.
>>
>> On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <pete.figlio...@gmail.com
>> > wrote:
>>
>>> This is a great question.  Basically you don't have to worry about the
>>> details-- just give a wildcard in your call to textFile.  See the 
>>> Programming
>>> Guide <http://spark.apache.org/docs/latest/programming-guide.html> section
>>> entitled "External Datasets".  The Spark framework will distribute your
>>> data across the workers.  Note that:
>>>
>>> *If using a path on the local filesystem, the file must also be
>>>> accessible at the same path on worker nodes. Either copy the file to all
>>>> workers or use a network-mounted shared file system.*
>>>
>>>
>>> In your case this would mean the directory of files.
>>>
>>> Curiously, I cannot get this to work when I mount a directory with sshfs
>>> on all of my worker nodes.  It says "file not found" even though the file
>>> clearly exists in the specified path on all workers.   Anyone care to try
>>> and comment on this?
>>>
>>> Thanks,
>>>
>>> Pete
>>>
>>> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <ickle...@googlemail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> maybe this is a stupid question:
>>>>
>>>> I have a list of files. Each file I want to take as an input for a
>>>> ML-algorithm. All files are independent from another.
>>>> My question now is how do I distribute the work so that each worker
>>>> takes a block of files and just runs the algorithm on them one by one.
>>>> I hope somebody can point me in the right direction! :)
>>>>
>>>> Best regards,
>>>> Lydia
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Fwd: distribute work (files)

Reply via email to