Re: distribute work (files)

Peter Figliozzi Wed, 07 Sep 2016 18:27:06 -0700

All (three) of them.  It's kind of cool-- when I re-run collect() a different
executor will show up as first to encounter the error.


On Wed, Sep 7, 2016 at 8:20 PM, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> Is it happening on all executors or one?
>
> On Thu, Sep 8, 2016 at 10:46 AM, Peter Figliozzi <pete.figlio...@gmail.com
> > wrote:
>
>>
>> Yes indeed (see below).  Just to reiterate, I am not running Hadoop.  The
>> "curly" node name mentioned in the stacktrace is the name of one of the
>> worker nodes.  I've mounted the same directory "datashare" with two text
>> files to all worker nodes with sshfs.  The Spark documentation suggests
>> that this should work:
>>
>> *If using a path on the local filesystem, the file must also be
>> accessible at the same path on worker nodes. Either copy the file to all
>> workers or use a network-mounted shared file system.*
>>
>> I was hoping someone else could try this and see if it works.
>>
>> Here's what I did to generate the error:
>>
>> val data = sc.textFile("file:///home/peter/datashare/*.txt")
>> data.collect()
>>
>> It's working to some extent because if I put a bogus path in, I'll get a
>> different (correct) error (InvalidInputException: Input Pattern
>> file:/home/peter/ddatashare/*.txt matches 0 files).
>>
>> Here's the stack trace when I use a valid path:
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 1 in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in stage
>> 18.0 (TID 792, curly): java.io.FileNotFoundException: File
>> file:/home/peter/datashare/f1.txt does not exist
>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
>> tus(RawLocalFileSystem.java:609)
>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
>> ernal(RawLocalFileSystem.java:822)
>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>> alFileSystem.java:599)
>> at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFi
>> leSystem.java:421)
>> at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
>> er.<init>(ChecksumFileSystem.java:140)
>> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSys
>> tem.java:341)
>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>> at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordR
>> eader.java:109)
>> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(Tex
>> tInputFormat.java:67)
>> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:246)
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>> at org.apache.spark.scheduler.Task.run(Task.scala:85)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> On Wed, Sep 7, 2016 at 9:50 AM, Yong Zhang <java8...@hotmail.com> wrote:
>>
>>> What error do you get? FileNotFoundException?
>>>
>>>
>>> Please paste the stacktrace here.
>>>
>>>
>>> Yong
>>>
>>>
>>> ------------------------------
>>> *From:* Peter Figliozzi <pete.figlio...@gmail.com>
>>> *Sent:* Wednesday, September 7, 2016 10:18 AM
>>> *To:* ayan guha
>>> *Cc:* Lydia Ickler; user.spark
>>> *Subject:* Re: distribute work (files)
>>>
>>> That's failing for me.  Can someone please try this-- is this even
>>> supposed to work:
>>>
>>>    - create a directory somewhere and add two text files to it
>>>    - mount that directory on the Spark worker machines with sshfs
>>>    - read the textfiles into one datas structure using a file URL with
>>>    a wildcard
>>>
>>> Thanks,
>>>
>>> Pete
>>>
>>> On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> To access local file, try with file:// URI.
>>>>
>>>> On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <
>>>> pete.figlio...@gmail.com> wrote:
>>>>
>>>>> This is a great question.  Basically you don't have to worry about the
>>>>> details-- just give a wildcard in your call to textFile.  See the 
>>>>> Programming
>>>>> Guide <http://spark.apache.org/docs/latest/programming-guide.html> section
>>>>> entitled "External Datasets".  The Spark framework will distribute your
>>>>> data across the workers.  Note that:
>>>>>
>>>>> *If using a path on the local filesystem, the file must also be
>>>>>> accessible at the same path on worker nodes. Either copy the file to all
>>>>>> workers or use a network-mounted shared file system.*
>>>>>
>>>>>
>>>>> In your case this would mean the directory of files.
>>>>>
>>>>> Curiously, I cannot get this to work when I mount a directory with
>>>>> sshfs on all of my worker nodes.  It says "file not found" even
>>>>> though the file clearly exists in the specified path on all workers.
>>>>> Anyone care to try and comment on this?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Pete
>>>>>
>>>>> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <ickle...@googlemail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> maybe this is a stupid question:
>>>>>>
>>>>>> I have a list of files. Each file I want to take as an input for a
>>>>>> ML-algorithm. All files are independent from another.
>>>>>> My question now is how do I distribute the work so that each worker
>>>>>> takes a block of files and just runs the algorithm on them one by one.
>>>>>> I hope somebody can point me in the right direction! :)
>>>>>>
>>>>>> Best regards,
>>>>>> Lydia
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Ayan Guha
>>>>
>>>
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: distribute work (files)

Reply via email to