Re: distribute work (files)

Peter Figliozzi Wed, 07 Sep 2016 19:26:50 -0700

It works!  Hmm, smells like some kind of linux permissions issue. Checking
this, the owner & group are the same all around, and there is global read
permission as well.  So I have no clue why it would not work with an sshfs
mounted volume.


Back to OPs question... use Spark's CSV data source instead of calling
textFile like I originally suggested.  See this StackOverflow
<http://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load>.


Good to know this is an option.  I use Cassandra for my data source and am
not running Hadoop (no reason to thus far).

Can anyone get this to work with an sshfs mounted share?

On Wed, Sep 7, 2016 at 8:48 PM, ayan guha <guha.a...@gmail.com> wrote:

> So, can you try to simulate the same without sshfs? ie, create a folder on
> /tmp/datashare and copy your files on all the machines and point
> sc.textFiles to that folder?
>
>
> On Thu, Sep 8, 2016 at 11:26 AM, Peter Figliozzi <pete.figlio...@gmail.com
> > wrote:
>
>> All (three) of them.  It's kind of cool-- when I re-run collect() a different
>> executor will show up as first to encounter the error.
>>
>> On Wed, Sep 7, 2016 at 8:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Is it happening on all executors or one?
>>>
>>> On Thu, Sep 8, 2016 at 10:46 AM, Peter Figliozzi <
>>> pete.figlio...@gmail.com> wrote:
>>>
>>>>
>>>> Yes indeed (see below).  Just to reiterate, I am not running Hadoop.
>>>> The "curly" node name mentioned in the stacktrace is the name of one of the
>>>> worker nodes.  I've mounted the same directory "datashare" with two text
>>>> files to all worker nodes with sshfs.  The Spark documentation suggests
>>>> that this should work:
>>>>
>>>> *If using a path on the local filesystem, the file must also be
>>>> accessible at the same path on worker nodes. Either copy the file to all
>>>> workers or use a network-mounted shared file system.*
>>>>
>>>> I was hoping someone else could try this and see if it works.
>>>>
>>>> Here's what I did to generate the error:
>>>>
>>>> val data = sc.textFile("file:///home/peter/datashare/*.txt")
>>>> data.collect()
>>>>
>>>> It's working to some extent because if I put a bogus path in, I'll get
>>>> a different (correct) error (InvalidInputException: Input Pattern
>>>> file:/home/peter/ddatashare/*.txt matches 0 files).
>>>>
>>>> Here's the stack trace when I use a valid path:
>>>>
>>>> org.apache.spark.SparkException: Job aborted due to stage failure:
>>>> Task 1 in stage 18.0 failed 4 times, most recent failure: Lost task 1.3 in
>>>> stage 18.0 (TID 792, curly): java.io.FileNotFoundException: File
>>>> file:/home/peter/datashare/f1.txt does not exist
>>>> at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileSta
>>>> tus(RawLocalFileSystem.java:609)
>>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInt
>>>> ernal(RawLocalFileSystem.java:822)
>>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>>> alFileSystem.java:599)
>>>> at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFi
>>>> leSystem.java:421)
>>>> at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
>>>> er.<init>(ChecksumFileSystem.java:140)
>>>> at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSys
>>>> tem.java:341)
>>>> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>>>> at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordR
>>>> eader.java:109)
>>>> at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(Tex
>>>> tInputFormat.java:67)
>>>> at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:246)
>>>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
>>>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>>>> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>>>> DD.scala:38)
>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>>>> at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.s
>>>> cala:274)
>>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>>> Executor.java:1142)
>>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>>> lExecutor.java:617)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>>
>>>>
>>>> On Wed, Sep 7, 2016 at 9:50 AM, Yong Zhang <java8...@hotmail.com>
>>>> wrote:
>>>>
>>>>> What error do you get? FileNotFoundException?
>>>>>
>>>>>
>>>>> Please paste the stacktrace here.
>>>>>
>>>>>
>>>>> Yong
>>>>>
>>>>>
>>>>> ------------------------------
>>>>> *From:* Peter Figliozzi <pete.figlio...@gmail.com>
>>>>> *Sent:* Wednesday, September 7, 2016 10:18 AM
>>>>> *To:* ayan guha
>>>>> *Cc:* Lydia Ickler; user.spark
>>>>> *Subject:* Re: distribute work (files)
>>>>>
>>>>> That's failing for me.  Can someone please try this-- is this even
>>>>> supposed to work:
>>>>>
>>>>>    - create a directory somewhere and add two text files to it
>>>>>    - mount that directory on the Spark worker machines with sshfs
>>>>>    - read the textfiles into one datas structure using a file URL
>>>>>    with a wildcard
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Pete
>>>>>
>>>>> On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> To access local file, try with file:// URI.
>>>>>>
>>>>>> On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <
>>>>>> pete.figlio...@gmail.com> wrote:
>>>>>>
>>>>>>> This is a great question.  Basically you don't have to worry about
>>>>>>> the details-- just give a wildcard in your call to textFile.  See
>>>>>>> the Programming Guide
>>>>>>> <http://spark.apache.org/docs/latest/programming-guide.html> section
>>>>>>> entitled "External Datasets".  The Spark framework will distribute your
>>>>>>> data across the workers.  Note that:
>>>>>>>
>>>>>>> *If using a path on the local filesystem, the file must also be
>>>>>>>> accessible at the same path on worker nodes. Either copy the file to 
>>>>>>>> all
>>>>>>>> workers or use a network-mounted shared file system.*
>>>>>>>
>>>>>>>
>>>>>>> In your case this would mean the directory of files.
>>>>>>>
>>>>>>> Curiously, I cannot get this to work when I mount a directory with
>>>>>>> sshfs on all of my worker nodes.  It says "file not found" even
>>>>>>> though the file clearly exists in the specified path on all workers.
>>>>>>> Anyone care to try and comment on this?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Pete
>>>>>>>
>>>>>>> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <
>>>>>>> ickle...@googlemail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> maybe this is a stupid question:
>>>>>>>>
>>>>>>>> I have a list of files. Each file I want to take as an input for a
>>>>>>>> ML-algorithm. All files are independent from another.
>>>>>>>> My question now is how do I distribute the work so that each worker
>>>>>>>> takes a block of files and just runs the algorithm on them one by one.
>>>>>>>> I hope somebody can point me in the right direction! :)
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Lydia
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: distribute work (files)

Reply via email to