Re: distribute work (files)

Peter Figliozzi Wed, 07 Sep 2016 07:19:21 -0700

That's failing for me.  Can someone please try this-- is this even supposed
to work:


   - create a directory somewhere and add two text files to it
   - mount that directory on the Spark worker machines with sshfs
   - read the textfiles into one datas structure using a file URL with a
   wildcard

Thanks,

Pete

On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote:

> To access local file, try with file:// URI.
>
> On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <pete.figlio...@gmail.com>
> wrote:
>
>> This is a great question.  Basically you don't have to worry about the
>> details-- just give a wildcard in your call to textFile.  See the Programming
>> Guide <http://spark.apache.org/docs/latest/programming-guide.html> section
>> entitled "External Datasets".  The Spark framework will distribute your
>> data across the workers.  Note that:
>>
>> *If using a path on the local filesystem, the file must also be
>>> accessible at the same path on worker nodes. Either copy the file to all
>>> workers or use a network-mounted shared file system.*
>>
>>
>> In your case this would mean the directory of files.
>>
>> Curiously, I cannot get this to work when I mount a directory with sshfs
>> on all of my worker nodes.  It says "file not found" even though the file
>> clearly exists in the specified path on all workers.   Anyone care to try
>> and comment on this?
>>
>> Thanks,
>>
>> Pete
>>
>> On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <ickle...@googlemail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> maybe this is a stupid question:
>>>
>>> I have a list of files. Each file I want to take as an input for a
>>> ML-algorithm. All files are independent from another.
>>> My question now is how do I distribute the work so that each worker
>>> takes a block of files and just runs the algorithm on them one by one.
>>> I hope somebody can point me in the right direction! :)
>>>
>>> Best regards,
>>> Lydia
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: distribute work (files)

Reply via email to