This is a great question.  Basically you don't have to worry about the
details-- just give a wildcard in your call to textFile.  See the Programming
Guide <http://spark.apache.org/docs/latest/programming-guide.html> section
entitled "External Datasets".  The Spark framework will distribute your
data across the workers.  Note that:

*If using a path on the local filesystem, the file must also be accessible
> at the same path on worker nodes. Either copy the file to all workers or
> use a network-mounted shared file system.*


In your case this would mean the directory of files.

Curiously, I cannot get this to work when I mount a directory with sshfs on
all of my worker nodes.  It says "file not found" even though the file
clearly exists in the specified path on all workers.   Anyone care to try
and comment on this?

Thanks,

Pete

On Tue, Sep 6, 2016 at 9:51 AM, Lydia Ickler <ickle...@googlemail.com>
wrote:

> Hi,
>
> maybe this is a stupid question:
>
> I have a list of files. Each file I want to take as an input for a
> ML-algorithm. All files are independent from another.
> My question now is how do I distribute the work so that each worker takes
> a block of files and just runs the algorithm on them one by one.
> I hope somebody can point me in the right direction! :)
>
> Best regards,
> Lydia
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to