Re: processing files

phiroc Fri, 21 Nov 2014 00:59:24 -0800

Hi Simon,

no, I don't need to run the tasks on multiple machines for now.


I will therefore stick to Makefile + shell or Java programs as Spark appears 
not to be the right tool for the tasks I am trying to accomplish.

Thanks you for your input.

Philippe



----- Mail original -----
De: "Simon Hafner" <reactorm...@gmail.com>
À: "Philippe de Rochambeau" <phi...@free.fr>
Envoyé: Vendredi 21 Novembre 2014 09:47:25
Objet: Re: processing files

2014-11-21 1:46 GMT-06:00 Philippe de Rochambeau <phi...@free.fr>:
> - reads xml files in thousands of directories, two levels down, from year x 
> to year y

You could try

sc.parallelize(new File(dirWithXML)).flatMap(sc.wholeTextFiles(_))

... not guaranteed to work.

> - extracts data from <image> tags in those files and stores them in a Sql or 
> NoSql database

>From what I understand, spark expects no side effects from the
functions you pass to map(). So that's probably not that good of an
idea if you don't want duplicated records.

> - generates ImageMagick commands based on the extracted data to generate 
> images

data transformation, easy. collect() and save.

> - generates curl commands to index the image files with Solr

same as imagemagick.

> Does Spark provide any tools/features to facilitate and automate ("batchify") 
> the above tasks?

Sure, but I wouldn't run the commands with spark. They might be run
twice or more.

> I can do all of the above with one or several Java programs, but I wondered 
> if using Spark would be of any use in such an endeavour.

Personally, I'd use a Makefile, xmlstarlet for the xml parsing, and
store the image paths to plaintext instead of a database, and get
parallelization via -j X. You could also run the imagemagick and curl
commands from there. But that naturally doesn't scale to multiple
machines.

Do you have more than one machine available to run this one? Do you
need to run it on more than one machine, because it takes too long on
just one? That's what spark excels at.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: processing files

Reply via email to