Hi Simon, no, I don't need to run the tasks on multiple machines for now.
I will therefore stick to Makefile + shell or Java programs as Spark appears not to be the right tool for the tasks I am trying to accomplish. Thanks you for your input. Philippe ----- Mail original ----- De: "Simon Hafner" <reactorm...@gmail.com> À: "Philippe de Rochambeau" <phi...@free.fr> Envoyé: Vendredi 21 Novembre 2014 09:47:25 Objet: Re: processing files 2014-11-21 1:46 GMT-06:00 Philippe de Rochambeau <phi...@free.fr>: > - reads xml files in thousands of directories, two levels down, from year x > to year y You could try sc.parallelize(new File(dirWithXML)).flatMap(sc.wholeTextFiles(_)) ... not guaranteed to work. > - extracts data from <image> tags in those files and stores them in a Sql or > NoSql database >From what I understand, spark expects no side effects from the functions you pass to map(). So that's probably not that good of an idea if you don't want duplicated records. > - generates ImageMagick commands based on the extracted data to generate > images data transformation, easy. collect() and save. > - generates curl commands to index the image files with Solr same as imagemagick. > Does Spark provide any tools/features to facilitate and automate ("batchify") > the above tasks? Sure, but I wouldn't run the commands with spark. They might be run twice or more. > I can do all of the above with one or several Java programs, but I wondered > if using Spark would be of any use in such an endeavour. Personally, I'd use a Makefile, xmlstarlet for the xml parsing, and store the image paths to plaintext instead of a database, and get parallelization via -j X. You could also run the imagemagick and curl commands from there. But that naturally doesn't scale to multiple machines. Do you have more than one machine available to run this one? Do you need to run it on more than one machine, because it takes too long on just one? That's what spark excels at. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org