Unsubcribing
Hello, does this mailist list have an administrator, please? I'm trying to unsubscribe, but to no avail. Many thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark equivalent to hdfs groups
Many thanks, Sean. - Mail original - De: "Sean Owen" À: phi...@free.fr Cc: "User" Envoyé: Mercredi 7 Septembre 2022 17:05:55 Objet: Re: Spark equivalent to hdfs groups No, because this is a storage concept, and Spark is not a storage system. You would appeal to tools and interfaces that the storage system provides, like hdfs. Where or how the hdfs binary is available depends on how you deploy Spark where; it would be available on a Hadoop cluster. It's just not a Spark question. On Wed, Sep 7, 2022 at 9:51 AM < phi...@free.fr > wrote: Hi Sean, I'm talking about HDFS Groups. On Linux, you can type "hdfs groups " to get the list of the groups user1 belongs to. In Zeppelin/Spark, the hdfs executable is not accessible. As a result, I wondered if there was a class in Spark (eg. Security or ACL) which would let you access a particular user's groups. - Mail original - De: "Sean Owen" < sro...@gmail.com > À: phi...@free.fr Cc: "User" < user@spark.apache.org > Envoyé: Mercredi 7 Septembre 2022 16:41:01 Objet: Re: Spark equivalent to hdfs groups Spark isn't a storage system or user management system; no there is no notion of groups (groups for what?) On Wed, Sep 7, 2022 at 8:36 AM < phi...@free.fr > wrote: Hello, is there a Spark equivalent to "hdfs groups "? Many thanks. Philippe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark equivalent to hdfs groups
Hi Sean, I'm talking about HDFS Groups. On Linux, you can type "hdfs groups " to get the list of the groups user1 belongs to. In Zeppelin/Spark, the hdfs executable is not accessible. As a result, I wondered if there was a class in Spark (eg. Security or ACL) which would let you access a particular user's groups. - Mail original - De: "Sean Owen" À: phi...@free.fr Cc: "User" Envoyé: Mercredi 7 Septembre 2022 16:41:01 Objet: Re: Spark equivalent to hdfs groups Spark isn't a storage system or user management system; no there is no notion of groups (groups for what?) On Wed, Sep 7, 2022 at 8:36 AM < phi...@free.fr > wrote: Hello, is there a Spark equivalent to "hdfs groups "? Many thanks. Philippe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Spark equivalent to hdfs groups
Hello, is there a Spark equivalent to "hdfs groups "? Many thanks. Philippe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: processing files
Hi Simon, no, I don't need to run the tasks on multiple machines for now. I will therefore stick to Makefile + shell or Java programs as Spark appears not to be the right tool for the tasks I am trying to accomplish. Thanks you for your input. Philippe - Mail original - De: Simon Hafner reactorm...@gmail.com À: Philippe de Rochambeau phi...@free.fr Envoyé: Vendredi 21 Novembre 2014 09:47:25 Objet: Re: processing files 2014-11-21 1:46 GMT-06:00 Philippe de Rochambeau phi...@free.fr: - reads xml files in thousands of directories, two levels down, from year x to year y You could try sc.parallelize(new File(dirWithXML)).flatMap(sc.wholeTextFiles(_)) ... not guaranteed to work. - extracts data from image tags in those files and stores them in a Sql or NoSql database From what I understand, spark expects no side effects from the functions you pass to map(). So that's probably not that good of an idea if you don't want duplicated records. - generates ImageMagick commands based on the extracted data to generate images data transformation, easy. collect() and save. - generates curl commands to index the image files with Solr same as imagemagick. Does Spark provide any tools/features to facilitate and automate (batchify) the above tasks? Sure, but I wouldn't run the commands with spark. They might be run twice or more. I can do all of the above with one or several Java programs, but I wondered if using Spark would be of any use in such an endeavour. Personally, I'd use a Makefile, xmlstarlet for the xml parsing, and store the image paths to plaintext instead of a database, and get parallelization via -j X. You could also run the imagemagick and curl commands from there. But that naturally doesn't scale to multiple machines. Do you have more than one machine available to run this one? Do you need to run it on more than one machine, because it takes too long on just one? That's what spark excels at. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org