[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705278#comment-14705278
 ] 

Silas Davis commented on SPARK-3533:
------------------------------------

I've looked at various solutions, and have summarised what I found in my post 
here: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Writing-to-multiple-outputs-in-Spark-td13298.html.
 The Stack Overflow question linked only address multiple Text outputs, and 
only does that for hadoop 1. My code synthesises the idea of using a wrapping 
OutputFormat, and of another gist that uses MultipleOuputs, but modifies 
saveAsNewAPIHadoopFile. My code also makes do with the current Spark API, but 
was enough effort, and seems common enough an aim that I'd argue some of it 
should be moved into Spark itself.

As for showing some code, my implementation is contained on the gist I have 
posted, and I have added this to the links attached to this ticket. I was 
hoping to get some comments on the code before embarking on a full pull request 
in which would require more consideration on where to place files etc. I'm not 
sure if you're suggesting it would be better to make a pull request now, or 
whether the gist is sufficient. I will open a pull request if you prefer. Is 
there anything else I should be doing to get committer buy-in?

[~nchammas] Have you been able to take a look at the code? 

> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>
>                 Key: SPARK-3533
>                 URL: https://issues.apache.org/jira/browse/SPARK-3533
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to