[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

Silas Davis (JIRA) Thu, 20 Aug 2015 07:03:06 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704939#comment-14704939
 ]


Silas Davis commented on SPARK-3533:
------------------------------------

Another possibility is that we don't use Hadoop's MultipleOuputs stuff at all. 
Cascading doesn't, for example. Whether or not to take that route depends on 
how much MultipleOuputs buys us (I haven't looked much into its implementation 
or what it does), and how much of a code smell setting up dummy 
ReduceContextImpl here: 
https://gist.github.com/silasdavis/d1d1f1f7ab78249af462#file-multipleoutputs-scala-L59
 is.

> Add saveAsTextFileByKey() method to RDDs
> ----------------------------------------
>
>                 Key: SPARK-3533
>                 URL: https://issues.apache.org/jira/browse/SPARK-3533
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark, Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

Reply via email to