Howdy doody Spark Users,

I’d like to somehow write out a single RDD to multiple paths in one go.
Here’s an example.

I have an RDD of (key, value) pairs like this:

>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda 
>>> x: x[0])>>> a.collect()
[('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]

Now I want to write the RDD out to different paths depending on the keys,
so that I have one output directory per distinct key. Each output directory
could potentially have multiple part- files or whatever.

So my output would be something like:

/path/prefix/n [/part-1, /part-2, etc]
/path/prefix/b [/part-1, /part-2, etc]
/path/prefix/f [/part-1, /part-2, etc]

How would you do that?

I suspect I need to use saveAsNewAPIHadoopFile
<http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsNewAPIHadoopFile>
or saveAsHadoopFile
<http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsHadoopFile>
along with the MultipleTextOutputFormat output format class, but I’m not
sure how.

By the way, there is a very similar question to this here on Stack Overflow
<http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job>
.

Nick
​




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Write-1-RDD-to-multiple-output-paths-in-one-go-tp14174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to