Howdy doody Spark Users, I’d like to somehow write out a single RDD to multiple paths in one go. Here’s an example.
I have an RDD of (key, value) pairs like this: >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda >>> x: x[0])>>> a.collect() [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')] Now I want to write the RDD out to different paths depending on the keys, so that I have one output directory per distinct key. Each output directory could potentially have multiple part- files or whatever. So my output would be something like: /path/prefix/n [/part-1, /part-2, etc] /path/prefix/b [/part-1, /part-2, etc] /path/prefix/f [/part-1, /part-2, etc] How would you do that? I suspect I need to use saveAsNewAPIHadoopFile <http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsNewAPIHadoopFile> or saveAsHadoopFile <http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsHadoopFile> along with the MultipleTextOutputFormat output format class, but I’m not sure how. By the way, there is a very similar question to this here on Stack Overflow <http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job> . Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Write-1-RDD-to-multiple-output-paths-in-one-go-tp14174.html Sent from the Apache Spark User List mailing list archive at Nabble.com.