A workaround would be to have multiple passes on the RDD and each pass write its own output?
Or in a foreachPartition do it in a single pass (open up multiple files per partition to write out)? -Abhishek- On Aug 14, 2015, at 7:56 AM, Silas Davis <si...@silasdavis.net> wrote: > Would it be right to assume that the silence on this topic implies others > don't really have this issue/desire? > > On Sat, 18 Jul 2015 at 17:24 Silas Davis <si...@silasdavis.net> wrote: > tl;dr hadoop and cascading provide ways of writing tuples to multiple output > files based on key, but the plain RDD interface doesn't seem to and it should. > > I have been looking into ways to write to multiple outputs in Spark. It seems > like a feature that is somewhat missing from Spark. > > The idea is to partition output and write the elements of an RDD to different > locations depending based on the key. For example in a pair RDD your key may > be (language, date, userId) and you would like to write separate files to > $someBasePath/$language/$date. Then there would be a version of > saveAsHadoopDataset that would be able to multiple location based on key > using the underlying OutputFormat. Perahps it would take a pair RDD with keys > ($partitionKey, $realKey), so for example ((language, date), userId). > > The prior art I have found on this is the following. > > Using SparkSQL: > The 'partitionBy' method of DataFrameWriter: > https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter > > This only works for parquet at the moment. > > Using Spark/Hadoop: > This pull request (with the hadoop1 API,) : > https://github.com/apache/spark/pull/4895/files. > > This uses MultipleTextOutputFormat (which in turn uses MultipleOutputFormat) > which is part of the old hadoop1 API. It only works for text but could be > generalised for any underlying OutputFormat by using MultipleOutputFormat > (but only for hadoop1 - which doesn't support ParquetAvroOutputFormat for > example) > > This gist (With the hadoop2 API): > https://gist.github.com/mlehman/df9546f6be2e362bbad2 > > This uses MultipleOutputs (available for both the old and new hadoop APIs) > and extends saveAsNewHadoopDataset to support multiple outputs. Should work > for any underlying OutputFormat. Probably better implemented by extending > saveAs[NewAPI]HadoopDataset. > > In Cascading: > Cascading provides PartititionTap: > http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html > to do this > > So my questions are: is there a reason why Spark doesn't provide this? Does > Spark provide similar functionality through some other mechanism? How would > it be best implemented? > > Since I started composing this message I've had a go at writing an wrapper > OutputFormat that writes multiple outputs using hadoop MultipleOutputs but > doesn't require modification of the PairRDDFunctions. The principle is > similar however. Again it feels slightly hacky to use dummy fields for the > ReduceContextImpl, but some of this may be a part of the impedance mismatch > between Spark and plain Hadoop... Here is my attempt: > https://gist.github.com/silasdavis/d1d1f1f7ab78249af462 > > I'd like to see this functionality in Spark somehow but invite suggestion of > how best to achieve it. > > Thanks, > Silas