A workaround would be to have multiple passes on the RDD and each pass write 
its own output?

Or in a foreachPartition do it in a single pass (open up multiple files per 
partition to write out)?

-Abhishek-

On Aug 14, 2015, at 7:56 AM, Silas Davis <si...@silasdavis.net> wrote:

> Would it be right to assume that the silence on this topic implies others 
> don't really have this issue/desire?
> 
> On Sat, 18 Jul 2015 at 17:24 Silas Davis <si...@silasdavis.net> wrote:
> tl;dr hadoop and cascading provide ways of writing tuples to multiple output 
> files based on key, but the plain RDD interface doesn't seem to and it should.
> 
> I have been looking into ways to write to multiple outputs in Spark. It seems 
> like a feature that is somewhat missing from Spark.
> 
> The idea is to partition output and write the elements of an RDD to different 
> locations depending based on the key. For example in a pair RDD your key may 
> be (language, date, userId) and you would like to write separate files to 
> $someBasePath/$language/$date. Then there would be  a version of 
> saveAsHadoopDataset that would be able to multiple location based on key 
> using the underlying OutputFormat. Perahps it would take a pair RDD with keys 
> ($partitionKey, $realKey), so for example ((language, date), userId).
> 
> The prior art I have found on this is the following.
> 
> Using SparkSQL:
> The 'partitionBy' method of DataFrameWriter: 
> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
> 
> This only works for parquet at the moment.
> 
> Using Spark/Hadoop:
> This pull request (with the hadoop1 API,) : 
> https://github.com/apache/spark/pull/4895/files. 
> 
> This uses MultipleTextOutputFormat (which in turn uses MultipleOutputFormat) 
> which is part of the old hadoop1 API. It only works for text but could be 
> generalised for any underlying OutputFormat by using MultipleOutputFormat 
> (but only for hadoop1 - which doesn't support ParquetAvroOutputFormat for 
> example)
> 
> This gist (With the hadoop2 API): 
> https://gist.github.com/mlehman/df9546f6be2e362bbad2 
> 
> This uses MultipleOutputs (available for both the old and new hadoop APIs) 
> and extends saveAsNewHadoopDataset to support multiple outputs. Should work 
> for any underlying OutputFormat. Probably better implemented by extending 
> saveAs[NewAPI]HadoopDataset.
> 
> In Cascading:
> Cascading provides PartititionTap: 
> http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html
>  to do this
> 
> So my questions are: is there a reason why Spark doesn't provide this? Does 
> Spark provide similar functionality through some other mechanism? How would 
> it be best implemented?
> 
> Since I started composing this message I've had a go at writing an wrapper 
> OutputFormat that writes multiple outputs using hadoop MultipleOutputs but 
> doesn't require modification of the PairRDDFunctions. The principle is 
> similar however. Again it feels slightly hacky to use dummy fields for the 
> ReduceContextImpl, but some of this may be a part of the impedance mismatch 
> between Spark and plain Hadoop... Here is my attempt: 
> https://gist.github.com/silasdavis/d1d1f1f7ab78249af462 
> 
> I'd like to see this functionality in Spark somehow but invite suggestion of 
> how best to achieve it.
> 
> Thanks,
> Silas

Reply via email to