This is already supported with the new partitioned data sources in DataFrame/SQL right?
On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini <[email protected]> wrote: > Speaking about Shopify's deployment, this would be a really nice to have > feature. > > We would like to write data to folders with the structure > `<year>/<month>/<day>` but have had to hold off on that because of the lack > of support for MultipleOutputs. > > On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis <[email protected]> > wrote: > >> Would it be right to assume that the silence on this topic implies others >> don't really have this issue/desire? >> >> On Sat, 18 Jul 2015 at 17:24 Silas Davis <[email protected]> wrote: >> >>> *tl;dr hadoop and cascading* *provide ways of writing tuples to >>> multiple output files based on key, but the plain RDD interface doesn't >>> seem to and it should.* >>> >>> I have been looking into ways to write to multiple outputs in Spark. It >>> seems like a feature that is somewhat missing from Spark. >>> >>> The idea is to partition output and write the elements of an RDD to >>> different locations depending based on the key. For example in a pair RDD >>> your key may be (language, date, userId) and you would like to write >>> separate files to $someBasePath/$language/$date. Then there would be a >>> version of saveAsHadoopDataset that would be able to multiple location >>> based on key using the underlying OutputFormat. Perahps it would take a >>> pair RDD with keys ($partitionKey, $realKey), so for example ((language, >>> date), userId). >>> >>> The prior art I have found on this is the following. >>> >>> Using SparkSQL: >>> The 'partitionBy' method of DataFrameWriter: >>> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter >>> >>> This only works for parquet at the moment. >>> >>> Using Spark/Hadoop: >>> This pull request (with the hadoop1 API,) : >>> https://github.com/apache/spark/pull/4895/files. >>> >>> This uses MultipleTextOutputFormat (which in turn uses >>> MultipleOutputFormat) which is part of the old hadoop1 API. It only works >>> for text but could be generalised for any underlying OutputFormat by using >>> MultipleOutputFormat (but only for hadoop1 - which doesn't support >>> ParquetAvroOutputFormat for example) >>> >>> This gist (With the hadoop2 API): >>> https://gist.github.com/mlehman/df9546f6be2e362bbad2 >>> >>> This uses MultipleOutputs (available for both the old and new hadoop >>> APIs) and extends saveAsNewHadoopDataset to support multiple outputs. >>> Should work for any underlying OutputFormat. Probably better implemented by >>> extending saveAs[NewAPI]HadoopDataset. >>> >>> In Cascading: >>> Cascading provides PartititionTap: >>> http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html >>> to do this >>> >>> So my questions are: is there a reason why Spark doesn't provide this? >>> Does Spark provide similar functionality through some other mechanism? How >>> would it be best implemented? >>> >>> Since I started composing this message I've had a go at writing an >>> wrapper OutputFormat that writes multiple outputs using hadoop >>> MultipleOutputs but doesn't require modification of the PairRDDFunctions. >>> The principle is similar however. Again it feels slightly hacky to use >>> dummy fields for the ReduceContextImpl, but some of this may be a part of >>> the impedance mismatch between Spark and plain Hadoop... Here is my >>> attempt: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462 >>> >>> I'd like to see this functionality in Spark somehow but invite >>> suggestion of how best to achieve it. >>> >>> Thanks, >>> Silas >>> >> >
