Re: Writing to multiple outputs in Spark

2015-08-17 Thread Silas Davis
@Reynold Xin: not really: it only works for Parquet (see partitionBy: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter), it requires you to have a DataFrame in the first place (for my use case the spark sql interface to avro records is more of a

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Abhishek R. Singh
A workaround would be to have multiple passes on the RDD and each pass write its own output? Or in a foreachPartition do it in a single pass (open up multiple files per partition to write out)? -Abhishek- On Aug 14, 2015, at 7:56 AM, Silas Davis si...@silasdavis.net wrote: Would it be right

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Reynold Xin
This is already supported with the new partitioned data sources in DataFrame/SQL right? On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini alex.angel...@shopify.com wrote: Speaking about Shopify's deployment, this would be a really nice to have feature. We would like to write data to folders

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Nicholas Chammas
See: https://issues.apache.org/jira/browse/SPARK-3533 Feel free to comment there and make a case if you think the issue should be reopened. Nick On Fri, Aug 14, 2015 at 11:11 AM Abhishek R. Singh abhis...@tetrationanalytics.com wrote: A workaround would be to have multiple passes on the RDD

Writing to multiple outputs in Spark

2015-07-18 Thread Silas Davis
*tl;dr hadoop and cascading* *provide ways of writing tuples to multiple output files based on key, but the plain RDD interface doesn't seem to and it should.* I have been looking into ways to write to multiple outputs in Spark. It seems like a feature that is somewhat missing from Spark. The