Hi, Thanks for the reply!
On Mon, Dec 5, 2016 at 1:30 PM, Michael Armbrust <mich...@databricks.com> wrote: > If you repartition($"column") and then do .write.partitionBy("column") you > should end up with a single file for each value of the partition column. > I have two concerns there: 1. In my case, I'd need to first explode my data by ~12x to assign each record to multiple 12-month rolling output windows. I'm not sure Spark SQL would be able to optimize this away, combining it with the output writing to do it incrementally. 2. Wouldn't each partition -- window in my case -- be shuffled to a single machine and then written together as one output shard? For a large amount of data per window, that seems less than ideal. > > On Mon, Dec 5, 2016 at 10:59 AM, Everett Anderson < > ever...@nuna.com.invalid> wrote: > >> Hi, >> >> I have a DataFrame of records with dates, and I'd like to write all >> 12-month (with overlap) windows to separate outputs. >> >> Currently, I have a loop equivalent to: >> >> for ((windowStart, windowEnd) <- windows) { >> val windowData = allData.filter( >> getFilterCriteria(windowStart, windowEnd)) >> windowData.write.format(...).save(...) >> } >> >> This works fine, but has the drawback that since Spark doesn't >> parallelize the writes, there is a fairly cost based on the number of >> windows. >> >> Is there a way around this? >> >> In MapReduce, I'd probably multiply the data in a Mapper with a window ID >> and then maybe use something like MultipleOutputs >> <https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html>. >> But I'm a bit worried of trying to do this in Spark because of the data >> explosion and RAM use. What's the best approach? >> >> Thanks! >> >> - Everett >> >> >