> > 1. In my case, I'd need to first explode my data by ~12x to assign each > record to multiple 12-month rolling output windows. I'm not sure Spark SQL > would be able to optimize this away, combining it with the output writing > to do it incrementally. >
You are right, but I wouldn't worry about the RAM use. If implemented properly (or if you just use the builtin window <https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String,%20java.lang.String)> function), it should all be pipelined. > 2. Wouldn't each partition -- window in my case -- be shuffled to a single > machine and then written together as one output shard? For a large amount > of data per window, that seems less than ideal. > Oh sorry, I thought you wanted one file per value. If you drop the repartition then it won't shuffle, but will just write in parallel on each machine.