On Mon, Dec 5, 2016 at 5:33 PM, Michael Armbrust <mich...@databricks.com> wrote:
> 1. In my case, I'd need to first explode my data by ~12x to assign each >> record to multiple 12-month rolling output windows. I'm not sure Spark SQL >> would be able to optimize this away, combining it with the output writing >> to do it incrementally. >> > > You are right, but I wouldn't worry about the RAM use. If implemented > properly (or if you just use the builtin window > <https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String,%20java.lang.String)> > function), it should all be pipelined. > Very cool! Will give it a go. I'm still on Spark 1.6.x so hadn't seen that function, either! > > >> 2. Wouldn't each partition -- window in my case -- be shuffled to a >> single machine and then written together as one output shard? For a large >> amount of data per window, that seems less than ideal. >> > > Oh sorry, I thought you wanted one file per value. If you drop the > repartition then it won't shuffle, but will just write in parallel on > each machine. > Thanks!