>
> 1. In my case, I'd need to first explode my data by ~12x to assign each
> record to multiple 12-month rolling output windows. I'm not sure Spark SQL
> would be able to optimize this away, combining it with the output writing
> to do it incrementally.
>

You are right, but I wouldn't worry about the RAM use.  If implemented
properly (or if you just use the builtin window
<https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String,%20java.lang.String)>
function), it should all be pipelined.


> 2. Wouldn't each partition -- window in my case -- be shuffled to a single
> machine and then written together as one output shard? For a large amount
> of data per window, that seems less than ideal.
>

Oh sorry, I thought you wanted one file per value.  If you drop the
repartition then it won't shuffle, but will just write in parallel on each
machine.

Reply via email to