On Mon, Dec 5, 2016 at 5:33 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> 1. In my case, I'd need to first explode my data by ~12x to assign each
>> record to multiple 12-month rolling output windows. I'm not sure Spark SQL
>> would be able to optimize this away, combining it with the output writing
>> to do it incrementally.
>>
>
> You are right, but I wouldn't worry about the RAM use.  If implemented
> properly (or if you just use the builtin window
> <https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html#window(org.apache.spark.sql.Column,%20java.lang.String,%20java.lang.String)>
> function), it should all be pipelined.
>


Very cool! Will give it a go. I'm still on Spark 1.6.x so hadn't seen that
function, either!


>
>
>> 2. Wouldn't each partition -- window in my case -- be shuffled to a
>> single machine and then written together as one output shard? For a large
>> amount of data per window, that seems less than ideal.
>>
>
> Oh sorry, I thought you wanted one file per value.  If you drop the
> repartition then it won't shuffle, but will just write in parallel on
> each machine.
>

Thanks!

Reply via email to