Re: Writing DataFrame filter results to separate files

Everett Anderson Mon, 05 Dec 2016 17:03:59 -0800

Hi,

Thanks for the reply!

On Mon, Dec 5, 2016 at 1:30 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> If you repartition($"column") and then do .write.partitionBy("column") you
> should end up with a single file for each value of the partition column.
>

I have two concerns there:

1. In my case, I'd need to first explode my data by ~12x to assign each
record to multiple 12-month rolling output windows. I'm not sure Spark SQL
would be able to optimize this away, combining it with the output writing
to do it incrementally.

2. Wouldn't each partition -- window in my case -- be shuffled to a single
machine and then written together as one output shard? For a large amount
of data per window, that seems less than ideal.

>
> On Mon, Dec 5, 2016 at 10:59 AM, Everett Anderson <
> ever...@nuna.com.invalid> wrote:
>
>> Hi,
>>
>> I have a DataFrame of records with dates, and I'd like to write all
>> 12-month (with overlap) windows to separate outputs.
>>
>> Currently, I have a loop equivalent to:
>>
>> for ((windowStart, windowEnd) <- windows) {
>>     val windowData = allData.filter(
>>         getFilterCriteria(windowStart, windowEnd))
>>     windowData.write.format(...).save(...)
>> }
>>
>> This works fine, but has the drawback that since Spark doesn't
>> parallelize the writes, there is a fairly cost based on the number of
>> windows.
>>
>> Is there a way around this?
>>
>> In MapReduce, I'd probably multiply the data in a Mapper with a window ID
>> and then maybe use something like MultipleOutputs
>> <https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html>.
>> But I'm a bit worried of trying to do this in Spark because of the data
>> explosion and RAM use. What's the best approach?
>>
>> Thanks!
>>
>> - Everett
>>
>>
>

Re: Writing DataFrame filter results to separate files

Reply via email to