Re: Prevent Shuffling on Writing Files

Reuven Lax Wed, 18 Sep 2019 14:06:17 -0700

Are you using streaming or batch? Also which runner are you using?

On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <[email protected]>
wrote:


> So I followed up on why TextIO shuffles and dug into the code some. It is
> using the shards and getting all the values into a keyed group to write to
> a single file.
>
> However... I wonder if there is way to just take the records that are on a
> worker and write them out. Thus not needing a shard number and doing this.
> Closer to how hadoop handle's writes.
>
> Maybe just a regular pardo and on bundleSetup it creates a writer and
> processElement reuses that writter to write to the same file for all
> elements within a bundle?
>
> I feel like this goes beyond scope of simple user mailing list so I'm
> expanding it to dev as well.
> +dev <[email protected]>
>
> Finding a solution that prevents quadrupling shuffle costs when simply
> writing out a file is a necessity for large scale jobs that work with 100+
> TB of data. If anyone has any ideas I'd love to hear them.
>
> Thanks,
> Shannon Duncan
>
> On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <[email protected]>
> wrote:
>
>> We have been using Beam for a bit now. However we just turned on the
>> dataflow shuffle service and were very surprised that the shuffled data
>> amounts were quadruple the amounts we expected.
>>
>> Turns out that the file writing TextIO is doing shuffles within itself.
>>
>> Is there a way to prevent shuffling in the writing phase?
>>
>> Thanks,
>> Shannon Duncan
>>
>

Re: Prevent Shuffling on Writing Files

Reply via email to