Are you using streaming or batch? Also which runner are you using?

On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <joseph.dun...@liveramp.com>
wrote:

> So I followed up on why TextIO shuffles and dug into the code some. It is
> using the shards and getting all the values into a keyed group to write to
> a single file.
>
> However... I wonder if there is way to just take the records that are on a
> worker and write them out. Thus not needing a shard number and doing this.
> Closer to how hadoop handle's writes.
>
> Maybe just a regular pardo and on bundleSetup it creates a writer and
> processElement reuses that writter to write to the same file for all
> elements within a bundle?
>
> I feel like this goes beyond scope of simple user mailing list so I'm
> expanding it to dev as well.
> +dev <dev@beam.apache.org>
>
> Finding a solution that prevents quadrupling shuffle costs when simply
> writing out a file is a necessity for large scale jobs that work with 100+
> TB of data. If anyone has any ideas I'd love to hear them.
>
> Thanks,
> Shannon Duncan
>
> On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <joseph.dun...@liveramp.com>
> wrote:
>
>> We have been using Beam for a bit now. However we just turned on the
>> dataflow shuffle service and were very surprised that the shuffled data
>> amounts were quadruple the amounts we expected.
>>
>> Turns out that the file writing TextIO is doing shuffles within itself.
>>
>> Is there a way to prevent shuffling in the writing phase?
>>
>> Thanks,
>> Shannon Duncan
>>
>

Reply via email to