So I followed up on why TextIO shuffles and dug into the code some. It is
using the shards and getting all the values into a keyed group to write to
a single file.

However... I wonder if there is way to just take the records that are on a
worker and write them out. Thus not needing a shard number and doing this.
Closer to how hadoop handle's writes.

Maybe just a regular pardo and on bundleSetup it creates a writer and
processElement reuses that writter to write to the same file for all
elements within a bundle?

I feel like this goes beyond scope of simple user mailing list so I'm
expanding it to dev as well.
+dev <dev@beam.apache.org>

Finding a solution that prevents quadrupling shuffle costs when simply
writing out a file is a necessity for large scale jobs that work with 100+
TB of data. If anyone has any ideas I'd love to hear them.

Thanks,
Shannon Duncan

On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <joseph.dun...@liveramp.com>
wrote:

> We have been using Beam for a bit now. However we just turned on the
> dataflow shuffle service and were very surprised that the shuffled data
> amounts were quadruple the amounts we expected.
>
> Turns out that the file writing TextIO is doing shuffles within itself.
>
> Is there a way to prevent shuffling in the writing phase?
>
> Thanks,
> Shannon Duncan
>

Reply via email to