So I followed up on why TextIO shuffles and dug into the code some. It is using the shards and getting all the values into a keyed group to write to a single file.
However... I wonder if there is way to just take the records that are on a worker and write them out. Thus not needing a shard number and doing this. Closer to how hadoop handle's writes. Maybe just a regular pardo and on bundleSetup it creates a writer and processElement reuses that writter to write to the same file for all elements within a bundle? I feel like this goes beyond scope of simple user mailing list so I'm expanding it to dev as well. +dev <dev@beam.apache.org> Finding a solution that prevents quadrupling shuffle costs when simply writing out a file is a necessity for large scale jobs that work with 100+ TB of data. If anyone has any ideas I'd love to hear them. Thanks, Shannon Duncan On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <joseph.dun...@liveramp.com> wrote: > We have been using Beam for a bit now. However we just turned on the > dataflow shuffle service and were very surprised that the shuffled data > amounts were quadruple the amounts we expected. > > Turns out that the file writing TextIO is doing shuffles within itself. > > Is there a way to prevent shuffling in the writing phase? > > Thanks, > Shannon Duncan >