Are you specifying the number of shards to write to: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L859
If so, this will incur an additional shuffle to re-distribute data written by all workers into the given number of shards before writing. In addition to that, I think we also run Reshuffle transforms on the set of files to break fusion when finalizing files but that cost should not be that significant. Probably posting a sketch of your pipeline will be helpful. Thanks, Cham On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <joseph.dun...@liveramp.com> wrote: > So I followed up on why TextIO shuffles and dug into the code some. It is > using the shards and getting all the values into a keyed group to write to > a single file. > > However... I wonder if there is way to just take the records that are on a > worker and write them out. Thus not needing a shard number and doing this. > Closer to how hadoop handle's writes. > > Maybe just a regular pardo and on bundleSetup it creates a writer and > processElement reuses that writter to write to the same file for all > elements within a bundle? > > I feel like this goes beyond scope of simple user mailing list so I'm > expanding it to dev as well. > +dev <dev@beam.apache.org> > > Finding a solution that prevents quadrupling shuffle costs when simply > writing out a file is a necessity for large scale jobs that work with 100+ > TB of data. If anyone has any ideas I'd love to hear them. > > Thanks, > Shannon Duncan > > On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <joseph.dun...@liveramp.com> > wrote: > >> We have been using Beam for a bit now. However we just turned on the >> dataflow shuffle service and were very surprised that the shuffled data >> amounts were quadruple the amounts we expected. >> >> Turns out that the file writing TextIO is doing shuffles within itself. >> >> Is there a way to prevent shuffling in the writing phase? >> >> Thanks, >> Shannon Duncan >> >