Are you using streaming or batch? Also which runner are you using? On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan <joseph.dun...@liveramp.com> wrote:
> So I followed up on why TextIO shuffles and dug into the code some. It is > using the shards and getting all the values into a keyed group to write to > a single file. > > However... I wonder if there is way to just take the records that are on a > worker and write them out. Thus not needing a shard number and doing this. > Closer to how hadoop handle's writes. > > Maybe just a regular pardo and on bundleSetup it creates a writer and > processElement reuses that writter to write to the same file for all > elements within a bundle? > > I feel like this goes beyond scope of simple user mailing list so I'm > expanding it to dev as well. > +dev <dev@beam.apache.org> > > Finding a solution that prevents quadrupling shuffle costs when simply > writing out a file is a necessity for large scale jobs that work with 100+ > TB of data. If anyone has any ideas I'd love to hear them. > > Thanks, > Shannon Duncan > > On Wed, Sep 18, 2019 at 1:06 PM Shannon Duncan <joseph.dun...@liveramp.com> > wrote: > >> We have been using Beam for a bit now. However we just turned on the >> dataflow shuffle service and were very surprised that the shuffled data >> amounts were quadruple the amounts we expected. >> >> Turns out that the file writing TextIO is doing shuffles within itself. >> >> Is there a way to prevent shuffling in the writing phase? >> >> Thanks, >> Shannon Duncan >> >