I'm not sure what you mean by "write out ot a specific shard number." Are you talking about FIleIO sinks?
Reuven On Fri, Sep 27, 2019 at 7:41 AM Shannon Duncan <joseph.dun...@liveramp.com> wrote: > So when beam writes out to a specific shard number, as I understand it > does a few things: > > - Assigns a shard key to each record (reduces parallelism) > - Shuffles and Groups by the shard key to colocate all records > - Writes out to each shard file within a single DoFn per key... > > When thinking about this, I believe we might be able to eliminate the > GroupByKey to go ahead and write out to each file with its records with > only a DoFn after the shard key is assigned. > > As long as the shard key is the actual key of the PCollection, then could > we use a state variable to force all keys that are the same to process to > share state with each other? > > On a DoFn can we use the setup to hold a Map of files being written to > within bundles on that instance, and on teardown can we close all files > within the map? > > If this is the case does it reduce the need for a shuffle and allow a DoFn > to safely write out in append mode to a file, batch, etc held in state? > > It doesn't really decrease parallelism after the key is assigned since it > can parallelize over each key within its state window. Which is the same > level of parallelism we achieve by doing a GroupByKey and doing a for loop > over the result. So performance shouldn't be impacted if this holds true. > > It's kind of like combining both the shuffle and the data write in the > same step? > > This does however have a significant cost reduction by eliminating a > compute based shuffle and also eliminating a Dataflow shuffle service call > if shuffle service is enabled. > > Thoughts? > > Thanks, > Shannon Duncan >