Yes, Specifically TextIO withNumShards(). On Fri, Sep 27, 2019 at 10:45 AM Reuven Lax <[email protected]> wrote:
> I'm not sure what you mean by "write out ot a specific shard number." Are > you talking about FIleIO sinks? > > Reuven > > On Fri, Sep 27, 2019 at 7:41 AM Shannon Duncan <[email protected]> > wrote: > >> So when beam writes out to a specific shard number, as I understand it >> does a few things: >> >> - Assigns a shard key to each record (reduces parallelism) >> - Shuffles and Groups by the shard key to colocate all records >> - Writes out to each shard file within a single DoFn per key... >> >> When thinking about this, I believe we might be able to eliminate the >> GroupByKey to go ahead and write out to each file with its records with >> only a DoFn after the shard key is assigned. >> >> As long as the shard key is the actual key of the PCollection, then could >> we use a state variable to force all keys that are the same to process to >> share state with each other? >> >> On a DoFn can we use the setup to hold a Map of files being written to >> within bundles on that instance, and on teardown can we close all files >> within the map? >> >> If this is the case does it reduce the need for a shuffle and allow a >> DoFn to safely write out in append mode to a file, batch, etc held in >> state? >> >> It doesn't really decrease parallelism after the key is assigned since it >> can parallelize over each key within its state window. Which is the same >> level of parallelism we achieve by doing a GroupByKey and doing a for loop >> over the result. So performance shouldn't be impacted if this holds true. >> >> It's kind of like combining both the shuffle and the data write in the >> same step? >> >> This does however have a significant cost reduction by eliminating a >> compute based shuffle and also eliminating a Dataflow shuffle service call >> if shuffle service is enabled. >> >> Thoughts? >> >> Thanks, >> Shannon Duncan >> >
