Yes, Specifically TextIO withNumShards().

On Fri, Sep 27, 2019 at 10:45 AM Reuven Lax <[email protected]> wrote:

> I'm not sure what you mean by "write out ot a specific shard number." Are
> you talking about FIleIO sinks?
>
> Reuven
>
> On Fri, Sep 27, 2019 at 7:41 AM Shannon Duncan <[email protected]>
> wrote:
>
>> So when beam writes out to a specific shard number, as I understand it
>> does a few things:
>>
>> - Assigns a shard key to each record (reduces parallelism)
>> - Shuffles and Groups by the shard key to colocate all records
>> - Writes out to each shard file within a single DoFn per key...
>>
>> When thinking about this, I believe we might be able to eliminate the
>> GroupByKey to go ahead and write out to each file with its records with
>> only a DoFn after the shard key is assigned.
>>
>> As long as the shard key is the actual key of the PCollection, then could
>> we use a state variable to force all keys that are the same to process to
>> share state with each other?
>>
>> On a DoFn can we use the setup to hold a Map of files being written to
>> within bundles on that instance, and on teardown can we close all files
>> within the map?
>>
>> If this is the case does it reduce the need for a shuffle and allow a
>> DoFn to safely write out in append mode to a file, batch, etc held in
>> state?
>>
>> It doesn't really decrease parallelism after the key is assigned since it
>> can parallelize over each key within its state window. Which is the same
>> level of parallelism we achieve by doing a GroupByKey and doing a for loop
>> over the result. So performance shouldn't be impacted if this holds true.
>>
>> It's kind of like combining both the shuffle and the data write in the
>> same step?
>>
>> This does however have a significant cost reduction by eliminating a
>> compute based shuffle and also eliminating a Dataflow shuffle service call
>> if shuffle service is enabled.
>>
>> Thoughts?
>>
>> Thanks,
>> Shannon Duncan
>>
>

Reply via email to