Re: How to partition a stream by key before writing with FileBasedSink?

Lukasz Cwik Wed, 24 May 2017 09:20:54 -0700

Google Cloud Dataflow won't override your setting. The dynamic sharding
occurs if you don't explicitly set a numShard value.


On Wed, May 24, 2017 at 9:14 AM, Josh <jof...@gmail.com> wrote:

> Hi Lukasz,
>
> Thanks for the example. That sounds like a nice solution -
> I am running on Dataflow though, which dynamically sets numShards - so if
> I set numShards to 1 on each of those AvroIO writers, I can't be sure that
> Dataflow isn't going to override my setting right? I guess this should work
> fine as long as I partition my stream into a large enough number of
> partitions so that Dataflow won't override numShards.
>
> Josh
>
>
> On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik <lc...@google.com> wrote:
>
>> Since your using a small number of shards, add a Partition transform
>> which uses a deterministic hash of the key to choose one of 4 partitions.
>> Write each partition with a single shard.
>>
>> (Fixed width diagram below)
>> Pipeline -> AvroIO(numShards = 4)
>> Becomes:
>> Pipeline -> Partition --> AvroIO(numShards = 1)
>>                       |-> AvroIO(numShards = 1)
>>                       |-> AvroIO(numShards = 1)
>>                       \-> AvroIO(numShards = 1)
>>
>> On Wed, May 24, 2017 at 1:05 AM, Josh <jof...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am using a FileBasedSink (AvroIO.write) on an unbounded stream
>>> (withWindowedWrites, hourly windows, numShards=4).
>>>
>>> I would like to partition the stream by some key in the element, so that
>>> all elements with the same key will get processed by the same shard writer,
>>> and therefore written to the same file. Is there a way to do this? Note
>>> that in my stream the number of keys is very large (most elements have a
>>> unique key, while a few elements share a key).
>>>
>>> Thanks,
>>> Josh
>>>
>>
>>
>

Re: How to partition a stream by key before writing with FileBasedSink?

Reply via email to