Ahh I see - Ok I'll try out this solution then. Thanks Lukasz! On Wed, May 24, 2017 at 5:20 PM, Lukasz Cwik <lc...@google.com> wrote:
> Google Cloud Dataflow won't override your setting. The dynamic sharding > occurs if you don't explicitly set a numShard value. > > On Wed, May 24, 2017 at 9:14 AM, Josh <jof...@gmail.com> wrote: > >> Hi Lukasz, >> >> Thanks for the example. That sounds like a nice solution - >> I am running on Dataflow though, which dynamically sets numShards - so if >> I set numShards to 1 on each of those AvroIO writers, I can't be sure that >> Dataflow isn't going to override my setting right? I guess this should work >> fine as long as I partition my stream into a large enough number of >> partitions so that Dataflow won't override numShards. >> >> Josh >> >> >> On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik <lc...@google.com> wrote: >> >>> Since your using a small number of shards, add a Partition transform >>> which uses a deterministic hash of the key to choose one of 4 partitions. >>> Write each partition with a single shard. >>> >>> (Fixed width diagram below) >>> Pipeline -> AvroIO(numShards = 4) >>> Becomes: >>> Pipeline -> Partition --> AvroIO(numShards = 1) >>> |-> AvroIO(numShards = 1) >>> |-> AvroIO(numShards = 1) >>> \-> AvroIO(numShards = 1) >>> >>> On Wed, May 24, 2017 at 1:05 AM, Josh <jof...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am using a FileBasedSink (AvroIO.write) on an unbounded stream >>>> (withWindowedWrites, hourly windows, numShards=4). >>>> >>>> I would like to partition the stream by some key in the element, so >>>> that all elements with the same key will get processed by the same shard >>>> writer, and therefore written to the same file. Is there a way to do this? >>>> Note that in my stream the number of keys is very large (most elements have >>>> a unique key, while a few elements share a key). >>>> >>>> Thanks, >>>> Josh >>>> >>> >>> >> >