On Fri, May 18, 2018 at 4:24 PM Raghu Angadi <[email protected]> wrote:
> On Fri, May 18, 2018 at 4:07 PM Kenneth Knowles <[email protected]> wrote: > >> It isn't any particular logic in Reshuffle - it is, semantically, an >> identity transform. It is the fact that other runners are perfectly able to >> re-run transform prior to a GBK. So, for example, randomly generated IDs >> will be re-generated. >> > > Ah, thanks, that makes sense. That implies to me Reshuffle is no more > broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear > caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic > keys and retries (though it applies to GroupByKey in general). > The "randomness" of Reshuffle.viaRandomKey() is fine, as the randomly generated key is never exposed to the users (so it doesn't matter if it changes). Reshuffle is broken if you are using it to get stable input on a runner that doesn't always have stable input as an implementation detail of GBKs. We tend to put in reshuffles in order to "commit" these random values and >> make them stable for the next stage, to be used to provide the needed >> idempotency for sinks. >> > > In such cases, I think the author should error out on the runner that > don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does > [1]. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 > We're moving to a world where the runner may not be known at pipeline construction time. However, explicitly using a (distinct) make-input-stable transform when that's the intent (which could be a primitive that runners should implement, possibly by swapping in Reshuffle, or reject) would allow for this. That being said, the exact semantics of this transform is a bit of a rabbit hole which is why we never finished the job of deprecating Reshuffle. This is a case where doing something is better than doing nothing, and our use of URNs for this kind of thing is flexible enough that we can deprecate old ones if/when we have time to pound out the right solution. > >> Kenn >> >> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <[email protected]> wrote: >> >>> >>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <[email protected]> >>>> wrote: >>>> >>>>> Thanks Kenn. >>>>> >>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>>>>> The fact that its usage has grown probably indicates that we have a >>>>>> large number of transforms that can easily cause data loss / duplication. >>>>>> >>>>> >>>>> Is this specific to Reshuffle or it is true for any GroupByKey? I see >>>>> Reshuffle as just a wrapper around GBK. >>>>> >>>> The issue is when it's used in such a way that data corruption can >>>> occur when the underlying GBK output is not stable. >>>> >>> >>> Could you describe this breakage bit more in detail or give a example? >>> Apologies in advance, I know this came up in multiple contexts in the past, >>> but I haven't grokked the issue well. It is the window rewrite that >>> Reshuffle does that causes misuse of GBK? >>> >>> Thanks. >>> >>
