It isn't any particular logic in Reshuffle - it is, semantically, an identity transform. It is the fact that other runners are perfectly able to re-run transform prior to a GBK. So, for example, randomly generated IDs will be re-generated. We tend to put in reshuffles in order to "commit" these random values and make them stable for the next stage, to be used to provide the needed idempotency for sinks.
Kenn On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com> wrote: > > On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <rober...@google.com> > wrote: > >> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <rang...@google.com> wrote: >> >>> Thanks Kenn. >>> >>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <k...@google.com> wrote: >>> >>>> The fact that its usage has grown probably indicates that we have a >>>> large number of transforms that can easily cause data loss / duplication. >>>> >>> >>> Is this specific to Reshuffle or it is true for any GroupByKey? I see >>> Reshuffle as just a wrapper around GBK. >>> >> The issue is when it's used in such a way that data corruption can occur >> when the underlying GBK output is not stable. >> > > Could you describe this breakage bit more in detail or give a example? > Apologies in advance, I know this came up in multiple contexts in the past, > but I haven't grokked the issue well. It is the window rewrite that > Reshuffle does that causes misuse of GBK? > > Thanks. >