On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw <rober...@google.com> wrote:
> We should probably keep the warning and all the caveats until we introduce > the alternative (and migrate to it for the non-parallelism uses of > reshuffle). I was just proposing we do this via a separate transform that > just calls Reshuffle until we have the new story fully fleshed out (I don't > know if any runner supports RequresStableInput, and it isn't translated > in the Fn API) to avoid being in this intermediate state for yet another > year. > I think all warnings and caveats really belong in JavaDoc for GroupByKey, since that is the actual transform the the semantics that we are concerned about incorrect use. A reshuffle transform can refer the users there. In addition, I think Reshuffle is really good name for reshuffle transform meant for most users. That said, any reorganization is much better than deprecation. Raghu. > > On Sun, May 20, 2018 at 6:38 PM Raghu Angadi <rang...@google.com> wrote: > >> >> >> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi <rang...@google.com> wrote: >>> >>>> [...] >>>> >>> I think it would be much more user friendly to un-deprecate it to add a >>>> warning for advanced users about non-portability of durability/replay >>>> guarantees/stable input assumptions. >>>> >>>>> >>> Yes, I think everyone in this thread is in agreement here. We should >>> provide a *different* transform that provides the durability guarantees >>> (with caveats). In the meantime, this delegating to a reshuffle would be >>> better than using a reshuffle directly. >>> >> >> Great. Sent a PR to undeprecate Reshuffle : >> https://github.com/apache/beam/pull/5432 >> The wording there for JavaDoc just a proposal... >> >> Raghu. >> >> >>> We tend to put in reshuffles in order to "commit" these random values >>>>>>>>> and make them stable for the next stage, to be used to provide the >>>>>>>>> needed >>>>>>>>> idempotency for sinks. >>>>>>>>> >>>>>>>> >>>>>>>> In such cases, I think the author should error out on the runner >>>>>>>> that don't provide that guarantee. That is what ExactlyOnceSink in >>>>>>>> KafkaIO >>>>>>>> does [1]. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >>>>>>>> >>>>>>> >>>>>>> We're moving to a world where the runner may not be known at >>>>>>> pipeline construction time. However, explicitly using a (distinct) >>>>>>> make-input-stable transform when that's the intent (which could be a >>>>>>> primitive that runners should implement, possibly by swapping in >>>>>>> Reshuffle, >>>>>>> or reject) would allow for this. That being said, the exact semantics of >>>>>>> this transform is a bit of a rabbit hole which is why we never finished >>>>>>> the >>>>>>> job of deprecating Reshuffle. This is a case where doing something is >>>>>>> better than doing nothing, and our use of URNs for this kind of thing is >>>>>>> flexible enough that we can deprecate old ones if/when we have time to >>>>>>> pound out the right solution. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>> Kenn >>>>>>>>> >>>>>>>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < >>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>> >>>>>>>>>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi < >>>>>>>>>>> rang...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Kenn. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles < >>>>>>>>>>>> k...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The fact that its usage has grown probably indicates that we >>>>>>>>>>>>> have a large number of transforms that can easily cause data loss >>>>>>>>>>>>> / >>>>>>>>>>>>> duplication. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Is this specific to Reshuffle or it is true for any GroupByKey? >>>>>>>>>>>> I see Reshuffle as just a wrapper around GBK. >>>>>>>>>>>> >>>>>>>>>>> The issue is when it's used in such a way that data corruption >>>>>>>>>>> can occur when the underlying GBK output is not stable. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Could you describe this breakage bit more in detail or give a >>>>>>>>>> example? Apologies in advance, I know this came up in multiple >>>>>>>>>> contexts in >>>>>>>>>> the past, but I haven't grokked the issue well. It is the window >>>>>>>>>> rewrite >>>>>>>>>> that Reshuffle does that causes misuse of GBK? >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>