Filed https://issues.apache.org/jira/browse/BEAM-4372 (unassigned).
On Mon, May 21, 2018 at 10:22 AM Raghu Angadi <rang...@google.com> wrote: > > > On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw <rober...@google.com> > wrote: > >> We should probably keep the warning and all the caveats until we >> introduce the alternative (and migrate to it for the non-parallelism uses >> of reshuffle). I was just proposing we do this via a separate transform >> that just calls Reshuffle until we have the new story fully fleshed out (I >> don't know if any runner supports RequresStableInput, and it isn't >> translated in the Fn API) to avoid being in this intermediate state for >> yet another year. >> > > I think all warnings and caveats really belong in JavaDoc for GroupByKey, > since that is the actual transform the the semantics that we are concerned > about incorrect use. A reshuffle transform can refer the users there. > > In addition, I think Reshuffle is really good name for reshuffle transform > meant for most users. > > That said, any reorganization is much better than deprecation. > > Raghu. > > >> >> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi <rang...@google.com> wrote: >> >>> >>> >>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi <rang...@google.com> >>>> wrote: >>>> >>>>> [...] >>>>> >>>> I think it would be much more user friendly to un-deprecate it to add a >>>>> warning for advanced users about non-portability of durability/replay >>>>> guarantees/stable input assumptions. >>>>> >>>>>> >>>> Yes, I think everyone in this thread is in agreement here. We should >>>> provide a *different* transform that provides the durability guarantees >>>> (with caveats). In the meantime, this delegating to a reshuffle would be >>>> better than using a reshuffle directly. >>>> >>> >>> Great. Sent a PR to undeprecate Reshuffle : >>> https://github.com/apache/beam/pull/5432 >>> The wording there for JavaDoc just a proposal... >>> >>> Raghu. >>> >>> >>>> We tend to put in reshuffles in order to "commit" these random values >>>>>>>>>> and make them stable for the next stage, to be used to provide the >>>>>>>>>> needed >>>>>>>>>> idempotency for sinks. >>>>>>>>>> >>>>>>>>> >>>>>>>>> In such cases, I think the author should error out on the runner >>>>>>>>> that don't provide that guarantee. That is what ExactlyOnceSink in >>>>>>>>> KafkaIO >>>>>>>>> does [1]. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >>>>>>>>> >>>>>>>> >>>>>>>> We're moving to a world where the runner may not be known at >>>>>>>> pipeline construction time. However, explicitly using a (distinct) >>>>>>>> make-input-stable transform when that's the intent (which could be a >>>>>>>> primitive that runners should implement, possibly by swapping in >>>>>>>> Reshuffle, >>>>>>>> or reject) would allow for this. That being said, the exact semantics >>>>>>>> of >>>>>>>> this transform is a bit of a rabbit hole which is why we never >>>>>>>> finished the >>>>>>>> job of deprecating Reshuffle. This is a case where doing something is >>>>>>>> better than doing nothing, and our use of URNs for this kind of thing >>>>>>>> is >>>>>>>> flexible enough that we can deprecate old ones if/when we have time to >>>>>>>> pound out the right solution. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> Kenn >>>>>>>>>> >>>>>>>>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < >>>>>>>>>>> rober...@google.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi < >>>>>>>>>>>> rang...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks Kenn. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles < >>>>>>>>>>>>> k...@google.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The fact that its usage has grown probably indicates that we >>>>>>>>>>>>>> have a large number of transforms that can easily cause data >>>>>>>>>>>>>> loss / >>>>>>>>>>>>>> duplication. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Is this specific to Reshuffle or it is true for any >>>>>>>>>>>>> GroupByKey? I see Reshuffle as just a wrapper around GBK. >>>>>>>>>>>>> >>>>>>>>>>>> The issue is when it's used in such a way that data corruption >>>>>>>>>>>> can occur when the underlying GBK output is not stable. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Could you describe this breakage bit more in detail or give a >>>>>>>>>>> example? Apologies in advance, I know this came up in multiple >>>>>>>>>>> contexts in >>>>>>>>>>> the past, but I haven't grokked the issue well. It is the window >>>>>>>>>>> rewrite >>>>>>>>>>> that Reshuffle does that causes misuse of GBK? >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>