On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw <rober...@google.com> wrote:
> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi <rang...@google.com> wrote: > >> [...] >> > I think it would be much more user friendly to un-deprecate it to add a >> warning for advanced users about non-portability of durability/replay >> guarantees/stable input assumptions. >> >>> > Yes, I think everyone in this thread is in agreement here. We should > provide a *different* transform that provides the durability guarantees > (with caveats). In the meantime, this delegating to a reshuffle would be > better than using a reshuffle directly. > Great. Sent a PR to undeprecate Reshuffle : https://github.com/apache/beam/pull/5432 The wording there for JavaDoc just a proposal... Raghu. > We tend to put in reshuffles in order to "commit" these random values and >>>>>>> make them stable for the next stage, to be used to provide the needed >>>>>>> idempotency for sinks. >>>>>>> >>>>>> >>>>>> In such cases, I think the author should error out on the runner that >>>>>> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO >>>>>> does >>>>>> [1]. >>>>>> >>>>>> [1] >>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >>>>>> >>>>> >>>>> We're moving to a world where the runner may not be known at pipeline >>>>> construction time. However, explicitly using a (distinct) >>>>> make-input-stable >>>>> transform when that's the intent (which could be a primitive that runners >>>>> should implement, possibly by swapping in Reshuffle, or reject) would >>>>> allow >>>>> for this. That being said, the exact semantics of this transform is a bit >>>>> of a rabbit hole which is why we never finished the job of deprecating >>>>> Reshuffle. This is a case where doing something is better than doing >>>>> nothing, and our use of URNs for this kind of thing is flexible enough >>>>> that >>>>> we can deprecate old ones if/when we have time to pound out the right >>>>> solution. >>>>> >>>>> >>>>>> >>>>>>> Kenn >>>>>>> >>>>>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < >>>>>>>> rober...@google.com> wrote: >>>>>>>> >>>>>>>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <rang...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks Kenn. >>>>>>>>>> >>>>>>>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <k...@google.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The fact that its usage has grown probably indicates that we >>>>>>>>>>> have a large number of transforms that can easily cause data loss / >>>>>>>>>>> duplication. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Is this specific to Reshuffle or it is true for any GroupByKey? I >>>>>>>>>> see Reshuffle as just a wrapper around GBK. >>>>>>>>>> >>>>>>>>> The issue is when it's used in such a way that data corruption can >>>>>>>>> occur when the underlying GBK output is not stable. >>>>>>>>> >>>>>>>> >>>>>>>> Could you describe this breakage bit more in detail or give a >>>>>>>> example? Apologies in advance, I know this came up in multiple >>>>>>>> contexts in >>>>>>>> the past, but I haven't grokked the issue well. It is the window >>>>>>>> rewrite >>>>>>>> that Reshuffle does that causes misuse of GBK? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>