We should probably keep the warning and all the caveats until we introduce
the alternative (and migrate to it for the non-parallelism uses of
reshuffle). I was just proposing we do this via a separate transform that
just calls Reshuffle until we have the new story fully fleshed out (I don't
know if any runner supports RequresStableInput, and it isn't translated in
the Fn API) to avoid being in this intermediate state for yet another year.

On Sun, May 20, 2018 at 6:38 PM Raghu Angadi <rang...@google.com> wrote:

>
>
> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi <rang...@google.com> wrote:
>>
>>> [...]
>>>
>> I think it would be much more user friendly to un-deprecate it to add a
>>> warning for advanced users about non-portability of durability/replay
>>> guarantees/stable input assumptions.
>>>
>>>>
>> Yes, I think everyone in this thread is in agreement here. We should
>> provide a *different* transform that provides the durability guarantees
>> (with caveats). In the meantime, this delegating to a reshuffle would be
>> better than using a reshuffle directly.
>>
>
> Great. Sent a PR to undeprecate Reshuffle :
> https://github.com/apache/beam/pull/5432
> The wording there for JavaDoc just a proposal...
>
> Raghu.
>
>
>> We tend to put in reshuffles in order to "commit" these random values and
>>>>>>>> make them stable for the next stage, to be used to provide the needed
>>>>>>>> idempotency for sinks.
>>>>>>>>
>>>>>>>
>>>>>>> In such cases, I think the author should error out on the runner
>>>>>>> that don't provide that guarantee. That is what ExactlyOnceSink in 
>>>>>>> KafkaIO
>>>>>>> does [1].
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>>>>>>
>>>>>>
>>>>>> We're moving to a world where the runner may not be known at pipeline
>>>>>> construction time. However, explicitly using a (distinct) 
>>>>>> make-input-stable
>>>>>> transform when that's the intent (which could be a primitive that runners
>>>>>> should implement, possibly by swapping in Reshuffle, or reject) would 
>>>>>> allow
>>>>>> for this. That being said, the exact semantics of this transform is a bit
>>>>>> of a rabbit hole which is why we never finished the job of deprecating
>>>>>> Reshuffle. This is a case where doing something is better than doing
>>>>>> nothing, and our use of URNs for this kind of thing is flexible enough 
>>>>>> that
>>>>>> we can deprecate old ones if/when we have time to pound out the right
>>>>>> solution.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <rang...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Kenn.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <k...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> The fact that its usage has grown probably indicates that we
>>>>>>>>>>>> have a large number of transforms that can easily cause data loss /
>>>>>>>>>>>> duplication.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Is this specific to Reshuffle or it is true for any GroupByKey?
>>>>>>>>>>> I see Reshuffle as just a wrapper around GBK.
>>>>>>>>>>>
>>>>>>>>>> The issue is when it's used in such a way that data corruption
>>>>>>>>>> can occur when the underlying GBK output is not stable.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Could you describe this breakage bit more in detail or give a
>>>>>>>>> example? Apologies in advance, I know this came up in multiple 
>>>>>>>>> contexts in
>>>>>>>>> the past, but I haven't grokked the issue well. It is the window 
>>>>>>>>> rewrite
>>>>>>>>> that Reshuffle does that causes misuse of GBK?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>

Reply via email to