On Sat, May 19, 2018 at 8:11 AM Robert Bradshaw <rober...@google.com> wrote:

> On Fri, May 18, 2018 at 6:29 PM Raghu Angadi <rang...@google.com> wrote:
>
>> True. I am still failing to see what is broken about Reshuffle that is
>> also not broken with GroupByKey transform. If someone depends on GroupByKey
>> to get stable input, isn't that equally incorrect/unportable?
>>
>
> Yes, if people use GBK in that way, it's also just as broken. The thought
> is that fewer people would use it with that intent, as GBK is not a no-op
> (it transforms the shape of the data, and also does not preserve
> windowing). This is in contrast to Reshuffle which was encouraged for this
> usecase.
>

I see. I am not aware of any recommendation for users (excluding advanced
users) to use this for stable input/durability gaurantees. Every single
case where I recommended Reshuffle was related to parallelism (there were
many such cases). Most of use of Reshuflle/GBK for stable input were done
consciously by the authors, fully aware of the caveats (SDF in Dataflow,
Kafka EOS sink use of GBK, etc).

As a result, deprecation is only hurting the innocent users who are using
Reshuffle correctly.

I think it would be much more user friendly to un-deprecate it to add a
warning for advanced users about non-portability of durability/replay
guarantees/stable input assumptions.

Raghu.

We tend to put in reshuffles in order to "commit" these random values and
>>>>> make them stable for the next stage, to be used to provide the needed
>>>>> idempotency for sinks.
>>>>>
>>>>
>>>> In such cases, I think the author should error out on the runner that
>>>> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does
>>>> [1].
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>>>
>>>
>>> We're moving to a world where the runner may not be known at pipeline
>>> construction time. However, explicitly using a (distinct) make-input-stable
>>> transform when that's the intent (which could be a primitive that runners
>>> should implement, possibly by swapping in Reshuffle, or reject) would allow
>>> for this. That being said, the exact semantics of this transform is a bit
>>> of a rabbit hole which is why we never finished the job of deprecating
>>> Reshuffle. This is a case where doing something is better than doing
>>> nothing, and our use of URNs for this kind of thing is flexible enough that
>>> we can deprecate old ones if/when we have time to pound out the right
>>> solution.
>>>
>>>
>>>>
>>>>> Kenn
>>>>>
>>>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <rober...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <rang...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Kenn.
>>>>>>>>
>>>>>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <k...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The fact that its usage has grown probably indicates that we have
>>>>>>>>> a large number of transforms that can easily cause data loss / 
>>>>>>>>> duplication.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Is this specific to Reshuffle or it is true for any GroupByKey? I
>>>>>>>> see Reshuffle as just a wrapper around GBK.
>>>>>>>>
>>>>>>> The issue is when it's used in such a way that data corruption can
>>>>>>> occur when the underlying GBK output is not stable.
>>>>>>>
>>>>>>
>>>>>> Could you describe this breakage bit more in detail or give a
>>>>>> example? Apologies in advance, I know this came up in multiple contexts 
>>>>>> in
>>>>>> the past, but I haven't grokked the issue well. It is the window rewrite
>>>>>> that Reshuffle does that causes misuse of GBK?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>

Reply via email to