On Fri, May 18, 2018 at 4:07 PM Kenneth Knowles <k...@google.com> wrote:

> It isn't any particular logic in Reshuffle - it is, semantically, an
> identity transform. It is the fact that other runners are perfectly able to
> re-run transform prior to a GBK. So, for example, randomly generated IDs
> will be re-generated.
>

Ah, thanks, that makes sense. That implies to me Reshuffle is no more
broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear
caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic
keys and retries (though it applies to GroupByKey in general).

We tend to put in reshuffles in order to "commit" these random values and
> make them stable for the next stage, to be used to provide the needed
> idempotency for sinks.
>

In such cases, I think the author should error out on the runner that don't
provide that guarantee. That is what ExactlyOnceSink in KafkaIO does [1].

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049


> Kenn
>
> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com> wrote:
>
>>
>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <rang...@google.com>
>>> wrote:
>>>
>>>> Thanks Kenn.
>>>>
>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <k...@google.com>
>>>> wrote:
>>>>
>>>>> The fact that its usage has grown probably indicates that we have a
>>>>> large number of transforms that can easily cause data loss / duplication.
>>>>>
>>>>
>>>> Is this specific to Reshuffle or it is true for any GroupByKey? I see
>>>> Reshuffle as just a wrapper around GBK.
>>>>
>>> The issue is when it's used in such a way that data corruption can occur
>>> when the underlying GBK output is not stable.
>>>
>>
>> Could you describe this breakage bit more in detail or give a example?
>> Apologies in advance, I know this came up in multiple contexts in the past,
>> but I haven't grokked the issue well. It is the window rewrite that
>> Reshuffle does that causes misuse of GBK?
>>
>> Thanks.
>>
>

Reply via email to