On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw <rober...@google.com> wrote:

> We should probably keep the warning and all the caveats until we introduce
> the alternative (and migrate to it for the non-parallelism uses of
> reshuffle). I was just proposing we do this via a separate transform that
> just calls Reshuffle until we have the new story fully fleshed out (I don't
> know if any runner supports RequresStableInput, and it isn't translated
> in the Fn API) to avoid being in this intermediate state for yet another
> year.
>

I think all warnings and caveats really belong in JavaDoc for GroupByKey,
since that is the actual transform the the semantics that we are concerned
about incorrect use. A reshuffle transform can refer the users there.

In addition, I think Reshuffle is really good name for reshuffle transform
meant for most users.

That said, any reorganization is much better than deprecation.

Raghu.


>
> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi <rang...@google.com> wrote:
>
>>
>>
>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi <rang...@google.com> wrote:
>>>
>>>> [...]
>>>>
>>> I think it would be much more user friendly to un-deprecate it to add a
>>>> warning for advanced users about non-portability of durability/replay
>>>> guarantees/stable input assumptions.
>>>>
>>>>>
>>> Yes, I think everyone in this thread is in agreement here. We should
>>> provide a *different* transform that provides the durability guarantees
>>> (with caveats). In the meantime, this delegating to a reshuffle would be
>>> better than using a reshuffle directly.
>>>
>>
>> Great. Sent a PR to undeprecate Reshuffle :
>> https://github.com/apache/beam/pull/5432
>> The wording there for JavaDoc just a proposal...
>>
>> Raghu.
>>
>>
>>> We tend to put in reshuffles in order to "commit" these random values
>>>>>>>>> and make them stable for the next stage, to be used to provide the 
>>>>>>>>> needed
>>>>>>>>> idempotency for sinks.
>>>>>>>>>
>>>>>>>>
>>>>>>>> In such cases, I think the author should error out on the runner
>>>>>>>> that don't provide that guarantee. That is what ExactlyOnceSink in 
>>>>>>>> KafkaIO
>>>>>>>> does [1].
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>>>>>>>
>>>>>>>
>>>>>>> We're moving to a world where the runner may not be known at
>>>>>>> pipeline construction time. However, explicitly using a (distinct)
>>>>>>> make-input-stable transform when that's the intent (which could be a
>>>>>>> primitive that runners should implement, possibly by swapping in 
>>>>>>> Reshuffle,
>>>>>>> or reject) would allow for this. That being said, the exact semantics of
>>>>>>> this transform is a bit of a rabbit hole which is why we never finished 
>>>>>>> the
>>>>>>> job of deprecating Reshuffle. This is a case where doing something is
>>>>>>> better than doing nothing, and our use of URNs for this kind of thing is
>>>>>>> flexible enough that we can deprecate old ones if/when we have time to
>>>>>>> pound out the right solution.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
>>>>>>>>>>> rang...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Kenn.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
>>>>>>>>>>>> k...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The fact that its usage has grown probably indicates that we
>>>>>>>>>>>>> have a large number of transforms that can easily cause data loss 
>>>>>>>>>>>>> /
>>>>>>>>>>>>> duplication.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is this specific to Reshuffle or it is true for any GroupByKey?
>>>>>>>>>>>> I see Reshuffle as just a wrapper around GBK.
>>>>>>>>>>>>
>>>>>>>>>>> The issue is when it's used in such a way that data corruption
>>>>>>>>>>> can occur when the underlying GBK output is not stable.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Could you describe this breakage bit more in detail or give a
>>>>>>>>>> example? Apologies in advance, I know this came up in multiple 
>>>>>>>>>> contexts in
>>>>>>>>>> the past, but I haven't grokked the issue well. It is the window 
>>>>>>>>>> rewrite
>>>>>>>>>> that Reshuffle does that causes misuse of GBK?
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>

Reply via email to