Filed https://issues.apache.org/jira/browse/BEAM-4372 (unassigned).

On Mon, May 21, 2018 at 10:22 AM Raghu Angadi <rang...@google.com> wrote:

>
>
> On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> We should probably keep the warning and all the caveats until we
>> introduce the alternative (and migrate to it for the non-parallelism uses
>> of reshuffle). I was just proposing we do this via a separate transform
>> that just calls Reshuffle until we have the new story fully fleshed out (I
>> don't know if any runner supports RequresStableInput, and it isn't
>> translated in the Fn API) to avoid being in this intermediate state for
>> yet another year.
>>
>
> I think all warnings and caveats really belong in JavaDoc for GroupByKey,
> since that is the actual transform the the semantics that we are concerned
> about incorrect use. A reshuffle transform can refer the users there.
>
> In addition, I think Reshuffle is really good name for reshuffle transform
> meant for most users.
>
> That said, any reorganization is much better than deprecation.
>
> Raghu.
>
>
>>
>> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi <rang...@google.com> wrote:
>>
>>>
>>>
>>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi <rang...@google.com>
>>>> wrote:
>>>>
>>>>> [...]
>>>>>
>>>> I think it would be much more user friendly to un-deprecate it to add a
>>>>> warning for advanced users about non-portability of durability/replay
>>>>> guarantees/stable input assumptions.
>>>>>
>>>>>>
>>>> Yes, I think everyone in this thread is in agreement here. We should
>>>> provide a *different* transform that provides the durability guarantees
>>>> (with caveats). In the meantime, this delegating to a reshuffle would be
>>>> better than using a reshuffle directly.
>>>>
>>>
>>> Great. Sent a PR to undeprecate Reshuffle :
>>> https://github.com/apache/beam/pull/5432
>>> The wording there for JavaDoc just a proposal...
>>>
>>> Raghu.
>>>
>>>
>>>> We tend to put in reshuffles in order to "commit" these random values
>>>>>>>>>> and make them stable for the next stage, to be used to provide the 
>>>>>>>>>> needed
>>>>>>>>>> idempotency for sinks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In such cases, I think the author should error out on the runner
>>>>>>>>> that don't provide that guarantee. That is what ExactlyOnceSink in 
>>>>>>>>> KafkaIO
>>>>>>>>> does [1].
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>>>>>>>>
>>>>>>>>
>>>>>>>> We're moving to a world where the runner may not be known at
>>>>>>>> pipeline construction time. However, explicitly using a (distinct)
>>>>>>>> make-input-stable transform when that's the intent (which could be a
>>>>>>>> primitive that runners should implement, possibly by swapping in 
>>>>>>>> Reshuffle,
>>>>>>>> or reject) would allow for this. That being said, the exact semantics 
>>>>>>>> of
>>>>>>>> this transform is a bit of a rabbit hole which is why we never 
>>>>>>>> finished the
>>>>>>>> job of deprecating Reshuffle. This is a case where doing something is
>>>>>>>> better than doing nothing, and our use of URNs for this kind of thing 
>>>>>>>> is
>>>>>>>> flexible enough that we can deprecate old ones if/when we have time to
>>>>>>>> pound out the right solution.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
>>>>>>>>>>>> rang...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Kenn.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
>>>>>>>>>>>>> k...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The fact that its usage has grown probably indicates that we
>>>>>>>>>>>>>> have a large number of transforms that can easily cause data 
>>>>>>>>>>>>>> loss /
>>>>>>>>>>>>>> duplication.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this specific to Reshuffle or it is true for any
>>>>>>>>>>>>> GroupByKey? I see Reshuffle as just a wrapper around GBK.
>>>>>>>>>>>>>
>>>>>>>>>>>> The issue is when it's used in such a way that data corruption
>>>>>>>>>>>> can occur when the underlying GBK output is not stable.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Could you describe this breakage bit more in detail or give a
>>>>>>>>>>> example? Apologies in advance, I know this came up in multiple 
>>>>>>>>>>> contexts in
>>>>>>>>>>> the past, but I haven't grokked the issue well. It is the window 
>>>>>>>>>>> rewrite
>>>>>>>>>>> that Reshuffle does that causes misuse of GBK?
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>

Reply via email to