+1 to introducing alternative transforms even if they wrap Reshuffle

The benefits of making them distinct is that we can put appropriate Javadoc
in place and runners can figure out what the user is intending and whether
Reshuffle or some other implementation is appropriate. We can also see
which of these use cases is most common by inspecting the transform usage.
And, when better options are available, we can just introduce the
appropriate transform, to get the appropriate special behavior.

I think there were three cases of use that I can remember:

1. Splitting up retry domains (eg., if one transform is expensive,
preventing failures in nearby transforms from causing it to be retried)
2. Redistributing elements to improve parallelism
3. Checkpointing/snapshotting a pcollection before side-effecting the
outside world (eg., RequiresStableInput)

-- Ben

On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw <rober...@google.com> wrote:

> We should probably keep the warning and all the caveats until we introduce
> the alternative (and migrate to it for the non-parallelism uses of
> reshuffle). I was just proposing we do this via a separate transform that
> just calls Reshuffle until we have the new story fully fleshed out (I don't
> know if any runner supports RequresStableInput, and it isn't translated
> in the Fn API) to avoid being in this intermediate state for yet another
> year.
>
> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi <rang...@google.com> wrote:
>
>>
>>
>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi <rang...@google.com> wrote:
>>>
>>>> [...]
>>>>
>>> I think it would be much more user friendly to un-deprecate it to add a
>>>> warning for advanced users about non-portability of durability/replay
>>>> guarantees/stable input assumptions.
>>>>
>>>>>
>>> Yes, I think everyone in this thread is in agreement here. We should
>>> provide a *different* transform that provides the durability guarantees
>>> (with caveats). In the meantime, this delegating to a reshuffle would be
>>> better than using a reshuffle directly.
>>>
>>
>> Great. Sent a PR to undeprecate Reshuffle :
>> https://github.com/apache/beam/pull/5432
>> The wording there for JavaDoc just a proposal...
>>
>> Raghu.
>>
>>
>>> We tend to put in reshuffles in order to "commit" these random values
>>>>>>>>> and make them stable for the next stage, to be used to provide the 
>>>>>>>>> needed
>>>>>>>>> idempotency for sinks.
>>>>>>>>>
>>>>>>>>
>>>>>>>> In such cases, I think the author should error out on the runner
>>>>>>>> that don't provide that guarantee. That is what ExactlyOnceSink in 
>>>>>>>> KafkaIO
>>>>>>>> does [1].
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>>>>>>>
>>>>>>>
>>>>>>> We're moving to a world where the runner may not be known at
>>>>>>> pipeline construction time. However, explicitly using a (distinct)
>>>>>>> make-input-stable transform when that's the intent (which could be a
>>>>>>> primitive that runners should implement, possibly by swapping in 
>>>>>>> Reshuffle,
>>>>>>> or reject) would allow for this. That being said, the exact semantics of
>>>>>>> this transform is a bit of a rabbit hole which is why we never finished 
>>>>>>> the
>>>>>>> job of deprecating Reshuffle. This is a case where doing something is
>>>>>>> better than doing nothing, and our use of URNs for this kind of thing is
>>>>>>> flexible enough that we can deprecate old ones if/when we have time to
>>>>>>> pound out the right solution.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi <rang...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>>>>>>>>>> rober...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
>>>>>>>>>>> rang...@google.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Kenn.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
>>>>>>>>>>>> k...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The fact that its usage has grown probably indicates that we
>>>>>>>>>>>>> have a large number of transforms that can easily cause data loss 
>>>>>>>>>>>>> /
>>>>>>>>>>>>> duplication.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is this specific to Reshuffle or it is true for any GroupByKey?
>>>>>>>>>>>> I see Reshuffle as just a wrapper around GBK.
>>>>>>>>>>>>
>>>>>>>>>>> The issue is when it's used in such a way that data corruption
>>>>>>>>>>> can occur when the underlying GBK output is not stable.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Could you describe this breakage bit more in detail or give a
>>>>>>>>>> example? Apologies in advance, I know this came up in multiple 
>>>>>>>>>> contexts in
>>>>>>>>>> the past, but I haven't grokked the issue well. It is the window 
>>>>>>>>>> rewrite
>>>>>>>>>> that Reshuffle does that causes misuse of GBK?
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>

Reply via email to