Re: What is the future of Reshuffle?

2018-05-21 Thread Raghu Angadi
Filed https://issues.apache.org/jira/browse/BEAM-4372 (unassigned).

On Mon, May 21, 2018 at 10:22 AM Raghu Angadi  wrote:

>
>
> On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw 
> wrote:
>
>> We should probably keep the warning and all the caveats until we
>> introduce the alternative (and migrate to it for the non-parallelism uses
>> of reshuffle). I was just proposing we do this via a separate transform
>> that just calls Reshuffle until we have the new story fully fleshed out (I
>> don't know if any runner supports RequresStableInput, and it isn't
>> translated in the Fn API) to avoid being in this intermediate state for
>> yet another year.
>>
>
> I think all warnings and caveats really belong in JavaDoc for GroupByKey,
> since that is the actual transform the the semantics that we are concerned
> about incorrect use. A reshuffle transform can refer the users there.
>
> In addition, I think Reshuffle is really good name for reshuffle transform
> meant for most users.
>
> That said, any reorganization is much better than deprecation.
>
> Raghu.
>
>
>>
>> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi  wrote:
>>
>>>
>>>
>>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
>>> wrote:
>>>
 On Sat, May 19, 2018 at 6:27 PM Raghu Angadi 
 wrote:

> [...]
>
 I think it would be much more user friendly to un-deprecate it to add a
> warning for advanced users about non-portability of durability/replay
> guarantees/stable input assumptions.
>
>>
 Yes, I think everyone in this thread is in agreement here. We should
 provide a *different* transform that provides the durability guarantees
 (with caveats). In the meantime, this delegating to a reshuffle would be
 better than using a reshuffle directly.

>>>
>>> Great. Sent a PR to undeprecate Reshuffle :
>>> https://github.com/apache/beam/pull/5432
>>> The wording there for JavaDoc just a proposal...
>>>
>>> Raghu.
>>>
>>>
 We tend to put in reshuffles in order to "commit" these random values
>> and make them stable for the next stage, to be used to provide the 
>> needed
>> idempotency for sinks.
>>
>
> In such cases, I think the author should error out on the runner
> that don't provide that guarantee. That is what ExactlyOnceSink in 
> KafkaIO
> does [1].
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>

 We're moving to a world where the runner may not be known at
 pipeline construction time. However, explicitly using a (distinct)
 make-input-stable transform when that's the intent (which could be a
 primitive that runners should implement, possibly by swapping in 
 Reshuffle,
 or reject) would allow for this. That being said, the exact semantics 
 of
 this transform is a bit of a rabbit hole which is why we never 
 finished the
 job of deprecating Reshuffle. This is a case where doing something is
 better than doing nothing, and our use of URNs for this kind of thing 
 is
 flexible enough that we can deprecate old ones if/when we have time to
 pound out the right solution.


>
>> Kenn
>>
>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
>> wrote:
>>
>>>
>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>>
 On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
 rang...@google.com> wrote:

> Thanks Kenn.
>
> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
> k...@google.com> wrote:
>
>> The fact that its usage has grown probably indicates that we
>> have a large number of transforms that can easily cause data 
>> loss /
>> duplication.
>>
>
> Is this specific to Reshuffle or it is true for any
> GroupByKey? I see Reshuffle as just a wrapper around GBK.
>
 The issue is when it's used in such a way that data corruption
 can occur when the underlying GBK output is not stable.

>>>
>>> Could you describe this breakage bit more in detail or give a
>>> example? Apologies in advance, I know this came up in multiple 
>>> contexts in
>>> the past, but I haven't grokked the issue well. It is the window 
>>> rewrite
>>> that Reshuffle does that causes misuse of GBK?
>>>
>>> Thanks.
>>>
>>


Re: What is the future of Reshuffle?

2018-05-21 Thread Raghu Angadi
On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw  wrote:

> We should probably keep the warning and all the caveats until we introduce
> the alternative (and migrate to it for the non-parallelism uses of
> reshuffle). I was just proposing we do this via a separate transform that
> just calls Reshuffle until we have the new story fully fleshed out (I don't
> know if any runner supports RequresStableInput, and it isn't translated
> in the Fn API) to avoid being in this intermediate state for yet another
> year.
>

I think all warnings and caveats really belong in JavaDoc for GroupByKey,
since that is the actual transform the the semantics that we are concerned
about incorrect use. A reshuffle transform can refer the users there.

In addition, I think Reshuffle is really good name for reshuffle transform
meant for most users.

That said, any reorganization is much better than deprecation.

Raghu.


>
> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi  wrote:
>
>>
>>
>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
>> wrote:
>>
>>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:
>>>
 [...]

>>> I think it would be much more user friendly to un-deprecate it to add a
 warning for advanced users about non-portability of durability/replay
 guarantees/stable input assumptions.

>
>>> Yes, I think everyone in this thread is in agreement here. We should
>>> provide a *different* transform that provides the durability guarantees
>>> (with caveats). In the meantime, this delegating to a reshuffle would be
>>> better than using a reshuffle directly.
>>>
>>
>> Great. Sent a PR to undeprecate Reshuffle :
>> https://github.com/apache/beam/pull/5432
>> The wording there for JavaDoc just a proposal...
>>
>> Raghu.
>>
>>
>>> We tend to put in reshuffles in order to "commit" these random values
> and make them stable for the next stage, to be used to provide the 
> needed
> idempotency for sinks.
>

 In such cases, I think the author should error out on the runner
 that don't provide that guarantee. That is what ExactlyOnceSink in 
 KafkaIO
 does [1].

 [1]
 https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049

>>>
>>> We're moving to a world where the runner may not be known at
>>> pipeline construction time. However, explicitly using a (distinct)
>>> make-input-stable transform when that's the intent (which could be a
>>> primitive that runners should implement, possibly by swapping in 
>>> Reshuffle,
>>> or reject) would allow for this. That being said, the exact semantics of
>>> this transform is a bit of a rabbit hole which is why we never finished 
>>> the
>>> job of deprecating Reshuffle. This is a case where doing something is
>>> better than doing nothing, and our use of URNs for this kind of thing is
>>> flexible enough that we can deprecate old ones if/when we have time to
>>> pound out the right solution.
>>>
>>>

> Kenn
>
> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
> wrote:
>
>>
>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>>
>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
>>> rang...@google.com> wrote:
>>>
 Thanks Kenn.

 On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
 k...@google.com> wrote:

> The fact that its usage has grown probably indicates that we
> have a large number of transforms that can easily cause data loss 
> /
> duplication.
>

 Is this specific to Reshuffle or it is true for any GroupByKey?
 I see Reshuffle as just a wrapper around GBK.

>>> The issue is when it's used in such a way that data corruption
>>> can occur when the underlying GBK output is not stable.
>>>
>>
>> Could you describe this breakage bit more in detail or give a
>> example? Apologies in advance, I know this came up in multiple 
>> contexts in
>> the past, but I haven't grokked the issue well. It is the window 
>> rewrite
>> that Reshuffle does that causes misuse of GBK?
>>
>> Thanks.
>>
>


Re: What is the future of Reshuffle?

2018-05-21 Thread Ben Chambers
+1 to introducing alternative transforms even if they wrap Reshuffle

The benefits of making them distinct is that we can put appropriate Javadoc
in place and runners can figure out what the user is intending and whether
Reshuffle or some other implementation is appropriate. We can also see
which of these use cases is most common by inspecting the transform usage.
And, when better options are available, we can just introduce the
appropriate transform, to get the appropriate special behavior.

I think there were three cases of use that I can remember:

1. Splitting up retry domains (eg., if one transform is expensive,
preventing failures in nearby transforms from causing it to be retried)
2. Redistributing elements to improve parallelism
3. Checkpointing/snapshotting a pcollection before side-effecting the
outside world (eg., RequiresStableInput)

-- Ben

On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw  wrote:

> We should probably keep the warning and all the caveats until we introduce
> the alternative (and migrate to it for the non-parallelism uses of
> reshuffle). I was just proposing we do this via a separate transform that
> just calls Reshuffle until we have the new story fully fleshed out (I don't
> know if any runner supports RequresStableInput, and it isn't translated
> in the Fn API) to avoid being in this intermediate state for yet another
> year.
>
> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi  wrote:
>
>>
>>
>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
>> wrote:
>>
>>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:
>>>
 [...]

>>> I think it would be much more user friendly to un-deprecate it to add a
 warning for advanced users about non-portability of durability/replay
 guarantees/stable input assumptions.

>
>>> Yes, I think everyone in this thread is in agreement here. We should
>>> provide a *different* transform that provides the durability guarantees
>>> (with caveats). In the meantime, this delegating to a reshuffle would be
>>> better than using a reshuffle directly.
>>>
>>
>> Great. Sent a PR to undeprecate Reshuffle :
>> https://github.com/apache/beam/pull/5432
>> The wording there for JavaDoc just a proposal...
>>
>> Raghu.
>>
>>
>>> We tend to put in reshuffles in order to "commit" these random values
> and make them stable for the next stage, to be used to provide the 
> needed
> idempotency for sinks.
>

 In such cases, I think the author should error out on the runner
 that don't provide that guarantee. That is what ExactlyOnceSink in 
 KafkaIO
 does [1].

 [1]
 https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049

>>>
>>> We're moving to a world where the runner may not be known at
>>> pipeline construction time. However, explicitly using a (distinct)
>>> make-input-stable transform when that's the intent (which could be a
>>> primitive that runners should implement, possibly by swapping in 
>>> Reshuffle,
>>> or reject) would allow for this. That being said, the exact semantics of
>>> this transform is a bit of a rabbit hole which is why we never finished 
>>> the
>>> job of deprecating Reshuffle. This is a case where doing something is
>>> better than doing nothing, and our use of URNs for this kind of thing is
>>> flexible enough that we can deprecate old ones if/when we have time to
>>> pound out the right solution.
>>>
>>>

> Kenn
>
> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
> wrote:
>
>>
>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>>
>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
>>> rang...@google.com> wrote:
>>>
 Thanks Kenn.

 On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
 k...@google.com> wrote:

> The fact that its usage has grown probably indicates that we
> have a large number of transforms that can easily cause data loss 
> /
> duplication.
>

 Is this specific to Reshuffle or it is true for any GroupByKey?
 I see Reshuffle as just a wrapper around GBK.

>>> The issue is when it's used in such a way that data corruption
>>> can occur when the underlying GBK output is not stable.
>>>
>>
>> Could you describe this breakage bit more in detail or give a
>> example? Apologies in advance, I know this came up in multiple 
>> contexts in
>> the past, but I haven't grokked the issue well. It is the window 
>> rewrite
>> that Reshuffle does that causes misuse of GBK?
>>
>> Thanks.
>>>

Re: What is the future of Reshuffle?

2018-05-21 Thread Robert Bradshaw
We should probably keep the warning and all the caveats until we introduce
the alternative (and migrate to it for the non-parallelism uses of
reshuffle). I was just proposing we do this via a separate transform that
just calls Reshuffle until we have the new story fully fleshed out (I don't
know if any runner supports RequresStableInput, and it isn't translated in
the Fn API) to avoid being in this intermediate state for yet another year.

On Sun, May 20, 2018 at 6:38 PM Raghu Angadi  wrote:

>
>
> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
> wrote:
>
>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:
>>
>>> [...]
>>>
>> I think it would be much more user friendly to un-deprecate it to add a
>>> warning for advanced users about non-portability of durability/replay
>>> guarantees/stable input assumptions.
>>>

>> Yes, I think everyone in this thread is in agreement here. We should
>> provide a *different* transform that provides the durability guarantees
>> (with caveats). In the meantime, this delegating to a reshuffle would be
>> better than using a reshuffle directly.
>>
>
> Great. Sent a PR to undeprecate Reshuffle :
> https://github.com/apache/beam/pull/5432
> The wording there for JavaDoc just a proposal...
>
> Raghu.
>
>
>> We tend to put in reshuffles in order to "commit" these random values and
 make them stable for the next stage, to be used to provide the needed
 idempotency for sinks.

>>>
>>> In such cases, I think the author should error out on the runner
>>> that don't provide that guarantee. That is what ExactlyOnceSink in 
>>> KafkaIO
>>> does [1].
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>>
>>
>> We're moving to a world where the runner may not be known at pipeline
>> construction time. However, explicitly using a (distinct) 
>> make-input-stable
>> transform when that's the intent (which could be a primitive that runners
>> should implement, possibly by swapping in Reshuffle, or reject) would 
>> allow
>> for this. That being said, the exact semantics of this transform is a bit
>> of a rabbit hole which is why we never finished the job of deprecating
>> Reshuffle. This is a case where doing something is better than doing
>> nothing, and our use of URNs for this kind of thing is flexible enough 
>> that
>> we can deprecate old ones if/when we have time to pound out the right
>> solution.
>>
>>
>>>
 Kenn

 On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
 wrote:

>
> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
> rober...@google.com> wrote:
>
>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
>> wrote:
>>
>>> Thanks Kenn.
>>>
>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
>>> wrote:
>>>
 The fact that its usage has grown probably indicates that we
 have a large number of transforms that can easily cause data loss /
 duplication.

>>>
>>> Is this specific to Reshuffle or it is true for any GroupByKey?
>>> I see Reshuffle as just a wrapper around GBK.
>>>
>> The issue is when it's used in such a way that data corruption
>> can occur when the underlying GBK output is not stable.
>>
>
> Could you describe this breakage bit more in detail or give a
> example? Apologies in advance, I know this came up in multiple 
> contexts in
> the past, but I haven't grokked the issue well. It is the window 
> rewrite
> that Reshuffle does that causes misuse of GBK?
>
> Thanks.
>



Re: What is the future of Reshuffle?

2018-05-20 Thread Raghu Angadi
On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
wrote:

> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:
>
>> [...]
>>
> I think it would be much more user friendly to un-deprecate it to add a
>> warning for advanced users about non-portability of durability/replay
>> guarantees/stable input assumptions.
>>
>>>
> Yes, I think everyone in this thread is in agreement here. We should
> provide a *different* transform that provides the durability guarantees
> (with caveats). In the meantime, this delegating to a reshuffle would be
> better than using a reshuffle directly.
>

Great. Sent a PR to undeprecate Reshuffle :
https://github.com/apache/beam/pull/5432
The wording there for JavaDoc just a proposal...

Raghu.


> We tend to put in reshuffles in order to "commit" these random values and
>>> make them stable for the next stage, to be used to provide the needed
>>> idempotency for sinks.
>>>
>>
>> In such cases, I think the author should error out on the runner that
>> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO 
>> does
>> [1].
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>
>
> We're moving to a world where the runner may not be known at pipeline
> construction time. However, explicitly using a (distinct) 
> make-input-stable
> transform when that's the intent (which could be a primitive that runners
> should implement, possibly by swapping in Reshuffle, or reject) would 
> allow
> for this. That being said, the exact semantics of this transform is a bit
> of a rabbit hole which is why we never finished the job of deprecating
> Reshuffle. This is a case where doing something is better than doing
> nothing, and our use of URNs for this kind of thing is flexible enough 
> that
> we can deprecate old ones if/when we have time to pound out the right
> solution.
>
>
>>
>>> Kenn
>>>
>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
>>> wrote:
>>>

 On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
 rober...@google.com> wrote:

> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
> wrote:
>
>> Thanks Kenn.
>>
>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
>> wrote:
>>
>>> The fact that its usage has grown probably indicates that we
>>> have a large number of transforms that can easily cause data loss /
>>> duplication.
>>>
>>
>> Is this specific to Reshuffle or it is true for any GroupByKey? I
>> see Reshuffle as just a wrapper around GBK.
>>
> The issue is when it's used in such a way that data corruption can
> occur when the underlying GBK output is not stable.
>

 Could you describe this breakage bit more in detail or give a
 example? Apologies in advance, I know this came up in multiple 
 contexts in
 the past, but I haven't grokked the issue well. It is the window 
 rewrite
 that Reshuffle does that causes misuse of GBK?

 Thanks.

>>>


Re: What is the future of Reshuffle?

2018-05-20 Thread Reuven Lax
On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
wrote:

> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:
>
>> On Sat, May 19, 2018 at 8:11 AM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, May 18, 2018 at 6:29 PM Raghu Angadi  wrote:
>>>
 True. I am still failing to see what is broken about Reshuffle that is
 also not broken with GroupByKey transform. If someone depends on GroupByKey
 to get stable input, isn't that equally incorrect/unportable?

>>>
>>> Yes, if people use GBK in that way, it's also just as broken. The
>>> thought is that fewer people would use it with that intent, as GBK is not a
>>> no-op (it transforms the shape of the data, and also does not preserve
>>> windowing). This is in contrast to Reshuffle which was encouraged for this
>>> usecase.
>>>
>>
>> I see. I am not aware of any recommendation for users (excluding advanced
>> users) to use this for stable input/durability gaurantees. Every single
>> case where I recommended Reshuffle was related to parallelism (there were
>> many such cases). Most of use of Reshuflle/GBK for stable input were done
>> consciously by the authors, fully aware of the caveats (SDF in Dataflow,
>> Kafka EOS sink use of GBK, etc).
>>
>> As a result, deprecation is only hurting the innocent users who are using
>> Reshuffle correctly.
>>
>> I think it would be much more user friendly to un-deprecate it to add a
>> warning for advanced users about non-portability of durability/replay
>> guarantees/stable input assumptions.
>>
>>>
> Yes, I think everyone in this thread is in agreement here. We should
> provide a *different* transform that provides the durability guarantees
> (with caveats). In the meantime, this delegating to a reshuffle would be
> better than using a reshuffle directly.
>

The different thing has already been discussed on the dev thread - it's the
RequresStableInput property on a DoFn's process method. The annotation has
even been added already (for Java), it just has never been hooked up.


>
>
>> We tend to put in reshuffles in order to "commit" these random values and
>>> make them stable for the next stage, to be used to provide the needed
>>> idempotency for sinks.
>>>
>>
>> In such cases, I think the author should error out on the runner that
>> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO 
>> does
>> [1].
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>
>
> We're moving to a world where the runner may not be known at pipeline
> construction time. However, explicitly using a (distinct) 
> make-input-stable
> transform when that's the intent (which could be a primitive that runners
> should implement, possibly by swapping in Reshuffle, or reject) would 
> allow
> for this. That being said, the exact semantics of this transform is a bit
> of a rabbit hole which is why we never finished the job of deprecating
> Reshuffle. This is a case where doing something is better than doing
> nothing, and our use of URNs for this kind of thing is flexible enough 
> that
> we can deprecate old ones if/when we have time to pound out the right
> solution.
>
>
>>
>>> Kenn
>>>
>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
>>> wrote:
>>>

 On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
 rober...@google.com> wrote:

> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
> wrote:
>
>> Thanks Kenn.
>>
>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
>> wrote:
>>
>>> The fact that its usage has grown probably indicates that we
>>> have a large number of transforms that can easily cause data loss /
>>> duplication.
>>>
>>
>> Is this specific to Reshuffle or it is true for any GroupByKey? I
>> see Reshuffle as just a wrapper around GBK.
>>
> The issue is when it's used in such a way that data corruption can
> occur when the underlying GBK output is not stable.
>

 Could you describe this breakage bit more in detail or give a
 example? Apologies in advance, I know this came up in multiple 
 contexts in
 the past, but I haven't grokked the issue well. It is the window 
 rewrite
 that Reshuffle does that causes misuse of GBK?

 Thanks.

>>>


Re: What is the future of Reshuffle?

2018-05-19 Thread Robert Bradshaw
On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:

> On Sat, May 19, 2018 at 8:11 AM Robert Bradshaw 
> wrote:
>
>> On Fri, May 18, 2018 at 6:29 PM Raghu Angadi  wrote:
>>
>>> True. I am still failing to see what is broken about Reshuffle that is
>>> also not broken with GroupByKey transform. If someone depends on GroupByKey
>>> to get stable input, isn't that equally incorrect/unportable?
>>>
>>
>> Yes, if people use GBK in that way, it's also just as broken. The thought
>> is that fewer people would use it with that intent, as GBK is not a no-op
>> (it transforms the shape of the data, and also does not preserve
>> windowing). This is in contrast to Reshuffle which was encouraged for this
>> usecase.
>>
>
> I see. I am not aware of any recommendation for users (excluding advanced
> users) to use this for stable input/durability gaurantees. Every single
> case where I recommended Reshuffle was related to parallelism (there were
> many such cases). Most of use of Reshuflle/GBK for stable input were done
> consciously by the authors, fully aware of the caveats (SDF in Dataflow,
> Kafka EOS sink use of GBK, etc).
>
> As a result, deprecation is only hurting the innocent users who are using
> Reshuffle correctly.
>
> I think it would be much more user friendly to un-deprecate it to add a
> warning for advanced users about non-portability of durability/replay
> guarantees/stable input assumptions.
>
>>
Yes, I think everyone in this thread is in agreement here. We should
provide a *different* transform that provides the durability guarantees
(with caveats). In the meantime, this delegating to a reshuffle would be
better than using a reshuffle directly.


> We tend to put in reshuffles in order to "commit" these random values and
>> make them stable for the next stage, to be used to provide the needed
>> idempotency for sinks.
>>
>
> In such cases, I think the author should error out on the runner that
> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does
> [1].
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>

 We're moving to a world where the runner may not be known at pipeline
 construction time. However, explicitly using a (distinct) make-input-stable
 transform when that's the intent (which could be a primitive that runners
 should implement, possibly by swapping in Reshuffle, or reject) would allow
 for this. That being said, the exact semantics of this transform is a bit
 of a rabbit hole which is why we never finished the job of deprecating
 Reshuffle. This is a case where doing something is better than doing
 nothing, and our use of URNs for this kind of thing is flexible enough that
 we can deprecate old ones if/when we have time to pound out the right
 solution.


>
>> Kenn
>>
>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
>> wrote:
>>
>>>
>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>>
 On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
 wrote:

> Thanks Kenn.
>
> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
> wrote:
>
>> The fact that its usage has grown probably indicates that we have
>> a large number of transforms that can easily cause data loss / 
>> duplication.
>>
>
> Is this specific to Reshuffle or it is true for any GroupByKey? I
> see Reshuffle as just a wrapper around GBK.
>
 The issue is when it's used in such a way that data corruption can
 occur when the underlying GBK output is not stable.

>>>
>>> Could you describe this breakage bit more in detail or give a
>>> example? Apologies in advance, I know this came up in multiple contexts 
>>> in
>>> the past, but I haven't grokked the issue well. It is the window rewrite
>>> that Reshuffle does that causes misuse of GBK?
>>>
>>> Thanks.
>>>
>>


Re: What is the future of Reshuffle?

2018-05-19 Thread Raghu Angadi
On Sat, May 19, 2018 at 8:11 AM Robert Bradshaw  wrote:

> On Fri, May 18, 2018 at 6:29 PM Raghu Angadi  wrote:
>
>> True. I am still failing to see what is broken about Reshuffle that is
>> also not broken with GroupByKey transform. If someone depends on GroupByKey
>> to get stable input, isn't that equally incorrect/unportable?
>>
>
> Yes, if people use GBK in that way, it's also just as broken. The thought
> is that fewer people would use it with that intent, as GBK is not a no-op
> (it transforms the shape of the data, and also does not preserve
> windowing). This is in contrast to Reshuffle which was encouraged for this
> usecase.
>

I see. I am not aware of any recommendation for users (excluding advanced
users) to use this for stable input/durability gaurantees. Every single
case where I recommended Reshuffle was related to parallelism (there were
many such cases). Most of use of Reshuflle/GBK for stable input were done
consciously by the authors, fully aware of the caveats (SDF in Dataflow,
Kafka EOS sink use of GBK, etc).

As a result, deprecation is only hurting the innocent users who are using
Reshuffle correctly.

I think it would be much more user friendly to un-deprecate it to add a
warning for advanced users about non-portability of durability/replay
guarantees/stable input assumptions.

Raghu.

We tend to put in reshuffles in order to "commit" these random values and
> make them stable for the next stage, to be used to provide the needed
> idempotency for sinks.
>

 In such cases, I think the author should error out on the runner that
 don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does
 [1].

 [1]
 https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049

>>>
>>> We're moving to a world where the runner may not be known at pipeline
>>> construction time. However, explicitly using a (distinct) make-input-stable
>>> transform when that's the intent (which could be a primitive that runners
>>> should implement, possibly by swapping in Reshuffle, or reject) would allow
>>> for this. That being said, the exact semantics of this transform is a bit
>>> of a rabbit hole which is why we never finished the job of deprecating
>>> Reshuffle. This is a case where doing something is better than doing
>>> nothing, and our use of URNs for this kind of thing is flexible enough that
>>> we can deprecate old ones if/when we have time to pound out the right
>>> solution.
>>>
>>>

> Kenn
>
> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
> wrote:
>
>>
>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
>>> wrote:
>>>
 Thanks Kenn.

 On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
 wrote:

> The fact that its usage has grown probably indicates that we have
> a large number of transforms that can easily cause data loss / 
> duplication.
>

 Is this specific to Reshuffle or it is true for any GroupByKey? I
 see Reshuffle as just a wrapper around GBK.

>>> The issue is when it's used in such a way that data corruption can
>>> occur when the underlying GBK output is not stable.
>>>
>>
>> Could you describe this breakage bit more in detail or give a
>> example? Apologies in advance, I know this came up in multiple contexts 
>> in
>> the past, but I haven't grokked the issue well. It is the window rewrite
>> that Reshuffle does that causes misuse of GBK?
>>
>> Thanks.
>>
>


Re: What is the future of Reshuffle?

2018-05-19 Thread Robert Bradshaw
On Fri, May 18, 2018 at 6:29 PM Raghu Angadi  wrote:

>
> On Fri, May 18, 2018 at 5:34 PM Robert Bradshaw 
> wrote:
>
>> Ah, thanks, that makes sense. That implies to me Reshuffle is no more
>>> broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear
>>> caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic
>>> keys and retries (though it applies to GroupByKey in general).
>>>
>>
>> The "randomness" of Reshuffle.viaRandomKey() is fine, as the randomly
>> generated key is never exposed to the users (so it doesn't matter if it
>> changes).
>>
>
> Agreed.
>
>
>> Reshuffle is broken if you are using it to get stable input on a runner
>> that doesn't always have stable input as an implementation detail of GBKs.
>>
>
> True. I am still failing to see what is broken about Reshuffle that is
> also not broken with GroupByKey transform. If someone depends on GroupByKey
> to get stable input, isn't that equally incorrect/unportable?
>

Yes, if people use GBK in that way, it's also just as broken. The thought
is that fewer people would use it with that intent, as GBK is not a no-op
(it transforms the shape of the data, and also does not preserve
windowing). This is in contrast to Reshuffle which was encouraged for this
usecase.


>
> Raghu.
>
>>
>> We tend to put in reshuffles in order to "commit" these random values and
 make them stable for the next stage, to be used to provide the needed
 idempotency for sinks.

>>>
>>> In such cases, I think the author should error out on the runner that
>>> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does
>>> [1].
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>>
>>
>> We're moving to a world where the runner may not be known at pipeline
>> construction time. However, explicitly using a (distinct) make-input-stable
>> transform when that's the intent (which could be a primitive that runners
>> should implement, possibly by swapping in Reshuffle, or reject) would allow
>> for this. That being said, the exact semantics of this transform is a bit
>> of a rabbit hole which is why we never finished the job of deprecating
>> Reshuffle. This is a case where doing something is better than doing
>> nothing, and our use of URNs for this kind of thing is flexible enough that
>> we can deprecate old ones if/when we have time to pound out the right
>> solution.
>>
>>
>>>
 Kenn

 On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
 wrote:

>
> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
> wrote:
>
>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
>> wrote:
>>
>>> Thanks Kenn.
>>>
>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
>>> wrote:
>>>
 The fact that its usage has grown probably indicates that we have a
 large number of transforms that can easily cause data loss / 
 duplication.

>>>
>>> Is this specific to Reshuffle or it is true for any GroupByKey? I
>>> see Reshuffle as just a wrapper around GBK.
>>>
>> The issue is when it's used in such a way that data corruption can
>> occur when the underlying GBK output is not stable.
>>
>
> Could you describe this breakage bit more in detail or give a example?
> Apologies in advance, I know this came up in multiple contexts in the 
> past,
> but I haven't grokked the issue well. It is the window rewrite that
> Reshuffle does that causes misuse of GBK?
>
> Thanks.
>



Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 5:34 PM Robert Bradshaw  wrote:

> Ah, thanks, that makes sense. That implies to me Reshuffle is no more
>> broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear
>> caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic
>> keys and retries (though it applies to GroupByKey in general).
>>
>
> The "randomness" of Reshuffle.viaRandomKey() is fine, as the randomly
> generated key is never exposed to the users (so it doesn't matter if it
> changes).
>

Agreed.


> Reshuffle is broken if you are using it to get stable input on a runner
> that doesn't always have stable input as an implementation detail of GBKs.
>

True. I am still failing to see what is broken about Reshuffle that is also
not broken with GroupByKey transform. If someone depends on GroupByKey to
get stable input, isn't that equally incorrect/unportable?

Raghu.

>
> We tend to put in reshuffles in order to "commit" these random values and
>>> make them stable for the next stage, to be used to provide the needed
>>> idempotency for sinks.
>>>
>>
>> In such cases, I think the author should error out on the runner that
>> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does
>> [1].
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>
>
> We're moving to a world where the runner may not be known at pipeline
> construction time. However, explicitly using a (distinct) make-input-stable
> transform when that's the intent (which could be a primitive that runners
> should implement, possibly by swapping in Reshuffle, or reject) would allow
> for this. That being said, the exact semantics of this transform is a bit
> of a rabbit hole which is why we never finished the job of deprecating
> Reshuffle. This is a case where doing something is better than doing
> nothing, and our use of URNs for this kind of thing is flexible enough that
> we can deprecate old ones if/when we have time to pound out the right
> solution.
>
>
>>
>>> Kenn
>>>
>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi  wrote:
>>>

 On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
 wrote:

> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
> wrote:
>
>> Thanks Kenn.
>>
>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
>> wrote:
>>
>>> The fact that its usage has grown probably indicates that we have a
>>> large number of transforms that can easily cause data loss / 
>>> duplication.
>>>
>>
>> Is this specific to Reshuffle or it is true for any GroupByKey? I see
>> Reshuffle as just a wrapper around GBK.
>>
> The issue is when it's used in such a way that data corruption can
> occur when the underlying GBK output is not stable.
>

 Could you describe this breakage bit more in detail or give a example?
 Apologies in advance, I know this came up in multiple contexts in the past,
 but I haven't grokked the issue well. It is the window rewrite that
 Reshuffle does that causes misuse of GBK?

 Thanks.

>>>


Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
On Fri, May 18, 2018 at 4:24 PM Raghu Angadi  wrote:

> On Fri, May 18, 2018 at 4:07 PM Kenneth Knowles  wrote:
>
>> It isn't any particular logic in Reshuffle - it is, semantically, an
>> identity transform. It is the fact that other runners are perfectly able to
>> re-run transform prior to a GBK. So, for example, randomly generated IDs
>> will be re-generated.
>>
>
> Ah, thanks, that makes sense. That implies to me Reshuffle is no more
> broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear
> caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic
> keys and retries (though it applies to GroupByKey in general).
>

The "randomness" of Reshuffle.viaRandomKey() is fine, as the randomly
generated key is never exposed to the users (so it doesn't matter if it
changes). Reshuffle is broken if you are using it to get stable input on a
runner that doesn't always have stable input as an implementation detail of
GBKs.

We tend to put in reshuffles in order to "commit" these random values and
>> make them stable for the next stage, to be used to provide the needed
>> idempotency for sinks.
>>
>
> In such cases, I think the author should error out on the runner that
> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does
> [1].
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>

We're moving to a world where the runner may not be known at pipeline
construction time. However, explicitly using a (distinct) make-input-stable
transform when that's the intent (which could be a primitive that runners
should implement, possibly by swapping in Reshuffle, or reject) would allow
for this. That being said, the exact semantics of this transform is a bit
of a rabbit hole which is why we never finished the job of deprecating
Reshuffle. This is a case where doing something is better than doing
nothing, and our use of URNs for this kind of thing is flexible enough that
we can deprecate old ones if/when we have time to pound out the right
solution.


>
>> Kenn
>>
>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi  wrote:
>>
>>>
>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
>>> wrote:
>>>
 On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
 wrote:

> Thanks Kenn.
>
> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
> wrote:
>
>> The fact that its usage has grown probably indicates that we have a
>> large number of transforms that can easily cause data loss / duplication.
>>
>
> Is this specific to Reshuffle or it is true for any GroupByKey? I see
> Reshuffle as just a wrapper around GBK.
>
 The issue is when it's used in such a way that data corruption can
 occur when the underlying GBK output is not stable.

>>>
>>> Could you describe this breakage bit more in detail or give a example?
>>> Apologies in advance, I know this came up in multiple contexts in the past,
>>> but I haven't grokked the issue well. It is the window rewrite that
>>> Reshuffle does that causes misuse of GBK?
>>>
>>> Thanks.
>>>
>>


Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 4:07 PM Kenneth Knowles  wrote:

> It isn't any particular logic in Reshuffle - it is, semantically, an
> identity transform. It is the fact that other runners are perfectly able to
> re-run transform prior to a GBK. So, for example, randomly generated IDs
> will be re-generated.
>

Ah, thanks, that makes sense. That implies to me Reshuffle is no more
broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear
caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic
keys and retries (though it applies to GroupByKey in general).

We tend to put in reshuffles in order to "commit" these random values and
> make them stable for the next stage, to be used to provide the needed
> idempotency for sinks.
>

In such cases, I think the author should error out on the runner that don't
provide that guarantee. That is what ExactlyOnceSink in KafkaIO does [1].

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049


> Kenn
>
> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi  wrote:
>
>>
>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
>>> wrote:
>>>
 Thanks Kenn.

 On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
 wrote:

> The fact that its usage has grown probably indicates that we have a
> large number of transforms that can easily cause data loss / duplication.
>

 Is this specific to Reshuffle or it is true for any GroupByKey? I see
 Reshuffle as just a wrapper around GBK.

>>> The issue is when it's used in such a way that data corruption can occur
>>> when the underlying GBK output is not stable.
>>>
>>
>> Could you describe this breakage bit more in detail or give a example?
>> Apologies in advance, I know this came up in multiple contexts in the past,
>> but I haven't grokked the issue well. It is the window rewrite that
>> Reshuffle does that causes misuse of GBK?
>>
>> Thanks.
>>
>


Re: What is the future of Reshuffle?

2018-05-18 Thread Kenneth Knowles
It isn't any particular logic in Reshuffle - it is, semantically, an
identity transform. It is the fact that other runners are perfectly able to
re-run transform prior to a GBK. So, for example, randomly generated IDs
will be re-generated. We tend to put in reshuffles in order to "commit"
these random values and make them stable for the next stage, to be used to
provide the needed idempotency for sinks.

Kenn

On Fri, May 18, 2018 at 4:05 PM Raghu Angadi  wrote:

>
> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
> wrote:
>
>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi  wrote:
>>
>>> Thanks Kenn.
>>>
>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:
>>>
 The fact that its usage has grown probably indicates that we have a
 large number of transforms that can easily cause data loss / duplication.

>>>
>>> Is this specific to Reshuffle or it is true for any GroupByKey? I see
>>> Reshuffle as just a wrapper around GBK.
>>>
>> The issue is when it's used in such a way that data corruption can occur
>> when the underlying GBK output is not stable.
>>
>
> Could you describe this breakage bit more in detail or give a example?
> Apologies in advance, I know this came up in multiple contexts in the past,
> but I haven't grokked the issue well. It is the window rewrite that
> Reshuffle does that causes misuse of GBK?
>
> Thanks.
>


Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 12:22 PM Robert Bradshaw 
wrote:

> [resending]
>
Agreed that keeping this deprecated without a clear replacement for so long
> is not ideal.
>
> I would at least break this into two separate transforms, the
> parallelism-breaking one (which seems OK) and the stable input one (which
> may just call the parallelism-breaking one, but should be decorated with
> lots of caveats and maybe even still have the deprecated annotation).
>

+1. Parallelism-breaking one is the most relevant to many users. Would love
to see that part deprecated, ideally keeping the name Reshuffle.

Raghu.


>
>
> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:
>
>> The fact that its usage has grown probably indicates that we have a large
>> number of transforms that can easily cause data loss / duplication.
>>
>> Yes, it is deprecated because it is primarily used as a Dataflow-specific
>> way to ensure stable input. My understanding is that the SparkRunner also
>> materializes at every GBK so it works there too (is this still the case?).
>> It doesn't work at all for other runners AFAIK. So it is @Deprecated not
>> because there is a replacement, but because it is kind of dangerous to use. 
>> Beam
>> could just say "GBK must ensure stable output" and "a composite containing
>> a GBK has to ensure stable output even if replaced" and that would solve
>> the issue, but I think it would make Beam on Flink impossibly slow - I
>> could be wrong about that. Generally stable input is tied to durability
>> model which is a key design point for engines.
>>
>> True that it isn't the only use, and I know you have been trying to nail
>> down what the uses actually are. Ben wrote up various uses in a portable
>> manner at https://beam.apache.org/documentation/execution-model.
>>
>>  - Coupled failure is the use where Reshuffle is to provide stable input
>>  - Breaking dependent parallelism is more portable - but since it is the
>> identity transform a runner may just elide it; it is a hint, basically, and
>> that seem OK (but can we do it more directly?)
>>
>> What I don't want is to build something where the implementation details
>> are the spec, and not fundamental, which is sort of where Reshuffle lies.
>> This thread highlights that this is a pretty urgent problem with our SDKs
>> and runners that it would be very helpful to work on.
>>
>> Kenn
>>
>>
>>
>> On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov 
>> wrote:
>>
>>> Agreed that it should be undeprecated, many users are getting confused
>>> by this.
>>> I know that some people are working on a replacement for at least one of
>>> its use cases (RequiresStableInput), but the use case of breaking fusion
>>> is, as of yet, unaddressed, and there's not much to be gained by keeping it
>>> deprecated.
>>>
>>> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:
>>>
 I am interested in more clarity on this as well. It has been deprecated
 for a long time without a replacement, and its usage has only grown, both
 within Beam code base as well as in user applications.

 If we are certain that it will not be removed before there is a good
 replacement for it, can we undeprecate it until there are proper plans for
 replacement?

 On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:

> I saw in a recent thread that the use of the Reshuffle transform was
> recommended to solve an user issue:
>
>
> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>
> I can see why it may fix the reported issue. I am just curious about
> the fact that the Reshuffle transform is marked as both @Internal and
> @Deprecated in Beam's SDK.
>
> Do we have some alternative? So far the class documentation does not
> recommend any replacement.
>



Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
wrote:

> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi  wrote:
>
>> Thanks Kenn.
>>
>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:
>>
>>> The fact that its usage has grown probably indicates that we have a
>>> large number of transforms that can easily cause data loss / duplication.
>>>
>>
>> Is this specific to Reshuffle or it is true for any GroupByKey? I see
>> Reshuffle as just a wrapper around GBK.
>>
> The issue is when it's used in such a way that data corruption can occur
> when the underlying GBK output is not stable.
>

Could you describe this breakage bit more in detail or give a example?
Apologies in advance, I know this came up in multiple contexts in the past,
but I haven't grokked the issue well. It is the window rewrite that
Reshuffle does that causes misuse of GBK?

Thanks.


Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
[resending]

Agreed that keeping this deprecated without a clear replacement for so long
is not ideal.

I would at least break this into two separate transforms, the
parallelism-breaking one (which seems OK) and the stable input one (which
may just call the parallelism-breaking one, but should be decorated with
lots of caveats and maybe even still have the deprecated annotation).

On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:

> The fact that its usage has grown probably indicates that we have a large
> number of transforms that can easily cause data loss / duplication.
>
> Yes, it is deprecated because it is primarily used as a Dataflow-specific
> way to ensure stable input. My understanding is that the SparkRunner also
> materializes at every GBK so it works there too (is this still the case?).
> It doesn't work at all for other runners AFAIK. So it is @Deprecated not
> because there is a replacement, but because it is kind of dangerous to use. 
> Beam
> could just say "GBK must ensure stable output" and "a composite containing
> a GBK has to ensure stable output even if replaced" and that would solve
> the issue, but I think it would make Beam on Flink impossibly slow - I
> could be wrong about that. Generally stable input is tied to durability
> model which is a key design point for engines.
>
> True that it isn't the only use, and I know you have been trying to nail
> down what the uses actually are. Ben wrote up various uses in a portable
> manner at https://beam.apache.org/documentation/execution-model.
>
>  - Coupled failure is the use where Reshuffle is to provide stable input
>  - Breaking dependent parallelism is more portable - but since it is the
> identity transform a runner may just elide it; it is a hint, basically, and
> that seem OK (but can we do it more directly?)
>
> What I don't want is to build something where the implementation details
> are the spec, and not fundamental, which is sort of where Reshuffle lies.
> This thread highlights that this is a pretty urgent problem with our SDKs
> and runners that it would be very helpful to work on.
>
> Kenn
>
>
>
> On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov 
> wrote:
>
>> Agreed that it should be undeprecated, many users are getting confused by
>> this.
>> I know that some people are working on a replacement for at least one of
>> its use cases (RequiresStableInput), but the use case of breaking fusion
>> is, as of yet, unaddressed, and there's not much to be gained by keeping it
>> deprecated.
>>
>> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:
>>
>>> I am interested in more clarity on this as well. It has been deprecated
>>> for a long time without a replacement, and its usage has only grown, both
>>> within Beam code base as well as in user applications.
>>>
>>> If we are certain that it will not be removed before there is a good
>>> replacement for it, can we undeprecate it until there are proper plans for
>>> replacement?
>>>
>>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>>>
 I saw in a recent thread that the use of the Reshuffle transform was
 recommended to solve an user issue:


 https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E

 I can see why it may fix the reported issue. I am just curious about
 the fact that the Reshuffle transform is marked as both @Internal and
 @Deprecated in Beam's SDK.

 Do we have some alternative? So far the class documentation does not
 recommend any replacement.

>>>


Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
On Fri, May 18, 2018 at 11:46 AM Raghu Angadi  wrote:

> Thanks Kenn.
>
> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:
>
>> The fact that its usage has grown probably indicates that we have a large
>> number of transforms that can easily cause data loss / duplication.
>>
>
> Is this specific to Reshuffle or it is true for any GroupByKey? I see
> Reshuffle as just a wrapper around GBK.
>

The issue is when it's used in such a way that data corruption can occur
when the underlying GBK output is not stable.


Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
On Fri, May 18, 2018 at 11:46 AM Raghu Angadi
 wrote:>> Thanks
Kenn.>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles
 wrote: The fact that
its usage has grown probably indicates that we have a large number of
transforms that can easily cause data loss /
duplication.>> Is this specific to Reshuffle or it is
true for any GroupByKey? I see Reshuffle as just a wrapper around
GBK.The issue is when it's used in such a way that data
corruption can occur when the underlying GBK output is not stable.


Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
Agreed that keeping this deprecated without a clear replacement for so
long is not ideal. I would at least break this into two
separate transforms, the parallelism-breaking one (which seems OK) and
the stable input one (which may just call the parallelism-breaking
one, but should be decorated with lots of caveats and maybe even still
have the deprecated annotation). On Fri, May 18, 2018 at 11:02 AM
Kenneth Knowles  wrote:>> The fact
that its usage has grown probably indicates that we have a large
number of transforms that can easily cause data loss /
duplication.>> Yes, it is deprecated because it is
primarily used as a Dataflow-specific way to ensure stable input. My
understanding is that the SparkRunner also materializes at every GBK
so it works there too (is this still the case?). It doesn't work
at all for other runners AFAIK. So it is @Deprecated not because there
is a replacement, but because it is kind of dangerous to use. Beam
could just say "GBK must ensure stable output" and "a
composite containing a GBK has to ensure stable output even if
replaced" and that would solve the issue, but I think it would
make Beam on Flink impossibly slow - I could be wrong about that.
Generally stable input is tied to durability model which is a key
design point for engines.>> True that it isn't the
only use, and I know you have been trying to nail down what the uses
actually are. Ben wrote up various uses in a portable manner at
https://beam.apache.org/documentation/execution-model.>>
 - Coupled failure is the use where Reshuffle is to provide
stable input>  - Breaking dependent parallelism is more
portable - but since it is the identity transform a runner may just
elide it; it is a hint, basically, and that seem OK (but can we do it
more directly?)>> What I don't want is to build
something where the implementation details are the spec, and not
fundamental, which is sort of where Reshuffle lies. This thread
highlights that this is a pretty urgent problem with our SDKs and
runners that it would be very helpful to work on.>>
Kenn On Fri, May 18, 2018 at 7:50 AM
Eugene Kirpichov 
wrote: Agreed that it should be undeprecated,
many users are getting confused by this.>> I know that some
people are working on a replacement for at least one of its use cases
(RequiresStableInput), but the use case of breaking fusion is, as of
yet, unaddressed, and there's not much to be gained by keeping it
deprecated. On Fri, May 18, 2018 at 7:45 AM
Raghu Angadi 
wrote:>> I am interested in more clarity
on this as well. It has been deprecated for a long time without a
replacement, and its usage has only grown, both within Beam code base
as well as in user applications.>> If we
are certain that it will not be removed before there is a good
replacement for it, can we undeprecate it until there are proper plans
for replacement? >> On Fri, May 18, 2018
at 7:12 AM Ismaël Mejía 
wrote: I saw in a recent
thread that the use of the Reshuffle transform was
recommended to solve an user
issue:
https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
I can see why it may fix the reported issue. I am just curious
about the fact that the Reshuffle transform is
marked as both @Internal and @Deprecated in
Beam's SDK. Do we have
some alternative? So far the class documentation does
not recommend any replacement.


Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
Thanks Kenn.

On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:

> The fact that its usage has grown probably indicates that we have a large
> number of transforms that can easily cause data loss / duplication.
>

Is this specific to Reshuffle or it is true for any GroupByKey? I see
Reshuffle as just a wrapper around GBK.

Raghu.

>
>
> Yes, it is deprecated because it is primarily used as a Dataflow-specific
> way to ensure stable input. My understanding is that the SparkRunner also
> materializes at every GBK so it works there too (is this still the case?).
> It doesn't work at all for other runners AFAIK. So it is @Deprecated not
> because there is a replacement, but because it is kind of dangerous to use. 
> Beam
> could just say "GBK must ensure stable output" and "a composite containing
> a GBK has to ensure stable output even if replaced" and that would solve
> the issue, but I think it would make Beam on Flink impossibly slow - I
> could be wrong about that. Generally stable input is tied to durability
> model which is a key design point for engines.
>
> True that it isn't the only use, and I know you have been trying to nail
> down what the uses actually are. Ben wrote up various uses in a portable
> manner at https://beam.apache.org/documentation/execution-model.
>
>  - Coupled failure is the use where Reshuffle is to provide stable input
>  - Breaking dependent parallelism is more portable - but since it is the
> identity transform a runner may just elide it; it is a hint, basically, and
> that seem OK (but can we do it more directly?)
>
> What I don't want is to build something where the implementation details
> are the spec, and not fundamental, which is sort of where Reshuffle lies.
> This thread highlights that this is a pretty urgent problem with our SDKs
> and runners that it would be very helpful to work on.
>
> Kenn
>
>
>
> On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov 
> wrote:
>
>> Agreed that it should be undeprecated, many users are getting confused by
>> this.
>> I know that some people are working on a replacement for at least one of
>> its use cases (RequiresStableInput), but the use case of breaking fusion
>> is, as of yet, unaddressed, and there's not much to be gained by keeping it
>> deprecated.
>>
>> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:
>>
>>> I am interested in more clarity on this as well. It has been deprecated
>>> for a long time without a replacement, and its usage has only grown, both
>>> within Beam code base as well as in user applications.
>>>
>>> If we are certain that it will not be removed before there is a good
>>> replacement for it, can we undeprecate it until there are proper plans for
>>> replacement?
>>>
>>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>>>
 I saw in a recent thread that the use of the Reshuffle transform was
 recommended to solve an user issue:


 https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E

 I can see why it may fix the reported issue. I am just curious about
 the fact that the Reshuffle transform is marked as both @Internal and
 @Deprecated in Beam's SDK.

 Do we have some alternative? So far the class documentation does not
 recommend any replacement.

>>>


Re: What is the future of Reshuffle?

2018-05-18 Thread Kenneth Knowles
The fact that its usage has grown probably indicates that we have a large
number of transforms that can easily cause data loss / duplication.

Yes, it is deprecated because it is primarily used as a Dataflow-specific
way to ensure stable input. My understanding is that the SparkRunner also
materializes at every GBK so it works there too (is this still the case?).
It doesn't work at all for other runners AFAIK. So it is @Deprecated not
because there is a replacement, but because it is kind of dangerous to
use. Beam
could just say "GBK must ensure stable output" and "a composite containing
a GBK has to ensure stable output even if replaced" and that would solve
the issue, but I think it would make Beam on Flink impossibly slow - I
could be wrong about that. Generally stable input is tied to durability
model which is a key design point for engines.

True that it isn't the only use, and I know you have been trying to nail
down what the uses actually are. Ben wrote up various uses in a portable
manner at https://beam.apache.org/documentation/execution-model.

 - Coupled failure is the use where Reshuffle is to provide stable input
 - Breaking dependent parallelism is more portable - but since it is the
identity transform a runner may just elide it; it is a hint, basically, and
that seem OK (but can we do it more directly?)

What I don't want is to build something where the implementation details
are the spec, and not fundamental, which is sort of where Reshuffle lies.
This thread highlights that this is a pretty urgent problem with our SDKs
and runners that it would be very helpful to work on.

Kenn



On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov 
wrote:

> Agreed that it should be undeprecated, many users are getting confused by
> this.
> I know that some people are working on a replacement for at least one of
> its use cases (RequiresStableInput), but the use case of breaking fusion
> is, as of yet, unaddressed, and there's not much to be gained by keeping it
> deprecated.
>
> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:
>
>> I am interested in more clarity on this as well. It has been deprecated
>> for a long time without a replacement, and its usage has only grown, both
>> within Beam code base as well as in user applications.
>>
>> If we are certain that it will not be removed before there is a good
>> replacement for it, can we undeprecate it until there are proper plans for
>> replacement?
>>
>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>>
>>> I saw in a recent thread that the use of the Reshuffle transform was
>>> recommended to solve an user issue:
>>>
>>>
>>> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>>>
>>> I can see why it may fix the reported issue. I am just curious about
>>> the fact that the Reshuffle transform is marked as both @Internal and
>>> @Deprecated in Beam's SDK.
>>>
>>> Do we have some alternative? So far the class documentation does not
>>> recommend any replacement.
>>>
>>


Re: What is the future of Reshuffle?

2018-05-18 Thread Eugene Kirpichov
Agreed that it should be undeprecated, many users are getting confused by
this.
I know that some people are working on a replacement for at least one of
its use cases (RequiresStableInput), but the use case of breaking fusion
is, as of yet, unaddressed, and there's not much to be gained by keeping it
deprecated.

On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:

> I am interested in more clarity on this as well. It has been deprecated
> for a long time without a replacement, and its usage has only grown, both
> within Beam code base as well as in user applications.
>
> If we are certain that it will not be removed before there is a good
> replacement for it, can we undeprecate it until there are proper plans for
> replacement?
>
> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>
>> I saw in a recent thread that the use of the Reshuffle transform was
>> recommended to solve an user issue:
>>
>>
>> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>>
>> I can see why it may fix the reported issue. I am just curious about
>> the fact that the Reshuffle transform is marked as both @Internal and
>> @Deprecated in Beam's SDK.
>>
>> Do we have some alternative? So far the class documentation does not
>> recommend any replacement.
>>
>


Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
I am interested in more clarity on this as well. It has been deprecated for
a long time without a replacement, and its usage has only grown, both
within Beam code base as well as in user applications.

If we are certain that it will not be removed before there is a good
replacement for it, can we undeprecate it until there are proper plans for
replacement?

On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:

> I saw in a recent thread that the use of the Reshuffle transform was
> recommended to solve an user issue:
>
>
> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>
> I can see why it may fix the reported issue. I am just curious about
> the fact that the Reshuffle transform is marked as both @Internal and
> @Deprecated in Beam's SDK.
>
> Do we have some alternative? So far the class documentation does not
> recommend any replacement.
>


What is the future of Reshuffle?

2018-05-18 Thread Ismaël Mejía
I saw in a recent thread that the use of the Reshuffle transform was
recommended to solve an user issue:

https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E

I can see why it may fix the reported issue. I am just curious about
the fact that the Reshuffle transform is marked as both @Internal and
@Deprecated in Beam's SDK.

Do we have some alternative? So far the class documentation does not
recommend any replacement.