Re: What is the future of Reshuffle?
Filed https://issues.apache.org/jira/browse/BEAM-4372 (unassigned). On Mon, May 21, 2018 at 10:22 AM Raghu Angadi wrote: > > > On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw > wrote: > >> We should probably keep the warning and all the caveats until we >> introduce the alternative (and migrate to it for the non-parallelism uses >> of reshuffle). I was just proposing we do this via a separate transform >> that just calls Reshuffle until we have the new story fully fleshed out (I >> don't know if any runner supports RequresStableInput, and it isn't >> translated in the Fn API) to avoid being in this intermediate state for >> yet another year. >> > > I think all warnings and caveats really belong in JavaDoc for GroupByKey, > since that is the actual transform the the semantics that we are concerned > about incorrect use. A reshuffle transform can refer the users there. > > In addition, I think Reshuffle is really good name for reshuffle transform > meant for most users. > > That said, any reorganization is much better than deprecation. > > Raghu. > > >> >> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi wrote: >> >>> >>> >>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw >>> wrote: >>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi wrote: > [...] > I think it would be much more user friendly to un-deprecate it to add a > warning for advanced users about non-portability of durability/replay > guarantees/stable input assumptions. > >> Yes, I think everyone in this thread is in agreement here. We should provide a *different* transform that provides the durability guarantees (with caveats). In the meantime, this delegating to a reshuffle would be better than using a reshuffle directly. >>> >>> Great. Sent a PR to undeprecate Reshuffle : >>> https://github.com/apache/beam/pull/5432 >>> The wording there for JavaDoc just a proposal... >>> >>> Raghu. >>> >>> We tend to put in reshuffles in order to "commit" these random values >> and make them stable for the next stage, to be used to provide the >> needed >> idempotency for sinks. >> > > In such cases, I think the author should error out on the runner > that don't provide that guarantee. That is what ExactlyOnceSink in > KafkaIO > does [1]. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 > We're moving to a world where the runner may not be known at pipeline construction time. However, explicitly using a (distinct) make-input-stable transform when that's the intent (which could be a primitive that runners should implement, possibly by swapping in Reshuffle, or reject) would allow for this. That being said, the exact semantics of this transform is a bit of a rabbit hole which is why we never finished the job of deprecating Reshuffle. This is a case where doing something is better than doing nothing, and our use of URNs for this kind of thing is flexible enough that we can deprecate old ones if/when we have time to pound out the right solution. > >> Kenn >> >> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi >> wrote: >> >>> >>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < >>> rober...@google.com> wrote: >>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi < rang...@google.com> wrote: > Thanks Kenn. > > On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles < > k...@google.com> wrote: > >> The fact that its usage has grown probably indicates that we >> have a large number of transforms that can easily cause data >> loss / >> duplication. >> > > Is this specific to Reshuffle or it is true for any > GroupByKey? I see Reshuffle as just a wrapper around GBK. > The issue is when it's used in such a way that data corruption can occur when the underlying GBK output is not stable. >>> >>> Could you describe this breakage bit more in detail or give a >>> example? Apologies in advance, I know this came up in multiple >>> contexts in >>> the past, but I haven't grokked the issue well. It is the window >>> rewrite >>> that Reshuffle does that causes misuse of GBK? >>> >>> Thanks. >>> >>
Re: What is the future of Reshuffle?
On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw wrote: > We should probably keep the warning and all the caveats until we introduce > the alternative (and migrate to it for the non-parallelism uses of > reshuffle). I was just proposing we do this via a separate transform that > just calls Reshuffle until we have the new story fully fleshed out (I don't > know if any runner supports RequresStableInput, and it isn't translated > in the Fn API) to avoid being in this intermediate state for yet another > year. > I think all warnings and caveats really belong in JavaDoc for GroupByKey, since that is the actual transform the the semantics that we are concerned about incorrect use. A reshuffle transform can refer the users there. In addition, I think Reshuffle is really good name for reshuffle transform meant for most users. That said, any reorganization is much better than deprecation. Raghu. > > On Sun, May 20, 2018 at 6:38 PM Raghu Angadi wrote: > >> >> >> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw >> wrote: >> >>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi wrote: >>> [...] >>> I think it would be much more user friendly to un-deprecate it to add a warning for advanced users about non-portability of durability/replay guarantees/stable input assumptions. > >>> Yes, I think everyone in this thread is in agreement here. We should >>> provide a *different* transform that provides the durability guarantees >>> (with caveats). In the meantime, this delegating to a reshuffle would be >>> better than using a reshuffle directly. >>> >> >> Great. Sent a PR to undeprecate Reshuffle : >> https://github.com/apache/beam/pull/5432 >> The wording there for JavaDoc just a proposal... >> >> Raghu. >> >> >>> We tend to put in reshuffles in order to "commit" these random values > and make them stable for the next stage, to be used to provide the > needed > idempotency for sinks. > In such cases, I think the author should error out on the runner that don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does [1]. [1] https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >>> >>> We're moving to a world where the runner may not be known at >>> pipeline construction time. However, explicitly using a (distinct) >>> make-input-stable transform when that's the intent (which could be a >>> primitive that runners should implement, possibly by swapping in >>> Reshuffle, >>> or reject) would allow for this. That being said, the exact semantics of >>> this transform is a bit of a rabbit hole which is why we never finished >>> the >>> job of deprecating Reshuffle. This is a case where doing something is >>> better than doing nothing, and our use of URNs for this kind of thing is >>> flexible enough that we can deprecate old ones if/when we have time to >>> pound out the right solution. >>> >>> > Kenn > > On Fri, May 18, 2018 at 4:05 PM Raghu Angadi > wrote: > >> >> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < >> rober...@google.com> wrote: >> >>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi < >>> rang...@google.com> wrote: >>> Thanks Kenn. On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles < k...@google.com> wrote: > The fact that its usage has grown probably indicates that we > have a large number of transforms that can easily cause data loss > / > duplication. > Is this specific to Reshuffle or it is true for any GroupByKey? I see Reshuffle as just a wrapper around GBK. >>> The issue is when it's used in such a way that data corruption >>> can occur when the underlying GBK output is not stable. >>> >> >> Could you describe this breakage bit more in detail or give a >> example? Apologies in advance, I know this came up in multiple >> contexts in >> the past, but I haven't grokked the issue well. It is the window >> rewrite >> that Reshuffle does that causes misuse of GBK? >> >> Thanks. >> >
Re: What is the future of Reshuffle?
+1 to introducing alternative transforms even if they wrap Reshuffle The benefits of making them distinct is that we can put appropriate Javadoc in place and runners can figure out what the user is intending and whether Reshuffle or some other implementation is appropriate. We can also see which of these use cases is most common by inspecting the transform usage. And, when better options are available, we can just introduce the appropriate transform, to get the appropriate special behavior. I think there were three cases of use that I can remember: 1. Splitting up retry domains (eg., if one transform is expensive, preventing failures in nearby transforms from causing it to be retried) 2. Redistributing elements to improve parallelism 3. Checkpointing/snapshotting a pcollection before side-effecting the outside world (eg., RequiresStableInput) -- Ben On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw wrote: > We should probably keep the warning and all the caveats until we introduce > the alternative (and migrate to it for the non-parallelism uses of > reshuffle). I was just proposing we do this via a separate transform that > just calls Reshuffle until we have the new story fully fleshed out (I don't > know if any runner supports RequresStableInput, and it isn't translated > in the Fn API) to avoid being in this intermediate state for yet another > year. > > On Sun, May 20, 2018 at 6:38 PM Raghu Angadi wrote: > >> >> >> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw >> wrote: >> >>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi wrote: >>> [...] >>> I think it would be much more user friendly to un-deprecate it to add a warning for advanced users about non-portability of durability/replay guarantees/stable input assumptions. > >>> Yes, I think everyone in this thread is in agreement here. We should >>> provide a *different* transform that provides the durability guarantees >>> (with caveats). In the meantime, this delegating to a reshuffle would be >>> better than using a reshuffle directly. >>> >> >> Great. Sent a PR to undeprecate Reshuffle : >> https://github.com/apache/beam/pull/5432 >> The wording there for JavaDoc just a proposal... >> >> Raghu. >> >> >>> We tend to put in reshuffles in order to "commit" these random values > and make them stable for the next stage, to be used to provide the > needed > idempotency for sinks. > In such cases, I think the author should error out on the runner that don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does [1]. [1] https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >>> >>> We're moving to a world where the runner may not be known at >>> pipeline construction time. However, explicitly using a (distinct) >>> make-input-stable transform when that's the intent (which could be a >>> primitive that runners should implement, possibly by swapping in >>> Reshuffle, >>> or reject) would allow for this. That being said, the exact semantics of >>> this transform is a bit of a rabbit hole which is why we never finished >>> the >>> job of deprecating Reshuffle. This is a case where doing something is >>> better than doing nothing, and our use of URNs for this kind of thing is >>> flexible enough that we can deprecate old ones if/when we have time to >>> pound out the right solution. >>> >>> > Kenn > > On Fri, May 18, 2018 at 4:05 PM Raghu Angadi > wrote: > >> >> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < >> rober...@google.com> wrote: >> >>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi < >>> rang...@google.com> wrote: >>> Thanks Kenn. On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles < k...@google.com> wrote: > The fact that its usage has grown probably indicates that we > have a large number of transforms that can easily cause data loss > / > duplication. > Is this specific to Reshuffle or it is true for any GroupByKey? I see Reshuffle as just a wrapper around GBK. >>> The issue is when it's used in such a way that data corruption >>> can occur when the underlying GBK output is not stable. >>> >> >> Could you describe this breakage bit more in detail or give a >> example? Apologies in advance, I know this came up in multiple >> contexts in >> the past, but I haven't grokked the issue well. It is the window >> rewrite >> that Reshuffle does that causes misuse of GBK? >> >> Thanks. >>>
Re: What is the future of Reshuffle?
We should probably keep the warning and all the caveats until we introduce the alternative (and migrate to it for the non-parallelism uses of reshuffle). I was just proposing we do this via a separate transform that just calls Reshuffle until we have the new story fully fleshed out (I don't know if any runner supports RequresStableInput, and it isn't translated in the Fn API) to avoid being in this intermediate state for yet another year. On Sun, May 20, 2018 at 6:38 PM Raghu Angadi wrote: > > > On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw > wrote: > >> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi wrote: >> >>> [...] >>> >> I think it would be much more user friendly to un-deprecate it to add a >>> warning for advanced users about non-portability of durability/replay >>> guarantees/stable input assumptions. >>> >> Yes, I think everyone in this thread is in agreement here. We should >> provide a *different* transform that provides the durability guarantees >> (with caveats). In the meantime, this delegating to a reshuffle would be >> better than using a reshuffle directly. >> > > Great. Sent a PR to undeprecate Reshuffle : > https://github.com/apache/beam/pull/5432 > The wording there for JavaDoc just a proposal... > > Raghu. > > >> We tend to put in reshuffles in order to "commit" these random values and make them stable for the next stage, to be used to provide the needed idempotency for sinks. >>> >>> In such cases, I think the author should error out on the runner >>> that don't provide that guarantee. That is what ExactlyOnceSink in >>> KafkaIO >>> does [1]. >>> >>> [1] >>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >>> >> >> We're moving to a world where the runner may not be known at pipeline >> construction time. However, explicitly using a (distinct) >> make-input-stable >> transform when that's the intent (which could be a primitive that runners >> should implement, possibly by swapping in Reshuffle, or reject) would >> allow >> for this. That being said, the exact semantics of this transform is a bit >> of a rabbit hole which is why we never finished the job of deprecating >> Reshuffle. This is a case where doing something is better than doing >> nothing, and our use of URNs for this kind of thing is flexible enough >> that >> we can deprecate old ones if/when we have time to pound out the right >> solution. >> >> >>> Kenn On Fri, May 18, 2018 at 4:05 PM Raghu Angadi wrote: > > On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < > rober...@google.com> wrote: > >> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi >> wrote: >> >>> Thanks Kenn. >>> >>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles >>> wrote: >>> The fact that its usage has grown probably indicates that we have a large number of transforms that can easily cause data loss / duplication. >>> >>> Is this specific to Reshuffle or it is true for any GroupByKey? >>> I see Reshuffle as just a wrapper around GBK. >>> >> The issue is when it's used in such a way that data corruption >> can occur when the underlying GBK output is not stable. >> > > Could you describe this breakage bit more in detail or give a > example? Apologies in advance, I know this came up in multiple > contexts in > the past, but I haven't grokked the issue well. It is the window > rewrite > that Reshuffle does that causes misuse of GBK? > > Thanks. >
Re: What is the future of Reshuffle?
On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw wrote: > On Sat, May 19, 2018 at 6:27 PM Raghu Angadi wrote: > >> [...] >> > I think it would be much more user friendly to un-deprecate it to add a >> warning for advanced users about non-portability of durability/replay >> guarantees/stable input assumptions. >> >>> > Yes, I think everyone in this thread is in agreement here. We should > provide a *different* transform that provides the durability guarantees > (with caveats). In the meantime, this delegating to a reshuffle would be > better than using a reshuffle directly. > Great. Sent a PR to undeprecate Reshuffle : https://github.com/apache/beam/pull/5432 The wording there for JavaDoc just a proposal... Raghu. > We tend to put in reshuffles in order to "commit" these random values and >>> make them stable for the next stage, to be used to provide the needed >>> idempotency for sinks. >>> >> >> In such cases, I think the author should error out on the runner that >> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO >> does >> [1]. >> >> [1] >> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >> > > We're moving to a world where the runner may not be known at pipeline > construction time. However, explicitly using a (distinct) > make-input-stable > transform when that's the intent (which could be a primitive that runners > should implement, possibly by swapping in Reshuffle, or reject) would > allow > for this. That being said, the exact semantics of this transform is a bit > of a rabbit hole which is why we never finished the job of deprecating > Reshuffle. This is a case where doing something is better than doing > nothing, and our use of URNs for this kind of thing is flexible enough > that > we can deprecate old ones if/when we have time to pound out the right > solution. > > >> >>> Kenn >>> >>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi >>> wrote: >>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < rober...@google.com> wrote: > On Fri, May 18, 2018 at 11:46 AM Raghu Angadi > wrote: > >> Thanks Kenn. >> >> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles >> wrote: >> >>> The fact that its usage has grown probably indicates that we >>> have a large number of transforms that can easily cause data loss / >>> duplication. >>> >> >> Is this specific to Reshuffle or it is true for any GroupByKey? I >> see Reshuffle as just a wrapper around GBK. >> > The issue is when it's used in such a way that data corruption can > occur when the underlying GBK output is not stable. > Could you describe this breakage bit more in detail or give a example? Apologies in advance, I know this came up in multiple contexts in the past, but I haven't grokked the issue well. It is the window rewrite that Reshuffle does that causes misuse of GBK? Thanks. >>>
Re: What is the future of Reshuffle?
On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw wrote: > On Sat, May 19, 2018 at 6:27 PM Raghu Angadi wrote: > >> On Sat, May 19, 2018 at 8:11 AM Robert Bradshaw >> wrote: >> >>> On Fri, May 18, 2018 at 6:29 PM Raghu Angadi wrote: >>> True. I am still failing to see what is broken about Reshuffle that is also not broken with GroupByKey transform. If someone depends on GroupByKey to get stable input, isn't that equally incorrect/unportable? >>> >>> Yes, if people use GBK in that way, it's also just as broken. The >>> thought is that fewer people would use it with that intent, as GBK is not a >>> no-op (it transforms the shape of the data, and also does not preserve >>> windowing). This is in contrast to Reshuffle which was encouraged for this >>> usecase. >>> >> >> I see. I am not aware of any recommendation for users (excluding advanced >> users) to use this for stable input/durability gaurantees. Every single >> case where I recommended Reshuffle was related to parallelism (there were >> many such cases). Most of use of Reshuflle/GBK for stable input were done >> consciously by the authors, fully aware of the caveats (SDF in Dataflow, >> Kafka EOS sink use of GBK, etc). >> >> As a result, deprecation is only hurting the innocent users who are using >> Reshuffle correctly. >> >> I think it would be much more user friendly to un-deprecate it to add a >> warning for advanced users about non-portability of durability/replay >> guarantees/stable input assumptions. >> >>> > Yes, I think everyone in this thread is in agreement here. We should > provide a *different* transform that provides the durability guarantees > (with caveats). In the meantime, this delegating to a reshuffle would be > better than using a reshuffle directly. > The different thing has already been discussed on the dev thread - it's the RequresStableInput property on a DoFn's process method. The annotation has even been added already (for Java), it just has never been hooked up. > > >> We tend to put in reshuffles in order to "commit" these random values and >>> make them stable for the next stage, to be used to provide the needed >>> idempotency for sinks. >>> >> >> In such cases, I think the author should error out on the runner that >> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO >> does >> [1]. >> >> [1] >> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >> > > We're moving to a world where the runner may not be known at pipeline > construction time. However, explicitly using a (distinct) > make-input-stable > transform when that's the intent (which could be a primitive that runners > should implement, possibly by swapping in Reshuffle, or reject) would > allow > for this. That being said, the exact semantics of this transform is a bit > of a rabbit hole which is why we never finished the job of deprecating > Reshuffle. This is a case where doing something is better than doing > nothing, and our use of URNs for this kind of thing is flexible enough > that > we can deprecate old ones if/when we have time to pound out the right > solution. > > >> >>> Kenn >>> >>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi >>> wrote: >>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < rober...@google.com> wrote: > On Fri, May 18, 2018 at 11:46 AM Raghu Angadi > wrote: > >> Thanks Kenn. >> >> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles >> wrote: >> >>> The fact that its usage has grown probably indicates that we >>> have a large number of transforms that can easily cause data loss / >>> duplication. >>> >> >> Is this specific to Reshuffle or it is true for any GroupByKey? I >> see Reshuffle as just a wrapper around GBK. >> > The issue is when it's used in such a way that data corruption can > occur when the underlying GBK output is not stable. > Could you describe this breakage bit more in detail or give a example? Apologies in advance, I know this came up in multiple contexts in the past, but I haven't grokked the issue well. It is the window rewrite that Reshuffle does that causes misuse of GBK? Thanks. >>>
Re: What is the future of Reshuffle?
On Sat, May 19, 2018 at 6:27 PM Raghu Angadi wrote: > On Sat, May 19, 2018 at 8:11 AM Robert Bradshaw > wrote: > >> On Fri, May 18, 2018 at 6:29 PM Raghu Angadi wrote: >> >>> True. I am still failing to see what is broken about Reshuffle that is >>> also not broken with GroupByKey transform. If someone depends on GroupByKey >>> to get stable input, isn't that equally incorrect/unportable? >>> >> >> Yes, if people use GBK in that way, it's also just as broken. The thought >> is that fewer people would use it with that intent, as GBK is not a no-op >> (it transforms the shape of the data, and also does not preserve >> windowing). This is in contrast to Reshuffle which was encouraged for this >> usecase. >> > > I see. I am not aware of any recommendation for users (excluding advanced > users) to use this for stable input/durability gaurantees. Every single > case where I recommended Reshuffle was related to parallelism (there were > many such cases). Most of use of Reshuflle/GBK for stable input were done > consciously by the authors, fully aware of the caveats (SDF in Dataflow, > Kafka EOS sink use of GBK, etc). > > As a result, deprecation is only hurting the innocent users who are using > Reshuffle correctly. > > I think it would be much more user friendly to un-deprecate it to add a > warning for advanced users about non-portability of durability/replay > guarantees/stable input assumptions. > >> Yes, I think everyone in this thread is in agreement here. We should provide a *different* transform that provides the durability guarantees (with caveats). In the meantime, this delegating to a reshuffle would be better than using a reshuffle directly. > We tend to put in reshuffles in order to "commit" these random values and >> make them stable for the next stage, to be used to provide the needed >> idempotency for sinks. >> > > In such cases, I think the author should error out on the runner that > don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does > [1]. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 > We're moving to a world where the runner may not be known at pipeline construction time. However, explicitly using a (distinct) make-input-stable transform when that's the intent (which could be a primitive that runners should implement, possibly by swapping in Reshuffle, or reject) would allow for this. That being said, the exact semantics of this transform is a bit of a rabbit hole which is why we never finished the job of deprecating Reshuffle. This is a case where doing something is better than doing nothing, and our use of URNs for this kind of thing is flexible enough that we can deprecate old ones if/when we have time to pound out the right solution. > >> Kenn >> >> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi >> wrote: >> >>> >>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw < >>> rober...@google.com> wrote: >>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi wrote: > Thanks Kenn. > > On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles > wrote: > >> The fact that its usage has grown probably indicates that we have >> a large number of transforms that can easily cause data loss / >> duplication. >> > > Is this specific to Reshuffle or it is true for any GroupByKey? I > see Reshuffle as just a wrapper around GBK. > The issue is when it's used in such a way that data corruption can occur when the underlying GBK output is not stable. >>> >>> Could you describe this breakage bit more in detail or give a >>> example? Apologies in advance, I know this came up in multiple contexts >>> in >>> the past, but I haven't grokked the issue well. It is the window rewrite >>> that Reshuffle does that causes misuse of GBK? >>> >>> Thanks. >>> >>
Re: What is the future of Reshuffle?
On Sat, May 19, 2018 at 8:11 AM Robert Bradshaw wrote: > On Fri, May 18, 2018 at 6:29 PM Raghu Angadi wrote: > >> True. I am still failing to see what is broken about Reshuffle that is >> also not broken with GroupByKey transform. If someone depends on GroupByKey >> to get stable input, isn't that equally incorrect/unportable? >> > > Yes, if people use GBK in that way, it's also just as broken. The thought > is that fewer people would use it with that intent, as GBK is not a no-op > (it transforms the shape of the data, and also does not preserve > windowing). This is in contrast to Reshuffle which was encouraged for this > usecase. > I see. I am not aware of any recommendation for users (excluding advanced users) to use this for stable input/durability gaurantees. Every single case where I recommended Reshuffle was related to parallelism (there were many such cases). Most of use of Reshuflle/GBK for stable input were done consciously by the authors, fully aware of the caveats (SDF in Dataflow, Kafka EOS sink use of GBK, etc). As a result, deprecation is only hurting the innocent users who are using Reshuffle correctly. I think it would be much more user friendly to un-deprecate it to add a warning for advanced users about non-portability of durability/replay guarantees/stable input assumptions. Raghu. We tend to put in reshuffles in order to "commit" these random values and > make them stable for the next stage, to be used to provide the needed > idempotency for sinks. > In such cases, I think the author should error out on the runner that don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does [1]. [1] https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >>> >>> We're moving to a world where the runner may not be known at pipeline >>> construction time. However, explicitly using a (distinct) make-input-stable >>> transform when that's the intent (which could be a primitive that runners >>> should implement, possibly by swapping in Reshuffle, or reject) would allow >>> for this. That being said, the exact semantics of this transform is a bit >>> of a rabbit hole which is why we never finished the job of deprecating >>> Reshuffle. This is a case where doing something is better than doing >>> nothing, and our use of URNs for this kind of thing is flexible enough that >>> we can deprecate old ones if/when we have time to pound out the right >>> solution. >>> >>> > Kenn > > On Fri, May 18, 2018 at 4:05 PM Raghu Angadi > wrote: > >> >> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw >> wrote: >> >>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi >>> wrote: >>> Thanks Kenn. On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: > The fact that its usage has grown probably indicates that we have > a large number of transforms that can easily cause data loss / > duplication. > Is this specific to Reshuffle or it is true for any GroupByKey? I see Reshuffle as just a wrapper around GBK. >>> The issue is when it's used in such a way that data corruption can >>> occur when the underlying GBK output is not stable. >>> >> >> Could you describe this breakage bit more in detail or give a >> example? Apologies in advance, I know this came up in multiple contexts >> in >> the past, but I haven't grokked the issue well. It is the window rewrite >> that Reshuffle does that causes misuse of GBK? >> >> Thanks. >> >
Re: What is the future of Reshuffle?
On Fri, May 18, 2018 at 6:29 PM Raghu Angadi wrote: > > On Fri, May 18, 2018 at 5:34 PM Robert Bradshaw > wrote: > >> Ah, thanks, that makes sense. That implies to me Reshuffle is no more >>> broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear >>> caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic >>> keys and retries (though it applies to GroupByKey in general). >>> >> >> The "randomness" of Reshuffle.viaRandomKey() is fine, as the randomly >> generated key is never exposed to the users (so it doesn't matter if it >> changes). >> > > Agreed. > > >> Reshuffle is broken if you are using it to get stable input on a runner >> that doesn't always have stable input as an implementation detail of GBKs. >> > > True. I am still failing to see what is broken about Reshuffle that is > also not broken with GroupByKey transform. If someone depends on GroupByKey > to get stable input, isn't that equally incorrect/unportable? > Yes, if people use GBK in that way, it's also just as broken. The thought is that fewer people would use it with that intent, as GBK is not a no-op (it transforms the shape of the data, and also does not preserve windowing). This is in contrast to Reshuffle which was encouraged for this usecase. > > Raghu. > >> >> We tend to put in reshuffles in order to "commit" these random values and make them stable for the next stage, to be used to provide the needed idempotency for sinks. >>> >>> In such cases, I think the author should error out on the runner that >>> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does >>> [1]. >>> >>> [1] >>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >>> >> >> We're moving to a world where the runner may not be known at pipeline >> construction time. However, explicitly using a (distinct) make-input-stable >> transform when that's the intent (which could be a primitive that runners >> should implement, possibly by swapping in Reshuffle, or reject) would allow >> for this. That being said, the exact semantics of this transform is a bit >> of a rabbit hole which is why we never finished the job of deprecating >> Reshuffle. This is a case where doing something is better than doing >> nothing, and our use of URNs for this kind of thing is flexible enough that >> we can deprecate old ones if/when we have time to pound out the right >> solution. >> >> >>> Kenn On Fri, May 18, 2018 at 4:05 PM Raghu Angadi wrote: > > On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw > wrote: > >> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi >> wrote: >> >>> Thanks Kenn. >>> >>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles >>> wrote: >>> The fact that its usage has grown probably indicates that we have a large number of transforms that can easily cause data loss / duplication. >>> >>> Is this specific to Reshuffle or it is true for any GroupByKey? I >>> see Reshuffle as just a wrapper around GBK. >>> >> The issue is when it's used in such a way that data corruption can >> occur when the underlying GBK output is not stable. >> > > Could you describe this breakage bit more in detail or give a example? > Apologies in advance, I know this came up in multiple contexts in the > past, > but I haven't grokked the issue well. It is the window rewrite that > Reshuffle does that causes misuse of GBK? > > Thanks. >
Re: What is the future of Reshuffle?
On Fri, May 18, 2018 at 5:34 PM Robert Bradshaw wrote: > Ah, thanks, that makes sense. That implies to me Reshuffle is no more >> broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear >> caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic >> keys and retries (though it applies to GroupByKey in general). >> > > The "randomness" of Reshuffle.viaRandomKey() is fine, as the randomly > generated key is never exposed to the users (so it doesn't matter if it > changes). > Agreed. > Reshuffle is broken if you are using it to get stable input on a runner > that doesn't always have stable input as an implementation detail of GBKs. > True. I am still failing to see what is broken about Reshuffle that is also not broken with GroupByKey transform. If someone depends on GroupByKey to get stable input, isn't that equally incorrect/unportable? Raghu. > > We tend to put in reshuffles in order to "commit" these random values and >>> make them stable for the next stage, to be used to provide the needed >>> idempotency for sinks. >>> >> >> In such cases, I think the author should error out on the runner that >> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does >> [1]. >> >> [1] >> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 >> > > We're moving to a world where the runner may not be known at pipeline > construction time. However, explicitly using a (distinct) make-input-stable > transform when that's the intent (which could be a primitive that runners > should implement, possibly by swapping in Reshuffle, or reject) would allow > for this. That being said, the exact semantics of this transform is a bit > of a rabbit hole which is why we never finished the job of deprecating > Reshuffle. This is a case where doing something is better than doing > nothing, and our use of URNs for this kind of thing is flexible enough that > we can deprecate old ones if/when we have time to pound out the right > solution. > > >> >>> Kenn >>> >>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi wrote: >>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw wrote: > On Fri, May 18, 2018 at 11:46 AM Raghu Angadi > wrote: > >> Thanks Kenn. >> >> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles >> wrote: >> >>> The fact that its usage has grown probably indicates that we have a >>> large number of transforms that can easily cause data loss / >>> duplication. >>> >> >> Is this specific to Reshuffle or it is true for any GroupByKey? I see >> Reshuffle as just a wrapper around GBK. >> > The issue is when it's used in such a way that data corruption can > occur when the underlying GBK output is not stable. > Could you describe this breakage bit more in detail or give a example? Apologies in advance, I know this came up in multiple contexts in the past, but I haven't grokked the issue well. It is the window rewrite that Reshuffle does that causes misuse of GBK? Thanks. >>>
Re: What is the future of Reshuffle?
On Fri, May 18, 2018 at 4:24 PM Raghu Angadi wrote: > On Fri, May 18, 2018 at 4:07 PM Kenneth Knowles wrote: > >> It isn't any particular logic in Reshuffle - it is, semantically, an >> identity transform. It is the fact that other runners are perfectly able to >> re-run transform prior to a GBK. So, for example, randomly generated IDs >> will be re-generated. >> > > Ah, thanks, that makes sense. That implies to me Reshuffle is no more > broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear > caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic > keys and retries (though it applies to GroupByKey in general). > The "randomness" of Reshuffle.viaRandomKey() is fine, as the randomly generated key is never exposed to the users (so it doesn't matter if it changes). Reshuffle is broken if you are using it to get stable input on a runner that doesn't always have stable input as an implementation detail of GBKs. We tend to put in reshuffles in order to "commit" these random values and >> make them stable for the next stage, to be used to provide the needed >> idempotency for sinks. >> > > In such cases, I think the author should error out on the runner that > don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does > [1]. > > [1] > https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 > We're moving to a world where the runner may not be known at pipeline construction time. However, explicitly using a (distinct) make-input-stable transform when that's the intent (which could be a primitive that runners should implement, possibly by swapping in Reshuffle, or reject) would allow for this. That being said, the exact semantics of this transform is a bit of a rabbit hole which is why we never finished the job of deprecating Reshuffle. This is a case where doing something is better than doing nothing, and our use of URNs for this kind of thing is flexible enough that we can deprecate old ones if/when we have time to pound out the right solution. > >> Kenn >> >> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi wrote: >> >>> >>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw >>> wrote: >>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi wrote: > Thanks Kenn. > > On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles > wrote: > >> The fact that its usage has grown probably indicates that we have a >> large number of transforms that can easily cause data loss / duplication. >> > > Is this specific to Reshuffle or it is true for any GroupByKey? I see > Reshuffle as just a wrapper around GBK. > The issue is when it's used in such a way that data corruption can occur when the underlying GBK output is not stable. >>> >>> Could you describe this breakage bit more in detail or give a example? >>> Apologies in advance, I know this came up in multiple contexts in the past, >>> but I haven't grokked the issue well. It is the window rewrite that >>> Reshuffle does that causes misuse of GBK? >>> >>> Thanks. >>> >>
Re: What is the future of Reshuffle?
On Fri, May 18, 2018 at 4:07 PM Kenneth Knowles wrote: > It isn't any particular logic in Reshuffle - it is, semantically, an > identity transform. It is the fact that other runners are perfectly able to > re-run transform prior to a GBK. So, for example, randomly generated IDs > will be re-generated. > Ah, thanks, that makes sense. That implies to me Reshuffle is no more broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic keys and retries (though it applies to GroupByKey in general). We tend to put in reshuffles in order to "commit" these random values and > make them stable for the next stage, to be used to provide the needed > idempotency for sinks. > In such cases, I think the author should error out on the runner that don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does [1]. [1] https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049 > Kenn > > On Fri, May 18, 2018 at 4:05 PM Raghu Angadi wrote: > >> >> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw >> wrote: >> >>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi >>> wrote: >>> Thanks Kenn. On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: > The fact that its usage has grown probably indicates that we have a > large number of transforms that can easily cause data loss / duplication. > Is this specific to Reshuffle or it is true for any GroupByKey? I see Reshuffle as just a wrapper around GBK. >>> The issue is when it's used in such a way that data corruption can occur >>> when the underlying GBK output is not stable. >>> >> >> Could you describe this breakage bit more in detail or give a example? >> Apologies in advance, I know this came up in multiple contexts in the past, >> but I haven't grokked the issue well. It is the window rewrite that >> Reshuffle does that causes misuse of GBK? >> >> Thanks. >> >
Re: What is the future of Reshuffle?
It isn't any particular logic in Reshuffle - it is, semantically, an identity transform. It is the fact that other runners are perfectly able to re-run transform prior to a GBK. So, for example, randomly generated IDs will be re-generated. We tend to put in reshuffles in order to "commit" these random values and make them stable for the next stage, to be used to provide the needed idempotency for sinks. Kenn On Fri, May 18, 2018 at 4:05 PM Raghu Angadi wrote: > > On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw > wrote: > >> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi wrote: >> >>> Thanks Kenn. >>> >>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: >>> The fact that its usage has grown probably indicates that we have a large number of transforms that can easily cause data loss / duplication. >>> >>> Is this specific to Reshuffle or it is true for any GroupByKey? I see >>> Reshuffle as just a wrapper around GBK. >>> >> The issue is when it's used in such a way that data corruption can occur >> when the underlying GBK output is not stable. >> > > Could you describe this breakage bit more in detail or give a example? > Apologies in advance, I know this came up in multiple contexts in the past, > but I haven't grokked the issue well. It is the window rewrite that > Reshuffle does that causes misuse of GBK? > > Thanks. >
Re: What is the future of Reshuffle?
On Fri, May 18, 2018 at 12:22 PM Robert Bradshaw wrote: > [resending] > Agreed that keeping this deprecated without a clear replacement for so long > is not ideal. > > I would at least break this into two separate transforms, the > parallelism-breaking one (which seems OK) and the stable input one (which > may just call the parallelism-breaking one, but should be decorated with > lots of caveats and maybe even still have the deprecated annotation). > +1. Parallelism-breaking one is the most relevant to many users. Would love to see that part deprecated, ideally keeping the name Reshuffle. Raghu. > > > On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: > >> The fact that its usage has grown probably indicates that we have a large >> number of transforms that can easily cause data loss / duplication. >> >> Yes, it is deprecated because it is primarily used as a Dataflow-specific >> way to ensure stable input. My understanding is that the SparkRunner also >> materializes at every GBK so it works there too (is this still the case?). >> It doesn't work at all for other runners AFAIK. So it is @Deprecated not >> because there is a replacement, but because it is kind of dangerous to use. >> Beam >> could just say "GBK must ensure stable output" and "a composite containing >> a GBK has to ensure stable output even if replaced" and that would solve >> the issue, but I think it would make Beam on Flink impossibly slow - I >> could be wrong about that. Generally stable input is tied to durability >> model which is a key design point for engines. >> >> True that it isn't the only use, and I know you have been trying to nail >> down what the uses actually are. Ben wrote up various uses in a portable >> manner at https://beam.apache.org/documentation/execution-model. >> >> - Coupled failure is the use where Reshuffle is to provide stable input >> - Breaking dependent parallelism is more portable - but since it is the >> identity transform a runner may just elide it; it is a hint, basically, and >> that seem OK (but can we do it more directly?) >> >> What I don't want is to build something where the implementation details >> are the spec, and not fundamental, which is sort of where Reshuffle lies. >> This thread highlights that this is a pretty urgent problem with our SDKs >> and runners that it would be very helpful to work on. >> >> Kenn >> >> >> >> On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov >> wrote: >> >>> Agreed that it should be undeprecated, many users are getting confused >>> by this. >>> I know that some people are working on a replacement for at least one of >>> its use cases (RequiresStableInput), but the use case of breaking fusion >>> is, as of yet, unaddressed, and there's not much to be gained by keeping it >>> deprecated. >>> >>> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi wrote: >>> I am interested in more clarity on this as well. It has been deprecated for a long time without a replacement, and its usage has only grown, both within Beam code base as well as in user applications. If we are certain that it will not be removed before there is a good replacement for it, can we undeprecate it until there are proper plans for replacement? On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía wrote: > I saw in a recent thread that the use of the Reshuffle transform was > recommended to solve an user issue: > > > https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E > > I can see why it may fix the reported issue. I am just curious about > the fact that the Reshuffle transform is marked as both @Internal and > @Deprecated in Beam's SDK. > > Do we have some alternative? So far the class documentation does not > recommend any replacement. >
Re: What is the future of Reshuffle?
On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw wrote: > On Fri, May 18, 2018 at 11:46 AM Raghu Angadi wrote: > >> Thanks Kenn. >> >> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: >> >>> The fact that its usage has grown probably indicates that we have a >>> large number of transforms that can easily cause data loss / duplication. >>> >> >> Is this specific to Reshuffle or it is true for any GroupByKey? I see >> Reshuffle as just a wrapper around GBK. >> > The issue is when it's used in such a way that data corruption can occur > when the underlying GBK output is not stable. > Could you describe this breakage bit more in detail or give a example? Apologies in advance, I know this came up in multiple contexts in the past, but I haven't grokked the issue well. It is the window rewrite that Reshuffle does that causes misuse of GBK? Thanks.
Re: What is the future of Reshuffle?
[resending] Agreed that keeping this deprecated without a clear replacement for so long is not ideal. I would at least break this into two separate transforms, the parallelism-breaking one (which seems OK) and the stable input one (which may just call the parallelism-breaking one, but should be decorated with lots of caveats and maybe even still have the deprecated annotation). On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: > The fact that its usage has grown probably indicates that we have a large > number of transforms that can easily cause data loss / duplication. > > Yes, it is deprecated because it is primarily used as a Dataflow-specific > way to ensure stable input. My understanding is that the SparkRunner also > materializes at every GBK so it works there too (is this still the case?). > It doesn't work at all for other runners AFAIK. So it is @Deprecated not > because there is a replacement, but because it is kind of dangerous to use. > Beam > could just say "GBK must ensure stable output" and "a composite containing > a GBK has to ensure stable output even if replaced" and that would solve > the issue, but I think it would make Beam on Flink impossibly slow - I > could be wrong about that. Generally stable input is tied to durability > model which is a key design point for engines. > > True that it isn't the only use, and I know you have been trying to nail > down what the uses actually are. Ben wrote up various uses in a portable > manner at https://beam.apache.org/documentation/execution-model. > > - Coupled failure is the use where Reshuffle is to provide stable input > - Breaking dependent parallelism is more portable - but since it is the > identity transform a runner may just elide it; it is a hint, basically, and > that seem OK (but can we do it more directly?) > > What I don't want is to build something where the implementation details > are the spec, and not fundamental, which is sort of where Reshuffle lies. > This thread highlights that this is a pretty urgent problem with our SDKs > and runners that it would be very helpful to work on. > > Kenn > > > > On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov > wrote: > >> Agreed that it should be undeprecated, many users are getting confused by >> this. >> I know that some people are working on a replacement for at least one of >> its use cases (RequiresStableInput), but the use case of breaking fusion >> is, as of yet, unaddressed, and there's not much to be gained by keeping it >> deprecated. >> >> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi wrote: >> >>> I am interested in more clarity on this as well. It has been deprecated >>> for a long time without a replacement, and its usage has only grown, both >>> within Beam code base as well as in user applications. >>> >>> If we are certain that it will not be removed before there is a good >>> replacement for it, can we undeprecate it until there are proper plans for >>> replacement? >>> >>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía wrote: >>> I saw in a recent thread that the use of the Reshuffle transform was recommended to solve an user issue: https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E I can see why it may fix the reported issue. I am just curious about the fact that the Reshuffle transform is marked as both @Internal and @Deprecated in Beam's SDK. Do we have some alternative? So far the class documentation does not recommend any replacement. >>>
Re: What is the future of Reshuffle?
On Fri, May 18, 2018 at 11:46 AM Raghu Angadi wrote: > Thanks Kenn. > > On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: > >> The fact that its usage has grown probably indicates that we have a large >> number of transforms that can easily cause data loss / duplication. >> > > Is this specific to Reshuffle or it is true for any GroupByKey? I see > Reshuffle as just a wrapper around GBK. > The issue is when it's used in such a way that data corruption can occur when the underlying GBK output is not stable.
Re: What is the future of Reshuffle?
On Fri, May 18, 2018 at 11:46 AM Raghu Angadiwrote:>> Thanks Kenn.>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: The fact that its usage has grown probably indicates that we have a large number of transforms that can easily cause data loss / duplication.>> Is this specific to Reshuffle or it is true for any GroupByKey? I see Reshuffle as just a wrapper around GBK.The issue is when it's used in such a way that data corruption can occur when the underlying GBK output is not stable.
Re: What is the future of Reshuffle?
Agreed that keeping this deprecated without a clear replacement for so long is not ideal. I would at least break this into two separate transforms, the parallelism-breaking one (which seems OK) and the stable input one (which may just call the parallelism-breaking one, but should be decorated with lots of caveats and maybe even still have the deprecated annotation). On Fri, May 18, 2018 at 11:02 AM Kenneth Knowleswrote:>> The fact that its usage has grown probably indicates that we have a large number of transforms that can easily cause data loss / duplication.>> Yes, it is deprecated because it is primarily used as a Dataflow-specific way to ensure stable input. My understanding is that the SparkRunner also materializes at every GBK so it works there too (is this still the case?). It doesn't work at all for other runners AFAIK. So it is @Deprecated not because there is a replacement, but because it is kind of dangerous to use. Beam could just say "GBK must ensure stable output" and "a composite containing a GBK has to ensure stable output even if replaced" and that would solve the issue, but I think it would make Beam on Flink impossibly slow - I could be wrong about that. Generally stable input is tied to durability model which is a key design point for engines.>> True that it isn't the only use, and I know you have been trying to nail down what the uses actually are. Ben wrote up various uses in a portable manner at https://beam.apache.org/documentation/execution-model.>> - Coupled failure is the use where Reshuffle is to provide stable input> - Breaking dependent parallelism is more portable - but since it is the identity transform a runner may just elide it; it is a hint, basically, and that seem OK (but can we do it more directly?)>> What I don't want is to build something where the implementation details are the spec, and not fundamental, which is sort of where Reshuffle lies. This thread highlights that this is a pretty urgent problem with our SDKs and runners that it would be very helpful to work on.>> Kenn On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov wrote: Agreed that it should be undeprecated, many users are getting confused by this.>> I know that some people are working on a replacement for at least one of its use cases (RequiresStableInput), but the use case of breaking fusion is, as of yet, unaddressed, and there's not much to be gained by keeping it deprecated. On Fri, May 18, 2018 at 7:45 AM Raghu Angadi wrote:>> I am interested in more clarity on this as well. It has been deprecated for a long time without a replacement, and its usage has only grown, both within Beam code base as well as in user applications.>> If we are certain that it will not be removed before there is a good replacement for it, can we undeprecate it until there are proper plans for replacement? >> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía wrote: I saw in a recent thread that the use of the Reshuffle transform was recommended to solve an user issue: https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E I can see why it may fix the reported issue. I am just curious about the fact that the Reshuffle transform is marked as both @Internal and @Deprecated in Beam's SDK. Do we have some alternative? So far the class documentation does not recommend any replacement.
Re: What is the future of Reshuffle?
Thanks Kenn. On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: > The fact that its usage has grown probably indicates that we have a large > number of transforms that can easily cause data loss / duplication. > Is this specific to Reshuffle or it is true for any GroupByKey? I see Reshuffle as just a wrapper around GBK. Raghu. > > > Yes, it is deprecated because it is primarily used as a Dataflow-specific > way to ensure stable input. My understanding is that the SparkRunner also > materializes at every GBK so it works there too (is this still the case?). > It doesn't work at all for other runners AFAIK. So it is @Deprecated not > because there is a replacement, but because it is kind of dangerous to use. > Beam > could just say "GBK must ensure stable output" and "a composite containing > a GBK has to ensure stable output even if replaced" and that would solve > the issue, but I think it would make Beam on Flink impossibly slow - I > could be wrong about that. Generally stable input is tied to durability > model which is a key design point for engines. > > True that it isn't the only use, and I know you have been trying to nail > down what the uses actually are. Ben wrote up various uses in a portable > manner at https://beam.apache.org/documentation/execution-model. > > - Coupled failure is the use where Reshuffle is to provide stable input > - Breaking dependent parallelism is more portable - but since it is the > identity transform a runner may just elide it; it is a hint, basically, and > that seem OK (but can we do it more directly?) > > What I don't want is to build something where the implementation details > are the spec, and not fundamental, which is sort of where Reshuffle lies. > This thread highlights that this is a pretty urgent problem with our SDKs > and runners that it would be very helpful to work on. > > Kenn > > > > On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov > wrote: > >> Agreed that it should be undeprecated, many users are getting confused by >> this. >> I know that some people are working on a replacement for at least one of >> its use cases (RequiresStableInput), but the use case of breaking fusion >> is, as of yet, unaddressed, and there's not much to be gained by keeping it >> deprecated. >> >> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi wrote: >> >>> I am interested in more clarity on this as well. It has been deprecated >>> for a long time without a replacement, and its usage has only grown, both >>> within Beam code base as well as in user applications. >>> >>> If we are certain that it will not be removed before there is a good >>> replacement for it, can we undeprecate it until there are proper plans for >>> replacement? >>> >>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía wrote: >>> I saw in a recent thread that the use of the Reshuffle transform was recommended to solve an user issue: https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E I can see why it may fix the reported issue. I am just curious about the fact that the Reshuffle transform is marked as both @Internal and @Deprecated in Beam's SDK. Do we have some alternative? So far the class documentation does not recommend any replacement. >>>
Re: What is the future of Reshuffle?
The fact that its usage has grown probably indicates that we have a large number of transforms that can easily cause data loss / duplication. Yes, it is deprecated because it is primarily used as a Dataflow-specific way to ensure stable input. My understanding is that the SparkRunner also materializes at every GBK so it works there too (is this still the case?). It doesn't work at all for other runners AFAIK. So it is @Deprecated not because there is a replacement, but because it is kind of dangerous to use. Beam could just say "GBK must ensure stable output" and "a composite containing a GBK has to ensure stable output even if replaced" and that would solve the issue, but I think it would make Beam on Flink impossibly slow - I could be wrong about that. Generally stable input is tied to durability model which is a key design point for engines. True that it isn't the only use, and I know you have been trying to nail down what the uses actually are. Ben wrote up various uses in a portable manner at https://beam.apache.org/documentation/execution-model. - Coupled failure is the use where Reshuffle is to provide stable input - Breaking dependent parallelism is more portable - but since it is the identity transform a runner may just elide it; it is a hint, basically, and that seem OK (but can we do it more directly?) What I don't want is to build something where the implementation details are the spec, and not fundamental, which is sort of where Reshuffle lies. This thread highlights that this is a pretty urgent problem with our SDKs and runners that it would be very helpful to work on. Kenn On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov wrote: > Agreed that it should be undeprecated, many users are getting confused by > this. > I know that some people are working on a replacement for at least one of > its use cases (RequiresStableInput), but the use case of breaking fusion > is, as of yet, unaddressed, and there's not much to be gained by keeping it > deprecated. > > On Fri, May 18, 2018 at 7:45 AM Raghu Angadi wrote: > >> I am interested in more clarity on this as well. It has been deprecated >> for a long time without a replacement, and its usage has only grown, both >> within Beam code base as well as in user applications. >> >> If we are certain that it will not be removed before there is a good >> replacement for it, can we undeprecate it until there are proper plans for >> replacement? >> >> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía wrote: >> >>> I saw in a recent thread that the use of the Reshuffle transform was >>> recommended to solve an user issue: >>> >>> >>> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E >>> >>> I can see why it may fix the reported issue. I am just curious about >>> the fact that the Reshuffle transform is marked as both @Internal and >>> @Deprecated in Beam's SDK. >>> >>> Do we have some alternative? So far the class documentation does not >>> recommend any replacement. >>> >>
Re: What is the future of Reshuffle?
Agreed that it should be undeprecated, many users are getting confused by this. I know that some people are working on a replacement for at least one of its use cases (RequiresStableInput), but the use case of breaking fusion is, as of yet, unaddressed, and there's not much to be gained by keeping it deprecated. On Fri, May 18, 2018 at 7:45 AM Raghu Angadi wrote: > I am interested in more clarity on this as well. It has been deprecated > for a long time without a replacement, and its usage has only grown, both > within Beam code base as well as in user applications. > > If we are certain that it will not be removed before there is a good > replacement for it, can we undeprecate it until there are proper plans for > replacement? > > On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía wrote: > >> I saw in a recent thread that the use of the Reshuffle transform was >> recommended to solve an user issue: >> >> >> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E >> >> I can see why it may fix the reported issue. I am just curious about >> the fact that the Reshuffle transform is marked as both @Internal and >> @Deprecated in Beam's SDK. >> >> Do we have some alternative? So far the class documentation does not >> recommend any replacement. >> >
Re: What is the future of Reshuffle?
I am interested in more clarity on this as well. It has been deprecated for a long time without a replacement, and its usage has only grown, both within Beam code base as well as in user applications. If we are certain that it will not be removed before there is a good replacement for it, can we undeprecate it until there are proper plans for replacement? On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía wrote: > I saw in a recent thread that the use of the Reshuffle transform was > recommended to solve an user issue: > > > https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E > > I can see why it may fix the reported issue. I am just curious about > the fact that the Reshuffle transform is marked as both @Internal and > @Deprecated in Beam's SDK. > > Do we have some alternative? So far the class documentation does not > recommend any replacement. >
What is the future of Reshuffle?
I saw in a recent thread that the use of the Reshuffle transform was recommended to solve an user issue: https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E I can see why it may fix the reported issue. I am just curious about the fact that the Reshuffle transform is marked as both @Internal and @Deprecated in Beam's SDK. Do we have some alternative? So far the class documentation does not recommend any replacement.