Hi Eugene, This is very interesting. Let me see if I get this right, the "Redistribute" transformation assigns a "running id" key (per-bundle) , calls "Redistribute.byKey", and extracts back the values, correct ? As for "Redistribute.byKey" - it's made of a GroupByKey transformation that follows a Window transformation that neutralises the "resolution" of triggers and panes that usually occurs in GroupByKey, correct ?
So this is basically a "FanOut" transformation which will depend on the available resources of the runner (and the uniqueness of the assigned keys) ? Would we want to Redistribute into a user-defined number of bundles (> current) ? How about "FanIn" ? Thanks, Amit On Fri, Oct 7, 2016 at 10:49 PM Eugene Kirpichov <kirpic...@google.com.invalid> wrote: > Hello, > > Heads up that https://github.com/apache/incubator-beam/pull/1036 will > introduce a transform called "Redistribute", encapsulating a relatively > common pattern - a "fusion break" [see > > https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion > previously > providing advice on that] - useful e.g. when you write an IO as a sequence > of ParDo's: split a query into parts, read each part, and you want to > prevent fusing these ParDo's because that would make the whole thing > execute sequentially, and in other similar cases. > > The PR also uses it, as an example, in DatastoreIO and JdbcIO, both of > which used to have a hand-rolled implementation of the same. The Write > transform has something similar, but not quite identical, so I skipped it. > > This is not a model change - merely providing a common implementation of > something useful that already existed but was scattered across the > codebase. > > Redistribute also subsumes the old mostly-internal Reshuffle transform via > Redistribute.byKey(). > > I tried finding more cases in the Beam codebase that have an ad-hoc > implementation of this; I did not find any, but I might have missed > something. I suppose the transform will need to be advertised in > documentation on best-practices for connector development; perhaps some > StackOverflow answers should be updated; any other places? >