Hello,

Heads up that https://github.com/apache/incubator-beam/pull/1036 will
introduce a transform called "Redistribute", encapsulating a relatively
common pattern - a "fusion break" [see
https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
previously
providing advice on that] - useful e.g. when you write an IO as a sequence
of ParDo's: split a query into parts, read each part, and you want to
prevent fusing these ParDo's because that would make the whole thing
execute sequentially, and in other similar cases.

The PR also uses it, as an example, in DatastoreIO and JdbcIO, both of
which used to have a hand-rolled implementation of the same. The Write
transform has something similar, but not quite identical, so I skipped it.

This is not a model change - merely providing a common implementation of
something useful that already existed but was scattered across the codebase.

Redistribute also subsumes the old mostly-internal Reshuffle transform via
Redistribute.byKey().

I tried finding more cases in the Beam codebase that have an ad-hoc
implementation of this; I did not find any, but I might have missed
something. I suppose the transform will need to be advertised in
documentation on best-practices for connector development; perhaps some
StackOverflow answers should be updated; any other places?

Reply via email to