Hi Eugene,

This is very interesting.
Let me see if I get this right, the "Redistribute"  transformation assigns
a "running id" key (per-bundle) , calls "Redistribute.byKey", and extracts
back the values, correct ?
As for "Redistribute.byKey" - it's made of a GroupByKey transformation that
follows a Window transformation that neutralises the "resolution" of
triggers and panes that usually occurs in GroupByKey, correct ?

So this is basically a "FanOut" transformation which will depend on the
available resources of the runner (and the uniqueness of the assigned keys)
?

Would we want to Redistribute into a user-defined number of bundles (>
current) ?

How about "FanIn" ?

Thanks,
Amit


On Fri, Oct 7, 2016 at 10:49 PM Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:

> Hello,
>
> Heads up that https://github.com/apache/incubator-beam/pull/1036 will
> introduce a transform called "Redistribute", encapsulating a relatively
> common pattern - a "fusion break" [see
>
> https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
> previously
> providing advice on that] - useful e.g. when you write an IO as a sequence
> of ParDo's: split a query into parts, read each part, and you want to
> prevent fusing these ParDo's because that would make the whole thing
> execute sequentially, and in other similar cases.
>
> The PR also uses it, as an example, in DatastoreIO and JdbcIO, both of
> which used to have a hand-rolled implementation of the same. The Write
> transform has something similar, but not quite identical, so I skipped it.
>
> This is not a model change - merely providing a common implementation of
> something useful that already existed but was scattered across the
> codebase.
>
> Redistribute also subsumes the old mostly-internal Reshuffle transform via
> Redistribute.byKey().
>
> I tried finding more cases in the Beam codebase that have an ad-hoc
> implementation of this; I did not find any, but I might have missed
> something. I suppose the transform will need to be advertised in
> documentation on best-practices for connector development; perhaps some
> StackOverflow answers should be updated; any other places?
>

Reply via email to