Re: Introducing a Redistribute transform

Kenneth Knowles Tue, 11 Oct 2016 10:10:06 -0700

On Mon, Oct 10, 2016 at 1:38 PM Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:


> The transform, the way it's implemented, actually does several things at
> the same time and that's why it's tricky to document it.
>

This thread has actually made me less sure about my thoughts on this
transform. I do know what the transform is about and I do think we need it.
But I don't know that it can be explained "within the model". Look at our
classic questions about Redistribute.arbitrarily() and Redistribute.byKey():

 - "what" is it computing? The identity on its input.
 - "where" is the event time windowing? Same as its input.
 - "when" is output produced? As fast as reasonable (runner-specific).
 - "how" are refinements related? Same as its input (I think this might
actually be incorrect if accumulating fired panes)

These points don't describe any of the real goals of Redistribute. Hence
describing it in terms of fusion and checkpointing, which are quite
runner-specific in their (optional) manifestations.

- Introduces a fusion barrier (in runners that have it), making sure that
> the runner can fully parallelize processing the output PCollection with
> DoFn's
>

Can a runner introduce other fusion barriers whenever it wants? Yes.
Can a runner ignore a proposed fusion barrier? Yes. (or when can it not?
why not?)


> - Introduces a fault-tolerance barrier, effectively "checkpointing" the
> input PCollection (again, in runners where it makes sense) and making sure
> that processing elements of the output PCollection with a DoFn, if the DoFn
> fails, will redo only that processing, but not need to recompute the input
> PCollection.
>

Can a runner introduce a checkpoint whenever appropriate? Yes.
Can a runner ignore a hint to checkpoint? Yes (if it can still compute the
same result - it may not even conceive of checkpointing in a compatible
way).

- All of the above and also makes the collection "key-partitioned", giving
> access to per-key state to downstream key-preserving DoFns. However, this
> is also runner-specific, because it's conceivable that a runner might not
> need this "key-partitioned" property (in fact it's best if a runner
> inserted such a "redistribute by key" automatically if it needs it...), and
> it currently isn't exposed anyway.
>

Agreed. The runner should insert the necessary keying wherever needed. One
might say the same for other uses of Redistribute, but in practice hints
are useful.


> Still thinking about the best way to describe this in a way that's least
> confusing to users.
>

I think it isn't just about users. I don't the transform is quite
well-defined at the "what the runner must do" level. Here is a question I
am considering: When is it _incorrect_ for a runner to replace a
Redistribute with an identity transform? I have some thoughts, such as
committing pseudorandomly generated data, but do you have some other ideas?

Kenn

Re: Introducing a Redistribute transform

Reply via email to