Re: [PROPOSAL] "Requires deterministic input"

Kenneth Knowles Tue, 21 Mar 2017 16:13:20 -0700

Good points & questions. I'll try to be more clear.

> On 21 March 2017 at 13:52, Stephen Sisk <s...@google.com.invalid> wrote:
>
> > Hey Kenn-
> >
> > this seems important, but I don't have all the context on what the
> problem
> > is.
> >
> > Can you explain this sentence "Specifically, there is pseudorandom data
> > generated and once it has been observed and used to produce a side
> effect,
> > it cannot be regenerated without erroneous results." ?
>

On Tue, Mar 21, 2017 at 2:04 PM, vikas rk <vikky...@gmail.com> wrote:

>
> For the Write transform I believe you are talking about ApplyShardingKey
> <https://github.com/apache/beam/blob/d66029cafde152c0a46ebd276ddfa4
> c3e7fd3433/sdks/java/core/src/main/java/org/apache/beam/sdk/
> io/Write.java#L304>
> which
> introduces non deterministic behavior when retried?

Yes, exactly this. If the sharding key changes, then the rest of the
transform doesn't function correctly.

> Where is the pseudorandom data coming from? Perhaps a concrete example
> > would help?
>

I think the Write transform is a particularly complex example because of
the layers of abstraction. A simplified strawman might be:

Transform 1: Build RPC write descriptors identified by pseudo-random UUIDs.
Transform 2: Issue RPCs with those identifiers, so the endpoint will ignore
repeats of the same UUID (I tend to call this an "idempotency key" so I
might slip into that terminology sometimes)

In this case, transform 2 requires deterministic input because if the write
fails and is retried a new UUID means the endpoint won't know it is a
retry, resulting in duplicate data.

Is this clearer?

Kenn

Re: [PROPOSAL] "Requires deterministic input"

Reply via email to