Re: Definition of Unified model

Jan Lukavský Thu, 23 May 2019 00:33:26 -0700

Hi all,

thanks everyone for this discussion. I think I have gathered enoughfeedback to be able to put down a proposition for changes, which I willdo and send to this list for further discussion. There are still doubtsremaining the non-determinism and it's relation to outputs stability vs.latency. But I will try to clarify all this in the design document.


Thanks,

 Jan

On 5/22/19 3:49 PM, Maximilian Michels wrote:

Someone from Flink might correct me if I'm wrong, but that's mycurrent understanding.
In essence your description of how exactly-once works in Flink iscorrect. The general assumption in Flink is that pipelines must bedeterministic and thus produce idempotent writes in the case offailures. However, that doesn't mean Beam sinks can't guarantee a bitmore with what Flink has to offer.
Luke already mentioned the design discussions for @RequiresStableInputwhich ensures idempotent writes for non-deterministic pipelines. Thisis not part of the model but an optional Beam feature.
We recently implemented support for @RequiresStableInput in the FlinkRunner. Reuven mentioned the Flink checkpoint confirmation, whichallows us to buffer (and checkpoint) processed data and only emit itonce a Flink checkpoint has completed.
Cheers,
Max

On 21.05.19 16:49, Jan Lukavský wrote:
Hi,
> Actually, I think it is a larger (open) question whether exactlyonce is guaranteed by the model or whether runners are allowed torelax that. I would think, however, that sources correctlyimplemented should be idempotent when run atop an exactly onceinfrastructure such as Flink of Dataflow.
I would assume, that the model basically inherits guarantees ofunderlying infrastructure. Because Flink does not work as youdescribed (atomic commit of inputs, state and outputs), but rather acheckpoint mark is flowing through the DAG much like watermark and onfailures operators are restored and data reprocessed, it (IMHO)implies, that you have exactly once everywhere in the DAG *but*sinks. That is because sinks cannot be restored to previous state,instead sinks are supposed to be idempotent in order for the exactlyonce to really work (or at least be able to commit outputs oncheckpoint in sink). That implies that if you don't have sink that isable to commit outputs atomically on checkpoint, the pipelineexecution should be deterministic upon retries, otherwise shadowwrites from failed paths of the pipeline might appear.
Someone from Flink might correct me if I'm wrong, but that's mycurrent understanding.
 > Sounds like we should make this clearer.
I meant that you are right that we must not in any thoughts we arehaving forget that streams are by definition out-of-order. That isproperty that we cannot change. But - that doesn't limit us fromcreating operator that presents the data to UDF as if the stream wasideally sorted. It can do that by introducing latency, of course.
On 5/21/19 4:01 PM, Robert Bradshaw wrote:
Reza: One could provide something like this as a utility class, but
one downside is that it is not scale invariant. It requires a tuning
parameter that, if to small, won't mitigate the problem, but if to
big, greatly increases latency. (Possibly one could define a dynamic
session-like window to solve this though...) It also might be harder
for runners that *can* cheaply present stuff in timestamp order to
optimize. (That and, in practice, our annotation-style process methods
don't lend themselves to easy composition.) I think it could work in
specific cases though.

More inline below.

On Tue, May 21, 2019 at 11:38 AM Jan Lukavský <je...@seznam.cz> wrote:
Hi Robert,

  > Beam has an exactly-once model. If the data was consumed, state
mutated, and outputs written downstream (these three are committed
together atomically) it will not be replayed. That does not, ofcourse,
solve the non-determanism due to ordering (including the fact that two
operations reading the same PCollection may view different ordering).

I think what you describe is a property of a runner, not of the model,
right? I think if I run my pipeline on Flink I will not get this
atomicity, because although Flink uses also exactly-once model ifmight
write outputs multiple times.
Actually, I think it is a larger (open) question whether exactly once
is guaranteed by the model or whether runners are allowed to relax
that. I would think, however, that sources correctly implemented
should be idempotent when run atop an exactly once infrastructure such
as Flink of Dataflow.
> 1) Is it correct for a (Stateful)DoFn to assume elements arereceived
in a specific order? In the current model, it is not. Being able to
read, handle, and produced out-of-order data, including late data,is a
pretty fundamental property of distributed systems.

Yes, absolutely. The argument here is not that Stateful ParDo should
presume to receive elements in any order, but to _present_ it assuch to
the user @ProcessElement function.
Sounds like we should make this clearer.
> 2) Given that some operations are easier (or possibly onlypossible)to write when operating on ordered data, and that different runnersmay
have (significantly) cheaper ways to provide this ordering than can be
done by the user themselves, should we elevate this to a property of
(Stateful?)DoFns that the runner can provide? I think a compelling
argument can be made here that we should.

+1

Jan

On 5/21/19 11:07 AM, Robert Bradshaw wrote:
On Mon, May 20, 2019 at 5:24 PM Jan Lukavský <je...@seznam.cz> wrote:
> I don't see batch vs. streaming as part of the model. Onecan havemicrobatch, or even a runner that alternates between differentmodes.
Although I understand motivation of this statement, this projectname is
"Apache Beam: An advanced unified programming model". What does the
model unify, if "streaming vs. batch" is not part of the model?
What I mean is that streaming vs. batch is no longer part of themodel
(or ideally API), but pushed down to be a concern of the runner
(executor) of the pipeline.
On Tue, May 21, 2019 at 10:32 AM Jan Lukavský <je...@seznam.cz>wrote:
Hi Kenn,
OK, so if we introduce annotation, we can have stateful ParDowith sorting, that would perfectly resolve my issues. I stillhave some doubts, though. Let me explain. The current behavior ofstateful ParDo has the following properties:
a) might fail in batch, although runs fine in streaming (thatis due to the buffering, and unbounded lateness in batch, whichwas discussed back and forth in this thread)
b) might be non deterministic (this is because the elementsarrive at somewhat random order, and even if you do the operation"assign unique ID to elements" this might produce differentresults when run multiple times)
PCollections are *explicitly* unordered. Any operations thatassume or
depend on a specific ordering for correctness (or determinism) must
provide that ordering themselves (i.e. tolerate "arbitrary shuffling
of inputs"). As you point out, that may be very expensive if you have
very hot keys with very large (unbounded) timestamp skew.

StatefulDoFns are low-level operations that should be used with care;
the simpler windowing model gives determinism in the face ofunordered
data (though late data and non-end-of-window triggering introduces
some of the non-determanism back in).
What worries me most is the property b), because it seems to meto have serious consequences - not only that if you run twicebatch pipeline you would get different results, but even onstreaming, when pipeline fails and gets restarted fromcheckpoint, produced output might differ from the previous runand data from the first run might have already been persistedinto sink. That would create somewhat messy outputs.
Beam has an exactly-once model. If the data was consumed, state
mutated, and outputs written downstream (these three are committed
together atomically) it will not be replayed. That does not, of
course, solve the non-determanism due to ordering (including the fact
that two operations reading the same PCollection may view different
ordering).
These two properties makes me think that the currentimplementation is more of a _special case_ than the general one.The general one would be that your state doesn't have theproperties to be able to tolerate buffering problems and/ornon-determinism. Which is the case where you need sorting in bothstreaming and batch to be part of the model.
Let me point out one more analogy - that is merging vs.non-merging windows. The general case (merging windows) impliessorting by timestamp in both batch case (explicit) and streaming(buffering). The special case (non-merging windows) doesn't relyon any timestamp ordering, so the sorting and buffering can bedropped. The underlying root cause of this is the same for bothstateful ParDo and windowing (essentially, assigning windowlabels is a stateful operation when windowing function is merging).
The reason for the current behavior of stateful ParDo seems to beperformance, but is it right to abandon correctness in favor ofperformance? Wouldn't it be more consistent to have the defaultbehavior prefer correctness and when you have the specificconditions of state function having special properties, then youcan annotate your DoFn (with something like@TimeOrderingAgnostic), which would yield a better performance inthat case?
There are two separable questions here.

1) Is it correct for a (Stateful)DoFn to assume elements are received
in a specific order? In the current model, it is not. Being able to
read, handle, and produced out-of-order data, including late data, is
a pretty fundamental property of distributed systems.

2) Given that some operations are easier (or possibly only possible)
to write when operating on ordered data, and that different runners
may have (significantly) cheaper ways to provide this ordering than
can be done by the user themselves, should we elevate this to a
property of (Stateful?)DoFns that the runner can provide? I think a
compelling argument can be made here that we should.

- Robert
On 5/21/19 1:00 AM, Kenneth Knowles wrote:
Thanks for the nice small example of a calculation that dependson order. You are right that many state machines have thisproperty. I agree w/ you and Luke that it is convenient for batchprocessing to sort by event timestamp before running a statefulParDo. In streaming you could also implement "sort by eventtimestamp" by buffering until you know all earlier data will bedropped - a slack buffer up to allowed lateness.
I do not think that it is OK to sort in batch and not instreaming. Many state machines diverge very rapidly when thingsare out of order. So each runner if they see the"@OrderByTimestamp" annotation (or whatever) needs to deliversorted data (by some mix of buffering and dropping), or to rejectthe pipeline as unsupported.
And also want to say that this is not the default case - manyuses of state & timers in ParDo yield different results at theelement level, but the results are equivalent at in the bigpicture. Such as the example of "assign a unique sequence numberto each element" or "group into batches" it doesn't matterexactly what the result is, only that it meets the spec. Andother cases like user funnels are monotonic enough that you alsodon't actually need sorting.
Kenn
On Mon, May 20, 2019 at 2:59 PM Jan Lukavský <je...@seznam.cz>wrote:
Yes, the problem will arise probably mostly when you have notwell distributed keys (or too few keys). I'm really not sure ifa pure GBK with a trigger can solve this - it might help to havedata driven trigger. There would still be some doubts, though.The main question is still here - people say, that sorting bytimestamp before stateful ParDo would be prohibitively slow, butI don't really see why - the sorting is very probably alreadythere. And if not (hash grouping instead of sorted grouping),then the sorting would affect only user defined StatefulParDos.
This would suggest that the best way out of this would be reallyto add annotation, so that the author of the pipeline can decide.
If that would be acceptable I think I can try to prepare somebasic functionality, but I'm not sure, if I would be able tocover all runners / sdks.
On 5/20/19 11:36 PM, Lukasz Cwik wrote:
It is read all per key and window and not just read all (thisstill won't scale with hot keys in the global window). The GBKpreceding the StatefulParDo will guarantee that you areprocessing all the values for a specific key and window at anygiven time. Is there a specific window/trigger that is missingthat you feel would remove the need for you to use StatefulParDo?
On Mon, May 20, 2019 at 12:54 PM Jan Lukavský <je...@seznam.cz>wrote:
Hi Lukasz,
Today, if you must have a strict order, you must guaranteethat your StatefulParDo implements the necessary "buffering &sorting" into state.
Yes, no problem with that. But this whole discussion started,because *this doesn't work on batch*. You simply cannot firstread everything from distributed storage and then buffer it allinto memory, just to read it again, but sorted. That will notwork. And even if it would, it would be a terrible waste ofresources.
Jan

On 5/20/19 8:39 PM, Lukasz Cwik wrote:
On Mon, May 20, 2019 at 8:24 AM Jan Lukavský <je...@seznam.cz>wrote:
This discussion brings many really interesting questions forme. :-)
> I don't see batch vs. streaming as part of the model. Onecan havemicrobatch, or even a runner that alternates between differentmodes.
Although I understand motivation of this statement, thisproject name is"Apache Beam: An advanced unified programming model". Whatdoes the
model unify, if "streaming vs. batch" is not part of the model?
Using microbatching, chaining of batch jobs, or pure streamingareexactly the "runtime conditions/characteristics" I refer to.All thesedefine several runtime parameters, which in turn define howwell/badlywill the pipeline perform and how many resources might beneeded. Frommy point of view, pure streaming should be the most resourcedemanding(if not, why bother with batch? why not run everything instreaming
only? what will there remain to "unify"?).
> Fortunately, for batch, only the state for a single keyneeds to bepreserved at a time, rather than the state for all keys acrossthe rangeof skew. Of course if you have few or hot keys, one can stillhave
issues (and this is not specific to StatefulDoFns).

Yes, but here is still the presumption that my stateful DoFn can
tolerate arbitrary shuffling of inputs. Let me explain the usecase in
more detail.
Suppose you have input stream consisting of 1s and 0s (andsome key foreach element, which is irrelevant for the demonstration). Yourtask isto calculate in running global window the actual number ofchangesbetween state 0 and state 1 and vice versa. When the statedoesn'tchange, you don't calculate anything. If input (for given key)would be
(tN denotes timestamp N):

    t1: 1

    t2: 0

    t3: 0

    t4: 1

    t5: 1

    t6: 0
then the output should yield (supposing that default state iszero):
    t1: (one: 1, zero: 0)

    t2: (one: 1, zero: 1)

    t3: (one: 1, zero: 1)

    t4: (one: 2, zero: 1)

    t5: (one: 2, zero: 1)

    t6: (one: 2, zero: 2)

How would you implement this in current Beam semantics?
I think your saying here that I know that my input is orderedin a specific way and since I assume the order when writing mypipeline I can perform this optimization. But there is nothingpreventing a runner from noticing that your processing in theglobal window with a specific type of trigger and re-orderingyour inputs/processing to get better performance (since youcan't use an AfterWatermark trigger for your pipeline instreaming for the GlobalWindow).
Today, if you must have a strict order, you must guarantee thatyour StatefulParDo implements the necessary "buffering &sorting" into state. I can see why you would want an annotationthat says I must have timestamp ordered elements, since itmakes writing certain StatefulParDos much easier. StatefulParDois a low-level function, it really is the "here you go and dowhatever you need to but here be dragons" function whilewindowing and triggering is meant to keep many people fromwriting StatefulParDo in the first place.
> Pipelines that fail in the "worst case" batch scenarioare likely todegrade poorly (possibly catastrophically) when the watermarkfalls
behind in streaming mode as well.
But the worst case is defined by input of size (availableresources +single byte) -> pipeline fail. Although it could havefinished, given
the right conditions.
> This might be reasonable, implemented by default bybufferingeverything and releasing elements as the watermark (+lateness)advances,but would likely lead to inefficient (though *maybe* easier toreason
about) code.
Sure, the pipeline will be less efficient, because it wouldhave tobuffer and sort the inputs. But at least it will producecorrect results
in cases where updates to state are order-sensitive.
> Would it be roughly equivalent to GBK + FlatMap(lambda(key, values):
[(key, value) for value in values])?
I'd say roughly yes, but difference would be in the trigger.The triggershould ideally fire as soon as watermark (+lateness) crosseselementwith lowest timestamp in the buffer. Although this could besomehow
emulated by fixed trigger each X millis.
> Or is the underlying desire just to be able to hint tothe runnerthat the code may perform better (e.g. require less resources)as skew
is reduced (and hence to order by timestamp iff it's cheap)?
No, the sorting would have to be done in streaming case aswell. That isan imperative of the unified model. I think it is possible tosort bytimestamp only in batch case (and do it for *all* batchstateful pardoswithout annotation), or introduce annotation, but then makethe same
guarantees for streaming case as well.

Jan

On 5/20/19 4:41 PM, Robert Bradshaw wrote:
On Mon, May 20, 2019 at 1:19 PM Jan Lukavský<je...@seznam.cz> wrote:
Hi Robert,
yes, I think you rephrased my point - although no *explicit*guaranteesof ordering are given in either mode, there is *implicit*ordering instreaming case that is due to nature of the processing - thedifferencebetween watermark and timestamp of elements flowing throughthe pipelineare generally low (too high difference leads to theoverbuffering
problem), but there is no such bound on batch.
Fortunately, for batch, only the state for a single key needsto bepreserved at a time, rather than the state for all keysacross therange of skew. Of course if you have few or hot keys, one canstill
have issues (and this is not specific to StatefulDoFns).
As a result, I see a few possible solutions:
- the best and most natural seems to be extension ofthe model, sothat it defines batch as not only "streaming pipelineexecuted in batchfashion", but "pipeline with at least as good runtimecharacteristics asin streaming case, executed in batch fashion", I reallydon't think thatthere are any conflicts with the current model, or that thiscould
affect performance, because the required sorting (as pointed by
Aljoscha) is very probably already done during translationof statefulpardos. Also note that this definition only affects userdefined
stateful pardos
I don't see batch vs. streaming as part of the model. One canhavemicrobatch, or even a runner that alternates betweendifferent modes.The model describes what the valid outputs are given a(sometimespartial) set of inputs. It becomes really hard to definethings like
"as good runtime characteristics." Once you allow any
out-of-orderedness, it is not very feasible to try and define(and
more cheaply implement) a "upper bound" of acceptable
out-of-orderedness.
Pipelines that fail in the "worst case" batch scenario arelikely todegrade poorly (possibly catastrophically) when the watermarkfalls
behind in streaming mode as well.
- another option would be to introduce annotation forDoFns (e.g.@RequiresStableTimeCharacteristics), which would result inthe sortingin batch case - but - this extension would have to ensurethe sorting instreaming mode also - it would require definition of allowedlateness,
and triggger (essentially similar to window)
This might be reasonable, implemented by default by buffering
everything and releasing elements as the watermark (+lateness)
advances, but would likely lead to inefficient (though*maybe* easierto reason about) code. Not sure about the semantics oftriggeringhere, especially data-driven triggers. Would it be roughlyequivalentto GBK + FlatMap(lambda (key, values): [(key, value) forvalue in
values])?
Or is the underlying desire just to be able to hint to therunner thatthe code may perform better (e.g. require less resources) asskew is
reduced (and hence to order by timestamp iff it's cheap)?
- last option would be to introduce these "higher orderguarantees" insome extension DSL (e.g. Euphoria), but that seems to be theworst
option to me
I see the first two options quite equally good, although theletter oneis probably more time consuming to implement. But it wouldbring
additional feature to streaming case as well.

Thanks for any thoughts.

     Jan

On 5/20/19 12:41 PM, Robert Bradshaw wrote:
On Fri, May 17, 2019 at 4:48 PM Jan Lukavský<je...@seznam.cz> wrote:
Hi Reuven,
How so? AFAIK stateful DoFns work just fine in batchrunners.
Stateful ParDo works in batch as far, as the logic insidethe state works for absolutely unbounded out-of-ordernessof elements. That basically (practically) can work onlyfor cases, where the order of input elements doesn'tmatter. But, "state" can refer to "state machine", and anytime you have a state machine involved, then the orderingof elements would matter.
No guarantees on order are provided in *either* streamingor batchmode by the model. However, it is the case that in order tomakeforward progress most streaming runners attempt to limitthe amount ofout-of-orderedness of elements (in terms of event time vs.processingtime) to make forward progress, which in turn could helpcap theamount of state that must be held concurrently, whereas abatch runner
may not allow any state to be safely discarded until the whole
timeline from infinite past to infinite future has beenobserved.
Also, as pointed out, state is not preserved "batch tobatch" in batch mode.
On Thu, May 16, 2019 at 3:59 PM Maximilian Michels<m...@apache.org> wrote:
batch semantics and streaming semantics differs onlyin that I can have GlobalWindow with default trigger onbatch and cannot on stream
You can have a GlobalWindow in streaming with a defaulttrigger. Youcould define additional triggers that do early firings.And you couldeven trigger the global window by advancing the watermarkto +inf.
IIRC, as a pragmatic note, we prohibited global window withdefaulttrigger on unbounded PCollections in the SDK because thisis morelikely to be user error than an actual desire to have nooutput until
drain. But it's semantically valid in the model.

Re: Definition of Unified model

Reply via email to