On processing event streams

Jan Lukavský Tue, 12 Nov 2019 01:36:51 -0800

Hi,

this is follow up of multiple threads covering the topic of how to (in aunified way) process event streams. Event streams can be characterizedby a common property that ordering of events matter. The processing(usually) looks something like

unordered stream -> buffer (per key) -> ordered stream -> statefullogic (DoFn)

This is perfectly fine and can be solved by current tools Beam offers(state & timers), but *only for streaming case*. The batch case isessentially broken, because:

a) out-of-orderness is essentially *unbounded* (as opposed to inputbeing bounded, strangely, that is not a contradiction), out-of-ordernessin streaming case is *bounded*, because the watermark can fall behindonly limit amount of time (sooner or later, nobody would actually careabout results from streaming pipeline being months or years late, right?)

b) with unbounded out-of-orderness, the spatial requirements of stategrow with O(N), worst case, where N is size of the whole input

c) moreover, many runners restrict the size of state per key to fit inmemory (spark, flink)


Now, solutions to this problems seem to be:

1) refine the model guarantees for batch stateful processing, so thatwe limit the out-of-orderness (the source of issues here) - the onlyreasonable way to do that is to enforce sorting before all statefuldofns in batch case (perhaps there might opt-out for that), or

2) define a way to mark stateful dofn as requiring the sorting (e.g.@RequiresTimeSortedInput) - note this has to be done for both batch andstreaming case, as opposed to 1), or

3) define a different URN for "ordered stateful dofn", with defaultexpansion using state as buffer (for both batch and streaming case) -that way this can be overridden in batch runners that can get intotrouble otherwise (and could be regarded as sort of natural extension ofthe current approach).

I still think that the best solution is 1), for multiple reasons goingfrom being internally logically consistent to being practical and easilyimplemented (a few lines of code in flink's case for instance). On theother hand, if this is really not what we want to do, then I'd like toknow the community's opinion on the two other options (or, if theremaybe is some other option I didn't cover).

Many thanks for opinions and help with fixing what is (sort of) brokenright now.

Jan

On processing event streams

Reply via email to