Re: Pipeline termination in the unified Beam model

Etienne Chauchot Wed, 10 May 2017 01:21:07 -0700

Hi everyone,

I'm reopening this subject because, IMHO, it is important to unifypipelines termination semantics in the model. Here are the differences Ihave observed in streaming pipelines termination:

- direct runner: when the output watermarks of all of its PCollectionsprogress to +infinity

- apex runner: when the output watermarks of all of its PCollectionsprogress to +infinity

- dataflow runner: when the output watermarks of all of its PCollectionsprogress to +infinity

- spark runner: streaming pipelines do not terminate unless timeout isset in pipelineResult.waitUntilFinish()

- flink runner: streaming pipelines do not terminate unless timeout isset in pipelineResult.waitUntilFinish() (thanks to Aljoscha for timeoutsupport PRhttps://github.com/apache/beam/pull/2915#pullrequestreview-37090326)



=> Is the direct/apex/dataflow behavior the correct "beam model" behavior?

I know that, at least for spark (mails in this thread), there is no easyway to know that we're done reading a source, so it might be verydifficult (at least for this runner) to unify toward +infinity behaviorif it is chosen as the standard behavior.


Best

Etienne

Le 18/04/2017 à 12:05, Etienne Chauchot a écrit :

+1 on "runners really terminate in a timely manner to easilyprogrammatically orchestrate Beam pipelines in a portable way, you doneed to know whetherthe pipeline will finish without thinking about the specific runnerand its options"
As an example, in Nexmark, we have streaming mode tests, and for thebenchmark, we need all the queries to behave the same between runnerstowards termination.
For now, to have the consistent behavior, in this mode we need to seta timeout (a bit random and flaky) on waitUntilFinish() for spark butthis timeout is not needed for direct runner.
Etienne

Le 02/03/2017 à 19:27, Kenneth Knowles a écrit :
Isn't this already the case? I think semantically it is an unavoidable
conclusion, so certainly +1 to that.

The DirectRunner and TestDataflowRunner both have this behavior already.
I've always considered that a streaming job running forever is just[very]
suboptimal shutdown latency :-)
Some bits of the discussion on the ticket seem to surround whether orhow
to communicate this property in a generic way. Since a runner owns its
PipelineResult it doesn't seem necessary.

So is the bottom line just that you want to more strongly insist that
runners really terminate in a timely manner? I'm +1 to that, too, for
basically the reason Stas gives: In order to easily programmatically
orchestrate Beam pipelines in a portable way, you do need to knowwhetherthe pipeline will finish without thinking about the specific runnerand its
options (as with our RunnableOnService tests).

Kenn
On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin<[email protected]>
wrote:
Note that even "unbounded pipeline in a streamingrunner".waitUntilFinish()can return, e.g., if you cancel it or terminate it. It's totallyreasonable
for users to want to understand and handle these cases.

+1

Dan

On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré <[email protected]>
wrote:
+1

Good idea !!

Regards
JB


On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
Raising this onto the mailing list from
https://issues.apache.org/jira/browse/BEAM-849

The issue came up: what does it mean for a pipeline to finish, in the
Beam
model?

Note that I am deliberately not talking about "batch" and "streaming"
pipelines, because this distinction does not exist in the model.Severalrunners have batch/streaming *modes*, which implement the samesemantics
(potentially different subsets: in batch mode typically a runner will
reject pipelines that have at least one unbounded PCollection) butin an
operationally different way. However we should define pipeline
termination
at the level of the unified model, and then make sure that allrunners
in
all modes implement that properly.

One natural way is to say "a pipeline terminates when the output
watermarks
of all of its PCollection's progress to +infinity". (Note: thiscan be
generalized, I guess, to having partial executions of a pipeline: if
you're
interested in the full contents of only some collections, then youwait
until only the watermarks of those collections progress to infinity)

A typical "batch" runner mode does not implement watermarks - we can
think
of it as assigning watermark -infinity to an output of a transformthat
hasn't started executing yet, and +infinity to output of a transform
that
has finished executing. This is consistent with how such runners
implement
termination in practice.
Dataflow streaming runner additionally implements such terminationforpipeline drain operation: it has 2 parts: 1) stop consuming inputfrom
the
sources, and 2) wait until all watermarks progress to infinity.
Let us fill the gap by making this part of the Beam model anddeclaring
that all runners should implement this behavior. This will give nice
properties, e.g.:
- A pipeline that has only bounded collections can be run by anyrunner
in
any mode, with the same results and termination behavior (this is
actually
my motivating example for raising this issue is: I was running
Splittable
DoFn tests
<https://github.com/apache/beam/blob/master/sdks/java/core/
src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java>
with the streaming Dataflow runner - these tests produce only bounded
collections - and noticed that they wouldn't terminate even thoughall
data
was processed)
- It will be possible to implement pipelines that stream data for a
while
and then eventually successfully terminate based on some condition.
E.g. a
pipeline that watches a continuously growing file until it is marked
read-only, or a pipeline that reads a Kafka topic partition until it
receives a "poison pill" message. This seems handy.
--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Pipeline termination in the unified Beam model

Reply via email to