We are running a DataStream pipeline using Exactly Once/Event Time
semantics on 1.2-SNAPSHOT. The pipeline sources from S3 using the
ContinuousFileReaderOperator. We use a custom version of the
ContinuousFileMonitoringFunction since our source directory changes over
time. The pipeline transforms and aggregates tuples of data that is steady
over time (spike-less), windowing by hour with an allowed lateness of 6
hours. We are running on ec2 c4 instances in a simple YARN setup.

What we are seeing is that as we scale the system to hundreds of cores
reading around 10 million events per second, the pipeline might checkpoint
up to few times before it reaches a state where it becomes unable to
complete a checkpoint. The checkpoint interval does not seem to matter as
we've tested intervals from 5 minutes to one hour and timeouts up to one
hour as well.
What we'll usually see is something like:

checkpoint 1 (2G) 1 minute
checkpoint 2 (3-4G)  1-5 minutes
checkpoint 3 (5-6G) 1-6 minutes
checkpoint 4-5 (mixed bag, 4-25 minutes, or never)
checkpoint 6-n never

We've added debugging output to Flink internal code, e.g.
BarrierBuffer.java. From the debugging output it's clear that the actual
checkpoint per operator always completes in at most several seconds. What
seem to be happening, however, is that CheckpointBarriers start to become
slower to arrive, and after a few checkpoints it gets worse with
CheckpointBarriers going greatly askew and finally never arriving.
Meanwhile we can see that downstream counts are closely tailing upstream
counts,  < .01  behind, but once the barrier flow seemingly stops, the
downstream slows to a stop as (I guess) network buffers fill, then the
pipeline is dead.

Meanwhile, cluster resources are not being stressed.

At this point we've stripped down the pipeline, tried various StateBackend
configs, etc. but the result is invariably the same sad story. It would be
great if somebody could provide more insight into where things might be
going wrong. Hopefully this is a simple config issue, but we'd be open to
any and all suggestions regarding testing, tweaking, etc.

-Cliff

Reply via email to