Flink 1.7.2 extremely unstable and losing jobs in prod

Bruno Aranda Tue, 19 Mar 2019 04:03:40 -0700

Hi,

This is causing serious instability and data loss in our production
environment. Any help figuring out what's going on here would be really
appreciated.

We recently updated our two EMR clusters from flink 1.6.1 to flink 1.7.2
(running on AWS EMR). The road to the upgrade was fairly rocky, but we felt
like it was working sufficiently well in our pre-production environments
that we rolled it out to prod.

However we're now seeing the jobmanager crash spontaneously several times a
day. There doesn't seem to be any pattern to when this happens - it doesn't
coincide with an increase in the data flowing through the system, nor is it
at the same time of day.

The big problem is that when it recovers, sometimes a lot of the jobs fail
to resume with the following exception:

org.apache.flink.util.FlinkException: JobManager responsible for
2401cd85e70698b25ae4fb2955f96fd0 lost the leadership.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
//...
Caused by: java.util.concurrent.TimeoutException: The heartbeat of
JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626)
... 16 more

Starting them manually afterwards doesn't resume from checkpoint, which for
most jobs means it starts from the end of the source kafka topic. This
means whenever this surprise jobmanager restart happens, we have a ticking
clock during which we're losing data.

We speculate that those jobs die first and while they wait to be restarted
(they have a 30 second delay strategy), the job manager restarts and does
not recover them? In any case, we have never seen so many job failures and
JM restarts with exactly the same EMR config.

We've got some functionality we're building that uses the StreamingFileSink
over S3 bugfixes in 1.7.2, so rolling back isn't an ideal option.

Looking through the mailing list, we found
https://issues.apache.org/jira/browse/FLINK-11843 - does it seem possible
this might be related?

Best regards,

Bruno

Flink 1.7.2 extremely unstable and losing jobs in prod

Reply via email to