Hello,
I'm running Flink 1.10 on a yarn cluster. I have a streaming application, that,
when under heavy load, fails from time to time with this unique error message
in the whole yarn log:
(...)
2020-11-15 16:18:42,202 WARN
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late
message for now expired checkpoint attempt 63 from task
4cbc940112a596db54568b24f9209aac of job 1e1717d19bd8ea296314077e42e1c7e5 at
container_e38_1604477334666_0960_01_000004 @ xxx (dataPort=33099).
2020-11-15 16:18:55,043 INFO org.apache.flink.yarn.YarnResourceManager
- Closing TaskExecutor connection
container_e38_1604477334666_0960_01_000004 because: The TaskExecutor is
shutting down.
2020-11-15 16:18:55,087 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph - Map (7/15)
(c8e92cacddcd4e41f51a2433d07d2153) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The TaskExecutor is shutting down.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:359)
at
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:218)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-11-15 16:18:55,092 INFO
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
- Calculating tasks to restart to recover the failed task
2f6467d98899e64a4721f0a7b6a059a8_6.
2020-11-15 16:18:55,101 INFO
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
- 230 tasks should be restarted to recover the failed task
2f6467d98899e64a4721f0a7b6a059a8_6.
(...)
What could be the cause of this failure? Why is there no other error message?
I've tried to increase the value of heartbeat.timeout, thinking that maybe it
was due to a slow responding mapper, but it did not solve the issue.
Best regards,
Arnaud
________________________________
L'intégrité de ce message n'étant pas assurée sur internet, la société
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir
l'expéditeur.
The integrity of this message cannot be guaranteed on the Internet. The company
that sent this message cannot therefore be held liable for its content nor
attachments. Any unauthorized use or dissemination is prohibited. If you are
not the intended recipient of this message, then please delete it and notify
the sender.