Re: Long blocking call in UserFunction triggers HA leader lost?

Chesnay Schepler Thu, 05 Nov 2020 14:57:42 -0800

I'd go with the network congestion theory for the time being; then theonly remedy is throttling the download of said list, or somehow reducingthe size of it significantly

What the task thread is doing doesn't matter in regards to HA; it maycause checkpoints to time out, but should have no other effects. (unlessit magically consumes all CPU resources of system)If you block in any function, then you're just blocking the task thread;nothing else.


On 11/5/2020 10:40 PM, Theo Diefenthal wrote:

Hi there,
I have a stream where I reload a huge list from time to time. I knowthere are various Side-Input patterns, but none of them seem to beperfect so I stuck with an easy approach: I use a Guava Cache and ifit expires and a new element comes in, processing of the element isblocked up until the new list is loaded.
That approach runs in production for a while now and it works fine, asthe cache has a mechanism to reload the list only on a real change.Now today, the list changed from a few hundred MB to multiple GB at atime where the network in general was a bit congested already. OneTaskManager needed round about 4minutes to load the list, but after30seconds, it reported it lost connection to zookeeper and had thus nomore information about the leading jobmanager, leading to a crashingloop. That crash & restart loop continued for 30minutes up until thelist was rolled back and was then successfully loaded again.
Now my question:
* If processing of an element blocks, I understand that its also notpossible to perform checkpoints at that time, but I didn't expectZookeeper, Heartbeats or other threads of the taskmanager to timeout.Was that just a coincidence of the network being congested or is thatsomething in the design of Flink that a long blocking call can lead tocrashes? (Other than X checkpoints timed out and following aconfigured forced crash occured). Which threads can be blocked inFlink during a map in a MapFunction?* For this approach with kind of a cached reload, should I switch toasync IO or just put loading of the list in a background thread? In mycase, it's not really important that processing is blocked up untilthe list is loaded. And in case of async IO: 99,999% of the eventswould directly return and would thus not be async, it's always just asingle one triggering reload of the list, so it doesn't seem to beperfectly suited here?
Im running on Flink 1.11 and heres the relevant excerpt from the log:
2020-11-05T09:41:40.933865+01:00 [WARN] Client session timed out, havenot heard from server in 33897ms for sessionid 0x374de97ba0afac92020-11-05T09:41:40.936488+01:00 [INFO] Client session timed out, havenot heard from server in 33897ms for sessionid 0x374de97ba0afac9,closing socket connection and attempting reconnect
2020-11-05T09:41:41.042032+01:00 [INFO] State change: SUSPENDED
2020-11-05T09:41:41.168802+01:00 [WARN] Connection to ZooKeepersuspended. Can no longer retrieve the leader from ZooKeeper.2020-11-05T09:41:41.169276+01:00 [WARN] Connection to ZooKeepersuspended. Can no longer retrieve the leader from ZooKeeper.2020-11-05T09:41:41.169514+01:00 [INFO] Close ResourceManagerconnection e4d1f9acca4ea3c5a793877467218452.2020-11-05T09:41:41.169514+01:00 [INFO] JobManager for job0dcfb212136daefbcbfe480c6a260261 with leader id8440610cd5de998bd6c65f3717de42b8 lost leadership.2020-11-05T09:41:41.185104+01:00 [INFO] Close JobManager connectionfor job 0dcfb212136daefbcbfe480c6a260261.2020-11-05T09:41:41.185354+01:00 [INFO] Attempting to fail taskexternally .... (c596aafa324b152911cb53ab4e6d1cc2).2020-11-05T09:41:41.187980+01:00 [WARN] ...(c596aafa324b152911cb53ab4e6d1cc2) switched from RUNNING to FAILED.org.apache.flink.util.FlinkException: JobManager responsible for0dcfb212136daefbcbfe480c6a260261 lost the leadership. atorg.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1415) atorg.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:173) atorg.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1852)
    at java.util.Optional.ifPresent(Optional.java:159)
atorg.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:1851) atorg.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) atorg.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) atorg.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
    at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
    at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
    at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
    at akka.actor.ActorCell.invoke(ActorCell.scala:561)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
    at akka.dispatch.Mailbox.run(Mailbox.scala:225)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
    at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
atakka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) atakka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) atakka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)Caused by: java.lang.Exception: Job leader for job id0dcfb212136daefbcbfe480c6a260261 lost leadership.
    ... 24 more

Best regards
Theo

Re: Long blocking call in UserFunction triggers HA leader lost?

Reply via email to