I'm not aware of any significant changes to the HA components between
1.9/1.11.
Would you mind sharing the complete jobmanager/taskmanager logs?

Thank you~

Xintong Song



On Fri, Dec 18, 2020 at 8:53 AM Lu Niu <qqib...@gmail.com> wrote:

> Hi, Xintong
>
> Thanks for replying and your suggestion. I did check the ZK side but there
> is nothing interesting. The error message actually shows that only one TM
> thought JM lost leadership while others ran fine. Also, this happened only
> after we migrated from 1.9 to 1.11. Is it possible this is introduced by
> 1.11?
>
> Best
> Lu
>
> On Wed, Dec 16, 2020 at 5:56 PM Xintong Song <tonysong...@gmail.com>
> wrote:
>
>> Hi Lu,
>>
>> I assume you are using ZooKeeper as the HA service?
>>
>> A common cause of unexpected leadership lost is the instability of HA
>> service. E.g., if ZK does not receive heartbeat from Flink RM for a
>> certain period of time, it will revoke the leadership and notify
>> other components. You can look into the ZooKeeper logs checking why RM's
>> leadership is revoked.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Thu, Dec 17, 2020 at 8:42 AM Lu Niu <qqib...@gmail.com> wrote:
>>
>>> Hi, Flink users
>>>
>>> Recently we migrated to flink 1.11 and see exceptions like:
>>> ```
>>> 2020-12-15 12:41:01,199 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
>>> USER_MATERIALIZED_EVENT_SIGNAL-user_context-event ->
>>> USER_MATERIALIZED_EVENT_SIGNAL-user_context-event-as_nrtgtuple (21/60)
>>> (711d1d319691a4b80e30fe6ab7dfab5b) switched from RUNNING to FAILED on
>>> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@50abf386.
>>> java.lang.Exception: Job leader for job id
>>> 47b1531f79ffe3b86bc5910f6071e40c lost leadership.
>>> at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1852)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_212-ga]
>>> at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:1851)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>>> [nrtg-1.11_deploy.jar:?]
>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>>> [nrtg-1.11_deploy.jar:?]
>>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>>> [nrtg-1.11_deploy.jar:?]
>>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>>> [nrtg-1.11_deploy.jar:?]
>>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:539)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:227)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:581)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:229) [nrtg-1.11_deploy.jar:?]
>>> ```
>>>
>>> ```
>>> 2020-12-15 01:01:39,531 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
>>> USER_MATERIALIZED_EVENT_SIGNAL.user_context.SINK-stream_joiner ->
>>> USER_MATERIALIZED_EVENT_SIGNAL-user_context-SINK-SINKS.realpin (260/360)
>>> (0c1f4495088ec9452c597f46a88a2c8e) switched from RUNNING to FAILED on
>>> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@2362b2fd.
>>> org.apache.flink.util.FlinkException: ResourceManager leader changed to
>>> new address null
>>> at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at
>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at
>>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>>> ~[nrtg-1.11_deploy.jar:?]
>>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>>> [nrtg-1.11_deploy.jar:?]
>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>>> [nrtg-1.11_deploy.jar:?]
>>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>>> [nrtg-1.11_deploy.jar:?]
>>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>>> [nrtg-1.11_deploy.jar:?]
>>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:539)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:227)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:581)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
>>> [nrtg-1.11_deploy.jar:?]
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:229) [nrtg-1.11_deploy.jar:?]
>>> ```
>>>
>>> This happens a few times per week. It seems like one Task Manager
>>> wrongly thought JobMananger is lost and triggers a full restart of the
>>> whole job. Does anyone know how to resolve such errors? Thanks!
>>>
>>> Best
>>> Lu
>>>
>>

Reply via email to