[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-09-02 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189286#comment-17189286
 ] 

Till Rohrmann commented on FLINK-19022:
---

I think we should do both things in order to be on the safe side.

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-09-02 Thread tartarus (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189152#comment-17189152
 ] 

tartarus commented on FLINK-19022:
--

[~trohrmann] sorry , There is another question to confirm.
if we catch Throwable in {{ResourceManager}} and {{Dispatcher}}, then they will 
call {{fatalErrorHandler.onFatalError(throwable);}}
this method will call {{ClusterEntrypoint.onFatalError()}}, then it is 
necessary to register {{terminationFuture}} with 
{{DispatcherResourceManagerComponent}}?

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   at 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-31 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187499#comment-17187499
 ] 

Till Rohrmann commented on FLINK-19022:
---

[~tartarus] this is what the underlying {{AkkaRpcActor}} should already do for 
us. So there should be no need to add another {{terminationFuture}}.

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-30 Thread tartarus (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187201#comment-17187201
 ] 

tartarus commented on FLINK-19022:
--

[~trohrmann] hello, I have one questions need to be confirmed when I do this 
work:

1) {{ResourceManager.getTerminationFuture}} has been implemented in 
{{RpcEndpoint}};
{code:java}
/**
 * Return a future which is completed with true when the rpc endpoint has been 
terminated.
 * In case of a failure, this future is completed with the occurring exception.
 *
 * @return Future which is completed when the rpc endpoint has been terminated.
 */
public CompletableFuture getTerminationFuture() {
return rpcServer.getTerminationFuture();
}
{code}
how about we add a {{terminationFuture}} to {{ResourceManager}} and register 
current {{ResourceManager.getTerminationFuture}} with {{terminationFuture}}?
and then if  {{ResourceManager}} started fail we call completeExceptionally to 
complete {{terminationFuture}}.
at last we register this {{terminationFuture}} with 
{{DispatcherResourceManagerComponent}}'s {{shutDownFuture}}.



> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-26 Thread tartarus (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185142#comment-17185142
 ] 

tartarus commented on FLINK-19022:
--

ok,I will try to complete this issue as soon as possible

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-26 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184982#comment-17184982
 ] 

Till Rohrmann commented on FLINK-19022:
---

No, we don't remove the {{FatalErrorHandler}} from the {{Dispatcher}} and the 
{{ResourceManager}}.

In the {{DispatcherResourceManagerComponent}} we should react to the 
termination future of the {{Dispatcher}} and {{ResourceManager}} in such a way 
that we call the fatal error handler if either of them completes while the 
{{DispatcherResourceManagerComponent}} is still running (not closing).

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-25 Thread tartarus (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184921#comment-17184921
 ] 

tartarus commented on FLINK-19022:
--

[~trohrmann] I want to confirm the little details.

we pass a {{FatalErrorHandler}} to the {{DispatcherResourceManagerComponent, 
and then we need to remove }}{{FatalErrorHandler from {{ResourceManager}} and 
{{Dispatcher}} ?}}

Just register the {{TerminationFuture}} of ResourceManager}} and 
{{Dispatcher to {{DispatcherResourceManagerComponent.}}

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>  

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-25 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183768#comment-17183768
 ] 

Till Rohrmann commented on FLINK-19022:
---

Yes, for 3) we also need to pass a {{FatalErrorHandler}} to the 
{{DispatcherResourceManagerComponent}}.

And I would say that we add 4) logging to the 
{{AkkaRpcActor.StartedState.terminate}} and {{AkkaRpcActor.StoppedState.start}}.

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-24 Thread tartarus (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183295#comment-17183295
 ] 

tartarus commented on FLINK-19022:
--

[~trohrmann]  I agree with you.

If we do, then 3 questions need to be confirmed:

1) We may need to catch {{Throwable in }}{{ResourceManager}} and {{Dispatcher}};

2)We call FatalErrorHandler directly in ResourceManager and Dispatcher or  
terminate through DispatcherResourceManagerComponent;

3)If terminate through DispatcherResourceManagerComponent, we need register 
both {{TerminationFuture}}  to DispatcherResourceManagerComponent;

Is there anything else that needs attention?

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-24 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183204#comment-17183204
 ] 

Till Rohrmann commented on FLINK-19022:
---

Well, the idea was that the {{DispatcherResourceManagerComponent}} must somehow 
react if one of the component shuts unexpectedly down. For example, one could 
monitor the termination futures of the {{ResourceManager}} and the 
{{Dispatcher}} and call a yet to be passed in {{FatalErrorHandler}} if they 
terminate while the {{DispatcherResourceManagerComponent}} is still 
{{isRunning}}.

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-24 Thread tartarus (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183136#comment-17183136
 ] 

tartarus commented on FLINK-19022:
--

[~trohrmann] do your means is add {{TerminationFuture}} for {{ResourceManager}} 
and register to {{DispatcherResourceManagerComponent}} like this
{code:java}
private void registerShutDownFuture() {
   FutureUtils.forward(dispatcherRunner.getShutDownFuture(), shutDownFuture);
}
{code}
 

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-24 Thread tartarus (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183122#comment-17183122
 ] 

tartarus commented on FLINK-19022:
--

[~trohrmann] thanks for your reply.

Sorry, my description is not very clear.

RM shutting down the container because the failed happen on 
{{AMRMClientAsyncImpl#registerApplicationMaster, so JM not registed to RM yet.}}

[https://github.com/apache/flink/blob/b4705edc841a8cf380d9a12d71551a4d38ec9e31/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L223]
 

beacause 

[https://github.com/apache/flink/blob/b4705edc841a8cf380d9a12d71551a4d38ec9e31/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L280]

here only catch {{Exception}} but my case is a {{Error}} ,  
{{NoSuchMethodError}} 

>From the current code logic, only {{AkkaRpcActor.StoppedState#start}} catch 
>the throwable, but not print log, so we miss the error message.

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0, 1.12.0, 1.11.1
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-24 Thread Till Rohrmann (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183029#comment-17183029
 ] 

Till Rohrmann commented on FLINK-19022:
---

Thanks for reporting this issue [~tartarus]. I agree that this is a problem. 
Could you share the logs with us? I would like to learn why the RM is shutting 
down the container eventually.

Concerning the problem: One problem is that 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L212
 only catches {{Exception}} instead of {{Throwable}}. The same actually applies 
to 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L248.

The other problem is as you've mentioned that the {{AkkaRpcActor}} does not log 
the failure cause in case of failed start or shut down attempt. I think it 
would be a good improvement to add the logs in these places.

Last but not least, I believe that the {{DispatcherResourceManagerComponent}} 
should also react if either of the {{Dispatcher}} or {{ResourceManager}} 
component failed during the start up. One way to do it, is to combine the 
termination futures {{Dispatcher.getTerminationFuture}} and 
{{ResourceManager.getTerminationFuture}} into the {{shutDownFuture}} of the 
{{DispatcherResourceManagerComponent}}.

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: tartarus
>Priority: Major
>
> My job appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at 

[jira] [Commented] (FLINK-19022) AkkaRpcActor failed to start but no exception information

2020-08-21 Thread tartarus (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181911#comment-17181911
 ] 

tartarus commented on FLINK-19022:
--

[~chesnay]  [~trohrmann]  How about adding log printing here to help quickly 
find the problem?

Please assign to me, thanks

> AkkaRpcActor failed to start but no exception information
> -
>
> Key: FLINK-19022
> URL: https://issues.apache.org/jira/browse/FLINK-19022
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: tartarus
>Priority: Major
>
> My task appeared that JM could not start normally, and the JM container was 
> finally killed by RM.
> In the end, I found through debug that AkkaRpcActor failed to start because 
> the version of yarn in my job was incompatible with the version in the 
> cluster.
> [AkkaRpcActor exception 
> handling|https://github.com/apache/flink/blob/478c9657fe1240acdc1eb08ad32ea93e08b0cd5e/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/akka/AkkaRpcActor.java#L550]
> I add log printing here,and then found the specific problem.
> {code:java}
> 2020-08-21 21:31:16,985 ERROR 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState 
> [flink-akka.actor.default-dispatcher-4]  - Could not start RpcEndpoint 
> resourcemanager.
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB.registerApplicationMaster(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterRequestProto;)Lorg/apache/hadoop/yarn/proto/YarnServiceProtos$RegisterApplicationMasterResponseProto;
>   at 
> org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:106)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy25.registerApplicationMaster(Unknown Source)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:222)
>   at 
> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.registerApplicationMaster(AMRMClientImpl.java:214)
>   at 
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl.registerApplicationMaster(AMRMClientAsyncImpl.java:138)
>   at 
> org.apache.flink.yarn.YarnResourceManager.createAndStartResourceManagerClient(YarnResourceManager.java:229)
>   at 
> org.apache.flink.yarn.YarnResourceManager.initialize(YarnResourceManager.java:262)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.startResourceManagerServices(ResourceManager.java:204)
>   at 
> org.apache.flink.runtime.resourcemanager.ResourceManager.onStart(ResourceManager.java:192)
>   at 
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:185)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:544)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:169)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>   at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>   at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>   at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>   at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>   at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>   at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>   at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
>