Hi Rohith, I should have mentioned that we were using CDH5.7.2-2.6 which have those two patches reverted and that causes the incompatibility. Yes. I have to backport YARN-8310 to fix another issue.
BTW: should we upgrade NM first as you mentioned before? Thanks, Aihua On Thu, Feb 7, 2019 at 9:41 PM Rohith Sharma K S <rohithsharm...@apache.org> wrote: > The above JIRA mentioned breaks but those are fixed in 2.6 itself. The > only one JIRA I see is YARN-8310 which is fixed in 2.10. Looking from stack > trace which you have mentioned, it doesn't seems related to your issue. May > be try applying a patch and run a job. > Otherwise, lets create a JIRA and discuss there in detail. > > -Rohith Sharma K S > > On Thu, 7 Feb 2019 at 22:52, Aihua Xu <aihu...@uber.com.invalid> wrote: > >> Hi Rohith, >> >> Thanks for your suggestion. I was tracing the issue and found out it's >> caused by the incompatibility from these two changes. The tokens have been >> changed. >> >> YARN-668. Changed >> NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use >> protobuf object as the payload. Contributed by Junping Du. >> >> YARN-2615. Changed >> ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use >> protobuf as payload. Contributed by Junping Du >> >> >> I was testing new RM with old NM. >> >> Followup on the the order of Yarn upgrade. I checked the HWX blog >> <https://hortonworks.com/blog/introducing-rolling-upgrades-downgrades-apache-hadoop-yarn-cluster/> >> about >> rolling upgrade and it's suggesting to upgrade RM first. But you are >> saying we should NM first and RM second? Can you confirm? >> >> Thanks, >> Aihua >> >> >> >> On Wed, Feb 6, 2019 at 8:26 PM Rohith Sharma K S < >> rohithsharm...@apache.org> wrote: >> >>> Hi Aihua, >>> >>> Could you give more clarity on when job is submitted like a) before >>> starting upgrade b) after RM upgrade and before NM upgrade c) after YARN >>> upgrade fully? >>> Typically, order of upgrade suggested is NM's first and RM second. >>> >>> Reg the NM warn messages you might be hitting >>> https://issues.apache.org/jira/browse/HADOOP-11692. >>> >>> Doesn't any subsequent jobs succeeded post upgrade? >>> -Rohith Sharma K S >>> >>> On Thu, 7 Feb 2019 at 03:20, Aihua Xu <aihu...@uber.com.invalid> wrote: >>> >>>> Hi all, >>>> >>>> I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop >>>> 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade >>>> NodeManager. When I submit a yarn job, RM fails with the following >>>> exception: >>>> >>>> Application application_1549408943468_0001 failed 2 times due to Error >>>> launching appattempt_1549408943468_0001_000002. Got exception: >>>> java.io.IOException: Failed on local exception: java.io.IOException: >>>> java.io.EOFException; Host Details : local host is: >>>> "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: >>>> "hadoopbencha22-sjc1.prod.uber.internal":8041; >>>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805) >>>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) >>>> at org.apache.hadoop.ipc.Client.call(Client.java:1439) >>>> at org.apache.hadoop.ipc.Client.call(Client.java:1349) >>>> at >>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) >>>> at >>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) >>>> at com.sun.proxy.$Proxy87.startContainers(Unknown Source) >>>> at >>>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>> at >>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) >>>> at >>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) >>>> at >>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) >>>> at >>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) >>>> at >>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) >>>> at com.sun.proxy.$Proxy88.startContainers(Unknown Source) >>>> at >>>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122) >>>> at >>>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>> at java.lang.Thread.run(Thread.java:748) >>>> Caused by: java.io.IOException: java.io.EOFException >>>> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757) >>>> at java.security.AccessController.doPrivileged(Native Method) >>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>> at >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) >>>> at >>>> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720) >>>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813) >>>> at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411) >>>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554) >>>> at org.apache.hadoop.ipc.Client.call(Client.java:1385) >>>> ... 20 more >>>> Caused by: java.io.EOFException >>>> at java.io.DataInputStream.readInt(DataInputStream.java:392) >>>> at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798) >>>> at >>>> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365) >>>> at >>>> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615) >>>> at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411) >>>> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800) >>>> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796) >>>> at java.security.AccessController.doPrivileged(Native Method) >>>> at javax.security.auth.Subject.doAs(Subject.java:422) >>>> at >>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) >>>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795) >>>> ... 23 more >>>> >>>> >>>> and NM with >>>> >>>> 2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: >>>> Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring >>>> password) with true cause: (null) >>>> >>>> >>>> I'm wondering if it's a known issue and anybody has an insight for it. >>>> >>>> Thanks, >>>> Aihua >>>> >>>> >>>>