The above JIRA mentioned breaks but those are fixed in 2.6 itself. The only one JIRA I see is YARN-8310 which is fixed in 2.10. Looking from stack trace which you have mentioned, it doesn't seems related to your issue. May be try applying a patch and run a job. Otherwise, lets create a JIRA and discuss there in detail.
-Rohith Sharma K S On Thu, 7 Feb 2019 at 22:52, Aihua Xu <aihu...@uber.com.invalid> wrote: > Hi Rohith, > > Thanks for your suggestion. I was tracing the issue and found out it's > caused by the incompatibility from these two changes. The tokens have been > changed. > > YARN-668. Changed > NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use > protobuf object as the payload. Contributed by Junping Du. > > YARN-2615. Changed > ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use > protobuf as payload. Contributed by Junping Du > > > I was testing new RM with old NM. > > Followup on the the order of Yarn upgrade. I checked the HWX blog > <https://hortonworks.com/blog/introducing-rolling-upgrades-downgrades-apache-hadoop-yarn-cluster/> > about > rolling upgrade and it's suggesting to upgrade RM first. But you are > saying we should NM first and RM second? Can you confirm? > > Thanks, > Aihua > > > > On Wed, Feb 6, 2019 at 8:26 PM Rohith Sharma K S < > rohithsharm...@apache.org> wrote: > >> Hi Aihua, >> >> Could you give more clarity on when job is submitted like a) before >> starting upgrade b) after RM upgrade and before NM upgrade c) after YARN >> upgrade fully? >> Typically, order of upgrade suggested is NM's first and RM second. >> >> Reg the NM warn messages you might be hitting >> https://issues.apache.org/jira/browse/HADOOP-11692. >> >> Doesn't any subsequent jobs succeeded post upgrade? >> -Rohith Sharma K S >> >> On Thu, 7 Feb 2019 at 03:20, Aihua Xu <aihu...@uber.com.invalid> wrote: >> >>> Hi all, >>> >>> I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop >>> 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade >>> NodeManager. When I submit a yarn job, RM fails with the following >>> exception: >>> >>> Application application_1549408943468_0001 failed 2 times due to Error >>> launching appattempt_1549408943468_0001_000002. Got exception: >>> java.io.IOException: Failed on local exception: java.io.IOException: >>> java.io.EOFException; Host Details : local host is: >>> "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: >>> "hadoopbencha22-sjc1.prod.uber.internal":8041; >>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805) >>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497) >>> at org.apache.hadoop.ipc.Client.call(Client.java:1439) >>> at org.apache.hadoop.ipc.Client.call(Client.java:1349) >>> at >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) >>> at >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) >>> at com.sun.proxy.$Proxy87.startContainers(Unknown Source) >>> at >>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>> at java.lang.reflect.Method.invoke(Method.java:498) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) >>> at >>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) >>> at com.sun.proxy.$Proxy88.startContainers(Unknown Source) >>> at >>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122) >>> at >>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: java.io.IOException: java.io.EOFException >>> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:422) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) >>> at >>> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720) >>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813) >>> at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411) >>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554) >>> at org.apache.hadoop.ipc.Client.call(Client.java:1385) >>> ... 20 more >>> Caused by: java.io.EOFException >>> at java.io.DataInputStream.readInt(DataInputStream.java:392) >>> at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798) >>> at >>> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365) >>> at >>> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615) >>> at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411) >>> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800) >>> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:422) >>> at >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889) >>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795) >>> ... 23 more >>> >>> >>> and NM with >>> >>> 2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: >>> Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring >>> password) with true cause: (null) >>> >>> >>> I'm wondering if it's a known issue and anybody has an insight for it. >>> >>> Thanks, >>> Aihua >>> >>> >>>