Re: Question about Yarn rolling upgrade

Rohith Sharma K S Thu, 07 Feb 2019 21:42:07 -0800

The above JIRA mentioned breaks but those are fixed in 2.6 itself. The only
one JIRA I see is YARN-8310 which is fixed in 2.10. Looking from stack
trace which you have mentioned, it doesn't seems related to your issue. May
be try applying a patch and run a job.
Otherwise, lets create a JIRA and discuss there in detail.


-Rohith Sharma K S

On Thu, 7 Feb 2019 at 22:52, Aihua Xu <aihu...@uber.com.invalid> wrote:

> Hi Rohith,
>
> Thanks for your suggestion. I was tracing the issue and found out it's
> caused by the incompatibility from these two changes. The tokens have been
> changed.
>
> YARN-668. Changed 
> NMTokenIdentifier/AMRMTokenIdentifier/ContainerTokenIdentifier to use 
> protobuf object as the payload. Contributed by Junping Du.
>
> YARN-2615. Changed 
> ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use 
> protobuf as payload. Contributed by Junping Du
>
>
> I was testing new RM with old NM.
>
> Followup on the the order of Yarn upgrade. I checked the HWX blog
> <https://hortonworks.com/blog/introducing-rolling-upgrades-downgrades-apache-hadoop-yarn-cluster/>
>  about
> rolling upgrade and it's suggesting to upgrade RM first.  But you are
> saying we should NM first and RM second? Can you confirm?
>
> Thanks,
> Aihua
>
>
>
> On Wed, Feb 6, 2019 at 8:26 PM Rohith Sharma K S <
> rohithsharm...@apache.org> wrote:
>
>> Hi Aihua,
>>
>> Could you give more clarity on when job is submitted like a) before
>> starting upgrade b) after RM upgrade and before NM upgrade c) after YARN
>> upgrade fully?
>> Typically, order of upgrade suggested is NM's first and RM second.
>>
>> Reg the NM warn messages you might be hitting
>> https://issues.apache.org/jira/browse/HADOOP-11692.
>>
>> Doesn't any subsequent jobs succeeded post upgrade?
>> -Rohith Sharma K S
>>
>> On Thu, 7 Feb 2019 at 03:20, Aihua Xu <aihu...@uber.com.invalid> wrote:
>>
>>> Hi all,
>>>
>>> I'm investigating the rolling upgrade process from Hadoop 2.6 to Hadoop
>>> 2.9.1. I'm trying to upgrade ResourceManager first and then try to upgrade
>>> NodeManager. When I submit a yarn job, RM fails with the following
>>> exception:
>>>
>>>  Application application_1549408943468_0001 failed 2 times due to Error 
>>> launching appattempt_1549408943468_0001_000002. Got exception: 
>>> java.io.IOException: Failed on local exception: java.io.IOException: 
>>> java.io.EOFException; Host Details : local host is: 
>>> "hadoopbenchaqjm01-sjc1/10.67.2.171"; destination host is: 
>>> "hadoopbencha22-sjc1.prod.uber.internal":8041;
>>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:805)
>>> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1497)
>>> at org.apache.hadoop.ipc.Client.call(Client.java:1439)
>>> at org.apache.hadoop.ipc.Client.call(Client.java:1349)
>>> at 
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
>>> at 
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
>>> at com.sun.proxy.$Proxy87.startContainers(Unknown Source)
>>> at 
>>> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:128)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>> at 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>>> at 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>>> at 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>>> at 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>>> at 
>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>>> at com.sun.proxy.$Proxy88.startContainers(Unknown Source)
>>> at 
>>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:122)
>>> at 
>>> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:307)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:748)
>>> Caused by: java.io.IOException: java.io.EOFException
>>> at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:757)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>> at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
>>> at 
>>> org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:720)
>>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:813)
>>> at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:411)
>>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1554)
>>> at org.apache.hadoop.ipc.Client.call(Client.java:1385)
>>> ... 20 more
>>> Caused by: java.io.EOFException
>>> at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>> at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1798)
>>> at 
>>> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:365)
>>> at 
>>> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:615)
>>> at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:411)
>>> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:800)
>>> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:796)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at javax.security.auth.Subject.doAs(Subject.java:422)
>>> at 
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)
>>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795)
>>> ... 23 more
>>>
>>>
>>> and NM with
>>>
>>> 2019-02-06 00:29:20,214 WARN SecurityLogger.org.apache.hadoop.ipc.Server: 
>>> Auth failed for 10.67.2.171:54588:null (DIGEST-MD5: IO error acquiring 
>>> password) with true cause: (null)
>>>
>>>
>>> I'm wondering if it's a known issue and anybody has an insight for it.
>>>
>>> Thanks,
>>> Aihua
>>>
>>>
>>>

Re: Question about Yarn rolling upgrade

Reply via email to