[ https://issues.apache.org/jira/browse/YARN-10791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Song Jiacheng updated YARN-10791: --------------------------------- Summary: Graceful decomission cause NPE during rolling upgrade from 2.6 to 3.2 (was: Graceful decomission cause NPE during Rolling upgrade from 2.6 to 3.2 ) > Graceful decomission cause NPE during rolling upgrade from 2.6 to 3.2 > ---------------------------------------------------------------------- > > Key: YARN-10791 > URL: https://issues.apache.org/jira/browse/YARN-10791 > Project: Hadoop YARN > Issue Type: Bug > Components: RM > Affects Versions: 3.2.1 > Reporter: Song Jiacheng > Priority: Minor > Attachments: YARN-10791.v1.patch, image-2021-05-31-10-32-17-541.png, > image-2021-05-31-10-37-31-795.png > > > We are rolling upgrading Yarn from 2.6.0 to 3.2.1, and we met this Exception > while we upgrading NM. > When we exclude a node and call refreshNode gracefully, All the MR AMs will > fail. > 2021-05-28 11:36:35,790 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: ERROR IN > CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleUpdatedNodes(RMContainerAllocator.java:883) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:821) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:316) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:282) > at java.lang.Thread.run(Thread.java:745) > The reason of this is because we gracefully decomission nodes while using > 2.6MR. > handleUpdatedNodes of 2.6MR can not recognize the node state of > "DECOMMISONING" > So I add a config to decide if we should send the DECOMMISONING to AMs > I don't know if it needs to be fixed, just raise a solution for this situation > !image-2021-05-31-10-32-17-541.png! > There are 2 nodes in the cluster, And the AM is deployed in node 44, I > excluded 46, which is another node in the cluster, and then refreshnode, the > error above occured. > As what I say, I think the original reasion is the compatibility of > NodeStateProto > !image-2021-05-31-10-37-31-795.png! > 2.6 MR can not recognize DECOMMISONING and SHUTDOWN -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org