[ https://issues.apache.org/jira/browse/AMBARI-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958262#comment-14958262 ]
Hudson commented on AMBARI-13396: --------------------------------- FAILURE: Integrated in Ambari-trunk-Commit #3645 (See [https://builds.apache.org/job/Ambari-trunk-Commit/3645/]) AMBARI-13396: RU: Handle Namenode being down scenarios (jluniya) (jluniya: [http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=c00908495953e7c725bd49b7a124883d12621324]) * ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py * ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/params_linux.py > RU: Handle Namenode being down scenarios > ---------------------------------------- > > Key: AMBARI-13396 > URL: https://issues.apache.org/jira/browse/AMBARI-13396 > Project: Ambari > Issue Type: Bug > Components: ambari-server > Affects Versions: 2.1.2 > Reporter: Jayush Luniya > Assignee: Jayush Luniya > Fix For: 2.1.3 > > Attachments: AMBARI-13396.patch > > > There are 2 scenarios that need to be handled during RU > *Setup:* > * host1 : namenode1, host2 :namenode2 > * namenode1 on node1 is down > *Scenario 1: During RU, namenode1 on host1 is going to be upgraded before > namenode2 on host2* > Since namenode1 on host1 is already down, namenode2 is the active namenode. > So we should fix the logic to simply restart namenode1 as namenode2 will > remain active. > *Scenario 2: During RU, namenode2 on host2 is going to be upgraded before > namenode1 on host1* > Since namenode2 on host2 is active, then we should fail, since there isn't > another namenode instance that can become active. However today we do the > following: > # Call "hdfs haadmin -failover nn2 nn1" which will fail since nn1 is not > healthy. > # When this command fails, we kill ZKFC on this host and then we wait for > this instance to come back as standby which will never happen because this > instance will come back as active. > We should simply fail when "haadmin failover" command fails instead of > killing ZKFC. > {noformat} > 2015-10-12 22:35:15,307 - Rolling Upgrade - Initiating a ZKFC failover on > active NameNode host jay-ams-2.c.pramod-thangali.internal. > 2015-10-12 22:35:15,308 - call['hdfs haadmin -failover nn2 nn1'] > {'logoutput': True, 'user': 'hdfs'} > Operation failed: NameNode at > jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not currently > healthy. Cannot be failover target > at > org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698) > at > org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632) > at > org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61) > at > org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604) > at > org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at > org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601) > at > org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94) > at > org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61) > at > org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131) > 2015-10-12 22:35:17,748 - call returned (255, 'Operation failed: NameNode at > jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not currently > healthy. Cannot be failover target\n\tat > org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)\n\tat > > org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)\n\tat > > org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)\n\tat > > org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)\n\tat > > org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat > > org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)\n\tat > > org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)\n\tat > > org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)\n\tat > > org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)\n\tat > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)\n\tat > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)\n\tat > java.security.AccessController.doPrivileged(Native Method)\n\tat > javax.security.auth.Subject.doAs(Subject.java:422)\n\tat > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)') > 2015-10-12 22:35:17,748 - Rolling Upgrade - failover command returned 255 > 2015-10-12 22:35:17,749 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ls > /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid > /dev/null 2>&1 && ps -p `cat > /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid` > /dev/null 2>&1''] {} > 2015-10-12 22:35:17,777 - call returned (0, '') > 2015-10-12 22:35:17,778 - Execute['kill -15 `cat > /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`'] {'user': 'hdfs'} > 2015-10-12 22:35:17,803 - File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid'] > {'action': ['delete']} > 2015-10-12 22:35:17,803 - Deleting > File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid'] > 2015-10-12 22:35:17,803 - call['hdfs haadmin -getServiceState nn2 | grep > standby'] {'logoutput': True, 'user': 'hdfs'} > 2015-10-12 22:35:20,922 - call returned (1, '') > 2015-10-12 22:35:20,923 - Rolling Upgrade - check for standby returned 1 > 2015-10-12 22:35:20,923 - Waiting for this NameNode to become the standby one. > 2015-10-12 22:35:20,923 - Execute['hdfs haadmin -getServiceState nn2 | grep > standby'] {'logoutput': True, 'tries': 50, 'user': 'hdfs', 'try_sleep': 6} > 2015-10-12 22:35:23,135 - Retrying after 6 seconds. Reason: Execution of > 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. > 2015-10-12 22:35:31,388 - Retrying after 6 seconds. Reason: Execution of > 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. > 2015-10-12 22:35:39,709 - Retrying after 6 seconds. Reason: Execution of > 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. > 2015-10-12 22:35:47,992 - Retrying after 6 seconds. Reason: Execution of > 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. > 2015-10-12 22:35:56,289 - Retrying after 6 seconds. Reason: Execution of > 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. > 2015-10-12 22:36:04,627 - Retrying after 6 seconds. Reason: Execution of > 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)