[jira] [Commented] (AMBARI-13396) RU: Handle Namenode being down scenarios

Hudson (JIRA) Wed, 14 Oct 2015 21:01:30 -0700

    [ 
https://issues.apache.org/jira/browse/AMBARI-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958262#comment-14958262
 ]


Hudson commented on AMBARI-13396:
---------------------------------

FAILURE: Integrated in Ambari-trunk-Commit #3645 (See 
[https://builds.apache.org/job/Ambari-trunk-Commit/3645/])
AMBARI-13396: RU: Handle Namenode being down scenarios (jluniya) (jluniya: 
[http://git-wip-us.apache.org/repos/asf?p=ambari.git&a=commit&h=c00908495953e7c725bd49b7a124883d12621324])
* 
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/utils.py
* 
ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/scripts/params_linux.py


> RU: Handle Namenode being down scenarios
> ----------------------------------------
>
>                 Key: AMBARI-13396
>                 URL: https://issues.apache.org/jira/browse/AMBARI-13396
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.1.2
>            Reporter: Jayush Luniya
>            Assignee: Jayush Luniya
>             Fix For: 2.1.3
>
>         Attachments: AMBARI-13396.patch
>
>
> There are 2 scenarios that need to be handled during RU
> *Setup:*
> * host1 : namenode1, host2 :namenode2
> * namenode1 on node1 is down
> *Scenario 1: During RU, namenode1 on host1 is going to be upgraded before 
> namenode2 on host2*
> Since namenode1 on host1 is already down, namenode2 is the active namenode. 
> So we  should fix the logic to simply restart namenode1 as namenode2 will 
> remain active.
> *Scenario 2: During RU, namenode2 on host2 is going to be upgraded before 
> namenode1 on host1*
> Since namenode2 on host2 is active, then we should fail, since there isn't 
> another namenode instance that can become active. However today we do the 
> following: 
> # Call "hdfs haadmin -failover nn2 nn1" which will fail since nn1 is not 
> healthy.
> # When this command fails, we kill ZKFC on this host and then we wait for 
> this instance to come back as standby which will never happen because this 
> instance will come back as active. 
> We should simply fail when "haadmin failover" command fails instead of 
> killing ZKFC.
> {noformat}
> 2015-10-12 22:35:15,307 - Rolling Upgrade - Initiating a ZKFC failover on 
> active NameNode host jay-ams-2.c.pramod-thangali.internal.
> 2015-10-12 22:35:15,308 - call['hdfs haadmin -failover nn2 nn1'] 
> {'logoutput': True, 'user': 'hdfs'}
> Operation failed: NameNode at 
> jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not currently 
> healthy. Cannot be failover target
>       at 
> org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)
>       at 
> org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)
>       at 
> org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)
>       at 
> org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)
>       at 
> org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at 
> org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)
>       at 
> org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)
>       at 
> org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)
>       at 
> org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
> 2015-10-12 22:35:17,748 - call returned (255, 'Operation failed: NameNode at 
> jay-ams-1.c.pramod-thangali.internal/10.240.0.178:8020 is not currently 
> healthy. Cannot be failover target\n\tat 
> org.apache.hadoop.ha.ZKFailoverController.checkEligibleForFailover(ZKFailoverController.java:698)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController.doGracefulFailover(ZKFailoverController.java:632)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController.access$400(ZKFailoverController.java:61)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:604)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController$3.run(ZKFailoverController.java:601)\n\tat
>  java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
>  
> org.apache.hadoop.ha.ZKFailoverController.gracefulFailoverToYou(ZKFailoverController.java:601)\n\tat
>  
> org.apache.hadoop.ha.ZKFCRpcServer.gracefulFailover(ZKFCRpcServer.java:94)\n\tat
>  
> org.apache.hadoop.ha.protocolPB.ZKFCProtocolServerSideTranslatorPB.gracefulFailover(ZKFCProtocolServerSideTranslatorPB.java:61)\n\tat
>  
> org.apache.hadoop.ha.proto.ZKFCProtocolProtos$ZKFCProtocolService$2.callBlockingMethod(ZKFCProtocolProtos.java:1548)\n\tat
>  
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)\n\tat
>  org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)\n\tat 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)\n\tat 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)\n\tat 
> java.security.AccessController.doPrivileged(Native Method)\n\tat 
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)\n\tat
>  org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)')
> 2015-10-12 22:35:17,748 - Rolling Upgrade - failover command returned 255
> 2015-10-12 22:35:17,749 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'ls 
> /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid > /dev/null 2>&1 && ps -p `cat 
> /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid` > /dev/null 2>&1''] {}
> 2015-10-12 22:35:17,777 - call returned (0, '')
> 2015-10-12 22:35:17,778 - Execute['kill -15 `cat 
> /var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid`'] {'user': 'hdfs'}
> 2015-10-12 22:35:17,803 - File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid'] 
> {'action': ['delete']}
> 2015-10-12 22:35:17,803 - Deleting 
> File['/var/run/hadoop/hdfs/hadoop-hdfs-zkfc.pid']
> 2015-10-12 22:35:17,803 - call['hdfs haadmin -getServiceState nn2 | grep 
> standby'] {'logoutput': True, 'user': 'hdfs'}
> 2015-10-12 22:35:20,922 - call returned (1, '')
> 2015-10-12 22:35:20,923 - Rolling Upgrade - check for standby returned 1
> 2015-10-12 22:35:20,923 - Waiting for this NameNode to become the standby one.
> 2015-10-12 22:35:20,923 - Execute['hdfs haadmin -getServiceState nn2 | grep 
> standby'] {'logoutput': True, 'tries': 50, 'user': 'hdfs', 'try_sleep': 6}
> 2015-10-12 22:35:23,135 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:31,388 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:39,709 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:47,992 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:35:56,289 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> 2015-10-12 22:36:04,627 - Retrying after 6 seconds. Reason: Execution of 
> 'hdfs haadmin -getServiceState nn2 | grep standby' returned 1. 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AMBARI-13396) RU: Handle Namenode being down scenarios

Reply via email to