Re: NameNode HA -Blueprints - Standby NN failed and Active NN created

Anandha L Ranganathan Wed, 26 Aug 2015 19:23:25 -0700

Thanks Bob.

It works fine and I am able to resolve the issue.


I filed the bug https://issues.apache.org/jira/browse/AMBARI-12893.

I can fix this and provide a patch. Could you point me to build instruction
wiki for Ambari.

On Wed, Aug 26, 2015 at 6:35 AM, Robert Nettleton <
[email protected]> wrote:

> Hi Anand,
>
> I just tried out a simple HDFS HA deployment (with Ambari 2.1.0), using
> the HOSTGROUP syntax for these two properties, and it failed as I expected.
>
> I’m not sure why “dfs_ha_initial_namenode_active” includes the FQDN.  I
> suspect that there is some other problem that is causing this.
>
> As I mentioned before, these two properties are not currently meant for
> %HOSTGROUP% substitution, so the fix is to specify the FQDNs within these
> properties.
>
> If you are concerned about including hostnames in your Blueprint, for
> portability concerns, then you can always set these properties in the
> cluster creation template.
>
> If you don’t need to select the initial state of the namenodes in your
> cluster, you can just remove these properties from your Blueprint, and the
> Blueprint processor will select an “active” and “standby” namenode.
>
> If it still appears to you that the property is being set by the
> Blueprints processor, please feel free to file a JIRA to track the
> investigation into this.
>
> Hope this helps!
>
> Thanks,
> Bob
>
> On Aug 26, 2015, at 2:29 AM, Anandha L Ranganathan <[email protected]>
> wrote:
>
> > + dev group.
> >
> >
> > This is what I found in the /var/lib/ambari-agent/data/command-#.json in
> > the one of the master host.
> > In this you can see the , the active namenode is substituted by FQDN but
> > not the the standby node. Is this a bug in the Ambari  version.
> >
> > I am using *Ambari 2.1* version.
> >
> >  hadoop-env{
> >
> >            "dfs_ha_initial_namenode_active": "usw2ha3dpma01.local",
> >            "hadoop_root_logger": "INFO,RFA",
> >            "dfs_ha_initial_namenode_standby":
> > "%HOSTGROUP::host_group_master_2%",
> >            "namenode_opt_permsize": "128m"
> > }
> >
> >
> > Thanks
> > Anand
> >
> >
> > On Tue, Aug 25, 2015 at 11:23 AM Anandha L Ranganathan <
> > [email protected]> wrote:
> >
> >>
> >> Hi
> >>
> >> I am trying to install Active Namenode HA using blueprints.
> >> During the cluster creation through scripts, it does  following and
> >> completes.
> >>
> >> 1) Journal nodes starts and initialized (formats journal node).
> >> 2) Initialization the HA state in zookeeper  or ZKFC ( Both in Active
> and
> >> Standby namenode )
> >> After 96% it fails.    I logged into the cluster using UI and re-started
> >> the standby namenode. But it throw the exception saying that Namenode
> not
> >> formatted.
> >> I have to manually copy the fsimage logs from using this command, "hdfs
> >> namenode -bootstrapStandby -force " in the standby NN server.
> >> and re-starting the namenode  works fine and  goes into standby mode.
> >>
> >> Is it something I am missing in the configuration ?
> >> My Namenode HA blue prints looks like this.
> >>
> >> hadoop-env{
> >> "dfs_ha_initial_namenode_active": "%HOSTGROUP::host_group_master_1%"
> >> "dfs_ha_initial_namenode_standby": "%HOSTGROUP::host_group_master_2"
> >> }
> >>
> >>
> >> hadoop-ev{
> >>
> >>        "dfs_ha_initial_namenode_active":
> >> "%HOSTGROUP::host_group_master_1%"
> >>        "dfs_ha_initial_namenode_standby":
> >> "%HOSTGROUP::host_group_master_2"
> >> }
> >>
> >> hdfs-site{
> >>          "dfs.client.failover.proxy.provider.dfs-nameservices":
> >>
> "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
> >>          "dfs.ha.automatic-failover.enabled": "true",
> >>          "dfs.ha.fencing.methods": "shell(/bin/true)",
> >>          "dfs.ha.namenodes.dfs-nameservices": "nn1,nn2",
> >>          "dfs.namenode.http-address.dfs-nameservices.nn1":
> >> "%HOSTGROUP::host_group_master_1%:50070",
> >>          "dfs.namenode.http-address.dfs-nameservices.nn2":
> >> "%HOSTGROUP::host_group_master_2%:50070",
> >>          "dfs.namenode.https-address.dfs-nameservices.nn1":
> >> "%HOSTGROUP::host_group_master_1%:50470",
> >>          "dfs.namenode.https-address.dfs-nameservices.nn2":
> >> "%HOSTGROUP::host_group_master_2%:50470",
> >>          "dfs.namenode.rpc-address.dfs-nameservices.nn1":
> >> "%HOSTGROUP::host_group_master_1%:8020",
> >>          "dfs.namenode.rpc-address.dfs-nameservices.nn2":
> >> "%HOSTGROUP::host_group_master_2%:8020",
> >>          "dfs.namenode.shared.edits.dir":
> >>
> "qjournal://%HOSTGROUP::host_group_master_1%:8485;%HOSTGROUP::host_group_master_2%:8485;%HOSTGROUP::host_group_master_3%:8485/dfs-nameservices",
> >>          "dfs.nameservices": "dfs-nameservices"
> >>
> >> }
> >>
> >>
> >> core-site{
> >>          "fs.defaultFS": "hdfs://dfs-nameservices",
> >>          "ha.zookeeper.quorum":
> >>
> "%HOSTGROUP::host_group_master_1%:2181,%HOSTGROUP::host_group_master_2%:2181,%HOSTGROUP::host_group_master_3%:2181"
> >>
> >> }
> >>
> >>
> >>
> >> This is the log message of Standby Namenode server.
> >>
> >> 2015-08-25 08:26:26,373 INFO  zookeeper.ZooKeeper
> >> (Environment.java:logEnv(100)) - Client
> >> environment:user.dir=/usr/hdp/2.2.6.0-2800/hadoop
> >> 2015-08-25 08:26:26,380 INFO  zookeeper.ZooKeeper
> >> (ZooKeeper.java:<init>(438)) - Initiating client connection,
> >>
> connectString=usw2ha2dpma01.local:2181,usw2ha2dpma02.local:2181,usw2ha2dpma03.local:2181
> >> sessionTimeout=5000
> >>
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5b7a5baa
> >> 2015-08-25 08:26:26,399 INFO  zookeeper.ClientCnxn
> >> (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to
> >> server usw2ha2dpma02.local/172.17.213.51:2181. Will not attempt to
> >> authenticate using SASL (unknown error)
> >> 2015-08-25 08:26:26,405 INFO  zookeeper.ClientCnxn
> >> (ClientCnxn.java:primeConnection(852)) - Socket connection established
> to
> >> usw2ha2dpma02.local/172.17.213.51:2181, initiating session
> >> 2015-08-25 08:26:26,413 INFO  zookeeper.ClientCnxn
> >> (ClientCnxn.java:onConnected(1235)) - Session establishment complete on
> >> server usw2ha2dpma02.local/172.17.213.51:2181, sessionid =
> >> 0x24f63f6f3050001, negotiated timeout = 5000
> >> 2015-08-25 08:26:26,416 INFO  ha.ActiveStandbyElector
> >> (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected.
> >> 2015-08-25 08:26:26,441 INFO  ipc.CallQueueManager
> >> (CallQueueManager.java:<init>(53)) - Using callQueue class
> >> java.util.concurrent.LinkedBlockingQueue
> >> 2015-08-25 08:26:26,472 INFO  ipc.Server (Server.java:run(605)) -
> Starting
> >> Socket Reader #1 for port 8019
> >> 2015-08-25 08:26:26,520 INFO  ipc.Server (Server.java:run(827)) - IPC
> >> Server Responder: starting
> >> 2015-08-25 08:26:26,526 INFO  ipc.Server (Server.java:run(674)) - IPC
> >> Server listener on 8019: starting
> >> 2015-08-25 08:26:27,596 INFO  ipc.Client
> >> (Client.java:handleConnectionFailure(859)) - Retrying connect to server:
> >> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
> >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
> sleepTime=1000
> >> MILLISECONDS)
> >> 2015-08-25 08:26:27,615 WARN  ha.HealthMonitor
> >> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception
> trying
> >> to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020
> :
> >> Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
> >> failed on connection exception: java.net.ConnectException: Connection
> >> refused; For more details see:
> >> http://wiki.apache.org/hadoop/ConnectionRefused
> >> 2015-08-25 08:26:27,616 INFO  ha.HealthMonitor
> >> (HealthMonitor.java:enterState(238)) - Entering state
> SERVICE_NOT_RESPONDING
> >> 2015-08-25 08:26:27,616 INFO  ha.ZKFailoverController
> >> (ZKFailoverController.java:setLastHealthState(850)) - Local service
> >> NameNode at usw2ha2dpma02.local/172.17.213.51:8020 entered state:
> >> SERVICE_NOT_RESPONDING
> >> 2015-08-25 08:26:27,616 INFO  ha.ZKFailoverController
> >> (ZKFailoverController.java:recheckElectability(766)) - Quitting master
> >> election for NameNode at usw2ha2dpma02.local/172.17.213.51:8020 and
> >> marking that fencing is necessary
> >> 2015-08-25 08:26:27,617 INFO  ha.ActiveStandbyElector
> >> (ActiveStandbyElector.java:quitElection(354)) - Yielding from election
> >> 2015-08-25 08:26:27,621 INFO  zookeeper.ClientCnxn
> >> (ClientCnxn.java:run(512)) - EventThread shut down
> >> 2015-08-25 08:26:27,621 INFO  zookeeper.ZooKeeper
> >> (ZooKeeper.java:close(684)) - Session: 0x24f63f6f3050001 closed
> >> 2015-08-25 08:26:29,623 INFO  ipc.Client
> >> (Client.java:handleConnectionFailure(859)) - Retrying connect to server:
> >> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
> >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
> sleepTime=1000
> >> MILLISECONDS)
> >> 2015-08-25 08:26:29,624 WARN  ha.HealthMonitor
> >> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception
> trying
> >> to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020
> :
> >> Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
> >> failed on connection exception: java.net.ConnectException: Connection
> >> refused; For more details see:
> >> http://wiki.apache.org/hadoop/ConnectionRefused
> >> 2015-08-25 08:26:31,626 INFO  ipc.Client
> >> (Client.java:handleConnectionFailure(859)) - Retrying connect to server:
> >> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
> >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
> sleepTime=1000
> >> MILLISECONDS)
> >> 2015-08-25 08:26:31,627 WARN  ha.HealthMonitor
> >> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception
> trying
> >> to monitor health of NameNode at usw2ha2dpma02.local/172.17.213.51:8020
> :
> >> Call From usw2ha2dpma02.local/172.17.213.51 to usw2ha2dpma02.local:8020
> >> failed on connection exception: java.net.ConnectException: Connection
> >> refused; For more details see:
> >> http://wiki.apache.org/hadoop/ConnectionRefused
> >> 2015-08-25 08:26:33,629 INFO  ipc.Client
> >> (Client.java:handleConnectionFailure(859)) - Retrying connect to server:
> >> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
> >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
> sleepTime=1000
> >> MILLISECONDS)
> >> 2015-08-25 08:26:33,630 WARN  ha.HealthMonitor
> >> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception
> trying
> >> to
> >>
> >>
>
>

Re: NameNode HA -Blueprints - Standby NN failed and Active NN created

Reply via email to