Re: NameNode HA -Blueprints - Standby NN failed and Active NN created

Anandha L Ranganathan Wed, 26 Aug 2015 19:23:46 -0700

Thanks Bob.

It works fine and I am able to resolve the issue.


I filed the bug https://issues.apache.org/jira/browse/AMBARI-12893.

I can fix this and provide a patch. Could you point me to build instruction
wiki for Ambari ?

On Wed, Aug 26, 2015 at 7:22 PM, Anandha L Ranganathan <
[email protected]> wrote:

> Thanks Bob.
>
> It works fine and I am able to resolve the issue.
>
> I filed the bug https://issues.apache.org/jira/browse/AMBARI-12893.
>
> I can fix this and provide a patch. Could you point me to build
> instruction wiki for Ambari.
>
> On Wed, Aug 26, 2015 at 6:35 AM, Robert Nettleton <
> [email protected]> wrote:
>
>> Hi Anand,
>>
>> I just tried out a simple HDFS HA deployment (with Ambari 2.1.0), using
>> the HOSTGROUP syntax for these two properties, and it failed as I expected.
>>
>> I’m not sure why “dfs_ha_initial_namenode_active” includes the FQDN.  I
>> suspect that there is some other problem that is causing this.
>>
>> As I mentioned before, these two properties are not currently meant for
>> %HOSTGROUP% substitution, so the fix is to specify the FQDNs within these
>> properties.
>>
>> If you are concerned about including hostnames in your Blueprint, for
>> portability concerns, then you can always set these properties in the
>> cluster creation template.
>>
>> If you don’t need to select the initial state of the namenodes in your
>> cluster, you can just remove these properties from your Blueprint, and the
>> Blueprint processor will select an “active” and “standby” namenode.
>>
>> If it still appears to you that the property is being set by the
>> Blueprints processor, please feel free to file a JIRA to track the
>> investigation into this.
>>
>> Hope this helps!
>>
>> Thanks,
>> Bob
>>
>> On Aug 26, 2015, at 2:29 AM, Anandha L Ranganathan <[email protected]>
>> wrote:
>>
>> > + dev group.
>> >
>> >
>> > This is what I found in the /var/lib/ambari-agent/data/command-#.json in
>> > the one of the master host.
>> > In this you can see the , the active namenode is substituted by FQDN but
>> > not the the standby node. Is this a bug in the Ambari  version.
>> >
>> > I am using *Ambari 2.1* version.
>> >
>> >  hadoop-env{
>> >
>> >            "dfs_ha_initial_namenode_active": "usw2ha3dpma01.local",
>> >            "hadoop_root_logger": "INFO,RFA",
>> >            "dfs_ha_initial_namenode_standby":
>> > "%HOSTGROUP::host_group_master_2%",
>> >            "namenode_opt_permsize": "128m"
>> > }
>> >
>> >
>> > Thanks
>> > Anand
>> >
>> >
>> > On Tue, Aug 25, 2015 at 11:23 AM Anandha L Ranganathan <
>> > [email protected]> wrote:
>> >
>> >>
>> >> Hi
>> >>
>> >> I am trying to install Active Namenode HA using blueprints.
>> >> During the cluster creation through scripts, it does  following and
>> >> completes.
>> >>
>> >> 1) Journal nodes starts and initialized (formats journal node).
>> >> 2) Initialization the HA state in zookeeper  or ZKFC ( Both in Active
>> and
>> >> Standby namenode )
>> >> After 96% it fails.    I logged into the cluster using UI and
>> re-started
>> >> the standby namenode. But it throw the exception saying that Namenode
>> not
>> >> formatted.
>> >> I have to manually copy the fsimage logs from using this command, "hdfs
>> >> namenode -bootstrapStandby -force " in the standby NN server.
>> >> and re-starting the namenode  works fine and  goes into standby mode.
>> >>
>> >> Is it something I am missing in the configuration ?
>> >> My Namenode HA blue prints looks like this.
>> >>
>> >> hadoop-env{
>> >> "dfs_ha_initial_namenode_active": "%HOSTGROUP::host_group_master_1%"
>> >> "dfs_ha_initial_namenode_standby": "%HOSTGROUP::host_group_master_2"
>> >> }
>> >>
>> >>
>> >> hadoop-ev{
>> >>
>> >>        "dfs_ha_initial_namenode_active":
>> >> "%HOSTGROUP::host_group_master_1%"
>> >>        "dfs_ha_initial_namenode_standby":
>> >> "%HOSTGROUP::host_group_master_2"
>> >> }
>> >>
>> >> hdfs-site{
>> >>          "dfs.client.failover.proxy.provider.dfs-nameservices":
>> >>
>> "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider",
>> >>          "dfs.ha.automatic-failover.enabled": "true",
>> >>          "dfs.ha.fencing.methods": "shell(/bin/true)",
>> >>          "dfs.ha.namenodes.dfs-nameservices": "nn1,nn2",
>> >>          "dfs.namenode.http-address.dfs-nameservices.nn1":
>> >> "%HOSTGROUP::host_group_master_1%:50070",
>> >>          "dfs.namenode.http-address.dfs-nameservices.nn2":
>> >> "%HOSTGROUP::host_group_master_2%:50070",
>> >>          "dfs.namenode.https-address.dfs-nameservices.nn1":
>> >> "%HOSTGROUP::host_group_master_1%:50470",
>> >>          "dfs.namenode.https-address.dfs-nameservices.nn2":
>> >> "%HOSTGROUP::host_group_master_2%:50470",
>> >>          "dfs.namenode.rpc-address.dfs-nameservices.nn1":
>> >> "%HOSTGROUP::host_group_master_1%:8020",
>> >>          "dfs.namenode.rpc-address.dfs-nameservices.nn2":
>> >> "%HOSTGROUP::host_group_master_2%:8020",
>> >>          "dfs.namenode.shared.edits.dir":
>> >>
>> "qjournal://%HOSTGROUP::host_group_master_1%:8485;%HOSTGROUP::host_group_master_2%:8485;%HOSTGROUP::host_group_master_3%:8485/dfs-nameservices",
>> >>          "dfs.nameservices": "dfs-nameservices"
>> >>
>> >> }
>> >>
>> >>
>> >> core-site{
>> >>          "fs.defaultFS": "hdfs://dfs-nameservices",
>> >>          "ha.zookeeper.quorum":
>> >>
>> "%HOSTGROUP::host_group_master_1%:2181,%HOSTGROUP::host_group_master_2%:2181,%HOSTGROUP::host_group_master_3%:2181"
>> >>
>> >> }
>> >>
>> >>
>> >>
>> >> This is the log message of Standby Namenode server.
>> >>
>> >> 2015-08-25 08:26:26,373 INFO  zookeeper.ZooKeeper
>> >> (Environment.java:logEnv(100)) - Client
>> >> environment:user.dir=/usr/hdp/2.2.6.0-2800/hadoop
>> >> 2015-08-25 08:26:26,380 INFO  zookeeper.ZooKeeper
>> >> (ZooKeeper.java:<init>(438)) - Initiating client connection,
>> >>
>> connectString=usw2ha2dpma01.local:2181,usw2ha2dpma02.local:2181,usw2ha2dpma03.local:2181
>> >> sessionTimeout=5000
>> >>
>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5b7a5baa
>> >> 2015-08-25 08:26:26,399 INFO  zookeeper.ClientCnxn
>> >> (ClientCnxn.java:logStartConnect(975)) - Opening socket connection to
>> >> server usw2ha2dpma02.local/172.17.213.51:2181. Will not attempt to
>> >> authenticate using SASL (unknown error)
>> >> 2015-08-25 08:26:26,405 INFO  zookeeper.ClientCnxn
>> >> (ClientCnxn.java:primeConnection(852)) - Socket connection established
>> to
>> >> usw2ha2dpma02.local/172.17.213.51:2181, initiating session
>> >> 2015-08-25 08:26:26,413 INFO  zookeeper.ClientCnxn
>> >> (ClientCnxn.java:onConnected(1235)) - Session establishment complete on
>> >> server usw2ha2dpma02.local/172.17.213.51:2181, sessionid =
>> >> 0x24f63f6f3050001, negotiated timeout = 5000
>> >> 2015-08-25 08:26:26,416 INFO  ha.ActiveStandbyElector
>> >> (ActiveStandbyElector.java:processWatchEvent(547)) - Session connected.
>> >> 2015-08-25 08:26:26,441 INFO  ipc.CallQueueManager
>> >> (CallQueueManager.java:<init>(53)) - Using callQueue class
>> >> java.util.concurrent.LinkedBlockingQueue
>> >> 2015-08-25 08:26:26,472 INFO  ipc.Server (Server.java:run(605)) -
>> Starting
>> >> Socket Reader #1 for port 8019
>> >> 2015-08-25 08:26:26,520 INFO  ipc.Server (Server.java:run(827)) - IPC
>> >> Server Responder: starting
>> >> 2015-08-25 08:26:26,526 INFO  ipc.Server (Server.java:run(674)) - IPC
>> >> Server listener on 8019: starting
>> >> 2015-08-25 08:26:27,596 INFO  ipc.Client
>> >> (Client.java:handleConnectionFailure(859)) - Retrying connect to
>> server:
>> >> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
>> >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
>> sleepTime=1000
>> >> MILLISECONDS)
>> >> 2015-08-25 08:26:27,615 WARN  ha.HealthMonitor
>> >> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception
>> trying
>> >> to monitor health of NameNode at usw2ha2dpma02.local/
>> 172.17.213.51:8020:
>> >> Call From usw2ha2dpma02.local/172.17.213.51 to
>> usw2ha2dpma02.local:8020
>> >> failed on connection exception: java.net.ConnectException: Connection
>> >> refused; For more details see:
>> >> http://wiki.apache.org/hadoop/ConnectionRefused
>> >> 2015-08-25 08:26:27,616 INFO  ha.HealthMonitor
>> >> (HealthMonitor.java:enterState(238)) - Entering state
>> SERVICE_NOT_RESPONDING
>> >> 2015-08-25 08:26:27,616 INFO  ha.ZKFailoverController
>> >> (ZKFailoverController.java:setLastHealthState(850)) - Local service
>> >> NameNode at usw2ha2dpma02.local/172.17.213.51:8020 entered state:
>> >> SERVICE_NOT_RESPONDING
>> >> 2015-08-25 08:26:27,616 INFO  ha.ZKFailoverController
>> >> (ZKFailoverController.java:recheckElectability(766)) - Quitting master
>> >> election for NameNode at usw2ha2dpma02.local/172.17.213.51:8020 and
>> >> marking that fencing is necessary
>> >> 2015-08-25 08:26:27,617 INFO  ha.ActiveStandbyElector
>> >> (ActiveStandbyElector.java:quitElection(354)) - Yielding from election
>> >> 2015-08-25 08:26:27,621 INFO  zookeeper.ClientCnxn
>> >> (ClientCnxn.java:run(512)) - EventThread shut down
>> >> 2015-08-25 08:26:27,621 INFO  zookeeper.ZooKeeper
>> >> (ZooKeeper.java:close(684)) - Session: 0x24f63f6f3050001 closed
>> >> 2015-08-25 08:26:29,623 INFO  ipc.Client
>> >> (Client.java:handleConnectionFailure(859)) - Retrying connect to
>> server:
>> >> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
>> >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
>> sleepTime=1000
>> >> MILLISECONDS)
>> >> 2015-08-25 08:26:29,624 WARN  ha.HealthMonitor
>> >> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception
>> trying
>> >> to monitor health of NameNode at usw2ha2dpma02.local/
>> 172.17.213.51:8020:
>> >> Call From usw2ha2dpma02.local/172.17.213.51 to
>> usw2ha2dpma02.local:8020
>> >> failed on connection exception: java.net.ConnectException: Connection
>> >> refused; For more details see:
>> >> http://wiki.apache.org/hadoop/ConnectionRefused
>> >> 2015-08-25 08:26:31,626 INFO  ipc.Client
>> >> (Client.java:handleConnectionFailure(859)) - Retrying connect to
>> server:
>> >> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
>> >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
>> sleepTime=1000
>> >> MILLISECONDS)
>> >> 2015-08-25 08:26:31,627 WARN  ha.HealthMonitor
>> >> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception
>> trying
>> >> to monitor health of NameNode at usw2ha2dpma02.local/
>> 172.17.213.51:8020:
>> >> Call From usw2ha2dpma02.local/172.17.213.51 to
>> usw2ha2dpma02.local:8020
>> >> failed on connection exception: java.net.ConnectException: Connection
>> >> refused; For more details see:
>> >> http://wiki.apache.org/hadoop/ConnectionRefused
>> >> 2015-08-25 08:26:33,629 INFO  ipc.Client
>> >> (Client.java:handleConnectionFailure(859)) - Retrying connect to
>> server:
>> >> usw2ha2dpma02.local/172.17.213.51:8020. Already tried 0 time(s); retry
>> >> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
>> sleepTime=1000
>> >> MILLISECONDS)
>> >> 2015-08-25 08:26:33,630 WARN  ha.HealthMonitor
>> >> (HealthMonitor.java:doHealthChecks(209)) - Transport-level exception
>> trying
>> >> to
>> >>
>> >>
>>
>>
>

Re: NameNode HA -Blueprints - Standby NN failed and Active NN created

Reply via email to