I am having a problem with Ambari not recognizing nodes on a network. The cluster is using CentOS 6. I am trying to install HDP 2.1. I have the following values in my hosts file:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.200.144 datanode10.localdomain.com 192.168.200.143 namenode.localdomain.com 192.168.200.107 datanode01.localdomain.com When I try to connect from the namenode.localdomain.com to datanode10.localdomain.com i get this error in the registration log: ========================== Running setup agent script... DJN...expected_host not defined here DJN:bootstrap.py ...expected_host is: datanode10.localdomain.com ========================== .... Agent out at: /var/log/ambari-agent/ambari-agent.out Agent log at: /var/log/ambari-agent/ambari-agent.log ("WARNING 2014-12-17 16:22:50,380 NetUtil.py:92 - Server at https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440 is not reachable, sleeping for 10 seconds... INFO 2014-12-17 16:23:00,390 NetUtil.py:48 - Connecting to https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440/ca WARNING 2014-12-17 16:23:00,391 NetUtil.py:71 - Failed to connect to https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440/ca due to [Errno -2] Name or service not known ... Connection to datanode10.localdomain.com closed. SSH command execution finished host=datanode10.localdomain.com, exitcode=0 Command end time 2014-12-17 16:23:26 datanode10.localdomain.com What follows is more detail. I also make some changes to the /usr/lib/python2.6/site-packages/ambari_server/bootstrap.py file def run(self): sshcommand = ["ssh", "-o", "ConnectTimeOut=60", "-o", "StrictHostKeyChecking=no", "-o", "BatchMode=yes", "-tt", # Should prevent "tput: No value for $TERM and no -T specified" warning "-i", self.sshkey_file, self.user + "@" + self.host, self.command] if DEBUG: self.host_log.write("Running ssh command " + ' '.join(sshcommand)) self.host_log.write("==========================") self.host_log.write("\nCommand start time " + datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " " + self.host + " " + self.user + " " + self.sshkey_file + " " + self.command) #self.host_log.write("djn:BOOTSTRAP the value is:" + self.host) sshstat = subprocess.Popen(sshcommand, stdout=subprocess.PIPE, stderr=subprocess.PIPE) log = sshstat.communicate() errorMsg = log[1] if self.errorMessage and sshstat.returncode != 0: errorMsg = self.errorMessage + "\n" + errorMsg log = log[0] + "\n" + errorMsg self.host_log.write(log) self.host_log.write("SSH command execution finished") self.host_log.write("host=" + self.host + ", exitcode=" + str(sshstat.returncode)) self.host_log.write("Command end time " + datetime.now().strftime('%Y-%m-%d %H:%M:%S') + " " + self.host) return {"exitstatus": sshstat.returncode, "log": log, "errormsg": errorMsg} I added some information on the host_log file. The information includes self.host, self.user, self.ssh key_file and so on... When I run the web front end I get two different results. First I will detail the connection to the namenode.localdomain.com. second I will detail the connection to the datanode10.localdomain.com. The connection to the namenode.localdomain.com is successful. Here is the important part of the registeration log: ========================== Running setup agent script... DJN...expected_host not defined here DJN:bootstrap.py ...expected_host is: namenode.localdomain.com ========================== Command start time 2014-12-17 16:23:17 namenode.localdomain.com root /var/run/ambari-server/bootstrap/25/sshKey sudo python /var/lib/ambari-agent/data/tmp/setupAgent1418854996.py namenode.localdomain.com DEV namenode.localdomain.com 1.7.0 8080 Verifying Python version compatibility... Using python /usr/bin/python2.6 Found ambari-agent PID: 32172 Stopping ambari-agent Removing PID file at /var/run/ambari-agent/ambari-agent.pid ambari-agent successfully stopped Restarting ambari-agent Verifying Python version compatibility... Using python /usr/bin/python2.6 ambari-agent is not running. No PID found at /var/run/ambari-agent/ambari-agent.pid Verifying Python version compatibility... Using python /usr/bin/python2.6 Checking for previously running Ambari Agent... Starting ambari-agent Verifying ambari-agent process status... Ambari Agent successfully started Agent PID at: /var/run/ambari-agent/ambari-agent.pid Agent out at: /var/log/ambari-agent/ambari-agent.out Agent log at: /var/log/ambari-agent/ambari-agent.log ('INFO 2014-12-17 16:22:56,352 Heartbeat.py:78 - Building Heartbeat: {responseId = 17, timestamp = 1418854976352, commandsInProgress = False, componentsMapped = False} INFO 2014-12-17 16:22:56,407 Controller.py:214 - Heartbeat response received (id = 18) INFO 2014-12-17 16:22:56,408 Controller.py:249 - No commands sent from namenode.localdomain.com INFO 2014-12-17 16:23:06,409 Heartbeat.py:78 - Building Heartbeat: {responseId = 18, timestamp = 1418854986409, commandsInProgress = False, componentsMapped = False} INFO 2014-12-17 16:23:13,422 HostCheckReportFileHandler.py:43 - Host check report at /var/lib/ambari-agent/data/hostcheck.result INFO 2014-12-17 16:23:13,423 HostCheckReportFileHandler.py:104 - Removing old host check file at /var/lib/ambari-agent/data/hostcheck.result INFO 2014-12-17 16:23:13,423 HostCheckReportFileHandler.py:109 - Creating host check file at /var/lib/ambari-agent/data/hostcheck.result INFO 2014-12-17 16:23:13,491 Controller.py:214 - Heartbeat response received (id = 19) INFO 2014-12-17 16:23:13,492 Controller.py:249 - No commands sent from namenode.localdomain.com INFO 2014-12-17 16:23:21,942 main.py:83 - loglevel=logging.INFO INFO 2014-12-17 16:23:23,493 Heartbeat.py:78 - Building Heartbeat: {responseId = 19, timestamp = 1418855003493, commandsInProgress = False, componentsMapped = False} INFO 2014-12-17 16:23:23,544 Controller.py:214 - Heartbeat response received (id = 20) INFO 2014-12-17 16:23:23,544 Controller.py:249 - No commands sent from namenode.localdomain.com INFO 2014-12-17 16:23:28,845 main.py:83 - loglevel=logging.INFO INFO 2014-12-17 16:23:28,846 DataCleaner.py:36 - Data cleanup thread started INFO 2014-12-17 16:23:28,847 DataCleaner.py:117 - Data cleanup started INFO 2014-12-17 16:23:28,857 DataCleaner.py:119 - Data cleanup finished INFO 2014-12-17 16:23:28,967 PingPortListener.py:51 - Ping port listener started on port: 8670 INFO 2014-12-17 16:23:28,968 main.py:233 - Connecting to Ambari server at https://namenode.localdomain.com:8440 (192.168.200.143) INFO 2014-12-17 16:23:28,969 NetUtil.py:48 - Connecting to https://namenode.localdomain.com:8440/ca ', None) Connection to namenode.localdomain.com closed. SSH command execution finished host=namenode.localdomain.com, exitcode=0 Command end time 2014-12-17 16:23:31 namenode.localdomain.com The connection to the datanode10.localdomain.com does not work. Here is the registeration log for that attempt: ========================== Running setup agent script... DJN...expected_host not defined here DJN:bootstrap.py ...expected_host is: datanode10.localdomain.com ========================== Command start time 2014-12-17 16:23:16 datanode10.localdomain.com root /var/run/ambari-server/bootstrap/25/sshKey sudo python /var/lib/ambari-agent/data/tmp/setupAgent1418854996.py datanode10.localdomain.com DEV namenode.localdomain.com 1.7.0 8080 Verifying Python version compatibility... Using python /usr/bin/python2.6 Found ambari-agent PID: 7325 Stopping ambari-agent Removing PID file at /var/run/ambari-agent/ambari-agent.pid ambari-agent successfully stopped Restarting ambari-agent Verifying Python version compatibility... Using python /usr/bin/python2.6 ambari-agent is not running. No PID found at /var/run/ambari-agent/ambari-agent.pid Verifying Python version compatibility... Using python /usr/bin/python2.6 Checking for previously running Ambari Agent... Starting ambari-agent Verifying ambari-agent process status... Ambari Agent successfully started Agent PID at: /var/run/ambari-agent/ambari-agent.pid Agent out at: /var/log/ambari-agent/ambari-agent.out Agent log at: /var/log/ambari-agent/ambari-agent.log ("WARNING 2014-12-17 16:22:50,380 NetUtil.py:92 - Server at https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440 is not reachable, sleeping for 10 seconds... INFO 2014-12-17 16:23:00,390 NetUtil.py:48 - Connecting to https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440/ca WARNING 2014-12-17 16:23:00,391 NetUtil.py:71 - Failed to connect to https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440/ca due to [Errno -2] Name or service not known WARNING 2014-12-17 16:23:00,391 NetUtil.py:92 - Server at https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440 is not reachable, sleeping for 10 seconds... INFO 2014-12-17 16:23:10,402 NetUtil.py:48 - Connecting to https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440/ca WARNING 2014-12-17 16:23:10,402 NetUtil.py:71 - Failed to connect to https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440/ca due to [Errno -2] Name or service not known WARNING 2014-12-17 16:23:10,402 NetUtil.py:92 - Server at https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440 is not reachable, sleeping for 10 seconds... INFO 2014-12-17 16:23:17,959 main.py:83 - loglevel=logging.INFO INFO 2014-12-17 16:23:17,959 main.py:55 - signal received, exiting. INFO 2014-12-17 16:23:17,960 ProcessHelper.py:39 - Removing pid file INFO 2014-12-17 16:23:17,960 ProcessHelper.py:46 - Removing temp files INFO 2014-12-17 16:23:23,639 main.py:83 - loglevel=logging.INFO INFO 2014-12-17 16:23:23,639 DataCleaner.py:36 - Data cleanup thread started INFO 2014-12-17 16:23:23,641 DataCleaner.py:117 - Data cleanup started INFO 2014-12-17 16:23:23,642 DataCleaner.py:119 - Data cleanup finished INFO 2014-12-17 16:23:23,678 PingPortListener.py:51 - Ping port listener started on port: 8670 WARNING 2014-12-17 16:23:23,678 main.py:235 - Unable to determine the IP address of the Ambari server 'namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode' INFO 2014-12-17 16:23:23,678 NetUtil.py:48 - Connecting to https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440/ca WARNING 2014-12-17 16:23:23,679 NetUtil.py:71 - Failed to connect to https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440/ca due to [Errno -2] Name or service not known WARNING 2014-12-17 16:23:23,679 NetUtil.py:92 - Server at https://namenode.localdomain.com.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode.namenode:8440 is not reachable, sleeping for 10 seconds... ", None) Connection to datanode10.localdomain.com closed. SSH command execution finished host=datanode10.localdomain.com, exitcode=0 Command end time 2014-12-17 16:23:26 datanode10.localdomain.com Registering with the server... Registration with the server failed. =============================== To double check something I wrote the following command using the sshcommand in the bootstrap.py script: [root@namenode ~]# ssh -v -o ConnectTimeOut=60 -o StrictHostKeyChecking=no -o BatchMode=yes -tt -i /root/Desktop/id_rsa r...@datanode10.localdomain.com "[ -d /var/lib/ambari-agent/data/tmp ] || sudo mkdir -p /var/lib/ambari-agent/data/tmp ; sudo chown root /var/lib/ambari-agent/data/tmp" The command worked and exited with a code of 0. More detail follows. I added the -v option and the path to the id_rsa key file is the same one that I entered into the first page of the wizard. The result is as follows: [root@namenode ~]# ssh -v -o ConnectTimeOut=60 -o StrictHostKeyChecking=no -o BatchMode=yes -tt -i /root/Desktop/id_rsa r...@datanode10.localdomain.com "[ -d /var/lib/ambari-agent/data/tmp ] || sudo mkdir -p /var/lib/ambari-agent/data/tmp ; sudo chown root /var/lib/ambari-agent/data/tmp" OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013 debug1: Reading configuration data /etc/ssh/ssh_config debug1: Applying options for * debug1: Connecting to datanode10.localdomain.com [192.168.200.144] port 22. debug1: fd 3 clearing O_NONBLOCK debug1: Connection established. debug1: permanently_set_uid: 0/0 debug1: identity file /root/Desktop/id_rsa type 1 debug1: identity file /root/Desktop/id_rsa-cert type -1 debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3 debug1: match: OpenSSH_5.3 pat OpenSSH* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_5.3 debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: server->client aes128-ctr hmac-md5 none debug1: kex: client->server aes128-ctr hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY debug1: Host 'datanode10.localdomain.com' is known and matches the RSA host key. debug1: Found key in /root/.ssh/known_hosts:13 debug1: ssh_rsa_verify: signature correct debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,gssapi-keyex,gssapi-with-mic,password debug1: Next authentication method: gssapi-keyex debug1: No valid Key exchange context debug1: Next authentication method: gssapi-with-mic debug1: Unspecified GSS failure. Minor code may provide more information Credentials cache file '/tmp/krb5cc_0' not found debug1: Unspecified GSS failure. Minor code may provide more information Credentials cache file '/tmp/krb5cc_0' not found debug1: Unspecified GSS failure. Minor code may provide more information debug1: Unspecified GSS failure. Minor code may provide more information Credentials cache file '/tmp/krb5cc_0' not found debug1: Next authentication method: publickey debug1: Offering public key: /root/Desktop/id_rsa debug1: Server accepts key: pkalg ssh-rsa blen 277 debug1: Authentication succeeded (publickey). debug1: channel 0: new [client-session] debug1: Requesting no-more-sessi...@openssh.com debug1: Entering interactive session. debug1: Sending environment. debug1: Sending env XMODIFIERS = @im=none debug1: Sending env LANG = en_US.UTF-8 debug1: Sending command: [ -d /var/lib/ambari-agent/data/tmp ] || sudo mkdir -p /var/lib/ambari-agent/data/tmp ; sudo chown root /var/lib/ambari-agent/data/tmp debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug1: client_input_channel_req: channel 0 rtype e...@openssh.com reply 0 debug1: channel 0: free: client-session, nchannels 1 Connection to datanode10.localdomain.com closed. Transferred: sent 2952, received 2352 bytes, in 0.0 seconds Bytes per second: sent 106095.7, received 84531.6 debug1: Exit status 0 David Novogrodsky david.novogrod...@gmail.com http://www.linkedin.com/in/davidnovogrodsky