I would spot on 

Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable

looks for me like an network / DNS issue. You can check per dmesg whats going 
on, too.

BR
- Alexander

> On 09 Feb 2015, at 17:57, daemeon reiydelle <daeme...@gmail.com> wrote:
> 
> Absolutely a critical error to lose the configured ntpd time source in 
> Hadoop. The replication and many other services require absolutely 
> millisecond time sync between the nodes. Interesting that your SRE design 
> called for ntpd running on each node. Curious.
> 
> What is the problem you are trying to solve by stopping ntpd on the local 
> host? Did someone not understand how ntpd works? Did someone configure it to 
> (I sure hope not) be free running?
> 
> 
> 
> .......
> “Life should not be a journey to the grave with the intention of arriving 
> safely in a
> pretty and well preserved body, but rather to skid in broadside in a cloud of 
> smoke,
> thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a 
> Ride!” 
> - Hunter Thompson
> 
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198
> London (+44) (0) 20 8144 9872
> 
> On Sun, Feb 8, 2015 at 7:30 PM, David chen <c77...@163.com 
> <mailto:c77...@163.com>> wrote:
> A shell script is deployed on every node of HDFS cluster, the script is 
> invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
> 
> After several days, NameNode crashed suddenly, but its log seemed no other 
> errors except the following:
> 2015-01-07 14:00:00,709 ERROR 
> org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
> 
> Inspected the Linux log(Centos /var/log/messages), also found the following 
> clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22 11:23:27 
> UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard, 
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard, ::#123 
> Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123 
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2, 
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo, 
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2, 
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22 for 
> interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from 
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished, 
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...  shutting 
> down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing 
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
> 
> It looks likely that NameNode received the SIGTERM signal sent by stopping 
> ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point 
> was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also know 
> the correct synchronized way. but i wonder why NameNode can receive the 
> SIGTERM signal sent by stopping ntpd command? and why three times all 
> happened at 14:00:00?
> Any ideas can be appreciated.
> 

Reply via email to