Absolutely a critical error to lose the configured ntpd time source in
Hadoop. The replication and many other services require absolutely
millisecond time sync between the nodes. Interesting that your SRE design
called for ntpd running on each node. Curious.

What is the problem you are trying to solve by stopping ntpd on the local
host? Did someone not understand how ntpd works? Did someone configure it
to (I sure hope not) be free running?




*.......*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Sun, Feb 8, 2015 at 7:30 PM, David chen <c77...@163.com> wrote:

> A shell script is deployed on every node of HDFS cluster, the script is
> invoked hourly by crontab, and its content is as follows:
> #!/bin/bash
> service ntpd stop
> ntpdate 192.168.0.1 #it's a valid ntpd server in LAN
> service ntpd start
> chkconfig ntpd on
>
> After several days, NameNode crashed suddenly, but its log seemed no other
> errors except the following:
> 2015-01-07 14:00:00,709 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
>
> Inspected the Linux log(Centos /var/log/messages), also found the
> following clues:
> Jan  7 14:00:01 host1 ntpd[32101]: ntpd exiting on signal 15
> Jan  7 13:59:59 host1 ntpd[44764]: ntpd 4.2.4p8@1.1612-o Fri Feb 22
> 11:23:27 UTC 2013 (1)
> Jan  7 13:59:59 host1 ntpd[44765]: precision = 0.143 usec
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #0 wildcard,
> 0.0.0.0#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #1 wildcard,
> ::#123 Disabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #2 lo, ::1#123
> Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #3 em2,
> fe80::ca1f:66ff:fee1:eed#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #4 lo,
> 127.0.0.1#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on interface #5 em2,
> 192.168.1.151#123 Enabled
> Jan  7 13:59:59 host1 ntpd[44765]: Listening on routing socket on fd #22
> for interface updates
> Jan  7 13:59:59 host1 ntpd[44765]: kernel time sync status 2040
> Jan  7 13:59:59 host1 ntpd[44765]: frequency initialized 499.399 PPM from
> /var/lib/ntp/drift
> Jan  7 14:00:01 host1 ntpd_initres[32103]: parent died before we finished,
> exiting
> Jan  7 14:04:17 host1 ntpd[44765]: synchronized to 192.168.0.191, stratum 2
> Jan  7 14:04:17 host1 ntpd[44765]: kernel time sync status change 2001
> Jan  7 14:26:02 host1 snmpd[4842]: Received TERM or STOP signal...
>  shutting down...
> Jan  7 14:26:02 host1 kernel: netlink: 12 bytes leftover after parsing
> attributes.
> Jan  7 14:26:02 host1 snmpd[45667]: NET-SNMP version 5.5
> Jan  7 14:52:48 host1 ntpd[44765]: no servers reachable
>
> It looks likely that NameNode received the SIGTERM signal sent by
> stopping ntpd command.
> Up to now, the problem has happened three times repeatedly, the time point
> was Jan  7 14:00:00, Jan 14 14:00:00 and Feb  4 14:00:00 respectively.
> Although the script to synchronize time is a little improper, and i also
> know the correct synchronized way. but i wonder why NameNode can receive
> the SIGTERM signal sent by stopping ntpd command? and why three times all
> happened at 14:00:00?
> Any ideas can be appreciated.
>

Reply via email to