Hi, I've initially reported this problem a while ago and I thought problem was related to IPv6. Now I belive it is not. I did some research and this is what I've found. I contacted Reyk couple of days ago, but didn't get any reply from him yet, so decided to post here.
I'm using OpenNTPD's constraint from the time it was introduced. However I found that ntpd(8) often is getting stuck at the very beginning when it starts and never recovers from that scenario. Usually it looks as follows when you hit the problem: # ntpctl -sa 0/4 peers valid, clock unsynced peer wt tl st next poll offset delay jitter 87.232.1.41 0.pool.ntp.org 1 2 - 0s 0s ---- peer not valid ---- 54.171.104.100 1.pool.ntp.org 1 2 - 0s 0s ---- peer not valid ---- 193.1.193.157 2.pool.ntp.org 1 2 - 0s 0s ---- peer not valid ---- 85.91.1.180 3.pool.ntp.org 1 2 - 0s 0s ---- peer not valid ---- All counters are zero and all peers are not valid. For me it usually stays like that and never recovers until rcctl restart ntpd. Then often it fails like above again. Multiple restarts in a row solve the problem. I did some debug modifications to ntpd and this is what I've found. I'm running my tests on: # sysctl -n kern.version OpenBSD 5.7-current (GENERIC) #955: Thu May 28 13:09:53 MDT 2015 dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC and to keep IPv6 out of the picture I have following settings in resolv.conf: # grep -w family /etc/resolv.conf family inet4 The ntpd server itself is configured as follows: # grep -ve '^#' /etc/ntpd.conf server 0.pool.ntp.org server 1.pool.ntp.org server 2.pool.ntp.org server 3.pool.ntp.org constraints from www.google.com I'm running debug version of ntpd as follows: ./ntpd.debug -vd 2>&1 | ts Here is example debug session when I hit the issue. Lines which contain XXX MK are added by me via log_debug(): May 28 23:25:54 ntp engine ready May 28 23:25:54 XXX MK STATE_NONE=0 May 28 23:25:54 XXX MK STATE_DNS_INPROGRESS=1 May 28 23:25:54 XXX MK STATE_DNS_TEMPFAIL=2 May 28 23:25:54 XXX MK STATE_DNS_DONE=3 May 28 23:25:54 XXX MK STATE_QUERY_SENT=4 May 28 23:25:54 XXX MK STATE_REPLY_RECEIVED=5 May 28 23:25:54 XXX MK STATE_INVALID=6 May 28 23:25:54 XXX MK ntp_main() constraint_cnt=0, constraint_median=0 May 28 23:25:54 XXX MK client_query() start May 28 23:25:54 XXX MK client_query() start May 28 23:25:54 XXX MK client_query() start May 28 23:25:54 XXX MK client_query() start May 28 23:25:54 XXX MK constraint_query() starts state=1 May 28 23:25:54 XXX MK constraint_query() fail with default? May 28 23:25:54 XXX MK ntp_main() function constraint_query() returned -1 May 28 23:25:54 XXX MK ntp_main() about to loop with control_dispatch_msg() and j=3 i=3 nfds=1 May 28 23:25:54 XXX MK ntp_main() about to loop with constraint_dispatch_msg() and j=3 i=3 nfds=1 May 28 23:25:54 XXX MK ntp_main() constraint_cnt=0, constraint_median=0 May 28 23:25:54 XXX MK constraint_query() starts state=1 May 28 23:25:54 XXX MK constraint_query() fail with default? May 28 23:25:54 XXX MK ntp_main() function constraint_query() returned -1 May 28 23:25:55 XXX MK ntp_main() about to loop with control_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:25:55 XXX MK ntp_main() about to loop with constraint_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:25:55 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 May 28 23:25:55 XXX MK constraint_query() starts with 216.58.208.68 state=3 May 28 23:25:55 constraint request to 216.58.208.68 May 28 23:25:55 XXX MK constraint_query() succeeded? May 28 23:25:55 XXX MK httpsdate_query() with 216.58.208.68 May 28 23:25:55 XXX MK httpsdate_init() with 216.58.208.68 May 28 23:25:55 XXX MK httpsdate_request() with 216.58.208.68 port 443 May 28 23:25:55 XXX MK httpsdate_request() date from 216.58.208.68 is Date: Thu, 28 May 2015 22:25:56 GMT May 28 23:25:55 XXX MK httpsdate_request() success from 216.58.208.68 May 28 23:25:55 XXX MK httpsdate_query() with 216.58.208.68 done May 28 23:25:55 XXX MK ntp_main() function poll() returned -1, errno=4 i=4 May 28 23:25:55 XXX MK ntp_main() about to loop with control_dispatch_msg() and j=3 i=4 nfds=-1 May 28 23:25:55 XXX MK ntp_main() about to loop with constraint_dispatch_msg() and j=3 i=4 nfds=-1 May 28 23:25:55 XXX MK constraint_check_child() bummer, we bump senderrors now senderrors=1 fail=0 state=4 May 28 23:25:55 XXX MK constraint_close() are we going to set state to STATE_INVALID? state=4 May 28 23:25:55 XXX MK constraint_close() somehow we set the state to STATE_INVALID state=6 senderrors=1 May 28 23:25:55 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 May 28 23:25:55 XXX MK constraint_query() starts with 216.58.208.68 state=6 May 28 23:25:55 XXX MK constraint_query() fail with STATE_INVALID last=850 now=850 diff=0 May 28 23:25:55 XXX MK ntp_main() function constraint_query() returned -1 for 216.58.208.68 May 28 23:25:56 XXX MK ntp_main() about to loop with control_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:25:56 XXX MK ntp_main() about to loop with constraint_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:25:56 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 May 28 23:25:56 XXX MK constraint_query() starts with 216.58.208.68 state=6 May 28 23:25:56 XXX MK constraint_query() fail with STATE_INVALID last=850 now=851 diff=1 May 28 23:25:56 XXX MK ntp_main() function constraint_query() returned -1 for 216.58.208.68 May 28 23:25:57 XXX MK ntp_main() about to loop with control_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:25:57 XXX MK ntp_main() about to loop with constraint_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:25:57 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 May 28 23:25:57 XXX MK constraint_query() starts with 216.58.208.68 state=6 May 28 23:25:57 XXX MK constraint_query() fail with STATE_INVALID last=850 now=852 diff=2 May 28 23:25:57 XXX MK ntp_main() function constraint_query() returned -1 for 216.58.208.68 May 28 23:25:58 XXX MK ntp_main() about to loop with control_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:25:58 XXX MK ntp_main() about to loop with constraint_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:25:58 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 May 28 23:25:58 XXX MK constraint_query() starts with 216.58.208.68 state=6 May 28 23:25:58 XXX MK constraint_query() fail with STATE_INVALID last=850 now=853 diff=3 May 28 23:25:58 XXX MK ntp_main() function constraint_query() returned -1 for 216.58.208.68 May 28 23:26:00 XXX MK ntp_main() about to loop with control_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:26:00 XXX MK ntp_main() about to loop with constraint_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:26:00 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 May 28 23:26:00 XXX MK constraint_query() starts with 216.58.208.68 state=6 May 28 23:26:00 XXX MK constraint_query() fail with STATE_INVALID last=850 now=854 diff=4 May 28 23:26:00 XXX MK ntp_main() function constraint_query() returned -1 for 216.58.208.68 May 28 23:26:01 XXX MK ntp_main() about to loop with control_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:26:01 XXX MK ntp_main() about to loop with constraint_dispatch_msg() and j=3 i=3 nfds=0 May 28 23:26:01 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 May 28 23:26:01 XXX MK constraint_query() starts with 216.58.208.68 state=6 May 28 23:26:01 XXX MK constraint_query() fail with STATE_INVALID last=850 now=855 diff=5 May 28 23:26:01 XXX MK ntp_main() function constraint_query() returned -1 for 216.58.208.68 and from that point constraint_query() stays with STATE_INVALID until time pases CONSTRAINT_SCAN_INTERVAL seconds. So what I understand is happening above: In ntp.c: 268 TAILQ_FOREACH(p, &conf->ntp_peers, entry) { 269 if (constraint_cnt && conf->constraint_median == 0) 270 continue; constraint_cnt is 1 but conf->constraint_median is always 0, so that part of the code always execute continue. Which means client_query() doesn't happen any more. In constraint.c: 162 case STATE_INVALID: 163 if (cstr->last + CONSTRAINT_SCAN_INTERVAL > now) { 164 /* Nothing to do */ 165 return (-1); 166 } Function constraint_query() returns -1 and only few times it succeeds. When constraint_query() actually succeded, poll() function returns -1 and from that point re-execution of constraint_query() always fails becaues of STATE_INVALID case (see above). In constraint.c: 282 if (fail || cstr->state < STATE_REPLY_RECEIVED) { 283 cstr->senderrors++; 284 constraint_close(cstr->fd); 285 } When poll() returns -1 the errno is set to EINTR, which in turns goes to SIGCHLD handler, which goes to constraint_check_child() and in line 282 (see above) cstr->state is set to STATE_QUERY_SENT so cstr->senderrors is bumped and constraint_close() is executed. In constraint.c: 347 if (cstr->senderrors) 348 cstr->state = STATE_INVALID; In constraint_close() because cstr->senderrors are non-zero state is set to STATE_INVALID and from that point ntpd is stuck. I'm not sure what would be the proper way to solve this, but very minimalisic patch as below fixed the issue for me: Index: constraint.c =================================================================== RCS file: /cvs/src/usr.sbin/ntpd/constraint.c,v retrieving revision 1.12 diff -u -p -u -r1.12 constraint.c --- constraint.c 28 May 2015 21:34:36 -0000 1.12 +++ constraint.c 28 May 2015 23:14:47 -0000 @@ -279,7 +279,7 @@ constraint_check_child(void) &cstr->addr->ss), CONSTRAINT_SCAN_INTERVAL); } - if (fail || cstr->state < STATE_REPLY_RECEIVED) { + if (fail || cstr->state < STATE_QUERY_SENT) { cstr->senderrors++; constraint_close(cstr->fd); } a bit more verbose change, which may (or may not) be a more clear is: Index: constraint.c =================================================================== RCS file: /cvs/src/usr.sbin/ntpd/constraint.c,v retrieving revision 1.12 diff -u -p -u -r1.12 constraint.c --- constraint.c 28 May 2015 21:34:36 -0000 1.12 +++ constraint.c 28 May 2015 23:16:57 -0000 @@ -272,6 +272,8 @@ constraint_check_child(void) fatalx("constraint %s, signal %d", log_sockaddr((struct sockaddr *) &cstr->addr->ss), sig); + if (!fail && cstr->state == STATE_QUERY_SENT) + continue; if (fail) { log_debug("no constraint reply from %s" " received in time, next query %ds", I'm running with the first patch for last couple of days and it works for me, but I'm not sure is this the correct(tm) approach. On Tue, Mar 31, 2015 at 02:16:31AM +0100, Mikolaj Kucharski wrote: > Hi, > > I have following ntpd.conf file on an OpenBSD machine: > > OpenBSD 5.7-current (GENERIC) #839: Mon Mar 30 14:21:47 MDT 2015 > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC > > # /etc/ntpd.conf > server 0.pool.ntp.org > server 1.pool.ntp.org > server 2.pool.ntp.org > server 3.pool.ntp.org > constraints from www.google.com > > I've noticed that ntpd(8) often ends up with all peers in 'peer not valid' > state, never syncing the time. I've enabled debug logging and this is > what I see in the logs: > > Mar 31 00:29:38 openbsd57 ntpd[12097]: ntp engine ready > Mar 31 00:29:39 openbsd57 ntpd[12097]: constraint request to > 2a00:1450:400b:800::2004 > Mar 31 00:29:39 openbsd57 ntpd[12097]: constraint request to 216.58.211.132 > Mar 31 00:29:39 openbsd57 ntpd[30212]: tls failed: 2a00:1450:400b:800::2004: > connect > Mar 31 00:29:39 openbsd57 ntpd[12097]: no constraint reply from > 2a00:1450:400b:800::2004 received in time, next query 900s > Mar 31 00:34:37 openbsd57 ntpd[12097]: 0 out of 4 peers valid > Mar 31 00:34:37 openbsd57 ntpd[12097]: bad peer 0.pool.ntp.org > (54.194.18.100) > Mar 31 00:34:37 openbsd57 ntpd[12097]: bad peer 1.pool.ntp.org > (149.157.192.5) > Mar 31 00:34:37 openbsd57 ntpd[12097]: bad peer 2.pool.ntp.org > (193.1.219.116) > Mar 31 00:34:37 openbsd57 ntpd[12097]: bad peer 3.pool.ntp.org (85.91.1.164) > Mar 31 00:44:39 openbsd57 ntpd[12097]: constraint request to > 2a00:1450:400b:800::2004 > Mar 31 00:44:39 openbsd57 ntpd[12097]: constraint request to 216.58.211.132 > Mar 31 00:44:39 openbsd57 ntpd[1720]: tls failed: 2a00:1450:400b:800::2004: > connect > Mar 31 00:44:39 openbsd57 ntpd[12097]: no constraint reply from > 2a00:1450:400b:800::2004 received in time, next query 900s > Mar 31 00:59:39 openbsd57 ntpd[12097]: constraint request to > 2a00:1450:400b:800::2004 > Mar 31 00:59:39 openbsd57 ntpd[12097]: constraint request to 216.58.211.132 > Mar 31 00:59:39 openbsd57 ntpd[13994]: tls failed: 2a00:1450:400b:800::2004: > connect > Mar 31 00:59:39 openbsd57 ntpd[12097]: no constraint reply from > 2a00:1450:400b:800::2004 received in time, next query 900s > > > That box doesn't have IPv6 configured: > > # ifconfig > lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 32768 > priority: 0 > groups: lo > inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3 > inet6 ::1 prefixlen 128 > inet 127.0.0.1 netmask 0xff000000 > vio0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 > lladdr 52:54:00:12:34:72 > priority: 0 > groups: egress > media: Ethernet autoselect > status: active > inet 172.16.0.52 netmask 0xffffff00 broadcast 172.16.0.255 > enc0: flags=0<> > priority: 0 > groups: enc > status: active > > > # nc -v 2a00:1450:400b:800::2004 443 > nc: connect to 2a00:1450:400b:800::2004 port 443 (tcp) failed: No route to > host > -- best regards q#