On Thu, May 28, 2015 at 5:24 PM, Mikolaj Kucharski <miko...@kucharski.name> wrote: > Hi, > > I've initially reported this problem a while ago and I thought problem > was related to IPv6. Now I belive it is not. I did some research and > this is what I've found. I contacted Reyk couple of days ago, but didn't > get any reply from him yet, so decided to post here. > > I'm using OpenNTPD's constraint from the time it was introduced. However > I found that ntpd(8) often is getting stuck at the very beginning when > it starts and never recovers from that scenario. Usually it looks as > follows when you hit the problem: > > # ntpctl -sa > 0/4 peers valid, clock unsynced > > peer > wt tl st next poll offset delay jitter > 87.232.1.41 0.pool.ntp.org > 1 2 - 0s 0s ---- peer not valid ---- > 54.171.104.100 1.pool.ntp.org > 1 2 - 0s 0s ---- peer not valid ---- > 193.1.193.157 2.pool.ntp.org > 1 2 - 0s 0s ---- peer not valid ---- > 85.91.1.180 3.pool.ntp.org > 1 2 - 0s 0s ---- peer not valid ---- > > All counters are zero and all peers are not valid. For me it usually > stays like that and never recovers until rcctl restart ntpd. Then often > it fails like above again. Multiple restarts in a row solve the problem. > > I did some debug modifications to ntpd and this is what I've found. > I'm running my tests on: > > # sysctl -n kern.version > OpenBSD 5.7-current (GENERIC) #955: Thu May 28 13:09:53 MDT 2015 > dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC > > and to keep IPv6 out of the picture I have following settings in resolv.conf: > > # grep -w family /etc/resolv.conf > family inet4 > > The ntpd server itself is configured as follows: > > # grep -ve '^#' /etc/ntpd.conf > server 0.pool.ntp.org > server 1.pool.ntp.org > server 2.pool.ntp.org > server 3.pool.ntp.org > constraints from www.google.com > > I'm running debug version of ntpd as follows: > > ./ntpd.debug -vd 2>&1 | ts > > Here is example debug session when I hit the issue. Lines which contain > XXX MK are added by me via log_debug(): > > May 28 23:25:54 ntp engine ready > May 28 23:25:54 XXX MK STATE_NONE=0 > May 28 23:25:54 XXX MK STATE_DNS_INPROGRESS=1 > May 28 23:25:54 XXX MK STATE_DNS_TEMPFAIL=2 > May 28 23:25:54 XXX MK STATE_DNS_DONE=3 > May 28 23:25:54 XXX MK STATE_QUERY_SENT=4 > May 28 23:25:54 XXX MK STATE_REPLY_RECEIVED=5 > May 28 23:25:54 XXX MK STATE_INVALID=6 > May 28 23:25:54 XXX MK ntp_main() constraint_cnt=0, constraint_median=0 > May 28 23:25:54 XXX MK client_query() start > May 28 23:25:54 XXX MK client_query() start > May 28 23:25:54 XXX MK client_query() start > May 28 23:25:54 XXX MK client_query() start > May 28 23:25:54 XXX MK constraint_query() starts state=1 > May 28 23:25:54 XXX MK constraint_query() fail with default? > May 28 23:25:54 XXX MK ntp_main() function constraint_query() returned -1 > May 28 23:25:54 XXX MK ntp_main() about to loop with control_dispatch_msg() > and j=3 i=3 nfds=1 > May 28 23:25:54 XXX MK ntp_main() about to loop with > constraint_dispatch_msg() and j=3 i=3 nfds=1 > May 28 23:25:54 XXX MK ntp_main() constraint_cnt=0, constraint_median=0 > May 28 23:25:54 XXX MK constraint_query() starts state=1 > May 28 23:25:54 XXX MK constraint_query() fail with default? > May 28 23:25:54 XXX MK ntp_main() function constraint_query() returned -1 > May 28 23:25:55 XXX MK ntp_main() about to loop with control_dispatch_msg() > and j=3 i=3 nfds=0 > May 28 23:25:55 XXX MK ntp_main() about to loop with > constraint_dispatch_msg() and j=3 i=3 nfds=0 > May 28 23:25:55 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 > May 28 23:25:55 XXX MK constraint_query() starts with 216.58.208.68 state=3 > May 28 23:25:55 constraint request to 216.58.208.68 > May 28 23:25:55 XXX MK constraint_query() succeeded? > May 28 23:25:55 XXX MK httpsdate_query() with 216.58.208.68 > May 28 23:25:55 XXX MK httpsdate_init() with 216.58.208.68 > May 28 23:25:55 XXX MK httpsdate_request() with 216.58.208.68 port 443 > May 28 23:25:55 XXX MK httpsdate_request() date from 216.58.208.68 is Date: > Thu, 28 May 2015 22:25:56 GMT > May 28 23:25:55 XXX MK httpsdate_request() success from 216.58.208.68 > May 28 23:25:55 XXX MK httpsdate_query() with 216.58.208.68 done > May 28 23:25:55 XXX MK ntp_main() function poll() returned -1, errno=4 i=4 > May 28 23:25:55 XXX MK ntp_main() about to loop with control_dispatch_msg() > and j=3 i=4 nfds=-1 > May 28 23:25:55 XXX MK ntp_main() about to loop with > constraint_dispatch_msg() and j=3 i=4 nfds=-1 > May 28 23:25:55 XXX MK constraint_check_child() bummer, we bump senderrors > now senderrors=1 fail=0 state=4 > May 28 23:25:55 XXX MK constraint_close() are we going to set state to > STATE_INVALID? state=4 > May 28 23:25:55 XXX MK constraint_close() somehow we set the state to > STATE_INVALID state=6 senderrors=1 > May 28 23:25:55 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 > May 28 23:25:55 XXX MK constraint_query() starts with 216.58.208.68 state=6 > May 28 23:25:55 XXX MK constraint_query() fail with STATE_INVALID last=850 > now=850 diff=0 > May 28 23:25:55 XXX MK ntp_main() function constraint_query() returned -1 for > 216.58.208.68 > May 28 23:25:56 XXX MK ntp_main() about to loop with control_dispatch_msg() > and j=3 i=3 nfds=0 > May 28 23:25:56 XXX MK ntp_main() about to loop with > constraint_dispatch_msg() and j=3 i=3 nfds=0 > May 28 23:25:56 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 > May 28 23:25:56 XXX MK constraint_query() starts with 216.58.208.68 state=6 > May 28 23:25:56 XXX MK constraint_query() fail with STATE_INVALID last=850 > now=851 diff=1 > May 28 23:25:56 XXX MK ntp_main() function constraint_query() returned -1 for > 216.58.208.68 > May 28 23:25:57 XXX MK ntp_main() about to loop with control_dispatch_msg() > and j=3 i=3 nfds=0 > May 28 23:25:57 XXX MK ntp_main() about to loop with > constraint_dispatch_msg() and j=3 i=3 nfds=0 > May 28 23:25:57 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 > May 28 23:25:57 XXX MK constraint_query() starts with 216.58.208.68 state=6 > May 28 23:25:57 XXX MK constraint_query() fail with STATE_INVALID last=850 > now=852 diff=2 > May 28 23:25:57 XXX MK ntp_main() function constraint_query() returned -1 for > 216.58.208.68 > May 28 23:25:58 XXX MK ntp_main() about to loop with control_dispatch_msg() > and j=3 i=3 nfds=0 > May 28 23:25:58 XXX MK ntp_main() about to loop with > constraint_dispatch_msg() and j=3 i=3 nfds=0 > May 28 23:25:58 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 > May 28 23:25:58 XXX MK constraint_query() starts with 216.58.208.68 state=6 > May 28 23:25:58 XXX MK constraint_query() fail with STATE_INVALID last=850 > now=853 diff=3 > May 28 23:25:58 XXX MK ntp_main() function constraint_query() returned -1 for > 216.58.208.68 > May 28 23:26:00 XXX MK ntp_main() about to loop with control_dispatch_msg() > and j=3 i=3 nfds=0 > May 28 23:26:00 XXX MK ntp_main() about to loop with > constraint_dispatch_msg() and j=3 i=3 nfds=0 > May 28 23:26:00 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 > May 28 23:26:00 XXX MK constraint_query() starts with 216.58.208.68 state=6 > May 28 23:26:00 XXX MK constraint_query() fail with STATE_INVALID last=850 > now=854 diff=4 > May 28 23:26:00 XXX MK ntp_main() function constraint_query() returned -1 for > 216.58.208.68 > May 28 23:26:01 XXX MK ntp_main() about to loop with control_dispatch_msg() > and j=3 i=3 nfds=0 > May 28 23:26:01 XXX MK ntp_main() about to loop with > constraint_dispatch_msg() and j=3 i=3 nfds=0 > May 28 23:26:01 XXX MK ntp_main() constraint_cnt=1, constraint_median=0 > May 28 23:26:01 XXX MK constraint_query() starts with 216.58.208.68 state=6 > May 28 23:26:01 XXX MK constraint_query() fail with STATE_INVALID last=850 > now=855 diff=5 > May 28 23:26:01 XXX MK ntp_main() function constraint_query() returned -1 for > 216.58.208.68 > > and from that point constraint_query() stays with STATE_INVALID until time > pases CONSTRAINT_SCAN_INTERVAL seconds. > > So what I understand is happening above: > > In ntp.c: > 268 TAILQ_FOREACH(p, &conf->ntp_peers, entry) { > 269 if (constraint_cnt && conf->constraint_median > == 0) > 270 continue; > > constraint_cnt is 1 but conf->constraint_median is always 0, so that > part of the code always execute continue. Which means client_query() > doesn't happen any more. > > In constraint.c: > 162 case STATE_INVALID: > 163 if (cstr->last + CONSTRAINT_SCAN_INTERVAL > now) { > 164 /* Nothing to do */ > 165 return (-1); > 166 } > > Function constraint_query() returns -1 and only few times it succeeds. > When constraint_query() actually succeded, poll() function returns -1 > and from that point re-execution of constraint_query() always fails > becaues of STATE_INVALID case (see above). > > In constraint.c: > 282 if (fail || cstr->state < > STATE_REPLY_RECEIVED) { > 283 cstr->senderrors++; > 284 constraint_close(cstr->fd); > 285 } > > When poll() returns -1 the errno is set to EINTR, which in turns goes to > SIGCHLD handler, which goes to constraint_check_child() and in line 282 > (see above) cstr->state is set to STATE_QUERY_SENT so cstr->senderrors > is bumped and constraint_close() is executed. > > In constraint.c: > 347 if (cstr->senderrors) > 348 cstr->state = STATE_INVALID; > > In constraint_close() because cstr->senderrors are non-zero state is set > to STATE_INVALID and from that point ntpd is stuck. > > I'm not sure what would be the proper way to solve this, but very > minimalisic patch as below fixed the issue for me: > > Index: constraint.c > =================================================================== > RCS file: /cvs/src/usr.sbin/ntpd/constraint.c,v > retrieving revision 1.12 > diff -u -p -u -r1.12 constraint.c > --- constraint.c 28 May 2015 21:34:36 -0000 1.12 > +++ constraint.c 28 May 2015 23:14:47 -0000 > @@ -279,7 +279,7 @@ constraint_check_child(void) > &cstr->addr->ss), > CONSTRAINT_SCAN_INTERVAL); > } > > - if (fail || cstr->state < STATE_REPLY_RECEIVED) { > + if (fail || cstr->state < STATE_QUERY_SENT) { > cstr->senderrors++; > constraint_close(cstr->fd); > }
Thank you for the in-depth analysis. I have been running this patch the last couple of days and have not experienced any issues as well. What say you reyk?