On Thu, May 28, 2015 at 5:24 PM, Mikolaj Kucharski
<miko...@kucharski.name> wrote:
> Hi,
>
> I've initially reported this problem a while ago and I thought problem
> was related to IPv6. Now I belive it is not. I did some research and
> this is what I've found. I contacted Reyk couple of days ago, but didn't
> get any reply from him yet, so decided to post here.
>
> I'm using OpenNTPD's constraint from the time it was introduced. However
> I found that ntpd(8) often is getting stuck at the very beginning when
> it starts and never recovers from that scenario. Usually it looks as
> follows when you hit the problem:
>
> # ntpctl -sa
> 0/4 peers valid, clock unsynced
>
> peer
>    wt tl st  next  poll          offset       delay      jitter
> 87.232.1.41 0.pool.ntp.org
>     1  2  -    0s    0s             ---- peer not valid ----
> 54.171.104.100 1.pool.ntp.org
>     1  2  -    0s    0s             ---- peer not valid ----
> 193.1.193.157 2.pool.ntp.org
>     1  2  -    0s    0s             ---- peer not valid ----
> 85.91.1.180 3.pool.ntp.org
>     1  2  -    0s    0s             ---- peer not valid ----
>
> All counters are zero and all peers are not valid. For me it usually
> stays like that and never recovers until rcctl restart ntpd. Then often
> it fails like above again. Multiple restarts in a row solve the problem.
>
> I did some debug modifications to ntpd and this is what I've found.
> I'm running my tests on:
>
> # sysctl -n kern.version
> OpenBSD 5.7-current (GENERIC) #955: Thu May 28 13:09:53 MDT 2015
>     dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC
>
> and to keep IPv6 out of the picture I have following settings in resolv.conf:
>
> # grep -w family /etc/resolv.conf
> family inet4
>
> The ntpd server itself is configured as follows:
>
> # grep -ve '^#' /etc/ntpd.conf
> server 0.pool.ntp.org
> server 1.pool.ntp.org
> server 2.pool.ntp.org
> server 3.pool.ntp.org
> constraints from www.google.com
>
> I'm running debug version of ntpd as follows:
>
> ./ntpd.debug -vd 2>&1 | ts
>
> Here is example debug session when I hit the issue. Lines which contain
> XXX MK are added by me via log_debug():
>
> May 28 23:25:54 ntp engine ready
> May 28 23:25:54 XXX MK STATE_NONE=0
> May 28 23:25:54 XXX MK STATE_DNS_INPROGRESS=1
> May 28 23:25:54 XXX MK STATE_DNS_TEMPFAIL=2
> May 28 23:25:54 XXX MK STATE_DNS_DONE=3
> May 28 23:25:54 XXX MK STATE_QUERY_SENT=4
> May 28 23:25:54 XXX MK STATE_REPLY_RECEIVED=5
> May 28 23:25:54 XXX MK STATE_INVALID=6
> May 28 23:25:54 XXX MK ntp_main() constraint_cnt=0, constraint_median=0
> May 28 23:25:54 XXX MK client_query() start
> May 28 23:25:54 XXX MK client_query() start
> May 28 23:25:54 XXX MK client_query() start
> May 28 23:25:54 XXX MK client_query() start
> May 28 23:25:54 XXX MK constraint_query() starts state=1
> May 28 23:25:54 XXX MK constraint_query() fail with default?
> May 28 23:25:54 XXX MK ntp_main() function constraint_query() returned -1
> May 28 23:25:54 XXX MK ntp_main() about to loop with control_dispatch_msg() 
> and j=3 i=3 nfds=1
> May 28 23:25:54 XXX MK ntp_main() about to loop with 
> constraint_dispatch_msg() and j=3 i=3 nfds=1
> May 28 23:25:54 XXX MK ntp_main() constraint_cnt=0, constraint_median=0
> May 28 23:25:54 XXX MK constraint_query() starts state=1
> May 28 23:25:54 XXX MK constraint_query() fail with default?
> May 28 23:25:54 XXX MK ntp_main() function constraint_query() returned -1
> May 28 23:25:55 XXX MK ntp_main() about to loop with control_dispatch_msg() 
> and j=3 i=3 nfds=0
> May 28 23:25:55 XXX MK ntp_main() about to loop with 
> constraint_dispatch_msg() and j=3 i=3 nfds=0
> May 28 23:25:55 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
> May 28 23:25:55 XXX MK constraint_query() starts with 216.58.208.68 state=3
> May 28 23:25:55 constraint request to 216.58.208.68
> May 28 23:25:55 XXX MK constraint_query() succeeded?
> May 28 23:25:55 XXX MK httpsdate_query() with 216.58.208.68
> May 28 23:25:55 XXX MK httpsdate_init() with 216.58.208.68
> May 28 23:25:55 XXX MK httpsdate_request() with 216.58.208.68 port 443
> May 28 23:25:55 XXX MK httpsdate_request() date from 216.58.208.68 is Date: 
> Thu, 28 May 2015 22:25:56 GMT
> May 28 23:25:55 XXX MK httpsdate_request() success from 216.58.208.68
> May 28 23:25:55 XXX MK httpsdate_query() with 216.58.208.68 done
> May 28 23:25:55 XXX MK ntp_main() function poll() returned -1, errno=4 i=4
> May 28 23:25:55 XXX MK ntp_main() about to loop with control_dispatch_msg() 
> and j=3 i=4 nfds=-1
> May 28 23:25:55 XXX MK ntp_main() about to loop with 
> constraint_dispatch_msg() and j=3 i=4 nfds=-1
> May 28 23:25:55 XXX MK constraint_check_child() bummer, we bump senderrors 
> now senderrors=1 fail=0 state=4
> May 28 23:25:55 XXX MK constraint_close() are we going to set state to 
> STATE_INVALID? state=4
> May 28 23:25:55 XXX MK constraint_close() somehow we set the state to 
> STATE_INVALID state=6 senderrors=1
> May 28 23:25:55 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
> May 28 23:25:55 XXX MK constraint_query() starts with 216.58.208.68 state=6
> May 28 23:25:55 XXX MK constraint_query() fail with STATE_INVALID last=850 
> now=850 diff=0
> May 28 23:25:55 XXX MK ntp_main() function constraint_query() returned -1 for 
> 216.58.208.68
> May 28 23:25:56 XXX MK ntp_main() about to loop with control_dispatch_msg() 
> and j=3 i=3 nfds=0
> May 28 23:25:56 XXX MK ntp_main() about to loop with 
> constraint_dispatch_msg() and j=3 i=3 nfds=0
> May 28 23:25:56 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
> May 28 23:25:56 XXX MK constraint_query() starts with 216.58.208.68 state=6
> May 28 23:25:56 XXX MK constraint_query() fail with STATE_INVALID last=850 
> now=851 diff=1
> May 28 23:25:56 XXX MK ntp_main() function constraint_query() returned -1 for 
> 216.58.208.68
> May 28 23:25:57 XXX MK ntp_main() about to loop with control_dispatch_msg() 
> and j=3 i=3 nfds=0
> May 28 23:25:57 XXX MK ntp_main() about to loop with 
> constraint_dispatch_msg() and j=3 i=3 nfds=0
> May 28 23:25:57 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
> May 28 23:25:57 XXX MK constraint_query() starts with 216.58.208.68 state=6
> May 28 23:25:57 XXX MK constraint_query() fail with STATE_INVALID last=850 
> now=852 diff=2
> May 28 23:25:57 XXX MK ntp_main() function constraint_query() returned -1 for 
> 216.58.208.68
> May 28 23:25:58 XXX MK ntp_main() about to loop with control_dispatch_msg() 
> and j=3 i=3 nfds=0
> May 28 23:25:58 XXX MK ntp_main() about to loop with 
> constraint_dispatch_msg() and j=3 i=3 nfds=0
> May 28 23:25:58 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
> May 28 23:25:58 XXX MK constraint_query() starts with 216.58.208.68 state=6
> May 28 23:25:58 XXX MK constraint_query() fail with STATE_INVALID last=850 
> now=853 diff=3
> May 28 23:25:58 XXX MK ntp_main() function constraint_query() returned -1 for 
> 216.58.208.68
> May 28 23:26:00 XXX MK ntp_main() about to loop with control_dispatch_msg() 
> and j=3 i=3 nfds=0
> May 28 23:26:00 XXX MK ntp_main() about to loop with 
> constraint_dispatch_msg() and j=3 i=3 nfds=0
> May 28 23:26:00 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
> May 28 23:26:00 XXX MK constraint_query() starts with 216.58.208.68 state=6
> May 28 23:26:00 XXX MK constraint_query() fail with STATE_INVALID last=850 
> now=854 diff=4
> May 28 23:26:00 XXX MK ntp_main() function constraint_query() returned -1 for 
> 216.58.208.68
> May 28 23:26:01 XXX MK ntp_main() about to loop with control_dispatch_msg() 
> and j=3 i=3 nfds=0
> May 28 23:26:01 XXX MK ntp_main() about to loop with 
> constraint_dispatch_msg() and j=3 i=3 nfds=0
> May 28 23:26:01 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
> May 28 23:26:01 XXX MK constraint_query() starts with 216.58.208.68 state=6
> May 28 23:26:01 XXX MK constraint_query() fail with STATE_INVALID last=850 
> now=855 diff=5
> May 28 23:26:01 XXX MK ntp_main() function constraint_query() returned -1 for 
> 216.58.208.68
>
> and from that point constraint_query() stays with STATE_INVALID until time
> pases CONSTRAINT_SCAN_INTERVAL seconds.
>
> So what I understand is happening above:
>
> In ntp.c:
>    268                  TAILQ_FOREACH(p, &conf->ntp_peers, entry) {
>    269                          if (constraint_cnt && conf->constraint_median 
> == 0)
>    270                                  continue;
>
> constraint_cnt is 1 but conf->constraint_median is always 0, so that
> part of the code always execute continue. Which means client_query()
> doesn't happen any more.
>
> In constraint.c:
>    162          case STATE_INVALID:
>    163                  if (cstr->last + CONSTRAINT_SCAN_INTERVAL > now) {
>    164                          /* Nothing to do */
>    165                          return (-1);
>    166                  }
>
> Function constraint_query() returns -1 and only few times it succeeds.
> When constraint_query() actually succeded, poll() function returns -1
> and from that point re-execution of constraint_query() always fails
> becaues of STATE_INVALID case (see above).
>
> In constraint.c:
>    282                          if (fail || cstr->state < 
> STATE_REPLY_RECEIVED) {
>    283                                  cstr->senderrors++;
>    284                                  constraint_close(cstr->fd);
>    285                          }
>
> When poll() returns -1 the errno is set to EINTR, which in turns goes to
> SIGCHLD handler, which goes to constraint_check_child() and in line 282
> (see above) cstr->state is set to STATE_QUERY_SENT so cstr->senderrors
> is bumped and constraint_close() is executed.
>
> In constraint.c:
>    347                  if (cstr->senderrors)
>    348                          cstr->state = STATE_INVALID;
>
> In constraint_close() because cstr->senderrors are non-zero state is set
> to STATE_INVALID and from that point ntpd is stuck.
>
> I'm not sure what would be the proper way to solve this, but very
> minimalisic patch as below fixed the issue for me:
>
> Index: constraint.c
> ===================================================================
> RCS file: /cvs/src/usr.sbin/ntpd/constraint.c,v
> retrieving revision 1.12
> diff -u -p -u -r1.12 constraint.c
> --- constraint.c        28 May 2015 21:34:36 -0000      1.12
> +++ constraint.c        28 May 2015 23:14:47 -0000
> @@ -279,7 +279,7 @@ constraint_check_child(void)
>                                     &cstr->addr->ss), 
> CONSTRAINT_SCAN_INTERVAL);
>                         }
>
> -                       if (fail || cstr->state < STATE_REPLY_RECEIVED) {
> +                       if (fail || cstr->state < STATE_QUERY_SENT) {
>                                 cstr->senderrors++;
>                                 constraint_close(cstr->fd);
>                         }

Thank you for the in-depth analysis. I have been running this patch
the last couple
of days and have not experienced any issues as well. What say you reyk?

Reply via email to