Hi,

I've initially reported this problem a while ago and I thought problem
was related to IPv6. Now I belive it is not. I did some research and
this is what I've found. I contacted Reyk couple of days ago, but didn't
get any reply from him yet, so decided to post here.

I'm using OpenNTPD's constraint from the time it was introduced. However
I found that ntpd(8) often is getting stuck at the very beginning when
it starts and never recovers from that scenario. Usually it looks as
follows when you hit the problem:

# ntpctl -sa
0/4 peers valid, clock unsynced

peer
   wt tl st  next  poll          offset       delay      jitter
87.232.1.41 0.pool.ntp.org 
    1  2  -    0s    0s             ---- peer not valid ----
54.171.104.100 1.pool.ntp.org 
    1  2  -    0s    0s             ---- peer not valid ----
193.1.193.157 2.pool.ntp.org 
    1  2  -    0s    0s             ---- peer not valid ----
85.91.1.180 3.pool.ntp.org 
    1  2  -    0s    0s             ---- peer not valid ----

All counters are zero and all peers are not valid. For me it usually
stays like that and never recovers until rcctl restart ntpd. Then often
it fails like above again. Multiple restarts in a row solve the problem.

I did some debug modifications to ntpd and this is what I've found.
I'm running my tests on:

# sysctl -n kern.version
OpenBSD 5.7-current (GENERIC) #955: Thu May 28 13:09:53 MDT 2015
    dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC

and to keep IPv6 out of the picture I have following settings in resolv.conf:

# grep -w family /etc/resolv.conf
family inet4

The ntpd server itself is configured as follows:

# grep -ve '^#' /etc/ntpd.conf
server 0.pool.ntp.org
server 1.pool.ntp.org
server 2.pool.ntp.org
server 3.pool.ntp.org
constraints from www.google.com

I'm running debug version of ntpd as follows:

./ntpd.debug -vd 2>&1 | ts 

Here is example debug session when I hit the issue. Lines which contain
XXX MK are added by me via log_debug():

May 28 23:25:54 ntp engine ready
May 28 23:25:54 XXX MK STATE_NONE=0
May 28 23:25:54 XXX MK STATE_DNS_INPROGRESS=1
May 28 23:25:54 XXX MK STATE_DNS_TEMPFAIL=2
May 28 23:25:54 XXX MK STATE_DNS_DONE=3
May 28 23:25:54 XXX MK STATE_QUERY_SENT=4
May 28 23:25:54 XXX MK STATE_REPLY_RECEIVED=5
May 28 23:25:54 XXX MK STATE_INVALID=6
May 28 23:25:54 XXX MK ntp_main() constraint_cnt=0, constraint_median=0
May 28 23:25:54 XXX MK client_query() start
May 28 23:25:54 XXX MK client_query() start
May 28 23:25:54 XXX MK client_query() start
May 28 23:25:54 XXX MK client_query() start
May 28 23:25:54 XXX MK constraint_query() starts state=1
May 28 23:25:54 XXX MK constraint_query() fail with default?
May 28 23:25:54 XXX MK ntp_main() function constraint_query() returned -1
May 28 23:25:54 XXX MK ntp_main() about to loop with control_dispatch_msg() and 
j=3 i=3 nfds=1
May 28 23:25:54 XXX MK ntp_main() about to loop with constraint_dispatch_msg() 
and j=3 i=3 nfds=1
May 28 23:25:54 XXX MK ntp_main() constraint_cnt=0, constraint_median=0
May 28 23:25:54 XXX MK constraint_query() starts state=1
May 28 23:25:54 XXX MK constraint_query() fail with default?
May 28 23:25:54 XXX MK ntp_main() function constraint_query() returned -1
May 28 23:25:55 XXX MK ntp_main() about to loop with control_dispatch_msg() and 
j=3 i=3 nfds=0
May 28 23:25:55 XXX MK ntp_main() about to loop with constraint_dispatch_msg() 
and j=3 i=3 nfds=0
May 28 23:25:55 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
May 28 23:25:55 XXX MK constraint_query() starts with 216.58.208.68 state=3
May 28 23:25:55 constraint request to 216.58.208.68
May 28 23:25:55 XXX MK constraint_query() succeeded?
May 28 23:25:55 XXX MK httpsdate_query() with 216.58.208.68
May 28 23:25:55 XXX MK httpsdate_init() with 216.58.208.68
May 28 23:25:55 XXX MK httpsdate_request() with 216.58.208.68 port 443
May 28 23:25:55 XXX MK httpsdate_request() date from 216.58.208.68 is Date: 
Thu, 28 May 2015 22:25:56 GMT
May 28 23:25:55 XXX MK httpsdate_request() success from 216.58.208.68
May 28 23:25:55 XXX MK httpsdate_query() with 216.58.208.68 done
May 28 23:25:55 XXX MK ntp_main() function poll() returned -1, errno=4 i=4
May 28 23:25:55 XXX MK ntp_main() about to loop with control_dispatch_msg() and 
j=3 i=4 nfds=-1
May 28 23:25:55 XXX MK ntp_main() about to loop with constraint_dispatch_msg() 
and j=3 i=4 nfds=-1
May 28 23:25:55 XXX MK constraint_check_child() bummer, we bump senderrors now 
senderrors=1 fail=0 state=4
May 28 23:25:55 XXX MK constraint_close() are we going to set state to 
STATE_INVALID? state=4
May 28 23:25:55 XXX MK constraint_close() somehow we set the state to 
STATE_INVALID state=6 senderrors=1
May 28 23:25:55 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
May 28 23:25:55 XXX MK constraint_query() starts with 216.58.208.68 state=6
May 28 23:25:55 XXX MK constraint_query() fail with STATE_INVALID last=850 
now=850 diff=0
May 28 23:25:55 XXX MK ntp_main() function constraint_query() returned -1 for 
216.58.208.68
May 28 23:25:56 XXX MK ntp_main() about to loop with control_dispatch_msg() and 
j=3 i=3 nfds=0
May 28 23:25:56 XXX MK ntp_main() about to loop with constraint_dispatch_msg() 
and j=3 i=3 nfds=0
May 28 23:25:56 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
May 28 23:25:56 XXX MK constraint_query() starts with 216.58.208.68 state=6
May 28 23:25:56 XXX MK constraint_query() fail with STATE_INVALID last=850 
now=851 diff=1
May 28 23:25:56 XXX MK ntp_main() function constraint_query() returned -1 for 
216.58.208.68
May 28 23:25:57 XXX MK ntp_main() about to loop with control_dispatch_msg() and 
j=3 i=3 nfds=0
May 28 23:25:57 XXX MK ntp_main() about to loop with constraint_dispatch_msg() 
and j=3 i=3 nfds=0
May 28 23:25:57 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
May 28 23:25:57 XXX MK constraint_query() starts with 216.58.208.68 state=6
May 28 23:25:57 XXX MK constraint_query() fail with STATE_INVALID last=850 
now=852 diff=2
May 28 23:25:57 XXX MK ntp_main() function constraint_query() returned -1 for 
216.58.208.68
May 28 23:25:58 XXX MK ntp_main() about to loop with control_dispatch_msg() and 
j=3 i=3 nfds=0
May 28 23:25:58 XXX MK ntp_main() about to loop with constraint_dispatch_msg() 
and j=3 i=3 nfds=0
May 28 23:25:58 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
May 28 23:25:58 XXX MK constraint_query() starts with 216.58.208.68 state=6
May 28 23:25:58 XXX MK constraint_query() fail with STATE_INVALID last=850 
now=853 diff=3
May 28 23:25:58 XXX MK ntp_main() function constraint_query() returned -1 for 
216.58.208.68
May 28 23:26:00 XXX MK ntp_main() about to loop with control_dispatch_msg() and 
j=3 i=3 nfds=0
May 28 23:26:00 XXX MK ntp_main() about to loop with constraint_dispatch_msg() 
and j=3 i=3 nfds=0
May 28 23:26:00 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
May 28 23:26:00 XXX MK constraint_query() starts with 216.58.208.68 state=6
May 28 23:26:00 XXX MK constraint_query() fail with STATE_INVALID last=850 
now=854 diff=4
May 28 23:26:00 XXX MK ntp_main() function constraint_query() returned -1 for 
216.58.208.68
May 28 23:26:01 XXX MK ntp_main() about to loop with control_dispatch_msg() and 
j=3 i=3 nfds=0
May 28 23:26:01 XXX MK ntp_main() about to loop with constraint_dispatch_msg() 
and j=3 i=3 nfds=0
May 28 23:26:01 XXX MK ntp_main() constraint_cnt=1, constraint_median=0
May 28 23:26:01 XXX MK constraint_query() starts with 216.58.208.68 state=6
May 28 23:26:01 XXX MK constraint_query() fail with STATE_INVALID last=850 
now=855 diff=5
May 28 23:26:01 XXX MK ntp_main() function constraint_query() returned -1 for 
216.58.208.68

and from that point constraint_query() stays with STATE_INVALID until time
pases CONSTRAINT_SCAN_INTERVAL seconds.

So what I understand is happening above:

In ntp.c:
   268                  TAILQ_FOREACH(p, &conf->ntp_peers, entry) {
   269                          if (constraint_cnt && conf->constraint_median 
== 0)
   270                                  continue;

constraint_cnt is 1 but conf->constraint_median is always 0, so that
part of the code always execute continue. Which means client_query()
doesn't happen any more.

In constraint.c:
   162          case STATE_INVALID:
   163                  if (cstr->last + CONSTRAINT_SCAN_INTERVAL > now) {
   164                          /* Nothing to do */
   165                          return (-1);
   166                  }

Function constraint_query() returns -1 and only few times it succeeds.
When constraint_query() actually succeded, poll() function returns -1
and from that point re-execution of constraint_query() always fails
becaues of STATE_INVALID case (see above).

In constraint.c:
   282                          if (fail || cstr->state < STATE_REPLY_RECEIVED) 
{
   283                                  cstr->senderrors++;
   284                                  constraint_close(cstr->fd);
   285                          }

When poll() returns -1 the errno is set to EINTR, which in turns goes to
SIGCHLD handler, which goes to constraint_check_child() and in line 282
(see above) cstr->state is set to STATE_QUERY_SENT so cstr->senderrors
is bumped and constraint_close() is executed.

In constraint.c:
   347                  if (cstr->senderrors)
   348                          cstr->state = STATE_INVALID;

In constraint_close() because cstr->senderrors are non-zero state is set
to STATE_INVALID and from that point ntpd is stuck. 

I'm not sure what would be the proper way to solve this, but very
minimalisic patch as below fixed the issue for me:

Index: constraint.c
===================================================================
RCS file: /cvs/src/usr.sbin/ntpd/constraint.c,v
retrieving revision 1.12
diff -u -p -u -r1.12 constraint.c
--- constraint.c        28 May 2015 21:34:36 -0000      1.12
+++ constraint.c        28 May 2015 23:14:47 -0000
@@ -279,7 +279,7 @@ constraint_check_child(void)
                                    &cstr->addr->ss), CONSTRAINT_SCAN_INTERVAL);
                        }
 
-                       if (fail || cstr->state < STATE_REPLY_RECEIVED) {
+                       if (fail || cstr->state < STATE_QUERY_SENT) {
                                cstr->senderrors++;
                                constraint_close(cstr->fd);
                        }


a bit more verbose change, which may (or may not) be a more clear is:

Index: constraint.c
===================================================================
RCS file: /cvs/src/usr.sbin/ntpd/constraint.c,v
retrieving revision 1.12
diff -u -p -u -r1.12 constraint.c
--- constraint.c        28 May 2015 21:34:36 -0000      1.12
+++ constraint.c        28 May 2015 23:16:57 -0000
@@ -272,6 +272,8 @@ constraint_check_child(void)
                                fatalx("constraint %s, signal %d", 
                                    log_sockaddr((struct sockaddr *)
                                    &cstr->addr->ss), sig);
+                       if (!fail && cstr->state == STATE_QUERY_SENT)
+                               continue;
                        if (fail) {
                                log_debug("no constraint reply from %s"
                                    " received in time, next query %ds",


I'm running with the first patch for last couple of days and it works
for me, but I'm not sure is this the correct(tm) approach.


On Tue, Mar 31, 2015 at 02:16:31AM +0100, Mikolaj Kucharski wrote:
> Hi,
> 
> I have following ntpd.conf file on an OpenBSD machine:
> 
> OpenBSD 5.7-current (GENERIC) #839: Mon Mar 30 14:21:47 MDT 2015
>     dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC
> 
> # /etc/ntpd.conf
> server 0.pool.ntp.org
> server 1.pool.ntp.org
> server 2.pool.ntp.org
> server 3.pool.ntp.org
> constraints from www.google.com
> 
> I've noticed that ntpd(8) often ends up with all peers in 'peer not valid'
> state, never syncing the time. I've enabled debug logging and this is
> what I see in the logs:
> 
> Mar 31 00:29:38 openbsd57 ntpd[12097]: ntp engine ready
> Mar 31 00:29:39 openbsd57 ntpd[12097]: constraint request to 
> 2a00:1450:400b:800::2004
> Mar 31 00:29:39 openbsd57 ntpd[12097]: constraint request to 216.58.211.132
> Mar 31 00:29:39 openbsd57 ntpd[30212]: tls failed: 2a00:1450:400b:800::2004: 
> connect
> Mar 31 00:29:39 openbsd57 ntpd[12097]: no constraint reply from 
> 2a00:1450:400b:800::2004 received in time, next query 900s
> Mar 31 00:34:37 openbsd57 ntpd[12097]: 0 out of 4 peers valid
> Mar 31 00:34:37 openbsd57 ntpd[12097]: bad peer 0.pool.ntp.org 
> (54.194.18.100) 
> Mar 31 00:34:37 openbsd57 ntpd[12097]: bad peer 1.pool.ntp.org 
> (149.157.192.5) 
> Mar 31 00:34:37 openbsd57 ntpd[12097]: bad peer 2.pool.ntp.org 
> (193.1.219.116) 
> Mar 31 00:34:37 openbsd57 ntpd[12097]: bad peer 3.pool.ntp.org (85.91.1.164) 
> Mar 31 00:44:39 openbsd57 ntpd[12097]: constraint request to 
> 2a00:1450:400b:800::2004
> Mar 31 00:44:39 openbsd57 ntpd[12097]: constraint request to 216.58.211.132
> Mar 31 00:44:39 openbsd57 ntpd[1720]: tls failed: 2a00:1450:400b:800::2004: 
> connect
> Mar 31 00:44:39 openbsd57 ntpd[12097]: no constraint reply from 
> 2a00:1450:400b:800::2004 received in time, next query 900s
> Mar 31 00:59:39 openbsd57 ntpd[12097]: constraint request to 
> 2a00:1450:400b:800::2004
> Mar 31 00:59:39 openbsd57 ntpd[12097]: constraint request to 216.58.211.132
> Mar 31 00:59:39 openbsd57 ntpd[13994]: tls failed: 2a00:1450:400b:800::2004: 
> connect
> Mar 31 00:59:39 openbsd57 ntpd[12097]: no constraint reply from 
> 2a00:1450:400b:800::2004 received in time, next query 900s
> 
> 
> That box doesn't have IPv6 configured:
> 
> # ifconfig
> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 32768
>         priority: 0
>         groups: lo
>         inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
>         inet6 ::1 prefixlen 128
>         inet 127.0.0.1 netmask 0xff000000
> vio0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>         lladdr 52:54:00:12:34:72
>         priority: 0
>         groups: egress
>         media: Ethernet autoselect
>         status: active
>         inet 172.16.0.52 netmask 0xffffff00 broadcast 172.16.0.255
> enc0: flags=0<>
>         priority: 0
>         groups: enc
>         status: active
> 
> 
> # nc -v 2a00:1450:400b:800::2004 443
> nc: connect to 2a00:1450:400b:800::2004 port 443 (tcp) failed: No route to 
> host
> 

-- 
best regards
q#

Reply via email to