Re: crash
Hal, On 14-04-2020 05:07, Hal Murray wrote: > I just pushed a fix. Please test. With this fix the ntpd appears to be running a few hours now without issue. Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
> -rw--- 1 root root 1708 Dec 13 11:05 ./keys/_key-certbot.pem > Anything wrong in here? Your configure line includes early-droproot. Your command line includes -u ntp:ntp With that combination, it's probably trying to read the key after switching to user ntp. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 14-04-2020 07:22, Hal Murray wrote: > Given that you have tested most of the rest of your ntp.conf, my guess would > be file permissions on the certificate or key. The key is most likely since > there is no reason to hide the certificate. # cd /etc/letsencrypt/ # find . -exec ls -ld {} \; drwxr-xr-x 7 root root 4096 Mar 5 09:37 . drwxr-xr-x 2 root root 4096 Dec 13 11:05 ./csr -rw-r--r-- 1 root root 932 Dec 13 11:05 ./csr/_csr-certbot.pem drwxr-xr-x 2 root root 4096 Dec 13 11:05 ./renewal drwxr-xr-x 3 root root 4096 Dec 13 11:05 ./accounts drwxr-xr-x 3 root root 4096 Dec 13 11:05 ./accounts/acme-v02.api.letsencrypt.org drwx-- 3 root root 4096 Dec 13 11:05 ./accounts/acme-v02.api.letsencrypt.org/directory drwx-- 2 root root 4096 Dec 13 11:05 ./accounts/acme-v02.api.letsencrypt.org/directory/020c96242a59060882fc55ae933cc35e -rw-r--r-- 1 root root 78 Dec 13 11:05 ./accounts/acme-v02.api.letsencrypt.org/directory/020c96242a59060882fc55ae933cc35e/regr.json -r 1 root root 1632 Dec 13 11:05 ./accounts/acme-v02.api.letsencrypt.org/directory/020c96242a59060882fc55ae933cc35e/private_key.json -rw-r--r-- 1 root root 77 Dec 13 11:05 ./accounts/acme-v02.api.letsencrypt.org/directory/020c96242a59060882fc55ae933cc35e/meta.json drwxr-xr-x 5 root root 4096 Dec 13 11:05 ./renewal-hooks drwxr-xr-x 2 root root 4096 Dec 13 11:05 ./renewal-hooks/deploy drwxr-xr-x 2 root root 4096 Dec 13 11:05 ./renewal-hooks/post drwxr-xr-x 2 root root 4096 Dec 13 11:05 ./renewal-hooks/pre drwx-- 2 root root 4096 Dec 13 11:05 ./keys -rw--- 1 root root 1708 Dec 13 11:05 ./keys/_key-certbot.pem Anything wrong in here? Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
udo...@xs4all.nl said: >> If you want the server side to support NTS, you need to add "nts enable" > With that in ntp.conf the ntpd does not start. Config needed I guess. The log file should have a useful message. It may take more than a few seconds to find due to all the cruft that is useful in other contexts. Start at the end and work back. Given that you have tested most of the rest of your ntp.conf, my guess would be file permissions on the certificate or key. The key is most likely since there is no reason to hide the certificate. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 14-04-2020 05:07, Hal Murray wrote: > >> # grep nts /etc/ntp.conf >> nts key /etc/letsencrypt/keys/_key-certbot.pem >> nts cert /etc/letsencrypt/csr/_csr-certbot.pem >> server time.cloudflare.com:1234 nts # TLS1.3 only > ... > > Thanks. > > I just pushed a fix. Please test. Will do, building rpm right now. > If you want the server side to support NTS, you need to add "nts enable" With that in ntp.conf the ntpd does not start. Config needed I guess. Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 13-04-2020 20:18, Hal Murray wrote: > It's dying while trying to reload the certificate file. > > Is that happening after running for an hour? Yes. > > That turns into 2 questions. Why is it trying to reload the certificates, > and > why is it crashing? > > What's in your ntp.conf? I don't need the whole thing, just the lines with > "nts". # grep nts /etc/ntp.conf nts key /etc/letsencrypt/keys/_key-certbot.pem nts cert /etc/letsencrypt/csr/_csr-certbot.pem server time.cloudflare.com:1234 nts # TLS1.3 only server ntpmon.dcs1.biz nts server pi4.rellim.com nts server ntp1.glypnod.com nts server ntp2.glypnod.com nts > Did this configuration work before a recent git pull? No. Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 13-04-2020 19:39, Hal Murray wrote: >> Or will I do the debug build? > > Please do it again with symbols. > > How long does it run before it crashes? Seconds? Hours? ... (gdb) bt #0 use_certificate_chain_file (ctx=ctx@entry=0x0, ssl=ssl@entry=0x0, file=file@entry=0x555f9640 "/etc/letsencrypt/csr/_csr-certbot.pem") at ssl/ssl_rsa.c:604 #1 0x77c5c36e in SSL_CTX_use_certificate_chain_file (ctx=ctx@entry=0x0, file=file@entry=0x555f9640 "/etc/letsencrypt/csr/_csr-certbot.pem") at ssl/ssl_rsa.c:688 #2 0x5558312e in nts_load_certificate (ctx=ctx@entry=0x0) at ../../ntpd/nts.c:225 #3 0x555832bc in nts_reload_certificate (ctx=0x0) at ../../ntpd/nts.c:204 #4 0x555840d5 in check_cert_file () at ../../ntpd/nts_server.c:171 #5 0x5558414d in nts_cert_timer () at ../../ntpd/nts_server.c:163 #6 0x55582d59 in nts_timer () at ../../ntpd/nts.c:107 #7 0x555739cd in timer () at ../../ntpd/ntp_timer.c:284 #8 0x55562051 in mainloop () at ../../ntpd/ntpd.c:940 #9 main (argc=, argv=) at ../../ntpd/ntpd.c:884 (gdb) An hour or so? Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
I think I've found a way for that to happen. Were you missing a "nts enable" in your config file? but did have a "nts cert ..." pointing to a valid file? -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
Thanks. It's dying while trying to reload the certificate file. Is that happening after running for an hour? That turns into 2 questions. Why is it trying to reload the certificates, and why is it crashing? What's in your ntp.conf? I don't need the whole thing, just the lines with "nts". Did this configuration work before a recent git pull? One of your earlier messages had some logging, but I didn't see the NTS messages I expect. With the latest run, did it say anything about loading certificates during initialization? I expect 3 lines like this: 2 Apr 13:12:11 ntpd[685]: NTSs: loaded certificate (chain) from xxx 2 Apr 13:12:11 ntpd[685]: NTSs: loaded private key from xxx 2 Apr 13:12:11 ntpd[685]: NTSs: Private Key OK Is there anything interesting with the permissions on the certificate or key files? You built with early-droproot, so I think it has already switched to user ntp when it loads them during initialization. I'm trying to figure out why it's trying to reload them. Either there is a bug in the reload logic, or it didn't load them the first try and the error didn't get handled correctly. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
> Or will I do the debug build? Please do it again with symbols. How long does it run before it crashes? Seconds? Hours? ... -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 13-04-2020 16:01, Hal Murray wrote: > > udo...@xs4all.nl said: >> Started things this way. One gdb line worries me a bit: (No debugging symbols >> found in build/main/ntpd/ntpd) > >> Perhaps a different build is needed? > > I'm not sure how that stuff works. > > configure has an --enable-debug-gdb option. That may do it. With that option and some debuginfo's installed I get: Thread 1 "ntpd" received signal SIGSEGV, Segmentation fault. use_certificate_chain_file (ctx=ctx@entry=0x0, ssl=ssl@entry=0x0, file=file@entry=0x555f9640 "/etc/letsencrypt/csr/_csr-certbot.pem") at ssl/ssl_rsa.c:604 604 passwd_callback = ssl->default_passwd_callback; Missing separate debuginfos, use: dnf debuginfo-install libgpg-error-1.36-2.fc31.x86_64 (gdb) bt #0 use_certificate_chain_file (ctx=ctx@entry=0x0, ssl=ssl@entry=0x0, file=file@entry=0x555f9640 "/etc/letsencrypt/csr/_csr-certbot.pem") at ssl/ssl_rsa.c:604 #1 0x77c5c36e in SSL_CTX_use_certificate_chain_file (ctx=ctx@entry=0x0, file=file@entry=0x555f9640 "/etc/letsencrypt/csr/_csr-certbot.pem") at ssl/ssl_rsa.c:688 #2 0x5558312e in nts_load_certificate (ctx=ctx@entry=0x0) at ../../ntpd/nts.c:225 #3 0x555832bc in nts_reload_certificate (ctx=0x0) at ../../ntpd/nts.c:204 #4 0x555840d5 in check_cert_file () at ../../ntpd/nts_server.c:171 #5 0x5558414d in nts_cert_timer () at ../../ntpd/nts_server.c:163 #6 0x55582d59 in nts_timer () at ../../ntpd/nts.c:107 #7 0x555739cd in timer () at ../../ntpd/ntp_timer.c:284 #8 0x55562051 in mainloop () at ../../ntpd/ntpd.c:940 #9 main (argc=, argv=) at ../../ntpd/ntpd.c:884 (gdb) Hopefully this helps fixing the issue. Kind regards, Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
> > udo...@xs4all.nl said: > >> Started things this way. One gdb line worries me a bit: (No debugging > >> symbols > >> found in build/main/ntpd/ntpd) > > > >> Perhaps a different build is needed? > > > > I'm not sure how that stuff works. > > > > configure has an --enable-debug-gdb option. That may do it. > Without that option I get: > > Thread 1 "ntpd" received signal SIGSEGV, Segmentation fault. > 0x77c5ba70 in use_certificate_chain_file () from > /lib64/libssl.so.1.1 > Missing separate debuginfos, use: dnf debuginfo-install > avahi-compat-libdns_sd-0.7-20.fc31.x86_64 avahi-libs-0.7-20.fc31.x86_64 > dbus-libs-1.12.16-3.fc31.x86_64 libcap-2.26-6.fc31.x86_64 > libgcc-9.3.1-1.fc31.x86_64 libgcrypt-1.8.5-1.fc31.x86_64 > lz4-libs-1.9.1-1.fc31.x86_64 nss-mdns-0.14.1-7.fc31.x86_64 > openssl-libs-1.1.1d-2.fc31.x86_64 systemd-libs-243.8-1.fc31.x86_64 > xz-libs-5.2.4-6.fc31.x86_64 zlib-1.2.11-20.fc31.x86_64 > (gdb) bt > #0 0x77c5ba70 in use_certificate_chain_file () >from /lib64/libssl.so.1.1 > #1 0x5558310e in ?? () > #2 0x5558329c in ?? () > #3 0x555840b5 in ?? () > #4 0x5558412d in ?? () > #5 0x55582d39 in ?? () > #6 0x555739ad in ?? () > #7 0x55562031 in ?? () > #8 0x778f01a3 in __libc_start_main () from /lib64/libc.so.6 > #9 0x5556232e in ?? () > (gdb) > > Does this help enough? doubt it :) > Or will I do the debug build? Please do. Because this backtrace, it won't show in what lineor even sourcefile the segfault occured. Yeah, it is possible to figure this out via those addresses (in theory) but that is rather complicated. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 13-04-2020 16:01, Hal Murray wrote: > > udo...@xs4all.nl said: >> Started things this way. One gdb line worries me a bit: (No debugging symbols >> found in build/main/ntpd/ntpd) > >> Perhaps a different build is needed? > > I'm not sure how that stuff works. > > configure has an --enable-debug-gdb option. That may do it. Without that option I get: Thread 1 "ntpd" received signal SIGSEGV, Segmentation fault. 0x77c5ba70 in use_certificate_chain_file () from /lib64/libssl.so.1.1 Missing separate debuginfos, use: dnf debuginfo-install avahi-compat-libdns_sd-0.7-20.fc31.x86_64 avahi-libs-0.7-20.fc31.x86_64 dbus-libs-1.12.16-3.fc31.x86_64 libcap-2.26-6.fc31.x86_64 libgcc-9.3.1-1.fc31.x86_64 libgcrypt-1.8.5-1.fc31.x86_64 lz4-libs-1.9.1-1.fc31.x86_64 nss-mdns-0.14.1-7.fc31.x86_64 openssl-libs-1.1.1d-2.fc31.x86_64 systemd-libs-243.8-1.fc31.x86_64 xz-libs-5.2.4-6.fc31.x86_64 zlib-1.2.11-20.fc31.x86_64 (gdb) bt #0 0x77c5ba70 in use_certificate_chain_file () from /lib64/libssl.so.1.1 #1 0x5558310e in ?? () #2 0x5558329c in ?? () #3 0x555840b5 in ?? () #4 0x5558412d in ?? () #5 0x55582d39 in ?? () #6 0x555739ad in ?? () #7 0x55562031 in ?? () #8 0x778f01a3 in __libc_start_main () from /lib64/libc.so.6 #9 0x5556232e in ?? () (gdb) Does this help enough? Or will I do the debug build? Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
udo...@xs4all.nl said: > I could disable NTSc for now to avoid crashes. Or if you have a patch I can > test with that one? Changing that may break (fix?) the crash. I'd like to understand that before we change anything else. Fixing Cloudflare will break all other NTS servers unless they make the same change as Cloudflare. I'm hoping somebody on the IEFT list will pick a date. There is rate limiting on those messages. It shouldn't clutter up the log file too much. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 13-04-2020 14:48, Hal Murray wrote: >> Apr 13 06:10:27 doos ntpd[204063]: EX-REP: Count=1 Print=1, Score=0.500, M4 >> V4 from [2606:4700:f1::1]:123, lng=84 > > That's saying the NTS stuff isn't working. 2606:4700:f1::1 is Cloudflare. > They have updated their servers to use the latest tweak from the draft RFC. > It's incompatible. I could disable NTSc for now to avoid crashes. Or if you have a patch I can test with that one? Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
udo...@xs4all.nl said: > Started things this way. One gdb line worries me a bit: (No debugging symbols > found in build/main/ntpd/ntpd) > Perhaps a different build is needed? I'm not sure how that stuff works. configure has an --enable-debug-gdb option. That may do it. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 13-04-2020 15:23, Hal Murray wrote: > when it crashes, you should get back to gdb > then > bt should give you a stack trace Started things this way. One gdb line worries me a bit: (No debugging symbols found in build/main/ntpd/ntpd) Perhaps a different build is needed? Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
udo...@xs4all.nl said: > I did not find a core dump. How else can I get a stack dump? use gdb. You need to add -n to the command line args ot ntpd will detach itself. cd build dir gdb build/main/ntpd/ntpd run -n http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 13-04-2020 14:48, Hal Murray wrote: > Can you get a stack trace? I did not find a core dump. How else can I get a stack dump? > What were your configure options? CFLAGS="-O2" %{__python3} ./waf configure \ --prefix=/usr\ --enable-early-droproot\ --refclock=nmea,generic\ --libdir=%{_libdir}\ --docdir=%{_docdir}/ntpsec\ --enable-doc >> Apr 13 06:10:27 doos ntpd[204063]: EX-REP: Count=1 Print=1, Score=0.500, M4 >> V4 from [2606:4700:f1::1]:123, lng=84 > > That's saying the NTS stuff isn't working. 2606:4700:f1::1 is Cloudflare. > They have updated their servers to use the latest tweak from the draft RFC. > It's incompatible. Ah. Would it help if I downgrade to a version a few weeks old? -rw-r--r-- 1 root root 450224 Nov 22 07:32 RPMS/x86_64/ntpsec-1.1.8-0.fc31.x86_64.rpm -rw-r--r-- 1 root root 450328 Dec 1 10:46 RPMS/x86_64/ntpsec-1.1.8-1.fc31.x86_64.rpm -rw-r--r-- 1 root root 451781 Dec 13 10:53 RPMS/x86_64/ntpsec-1.1.8-2.fc31.x86_64.rpm -rw-r--r-- 1 root root 451820 Dec 13 10:55 RPMS/x86_64/ntpsec-1.1.8-3.fc31.x86_64.rpm -rw-r--r-- 1 root root 451848 Jan 4 17:49 RPMS/x86_64/ntpsec-1.1.8-4.fc31.x86_64.rpm -rw-r--r-- 1 root root 452552 Feb 23 07:00 RPMS/x86_64/ntpsec-1.1.8-5.fc31.x86_64.rpm -rw-r--r-- 1 root root 453601 Mar 14 14:39 RPMS/x86_64/ntpsec-1.1.8-6.fc31.x86_64.rpm -rw-r--r-- 1 root root 453415 Apr 3 09:53 RPMS/x86_64/ntpsec-1.1.8-7.fc31.x86_64.rpm -rw-r--r-- 1 root root 453667 Apr 12 09:30 RPMS/x86_64/ntpsec-1.1.8-8.fc31.x86_64.rpm Or older? Kind regards, Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
> Apr 13 07:10:23 doos kernel: ntpd[204063]: segfault at 17f8 ip > 7f9d70252a70 sp 7ffe3665adc0 error 4 in libssl.so.1.1.1d[7f9d7022e000+ > 5] Can you get a stack trace? What were your configure options? > Apr 13 06:10:27 doos ntpd[204063]: EX-REP: Count=1 Print=1, Score=0.500, M4 > V4 from [2606:4700:f1::1]:123, lng=84 That's saying the NTS stuff isn't working. 2606:4700:f1::1 is Cloudflare. They have updated their servers to use the latest tweak from the draft RFC. It's incompatible. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: crash
On 13-04-2020 14:13, Udo van den Heuvel via devel wrote: > All, > > This happens since yesterday: This is with a fairly recent 1.1.8 git build. Fedora is up to date. Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
crash
All, This happens since yesterday: Apr 13 06:10:23 doos ntpd[204062]: INIT: ntpd ntpsec-1.1.8 2019-08-02T00:00:00Z: Starting Apr 13 06:10:23 doos ntpd[204062]: INIT: Command line: /usr/sbin/ntpd -u ntp:ntp -g -N -p /var/run/ntpd.pid Apr 13 06:10:23 doos ntpd[204063]: INIT: precision = 1.397 usec (-19) Apr 13 06:10:23 doos ntpd[204063]: INIT: successfully locked into RAM Apr 13 06:10:23 doos ntpd[204063]: CONFIG: readconfig: parsing file: /etc/ntp.conf Apr 13 06:10:23 doos ntpd[204063]: AUTH: authreadkeys: reading /etc/ntp/keys Apr 13 06:10:23 doos ntpd[204063]: AUTH: authreadkeys: added 0 keys Apr 13 06:10:23 doos ntpd[204063]: CONFIG: 'monitor' cannot be disabled while 'limited' is enabled Apr 13 06:10:23 doos ntpd[204063]: INIT: Using SO_TIMESTAMPNS Apr 13 06:10:23 doos ntpd[204063]: IO: Listen and drop on 0 v6wildcard [::]:123 Apr 13 06:10:23 doos ntpd[204063]: IO: Listen and drop on 1 v4wildcard 0.0.0.0:123 Apr 13 06:10:23 doos ntpd[204063]: IO: Listen normally on 2 lo 127.0.0.1:123 Apr 13 06:10:23 doos ntpd[204063]: IO: Listen normally on 3 eth0 192.168.10.70:123 Apr 13 06:10:23 doos ntpd[204063]: IO: Listen normally on 4 lo [::1]:123 Apr 13 06:10:23 doos ntpd[204063]: IO: Listen normally on 5 eth0 [fd00:c0a8:a00:1::70]:123 Apr 13 06:10:23 doos ntpd[204063]: IO: Listen normally on 6 eth0 [2001:981:a812:0:b62e:99ff:fe92:5264]:123 Apr 13 06:10:23 doos ntpd[204063]: IO: Listen normally on 7 eth0 [fe80::b62e:99ff:fe92:5264%2]:123 Apr 13 06:10:23 doos ntpd[204063]: IO: Listening on routing socket on fd #24 for interface updates Apr 13 06:10:23 doos ntpd[204063]: SYNC: Found 14 servers, suggest minsane at least 3 Apr 13 06:10:23 doos ntpd[204063]: INIT: MRU 10922 entries, 13 hash bits, 65536 bytes Apr 13 06:10:23 doos ntpd[204063]: INIT: OpenSSL 1.1.1d FIPS 10 Sep 2019, 1010104f Apr 13 06:10:23 doos ntpd[204062]: 2020-04-13T06:10:23 ntpd[204062]: INIT: ntpd ntpsec-1.1.8 2019-08-02T00:00:00Z: Starting Apr 13 06:10:23 doos ntpd[204062]: 2020-04-13T06:10:23 ntpd[204062]: INIT: Command line: /usr/sbin/ntpd -u ntp:ntp -g -N -p /var/run/ntpd.pid Apr 13 06:10:27 doos ntpd[204063]: EX-REP: Count=1 Print=1, Score=0.500, M4 V4 from [2606:4700:f1::1]:123, lng=84 Apr 13 06:10:27 doos ntpd[204063]: EX-REP: e400 4e54534e 6274db5e 604c0da9 01040024 31ecdfb4 899336cb a523b5e6 0f5d614a 766ff4ca 99384f4d a84fd23e 7e959900 Apr 13 06:11:34 doos ntpd[204063]: EX-REP: Count=2 Print=2, Score=0.995, M4 V4 from [2606:4700:f1::1]:123, lng=84 Apr 13 06:11:34 doos ntpd[204063]: EX-REP: e400 4e54534e c0d6d0b1 7ff9fd3c 01040024 0ab1fdb5 06d34008 4477e202 f7c726b0 e8f662cc ef488b09 c4b3100b 8fc83793 Apr 13 06:12:38 doos ntpd[204063]: EX-REP: Count=3 Print=3, Score=1.487, M4 V4 from [2606:4700:f1::1]:123, lng=84 Apr 13 06:12:38 doos ntpd[204063]: EX-REP: e400 4e54534e 79b3755d fcf61bc4 01040024 6502774b aafc5e82 fb0692fc 2ab219c9 05be1d8a 8db3d63d 61d2591d 08fe9f00 Apr 13 06:13:42 doos ntpd[204063]: CLOCK: time stepped by -0.227362 Apr 13 06:13:42 doos ntpd[204063]: INIT: MRU 10922 entries, 13 hash bits, 65536 bytes Apr 13 06:14:07 doos ntpd[204063]: EX-REP: Count=4 Print=4, Score=1.968, M4 V4 from [2606:4700:f1::1]:123, lng=84 Apr 13 06:14:07 doos ntpd[204063]: EX-REP: e400 4e54534e 735b5415 6c6af8af 01040024 e136691f 681af7eb 58590394 3fe8b189 4d7ec4cb 00658d17 a88bd4d7 542dd7da Apr 13 06:15:14 doos ntpd[204063]: EX-REP: Count=5 Print=5, Score=2.450, M4 V4 from [2606:4700:f1::1]:123, lng=84 Apr 13 06:15:14 doos ntpd[204063]: EX-REP: e400 4e54534e 02999b2d 3e0fd596 01040024 3d2a4325 067694c7 4fce200e 841a6932 d94001f5 0fe4aa4f 09dd46d4 33149497 Apr 13 07:10:23 doos kernel: [52367.896238] ntpd[204063]: segfault at 17f8 ip 7f9d70252a70 sp 7ffe3665adc0 error 4 in libssl.so.1.1.1d[7f9d7022e000+5] Apr 13 07:10:23 doos kernel: ntpd[204063]: segfault at 17f8 ip 7f9d70252a70 sp 7ffe3665adc0 error 4 in libssl.so.1.1.1d[7f9d7022e000+5] openssl rpm is intact. Kind regards, Udo ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Usefuleness of noval (Was: Re: NTS crash...)
Yo Richard! On Wed, 27 Mar 2019 16:23:19 -0500 Richard Laager via devel wrote: > On 3/26/19 4:27 PM, Gary E. Miller via devel wrote: > > I added noval, still can not connect: > > > > server 204.17.205.23 maxpoll 5 nts noval # pi3 > > I wonder if we should revisit "noval". I think I originally argued in > favor of having it, as a standard TLS client knob. But IIRC, Daniel > suggested it was pointless. The pointless proven wrong by the Hackathon. And still needed today. > Does NTS with noval actually buy us anything over plain NTP? Yes, it was 100% essential for the NTS hackathon. Otherwise NTPsec would not be able to connect to the ostfalia servers. Still needed today. Maybe, just maybe, if one of the may flavors of cert pinning worked with NTPsec then it might only be useful for debugging. RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can’t measure it, you can’t improve it." - Lord Kelvin pgpmygb9h4VKK.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Usefuleness of noval (Was: Re: NTS crash...)
On 3/26/19 4:27 PM, Gary E. Miller via devel wrote: > I added noval, still can not connect: > > server 204.17.205.23 maxpoll 5 nts noval # pi3 I wonder if we should revisit "noval". I think I originally argued in favor of having it, as a standard TLS client knob. But IIRC, Daniel suggested it was pointless. Does NTS with noval actually buy us anything over plain NTP? -- Richard signature.asc Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: NTS crash...
Yo Hal! On Tue, 26 Mar 2019 14:49:52 -0700 Hal Murray via devel wrote: > > Now it does not crash, anyway to make it work? > > I need to use some IPs for private, offgrid, networking. > > I use /etc/hosts, so that hasn't been a problem for me. Now I have two files to copy/merge over a dozen places instead of one. > > If you only need the name for the cert, and you are not checking > > the cert, it should work. > > Yes, but I need to get some quiet time. It's not hard, just not > simple, at least until I see an easy way to do it. No rush. Take your time and do it right. RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can’t measure it, you can’t improve it." - Lord Kelvin pgpqQJ_DVcV9D.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: NTS crash...
> Now it does not crash, anyway to make it work? > I need to use some IPs for private, offgrid, networking. I use /etc/hosts, so that hasn't been a problem for me. > If you only need the name for the cert, and you are not checking the cert, it > should work. Yes, but I need to get some quiet time. It's not hard, just not simple, at least until I see an easy way to do it. The early config stuff handles the IP Address case and throws away the hostname string. I want to understand that area before I try to fix it. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: NTS crash...
Yo Hal! On Tue, 26 Mar 2019 14:22:11 -0700 Hal Murray via devel wrote: > >> Are you trying to use NTS on an IP Address? Known bug. [That > >> "(null)" happens on that case.] > > Nope. Here is the line from ntp.conf that crashes my slow RasPi. > > But not my fast RasPi: > > > server 204.17.205.23 maxpoll 5 nts # pi3 > > That sure looks like at IP Address to me. Yup, I guess not enough coffee yet. Now it does not crash, anyway to make it work? I need to use some IPs for private, offgrid, networking. I added noval, still can not connect: server 204.17.205.23 maxpoll 5 nts noval # pi3 If you only need the name for the cert, and you are not checking the cert, it should work. RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can’t measure it, you can’t improve it." - Lord Kelvin pgpczCP6ZWuGL.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: NTS crash...
>> Are you trying to use NTS on an IP Address? Known bug. [That >> "(null)" happens on that case.] > Nope. Here is the line from ntp.conf that crashes my slow RasPi. But not my > fast RasPi: > server 204.17.205.23 maxpoll 5 nts # pi3 That sure looks like at IP Address to me. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: NTS crash...
Yo Hal! On Tue, 26 Mar 2019 13:44:41 -0700 Hal Murray wrote: > > Always fails for me. On seversl different RasPi. > > 2019-03-26T13:35:09 ntpd[26050]: DNS: dns_probe: (null), > cast_flags:1, flag= s:21001 > > Are you trying to use NTS on an IP Address? Known bug. [That > "(null)" happens on that case.] Nope. Here is the line from ntp.conf that crashes my slow RasPi. But not my fast RasPi: server 204.17.205.23 maxpoll 5 nts # pi3 > I thought I mentioned that case before but I guess I wasn't loud > enough. Is crashing your idea of loud? I know that fails, now. I know it will get fixed, eventually. But this is something else, a race. RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can’t measure it, you can’t improve it." - Lord Kelvin pgpGhmAco5sdN.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: NTS crash...
> Always fails for me. On seversl different RasPi. 2019-03-26T13:35:09 ntpd[26050]: DNS: dns_probe: (null), cast_flags:1, flag= s:21001 Are you trying to use NTS on an IP Address? Known bug. [That "(null)" happens on that case.] I thought I mentioned that case before but I guess I wasn't loud enough. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: NTS crash...
Yo Hal! On Tue, 26 Mar 2019 13:29:45 -0700 Hal Murray via devel wrote: > > I applied today's NTPsec git head, with the mysyslog patches. > > Older and slower RasPi still crash on startup if the try to be an > > NTS client. > > Works for me. It's old enough that it has only 2 USB ports. Always fails for me. On seversl different RasPi. 2019-03-26T13:35:06 ntpd[26050]: PROTO: SHM(1) 8014 84 reachable 2019-03-26T13:35:06 ntpd[26050]: PROTO: SHM(1) 901a 8a sys_peer 2019-03-26T13:35:06 ntpd[26050]: PROTO: 0.0.0.0 c415 05 clock_sync 2019-03-26T13:35:07 ntpd[26050]: PROTO: 2001:470:e815::8 a014 84 reachable 2019-03-26T13:35:08 ntpd[26050]: PROTO: 204.17.205.17 a014 84 reachable 2019-03-26T13:35:09 ntpd[26050]: DNS: dns_probe: (null), cast_flags:1, flags:21001 [New Thread 0x75b1a460 (LWP 26062)] Thread 4 "ntpd" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x75b1a460 (LWP 26062)] 0x76cfb6fc in strlcpy () from /usr/lib/libbsd.so.0 Any idea how to make gdb more useful here? RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can’t measure it, you can’t improve it." - Lord Kelvin pgpIPV9cs9WdR.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
NTS crash...
> I applied today's NTPsec git head, with the mysyslog patches. Older and > slower RasPi still crash on startup if the try to be an NTS client. Works for me. It's old enough that it has only 2 USB ports. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
✘NTS crash...
Yo All! I applied today's NTPsec git head, with the mysyslog patches. Older and slower RasPi still crash on startup if the try to be an NTS client. RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 Veritas liberabit vos. -- Quid est veritas? "If you can’t measure it, you can’t improve it." - Lord Kelvin pgpRtWBu2t_M9.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Dave Morgan's report on the mystery crash
Dave Morgan: > All, > I am at work at moment. If logs still needed I will send in about 10 > hours when back home. Thanks, we've found and fixed the problem. -- http://www.catb.org/~esr/;>Eric S. Raymond Please consider contributing to my Patreon page at https://www.patreon.com/esr so I can keep the invisible wheels of the Internet turning. Give generously - the civilization you save might be your own. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Dave Morgan's report on the mystery crash
All, I am at work at moment. If logs still needed I will send in about 10 hours when back home. Dave On 05/09/2017, Eric S. Raymond via devel <devel@ntpsec.org> wrote: > Dave Morgan sent me a report on two instances of the mystery crash > tghat hapened to him last week (he also said the installation had been > stable since). Alas, I somehow fat-fingered my copy of that mail. > > Dave, please repost to the list so we can all stare at your logs and > config. > -- > http://www.catb.org/~esr/;>Eric S. Raymond > > Non-cooperation with evil is as much a duty as cooperation with good. > -- Mohandas Gandhi > ___ > devel mailing list > devel@ntpsec.org > http://lists.ntpsec.org/mailman/listinfo/devel > -- http://www.morgad.co.uk/index.html DP: http://www.pgdp.net NTP: http://www.pool.ntp.org L: http://www.lynton-rail.co.uk ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Dave Morgan's report on the mystery crash
Dave Morgan sent me a report on two instances of the mystery crash tghat hapened to him last week (he also said the installation had been stable since). Alas, I somehow fat-fingered my copy of that mail. Dave, please repost to the list so we can all stare at your logs and config. -- http://www.catb.org/~esr/;>Eric S. Raymond Non-cooperation with evil is as much a duty as cooperation with good. -- Mohandas Gandhi ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
All hands alert - crash of unknown origin
Everyone should read this thread: https://gitlab.com/NTPsec/ntpsec/issues/375 The only empirical clue we have is that it only seems to manifest under the kind of high load characterestic of pool service. I have a suspicion that somrthing is causing memory usage to spike and the OOM killer is reaping the process. This is a serious bug and we need everyone with test facilities trying to reproduce it. If there is any way you can set up and watch a pool server, please do so. -- http://www.catb.org/~esr/;>Eric S. Raymond The Bible is not my book, and Christianity is not my religion. I could never give assent to the long, complicated statements of Christian dogma. -- Abraham Lincoln ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: ✘pyntpq crash
Gary E. Miller: > Yo Eric! > > Whoops: > > # ntpq/pyntpq -p > Traceback (most recent call last): > File "ntpq/pyntpq", line 1441, in > interpreter.onecmd(cmd) > File "/usr/lib64/python2.7/cmd.py", line 221, in onecmd > return func(arg) > File "ntpq/pyntpq", line 1051, in do_peers > self.__dopeers(showall=False, mode="peers") > File "ntpq/pyntpq", line 270, in __dopeers > if not self.__dogetassoc(): > File "ntpq/pyntpq", line 173, in __dogetassoc > self.peers = self.session.readstat() > File "/u1/src/NTP/ntpsec/ntpq/ntp/packet.py", line 594, in readstat > self.doquery(opcode=CTL_OP_READSTAT, associd=associd) > File "/u1/src/NTP/ntpsec/ntpq/ntp/packet.py", line 581, in doquery > res = self.getresponse(opcode, associd, not retry) > File "/u1/src/NTP/ntpsec/ntpq/ntp/packet.py", line 461, in getresponse > rawdata = polystr(self.sock.recv(4096)) > socket.error: [Errno 111] Connection refused > > This happens when I kill the ntpd on the localhost. Prolly easy to fix. Trivial. Pushed. You should now get a "connection timed out" message. -- http://www.catb.org/~esr/;>Eric S. Raymond signature.asc Description: PGP signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
✘pyntpq crash
Yo Eric! Whoops: # ntpq/pyntpq -p Traceback (most recent call last): File "ntpq/pyntpq", line 1441, in interpreter.onecmd(cmd) File "/usr/lib64/python2.7/cmd.py", line 221, in onecmd return func(arg) File "ntpq/pyntpq", line 1051, in do_peers self.__dopeers(showall=False, mode="peers") File "ntpq/pyntpq", line 270, in __dopeers if not self.__dogetassoc(): File "ntpq/pyntpq", line 173, in __dogetassoc self.peers = self.session.readstat() File "/u1/src/NTP/ntpsec/ntpq/ntp/packet.py", line 594, in readstat self.doquery(opcode=CTL_OP_READSTAT, associd=associd) File "/u1/src/NTP/ntpsec/ntpq/ntp/packet.py", line 581, in doquery res = self.getresponse(opcode, associd, not retry) File "/u1/src/NTP/ntpsec/ntpq/ntp/packet.py", line 461, in getresponse rawdata = polystr(self.sock.recv(4096)) socket.error: [Errno 111] Connection refused This happens when I kill the ntpd on the localhost. Prolly easy to fix. RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 pgp4_TLXSJUyC.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Processing old mail... Hal Murray: > > I believe you're right that these platforms don't have it. The question is, > > how important is that fact? Is the performance hit from synchronous DNS > > really a showstopper? I don't know the answer. > > There are two cases I know of where ntpd does a DNS lookup after it gets > started. > > One is the try again when DNS for the normal server case doesn't work during > initialization. It will try again occasionally until it gets an answer. > (which might be negative) > > The main one is the pool code trying for a new server. I think we should be > extending this rather than dropping it. There are several possibles in this > area. The main one would be to verify that a server you are using is still > in the pool. (There isn't a way to do that yet - the pool doesn't have any > DNS support for that.) The other would be to try replacing the poorest > server rather than only replacing dead servers. > > DNS lookups can take a LONG time. I think I've seen 40 seconds on a failing > case. > > If we get the recv time stamp from the OS, I think the DNS delays won't > introduce any lies on the normal path. We could test that by putting a sleep > in the main loop. (There is a filter to reject packets that take too long, > but I think that's time-in-flight and excludes time sitting on the server.) > > There are two cases I can think of where a pause in ntpd would cause > troubles. One is that it would mess up refclocks. The other is that packets > will get dropped if too many of them arrive. > > I think that means we could use the pool command on a system without > refclocks. That covers end nodes and maybe lightly loaded servers. > > --- > > It's worth checking out the input buffering side of things. There may be > some code there that we don't need. I think there is a pool of buffers. > Where can a buffer sit other than on the free queue. Why do we need a pool? The project has more important priorities than chasing this down. But: I have edited this text, adding a few details I have learned since, into a new section for the internals tour (devel/tour.txt). That will give somebody a better-than-nothing place to start if we ever again try something like the cAres replacement. -- http://www.catb.org/~esr/;>Eric S. Raymond ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
On Tue, Jun 28, 2016 at 11:39:16PM -0700, Hal Murray wrote: > > matthew.sel...@twosigma.com said: > > "rlimit memlock 0" using Classic causes ntpd to died after 3 minutes with > > this error 2016-06-29T00:13:21.903+00:00 host.example.com ntpd[27206]: > > libgcc_s.so.1 must be installed for pthread_cancel to work > > What version of Classic are you running? I though they had fixed that. This lab system happens to be running 4.2.7.0p368 Super-old, I know. I'll upgrade it over the weekend. > > I've attached 15 minute graphs for "rlimit memlock -1" and "rlimit memlock > > 128" using Classic. Locking memory seems to result in more stable graphs > > over the time period that I was able to collect quickly. > > What are you plotting? Y-axis is offset as measured by ntpq -p in microseconds. X-axis is time. And the 3 lines represent 3 different remote refclocks that my ntp client is pointing at. Cheers, -Matt ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
matthew.sel...@twosigma.com said: > "rlimit memlock 0" using Classic causes ntpd to died after 3 minutes with > this error 2016-06-29T00:13:21.903+00:00 host.example.com ntpd[27206]: > libgcc_s.so.1 must be installed for pthread_cancel to work What version of Classic are you running? I though they had fixed that. > I've attached 15 minute graphs for "rlimit memlock -1" and "rlimit memlock > 128" using Classic. Locking memory seems to result in more stable graphs > over the time period that I was able to collect quickly. What are you plotting? -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
On Tue, Jun 28, 2016 at 07:26:39PM -0400, Eric S. Raymond wrote: > Hal Murray: > > I think you have extrapolated from some modern systems to our whole target > > environment. I don't remember any discussion supporting memlock not being > > interesting/important. > > There were actually two threads about this attached to memlock-related bug > reports in Classic. They initially thought memlocking was important, then > figured out it wasn't. Matt Selsky has been following those bugs; he and I > discussed the issue on #ntpsec. "rlimit memlock 0" using Classic causes ntpd to died after 3 minutes with this error 2016-06-29T00:13:21.903+00:00 host.example.com ntpd[27206]: libgcc_s.so.1 must be installed for pthread_cancel to work I've attached 15 minute graphs for "rlimit memlock -1" and "rlimit memlock 128" using Classic. Locking memory seems to result in more stable graphs over the time period that I was able to collect quickly. Cheers, -Matt ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Yo Eric! On Tue, 28 Jun 2016 19:47:14 -0400 "Eric S. Raymond"wrote: > Gary E. Miller : > > Yo Eric! > > > > On Tue, 28 Jun 2016 19:26:39 -0400 > > "Eric S. Raymond" wrote: > > > > > (You should camp on #ntpsec. Also join our Signal channel - > > > because that's secured, most of the vuln discussions happen > > > there.) > > > > Ah, how do we joing the Signal channel? > > Install Signal on your smartphone and/or Chrome instance. One of us > can and will add you. Done. Tied to my cell phone: 541-390-3793. RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 pgpdPdpyb5JHH.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Gary E. Miller: > Yo Eric! > > On Tue, 28 Jun 2016 19:26:39 -0400 > "Eric S. Raymond" wrote: > > > (You should camp on #ntpsec. Also join our Signal channel - because > > that's secured, most of the vuln discussions happen there.) > > Ah, how do we joing the Signal channel? Install Signal on your smartphone and/or Chrome instance. One of us can and will add you. -- http://www.catb.org/~esr/;>Eric S. Raymond signature.asc Description: Digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Yo Eric! On Tue, 28 Jun 2016 19:26:39 -0400 "Eric S. Raymond"wrote: > (You should camp on #ntpsec. Also join our Signal channel - because > that's secured, most of the vuln discussions happen there.) Ah, how do we joing the Signal channel? RGDS GARY --- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97703 g...@rellim.com Tel:+1 541 382 8588 pgpQXsYUMZaBI.pgp Description: OpenPGP digital signature ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Hal Murray: > I think you have extrapolated from some modern systems to our whole target > environment. I don't remember any discussion supporting memlock not being > interesting/important. There were actually two threads about this attached to memlock-related bug reports in Classic. They initially thought memlocking was important, then figured out it wasn't. Matt Selsky has been following those bugs; he and I discussed the issue on #ntpsec. (You should camp on #ntpsec. Also join our Signal channel - because that's secured, most of the vuln discussions happen there.) > I'd be a lot happier if you had a plan for what to do if it turned out to be > a problem and/or a way to verify that we don't need it or detect that it > causes trouble. I have a plan. Love is the plan, the plan is git (classical reference). The two patches that removed it are pretty well isolated and should be easily revertible. As in, the work of minutes. As for how to tell if it causes problems, that's not very difficult either. If it's suspected, graph jitter against memory utilization. > Consider ntpd running on an old system that is mostly lightly loaded and > doesn't have a lot of memory. I could easily imagine ntpd getting swapped > out when some load did come along. I don't know how to evaluate if that will > cause problems and I don't think we have a test environment that is likely to > blunder into it. I remember page faults causing enough processing lag to be a real issue here, but not since the mid-1990s at the latest. And certainly not with SSDs. But I think you've brought another issue to the surface which I'll start a separate thread about. > I poked around a bit. Linux and NetBSD and FreeBSD all have getrusage(). I > didn't notice any differences. It covers page faults and CPU usage. When > I'm in the right mood, I'll add another file parallel to sysstats to collect > that sort of data. The CPU usage will probably be interesting even if page > faults are boring. That kind of data is always useful and I would welcome it. -- http://www.catb.org/~esr/;>Eric S. Raymond ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
e...@thyrsus.com said: > After discussion with Daniel about the performance and security issues I > deleted the memlock code. As the comment explains: I think changes like that are worthy of a general announcement. > on modern systems, which swap so seldom > that many people don't bother with swap partitions I think you have extrapolated from some modern systems to our whole target environment. I don't remember any discussion supporting memlock not being interesting/important. I'd be a lot happier if you had a plan for what to do if it turned out to be a problem and/or a way to verify that we don't need it or detect that it causes trouble. Consider ntpd running on an old system that is mostly lightly loaded and doesn't have a lot of memory. I could easily imagine ntpd getting swapped out when some load did come along. I don't know how to evaluate if that will cause problems and I don't think we have a test environment that is likely to blunder into it. I poked around a bit. Linux and NetBSD and FreeBSD all have getrusage(). I didn't notice any differences. It covers page faults and CPU usage. When I'm in the right mood, I'll add another file parallel to sysstats to collect that sort of data. The CPU usage will probably be interesting even if page faults are boring. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
On Mon, Jun 27, 2016 at 3:47 PM, Hal Murraywrote: > > cbwie...@gmail.com said: > > How are pool entries added when the service decides it needs more? > > There is some background stuff that roughly says "need more?", and if so > fires off the DNS lookup. > > > > Would it be possible to leverage this code for adding all servers > specified > > by name? > > Probably not directly, but it wouldn't be hard for the server code to use > more than one address if that was desired. Maybe it should be "servers" > rather than "server". Do you have an example where that would be useful? > > If you don't have lots of servers, you probably don't want to switch to > using > "pool" since that path will probably keep banging away at the DNS looking > for > more servers. > > I'm not looking to change the operation of the server or pool directive. I was thinking of setting up associations using the DNS lookup code. If the mechanism for adding new pool servers was blocking on the DNS call but asynchronous to the rest of the daemon, I was figuring to call the lookup with the name provided by the server directive. The only real difference between a specified server and a pool server is that you don't delete the specified server. I'm definitely not looking to bang on DNS servers any more than I have to. Clark B. Wierda ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
A question: How are pool entries added when the service decides it needs more? Would it be possible to leverage this code for adding all servers specified by name? The DNS cost would be the same. This only difference is the name used for the query. Once a server is associated, the IP is used. There should be no impact on the time calculations after an association is provisioned. Clark B. Wierda On Sun, Jun 26, 2016 at 9:13 PM, Hal Murraywrote: > > Possible crazy idea... > > How about we never kill the DNS helper thread. Just let it sit there in > case > it gets more work to do. The only cost is a bit of memory. > > Or maybe only do that if we are locking stuff into memory. > > ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Hal Murray: > > e...@thyrsus.com said: > > Ugh. Our options have just narrowed. I've just seen > > libgcc_s.so.1 must be installed for pthread_cancel to work Aborted (core > > dumped) > > > with memlock off in the build. > > Can you reproduce it? > > My guess is that you didn't really get memlock turned off. How about putting > a break on mlockall or the call to it. (There is only one in ntpd.c) This is possible. I will attempt to reproduce. -- http://www.catb.org/~esr/;>Eric S. Raymond ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Hal Murray: > If it uses threads, we still have the problem of not being able to load the > thread cleanup code. Maybe. We don't know if the libc implementation is vulnerable to that bug or not. I should do an experimental implementation on a branch and find out. -- http://www.catb.org/~esr/;>Eric S. Raymond ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
e...@thyrsus.com said: >> Is getaddrinfo_a() in RTEMS? QNX? BSD? > It's not an OS thing, it's a toolchain thing. getaddrinfo_a() is > implemented using standard C and POSIX threads, it doesn't need OS-specific > support. Or it's in an optional extra library. > Linux has it because Linux uses libc whether you're compiling with gcc or > clang. Any of those other platforms will have it *if* they have (gcc || > clang) && glibc. My Linux man page says: #define _GNU_SOURCE /* See feature_test_macros(7) */ Link with -lanl. I couldn't find it in /usr/include/ on NetBSD or FreeBSD. On Linux, it's in netdb.h. -- If it uses threads, we still have the problem of not being able to load the thread cleanup code. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Mark Atwood: > Is getaddrinfo_a() in RTEMS? QNX? BSD? It's not an OS thing, it's a toolchain thing. getaddrinfo_a() is implemented using standard C and POSIX threads, it doesn't need OS-specific support. Linux has it because Linux uses libc whether you're compiling with gcc or clang. Any of those other platforms will have it *if* they have (gcc || clang) && glibc. There is at least one other implementation out there, in a GPL-licensed plackage called "adns". -- http://www.catb.org/~esr/;>Eric S. Raymond ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Is getaddrinfo_a() in RTEMS? QNX? BSD? On Sun, Jun 26, 2016 at 7:06 AM Eric S. Raymondwrote: > Eric S. Raymond : > > > What would you do if we discovered a case where we wanted it? > > > > Cry a lot. Then add logic to force synchronous DNS when memlocking is > > selected, and document this as a workaround for a bug we haven't fixed > yet. > > Ugh. Our options have just narrowed. I've just seen > > libgcc_s.so.1 must be installed for pthread_cancel to work > Aborted (core dumped) > > with memlock off in the build. > > I think the homebrew async-lookup code has to go. Even if we installed > the warmup fix, I don't think I'd trust it. > -- > http://www.catb.org/~esr/;>Eric S. Raymond > ___ > devel mailing list > devel@ntpsec.org > http://lists.ntpsec.org/mailman/listinfo/devel > ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Eric S. Raymond: > > What would you do if we discovered a case where we wanted it? > > Cry a lot. Then add logic to force synchronous DNS when memlocking is > selected, and document this as a workaround for a bug we haven't fixed yet. Ugh. Our options have just narrowed. I've just seen libgcc_s.so.1 must be installed for pthread_cancel to work Aborted (core dumped) with memlock off in the build. I think the homebrew async-lookup code has to go. Even if we installed the warmup fix, I don't think I'd trust it. -- http://www.catb.org/~esr/;>Eric S. Raymond ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Hal Murray: > > e...@thyrsus.com said: > > In this case, we have two possible complexity-reducing fixes. One is to > > drop the memlock feature entirely. The other is to drop the buggy homebrew > > asynchronous-DNS lookup from Classic and use libc's. > > Dropping memlock is an interesting idea. I can't think of any place where it > is required today but my crystal ball for what we will need tomorrow has > never been very good. Crypto security *might* be it. I'll wait for Daniel to weigh in once he's done climbing mountains or whatever. > What would you do if we discovered a case where we wanted it? Cry a lot. Then add logic to force synchronous DNS when memlocking is selected, and document this as a workaround for a bug we haven't fixed yet. > We could try simplifying things to only supporting lock-everything-I-need > rather than specifying how much. There might be a slippery slope if > something like a thread stack needs a sane size specified. I'm not intimate with mlockall, but it looks like it works that way now. if (do_memlock) { /* * lock the process into memory */ if (!dumpopts && 0 != mlockall(MCL_CURRENT|MCL_FUTURE)) msyslog(LOG_ERR, "mlockall(): %m"); } > Is there a simple way to count page faults for a process? Or measure swapped > out data and/or code that isn't swapped in? I don't know. I can do some research, but I'm not sure "enough page faults to merit memory locking" would be a well-defined threshold even if I knew how to count them. > I don't think your use-libc approach will be as simple as you would > like. It's not available on NetBSD or FreeBSD. Maybe I just didn't > look in the right place. It's not in netdb.h where it is for Linux. I believe you're right that these platforms don't have it. The question is, how important is that fact? Is the performance hit from synchronous DNS really a showstopper? I don't know the answer. -- http://www.catb.org/~esr/;>Eric S. Raymond ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
e...@thyrsus.com said: > In this case, we have two possible complexity-reducing fixes. One is to > drop the memlock feature entirely. The other is to drop the buggy homebrew > asynchronous-DNS lookup from Classic and use libc's. Dropping memlock is an interesting idea. I can't think of any place where it is required today but my crystal ball for what we will need tomorrow has never been very good. What would you do if we discovered a case where we wanted it? We could try simplifying things to only supporting lock-everything-I-need rather than specifying how much. There might be a slippery slope if something like a thread stack needs a sane size specified. Is there a simple way to count page faults for a process? Or measure swapped out data and/or code that isn't swapped in? I don't think your use-libc approach will be as simple as you would like. It's not available on NetBSD or FreeBSD. Maybe I just didn't look in the right place. It's not in netdb.h where it is for Linux. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
Mark: Heads up! Policy issue. Important but not urgent. Hal Murray <hmur...@megapathdsl.net>: > > e...@thyrsus.com said: > > I think the hack is to force libgcc_s to be loaded early. I don't know how > > to do that in waf. > > There are two problems in this area. One is the end-of-thread code not > getting locked into memory. I think that is what you are running into. > > The other is a tangle of error handling on out-of-memory issues by things > like pthread_create and DNS lookup. I think the latter end up with a retry > error code. I think I fixed some/many of them to crash rather than retry on > the assumption that memory wasn't going to get freed and I didn't know of any > other reason to retry. But that was a long time ago (maybe pre fork) and I > don't remember the details. > > > I think we should copy the warmup code from ntp classic. It's basically an > upstream bug. Warmup seems like a reasonable work around. We could do that. But I'm opposed to the idea. Not because I think the warmup code is of itself bad, but because adding complexity seems like the wrong direction to go in general. The project motto is "Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away." I didn't pick it out of a hat. I wasn't just quoting it as a tribal shibboleth. I *meant* it, and I've acted on it to the project's great benefit. Given a choice, I will almost always opt for the fix that removes complexity and code bulk even if it sacrifices a feature I consider marginal. My being relentless about this is the direct reason we've dodged so many CVEs; that is real-world feedback telling me to keep up the simplifying pressure. In this case, we have two possible complexity-reducing fixes. One is to drop the memlock feature entirely. The other is to drop the buggy homebrew asynchronous-DNS lookup from Classic and use libc's. Before I will willingly sign off on any solution that adds code, someone needs to explain to me why neither of those approaches will fly. It could be, for example, that Daniel thinks we need memlocking for crypto security. (I'm not going to buy "performance", not when modern systems swap so seldom that many people have stopped bothering with swap partitions.) But if so, I want to hear him explain that and establish that the memory-locking code is worth its weight. It could be that Mark judges there's a really important platform out there that has POSIX threads but is non-libc, so getaddrinfo_a() is an unacceptable port blocker that can be soilved with the homebrew code. But if so, I want to hear him explain that and establish that the homebrew lookup code is worth its weight. Nothing that increases our defect rate gets to stay in purely on historical inertia. Show me the use case, please. -- http://www.catb.org/~esr/;>Eric S. Raymond ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
e...@thyrsus.com said: > I think the hack is to force libgcc_s to be loaded early. I don't know how > to do that in waf. There are two problems in this area. One is the end-of-thread code not getting locked into memory. I think that is what you are running into. The other is a tangle of error handling on out-of-memory issues by things like pthread_create and DNS lookup. I think the latter end up with a retry error code. I think I fixed some/many of them to crash rather than retry on the assumption that memory wasn't going to get freed and I didn't know of any other reason to retry. But that was a long time ago (maybe pre fork) and I don't remember the details. I think we should copy the warmup code from ntp classic. It's basically an upstream bug. Warmup seems like a reasonable work around. It's in ntpd/ntpd.c Search for NEED_PTHREAD_WARMUP and backup over the long comment which describes what's going on. There is a note about not working on FreeBSD. I haven't sorted that out. It may refer to the linker hack. Here are the bugs I remember: https://bugs.ntp.org/show_bug.cgi?id=2831 FreeBSD page fault story, morphs into lock discussion https://bugs.ntp.org/show_bug.cgi?id=2905 rlimit/memlock discussion There is more info in various bugs: https://bugs.ntp.org/show_bug.cgi?id=2332 https://bugs.ntp.org/show_bug.cgi?id=2954 https://bugs.ntp.org/show_bug.cgi?id=2817 The signal/noise may not be good. -- These are my opinions. I hate spam. ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel
Re: Use of pool servers reveals unacceptable crash rate in async DNS
On Sat, Jun 25, 2016 at 06:13:56PM -0400, Eric S. Raymond wrote: > Hal Murray: > > > > e...@thyrsus.com said: > > > 1. Apply Classic's workaround for the problem, which I don't remember the > > > details of but involved some dodgy nonstandard linker hacks done through > > > the > > > build system. *However, I did not trust this method when I understood > > > it.* > > > It seemed sure to cause porting difficulties and is inherently fragile. > > > > k...@roeckx.be said: > > > If it's the one I'm thinking about, I think the solution is to remove the > > > locking of memory. > > > > We may be confusing several bugs. > > > > There was a problem with locking stuff into memory. Some library needed by > > end of thread processing wasn't loaded yet and things worked out such that > > with the default memory 32 bit systems worked but 64 bit systems didn't > > have > > enough room. > > > > I think one solution was to create a dummy thread early on to get that > > module > > loaded. Or disable memory locking, or tell it to use more memory, or ... > > This matches what I remember, except for "use more memory". There was a third > workaround involved weird linker options to force early loading of the > library. Like -WL,-z,now? That's not such a weird option. Kurt ___ devel mailing list devel@ntpsec.org http://lists.ntpsec.org/mailman/listinfo/devel