Thanks for the feedback. Just the bit about "when provided by Debian" concerns me - we have a slow-ish cadence of releases, and then distros take another year to ship them :\
You might be reasonably served by a custom build (in-place or even package). YMMV... Jim On Tue, Jun 13, 2023, 18:38 Kari Lempiäinen <kari.lempiai...@summerday.net> wrote: > Hi, > > > > Great work Jim! I’m glad you could reproduce the problem and found a > potential culprit. > > > > Just for my own interest I restored upsshed from my backups (version > 2.7.4-13) and it seems to running ok, so no big runtime changes regarding > that with Debian 12. It is not hogging CPU. From the daemon log the > heartbeat seems to be working ok. Only difference between the old logs (pre > Debian 12 update) is the there is this line (Network UPS Tools upsmon > 2.8.0) every five minutes. > > > > Jun 13 19:17:08 fricka upssched[1896873]: Timer daemon started > > Jun 13 19:17:08 fricka upssched[1896873]: New timer: > heartbeat-failure-timer (660 seconds) > > Jun 13 19:17:08 fricka nut-monitor[1896870]: Network UPS Tools upsmon 2.8.0 > > Jun 13 19:22:08 fricka nut-monitor[1899911]: Network UPS Tools upsmon 2.8.0 > > Jun 13 19:27:08 fricka upssched[1896873]: Cancelling timer: > heartbeat-failure-timer > > Jun 13 19:27:08 fricka upssched[1896873]: New timer: > heartbeat-failure-timer (660 seconds) > > Jun 13 19:27:08 fricka nut-monitor[1903379]: Network UPS Tools upsmon 2.8.0 > > Jun 13 19:32:13 fricka nut-monitor[1906677]: Network UPS Tools upsmon 2.8.0 > > > > That line wasn’t there previously. I will leave the old version in place > until nut gets fixed and it’s been provided by Debian. > > > > This actually reminded me of a similar problem I had myself a loooong time > ago. I was writing a large software distribution program for a Windows > platform. The connection between server and client were done by TCP > sockets. In my server code there was place where I read reply from the > client and the read (from socket) function didn’t return error code, but > the data length was 0. In the documentation, at least at the time, it > wasn’t specified directly that it was an error situation. My server code > looped in a thread hogging CPU. When I modified the code to treat 0 bytes > read as an error, everything worked fine. > > > > Best regards, > > Kari > > > > *From: *Jim Klimov <jimklimov+...@gmail.com> > *Date: *Tuesday, 13. June 2023 at 18.36 > *To: *Kari Lempiäinen <kari.lempiai...@summerday.net> > *Cc: *Arnaud Quette via Nut-upsuser <nut-upsuser@alioth-lists.debian.net>, > nut-upsdev <nut-ups...@alioth-lists.debian.net>, Dimitris Economou < > dimitris.s.econo...@gmail.com> > *Subject: *Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12 > > So... determining that FD is to be reaped proved hard. Internet lore > suggests fcntl() and poll() on the FD, but it just seems valid to them. The > errno is also usually not raised (once I saw a "111: Connection refused" > though). > > So the best dumb idea so far is to bail out if we spent the whole loop > (128 attempts) and only got zero-sized read replies and no errors. > > > > https://github.com/networkupstools/nut/pull/1965 > > > > Jim > > > > On Tue, Jun 13, 2023 at 4:06 PM Jim Klimov <jimklimov+...@gmail.com> > wrote: > > After launching the command several times, with debug (posted by new code > in a new branch for the investigation) confirming that the same daemon > handles operations from the new client instances, its strace now has > numerous FDs to report after select() - so I guess it is a problem of > detecting an exit of the counterpart. > > > > 0.000000 [D2] parse_at: is 'heartbeat@localhost' in AT command the > 'heartbeat@localhost' we were launched to process? > 0.000034 [D1] parse_at: 'heartbeat@localhost' in AT command > matched the 'heartbeat@localhost' UPSNAME we were launched to process > 0.000041 [D1] parse_at: processing CANCEL-TIMER > 0.000054 [D2] parse_at: is 'heartbeat@localhost' in AT command the > 'heartbeat@localhost' we were launched to process? > 0.000057 [D1] parse_at: 'heartbeat@localhost' in AT command > matched the 'heartbeat@localhost' UPSNAME we were launched to process > 0.000060 [D1] parse_at: processing START-TIMER > 0.000151 [D1] Keeping stderr open due to debug verbosity 8 > 0.000195 Timer daemon started > 0.000204 [D2] Timer daemon waiting for connections on pipefd 10 > 0.250325 [D3] new connection on fd 7 > 0.250377 New timer: heartbeat-failure-timer (660 seconds) > 0.250423 [D1] Exiting upssched (CLI process) > > > > 0.000000 [D2] parse_at: is 'heartbeat@localhost' in AT command the > 'heartbeat@localhost' we were launched to process? > 0.000039 [D1] parse_at: 'heartbeat@localhost' in AT command > matched the 'heartbeat@localhost' UPSNAME we were launched to process > 0.000047 [D1] parse_at: processing CANCEL-TIMER > 14.745773 [D3] new connection on fd 8 > 14.745829 Cancelling timer: heartbeat-failure-timer > 0.000153 [D2] parse_at: is 'heartbeat@localhost' in AT command the > 'heartbeat@localhost' we were launched to process? > 0.000159 [D1] parse_at: 'heartbeat@localhost' in AT command > matched the 'heartbeat@localhost' UPSNAME we were launched to process > 0.000163 [D1] parse_at: processing START-TIMER > 14.745890 [D3] new connection on fd 9 > 14.745969 New timer: heartbeat-failure-timer (660 seconds) > 0.000293 [D1] Exiting upssched (CLI process) > > > > 0.000001 [D2] parse_at: is 'heartbeat@localhost' in AT command the > 'heartbeat@localhost' we were launched to process? > 0.000042 [D1] parse_at: 'heartbeat@localhost' in AT command > matched the 'heartbeat@localhost' UPSNAME we were launched to process > 0.000050 [D1] parse_at: processing CANCEL-TIMER > 47.674032 [D3] new connection on fd 11 > 47.674160 Cancelling timer: heartbeat-failure-timer > 0.000266 [D2] parse_at: is 'heartbeat@localhost' in AT command the > 'heartbeat@localhost' we were launched to process? > 0.000277 [D1] parse_at: 'heartbeat@localhost' in AT command > matched the 'heartbeat@localhost' UPSNAME we were launched to process > 0.000284 [D1] parse_at: processing START-TIMER > 47.674235 [D3] new connection on fd 12 > 47.674388 New timer: heartbeat-failure-timer (660 seconds) > 0.000487 [D1] Exiting upssched (CLI process) > > > > Note that `select` only lists remnants of "new connection" but not the > FD#10 (pipefd): > > > > select(13, [7 8 9 10 11 12], NULL, NULL, {tv_sec=1, tv_usec=0}) = 5 > (in [7 8 9 11 12], left {tv_sec=0, tv_usec=999997}) > > > > Jim > > > > > > > > On Tue, Jun 13, 2023 at 3:41 PM Jim Klimov <jimklimov+...@gmail.com> > wrote: > > So, got some good news: I hear(*) I managed to reproduce the problem with > current NUT master and an adapted copy of your posted configs and script :D > > Experimental debugging now sounds possible. > > > > (*) PC under the desk wails with all its cooling fans as soon as I started > the client which spawned a daemon and itself had exited: > > > > $ UPSNAME=heartbeat@localhost NOTIFYTYPE=ONBATT NUT_STATEPATH=/dev/shm > NUT_CONFPATH=`pwd`/tests/NIT/tmp/etc ./clients/upssched -DDDDDDDD > 0.000000 [D2] parse_at: is 'heartbeat@localhost' in AT command the > 'heartbeat@localhost' we were launched to process? > 0.000053 [D1] parse_at: 'heartbeat@localhost' in AT command > matched the 'heartbeat@localhost' UPSNAME we were launched to process > 0.000066 [D1] parse_at: processing CANCEL-TIMER > 0.000089 [D2] parse_at: is 'heartbeat@localhost' in AT command the > 'heartbeat@localhost' we were launched to process? > 0.000096 [D1] parse_at: 'heartbeat@localhost' in AT command > matched the 'heartbeat@localhost' UPSNAME we were launched to process > 0.000104 [D1] parse_at: processing START-TIMER > > > > $ > > > > Jim > > > > > > On Tue, Jun 13, 2023 at 1:58 PM Kari Lempiäinen < > kari.lempiai...@summerday.net> wrote: > > Hi, > > > > Thanks Jim! Here is more system information from the commands you > mentioned. > > > > Kari > > > > root@fricka:~# lsof -p 1716171 > > lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1002/gvfs > > Output information may be incomplete. > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > > upssched 1716171 root cwd DIR 8,2 4096 2 / > > upssched 1716171 root rtd DIR 8,2 4096 2 / > > upssched 1716171 root txt REG 8,2 39240 21762148 > /usr/sbin/upssched > > upssched 1716171 root mem REG 8,2 1922136 21764230 > /usr/lib/x86_64-linux-gnu/libc.so.6 > > upssched 1716171 root mem REG 8,2 210968 21764476 > /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 > > upssched 1716171 root 0u CHR 1,3 0t0 4 > /dev/null > > upssched 1716171 root 1u CHR 1,3 0t0 4 > /dev/null > > upssched 1716171 root 2u CHR 1,3 0t0 4 > /dev/null > > upssched 1716171 root 3u unix 0x00000000eeb52825 0t0 16427010 > type=DGRAM (CONNECTED) > > upssched 1716171 root 4r REG 8,2 545 39847858 > /etc/nut/upssched.conf > > upssched 1716171 root 5u unix 0x00000000a035b466 0t0 16426351 > type=STREAM (UNCONNECTED) > > upssched 1716171 root 6u unix 0x0000000050820b2b 0t0 16426352 > type=STREAM (UNCONNECTED) > > upssched 1716171 root 7u unix 0x00000000447aac0e 0t0 16424724 > /var/run/nut/upssched.pipe type=STREAM (CONNECTED) > > upssched 1716171 root 10u unix 0x00000000d7bd8088 0t0 16424722 > /var/run/nut/upssched.pipe type=STREAM (LISTEN) > > > > > > root@fricka:~# ps -ef|grep ups > > root 1425 1 0 Jun11 ? 00:00:00 sshd: /usr/sbin/sshd > -D [listener] 0 of 10-100 startups > > root 1156722 1 0 00:00 ? 00:00:00 /usr/sbin/cupsd -l > > root 1156723 1 0 00:00 ? 00:00:00 /usr/sbin/cups-browsed > > lp 1156748 1156722 0 00:00 ? 00:00:00 > /usr/lib/cups/notifier/dbus dbus:// > > nut 1713117 1 0 14:30 ? 00:00:00 /lib/nut/dummy-ups -a > heartbeat > > nut 1713154 1 0 14:30 ? 00:00:00 /lib/nut/usbhid-ups -a > eaton5s > > nut 1713155 1 0 14:30 ? 00:00:00 /lib/nut/upsd -F > > root 1713156 1 0 14:30 ? 00:00:00 /lib/nut/upsmon -F > > root 1713157 1713156 0 14:30 ? 00:00:00 /lib/nut/upsmon -F > > root 1716171 1 99 14:35 ? 00:05:10 /sbin/upssched UPS > heartbeat@localhost: On battery. > > root 1719510 1550505 0 14:40 pts/2 00:00:00 grep ups > > > > > > > > > > root@fricka:~# fuser /var/run/nut/upssched.pipe > > /run/nut/upssched.pipe: 1716171 > > > > > > root@fricka:/proc/1716171/fd# ls -la > > total 0 > > dr-x------ 2 root root 0 Jun 13 14:36 . > > dr-xr-xr-x 9 root root 0 Jun 13 14:35 .. > > lrwx------ 1 root root 64 Jun 13 14:36 0 -> /dev/null > > lrwx------ 1 root root 64 Jun 13 14:36 1 -> /dev/null > > lrwx------ 1 root root 64 Jun 13 14:36 10 -> 'socket:[16424722]' > > lrwx------ 1 root root 64 Jun 13 14:36 2 -> /dev/null > > lrwx------ 1 root root 64 Jun 13 14:36 3 -> 'socket:[16427010]' > > lr-x------ 1 root root 64 Jun 13 14:36 4 -> /etc/nut/upssched.conf > > lrwx------ 1 root root 64 Jun 13 14:36 5 -> 'socket:[16426351]' > > lrwx------ 1 root root 64 Jun 13 14:36 6 -> 'socket:[16426352]' > > lrwx------ 1 root root 64 Jun 13 14:36 7 -> 'socket:[16424724]' > > > > > > > ------------------------------ > > *From:* Jim Klimov <jimklimov+...@gmail.com> > *Sent:* Tuesday, June 13, 2023 14:17 > *To:* Kari Lempiäinen <kari.lempiai...@summerday.net> > *Cc:* Arnaud Quette via Nut-upsuser <nut-upsuser@alioth-lists.debian.net>; > nut-upsdev <nut-ups...@alioth-lists.debian.net>; Dimitris Economou < > dimitris.s.econo...@gmail.com> > *Subject:* Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12 > > > > FWIW, cross-posted the issue to NUT GitHub tracker: > https://github.com/networkupstools/nut/issues/1964 > > > > Jim > >
_______________________________________________ Nut-upsuser mailing list Nut-upsuser@alioth-lists.debian.net https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/nut-upsuser