On Wed, May 15, 2024 at 2:59 PM Michal Sekletar <msekl...@redhat.com> wrote:
> Hi everyone, > > I am currently struggling to understand why I am running into D-Bus > related timeout on PID 1 API bus, which in turn causes systemd to > transition the bus to BUS_CLOSING state and stop all Type=dbus services. > Here is a reproducer that was put together by Frantisek. > > #/bin/bash > set -eux > set -o pipefail > > NMOUNTS=800 > MPOINT="$(findmnt --mountpoint / --noheading -o SOURCE)" > TS="$(date +"%s.%N")" > > systemctl mask --now --runtime systemd-oomd.{socket,service} > id testuser || useradd testuser > systemctl stop "tmp-mnt*.mount" || : > systemctl log-level debug > > for i in {0..100}; do > echo "=== $i ===" > for i in $(seq 0 $NMOUNTS); do > mkdir -p "/tmp/mnt$i" > systemd-mount "$MPOINT" "/tmp/mnt$i" & > done > wait > for i in $(seq 0 $NMOUNTS); do > systemd-umount "/tmp/mnt$i" & > done > wait > > busctl status --user --machine=testuser@ --no-pager > systemctl stop user@4711.service > (! journalctl --since="@$TS" --grep "Connection terminated") > done > > The timeout is detected in process_timeout() in sd-bus, however, the > reply_callback which timed out has a message with matching reply_cookie > already in the bus read queue (for a long time). How is it possible it > wasn't processed in the previous event loop iteration? Clearly in > reproducer we force systemd to do a bunch of work, i.e. even loop is > getting dispatched all the time? Btw, in order to verify that reply is > already sitting in the queue I've compiled systemd with the following patch. > Great, my mailer screwed up the patch so if anyone would want to apply it, please take it from Github instead. https://github.com/msekletar/systemd/commit/a636dcd44996891b0c841e037cd36eb46c15b681.patch > > diff --git a/src/libsystemd/sd-bus/bus-message.h > b/src/libsystemd/sd-bus/bus-message.h > index 76f0d853d3..16d25e11a7 100644 > --- a/src/libsystemd/sd-bus/bus-message.h > +++ b/src/libsystemd/sd-bus/bus-message.h > @@ -119,6 +119,8 @@ struct sd_bus_message { > unsigned n_header_offsets; > > uint64_t read_counter; > + > + time_t rq_in; > }; > > static inline bool BUS_MESSAGE_NEED_BSWAP(sd_bus_message *m) { > diff --git a/src/libsystemd/sd-bus/bus-socket.c > b/src/libsystemd/sd-bus/bus-socket.c > index 07179e0705..6c1bc3e13b 100644 > --- a/src/libsystemd/sd-bus/bus-socket.c > +++ b/src/libsystemd/sd-bus/bus-socket.c > @@ -1337,6 +1337,7 @@ static int bus_socket_make_message(sd_bus *bus, > size_t size) { > if (t) { > t->read_counter = ++bus->read_counter; > bus->rqueue[bus->rqueue_size++] = > bus_message_ref_queued(t, bus); > + t->rq_in = time(NULL); > sd_bus_message_unref(t); > } > > diff --git a/src/libsystemd/sd-bus/sd-bus.c > b/src/libsystemd/sd-bus/sd-bus.c > index 22784e8f66..9cc7ce212f 100644 > --- a/src/libsystemd/sd-bus/sd-bus.c > +++ b/src/libsystemd/sd-bus/sd-bus.c > @@ -511,6 +511,7 @@ static int synthesize_connected_signal(sd_bus *bus) { > memmove(bus->rqueue + 1, bus->rqueue, sizeof(sd_bus_message*) * > bus->rqueue_size); > bus->rqueue[0] = bus_message_ref_queued(m, bus); > bus->rqueue_size++; > + m->rq_in = time(NULL); > > return 0; > } > @@ -2681,6 +2682,7 @@ static int process_timeout(sd_bus *bus) { > struct reply_callback *c; > sd_bus_slot *slot; > bool is_hello; > + unsigned i; > usec_t n; > int r; > > @@ -2695,6 +2697,15 @@ static int process_timeout(sd_bus *bus) { > if (c->timeout_usec > n) > return 0; > > + for (i = 0; i < bus->rqueue_size; i++) { > + sd_bus_message *q = bus->rqueue[i]; > + > + if (c->cookie == q->reply_cookie) { > + log_error("Message was put to read queue %" > PRIiMAX " seconds ago.", time(NULL) - q->rq_in); > + assert_not_reached(); > + } > + } > + > r = bus_message_new_synthetic_error( > bus, > c->cookie, > @@ -4464,5 +4475,6 @@ _public_ int sd_bus_enqueue_for_read(sd_bus *bus, > sd_bus_message *m) { > return r; > > bus->rqueue[bus->rqueue_size++] = bus_message_ref_queued(m, bus); > + m->rq_in = time(NULL); > return 0; > } > > Once compiled with above patch and running reproducer for some time I am > hitting assert_not_reached() and log says that matching reply was in read > queue for more than 82 seconds! Any ideas or debugging tips would be highly > appreciated. > > Cheers, > Michal > > >