Hi Maxim! Maxim Cournoyer <maxim.courno...@gmail.com> skribis:
[...] > Oh! Marius found this interesting issue [0] that they shared on IRC > today; I wonder if it may be related. sd-bus apparently pipeline things > aggressively, which is not always handled well by other D-bus > participants. > > [0] https://github.com/systemd/systemd/issues/16610 Interesting. >> Anyway, the “Hello” message is sent to the system bus asynchronously in >> ‘sd-bus.c’: [...] >> static int hello_callback(sd_bus_message *reply, void *userdata, >> sd_bus_error *error) { >> >> [...] >> >> fail: >> /* When Hello() failed, let's propagate this in two ways: first we >> return the error immediately here, >> * which is the propagated up towards the event loop. Let's also >> invalidate the connection, so that >> * if the user then calls back into us again we won't wait any >> longer. */ >> >> bus_set_state(bus, BUS_CLOSING); >> return r; >> } >> >> >> It’s not clear from that whether the authors intended for the thing to >> keep going in case of failure. In our case it’s not helpful. > > If we picture this in the systemd use case, I believe sd-bus must be > used as *the* d-bus daemon, perhaps? So it should never die, and expect > users to call back into it to retry things? In our case, it acts as a > D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned > right for our use case. I think systemd-logind is a separate process, just like elogind, that talks over D-Bus; I don’t think there’s any difference here. It just seems that this corner case, where we don’t get a reply for ‘Hello’, is not correctly handled. > Interesting that you mention this; that's what I worked on yesterday > (see attached patches). I managed to get elogind panicking the kernel > during a guix system reconfigure, which corrupted GRUB, so had to chroot > and reconfigure from there [1]. Not sure what happened, but it seems > that killing and restarting elogind is susceptible to cause hard locks. Ouch! This is weird. Note that “everything” depends on elogind, as can be seen with ‘guix system shepherd-graph’. So “herd stop elogind” is not something you should try at home. Incidentally, can you reproduce the bug in a VM? That would be much nicer. >> We know that elogind is started after dbus-daemon has written its PID >> file (there’s a Shepherd service dependency). Is there a possibility >> that dbus-daemon writes its PID file before it has set >> ‘new_connection_function’? > > Are PID files conventionally agreed to be synchronization primitives? Yes. Daemons are supposed to write their PID file once they’re ready to do their job. > I'm sorry for my lack of deep analysis here, I've spent most of my > evening attempting to fix my system just to boot ^^'. I've at least > managed to collect the following verbose D-Bus log (54 MiB uncompressed) > which hopefully can shed some light onto how this failure came to be. Excerpt: --8<---------------cut here---------------start------------->8--- 366: 0x7f28f396e740: 1639108961.938358 [dbus-sysdeps-util-unix.c(237):_dbus_write_pid_to_file_and_pipe] writing pid file /var/run/dbus/pid 366: 0x7f28f396e740: 1639108961.938438 [dbus-sysdeps-util-unix.c(291):_dbus_write_pid_to_file_and_pipe] No pid pipe to write to 366: 0x7f28f396e740: 1639108961.938474 [dbus-userdb.c(177):_dbus_user_database_lookup] Using cache for UID 986 information 366: 0x7f28f396e740: 1639108961.938665 [dbus-sysdeps-unix.c(3514):_dbus_socketpair] full-duplex pipe 6 <-> 7 366: 0x7f28f396e740: 1639108961.938700 [main.c(719):main] We are on D-Bus... 366: 0x7f28f396e740: 1639108961.938725 [dbus-mainloop.c(884):_dbus_loop_run] Running main loop, depth 0 -> 1 366: 0x7f28f396e740: 1639108962.566557 [dbus-server-socket.c(182):socket_handle_watch] Handling client connection, flags 0x1 366: 0x7f28f396e740: 1639108962.566623 [dbus-sysdeps-unix.c(2344):_dbus_accept] client fd 8 accepted 366: 0x7f28f396e740: 1639108962.566665 [dbus-server-socket.c(94):handle_new_client_fd_and_unlock] Creating new client connection with fd 8 366: 0x7f28f396e740: 1639108962.566779 [dbus-connection.c(1360):_dbus_connection_new_for_transport] LOCK 366: 0x7f28f396e740: 1639108962.566824 [dbus-transport-socket.c(180):check_read_watch] fd = 8 366: 0x7f28f396e740: 1639108962.566869 [dbus-transport-socket.c(227):check_read_watch] setting read watch enabled = 1 [...] 366: 0x7f28f396e740: 1639108962.568765 [dbus-transport-socket.c(974):socket_handle_watch] handling read watch 0x23c2b00 flags = 9 366: 0x7f28f396e740: 1639108962.568807 [dbus-transport-socket.c(348):exchange_credentials] exchange_credentials: do_reading = 1, do_writing = 0 366: 0x7f28f396e740: 1639108962.568862 [dbus-transport-socket.c(383):exchange_credentials] Failed to read credentials Failed to read credentials byte (zero-length read) 366: 0x7f28f396e740: 1639108962.568904 [dbus-transport.c(510):_dbus_transport_disconnect] start 366: 0x7f28f396e740: 1639108962.568945 [dbus-transport-socket.c(1041):socket_disconnect] 366: 0x7f28f396e740: 1639108962.568986 [dbus-transport-socket.c(76):free_watches] start 366: 0x7f28f396e740: 1639108962.569029 [dbus-watch.c(423):_dbus_watch_list_remove_watch] Removing watch on fd 8 --8<---------------cut here---------------end--------------->8--- We see this sequence: 1. Writes PID file. 2. Accepts incoming connection. 3. “Failed to read credentials”. 4. Close connection. So we need to understand #3. Perhaps going back to strace logs could help this time… Thanks, Ludo’.