bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Ludovic Courtès Fri, 10 Dec 2021 02:18:22 -0800

Hi Maxim!

Maxim Cournoyer <maxim.courno...@gmail.com> skribis:


[...]

> Oh!  Marius found this interesting issue [0] that they shared on IRC
> today; I wonder if it may be related.  sd-bus apparently pipeline things
> aggressively, which is not always handled well by other D-bus
> participants.
>
> [0]  https://github.com/systemd/systemd/issues/16610

Interesting.

>> Anyway, the “Hello” message is sent to the system bus asynchronously in
>> ‘sd-bus.c’:

[...]

>> static int hello_callback(sd_bus_message *reply, void *userdata, 
>> sd_bus_error *error) {
>>
>> [...]
>>
>> fail:
>>         /* When Hello() failed, let's propagate this in two ways: first we 
>> return the error immediately here,
>>          * which is the propagated up towards the event loop. Let's also 
>> invalidate the connection, so that
>>          * if the user then calls back into us again we won't wait any 
>> longer. */
>>
>>         bus_set_state(bus, BUS_CLOSING);
>>         return r;
>> }
>>
>>
>> It’s not clear from that whether the authors intended for the thing to
>> keep going in case of failure.  In our case it’s not helpful.
>
> If we picture this in the systemd use case, I believe sd-bus must be
> used as *the* d-bus daemon, perhaps?  So it should never die, and expect
> users to call back into it to retry things?  In our case, it acts as a
> D-Bus client, not a server (IIUC), so perhaps its behavior is not tuned
> right for our use case.

I think systemd-logind is a separate process, just like elogind, that
talks over D-Bus; I don’t think there’s any difference here.  It just
seems that this corner case, where we don’t get a reply for ‘Hello’, is
not correctly handled.

> Interesting that you mention this; that's what I worked on yesterday
> (see attached patches).  I managed to get elogind panicking the kernel
> during a guix system reconfigure, which corrupted GRUB, so had to chroot
> and reconfigure from there [1].  Not sure what happened, but it seems
> that killing and restarting elogind is susceptible to cause hard locks.

Ouch!  This is weird.  Note that “everything” depends on elogind, as can
be seen with ‘guix system shepherd-graph’.  So “herd stop elogind” is
not something you should try at home.

Incidentally, can you reproduce the bug in a VM?  That would be much
nicer.

>> We know that elogind is started after dbus-daemon has written its PID
>> file (there’s a Shepherd service dependency).  Is there a possibility
>> that dbus-daemon writes its PID file before it has set
>> ‘new_connection_function’?
>
> Are PID files conventionally agreed to be synchronization primitives?

Yes.  Daemons are supposed to write their PID file once they’re ready to
do their job.

> I'm sorry for my lack of deep analysis here, I've spent most of my
> evening attempting to fix my system just to boot ^^'.  I've at least
> managed to collect the following verbose D-Bus log (54 MiB uncompressed)
> which hopefully can shed some light onto how this failure came to be.

Excerpt:

--8<---------------cut here---------------start------------->8---
366: 0x7f28f396e740: 1639108961.938358 
[dbus-sysdeps-util-unix.c(237):_dbus_write_pid_to_file_and_pipe] writing pid 
file /var/run/dbus/pid
366: 0x7f28f396e740: 1639108961.938438 
[dbus-sysdeps-util-unix.c(291):_dbus_write_pid_to_file_and_pipe] No pid pipe to 
write to
366: 0x7f28f396e740: 1639108961.938474 
[dbus-userdb.c(177):_dbus_user_database_lookup] Using cache for UID 986 
information
366: 0x7f28f396e740: 1639108961.938665 
[dbus-sysdeps-unix.c(3514):_dbus_socketpair] full-duplex pipe 6 <-> 7
366: 0x7f28f396e740: 1639108961.938700 [main.c(719):main] We are on D-Bus...
366: 0x7f28f396e740: 1639108961.938725 [dbus-mainloop.c(884):_dbus_loop_run] 
Running main loop, depth 0 -> 1
366: 0x7f28f396e740: 1639108962.566557 
[dbus-server-socket.c(182):socket_handle_watch] Handling client connection, 
flags 0x1
366: 0x7f28f396e740: 1639108962.566623 [dbus-sysdeps-unix.c(2344):_dbus_accept] 
client fd 8 accepted
366: 0x7f28f396e740: 1639108962.566665 
[dbus-server-socket.c(94):handle_new_client_fd_and_unlock] Creating new client 
connection with fd 8
366: 0x7f28f396e740: 1639108962.566779 
[dbus-connection.c(1360):_dbus_connection_new_for_transport] LOCK
366: 0x7f28f396e740: 1639108962.566824 
[dbus-transport-socket.c(180):check_read_watch] fd = 8
366: 0x7f28f396e740: 1639108962.566869 
[dbus-transport-socket.c(227):check_read_watch]   setting read watch enabled = 1

[...]

366: 0x7f28f396e740: 1639108962.568765 
[dbus-transport-socket.c(974):socket_handle_watch] handling read watch 
0x23c2b00 flags = 9
366: 0x7f28f396e740: 1639108962.568807 
[dbus-transport-socket.c(348):exchange_credentials] exchange_credentials: 
do_reading = 1, do_writing = 0
366: 0x7f28f396e740: 1639108962.568862 
[dbus-transport-socket.c(383):exchange_credentials] Failed to read credentials 
Failed to read credentials byte (zero-length read)
366: 0x7f28f396e740: 1639108962.568904 
[dbus-transport.c(510):_dbus_transport_disconnect] start
366: 0x7f28f396e740: 1639108962.568945 
[dbus-transport-socket.c(1041):socket_disconnect] 
366: 0x7f28f396e740: 1639108962.568986 
[dbus-transport-socket.c(76):free_watches] start
366: 0x7f28f396e740: 1639108962.569029 
[dbus-watch.c(423):_dbus_watch_list_remove_watch] Removing watch on fd 8
--8<---------------cut here---------------end--------------->8---

We see this sequence:

  1. Writes PID file.

  2. Accepts incoming connection.

  3. “Failed to read credentials”.

  4. Close connection.

So we need to understand #3.  Perhaps going back to strace logs could
help this time…

Thanks,
Ludo’.

bug#52051: [core-updates-frozen] cannot login ('org.freedesktop.login1' service times out)

Reply via email to