On Mon, Dec 13, 2021 at 09:42:04AM -0700, Jim Fehlig wrote:
On 12/12/21 14:15, Martin Kletzander wrote:
On Sun, Dec 12, 2021 at 10:40:46AM -0700, Jim Fehlig wrote:
On 12/11/21 03:28, Martin Kletzander wrote:
On Sat, Dec 11, 2021 at 11:16:13AM +0100, Martin Kletzander wrote:
On Fri, Dec 10, 2021 at 05:48:03PM -0700, Jim Fehlig wrote:
Hi Martin!

I recently received a bug report (sorry, not public) about simple operations
like 'virsh list' hanging when invoked with an internal test tool. I found
this
commit to be the culprit.


OK, one more thing though, the fact that pkttyagent is spawned cannot
cause virsh to hang.  If the authentications is not required, then it
will just wait there for a while and then be killed.  If authentication
*is* required, then either you already have an agent running and that
one should be used since we're starting pkttyagent with `--fallback` or
you do not have any agent running in which case virsh list would fail
to connect.  Where does the virsh hang, what's the backtrace?

The last scenario you describe appears to be the case. virsh fails to connect
then gets stuck trying to kill off pkttyagent

#0  0x00007f9f07530241 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007f9f07535ad3 in nanosleep () from /lib64/libc.so.6
#2  0x00007f9f07f478af in g_usleep () from /usr/lib64/libglib-2.0.so.0
#3  0x00007f9f086694fa in virProcessAbort (pid=367) at
../src/util/virprocess.c:187
#4  0x00007f9f0861ed9b in virCommandAbort (cmd=cmd@entry=0x55a798660c50) at
../src/util/vircommand.c:2774
#5  0x00007f9f08621478 in virCommandFree (cmd=0x55a798660c50) at
../src/util/vircommand.c:3061
#6  0x00007f9f08668581 in virPolkitAgentDestroy (agent=0x55a7986426e0) at
../src/util/virpolkit.c:164
#7  0x000055a797836d93 in virshConnect (ctl=ctl@entry=0x7ffc551dd980, uri=0x0,
readonly=readonly@entry=false) at ../tools/virsh.c:187
#8  0x000055a797837007 in virshReconnect (ctl=ctl@entry=0x7ffc551dd980,
name=name@entry=0x0, readonly=<optimized out>, readonly@entry=false,
force=force@entry=false) at ../tools/virsh.c:223
#9  0x000055a7978371e0 in virshConnectionHandler (ctl=0x7ffc551dd980) at
../tools/virsh.c:325
#10 0x000055a797880172 in vshCommandRun (ctl=ctl@entry=0x7ffc551dd980,
cmd=0x55a79865f580) at ../tools/vsh.c:1308
#11 0x000055a7978367b7 in main (argc=2, argv=<optimized out>) at
../tools/virsh.c:907

Odd thing is, I attached gdb to this virsh process several minutes after
invoking the test tool that calls 'virsh list'. I can't explain why the process
is still blocked in g_usleep, which should only have slept for 10 milliseconds.
Even odder, detaching from the process appears to awaken g_usleep and allows
process shutdown to continue. The oddness can also be seen in the debug output

2021-12-12 16:35:38.783+0000: 5912: debug : virCommandRunAsync:2629 : About to
run /usr/bin/pkttyagent --process 5912 --notify-fd 4 --fallback

2021-12-12 16:35:38.787+0000: 5912: debug : virCommandRunAsync:2632 : Command
result 0, with PID 5914

...
2021-12-12 16:35:38.830+0000: 5912: debug : virProcessAbort:177 : aborting child
process 5914

2021-12-12 16:35:38.830+0000: 5912: debug : virProcessAbort:185 : trying SIGTERM
to child process 5914


Attach gdb to the process, observe above backtrace, quit gdb.

2021-12-12 16:44:18.059+0000: 5912: debug : virProcessAbort:195 : trying SIGKILL
to child process 5914

2021-12-12 16:44:18.061+0000: 5912: debug : virProcessAbort:201 : process has
ended: fatal signal 9


The stuck g_usleep() is weird.  Isn't there a tremendous load on the
machine?  I can't think of much else.

The machine is mostly idle.

Also I am looking at the pkttyagent code and it looks like it blocks the
first SIGTERM and sending two of them should help in that case, but if
we want to wait for few ms between them, than we;ll be in the same
pickle.


Anyway, if just adding:

     if (!isatty(STDIN_FILENO))
         return false;

This indeed fixes the regression in the test tool.


That just means that it won't start the agent.  Let's do this, but I
would really, *really* like to figure out what the heck is happening
there, because there has to be something wrong and it might just be
waiting around the corner for us and bite us in the back in a year or
so.  Although I understand how improbable that is.

Do you have additional suggestions that may help us gain a better understanding
of the problem?


Unfortunately I cannot think of anything else than trying to figure out
whether the kernel is scheduling that task at all and if not then why.
How to do that however is unfortunately out of my territory :(

Regards,
Jim

Attachment: signature.asc
Description: PGP signature

Reply via email to