Hi Stefano and all,

Thank you Stefano for your response and skepticism about whether this was
a kernel issue - you were absolutely right to question it!

After extensive debugging with strace on both guest and host, I've
determined this was NOT a kernel bug at all, but rather an OpenSSH issue
specific to vsock connections.

Root Cause:
-----------
The 10-20 second delay was caused by OpenSSH's sshd attempting DNS lookups
on the literal string "UNKNOWN" (the placeholder hostname used for vsock
connections where no IP address exists). This triggered two 5-second DNS
timeouts during login recording and audit subsystem operations, totaling
~10 seconds of delay.

The strace showed:
   17:11:14.465 sendmmsg(13, DNS query for "UNKNOWN")
   17:11:14.465 poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout) 
<5.005s>
   17:11:19.472 sendmmsg(13, DNS query for "UNKNOWN") [RETRY]
   17:11:19.472 poll([{fd=13, events=POLLIN}], 1, 5000) = 0 (Timeout) 
<5.005s>

Why I Initially Thought It Was a Kernel Issue:
----------------------------------------------
- bpftrace showed ppoll() timeouts while data appeared to be queued
- The pattern looked like a classic lost wakeup race condition

However, the vsock kernel modules were working perfectly. The delay
happened in userspace during sshd's session setup, specifically when
mm_record_login() tried to resolve the peer hostname for logging.

The Fix:
--------
OpenSSH 10.1 and 10.2 include fixes to prevent passing "UNKNOWN" to
subsystems that would attempt DNS resolution:

- 10.1: Skip audit logging for UNKNOWN hostnames
- 10.2: Don't set PAM_RHOST when remote host is "UNKNOWN"

References:
- https://github.com/openssh/openssh-portable/pull/388
- 
https://gitlab.archlinux.org/archlinux/packaging/packages/openssh/-/issues/16
- https://www.openssh.org/releasenotes.html

Workaround for older OpenSSH versions:
Add to /etc/hosts: 127.0.0.1 UNKNOWN

Apologies for the noise on netdev - the vsock kernel implementation is
working correctly. The misleading symptoms (PTY-specific, ppoll timeouts,
state between connections) made it appear kernel-related when it was
actually sshd's login recording code hitting DNS timeouts.

Thanks again for your help and for maintaining the vsock subsystem!

Best regards,
[Your name - don't forget to update it this time or you'll look even 
more stupid]



Reply via email to