On 5/7/26 22:13, Polina Vishneva wrote:
From: "Denis V. Lunev" <[email protected]>
When the host initiates an AF_VSOCK connect() to a guest that has not
yet loaded the virtio-vsock transport (i.e. still booting), the caller
blocks for VSOCK_DEFAULT_CONNECT_TIMEOUT (2 seconds), because
vhost_transport_do_send_pkt() silently exits when
vhost_vq_get_backend(vq) returns NULL.
If the guest doesn't start listening within this timeout, connect()
returns ETIMEDOUT.
This delay is usually pointless and it doesn't well align with our
behavior at other initialization stages: for example, if a connection is
attempted when the guest driver is already loaded, but when nothing is
listening yet, it returns ECONNRESET immediately without any wait.
Fix this by checking the RX virtqueue backend in
vhost_transport_send_pkt() before queuing. If the backend is NULL,
return -ECONNREFUSED immediately.
Signed-off-by: Denis V. Lunev <[email protected]>
Co-authored-by: Polina Vishneva <[email protected]>
Signed-off-by: Polina Vishneva <[email protected]>
---
drivers/vhost/vsock.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 1d8ec6bed53e..e6de1e23121b 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -302,6 +302,20 @@ vhost_transport_send_pkt(struct sk_buff *skb, struct net
*net)
return -ENODEV;
}
+ /* If the guest has not yet initialized the RX virtqueue, fail
+ * immediately rather than queueing the packet and letting the
+ * caller wait for VSOCK_DEFAULT_CONNECT_TIMEOUT.
+ *
+ * Reading private_data without vq->mutex is a deliberate racy
+ * check: if the backend is NULL the guest driver is definitely
+ * not ready; if it becomes NULL right after, the worker
+ * (do_send_pkt) rechecks under the mutex. */
+ if (!READ_ONCE(vsock->vqs[VSOCK_VQ_RX].private_data)) {
+ rcu_read_unlock();
+ kfree_skb(skb);
+ return -ECONNREFUSED;
i'm a bit hesitating about the proper error code returned here.
Who receives this error code eventually and how does it process it?
i mean - we are in a process on a VM start, but it has not been fully
initialized yet.
But we believe it will be initialized soon, so i'd expect the attempt should be
repeated in a while.
On the other hand i'm not sure the process when gets -ECONNREFUSED, will
definitely retries the attempt.
May be to use -EAGAIN here - this error code definitely is expected when a new
attempt is expected.
AI also suggests -EHOSTUNREACH (and by the way - AI does not recommend EAGAIN
he-he :))) ).
EHOSTUNREACH as the error code for "guest transport not ready"
Semantics: EHOSTUNREACH means "the destination host cannot be reached" - the peer exists
conceptually but the
communication path to it is currently unavailable. This maps precisely to the situation: the guest
VM exists, QEMU has
opened the vhost-vsock device and assigned a CID, but the guest has not yet loaded its virtio-vsock
driver, so the
transport path is not established.
Existing usage in vsock subsystem:
• vmci_transport.c:95 - VMCI_ERROR_INVALID_RESOURCE is mapped to EHOSTUNREACH. This is the case
where the VMCI
endpoint for the peer cannot be located - the peer's transport resource does not exist yet or has
been destroyed.
• vmci_transport_notify.c:436,525 - returned when send_waiting_read() / send_waiting_write() fails,
meaning the
notification could not reach the peer. The peer is considered unreachable.
Both cases share the same pattern: the peer is known to exist (has a CID, was previously connected,
etc.) but the
transport layer cannot deliver data to it right now.
Why it fits better than ECONNREFUSED:
• ECONNREFUSED implies the peer received the request and actively rejected it (e.g., nothing
listening on that port).
Here the guest never sees the request at all - the virtqueue backend is NULL, so the packet
cannot even enter the
guest.
• EHOSTUNREACH implies the packet could not be routed/delivered to the destination. This is exactly
what happens - the
RX virtqueue has no backend, so delivery is impossible.
Userspace behavior:
• Programs and retry frameworks commonly treat EHOSTUNREACH as a transient condition worth retrying
(the host may come
up), whereas ECONNREFUSED is typically treated as "service does not exist at this address" and
not retried.
• For the specific use case (host connecting to a guest that is still booting), retry is the
correct behavior - the
guest will eventually load its driver and become reachable.
It is a standard connect() error code - unlike EAGAIN, which is not expected from connect() and
would confuse most
userspace socket code.
+ }
+
if (virtio_vsock_skb_reply(skb))
atomic_inc(&vsock->queued_replies);
@@ -624,9 +638,6 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
mutex_unlock(&vq->mutex);
}
- /* Some packets may have been queued before the device was started,
- * let's kick the send worker to send them.
- */
vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
i think the vhost_vq_work_queue() call should be removed as well here, not only
the comment.
Before the patch: packets accumulate while backend is NULL
Timeline from the QEMU/host perspective:
1. QEMU opens /dev/vhost-vsock - struct vhost_vsock is created, but virtqueue backend
(private_data) is still NULL.
2. QEMU issues ioctl(VHOST_VSOCK_SET_GUEST_CID) - sets vsock->guest_cid, inserts vsock into
vhost_vsock_hash. From this point vhost_vsock_get(cid) can find it.
3. Guest is still booting, virtio-vsock driver not loaded yet. But the vsock is already
discoverable by CID lookup.
4. Host calls connect() - the packet gets queued but cannot be delivered:
connect(fd, {AF_VSOCK, guest_cid, port})
vsock_connect() [af_vsock.c:1650]
transport->connect(vsk) [af_vsock.c:1730]
virtio_transport_connect()
[virtio_transport_common.c:1076]
virtio_transport_send_pkt_info()
[virtio_transport_common.c:328]
t_ops->send_pkt(skb, net)
vhost_transport_send_pkt() [vsock.c:289]
vhost_vsock_get(dst_cid) -> found (CID already in hash)
virtio_vsock_skb_queue_tail() ← PACKET QUEUED
vhost_vq_work_queue() ← WORKER KICKED
return len ← SUCCESS (positive)
Worker wakes up but cannot deliver:
vhost_transport_send_pkt_work()
vhost_transport_do_send_pkt(vsock, vq) [vsock.c:107]
mutex_lock(&vq->mutex)
vhost_vq_get_backend(vq) == NULL ← guest not ready
goto out ← PACKET STAYS IN QUEUE
mutex_unlock(&vq->mutex)
Back in vsock_connect() - transport->connect() returned success (len > 0), so the code enters the
wait loop:
sk->sk_state = TCP_SYN_SENT;
err = transport->connect(vsk); → returns len (success)
if (err < 0) goto out; → NOT taken
...
while (sk->sk_state != TCP_ESTABLISHED && ...) {
timeout = schedule_timeout(timeout); ← SLEEPS 2 SECONDS
if (timeout == 0) {
err = -ETIMEDOUT; ← GIVES UP
}
}
The guest never receives the CONNECT request (it is stuck in the queue), so no response arrives,
and connect() returns ETIMEDOUT after 2 seconds.
5. Later the guest finishes booting, loads the virtio-vsock driver, negotiates virtqueues. QEMU
issues ioctl(VHOST_VSOCK_SET_RUNNING, 1) which calls vhost_vsock_start():
vhost_vsock_start() [vsock.c:609]
for each vq:
mutex_lock(&vq->mutex)
vhost_vq_set_backend(vq, vsock) ← backend becomes NON-NULL
mutex_unlock(&vq->mutex)
vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], ← KICKS WORKER AGAIN
&vsock->send_pkt_work)
Worker wakes up, now vhost_vq_get_backend(vq) != NULL, delivers the queued packet to the guest. But
it is too late - connect() on the host side already timed out.
Why the kick in vhost_vsock_start() is essential here: between steps 4 and 5 nobody else will wake
the worker. The kick from step 4 already fired and did nothing (backend was NULL). No new packets are
coming - the only connect() caller is sleeping. Without this kick the packet would remain in the queue
forever.
────────────────────────────────────────
After the patch: packets no longer accumulate
Same initial conditions - QEMU has set the CID, guest is still booting.
Host calls connect():
connect(fd, {AF_VSOCK, guest_cid, port})
vsock_connect() [af_vsock.c:1650]
transport->connect(vsk) [af_vsock.c:1730]
virtio_transport_connect()
[virtio_transport_common.c:1076]
virtio_transport_send_pkt_info()
[virtio_transport_common.c:328]
t_ops->send_pkt(skb, net)
vhost_transport_send_pkt() [vsock.c:289]
vhost_vsock_get(dst_cid) -> found
READ_ONCE(vsock->vqs[VSOCK_VQ_RX].private_data) == NULL
kfree_skb(skb) ← PACKET FREED
return -ECONNREFUSED ← ERROR RETURNED
The error propagates back immediately:
virtio_transport_send_pkt_info():
ret = t_ops->send_pkt(skb, net) → -ECONNREFUSED
if (ret < 0) break → breaks out
virtio_transport_connect() returns -ECONNREFUSED
vsock_connect():
err = transport->connect(vsk) → -ECONNREFUSED
if (err < 0) goto out → TAKEN, skips wait loop
connect() returns ECONNREFUSED to userspace immediately
The packet never enters send_pkt_queue. When vhost_vsock_start() runs later, the queue is
guaranteed to be empty - there is nothing for the worker kick to flush.
────────────────────────────────────────
Summary: SET_GUEST_CID makes the vsock discoverable, SET_RUNNING actually enables the virtqueues.
Between these two ioctls there is a window where packets are accepted into the queue but cannot be
delivered. The kick in vhost_vsock_start() existed to drain this backlog. The patch closes the window
at the entry point instead - refusing packets outright - so the backlog can never form.
mutex_unlock(&vsock->dev.mutex);
_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel