Raphael Norwitz <raphael.s.norw...@gmail.com> writes:
> As an afterthought - if VHOST_USER_F_PROTOCOL_FEATURES is indeed > unset, the issue may well be caused by QEMU reading an uninitialized > value for dev->protocol_features. Some device types like cryptodev > explicitly zero it out. As I said, it isn't set anywhere else in the > source and If dev->protocol_features had REPLY_ACK set when the > vhost_dev device is initialized, it would exactly explain the behavior > you are seeing. Can I point you at the proposed clarification of the vhost-user specification: Subject: [VHOST USER SPEC PATCH] vhost-user.rst: add clarifying language about protocol negotiation Date: Fri, 26 Feb 2021 11:16:19 +0000 Message-Id: <20210226111619.21178-1-alex.ben...@linaro.org> > > On Fri, Feb 26, 2021 at 9:58 AM Raphael Norwitz > <raphael.s.norw...@gmail.com> wrote: >> >> There are two sets of features being negotiated - virtio and >> vhost-user. Based on what you've posted here, I suspect the >> VHOST_USER_F_PROTOCOL_FEATURES virtio feature may not be negotiated by >> the backend, preventing the vhost-user protocol feature negotiation >> from happening at all. I'm not 100% sure why this would cause QEMU to >> assume that REPLY_ACK was negotiated though. >> >> some questions: >> >> On Mon, Feb 22, 2021 at 3:26 AM Alex Bennée <alex.ben...@linaro.org> wrote: >> > >> > >> > Dr. David Alan Gilbert <dgilb...@redhat.com> writes: >> > >> > > * Alex Bennée (alex.ben...@linaro.org) wrote: >> > >> Hi, >> > >> >> > >> I finally got a chance to get down into the guts of vhost-user while >> > >> attempting to port my original C RPMB daemon to Rust using the >> > >> vhost-user-backend and related crates. I ended up with this hang during >> > >> negotiation: >> > >> >> > >> startup >> > >> >> > >> vhost_user_write req:1 flags:0x1 >> > >> vhost_user_read_start >> > >> vhost_user_read req:1 flags:0x5 >> > >> vhost_user_backend_init: we got 170000000 >> > >> > GET_FEATURES >> >> Do we also see a GET_PROTOCOL_FEATURES and a SET_PROTOCOL_FEATURES >> message here? If so can you confirm what flags they contained? >> >> vhost-user feature negotiation works as follows (see >> vhost_user_backend_init()): >> >> err = vhost_user_get_features(dev, &features); >> if (err < 0) { >> return err; >> } >> >> if (virtio_has_feature(features, VHOST_USER_F_PROTOCOL_FEATURES)) { >> dev->backend_features |= 1ULL << VHOST_USER_F_PROTOCOL_FEATURES; >> >> err = vhost_user_get_u64(dev, VHOST_USER_GET_PROTOCOL_FEATURES, >> &protocol_features); >> if (err < 0) { >> return err; >> } >> >> dev->protocol_features = >> protocol_features & VHOST_USER_PROTOCOL_FEATURE_MASK; >> ... >> >> err = vhost_user_set_protocol_features(dev, dev->protocol_features); >> if (err < 0) { >> return err; >> } >> } >> >> So we first get the virtio features and check if the backend >> advertises VHOST_USER_F_PROTOCOL_FEATURES. If it does, we proceed to >> negotiate vhost-user features, in which case we should see >> GET_PROTOCOL_FEATURES and a SET_PROTOCOL_FEATURES. Otherwise it looks >> like the function just returns, and we leave the vhost-user features >> uninitialized (presumably zeroed out?), and the backend will never >> even receive a GET/SET_PROTOCOL_FEATURES. >> >> dev->protocol_features is not touched anywhere else, and, if >> VHOST_USER_F_PROTOCOL_FEATURES is negotiated, comes directly to the >> backend from the protocol_features the backend &ed with >> VHOST_USER_PROTOCOL_FEATURE_MASK. Therefore if >> VHOST_USER_F_PROTOCOL_FEATURES is indeed negotiated here I'm not sure >> what could cause QEMU to think REPLY_ACK was negotiated while the >> backend does not, spare something obvious like the backend mishandling >> the GET/SET_PROTOCOL_FEATURES messages. I briefly checked the rustvmm >> code for that and didn't see anything obvious. >> >> mst - are backend devices meant to function if >> VHOST_USER_F_PROTOCOL_FEATURES is not advertised? Do we know of any >> functioning backend which does not advertise this virtio feature? If >> not, maybe we consider failing out here? >> >> alex - Are you sure QEMU gets stuck waiting on a reply_ack message, >> and not somewhere else in the setup path? I trust a SET_MEM_TABLE >> message was actually received by the backend. Did you confirm that >> QEMU was indeed stuck waiting for a reply and not somewhere else later >> on? >> >> > >> > >> vhost_user_write req:15 flags:0x1 >> > >> vhost_user_read_start >> > >> vhost_user_read req:15 flags:0x5 >> > >> vhost_user_set_protocol_features: 2008 >> > >> vhost_user_write req:16 flags:0x1 >> > >> vhost_user_write req:3 flags:0x1 >> > >> vhost_user_write req:1 flags:0x1 >> > >> vhost_user_read_start >> > >> vhost_user_read req:1 flags:0x5 >> > >> vhost_user_write req:13 flags:0x1 >> > >> >> > >> kernel initialises device >> > >> >> > >> virtio_rpmb virtio1: init done! >> > >> vhost_user_write req:13 flags:0x1 >> > >> vhost_dev_set_features: 130000000 >> > >> vhost_user_set_features: 130000000 >> > >> > SET_FEATURES >> >> This is setting virtio features - should have nothing to do with REPLY_ACK. >> >> > >> > >> vhost_user_write req:2 flags:0x1 >> > >> vhost_user_write req:5 flags:0x9 >> > >> vhost_user_read_start >> > >> >> > <snip> >> > >> >> > >> - Should QEMU have preserved VhostUserVirtioFeatures::PROTOCOL_FEATURES >> > >> when doing the eventual VHOST_USER_SET_FEATURES reply? >> > >> >> > >> - Is vhost.rs being to strict or libvhost-user too lax in interpreting >> > >> the negotiated features before processing the ``need_reply`` [Bit 3] >> > >> field of the messages? >> > > >> > > I think vhost.rs is being correctly strict - but there would be no harm >> > > in it flagging that you'd hit an inconsistency if it finds a need_reply >> > > without the feature. >> > >> > But the feature should have been negotiated. So unless the slave can >> > assume it is enabled because it asked I think QEMU is in the wrong by >> > not preserving the feature bits in it's SET_FEATURES reply. We just gets >> > away with it with libvhostuser being willing to reply anyway. >> > >> > > >> > >> - are VHOST_USER_SET_MEM_TABLE to VHOST_USER_SET_INFLIGHT_FD included >> > >> in the "list of the ones that do" require replies or do they only >> > >> reply when REPLY_ACK has been negotiated as the ambiguous "seealso::" >> > >> box out seems to imply? >> > > >> > > set_mem_table gives a reply when postcopy is enabled (and then qemu >> > > replies to the reply!) but otherwise doesn't. >> > > (Note there's an issue opened for .rs to support ADD_MEM_REGION >> > > since it's a lot better than SET_MEM_TABLE which has a fixed size table >> > > that's small). >> > >> > Thanks for the heads up. >> > >> > > >> > > Dave >> > > >> > >> Currently I have some hacks in: >> > >> >> > >> https://github.com/stsquad/vhost/tree/my-hacks >> > >> >> > >> which gets my daemon booting up to the point we actually need to do a >> > >> transaction. However I won't submit a PR until I've worked out exactly >> > >> where the problems are. >> > >> >> > >> -- >> > >> Alex Bennée >> > >> >> > >> > >> > -- >> > Alex Bennée >> > -- Alex Bennée