On Tue, May 04, 2021 at 11:27:12AM +0200, Kevin Wolf wrote: > Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben: > > On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote: > > > This is a partial revert of commits 77542d43149 and bc79c87bcde. > > > > > > Usually, an error during initialisation means that the configuration was > > > wrong. Reconnecting won't make the error go away, but just turn the > > > error condition into an endless loop. Avoid this and return errors > > > again. > > > > So there are several possible reasons for an error: > > > > 1. remote restarted - we would like to reconnect, > > this was the original use-case for reconnect. > > > > I am not very happy that we are killing this usecase. > > This patch is killing it only during initialisation, where it's quite > unlikely compared to other cases and where the current implementation is > rather broken. So reverting the broken feature and going back to a > simpler correct state feels like a good idea to me. > > The idea is to add the "retry during initialisation" feature back on top > of this, but it requires some more changes in the error paths so that we > can actually distinguish different kinds of errors and don't retry when > we already know that it can't succeed.
Okay ... let's make all this explicit in the commit log though, ok? > > 2. qemu detected an error and closed the connection > > looks like we try to handle that by reconnect, > > this is something we should address. > > Yes, if qemu produces the error locally, retrying is useless. > > > 3. remote failed due to a bad command from qemu. > > this usecase isn't well supported at the moment. > > > > How about supporting it on the remote side? I think that if the > > data is well-formed just has a configuration remote can not support > > then instead of closing the connection, remote can wait for > > commands with need_reply set, and respond with an error. Or at > > least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated. > > If VHOST_USER_SET_VRING_ERR is used then signalling that fd might > > also be reasonable. > > > > OTOH if qemu is buggy and sends malformed data and remote detects > > that then hacing qemu retry forever is ok, might actually be > > benefitial for debugging. > > I haven't really checked this case yet, it seems to be less common. > Explicitly communicating an error is certainly better than just cutting > the connection. But as you say, it means QEMU is buggy, so blindly > retrying in this case is kind of acceptable. > > Raphael suggested that we could limit the number of retries during > initialisation so that it wouldn't result in a hang at least. not sure how do I feel about random limits ... how would we set the limit? > > > Additionally, calling vhost_user_blk_disconnect() from the chardev event > > > handler could result in use-after-free because none of the > > > initialisation code expects that the device could just go away in the > > > middle. So removing the call fixes crashes in several places. > > > For example, using a num-queues setting that is incompatible with the > > > backend would result in a crash like this (dereferencing dev->opaque, > > > which is already NULL): > > > > > > #0 0x0000555555d0a4bd in vhost_user_read_cb (source=0x5555568f4690, > > > condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffffffcbf0) at > > > ../hw/virtio/vhost-user.c:313 > > > #1 0x0000555555d950d3 in qio_channel_fd_source_dispatch > > > (source=0x555557c3f750, callback=0x555555d0a478 <vhost_user_read_cb>, > > > user_data=0x7fffffffcbf0) at ../io/channel-watch.c:84 > > > #2 0x00007ffff7b32a9f in g_main_context_dispatch () at > > > /lib64/libglib-2.0.so.0 > > > #3 0x00007ffff7b84a98 in g_main_context_iterate.constprop () at > > > /lib64/libglib-2.0.so.0 > > > #4 0x00007ffff7b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0 > > > #5 0x0000555555d0a724 in vhost_user_read (dev=0x555557bc62f8, > > > msg=0x7fffffffcc50) at ../hw/virtio/vhost-user.c:402 > > > #6 0x0000555555d0ee6b in vhost_user_get_config (dev=0x555557bc62f8, > > > config=0x555557bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133 > > > #7 0x0000555555d56d46 in vhost_dev_get_config (hdev=0x555557bc62f8, > > > config=0x555557bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566 > > > #8 0x0000555555cdd150 in vhost_user_blk_device_realize > > > (dev=0x555557bc60b0, errp=0x7fffffffcf90) at > > > ../hw/block/vhost-user-blk.c:510 > > > #9 0x0000555555d08f6d in virtio_device_realize (dev=0x555557bc60b0, > > > errp=0x7fffffffcff0) at ../hw/virtio/virtio.c:3660 > > > > Right. So that's definitely something to fix. > > > > > > > > Signed-off-by: Kevin Wolf <kw...@redhat.com> > > Kevin