On Tue, May 04, 2021 at 12:57:29PM +0200, Kevin Wolf wrote: > Am 04.05.2021 um 11:44 hat Michael S. Tsirkin geschrieben: > > On Tue, May 04, 2021 at 11:27:12AM +0200, Kevin Wolf wrote: > > > Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben: > > > > On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote: > > > > > This is a partial revert of commits 77542d43149 and bc79c87bcde. > > > > > > > > > > Usually, an error during initialisation means that the configuration > > > > > was > > > > > wrong. Reconnecting won't make the error go away, but just turn the > > > > > error condition into an endless loop. Avoid this and return errors > > > > > again. > > > > > > > > So there are several possible reasons for an error: > > > > > > > > 1. remote restarted - we would like to reconnect, > > > > this was the original use-case for reconnect. > > > > > > > > I am not very happy that we are killing this usecase. > > > > > > This patch is killing it only during initialisation, where it's quite > > > unlikely compared to other cases and where the current implementation is > > > rather broken. So reverting the broken feature and going back to a > > > simpler correct state feels like a good idea to me. > > > > > > The idea is to add the "retry during initialisation" feature back on top > > > of this, but it requires some more changes in the error paths so that we > > > can actually distinguish different kinds of errors and don't retry when > > > we already know that it can't succeed. > > > > Okay ... let's make all this explicit in the commit log though, ok? > > That's fair, I'll add a paragraph addressing this case when merging the > series, like this: > > Note that this removes the ability to reconnect during > initialisation (but not during operation) when there is no permanent > error, but the backend restarts, as the implementation was buggy. > This feature can be added back in a follow-up series after changing > error paths to distinguish cases where retrying could help from > cases with permanent errors. > > > > > 2. qemu detected an error and closed the connection > > > > looks like we try to handle that by reconnect, > > > > this is something we should address. > > > > > > Yes, if qemu produces the error locally, retrying is useless. > > > > > > > 3. remote failed due to a bad command from qemu. > > > > this usecase isn't well supported at the moment. > > > > > > > > How about supporting it on the remote side? I think that if the > > > > data is well-formed just has a configuration remote can not support > > > > then instead of closing the connection, remote can wait for > > > > commands with need_reply set, and respond with an error. Or at > > > > least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated. > > > > If VHOST_USER_SET_VRING_ERR is used then signalling that fd might > > > > also be reasonable. > > > > > > > > OTOH if qemu is buggy and sends malformed data and remote detects > > > > that then hacing qemu retry forever is ok, might actually be > > > > benefitial for debugging. > > > > > > I haven't really checked this case yet, it seems to be less common. > > > Explicitly communicating an error is certainly better than just cutting > > > the connection. But as you say, it means QEMU is buggy, so blindly > > > retrying in this case is kind of acceptable. > > > > > > Raphael suggested that we could limit the number of retries during > > > initialisation so that it wouldn't result in a hang at least. > > > > not sure how do I feel about random limits ... how would we set the > > limit? > > To be honest, probably even 1 would already be good enough in practice. > Make it 5 or something and you definitely cover any realistic case when > there is no bug involved. > > Even hitting this case once requires bad luck with the timing, so that > the restart of the backend coincides with already having connected to > the socket, but not completed the configuration yet, which is a really > short window. Having the backend drop the connection again in the same > short window on the second attempt is an almost sure sign of a bug with > one of the operations done during initialisation. > > Even if this corner case turned out to be a bit less unlikely to happen > than I'm thinking (which is, it won't happen at all), randomly failing a > device-add once in a while still feels a lot better than hanging the VM > once in a while. > > Kevin
Well if backend is e.g. just stuck and connection does not close, then VM hangs anyway. So IMHO it's not such a big deal. If we really want to address this we should handle all this asynchronously. As in make device-add succeed and then progress in stages but do not block the monitor. That would be nice but it's a big change in the code. -- MST