On Fri, Oct 13, 2023 at 12:22 PM Julien Grall <jul...@xen.org> wrote: > > Hi George, > > On 13/10/2023 11:16, George Dunlap wrote: > > On Thu, Oct 12, 2023 at 11:36 PM Stefano Stabellini > > <sstabell...@kernel.org> wrote: > >> > >> On Thu, 12 Oct 2023, George Dunlap wrote: > >>>>> Stop tinkering in the hope that it hides the problem. You're only > >>>>> making it harder to fix properly. > >>>> > >>>> Making it harder to fix properly would be a valid reason not to commit > >>>> the (maybe partial) fix. But looking at the fix again: > >>>> > >>>> diff --git a/tools/xenstored/domain.c b/tools/xenstored/domain.c > >>>> index a6cd199fdc..9cd6678015 100644 > >>>> --- a/tools/xenstored/domain.c > >>>> +++ b/tools/xenstored/domain.c > >>>> @@ -989,6 +989,7 @@ static struct domain *introduce_domain(const void > >>>> *ctx, > >>>> talloc_steal(domain->conn, domain); > >>>> > >>>> if (!restore) { > >>>> + domain_conn_reset(domain); > >>>> /* Notify the domain that xenstore is available > >>>> */ > >>>> interface->connection = XENSTORE_CONNECTED; > >>>> xenevtchn_notify(xce_handle, domain->port); > >>>> @@ -1031,8 +1032,6 @@ int do_introduce(const void *ctx, struct > >>>> connection *conn, > >>>> if (!domain) > >>>> return errno; > >>>> > >>>> - domain_conn_reset(domain); > >>>> - > >>>> send_ack(conn, XS_INTRODUCE); > >>>> > >>>> It is a 1-line movement. Textually small. Easy to understand and to > >>>> revert. It doesn't seem to be making things harder to fix? We could > >>>> revert it any time if a better fix is offered. > >>>> > >>>> Maybe we could have a XXX note in the commit message or in-code > >>>> comment? > >>> > >>> It moves a line from one function (do_domain_introduce()) into a > >>> completely different function (introduce_domain()), nested inside two > >>> if() statements; with no analysis on how the change will impact > >>> things. > >> > >> I am not the original author of the patch, and I am not the maintainer > >> of the code, so I don't feel I have the qualifications to give you the > >> answers you are seeking. Julien as author of the patch and xenstore > >> reviewer might be in a better position to answer. Or Juergen as xenstore > >> maintainer. > > > > I understand that; my main point is that the change is more complex > > than you're characterizing it. This is information necessary to > > understand whether the patch is correct, but it's not in the patch > > description, nor in the subsequent thread back in May. > > > >>> Are there any paths through do_domain_introduce() that now *won't* get > >>> a domain_conn_reset() call? Is that OK? > >> > >> Yes, the already-introduced and the restore code paths. The operations in > >> the already-introduced or the restore code paths seem simple enough not > >> to require a domain_conn_reset. Julien and Juergen should confirm. > > > > There is no "restore" codepath through do_domain_introduce(); it > > passes "false" for the "restore" argument. So we only have two paths > > to consider through do_domain_introduce(): The "not introduced and not > > restoring" path, and the "already-introduced" path. > > > > I'm not sure what the "simple" elements on the branch in > > introduce_domain() have to do with whether the content of the page > > needs to be cleaned up. As I said, I don't 100% understand this code, > > but it seems like if anything, the reset would be *more* important to > > have in the "reintroduce" case than in the "initial introduction" > > case, since I'd expect the "initial introduction" case to be empty > > already. > Indeed, there should be no watches/transactions/buffered I/O for the > initial introduction. However, the function is also clear part of the > interface because we can't guaranteed it was zeroed. > > The latter matter for the initial introduction. I believe the rest is > just called for simplicity. > > > > >> Doesn't it seem weird to you that we set a connection to CONNECTED, > >> notify the domain that it is ready to go, and only *after* that we reset > >> the connection to zero? > >> > >> What happens if a domain starts using the connection as soon as it > >> receives the event channel notification and before domain_conn_reset is > >> called? > > > > Yes, it does seem weird, which is why I said the following. :-) > > > >>> I mean, it certainly seems strange to set the state to CONNECTED, send > >>> off an event channel, and then after that delete all watches / > >>> transactions / buffered data and so on; > > > > But just because the current code is probably wrong, doesn't mean that > > the modified code is probably correct. > > > > If the problem is the delay between the xenevtchn_notify() in > > introduce_domain() and the domain_conn_reset() afterwards in > > do_domain(), would it make sense instead to move the notification into > > do_introduce(), after the domain_conn_reset()? It is, after all, in > > response to XS_INTRODUCE that we want to send the notification, not in > > dom0_init() or read_state_connection() (which seems to be more about > > restoring a domain). > > I understand that the event channel notification was specifically added > for dom0less. But I don't see why we don't want to send it to dom0 as well. > > Technically, dom0 has exactly the same problem as dom0less domains it > boots before Xenstored is running and therefore it may need to know when > it is ready to receive commands.
It seemed to work fine before 2022, when the notification was introduced. How was that coordination done? Will dom0 honor the "XENSTORE_RECONNECT" state during bring-up as described in xenstore-ring.txt? Personally I'd just take the patch as I wrote it, restoring dom0's pre-2022 behavior (after review and testing of course); but if you wanted to test & resubmit with a similar notification inside dom0_init(), I wouldn't object to it. Naturally it would be ideal if we could avoid the code duplication; but that's not possible without a lot more refactoring, which I don't think we should be doing at this stage in the release. -George