On Fri, Oct 13, 2023 at 12:22 PM Julien Grall <jul...@xen.org> wrote:
>
> Hi George,
>
> On 13/10/2023 11:16, George Dunlap wrote:
> > On Thu, Oct 12, 2023 at 11:36 PM Stefano Stabellini
> > <sstabell...@kernel.org> wrote:
> >>
> >> On Thu, 12 Oct 2023, George Dunlap wrote:
> >>>>> Stop tinkering in the hope that it hides the problem.  You're only
> >>>>> making it harder to fix properly.
> >>>>
> >>>> Making it harder to fix properly would be a valid reason not to commit
> >>>> the (maybe partial) fix. But looking at the fix again:
> >>>>
> >>>> diff --git a/tools/xenstored/domain.c b/tools/xenstored/domain.c
> >>>> index a6cd199fdc..9cd6678015 100644
> >>>> --- a/tools/xenstored/domain.c
> >>>> +++ b/tools/xenstored/domain.c
> >>>> @@ -989,6 +989,7 @@ static struct domain *introduce_domain(const void 
> >>>> *ctx,
> >>>>                  talloc_steal(domain->conn, domain);
> >>>>
> >>>>                  if (!restore) {
> >>>> +                       domain_conn_reset(domain);
> >>>>                          /* Notify the domain that xenstore is available 
> >>>> */
> >>>>                          interface->connection = XENSTORE_CONNECTED;
> >>>>                          xenevtchn_notify(xce_handle, domain->port);
> >>>> @@ -1031,8 +1032,6 @@ int do_introduce(const void *ctx, struct 
> >>>> connection *conn,
> >>>>          if (!domain)
> >>>>                  return errno;
> >>>>
> >>>> -       domain_conn_reset(domain);
> >>>> -
> >>>>          send_ack(conn, XS_INTRODUCE);
> >>>>
> >>>> It is a 1-line movement. Textually small. Easy to understand and to
> >>>> revert. It doesn't seem to be making things harder to fix? We could
> >>>> revert it any time if a better fix is offered.
> >>>>
> >>>> Maybe we could have a XXX note in the commit message or in-code
> >>>> comment?
> >>>
> >>> It moves a line from one function (do_domain_introduce()) into a
> >>> completely different function (introduce_domain()), nested inside two
> >>> if() statements; with no analysis on how the change will impact
> >>> things.
> >>
> >> I am not the original author of the patch, and I am not the maintainer
> >> of the code, so I don't feel I have the qualifications to give you the
> >> answers you are seeking. Julien as author of the patch and xenstore
> >> reviewer might be in a better position to answer. Or Juergen as xenstore
> >> maintainer.
> >
> > I understand that; my main point is that the change is more complex
> > than you're characterizing it.  This is information necessary to
> > understand whether the patch is correct, but it's not in the patch
> > description, nor in the subsequent thread back in May.
> >
> >>> Are there any paths through do_domain_introduce() that now *won't* get
> >>> a domain_conn_reset() call?  Is that OK?
> >>
> >> Yes, the already-introduced and the restore code paths. The operations in
> >> the already-introduced or the restore code paths seem simple enough not
> >> to require a domain_conn_reset. Julien and Juergen should confirm.
> >
> > There is no "restore" codepath through do_domain_introduce(); it
> > passes "false" for the "restore" argument.  So we  only have two paths
> > to consider through do_domain_introduce(): The "not introduced and not
> > restoring" path, and the "already-introduced" path.
> >
> > I'm not sure what the "simple" elements on the branch in
> > introduce_domain() have to do with whether the content of the page
> > needs to be cleaned up.  As I said, I don't 100% understand this code,
> > but it seems like if anything, the reset would be *more* important to
> > have in the "reintroduce" case than in the "initial introduction"
> > case, since I'd expect the "initial introduction" case to be empty
> > already.
> Indeed, there should be no watches/transactions/buffered I/O for the
> initial introduction. However, the function is also clear part of the
> interface because we can't guaranteed it was zeroed.
>
> The latter matter for the initial introduction. I believe the rest is
> just called for simplicity.
>
> >
> >> Doesn't it seem weird to you that we set a connection to CONNECTED,
> >> notify the domain that it is ready to go, and only *after* that we reset
> >> the connection to zero?
> >>
> >> What happens if a domain starts using the connection as soon as it
> >> receives the event channel notification and before domain_conn_reset is
> >> called?
> >
> > Yes, it does seem weird, which is why I said the following. :-)
> >
> >>> I mean, it certainly seems strange to set the state to CONNECTED, send
> >>> off an event channel, and then after that delete all watches /
> >>> transactions / buffered data and so on;
> >
> > But just because the current code is probably wrong, doesn't mean that
> > the modified code is probably correct.
> >
> > If the problem is the delay between the xenevtchn_notify() in
> > introduce_domain() and the domain_conn_reset() afterwards in
> > do_domain(), would it make sense instead to move the notification into
> > do_introduce(), after the domain_conn_reset()?  It is, after all, in
> > response to XS_INTRODUCE that we want to send the notification, not in
> > dom0_init() or read_state_connection() (which seems to be more about
> > restoring a domain).
>
> I understand that the event channel notification was specifically added
> for dom0less. But I don't see why we don't want to send it to dom0 as well.
>
> Technically, dom0 has exactly the same problem as dom0less domains it
> boots before Xenstored is running and therefore it may need to know when
> it is ready to receive commands.

It seemed to work fine before 2022, when the notification was
introduced.  How was that coordination done?  Will dom0 honor the
"XENSTORE_RECONNECT" state during bring-up as described in
xenstore-ring.txt?

Personally I'd just take the patch as I wrote it, restoring dom0's
pre-2022 behavior (after review and testing of course); but if you
wanted to test & resubmit with a similar notification inside
dom0_init(), I wouldn't object to it.

Naturally it would be ideal if we could avoid the code duplication;
but that's not possible without a lot more refactoring, which I don't
think we should be doing at this stage in the release.

 -George

Reply via email to