> On 20 Oct 2021, at 15:45, Julien Grall <jul...@xen.org> wrote:
>
> From: Julien Grall <jgr...@amazon.com>
>
> Commit 939775cfd3 "handle dying domains in live update" was meant to
> handle gracefully dying domain. However, the @releaseDomain watch
> will end up to be sent as soon as we finished to restore Xenstored
> state.
>
> This may be before Xen reports the domain to be dying (such as if
> the guest decided to revoke access to the xenstore page). Consequently
> daemon like xenconsoled will not clean-up the domain and it will be
> left as a zombie.
>
> To avoid the problem, mark the connection as ignored. This also
> requires to tweak conn_can_write() and conn_can_read() to prevent
> dereferencing a NULL pointer (the interface will not mapped).
>
> The check conn->is_ignored was originally added after the callbacks
> because the helpers for a socket connection may close the fd. However,
> ignore_connection() will close a socket connection directly. So it is
> fine to do the re-order.
>
> Signed-off-by: Julien Grall <jgr...@amazon.com>
Reviewed-by: Luca Fancellu <luca.fance...@arm.com>
>
> ---
>
> This issue was originally found when developping commit 939775cfd3.
> Unfortunately, due to some miscommunication the wrong patch was sent
> upstream. The approach is re-using the one we discussed back then.
>
> This was tested by modifying Linux to revoke the Xenstore grant during
> boot. Without this patch, the domain will be left after Live-Update. Note
> that on a basic setup (only xenconsoled and xl watch @releaseDomain),
> the domain may be cleaned on the next domain shutdown/start.
>
> Xenstore Live-Update is so far a tech preview feature. But I would still
> like to request this patch to be included in 4.16 as this was meant to
> be part of the original work.
> ---
> tools/xenstore/xenstored_core.c | 8 ++++----
> tools/xenstore/xenstored_core.h | 1 +
> tools/xenstore/xenstored_domain.c | 21 ++++++++++++---------
> 3 files changed, 17 insertions(+), 13 deletions(-)
>
> diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c
> index 0d4c73d6e20c..91d093a12ea6 100644
> --- a/tools/xenstore/xenstored_core.c
> +++ b/tools/xenstore/xenstored_core.c
> @@ -338,10 +338,10 @@ static int destroy_conn(void *_conn)
>
> static bool conn_can_read(struct connection *conn)
> {
> - if (!conn->funcs->can_read(conn))
> + if (conn->is_ignored)
> return false;
>
> - if (conn->is_ignored)
> + if (!conn->funcs->can_read(conn))
> return false;
>
> /*
> @@ -356,7 +356,7 @@ static bool conn_can_read(struct connection *conn)
>
> static bool conn_can_write(struct connection *conn)
> {
> - return conn->funcs->can_write(conn) && !conn->is_ignored;
> + return !conn->is_ignored && conn->funcs->can_write(conn);
> }
>
> /* This function returns index inside the array if succeed, -1 if fail */
> @@ -1466,7 +1466,7 @@ static struct {
> *
> * All watches, transactions, buffers will be freed.
> */
> -static void ignore_connection(struct connection *conn)
> +void ignore_connection(struct connection *conn)
> {
> struct buffered_data *out, *tmp;
>
> diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h
> index 258f6ff38279..07d861d92499 100644
> --- a/tools/xenstore/xenstored_core.h
> +++ b/tools/xenstore/xenstored_core.h
> @@ -206,6 +206,7 @@ struct node *read_node(struct connection *conn, const
> void *ctx,
>
> struct connection *new_connection(const struct interface_funcs *funcs);
> struct connection *get_connection_by_id(unsigned int conn_id);
> +void ignore_connection(struct connection *conn);
> void check_store(void);
> void corrupt(struct connection *conn, const char *fmt, ...);
>
> diff --git a/tools/xenstore/xenstored_domain.c
> b/tools/xenstore/xenstored_domain.c
> index 47e9107c144e..d03c7d93a9e7 100644
> --- a/tools/xenstore/xenstored_domain.c
> +++ b/tools/xenstore/xenstored_domain.c
> @@ -268,14 +268,7 @@ void check_domains(void)
> domain->shutdown = true;
> notify = 1;
> }
> - /*
> - * On Restore, we may have been unable to remap the
> - * interface and the port. As we don't know whether
> - * this was because of a dying domain, we need to
> - * check if the interface and port are still valid.
> - */
> - if (!dominfo.dying && domain->port &&
> - domain->interface)
> + if (!dominfo.dying)
> continue;
> }
> if (domain->conn) {
> @@ -1303,6 +1296,17 @@ void read_state_connection(const void *ctx, const void
> *state)
> if (!domain)
> barf("domain allocation error");
>
> + conn = domain->conn;
> +
> + /*
> + * We may not have been able to restore the domain (for
> + * instance because it revoked the Xenstore grant). We need
> + * to keep it around to send @releaseDomain when it is
> + * dead. So mark it as ignored.
> + */
> + if (!domain->port || !domain->interface)
> + ignore_connection(conn);
> +
> if (sc->spec.ring.tdomid != DOMID_INVALID) {
> tdomain = find_or_alloc_domain(ctx,
> sc->spec.ring.tdomid);
> @@ -1311,7 +1315,6 @@ void read_state_connection(const void *ctx, const void
> *state)
> talloc_reference(domain->conn, tdomain->conn);
> domain->conn->target = tdomain->conn;
> }
> - conn = domain->conn;
> }
>
> conn->conn_id = sc->conn_id;
> --
> 2.32.0
>
>