On Sun, Nov 29, 2020 at 06:29:55PM +0000, Michael Kelley wrote:
> From: Andrea Parri (Microsoft) <[email protected]> Sent: Thursday,
> November 26, 2020 11:12 AM
> >
> > Quoting from commit 7527810573436f ("Drivers: hv: vmbus: Introduce
> > the CHANNELMSG_MODIFYCHANNEL message type"),
> >
> > "[...] Hyper-V can *not* currently ACK CHANNELMSG_MODIFYCHANNEL
> > messages with the promise that (after the ACK is sent) the
> > channel won't send any more interrupts to the "old" CPU.
> >
> > The peculiarity of the CHANNELMSG_MODIFYCHANNEL messages is
> > problematic if the user want to take a CPU offline, since we
> > don't want to take a CPU offline (and, potentially, "lose"
> > channel interrupts on such CPU) if the host is still processing
> > a CHANNELMSG_MODIFYCHANNEL message associated to that CPU."
> >
> > Introduce the CHANNELMSG_MODIFYCHANNEL_RESPONSE(24) message type,
> > which embodies the type of the CHANNELMSG_MODIFYCHANNEL ACK.
>
> I feel like the commit message needs a bit more detail about the new
> functionality that is added, including introducing the new VMbus protocol
> version 5.3, and then waiting for the response message when running
> with that protocol version of later. Also, when taking a CPU offline, new
> functionality ensures that there are no pending channel interrupts for
> that CPU.
I'll add details along these lines in the next submission.
>
> Could/should this patch be broken into multiple patches?
>
> 1) Introduce and negotiate VMbus protocol version 5.3. Note that just
> negotiating version 5.3 should be safe because any ACK messages will
> be cleanly ignored by the old code.
> 2) Check for pending channel interrupts before taking a CPU offline.
> Wouldn't this check be valid even with the existing code where there is no
> ACK message? The old code is, in effect, assuming that enough time has
> passed such that the modify channel message has been processed.
> 3) Code to wait for an ACK message, and code to receive and process an
> ACK message.
I think that the patch could be broken into multiple patches. I'm still
wondering whether we could/should invoke hv_synic_event_pending() without
an ACK message (like the description above seems to suggest), say, from
the introduction of MODIFYCHANNEL messages... Thoughts?
> > +static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32
> > target_vp)
> > +{
> > + struct vmbus_channel_modifychannel *msg;
> > + struct vmbus_channel_msginfo *info;
> > + unsigned long flags;
> > + int ret;
> > +
> > + info = kzalloc(sizeof(struct vmbus_channel_msginfo) +
> > + sizeof(struct vmbus_channel_modifychannel),
> > + GFP_KERNEL);
> > + if (!info)
> > + return -ENOMEM;
> > +
> > + init_completion(&info->waitevent);
> > + info->waiting_channel = channel;
> > +
> > + msg = (struct vmbus_channel_modifychannel *)info->msg;
> > + msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL;
> > + msg->child_relid = channel->offermsg.child_relid;
> > + msg->target_vp = target_vp;
> > +
> > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> > + list_add_tail(&info->msglistentry, &vmbus_connection.chn_msg_list);
> > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
> > +
> > + if (channel->rescind) {
> > + ret = -ENODEV;
> > + goto free_info;
> > + }
>
> Does the above check really do anything? Such checks are sprinkled throughout
> the VMbus code, and I've always questioned their efficacy. The rescind flag
> can
> be set without holding the channel_mutex, so as soon as the above code tests
> the flag, it seems like it could change.
Same question (and questioning) here actually. I'm planning to remove
this check in v2. Similarly for the ->rescind check below.
>
> > +
> > + ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_modifychannel),
> > true);
>
> Use "sizeof(*msg)" instead?
Applied.
>
> > + trace_vmbus_send_modifychannel(msg, ret);
> > + if (ret != 0)
> > + goto clean_msglist;
> > +
> > + /*
> > + * Release channel_mutex; otherwise, vmbus_onoffer_rescind() could
> > block on
> > + * the mutex and be unable to signal the completion.
> > + */
> > + mutex_unlock(&vmbus_connection.channel_mutex);
> > + wait_for_completion(&info->waitevent);
> > + mutex_lock(&vmbus_connection.channel_mutex);
>
> The calling function, target_cpu_store(), obtains the mutex to synchronize
> against
> channel closure. Does releasing the mutex here still ensure the needed
> synchronization?
> If so, some comments explaining why could be helpful. I didn't try to reason
> through it
> all, so I'm not sure.
AFAICT, the synchronization against channel closure just want the mutex
critical section to include the MODIFYCHANNEL post.
AFAICT, different target_cpu_store() invocations remain serialized (by
the mutex in kernfs_fop_write()); maybe this is worth noticing?
>
> > +
> > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> > + list_del(&info->msglistentry);
> > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
> > +
> > + if (channel->rescind) {
> > + ret = -ENODEV;
> > + goto free_info;
> > + }
>
> Again, I'm wondering if the above is actually useful.
See above.
>
> > +
> > + if (info->response.modify_response.status) {
>
> I'm thinking that current Hyper-V never sends a non-zero status. But it's
> good
> to check in case of future changes. The implication is that a non-zero
> status means
> that the request to change the target CPU was not performed, correct?
This corresponds to my understanding too. But you're right, I should
have verified with the Hyper-V team...
>
> > + kfree(info);
> > + return -EAGAIN;
> > + }
> > +
> > + kfree(info);
> > + return 0;
> > +
> > +clean_msglist:
> > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> > + list_del(&info->msglistentry);
> > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags);
>
> The error handling seems more complex than needed. The "clean_msglist" label
> is only called from one place, so the above code could go inline at that
> place.
Mmh, it seems also wrong... IAC, I'll review it before the next
submission, taking into account these remarks.
>
> > +free_info:
> > + kfree(info);
> > + return ret;
>
> More error handling: The kfree(info) call is done in three different places.
> With
> consistent use of the 'ret' local variable, only one place would be needed.
See above.
>
> > +}
> > +
> > /*
> > * Set/change the vCPU (@target_vp) the channel (@child_relid) will
> > interrupt.
> > *
> > * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. Also, Hyper-V does
> > not
> > - * ACK such messages. IOW we can't know when the host will stop
> > interrupting
> > - * the "old" vCPU and start interrupting the "new" vCPU for the given
> > channel.
> > + * ACK such messages before VERSION_WIN10_V5_3. Without ACK, we can not
> > know
> > + * when the host will stop interrupting the "old" vCPU and start
> > interrupting
> > + * the "new" vCPU for the given channel.
>
> In the interest of clarity, make the above comment slightly more explicit:
> When VMbus
> version 5.3 or later is negotiated, Hyper-V always sends an ACK in response to
> CHANNELMSG_MODIFYCHANNEL. For VMbus version 5.2 and earlier, it never sends
> an ACK.
Applied.
> > +static void vmbus_onmodifychannel_response(struct
> > vmbus_channel_message_header *hdr)
> > +{
> > + struct vmbus_channel_modifychannel_response *response;
> > + struct vmbus_channel_msginfo *msginfo;
> > + unsigned long flags;
> > +
> > + response = (struct vmbus_channel_modifychannel_response *)hdr;
> > +
> > + trace_vmbus_onmodifychannel_response(response);
> > +
> > + /*
> > + * Find the modify msg, copy the response and signal/unblock the wait
> > event.
> > + */
> > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags);
> > +
> > + list_for_each_entry(msginfo, &vmbus_connection.chn_msg_list,
> > msglistentry) {
> > + struct vmbus_channel_message_header *responseheader =
> > + (struct vmbus_channel_message_header
> > *)msginfo->msg;
> > +
> > + if (responseheader->msgtype == CHANNELMSG_MODIFYCHANNEL) {
> > + struct vmbus_channel_modifychannel *modifymsg;
> > +
> > + modifymsg = (struct vmbus_channel_modifychannel
> > *)msginfo->msg;
> > + if (modifymsg->child_relid == response->child_relid) {
> > + memcpy(&msginfo->response.modify_response,
> > response,
> > + sizeof(struct
> > vmbus_channel_modifychannel_response));
>
> Use "sizeof(*response)" ?
Applied.
> > @@ -237,6 +238,40 @@ void hv_synic_disable_regs(unsigned int cpu)
> > hv_set_synic_state(sctrl.as_uint64);
> > }
> >
> > +#define HV_MAX_TRIES 3
> > +/*
> > + * Scan the event flags page of 'this' CPU looking for any bit that is
> > set. If we find one
> > + * bit set, then wait for a few milliseconds. Repeat these steps for a
> > maximum of 3 times.
>
> A bit more comment here might be warranted. If a bit is set, that means
> there is a
> pending channel interrupt. The expectation is that the normal interrupt
> handling
> mechanism will find and process the channel interrupt "very soon", and in the
> process
> clear the bit. Since Hyper-V won't be setting any additional channel
> interrupt bits, we
> should very soon reach a state where no bits are set.
I could add something like the above. Some editing seems to be required
if we want it to be part of 2/3 (that is, before ACKs are introduced).
Similar consideration holds for my comment in hv_synic_cleanup() below.
Suggestions?
>
> > + * Return 'true', if there is still any set bit after this operation;
> > 'false', otherwise.
> > + */
> > +static bool hv_synic_event_pending(void)
> > +{
> > + struct hv_per_cpu_context *hv_cpu =
> > this_cpu_ptr(hv_context.cpu_context);
> > + union hv_synic_event_flags *event =
> > + (union hv_synic_event_flags *)hv_cpu->synic_event_page +
> > VMBUS_MESSAGE_SINT;
> > + unsigned long *recv_int_page = event->flags;
>
> I think a comment is warranted here. The similar vmbus_chan_sched() code has
> two ways
> to get the recv_int_page, depending on the negotiated VMbus protocol version.
> This code
> assumes the "new" way to get the recv_int_page, which makes sense given that
> it is invoked
> only for newer VMbus protocol versions. But a note about that assumption
> would be a good
> warning for future readers/coders.
I've added a comment to this effect.
>
> > + bool pending;
> > + u32 relid;
> > + int tries = 0;
> > +
> > +retry:
> > + pending = false;
> > + for_each_set_bit(relid, recv_int_page, HV_EVENT_FLAGS_COUNT) {
> > + /* Special case - VMBus channel protocol messages */
> > + if (relid == 0)
> > + continue;
> > + if (sync_test_bit(relid, recv_int_page)) {
> > + pending = true;
> > + break;
>
> I'm not clear on why the above test is needed. If the bit was found to be set
> by for_each_set_bit(), that seems good enough to set pending=true.
Agreed. I've removed the test.
Thanks,
Andrea