On Sun, Nov 29, 2020 at 06:29:55PM +0000, Michael Kelley wrote: > From: Andrea Parri (Microsoft) <parri.and...@gmail.com> Sent: Thursday, > November 26, 2020 11:12 AM > > > > Quoting from commit 7527810573436f ("Drivers: hv: vmbus: Introduce > > the CHANNELMSG_MODIFYCHANNEL message type"), > > > > "[...] Hyper-V can *not* currently ACK CHANNELMSG_MODIFYCHANNEL > > messages with the promise that (after the ACK is sent) the > > channel won't send any more interrupts to the "old" CPU. > > > > The peculiarity of the CHANNELMSG_MODIFYCHANNEL messages is > > problematic if the user want to take a CPU offline, since we > > don't want to take a CPU offline (and, potentially, "lose" > > channel interrupts on such CPU) if the host is still processing > > a CHANNELMSG_MODIFYCHANNEL message associated to that CPU." > > > > Introduce the CHANNELMSG_MODIFYCHANNEL_RESPONSE(24) message type, > > which embodies the type of the CHANNELMSG_MODIFYCHANNEL ACK. > > I feel like the commit message needs a bit more detail about the new > functionality that is added, including introducing the new VMbus protocol > version 5.3, and then waiting for the response message when running > with that protocol version of later. Also, when taking a CPU offline, new > functionality ensures that there are no pending channel interrupts for > that CPU.
I'll add details along these lines in the next submission. > > Could/should this patch be broken into multiple patches? > > 1) Introduce and negotiate VMbus protocol version 5.3. Note that just > negotiating version 5.3 should be safe because any ACK messages will > be cleanly ignored by the old code. > 2) Check for pending channel interrupts before taking a CPU offline. > Wouldn't this check be valid even with the existing code where there is no > ACK message? The old code is, in effect, assuming that enough time has > passed such that the modify channel message has been processed. > 3) Code to wait for an ACK message, and code to receive and process an > ACK message. I think that the patch could be broken into multiple patches. I'm still wondering whether we could/should invoke hv_synic_event_pending() without an ACK message (like the description above seems to suggest), say, from the introduction of MODIFYCHANNEL messages... Thoughts? > > +static int send_modifychannel_with_ack(struct vmbus_channel *channel, u32 > > target_vp) > > +{ > > + struct vmbus_channel_modifychannel *msg; > > + struct vmbus_channel_msginfo *info; > > + unsigned long flags; > > + int ret; > > + > > + info = kzalloc(sizeof(struct vmbus_channel_msginfo) + > > + sizeof(struct vmbus_channel_modifychannel), > > + GFP_KERNEL); > > + if (!info) > > + return -ENOMEM; > > + > > + init_completion(&info->waitevent); > > + info->waiting_channel = channel; > > + > > + msg = (struct vmbus_channel_modifychannel *)info->msg; > > + msg->header.msgtype = CHANNELMSG_MODIFYCHANNEL; > > + msg->child_relid = channel->offermsg.child_relid; > > + msg->target_vp = target_vp; > > + > > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags); > > + list_add_tail(&info->msglistentry, &vmbus_connection.chn_msg_list); > > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags); > > + > > + if (channel->rescind) { > > + ret = -ENODEV; > > + goto free_info; > > + } > > Does the above check really do anything? Such checks are sprinkled throughout > the VMbus code, and I've always questioned their efficacy. The rescind flag > can > be set without holding the channel_mutex, so as soon as the above code tests > the flag, it seems like it could change. Same question (and questioning) here actually. I'm planning to remove this check in v2. Similarly for the ->rescind check below. > > > + > > + ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_modifychannel), > > true); > > Use "sizeof(*msg)" instead? Applied. > > > + trace_vmbus_send_modifychannel(msg, ret); > > + if (ret != 0) > > + goto clean_msglist; > > + > > + /* > > + * Release channel_mutex; otherwise, vmbus_onoffer_rescind() could > > block on > > + * the mutex and be unable to signal the completion. > > + */ > > + mutex_unlock(&vmbus_connection.channel_mutex); > > + wait_for_completion(&info->waitevent); > > + mutex_lock(&vmbus_connection.channel_mutex); > > The calling function, target_cpu_store(), obtains the mutex to synchronize > against > channel closure. Does releasing the mutex here still ensure the needed > synchronization? > If so, some comments explaining why could be helpful. I didn't try to reason > through it > all, so I'm not sure. AFAICT, the synchronization against channel closure just want the mutex critical section to include the MODIFYCHANNEL post. AFAICT, different target_cpu_store() invocations remain serialized (by the mutex in kernfs_fop_write()); maybe this is worth noticing? > > > + > > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags); > > + list_del(&info->msglistentry); > > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags); > > + > > + if (channel->rescind) { > > + ret = -ENODEV; > > + goto free_info; > > + } > > Again, I'm wondering if the above is actually useful. See above. > > > + > > + if (info->response.modify_response.status) { > > I'm thinking that current Hyper-V never sends a non-zero status. But it's > good > to check in case of future changes. The implication is that a non-zero > status means > that the request to change the target CPU was not performed, correct? This corresponds to my understanding too. But you're right, I should have verified with the Hyper-V team... > > > + kfree(info); > > + return -EAGAIN; > > + } > > + > > + kfree(info); > > + return 0; > > + > > +clean_msglist: > > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags); > > + list_del(&info->msglistentry); > > + spin_unlock_irqrestore(&vmbus_connection.channelmsg_lock, flags); > > The error handling seems more complex than needed. The "clean_msglist" label > is only called from one place, so the above code could go inline at that > place. Mmh, it seems also wrong... IAC, I'll review it before the next submission, taking into account these remarks. > > > +free_info: > > + kfree(info); > > + return ret; > > More error handling: The kfree(info) call is done in three different places. > With > consistent use of the 'ret' local variable, only one place would be needed. See above. > > > +} > > + > > /* > > * Set/change the vCPU (@target_vp) the channel (@child_relid) will > > interrupt. > > * > > * CHANNELMSG_MODIFYCHANNEL messages are aynchronous. Also, Hyper-V does > > not > > - * ACK such messages. IOW we can't know when the host will stop > > interrupting > > - * the "old" vCPU and start interrupting the "new" vCPU for the given > > channel. > > + * ACK such messages before VERSION_WIN10_V5_3. Without ACK, we can not > > know > > + * when the host will stop interrupting the "old" vCPU and start > > interrupting > > + * the "new" vCPU for the given channel. > > In the interest of clarity, make the above comment slightly more explicit: > When VMbus > version 5.3 or later is negotiated, Hyper-V always sends an ACK in response to > CHANNELMSG_MODIFYCHANNEL. For VMbus version 5.2 and earlier, it never sends > an ACK. Applied. > > +static void vmbus_onmodifychannel_response(struct > > vmbus_channel_message_header *hdr) > > +{ > > + struct vmbus_channel_modifychannel_response *response; > > + struct vmbus_channel_msginfo *msginfo; > > + unsigned long flags; > > + > > + response = (struct vmbus_channel_modifychannel_response *)hdr; > > + > > + trace_vmbus_onmodifychannel_response(response); > > + > > + /* > > + * Find the modify msg, copy the response and signal/unblock the wait > > event. > > + */ > > + spin_lock_irqsave(&vmbus_connection.channelmsg_lock, flags); > > + > > + list_for_each_entry(msginfo, &vmbus_connection.chn_msg_list, > > msglistentry) { > > + struct vmbus_channel_message_header *responseheader = > > + (struct vmbus_channel_message_header > > *)msginfo->msg; > > + > > + if (responseheader->msgtype == CHANNELMSG_MODIFYCHANNEL) { > > + struct vmbus_channel_modifychannel *modifymsg; > > + > > + modifymsg = (struct vmbus_channel_modifychannel > > *)msginfo->msg; > > + if (modifymsg->child_relid == response->child_relid) { > > + memcpy(&msginfo->response.modify_response, > > response, > > + sizeof(struct > > vmbus_channel_modifychannel_response)); > > Use "sizeof(*response)" ? Applied. > > @@ -237,6 +238,40 @@ void hv_synic_disable_regs(unsigned int cpu) > > hv_set_synic_state(sctrl.as_uint64); > > } > > > > +#define HV_MAX_TRIES 3 > > +/* > > + * Scan the event flags page of 'this' CPU looking for any bit that is > > set. If we find one > > + * bit set, then wait for a few milliseconds. Repeat these steps for a > > maximum of 3 times. > > A bit more comment here might be warranted. If a bit is set, that means > there is a > pending channel interrupt. The expectation is that the normal interrupt > handling > mechanism will find and process the channel interrupt "very soon", and in the > process > clear the bit. Since Hyper-V won't be setting any additional channel > interrupt bits, we > should very soon reach a state where no bits are set. I could add something like the above. Some editing seems to be required if we want it to be part of 2/3 (that is, before ACKs are introduced). Similar consideration holds for my comment in hv_synic_cleanup() below. Suggestions? > > > + * Return 'true', if there is still any set bit after this operation; > > 'false', otherwise. > > + */ > > +static bool hv_synic_event_pending(void) > > +{ > > + struct hv_per_cpu_context *hv_cpu = > > this_cpu_ptr(hv_context.cpu_context); > > + union hv_synic_event_flags *event = > > + (union hv_synic_event_flags *)hv_cpu->synic_event_page + > > VMBUS_MESSAGE_SINT; > > + unsigned long *recv_int_page = event->flags; > > I think a comment is warranted here. The similar vmbus_chan_sched() code has > two ways > to get the recv_int_page, depending on the negotiated VMbus protocol version. > This code > assumes the "new" way to get the recv_int_page, which makes sense given that > it is invoked > only for newer VMbus protocol versions. But a note about that assumption > would be a good > warning for future readers/coders. I've added a comment to this effect. > > > + bool pending; > > + u32 relid; > > + int tries = 0; > > + > > +retry: > > + pending = false; > > + for_each_set_bit(relid, recv_int_page, HV_EVENT_FLAGS_COUNT) { > > + /* Special case - VMBus channel protocol messages */ > > + if (relid == 0) > > + continue; > > + if (sync_test_bit(relid, recv_int_page)) { > > + pending = true; > > + break; > > I'm not clear on why the above test is needed. If the bit was found to be set > by for_each_set_bit(), that seems good enough to set pending=true. Agreed. I've removed the test. Thanks, Andrea