Re: [RFC] adding a generic QAPI event for failed device hotunplug

Markus Armbruster Mon, 08 Mar 2021 22:23:33 -0800

Cc: Paolo and Julia in addition to Igor, because the thread is wandering
towards DeviceState member pending_deleted_event.


Cc: Laine for libvirt expertise.  Laine, if you're not the right person,
please loop in the right person.

David Gibson <da...@gibson.dropbear.id.au> writes:

> On Mon, Mar 08, 2021 at 03:01:53PM -0300, Daniel Henrique Barboza wrote:
>> 
>> 
>> On 3/8/21 2:04 PM, Markus Armbruster wrote:
>> > Daniel Henrique Barboza <danielhb...@gmail.com> writes:
>> > 
>> > > On 3/6/21 3:57 AM, Markus Armbruster wrote:
[...]
>> > > > We should document the event's reliability.  Are there unplug 
>> > > > operations
>> > > > where we can't detect failure?  Are there unplug operations where we
>> > > > could, but haven't implemented the event?
>> > > 
>> > > The current version of the PowerPC spec that the pseries machine 
>> > > implements
>> > > (LOPAR) does not predict a way for the virtual machine to report a 
>> > > hotunplug
>> > > error back to the hypervisor. If something fails, the hypervisor is left
>> > > in the dark.
>> > > 
>> > > What happened in the 6.0.0 dev cycle is that we faced a reliable way of
>> > 
>> > Do you mean "unreliable way"?
>> 
>> I guess a better word would be 'reproducible', as in we discovered a 
>> reproducible
>> way of getting the guest kernel to refuse the CPU hotunplug.
>
> Right.  It's worth noting here that in the PAPR model, there are no
> "forced" unplugs.  Unplugs always consist of a request to the guest,
> which is then resposible for offlining the device and signalling back
> to the hypervisor that it's done with it.
>
>> > > making CPU hotunplug fail in the guest (trying to hotunplug the last 
>> > > online
>> > > CPU) and the pseries machine was unprepared for dealing with it. We 
>> > > ended up
>> > > implementing a hotunplug timeout and, if the timeout kicks in, we're 
>> > > assuming
>> > > that the CPU hotunplug failed in the guest. This is the first scenario 
>> > > we have
>> > > today where we want to send a QAPI event informing the CPU hotunplug 
>> > > error,
>> > > but this event does not exist in QEMU ATM.
>> > 
>> > When the timeout kicks in, how can you know the operation failed?  You
>> > better be sure when you send an error event.  In other words: what
>> > prevents the scenario where the operation is much slower than you
>> > expect, the timeout expires, the error event is sent, and then the
>> > operation completes successfully?
>> 
>> A CPU hotunplug in a pseries guest takes no more than a couple of seconds, 
>> even
>> if the guest is under heavy load. The timeout is set to 15 seconds.
>
> Right.  We're well aware that a timeout is an ugly hack, since it's
> not really possible to distinguish it from a guest that's just being
> really slow.

As long as unplug failure cannot be detected reliably, we need a timeout
*somewhere*.  I vaguely recall libvirt has one.  Laine?

Putting the timeout in QEMU might be better.  QEMU might be in a better
position to pick a suitable timeout.

> It was the best thing we could come up with in the short term though.
> Without the changes we're suggesting here, the guest can positively
> assert the unplug is complete, but it has no way to signal failure.
> So, without a timeout qemu is stuck waiting indefinitely (or at least
> until the next machine reset) on the guest for an unplug that might
> never complete.
>
> It's particularly nasty if the guest does really fail the hotplug
> internally, but then conditions change so that the same unplug could
> now succeed.  The unplug request is just a message - there's no guest
> visible persistent state saying we want the device unplugged, so if
> conditions change the guest won't then re-attempt the unplug.
> Meanwhile the user can't re-attempt the device_del, because on the
> qemu side it's still stuck in a pending unplug waiting on the guest.

"Can't re-attempt" feels like a bug.  Let me explain.

Depending on the device, device_del can complete the unplug, or merely
initiate it.  Documentation:

# Notes: When this command completes, the device may not be removed from the
#        guest.  Hot removal is an operation that requires guest cooperation.
#        This command merely requests that the guest begin the hot removal
#        process.  Completion of the device removal process is signaled with a
#        DEVICE_DELETED event. Guest reset will automatically complete removal
#        for all devices.

This is not as accurate as it could be.  Behavior depends on the device.

For some kinds of devices, the command completes the unplug before it
sends its reply.  Example: USB devices.  Fine print: the DEVICE_DELETED
event gets send with a slight delay because device cleanup uses RCU.

For others, notably PCI devices, the command only pokes the guest to do
its bit.  QEMU reacts to the guest's actions, which eventually leads to
the actual unplug.  DEVICE_DELETED gets sent then.  If the guest doesn't
play ball to the end, the event doesn't get send.

The "can't retry unplug" behavior is new.  Another device_del used to
simply poke the guest again.  I think this regressed in commit
cce8944cc9 "qdev-monitor: Forbid repeated device_del", 2020-02-25.
Feels like a must-fix for 6.0.

> As we're discussing we think we have a better way, but it relies on
> guest changes, so we can't assume we already have that on the qemu
> side.
>
>> I'm aware that there's always that special use case, that we haven't seen 
>> yet,
>> where this assumption is no longer valid. The plan is to change the spec and 
>> the
>> guest kernel to signal the CPU hotunplug error back to QEMU before the dragon
>> lands. For now, timing out the CPU hotunplug process when we're almost 
>> certain
>> (but not 100%) that it failed in the guest is the best can do.
>> 
>> For both cases I want to use DEVICE_UNPLUG_ERROR right from the start, 
>> avoiding
>> guest visible changes when we change how we're detecting the unplug errors.
>> 
>> > > The second scenario is a memory hotunplug error. I found out that the 
>> > > pseries
>> > > guest kernel does a reconfiguration step when re-attaching the DIMM right
>> > > after refusing the hotunplug, and this reconfiguration is visible to 
>> > > QEMU.
>> > > I proceeded to make the pseries machine detect this error case, rollback 
>> > > the
>> > > unplug operation and fire up the MEM_UNPLUG_ERROR. This case is already 
>> > > covered
>> > > by QAPI, but if we add a DEVICE_UNPLUG_ERROR event I would use it in 
>> > > this case as
>> > > well instead of the MEM specific one.
>> > > 
>> > > This investigation and work in the mem hotunplug error path triggered a
>> > > discussion in qemu-ppc, where we're considering whether we should do the 
>> > > same
>> > > signalling the kernel does for the DIMM hotunplug error for all other 
>> > > device
>> > > hotunplug errors, given that the reconfiguration per se is not forbidden 
>> > > by LOPAR
>> > > and it's currently a no-op. We would make a LOPAR spec change to make 
>> > > this an
>> > > official hotunplug error report mechanism, and all pseries hotunplug 
>> > > operations,
>> > > for all devices, would report DEVICE_UNPLUG_ERROR QAPI events in the 
>> > > error path.
>> > > 
>> > > Granted, the spec change + Kernel change is not something that we will 
>> > > be able
>> > > to nail down in the 6.0.0 cycle, but having the DEVICE_UNPLUG_ERROR QAPI 
>> > > event
>> > > already in place would make it easier for the future as well.
>> > > 
>> > > 
>> > > I have a doc draft of these changes/infos that I forgot to post. I would 
>> > > post
>> > > it as docs/system/ppc-spapr-hotunplug-notes.rst. I can add the QAPI 
>> > > events
>> > > information there as well. Does that work for you as far as documentation
>> > > goes?
>> > 
>> > A DEVICE_UNPLUG_ERROR event makes perfect sense regardless of machine
>> > and device.
>
> Ack.  Fwiw, I don't think this can ever be more than a "best effort"
> notification.  Even with a machine and OS that should support it, a
> guest bug or hang could mean that it never acks *or* nacks an unplug
> request.

Since the success event is named DEVICE_DELETED, I'd name the
probably-failed event DEVICE_NOT_DELETED.  Bonus: can read it as a
statement of fact "still not deleted" instead of an error (that just
might not be one).

>> > I'm not asking you to to implement it for all machines and devices.  But
>> > I want its design to support growth towards that goal, and its
>> > documentation reflect its current state.
>> > 
>> > In particular, the doc comment in the QAPI schema should list the
>> > limitation.  If the event is only implemented for LOPAR for now, say so.
>> > If it's only implemented for certain devices, say so.
>> > 
>> > Questions?
>> 
>> 
>> Nope. Thanks for the pointers. I'll add the DEVICE_UNPLUG_ERROR QAPI event
>> in a way that it can be used by other machines in the future, and documenting
>> where the event is being used ATM.
>> 
>> 
>> Thanks,
>> 
>> 
>> DHB
>> 
>> 
>> > 
>>

Re: [RFC] adding a generic QAPI event for failed device hotunplug

Reply via email to