On Tue, 19 Jan 2021 10:11:19 -0300
Daniel Henrique Barboza <danielhb...@gmail.com> wrote:

> 
> 
> On 1/19/21 6:50 AM, Greg Kurz wrote:
> > On Mon, 18 Jan 2021 16:30:35 -0300
> > Daniel Henrique Barboza <danielhb...@gmail.com> wrote:
> > 
> >> Commit 47c8c915b162 fixed a problem where multiple spapr_drc_detach()
> >> requests were breaking QEMU. The solution was to just spapr_drc_detach()
> >> once, and use spapr_drc_unplug_requested() to filter whether we
> >> already detached it or not. The commit also tied the hotplug request
> >> to the guest in the same condition.
> >>
> >> Turns out that there is a reliable way for a CPU hotunplug to fail. If
> >> a guest with one CPU hotplugs a CPU1, then offline CPU0s via
> >> 'echo 0 > /sys/devices/system/cpu/cpu0/online', then attempts to
> >> hotunplug CPU1, the kernel will refuse it because it's the last online
> >> CPU of the system. Given that we're pulsing the IRQ only in the first try,
> >> in a failed attempt, all other CPU1 hotunplug attempts will fail, 
> >> regardless
> >> of the online state of CPU1 in the kernel, because we're simply not letting
> >> the guest know that we want to hotunplug the device.
> >>
> >> Let's move spapr_hotplug_req_remove_by_index() back out of the
> >> "if (!spapr_drc_unplug_requested(drc))" conditional, allowing for multiple
> >> 'device_del' requests to the same CPU core to reach the guest, in case
> >> the CPU core didn't fully hotunplugged previously. Granted, this is not
> >> optimal because this can allow for multiple hotplug events queueing in the
> >> guest, like it was already possible before 47c8c915b162.
> >>
> >> Other possible alternatives would be:
> >>
> >> - check if the given CPU is the last online CPU in the guest before 
> >> attempting
> >> to hotunplug. This can be done by checking 'cs->halted' and msr states of
> >> the core. Problem is, this approach will fail if the guest offlines/onlines
> >> a CPU while we're in the middle of the unplug request, given that we can't
> >> control whether the CPU core states will change in the kernel. This option
> >> sure makes it harder to allow a hotunplug failure to happen, but will never
> >> be enough to fully avoid it;
> >>
> >> - let the user handled it. In this case, we would advise the user to 
> >> reboot the
> >> guest and the CPU will be removed during machine reset.
> >>
> > 
> > This is actually the only viable option since there's no way for the guest 
> > to
> > report an unplug request failure to QEMU. And this isn't specific to CPUs, 
> > eg.
> > Linux can also block unplug requests for DIMM devices if some LMB doesn't 
> > belong
> > to ZONE_MOVABLE. The solution for this is to tell linux to always put 
> > hot-plugged
> > memory in ZONE_MOVABLE.
> 
> Indeed, the lack of a 'not going to comply with this unplug request' logic in
> PAPR is the source of a lot of bugs and code in QEMU. As I said to you 
> offline,
> I had an old idea of proposing a PAPR change to add this mechanic, and then
> QEMU can be aware of unplug requests that are denied by the kernel right 
> before
> all the DRC state transition occurs. This will demand some changes, but the 
> net
> result is more predictability in all unplug operations.
> 

Yes, this could probably help if PowerVM also consider this change as
valuable. Not sure on the impact this would cause on the linux and QEMU
code though.

> Note that the new LMB DT attribte doesn't fix the problems for DRCs. It just 
> makes
> it harder to occur. There's nothing holding the kernel back from refusing LMB 
> unplug
> requests in stress conditions (e.g. a stress-ng workload overloading all the 
> RAM).
> 

Indeed. The use of 'movable_node' itself can be an issue as it can cause OOM
more easily for kernel allocations that cannot go to ZONE_MOVABLE.

> > 
> > Could something similar be done for CPUs ? For example, forbidding the 
> > off-lining
> > of CPU0 at the linux level : this would ensure all cores, except the one 
> > that has
> > CPU0, are always hot-unpluggable.
> 
> I don't think so. LMBs has a strong case for that extra DT flag because the 
> OS can
> expand the kernel space to the hotplugged DIMMs and so on. As you said in the 
> v1
> review, PAPR already mentions that offlining all VCPUs should just terminate 
> the
> partition, so why do we need a flag to forbid CPU offlining?
> 

Problem is that CPU teardown doesn't even allow to offline the last
CPU. I cannot think of a way to bypass that cleanly and get the last
CPU of linux guest to call RTAS "stop-self" anyway... I don't think
we can implement this _feature_ of PAPR. So we need something else.

> We're dealing with a kernel restriction (always need an online CPU) that PAPR 
> doesn't
> predicts in the spec. Not only that, but it's also a dynamic restriction. The 
> OS can
> deny hotunplug of any of the available CPUs, as long as it is the last 
> online. This
> puts QEMU into this weird spot we are now: we can't guarantee that the vcpu 
> state
> will remain in the guest during the unplug request.
> 
> x86 for instance forbids offlining CPU0. I'm not sure if it's an ACPI 
> restriction or
> kernel design choice. But this would fix the issue we're having here. Of all 
> the
> available choices for a fix, patching the pseries kernel to forbid offlining 
> CPU0
> looks a very sane one in my estimation.
> 

AFAICT the limitation comes from ACPI. POWER doesn't have such a restriction,
on powernv and pseries, so we would introduce it artificially as a workaround
to avoid the issue this patch is trying to address. I understand from the above
that adding a new flag in PAPR for QEMU to control this behavior has poor 
chances
of success. So this could be hardcoded to CPU0.

> 
> > 
> >> None of the alternatives are clear winnners, so this patch goes for the 
> >> approach
> >> makes the IRQ queue of the guest prone to multiple hotunplug requests for 
> >> the
> >> same CPU, but at least the user can successfully hotunplug the CPU after a 
> >> failed
> >> attempt, without the need of guest reboot.
> >>
> >> Reported-by: Xujun Ma <x...@redhat.com>
> >> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1911414
> >> Signed-off-by: Daniel Henrique Barboza <danielhb...@gmail.com>
> >> ---
> >>   hw/ppc/spapr.c | 11 ++++++++++-
> >>   1 file changed, 10 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index e9e3816cd3..e2f12ce413 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -3737,8 +3737,17 @@ void spapr_core_unplug_request(HotplugHandler 
> >> *hotplug_dev, DeviceState *dev,
> >>   
> >>       if (!spapr_drc_unplug_requested(drc)) {
> >>           spapr_drc_detach(drc);
> >> -        spapr_hotplug_req_remove_by_index(drc);
> >>       }
> >> +
> >> +    /*
> >> +     * spapr_hotplug_req_remove_by_index is left unguarded, out of the
> >> +     * "!spapr_drc_unplug_requested" check, to allow for multiple IRQ
> >> +     * pulses removing the same CPU core. Otherwise, in an failed 
> >> hotunplug
> >> +     * attempt (e.g. the kernel will refuse to remove the last online CPU
> >> +     * core), we will never attempt it again because unplug_requested will
> >> +     * still be 'true' in that case.
> >> +     */
> >> +    spapr_hotplug_req_remove_by_index(drc);
> > 
> > This not only fire the IRQ again, it also enqueues a new event... have
> > you tried hammering QEMU with CPU hot-plug/unplug requests. I seem to
> > remember that the troubles fixed by 47c8c915b162 had more to do with
> > the DRC state machine than the hot-plug event itself, but posting the
> > same event several times during a regular hot-unplug sequence could
> > maybe cause subtle bugs as well.
> 
> I tried hammering the guest with lots of IRQ events. Aside from the dmesg
> flood in the guest kernel everything went fine. No guest breakage of any
> sort.
> 
> It is also worth noticing that the unplug_request code is protected by
> BQL. It is not possible to spam the guest queue out of control.
> 

True.

> > 
> > Honestly, this is still a band-aid : it doesn't fix anything, it just
> > gives an alternative solution to reboot when someone has done something
> > silly. I'd rather not loosen our sanity checks for such a corner case.
> 
> It doesn't fix indeed. But documenting that the user shouldn't offline CPU0
> in the kernel because QEMU can't handle unplug properly isn't fixing anything
> either.
> 

Indeed but users know that they're walking on quicksand at least.

> Thing is, our spapr code is too optimistic, and not justifiable in my opinion,

PAPR is "too much something" in a lot of aspects and this certainly
had an impact on the code. ;-)

> in all unplug code for all devices. It assumes that everything will go as
> intended, but in reality we're hoping it will work. Otherwise, we're going to
> ask the user to reboot the guest so the unplug can happen in CAS.
> 
> This change I'm making makes the CPU unplug code way more pessimistic, and it
> can open a way for device_del abuse via an IRQ spam. But in the end it will
> alleviate he situation not only of this particular bug, but all "insert a 
> random
> reason why the kernel denied a CPU hotunplug, and now I can't unplug the
> CPU anymore" bugs in the future as well. It gives the user the choice to try
> again if something didn't go as planned. As far as I'm concerned, we should
> fire up spapr_hotplug_req_remove_by_index() from all unplug_request() calls
> of all devices regardless of unplug_request being true or not.
> 

I agree that if we go for this approach, we should fix all our unplug_request()
handlers to be consistent. BTW, what do other archs do in similar situation ?


> And if we're concerned about abuse of device_del spam, we can document that
> device_del can't be spammed in a 'unplug until it works' fashion. I'd rather
> advice the user to use device_del properly than to tell the user "if you 
> offline
> CPUs and enter this situation, you'll need to reboot the guest".
> 

It all boils down to what "use device_del properly" does mean exactly.

> > 
> > On the other side, the at-least-one-cpu thing is a linux limitation.
> > It seems fair that linux should provide a way to mitigate the effects.
> > Like suggested above, this could be the ability to elect an individual
> > vCPU to be always on-line. Since QEMU refuses the hotplug of the boot
> > core, QEMU could maybe tell the guest to elect CPU0 through some
> > flag in the DT, like we've done recently for LMBs.
> 
> 
> As I said above, I think that patching the kernel to avoid CPU0 offlining 
> looks
> like a viable idea.
> 

My only concern is that I don't have a clear idea on CPU offlining use
cases that could break with such a change.

> 
> I still believe that this patch should go through regardless though. I'm 
> convinced
> that we can't be optimistic about the CPU unplug process as a whole, and 
> advising
> the user to not spam the guest IRQ is a better option than documenting that we
> need to reset the machine to handle the unplug, because the user did something
> that the kernel fully allows, but we can't deal with and won't even allow 
> further
> attempts at it. I mean, what are we? Windows?
> 

David, any insight here ?

> 
> 
> Thanks,
> 
> 
> DHB
> 
> 
> 
> 
> > 
> >>   }
> >>   
> >>   int spapr_core_dt_populate(SpaprDrc *drc, SpaprMachineState *spapr,
> > 


Reply via email to