Re: [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events
On Wed, Sep 13, 2023 at 06:55:08PM -0700, Sean Christopherson wrote: > Add flags to "struct kvm_gfn_range" to let notifier events target only > shared and only private mappings, and write up the existing mmu_notifier > events to be shared-only (private memory is never associated with a > userspace virtual address, i.e. can't be reached via mmu_notifiers). > > Add two flags so that KVM can handle the three possibilities (shared, > private, and shared+private) without needing something like a tri-state > enum. > > Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com > Signed-off-by: Sean Christopherson > --- > include/linux/kvm_host.h | 2 ++ > virt/kvm/kvm_main.c | 7 +++ > 2 files changed, 9 insertions(+) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index d8c6ce6c8211..b5373cee2b08 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -263,6 +263,8 @@ struct kvm_gfn_range { > gfn_t start; > gfn_t end; > union kvm_mmu_notifier_arg arg; > + bool only_private; > + bool only_shared; > bool may_block; > }; > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 174de2789657..a41f8658dfe0 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t > __kvm_handle_hva_range(struct kvm *kvm, >* the second or later invocation of the handler). >*/ > gfn_range.arg = range->arg; > + > + /* > + * HVA-based notifications aren't relevant to private > + * mappings as they don't have a userspace mapping. > + */ > + gfn_range.only_private = false; > + gfn_range.only_shared = true; > gfn_range.may_block = range->may_block; Who is supposed to read only_private/only_shared? Is it supposed to be plumbed onto arch code and handled specially there? I ask because I see elsewhere you have: /* * If one or more memslots were found and thus zapped, notify arch code * that guest memory has been reclaimed. This needs to be done *after* * dropping mmu_lock, as x86's reclaim path is slow. */ if (__kvm_handle_hva_range(kvm, _range).found_memslot) kvm_arch_guest_memory_reclaimed(kvm); and if there are any MMU notifier events that touch HVAs, then kvm_arch_guest_memory_reclaimed()->wbinvd_on_all_cpus() will get called, which causes the performance issues for SEV and SNP that Ashish had brought up. Technically that would only need to happen if there are GPAs in that memslot that aren't currently backed by gmem pages (and then gmem could handle its own wbinvd_on_all_cpus() (or maybe clflush per-page)). Actually, even if there are shared pages in the GPA range, the kvm_arch_guest_memory_reclaimed()->wbinvd_on_all_cpus() can be skipped for guests that only use gmem pages for private memory. Is that acceptable? Just trying to figure out where this only_private/only_shared handling ties into that (or if it's a separate thing entirely). -Mike > > /* > -- > 2.42.0.283.g2d96d420d3-goog >
Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On Wed, Sep 13, 2023 at 06:55:12PM -0700, Sean Christopherson wrote: > TODO > > Cc: Fuad Tabba > Cc: Vishal Annapurve > Cc: Ackerley Tng > Cc: Jarkko Sakkinen > Cc: Maciej Szmigiero > Cc: Vlastimil Babka > Cc: David Hildenbrand > Cc: Quentin Perret > Cc: Michael Roth > Cc: Wang > Cc: Liam Merwick > Cc: Isaku Yamahata > Co-developed-by: Kirill A. Shutemov > Signed-off-by: Kirill A. Shutemov > Co-developed-by: Yu Zhang > Signed-off-by: Yu Zhang > Co-developed-by: Chao Peng > Signed-off-by: Chao Peng > Co-developed-by: Ackerley Tng > Signed-off-by: Ackerley Tng > Co-developed-by: Isaku Yamahata > Signed-off-by: Isaku Yamahata > Signed-off-by: Sean Christopherson > --- > include/linux/kvm_host.h | 48 +++ > include/uapi/linux/kvm.h | 15 +- > include/uapi/linux/magic.h | 1 + > virt/kvm/Kconfig | 4 + > virt/kvm/Makefile.kvm | 1 + > virt/kvm/guest_mem.c | 593 + > virt/kvm/kvm_main.c| 61 +++- > virt/kvm/kvm_mm.h | 38 +++ > 8 files changed, 756 insertions(+), 5 deletions(-) > create mode 100644 virt/kvm/guest_mem.c > > +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t > len) > +{ > + struct list_head *gmem_list = >i_mapping->private_list; > + pgoff_t start = offset >> PAGE_SHIFT; > + pgoff_t end = (offset + len) >> PAGE_SHIFT; > + struct kvm_gmem *gmem; > + > + /* > + * Bindings must stable across invalidation to ensure the start+end > + * are balanced. > + */ > + filemap_invalidate_lock(inode->i_mapping); > + > + list_for_each_entry(gmem, gmem_list, entry) { > + kvm_gmem_invalidate_begin(gmem, start, end); In v11 we used to call truncate_inode_pages_range() here to drop filemap's reference on the folio. AFAICT the folios are only getting free'd upon guest shutdown without this. Was this on purpose? > + kvm_gmem_invalidate_end(gmem, start, end); > + } > + > + filemap_invalidate_unlock(inode->i_mapping); > + > + return 0; > +} > + > +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len) > +{ > + struct address_space *mapping = inode->i_mapping; > + pgoff_t start, index, end; > + int r; > + > + /* Dedicated guest is immutable by default. */ > + if (offset + len > i_size_read(inode)) > + return -EINVAL; > + > + filemap_invalidate_lock_shared(mapping); We take the filemap lock here, but not for kvm_gmem_get_pfn()->kvm_gmem_get_folio(). Is it needed there as well? > + > + start = offset >> PAGE_SHIFT; > + end = (offset + len) >> PAGE_SHIFT; > + > + r = 0; > + for (index = start; index < end; ) { > + struct folio *folio; > + > + if (signal_pending(current)) { > + r = -EINTR; > + break; > + } > + > + folio = kvm_gmem_get_folio(inode, index); > + if (!folio) { > + r = -ENOMEM; > + break; > + } > + > + index = folio_next_index(folio); > + > + folio_unlock(folio); > + folio_put(folio); > + > + /* 64-bit only, wrapping the index should be impossible. */ > + if (WARN_ON_ONCE(!index)) > + break; > + > + cond_resched(); > + } > + > + filemap_invalidate_unlock_shared(mapping); > + > + return r; > +} > + > +static int __kvm_gmem_create(struct kvm *kvm, loff_t size, struct vfsmount > *mnt) > +{ > + const char *anon_name = "[kvm-gmem]"; > + const struct qstr qname = QSTR_INIT(anon_name, strlen(anon_name)); > + struct kvm_gmem *gmem; > + struct inode *inode; > + struct file *file; > + int fd, err; > + > + inode = alloc_anon_inode(mnt->mnt_sb); > + if (IS_ERR(inode)) > + return PTR_ERR(inode); > + > + err = security_inode_init_security_anon(inode, , NULL); > + if (err) > + goto err_inode; > + > + inode->i_private = (void *)(unsigned long)flags; The 'flags' argument isn't added until the subsequent patch that adds THP support. > +static bool kvm_gmem_is_valid_size(loff_t size, u64 flags) > +{ > + if (size < 0 || !PAGE_ALIGNED(size)) > + return false; > + > + return true; > +} > + > +int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) > +{ > + loff_t size = args->size; > + u6
[PATCH v2] powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death
588.ppc64le #1 [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn [ 1538.362725] Call Trace: [ 1538.362743] [c009d4adf590] [c0e0e0fc] dump_stack+0xb0/0xf4 (unreliable) [ 1538.362789] [c009d4adf5d0] [c0475dfc] bad_page+0x12c/0x1b0 [ 1538.362827] [c009d4adf660] [c04784bc] free_pcppages_bulk+0x5bc/0x940 [ 1538.362871] [c009d4adf760] [c0478c38] page_alloc_cpu_dead+0x118/0x120 [ 1538.362918] [c009d4adf7b0] [c015b898] cpuhp_invoke_callback.constprop.5+0xb8/0x760 [ 1538.362969] [c009d4adf820] [c015eee8] _cpu_down+0x188/0x340 [ 1538.363007] [c009d4adf890] [c015d75c] cpu_down+0x5c/0xa0 [ 1538.363045] [c009d4adf8c0] [c092c544] cpu_subsys_offline+0x24/0x40 [ 1538.363091] [c009d4adf8e0] [c09212f0] device_offline+0xf0/0x130 [ 1538.363129] [c009d4adf920] [c010aee4] dlpar_offline_cpu+0x1c4/0x2a0 [ 1538.363174] [c009d4adf9e0] [c010b2f8] dlpar_cpu_remove+0xb8/0x190 [ 1538.363219] [c009d4adfa60] [c010b4fc] dlpar_cpu_remove_by_index+0x12c/0x150 [ 1538.363264] [c009d4adfaf0] [c010ca24] dlpar_cpu+0x94/0x800 [ 1538.363302] [c009d4adfc00] [c0102cc8] pseries_hp_work_fn+0x128/0x1e0 [ 1538.363347] [c009d4adfc70] [c018aa84] process_one_work+0x304/0x5d0 [ 1538.363394] [c009d4adfd10] [c018b5cc] worker_thread+0xcc/0x7a0 [ 1538.363433] [c009d4adfdc0] [c019567c] kthread+0x1ac/0x1c0 [ 1538.363469] [c009d4adfe30] [c000b7dc] ret_from_kernel_thread+0x5c/0x80 The latter trace is due to the following sequence: page_alloc_cpu_dead drain_pages drain_pages_zone free_pcppages_bulk where drain_pages() in this case is called under the assumption that the unplugged cpu is no longer executing. To ensure that is the case, and early call is made to __cpu_die()->pseries_cpu_die(), which runs a loop that waits for the cpu to reach a halted state by polling its status via query-cpu-stopped-state RTAS calls. It only polls for 25 iterations before giving up, however, and in the trace above this results in the following being printed only .1 seconds after the hotplug worker thread begins processing the unplug request: [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU , drc index: 113a [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2 At that point the worker thread assumes the unplugged CPU is in some unknown/dead state and procedes with the cleanup, causing the race with the XIVE cleanup code executed by the unplugged CPU. Fix this by waiting indefinitely, but also making an effort to avoid spurious lockup messages by allowing for rescheduling after polling the CPU status and printing a warning if we wait for longer than 120s. Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 Suggested-by: Michael Ellerman Cc: Thiago Jung Bauermann Cc: Michael Ellerman Cc: Cedric Le Goater Cc: Greg Kurz Cc: Nathan Lynch Signed-off-by: Michael Roth --- changes from v1: - renamed from "powerpc/pseries/hotplug-cpu: increase wait time for vCPU death" - wait indefinitely, but issue cond_resched() when RTAS reports that CPU hasn't stopped (Michael) - print a warning after 120s of waiting (Michael) - use pr_warn() instead of default printk() level --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index c6e0d8abf75e..7a974ed6b240 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -107,22 +107,28 @@ static int pseries_cpu_disable(void) */ static void pseries_cpu_die(unsigned int cpu) { - int tries; int cpu_status = 1; unsigned int pcpu = get_hard_smp_processor_id(cpu); + unsigned long timeout = jiffies + msecs_to_jiffies(12); - for (tries = 0; tries < 25; tries++) { + while (true) { cpu_status = smp_query_cpu_stopped(pcpu); if (cpu_status == QCSS_STOPPED || cpu_status == QCSS_HARDWARE_ERROR) break; - cpu_relax(); + if (time_after(jiffies, timeout)) { + pr_warn("CPU %i (hwid %i) didn't die after 120 seconds\n", + cpu, pcpu); + timeout = jiffies + msecs_to_jiffies(12); + } + + cond_resched(); } - if (cpu_status != 0) { - printk("Querying DEAD? cpu %i (%i) shows %i\n", - cpu, pcpu, cpu_status); + if (cpu_status == QCSS_HARDWARE_ERROR) { +
Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Quoting Nathan Lynch (2020-08-07 02:05:09) > Hi everyone, > > Michael Ellerman writes: > > Greg Kurz writes: > >> On Tue, 04 Aug 2020 23:35:10 +1000 > >> Michael Ellerman wrote: > >>> Spinning forever seems like a bad idea, but as has been demonstrated at > >>> least twice now, continuing when we don't know the state of the other > >>> CPU can lead to straight up crashes. > >>> > >>> So I think I'm persuaded that it's preferable to have the kernel stuck > >>> spinning rather than oopsing. > >>> > >> > >> +1 > >> > >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My > >>> first instinct is no, if we're stuck here for 20s a stack trace would be > >>> good. But then we will probably hit that on some big and/or heavily > >>> loaded machine. > >>> > >>> So possibly we should call cond_resched() but have some custom logic in > >>> the loop to print a warning if we are stuck for more than some > >>> sufficiently long amount of time. > >> > >> How long should that be ? > > > > Yeah good question. > > > > I guess step one would be seeing how long it can take on the 384 vcpu > > machine. And we can probably test on some other big machines. > > > > Hopefully Nathan can give us some idea of how long he's seen it take on > > large systems? I know he was concerned about the 20s timeout of the > > softlockup detector. > > Maybe I'm not quite clear what this is referring to, but I don't think > stop-self/query-stopped-state latency increases with processor count, at > least not on PowerVM. And IIRC I was concerned with the earlier patch's > potential for causing the softlockup watchdog to rightly complain by > polling the stopped state without ever scheduling away. > > The fact that smp_query_cpu_stopped() kind of collapses the two distinct > results from the query-cpu-stopped-state RTAS call into one return value > may make it harder than necessary to reason about the questions around > cond_resched() and whether to warn. > > Sorry to pull this stunt but I have had some code sitting in a neglected > branch that I think gets the logic around this right. > > What we should have is a simple C wrapper for the RTAS call that reflects the > architected inputs and outputs: > > > (-- rtas.c --) > > /** > * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state. > * @hwcpu: Identifies the processor thread to be queried. > * @status: Pointer to status, valid only on success. > * > * Determine whether the given processor thread is in the stopped > * state. If successful and @status is non-NULL, the thread's status > * is stored to @status. > * > * Return: > * * 0 - Success > * * -1 - Hardware error > * * -2 - Busy, try again later > */ > int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status) > { >unsigned int cpu_status; >int token; >int fwrc; > >token = rtas_token("query-cpu-stopped-state"); > >fwrc = rtas_call(token, 1, 2, _status, hwcpu); >if (fwrc != 0) >goto out; > >if (status != NULL) >*status = cpu_status; > out: >return fwrc; > } > > > > And then a utility function that waits for the remote thread to enter > stopped state, with higher-level logic for rescheduling and warning. The > fact that smp_query_cpu_stopped() currently does not handle a -2/busy > status is a bug, fixed below by using rtas_busy_delay(). Note the > justification for the explicit cond_resched() in the outer loop: > > > (-- rtas.h --) > > /* query-cpu-stopped-state CPU_status */ > #define RTAS_QCSS_STATUS_STOPPED 0 > #define RTAS_QCSS_STATUS_IN_PROGRESS 1 > #define RTAS_QCSS_STATUS_NOT_STOPPED 2 > > (-- pseries/hotplug-cpu.c --) > > /** > * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state. > */ > static void wait_for_cpu_stopped(unsigned int cpu) > { >unsigned int status; >unsigned int hwcpu; > >hwcpu = get_hard_smp_processor_id(cpu); > >do { >int fwrc; > >/* > * rtas_busy_delay() will yield only if RTAS returns a > * busy status. Since query-cpu-stopped-state can > * yield RTAS_QCSS_STATUS_IN_PROGRESS or > * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded > * period before the target thread stops, we must take > * care to explicitly reschedule while polling. > */ >cond_resched(); > >do { >fwrc = rtas_query_cpu_stopped_state(hwcpu, ); >} while (rtas_busy_delay(fwrc)); > >if (fwrc == 0) >continue; > >pr_err_ratelimited("query-cpu-stopped-state for " >
Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Quoting Michael Roth (2020-08-05 17:29:28) > Quoting Michael Roth (2020-08-04 23:37:32) > > Quoting Michael Ellerman (2020-08-04 22:07:08) > > > Greg Kurz writes: > > > > On Tue, 04 Aug 2020 23:35:10 +1000 > > > > Michael Ellerman wrote: > > > >> Spinning forever seems like a bad idea, but as has been demonstrated at > > > >> least twice now, continuing when we don't know the state of the other > > > >> CPU can lead to straight up crashes. > > > >> > > > >> So I think I'm persuaded that it's preferable to have the kernel stuck > > > >> spinning rather than oopsing. > > > >> > > > > > > > > +1 > > > > > > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My > > > >> first instinct is no, if we're stuck here for 20s a stack trace would > > > >> be > > > >> good. But then we will probably hit that on some big and/or heavily > > > >> loaded machine. > > > >> > > > >> So possibly we should call cond_resched() but have some custom logic in > > > >> the loop to print a warning if we are stuck for more than some > > > >> sufficiently long amount of time. > > > > > > > > How long should that be ? > > > > > > Yeah good question. > > > > > > I guess step one would be seeing how long it can take on the 384 vcpu > > > machine. And we can probably test on some other big machines. > > > > > > Hopefully Nathan can give us some idea of how long he's seen it take on > > > large systems? I know he was concerned about the 20s timeout of the > > > softlockup detector. > > > > > > Maybe a minute or two? > > > > Hmm, so I took a stab at this where I called cond_resched() after > > every 5 seconds of polling and printed a warning at the same time (FWIW > > that doesn't seem to trigger any warnings on a loaded 96-core mihawk > > system using KVM running the 384vcpu unplug loop) > > > > But it sounds like that's not quite what you had in mind. How frequently > > do you think we should call cond_resched()? Maybe after 25 iterations > > of polling smp_query_cpu_stopped() to keep original behavior somewhat > > similar? > > > > I'll let the current patch run on the mihawk system overnight in the > > meantime so we at least have that data point, but would be good to > > know what things look like a large pHyp machine. > > At one point I did manage to get the system in a state where unplug > operations were taking 1-2s, but still not enough to trigger any > 5s warning, and I wasn't able to reproduce that in subsequent runs. > > I also tried reworking the patch so that we print a warning and > cond_resched() after 200 ms to make sure that path gets executed, but > only managed to trigger the warning twice after a few hours. > > So, if we print a warning after a couple minutes, that seems pretty > conservative as far as avoiding spurious warnings. And if we > cond_resched() after 25 loops of polling (~0.1 ms in the cases ~0.1 seconds I mean > that caused the original crash), that would avoid most of the > default RCU/lockup warnings. > > But having a second timeout to trigger the cond_resched() after some > set interval like 2s seems more deterministic since we're less > susceptible to longer delays due to things like the RTAS calls > contending for QEMU's global mutex in the the KVM case. > > > > > > Thanks! > > > > > > > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE > > > >> > interrupt controller") > > > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 > > > >> > > > >> This is not public. > > > > > > > > I'll have a look at changing that. > > > > > > Thanks. > > > > > > cheers
Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Quoting Michael Roth (2020-08-04 23:37:32) > Quoting Michael Ellerman (2020-08-04 22:07:08) > > Greg Kurz writes: > > > On Tue, 04 Aug 2020 23:35:10 +1000 > > > Michael Ellerman wrote: > > >> Spinning forever seems like a bad idea, but as has been demonstrated at > > >> least twice now, continuing when we don't know the state of the other > > >> CPU can lead to straight up crashes. > > >> > > >> So I think I'm persuaded that it's preferable to have the kernel stuck > > >> spinning rather than oopsing. > > >> > > > > > > +1 > > > > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My > > >> first instinct is no, if we're stuck here for 20s a stack trace would be > > >> good. But then we will probably hit that on some big and/or heavily > > >> loaded machine. > > >> > > >> So possibly we should call cond_resched() but have some custom logic in > > >> the loop to print a warning if we are stuck for more than some > > >> sufficiently long amount of time. > > > > > > How long should that be ? > > > > Yeah good question. > > > > I guess step one would be seeing how long it can take on the 384 vcpu > > machine. And we can probably test on some other big machines. > > > > Hopefully Nathan can give us some idea of how long he's seen it take on > > large systems? I know he was concerned about the 20s timeout of the > > softlockup detector. > > > > Maybe a minute or two? > > Hmm, so I took a stab at this where I called cond_resched() after > every 5 seconds of polling and printed a warning at the same time (FWIW > that doesn't seem to trigger any warnings on a loaded 96-core mihawk > system using KVM running the 384vcpu unplug loop) > > But it sounds like that's not quite what you had in mind. How frequently > do you think we should call cond_resched()? Maybe after 25 iterations > of polling smp_query_cpu_stopped() to keep original behavior somewhat > similar? > > I'll let the current patch run on the mihawk system overnight in the > meantime so we at least have that data point, but would be good to > know what things look like a large pHyp machine. At one point I did manage to get the system in a state where unplug operations were taking 1-2s, but still not enough to trigger any 5s warning, and I wasn't able to reproduce that in subsequent runs. I also tried reworking the patch so that we print a warning and cond_resched() after 200 ms to make sure that path gets executed, but only managed to trigger the warning twice after a few hours. So, if we print a warning after a couple minutes, that seems pretty conservative as far as avoiding spurious warnings. And if we cond_resched() after 25 loops of polling (~0.1 ms in the cases that caused the original crash), that would avoid most of the default RCU/lockup warnings. But having a second timeout to trigger the cond_resched() after some set interval like 2s seems more deterministic since we're less susceptible to longer delays due to things like the RTAS calls contending for QEMU's global mutex in the the KVM case. > > Thanks! > > > > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE > > >> > interrupt controller") > > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 > > >> > > >> This is not public. > > > > > > I'll have a look at changing that. > > > > Thanks. > > > > cheers
Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
Quoting Michael Ellerman (2020-08-04 22:07:08) > Greg Kurz writes: > > On Tue, 04 Aug 2020 23:35:10 +1000 > > Michael Ellerman wrote: > >> There is a bit of history to this code, but not in a good way :) > >> > >> Michael Roth writes: > >> > For a power9 KVM guest with XIVE enabled, running a test loop > >> > where we hotplug 384 vcpus and then unplug them, the following traces > >> > can be seen (generally within a few loops) either from the unplugged > >> > vcpu: > >> > > >> > [ 1767.353447] cpu 65 (hwid 65) Ready to die... > >> > [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2 > >> > [ 1767.952311] list_del corruption. next->prev should be > >> > c00a02470208, but was c00a02470048 > >> ... > >> > > >> > At that point the worker thread assumes the unplugged CPU is in some > >> > unknown/dead state and procedes with the cleanup, causing the race with > >> > the XIVE cleanup code executed by the unplugged CPU. > >> > > >> > Fix this by inserting an msleep() after each RTAS call to avoid > >> > >> We previously had an msleep(), but it was removed: > >> > >> b906cfa397fd ("powerpc/pseries: Fix cpu hotplug") > > > > Ah, I hadn't seen that one... > > > >> > pseries_cpu_die() returning prematurely, and double the number of > >> > attempts so we wait at least a total of 5 seconds. While this isn't an > >> > ideal solution, it is similar to how we dealt with a similar issue for > >> > cede_offline mode in the past (940ce422a3). > >> > >> Thiago tried to fix this previously but there was a bit of discussion > >> that didn't quite resolve: > >> > >> > >> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/ > > > > Yeah it appears that the motivation at the time was to make the "Querying > > DEAD?" > > messages to disappear and to avoid potentially concurrent calls to > > rtas-stop-self > > which is prohibited by PAPR... not fixing actual crashes. > > I'm pretty sure at one point we were triggering crashes *in* RTAS via > this path, I think that got resolved. > > >> Spinning forever seems like a bad idea, but as has been demonstrated at > >> least twice now, continuing when we don't know the state of the other > >> CPU can lead to straight up crashes. > >> > >> So I think I'm persuaded that it's preferable to have the kernel stuck > >> spinning rather than oopsing. > >> > > > > +1 > > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My > >> first instinct is no, if we're stuck here for 20s a stack trace would be > >> good. But then we will probably hit that on some big and/or heavily > >> loaded machine. > >> > >> So possibly we should call cond_resched() but have some custom logic in > >> the loop to print a warning if we are stuck for more than some > >> sufficiently long amount of time. > > > > How long should that be ? > > Yeah good question. > > I guess step one would be seeing how long it can take on the 384 vcpu > machine. And we can probably test on some other big machines. > > Hopefully Nathan can give us some idea of how long he's seen it take on > large systems? I know he was concerned about the 20s timeout of the > softlockup detector. > > Maybe a minute or two? Hmm, so I took a stab at this where I called cond_resched() after every 5 seconds of polling and printed a warning at the same time (FWIW that doesn't seem to trigger any warnings on a loaded 96-core mihawk system using KVM running the 384vcpu unplug loop) But it sounds like that's not quite what you had in mind. How frequently do you think we should call cond_resched()? Maybe after 25 iterations of polling smp_query_cpu_stopped() to keep original behavior somewhat similar? I'll let the current patch run on the mihawk system overnight in the meantime so we at least have that data point, but would be good to know what things look like a large pHyp machine. Thanks! > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE > >> > interrupt controller") > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 > >> > >> This is not public. > > > > I'll have a look at changing that. > > Thanks. > > cheers
[PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death
588.ppc64le #1 [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn [ 1538.362725] Call Trace: [ 1538.362743] [c009d4adf590] [c0e0e0fc] dump_stack+0xb0/0xf4 (unreliable) [ 1538.362789] [c009d4adf5d0] [c0475dfc] bad_page+0x12c/0x1b0 [ 1538.362827] [c009d4adf660] [c04784bc] free_pcppages_bulk+0x5bc/0x940 [ 1538.362871] [c009d4adf760] [c0478c38] page_alloc_cpu_dead+0x118/0x120 [ 1538.362918] [c009d4adf7b0] [c015b898] cpuhp_invoke_callback.constprop.5+0xb8/0x760 [ 1538.362969] [c009d4adf820] [c015eee8] _cpu_down+0x188/0x340 [ 1538.363007] [c009d4adf890] [c015d75c] cpu_down+0x5c/0xa0 [ 1538.363045] [c009d4adf8c0] [c092c544] cpu_subsys_offline+0x24/0x40 [ 1538.363091] [c009d4adf8e0] [c09212f0] device_offline+0xf0/0x130 [ 1538.363129] [c009d4adf920] [c010aee4] dlpar_offline_cpu+0x1c4/0x2a0 [ 1538.363174] [c009d4adf9e0] [c010b2f8] dlpar_cpu_remove+0xb8/0x190 [ 1538.363219] [c009d4adfa60] [c010b4fc] dlpar_cpu_remove_by_index+0x12c/0x150 [ 1538.363264] [c009d4adfaf0] [c010ca24] dlpar_cpu+0x94/0x800 [ 1538.363302] [c009d4adfc00] [c0102cc8] pseries_hp_work_fn+0x128/0x1e0 [ 1538.363347] [c009d4adfc70] [c018aa84] process_one_work+0x304/0x5d0 [ 1538.363394] [c009d4adfd10] [c018b5cc] worker_thread+0xcc/0x7a0 [ 1538.363433] [c009d4adfdc0] [c019567c] kthread+0x1ac/0x1c0 [ 1538.363469] [c009d4adfe30] [c000b7dc] ret_from_kernel_thread+0x5c/0x80 The latter trace is due to the following sequence: page_alloc_cpu_dead drain_pages drain_pages_zone free_pcppages_bulk where drain_pages() in this case is called under the assumption that the unplugged cpu is no longer executing. To ensure that is the case, and early call is made to __cpu_die()->pseries_cpu_die(), which runs a loop that waits for the cpu to reach a halted state by polling its status via query-cpu-stopped-state RTAS calls. It only polls for 25 iterations before giving up, however, and in the trace above this results in the following being printed only .1 seconds after the hotplug worker thread begins processing the unplug request: [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU , drc index: 113a [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2 At that point the worker thread assumes the unplugged CPU is in some unknown/dead state and procedes with the cleanup, causing the race with the XIVE cleanup code executed by the unplugged CPU. Fix this by inserting an msleep() after each RTAS call to avoid pseries_cpu_die() returning prematurely, and double the number of attempts so we wait at least a total of 5 seconds. While this isn't an ideal solution, it is similar to how we dealt with a similar issue for cede_offline mode in the past (940ce422a3). Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt controller") Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588 Cc: Michael Ellerman Cc: Cedric Le Goater Cc: Greg Kurz Cc: Nathan Lynch Signed-off-by: Michael Roth --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index c6e0d8abf75e..3cb172758052 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu) int cpu_status = 1; unsigned int pcpu = get_hard_smp_processor_id(cpu); - for (tries = 0; tries < 25; tries++) { + for (tries = 0; tries < 50; tries++) { cpu_status = smp_query_cpu_stopped(pcpu); if (cpu_status == QCSS_STOPPED || cpu_status == QCSS_HARDWARE_ERROR) break; - cpu_relax(); - + msleep(100); } if (cpu_status != 0) { -- 2.17.1
[PATCH] KVM: PPC: Book3S HV: Fix H_CEDE return code for nested guests
The h_cede_tm kvm-unit-test currently fails when run inside an L1 guest via the guest/nested hypervisor. ./run-tests.sh -v ... TESTNAME=h_cede_tm TIMEOUT=90s ACCEL= ./powerpc/run powerpc/tm.elf -smp 2,threads=2 -machine cap-htm=on -append "h_cede_tm" FAIL h_cede_tm (2 tests, 1 unexpected failures) While the test relates to transactional memory instructions, the actual failure is due to the return code of the H_CEDE hypercall, which is reported as 224 instead of 0. This happens even when no TM instructions are issued. 224 is the value placed in r3 to execute a hypercall for H_CEDE, and r3 is where the caller expects the return code to be placed upon return. In the case of guest running under a nested hypervisor, issuing H_CEDE causes a return from H_ENTER_NESTED. In this case H_CEDE is specially-handled immediately rather than later in kvmppc_pseries_do_hcall() as with most other hcalls, but we forget to set the return code for the caller, hence why kvm-unit-test sees the 224 return code and reports an error. Guest kernels generally don't check the return value of H_CEDE, so that likely explains why this hasn't caused issues outside of kvm-unit-tests so far. Fix this by setting r3 to 0 after we finish processing the H_CEDE. RHBZ: 1778556 Fixes: 4bad77799fed ("KVM: PPC: Book3S HV: Handle hypercalls correctly when nested") Cc: linuxppc-...@ozlabs.org Cc: David Gibson Cc: Paul Mackerras Signed-off-by: Michael Roth --- arch/powerpc/kvm/book3s_hv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index 2cefd071b848..c0c43a733830 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -3616,6 +3616,7 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 time_limit, if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested && kvmppc_get_gpr(vcpu, 3) == H_CEDE) { kvmppc_nested_cede(vcpu); + kvmppc_set_gpr(vcpu, 3, 0); trap = 0; } } else { -- 2.17.1
Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.
Quoting Alexey Kardashevskiy (2019-12-11 16:47:30) > > > On 12/12/2019 07:31, Michael Roth wrote: > > Quoting Alexey Kardashevskiy (2019-12-11 02:15:44) > >> > >> > >> On 11/12/2019 02:35, Ram Pai wrote: > >>> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote: > >>>> > >>>> > >>>> On 10/12/2019 16:12, Ram Pai wrote: > >>>>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote: > >>>>>> > >>>>>> > >>>>>> On 07/12/2019 12:12, Ram Pai wrote: > >>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one > >>>>>>> of > >>>>>>> its parameters. On secure VMs, hypervisor cannot access the contents > >>>>>>> of > >>>>>>> this page since it gets encrypted. Hence share the page with the > >>>>>>> hypervisor, and unshare when done. > >>>>>> > >>>>>> > >>>>>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra > >>>>>> pages. There is small problem that when DDW is enabled, > >>>>>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains > >>>>>> about the performance on slack but this is caused by initial cleanup of > >>>>>> the default TCE window (which we do not use anyway) and to battle this > >>>>>> we can simply reduce its size by adding > >>>>> > >>>>> something that takes hardly any time with H_PUT_TCE_INDIRECT, takes > >>>>> 13secs per device for H_PUT_TCE approach, during boot. This is with a > >>>>> 30GB guest. With larger guest, the time will further detoriate. > >>>> > >>>> > >>>> No it will not, I checked. The time is the same for 2GB and 32GB guests- > >>>> the delay is caused by clearing the small DMA window which is small by > >>>> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and > >>>> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G > >>>> (depends on the system) so the number of TCEs is much smaller. > >>> > >>> I cant get your results. What changes did you make to get it? > >> > >> > >> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent > >> in clearing the default window and the huge window took a fraction of a > >> second to create and map. > > > > Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use > > of H_PUT_TCE everywhere? > > > Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when > fixed (I have it in my local branch), this does not make a difference. > > > > > > In theory couldn't we leave FW_FEATURE_MULTITCE in place so that > > iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically > > instant), > > PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does > exactly. But I am pretty sure the idea is that either both H_STUFF_TCE > and H_PUT_TCE_INDIRECT are present or neither. That was my interpretation (or maybe I just went by what your implementation did :), but just because they are available doesn't mean the guest has to use them. I agree it's ugly to condition it on is_secure_guest(), but to me that seems better than sharing memory uncessarily, or potentially leaving stale mappings into default IOMMU. Not sure if that are other alternatives though. > > > > and then force H_PUT_TCE for new mappings via something like: > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > > b/arch/powerpc/platforms/pseries/iommu.c > > index 6ba081dd61c9..85d092baf17d 100644 > > --- a/arch/powerpc/platforms/pseries/iommu.c > > +++ b/arch/powerpc/platforms/pseries/iommu.c > > @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table > > *tbl, long tcenum, > > unsigned long flags; > > > > if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) { > > + if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || > > is_secure_guest()) { > > > Nobody (including myself) seems to like the idea of having > is_secure_guest() all over the place. > > And with KVM acceleration enabled, it is pretty fast anyway. Just now we > do not have H_PUT_TCE in KVM/UV for
Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.
Quoting Ram Pai (2019-12-12 00:45:02) > On Tue, Dec 10, 2019 at 07:43:24PM -0600, Michael Roth wrote: > > Quoting Ram Pai (2019-12-06 19:12:39) > > > Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on > > > secure guests") > > > disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops > > > path for secure VMs, helped enable dma_direct path. This enabled > > > support for bounce-buffering through SWIOTLB. However it fails to > > > operate when IOMMU is enabled, since I/O pages are not TCE mapped. > > > > > > Renable dma_iommu_ops path for pseries Secure VMs. It handles all > > > cases including, TCE mapping I/O pages, in the presence of a > > > IOMMU. > > > > Wasn't clear to me at first, but I guess the main gist of this series is > > that we want to continue to use SWIOTLB, but also need to create mappings > > of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops > > and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout > > to call into dma_direct_* ops rather than relying on the dma_is_direct(ops) > > checks in DMA API functions to do the same. > > > > That makes sense, but one issue I see with that is that > > dma_iommu_map_bypass() only tests true if all the following are true: > > > > 1) the device requests a 64-bit DMA mask via > >dma_set_mask/dma_set_coherent_mask > > 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line) > > > > dma_is_direct() checks don't have this limitation, so I think for > > anything cases, such as devices that use a smaller DMA mask, we'll > > end up falling back to the non-bypass functions in dma_iommu_ops, which > > will likely break for things like dma_alloc_coherent/dma_map_single > > since they won't use SWIOTLB pages and won't do the necessary calls to > > set_memory_unencrypted() to share those non-SWIOTLB buffers with > > hypervisor. > > > > Maybe that's ok, but I think we should be clearer about how to > > fail/handle these cases. > > Yes. makes sense. Device that cannot handle 64bit dma mask will not work. > > > > > Though I also agree with some concerns Alexey stated earlier: it seems > > wasteful to map the entire DDW window just so these bounce buffers can be > > mapped. Especially if you consider the lack of a mapping to be an > > additional > > safe-guard against things like buggy device implementations on the QEMU > > side. E.g. if we leaked pages to the hypervisor on accident, those pages > > wouldn't be immediately accessible to a device, and would still require > > additional work get past the IOMMU. > > Well, an accidental unintented page leak to the hypervisor, is a very > bad thing, regardless of any DMA mapping. The device may not be able to > access it, but the hypervisor still can access it. Agreed, but if IOMMU can provide additional isolation we should make use of it when reasonable. > > > > > What would it look like if we try to make all this work with disable_ddw > > passed > > to kernel command-line (or forced for is_secure_guest())? > > > > 1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* > > ops, > > but an additional case or hook that considers is_secure_guest() might > > do > > it. > > > > 2) We'd also need to set up an IOMMU mapping for the bounce buffers via > > io_tlb_start/io_tlb_end. We could do it once, on-demand via > > dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or > > maybe in some init function. > > Hmm... i not sure how to accomplish (2). we need use some DDW window > to setup the mappings. right? If disable_ddw is set, there wont be any > ddw. What am i missing? We have 2 windows, the default 32-bit window that normally covers DMA addresses 0..1GB, and an additional DDW window that's created on demand for 64-bit devices (since they can address a full linear mapping of all guest memory at DMA address 1<<59). Your current patch uses the latter, but we could potentially use the 32-bit window since we only need to map the SWIOTLB pages. Not saying that's necessarily better, but one upside is it only requires devices to support 32-bit DMA addressing, which is a larger class of devices than those that support 64-bit (since 64-bit device drivers generally have a 32-bit fallback). Required changes are a bit more pervasive though. It might make sense to get both scenarios working at some point, but maybe it's okay to handle 64-bit first. 64-bit does give us more legroom if we ant
Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.
Quoting Alexey Kardashevskiy (2019-12-11 02:15:44) > > > On 11/12/2019 02:35, Ram Pai wrote: > > On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote: > >> > >> > >> On 10/12/2019 16:12, Ram Pai wrote: > >>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote: > > > On 07/12/2019 12:12, Ram Pai wrote: > > H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of > > its parameters. On secure VMs, hypervisor cannot access the contents of > > this page since it gets encrypted. Hence share the page with the > > hypervisor, and unshare when done. > > > I thought the idea was to use H_PUT_TCE and avoid sharing any extra > pages. There is small problem that when DDW is enabled, > FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains > about the performance on slack but this is caused by initial cleanup of > the default TCE window (which we do not use anyway) and to battle this > we can simply reduce its size by adding > >>> > >>> something that takes hardly any time with H_PUT_TCE_INDIRECT, takes > >>> 13secs per device for H_PUT_TCE approach, during boot. This is with a > >>> 30GB guest. With larger guest, the time will further detoriate. > >> > >> > >> No it will not, I checked. The time is the same for 2GB and 32GB guests- > >> the delay is caused by clearing the small DMA window which is small by > >> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and > >> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G > >> (depends on the system) so the number of TCEs is much smaller. > > > > I cant get your results. What changes did you make to get it? > > > Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent > in clearing the default window and the huge window took a fraction of a > second to create and map. Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use of H_PUT_TCE everywhere? In theory couldn't we leave FW_FEATURE_MULTITCE in place so that iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically instant), and then force H_PUT_TCE for new mappings via something like: diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 6ba081dd61c9..85d092baf17d 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, unsigned long flags; if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) { + if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || is_secure_guest()) { return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, direction, attrs); } That seems like it would avoid the extra 13s. If we take the additional step of only mapping SWIOTLB range in enable_ddw() for is_secure_guest() that might further improve things (though the bigger motivation with that is the extra isolation it would grant us for stuff behind the IOMMU, since it apparently doesn't affect boot-time all that much) > > > > -global > spapr-pci-host-bridge.dma_win_size=0x400 > >>> > >>> This option, speeds it up tremendously. But than should this option be > >>> enabled in qemu by default? only for secure VMs? for both VMs? > >> > >> > >> As discussed in slack, by default we do not need to clear the entire TCE > >> table and we only have to map swiotlb buffer using the small window. It > >> is a guest kernel change only. Thanks, > > > > Can you tell me what code you are talking about here. Where is the TCE > > table getting cleared? What code needs to be changed to not clear it? > > > pci_dma_bus_setup_pSeriesLP() > iommu_init_table() > iommu_table_clear() > for () tbl->it_ops->get() > > We do not really need to clear it there, we only need it for VFIO with > IOMMU SPAPR TCE v1 which reuses these tables but there are > iommu_take_ownership/iommu_release_ownership to clear these tables. I'll > send a patch for this. > > > > Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear > > aswell? > > > This one does not need to clear TCEs as this creates a window of known > size and maps it all. > > Well, actually, it only maps actual guest RAM, if there are gaps in RAM, > then TCEs for the gaps will have what hypervisor had there (which is > zeroes, qemu/kvm clears it anyway). > > > > But before I close, you have not told me clearly, what is the problem > > with; 'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the > > page'. > > Between share and unshare you have a (tiny) window of opportunity to > attack the guest. No, I do not know how exactly. > > For example, the hypervisor does a lot of PHB+PCI hotplug-unplug
Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.
Quoting Alexey Kardashevskiy (2019-12-11 02:36:29) > > > On 11/12/2019 12:43, Michael Roth wrote: > > Quoting Ram Pai (2019-12-06 19:12:39) > >> Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on > >> secure guests") > >> disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops > >> path for secure VMs, helped enable dma_direct path. This enabled > >> support for bounce-buffering through SWIOTLB. However it fails to > >> operate when IOMMU is enabled, since I/O pages are not TCE mapped. > >> > >> Renable dma_iommu_ops path for pseries Secure VMs. It handles all > >> cases including, TCE mapping I/O pages, in the presence of a > >> IOMMU. > > > > Wasn't clear to me at first, but I guess the main gist of this series is > > that we want to continue to use SWIOTLB, but also need to create mappings > > of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops > > and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout > > to call into dma_direct_* ops rather than relying on the dma_is_direct(ops) > > checks in DMA API functions to do the same. > > > Correct. Took me a bit of time to realize what we got here :) We only > rely on dma_iommu_ops::.dma_supported to write the DMA offset to a > device (when creating a huge window), and after that we know it is > mapped directly and swiotlb gets this 1<<59 offset via __phys_to_dma(). > > > > That makes sense, but one issue I see with that is that > > dma_iommu_map_bypass() only tests true if all the following are true: > > > > 1) the device requests a 64-bit DMA mask via > >dma_set_mask/dma_set_coherent_mask > > 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line) > > > > dma_is_direct() checks don't have this limitation, so I think for > > anything cases, such as devices that use a smaller DMA mask, we'll > > end up falling back to the non-bypass functions in dma_iommu_ops, which > > will likely break for things like dma_alloc_coherent/dma_map_single > > since they won't use SWIOTLB pages and won't do the necessary calls to > > set_memory_unencrypted() to share those non-SWIOTLB buffers with > > hypervisor. > > > > Maybe that's ok, but I think we should be clearer about how to > > fail/handle these cases. > > > > Though I also agree with some concerns Alexey stated earlier: it seems > > wasteful to map the entire DDW window just so these bounce buffers can be > > mapped. Especially if you consider the lack of a mapping to be an > > additional > > safe-guard against things like buggy device implementations on the QEMU > > side. E.g. if we leaked pages to the hypervisor on accident, those pages > > wouldn't be immediately accessible to a device, and would still require > > additional work get past the IOMMU. > > > > What would it look like if we try to make all this work with disable_ddw > > passed > > to kernel command-line (or forced for is_secure_guest())? > > > > 1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* > > ops, > > but an additional case or hook that considers is_secure_guest() might > > do > > it. > > > > 2) We'd also need to set up an IOMMU mapping for the bounce buffers via > > io_tlb_start/io_tlb_end. We could do it once, on-demand via > > dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or > > maybe in some init function. > > > io_tlb_start/io_tlb_end are only guaranteed to stay within 4GB and our > default DMA window is 1GB (KVM) or 2GB (PowerVM), ok, we can define > ARCH_LOW_ADDRESS_LIMIT as 1GB. True, and limiting allocations to under 1GB might be brittle (also saw a patching floating around that increased IO_TLB_DEFAULT_SIZE size to 1GB, which obviously wouldn't work out with this approach, but not sure if that's still needed or not: "powerpc/svm: Increase SWIOTLB buffer size") However that's only an issue if we insist on using an identity mapping in the IOMMU, which would be nice because non-IOMMU virtio would magically work, but since that's not a goal of this series I think we do have the option of mapping io_tlb_start at DMA address 0 (or thereabouts). We'd probably need to modify __phys_to_dma to treat archdata.dma_offset as a negative offset in this case, but it seems like it would work about the same as with DDW offset. But yah, it does make things a bit less appealing than what I was initially thinking with that approach... > > But it has also been mentioned that we are likely to be
Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.
Quoting Ram Pai (2019-12-06 19:12:39) > Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on > secure guests") > disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops > path for secure VMs, helped enable dma_direct path. This enabled > support for bounce-buffering through SWIOTLB. However it fails to > operate when IOMMU is enabled, since I/O pages are not TCE mapped. > > Renable dma_iommu_ops path for pseries Secure VMs. It handles all > cases including, TCE mapping I/O pages, in the presence of a > IOMMU. Wasn't clear to me at first, but I guess the main gist of this series is that we want to continue to use SWIOTLB, but also need to create mappings of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout to call into dma_direct_* ops rather than relying on the dma_is_direct(ops) checks in DMA API functions to do the same. That makes sense, but one issue I see with that is that dma_iommu_map_bypass() only tests true if all the following are true: 1) the device requests a 64-bit DMA mask via dma_set_mask/dma_set_coherent_mask 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line) dma_is_direct() checks don't have this limitation, so I think for anything cases, such as devices that use a smaller DMA mask, we'll end up falling back to the non-bypass functions in dma_iommu_ops, which will likely break for things like dma_alloc_coherent/dma_map_single since they won't use SWIOTLB pages and won't do the necessary calls to set_memory_unencrypted() to share those non-SWIOTLB buffers with hypervisor. Maybe that's ok, but I think we should be clearer about how to fail/handle these cases. Though I also agree with some concerns Alexey stated earlier: it seems wasteful to map the entire DDW window just so these bounce buffers can be mapped. Especially if you consider the lack of a mapping to be an additional safe-guard against things like buggy device implementations on the QEMU side. E.g. if we leaked pages to the hypervisor on accident, those pages wouldn't be immediately accessible to a device, and would still require additional work get past the IOMMU. What would it look like if we try to make all this work with disable_ddw passed to kernel command-line (or forced for is_secure_guest())? 1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* ops, but an additional case or hook that considers is_secure_guest() might do it. 2) We'd also need to set up an IOMMU mapping for the bounce buffers via io_tlb_start/io_tlb_end. We could do it once, on-demand via dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or maybe in some init function. That also has the benefit of not requiring devices to support 64-bit DMA. Alternatively, we could continue to rely on the 64-bit DDW window, but modify call to enable_ddw() to only map the io_tlb_start/end range in the case of is_secure_guest(). This is a little cleaner implementation-wise since we can rely on the existing dma_iommu_{alloc,map}_bypass() hooks, but devices that don't support 64-bit will fail back to not using dma_direct_* ops and fail miserably. We'd probably want to handle that more gracefully. Or we handle both cases gracefully. To me it makes more sense to enable non-DDW case, then consider adding DDW case later if there's some reason why 64-bit DMA is needed. But would be good to hear if there are other opinions. > > Signed-off-by: Ram Pai > --- > arch/powerpc/platforms/pseries/iommu.c | 11 +-- > 1 file changed, 1 insertion(+), 10 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/iommu.c > b/arch/powerpc/platforms/pseries/iommu.c > index 67b5009..4e27d66 100644 > --- a/arch/powerpc/platforms/pseries/iommu.c > +++ b/arch/powerpc/platforms/pseries/iommu.c > @@ -36,7 +36,6 @@ > #include > #include > #include > -#include > #include > > #include "pseries.h" > @@ -1346,15 +1345,7 @@ void iommu_init_early_pSeries(void) > of_reconfig_notifier_register(_reconfig_nb); > register_memory_notifier(_mem_nb); > > - /* > -* Secure guest memory is inacessible to devices so regular DMA isn't > -* possible. > -* > -* In that case keep devices' dma_map_ops as NULL so that the generic > -* DMA code path will use SWIOTLB to bounce buffers for DMA. > -*/ > - if (!is_secure_guest()) > - set_pci_dma_ops(_iommu_ops); > + set_pci_dma_ops(_iommu_ops); > } > > static int __init disable_multitce(char *str) > -- > 1.8.3.1 >
Re: [RFC v1 0/2] Enable IOMMU support for pseries Secure VMs
Quoting Michael S. Tsirkin (2019-11-06 12:06:37) > On Wed, Nov 06, 2019 at 12:59:50PM +1100, Alexey Kardashevskiy wrote: > > > > > > On 05/11/2019 08:28, Ram Pai wrote: > > > This patch series enables IOMMU support for pseries Secure VMs. > > > > > > > > > Tested using QEMU command line option: > > > > > > "-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4, > > > iommu_platform=on,disable-modern=off,disable-legacy=on" > > > and > > > > > > "-device virtio-blk-pci,scsi=off,bus=pci.0, > > > addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0, > > > iommu_platform=on,disable-modern=off,disable-legacy=on" > > > > > > Worth mentioning that SLOF won't boot with such devices as SLOF does not > > know about iommu_platform=on. > > Shouldn't be hard to support: set up the iommu to allow everything > and ack the feature. Right? It's not a static/linear mapping in this case so we need calls to map DMA buffers as-needed. I've gotten it to boot with virtio-blk, but the patches have some hacks and need cleanup, hoping to post them soon. I'm a bit perplexed how we would manage to boot a guest without those changes though, this is what I get with qemu 4.1.0: qemu-system-ppc64 -M pseries,ic-mode=xics -m 512M -bios /home/mdroth/w/build/qemu-4.1.0-build/pc-bios/slof.bin -device virtio-blk-pci,drive=drive0,id=blk0,disable-modern=off,disable-legacy=on,iommu_platform=on -drive file=/home/mdroth/vm/bionic-server-cloudimg-ppc64el.img,if=none,id=drive0 -trace enable=spapr_iommu\*,file=trace.out -monitor unix:/tmp/mon.sock,server,nowait -vga none -nographic qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-cfpc=workaround qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-sbbc=workaround qemu-system-ppc64: warning: TCG doesn't support requested feature, cap-ibs=workaround SLOF ** QEMU Starting Build Date = Jul 3 2019 12:26:14 FW Version = git-ba1ab360eebe6338 Press "s" to enter Open Firmware. Populating /vdevice methods Populating /vdevice/vty@7100 Populating /vdevice/nvram@7101 Populating /vdevice/l-lan@7102 Populating /vdevice/v-scsi@7103 SCSI: Looking for devices 8200 CD-ROM : "QEMU QEMU CD-ROM 2.5+" Populating /pci@8002000 00 (D) : 1af4 1042virtio [ block ] No NVRAM common partition, re-initializing... Scanning USB Using default console: /vdevice/vty@7100 Welcome to Open Firmware Copyright (c) 2004, 2017 IBM Corporation All rights reserved. This program and the accompanying materials are made available under the terms of the BSD License available at http://www.opensource.org/licenses/bsd-license.php Trying to load: from: /pci@8002000/scsi@0 ... virtioblk_init: failed virtioblk_transfer: Access beyond end of device! And then it hangs. This is with TCG so maybe it behaves differently with KVM, but that's the result I would expect with the current SLOF code. > > > > > > > Ram Pai (2): > > > powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor. > > > powerpc/pseries/iommu: Use dma_iommu_ops for Secure VMs aswell. > > > > > > arch/powerpc/platforms/pseries/iommu.c | 30 > > > ++ > > > 1 file changed, 18 insertions(+), 12 deletions(-) > > > > > > > -- > > Alexey >
[PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag
c0002073fb0ca900 R27 = c19e2464 R12 = c0050790 R28 = c00812b0 R13 = c000207fff623e00 R29 = c0002073fb0ca808 R14 = 7d1bbee0 R30 = c0002073fb0ca800 R15 = 7d1bcd60 R31 = 0800 pc = c020b260 smp_call_function_many+0x3d0/0x460 cfar= c020b270 smp_call_function_many+0x3e0/0x460 lr = c020b20c smp_call_function_many+0x37c/0x460 msr = 90010288b033 cr = 44024824 ctr = c0050790 xer = trap = 100 CPU 42 is running normally, doing VCPU work: # CPU 42 stack trace (via xmon) [link register ] c0081be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv] [c08ed3343820] c08ed3343850 (unreliable) [c08ed33438d0] c0081be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv] [c08ed33439d0] c0081be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv] [c08ed3343ae0] c0081c10891c kvmppc_vcpu_run+0x34/0x48 [kvm] [c08ed3343b00] c0081c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [c08ed3343b90] c0081c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm] [c08ed3343d00] c0475450 do_vfs_ioctl+0xe0/0xc70 [c08ed3343db0] c04760e4 ksys_ioctl+0x104/0x120 [c08ed3343e00] c0476128 sys_ioctl+0x28/0x80 [c08ed3343e20] c000b388 system_call+0x5c/0x70 --- Exception: c00 (System Call) at 7d545cfd7694 SP (7d53ff7edf50) is in userspace It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION] was set for CPU 42 by at least 1 of the CPUs waiting in smp_call_function_many(), but somehow the corresponding call_single_queue entries were never processed by CPU 42, causing the callers to spin in csd_lock_wait() indefinitely. Nick Piggin suggested something similar to the following sequence as a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE IPI messages seems to be most common, but any mix of CALL_FUNCTION and !CALL_FUNCTION messages could trigger it): CPU X: smp_muxed_ipi_set_message(): X: smp_mb() X: message[RESCHEDULE] = 1 X: doorbell_global_ipi(42): X: kvmppc_set_host_ipi(42, 1) X: ppc_msgsnd_sync()/smp_mb() X: ppc_msgsnd() -> 42 42: doorbell_exception(): // from CPU X 42: ppc_msgsync() 105: smp_muxed_ipi_set_message(): 105: smb_mb() // STORE DEFERRED DUE TO RE-ORDERING --105: message[CALL_FUNCTION] = 1 | 105: doorbell_global_ipi(42): | 105: kvmppc_set_host_ipi(42, 1) | 42: kvmppc_set_host_ipi(42, 0) | 42: smp_ipi_demux_relaxed() | 42: // returns to executing guest | // RE-ORDERED STORE COMPLETES ->105: message[CALL_FUNCTION] = 1 105: ppc_msgsnd_sync()/smp_mb() 105: ppc_msgsnd() -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue This can be prevented with an smp_mb() at the beginning of kvmppc_set_host_ipi(), such that stores to message[] (or other state indicated by the host_ipi flag) are ordered vs. the store to to host_ipi. However, doing so might still allow for the following scenario (not yet observed): CPU X: smp_muxed_ipi_set_message(): X: smp_mb() X: message[RESCHEDULE] = 1 X: doorbell_global_ipi(42): X: kvmppc_set_host_ipi(42, 1) X: ppc_msgsnd_sync()/smp_mb() X: ppc_msgsnd() -> 42 42: doorbell_exception(): // from CPU X 42: ppc_msgsync() // STORE DEFERRED DUE TO RE-ORDERING -- 42: kvmppc_set_host_ipi(42, 0) | 42: smp_ipi_demux_relaxed() | 105: smp_muxed_ipi_set_message(): | 105: smb_mb() | 105: message[CALL_FUNCTION] = 1 | 105: doorbell_global_ipi(42): | 105: kvmppc_set_host_ipi(42, 1) | // RE-ORDERED STORE COMPLETES -> 42: kvmppc_set_host_ipi(42, 0) 42: // returns to executing guest 105: ppc_msgsnd_sync()/smp_mb() 105: ppc_msgsnd() -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue Fixing this scenario would require an smp_mb() *after* clearing host_ipi flag in kvmppc_set_host_ipi() to order the store vs. subsequent processing of IPI messages. To handle both cases, this patch splits kvmppc_set_host_ipi() into separate set/clear functions, where we execute smp_mb() prior to setting host_ipi flag, and after clearing host_ipi flag. These functions pair with each other to synchronize the sender and receiver sides. With that change in place the above workload ran for 20 hours without triggering any lock-ups. Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0 Cc: Michael Ellerman Cc: Paul Mackerras Cc: Nicholas Piggin Cc: kvm-...@vger.kernel.org Signed-off-by: Michael Roth --- v2: - changelog: don't reference CPUs in hex - changelog: make pseudocode clearer - ch
Re: [PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()
Quoting Michael Roth (2019-09-05 18:21:22) > Quoting Michael Ellerman (2019-09-04 22:04:48) > > That raises the question of whether this needs to be a full barrier or > > just a write barrier, and where is the matching barrier on the reading > > side? > > For this particular case I think the same barrier orders it on the > read-side via kvmppc_set_host_ipi(42, 0) above, but I'm not sure that > work as a general solution, unless maybe we make that sort of usage > (clear-before-processing) part of the protocol of using > kvmppc_set_host_ipi()... it makes sense given we already need to take > care to not miss clearing them else we get issues like what was fixed > in 755563bc79c7, which introduced the clear in doorbell_exception(). So > then it's a matter of additionally making sure we do it prior to > processing host_ipi state. I haven't looked too closely at the other > users of kvmppc_set_host_ipi() yet though. > As far as using rw barriers, I can't think of any reason we couldn't, but > I wouldn't say I'm at all confident in declaring that safe atm... I think we need a full barrier after all. The following seems possible otherwise: CPU X: smp_mb() X: ipi_message[RESCHEDULE] = 1 X: kvmppc_set_host_ipi(42, 1) X: smp_mb() X: doorbell/msgsnd -> 42 42: doorbell_exception() (from CPU X) 42: msgsync 42: kvmppc_set_host_ipi(42, 0) // STORE DEFERRED DUE TO RE-ORDERING 42: smp_ipi_demux_relaxed() 105: smb_mb() 105: ipi_message[CALL_FUNCTION] = 1 105: smp_mb() 105: kvmppc_set_host_ipi(42, 1) 42: kvmppc_set_host_ipi(42, 0) // RE-ORDERED STORE COMPLETES 42: // returns to executing guest 105: doorbell/msgsnd -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue However that also means the current patch is insufficient, since the barrier for preventing this scenario needs to come *after* setting paca_ptrs[cpu]->kvm_hstate.host_ipi to 0. So I think the right interface is for this is to split kvmppc_set_host_ipi out into: static inline void kvmppc_set_host_ipi(int cpu) { smp_mb(); paca_ptrs[cpu]->kvm_hstate.host_ipi = 1; } static inline void kvmppc_clear_host_ipi(int cpu) { paca_ptrs[cpu]->kvm_hstate.host_ipi = 0; smp_mb(); }
Re: [PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()
Quoting Michael Ellerman (2019-09-04 22:04:48) > Hi Mike, > > Thanks for the patch & great change log, just a few comments. Hi Michael, Thank you for the suggestions. I will roll them into v2 unless otherwise noted below. > > Michael Roth writes: > > On a 2-socket Witherspoon system with 128 cores and 1TB of memory >^ >Power9 (not everyone knows what a Witherspoon is) > > > running the following guest configs: > > > > guest A: > > - 224GB of memory > > - 56 VCPUs (sockets=1,cores=28,threads=2), where: > > VCPUs 0-1 are pinned to CPUs 0-3, > > VCPUs 2-3 are pinned to CPUs 4-7, > > ... > > VCPUs 54-55 are pinned to CPUs 108-111 > > > > guest B: > > - 4GB of memory > > - 4 VCPUs (sockets=1,cores=4,threads=1) > > > > with the following workloads (with KSM and THP enabled in all): > > > > guest A: > > stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M > > > > guest B: > > stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M > > > > host: > > stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M > > > > the below soft-lockup traces were observed after an hour or so and > > persisted until the host was reset (this was found to be reliably > > reproducible for this configuration, for kernels 4.15, 4.18, 5.0, > > and 5.3-rc5): > > > > [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU > > [ 1253.183319] rcu: 124-: (5250 ticks this GP) > > idle=10a/1/0x4002 softirq=5408/5408 fqs=1941 > > [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU > > 52/KVM:19709] > > [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! > > [worker:19913] > > [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! > > [worker:20331] > > [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! > > [worker:20338] > > [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! > > [avocado:19525] > > [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! > > [ksmd:791] > > [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU > > [ 1316.198032] rcu: 124-: (21003 ticks this GP) > > idle=10a/1/0x4002 softirq=5408/5408 fqs=8243 > > [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! > > [ksmd:791] > > [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU > > [ 1379.212629] rcu: 124-: (36756 ticks this GP) > > idle=10a/1/0x4002 softirq=5408/5408 fqs=14714 > > [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! > > [ksmd:791] > > [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU > > [ 1442.227115] rcu: 124-: (52509 ticks this GP) > > idle=10a/1/0x4002 softirq=5408/5408 fqs=21403 > > [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds. > > [ 1455.111822] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > > [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds. > > [ 1455.111905] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > > [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds. > > [ 1455.111986] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > > [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds. > > [ 1455.112068] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > > [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds. > > [ 1455.112159] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > > [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > > disables this message. > > [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds. > > [ 1455.112231] Tainted: G L5.3.0-rc5-mdr-vanilla+ #1 > > [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > &
[PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()
02073fb0ca900 R27 = c19e2464 R12 = c0050790 R28 = c00812b0 R13 = c000207fff623e00 R29 = c0002073fb0ca808 R14 = 7d1bbee0 R30 = c0002073fb0ca800 R15 = 7d1bcd60 R31 = 0800 pc = c020b260 smp_call_function_many+0x3d0/0x460 cfar= c020b270 smp_call_function_many+0x3e0/0x460 lr = c020b20c smp_call_function_many+0x37c/0x460 msr = 90010288b033 cr = 44024824 ctr = c0050790 xer = trap = 100 CPU 42 is running normally, doing VCPU work: 1f:mon> c2a [link register ] c0081be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv] [c08ed3343820] c08ed3343850 (unreliable) [c08ed33438d0] c0081be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv] [c08ed33439d0] c0081be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv] [c08ed3343ae0] c0081c10891c kvmppc_vcpu_run+0x34/0x48 [kvm] [c08ed3343b00] c0081c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [c08ed3343b90] c0081c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm] [c08ed3343d00] c0475450 do_vfs_ioctl+0xe0/0xc70 [c08ed3343db0] c04760e4 ksys_ioctl+0x104/0x120 [c08ed3343e00] c0476128 sys_ioctl+0x28/0x80 [c08ed3343e20] c000b388 system_call+0x5c/0x70 --- Exception: c00 (System Call) at 7d545cfd7694 SP (7d53ff7edf50) is in userspace It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION] was set for CPU 42 by at least 1 of the CPUs waiting in smp_call_function_many(), but somehow the corresponding call_single_queue entries were never processed, causing the callers to spin in csd_lock_wait() indefinitely. Nick Piggin suggested something similar to the following sequence as a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE IPI messages seems to be most common, but any mix of CALL_FUNCTION and !CALL_FUNCTION messages could trigger it): CPU X: smp_mb() X: ipi_message[RESCHEDULE] = 1 X: kvmppc_set_host_ipi(42, 1) X: smp_mb() X: doorbell/msgsnd -> 42 42: doorbell_exception() (from CPU X) 42: msgsync 105: smb_mb() 105: kvmppc_set_host_ipi(42, 1) 42: kvmppc_set_host_ipi(42, 0) 42: smp_ipi_demux_relaxed() 42: // returns to executing guest 105: ipi_message[CALL_FUNCTION] = 1 // due to re-ordering 105: smp_mb() 105: doorbell/msgsnd -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue This patch avoids that scenario by placing a barrier at the start of kvmppc_set_host_ipi() such that the storing of these messages (or other state indicated by host_ipi being set) cannot be re-ordered past it. With this the above workload ran for 6 hours (so far) without triggering any lock-ups. Cc: Paul Mackerras Cc: Nicholas Piggin Cc: kvm-...@vger.kernel.org Signed-off-by: Michael Roth --- arch/powerpc/include/asm/kvm_ppc.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 2484e6a8f5ca..254abad0f55e 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -454,6 +454,7 @@ static inline u32 kvmppc_get_xics_latch(void) static inline void kvmppc_set_host_ipi(int cpu, u8 host_ipi) { + smp_mb(); paca_ptrs[cpu]->kvm_hstate.host_ipi = host_ipi; } -- 2.17.1
Re: [PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K
Quoting Daniel Borkmann (2018-12-10 08:26:31) > On 12/07/2018 04:36 PM, Michael Roth wrote: > > Quoting Michael Ellerman (2018-12-07 06:31:13) > >> Michael Roth writes: > >> > >>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF > >>> JIT allocations. At compile time it defaults to PAGE_SIZE * 4, > >>> and is adjusted again at init time if MODULES_VADDR is defined. > >>> > >>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with > >> > >> But maybe it should be, I don't know why we don't define it. > >> > >>> the compile-time default at boot-time, which is 0x9c40 when > >>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit > >>> value: > >>> > >>> root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit > >>> -1673527296 > >>> > >>> and can cause various unexpected failures throughout the network > >>> stack. In one case `strace dhclient eth0` reported: > >>> > >>> setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, > >>> filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524) > >>> > >>> and similar failures can be seen with tools like tcpdump. This doesn't > >>> always reproduce however, and I'm not sure why. The more consistent > >>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host > >>> would time out on systemd/netplan configuring a virtio-net NIC with no > >>> noticeable errors in the logs. > >>> > >>> Fix this by limiting the compile-time default for bpf_jit_limit to > >>> INT_MAX. > >> > >> INT_MAX is a lot more than (4k * 4), so I guess I'm not clear on > >> whether we should be using PAGE_SIZE here at all. I guess each BPF > >> program uses at least one page is the thinking? > > > > That seems to be the case, at least, the max number of minimum-sized > > allocations would be less on ppc64 since the allocations are always at > > least PAGE_SIZE in size. The init-time default also limits to INT_MAX, > > so it seemed consistent to do that here too. > > > >> > >> Thanks for tracking this down. For some reason none of my ~10 test boxes > >> have hit this, perhaps I don't have new enough userspace? > > > > I'm not too sure, I would've thought things like the dhclient case in > > the commit log would fail every time, but sometimes I need to reboot the > > guest before I start seeing the behavior. Maybe there's something special > > about when JIT allocations are actually done that can affect > > reproducibility? > > > > In my case at least the virtio-net networking timeout was consistent > > enough for a bisect, but maybe it depends on the specific network > > configuration (single NIC, basic DHCP through netplan/systemd in my case). > > > >> > >> You don't mention why you needed to add BPF_MIN(), I assume because the > >> kernel version of min() has gotten too complicated to work here? > > > > I wasn't sure if it was safe here or not, so I tried looking at other > > users and came across: > > > > mm/vmalloc.c:777:#define VMAP_MIN(x, y) ((x) < (y) ? (x) : > > (y)) /* can't use min() */ > > > > I'm not sure what the reasoning was (or whether it still applies), but I > > figured it was safer to do the same here. Maybe Nick still recalls? > > > >> > >> Daniel I assume you'll merge this via your tree? > >> > >> cheers > >> > >>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv > >>> allocations") > >>> Cc: linuxppc-...@ozlabs.org > >>> Cc: Daniel Borkmann > >>> Cc: Sandipan Das > >>> Cc: Alexei Starovoitov > >>> Signed-off-by: Michael Roth > > Thanks for the reports / fixes and sorry for my late reply (bit too > swamped last week), some more thoughts below. > > >>> kernel/bpf/core.c | 3 ++- > >>> 1 file changed, 2 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > >>> index b1a3545d0ec8..55de4746cdfd 100644 > >>> --- a/kernel/bpf/core.c > >>> +++ b/kernel/bpf/core.c > >>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp) > >>> } > >>> > >>> #ifdef CONFIG_BPF_JIT > >>> -# define BPF_JIT_L
Re: [PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K
Quoting Michael Ellerman (2018-12-07 06:31:13) > Michael Roth writes: > > > Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF > > JIT allocations. At compile time it defaults to PAGE_SIZE * 4, > > and is adjusted again at init time if MODULES_VADDR is defined. > > > > For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with > > But maybe it should be, I don't know why we don't define it. > > > the compile-time default at boot-time, which is 0x9c40 when > > using 64K page size. This overflows the signed 32-bit bpf_jit_limit > > value: > > > > root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit > > -1673527296 > > > > and can cause various unexpected failures throughout the network > > stack. In one case `strace dhclient eth0` reported: > > > > setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, > > 16) = -1 ENOTSUPP (Unknown error 524) > > > > and similar failures can be seen with tools like tcpdump. This doesn't > > always reproduce however, and I'm not sure why. The more consistent > > failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host > > would time out on systemd/netplan configuring a virtio-net NIC with no > > noticeable errors in the logs. > > > > Fix this by limiting the compile-time default for bpf_jit_limit to > > INT_MAX. > > INT_MAX is a lot more than (4k * 4), so I guess I'm not clear on > whether we should be using PAGE_SIZE here at all. I guess each BPF > program uses at least one page is the thinking? That seems to be the case, at least, the max number of minimum-sized allocations would be less on ppc64 since the allocations are always at least PAGE_SIZE in size. The init-time default also limits to INT_MAX, so it seemed consistent to do that here too. > > Thanks for tracking this down. For some reason none of my ~10 test boxes > have hit this, perhaps I don't have new enough userspace? I'm not too sure, I would've thought things like the dhclient case in the commit log would fail every time, but sometimes I need to reboot the guest before I start seeing the behavior. Maybe there's something special about when JIT allocations are actually done that can affect reproducibility? In my case at least the virtio-net networking timeout was consistent enough for a bisect, but maybe it depends on the specific network configuration (single NIC, basic DHCP through netplan/systemd in my case). > > You don't mention why you needed to add BPF_MIN(), I assume because the > kernel version of min() has gotten too complicated to work here? I wasn't sure if it was safe here or not, so I tried looking at other users and came across: mm/vmalloc.c:777:#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* can't use min() */ I'm not sure what the reasoning was (or whether it still applies), but I figured it was safer to do the same here. Maybe Nick still recalls? > > Daniel I assume you'll merge this via your tree? > > cheers > > > Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv > > allocations") > > Cc: linuxppc-...@ozlabs.org > > Cc: Daniel Borkmann > > Cc: Sandipan Das > > Cc: Alexei Starovoitov > > Signed-off-by: Michael Roth > > --- > > kernel/bpf/core.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > index b1a3545d0ec8..55de4746cdfd 100644 > > --- a/kernel/bpf/core.c > > +++ b/kernel/bpf/core.c > > @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp) > > } > > > > #ifdef CONFIG_BPF_JIT > > -# define BPF_JIT_LIMIT_DEFAULT (PAGE_SIZE * 4) > > +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y)) > > +# define BPF_JIT_LIMIT_DEFAULT BPF_MIN((PAGE_SIZE * 4), INT_MAX) > > > > /* All BPF JIT sysctl knobs here. */ > > int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON); > > -- > > 2.17.1 >
[PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF JIT allocations. At compile time it defaults to PAGE_SIZE * 4, and is adjusted again at init time if MODULES_VADDR is defined. For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with the compile-time default at boot-time, which is 0x9c40 when using 64K page size. This overflows the signed 32-bit bpf_jit_limit value: root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit -1673527296 and can cause various unexpected failures throughout the network stack. In one case `strace dhclient eth0` reported: setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524) and similar failures can be seen with tools like tcpdump. This doesn't always reproduce however, and I'm not sure why. The more consistent failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host would time out on systemd/netplan configuring a virtio-net NIC with no noticeable errors in the logs. Fix this by limiting the compile-time default for bpf_jit_limit to INT_MAX. Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations") Cc: linuxppc-...@ozlabs.org Cc: Daniel Borkmann Cc: Sandipan Das Cc: Alexei Starovoitov Signed-off-by: Michael Roth --- kernel/bpf/core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index b1a3545d0ec8..55de4746cdfd 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp) } #ifdef CONFIG_BPF_JIT -# define BPF_JIT_LIMIT_DEFAULT (PAGE_SIZE * 4) +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y)) +# define BPF_JIT_LIMIT_DEFAULT BPF_MIN((PAGE_SIZE * 4), INT_MAX) /* All BPF JIT sysctl knobs here. */ int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON); -- 2.17.1
Re: [PATCH bpf] bpf: fix default unprivileged allocation limit
Quoting Sandipan Das (2018-12-06 03:27:32) > When using a large page size, the default value of the bpf_jit_limit > knob becomes invalid and users are not able to run unprivileged bpf > programs. > > The bpf_jit_limit knob is represented internally as a 32-bit signed > integer because of which the default value, i.e. PAGE_SIZE * 4, > overflows in case of an architecture like powerpc64 which uses 64K > as the default page size (i.e. CONFIG_PPC_64K_PAGES is set). > > So, instead of depending on the page size, use a constant value. > > Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv > allocations") This also consistently caused a virtio-net KVM Ubuntu 18.04 guest to time out on configuring networking during boot via systemd/netplan. A bisect pointed to the same commit this patch addresses. > Signed-off-by: Sandipan Das > --- > kernel/bpf/core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index b1a3545d0ec8..a81d097a17fb 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -365,7 +365,7 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp) > } > > #ifdef CONFIG_BPF_JIT > -# define BPF_JIT_LIMIT_DEFAULT (PAGE_SIZE * 4) > +# define BPF_JIT_LIMIT_DEFAULT (4096 * 4) This isn't quite right as we still use (bpf_jit_limit >> PAGE_SHIFT) to check allocations in bpf_jit_charge_modmem(), so that should be fixed up as well. Another alternative which is to clamp BPF_JIT_LIMIT_DEFAULT to INT_MAX, which fixes the issue for me and is similar to what bpf_jit_charge_init() does for kernels where MODULES_VADDR is defined. I'll go ahead and send the patch in case that seems preferable. > > /* All BPF JIT sysctl knobs here. */ > int bpf_jit_enable __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON); > -- > 2.19.2 >
[PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED
Gibson Cc: Paul Mackerras Cc: Suraj Jitindar Singh Signed-off-by: Michael Roth --- arch/powerpc/kvm/book3s_hv.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c index d65b961661fb..a56f8413758a 100644 --- a/arch/powerpc/kvm/book3s_hv.c +++ b/arch/powerpc/kvm/book3s_hv.c @@ -983,6 +983,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu) ret = kvmhv_enter_nested_guest(vcpu); if (ret == H_INTERRUPT) { kvmppc_set_gpr(vcpu, 3, 0); + vcpu->arch.hcall_needed = 0; return -EINTR; } break; -- 2.17.1
[PATCH] powerpc/pseries: advertise Hot Plug Event support to firmware
With the inclusion of: powerpc/pseries: Implement indexed-count hotplug memory remove powerpc/pseries: Implement indexed-count hotplug memory add we now have complete handling of the RTAS hotplug event format as described by PAPR via ACR "PAPR Changes for Hotplug RTAS Events". This capability is indicated by byte 6, bit 5 of architecture option vector 5, and allows for greater control over cpu/memory/pci hot plug/unplug operations. Existing pseries kernels will utilize this capability based on the existence of the /event-sources/hot-plug-events DT property, so we only need to advertise it via CAS and do not need a corresponding FW_FEATURE_* value to test for. Cc: Michael Ellerman <m...@ellerman.id.au> Cc: Nathan Fontenot <nf...@linux.vnet.ibm.com> Cc: David Gibson <da...@gibson.dropbear.id.au> Signed-off-by: Michael Roth <mdr...@linux.vnet.ibm.com> --- arch/powerpc/include/asm/prom.h | 1 + arch/powerpc/kernel/prom_init.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h index 2c8001c..4a90634 100644 --- a/arch/powerpc/include/asm/prom.h +++ b/arch/powerpc/include/asm/prom.h @@ -153,6 +153,7 @@ struct of_drconf_cell { #define OV5_XCMO 0x0440 /* Page Coalescing */ #define OV5_TYPE1_AFFINITY 0x0580 /* Type 1 NUMA affinity */ #define OV5_PRRN 0x0540 /* Platform Resource Reassignment */ +#define OV5_HP_EVT 0x0604 /* Hot Plug Event support */ #define OV5_RESIZE_HPT 0x0601 /* Hash Page Table resizing */ #define OV5_PFO_HW_RNG 0x1180 /* PFO Random Number Generator */ #define OV5_PFO_HW_842 0x1140 /* PFO Compression Accelerator */ diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c index f3c8799..1a835e7 100644 --- a/arch/powerpc/kernel/prom_init.c +++ b/arch/powerpc/kernel/prom_init.c @@ -839,7 +839,7 @@ struct ibm_arch_vec __cacheline_aligned ibm_architecture_vec = { 0, #endif .associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN), - .bin_opts = OV5_FEAT(OV5_RESIZE_HPT), + .bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT), .micro_checkpoint = 0, .reserved0 = 0, .max_cpus = cpu_to_be32(NR_CPUS), /* number of cores supported */ -- 2.7.4
Re: [PATCH v6 3/3] powerpc/pseries: Implement indexed-count hotplug memory remove
Quoting Nathan Fontenot (2016-10-18 12:21:06) > From: Sahil Mehta> > Indexed-count remove for memory hotplug guarantees that a contiguous block > of lmbs beginning at a specified will be unassigned (NOT > that lmbs will be removed). Because of Qemu's per-DIMM memory > management, the removal of a contiguous block of memory currently > requires a series of individual calls. Indexed-count remove reduces > this series into a single call. > > Signed-off-by: Sahil Mehta > Signed-off-by: Nathan Fontenot > --- > v2: -use u32s drc_index and count instead of u32 ic[] > in dlpar_memory > v3: -add logic to handle invalid drc_index input > v4: -none > v5: -Update for() loop to start at start_index > v6: -none > > arch/powerpc/platforms/pseries/hotplug-memory.c | 90 > +++ > 1 file changed, 90 insertions(+) > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > b/arch/powerpc/platforms/pseries/hotplug-memory.c > index badc66d..19ad081 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > @@ -558,6 +558,92 @@ static int dlpar_memory_remove_by_index(u32 drc_index, > struct property *prop) > return rc; > } > > +static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index, > +struct property *prop) > +{ > + struct of_drconf_cell *lmbs; > + u32 num_lmbs, *p; > + int i, rc, start_lmb_found; > + int lmbs_available = 0, start_index = 0, end_index; > + > + pr_info("Attempting to hot-remove %u LMB(s) at %x\n", > + lmbs_to_remove, drc_index); > + > + if (lmbs_to_remove == 0) > + return -EINVAL; > + > + p = prop->value; > + num_lmbs = *p++; > + lmbs = (struct of_drconf_cell *)p; > + start_lmb_found = 0; > + > + /* Navigate to drc_index */ > + while (start_index < num_lmbs) { > + if (lmbs[start_index].drc_index == drc_index) { > + start_lmb_found = 1; > + break; > + } > + > + start_index++; > + } > + > + if (!start_lmb_found) > + return -EINVAL; > + > + end_index = start_index + lmbs_to_remove; > + > + /* Validate that there are enough LMBs to satisfy the request */ > + for (i = start_index; i < end_index; i++) { > + if (lmbs[i].flags & DRCONF_MEM_RESERVED) > + break; > + > + lmbs_available++; > + } > + > + if (lmbs_available < lmbs_to_remove) > + return -EINVAL; > + > + for (i = start_index; i < end_index; i++) { > + if (!(lmbs[i].flags & DRCONF_MEM_ASSIGNED)) > + continue; > + > + rc = dlpar_remove_lmb([i]); dlpar_remove_lmb() currently does both offlining of the memory as well as releasing the LMB back to the platform, but the specification for hotplug notifications has the following verbage regarding indexed-count/count identifiers: 'When using “drc count” or “drc count indexed” as the Hotplug Identifier, the OS should take steps to verify the entirety of the request can be satisfied before proceeding with the hotplug / unplug operations. If only a partial count can be satisfied, the OS should ignore the entirety of the request. If the OS cannot determine this beforehand, it should satisfy the hotplug / unplug request for as many of the requested resources as possible, and attempt to revert to the original OS / DRC state.' So doing the dlpar_remove->dlapr_add in case of failure is in line with the spec, but it should only be done as a last-resort. To me that suggests that we should be attempting to offline all the LMBs beforehand, and only after that's successful should we begin attempting to release LMBs back to the platform. Should we consider introducing that logic in the patchset? Or maybe as a follow-up? > + if (rc) > + break; > + > + lmbs[i].reserved = 1; > + } > + > + if (rc) { > + pr_err("Memory indexed-count-remove failed, adding any > removed LMBs\n"); > + > + for (i = start_index; i < end_index; i++) { > + if (!lmbs[i].reserved) > + continue; > + > + rc = dlpar_add_lmb([i]); > + if (rc) > + pr_err("Failed to add LMB, drc index %x\n", > + be32_to_cpu(lmbs[i].drc_index)); > + > + lmbs[i].reserved = 0; > + } > + rc = -EINVAL; > + } else { > + for (i = start_index; i < end_index; i++) { > + if (!lmbs[i].reserved) > +
Re: [PATCH v4 2/2] powerpc/pseries: Implement indexed-count hotplug memory remove
Quoting Sahil Mehta (2016-08-01 12:23:16) > Indexed-count remove for memory hotplug guarantees that a contiguous block > of lmbs beginning at a specified will be unassigned (NOT > that lmbs will be removed). Because of Qemu's per-DIMM memory > management, the removal of a contiguous block of memory currently > requires a series of individual calls. Indexed-count remove reduces > this series into a single call. > > Signed-off-by: Sahil Mehta> --- > v2: -use u32s drc_index and count instead of u32 ic[] > in dlpar_memory > v3: -add logic to handle invalid drc_index input > v4: -none > > arch/powerpc/platforms/pseries/hotplug-memory.c | 90 > +++ > 1 file changed, 90 insertions(+) > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > b/arch/powerpc/platforms/pseries/hotplug-memory.c > index 2d4ceb3..dd5eb38 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > @@ -503,6 +503,92 @@ static int dlpar_memory_remove_by_index(u32 drc_index, > struct property *prop) > return rc; > } > > +static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index, > +struct property *prop) > +{ > + struct of_drconf_cell *lmbs; > + u32 num_lmbs, *p; > + int i, rc, start_lmb_found; > + int lmbs_available = 0, start_index = 0, end_index; > + > + pr_info("Attempting to hot-remove %u LMB(s) at %x\n", > + lmbs_to_remove, drc_index); > + > + if (lmbs_to_remove == 0) > + return -EINVAL; > + > + p = prop->value; > + num_lmbs = *p++; > + lmbs = (struct of_drconf_cell *)p; > + start_lmb_found = 0; > + > + /* Navigate to drc_index */ > + while (start_index < num_lmbs) { > + if (lmbs[start_index].drc_index == drc_index) { > + start_lmb_found = 1; > + break; > + } > + > + start_index++; > + } > + > + if (!start_lmb_found) > + return -EINVAL; > + > + end_index = start_index + lmbs_to_remove; > + > + /* Validate that there are enough LMBs to satisfy the request */ > + for (i = start_index; i < end_index; i++) { > + if (lmbs[i].flags & DRCONF_MEM_RESERVED) > + break; > + > + lmbs_available++; > + } > + > + if (lmbs_available < lmbs_to_remove) > + return -EINVAL; > + > + for (i = 0; i < end_index; i++) { Shouldn't this be i = start_index? Otherwise it seems we'd attempt to satisfy the request using LMBs outside of the requested range. > + if (!(lmbs[i].flags & DRCONF_MEM_ASSIGNED)) > + continue; > + > + rc = dlpar_remove_lmb([i]); > + if (rc) > + break; > + > + lmbs[i].reserved = 1; > + } > + > + if (rc) { > + pr_err("Memory indexed-count-remove failed, adding any > removed LMBs\n"); > + > + for (i = start_index; i < end_index; i++) { > + if (!lmbs[i].reserved) > + continue; > + > + rc = dlpar_add_lmb([i]); > + if (rc) > + pr_err("Failed to add LMB, drc index %x\n", > + be32_to_cpu(lmbs[i].drc_index)); > + > + lmbs[i].reserved = 0; > + } > + rc = -EINVAL; > + } else { > + for (i = start_index; i < end_index; i++) { > + if (!lmbs[i].reserved) > + continue; > + > + pr_info("Memory at %llx (drc index %x) was > hot-removed\n", > + lmbs[i].base_addr, lmbs[i].drc_index); > + > + lmbs[i].reserved = 0; > + } > + } > + > + return rc; > +} > + > #else > static inline int pseries_remove_memblock(unsigned long base, > unsigned int memblock_size) > @@ -829,6 +915,10 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog) > } else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) { > drc_index = hp_elog->_drc_u.drc_index; > rc = dlpar_memory_remove_by_index(drc_index, prop); > + } else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_IC) { > + count = hp_elog->_drc_u.indexed_count[0]; > + drc_index = hp_elog->_drc_u.indexed_count[1]; > + rc = dlpar_memory_remove_by_ic(count, drc_index, > prop); > } else { > rc = -EINVAL; > } > > ___ >