Re: [RFC PATCH v12 10/33] KVM: Set the stage for handling only shared mappings in mmu_notifier events

2023-09-18 Thread Michael Roth
On Wed, Sep 13, 2023 at 06:55:08PM -0700, Sean Christopherson wrote:
> Add flags to "struct kvm_gfn_range" to let notifier events target only
> shared and only private mappings, and write up the existing mmu_notifier
> events to be shared-only (private memory is never associated with a
> userspace virtual address, i.e. can't be reached via mmu_notifiers).
> 
> Add two flags so that KVM can handle the three possibilities (shared,
> private, and shared+private) without needing something like a tri-state
> enum.
> 
> Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
> Signed-off-by: Sean Christopherson 
> ---
>  include/linux/kvm_host.h | 2 ++
>  virt/kvm/kvm_main.c  | 7 +++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d8c6ce6c8211..b5373cee2b08 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -263,6 +263,8 @@ struct kvm_gfn_range {
>   gfn_t start;
>   gfn_t end;
>   union kvm_mmu_notifier_arg arg;
> + bool only_private;
> + bool only_shared;
>   bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 174de2789657..a41f8658dfe0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
> __kvm_handle_hva_range(struct kvm *kvm,
>* the second or later invocation of the handler).
>*/
>   gfn_range.arg = range->arg;
> +
> + /*
> +  * HVA-based notifications aren't relevant to private
> +  * mappings as they don't have a userspace mapping.
> +  */
> + gfn_range.only_private = false;
> + gfn_range.only_shared = true;
>   gfn_range.may_block = range->may_block;

Who is supposed to read only_private/only_shared? Is it supposed to be
plumbed onto arch code and handled specially there?

I ask because I see elsewhere you have:

/*
 * If one or more memslots were found and thus zapped, notify arch code
 * that guest memory has been reclaimed.  This needs to be done *after*
 * dropping mmu_lock, as x86's reclaim path is slow.
 */
if (__kvm_handle_hva_range(kvm, _range).found_memslot)
kvm_arch_guest_memory_reclaimed(kvm);

and if there are any MMU notifier events that touch HVAs, then
kvm_arch_guest_memory_reclaimed()->wbinvd_on_all_cpus() will get called,
which causes the performance issues for SEV and SNP that Ashish had brought
up. Technically that would only need to happen if there are GPAs in that
memslot that aren't currently backed by gmem pages (and then gmem could handle
its own wbinvd_on_all_cpus() (or maybe clflush per-page)). 

Actually, even if there are shared pages in the GPA range, the
kvm_arch_guest_memory_reclaimed()->wbinvd_on_all_cpus() can be skipped for
guests that only use gmem pages for private memory. Is that acceptable? Just
trying to figure out where this only_private/only_shared handling ties into
that (or if it's a separate thing entirely).

-Mike

>  
>   /*
> -- 
> 2.42.0.283.g2d96d420d3-goog
> 


Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-09-18 Thread Michael Roth
On Wed, Sep 13, 2023 at 06:55:12PM -0700, Sean Christopherson wrote:
> TODO
> 
> Cc: Fuad Tabba 
> Cc: Vishal Annapurve 
> Cc: Ackerley Tng 
> Cc: Jarkko Sakkinen 
> Cc: Maciej Szmigiero 
> Cc: Vlastimil Babka 
> Cc: David Hildenbrand 
> Cc: Quentin Perret 
> Cc: Michael Roth 
> Cc: Wang 
> Cc: Liam Merwick 
> Cc: Isaku Yamahata 
> Co-developed-by: Kirill A. Shutemov 
> Signed-off-by: Kirill A. Shutemov 
> Co-developed-by: Yu Zhang 
> Signed-off-by: Yu Zhang 
> Co-developed-by: Chao Peng 
> Signed-off-by: Chao Peng 
> Co-developed-by: Ackerley Tng 
> Signed-off-by: Ackerley Tng 
> Co-developed-by: Isaku Yamahata 
> Signed-off-by: Isaku Yamahata 
> Signed-off-by: Sean Christopherson 
> ---
>  include/linux/kvm_host.h   |  48 +++
>  include/uapi/linux/kvm.h   |  15 +-
>  include/uapi/linux/magic.h |   1 +
>  virt/kvm/Kconfig   |   4 +
>  virt/kvm/Makefile.kvm  |   1 +
>  virt/kvm/guest_mem.c   | 593 +
>  virt/kvm/kvm_main.c|  61 +++-
>  virt/kvm/kvm_mm.h  |  38 +++
>  8 files changed, 756 insertions(+), 5 deletions(-)
>  create mode 100644 virt/kvm/guest_mem.c
> 



> +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t 
> len)
> +{
> + struct list_head *gmem_list = >i_mapping->private_list;
> + pgoff_t start = offset >> PAGE_SHIFT;
> + pgoff_t end = (offset + len) >> PAGE_SHIFT;
> + struct kvm_gmem *gmem;
> +
> + /*
> +  * Bindings must stable across invalidation to ensure the start+end
> +  * are balanced.
> +  */
> + filemap_invalidate_lock(inode->i_mapping);
> +
> + list_for_each_entry(gmem, gmem_list, entry) {
> + kvm_gmem_invalidate_begin(gmem, start, end);

In v11 we used to call truncate_inode_pages_range() here to drop filemap's
reference on the folio. AFAICT the folios are only getting free'd upon
guest shutdown without this. Was this on purpose?

> + kvm_gmem_invalidate_end(gmem, start, end);
> + }
> +
> + filemap_invalidate_unlock(inode->i_mapping);
> +
> + return 0;
> +}
> +
> +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
> +{
> + struct address_space *mapping = inode->i_mapping;
> + pgoff_t start, index, end;
> + int r;
> +
> + /* Dedicated guest is immutable by default. */
> + if (offset + len > i_size_read(inode))
> + return -EINVAL;
> +
> + filemap_invalidate_lock_shared(mapping);

We take the filemap lock here, but not for
kvm_gmem_get_pfn()->kvm_gmem_get_folio(). Is it needed there as well?

> +
> + start = offset >> PAGE_SHIFT;
> + end = (offset + len) >> PAGE_SHIFT;
> +
> + r = 0;
> + for (index = start; index < end; ) {
> + struct folio *folio;
> +
> + if (signal_pending(current)) {
> + r = -EINTR;
> + break;
> + }
> +
> + folio = kvm_gmem_get_folio(inode, index);
> + if (!folio) {
> + r = -ENOMEM;
> + break;
> + }
> +
> + index = folio_next_index(folio);
> +
> + folio_unlock(folio);
> + folio_put(folio);
> +
> + /* 64-bit only, wrapping the index should be impossible. */
> + if (WARN_ON_ONCE(!index))
> + break;
> +
> + cond_resched();
> + }
> +
> + filemap_invalidate_unlock_shared(mapping);
> +
> + return r;
> +}
> +



> +static int __kvm_gmem_create(struct kvm *kvm, loff_t size, struct vfsmount 
> *mnt)
> +{
> + const char *anon_name = "[kvm-gmem]";
> + const struct qstr qname = QSTR_INIT(anon_name, strlen(anon_name));
> + struct kvm_gmem *gmem;
> + struct inode *inode;
> + struct file *file;
> + int fd, err;
> +
> + inode = alloc_anon_inode(mnt->mnt_sb);
> + if (IS_ERR(inode))
> + return PTR_ERR(inode);
> +
> + err = security_inode_init_security_anon(inode, , NULL);
> + if (err)
> + goto err_inode;
> +
> + inode->i_private = (void *)(unsigned long)flags;

The 'flags' argument isn't added until the subsequent patch that adds THP
support.



> +static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
> +{
> + if (size < 0 || !PAGE_ALIGNED(size))
> + return false;
> +
> + return true;
> +}
> +
> +int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
> +{
> + loff_t size = args->size;
> + u6

[PATCH v2] powerpc/pseries/hotplug-cpu: wait indefinitely for vCPU death

2020-08-11 Thread Michael Roth
588.ppc64le #1
  [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn
  [ 1538.362725] Call Trace:
  [ 1538.362743] [c009d4adf590] [c0e0e0fc] dump_stack+0xb0/0xf4 
(unreliable)
  [ 1538.362789] [c009d4adf5d0] [c0475dfc] bad_page+0x12c/0x1b0
  [ 1538.362827] [c009d4adf660] [c04784bc] 
free_pcppages_bulk+0x5bc/0x940
  [ 1538.362871] [c009d4adf760] [c0478c38] 
page_alloc_cpu_dead+0x118/0x120
  [ 1538.362918] [c009d4adf7b0] [c015b898] 
cpuhp_invoke_callback.constprop.5+0xb8/0x760
  [ 1538.362969] [c009d4adf820] [c015eee8] _cpu_down+0x188/0x340
  [ 1538.363007] [c009d4adf890] [c015d75c] cpu_down+0x5c/0xa0
  [ 1538.363045] [c009d4adf8c0] [c092c544] 
cpu_subsys_offline+0x24/0x40
  [ 1538.363091] [c009d4adf8e0] [c09212f0] device_offline+0xf0/0x130
  [ 1538.363129] [c009d4adf920] [c010aee4] 
dlpar_offline_cpu+0x1c4/0x2a0
  [ 1538.363174] [c009d4adf9e0] [c010b2f8] 
dlpar_cpu_remove+0xb8/0x190
  [ 1538.363219] [c009d4adfa60] [c010b4fc] 
dlpar_cpu_remove_by_index+0x12c/0x150
  [ 1538.363264] [c009d4adfaf0] [c010ca24] dlpar_cpu+0x94/0x800
  [ 1538.363302] [c009d4adfc00] [c0102cc8] 
pseries_hp_work_fn+0x128/0x1e0
  [ 1538.363347] [c009d4adfc70] [c018aa84] 
process_one_work+0x304/0x5d0
  [ 1538.363394] [c009d4adfd10] [c018b5cc] worker_thread+0xcc/0x7a0
  [ 1538.363433] [c009d4adfdc0] [c019567c] kthread+0x1ac/0x1c0
  [ 1538.363469] [c009d4adfe30] [c000b7dc] 
ret_from_kernel_thread+0x5c/0x80

The latter trace is due to the following sequence:

  page_alloc_cpu_dead
drain_pages
  drain_pages_zone
free_pcppages_bulk

where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs
a loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for
25 iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:

  [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU , drc 
index: 113a
  [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2

At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race with
the XIVE cleanup code executed by the unplugged CPU.

Fix this by waiting indefinitely, but also making an effort to avoid
spurious lockup messages by allowing for rescheduling after polling
the CPU status and printing a warning if we wait for longer than 120s.

Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt 
controller")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
Suggested-by: Michael Ellerman 
Cc: Thiago Jung Bauermann 
Cc: Michael Ellerman 
Cc: Cedric Le Goater 
Cc: Greg Kurz 
Cc: Nathan Lynch 
Signed-off-by: Michael Roth 
---
changes from v1:
 - renamed from "powerpc/pseries/hotplug-cpu: increase wait time for vCPU death"
 - wait indefinitely, but issue cond_resched() when RTAS reports that CPU
   hasn't stopped (Michael)
 - print a warning after 120s of waiting (Michael)
 - use pr_warn() instead of default printk() level
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index c6e0d8abf75e..7a974ed6b240 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -107,22 +107,28 @@ static int pseries_cpu_disable(void)
  */
 static void pseries_cpu_die(unsigned int cpu)
 {
-   int tries;
int cpu_status = 1;
unsigned int pcpu = get_hard_smp_processor_id(cpu);
+   unsigned long timeout = jiffies + msecs_to_jiffies(12);
 
-   for (tries = 0; tries < 25; tries++) {
+   while (true) {
cpu_status = smp_query_cpu_stopped(pcpu);
if (cpu_status == QCSS_STOPPED ||
cpu_status == QCSS_HARDWARE_ERROR)
break;
-   cpu_relax();
 
+   if (time_after(jiffies, timeout)) {
+   pr_warn("CPU %i (hwid %i) didn't die after 120 
seconds\n",
+   cpu, pcpu);
+   timeout = jiffies + msecs_to_jiffies(12);
+   }
+
+   cond_resched();
}
 
-   if (cpu_status != 0) {
-   printk("Querying DEAD? cpu %i (%i) shows %i\n",
-  cpu, pcpu, cpu_status);
+   if (cpu_status == QCSS_HARDWARE_ERROR) {
+  

Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-10 Thread Michael Roth
Quoting Nathan Lynch (2020-08-07 02:05:09)
> Hi everyone,
> 
> Michael Ellerman  writes:
> > Greg Kurz  writes:
> >> On Tue, 04 Aug 2020 23:35:10 +1000
> >> Michael Ellerman  wrote:
> >>> Spinning forever seems like a bad idea, but as has been demonstrated at
> >>> least twice now, continuing when we don't know the state of the other
> >>> CPU can lead to straight up crashes.
> >>> 
> >>> So I think I'm persuaded that it's preferable to have the kernel stuck
> >>> spinning rather than oopsing.
> >>> 
> >>
> >> +1
> >>
> >>> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >>> first instinct is no, if we're stuck here for 20s a stack trace would be
> >>> good. But then we will probably hit that on some big and/or heavily
> >>> loaded machine.
> >>> 
> >>> So possibly we should call cond_resched() but have some custom logic in
> >>> the loop to print a warning if we are stuck for more than some
> >>> sufficiently long amount of time.
> >>
> >> How long should that be ?
> >
> > Yeah good question.
> >
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> >
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
> 
> Maybe I'm not quite clear what this is referring to, but I don't think
> stop-self/query-stopped-state latency increases with processor count, at
> least not on PowerVM. And IIRC I was concerned with the earlier patch's
> potential for causing the softlockup watchdog to rightly complain by
> polling the stopped state without ever scheduling away.
> 
> The fact that smp_query_cpu_stopped() kind of collapses the two distinct
> results from the query-cpu-stopped-state RTAS call into one return value
> may make it harder than necessary to reason about the questions around
> cond_resched() and whether to warn.
> 
> Sorry to pull this stunt but I have had some code sitting in a neglected
> branch that I think gets the logic around this right.
> 
> What we should have is a simple C wrapper for the RTAS call that reflects the
> architected inputs and outputs:
> 
> 
> (-- rtas.c --)
> 
> /**
>  * rtas_query_cpu_stopped_state() - Call RTAS query-cpu-stopped-state.
>  * @hwcpu: Identifies the processor thread to be queried.
>  * @status: Pointer to status, valid only on success.
>  *
>  * Determine whether the given processor thread is in the stopped
>  * state.  If successful and @status is non-NULL, the thread's status
>  * is stored to @status.
>  *
>  * Return:
>  * * 0   - Success
>  * * -1  - Hardware error
>  * * -2  - Busy, try again later
>  */
> int rtas_query_cpu_stopped_state(unsigned int hwcpu, unsigned int *status)
> {
>unsigned int cpu_status;
>int token;
>int fwrc;
> 
>token = rtas_token("query-cpu-stopped-state");
> 
>fwrc = rtas_call(token, 1, 2, _status, hwcpu);
>if (fwrc != 0)
>goto out;
> 
>if (status != NULL)
>*status = cpu_status;
> out:
>return fwrc;
> }
> 
> 
> 
> And then a utility function that waits for the remote thread to enter
> stopped state, with higher-level logic for rescheduling and warning. The
> fact that smp_query_cpu_stopped() currently does not handle a -2/busy
> status is a bug, fixed below by using rtas_busy_delay(). Note the
> justification for the explicit cond_resched() in the outer loop:
> 
> 
> (-- rtas.h --)
> 
> /* query-cpu-stopped-state CPU_status */
> #define RTAS_QCSS_STATUS_STOPPED 0
> #define RTAS_QCSS_STATUS_IN_PROGRESS 1
> #define RTAS_QCSS_STATUS_NOT_STOPPED 2
> 
> (-- pseries/hotplug-cpu.c --)
> 
> /**
>  * wait_for_cpu_stopped() - Wait for a cpu to enter RTAS stopped state.
>  */
> static void wait_for_cpu_stopped(unsigned int cpu)
> {
>unsigned int status;
>unsigned int hwcpu;
> 
>hwcpu = get_hard_smp_processor_id(cpu);
> 
>do {
>int fwrc;
> 
>/*
> * rtas_busy_delay() will yield only if RTAS returns a
> * busy status. Since query-cpu-stopped-state can
> * yield RTAS_QCSS_STATUS_IN_PROGRESS or
> * RTAS_QCSS_STATUS_NOT_STOPPED for an unbounded
> * period before the target thread stops, we must take
> * care to explicitly reschedule while polling.
> */
>cond_resched();
> 
>do {
>fwrc = rtas_query_cpu_stopped_state(hwcpu, );
>} while (rtas_busy_delay(fwrc));
> 
>if (fwrc == 0)
>continue;
> 
>pr_err_ratelimited("query-cpu-stopped-state for "
>  

Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-05 Thread Michael Roth
Quoting Michael Roth (2020-08-05 17:29:28)
> Quoting Michael Roth (2020-08-04 23:37:32)
> > Quoting Michael Ellerman (2020-08-04 22:07:08)
> > > Greg Kurz  writes:
> > > > On Tue, 04 Aug 2020 23:35:10 +1000
> > > > Michael Ellerman  wrote:
> > > >> Spinning forever seems like a bad idea, but as has been demonstrated at
> > > >> least twice now, continuing when we don't know the state of the other
> > > >> CPU can lead to straight up crashes.
> > > >> 
> > > >> So I think I'm persuaded that it's preferable to have the kernel stuck
> > > >> spinning rather than oopsing.
> > > >> 
> > > >
> > > > +1
> > > >
> > > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> > > >> first instinct is no, if we're stuck here for 20s a stack trace would 
> > > >> be
> > > >> good. But then we will probably hit that on some big and/or heavily
> > > >> loaded machine.
> > > >> 
> > > >> So possibly we should call cond_resched() but have some custom logic in
> > > >> the loop to print a warning if we are stuck for more than some
> > > >> sufficiently long amount of time.
> > > >
> > > > How long should that be ?
> > > 
> > > Yeah good question.
> > > 
> > > I guess step one would be seeing how long it can take on the 384 vcpu
> > > machine. And we can probably test on some other big machines.
> > > 
> > > Hopefully Nathan can give us some idea of how long he's seen it take on
> > > large systems? I know he was concerned about the 20s timeout of the
> > > softlockup detector.
> > > 
> > > Maybe a minute or two?
> > 
> > Hmm, so I took a stab at this where I called cond_resched() after
> > every 5 seconds of polling and printed a warning at the same time (FWIW
> > that doesn't seem to trigger any warnings on a loaded 96-core mihawk
> > system using KVM running the 384vcpu unplug loop)
> > 
> > But it sounds like that's not quite what you had in mind. How frequently
> > do you think we should call cond_resched()? Maybe after 25 iterations
> > of polling smp_query_cpu_stopped() to keep original behavior somewhat
> > similar?
> > 
> > I'll let the current patch run on the mihawk system overnight in the
> > meantime so we at least have that data point, but would be good to
> > know what things look like a large pHyp machine.
> 
> At one point I did manage to get the system in a state where unplug
> operations were taking 1-2s, but still not enough to trigger any
> 5s warning, and I wasn't able to reproduce that in subsequent runs.
> 
> I also tried reworking the patch so that we print a warning and
> cond_resched() after 200 ms to make sure that path gets executed, but
> only managed to trigger the warning twice after a few hours.
> 
> So, if we print a warning after a couple minutes, that seems pretty
> conservative as far as avoiding spurious warnings. And if we
> cond_resched() after 25 loops of polling (~0.1 ms in the cases

~0.1 seconds I mean

> that caused the original crash), that would avoid most of the
> default RCU/lockup warnings.
> 
> But having a second timeout to trigger the cond_resched() after some
> set interval like 2s seems more deterministic since we're less
> susceptible to longer delays due to things like the RTAS calls
> contending for QEMU's global mutex in the the KVM case.
> 
> 
> > 
> > Thanks!
> > 
> > > 
> > > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
> > > >> > interrupt controller")
> > > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> > > >> 
> > > >> This is not public.
> > > >
> > > > I'll have a look at changing that.
> > > 
> > > Thanks.
> > > 
> > > cheers


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-05 Thread Michael Roth
Quoting Michael Roth (2020-08-04 23:37:32)
> Quoting Michael Ellerman (2020-08-04 22:07:08)
> > Greg Kurz  writes:
> > > On Tue, 04 Aug 2020 23:35:10 +1000
> > > Michael Ellerman  wrote:
> > >> Spinning forever seems like a bad idea, but as has been demonstrated at
> > >> least twice now, continuing when we don't know the state of the other
> > >> CPU can lead to straight up crashes.
> > >> 
> > >> So I think I'm persuaded that it's preferable to have the kernel stuck
> > >> spinning rather than oopsing.
> > >> 
> > >
> > > +1
> > >
> > >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> > >> first instinct is no, if we're stuck here for 20s a stack trace would be
> > >> good. But then we will probably hit that on some big and/or heavily
> > >> loaded machine.
> > >> 
> > >> So possibly we should call cond_resched() but have some custom logic in
> > >> the loop to print a warning if we are stuck for more than some
> > >> sufficiently long amount of time.
> > >
> > > How long should that be ?
> > 
> > Yeah good question.
> > 
> > I guess step one would be seeing how long it can take on the 384 vcpu
> > machine. And we can probably test on some other big machines.
> > 
> > Hopefully Nathan can give us some idea of how long he's seen it take on
> > large systems? I know he was concerned about the 20s timeout of the
> > softlockup detector.
> > 
> > Maybe a minute or two?
> 
> Hmm, so I took a stab at this where I called cond_resched() after
> every 5 seconds of polling and printed a warning at the same time (FWIW
> that doesn't seem to trigger any warnings on a loaded 96-core mihawk
> system using KVM running the 384vcpu unplug loop)
> 
> But it sounds like that's not quite what you had in mind. How frequently
> do you think we should call cond_resched()? Maybe after 25 iterations
> of polling smp_query_cpu_stopped() to keep original behavior somewhat
> similar?
> 
> I'll let the current patch run on the mihawk system overnight in the
> meantime so we at least have that data point, but would be good to
> know what things look like a large pHyp machine.

At one point I did manage to get the system in a state where unplug
operations were taking 1-2s, but still not enough to trigger any
5s warning, and I wasn't able to reproduce that in subsequent runs.

I also tried reworking the patch so that we print a warning and
cond_resched() after 200 ms to make sure that path gets executed, but
only managed to trigger the warning twice after a few hours.

So, if we print a warning after a couple minutes, that seems pretty
conservative as far as avoiding spurious warnings. And if we
cond_resched() after 25 loops of polling (~0.1 ms in the cases
that caused the original crash), that would avoid most of the
default RCU/lockup warnings.

But having a second timeout to trigger the cond_resched() after some
set interval like 2s seems more deterministic since we're less
susceptible to longer delays due to things like the RTAS calls
contending for QEMU's global mutex in the the KVM case.


> 
> Thanks!
> 
> > 
> > >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
> > >> > interrupt controller")
> > >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> > >> 
> > >> This is not public.
> > >
> > > I'll have a look at changing that.
> > 
> > Thanks.
> > 
> > cheers


Re: [PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-04 Thread Michael Roth
Quoting Michael Ellerman (2020-08-04 22:07:08)
> Greg Kurz  writes:
> > On Tue, 04 Aug 2020 23:35:10 +1000
> > Michael Ellerman  wrote:
> >> There is a bit of history to this code, but not in a good way :)
> >> 
> >> Michael Roth  writes:
> >> > For a power9 KVM guest with XIVE enabled, running a test loop
> >> > where we hotplug 384 vcpus and then unplug them, the following traces
> >> > can be seen (generally within a few loops) either from the unplugged
> >> > vcpu:
> >> >
> >> >   [ 1767.353447] cpu 65 (hwid 65) Ready to die...
> >> >   [ 1767.952096] Querying DEAD? cpu 66 (66) shows 2
> >> >   [ 1767.952311] list_del corruption. next->prev should be 
> >> > c00a02470208, but was c00a02470048
> >> ...
> >> >
> >> > At that point the worker thread assumes the unplugged CPU is in some
> >> > unknown/dead state and procedes with the cleanup, causing the race with
> >> > the XIVE cleanup code executed by the unplugged CPU.
> >> >
> >> > Fix this by inserting an msleep() after each RTAS call to avoid
> >> 
> >> We previously had an msleep(), but it was removed:
> >> 
> >>   b906cfa397fd ("powerpc/pseries: Fix cpu hotplug")
> >
> > Ah, I hadn't seen that one...
> >
> >> > pseries_cpu_die() returning prematurely, and double the number of
> >> > attempts so we wait at least a total of 5 seconds. While this isn't an
> >> > ideal solution, it is similar to how we dealt with a similar issue for
> >> > cede_offline mode in the past (940ce422a3).
> >> 
> >> Thiago tried to fix this previously but there was a bit of discussion
> >> that didn't quite resolve:
> >> 
> >>   
> >> https://lore.kernel.org/linuxppc-dev/20190423223914.3882-1-bauer...@linux.ibm.com/
> >
> > Yeah it appears that the motivation at the time was to make the "Querying 
> > DEAD?"
> > messages to disappear and to avoid potentially concurrent calls to 
> > rtas-stop-self
> > which is prohibited by PAPR... not fixing actual crashes.
> 
> I'm pretty sure at one point we were triggering crashes *in* RTAS via
> this path, I think that got resolved.
> 
> >> Spinning forever seems like a bad idea, but as has been demonstrated at
> >> least twice now, continuing when we don't know the state of the other
> >> CPU can lead to straight up crashes.
> >> 
> >> So I think I'm persuaded that it's preferable to have the kernel stuck
> >> spinning rather than oopsing.
> >> 
> >
> > +1
> >
> >> I'm 50/50 on whether we should have a cond_resched() in the loop. My
> >> first instinct is no, if we're stuck here for 20s a stack trace would be
> >> good. But then we will probably hit that on some big and/or heavily
> >> loaded machine.
> >> 
> >> So possibly we should call cond_resched() but have some custom logic in
> >> the loop to print a warning if we are stuck for more than some
> >> sufficiently long amount of time.
> >
> > How long should that be ?
> 
> Yeah good question.
> 
> I guess step one would be seeing how long it can take on the 384 vcpu
> machine. And we can probably test on some other big machines.
> 
> Hopefully Nathan can give us some idea of how long he's seen it take on
> large systems? I know he was concerned about the 20s timeout of the
> softlockup detector.
> 
> Maybe a minute or two?

Hmm, so I took a stab at this where I called cond_resched() after
every 5 seconds of polling and printed a warning at the same time (FWIW
that doesn't seem to trigger any warnings on a loaded 96-core mihawk
system using KVM running the 384vcpu unplug loop)

But it sounds like that's not quite what you had in mind. How frequently
do you think we should call cond_resched()? Maybe after 25 iterations
of polling smp_query_cpu_stopped() to keep original behavior somewhat
similar?

I'll let the current patch run on the mihawk system overnight in the
meantime so we at least have that data point, but would be good to
know what things look like a large pHyp machine.

Thanks!

> 
> >> > Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE 
> >> > interrupt controller")
> >> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
> >> 
> >> This is not public.
> >
> > I'll have a look at changing that.
> 
> Thanks.
> 
> cheers


[PATCH] powerpc/pseries/hotplug-cpu: increase wait time for vCPU death

2020-08-03 Thread Michael Roth
588.ppc64le #1
  [ 1538.362687] Workqueue: pseries hotplug workque pseries_hp_work_fn
  [ 1538.362725] Call Trace:
  [ 1538.362743] [c009d4adf590] [c0e0e0fc] dump_stack+0xb0/0xf4 
(unreliable)
  [ 1538.362789] [c009d4adf5d0] [c0475dfc] bad_page+0x12c/0x1b0
  [ 1538.362827] [c009d4adf660] [c04784bc] 
free_pcppages_bulk+0x5bc/0x940
  [ 1538.362871] [c009d4adf760] [c0478c38] 
page_alloc_cpu_dead+0x118/0x120
  [ 1538.362918] [c009d4adf7b0] [c015b898] 
cpuhp_invoke_callback.constprop.5+0xb8/0x760
  [ 1538.362969] [c009d4adf820] [c015eee8] _cpu_down+0x188/0x340
  [ 1538.363007] [c009d4adf890] [c015d75c] cpu_down+0x5c/0xa0
  [ 1538.363045] [c009d4adf8c0] [c092c544] 
cpu_subsys_offline+0x24/0x40
  [ 1538.363091] [c009d4adf8e0] [c09212f0] device_offline+0xf0/0x130
  [ 1538.363129] [c009d4adf920] [c010aee4] 
dlpar_offline_cpu+0x1c4/0x2a0
  [ 1538.363174] [c009d4adf9e0] [c010b2f8] 
dlpar_cpu_remove+0xb8/0x190
  [ 1538.363219] [c009d4adfa60] [c010b4fc] 
dlpar_cpu_remove_by_index+0x12c/0x150
  [ 1538.363264] [c009d4adfaf0] [c010ca24] dlpar_cpu+0x94/0x800
  [ 1538.363302] [c009d4adfc00] [c0102cc8] 
pseries_hp_work_fn+0x128/0x1e0
  [ 1538.363347] [c009d4adfc70] [c018aa84] 
process_one_work+0x304/0x5d0
  [ 1538.363394] [c009d4adfd10] [c018b5cc] worker_thread+0xcc/0x7a0
  [ 1538.363433] [c009d4adfdc0] [c019567c] kthread+0x1ac/0x1c0
  [ 1538.363469] [c009d4adfe30] [c000b7dc] 
ret_from_kernel_thread+0x5c/0x80

The latter trace is due to the following sequence:

  page_alloc_cpu_dead
drain_pages
  drain_pages_zone
free_pcppages_bulk

where drain_pages() in this case is called under the assumption that
the unplugged cpu is no longer executing. To ensure that is the case,
and early call is made to __cpu_die()->pseries_cpu_die(), which runs
a loop that waits for the cpu to reach a halted state by polling its
status via query-cpu-stopped-state RTAS calls. It only polls for
25 iterations before giving up, however, and in the trace above this
results in the following being printed only .1 seconds after the
hotplug worker thread begins processing the unplug request:

  [ 1538.253044] pseries-hotplug-cpu: Attempting to remove CPU , drc 
index: 113a
  [ 1538.360259] Querying DEAD? cpu 314 (314) shows 2

At that point the worker thread assumes the unplugged CPU is in some
unknown/dead state and procedes with the cleanup, causing the race with
the XIVE cleanup code executed by the unplugged CPU.

Fix this by inserting an msleep() after each RTAS call to avoid
pseries_cpu_die() returning prematurely, and double the number of
attempts so we wait at least a total of 5 seconds. While this isn't an
ideal solution, it is similar to how we dealt with a similar issue for
cede_offline mode in the past (940ce422a3).

Fixes: eac1e731b59ee ("powerpc/xive: guest exploitation of the XIVE interrupt 
controller")
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1856588
Cc: Michael Ellerman 
Cc: Cedric Le Goater 
Cc: Greg Kurz 
Cc: Nathan Lynch 
Signed-off-by: Michael Roth 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index c6e0d8abf75e..3cb172758052 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -111,13 +111,12 @@ static void pseries_cpu_die(unsigned int cpu)
int cpu_status = 1;
unsigned int pcpu = get_hard_smp_processor_id(cpu);
 
-   for (tries = 0; tries < 25; tries++) {
+   for (tries = 0; tries < 50; tries++) {
cpu_status = smp_query_cpu_stopped(pcpu);
if (cpu_status == QCSS_STOPPED ||
cpu_status == QCSS_HARDWARE_ERROR)
break;
-   cpu_relax();
-
+   msleep(100);
}
 
if (cpu_status != 0) {
-- 
2.17.1



[PATCH] KVM: PPC: Book3S HV: Fix H_CEDE return code for nested guests

2020-03-10 Thread Michael Roth
The h_cede_tm kvm-unit-test currently fails when run inside an L1 guest
via the guest/nested hypervisor.

  ./run-tests.sh -v
  ...
  TESTNAME=h_cede_tm TIMEOUT=90s ACCEL= ./powerpc/run powerpc/tm.elf -smp 
2,threads=2 -machine cap-htm=on -append "h_cede_tm"
  FAIL h_cede_tm (2 tests, 1 unexpected failures)

While the test relates to transactional memory instructions, the actual
failure is due to the return code of the H_CEDE hypercall, which is
reported as 224 instead of 0. This happens even when no TM instructions
are issued.

224 is the value placed in r3 to execute a hypercall for H_CEDE, and r3
is where the caller expects the return code to be placed upon return.

In the case of guest running under a nested hypervisor, issuing H_CEDE
causes a return from H_ENTER_NESTED. In this case H_CEDE is
specially-handled immediately rather than later in
kvmppc_pseries_do_hcall() as with most other hcalls, but we forget to
set the return code for the caller, hence why kvm-unit-test sees the
224 return code and reports an error.

Guest kernels generally don't check the return value of H_CEDE, so
that likely explains why this hasn't caused issues outside of
kvm-unit-tests so far.

Fix this by setting r3 to 0 after we finish processing the H_CEDE.

RHBZ: 1778556

Fixes: 4bad77799fed ("KVM: PPC: Book3S HV: Handle hypercalls correctly when 
nested")
Cc: linuxppc-...@ozlabs.org
Cc: David Gibson 
Cc: Paul Mackerras 
Signed-off-by: Michael Roth 
---
 arch/powerpc/kvm/book3s_hv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2cefd071b848..c0c43a733830 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3616,6 +3616,7 @@ int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, u64 
time_limit,
if (trap == BOOK3S_INTERRUPT_SYSCALL && !vcpu->arch.nested &&
kvmppc_get_gpr(vcpu, 3) == H_CEDE) {
kvmppc_nested_cede(vcpu);
+   kvmppc_set_gpr(vcpu, 3, 0);
trap = 0;
}
} else {
-- 
2.17.1



Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-12 Thread Michael Roth
Quoting Alexey Kardashevskiy (2019-12-11 16:47:30)
> 
> 
> On 12/12/2019 07:31, Michael Roth wrote:
> > Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
> >>
> >>
> >> On 11/12/2019 02:35, Ram Pai wrote:
> >>> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> >>>>
> >>>>
> >>>> On 10/12/2019 16:12, Ram Pai wrote:
> >>>>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 07/12/2019 12:12, Ram Pai wrote:
> >>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one 
> >>>>>>> of
> >>>>>>> its parameters.  On secure VMs, hypervisor cannot access the contents 
> >>>>>>> of
> >>>>>>> this page since it gets encrypted.  Hence share the page with the
> >>>>>>> hypervisor, and unshare when done.
> >>>>>>
> >>>>>>
> >>>>>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
> >>>>>> pages. There is small problem that when DDW is enabled,
> >>>>>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
> >>>>>> about the performance on slack but this is caused by initial cleanup of
> >>>>>> the default TCE window (which we do not use anyway) and to battle this
> >>>>>> we can simply reduce its size by adding
> >>>>>
> >>>>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> >>>>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
> >>>>> 30GB guest. With larger guest, the time will further detoriate.
> >>>>
> >>>>
> >>>> No it will not, I checked. The time is the same for 2GB and 32GB guests-
> >>>> the delay is caused by clearing the small DMA window which is small by
> >>>> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
> >>>> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
> >>>> (depends on the system) so the number of TCEs is much smaller.
> >>>
> >>> I cant get your results.  What changes did you make to get it?
> >>
> >>
> >> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
> >> in clearing the default window and the huge window took a fraction of a
> >> second to create and map.
> > 
> > Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
> > of H_PUT_TCE everywhere?
> 
> 
> Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when
> fixed (I have it in my local branch), this does not make a difference.
> 
> 
> > 
> > In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
> > iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
> > instant),
> 
> PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does
> exactly. But I am pretty sure the idea is that either both H_STUFF_TCE
> and H_PUT_TCE_INDIRECT are present or neither.

That was my interpretation (or maybe I just went by what your
implementation did :), but just because they are available doesn't mean
the guest has to use them. I agree it's ugly to condition it on
is_secure_guest(), but to me that seems better than sharing memory
uncessarily, or potentially leaving stale mappings into default IOMMU.

Not sure if that are other alternatives though.

> 
> 
> > and then force H_PUT_TCE for new mappings via something like:
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 6ba081dd61c9..85d092baf17d 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
> > *tbl, long tcenum,
> > unsigned long flags;
> >  
> > if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
> > +   if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || 
> > is_secure_guest()) {
> 
> 
> Nobody (including myself) seems to like the idea of having
> is_secure_guest() all over the place.
> 
> And with KVM acceleration enabled, it is pretty fast anyway. Just now we
> do not have H_PUT_TCE in KVM/UV for 

Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-12 Thread Michael Roth
Quoting Ram Pai (2019-12-12 00:45:02)
> On Tue, Dec 10, 2019 at 07:43:24PM -0600, Michael Roth wrote:
> > Quoting Ram Pai (2019-12-06 19:12:39)
> > > Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
> > > secure guests")
> > > disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
> > > path for secure VMs, helped enable dma_direct path.  This enabled
> > > support for bounce-buffering through SWIOTLB.  However it fails to
> > > operate when IOMMU is enabled, since I/O pages are not TCE mapped.
> > > 
> > > Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
> > > cases including, TCE mapping I/O pages, in the presence of a
> > > IOMMU.
> > 
> > Wasn't clear to me at first, but I guess the main gist of this series is
> > that we want to continue to use SWIOTLB, but also need to create mappings
> > of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops
> > and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout
> > to call into dma_direct_* ops rather than relying on the dma_is_direct(ops)
> > checks in DMA API functions to do the same.
> > 
> > That makes sense, but one issue I see with that is that
> > dma_iommu_map_bypass() only tests true if all the following are true:
> > 
> > 1) the device requests a 64-bit DMA mask via
> >dma_set_mask/dma_set_coherent_mask
> > 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line)
> > 
> > dma_is_direct() checks don't have this limitation, so I think for
> > anything cases, such as devices that use a smaller DMA mask, we'll
> > end up falling back to the non-bypass functions in dma_iommu_ops, which
> > will likely break for things like dma_alloc_coherent/dma_map_single
> > since they won't use SWIOTLB pages and won't do the necessary calls to
> > set_memory_unencrypted() to share those non-SWIOTLB buffers with
> > hypervisor.
> > 
> > Maybe that's ok, but I think we should be clearer about how to
> > fail/handle these cases.
> 
> Yes. makes sense. Device that cannot handle 64bit dma mask will not work.
> 
> > 
> > Though I also agree with some concerns Alexey stated earlier: it seems
> > wasteful to map the entire DDW window just so these bounce buffers can be
> > mapped.  Especially if you consider the lack of a mapping to be an 
> > additional
> > safe-guard against things like buggy device implementations on the QEMU
> > side. E.g. if we leaked pages to the hypervisor on accident, those pages
> > wouldn't be immediately accessible to a device, and would still require
> > additional work get past the IOMMU.
> 
> Well, an accidental unintented page leak to the hypervisor, is a very
> bad thing, regardless of any DMA mapping. The device may not be able to
> access it, but the hypervisor still can access it.

Agreed, but if IOMMU can provide additional isolation we should make use
of it when reasonable.

> 
> > 
> > What would it look like if we try to make all this work with disable_ddw 
> > passed
> > to kernel command-line (or forced for is_secure_guest())?
> > 
> >   1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* 
> > ops,
> >  but an additional case or hook that considers is_secure_guest() might 
> > do
> >  it.
> >  
> >   2) We'd also need to set up an IOMMU mapping for the bounce buffers via
> >  io_tlb_start/io_tlb_end. We could do it once, on-demand via
> >  dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or
> >  maybe in some init function.
> 
> Hmm... i not sure how to accomplish (2).   we need use some DDW window
> to setup the mappings. right?  If disable_ddw is set, there wont be any
> ddw.  What am i missing?

We have 2 windows, the default 32-bit window that normally covers DMA addresses
0..1GB, and an additional DDW window that's created on demand for 64-bit
devices (since they can address a full linear mapping of all guest
memory at DMA address 1<<59). Your current patch uses the latter, but we
could potentially use the 32-bit window since we only need to map the
SWIOTLB pages.

Not saying that's necessarily better, but one upside is it only requires
devices to support 32-bit DMA addressing, which is a larger class of
devices than those that support 64-bit (since 64-bit device drivers
generally have a 32-bit fallback). Required changes are a bit more
pervasive though.

It might make sense to get both scenarios working at some point, but
maybe it's okay to handle 64-bit first. 64-bit does give us more legroom
if we ant

Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

2019-12-11 Thread Michael Roth
Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
> 
> 
> On 11/12/2019 02:35, Ram Pai wrote:
> > On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 10/12/2019 16:12, Ram Pai wrote:
> >>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
> 
> 
>  On 07/12/2019 12:12, Ram Pai wrote:
> > H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> > its parameters.  On secure VMs, hypervisor cannot access the contents of
> > this page since it gets encrypted.  Hence share the page with the
> > hypervisor, and unshare when done.
> 
> 
>  I thought the idea was to use H_PUT_TCE and avoid sharing any extra
>  pages. There is small problem that when DDW is enabled,
>  FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
>  about the performance on slack but this is caused by initial cleanup of
>  the default TCE window (which we do not use anyway) and to battle this
>  we can simply reduce its size by adding
> >>>
> >>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> >>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
> >>> 30GB guest. With larger guest, the time will further detoriate.
> >>
> >>
> >> No it will not, I checked. The time is the same for 2GB and 32GB guests-
> >> the delay is caused by clearing the small DMA window which is small by
> >> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
> >> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
> >> (depends on the system) so the number of TCEs is much smaller.
> > 
> > I cant get your results.  What changes did you make to get it?
> 
> 
> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
> in clearing the default window and the huge window took a fraction of a
> second to create and map.

Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
of H_PUT_TCE everywhere?

In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
instant), and then force H_PUT_TCE for new mappings via something like:

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 6ba081dd61c9..85d092baf17d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
unsigned long flags;
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
+   if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || 
is_secure_guest()) {
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
   direction, attrs);
}

That seems like it would avoid the extra 13s.

If we take the additional step of only mapping SWIOTLB range in
enable_ddw() for is_secure_guest() that might further improve things
(though the bigger motivation with that is the extra isolation it would
grant us for stuff behind the IOMMU, since it apparently doesn't affect
boot-time all that much)

> 
> 
> 
>  -global
>  spapr-pci-host-bridge.dma_win_size=0x400
> >>>
> >>> This option, speeds it up tremendously.  But than should this option be
> >>> enabled in qemu by default?  only for secure VMs? for both VMs?
> >>
> >>
> >> As discussed in slack, by default we do not need to clear the entire TCE
> >> table and we only have to map swiotlb buffer using the small window. It
> >> is a guest kernel change only. Thanks,
> > 
> > Can you tell me what code you are talking about here.  Where is the TCE
> > table getting cleared? What code needs to be changed to not clear it?
> 
> 
> pci_dma_bus_setup_pSeriesLP()
> iommu_init_table()
> iommu_table_clear()
> for () tbl->it_ops->get()
> 
> We do not really need to clear it there, we only need it for VFIO with
> IOMMU SPAPR TCE v1 which reuses these tables but there are
> iommu_take_ownership/iommu_release_ownership to clear these tables. I'll
> send a patch for this.


> 
> 
> > Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear
> > aswell?
> 
> 
> This one does not need to clear TCEs as this creates a window of known
> size and maps it all.
> 
> Well, actually, it only maps actual guest RAM, if there are gaps in RAM,
> then TCEs for the gaps will have what hypervisor had there (which is
> zeroes, qemu/kvm clears it anyway).
> 
> 
> > But before I close, you have not told me clearly, what is the problem
> > with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the 
> > page'.
> 
> Between share and unshare you have a (tiny) window of opportunity to
> attack the guest. No, I do not know how exactly.
> 
> For example, the hypervisor does a lot of PHB+PCI hotplug-unplug 

Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-11 Thread Michael Roth
Quoting Alexey Kardashevskiy (2019-12-11 02:36:29)
> 
> 
> On 11/12/2019 12:43, Michael Roth wrote:
> > Quoting Ram Pai (2019-12-06 19:12:39)
> >> Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
> >> secure guests")
> >> disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
> >> path for secure VMs, helped enable dma_direct path.  This enabled
> >> support for bounce-buffering through SWIOTLB.  However it fails to
> >> operate when IOMMU is enabled, since I/O pages are not TCE mapped.
> >>
> >> Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
> >> cases including, TCE mapping I/O pages, in the presence of a
> >> IOMMU.
> > 
> > Wasn't clear to me at first, but I guess the main gist of this series is
> > that we want to continue to use SWIOTLB, but also need to create mappings
> > of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops
> > and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout
> > to call into dma_direct_* ops rather than relying on the dma_is_direct(ops)
> > checks in DMA API functions to do the same.
> 
> 
> Correct. Took me a bit of time to realize what we got here :) We only
> rely on  dma_iommu_ops::.dma_supported to write the DMA offset to a
> device (when creating a huge window), and after that we know it is
> mapped directly and swiotlb gets this 1<<59 offset via __phys_to_dma().
> 
> 
> > That makes sense, but one issue I see with that is that
> > dma_iommu_map_bypass() only tests true if all the following are true:
> > 
> > 1) the device requests a 64-bit DMA mask via
> >dma_set_mask/dma_set_coherent_mask
> > 2) DDW is enabled (i.e. we don't pass disable_ddw on command-line)
> > 
> > dma_is_direct() checks don't have this limitation, so I think for
> > anything cases, such as devices that use a smaller DMA mask, we'll
> > end up falling back to the non-bypass functions in dma_iommu_ops, which
> > will likely break for things like dma_alloc_coherent/dma_map_single
> > since they won't use SWIOTLB pages and won't do the necessary calls to
> > set_memory_unencrypted() to share those non-SWIOTLB buffers with
> > hypervisor.
> > 
> > Maybe that's ok, but I think we should be clearer about how to
> > fail/handle these cases.
> > 
> > Though I also agree with some concerns Alexey stated earlier: it seems
> > wasteful to map the entire DDW window just so these bounce buffers can be
> > mapped.  Especially if you consider the lack of a mapping to be an 
> > additional
> > safe-guard against things like buggy device implementations on the QEMU
> > side. E.g. if we leaked pages to the hypervisor on accident, those pages
> > wouldn't be immediately accessible to a device, and would still require
> > additional work get past the IOMMU.
> > 
> > What would it look like if we try to make all this work with disable_ddw 
> > passed
> > to kernel command-line (or forced for is_secure_guest())?
> > 
> >   1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* 
> > ops,
> >  but an additional case or hook that considers is_secure_guest() might 
> > do
> >  it.
> >  
> >   2) We'd also need to set up an IOMMU mapping for the bounce buffers via
> >  io_tlb_start/io_tlb_end. We could do it once, on-demand via
> >  dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or
> >  maybe in some init function.
> 
> 
> io_tlb_start/io_tlb_end are only guaranteed to stay within 4GB and our
> default DMA window is 1GB (KVM) or 2GB (PowerVM), ok, we can define
> ARCH_LOW_ADDRESS_LIMIT as 1GB.

True, and limiting allocations to under 1GB might be brittle (also saw a
patching floating around that increased IO_TLB_DEFAULT_SIZE size to 1GB,
which obviously wouldn't work out with this approach, but not sure if
that's still needed or not: "powerpc/svm: Increase SWIOTLB buffer size")

However that's only an issue if we insist on using an identity mapping
in the IOMMU, which would be nice because non-IOMMU virtio would
magically work, but since that's not a goal of this series I think we do
have the option of mapping io_tlb_start at DMA address 0 (or
thereabouts).

We'd probably need to modify __phys_to_dma to treat archdata.dma_offset
as a negative offset in this case, but it seems like it would work about
the same as with DDW offset.

But yah, it does make things a bit less appealing than what I was initially
thinking with that approach...

> 
> But it has also been mentioned that we are likely to be 

Re: [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM.

2019-12-10 Thread Michael Roth
Quoting Ram Pai (2019-12-06 19:12:39)
> Commit edea902c1c1e ("powerpc/pseries/iommu: Don't use dma_iommu_ops on
> secure guests")
> disabled dma_iommu_ops path, for secure VMs. Disabling dma_iommu_ops
> path for secure VMs, helped enable dma_direct path.  This enabled
> support for bounce-buffering through SWIOTLB.  However it fails to
> operate when IOMMU is enabled, since I/O pages are not TCE mapped.
> 
> Renable dma_iommu_ops path for pseries Secure VMs.  It handles all
> cases including, TCE mapping I/O pages, in the presence of a
> IOMMU.

Wasn't clear to me at first, but I guess the main gist of this series is
that we want to continue to use SWIOTLB, but also need to create mappings
of it's bounce buffers in the IOMMU, so we revert to using dma_iommu_ops
and rely on the various dma_iommu_{map,alloc}_bypass() hooks throughout
to call into dma_direct_* ops rather than relying on the dma_is_direct(ops)
checks in DMA API functions to do the same.

That makes sense, but one issue I see with that is that
dma_iommu_map_bypass() only tests true if all the following are true:

1) the device requests a 64-bit DMA mask via
   dma_set_mask/dma_set_coherent_mask
2) DDW is enabled (i.e. we don't pass disable_ddw on command-line)

dma_is_direct() checks don't have this limitation, so I think for
anything cases, such as devices that use a smaller DMA mask, we'll
end up falling back to the non-bypass functions in dma_iommu_ops, which
will likely break for things like dma_alloc_coherent/dma_map_single
since they won't use SWIOTLB pages and won't do the necessary calls to
set_memory_unencrypted() to share those non-SWIOTLB buffers with
hypervisor.

Maybe that's ok, but I think we should be clearer about how to
fail/handle these cases.

Though I also agree with some concerns Alexey stated earlier: it seems
wasteful to map the entire DDW window just so these bounce buffers can be
mapped.  Especially if you consider the lack of a mapping to be an additional
safe-guard against things like buggy device implementations on the QEMU
side. E.g. if we leaked pages to the hypervisor on accident, those pages
wouldn't be immediately accessible to a device, and would still require
additional work get past the IOMMU.

What would it look like if we try to make all this work with disable_ddw passed
to kernel command-line (or forced for is_secure_guest())?

  1) dma_iommu_{alloc,map}_bypass() would no longer get us to dma_direct_* ops,
 but an additional case or hook that considers is_secure_guest() might do
 it.
 
  2) We'd also need to set up an IOMMU mapping for the bounce buffers via
 io_tlb_start/io_tlb_end. We could do it once, on-demand via
 dma_iommu_bypass_supported() like we do for the 64-bit DDW window, or
 maybe in some init function.

That also has the benefit of not requiring devices to support 64-bit DMA.

Alternatively, we could continue to rely on the 64-bit DDW window, but
modify call to enable_ddw() to only map the io_tlb_start/end range in
the case of is_secure_guest(). This is a little cleaner implementation-wise
since we can rely on the existing dma_iommu_{alloc,map}_bypass() hooks, but
devices that don't support 64-bit will fail back to not using dma_direct_* ops
and fail miserably. We'd probably want to handle that more gracefully.

Or we handle both cases gracefully. To me it makes more sense to enable
non-DDW case, then consider adding DDW case later if there's some reason
why 64-bit DMA is needed. But would be good to hear if there are other
opinions.

> 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 11 +--
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 67b5009..4e27d66 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -36,7 +36,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
> 
>  #include "pseries.h"
> @@ -1346,15 +1345,7 @@ void iommu_init_early_pSeries(void)
> of_reconfig_notifier_register(_reconfig_nb);
> register_memory_notifier(_mem_nb);
> 
> -   /*
> -* Secure guest memory is inacessible to devices so regular DMA isn't
> -* possible.
> -*
> -* In that case keep devices' dma_map_ops as NULL so that the generic
> -* DMA code path will use SWIOTLB to bounce buffers for DMA.
> -*/
> -   if (!is_secure_guest())
> -   set_pci_dma_ops(_iommu_ops);
> +   set_pci_dma_ops(_iommu_ops);
>  }
> 
>  static int __init disable_multitce(char *str)
> -- 
> 1.8.3.1
> 


Re: [RFC v1 0/2] Enable IOMMU support for pseries Secure VMs

2019-11-06 Thread Michael Roth
Quoting Michael S. Tsirkin (2019-11-06 12:06:37)
> On Wed, Nov 06, 2019 at 12:59:50PM +1100, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 05/11/2019 08:28, Ram Pai wrote:
> > > This patch series enables IOMMU support for pseries Secure VMs.
> > > 
> > > 
> > > Tested using QEMU command line option:
> > > 
> > >  "-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4,
> > > iommu_platform=on,disable-modern=off,disable-legacy=on"
> > >  and 
> > > 
> > >  "-device virtio-blk-pci,scsi=off,bus=pci.0,
> > > addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,
> > > iommu_platform=on,disable-modern=off,disable-legacy=on"
> > 
> > 
> > Worth mentioning that SLOF won't boot with such devices as SLOF does not 
> > know about iommu_platform=on.
> 
> Shouldn't be hard to support: set up the iommu to allow everything
> and ack the feature. Right?

It's not a static/linear mapping in this case so we need calls to map DMA
buffers as-needed. I've gotten it to boot with virtio-blk, but the patches
have some hacks and need cleanup, hoping to post them soon.

I'm a bit perplexed how we would manage to boot a guest without those
changes though, this is what I get with qemu 4.1.0:

  qemu-system-ppc64 -M pseries,ic-mode=xics -m 512M -bios 
/home/mdroth/w/build/qemu-4.1.0-build/pc-bios/slof.bin -device 
virtio-blk-pci,drive=drive0,id=blk0,disable-modern=off,disable-legacy=on,iommu_platform=on
 -drive 
file=/home/mdroth/vm/bionic-server-cloudimg-ppc64el.img,if=none,id=drive0 
-trace enable=spapr_iommu\*,file=trace.out -monitor 
unix:/tmp/mon.sock,server,nowait -vga none -nographic
  qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-cfpc=workaround
  qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-sbbc=workaround
  qemu-system-ppc64: warning: TCG doesn't support requested feature, 
cap-ibs=workaround
  
  
  SLOF **
  QEMU Starting
   Build Date = Jul  3 2019 12:26:14
   FW Version = git-ba1ab360eebe6338
   Press "s" to enter Open Firmware.
  
  Populating /vdevice methods
  Populating /vdevice/vty@7100
  Populating /vdevice/nvram@7101
  Populating /vdevice/l-lan@7102
  Populating /vdevice/v-scsi@7103
 SCSI: Looking for devices
8200 CD-ROM   : "QEMU QEMU CD-ROM  2.5+"
  Populating /pci@8002000
   00  (D) : 1af4 1042virtio [ block ]
  No NVRAM common partition, re-initializing...
  Scanning USB
  Using default console: /vdevice/vty@7100
  
Welcome to Open Firmware
  
Copyright (c) 2004, 2017 IBM Corporation All rights reserved.
This program and the accompanying materials are made available
under the terms of the BSD License available at
http://www.opensource.org/licenses/bsd-license.php
  
  
  Trying to load:  from: /pci@8002000/scsi@0 ... virtioblk_init: failed
  virtioblk_transfer: Access beyond end of device!

And then it hangs. This is with TCG so maybe it behaves differently with
KVM, but that's the result I would expect with the current SLOF code.

> 
> > > 
> > > Ram Pai (2):
> > >   powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.
> > >   powerpc/pseries/iommu: Use dma_iommu_ops for Secure VMs aswell.
> > > 
> > >  arch/powerpc/platforms/pseries/iommu.c | 30 
> > > ++
> > >  1 file changed, 18 insertions(+), 12 deletions(-)
> > > 
> > 
> > -- 
> > Alexey
> 


[PATCH v2] KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag

2019-09-11 Thread Michael Roth
c0002073fb0ca900   R27 = c19e2464
  R12 = c0050790   R28 = c00812b0
  R13 = c000207fff623e00   R29 = c0002073fb0ca808
  R14 = 7d1bbee0   R30 = c0002073fb0ca800
  R15 = 7d1bcd60   R31 = 0800
  pc  = c020b260 smp_call_function_many+0x3d0/0x460
  cfar= c020b270 smp_call_function_many+0x3e0/0x460
  lr  = c020b20c smp_call_function_many+0x37c/0x460
  msr = 90010288b033   cr  = 44024824
  ctr = c0050790   xer =    trap =  100

CPU 42 is running normally, doing VCPU work:

  # CPU 42 stack trace (via xmon)
  [link register   ] c0081be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 
[kvm_hv]
  [c08ed3343820] c08ed3343850 (unreliable)
  [c08ed33438d0] c0081be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 
[kvm_hv]
  [c08ed33439d0] c0081be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
  [c08ed3343ae0] c0081c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
  [c08ed3343b00] c0081c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
  [c08ed3343b90] c0081c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
  [c08ed3343d00] c0475450 do_vfs_ioctl+0xe0/0xc70
  [c08ed3343db0] c04760e4 ksys_ioctl+0x104/0x120
  [c08ed3343e00] c0476128 sys_ioctl+0x28/0x80
  [c08ed3343e20] c000b388 system_call+0x5c/0x70
  --- Exception: c00 (System Call) at 7d545cfd7694
  SP (7d53ff7edf50) is in userspace

It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
was set for CPU 42 by at least 1 of the CPUs waiting in
smp_call_function_many(), but somehow the corresponding
call_single_queue entries were never processed by CPU 42, causing the
callers to spin in csd_lock_wait() indefinitely.

Nick Piggin suggested something similar to the following sequence as
a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
IPI messages seems to be most common, but any mix of CALL_FUNCTION and
!CALL_FUNCTION messages could trigger it):

CPU
  X: smp_muxed_ipi_set_message():
  X:   smp_mb()
  X:   message[RESCHEDULE] = 1
  X: doorbell_global_ipi(42):
  X:   kvmppc_set_host_ipi(42, 1)
  X:   ppc_msgsnd_sync()/smp_mb()
  X:   ppc_msgsnd() -> 42
 42: doorbell_exception(): // from CPU X
 42:   ppc_msgsync()
105: smp_muxed_ipi_set_message():
105:   smb_mb()
 // STORE DEFERRED DUE TO RE-ORDERING
  --105:   message[CALL_FUNCTION] = 1
  | 105: doorbell_global_ipi(42):
  | 105:   kvmppc_set_host_ipi(42, 1)
  |  42:   kvmppc_set_host_ipi(42, 0)
  |  42: smp_ipi_demux_relaxed()
  |  42: // returns to executing guest
  |  // RE-ORDERED STORE COMPLETES
  ->105:   message[CALL_FUNCTION] = 1
105:   ppc_msgsnd_sync()/smp_mb()
105:   ppc_msgsnd() -> 42
 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue

This can be prevented with an smp_mb() at the beginning of
kvmppc_set_host_ipi(), such that stores to message[] (or other
state indicated by the host_ipi flag) are ordered vs. the store to
to host_ipi.

However, doing so might still allow for the following scenario (not
yet observed):

CPU
  X: smp_muxed_ipi_set_message():
  X:   smp_mb()
  X:   message[RESCHEDULE] = 1
  X: doorbell_global_ipi(42):
  X:   kvmppc_set_host_ipi(42, 1)
  X:   ppc_msgsnd_sync()/smp_mb()
  X:   ppc_msgsnd() -> 42
 42: doorbell_exception(): // from CPU X
 42:   ppc_msgsync()
 // STORE DEFERRED DUE TO RE-ORDERING
  -- 42:   kvmppc_set_host_ipi(42, 0)
  |  42: smp_ipi_demux_relaxed()
  | 105: smp_muxed_ipi_set_message():
  | 105:   smb_mb()
  | 105:   message[CALL_FUNCTION] = 1
  | 105: doorbell_global_ipi(42):
  | 105:   kvmppc_set_host_ipi(42, 1)
  |  // RE-ORDERED STORE COMPLETES
  -> 42:   kvmppc_set_host_ipi(42, 0)
 42: // returns to executing guest
105:   ppc_msgsnd_sync()/smp_mb()
105:   ppc_msgsnd() -> 42
 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
105: // hangs waiting on 42 to process messages/call_single_queue

Fixing this scenario would require an smp_mb() *after* clearing
host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
subsequent processing of IPI messages.

To handle both cases, this patch splits kvmppc_set_host_ipi() into
separate set/clear functions, where we execute smp_mb() prior to
setting host_ipi flag, and after clearing host_ipi flag. These
functions pair with each other to synchronize the sender and receiver
sides.

With that change in place the above workload ran for 20 hours without
triggering any lock-ups.

Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") 
# v4.0
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: kvm-...@vger.kernel.org
Signed-off-by: Michael Roth 
---
v2:
 - changelog: don't reference CPUs in hex
 - changelog: make pseudocode clearer
 - ch

Re: [PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()

2019-09-10 Thread Michael Roth
Quoting Michael Roth (2019-09-05 18:21:22)
> Quoting Michael Ellerman (2019-09-04 22:04:48)
> > That raises the question of whether this needs to be a full barrier or
> > just a write barrier, and where is the matching barrier on the reading
> > side?
> 
> For this particular case I think the same barrier orders it on the
> read-side via kvmppc_set_host_ipi(42, 0) above, but I'm not sure that
> work as a general solution, unless maybe we make that sort of usage
> (clear-before-processing) part of the protocol of using
> kvmppc_set_host_ipi()... it makes sense given we already need to take
> care to not miss clearing them else we get issues like what was fixed
> in 755563bc79c7, which introduced the clear in doorbell_exception(). So
> then it's a matter of additionally making sure we do it prior to
> processing host_ipi state. I haven't looked too closely at the other
> users of kvmppc_set_host_ipi() yet though.



> As far as using rw barriers, I can't think of any reason we couldn't, but
> I wouldn't say I'm at all confident in declaring that safe atm...

I think we need a full barrier after all. The following seems possible
otherwise:

  CPU
X: smp_mb()
X: ipi_message[RESCHEDULE] = 1
X: kvmppc_set_host_ipi(42, 1)
X: smp_mb()
X: doorbell/msgsnd -> 42
   42: doorbell_exception() (from CPU X)
   42: msgsync
   42: kvmppc_set_host_ipi(42, 0) // STORE DEFERRED DUE TO RE-ORDERING
   42: smp_ipi_demux_relaxed()
  105: smb_mb()
  105: ipi_message[CALL_FUNCTION] = 1
  105: smp_mb()
  105: kvmppc_set_host_ipi(42, 1)
   42: kvmppc_set_host_ipi(42, 0) // RE-ORDERED STORE COMPLETES
   42: // returns to executing guest
  105: doorbell/msgsnd -> 42
   42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
  105: // hangs waiting on 42 to process messages/call_single_queue

However that also means the current patch is insufficient, since the
barrier for preventing this scenario needs to come *after* setting
paca_ptrs[cpu]->kvm_hstate.host_ipi to 0.

So I think the right interface is for this is to split
kvmppc_set_host_ipi out into:

static inline void kvmppc_set_host_ipi(int cpu)
{
   smp_mb();
   paca_ptrs[cpu]->kvm_hstate.host_ipi = 1;
}

static inline void kvmppc_clear_host_ipi(int cpu)
{
   paca_ptrs[cpu]->kvm_hstate.host_ipi = 0;
   smp_mb();
}


Re: [PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()

2019-09-05 Thread Michael Roth
Quoting Michael Ellerman (2019-09-04 22:04:48)
> Hi Mike,
> 
> Thanks for the patch & great change log, just a few comments.

Hi Michael,

Thank you for the suggestions. I will roll them into v2 unless otherwise
noted below.

> 
> Michael Roth  writes:
> > On a 2-socket Witherspoon system with 128 cores and 1TB of memory
>^
>Power9 (not everyone knows what a Witherspoon is)
> 
> > running the following guest configs:
> >
> >   guest A:
> > - 224GB of memory
> > - 56 VCPUs (sockets=1,cores=28,threads=2), where:
> >   VCPUs 0-1 are pinned to CPUs 0-3,
> >   VCPUs 2-3 are pinned to CPUs 4-7,
> >   ...
> >   VCPUs 54-55 are pinned to CPUs 108-111
> >
> >   guest B:
> > - 4GB of memory
> > - 4 VCPUs (sockets=1,cores=4,threads=1)
> >
> > with the following workloads (with KSM and THP enabled in all):
> >
> >   guest A:
> > stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
> >
> >   guest B:
> > stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
> >
> >   host:
> > stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
> >
> > the below soft-lockup traces were observed after an hour or so and
> > persisted until the host was reset (this was found to be reliably
> > reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
> > and 5.3-rc5):
> >
> >   [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
> >   [ 1253.183319] rcu: 124-: (5250 ticks this GP) 
> > idle=10a/1/0x4002 softirq=5408/5408 fqs=1941
> >   [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 
> > 52/KVM:19709]
> >   [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! 
> > [worker:19913]
> >   [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! 
> > [worker:20331]
> >   [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! 
> > [worker:20338]
> >   [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! 
> > [avocado:19525]
> >   [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> > [ksmd:791]
> >   [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
> >   [ 1316.198032] rcu: 124-: (21003 ticks this GP) 
> > idle=10a/1/0x4002 softirq=5408/5408 fqs=8243
> >   [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> > [ksmd:791]
> >   [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
> >   [ 1379.212629] rcu: 124-: (36756 ticks this GP) 
> > idle=10a/1/0x4002 softirq=5408/5408 fqs=14714
> >   [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> > [ksmd:791]
> >   [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
> >   [ 1442.227115] rcu: 124-: (52509 ticks this GP) 
> > idle=10a/1/0x4002 softirq=5408/5408 fqs=21403
> >   [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
> >   [ 1455.111822]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
> >   [ 1455.111905]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
> >   [ 1455.111986]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
> >   [ 1455.112068]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
> >   [ 1455.112159]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
> >   [ 1455.112231]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> &

[PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()

2019-09-04 Thread Michael Roth
02073fb0ca900   R27 = c19e2464
  R12 = c0050790   R28 = c00812b0
  R13 = c000207fff623e00   R29 = c0002073fb0ca808
  R14 = 7d1bbee0   R30 = c0002073fb0ca800
  R15 = 7d1bcd60   R31 = 0800
  pc  = c020b260 smp_call_function_many+0x3d0/0x460
  cfar= c020b270 smp_call_function_many+0x3e0/0x460
  lr  = c020b20c smp_call_function_many+0x37c/0x460
  msr = 90010288b033   cr  = 44024824
  ctr = c0050790   xer =    trap =  100

CPU 42 is running normally, doing VCPU work:

  1f:mon> c2a
  [link register   ] c0081be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 
[kvm_hv]
  [c08ed3343820] c08ed3343850 (unreliable)
  [c08ed33438d0] c0081be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 
[kvm_hv]
  [c08ed33439d0] c0081be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
  [c08ed3343ae0] c0081c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
  [c08ed3343b00] c0081c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
  [c08ed3343b90] c0081c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
  [c08ed3343d00] c0475450 do_vfs_ioctl+0xe0/0xc70
  [c08ed3343db0] c04760e4 ksys_ioctl+0x104/0x120
  [c08ed3343e00] c0476128 sys_ioctl+0x28/0x80
  [c08ed3343e20] c000b388 system_call+0x5c/0x70
  --- Exception: c00 (System Call) at 7d545cfd7694
  SP (7d53ff7edf50) is in userspace

It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
was set for CPU 42 by at least 1 of the CPUs waiting in
smp_call_function_many(), but somehow the corresponding
call_single_queue entries were never processed, causing the callers
to spin in csd_lock_wait() indefinitely.

Nick Piggin suggested something similar to the following sequence as
a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
IPI messages seems to be most common, but any mix of CALL_FUNCTION and
!CALL_FUNCTION messages could trigger it):

  CPU
X: smp_mb()
X: ipi_message[RESCHEDULE] = 1
X: kvmppc_set_host_ipi(42, 1)
X: smp_mb()
X: doorbell/msgsnd -> 42
   42: doorbell_exception() (from CPU X)
   42: msgsync
  105: smb_mb()
  105: kvmppc_set_host_ipi(42, 1)
   42: kvmppc_set_host_ipi(42, 0)
   42: smp_ipi_demux_relaxed()
   42: // returns to executing guest
  105: ipi_message[CALL_FUNCTION] = 1 // due to re-ordering
  105: smp_mb()
  105: doorbell/msgsnd -> 42
   42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
  105: // hangs waiting on 42 to process messages/call_single_queue

This patch avoids that scenario by placing a barrier at the start
of kvmppc_set_host_ipi() such that the storing of these messages (or
other state indicated by host_ipi being set) cannot be re-ordered past
it.

With this the above workload ran for 6 hours (so far) without
triggering any lock-ups.

Cc: Paul Mackerras 
Cc: Nicholas Piggin 
Cc: kvm-...@vger.kernel.org
Signed-off-by: Michael Roth 
---
 arch/powerpc/include/asm/kvm_ppc.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 2484e6a8f5ca..254abad0f55e 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -454,6 +454,7 @@ static inline u32 kvmppc_get_xics_latch(void)
 
 static inline void kvmppc_set_host_ipi(int cpu, u8 host_ipi)
 {
+   smp_mb();
paca_ptrs[cpu]->kvm_hstate.host_ipi = host_ipi;
 }
 
-- 
2.17.1



Re: [PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K

2018-12-10 Thread Michael Roth
Quoting Daniel Borkmann (2018-12-10 08:26:31)
> On 12/07/2018 04:36 PM, Michael Roth wrote:
> > Quoting Michael Ellerman (2018-12-07 06:31:13)
> >> Michael Roth  writes:
> >>
> >>> Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
> >>> JIT allocations. At compile time it defaults to PAGE_SIZE * 4,
> >>> and is adjusted again at init time if MODULES_VADDR is defined.
> >>>
> >>> For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
> >>
> >> But maybe it should be, I don't know why we don't define it.
> >>
> >>> the compile-time default at boot-time, which is 0x9c40 when
> >>> using 64K page size. This overflows the signed 32-bit bpf_jit_limit
> >>> value:
> >>>
> >>>   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
> >>>   -1673527296
> >>>
> >>> and can cause various unexpected failures throughout the network
> >>> stack. In one case `strace dhclient eth0` reported:
> >>>
> >>>   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, 
> >>> filter=0x105dd27f8}, 16) = -1 ENOTSUPP (Unknown error 524)
> >>>
> >>> and similar failures can be seen with tools like tcpdump. This doesn't
> >>> always reproduce however, and I'm not sure why. The more consistent
> >>> failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
> >>> would time out on systemd/netplan configuring a virtio-net NIC with no
> >>> noticeable errors in the logs.
> >>>
> >>> Fix this by limiting the compile-time default for bpf_jit_limit to
> >>> INT_MAX.
> >>
> >> INT_MAX is a lot more than (4k * 4), so I guess I'm not clear on
> >> whether we should be using PAGE_SIZE here at all. I guess each BPF
> >> program uses at least one page is the thinking?
> > 
> > That seems to be the case, at least, the max number of minimum-sized
> > allocations would be less on ppc64 since the allocations are always at
> > least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
> > so it seemed consistent to do that here too.
> > 
> >>
> >> Thanks for tracking this down. For some reason none of my ~10 test boxes
> >> have hit this, perhaps I don't have new enough userspace?
> > 
> > I'm not too sure, I would've thought things like the dhclient case in
> > the commit log would fail every time, but sometimes I need to reboot the
> > guest before I start seeing the behavior. Maybe there's something special
> > about when JIT allocations are actually done that can affect
> > reproducibility?
> > 
> > In my case at least the virtio-net networking timeout was consistent
> > enough for a bisect, but maybe it depends on the specific network
> > configuration (single NIC, basic DHCP through netplan/systemd in my case).
> > 
> >>
> >> You don't mention why you needed to add BPF_MIN(), I assume because the
> >> kernel version of min() has gotten too complicated to work here?
> > 
> > I wasn't sure if it was safe here or not, so I tried looking at other
> > users and came across:
> > 
> > mm/vmalloc.c:777:#define VMAP_MIN(x, y)   ((x) < (y) ? (x) : 
> > (y)) /* can't use min() */
> > 
> > I'm not sure what the reasoning was (or whether it still applies), but I
> > figured it was safer to do the same here. Maybe Nick still recalls?
> > 
> >>
> >> Daniel I assume you'll merge this via your tree?
> >>
> >> cheers
> >>
> >>> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv 
> >>> allocations")
> >>> Cc: linuxppc-...@ozlabs.org
> >>> Cc: Daniel Borkmann 
> >>> Cc: Sandipan Das 
> >>> Cc: Alexei Starovoitov 
> >>> Signed-off-by: Michael Roth 
> 
> Thanks for the reports / fixes and sorry for my late reply (bit too
> swamped last week), some more thoughts below.
> 
> >>>  kernel/bpf/core.c | 3 ++-
> >>>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> >>> index b1a3545d0ec8..55de4746cdfd 100644
> >>> --- a/kernel/bpf/core.c
> >>> +++ b/kernel/bpf/core.c
> >>> @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
> >>>  }
> >>>  
> >>>  #ifdef CONFIG_BPF_JIT
> >>> -# define BPF_JIT_L

Re: [PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K

2018-12-07 Thread Michael Roth
Quoting Michael Ellerman (2018-12-07 06:31:13)
> Michael Roth  writes:
> 
> > Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
> > JIT allocations. At compile time it defaults to PAGE_SIZE * 4,
> > and is adjusted again at init time if MODULES_VADDR is defined.
> >
> > For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
> 
> But maybe it should be, I don't know why we don't define it.
> 
> > the compile-time default at boot-time, which is 0x9c40 when
> > using 64K page size. This overflows the signed 32-bit bpf_jit_limit
> > value:
> >
> >   root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
> >   -1673527296
> >
> > and can cause various unexpected failures throughout the network
> > stack. In one case `strace dhclient eth0` reported:
> >
> >   setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 
> > 16) = -1 ENOTSUPP (Unknown error 524)
> >
> > and similar failures can be seen with tools like tcpdump. This doesn't
> > always reproduce however, and I'm not sure why. The more consistent
> > failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
> > would time out on systemd/netplan configuring a virtio-net NIC with no
> > noticeable errors in the logs.
> >
> > Fix this by limiting the compile-time default for bpf_jit_limit to
> > INT_MAX.
> 
> INT_MAX is a lot more than (4k * 4), so I guess I'm not clear on
> whether we should be using PAGE_SIZE here at all. I guess each BPF
> program uses at least one page is the thinking?

That seems to be the case, at least, the max number of minimum-sized
allocations would be less on ppc64 since the allocations are always at
least PAGE_SIZE in size. The init-time default also limits to INT_MAX,
so it seemed consistent to do that here too.

> 
> Thanks for tracking this down. For some reason none of my ~10 test boxes
> have hit this, perhaps I don't have new enough userspace?

I'm not too sure, I would've thought things like the dhclient case in
the commit log would fail every time, but sometimes I need to reboot the
guest before I start seeing the behavior. Maybe there's something special
about when JIT allocations are actually done that can affect
reproducibility?

In my case at least the virtio-net networking timeout was consistent
enough for a bisect, but maybe it depends on the specific network
configuration (single NIC, basic DHCP through netplan/systemd in my case).

> 
> You don't mention why you needed to add BPF_MIN(), I assume because the
> kernel version of min() has gotten too complicated to work here?

I wasn't sure if it was safe here or not, so I tried looking at other
users and came across:

mm/vmalloc.c:777:#define VMAP_MIN(x, y) ((x) < (y) ? (x) : (y)) /* 
can't use min() */

I'm not sure what the reasoning was (or whether it still applies), but I
figured it was safer to do the same here. Maybe Nick still recalls?

> 
> Daniel I assume you'll merge this via your tree?
> 
> cheers
> 
> > Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv 
> > allocations")
> > Cc: linuxppc-...@ozlabs.org
> > Cc: Daniel Borkmann 
> > Cc: Sandipan Das 
> > Cc: Alexei Starovoitov 
> > Signed-off-by: Michael Roth 
> > ---
> >  kernel/bpf/core.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index b1a3545d0ec8..55de4746cdfd 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
> >  }
> >  
> >  #ifdef CONFIG_BPF_JIT
> > -# define BPF_JIT_LIMIT_DEFAULT   (PAGE_SIZE * 4)
> > +# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
> > +# define BPF_JIT_LIMIT_DEFAULT   BPF_MIN((PAGE_SIZE * 4), INT_MAX)
> >  
> >  /* All BPF JIT sysctl knobs here. */
> >  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
> > -- 
> > 2.17.1
> 


[PATCH] bpf: fix overflow of bpf_jit_limit when PAGE_SIZE >= 64K

2018-12-06 Thread Michael Roth
Commit ede95a63b5 introduced a bpf_jit_limit tuneable to limit BPF
JIT allocations. At compile time it defaults to PAGE_SIZE * 4,
and is adjusted again at init time if MODULES_VADDR is defined.

For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
the compile-time default at boot-time, which is 0x9c40 when
using 64K page size. This overflows the signed 32-bit bpf_jit_limit
value:

  root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
  -1673527296

and can cause various unexpected failures throughout the network
stack. In one case `strace dhclient eth0` reported:

  setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8}, 16) 
= -1 ENOTSUPP (Unknown error 524)

and similar failures can be seen with tools like tcpdump. This doesn't
always reproduce however, and I'm not sure why. The more consistent
failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9 host
would time out on systemd/netplan configuring a virtio-net NIC with no
noticeable errors in the logs.

Fix this by limiting the compile-time default for bpf_jit_limit to
INT_MAX.

Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv 
allocations")
Cc: linuxppc-...@ozlabs.org
Cc: Daniel Borkmann 
Cc: Sandipan Das 
Cc: Alexei Starovoitov 
Signed-off-by: Michael Roth 
---
 kernel/bpf/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index b1a3545d0ec8..55de4746cdfd 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -365,7 +365,8 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
 }
 
 #ifdef CONFIG_BPF_JIT
-# define BPF_JIT_LIMIT_DEFAULT (PAGE_SIZE * 4)
+# define BPF_MIN(x, y) ((x) < (y) ? (x) : (y))
+# define BPF_JIT_LIMIT_DEFAULT BPF_MIN((PAGE_SIZE * 4), INT_MAX)
 
 /* All BPF JIT sysctl knobs here. */
 int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
-- 
2.17.1



Re: [PATCH bpf] bpf: fix default unprivileged allocation limit

2018-12-06 Thread Michael Roth
Quoting Sandipan Das (2018-12-06 03:27:32)
> When using a large page size, the default value of the bpf_jit_limit
> knob becomes invalid and users are not able to run unprivileged bpf
> programs.
> 
> The bpf_jit_limit knob is represented internally as a 32-bit signed
> integer because of which the default value, i.e. PAGE_SIZE * 4,
> overflows in case of an architecture like powerpc64 which uses 64K
> as the default page size (i.e. CONFIG_PPC_64K_PAGES is set).
> 
> So, instead of depending on the page size, use a constant value.
> 
> Fixes: ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv 
> allocations")

This also consistently caused a virtio-net KVM Ubuntu 18.04 guest to time out
on configuring networking during boot via systemd/netplan. A bisect
pointed to the same commit this patch addresses.

> Signed-off-by: Sandipan Das 
> ---
>  kernel/bpf/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index b1a3545d0ec8..a81d097a17fb 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -365,7 +365,7 @@ void bpf_prog_kallsyms_del_all(struct bpf_prog *fp)
>  }
> 
>  #ifdef CONFIG_BPF_JIT
> -# define BPF_JIT_LIMIT_DEFAULT (PAGE_SIZE * 4)
> +# define BPF_JIT_LIMIT_DEFAULT (4096 * 4)

This isn't quite right as we still use (bpf_jit_limit >> PAGE_SHIFT) to check
allocations in bpf_jit_charge_modmem(), so that should be fixed up as well.

Another alternative which is to clamp BPF_JIT_LIMIT_DEFAULT to INT_MAX,
which fixes the issue for me and is similar to what
bpf_jit_charge_init() does for kernels where MODULES_VADDR is defined.
I'll go ahead and send the patch in case that seems preferable.

> 
>  /* All BPF JIT sysctl knobs here. */
>  int bpf_jit_enable   __read_mostly = IS_BUILTIN(CONFIG_BPF_JIT_ALWAYS_ON);
> -- 
> 2.19.2
> 



[PATCH] KVM: PPC: Book3S HV: fix handling for interrupted H_ENTER_NESTED

2018-11-08 Thread Michael Roth
Gibson 
Cc: Paul Mackerras 
Cc: Suraj Jitindar Singh 
Signed-off-by: Michael Roth 
---
 arch/powerpc/kvm/book3s_hv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d65b961661fb..a56f8413758a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -983,6 +983,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
ret = kvmhv_enter_nested_guest(vcpu);
if (ret == H_INTERRUPT) {
kvmppc_set_gpr(vcpu, 3, 0);
+   vcpu->arch.hcall_needed = 0;
return -EINTR;
}
break;
-- 
2.17.1



[PATCH] powerpc/pseries: advertise Hot Plug Event support to firmware

2017-02-20 Thread Michael Roth
With the inclusion of:

  powerpc/pseries: Implement indexed-count hotplug memory remove
  powerpc/pseries: Implement indexed-count hotplug memory add

we now have complete handling of the RTAS hotplug event format
as described by PAPR via ACR "PAPR Changes for Hotplug RTAS Events".

This capability is indicated by byte 6, bit 5 of architecture
option vector 5, and allows for greater control over cpu/memory/pci
hot plug/unplug operations.

Existing pseries kernels will utilize this capability based on the
existence of the /event-sources/hot-plug-events DT property, so we
only need to advertise it via CAS and do not need a corresponding
FW_FEATURE_* value to test for.

Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: Nathan Fontenot <nf...@linux.vnet.ibm.com>
Cc: David Gibson <da...@gibson.dropbear.id.au>
Signed-off-by: Michael Roth <mdr...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/prom.h | 1 +
 arch/powerpc/kernel/prom_init.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 2c8001c..4a90634 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -153,6 +153,7 @@ struct of_drconf_cell {
 #define OV5_XCMO   0x0440  /* Page Coalescing */
 #define OV5_TYPE1_AFFINITY 0x0580  /* Type 1 NUMA affinity */
 #define OV5_PRRN   0x0540  /* Platform Resource Reassignment */
+#define OV5_HP_EVT 0x0604  /* Hot Plug Event support */
 #define OV5_RESIZE_HPT 0x0601  /* Hash Page Table resizing */
 #define OV5_PFO_HW_RNG 0x1180  /* PFO Random Number Generator */
 #define OV5_PFO_HW_842 0x1140  /* PFO Compression Accelerator */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index f3c8799..1a835e7 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -839,7 +839,7 @@ struct ibm_arch_vec __cacheline_aligned 
ibm_architecture_vec = {
0,
 #endif
.associativity = OV5_FEAT(OV5_TYPE1_AFFINITY) | 
OV5_FEAT(OV5_PRRN),
-   .bin_opts = OV5_FEAT(OV5_RESIZE_HPT),
+   .bin_opts = OV5_FEAT(OV5_RESIZE_HPT) | OV5_FEAT(OV5_HP_EVT),
.micro_checkpoint = 0,
.reserved0 = 0,
.max_cpus = cpu_to_be32(NR_CPUS),   /* number of cores 
supported */
-- 
2.7.4



Re: [PATCH v6 3/3] powerpc/pseries: Implement indexed-count hotplug memory remove

2016-11-14 Thread Michael Roth
Quoting Nathan Fontenot (2016-10-18 12:21:06)
> From: Sahil Mehta 
> 
> Indexed-count remove for memory hotplug guarantees that a contiguous block
> of  lmbs beginning at a specified  will be unassigned (NOT
> that  lmbs will be removed). Because of Qemu's per-DIMM memory
> management, the removal of a contiguous block of memory currently
> requires a series of individual calls. Indexed-count remove reduces
> this series into a single call.
> 
> Signed-off-by: Sahil Mehta 
> Signed-off-by: Nathan Fontenot 
> ---
> v2: -use u32s drc_index and count instead of u32 ic[]
>  in dlpar_memory
> v3: -add logic to handle invalid drc_index input
> v4: -none
> v5: -Update for() loop to start at start_index
> v6: -none
> 
>  arch/powerpc/platforms/pseries/hotplug-memory.c |   90 
> +++
>  1 file changed, 90 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index badc66d..19ad081 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -558,6 +558,92 @@ static int dlpar_memory_remove_by_index(u32 drc_index, 
> struct property *prop)
> return rc;
>  }
> 
> +static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index,
> +struct property *prop)
> +{
> +   struct of_drconf_cell *lmbs;
> +   u32 num_lmbs, *p;
> +   int i, rc, start_lmb_found;
> +   int lmbs_available = 0, start_index = 0, end_index;
> +
> +   pr_info("Attempting to hot-remove %u LMB(s) at %x\n",
> +   lmbs_to_remove, drc_index);
> +
> +   if (lmbs_to_remove == 0)
> +   return -EINVAL;
> +
> +   p = prop->value;
> +   num_lmbs = *p++;
> +   lmbs = (struct of_drconf_cell *)p;
> +   start_lmb_found = 0;
> +
> +   /* Navigate to drc_index */
> +   while (start_index < num_lmbs) {
> +   if (lmbs[start_index].drc_index == drc_index) {
> +   start_lmb_found = 1;
> +   break;
> +   }
> +
> +   start_index++;
> +   }
> +
> +   if (!start_lmb_found)
> +   return -EINVAL;
> +
> +   end_index = start_index + lmbs_to_remove;
> +
> +   /* Validate that there are enough LMBs to satisfy the request */
> +   for (i = start_index; i < end_index; i++) {
> +   if (lmbs[i].flags & DRCONF_MEM_RESERVED)
> +   break;
> +
> +   lmbs_available++;
> +   }
> +
> +   if (lmbs_available < lmbs_to_remove)
> +   return -EINVAL;
> +
> +   for (i = start_index; i < end_index; i++) {
> +   if (!(lmbs[i].flags & DRCONF_MEM_ASSIGNED))
> +   continue;
> +
> +   rc = dlpar_remove_lmb([i]);

dlpar_remove_lmb() currently does both offlining of the memory as well
as releasing the LMB back to the platform, but the specification for
hotplug notifications has the following verbage regarding
indexed-count/count identifiers:

'When using “drc count” or “drc count indexed” as the Hotplug Identifier,
the OS should take steps to verify the entirety of the request can be
satisfied before proceeding with the hotplug / unplug operations. If
only a partial count can be satisfied, the OS should ignore the entirety
of the request. If the OS cannot determine this beforehand, it should
satisfy the hotplug / unplug request for as many of the requested
resources as possible, and attempt to revert to the original OS / DRC
state.'

So doing the dlpar_remove->dlapr_add in case of failure is in line with
the spec, but it should only be done as a last-resort. To me that
suggests that we should be attempting to offline all the LMBs
beforehand, and only after that's successful should we begin attempting
to release LMBs back to the platform. Should we consider introducing
that logic in the patchset? Or maybe as a follow-up?

> +   if (rc)
> +   break;
> +
> +   lmbs[i].reserved = 1;
> +   }
> +
> +   if (rc) {
> +   pr_err("Memory indexed-count-remove failed, adding any 
> removed LMBs\n");
> +
> +   for (i = start_index; i < end_index; i++) {
> +   if (!lmbs[i].reserved)
> +   continue;
> +
> +   rc = dlpar_add_lmb([i]);
> +   if (rc)
> +   pr_err("Failed to add LMB, drc index %x\n",
> +  be32_to_cpu(lmbs[i].drc_index));
> +
> +   lmbs[i].reserved = 0;
> +   }
> +   rc = -EINVAL;
> +   } else {
> +   for (i = start_index; i < end_index; i++) {
> +   if (!lmbs[i].reserved)
> +   

Re: [PATCH v4 2/2] powerpc/pseries: Implement indexed-count hotplug memory remove

2016-08-23 Thread Michael Roth
Quoting Sahil Mehta (2016-08-01 12:23:16)
> Indexed-count remove for memory hotplug guarantees that a contiguous block
> of  lmbs beginning at a specified  will be unassigned (NOT
> that  lmbs will be removed). Because of Qemu's per-DIMM memory
> management, the removal of a contiguous block of memory currently
> requires a series of individual calls. Indexed-count remove reduces
> this series into a single call.
> 
> Signed-off-by: Sahil Mehta 
> ---
> v2: -use u32s drc_index and count instead of u32 ic[]
>  in dlpar_memory
> v3: -add logic to handle invalid drc_index input
> v4: -none
> 
>  arch/powerpc/platforms/pseries/hotplug-memory.c |   90 
> +++
>  1 file changed, 90 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 2d4ceb3..dd5eb38 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -503,6 +503,92 @@ static int dlpar_memory_remove_by_index(u32 drc_index, 
> struct property *prop)
> return rc;
>  }
> 
> +static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, u32 drc_index,
> +struct property *prop)
> +{
> +   struct of_drconf_cell *lmbs;
> +   u32 num_lmbs, *p;
> +   int i, rc, start_lmb_found;
> +   int lmbs_available = 0, start_index = 0, end_index;
> +
> +   pr_info("Attempting to hot-remove %u LMB(s) at %x\n",
> +   lmbs_to_remove, drc_index);
> +
> +   if (lmbs_to_remove == 0)
> +   return -EINVAL;
> +
> +   p = prop->value;
> +   num_lmbs = *p++;
> +   lmbs = (struct of_drconf_cell *)p;
> +   start_lmb_found = 0;
> +
> +   /* Navigate to drc_index */
> +   while (start_index < num_lmbs) {
> +   if (lmbs[start_index].drc_index == drc_index) {
> +   start_lmb_found = 1;
> +   break;
> +   }
> +
> +   start_index++;
> +   }
> +
> +   if (!start_lmb_found)
> +   return -EINVAL;
> +
> +   end_index = start_index + lmbs_to_remove;
> +
> +   /* Validate that there are enough LMBs to satisfy the request */
> +   for (i = start_index; i < end_index; i++) {
> +   if (lmbs[i].flags & DRCONF_MEM_RESERVED)
> +   break;
> +
> +   lmbs_available++;
> +   }
> +
> +   if (lmbs_available < lmbs_to_remove)
> +   return -EINVAL;
> +
> +   for (i = 0; i < end_index; i++) {

Shouldn't this be i = start_index? Otherwise it seems we'd attempt to
satisfy the request using LMBs outside of the requested range.

> +   if (!(lmbs[i].flags & DRCONF_MEM_ASSIGNED))
> +   continue;
> +
> +   rc = dlpar_remove_lmb([i]);
> +   if (rc)
> +   break;
> +
> +   lmbs[i].reserved = 1;
> +   }
> +
> +   if (rc) {
> +   pr_err("Memory indexed-count-remove failed, adding any 
> removed LMBs\n");
> +
> +   for (i = start_index; i < end_index; i++) {
> +   if (!lmbs[i].reserved)
> +   continue;
> +
> +   rc = dlpar_add_lmb([i]);
> +   if (rc)
> +   pr_err("Failed to add LMB, drc index %x\n",
> +  be32_to_cpu(lmbs[i].drc_index));
> +
> +   lmbs[i].reserved = 0;
> +   }
> +   rc = -EINVAL;
> +   } else {
> +   for (i = start_index; i < end_index; i++) {
> +   if (!lmbs[i].reserved)
> +   continue;
> +
> +   pr_info("Memory at %llx (drc index %x) was 
> hot-removed\n",
> +   lmbs[i].base_addr, lmbs[i].drc_index);
> +
> +   lmbs[i].reserved = 0;
> +   }
> +   }
> +
> +   return rc;
> +}
> +
>  #else
>  static inline int pseries_remove_memblock(unsigned long base,
>   unsigned int memblock_size)
> @@ -829,6 +915,10 @@ int dlpar_memory(struct pseries_hp_errorlog *hp_elog)
> } else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_INDEX) {
> drc_index = hp_elog->_drc_u.drc_index;
> rc = dlpar_memory_remove_by_index(drc_index, prop);
> +   } else if (hp_elog->id_type == PSERIES_HP_ELOG_ID_DRC_IC) {
> +   count = hp_elog->_drc_u.indexed_count[0];
> +   drc_index = hp_elog->_drc_u.indexed_count[1];
> +   rc = dlpar_memory_remove_by_ic(count, drc_index, 
> prop);
> } else {
> rc = -EINVAL;
> }
> 
> ___
>