Re: [PATCH] KVM: PPC: Book3S HV: Fix invalid use of register expression

2017-08-30 Thread Paul Mackerras
On Tue, Aug 15, 2017 at 02:37:01PM +1000, Michael Ellerman wrote:
> From: Andreas Schwab 
> 
> binutils >= 2.26 now warns about misuse of register expressions in
> assembler operands that are actually literals. In this instance r0 is
> being used where a literal 0 should be used.
> 
> Signed-off-by: Andreas Schwab 
> [mpe: Split into separate KVM patch, tweak change log]
> Signed-off-by: Michael Ellerman 

Thanks, patch applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH v2 1/9] KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validation

2017-08-30 Thread Paul Mackerras
On Sun, Aug 13, 2017 at 11:33:38AM +1000, Nicholas Piggin wrote:
> KVM currently validates the size of the VPA registered by the client
> against sizeof(struct lppaca), however we align (and therefore size)
> that struct to 1kB to avoid crossing a 4kB boundary in the client.
> 
> PAPR calls for sizes >= 640 bytes to be accepted. Hard code this with
> a comment.
> 
> Signed-off-by: Nicholas Piggin 

Thanks, patch applied to my kvm-ppc-next branch.

Paul.


Re: [PATCH 1/2] KVM: PPC: e500: fix some NULL dereferences on error

2017-08-30 Thread Paul Mackerras
On Thu, Jul 13, 2017 at 10:38:29AM +0300, Dan Carpenter wrote:
> There are some error paths in kvmppc_core_vcpu_create_e500() where we
> forget to set the error code.  It means that we return ERR_PTR(0) which
> is NULL and it results in a NULL pointer dereference in the caller.
> 
> Signed-off-by: Dan Carpenter 

Thanks, both patches applied to my kvm-ppc-next branch.

Paul.


[PATCH kernel] powerpc/powernv: Update comment about shifting IOV BAR

2017-08-30 Thread Alexey Kardashevskiy
From: Benjamin Herrenschmidt 

From: Alexey Kardashevskiy 

This updates the comment about creating a hole in /proc/iomem which
should not be normally happening but it does in the powernv platform
due the way MMIO M64 BARs are organised in the IODA2-capable hardware.

Signed-off-by: Alexey Kardashevskiy 
---

It has been mentioned multiple times (last one -
https://www.spinics.net/lists/linux-pci/msg64084.html ) that the comment
is not informative enough for people not particularly familiar with
the POWER8 IO hardware.

This attempt aims to:
1. explain why we shift the resource
2. explain why nothing can use that hole as a resource while it is "free"
(I am not sure that this is the case actually)

Please comment, everyone, let's have this very well documented while
I remember these bits :) Thanks.
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 48de308224d6..c4a36ae78c95 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1002,9 +1002,13 @@ static int pnv_pci_vf_resource_shift(struct pci_dev 
*dev, int offset)
}
 
/*
-* After doing so, there would be a "hole" in the /proc/iomem when
-* offset is a positive value. It looks like the device return some
-* mmio back to the system, which actually no one could use it.
+* Since M64 BAR shares segments among all possible 256 PEs,
+* we have to shift the beginning of PF IOV BAR to make it start from
+* the segment which belongs to the PE number assigned to the first VF.
+* This creates a "hole" in the /proc/iomem which could be used for
+* allocating other resources, however this is not expected to happen
+* on IODA as the only possibility would be a PCI hotplug and IODA
+* hardware only allows it on a slot with dedicated PHB.
 */
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = >resource[i + PCI_IOV_RESOURCES];
-- 
2.11.0



Re: [PATCH kernel] PCI: Disable IOV before pcibios_sriov_disable()

2017-08-30 Thread Alexey Kardashevskiy
On 31/08/17 05:02, Bjorn Helgaas wrote:
> On Fri, Aug 11, 2017 at 06:19:33PM +1000, Alexey Kardashevskiy wrote:
>> From: Gavin Shan 
>>
>> The PowerNV platform is the only user of pcibios_sriov_disable().
>> The IOV BAR could be shifted by pci_iov_update_resource(). The
>> warning message in the function is printed if the IOV capability
>> is in enabled (PCI_SRIOV_CTRL_VFE && PCI_SRIOV_CTRL_MSE) state.
>>
>> This is the backtrace of what is happening:
>>pci_disable_sriov
>>sriov_disable
>>pnv_pci_sriov_disable
>>pnv_pci_vf_resource_shift
>>pci_update_resource
>>pci_iov_update_resource
>>
>> This fixes the issue by disabling IOV capability before calling
>> pcibios_sriov_disable(). With it, the disabling path matches
>> the enabling path: pcibios_sriov_enable() is called before the
>> IOV capability is enabled.
>>
>> Cc: shan.ga...@gmail.com
>> Cc: Benjamin Herrenschmidt 
>> Cc: Paul Mackerras 
>> Reported-by: Carol L Soto 
>> Signed-off-by: Gavin Shan 
>> Tested-by: Carol L Soto 
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>
>> This is repost. Since Gavin left the team, I am trying to push it out.
>> The previos converstion is here: https://patchwork.ozlabs.org/patch/732653/
> 
> I gave up on the previous issue.  I think this patch makes sense as-is
> at least as far as the fact that we can't update a struct resource
> while the device is still consuming it.  I reworked the changelog to
> emphasize that.
> 
> I assume the fact that pci_iov_update_resource() dropped the resource
> update caused some user-visible issue later on, and I might mention
> that, too, if I knew what it was.

I could not identify any issue so far in my test setup - I recreated VFs
several times, run some traffic through them on one of mellanox'es so the
message+backtrace seems to be the only issue for now.


> Here's what I would consider putting on pci/virtualization (the diff
> is unchanged from your post):


This sounds good to me, thanks for updating the commit log.

I'll still try and finish my homework with updating that comment about the
hole in arch/powerpc/platforms/powernv/pci-ioda.c.

Cheers.

> 
> 
> commit 08132e7759b3929bea0ccdf8afe81ebf05351389
> Author: Gavin Shan 
> Date:   Fri Aug 11 18:19:33 2017 +1000
> 
> PCI: Disable VF decoding before updating resources in 
> pcibios_sriov_disable()
> 
> A struct resource represents the address space consumed by a device.  We
> should not modify that resource while the device is actively using the
> address space.  For VFs, pci_iov_update_resource() enforces this by
> printing a warning and doing nothing if the VFE (VF Enable) and MSE (VF
> Memory Space Enable) bits are set.
> 
> Previously, both sriov_enable() and sriov_disable() called the
> pcibios_sriov_disable() arch hook, which may update the struct resource,
> while VFE and MSE were enabled.  This effectively dropped the resource
> update pcibios_sriov_disable() intended to do.
> 
> Disable VF memory decoding before calling pcibios_sriov_disable().
> 
> Reported-by: Carol L Soto 
> Tested-by: Carol L Soto 
> Signed-off-by: Gavin Shan 
> Signed-off-by: Alexey Kardashevskiy 
> [bhelgaas: changelog]
> Signed-off-by: Bjorn Helgaas 
> Cc: shan.ga...@gmail.com
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 

> 
> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> index 120485d6f352..ac41c8be9200 100644
> --- a/drivers/pci/iov.c
> +++ b/drivers/pci/iov.c
> @@ -331,7 +331,6 @@ static int sriov_enable(struct pci_dev *dev, int 
> nr_virtfn)
>   while (i--)
>   pci_iov_remove_virtfn(dev, i, 0);
>  
> - pcibios_sriov_disable(dev);
>  err_pcibios:
>   iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
>   pci_cfg_access_lock(dev);
> @@ -339,6 +338,8 @@ static int sriov_enable(struct pci_dev *dev, int 
> nr_virtfn)
>   ssleep(1);
>   pci_cfg_access_unlock(dev);
>  
> + pcibios_sriov_disable(dev);
> +
>   if (iov->link != dev->devfn)
>   sysfs_remove_link(>dev.kobj, "dep_link");
>  
> @@ -357,14 +358,14 @@ static void sriov_disable(struct pci_dev *dev)
>   for (i = 0; i < iov->num_VFs; i++)
>   pci_iov_remove_virtfn(dev, i, 0);
>  
> - pcibios_sriov_disable(dev);
> -
>   iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
>   pci_cfg_access_lock(dev);
>   pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
>   ssleep(1);
>   pci_cfg_access_unlock(dev);
>  
> + pcibios_sriov_disable(dev);
> +
>   if (iov->link != dev->devfn)
>   

Re: [PATCH v3 00/17] powerpc: Do alignment fixups using analyse_instr etc.

2017-08-30 Thread Michael Neuling
On Thu, 2017-08-31 at 10:49 +1000, Michael Neuling wrote:
> Tested-by: Michael Neuling 
> 
> FWIW I've written a test case for alignment faults (which I'll convert to a
> selftest and upstream). It tests all load stores supported by POWER9 (results
> below).

Sorry, this is not quite right.  It doesn't test load/store quad or string
instructions.

It also doesn't test atomic memory options (AMO) since they can't be emulated
anyway.

Mikey


Re: [PATCH v3 00/17] powerpc: Do alignment fixups using analyse_instr etc.

2017-08-30 Thread Michael Neuling
Tested-by: Michael Neuling 

FWIW I've written a test case for alignment faults (which I'll convert to a
selftest and upstream). It tests all load stores supported by POWER9 (results
below).

VSX: 2.06B
Doing lxvd2x:   PASSED
Doing lxvw4x:   PASSED
Doing lxsdx:PASSED
Doing lxvdsx:   PASSED
Doing stxvd2x:  PASSED
Doing stxvw4x:  PASSED
Doing stxsdx:   PASSED
VSX: 2.07B
Doing lxsspx:   PASSED
Doing lxsiwax:  PASSED
Doing lxsiwzx:  PASSED
Doing stxsspx:  PASSED
Doing stxsiwx:  PASSED
VSX: 3.00B
Doing lxsd: PASSED
Doing lxsibzx:  PASSED
Doing lxsihzx:  PASSED
Doing lxssp:PASSED
Doing lxv:  PASSED
Doing lxvb16x:  PASSED
Doing lxvh8x:   PASSED
Doing lxvx: PASSED
Doing lxvwsx:   PASSED
Doing lxvl: PASSED
Doing lxvll:PASSED
Doing stxsd:PASSED
Doing stxsibx:  PASSED
Doing stxsihx:  PASSED
Doing stxssp:   PASSED
Doing stxv: PASSED
Doing stxvb16x: PASSED
Doing stxvh8x:  PASSED
Doing stxvx:PASSED
Doing stxvl:PASSED
Doing stxvll:   PASSED
Integer
Doing lbz:  PASSED
Doing lbzu: PASSED
Doing lbzx: PASSED
Doing lbzux:PASSED
Doing lhz:  PASSED
Doing lhzu: PASSED
Doing lhzx: PASSED
Doing lhzux:PASSED
Doing lha:  PASSED
Doing lhau: PASSED
Doing lhax: PASSED
Doing lhaux:PASSED
Doing lhbrx:PASSED
Doing lwz:  PASSED
Doing lwzu: PASSED
Doing lwzx: PASSED
Doing lwzux:PASSED
Doing lwa:  PASSED
Doing lwax: PASSED
Doing lwaux:PASSED
Doing lwbrx:PASSED
Doing ld:   PASSED
Doing ldu:  PASSED
Doing ldx:  PASSED
Doing ldux: PASSED
Doing ldbrx:PASSED
Doing lmw:  PASSED
Doing stb:  PASSED
Doing stbx: PASSED
Doing stbu: PASSED
Doing stbux:PASSED
Doing sth:  PASSED
Doing sthx: PASSED
Doing sthu: PASSED
Doing sthux:PASSED
Doing sthbrx:   PASSED
Doing stw:  PASSED
Doing stwx: PASSED
Doing stwu: PASSED
Doing stwux:PASSED
Doing stwbrx:   PASSED
Doing std:  PASSED
Doing stdx: PASSED
Doing stdu: PASSED
Doing stdux:PASSED
Doing stdbrx:   PASSED
Doing stmw: PASSED
VMX
Doing stvx: PASSED
Doing stvebx:   PASSED
Doing stvehx:   PASSED
Doing stvewx:   PASSED
Doing stvxl:PASSED
Floating point
Doing lfd:  PASSED
Doing lfdx: PASSED
Doing lfdp: PASSED
Doing lfdpx:PASSED
Doing lfdu: PASSED
Doing lfdux:PASSED
Doing lfs:  PASSED
Doing lfsx: PASSED
Doing lfsu: PASSED
Doing lfsux:PASSED
Doing lfiwzx:   PASSED
Doing lfiwax:   PASSED
Doing stfd: PASSED
Doing stfdx:PASSED
Doing stfdp:PASSED
Doing stfdpx:   PASSED
Doing stfdu:PASSED
Doing stfdux:   PASSED
Doing stfs: PASSED
Doing stfsx:PASSED
Doing stfsu:PASSED
Doing stfsux:   PASSED
Doing stfiwx:   PASSED


On Wed, 2017-08-30 at 14:12 +1000, Paul Mackerras wrote:
> This series extends the instruction emulation infrastructure in
> arch/powerpc/lib/sstep.c and uses it for emulating instructions when
> we get an alignment interrupt.  The advantage of this is that we only
> have to add the new POWER9 instructions in one place, and it fixes
> several bugs in alignment interrupt handling that have been identified
> recently.
> 
> With this, analyse_instr() and emulate_step() handle almost all load
> and store instructions in Power ISA v3.00 -- all except the atomic
> memory operations (lwat, stwat, etc.).  We now always use the largest
> possible aligned memory accesses (up to 8 bytes) to emulate unaligned
> accesses.  If we get a fault, the faulting address is accurately
> recorded in regs->dar.  We also can now access FP/VMX/VSX registers
> directly if they are live, without having to spill them all to the
> thread_struct and the reload them all later.  There are also various
> other fixes in the series.
> 
> This version is based on the current powerpc next branch.
> 
> Paul.
> 
>  arch/powerpc/Kconfig  |4 -
>  arch/powerpc/include/asm/ppc-opcode.h |   10 +-
>  arch/powerpc/include/asm/sstep.h  |   90 +-
>  arch/powerpc/kernel/align.c   |  774 +---
>  arch/powerpc/lib/Makefile |3 +-
>  

Re: [PATCH RFC] powerpc: Implements MMIO emulation for lvx/stvx instructions

2017-08-30 Thread joserz
On Wed, Aug 30, 2017 at 07:45:17PM +1000, Paul Mackerras wrote:
> On Tue, Aug 29, 2017 at 07:18:01PM -0300, Jose Ricardo Ziviani wrote:
> > Hello!
> > 
> > This patch implements MMIO emulation for two instructions: lvx and stvx. I 
> > started to implement other instructions but I'd like to have this reviewed 
> > beforehand because this is my first patch here and I'll certainly have some 
> > rework/fixes :-).
> > 
> > Note: stvx is only storing 8 bytes, for some reason the code 
> > "vcpu->arch.paddr_accessed += run->mmio.len;", which adds the 8-byte offset 
> > after the first write is not making any difference (interesting that it 
> > works for load operations). I'm still investigating it but any idea about 
> > it will be appreciated.
> 
> The run structure is mmapped by userspace (i.e. QEMU) and can be
> written by userspace between the first and the second exits to
> userspace (you have to do two exits to userspace because you can only
> transfer 8 bytes on each exit).  It's possible that userspace might be
> clearing run->mmio.len.  In general it's better not to rely on
> anything in *run (except of course the mmio_data for a MMIO read) when
> we come in from userspace to the kernel.
> 
> Paul.
> 

Hello Paul,

My bad, actually it works. I was mmap'ping an address that doesn't allow 
16-byte writing access. After mmap'ping a higher address (of the same device) I 
was able to perform 16-byte read/write.

== before stvx ==

  (gdb) info registers vr0
  vr0  {uint128 = 0x12345678abcdef09, ...}

  (gdb) info registers r9
  r9 0x3fffb7c90010

  (gdb) x /4wx 0x3fffb7c90010
  0x3fffb7c90010: 0x 0x 0x 0x

  (gdb) info registers r28
  r28 0x0 

stvxv0,r28,r9

== after stvx ==

  (gdb) x /4wx 0x3fffb7c90010
  0x3fffb7c90010: 0x12345678 0x 0xabcdef09 0x

== before lvx ==

  (gdb) info registers vr10
  vr10 {uint128 = 0x,...}

lvx v10,r28,r9

== after lvx ==

  (gdb) info registers vr10
  vr10 {uint128 = 0x12345678abcdef09,...}

If you think it's ok I'll submit this patch without the RFC.

Thank you very much!

Ziviani



Re: [PATCH RFC] Interface to set SPRN_TIDR

2017-08-30 Thread Sukadev Bhattiprolu
Michael Neuling [mi...@neuling.org] wrote:
> Suka,
> 
> Please CC Christophe who as an alternative way of doing this. We ned to get
> agreement across all users of TIDR/AS_notify...

Mikey,

Thanks. There is overlap between the two patches. I will send a patch on
top of Christophe's for the interfaces to assign/clear the TIDR value and
clear the thread->tidr during arch_dup_task_struct(). I will also drop the
CONFIG_VAS check since its not only for VAS.

Christophe, can you let me know of any other comments on this patch?

Suka
> 
> His patch is here:
>   https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-August/161582.html
> 
> Mikey
> 
> On Tue, 2017-08-29 at 19:38 -0700, Sukadev Bhattiprolu wrote:
> > We need the SPRN_TIDR to be set for use with fast thread-wakeup
> > (core-to-core wakeup) in VAS. Each user thread that has a receive
> > window setup and expects to be notified when a sender issues a paste
> > needs to have a unique SPRN_TIDR value.
> > 
> > The SPRN_TIDR value only needs to unique within the process but for
> > now we use a globally unique thread id as described below.
> > 
> > Signed-off-by: Sukadev Bhattiprolu 
> > ---
> > Changelog[v2]
> > - Michael Ellerman: Use an interface to assign TIDR so it is
> >   assigned to only threads that need it; move assignment to
> >   restore_sprs(). Drop lint from rebase;
> > 
> > 
> >  arch/powerpc/include/asm/processor.h |  4 ++
> >  arch/powerpc/include/asm/switch_to.h |  3 ++
> >  arch/powerpc/kernel/process.c| 97
> > 
> >  3 files changed, 104 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/processor.h
> > b/arch/powerpc/include/asm/processor.h
> > index fab7ff8..bf6ba63 100644
> > --- a/arch/powerpc/include/asm/processor.h
> > +++ b/arch/powerpc/include/asm/processor.h
> > @@ -232,6 +232,10 @@ struct debug_reg {
> >  struct thread_struct {
> >     unsigned long   ksp;/* Kernel stack pointer */
> > 
> > +#ifdef CONFIG_PPC_VAS
> > +   unsigned long   tidr;
> > +#endif
> > +
> >  #ifdef CONFIG_PPC64
> >     unsigned long   ksp_vsid;
> >  #endif
> > diff --git a/arch/powerpc/include/asm/switch_to.h
> > b/arch/powerpc/include/asm/switch_to.h
> > index 17c8380..4962455 100644
> > --- a/arch/powerpc/include/asm/switch_to.h
> > +++ b/arch/powerpc/include/asm/switch_to.h
> > @@ -91,4 +91,7 @@ static inline void clear_task_ebb(struct task_struct *t)
> >  #endif
> >  }
> > 
> > +extern void set_thread_tidr(struct task_struct *t);
> > +extern void clear_thread_tidr(struct task_struct *t);
> > +
> >  #endif /* _ASM_POWERPC_SWITCH_TO_H */
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 1f0fd36..13abb22 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -1132,6 +1132,10 @@ static inline void restore_sprs(struct thread_struct
> > *old_thread,
> >     mtspr(SPRN_TAR, new_thread->tar);
> >     }
> >  #endif
> > +#ifdef CONFIG_PPC_VAS
> > +   if (old_thread->tidr != new_thread->tidr)
> > +   mtspr(SPRN_TIDR, new_thread->tidr);
> > +#endif
> >  }
> > 
> >  #ifdef CONFIG_PPC_BOOK3S_64
> > @@ -1446,9 +1450,97 @@ void flush_thread(void)
> >  #endif /* CONFIG_HAVE_HW_BREAKPOINT */
> >  }
> > 
> > +#ifdef CONFIG_PPC_VAS
> > +static DEFINE_SPINLOCK(vas_thread_id_lock);
> > +static DEFINE_IDA(vas_thread_ida);
> > +
> > +/*
> > + * We need to assign an unique thread id to each thread in a process. This
> > + * thread id is intended to be used with the Fast Thread-wakeup (aka Core-
> > + * to-core wakeup) mechanism being implemented on top of Virtual 
> > Accelerator
> > + * Switchboard (VAS).
> > + *
> > + * To get a unique thread-id per process we could simply use task_pid_nr()
> > + * but the problem is that task_pid_nr() is not yet available for the 
> > thread
> > + * when copy_thread() is called. Fixing that would require changing more
> > + * intrusive arch-neutral code in code path in copy_process()?.
> > + *
> > + * Further, to assign unique thread ids within each process, we need an
> > + * atomic field (or an IDR) in task_struct, which again intrudes into the
> > + * arch-neutral code.
> > + *
> > + * So try to assign globally unique thraed ids for now.
> > + *
> > + * NOTE: TIDR 0 indicates that the thread does not need a TIDR value.
> > + *  For now, only threads that expect to be notified by the VAS
> > + *  hardware need a TIDR value and we assign values > 0 for those.
> > + */
> > +#define MAX_THREAD_CONTEXT ((1 << 15) - 2)
> > +static int assign_thread_tidr(void)
> > +{
> > +   int index;
> > +   int err;
> > +
> > +again:
> > +   if (!ida_pre_get(_thread_ida, GFP_KERNEL))
> > +   return -ENOMEM;
> > +
> > +   spin_lock(_thread_id_lock);
> > +   err = ida_get_new_above(_thread_ida, 1, );
> > +   spin_unlock(_thread_id_lock);
> > +
> > +   if (err == -EAGAIN)
> > +   goto again;
> > +   else if (err)
> > + 

[GIT PULL] Please pull JSON files for POWER9 PMU events

2017-08-30 Thread Sukadev Bhattiprolu

Hi Arnaldo,

Please pull an update to the JSON files for POWER9 PMU events. This
removes alternate event codes from the JSON files which seem to confuse
perf.

The following changes since commit 1b2f76d77a277bb70d38ad0991ed7f16bbc115a9:

  Merge tag 'perf-core-for-mingo-4.14-20170829' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
(2017-08-29 23:13:56 +0200)

are available in the git repository at:

  https://github.com/sukadev/linux/ p9-json-v3

Sukadev Bhattiprolu (1):
  perf vendor events powerpc: remove duplicate events

 .../pmu-events/arch/powerpc/power9/frontend.json   |7 +-
 .../perf/pmu-events/arch/powerpc/power9/other.json |  120 
 .../pmu-events/arch/powerpc/power9/pipeline.json   |7 +-
 tools/perf/pmu-events/arch/powerpc/power9/pmc.json |7 +-
 4 files changed, 3 insertions(+), 138 deletions(-)



Re: [PATCH v7 08/11] mm: zero reserved and unavailable struct pages

2017-08-30 Thread kbuild test robot
Hi Pavel,

[auto build test ERROR on sparc/master]
[also build test ERROR on v4.13-rc7 next-20170829]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Pavel-Tatashin/complete-deferred-page-initialization/20170831-041021
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc.git master
config: cris-allmodconfig (attached as .config)
compiler: cris-linux-gcc (GCC) 6.2.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=cris 

All errors (new ones prefixed by >>):

   In file included from mm/page_alloc.c:18:0:
   include/linux/mm.h:1974:33: error: expected '=', ',', ';', 'asm' or 
'__attribute__' before 'zero_resv_unavail'
static inline void __paginginit zero_resv_unavail(void) {}
^
   mm/page_alloc.c: In function 'free_area_init':
>> mm/page_alloc.c:6863:2: error: implicit declaration of function 
>> 'zero_resv_unavail' [-Werror=implicit-function-declaration]
 zero_resv_unavail();
 ^
   cc1: some warnings being treated as errors

vim +/zero_resv_unavail +6863 mm/page_alloc.c

  6858  
  6859  void __init free_area_init(unsigned long *zones_size)
  6860  {
  6861  free_area_init_node(0, zones_size,
  6862  __pa(PAGE_OFFSET) >> PAGE_SHIFT, NULL);
> 6863  zero_resv_unavail();
  6864  }
  6865  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: mpic IRQ_TYPE_BOTH handling

2017-08-30 Thread Michael Ellerman
Hi Gregory,

Gregory Fong  writes:
> Hi all,
>
> In arch/powerpc/sysdev/mpic.c , it looks like IRQ_TYPE_EDGE_BOTH is
> handled the same way as IRQ_TYPE_EDGE_FALLING:
>
> static unsigned int mpic_type_to_vecpri(struct mpic *mpic, unsigned int type)
> {
> /* Now convert sense value */
> switch(type & IRQ_TYPE_SENSE_MASK) {
> case IRQ_TYPE_EDGE_RISING:
> return MPIC_INFO(VECPRI_SENSE_EDGE) |
>MPIC_INFO(VECPRI_POLARITY_POSITIVE);
> case IRQ_TYPE_EDGE_FALLING:
> case IRQ_TYPE_EDGE_BOTH:
> return MPIC_INFO(VECPRI_SENSE_EDGE) |
>MPIC_INFO(VECPRI_POLARITY_NEGATIVE);
> case IRQ_TYPE_LEVEL_HIGH:
> return MPIC_INFO(VECPRI_SENSE_LEVEL) |
>MPIC_INFO(VECPRI_POLARITY_POSITIVE);
> case IRQ_TYPE_LEVEL_LOW:
> default:
> return MPIC_INFO(VECPRI_SENSE_LEVEL) |
>MPIC_INFO(VECPRI_POLARITY_NEGATIVE);
> }
> }
>
> If IRQ_TYPE_EDGE_BOTH is unsupported, shouldn't we be returning an
> error, instead of silently setting to use IRQ_TYPE_EDGE_FALLING?
> Something like the following (sorry if the diff wraps weirdly, on
> webmail at the moment):

I don't know this code so I asked Ben and he said something like
"PowerMacs never use BOTH, so it hasn't mattered, but Freescale machines
might".

So if you want to send a proper signed-off patch, and confirm that all
the callers can handle the error properly, then we can merge it.

cheers

> 8<
> diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
> index b9aac95..5867ea2 100644
> --- a/arch/powerpc/sysdev/mpic.c
> +++ b/arch/powerpc/sysdev/mpic.c
> @@ -876,6 +876,9 @@ int mpic_set_irq_type(struct irq_data *d, unsigned
> int flow_type)
> if (src >= mpic->num_sources)
> return -EINVAL;
>
> +   if (flow_type & IRQ_TYPE_SENSE_MASK == IRQ_TYPE_EDGE_BOTH)
> +   return -EINVAL;
> +
> vold = mpic_irq_read(src, MPIC_INFO(IRQ_VECTOR_PRI));
>
> /* We don't support "none" type */
> >8
>
> Thanks,
> Gregory


Re: [PATCH v7 08/11] mm: zero reserved and unavailable struct pages

2017-08-30 Thread kbuild test robot
Hi Pavel,

[auto build test ERROR on sparc/master]
[also build test ERROR on v4.13-rc7 next-20170829]
[cannot apply to mmotm/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Pavel-Tatashin/complete-deferred-page-initialization/20170831-041021
base:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc.git master
config: tile-allmodconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 4.6.2
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=tile 

All errors (new ones prefixed by >>):

   In file included from include/linux/pid_namespace.h:6:0,
from include/linux/ptrace.h:9,
from arch/tile/kernel/asm-offsets.c:35:
>> include/linux/mm.h:1974:33: error: expected '=', ',', ';', 'asm' or 
>> '__attribute__' before 'zero_resv_unavail'
   make[2]: *** [arch/tile/kernel/asm-offsets.s] Error 1
   make[2]: Target '__build' not remade because of errors.
   make[1]: *** [prepare0] Error 2
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [sub-make] Error 2

vim +1974 include/linux/mm.h

  1970  
  1971  #ifdef CONFIG_HAVE_MEMBLOCK
  1972  void zero_resv_unavail(void);
  1973  #else
> 1974  static inline void __paginginit zero_resv_unavail(void) {}
  1975  #endif
  1976  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v2 1/3] powerpc/mm: Export flush_all_mm()

2017-08-30 Thread Benjamin Herrenschmidt
On Wed, 2017-08-30 at 15:59 +0200, Frederic Barrat wrote:
> > It's not clear why it makes sense for these to be empty. Either for the
> > general idea of the "flush_all_mm()" API, or for your intended use by
> > CXL.
> 
> I was not too sure what to do for hash, but the idea is that the new 
> flush_all_mm() is really the equivalent of the old flush_tlb_mm() from 
> before Ben's optimizations for radix, and that was/still is an empty 
> operation on hash, so I kept it that way.
> 
> We don't support hash for capi2 yet. Adding it will definitely require 
> some work in that area, as the current approach (use count on the driver 
> and all TLBIs becoming global when the driver is in use) won't hold much 
> longer.

Why not ? It would work fine on hash, but you do need a way to flush
the TLB when decreasing the count indeed and that's missing for hash.

Cheers,
Ben.



Re: [PATCH v2 1/3] powerpc/mm: Export flush_all_mm()

2017-08-30 Thread Benjamin Herrenschmidt
On Wed, 2017-08-30 at 23:17 +1000, Michael Ellerman wrote:
> It's not clear why it makes sense for these to be empty. Either for the
> general idea of the "flush_all_mm()" API, or for your intended use by
> CXL.

Indeed. On hash we don't have a way to flush a PID out of the TLB,
but you can flush the whole LPID.

Cheers,
Ben.



Re: [RESEND PATCH v5 00/16] eeprom: at24: Add OF device ID table

2017-08-30 Thread Javier Martinez Canillas
Hello Geert,

On Wed, Aug 30, 2017 at 10:15 PM, Geert Uytterhoeven
 wrote:
> Hi Javier,
>
> On Wed, Aug 30, 2017 at 9:57 PM, Javier Martinez Canillas
>  wrote:
>>> I think we should talk about the same case: Let me repeat what I did:
>>>
>>> 1) I added your patch "eeprom: at24: Add OF device ID table"
>>> 2) I added an EEPROM node to an I2C
>>>
>>> +   eeprom@50 {
>>> +   compatible = "renesas,24c01";
>>> +   reg = <0x50>;
>>> +   };
>>>
>>> -> no at24 binding to the device
>>>
>>> 3) I revert your patch
>>>
>>> -> at24 binding to the device
>>>
>>
>> I've tested this and you are right, it fails...
>>
>> The problem is that the patch also changes how the driver obtains the
>> EEPROM parameters (the magic value in the entry's data field).
>>
>> So even when module autoload and device / driver matching works, the
>> driver probe function fails because if (client->dev.of_node) the
>> driver attempts to get the entry data using
>> of_device_get_match_data(), which is obviously wrong since the
>> compatible string in the dev node isn't present in the OF table.
>>
>> The id->driver_data from the I2C table should be used instead since
>> that's the table that matches in this case.
>>
>> One option is to fallback to id->driver_data if
>> of_device_get_match_data() fails, but that's just an (ugly)
>> workaround. So I agree with you that the best option is to wait for
>> the DTS patches to land first.
>
> Which means new kernels won't work with old DTBs. Oops...
> I'm afraid that needs to be fixed.  People care about DTB backward
> compatibility on many platforms.
>

Right, I've yet to find one of those mythical platforms that ship old
DTBs with new kernels, but I agree with you since people seem to care
about backward compatibility (at least on theory).

So I see two options then:

1) Use the workaround I mentioned and lookup the I2C device ID table
entry data if of_device_get_match_data() fails

2) Only call of_device_get_match_data() if (dev->of_node &&
of_match_device(dev->driver->of_match_table, dev))

Not sure what's the preferred idiom for these cases.

To good thing about keeping backward compatibility is that Wolfram
would be able to pick the driver patch even before the DTS patches
land.

Best regards,
Javier


[PATCH v3 4/8] powerpc/xive: introduce xive_esb_write()

2017-08-30 Thread Cédric Le Goater
Some source support MMIO stores on the ESB page to perform EOI. Let's
introduce a specific routine for this case even if this should be the
only use of it.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 arch/powerpc/sysdev/xive/common.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 8a58662ed793..ac5f18a66742 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -203,6 +203,15 @@ static u8 xive_esb_read(struct xive_irq_data *xd, u32 
offset)
return (u8)val;
 }
 
+static void xive_esb_write(struct xive_irq_data *xd, u32 offset, u64 data)
+{
+   /* Handle HW errata */
+   if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
+   offset |= offset << 4;
+
+   out_be64(xd->eoi_mmio + offset, data);
+}
+
 #ifdef CONFIG_XMON
 static void xive_dump_eq(const char *name, struct xive_q *q)
 {
@@ -297,7 +306,7 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data 
*xd)
 {
/* If the XIVE supports the new "store EOI facility, use it */
if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
-   out_be64(xd->eoi_mmio + XIVE_ESB_STORE_EOI, 0);
+   xive_esb_write(xd, XIVE_ESB_STORE_EOI, 0);
else if (hw_irq && xd->flags & XIVE_IRQ_FLAG_EOI_FW) {
/*
 * The FW told us to call it. This happens for some
-- 
2.13.5



Re: [RESEND PATCH v5 00/16] eeprom: at24: Add OF device ID table

2017-08-30 Thread Geert Uytterhoeven
Hi Javier,

On Wed, Aug 30, 2017 at 9:57 PM, Javier Martinez Canillas
 wrote:
>> I think we should talk about the same case: Let me repeat what I did:
>>
>> 1) I added your patch "eeprom: at24: Add OF device ID table"
>> 2) I added an EEPROM node to an I2C
>>
>> +   eeprom@50 {
>> +   compatible = "renesas,24c01";
>> +   reg = <0x50>;
>> +   };
>>
>> -> no at24 binding to the device
>>
>> 3) I revert your patch
>>
>> -> at24 binding to the device
>>
>
> I've tested this and you are right, it fails...
>
> The problem is that the patch also changes how the driver obtains the
> EEPROM parameters (the magic value in the entry's data field).
>
> So even when module autoload and device / driver matching works, the
> driver probe function fails because if (client->dev.of_node) the
> driver attempts to get the entry data using
> of_device_get_match_data(), which is obviously wrong since the
> compatible string in the dev node isn't present in the OF table.
>
> The id->driver_data from the I2C table should be used instead since
> that's the table that matches in this case.
>
> One option is to fallback to id->driver_data if
> of_device_get_match_data() fails, but that's just an (ugly)
> workaround. So I agree with you that the best option is to wait for
> the DTS patches to land first.

Which means new kernels won't work with old DTBs. Oops...
I'm afraid that needs to be fixed.  People care about DTB backward
compatibility on many platforms.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


[PATCH v3 5/8] powerpc/xive: add the HW IRQ number under xive_irq_data

2017-08-30 Thread Cédric Le Goater
It will be required later by the H_INT_ESB hcall.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h   | 1 +
 arch/powerpc/sysdev/xive/native.c | 2 ++
 arch/powerpc/sysdev/xive/spapr.c  | 2 ++
 3 files changed, 5 insertions(+)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 473f133a8555..64ec9bbcf03e 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -45,6 +45,7 @@ struct xive_irq_data {
void __iomem *trig_mmio;
u32 esb_shift;
int src_chip;
+   u32 hw_irq;
 
/* Setup/used by frontend */
int target;
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index ef92a83090e1..f8bcff15b0f9 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -82,6 +82,8 @@ int xive_native_populate_irq_data(u32 hw_irq, struct 
xive_irq_data *data)
return -ENOMEM;
}
 
+   data->hw_irq = hw_irq;
+
if (!data->trig_page)
return 0;
if (data->trig_page == data->eoi_page) {
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 797bb0636ab7..0fcae7504353 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -264,6 +264,8 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct 
xive_irq_data *data)
return -ENOMEM;
}
 
+   data->hw_irq = hw_irq;
+
/* Full function page supports trigger */
if (flags & XIVE_SRC_TRIGGER) {
data->trig_mmio = data->eoi_mmio;
-- 
2.13.5



Re: [RESEND PATCH v5 00/16] eeprom: at24: Add OF device ID table

2017-08-30 Thread Javier Martinez Canillas
>
> I think we should talk about the same case: Let me repeat what I did:
>
> 1) I added your patch "eeprom: at24: Add OF device ID table"
> 2) I added an EEPROM node to an I2C
>
> +   eeprom@50 {
> +   compatible = "renesas,24c01";
> +   reg = <0x50>;
> +   };
>
> -> no at24 binding to the device
>
> 3) I revert your patch
>
> -> at24 binding to the device
>

I've tested this and you are right, it fails...

The problem is that the patch also changes how the driver obtains the
EEPROM parameters (the magic value in the entry's data field).

So even when module autoload and device / driver matching works, the
driver probe function fails because if (client->dev.of_node) the
driver attempts to get the entry data using
of_device_get_match_data(), which is obviously wrong since the
compatible string in the dev node isn't present in the OF table.

The id->driver_data from the I2C table should be used instead since
that's the table that matches in this case.

One option is to fallback to id->driver_data if
of_device_get_match_data() fails, but that's just an (ugly)
workaround. So I agree with you that the best option is to wait for
the DTS patches to land first.

It worked for me on my previous tests because the tested drivers
didn't use a table entry data, I'm so sorry for missing this :(

Best regards,
Javier


[PATCH v3 2/8] powerpc/xive: guest exploitation of the XIVE interrupt controller

2017-08-30 Thread Cédric Le Goater
This is the framework for using XIVE in a PowerVM guest. The support
is very similar to the native one in a much simpler form.

Each source is associated with an Event State Buffer (ESB). This is a
two bit state machine which is used to trigger events. The bits are
named "P" (pending) and "Q" (queued) and can be controlled by MMIO.
The Guest OS registers event (or notifications) queues on which the HW
will post event data for a target to notify.

Instead of OPAL calls, a set of Hypervisors call are used to configure
the interrupt sources and the event/notification queues of the guest:

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (PQ bits) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines to which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification config associated with the
   queue, only unconditional notification for the moment.  Reset is
   performed with a queue size of 0 and queueing is disabled in that
   case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the partition's interrupt exploitation structures to
   their initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure sure all
   notifications have reached their queue.

As for XICS, the XIVE interface for the guest is described in the
device tree under the "interrupt-controller" node. A couple of new
properties are specific to XIVE :

 - "reg"

   contains the base address and size of the thread interrupt
   managnement areas (TIMA), also called rings, for the User level and
   for the Guest OS level. Only the Guest OS level is taken into
   account today.

 - "ibm,xive-eq-sizes"

   the size of the event queues. One cell per size supported, contains
   log2 of size, in ascending order.

 - "ibm,xive-lisn-ranges"

   the interrupt numbers ranges assigned to the guest. These are
   allocated using a simple bitmap.

and also :

 - "/ibm,plat-res-int-priorities"

   contains a list of priorities that the hypervisor has reserved for
   its own use.

Tested with a QEMU XIVE model for pseries and with the Power hypervisor.

Signed-off-by: Cédric Le Goater 
---

 Changes since v2 :

 - changed H_INT_SYNC hcall to reflect new api
 
 Changes since v1 :

 - added a xive_teardown_cpu() routine
 - removed P9 doorbell support when xive is enabled.
 - merged in patch for "ibm,plat-res-int-priorities" support
 - added some comments on the usage of raw I/O accessors.
 
 Changes since RFC :

 - renamed backend to spapr
 - fixed hotplug support
 - fixed kexec support
 - fixed src_chip value (XIVE_INVALID_CHIP_ID)
 - added doorbell support 
 - added some hcall debug logs

 arch/powerpc/include/asm/hvcall.h|  13 +-
 arch/powerpc/include/asm/xive.h  |   3 +
 arch/powerpc/platforms/pseries/Kconfig   |   1 +
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  11 +-
 arch/powerpc/platforms/pseries/kexec.c   |   6 +-
 arch/powerpc/platforms/pseries/setup.c   |   8 +-
 arch/powerpc/platforms/pseries/smp.c |  27 +-
 arch/powerpc/sysdev/xive/Kconfig |   5 +
 arch/powerpc/sysdev/xive/Makefile|   1 +
 arch/powerpc/sysdev/xive/common.c|  13 +
 arch/powerpc/sysdev/xive/spapr.c | 618 +++
 11 files changed, 698 insertions(+), 8 deletions(-)
 create mode 100644 arch/powerpc/sysdev/xive/spapr.c

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 57d38b504ff7..3d34dc0869f6 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -280,7 +280,18 @@
 #define H_RESIZE_HPT_COMMIT0x370
 #define H_REGISTER_PROC_TBL0x37C
 #define H_SIGNAL_SYS_RESET 0x380
-#define MAX_HCALL_OPCODE   H_SIGNAL_SYS_RESET
+#define H_INT_GET_SOURCE_INFO   0x3A8
+#define H_INT_SET_SOURCE_CONFIG 0x3AC
+#define H_INT_GET_SOURCE_CONFIG 0x3B0
+#define H_INT_GET_QUEUE_INFO0x3B4
+#define H_INT_SET_QUEUE_CONFIG  0x3B8
+#define H_INT_GET_QUEUE_CONFIG  0x3BC
+#define H_INT_SET_OS_REPORTING_LINE 0x3C0
+#define H_INT_GET_OS_REPORTING_LINE 0x3C4
+#define H_INT_ESB   0x3C8
+#define H_INT_SYNC  0x3CC
+#define H_INT_RESET 0x3D0
+#define MAX_HCALL_OPCODE   H_INT_RESET
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index c23ff4389ca2..473f133a8555 

[PATCH v3 1/8] powerpc/xive: introduce a common routine xive_queue_page_alloc()

2017-08-30 Thread Cédric Le Goater
This routine will be used in the spapr backend. Also introduce a short
xive_alloc_order() helper.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 arch/powerpc/sysdev/xive/common.c| 16 
 arch/powerpc/sysdev/xive/native.c| 16 +---
 arch/powerpc/sysdev/xive/xive-internal.h |  6 ++
 3 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 6e0c9dee724f..26999ceae20e 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1424,6 +1424,22 @@ bool xive_core_init(const struct xive_ops *ops, void 
__iomem *area, u32 offset,
return true;
 }
 
+__be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift)
+{
+   unsigned int alloc_order;
+   struct page *pages;
+   __be32 *qpage;
+
+   alloc_order = xive_alloc_order(queue_shift);
+   pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order);
+   if (!pages)
+   return ERR_PTR(-ENOMEM);
+   qpage = (__be32 *)page_address(pages);
+   memset(qpage, 0, 1 << queue_shift);
+
+   return qpage;
+}
+
 static int __init xive_off(char *arg)
 {
xive_cmdline_disabled = true;
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 0f95476b01f6..ef92a83090e1 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -202,17 +202,12 @@ EXPORT_SYMBOL_GPL(xive_native_disable_queue);
 static int xive_native_setup_queue(unsigned int cpu, struct xive_cpu *xc, u8 
prio)
 {
struct xive_q *q = >queue[prio];
-   unsigned int alloc_order;
-   struct page *pages;
__be32 *qpage;
 
-   alloc_order = (xive_queue_shift > PAGE_SHIFT) ?
-   (xive_queue_shift - PAGE_SHIFT) : 0;
-   pages = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, alloc_order);
-   if (!pages)
-   return -ENOMEM;
-   qpage = (__be32 *)page_address(pages);
-   memset(qpage, 0, 1 << xive_queue_shift);
+   qpage = xive_queue_page_alloc(cpu, xive_queue_shift);
+   if (IS_ERR(qpage))
+   return PTR_ERR(qpage);
+
return xive_native_configure_queue(get_hard_smp_processor_id(cpu),
   q, prio, qpage, xive_queue_shift, 
false);
 }
@@ -227,8 +222,7 @@ static void xive_native_cleanup_queue(unsigned int cpu, 
struct xive_cpu *xc, u8
 * from an IPI and iounmap isn't safe
 */
__xive_native_disable_queue(get_hard_smp_processor_id(cpu), q, prio);
-   alloc_order = (xive_queue_shift > PAGE_SHIFT) ?
-   (xive_queue_shift - PAGE_SHIFT) : 0;
+   alloc_order = xive_alloc_order(xive_queue_shift);
free_pages((unsigned long)q->qpage, alloc_order);
q->qpage = NULL;
 }
diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
b/arch/powerpc/sysdev/xive/xive-internal.h
index d07ef2d29caf..dd1e2022cce4 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -56,6 +56,12 @@ struct xive_ops {
 
 bool xive_core_init(const struct xive_ops *ops, void __iomem *area, u32 offset,
u8 max_prio);
+__be32 *xive_queue_page_alloc(unsigned int cpu, u32 queue_shift);
+
+static inline u32 xive_alloc_order(u32 queue_shift)
+{
+   return (queue_shift > PAGE_SHIFT) ? (queue_shift - PAGE_SHIFT) : 0;
+}
 
 extern bool xive_cmdline_disabled;
 
-- 
2.13.5



[PATCH v3 7/8] powerpc/xive: add XIVE Exploitation Mode to CAS

2017-08-30 Thread Cédric Le Goater
On POWER9, the Client Architecture Support (CAS) negotiation process
determines whether the guest operates in XIVE Legacy compatibility or
in XIVE exploitation mode. Now that we have initial guest support for
the XIVE interrupt controller, let's inform the hypervisor what we can
do.

The platform advertises the XIVE Exploitation Mode support using the
property "ibm,arch-vec-5-platform-support-vec-5", byte 23 bits 0-1 :

 - 0b00 XIVE legacy mode Only
 - 0b01 XIVE exploitation mode Only
 - 0b10 XIVE legacy or exploitation mode

The OS asks for XIVE Exploitation Mode support using the property
"ibm,architecture-vec-5", byte 23 bits 0-1:

 - 0b00 XIVE legacy mode Only
 - 0b01 XIVE exploitation mode Only

Signed-off-by: Cédric Le Goater 
---

 Changes since v1:

 - fixed XIVE mode parsing 
 - integrated the prom.h definitions
 - introduced extra bits definition : OV5_XIVE_LEGACY and OV5_XIVE_EITHER
 
 arch/powerpc/include/asm/prom.h |  5 -
 arch/powerpc/kernel/prom_init.c | 34 +-
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 35c00d7a0cf8..825bd5998701 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -159,7 +159,10 @@ struct of_drconf_cell {
 #define OV5_PFO_HW_842 0x1140  /* PFO Compression Accelerator */
 #define OV5_PFO_HW_ENCR0x1120  /* PFO Encryption Accelerator */
 #define OV5_SUB_PROCESSORS 0x1501  /* 1,2,or 4 Sub-Processors supported */
-#define OV5_XIVE_EXPLOIT   0x1701  /* XIVE exploitation supported */
+#define OV5_XIVE_SUPPORT   0x17C0  /* XIVE Exploitation Support Mask */
+#define OV5_XIVE_LEGACY0x1700  /* XIVE legacy mode Only */
+#define OV5_XIVE_EXPLOIT   0x1740  /* XIVE exploitation mode Only */
+#define OV5_XIVE_EITHER0x1780  /* XIVE legacy or exploitation 
mode */
 /* MMU Base Architecture */
 #define OV5_MMU_SUPPORT0x18C0  /* MMU Mode Support Mask */
 #define OV5_MMU_HASH   0x1800  /* Hash MMU Only */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 613f79f03877..02190e90c7ae 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -177,6 +177,7 @@ struct platform_support {
bool hash_mmu;
bool radix_mmu;
bool radix_gtse;
+   bool xive;
 };
 
 /* Platforms codes are now obsolete in the kernel. Now only used within this
@@ -1041,6 +1042,27 @@ static void __init prom_parse_mmu_model(u8 val,
}
 }
 
+static void __init prom_parse_xive_model(u8 val,
+struct platform_support *support)
+{
+   switch (val) {
+   case OV5_FEAT(OV5_XIVE_EITHER): /* Either Available */
+   prom_debug("XIVE - either mode supported\n");
+   support->xive = true;
+   break;
+   case OV5_FEAT(OV5_XIVE_EXPLOIT): /* Only Exploitation mode */
+   prom_debug("XIVE - exploitation mode supported\n");
+   support->xive = true;
+   break;
+   case OV5_FEAT(OV5_XIVE_LEGACY): /* Only Legacy mode */
+   prom_debug("XIVE - legacy mode supported\n");
+   break;
+   default:
+   prom_debug("Unknown xive support option: 0x%x\n", val);
+   break;
+   }
+}
+
 static void __init prom_parse_platform_support(u8 index, u8 val,
   struct platform_support *support)
 {
@@ -1054,6 +1076,10 @@ static void __init prom_parse_platform_support(u8 index, 
u8 val,
support->radix_gtse = true;
}
break;
+   case OV5_INDX(OV5_XIVE_SUPPORT): /* Interrupt mode */
+   prom_parse_xive_model(val & OV5_FEAT(OV5_XIVE_SUPPORT),
+ support);
+   break;
}
 }
 
@@ -1062,7 +1088,8 @@ static void __init prom_check_platform_support(void)
struct platform_support supported = {
.hash_mmu = false,
.radix_mmu = false,
-   .radix_gtse = false
+   .radix_gtse = false,
+   .xive = false
};
int prop_len = prom_getproplen(prom.chosen,
   "ibm,arch-vec-5-platform-support");
@@ -1095,6 +1122,11 @@ static void __init prom_check_platform_support(void)
/* We're probably on a legacy hypervisor */
prom_debug("Assuming legacy hash support\n");
}
+
+   if (supported.xive) {
+   prom_debug("Asking for XIVE\n");
+   ibm_architecture_vec.vec5.intarch = OV5_FEAT(OV5_XIVE_EXPLOIT);
+   }
 }
 
 static void __init prom_send_capabilities(void)
-- 
2.13.5



[PATCH v3 0/8] guest exploitation of the XIVE interrupt controller

2017-08-30 Thread Cédric Le Goater
Hello,

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with an
interrupt controller using the legacy model, as found on POWER8, or in
XIVE exploitation mode, the newer POWER9 interrupt model. This
patchset is the latest proposal to add XIVE support in the sPAPR
machine.

Tested with a QEMU XIVE model for sPAPR machine and with the Power
hypervisor.

Code is here:

  https://github.com/legoater/linux/commits/xive
  https://github.com/legoater/qemu/commits/xive   

Thanks,

C.

Changes since v2 :

 - changed H_INT_SYNC hcall to reflect new api from the specs
 - added some Reviewed-by's

Changes since v1 :

 - introduced a common subroutine xive_queue_page_alloc()
 - introduced a xive_teardown_cpu() routine
 - removed P9 doorbell support when xive is enabled.
 - fixed xive_esb_read() naming
 - fixed XIVE mode parsing in CAS (just got the final specs)

Changes since RFC :

 - renamed backend to 'spapr'
 - fixed hotplug support
 - fixed kexec support
 - fixed src_chip value (XIVE_INVALID_CHIP_ID)
 - added doorbell support
 - added some debug logs
 - added  H_INT_ESB hcall
 - took into account '/ibm,plat-res-int-priorities'
 - fixed WARNING in xive_find_target_in_mask()

Cédric Le Goater (8):
  powerpc/xive: introduce a common routine xive_queue_page_alloc()
  powerpc/xive: guest exploitation of the XIVE interrupt controller
  powerpc/xive: rename xive_poke_esb() in xive_esb_read()
  powerpc/xive: introduce xive_esb_write()
  powerpc/xive: add the HW IRQ number under xive_irq_data
  powerpc/xive: introduce H_INT_ESB hcall
  powerpc/xive: add XIVE Exploitation Mode to CAS
  powerpc/xive: improve debugging macros

 arch/powerpc/include/asm/hvcall.h|  13 +-
 arch/powerpc/include/asm/prom.h  |   5 +-
 arch/powerpc/include/asm/xive.h  |   5 +
 arch/powerpc/kernel/prom_init.c  |  34 +-
 arch/powerpc/platforms/pseries/Kconfig   |   1 +
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  11 +-
 arch/powerpc/platforms/pseries/kexec.c   |   6 +-
 arch/powerpc/platforms/pseries/setup.c   |   8 +-
 arch/powerpc/platforms/pseries/smp.c |  27 +-
 arch/powerpc/sysdev/xive/Kconfig |   5 +
 arch/powerpc/sysdev/xive/Makefile|   1 +
 arch/powerpc/sysdev/xive/common.c|  76 ++-
 arch/powerpc/sysdev/xive/native.c|  18 +-
 arch/powerpc/sysdev/xive/spapr.c | 662 +++
 arch/powerpc/sysdev/xive/xive-internal.h |   7 +
 15 files changed, 844 insertions(+), 35 deletions(-)
 create mode 100644 arch/powerpc/sysdev/xive/spapr.c

-- 
2.13.5



[PATCH v3 6/8] powerpc/xive: introduce H_INT_ESB hcall

2017-08-30 Thread Cédric Le Goater
The H_INT_ESB hcall() is used to issue a load or store to the ESB page
instead of using the MMIO pages. This can be used as a workaround on
some HW issues. The OS knows that this hcall should be used on an
interrupt source when the ESB hcall flag is set to 1 in the hcall
H_INT_GET_SOURCE_INFO.

To maintain the frontier between the xive frontend and backend, we
introduce a new xive operation 'esb_rw' to be used in the routines
doing memory accesses on the ESBs.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h  |  1 +
 arch/powerpc/sysdev/xive/common.c| 10 ++--
 arch/powerpc/sysdev/xive/spapr.c | 44 +++-
 arch/powerpc/sysdev/xive/xive-internal.h |  1 +
 4 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 64ec9bbcf03e..371fbebf1ec9 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -56,6 +56,7 @@ struct xive_irq_data {
 #define XIVE_IRQ_FLAG_SHIFT_BUG0x04
 #define XIVE_IRQ_FLAG_MASK_FW  0x08
 #define XIVE_IRQ_FLAG_EOI_FW   0x10
+#define XIVE_IRQ_FLAG_H_INT_ESB0x20
 
 #define XIVE_INVALID_CHIP_ID   -1
 
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index ac5f18a66742..8fd58773c241 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -198,7 +198,10 @@ static u8 xive_esb_read(struct xive_irq_data *xd, u32 
offset)
if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
offset |= offset << 4;
 
-   val = in_be64(xd->eoi_mmio + offset);
+   if ((xd->flags & XIVE_IRQ_FLAG_H_INT_ESB) && xive_ops->esb_rw)
+   val = xive_ops->esb_rw(xd->hw_irq, offset, 0, 0);
+   else
+   val = in_be64(xd->eoi_mmio + offset);
 
return (u8)val;
 }
@@ -209,7 +212,10 @@ static void xive_esb_write(struct xive_irq_data *xd, u32 
offset, u64 data)
if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
offset |= offset << 4;
 
-   out_be64(xd->eoi_mmio + offset, data);
+   if ((xd->flags & XIVE_IRQ_FLAG_H_INT_ESB) && xive_ops->esb_rw)
+   xive_ops->esb_rw(xd->hw_irq, offset, data, 1);
+   else
+   out_be64(xd->eoi_mmio + offset, data);
 }
 
 #ifdef CONFIG_XMON
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 0fcae7504353..43e9eeb0d39f 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -224,7 +224,46 @@ static long plpar_int_sync(unsigned long flags, unsigned 
long lisn)
return 0;
 }
 
-#define XIVE_SRC_H_INT_ESB (1ull << (63 - 60)) /* TODO */
+#define XIVE_ESB_FLAG_STORE (1ull << (63 - 63))
+
+static long plpar_int_esb(unsigned long flags,
+ unsigned long lisn,
+ unsigned long offset,
+ unsigned long in_data,
+ unsigned long *out_data)
+{
+   unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+   long rc;
+
+   pr_devel("H_INT_ESB flags=%lx lisn=%lx offset=%lx in=%lx\n",
+   flags,  lisn, offset, in_data);
+
+   rc = plpar_hcall(H_INT_ESB, retbuf, flags, lisn, offset, in_data);
+   if (rc) {
+   pr_err("H_INT_ESB lisn=%ld offset=%ld returned %ld\n",
+  lisn, offset, rc);
+   return  rc;
+   }
+
+   *out_data = retbuf[0];
+
+   return 0;
+}
+
+static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 data, bool write)
+{
+   unsigned long read_data;
+   long rc;
+
+   rc = plpar_int_esb(write ? XIVE_ESB_FLAG_STORE : 0,
+  lisn, offset, data, _data);
+   if (rc)
+   return -1;
+
+   return write ? 0 : read_data;
+}
+
+#define XIVE_SRC_H_INT_ESB (1ull << (63 - 60))
 #define XIVE_SRC_LSI   (1ull << (63 - 61))
 #define XIVE_SRC_TRIGGER   (1ull << (63 - 62))
 #define XIVE_SRC_STORE_EOI (1ull << (63 - 63))
@@ -244,6 +283,8 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct 
xive_irq_data *data)
if (rc)
return  -EINVAL;
 
+   if (flags & XIVE_SRC_H_INT_ESB)
+   data->flags  |= XIVE_IRQ_FLAG_H_INT_ESB;
if (flags & XIVE_SRC_STORE_EOI)
data->flags  |= XIVE_IRQ_FLAG_STORE_EOI;
if (flags & XIVE_SRC_LSI)
@@ -487,6 +528,7 @@ static const struct xive_ops xive_spapr_ops = {
.setup_cpu  = xive_spapr_setup_cpu,
.teardown_cpu   = xive_spapr_teardown_cpu,
.sync_source= xive_spapr_sync_source,
+   .esb_rw = xive_spapr_esb_rw,
 #ifdef CONFIG_SMP
.get_ipi= xive_spapr_get_ipi,
.put_ipi= xive_spapr_put_ipi,
diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
b/arch/powerpc/sysdev/xive/xive-internal.h
index dd1e2022cce4..f34abed0c05f 

[PATCH v3 3/8] powerpc/xive: rename xive_poke_esb() in xive_esb_read()

2017-08-30 Thread Cédric Le Goater
xive_poke_esb() is performing a load/read so it is better named as
xive_esb_read() as we will need to introduce a xive_esb_write()
routine. Also use the XIVE_ESB_LOAD_EOI offset when EOI'ing LSI
interrupts.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v1:

 - fixed naming.

 arch/powerpc/sysdev/xive/common.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 8774af7a4105..8a58662ed793 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -190,7 +190,7 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, bool 
just_peek)
  * This is used to perform the magic loads from an ESB
  * described in xive.h
  */
-static u8 xive_poke_esb(struct xive_irq_data *xd, u32 offset)
+static u8 xive_esb_read(struct xive_irq_data *xd, u32 offset)
 {
u64 val;
 
@@ -227,7 +227,7 @@ void xmon_xive_do_dump(int cpu)
xive_dump_eq("IRQ", >queue[xive_irq_priority]);
 #ifdef CONFIG_SMP
{
-   u64 val = xive_poke_esb(>ipi_data, XIVE_ESB_GET);
+   u64 val = xive_esb_read(>ipi_data, XIVE_ESB_GET);
xmon_printf("  IPI state: %x:%c%c\n", xc->hw_ipi,
val & XIVE_ESB_VAL_P ? 'P' : 'p',
val & XIVE_ESB_VAL_P ? 'Q' : 'q');
@@ -326,9 +326,9 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data 
*xd)
 * properly.
 */
if (xd->flags & XIVE_IRQ_FLAG_LSI)
-   in_be64(xd->eoi_mmio);
+   xive_esb_read(xd, XIVE_ESB_LOAD_EOI);
else {
-   eoi_val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
+   eoi_val = xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
DBG_VERBOSE("eoi_val=%x\n", offset, eoi_val);
 
/* Re-trigger if needed */
@@ -383,12 +383,12 @@ static void xive_do_source_set_mask(struct xive_irq_data 
*xd,
 * ESB accordingly on unmask.
 */
if (mask) {
-   val = xive_poke_esb(xd, XIVE_ESB_SET_PQ_01);
+   val = xive_esb_read(xd, XIVE_ESB_SET_PQ_01);
xd->saved_p = !!(val & XIVE_ESB_VAL_P);
} else if (xd->saved_p)
-   xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
+   xive_esb_read(xd, XIVE_ESB_SET_PQ_10);
else
-   xive_poke_esb(xd, XIVE_ESB_SET_PQ_00);
+   xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
 }
 
 /*
@@ -768,7 +768,7 @@ static int xive_irq_retrigger(struct irq_data *d)
 * To perform a retrigger, we first set the PQ bits to
 * 11, then perform an EOI.
 */
-   xive_poke_esb(xd, XIVE_ESB_SET_PQ_11);
+   xive_esb_read(xd, XIVE_ESB_SET_PQ_11);
 
/*
 * Note: We pass "0" to the hw_irq argument in order to
@@ -803,7 +803,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, 
void *state)
irqd_set_forwarded_to_vcpu(d);
 
/* Set it to PQ=10 state to prevent further sends */
-   pq = xive_poke_esb(xd, XIVE_ESB_SET_PQ_10);
+   pq = xive_esb_read(xd, XIVE_ESB_SET_PQ_10);
 
/* No target ? nothing to do */
if (xd->target == XIVE_INVALID_TARGET) {
@@ -832,7 +832,7 @@ static int xive_irq_set_vcpu_affinity(struct irq_data *d, 
void *state)
 * for sure the queue slot is no longer in use.
 */
if (pq & 2) {
-   pq = xive_poke_esb(xd, XIVE_ESB_SET_PQ_11);
+   pq = xive_esb_read(xd, XIVE_ESB_SET_PQ_11);
xd->saved_p = true;
 
/*
-- 
2.13.5



[PATCH v3 8/8] powerpc/xive: improve debugging macros

2017-08-30 Thread Cédric Le Goater
Having the CPU identifier in the debug logs is helpful when tracking
issues. Also add some more logging and fix a compile issue in
xive_do_source_eoi().

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/common.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 8fd58773c241..1c087ed7427f 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -40,7 +40,8 @@
 #undef DEBUG_ALL
 
 #ifdef DEBUG_ALL
-#define DBG_VERBOSE(fmt...)pr_devel(fmt)
+#define DBG_VERBOSE(fmt, ...)  pr_devel("cpu %d - " fmt, \
+smp_processor_id(), ## __VA_ARGS__)
 #else
 #define DBG_VERBOSE(fmt...)do { } while(0)
 #endif
@@ -344,7 +345,7 @@ void xive_do_source_eoi(u32 hw_irq, struct xive_irq_data 
*xd)
xive_esb_read(xd, XIVE_ESB_LOAD_EOI);
else {
eoi_val = xive_esb_read(xd, XIVE_ESB_SET_PQ_00);
-   DBG_VERBOSE("eoi_val=%x\n", offset, eoi_val);
+   DBG_VERBOSE("eoi_val=%x\n", eoi_val);
 
/* Re-trigger if needed */
if ((eoi_val & XIVE_ESB_VAL_Q) && xd->trig_mmio)
@@ -1004,6 +1005,9 @@ static void xive_ipi_eoi(struct irq_data *d)
 {
struct xive_cpu *xc = __this_cpu_read(xive_cpu);
 
+   DBG_VERBOSE("IPI eoi: irq=%d [0x%lx] (HW IRQ 0x%x) pending=%02x\n",
+   d->irq, irqd_to_hwirq(d), xc->hw_ipi, xc->pending_prio);
+
/* Handle possible race with unplug and drop stale IPIs */
if (!xc)
return;
-- 
2.13.5



Re: [PATCH kernel] PCI: Disable IOV before pcibios_sriov_disable()

2017-08-30 Thread Bjorn Helgaas
On Fri, Aug 11, 2017 at 06:19:33PM +1000, Alexey Kardashevskiy wrote:
> From: Gavin Shan 
> 
> The PowerNV platform is the only user of pcibios_sriov_disable().
> The IOV BAR could be shifted by pci_iov_update_resource(). The
> warning message in the function is printed if the IOV capability
> is in enabled (PCI_SRIOV_CTRL_VFE && PCI_SRIOV_CTRL_MSE) state.
> 
> This is the backtrace of what is happening:
>pci_disable_sriov
>sriov_disable
>pnv_pci_sriov_disable
>pnv_pci_vf_resource_shift
>pci_update_resource
>pci_iov_update_resource
> 
> This fixes the issue by disabling IOV capability before calling
> pcibios_sriov_disable(). With it, the disabling path matches
> the enabling path: pcibios_sriov_enable() is called before the
> IOV capability is enabled.
> 
> Cc: shan.ga...@gmail.com
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Reported-by: Carol L Soto 
> Signed-off-by: Gavin Shan 
> Tested-by: Carol L Soto 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> 
> This is repost. Since Gavin left the team, I am trying to push it out.
> The previos converstion is here: https://patchwork.ozlabs.org/patch/732653/

I gave up on the previous issue.  I think this patch makes sense as-is
at least as far as the fact that we can't update a struct resource
while the device is still consuming it.  I reworked the changelog to
emphasize that.

I assume the fact that pci_iov_update_resource() dropped the resource
update caused some user-visible issue later on, and I might mention
that, too, if I knew what it was.

Here's what I would consider putting on pci/virtualization (the diff
is unchanged from your post):


commit 08132e7759b3929bea0ccdf8afe81ebf05351389
Author: Gavin Shan 
Date:   Fri Aug 11 18:19:33 2017 +1000

PCI: Disable VF decoding before updating resources in 
pcibios_sriov_disable()

A struct resource represents the address space consumed by a device.  We
should not modify that resource while the device is actively using the
address space.  For VFs, pci_iov_update_resource() enforces this by
printing a warning and doing nothing if the VFE (VF Enable) and MSE (VF
Memory Space Enable) bits are set.

Previously, both sriov_enable() and sriov_disable() called the
pcibios_sriov_disable() arch hook, which may update the struct resource,
while VFE and MSE were enabled.  This effectively dropped the resource
update pcibios_sriov_disable() intended to do.

Disable VF memory decoding before calling pcibios_sriov_disable().

Reported-by: Carol L Soto 
Tested-by: Carol L Soto 
Signed-off-by: Gavin Shan 
Signed-off-by: Alexey Kardashevskiy 
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas 
Cc: shan.ga...@gmail.com
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 120485d6f352..ac41c8be9200 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -331,7 +331,6 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
while (i--)
pci_iov_remove_virtfn(dev, i, 0);
 
-   pcibios_sriov_disable(dev);
 err_pcibios:
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
@@ -339,6 +338,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn)
ssleep(1);
pci_cfg_access_unlock(dev);
 
+   pcibios_sriov_disable(dev);
+
if (iov->link != dev->devfn)
sysfs_remove_link(>dev.kobj, "dep_link");
 
@@ -357,14 +358,14 @@ static void sriov_disable(struct pci_dev *dev)
for (i = 0; i < iov->num_VFs; i++)
pci_iov_remove_virtfn(dev, i, 0);
 
-   pcibios_sriov_disable(dev);
-
iov->ctrl &= ~(PCI_SRIOV_CTRL_VFE | PCI_SRIOV_CTRL_MSE);
pci_cfg_access_lock(dev);
pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, iov->ctrl);
ssleep(1);
pci_cfg_access_unlock(dev);
 
+   pcibios_sriov_disable(dev);
+
if (iov->link != dev->devfn)
sysfs_remove_link(>dev.kobj, "dep_link");
 


Re: [RFC Part1 PATCH v3 16/17] X86/KVM: Provide support to create Guest and HV shared per-CPU variables

2017-08-30 Thread Borislav Petkov
On Wed, Aug 30, 2017 at 11:18:42AM -0500, Brijesh Singh wrote:
> I was trying to avoid mixing early and no-early set_memory_decrypted() but if
> feedback is: use early_set_memory_decrypted() only if its required otherwise
> use set_memory_decrypted() then I can improve the logic in next rev. thanks

Yes, I think you should use the early versions when you're, well,
*early* :-) But get rid of that for_each_possible_cpu() and do it only
on the current CPU, as this is a per-CPU path anyway. If you need to
do it on *every* CPU and very early, then you need a separate function
which is called in kvm_smp_prepare_boot_cpu() as there you're pre-SMP.

Thx.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: [PATCH v7 07/11] sparc64: optimized struct page zeroing

2017-08-30 Thread David Miller
From: Pasha Tatashin 
Date: Wed, 30 Aug 2017 09:19:58 -0400

> The reason I am not doing initializing stores is because they require
> a membar, even if only regular stores are following (I hoped to do a
> membar before first load). This is something I was thinking was not
> true, but after consulting with colleagues and checking processor
> manual, I verified that it is the case.

Oh yes, that's right, now I remember.


Re: [RESEND PATCH v5 00/16] eeprom: at24: Add OF device ID table

2017-08-30 Thread Wolfram Sang
On Wed, Aug 30, 2017 at 06:19:02PM +0200, Javier Martinez Canillas wrote:
> Hello Wolfram,
> 
> On Tue, Aug 29, 2017 at 10:48 AM, Wolfram Sang  wrote:
> >
> >> I don't have a DT based system at hand now, but I'll test it again and
> >> let you know probably tomorrow.
> >
> > I will try again today, too. Thanks!
> >
> 
> Ok, I had some time to do some tests again. I used an ARM Chromebook
> (Exynos Peach Pi) that has an I2C touchpad (Atmel maXTouch).

I tried again as well and it still fails for me.

> Tested the following cases:

I think we should talk about the same case: Let me repeat what I did:

1) I added your patch "eeprom: at24: Add OF device ID table"
2) I added an EEPROM node to an I2C

+   eeprom@50 {
+   compatible = "renesas,24c01";
+   reg = <0x50>;
+   };

-> no at24 binding to the device

3) I revert your patch

-> at24 binding to the device

I think you should be able to test this DTS snipplet even without a real
eeprom. Especially after applying this to the at24 driver.

diff --git a/drivers/misc/eeprom/at24.c b/drivers/misc/eeprom/at24.c
index 79c5c39be29cac..f9f547680c53db 100644
--- a/drivers/misc/eeprom/at24.c
+++ b/drivers/misc/eeprom/at24.c
@@ -805,11 +805,6 @@ static int at24_probe(struct i2c_client *client, const 
struct i2c_device_id *id)
 * Perform a one-byte test read to verify that the
 * chip is functional.
 */
-   err = at24_read(at24, 0, _byte, 1);
-   if (err) {
-   err = -ENODEV;
-   goto err_clients;
-   }
 
at24->nvmem_config.name = dev_name(>dev);
at24->nvmem_config.dev = >dev;


Can you check this?

Thanks,

   Wolfram



signature.asc
Description: PGP signature


Build failures in powerpc ptrace selftests

2017-08-30 Thread Seth Forshee
With gcc 7 from Ubuntu 17.10 I'm getting the follwing error building the
ptrace selftests for powerpc:

  ptrace-tm-vsx.c: In function ‘tm_vsx’:
  ptrace-tm-vsx.c:42:2: error: PIC register clobbered by ‘r2’ in ‘asm’
asm __volatile__(
^~~
  make[1]: *** [ptrace-tm-vsx] Error 1
  ptrace-tm-spd-vsx.c: In function ‘tm_spd_vsx’:
  ptrace-tm-spd-vsx.c:55:2: error: PIC register clobbered by ‘r2’ in ‘asm’
asm __volatile__(
^~~
  make[1]: *** [ptrace-tm-spd-vsx] Error 1
  ptrace-tm-spr.c: In function ‘tm_spr’:
  ptrace-tm-spr.c:46:2: error: PIC register clobbered by ‘r2’ in ‘asm’
asm __volatile__(
^~~

Best I can tell gcc now considers any inline asm which clobbers r2 an
error, and I get it even with -fno-pic. I'm not familiar with powerpc
asm, but it's not obvious to me why r2 is in the clobber list for any of
these.

Thanks,
Seth


[PATCH] powerpc: 4xx: constify platform_suspend_ops

2017-08-30 Thread Arvind Yadav
platform_suspend_ops are not supposed to change at runtime.
Functions suspend_set_ops working with const platform_suspend_ops.
So mark the non-const structs as const.

Signed-off-by: Arvind Yadav 
---
 arch/powerpc/sysdev/ppc4xx_cpm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/ppc4xx_cpm.c b/arch/powerpc/sysdev/ppc4xx_cpm.c
index ba95adf..6483675 100644
--- a/arch/powerpc/sysdev/ppc4xx_cpm.c
+++ b/arch/powerpc/sysdev/ppc4xx_cpm.c
@@ -240,7 +240,7 @@ static int cpm_suspend_enter(suspend_state_t state)
return 0;
 }
 
-static struct platform_suspend_ops cpm_suspend_ops = {
+static const struct platform_suspend_ops cpm_suspend_ops = {
.valid  = cpm_suspend_valid,
.enter  = cpm_suspend_enter,
 };
-- 
2.7.4



Re: [RESEND PATCH v5 00/16] eeprom: at24: Add OF device ID table

2017-08-30 Thread Javier Martinez Canillas
Hello Wolfram,

On Tue, Aug 29, 2017 at 10:48 AM, Wolfram Sang  wrote:
>
>> I don't have a DT based system at hand now, but I'll test it again and
>> let you know probably tomorrow.
>
> I will try again today, too. Thanks!
>

Ok, I had some time to do some tests again. I used an ARM Chromebook
(Exynos Peach Pi) that has an I2C touchpad (Atmel maXTouch).

Tested the following cases:

1) Driver without OF device ID table (only a I2C table with a
"maxtouch" entry) and DTS defining a device node with a
"atmel,maxtouch" compatible string. This is the case without any of
the patches in this series.

$ modinfo drivers/input/touchscreen/atmel_mxt_ts.ko | grep maxtouch
alias:  i2c:maxtouch

$ grep maxtouch /sys/devices/platform/soc/12e0.i2c/i2c-8/8-004b/uevent
OF_COMPATIBLE_0=atmel,maxtouch
MODALIAS=i2c:maxtouch

2) Driver without OF device ID table (only a I2C table with a
"maxtouch" entry) and DTS defining a device node with a
"atmel,maxtouch", "generic,maxtouch" compatible string. This is the
case when platform maintainers merge the DTS patches without the
driver patch.

$ modinfo drivers/input/touchscreen/atmel_mxt_ts.ko | grep maxtouch
alias:  i2c:maxtouch

$ grep maxtouch /sys/devices/platform/soc/12e0.i2c/i2c-8/8-004b/uevent
OF_COMPATIBLE_0=atmel,maxtouch
OF_COMPATIBLE_1=generic,maxtouch
MODALIAS=i2c:maxtouch

3) Driver with an OF device ID table (with a "generic,maxtouch" entry)
and DTS defining a device node with a "atmel,maxtouch" compatible
string. This is the case when the driver patch is merged without the
DTS patches.

$ modinfo drivers/input/touchscreen/atmel_mxt_ts.ko | grep maxtouch
alias:  of:N*T*Cgeneric,maxtouchC*
alias:  of:N*T*Cgeneric,maxtouch
alias:  i2c:maxtouch

$ grep maxtouch /sys/devices/platform/soc/12e0.i2c/i2c-8/8-004b/uevent
OF_COMPATIBLE_0=atmel,maxtouch
MODALIAS=i2c:maxtouch

4) Driver with an OF device ID table (with a "generic,maxtouch" entry)
and DTS defining a device node with a "atmel,maxtouch",
"generic,maxtouch" compatible string. This is the case when both the
DTS and driver patches are merged.

$ modinfo drivers/input/touchscreen/atmel_mxt_ts.ko | grep maxtouch
alias:  of:N*T*Cgeneric,maxtouchC*
alias:  of:N*T*Cgeneric,maxtouch
alias:  i2c:maxtouch

$ grep maxtouch /sys/devices/platform/soc/12e0.i2c/i2c-8/8-004b/uevent
OF_COMPATIBLE_0=atmel,maxtouch
OF_COMPATIBLE_1=generic,maxtouch
MODALIAS=i2c:maxtouch

For all cases module autoload, driver probe and evtest worked for me.

You said that (3) doesn't work but I don't understand why is failing
for you. Probably I'm missing something.

Best regards,
Javier


Re: [RFC Part1 PATCH v3 16/17] X86/KVM: Provide support to create Guest and HV shared per-CPU variables

2017-08-30 Thread Brijesh Singh

Hi Boris,

On 08/29/2017 05:22 AM, Borislav Petkov wrote:

[...]


On Mon, Jul 24, 2017 at 02:07:56PM -0500, Brijesh Singh wrote:

Some KVM specific MSR's (steal-time, asyncpf, avic_eio) allocates per-CPU


   MSRs


variable at compile time and share its physical address with hypervisor.


That sentence needs changing - the MSRs don't allocate - for them gets
allocated.


It presents a challege when SEV is active in guest OS, when SEV is active,
the guest memory is encrypted with guest key hence hypervisor will not
able to modify the guest memory. When SEV is active, we need to clear the
encryption attribute (aka C-bit) of shared physical addresses so that both
guest and hypervisor can access the data.


This whole paragraph needs rewriting.



I will improve the commit message in next rev.

[...]


+/* NOTE: function is marked as __ref because it is used by __init functions */


No need for that comment.

What should you look into is why do you need to call the early versions:

" * producing a warning (of course, no warning does not mean code is
  * correct, so optimally document why the __ref is needed and why it's OK)."

And we do have the normal set_memory_decrypted() etc helpers so why
aren't we using those?



Since kvm_guest_init() is called early in the boot process hence we will not
able to use set_memory_decrypted() function. IIRC, if we try calling
set_memory_decrypted() early then we will hit a BUG_ON [1] -- mainly when it
tries to flush the caches.

[1] 
http://elixir.free-electrons.com/linux/latest/source/arch/x86/mm/pageattr.c#L167




If you need to use the early ones too, then you probably need to
differentiate this in the callers by passing a "bool early", which calls
the proper flavor.



Sure I can rearrange code to make it more readable and use "bool early"
parameter to differentiate it.



+static int __ref kvm_map_hv_shared_decrypted(void)
+{
+   static int once, ret;
+   int cpu;
+
+   if (once)
+   return ret;


So this function gets called per-CPU but you need to do this ugly "once"
thing - i.e., global function called in a per-CPU context.

Why can't you do that mapping only on the current CPU and then
when that function is called on the next CPU, it will do the same thing
on that next CPU?




Yes, it can be done but I remember running into issues during the CPU hot plug.
The patch uses early_set_memory_decrypted() -- which calls
kernel_physical_mapping_init() to split the large pages into smaller. IIRC, the
API did not work after the system is successfully booted. After the system is
booted we must use the set_memory_decrypted().

I was trying to avoid mixing early and no-early set_memory_decrypted() but if
feedback is: use early_set_memory_decrypted() only if its required otherwise
use set_memory_decrypted() then I can improve the logic in next rev. thanks


[...]


diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index da0be9a..52854cf 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -783,6 +783,9 @@
. = ALIGN(cacheline);   \
*(.data..percpu)\
*(.data..percpu..shared_aligned)\
+   . = ALIGN(PAGE_SIZE);   \
+   *(.data..percpu..hv_shared) \
+   . = ALIGN(PAGE_SIZE);   \
VMLINUX_SYMBOL(__per_cpu_end) = .;


Yeah, no, you can't do that. That's adding this section unconditionally
on *every* arch. You need to do some ifdeffery like it is done at the
beginning of that file and have this only on the arch which supports SEV.




Will do . thanks

-Brijesh


Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-08-30 Thread Adam Borowski
On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> I will wait for people to test and for result of my own test before
> reposting if need be, otherwise i will post as separate patch.
>
> > But from a _very_ quick read-through this looks fine. But it obviously
> > needs testing.
> > 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Tested your branch as of 10f07641, on a long list of guest VMs.
No earth-shattering kaboom.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 


Re: [PATCH] powerpc/pseries: Don't attempt to acquire drc during memory hot add for assigned lmbs

2017-08-30 Thread John Allen
On 08/30/2017 09:35 AM, Nathan Fontenot wrote:
> On 08/29/2017 09:35 PM, Michael Ellerman wrote:
>> John Allen  writes:
>>
>>> Check if an LMB is assigned before attempting to call dlpar_acquire_drc in
>>> order to avoid any unnecessary rtas calls. This substantially reduces the
>>> running time of memory hot add on lpars with large amounts of memory.
>>>
>>> Signed-off-by: John Allen 
>>> ---
>>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> index ca9b2f4..95cf2ff 100644
>>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>>> @@ -817,6 +817,9 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add, 
>>> struct property *prop)
>>> return -EINVAL;
>>>
>>> for (i = 0; i < num_lmbs && lmbs_to_add != lmbs_added; i++) {
>>> +   if (lmbs[i].flags & DRCONF_MEM_ASSIGNED)
>>> +   continue;
>>> +
>>> rc = dlpar_acquire_drc(lmbs[i].drc_index);
>>> if (rc)
>>> continue;
>>
>> This doesn't build for me, see below. What compiler are you using to
>> test this?> 
>>   arch/powerpc/platforms/pseries/hotplug-memory.c: In function 
>> 'dlpar_memory':
>>   arch/powerpc/platforms/pseries/hotplug-memory.c:1081:2: error: 'rc' may be 
>> used uninitialized in this function [-Werror=maybe-uninitialized]
>> return rc;
>> ^
>>
> 
> I don't see this build failure either, looks like its time to update my
> compiler too.

This also builds for me. I'm using gcc version 4.8.5

> 
>> It's a bit confusing because you didn't modify that function, but the
>> function you did modify has been inlined into there.
>>
>> Possibly the compiler is wrong and we do always initialise rc, but it's
>> not clear at all.
>>
>> And it raises a bigger question, how is this supposed to actually work?
>>
>> If we go around the loop and find that something is already assigned:
>>
>>  for (i = 0; i < num_lmbs && lmbs_to_add != lmbs_added; i++) {
>>  if (lmbs[i].flags & DRCONF_MEM_ASSIGNED)
>>  continue;
>> ...
>>  lmbs_added++;
>> ...
>> }
>>
>> We don't increment lmbs_added, so at the end of the loop we will see
>> that lmbs_added is not equal to lmbs_to_add, and so we remove
>> everything:
> 
> There is a loop just before this code verifies there are enough
> LMBs available to satisfy the add request. This counts the number
> of LMBs that do not have the assigned flag set. This is where we
> update lmbs_available and if there are not enough lmbs available to
> satisfy the request, we bail.
> 
> When the loop you referenced above exits we should have attempted to
> enough LMBs to satisfy the request.
> 
>>
>>  if (lmbs_added != lmbs_to_add) {
>>  pr_err("Memory hot-add failed, removing any added LMBs\n");
>>
>>  for (i = 0; i < num_lmbs; i++) {
>>  if (!lmbs[i].reserved)
>>  continue;
>>
>>  rc = dlpar_remove_lmb([i]);
>>
>>
> 
> if we get into this recovery path, we only try to remove any LMBs
> that were added in the loop above. The code marks any LMB that is
> added as 'reserved' so that we only try to remove LMBs marked
> as 'reserved' in the cleanup path.
> 
>> So it seems like if we ever hit that continue you added, the whole
>> operation will fail anyway. So I'm confused.
> 
> The added code was to skip over any LMBs that are already assigned to
> the partition. As the code is written now, we walk through the entire list
> of LMBs possible. For large systems this can be tens, or hundreds of thousands
> of LMBs.
> 
> Without this check the code will try to acquire the drc for (this
> is done via rtas-set-indicator calls) which will fail for LMBs that are
> already assigned to the partition.
> 
> As an example on a large system that has 100,000 LMBs where the first
> 50,000 are assigned to the partition. When we get an add request to add
> X more LMBs we do see that there are 50,000 available to add. We then try
> to add X LMBs but start at the beginning of the LMB list when trying to
> add them. This results in 50,000 attempts to acquire the drc for LMbs
> that we already own. Instead we should just skip trying to acquire them.
> 
> Hopefully that helps clarify things.
> 
> -Nathan
> 
>>
>> cheers
>>
> 



Re: Question: handling early hotplug interrupts

2017-08-30 Thread Nathan Fontenot
On 08/30/2017 01:09 AM, Michael Ellerman wrote:
> Daniel Henrique Barboza  writes:
> 
>> Hi Ben,
>>
>> On 08/29/2017 06:55 PM, Benjamin Herrenschmidt wrote:
>>> On Tue, 2017-08-29 at 17:43 -0300, Daniel Henrique Barboza wrote:
 Hi,

 This is a scenario I've been facing when working in early device
 hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn
 the guest of the event, then the kernel fetches it by calling
 'check_exception' and handles it. If the hotplug is done too early
 (before SLOF, for example), the pulse is ignored and the hotplug event
 is left unchecked in the events queue.

 One solution would be to pulse the hotplug queue interrupt after CAS,
 when we are sure that the hotplug queue is negotiated. However, this
 panics the kernel with sig 11 kernel access of bad area, which suggests
 that the kernel wasn't quite ready to handle it.
>>> That's not right. This is a bug that needs fixing. The interrupt should
>>> be masked anyway but still.
>>>
>>> Tell us more about the crash (backtrace etc...)  this definitely needs
>>> fixing.
>>
>> This is the backtrace using a 4.13.0-rc3 guest:
>>
>> -
>> [0.008913] Unable to handle kernel paging request for data at address 
>> 0x0100
>> [0.008989] Faulting instruction address: 0xc012c318
>> [0.009046] Oops: Kernel access of bad area, sig: 11 [#1]
>> [0.009092] SMP NR_CPUS=1024
>> [0.009092] NUMA
>> [0.009128] pSeries
>> [0.009173] Modules linked in:
>> [0.009210] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc3+ #1
>> [0.009268] task: c000feb02580 task.stack: c000fe108000
>> [0.009325] NIP: c012c318 LR: c012c9c4 CTR: 
>> 
>> [0.009394] REGS: c000fffef910 TRAP: 0380   Not tainted (4.13.0-rc3+)
>> [0.009450] MSR: 82009033 
>> [0.009454]   CR: 28000822  XER: 2000
>> [0.009554] CFAR: c012c9c0 SOFTE: 0
>> [0.009554] GPR00: c012c9c4 c000fffefb90 c141f100 
>> 0400
>> [0.009554] GPR04:  c000fe1851c0  
>> fee6
>> [0.009554] GPR08: 000fffe1  0001 
>> 02001001
>> [0.009554] GPR12: 0040 cfd8 c000db58 
>> 
>> [0.009554] GPR16:    
>> 
>> [0.009554] GPR20:    
>> 0001
>> [0.009554] GPR24: 0002 0013 c000fe14bc00 
>> 0400
>> [0.009554] GPR28: 0400  c000fe1851c0 
>> 0001
>> [0.010121] NIP [c012c318] __queue_work+0x48/0x640
>> [0.010168] LR [c012c9c4] queue_work_on+0xb4/0xf0
>> [0.010213] Call Trace:
>> [0.010239] [c000fffefb90] [c000db58] kernel_init+0x8/0x160 
>> (unreliable)
>> [0.010308] [c000fffefc70] [c012c9c4] queue_work_on+0xb4/0xf0
>> [0.010368] [c000fffefcb0] [c00c4608] 
>> queue_hotplug_event+0xd8/0x150
>> [0.010435] [c000fffefd00] [c00c30d0] 
>> ras_hotplug_interrupt+0x140/0x190
>> [0.010505] [c000fffefd90] [c018c8b0] 
>> __handle_irq_event_percpu+0x90/0x310
>> [0.010573] [c000fffefe50] [c018cb6c] 
>> handle_irq_event_percpu+0x3c/0x90
>> [0.010642] [c000fffefe90] [c018cc24] 
>> handle_irq_event+0x64/0xc0
>> [0.010710] [c000fffefec0] [c01928b0] 
>> handle_fasteoi_irq+0xc0/0x230
>> [0.010779] [c000fffefef0] [c018ae14] 
>> generic_handle_irq+0x54/0x80
>> [0.010847] [c000fffeff20] [c00189f0] __do_irq+0x90/0x210
>> [0.010904] [c000fffeff90] [c002e730] call_do_irq+0x14/0x24
>> [0.010961] [c000fe10b640] [c0018c10] do_IRQ+0xa0/0x130
>> [0.011021] [c000fe10b6a0] [c0008c58] 
>> hardware_interrupt_common+0x158/0x160
>> [0.011090] --- interrupt: 501 at __replay_interrupt+0x38/0x3c
>> [0.011090] LR = arch_local_irq_restore+0x74/0x90
>> [0.011179] [c000fe10b990] [c000fe10b9e0] 0xc000fe10b9e0 
>> (unreliable)
>> [0.011249] [c000fe10b9b0] [c0b967fc] 
>> _raw_spin_unlock_irqrestore+0x4c/0xb0
>> [0.011316] [c000fe10b9e0] [c018ff50] __setup_irq+0x630/0x9e0
>> [0.011374] [c000fe10ba90] [c019054c] 
>> request_threaded_irq+0x13c/0x250
>> [0.011441] [c000fe10baf0] [c00c2cd0] 
>> request_event_sources_irqs+0x100/0x180
>> [0.011511] [c000fe10bc10] [c0eceda8] 
>> __machine_initcall_pseries_init_ras_IRQ+0xc4/0x12c
>> [0.011591] [c000fe10bc40] [c000d8c8] 
>> do_one_initcall+0x68/0x1e0
>> [0.011659] [c000fe10bd00] [c0eb4484] 
>> 

Re: [PATCH] powerpc/pseries: Don't attempt to acquire drc during memory hot add for assigned lmbs

2017-08-30 Thread Nathan Fontenot
On 08/29/2017 09:35 PM, Michael Ellerman wrote:
> John Allen  writes:
> 
>> Check if an LMB is assigned before attempting to call dlpar_acquire_drc in
>> order to avoid any unnecessary rtas calls. This substantially reduces the
>> running time of memory hot add on lpars with large amounts of memory.
>>
>> Signed-off-by: John Allen 
>> ---
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
>> b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> index ca9b2f4..95cf2ff 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
>> @@ -817,6 +817,9 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add, 
>> struct property *prop)
>>  return -EINVAL;
>>
>>  for (i = 0; i < num_lmbs && lmbs_to_add != lmbs_added; i++) {
>> +if (lmbs[i].flags & DRCONF_MEM_ASSIGNED)
>> +continue;
>> +
>>  rc = dlpar_acquire_drc(lmbs[i].drc_index);
>>  if (rc)
>>  continue;
> 
> This doesn't build for me, see below. What compiler are you using to
> test this?> 
>   arch/powerpc/platforms/pseries/hotplug-memory.c: In function 'dlpar_memory':
>   arch/powerpc/platforms/pseries/hotplug-memory.c:1081:2: error: 'rc' may be 
> used uninitialized in this function [-Werror=maybe-uninitialized]
> return rc;
> ^
>

I don't see this build failure either, looks like its time to update my
compiler too.
 
> It's a bit confusing because you didn't modify that function, but the
> function you did modify has been inlined into there.
> 
> Possibly the compiler is wrong and we do always initialise rc, but it's
> not clear at all.
> 
> And it raises a bigger question, how is this supposed to actually work?
> 
> If we go around the loop and find that something is already assigned:
> 
>   for (i = 0; i < num_lmbs && lmbs_to_add != lmbs_added; i++) {
>   if (lmbs[i].flags & DRCONF_MEM_ASSIGNED)
>   continue;
> ...
>   lmbs_added++;
> ...
> }
> 
> We don't increment lmbs_added, so at the end of the loop we will see
> that lmbs_added is not equal to lmbs_to_add, and so we remove
> everything:

There is a loop just before this code verifies there are enough
LMBs available to satisfy the add request. This counts the number
of LMBs that do not have the assigned flag set. This is where we
update lmbs_available and if there are not enough lmbs available to
satisfy the request, we bail.

When the loop you referenced above exits we should have attempted to
enough LMBs to satisfy the request.

> 
>   if (lmbs_added != lmbs_to_add) {
>   pr_err("Memory hot-add failed, removing any added LMBs\n");
> 
>   for (i = 0; i < num_lmbs; i++) {
>   if (!lmbs[i].reserved)
>   continue;
> 
>   rc = dlpar_remove_lmb([i]);
> 
>

if we get into this recovery path, we only try to remove any LMBs
that were added in the loop above. The code marks any LMB that is
added as 'reserved' so that we only try to remove LMBs marked
as 'reserved' in the cleanup path.
 
> So it seems like if we ever hit that continue you added, the whole
> operation will fail anyway. So I'm confused.

The added code was to skip over any LMBs that are already assigned to
the partition. As the code is written now, we walk through the entire list
of LMBs possible. For large systems this can be tens, or hundreds of thousands
of LMBs.

Without this check the code will try to acquire the drc for (this
is done via rtas-set-indicator calls) which will fail for LMBs that are
already assigned to the partition.

As an example on a large system that has 100,000 LMBs where the first
50,000 are assigned to the partition. When we get an add request to add
X more LMBs we do see that there are 50,000 available to add. We then try
to add X LMBs but start at the beginning of the LMB list when trying to
add them. This results in 50,000 attempts to acquire the drc for LMbs
that we already own. Instead we should just skip trying to acquire them.

Hopefully that helps clarify things.

-Nathan

> 
> cheers
> 



Re: [PATCH v2 1/3] powerpc/mm: Export flush_all_mm()

2017-08-30 Thread Frederic Barrat



Le 30/08/2017 à 15:17, Michael Ellerman a écrit :

Frederic Barrat  writes:


With the optimizations introduced by commit a46cc7a90fd8
("powerpc/mm/radix: Improve TLB/PWC flushes"), flush_tlb_mm() no
longer flushes the page walk cache with radix. This patch introduces
flush_all_mm(), which flushes everything, tlb and pwc, for a given mm.

Signed-off-by: Frederic Barrat 
---
Changelog:
v2: this patch is new

  arch/powerpc/include/asm/book3s/64/tlbflush-hash.h  |  8 
  arch/powerpc/include/asm/book3s/64/tlbflush-radix.h |  3 +++
  arch/powerpc/include/asm/book3s/64/tlbflush.h   | 15 +++
  arch/powerpc/mm/tlb-radix.c |  6 --
  4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 2f6373144e2c..c5d89d271a96 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -65,6 +65,14 @@ static inline void hash__flush_tlb_mm(struct mm_struct *mm)
  {
  }
  
+static inline void hash__local_flush_all_mm(struct mm_struct *mm)

+{
+}
+
+static inline void hash__flush_all_mm(struct mm_struct *mm)
+{
+}


It's not clear why it makes sense for these to be empty. Either for the
general idea of the "flush_all_mm()" API, or for your intended use by
CXL.


I was not too sure what to do for hash, but the idea is that the new 
flush_all_mm() is really the equivalent of the old flush_tlb_mm() from 
before Ben's optimizations for radix, and that was/still is an empty 
operation on hash, so I kept it that way.


We don't support hash for capi2 yet. Adding it will definitely require 
some work in that area, as the current approach (use count on the driver 
and all TLBIs becoming global when the driver is in use) won't hold much 
longer.


  Fred





RE: [PATCH 1/2] powerpc/xmon: hdec is now 64bits

2017-08-30 Thread David Laight
From: Balbir Singh
> Sent: 30 August 2017 01:28
> ISA 300 defines hypervisor decrementer to be 64 bits in length.
> This patch extends the print format for all archs to be 64 bits
> 
> Signed-off-by: Balbir Singh 
> ---
>  arch/powerpc/xmon/xmon.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index 9e68f1d..1b26d53 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -1749,7 +1749,7 @@ static void dump_206_sprs(void)
> 
>   printf("sdr1   = %.16lx  hdar  = %.16lx hdsisr = %.8x\n",
>   mfspr(SPRN_SDR1), mfspr(SPRN_HDAR), mfspr(SPRN_HDSISR));
> - printf("hsrr0  = %.16lx hsrr1  = %.16lx hdec = %.8x\n",
> + printf("hsrr0  = %.16lx hsrr1  = %.16lx hdec = %.16lx\n",
>   mfspr(SPRN_HSRR0), mfspr(SPRN_HSRR1), mfspr(SPRN_HDEC));
>   printf("lpcr   = %.16lx  pcr   = %.16lx lpidr = %.8x\n",
>   mfspr(SPRN_LPCR), mfspr(SPRN_PCR), mfspr(SPRN_LPID));

On the face of it the patch doesn't do what the commit message says.
Not only that it is really silly to print a 32bit value with 8 extra
leading zero digits.

Something more subtle was also wrong:
There were 3 mfspr() calls, 2 printed with %lx and one with %x.
That ought to generate a warning from gcc.

David



Re: [PATCH v3 0/3] powerpc/mm: Mark memory contexts requiring global TLBIs

2017-08-30 Thread Frederic Barrat
I'm dropping this series, as there was a recent change done in the 
memory context that I can reuse.

The follow up of the story is:
http://patchwork.ozlabs.org/patch/807570/

  Fred


Le 02/08/2017 à 22:29, Frederic Barrat a écrit :

capi2 and opencapi require the TLB invalidations being sent for
addresses used on the cxl adapter or opencapi device to be global, as
there's a translation cache in the PSL (for capi2) or NPU (for
opencapi). The CAPP, on behalf of the PSL, and NPU snoop the power bus.

This is not new: for the hash memory model, as soon as the cxl driver
is active, all local TLBIs become global. We need a similar mechanism
for the radix memory model. This patch tries to improve things a bit
by flagging the contexts requiring global TLBIs, therefore limiting
the "upgrade" and not affecting contexts not used by the card.

A longer-term goal is to modify the current implementation for hash to
follow the same direction, i.e. identify contexts needing global
TLBIs, but that will be for later. It would be required to support
hash for opencapi.

Changelog:
v3:
  - convert from RFC to PATCH
  - mark contexts used by XSL (cxllib) as needed global invalidation
RFC v2:
  - address comments received
  - rename MM_CONTEXT_GLOBAL_TLBI -> MM_GLOBAL_TLBIE
  - add memory barriers to make sure the device doesn't miss any TLBI
  - also add barrier for the hash implemention to fix the same issue

Frederic Barrat (3):
   powerpc/mm: Add marker for contexts requiring global TLB invalidations
   cxl: Mark context requiring global TLBIs
   cxl: Add memory barrier to guarantee TLBI scope

  arch/powerpc/include/asm/book3s/64/mmu.h | 18 ++
  arch/powerpc/include/asm/tlb.h   | 27 +++
  arch/powerpc/mm/mmu_context_book3s64.c   |  1 +
  arch/powerpc/mm/tlb-radix.c  |  8 
  arch/powerpc/mm/tlb_hash64.c |  3 ++-
  drivers/misc/cxl/api.c   | 12 ++--
  drivers/misc/cxl/cxllib.c|  7 +++
  drivers/misc/cxl/file.c  | 12 ++--
  include/misc/cxl-base.h  | 22 +++---
  9 files changed, 94 insertions(+), 16 deletions(-)





Re: [PATCH v7 07/11] sparc64: optimized struct page zeroing

2017-08-30 Thread Pasha Tatashin

Hi Dave,

Thank you for acking.

The reason I am not doing initializing stores is because they require a 
membar, even if only regular stores are following (I hoped to do a 
membar before first load). This is something I was thinking was not 
true, but after consulting with colleagues and checking processor 
manual, I verified that it is the case.


Pasha



You should probably use initializing stores when you are doing 8
stores and we thus know the page struct is cache line aligned.

But other than that:

Acked-by: David S. Miller 


Re: [PATCH v2 2/3] cxl: Fix driver use count

2017-08-30 Thread Michael Ellerman
Frederic Barrat  writes:

> cxl keeps a driver use count, which is used with the hash memory model
> on p8 to know when to upgrade local TLBIs to global and to trigger
> callbacks to manage the MMU for PSL8.
>
> If a process opens a context and closes without attaching or fails the
> attachment, the driver use count is never decremented. As a
> consequence, TLB invalidations remain global, even if there are no
> active cxl contexts.
>
> We should increment the driver use count when the process is attaching
> to the cxl adapter, and not on open. It's not needed before the
> adapter starts using the context and the use count is decremented on
> the detach path, so it makes more sense.
>
> It affects only the user api. The kernel api is already doing The
> Right Thing.
>
> Signed-off-by: Frederic Barrat 
> Cc: sta...@vger.kernel.org # v4.2+
> Fixes: 7bb5d91a4dda ("cxl: Rework context lifetimes")
> Acked-by: Andrew Donnellan 

I'll pick this up straight away, because it's a fix.

So you can drop this from the series.

cheers


Re: [PATCH v2 1/3] powerpc/mm: Export flush_all_mm()

2017-08-30 Thread Michael Ellerman
Frederic Barrat  writes:

> With the optimizations introduced by commit a46cc7a90fd8
> ("powerpc/mm/radix: Improve TLB/PWC flushes"), flush_tlb_mm() no
> longer flushes the page walk cache with radix. This patch introduces
> flush_all_mm(), which flushes everything, tlb and pwc, for a given mm.
>
> Signed-off-by: Frederic Barrat 
> ---
> Changelog:
> v2: this patch is new
>
>  arch/powerpc/include/asm/book3s/64/tlbflush-hash.h  |  8 
>  arch/powerpc/include/asm/book3s/64/tlbflush-radix.h |  3 +++
>  arch/powerpc/include/asm/book3s/64/tlbflush.h   | 15 +++
>  arch/powerpc/mm/tlb-radix.c |  6 --
>  4 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h 
> b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> index 2f6373144e2c..c5d89d271a96 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
> @@ -65,6 +65,14 @@ static inline void hash__flush_tlb_mm(struct mm_struct *mm)
>  {
>  }
>  
> +static inline void hash__local_flush_all_mm(struct mm_struct *mm)
> +{
> +}
> +
> +static inline void hash__flush_all_mm(struct mm_struct *mm)
> +{
> +}

It's not clear why it makes sense for these to be empty. Either for the
general idea of the "flush_all_mm()" API, or for your intended use by
CXL.

cheers


Re: [PATCH] powerpc/powernv: Turn on SCSI_AACRAID in powernv_defconfig

2017-08-30 Thread Guilherme G. Piccoli
On 08/30/2017 08:07 AM, Michael Ellerman wrote:
>[...]
> OK. So maybe when the petitboot kernel moves up to 4.14 they may want to
> flip it back to being a module.
> 

OK, I'll be tuned! Thanks


> I think the majority of machines that have one of these adapters will be
> using it for their root disk, so I think the slow initialisation is just
> something we have to suffer.
> 
> cheers
> 



Re: [PATCH] cxl: Set the valid bit in PE for dedicated mode

2017-08-30 Thread Michael Ellerman
Vaibhav Jain  writes:

> Hi Mpe,
>
> Thanks for reviewing the patch
>
> Michael Ellerman  writes:
>
>>> +   ctx->elem->software_state = cpu_to_be32(CXL_PE_SOFTWARE_STATE_V);
>>> +   /* Make sure the changes to the PE are visible to the card */
>>
>> A barrier orders something vs something else. So what's the something
>> else in this case? Is it the afu_reset() below, what does that actually do?
>>
>
> The issue is with call to afu_enable() after the call to afu_reset that
> would start the AFU. If this load gets reordered and PSL doesnt see the
> valid bit set for this structure then it will result in PSL entering a
> freeze-state.

OK, so it's ordering the store above to ctx->elem->software_state vs the
store to the AFU in afu_enable().

> Though on second thoughts afu_enable() is grabbing a spin-lock before
> doing an mmio to start the AFU that would be forcing a barrier
> anyways. But since that spans the function boundary hence to be safe
> have added a write barrier after populating the process element.

The spin lock doesn't help you, stores are allowed to leak into the
locked region.

But the MMIO is preceeded by a sync.

> Lastly function is not performance critical as it will be usually called
> in the life time of a process only once. So the impact smp_wmb() is
> having would be minimal.

Sure. Performance is not the issue, barriers are subtle so it's
important that they're well documented.


So I don't think you need the barrier, because the out_be64() will do it
for you. But if you really want to add one, I don't mind.

But, you should use wmb(), not smp_wmb(), because the ordering is still
required on non-SMP systems. And please update the comment to capture
all of the above discussion.

cheers


Re: [PATCH v3 4/4] powerpc/64s: idle ESL=0 stop can avoid MSR and save/restore overhead

2017-08-30 Thread Nicholas Piggin
On Wed, 30 Aug 2017 21:25:59 +1000
Michael Ellerman  wrote:

> Nicholas Piggin  writes:
> 
> > When stop is executed with EC=ESL=0, it appears to execute like a
> > normal instruction (resuming from NIP when woken by interrupt).
> > So all the save/restore handling can be avoided completely. In
> > particular NV GPRs do not have to be saved, and MSR does not have
> > to be switched back to kernel MSR.
> >
> > So move the test for "lite" sleep states out to power9_idle_stop.
> >
> > Reviewed-by: Gautham R. Shenoy 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/kernel/idle_book3s.S | 35 ---
> >  1 file changed, 24 insertions(+), 11 deletions(-)  
> 
> This is blowing up for me on mambo:

Oh this is a known bug in mambo that does not match the hardware.

You need >= Mambo.7.8.21, or this firmware patch to work around
the issue for old mambos.

mambo.git 58d3162f4d6204fc077ff4a6ba47e4d1e19d5120

https://lists.ozlabs.org/pipermail/skiboot/2017-August/008768.html

Thanks,
Nick


Re: [PATCH] cxl: Set the valid bit in PE for dedicated mode

2017-08-30 Thread Vaibhav Jain
Hi Mpe,

Thanks for reviewing the patch

Michael Ellerman  writes:

>> +ctx->elem->software_state = cpu_to_be32(CXL_PE_SOFTWARE_STATE_V);
>> +/* Make sure the changes to the PE are visible to the card */
>
> A barrier orders something vs something else. So what's the something
> else in this case? Is it the afu_reset() below, what does that actually do?
>

The issue is with call to afu_enable() after the call to afu_reset that
would start the AFU. If this load gets reordered and PSL doesnt see the
valid bit set for this structure then it will result in PSL entering a
freeze-state.

Though on second thoughts afu_enable() is grabbing a spin-lock before
doing an mmio to start the AFU that would be forcing a barrier
anyways. But since that spans the function boundary hence to be safe
have added a write barrier after populating the process element.

Lastly function is not performance critical as it will be usually called
in the life time of a process only once. So the impact smp_wmb() is
having would be minimal.

-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.



Re: [PATCH v3 4/4] powerpc/64s: idle ESL=0 stop can avoid MSR and save/restore overhead

2017-08-30 Thread Michael Ellerman
Nicholas Piggin  writes:

> When stop is executed with EC=ESL=0, it appears to execute like a
> normal instruction (resuming from NIP when woken by interrupt).
> So all the save/restore handling can be avoided completely. In
> particular NV GPRs do not have to be saved, and MSR does not have
> to be switched back to kernel MSR.
>
> So move the test for "lite" sleep states out to power9_idle_stop.
>
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/idle_book3s.S | 35 ---
>  1 file changed, 24 insertions(+), 11 deletions(-)

This is blowing up for me on mambo:

  usbcore: registered new interface driver usb-storage
  Disabling lock debugging due to kernel taint
  Severe Machine check interrupt [Not recovered]
NIP [c02a0c04]: kmem_cache_free+0x64/0x2c0
Initiator: CPU
Error type: Real address [Load/Store (foreign)]
  opal: Hardware platform error: Unrecoverable Machine Check exception
  CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   M
4.13.0-rc2-gcc-6.3.1-00257-g26268bb39bff #543
  task: c16b1200 task.stack: c175c000
  NIP:  c02a0c04 LR: c01128ec CTR: c0112800
  REGS: c0003fff7d80 TRAP: 0200   Tainted: G   M 
(4.13.0-rc2-gcc-6.3.1-00257-g26268bb39bff)
  MSR:  90209003   CR: 28002828  XER: 2000
  CFAR: c01128e8 DAR: c00a003c3ce0 DSISR: 0008 SOFTE: 1 
  GPR00: c01128ec c175f7b0 c1760500 c000f001a000 
  GPR04: c000f0f37f00 0003 000220c3  
  GPR08:  003c3cc0 c17e7758  
  GPR12: c0112800 cfff c0112800 c000f0f37f90 
  GPR16: c175c000 c175c000 0001 c16d7900 
  GPR20: c179ba98 7fff c175c000  
  GPR24: c017a8c0 000a c16d8a00 c175f8d0 
  GPR28: c01128ec c000f0f37f00 c00a003c3cc0 c000f001a000 
  NIP [c02a0c04] kmem_cache_free+0x64/0x2c0
  LR [c01128ec] put_cred_rcu+0xec/0x140
  Call Trace:
  [c175f7b0] [c175f800] init_thread_union+0x3800/0x4000 
(unreliable)
  [c175f840] [c01128ec] put_cred_rcu+0xec/0x140
  [c175f8b0] [c017a908] rcu_process_callbacks+0x438/0x6a0
  [c175f980] [c00e5e28] __do_softirq+0x198/0x310
  [c175fa70] [c00e6248] irq_exit+0xf8/0x140
  [c175fa90] [c0023710] timer_interrupt+0xa0/0xe0
  [c175fac0] [c0008fcc] decrementer_common+0x11c/0x120
  --- interrupt: 901 at replay_interrupt_return+0x0/0x4
  LR = arch_local_irq_restore.part.1+0x84/0xb0
  [c175fdb0] [c175c000] init_thread_union+0x0/0x4000 
(unreliable)
  [c175fdd0] [c001cde0] arch_cpu_idle+0xe0/0x140
  [c175fe00] [c07f4f44] default_idle_call+0x44/0x84
  [c175fe20] [c0142e54] do_idle+0x254/0x320
  [c175fe90] [c0143280] cpu_startup_entry+0x30/0x40
  [c175fec0] [c000d2d8] rest_init+0x2f8/0x320
  [c175ff00] [c1000d74] start_kernel+0x510/0x52c
  [c175ff90] [c000ab70] start_here_common+0x1c/0x4ac
  Instruction dump:
  6000 e9230008 71280100 40820170 2fbf 419e013c 3d420008 394a7258 
  7bbe8502 7bc93664 ebca 7fde4a14  712a0001 40820234 e93f0008 
  [1.217346467,0] OPAL: Reboot requested due to Platform error.
  [1.217351582,3] OPAL: failed to log an error
  [1.217358188,5] OPAL: Reboot request...


I'll just drop it for now.

cheers


Re: [PATCH] powerpc/powernv: Turn on SCSI_AACRAID in powernv_defconfig

2017-08-30 Thread Michael Ellerman
"Guilherme G. Piccoli"  writes:

> On 08/29/2017 08:22 AM, Michael Ellerman wrote:
>> "Guilherme G. Piccoli"  writes:
>> 
>>> On 08/28/2017 02:56 AM, Michael Ellerman wrote:
 Some Power9 boxes will have this adapter installed, so add it to the
 defconfig so we can boot on those machines without an initrd.
>>>
>>> Michael, not sure if this affects Petitboot (I know it has its own
>>> default config files), but in the past we had some issues with this
>>> driver and it's nice to have the possibility to blacklist it.
>> 
>> Why we were black listing it?
>
> Basically for the same reason you want it built-in - debug purposes heheh
>
> In fact, this driver takes some time in its initialization process, so
> if my rootfs is not under it, to speedup my boot-time I could blacklist
> it; also, to debug the aacraid driver we could blacklist it in Petitboot
> and debug on distro (easier to build kernels there), since the issue
> could be in first probe only (or be affected by kexec-out from Petitboot).
>
> Anyway, since make it a module will harm your workflow, I agree built-in
> is the best option!

OK. So maybe when the petitboot kernel moves up to 4.14 they may want to
flip it back to being a module.

I think the majority of machines that have one of these adapters will be
using it for their root disk, so I think the slow initialisation is just
something we have to suffer.

cheers


[PATCH] selftests/powerpc: Force ptrace tests to build -fno-pie

2017-08-30 Thread Michael Neuling
Currently these tests won't build with a `--enable-default-pie`
compiler as they require r30 to be clobbered. This gives
an error:
  ptrace-tm-spd-gpr.c:41:2: error: PIC register clobbered by 'r30' in 'asm'

This forces these tests to be built no-pie.

Signed-off-by: Michael Neuling 
---
 tools/testing/selftests/powerpc/ptrace/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/powerpc/ptrace/Makefile 
b/tools/testing/selftests/powerpc/ptrace/Makefile
index fe6bc60dfc..8932263e5a 100644
--- a/tools/testing/selftests/powerpc/ptrace/Makefile
+++ b/tools/testing/selftests/powerpc/ptrace/Makefile
@@ -6,7 +6,7 @@ include ../../lib.mk
 
 all: $(TEST_PROGS)
 
-CFLAGS += -m64 -I../../../../../usr/include -I../tm -mhtm
+CFLAGS += -m64 -I../../../../../usr/include -I../tm -mhtm -fno-pie
 
 $(TEST_PROGS): ../harness.c ../utils.c ../lib/reg.S ptrace.h
 
-- 
2.11.0



[PATCH v2 3/3] cxl: Enable global TLBIs for cxl contexts

2017-08-30 Thread Frederic Barrat
The PSL and nMMU need to see all TLB invalidations for the memory
contexts used on the adapter. For the hash memory model, it is done by
making all TLBIs global as soon as the cxl driver is in use. For
radix, we need something similar, but we can refine and only convert
to global the invalidations for contexts actually used by the device.

So increment the 'active_cpus' count for the contexts attached to the
cxl adapter. As soon as there's more than 1 active cpu, the TLBIs for
the context become global. Active cpu count must be decremented when
detaching to restore locality if possible and to avoid overflowing the
counter.

Signed-off-by: Frederic Barrat 
---
Changelog:
v2: Replace flush_tlb_mm() by the new flush_all_mm() to flush the TLBs
and PWCs (thanks to Ben)

 arch/powerpc/include/asm/mmu_context.h | 35 ++
 arch/powerpc/mm/mmu_context.c  |  9 -
 drivers/misc/cxl/api.c | 22 ++---
 drivers/misc/cxl/context.c |  3 +++
 drivers/misc/cxl/file.c| 19 --
 5 files changed, 74 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 309592589e30..71a1d01ff206 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -77,6 +77,41 @@ extern void switch_cop(struct mm_struct *next);
 extern int use_cop(unsigned long acop, struct mm_struct *mm);
 extern void drop_cop(unsigned long acop, struct mm_struct *mm);
 
+#ifdef CONFIG_PPC_BOOK3S_64
+static inline void inc_mm_active_cpus(struct mm_struct *mm)
+{
+   atomic_inc(>context.active_cpus);
+}
+
+static inline void dec_mm_active_cpus(struct mm_struct *mm)
+{
+   atomic_dec(>context.active_cpus);
+}
+
+static inline void mm_context_add_copro(struct mm_struct *mm)
+{
+   inc_mm_active_cpus(mm);
+}
+
+static inline void mm_context_remove_copro(struct mm_struct *mm)
+{
+   /*
+* Need to broadcast a global flush of the full mm before
+* decrementing active_cpus count, as the next TLBI may be
+* local and the nMMU and/or PSL need to be cleaned up.
+* Should be rare enough so that it's acceptable.
+*/
+   flush_all_mm(mm);
+   dec_mm_active_cpus(mm);
+}
+#else
+static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
+static inline void dec_mm_active_cpus(struct mm_struct *mm) { }
+static inline void mm_context_add_copro(struct mm_struct *mm) { }
+static inline void mm_context_remove_copro(struct mm_struct *mm) { }
+#endif
+
+
 extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
   struct task_struct *tsk);
 
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 0f613bc63c50..d60a62bf4fc7 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -34,15 +34,6 @@ static inline void switch_mm_pgdir(struct task_struct *tsk,
   struct mm_struct *mm) { }
 #endif
 
-#ifdef CONFIG_PPC_BOOK3S_64
-static inline void inc_mm_active_cpus(struct mm_struct *mm)
-{
-   atomic_inc(>context.active_cpus);
-}
-#else
-static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
-#endif
-
 void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
 {
diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index a0c44d16bf30..1137a2cc1d3e 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cxl.h"
 
@@ -331,9 +332,12 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
/* ensure this mm_struct can't be freed */
cxl_context_mm_count_get(ctx);
 
-   /* decrement the use count */
-   if (ctx->mm)
+   if (ctx->mm) {
+   /* decrement the use count from above */
mmput(ctx->mm);
+   /* make TLBIs for this context global */
+   mm_context_add_copro(ctx->mm);
+   }
}
 
/*
@@ -342,13 +346,25 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
 */
cxl_ctx_get();
 
+   /*
+* Barrier is needed to make sure all TLBIs are global before
+* we attach and the context starts being used by the adapter.
+*
+* Needed after mm_context_add_copro() for radix and
+* cxl_ctx_get() for hash/p8
+*/
+   smp_mb();
+
if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
put_pid(ctx->pid);
ctx->pid = NULL;
cxl_adapter_context_put(ctx->afu->adapter);
cxl_ctx_put();
-   if (task)
+   if (task) {

[PATCH v2 2/3] cxl: Fix driver use count

2017-08-30 Thread Frederic Barrat
cxl keeps a driver use count, which is used with the hash memory model
on p8 to know when to upgrade local TLBIs to global and to trigger
callbacks to manage the MMU for PSL8.

If a process opens a context and closes without attaching or fails the
attachment, the driver use count is never decremented. As a
consequence, TLB invalidations remain global, even if there are no
active cxl contexts.

We should increment the driver use count when the process is attaching
to the cxl adapter, and not on open. It's not needed before the
adapter starts using the context and the use count is decremented on
the detach path, so it makes more sense.

It affects only the user api. The kernel api is already doing The
Right Thing.

Signed-off-by: Frederic Barrat 
Cc: sta...@vger.kernel.org # v4.2+
Fixes: 7bb5d91a4dda ("cxl: Rework context lifetimes")
Acked-by: Andrew Donnellan 
---
Changelog:
v2: fix typo in comments (Thanks to Andrew)

 drivers/misc/cxl/api.c  | 4 
 drivers/misc/cxl/file.c | 8 +++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index 1a138c83f877..a0c44d16bf30 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -336,6 +336,10 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
mmput(ctx->mm);
}
 
+   /*
+* Increment driver use count. Enables global TLBIs for hash
+* and callbacks to handle the segment table
+*/
cxl_ctx_get();
 
if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c
index 0761271d68c5..4bfad9f6dc9f 100644
--- a/drivers/misc/cxl/file.c
+++ b/drivers/misc/cxl/file.c
@@ -95,7 +95,6 @@ static int __afu_open(struct inode *inode, struct file *file, 
bool master)
 
pr_devel("afu_open pe: %i\n", ctx->pe);
file->private_data = ctx;
-   cxl_ctx_get();
 
/* indicate success */
rc = 0;
@@ -225,6 +224,12 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
if (ctx->mm)
mmput(ctx->mm);
 
+   /*
+* Increment driver use count. Enables global TLBIs for hash
+* and callbacks to handle the segment table
+*/
+   cxl_ctx_get();
+
trace_cxl_attach(ctx, work.work_element_descriptor, 
work.num_interrupts, amr);
 
if ((rc = cxl_ops->attach_process(ctx, false, 
work.work_element_descriptor,
@@ -233,6 +238,7 @@ static long afu_ioctl_start_work(struct cxl_context *ctx,
cxl_adapter_context_put(ctx->afu->adapter);
put_pid(ctx->pid);
ctx->pid = NULL;
+   cxl_ctx_put();
cxl_context_mm_count_put(ctx);
goto out;
}
-- 
2.11.0



[PATCH v2 1/3] powerpc/mm: Export flush_all_mm()

2017-08-30 Thread Frederic Barrat
With the optimizations introduced by commit a46cc7a90fd8
("powerpc/mm/radix: Improve TLB/PWC flushes"), flush_tlb_mm() no
longer flushes the page walk cache with radix. This patch introduces
flush_all_mm(), which flushes everything, tlb and pwc, for a given mm.

Signed-off-by: Frederic Barrat 
---
Changelog:
v2: this patch is new

 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h  |  8 
 arch/powerpc/include/asm/book3s/64/tlbflush-radix.h |  3 +++
 arch/powerpc/include/asm/book3s/64/tlbflush.h   | 15 +++
 arch/powerpc/mm/tlb-radix.c |  6 --
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index 2f6373144e2c..c5d89d271a96 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -65,6 +65,14 @@ static inline void hash__flush_tlb_mm(struct mm_struct *mm)
 {
 }
 
+static inline void hash__local_flush_all_mm(struct mm_struct *mm)
+{
+}
+
+static inline void hash__flush_all_mm(struct mm_struct *mm)
+{
+}
+
 static inline void hash__local_flush_tlb_page(struct vm_area_struct *vma,
  unsigned long vmaddr)
 {
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
index 9b433a624bf3..af06c6fe8a9f 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-radix.h
@@ -21,17 +21,20 @@ extern void radix__flush_tlb_range(struct vm_area_struct 
*vma, unsigned long sta
 extern void radix__flush_tlb_kernel_range(unsigned long start, unsigned long 
end);
 
 extern void radix__local_flush_tlb_mm(struct mm_struct *mm);
+extern void radix__local_flush_all_mm(struct mm_struct *mm);
 extern void radix__local_flush_tlb_page(struct vm_area_struct *vma, unsigned 
long vmaddr);
 extern void radix__local_flush_tlb_page_psize(struct mm_struct *mm, unsigned 
long vmaddr,
  int psize);
 extern void radix__tlb_flush(struct mmu_gather *tlb);
 #ifdef CONFIG_SMP
 extern void radix__flush_tlb_mm(struct mm_struct *mm);
+extern void radix__flush_all_mm(struct mm_struct *mm);
 extern void radix__flush_tlb_page(struct vm_area_struct *vma, unsigned long 
vmaddr);
 extern void radix__flush_tlb_page_psize(struct mm_struct *mm, unsigned long 
vmaddr,
int psize);
 #else
 #define radix__flush_tlb_mm(mm)radix__local_flush_tlb_mm(mm)
+#define radix__flush_all_mm(mm)radix__local_flush_all_mm(mm)
 #define radix__flush_tlb_page(vma,addr)
radix__local_flush_tlb_page(vma,addr)
 #define radix__flush_tlb_page_psize(mm,addr,p) 
radix__local_flush_tlb_page_psize(mm,addr,p)
 #endif
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 72b925f97bab..70760d018bcd 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -57,6 +57,13 @@ static inline void local_flush_tlb_page(struct 
vm_area_struct *vma,
return hash__local_flush_tlb_page(vma, vmaddr);
 }
 
+static inline void local_flush_all_mm(struct mm_struct *mm)
+{
+   if (radix_enabled())
+   return radix__local_flush_all_mm(mm);
+   return hash__local_flush_all_mm(mm);
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
if (radix_enabled())
@@ -79,9 +86,17 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
return radix__flush_tlb_page(vma, vmaddr);
return hash__flush_tlb_page(vma, vmaddr);
 }
+
+static inline void flush_all_mm(struct mm_struct *mm)
+{
+   if (radix_enabled())
+   return radix__flush_all_mm(mm);
+   return hash__flush_all_mm(mm);
+}
 #else
 #define flush_tlb_mm(mm)   local_flush_tlb_mm(mm)
 #define flush_tlb_page(vma, addr)  local_flush_tlb_page(vma, addr)
+#define flush_all_mm(mm)   local_flush_all_mm(mm)
 #endif /* CONFIG_SMP */
 /*
  * flush the page walk cache for the address
diff --git a/arch/powerpc/mm/tlb-radix.c b/arch/powerpc/mm/tlb-radix.c
index b3e849c4886e..5a1f46eff3a2 100644
--- a/arch/powerpc/mm/tlb-radix.c
+++ b/arch/powerpc/mm/tlb-radix.c
@@ -144,7 +144,7 @@ void radix__local_flush_tlb_mm(struct mm_struct *mm)
 EXPORT_SYMBOL(radix__local_flush_tlb_mm);
 
 #ifndef CONFIG_SMP
-static void radix__local_flush_all_mm(struct mm_struct *mm)
+void radix__local_flush_all_mm(struct mm_struct *mm)
 {
unsigned long pid;
 
@@ -154,6 +154,7 @@ static void radix__local_flush_all_mm(struct mm_struct *mm)
_tlbiel_pid(pid, RIC_FLUSH_ALL);
preempt_enable();
 }
+EXPORT_SYMBOL(radix__local_flush_all_mm);
 #endif /* CONFIG_SMP */
 
 void radix__local_flush_tlb_page_psize(struct 

Re: [PATCH RFC] powerpc: Implements MMIO emulation for lvx/stvx instructions

2017-08-30 Thread Paul Mackerras
On Tue, Aug 29, 2017 at 07:18:01PM -0300, Jose Ricardo Ziviani wrote:
> Hello!
> 
> This patch implements MMIO emulation for two instructions: lvx and stvx. I 
> started to implement other instructions but I'd like to have this reviewed 
> beforehand because this is my first patch here and I'll certainly have some 
> rework/fixes :-).
> 
> Note: stvx is only storing 8 bytes, for some reason the code 
> "vcpu->arch.paddr_accessed += run->mmio.len;", which adds the 8-byte offset 
> after the first write is not making any difference (interesting that it works 
> for load operations). I'm still investigating it but any idea about it will 
> be appreciated.

The run structure is mmapped by userspace (i.e. QEMU) and can be
written by userspace between the first and the second exits to
userspace (you have to do two exits to userspace because you can only
transfer 8 bytes on each exit).  It's possible that userspace might be
clearing run->mmio.len.  In general it's better not to rely on
anything in *run (except of course the mmio_data for a MMIO read) when
we come in from userspace to the kernel.

Paul.


Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-30 Thread Laurent Dufour
On 30/08/2017 07:03, Anshuman Khandual wrote:
> On 08/29/2017 07:15 PM, Peter Zijlstra wrote:
>> On Tue, Aug 29, 2017 at 03:18:25PM +0200, Laurent Dufour wrote:
>>> On 29/08/2017 14:04, Peter Zijlstra wrote:
 On Tue, Aug 29, 2017 at 09:59:30AM +0200, Laurent Dufour wrote:
> On 27/08/2017 02:18, Kirill A. Shutemov wrote:
>>> +
>>> +   if (unlikely(!vma->anon_vma))
>>> +   goto unlock;
>>
>> It deserves a comment.
>
> You're right I'll add it in the next version.
> For the record, the root cause is that __anon_vma_prepare() requires the
> mmap_sem to be held because vm_next and vm_prev must be safe.

 But should that test not be:

if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma))
goto unlock;

 Because !anon vmas will never have ->anon_vma set and you don't want to
 exclude those.
>>>
>>> Yes in the case we later allow non anonymous vmas to be handled.
>>> Currently only anonymous vmas are supported so the check is good enough,
>>> isn't it ?
>>
>> That wasn't at all clear from reading the code. This makes it clear
>> ->anon_vma is only ever looked at for anonymous.
>>
>> And like Kirill says, we _really_ should start allowing some (if not
>> all) vm_ops. Large file based mappings aren't particularly rare.
>>
>> I'm not sure we want to introduce a white-list or just bite the bullet
>> and audit all ->fault() implementations. But either works and isn't
>> terribly difficult, auditing all is more work though.
> 
> filemap_fault() is used as vma-vm_ops->fault() for most of the file
> systems. Changing it can enable speculative fault support for all of
> them. It will still exclude other driver based vma-vm_ops->fault()
> implementation. AFAICS, __lock_page_or_retry() function can drop
> mm->mmap_sem if the page could not be locked right away. As suggested
> by Peterz, making it understand FAULT_FLAG_SPECULATIVE should be good
> enough. The patch is lightly tested for file mappings on top of this
> series.

Hi Anshuman,

This sounds pretty good, except for  the FAULT_FLAG_RETRY_NOWAIT's case I
mentioned in another mail.

The next step would be to find a way to discriminate between the vm_fault()
functions. Any idea ?

Thanks,
Laurent.

> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a497024..08f3042 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>  unsigned int flags)
>  {
> +   if (flags & FAULT_FLAG_SPECULATIVE) {
> +   if (flags & FAULT_FLAG_KILLABLE) {
> +   int ret;
> +
> +   ret = __lock_page_killable(page);
> +   if (ret)
> +   return 0;
> +   } else
> +   __lock_page(page);
> +   return 1;
> +   }
> +
> if (flags & FAULT_FLAG_ALLOW_RETRY) {
> /*
>  * CAUTION! In this case, mmap_sem is not released
> diff --git a/mm/memory.c b/mm/memory.c
> index 549d235..02347f3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3836,8 +3836,6 @@ static int handle_pte_fault(struct vm_fault *vmf)
> if (!vmf->pte) {
> if (vma_is_anonymous(vmf->vma))
> return do_anonymous_page(vmf);
> -   else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
> -   return VM_FAULT_RETRY;
> else
> return do_fault(vmf);
> }
> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, 
> unsigned long address,
> goto unlock;
> }
> 
> -   /*
> -* Can't call vm_ops service has we don't know what they would do
> -* with the VMA.
> -* This include huge page from hugetlbfs.
> -*/
> -   if (vma->vm_ops) {
> -   trace_spf_vma_notsup(_RET_IP_, vma, address);
> -   goto unlock;
> -   }
> -
> -   if (unlikely(!vma->anon_vma)) {
> +   if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
> trace_spf_vma_notsup(_RET_IP_, vma, address);
> goto unlock;
> }
> 



Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-30 Thread Laurent Dufour
On 30/08/2017 07:58, Peter Zijlstra wrote:
> On Wed, Aug 30, 2017 at 10:33:50AM +0530, Anshuman Khandual wrote:
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a497024..08f3042 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1181,6 +1181,18 @@ int __lock_page_killable(struct page *__page)
>>  int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
>>  unsigned int flags)
>>  {
>> +   if (flags & FAULT_FLAG_SPECULATIVE) {
>> +   if (flags & FAULT_FLAG_KILLABLE) {
>> +   int ret;
>> +
>> +   ret = __lock_page_killable(page);
>> +   if (ret)
>> +   return 0;
>> +   } else
>> +   __lock_page(page);
>> +   return 1;
>> +   }
>> +
>> if (flags & FAULT_FLAG_ALLOW_RETRY) {
>> /*
>>  * CAUTION! In this case, mmap_sem is not released
> 
> Yeah, that looks right.

Hum, I'm wondering if FAULT_FLAG_RETRY_NOWAIT should be forced in the
speculative path in that case to match the semantics of
__lock_page_or_retry().

> 
>> @@ -4012,17 +4010,7 @@ int handle_speculative_fault(struct mm_struct *mm, 
>> unsigned long address,
>> goto unlock;
>> }
>>
>> +   if (unlikely(vma_is_anonymous(vma) && !vma->anon_vma)) {
>> trace_spf_vma_notsup(_RET_IP_, vma, address);
>> goto unlock;
>> }
> 
> As riel pointed out on IRC slightly later, private file maps also need
> ->anon_vma and those actually have ->vm_ops IIRC so the condition needs
> to be slightly more complicated.

Yes I read again the code and lead to the same conclusion.



Re: [PATCH v3 3/6] dt: booting-without-of: DT fix s/#interrupt-cell/#interrupt-cells/

2017-08-30 Thread Michael Ellerman
Geert Uytterhoeven  writes:

> Signed-off-by: Geert Uytterhoeven 
> Acked-by: Rob Herring 
> ---

Rob this has your ack, but I'd expect it to go via your tree? Or should
I grab it?

cheers

> diff --git a/Documentation/devicetree/booting-without-of.txt 
> b/Documentation/devicetree/booting-without-of.txt
> index 280d283304bb82d8..f35d3adacb987f7d 100644
> --- a/Documentation/devicetree/booting-without-of.txt
> +++ b/Documentation/devicetree/booting-without-of.txt
> @@ -1309,7 +1309,7 @@ number and level/sense information. All interrupt 
> children in an
>  OpenPIC interrupt domain use 2 cells per interrupt in their interrupts
>  property.
>  
> -The PCI bus binding specifies a #interrupt-cell value of 1 to encode
> +The PCI bus binding specifies a #interrupt-cells value of 1 to encode
>  which interrupt pin (INTA,INTB,INTC,INTD) is used.
>  
>  2) interrupt-parent property
> -- 
> 2.7.4


Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-30 Thread Laurent Dufour
On 27/08/2017 02:18, Kirill A. Shutemov wrote:
> On Fri, Aug 18, 2017 at 12:05:13AM +0200, Laurent Dufour wrote:
>> +/*
>> + * vm_normal_page() adds some processing which should be done while
>> + * hodling the mmap_sem.
>> + */
>> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>> + unsigned int flags)
>> +{
>> +struct vm_fault vmf = {
>> +.address = address,
>> +};
>> +pgd_t *pgd;
>> +p4d_t *p4d;
>> +pud_t *pud;
>> +pmd_t *pmd;
>> +int dead, seq, idx, ret = VM_FAULT_RETRY;
>> +struct vm_area_struct *vma;
>> +struct mempolicy *pol;
>> +
>> +/* Clear flags that may lead to release the mmap_sem to retry */
>> +flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>> +flags |= FAULT_FLAG_SPECULATIVE;
>> +
>> +idx = srcu_read_lock(_srcu);
>> +vma = find_vma_srcu(mm, address);
>> +if (!vma)
>> +goto unlock;
>> +
>> +/*
>> + * Validate the VMA found by the lockless lookup.
>> + */
>> +dead = RB_EMPTY_NODE(>vm_rb);
>> +seq = raw_read_seqcount(>vm_sequence); /* rmb <-> 
>> seqlock,vma_rb_erase() */
>> +if ((seq & 1) || dead)
>> +goto unlock;
>> +
>> +/*
>> + * Can't call vm_ops service has we don't know what they would do
>> + * with the VMA.
>> + * This include huge page from hugetlbfs.
>> + */
>> +if (vma->vm_ops)
>> +goto unlock;
> 
> I think we need to have a way to white-list safe ->vm_ops.
> 
>> +
>> +if (unlikely(!vma->anon_vma))
>> +goto unlock;
> 
> It deserves a comment.
> 
>> +
>> +vmf.vma_flags = READ_ONCE(vma->vm_flags);
>> +vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>> +
>> +/* Can't call userland page fault handler in the speculative path */
>> +if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>> +goto unlock;
>> +
>> +/*
>> + * MPOL_INTERLEAVE implies additional check in mpol_misplaced() which
>> + * are not compatible with the speculative page fault processing.
>> + */
>> +pol = __get_vma_policy(vma, address);
>> +if (!pol)
>> +pol = get_task_policy(current);
>> +if (pol && pol->mode == MPOL_INTERLEAVE)
>> +goto unlock;
>> +
>> +if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>> +/*
>> + * This could be detected by the check address against VMA's
>> + * boundaries but we want to trace it as not supported instead
>> + * of changed.
>> + */
>> +goto unlock;
>> +
>> +if (address < READ_ONCE(vma->vm_start)
>> +|| READ_ONCE(vma->vm_end) <= address)
>> +goto unlock;
>> +
>> +/*
>> + * The three following checks are copied from access_error from
>> + * arch/x86/mm/fault.c
>> + */
>> +if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>> +   flags & FAULT_FLAG_INSTRUCTION,
>> +   flags & FAULT_FLAG_REMOTE))
>> +goto unlock;
>> +
>> +/* This is one is required to check that the VMA has write access set */
>> +if (flags & FAULT_FLAG_WRITE) {
>> +if (unlikely(!(vmf.vma_flags & VM_WRITE)))
>> +goto unlock;
>> +} else {
>> +if (unlikely(!(vmf.vma_flags & (VM_READ | VM_EXEC | VM_WRITE
>> +goto unlock;
>> +}
>> +
>> +/*
>> + * Do a speculative lookup of the PTE entry.
>> + */
>> +local_irq_disable();
>> +pgd = pgd_offset(mm, address);
>> +if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
>> +goto out_walk;
>> +
>> +p4d = p4d_alloc(mm, pgd, address);
>> +if (p4d_none(*p4d) || unlikely(p4d_bad(*p4d)))
>> +goto out_walk;
>> +
>> +pud = pud_alloc(mm, p4d, address);
>> +if (pud_none(*pud) || unlikely(pud_bad(*pud)))
>> +goto out_walk;
>> +
>> +pmd = pmd_offset(pud, address);
>> +if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
>> +goto out_walk;
>> +
>> +/*
>> + * The above does not allocate/instantiate page-tables because doing so
>> + * would lead to the possibility of instantiating page-tables after
>> + * free_pgtables() -- and consequently leaking them.
>> + *
>> + * The result is that we take at least one !speculative fault per PMD
>> + * in order to instantiate it.
>> + */
> 
> 
> Doing all this job and just give up because we cannot allocate page tables
> looks very wasteful to me.
> 
> Have you considered to look how we can hand over from speculative to
> non-speculative path without starting from scratch (when possible)?
> 
>> +/* Transparent huge pages are not supported. */
>> +if (unlikely(pmd_trans_huge(*pmd)))
>> +goto out_walk;
> 
> That's looks like a blocker to me.
> 
> Is there any problem with making it supported (besides plain coding)?

This is not 

Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-08-30 Thread Mike Galbraith
On Tue, 2017-08-29 at 20:56 -0400, Jerome Glisse wrote:
> On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Looks good here.

I reproduced fairly quickly with RT host and 1 RT guest by just having
the guest do a parallel kbuild over NFS (the guest had to be restored
afterward, was corrupted).  I'm currently flogging 2 guests as well as
the host, whimper free.  I'll let the lot broil for while longer, but
at this point, smoke/flame appearance seems comfortingly unlikely.

-Mike




Re: [PATCH RFC] Interface to set SPRN_TIDR

2017-08-30 Thread Michael Neuling
Suka,

Please CC Christophe who as an alternative way of doing this. We ned to get
agreement across all users of TIDR/AS_notify...

His patch is here:
  https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-August/161582.html

Mikey

On Tue, 2017-08-29 at 19:38 -0700, Sukadev Bhattiprolu wrote:
> We need the SPRN_TIDR to be set for use with fast thread-wakeup
> (core-to-core wakeup) in VAS. Each user thread that has a receive
> window setup and expects to be notified when a sender issues a paste
> needs to have a unique SPRN_TIDR value.
> 
> The SPRN_TIDR value only needs to unique within the process but for
> now we use a globally unique thread id as described below.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
> Changelog[v2]
>   - Michael Ellerman: Use an interface to assign TIDR so it is
>     assigned to only threads that need it; move assignment to
>     restore_sprs(). Drop lint from rebase;
> 
> 
>  arch/powerpc/include/asm/processor.h |  4 ++
>  arch/powerpc/include/asm/switch_to.h |  3 ++
>  arch/powerpc/kernel/process.c| 97
> 
>  3 files changed, 104 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/processor.h
> b/arch/powerpc/include/asm/processor.h
> index fab7ff8..bf6ba63 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -232,6 +232,10 @@ struct debug_reg {
>  struct thread_struct {
>   unsigned long   ksp;/* Kernel stack pointer */
> 
> +#ifdef CONFIG_PPC_VAS
> + unsigned long   tidr;
> +#endif
> +
>  #ifdef CONFIG_PPC64
>   unsigned long   ksp_vsid;
>  #endif
> diff --git a/arch/powerpc/include/asm/switch_to.h
> b/arch/powerpc/include/asm/switch_to.h
> index 17c8380..4962455 100644
> --- a/arch/powerpc/include/asm/switch_to.h
> +++ b/arch/powerpc/include/asm/switch_to.h
> @@ -91,4 +91,7 @@ static inline void clear_task_ebb(struct task_struct *t)
>  #endif
>  }
> 
> +extern void set_thread_tidr(struct task_struct *t);
> +extern void clear_thread_tidr(struct task_struct *t);
> +
>  #endif /* _ASM_POWERPC_SWITCH_TO_H */
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index 1f0fd36..13abb22 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1132,6 +1132,10 @@ static inline void restore_sprs(struct thread_struct
> *old_thread,
>   mtspr(SPRN_TAR, new_thread->tar);
>   }
>  #endif
> +#ifdef CONFIG_PPC_VAS
> + if (old_thread->tidr != new_thread->tidr)
> + mtspr(SPRN_TIDR, new_thread->tidr);
> +#endif
>  }
> 
>  #ifdef CONFIG_PPC_BOOK3S_64
> @@ -1446,9 +1450,97 @@ void flush_thread(void)
>  #endif /* CONFIG_HAVE_HW_BREAKPOINT */
>  }
> 
> +#ifdef CONFIG_PPC_VAS
> +static DEFINE_SPINLOCK(vas_thread_id_lock);
> +static DEFINE_IDA(vas_thread_ida);
> +
> +/*
> + * We need to assign an unique thread id to each thread in a process. This
> + * thread id is intended to be used with the Fast Thread-wakeup (aka Core-
> + * to-core wakeup) mechanism being implemented on top of Virtual Accelerator
> + * Switchboard (VAS).
> + *
> + * To get a unique thread-id per process we could simply use task_pid_nr()
> + * but the problem is that task_pid_nr() is not yet available for the thread
> + * when copy_thread() is called. Fixing that would require changing more
> + * intrusive arch-neutral code in code path in copy_process()?.
> + *
> + * Further, to assign unique thread ids within each process, we need an
> + * atomic field (or an IDR) in task_struct, which again intrudes into the
> + * arch-neutral code.
> + *
> + * So try to assign globally unique thraed ids for now.
> + *
> + * NOTE: TIDR 0 indicates that the thread does not need a TIDR value.
> + *    For now, only threads that expect to be notified by the VAS
> + *    hardware need a TIDR value and we assign values > 0 for those.
> + */
> +#define MAX_THREAD_CONTEXT   ((1 << 15) - 2)
> +static int assign_thread_tidr(void)
> +{
> + int index;
> + int err;
> +
> +again:
> + if (!ida_pre_get(_thread_ida, GFP_KERNEL))
> + return -ENOMEM;
> +
> + spin_lock(_thread_id_lock);
> + err = ida_get_new_above(_thread_ida, 1, );
> + spin_unlock(_thread_id_lock);
> +
> + if (err == -EAGAIN)
> + goto again;
> + else if (err)
> + return err;
> +
> + if (index > MAX_THREAD_CONTEXT) {
> + spin_lock(_thread_id_lock);
> + ida_remove(_thread_ida, index);
> + spin_unlock(_thread_id_lock);
> + return -ENOMEM;
> + }
> +
> + return index;
> +}
> +
> +static void free_thread_tidr(int id)
> +{
> + spin_lock(_thread_id_lock);
> + ida_remove(_thread_ida, id);
> + spin_unlock(_thread_id_lock);
> +}
> +
> +void clear_thread_tidr(struct task_struct *t)
> +{
> + if (t->thread.tidr) {
> + free_thread_tidr(t->thread.tidr);
> + t->thread.tidr = 0;
> +  

[PATCH v3 18/17] powerpc: Emulate load/store floating point as integer word instructions

2017-08-30 Thread Paul Mackerras
This adds emulation for the lfiwax, lfiwzx and stfiwx instructions.
This necessitated adding a new flag to indicate whether a floating
point or an integer conversion was needed for LOAD_FP and STORE_FP,
so this moves the size field in op->type up 4 bits.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/sstep.h |  5 ++--
 arch/powerpc/lib/sstep.c | 60 ++--
 2 files changed, 48 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/sstep.h b/arch/powerpc/include/asm/sstep.h
index 309d1c5..ab9d849 100644
--- a/arch/powerpc/include/asm/sstep.h
+++ b/arch/powerpc/include/asm/sstep.h
@@ -68,6 +68,7 @@ enum instruction_type {
 #define SIGNEXT0x20
 #define UPDATE 0x40/* matches bit in opcode 31 instructions */
 #define BYTEREV0x80
+#define FPCONV 0x100
 
 /* Barrier type field, ORed in with type */
 #define BARRIER_MASK   0xe0
@@ -93,8 +94,8 @@ enum instruction_type {
 #define VSX_CHECK_VEC  8   /* check MSR_VEC not MSR_VSX for reg >= 32 */
 
 /* Size field in type word */
-#define SIZE(n)((n) << 8)
-#define GETSIZE(w) ((w) >> 8)
+#define SIZE(n)((n) << 12)
+#define GETSIZE(w) ((w) >> 12)
 
 #define MKOP(t, f, s)  ((t) | (f) | SIZE(s))
 
diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 24031ca..2f6897c 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -457,19 +457,23 @@ NOKPROBE_SYMBOL(write_mem);
  * These access either the real FP register or the image in the
  * thread_struct, depending on regs->msr & MSR_FP.
  */
-static int do_fp_load(int rn, unsigned long ea, int nb, struct pt_regs *regs,
- bool cross_endian)
+static int do_fp_load(struct instruction_op *op, unsigned long ea,
+ struct pt_regs *regs, bool cross_endian)
 {
-   int err;
+   int err, rn, nb;
union {
+   int i;
+   unsigned int u;
float f;
double d[2];
unsigned long l[2];
u8 b[2 * sizeof(double)];
} u;
 
+   nb = GETSIZE(op->type);
if (!address_ok(regs, ea, nb))
return -EFAULT;
+   rn = op->reg;
err = copy_mem_in(u.b, ea, nb, regs);
if (err)
return err;
@@ -479,8 +483,14 @@ static int do_fp_load(int rn, unsigned long ea, int nb, 
struct pt_regs *regs,
do_byte_reverse([8], 8);
}
preempt_disable();
-   if (nb == 4)
-   conv_sp_to_dp(, [0]);
+   if (nb == 4) {
+   if (op->type & FPCONV)
+   conv_sp_to_dp(, [0]);
+   else if (op->type & SIGNEXT)
+   u.l[0] = u.i;
+   else
+   u.l[0] = u.u;
+   }
if (regs->msr & MSR_FP)
put_fpr(rn, [0]);
else
@@ -498,25 +508,33 @@ static int do_fp_load(int rn, unsigned long ea, int nb, 
struct pt_regs *regs,
 }
 NOKPROBE_SYMBOL(do_fp_load);
 
-static int do_fp_store(int rn, unsigned long ea, int nb, struct pt_regs *regs,
-  bool cross_endian)
+static int do_fp_store(struct instruction_op *op, unsigned long ea,
+  struct pt_regs *regs, bool cross_endian)
 {
+   int rn, nb;
union {
+   unsigned int u;
float f;
double d[2];
unsigned long l[2];
u8 b[2 * sizeof(double)];
} u;
 
+   nb = GETSIZE(op->type);
if (!address_ok(regs, ea, nb))
return -EFAULT;
+   rn = op->reg;
preempt_disable();
if (regs->msr & MSR_FP)
get_fpr(rn, [0]);
else
u.l[0] = current->thread.TS_FPR(rn);
-   if (nb == 4)
-   conv_dp_to_sp([0], );
+   if (nb == 4) {
+   if (op->type & FPCONV)
+   conv_dp_to_sp([0], );
+   else
+   u.u = u.l[0];
+   }
if (nb == 16) {
rn |= 1;
if (regs->msr & MSR_FP)
@@ -2049,7 +2067,7 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
 #ifdef CONFIG_PPC_FPU
case 535:   /* lfsx */
case 567:   /* lfsux */
-   op->type = MKOP(LOAD_FP, u, 4);
+   op->type = MKOP(LOAD_FP, u | FPCONV, 4);
break;
 
case 599:   /* lfdx */
@@ -2059,7 +2077,7 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
 
case 663:   /* stfsx */
case 695:   /* stfsux */
-   op->type = MKOP(STORE_FP, u, 4);
+   op->type = MKOP(STORE_FP, u | FPCONV, 4);
break;
 
case 727:   /* stfdx */
@@ -2072,9 

Re: [PATCH v2 14/20] mm: Provide speculative fault infrastructure

2017-08-30 Thread Peter Zijlstra
On Wed, Aug 30, 2017 at 07:19:30AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2017-08-29 at 13:27 +0200, Peter Zijlstra wrote:
> > mpe helped me out and explained that is the PWC hint to TBLIE.
> > 
> > So, you set need_flush_all when you unhook pud/pmd/pte which you then
> > use to set PWC. So free_pgtables() will do the PWC when it unhooks
> > higher level pages.
> > 
> > But you're right that there's some issues, free_pgtables() itself
> > doesn't seem to use mm->page_table_lock,pmd->lock _AT_ALL_ to unhook the
> > pages.
> > 
> > If it were to do that, things should work fine since those locks would
> > then serialize against the speculative faults, we would never install a
> > page if the VMA would be under tear-down and it would thus not be
> > visible to your caches either.
> 
> That's one case. I don't remember of *all* the cases to be honest, but
> I do remember several times over the past few years thinking "ah we are
> fine because the mm sem taken for writing protects us from any
> concurrent tree structure change" :-)

Well, installing always seems to use the locks (it needs to, because its
always done with down_read()), that only leaves removal, and the only
place I know that removes stuff is free_pgtables().

But I think I found another fun place, copy_page_range(). While it
(pointlessly) takes all the PTLs on the dst mm it walks the src page
tables without any PTLs.

This means that if we have a multi-threaded process doing fork() a
thread of the src mm could instantiate page-tables that will not be
copied over.

Of course, this is highly dubious behaviour to begin with, and I don't
think there's anything fundamentally wrong with missing those pages but
we should document this stuff.


Re: Question: handling early hotplug interrupts

2017-08-30 Thread Michael Ellerman
Daniel Henrique Barboza  writes:

> Hi Ben,
>
> On 08/29/2017 06:55 PM, Benjamin Herrenschmidt wrote:
>> On Tue, 2017-08-29 at 17:43 -0300, Daniel Henrique Barboza wrote:
>>> Hi,
>>>
>>> This is a scenario I've been facing when working in early device
>>> hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn
>>> the guest of the event, then the kernel fetches it by calling
>>> 'check_exception' and handles it. If the hotplug is done too early
>>> (before SLOF, for example), the pulse is ignored and the hotplug event
>>> is left unchecked in the events queue.
>>>
>>> One solution would be to pulse the hotplug queue interrupt after CAS,
>>> when we are sure that the hotplug queue is negotiated. However, this
>>> panics the kernel with sig 11 kernel access of bad area, which suggests
>>> that the kernel wasn't quite ready to handle it.
>> That's not right. This is a bug that needs fixing. The interrupt should
>> be masked anyway but still.
>>
>> Tell us more about the crash (backtrace etc...)  this definitely needs
>> fixing.
>
> This is the backtrace using a 4.13.0-rc3 guest:
>
> -
> [0.008913] Unable to handle kernel paging request for data at address 
> 0x0100
> [0.008989] Faulting instruction address: 0xc012c318
> [0.009046] Oops: Kernel access of bad area, sig: 11 [#1]
> [0.009092] SMP NR_CPUS=1024
> [0.009092] NUMA
> [0.009128] pSeries
> [0.009173] Modules linked in:
> [0.009210] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc3+ #1
> [0.009268] task: c000feb02580 task.stack: c000fe108000
> [0.009325] NIP: c012c318 LR: c012c9c4 CTR: 
> 
> [0.009394] REGS: c000fffef910 TRAP: 0380   Not tainted (4.13.0-rc3+)
> [0.009450] MSR: 82009033 
> [0.009454]   CR: 28000822  XER: 2000
> [0.009554] CFAR: c012c9c0 SOFTE: 0
> [0.009554] GPR00: c012c9c4 c000fffefb90 c141f100 
> 0400
> [0.009554] GPR04:  c000fe1851c0  
> fee6
> [0.009554] GPR08: 000fffe1  0001 
> 02001001
> [0.009554] GPR12: 0040 cfd8 c000db58 
> 
> [0.009554] GPR16:    
> 
> [0.009554] GPR20:    
> 0001
> [0.009554] GPR24: 0002 0013 c000fe14bc00 
> 0400
> [0.009554] GPR28: 0400  c000fe1851c0 
> 0001
> [0.010121] NIP [c012c318] __queue_work+0x48/0x640
> [0.010168] LR [c012c9c4] queue_work_on+0xb4/0xf0
> [0.010213] Call Trace:
> [0.010239] [c000fffefb90] [c000db58] kernel_init+0x8/0x160 
> (unreliable)
> [0.010308] [c000fffefc70] [c012c9c4] queue_work_on+0xb4/0xf0
> [0.010368] [c000fffefcb0] [c00c4608] 
> queue_hotplug_event+0xd8/0x150
> [0.010435] [c000fffefd00] [c00c30d0] 
> ras_hotplug_interrupt+0x140/0x190
> [0.010505] [c000fffefd90] [c018c8b0] 
> __handle_irq_event_percpu+0x90/0x310
> [0.010573] [c000fffefe50] [c018cb6c] 
> handle_irq_event_percpu+0x3c/0x90
> [0.010642] [c000fffefe90] [c018cc24] 
> handle_irq_event+0x64/0xc0
> [0.010710] [c000fffefec0] [c01928b0] 
> handle_fasteoi_irq+0xc0/0x230
> [0.010779] [c000fffefef0] [c018ae14] 
> generic_handle_irq+0x54/0x80
> [0.010847] [c000fffeff20] [c00189f0] __do_irq+0x90/0x210
> [0.010904] [c000fffeff90] [c002e730] call_do_irq+0x14/0x24
> [0.010961] [c000fe10b640] [c0018c10] do_IRQ+0xa0/0x130
> [0.011021] [c000fe10b6a0] [c0008c58] 
> hardware_interrupt_common+0x158/0x160
> [0.011090] --- interrupt: 501 at __replay_interrupt+0x38/0x3c
> [0.011090] LR = arch_local_irq_restore+0x74/0x90
> [0.011179] [c000fe10b990] [c000fe10b9e0] 0xc000fe10b9e0 
> (unreliable)
> [0.011249] [c000fe10b9b0] [c0b967fc] 
> _raw_spin_unlock_irqrestore+0x4c/0xb0
> [0.011316] [c000fe10b9e0] [c018ff50] __setup_irq+0x630/0x9e0
> [0.011374] [c000fe10ba90] [c019054c] 
> request_threaded_irq+0x13c/0x250
> [0.011441] [c000fe10baf0] [c00c2cd0] 
> request_event_sources_irqs+0x100/0x180
> [0.011511] [c000fe10bc10] [c0eceda8] 
> __machine_initcall_pseries_init_ras_IRQ+0xc4/0x12c
> [0.011591] [c000fe10bc40] [c000d8c8] 
> do_one_initcall+0x68/0x1e0
> [0.011659] [c000fe10bd00] [c0eb4484] 
> kernel_init_freeable+0x284/0x370
> [0.011725] [c000fe10bdc0] [c000db7c] kernel_init+0x2c/0x160
> [0.011782] [c000fe10be30] [c000bc9c] 
> 

Re: Question: handling early hotplug interrupts

2017-08-30 Thread Michael Ellerman
Daniel Henrique Barboza  writes:

> Hi,
>
> This is a scenario I've been facing when working in early device 
> hotplugs in QEMU. When a device is added, a IRQ pulse is fired to warn 
> the guest of the event, then the kernel fetches it by calling 
> 'check_exception' and handles it. If the hotplug is done too early 
> (before SLOF, for example), the pulse is ignored and the hotplug event 
> is left unchecked in the events queue.
>
> One solution would be to pulse the hotplug queue interrupt after CAS, 
> when we are sure that the hotplug queue is negotiated. However, this 
> panics the kernel with sig 11 kernel access of bad area, which suggests 
> that the kernel wasn't quite ready to handle it.
>
> In my experiments using upstream 4.13 I saw that there is a 'safe time' 
> to pulse the queue, sometime after CAS and before mounting the root fs, 
> but I wasn't able to pinpoint it. From QEMU perspective, the last hcall 
> done (an h_set_mode) is still too early to pulse it and the kernel 
> panics. Looking at the kernel source I saw that the IRQ handling is 
> initiated quite early in the init process.
>
> So my question (ok, actually 2 questions):
>
> - Is my analysis correct? Is there an unsafe time to fire a IRQ pulse 
> before CAS that can break the kernel or am I overlooking/doing something 
> wrong?
> - is there a reliable way to know when can the kernel safely handle the 
> hotplug interrupt?

In addition to Ben's comments, you need to think about this differently.

The operating system you're booting may not be Linux.

Whatever Qemu does needs to make sense without reference to the exact
details or ordering of the Linux code. Qemu needs to provide a mechanism
that any operating system could use, and then we can make it work with
Linux.

cheers