Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.12-1 tag

2021-02-22 Thread Oliver O'Halloran
On Tue, Feb 23, 2021 at 9:44 AM Linus Torvalds
 wrote:
>
> On Mon, Feb 22, 2021 at 4:06 AM Michael Ellerman  wrote:
> >
> > Please pull powerpc updates for 5.12.
>
> Pulled. However:
>
> >  mode change 100755 => 100644 
> > tools/testing/selftests/powerpc/eeh/eeh-functions.sh
> >  create mode 100755 tools/testing/selftests/powerpc/eeh/eeh-vf-aware.sh
> >  create mode 100755 tools/testing/selftests/powerpc/eeh/eeh-vf-unaware.sh
>
> Somebody is being confused.
>
> Why create two new shell scripts with the proper executable bit, and
> then remove the executable bit from an existing one?
>
> That just seems very inconsistent.

eeh-function.sh just provides some helper functions for the other
scripts and doesn't do anything when executed directly. I thought
making it non-executable made sense.

>
>  Linus


Re: [PATCH] arch:powerpc simple_write_to_buffer return check

2021-02-04 Thread Oliver O'Halloran
On Fri, Feb 5, 2021 at 5:17 AM Mayank Suman  wrote:
>
> Signed-off-by: Mayank Suman 

commit messages aren't optional

> ---
>  arch/powerpc/kernel/eeh.c| 8 
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 813713c9120c..2dbe1558a71f 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file 
> *filp,
> char buf[20];
> int ret;
>
> -   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> -   if (!ret)
> +   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);

We should probably be zeroing the buffer. Reading to sizeof(buf) - 1
is done in a few places to guarantee that the string is nul
terminated, but without the preceeding memset() that isn't actually
guaranteed.

> +   if (ret <= 0)
> return -EFAULT;

EFAULT is supposed to be returned when the user supplies a buffer to
write(2) which is outside their address space. I figured letting the
sscanf() in the next step fail if the user passes writes a zero-length
buffer and returning EINVAL made more sense. That said, the exact
semantics around zero length writes are pretty handwavy so I guess
this isn't wrong, but I don't think it's better either.

> /*
> @@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp,
>
> memset(buf, 0, sizeof(buf));
> ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> -   if (!ret)
> +   if (ret <= 0)
> return -EFAULT;
>
> ret = sscanf(buf, "%x:%x:%x.%x", , , , );
> @@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp,
>
> memset(buf, 0, sizeof(buf));
> ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> -   if (!ret)
> +   if (ret <= 0)
> return -EFAULT;
>
> ret = sscanf(buf, "%x:%x:%x.%x", , , , );
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
> b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 89e22c460ebf..36ed2b8f7375 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
> return -ENXIO;
>
> /* Copy over argument buffer */
> -   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> -   if (!ret)
> +   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> +   if (ret <= 0)
> return -EFAULT;
>
> /* Retrieve parameters */
> --
> 2.30.0
>


Re: [PATCH] powerpc/eeh: remove unneeded semicolon

2021-02-01 Thread Oliver O'Halloran
On Tue, Feb 2, 2021 at 2:21 PM Yang Li  wrote:
>
> Eliminate the following coccicheck warning:
> ./arch/powerpc/kernel/eeh.c:782:2-3: Unneeded semicolon

Doesn't appear to be a load-bearing semicolon. It's hard to tell with EEH.

Reviewed-by: Oliver O'Halloran 

>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  arch/powerpc/kernel/eeh.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 813713c..02c058d 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -779,7 +779,7 @@ int pcibios_set_pcie_reset_state(struct pci_dev *dev, 
> enum pcie_reset_state stat
> default:
> eeh_pe_state_clear(pe, EEH_PE_ISOLATED | EEH_PE_CFG_BLOCKED, 
> true);
> return -EINVAL;
> -   };
> +   }
>
> return 0;
>  }
> --
> 1.8.3.1
>


Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed

2020-11-02 Thread Oliver O'Halloran
On Tue, Nov 3, 2020 at 1:39 AM Cédric Le Goater  wrote:
>
> On 10/14/20 4:55 AM, Alexey Kardashevskiy wrote:
> >
> > How do you remove PHBs exactly? There is no such thing in the powernv 
> > platform, I thought someone added this and you are fixing it but no. PHBs 
> > on powernv are created at the boot time and there is no way to remove them, 
> > you can only try removing all the bridges.
>
> yes. I noticed that later when proposing the fix for the double
> free.
>
> > So what exactly are you doing?
>
> What you just said above, with the commands :
>
>   echo 1 >  /sys/devices/pci0031\:00/0031\:00\:00.0/remove
>   echo 1 >  /sys/devices/pci0031\:00/pci_bus/0031\:00/rescan

Right, so that'll remove the root port device (and Bus 01 beneath it),
but the PHB itself is still there. If it was removed the root bus
would also disappear.


Re: [PATCH] powerpc/eeh_cache: Fix a possible debugfs deadlock

2020-10-29 Thread Oliver O'Halloran
On Thu, Oct 29, 2020 at 2:27 AM Qian Cai  wrote:
>
> Lockdep complains that a possible deadlock below in
> eeh_addr_cache_show() because it is acquiring a lock with IRQ enabled,
> but eeh_addr_cache_insert_dev() needs to acquire the same lock with IRQ
> disabled. Let's just make eeh_addr_cache_show() acquire the lock with
> IRQ disabled as well.
>
> CPU0CPU1
> 
>lock(_io_addr_cache_root.piar_lock);
> local_irq_disable();
> lock(>lock);
> lock(_io_addr_cache_root.piar_lock);
>
>  lock(>lock);
>
>   *** DEADLOCK ***
>
>   lock_acquire+0x140/0x5f0
>   _raw_spin_lock_irqsave+0x64/0xb0
>   eeh_addr_cache_insert_dev+0x48/0x390
>   eeh_probe_device+0xb8/0x1a0
>   pnv_pcibios_bus_add_device+0x3c/0x80
>   pcibios_bus_add_device+0x118/0x290
>   pci_bus_add_device+0x28/0xe0
>   pci_bus_add_devices+0x54/0xb0
>   pcibios_init+0xc4/0x124
>   do_one_initcall+0xac/0x528
>   kernel_init_freeable+0x35c/0x3fc
>   kernel_init+0x24/0x148
>   ret_from_kernel_thread+0x5c/0x80
>
>   lock_acquire+0x140/0x5f0
>   _raw_spin_lock+0x4c/0x70
>   eeh_addr_cache_show+0x38/0x110
>   seq_read+0x1a0/0x660
>   vfs_read+0xc8/0x1f0
>   ksys_read+0x74/0x130
>   system_call_exception+0xf8/0x1d0
>   system_call_common+0xe8/0x218
>
> Fixes: 5ca85ae6318d ("powerpc/eeh_cache: Add a way to dump the EEH address 
> cache")
> Signed-off-by: Qian Cai 

Good catch,

Reviewed-by: Oliver O'Halloran 


Re: [Skiboot] [PATCH 0/3] warn and suppress irqflood

2020-10-25 Thread Oliver O'Halloran
On Mon, Oct 26, 2020 at 12:11 AM Pingfan Liu  wrote:
>
> On Sun, Oct 25, 2020 at 8:21 PM Oliver O'Halloran  wrote:
> >
> > On Sun, Oct 25, 2020 at 10:22 PM Pingfan Liu  wrote:
> > >
> > > On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner  
> > > wrote:
> > > >
> > > > On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote:
> > > > > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 
> > > > > platform.
> > > > > When the bug happens, the kernel is totally occupies by irq.  
> > > > > Currently, there
> > > > > may be nothing or just soft lockup warning showed in console. It is 
> > > > > better
> > > > > to warn users with irq flood info.
> > > > >
> > > > > In the kdump case, the kernel can move on by suppressing the irq 
> > > > > flood.
> > > >
> > > > You're curing the symptom not the cause and the cure is just magic and
> > > > can't work reliably.
> > > Yeah, it is magic. But at least, it is better to printk something and
> > > alarm users about what happens. With current code, it may show nothing
> > > when system hangs.
> > > >
> > > > Where is that irq flood originated from and why is none of the
> > > > mechanisms we have in place to shut it up working?
> > > The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus
> > > driver (i2c-opal.c). After i2c_opal_send_request(), the bug is
> > > triggered.
> > >
> > > But things are complicated by introducing a firmware layer: Skiboot.
> > > This software layer hides the detail of manipulating the hardware from
> > > Linux.
> > >
> > > I guess the software logic can not enter a sane state when kernel crashes.
> > >
> > > Cc Skiboot and ppc64 community to see whether anyone has idea about it.
> >
> > What system are you using?
>
> Here is the info, if not enough, I will get more.
>  Product Name  : OpenPOWER Firmware
>  Product Version   : open-power-SUPERMICRO-P9DSU-V1.16-20180531-imp
>  Product Extra : op-build-e4b3eb5
>  Product Extra : skiboot-v6.0-p1da203b
>  Product Extra : hostboot-f911e5c-pda8239f
>  Product Extra : occ-77bb5e6-p623d1cd
>  Product Extra : linux-4.16.7-openpower2-pbc45895
>  Product Extra : petitboot-v1.7.1-pf773c0d
>  Product Extra : machine-xml-218a77a

Unfortunately I don't have a schematic for that one.

> > There's an external interrupt pin which is supposed to be wired to the
> > TPM. I think we bounce that interrupt to FW by default since the
> > external interrupt is sometimes used for other system-specific
> > purposes. Odds are FW doesn't know what to do with it so you
> > effectively have an always-on LSI. I fixed a similar bug a while ago
> > by having skiboot mask any interrupts it doesn't have a handler for,
>
> This sounds like the root cause. But here Skiboot should have handler,
> otherwise the first kernel can not run smoothly.

I don't know why the TPM interrupt is asserted. If the TPM driver is
polling for a response it might clear the underlying condition as a
side effect of it's normal operation.

> Do you have any idea about an unexpected re-initialization introducing
> an unsane stage?

No idea, but those TPMs have a history of bricking themselves if you
do anything slightly odd to them. It wouldn't surprise me if the
re-probe can cause issues.

> Thanks,
> Pingfan


Re: [Skiboot] [PATCH 0/3] warn and suppress irqflood

2020-10-25 Thread Oliver O'Halloran
On Sun, Oct 25, 2020 at 10:22 PM Pingfan Liu  wrote:
>
> On Thu, Oct 22, 2020 at 4:37 PM Thomas Gleixner  wrote:
> >
> > On Thu, Oct 22 2020 at 13:56, Pingfan Liu wrote:
> > > I hit a irqflood bug on powerpc platform, and two years ago, on a x86 
> > > platform.
> > > When the bug happens, the kernel is totally occupies by irq.  Currently, 
> > > there
> > > may be nothing or just soft lockup warning showed in console. It is better
> > > to warn users with irq flood info.
> > >
> > > In the kdump case, the kernel can move on by suppressing the irq flood.
> >
> > You're curing the symptom not the cause and the cure is just magic and
> > can't work reliably.
> Yeah, it is magic. But at least, it is better to printk something and
> alarm users about what happens. With current code, it may show nothing
> when system hangs.
> >
> > Where is that irq flood originated from and why is none of the
> > mechanisms we have in place to shut it up working?
> The bug originates from a driver tpm_i2c_nuvoton, which calls i2c-bus
> driver (i2c-opal.c). After i2c_opal_send_request(), the bug is
> triggered.
>
> But things are complicated by introducing a firmware layer: Skiboot.
> This software layer hides the detail of manipulating the hardware from
> Linux.
>
> I guess the software logic can not enter a sane state when kernel crashes.
>
> Cc Skiboot and ppc64 community to see whether anyone has idea about it.

What system are you using?

There's an external interrupt pin which is supposed to be wired to the
TPM. I think we bounce that interrupt to FW by default since the
external interrupt is sometimes used for other system-specific
purposes. Odds are FW doesn't know what to do with it so you
effectively have an always-on LSI. I fixed a similar bug a while ago
by having skiboot mask any interrupts it doesn't have a handler for,
but I have no idea when that fix will land in a released FW build. Oh
well.


Re: [PATCH -next] Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"

2020-10-14 Thread Oliver O'Halloran
On Thu, Oct 15, 2020 at 5:28 AM Qian Cai  wrote:
>
> This reverts commit 3a3181e16fbde752007759f8759d25e0ff1fc425 which
> causes memory corruptions on POWER9 NV.

I was going to post this along with a fix for Cedric's original bug,
but I can do that separately so:

Acked-by: Oliver O'Halloran 


Re: PCI: Race condition in pci_create_sysfs_dev_files

2020-10-06 Thread Oliver O'Halloran
On Wed, Oct 7, 2020 at 10:26 AM Bjorn Helgaas  wrote:
>
> I'm not really a fan of this because pci_sysfs_init() is a bit of a
> hack to begin with, and this makes it even more complicated.
>
> It's not obvious from the code why we need pci_sysfs_init(), but
> Yinghai hinted [1] that we need to create sysfs after assigning
> resources.  I experimented by removing pci_sysfs_init() and skipping
> the ROM BAR sizing.  In that case, we create sysfs files in
> pci_bus_add_device() and later assign space for the ROM BAR, so we
> fail to create the "rom" sysfs file.
>
> The current solution to that is to delay the sysfs files until
> pci_sysfs_init(), a late_initcall(), which runs after resource
> assignments.  But I think it would be better if we could create the
> sysfs file when we assign the BAR.  Then we could get rid of the
> late_initcall() and that implicit ordering requirement.

You could probably fix that by using an attribute_group to control
whether the attribute shows up in sysfs or not. The .is_visible() for
the group can look at the current state of the device and hide the rom
attribute if the BAR isn't assigned or doesn't exist. That way we
don't need to care when the actual assignment occurs.

> But I haven't tried to code it up, so it's probably more complicated
> than this.  I guess ideally we would assign all the resources before
> pci_bus_add_device().  If we could do that, we could just remove
> pci_sysfs_init() and everything would just work, but I think that's a
> HUGE can of worms.

I was under the impression the whole point of pci_bus_add_device() was
to handle any initialisation that needed to be done after resources
were assigned. Is the ROM BAR being potentially unassigned an x86ism
or is there some bigger point I'm missing?

Oliver


Re: [PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules

2020-09-28 Thread Oliver O'Halloran
On Tue, Sep 29, 2020 at 6:50 AM Tyrel Datwyler  wrote:
>
> On 9/23/20 11:41 PM, Oliver O'Halloran wrote:
> > On Thu, Sep 24, 2020 at 3:15 PM Mamatha Inamdar
> >  wrote:
> >>
> >> This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules
> >> (descriptions taken from Kconfig file)
> >>
> >> Signed-off-by: Mamatha Inamdar 
> >> ---
> >>  drivers/pci/hotplug/rpadlpar_core.c |1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
> >> b/drivers/pci/hotplug/rpadlpar_core.c
> >> index f979b70..bac65ed 100644
> >> --- a/drivers/pci/hotplug/rpadlpar_core.c
> >> +++ b/drivers/pci/hotplug/rpadlpar_core.c
> >> @@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void)
> >>  module_init(rpadlpar_io_init);
> >>  module_exit(rpadlpar_io_exit);
> >>  MODULE_LICENSE("GPL");
> >> +MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O 
> >> slots");
> >
> > RPA as a spec was superseded by PAPR in the early 2000s. Can we rename
> > this already?
>
> I seem to recall Michael and I discussed the naming briefly when I added the
> maintainer entries for the drivers and that the PAPR acronym is almost as
> meaningless to most as the original RPA. While, IBM no longer uses the term
> pseries for Power hardware marketing it is the defacto platform identifier in
> the Linux kernel tree for what we would call PAPR compliant. All in all I have
> no problem with renaming, but maybe we should consider pseries_dlpar or even
> simpler ibmdlpar.

I'm not too bothered by what we call it so long as it's consistent
with *something* else in the tree. Using pseries rather than ibm as a
prefix would probably be better since the legacy ibmphp driver is in
the same directory.


Re: [PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules

2020-09-27 Thread Oliver O'Halloran
On Sat, Sep 26, 2020 at 5:43 AM Bjorn Helgaas  wrote:
>
> On Thu, Sep 24, 2020 at 04:41:39PM +1000, Oliver O'Halloran wrote:
> > On Thu, Sep 24, 2020 at 3:15 PM Mamatha Inamdar
> >  wrote:
> > >
> > > This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules
> > > (descriptions taken from Kconfig file)
> > >
> > > Signed-off-by: Mamatha Inamdar 
> > > ---
> > >  drivers/pci/hotplug/rpadlpar_core.c |1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
> > > b/drivers/pci/hotplug/rpadlpar_core.c
> > > index f979b70..bac65ed 100644
> > > --- a/drivers/pci/hotplug/rpadlpar_core.c
> > > +++ b/drivers/pci/hotplug/rpadlpar_core.c
> > > @@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void)
> > >  module_init(rpadlpar_io_init);
> > >  module_exit(rpadlpar_io_exit);
> > >  MODULE_LICENSE("GPL");
> > > +MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O 
> > > slots");
> >
> > RPA as a spec was superseded by PAPR in the early 2000s. Can we rename
> > this already?
> >
> > The only potential problem I can see is scripts doing: modprobe
> > rpadlpar_io or similar
> >
> > However, we should be able to fix that with a module alias.
>
> Is MODULE_DESCRIPTION() connected with how modprobe works?

I don't think so. The description is just there as an FYI.

> If this patch just improves documentation, without breaking users of
> modprobe, I'm fine with it, even if it would be nice to rename to PAPR
> or something in the future.

Right, the change in this patch is just a documentation fix and
shouldn't cause any problems.

I was suggesting renaming the module itself since the term "RPA" is
only used in this hotplug driver and some of the corresponding PHB add
/ remove handling in arch/powerpc/platforms/pseries/. We can make that
change in a follow up though.

> But, please use "git log --oneline drivers/pci/hotplug/rpadlpar*" and
> match the style, and also look through the rest of drivers/pci/ to see
> if we should do the same thing to any other modules.
>
> Bjorn


Re: [PATCH] rpadlpar_io:Add MODULE_DESCRIPTION entries to kernel modules

2020-09-24 Thread Oliver O'Halloran
On Thu, Sep 24, 2020 at 3:15 PM Mamatha Inamdar
 wrote:
>
> This patch adds a brief MODULE_DESCRIPTION to rpadlpar_io kernel modules
> (descriptions taken from Kconfig file)
>
> Signed-off-by: Mamatha Inamdar 
> ---
>  drivers/pci/hotplug/rpadlpar_core.c |1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/hotplug/rpadlpar_core.c 
> b/drivers/pci/hotplug/rpadlpar_core.c
> index f979b70..bac65ed 100644
> --- a/drivers/pci/hotplug/rpadlpar_core.c
> +++ b/drivers/pci/hotplug/rpadlpar_core.c
> @@ -478,3 +478,4 @@ static void __exit rpadlpar_io_exit(void)
>  module_init(rpadlpar_io_init);
>  module_exit(rpadlpar_io_exit);
>  MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("RPA Dynamic Logical Partitioning driver for I/O slots");

RPA as a spec was superseded by PAPR in the early 2000s. Can we rename
this already?

The only potential problem I can see is scripts doing: modprobe
rpadlpar_io or similar

However, we should be able to fix that with a module alias.

Oliver


Re: [RFC PATCH] PCI/portdrv: No need to call pci_disable_device() during shutdown

2020-09-10 Thread Oliver O'Halloran
On Fri, Sep 11, 2020 at 11:55 AM Tiezhu Yang  wrote:
>
> On 09/11/2020 04:21 AM, Bjorn Helgaas wrote:
> > [+cc Huacai]
> >
> > On Thu, Sep 10, 2020 at 02:41:39PM -0500, Bjorn Helgaas wrote:
> >> On Sat, Sep 05, 2020 at 04:33:26PM +0800, Tiezhu Yang wrote:
> >>> After commit 745be2e700cd ("PCIe: portdrv: call pci_disable_device
> >>> during remove") and commit cc27b735ad3a ("PCI/portdrv: Turn off PCIe
> >>> services during shutdown"), it also calls pci_disable_device() during
> >>> shutdown, this seems unnecessary, so just remove it.
> >> I would like to get rid of the portdrv completely by folding its
> >> functionality into the PCI core itself, so there would be no portdrv
> >> probe or remove.
> >>
> >> Does this solve a problem?
>
> Yes, sometimes it can not shutdown or reboot normally with
> pci_disable_device().

Do you have any more details about what goes wrong here? Leaving
devices enabled when actually shutting down probably doesn't matter.
However, .shutdown() is also used when kexec()ing into a new kernel
and we probably should be disabling devices before handing over to the
new kernel.

Is the real issue that we're closing the bridge windows before the
endpoint drivers have had a chance to clean up?

Oliver


Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-30 Thread Oliver O'Halloran
On Mon, Aug 31, 2020 at 10:08 AM Alexey Kardashevskiy  wrote:
>
> On 29/08/2020 05:55, Leonardo Bras wrote:
> > On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
> >>
> >> On 28/08/2020 01:32, Leonardo Bras wrote:
> >>> Hello Alexey, thank you for this feedback!
> >>>
> >>> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
> > +#define TCE_RPN_BITS 52  /* Bits 0-51 
> > represent RPN on TCE */
> 
>  Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>  is the actual limit.
> >>>
> >>> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical 
> >>> memory addressable in the machine. IIUC, it means we can access physical 
> >>> address up to (1ul << MAX_PHYSMEM_BITS).
> >>>
> >>> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> >>> 0-51 as the RPN. By looking at code, I understand that it means we may 
> >>> input any address < (1ul << 52) to TCE.
> >>>
> >>> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I 
> >>> suppose we can't ever pass a physical page address over
> >>> (1ul << 51), and TCE accepts up to (1ul << 52).
> >>> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means 
> >>> that TCE_RPN_BITS will also be increased, so I think they are independent 
> >>> values.
> >>>
> >>> Does it make sense? Please let me know if I am missing something.
> >>
> >> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
> >> 6Apr2012.pdf spec says:
> >>
> >> "The number of most significant RPN bits implemented in the TCE is
> >> dependent on the max size of System Memory to be supported by the 
> >> platform".
> >>
> >> IODA3 is the same on this matter.
> >>
> >> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
> >> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
> >> where TCE_RPN_BITS comes from exactly - I have no idea.
> >
> > Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
> > hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
> > pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
> > described as RPN, as described before.
> >
> > IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
> > shows system memory mapping into a TCE, and the TCE also has bits 0-51
> > for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
> >> In fact, by the looks of those figures, the RPN_MASK should always be a
> > 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
>
> I suspect the mask is there in the first place for extra protection
> against too big addresses going to the TCE table (or/and for virtial vs
> physical addresses). Using 52bit mask makes no sense for anything, you
> could just drop the mask and let c compiler deal with 64bit "uint" as it
> is basically a 4K page address anywhere in the 64bit space. Thanks,

Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
physical address space. The IODA3 spec does explicitly say the upper
bits are optional and the implementation only needs to support enough
to cover up to the physical address limit, which is 56bits of P9 /
PHB4. If you want to validate that the address will fit inside of
MAX_PHYSMEM_BITS then fine, but I think that should be done as a
WARN_ON or similar rather than just silently masking off the bits.


Re: [PATCH] powerpc/pseries: Add pcibios_default_alignment implementation

2020-08-22 Thread Oliver O'Halloran
On Sat, Aug 22, 2020 at 6:51 AM Shawn Anastasio  wrote:
>
> Implement pcibios_default_alignment for pseries so that
> resources are page-aligned. The main benefit of this is being
> able to map any resource from userspace via mechanisms like VFIO.

Reviewed-by: Oliver O'Halloran 

That said, there's nothing power specific about this so we should
probably drop the pcibios hacks and fix the default alignment in the
PCI core.

> This is identical to powernv's implementation.
>
> Signed-off-by: Shawn Anastasio 
> ---
>  arch/powerpc/platforms/pseries/pci.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/arch/powerpc/platforms/pseries/pci.c 
> b/arch/powerpc/platforms/pseries/pci.c
> index 911534b89c85..6d922c096354 100644
> --- a/arch/powerpc/platforms/pseries/pci.c
> +++ b/arch/powerpc/platforms/pseries/pci.c
> @@ -210,6 +210,11 @@ int pseries_pcibios_sriov_disable(struct pci_dev *pdev)
>  }
>  #endif
>
> +static resource_size_t pseries_pcibios_default_alignment(void)
> +{
> +   return PAGE_SIZE;
> +}
> +
>  static void __init pSeries_request_regions(void)
>  {
> if (!isa_io_base)
> @@ -231,6 +236,8 @@ void __init pSeries_final_fixup(void)
>
> eeh_show_enabled();
>
> +   ppc_md.pcibios_default_alignment = pseries_pcibios_default_alignment;
> +
>  #ifdef CONFIG_PCI_IOV
> ppc_md.pcibios_sriov_enable = pseries_pcibios_sriov_enable;
> ppc_md.pcibios_sriov_disable = pseries_pcibios_sriov_disable;
> --
> 2.28.0
>


Re: [PATCH v2 1/4] libnvdimm: fix memmory leaks in of_pmem.c

2020-08-19 Thread Oliver O'Halloran
On Wed, Aug 19, 2020 at 12:05 PM Zhen Lei  wrote:
>
> The memory priv->bus_desc.provider_name allocated by kstrdup() is not
> freed correctly.
>
> Fixes: 49bddc73d15c ("libnvdimm/of_pmem: Provide a unique name for bus 
> provider")
> Signed-off-by: Zhen Lei 

Yep, that's a bug.

Reviewed-by: Oliver O'Halloran 

> ---
>  drivers/nvdimm/of_pmem.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c
> index 10dbdcdfb9ce913..1292ffca7b2ecc0 100644
> --- a/drivers/nvdimm/of_pmem.c
> +++ b/drivers/nvdimm/of_pmem.c
> @@ -36,6 +36,7 @@ static int of_pmem_region_probe(struct platform_device 
> *pdev)
>
> priv->bus = bus = nvdimm_bus_register(>dev, >bus_desc);
> if (!bus) {
> +   kfree(priv->bus_desc.provider_name);
> kfree(priv);
> return -ENODEV;
> }
> @@ -83,6 +84,7 @@ static int of_pmem_region_remove(struct platform_device 
> *pdev)
> struct of_pmem_private *priv = platform_get_drvdata(pdev);
>
> nvdimm_bus_unregister(priv->bus);
> +   kfree(priv->bus_desc.provider_name);
> kfree(priv);
>
> return 0;
> --
> 1.8.3
>
>


Re: [PATCH v2 1/4] libnvdimm: Fix memory leaks in of_pmem.c

2020-08-19 Thread Oliver O'Halloran
On Wed, Aug 19, 2020 at 10:28 PM Markus Elfring  wrote:
>
> > The memory priv->bus_desc.provider_name allocated by kstrdup() is not
> > freed correctly.

Personally I thought his commit message was perfectly fine. A little
unorthodox, but it works.

> How do you think about to choose an imperative wording for
> a corresponding change description?

...but this! This is word salad.

> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?id=18445bf405cb331117bc98427b1ba6f12418ad17#n151
>
> Regards,
> Markus


Re: [PATCH v2] PCI: Introduce flag for detached virtual functions

2020-08-13 Thread Oliver O'Halloran
On Thu, Aug 13, 2020 at 7:00 PM Niklas Schnelle  wrote:
>
>
> On 8/13/20 3:55 AM, Oliver O'Halloran wrote:
> > On Thu, Aug 13, 2020 at 5:21 AM Matthew Rosato  
> > wrote:
> >> *snip*
> >> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> >> index 3902c9f..04ac76d 100644
> >> --- a/arch/s390/pci/pci.c
> >> +++ b/arch/s390/pci/pci.c
> >> @@ -581,6 +581,14 @@ int pcibios_enable_device(struct pci_dev *pdev, int 
> >> mask)
> >>  {
> >> struct zpci_dev *zdev = to_zpci(pdev);
> >>
> >> +   /*
> >> +* If we have a VF on a non-multifunction bus, it must be a VF 
> >> that is
> >> +* detached from its parent PF.  We rely on firmware emulation to
> >> +* provide underlying PF details.
> >> +*/
> >> +   if (zdev->vfn && !zdev->zbus->multifunction)
> >> +   pdev->detached_vf = 1;
> >
> > The enable hook seems like it's a bit too late for this sort of
> > screwing around with the pci_dev. Anything in the setup path that
> > looks at ->detached_vf would see it cleared while anything that looks
> > after the device is enabled will see it set. Can this go into
> > pcibios_add_device() or a fixup instead?
> >
>
> This particular check could go into pcibios_add_device() yes.
> We're also currently working on a slight rework of how
> we establish the VF to parent PF linking including the sysfs
> part of that. The latter sadly can only go after the sysfs
> for the virtfn has been created and that only happens
> after all fixups. We would like to do both together because
> the latter sets pdev->is_virtfn which I think is closely related.
>
> I was thinking of starting another discussion
> about adding a hook that is executed just after the sysfs entries
> for the PCI device are created but haven't yet.

if all you need is sysfs then pcibios_bus_add_device() or a bus
notifier should work

> That said pcibios_enable_device() is called before drivers
> like vfio-pci are enabled

Hmm, is that an s390 thing? I was under the impression that drivers
handled enabling the device rather than assuming the platform did it
for them. Granted it's usually one of the first things a driver does,
but there's still scope for surprising behaviour.

> and so as long as all uses of pdev->detached_vf
> are in drivers it should be early enough. AFAIK almost everything
> dealing with VFs before that is already skipped with pdev->no_vf_scan
> though.

I'm sure it works fine in your particular case. My main gripe is that
you're adding a flag in a generic structure so people reading the code
without that context may make assumptions about when it's valid to
use. The number of pcibios_* hooks we have means that working out when
and where something happens in the pci setup path usually involves
going on a ~magical journey~ through generic and arch specific code.
It's not *that* bad once you've worked out how it all fits together,
but it's still a pain. If we can initialise stuff before the pci_dev
is added to the bus it's usually for the better.

Oliver


Re: [PATCH v2] PCI: Introduce flag for detached virtual functions

2020-08-12 Thread Oliver O'Halloran
On Thu, Aug 13, 2020 at 6:33 AM Alex Williamson
 wrote:
>
> On Wed, 12 Aug 2020 15:21:11 -0400
> Matthew Rosato  wrote:
>
> > @@ -521,7 +522,8 @@ static int vfio_basic_config_read(struct 
> > vfio_pci_device *vdev, int pos,
> >   count = vfio_default_config_read(vdev, pos, count, perm, offset, val);
> >
> >   /* Mask in virtual memory enable for SR-IOV devices */
> > - if (offset == PCI_COMMAND && vdev->pdev->is_virtfn) {
> > + if ((offset == PCI_COMMAND) &&
> > + (vdev->pdev->is_virtfn || vdev->pdev->detached_vf)) {
> >   u16 cmd = le16_to_cpu(*(__le16 *)>vconfig[PCI_COMMAND]);
> >   u32 tmp_val = le32_to_cpu(*val);
> >
> > @@ -1734,7 +1736,8 @@ int vfio_config_init(struct vfio_pci_device *vdev)
> >vconfig[PCI_INTERRUPT_PIN]);
> >
> >   vconfig[PCI_INTERRUPT_PIN] = 0; /* Gratuitous for good VFs */
> > -
> > + }
> > + if (pdev->is_virtfn || pdev->detached_vf) {
> >   /*
> >* VFs do no implement the memory enable bit of the COMMAND
> >* register therefore we'll not have it set in our initial
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 8355306..23a6972 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -445,6 +445,7 @@ struct pci_dev {
> >   unsigned intis_probed:1;/* Device probing in progress 
> > */
> >   unsigned intlink_active_reporting:1;/* Device capable of 
> > reporting link active */
> >   unsigned intno_vf_scan:1;   /* Don't scan for VFs after 
> > IOV enablement */
> > + unsigned intdetached_vf:1;  /* VF without local PF access 
> > */
>
> Is there too much implicit knowledge in defining a "detached VF"?  For
> example, why do we know that we can skip the portion of
> vfio_config_init() that copies the vendor and device IDs from the
> struct pci_dev into the virtual config space?  It's true on s390x, but
> I think that's because we know that firmware emulates those registers
> for us.
>
> We also skip the INTx pin register sanity checking.  Do we do
> that because we haven't installed the broken device into an s390x
> system?  Because we know firmware manages that for us too?  Or simply
> because s390x doesn't support INTx anyway, and therefore it's another
> architecture implicit decision?

Agreed. Any hacks we put in for normal VFs are going to be needed for
the passed-though VF case. Only applying the memory space enable
workaround doesn't make sense to me either.

> If detached_vf is really equivalent to is_virtfn for all cases that
> don't care about referencing physfn on the pci_dev, then we should
> probably have a macro to that effect.

A pci_is_virtfn() helper would be better than open coding both checks
everywhere. That said, it might be solving the wrong problem. The
union between ->physfn and ->sriov has always seemed like a footgun to
me so we might be better off switching the users who want a physfn to
a helper instead. i.e.

struct pci_dev *pci_get_vf_physfn(struct pci_dev *vf)
{
if (!vf->is_virtfn)
return NULL;

return vf->physfn;
}

...

pf = pci_get_vf_physfn(vf)
if (pf)
/* do pf things */

Then we can just use ->is_virtfn for the normal and detached cases.

Oliver


Re: [PATCH v2] PCI: Introduce flag for detached virtual functions

2020-08-12 Thread Oliver O'Halloran
On Thu, Aug 13, 2020 at 5:21 AM Matthew Rosato  wrote:
>
> s390x has the notion of providing VFs to the kernel in a manner
> where the associated PF is inaccessible other than via firmware.
> These are not treated as typical VFs and access to them is emulated
> by underlying firmware which can still access the PF.  After
> abafbc55 however these detached VFs were no longer able to work
> with vfio-pci as the firmware does not provide emulation of the
> PCI_COMMAND_MEMORY bit.  In this case, let's explicitly recognize
> these detached VFs so that vfio-pci can allow memory access to
> them again.

Hmm, cool. I think we have a similar feature on pseries so that's
probably broken too.

> Signed-off-by: Matthew Rosato 
> ---
>  arch/s390/pci/pci.c|  8 
>  drivers/vfio/pci/vfio_pci_config.c | 11 +++
>  include/linux/pci.h|  1 +
>  3 files changed, 16 insertions(+), 4 deletions(-)
>
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index 3902c9f..04ac76d 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -581,6 +581,14 @@ int pcibios_enable_device(struct pci_dev *pdev, int mask)
>  {
> struct zpci_dev *zdev = to_zpci(pdev);
>
> +   /*
> +* If we have a VF on a non-multifunction bus, it must be a VF that is
> +* detached from its parent PF.  We rely on firmware emulation to
> +* provide underlying PF details.
> +*/
> +   if (zdev->vfn && !zdev->zbus->multifunction)
> +   pdev->detached_vf = 1;

The enable hook seems like it's a bit too late for this sort of
screwing around with the pci_dev. Anything in the setup path that
looks at ->detached_vf would see it cleared while anything that looks
after the device is enabled will see it set. Can this go into
pcibios_add_device() or a fixup instead?


Re: [PATCH -next] powerpc/powernv/sriov: Remove unused but set variable 'phb'

2020-07-27 Thread Oliver O'Halloran
On Tue, Jul 28, 2020 at 3:01 AM Wei Yongjun  wrote:
>
> Gcc report warning as follows:
>
> arch/powerpc/platforms/powernv/pci-sriov.c:602:25: warning:
>  variable 'phb' set but not used [-Wunused-but-set-variable]
>   602 |  struct pnv_phb*phb;
>   | ^~~
>
> This variable is not used, so this commit removing it.
>
> Reported-by: Hulk Robot 
> Signed-off-by: Wei Yongjun 
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
> b/arch/powerpc/platforms/powernv/pci-sriov.c
> index 8404d8c3901d..7894745fd4f8 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -599,10 +599,8 @@ static int pnv_pci_vf_resource_shift(struct pci_dev 
> *dev, int offset)
>  static void pnv_pci_sriov_disable(struct pci_dev *pdev)
>  {
> u16num_vfs, base_pe;
> -   struct pnv_phb*phb;
> struct pnv_iov_data   *iov;
>
> -   phb = pci_bus_to_pnvhb(pdev->bus);
> iov = pnv_iov_get(pdev);
>     num_vfs = iov->num_vfs;
> base_pe = iov->vf_pe_arr[0].pe_number;
>

Acked-by: Oliver O'Halloran 


Re: [PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code

2020-07-19 Thread Oliver O'Halloran
On Sun, Jul 19, 2020 at 5:13 AM Greg Thelen  wrote:
>
> Oliver O'Halloran  wrote:
>
> > On Mon, Jun 15, 2020 at 9:33 AM Greg Thelen  wrote:
> >>
> >> Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
> >> configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
> >> only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
> >> CONFIG_IOMMU_API see:
> >>   arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
> >> 'pnv_ioda_setup_bus_dma' defined but not used
> >>
> >> Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.
> >
> > Doh! Thanks for the fix.
> >
> > Reviewed-by: Oliver O'Halloran 
>
> Is there anything else needed from me on this patch?
> Given that it fixes a 5.8 commit I figured it'd be 5.8 material.

Oh sorry, I completely forgot about this patch. I sent another series
that included a more-or-less identical fix after the kbuild robot sent
a reminder:

http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187630=*

That's current in powerpc/next, but if it's causing a build break then
I agree it should probably go into 5.8 too.

Oliver


Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Oliver O'Halloran
On Wed, Jul 15, 2020 at 8:03 AM Arnd Bergmann  wrote:
>
> - Most error checking is static: PCIBIOS_BAD_REGISTER_NUMBER
>   only happens if you pass an invalid register number, but most
>   callers pass a compile-time constant register number that is
>   known to be correct, or the driver would never work. Similarly,
>   PCIBIOS_DEVICE_NOT_FOUND wouldn't normally happen
>   since you pass a valid pci_device pointer that was already
>   probed.

Having some feedback about obvious programming errors is still useful
when doing driver development. Reporting those via printk() would
probably be more useful to those who care though.

> - config space accesses are very rare compared to memory
>   space access and on the hardware side the error handling
>   would be similar, but readl/writel don't return errors, they just
>   access wrong registers or return 0x.
>   arch/powerpc/kernel/eeh.c has a ton extra code written to
>   deal with it, but no other architectures do.

TBH the EEH MMIO hooks were probably a mistake to begin with. Errors
detected via MMIO are almost always asynchronous to the error itself
so you usually just wind up with a misleading stack trace rather than
any kind of useful synchronous error reporting. It seems like most
drivers don't bother checking for 0xFFs either and rely on the
asynchronous reporting via .error_detected() instead, so I have to
wonder what the point is. I've been thinking of removing the MMIO
hooks and using a background poller to check for errors on each PHB
periodically (assuming we don't have an EEH interrupt) instead. That
would remove the requirement for eeh_dev_check_failure() to be
interrupt safe too, so it might even let us fix all the godawful races
in EEH.

> - If we add code to detect errors in pci_read_config_*
>   and do some of the stuff from powerpc's
>   eeh_dev_check_failure(), we are more likely to catch
>   intermittent failures when drivers don't check, or bugs
>   with invalid arguments in device drivers than relying on
>   drivers to get their error handling right when those code
>   paths don't ever get covered in normal testing.

Adding some kind of error detection to the generic config accessors
wouldn't hurt, but detection is only half the problem. The main job of
eeh_dev_check_failure() is waking up the EEH recovery thread which
actually handles notifying drivers, device resets, etc and you'd want
something in the PCI core. Right now there's two implementations of
that reporting logic: one for EEH in arch/powerpc/eeh_driver.c and one
for AER/DPC in drivers/pci/pcie/err.c. I think the latter could be
moved into the PCI core easily enough since there's not much about it
that's really specific to PCIe. Ideally we could drop the EEH specific
one too, but I'm not sure how to implement that without it devolving
into callback spaghetti.

Oliver


Re: [PATCH v2 5/7] driver core: Add device location to "struct device" and expose it in sysfs

2020-07-02 Thread Oliver O'Halloran
On Thu, 2020-07-02 at 09:32 +0200, Greg Kroah-Hartman wrote:
> On Thu, Jul 02, 2020 at 03:23:23PM +1000, Oliver O'Halloran wrote:
> > Yep, that's a problem. If we want to provide a useful mechanism to
> > userspace then the default behaviour of the kernel can't undermine
> > that mechanism. If that means we need another kernel command line
> > parameter then I guess we just have to live with it.
> 
> I really do not want yet-another-kernel-command-line-option if we can
> help it at all.  Sane defaults are the best thing to do here.  Userspace
> comes up really early, put your policy in there, not in blobs passed
> from your bootloader.

Userspace comes up early, but builtin drivers will bind before init is
started. e.g.

# dmesg | egrep '0002:01:00.0|/init'
[0.976800][T1] pci 0002:01:00.0: [8086:1589] type 00 class 0x02
[0.976923][T1] pci 0002:01:00.0: reg 0x10: [mem 
0x2200-0x227f 64bit pref]
[0.977004][T1] pci 0002:01:00.0: reg 0x1c: [mem 
0x22000200-0x220002007fff 64bit pref]
[0.977068][T1] pci 0002:01:00.0: reg 0x30: [mem 0x-0x0007 
pref]
[0.977122][T1] pci 0002:01:00.0: BAR3 [mem size 0x8000 64bit pref]: 
requesting alignment to 0x1
[0.977401][T1] pci 0002:01:00.0: PME# supported from D0 D3hot
[1.011929][T1] pci 0002:01:00.0: BAR 0: assigned [mem 
0x2200-0x227f 64bit pref]
[1.012085][T1] pci 0002:01:00.0: BAR 6: assigned [mem 
0x3fe1-0x3fe10007 pref]
[1.012127][T1] pci 0002:01:00.0: BAR 3: assigned [mem 
0x22000200-0x220002007fff 64bit pref]
[4.399588][   T12] i40e 0002:01:00.0: enabling device (0140 -> 0142)
[4.410891][   T12] i40e 0002:01:00.0: fw 5.1.40981 api 1.5 nvm 5.03 
0x80002469 1.1313.0 [8086:1589] [15d9:]
[4.647524][   T12] i40e 0002:01:00.0: MAC address: 0c:c4:7a:b7:fc:74
[4.647685][   T12] i40e 0002:01:00.0: FW LLDP is enabled
[4.653918][   T12] i40e 0002:01:00.0 eth0: NIC Link is Up, 1000 Mbps Full 
Duplex, Flow Control: None
[4.62][   T12] i40e 0002:01:00.0: PCI-Express: Speed 8.0GT/s Width x8
[4.656071][   T12] i40e 0002:01:00.0: Features: PF-id[0] VSIs: 34 QP: 80 
RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
[   13.803709][T1] Run /init as init process
[   13.963242][  T711] i40e 0002:01:00.0 enP2p1s0f0: renamed from eth0

Building everything into the kernel is admittedly pretty niche. I only
do it to avoid re-building the initramfs for my test kernels. It does
seem relatively common on embedded systems, but I'm not sure how many
of those care about PCIe. It would be nice to provide *something* to
cover that case for the people who care.

Oliver




Re: [PATCH v2 5/7] driver core: Add device location to "struct device" and expose it in sysfs

2020-07-01 Thread Oliver O'Halloran
On Thu, Jul 2, 2020 at 4:07 AM Rajat Jain  wrote:
>
> *snip*
>
> > > I guess it would make sense to have an attribute for user space to
> > > write to in order to make the kernel reject device plug-in events
> > > coming from a given port or connector, but the kernel has no reliable
> > > means to determine *which* ports or connectors are "safe", and even if
> > > there was a way for it to do that, it still may not agree with user
> > > space on which ports or connectors should be regarded as "safe".
> >
> > Again, we have been doing this for USB devices for a very long time, PCI
> > shouldn't be any different.  Why people keep ignoring working solutions
> > is beyond me, there's nothing "special" about PCI devices here for this
> > type of "worry" or reasoning to try to create new solutions.
> >
> > So, again, I ask, go do what USB does, and to do that, take the logic
> > out of the USB core, make it bus-agnositic, and _THEN_ add it to the PCI
> > code. Why the original submitter keeps ignoring my request to do this
> > is beyond me, I guess they like making patches that will get rejected :(
>
> IMHO I'm actually trying to precisely do what I think was the
> conclusion of our discussion, and then some changes because of the
> further feedback I received on those patches. Let's take a step back
> and please allow me to explain how I got here (my apologies but this
> spans a couple of threads, and I"m trying to tie them all together
> here):

The previous thread had some suggestions, but no real conclusions.
That's probably why we're still arguing about it...

> GOAL: To allow user space to control what (PCI) drivers he wants to
> allow on external (thunderbolt) ports. There was a lot of debate about
> the need for such a policy at
> https://lore.kernel.org/linux-pci/CACK8Z6GR7-wseug=ttvyrarvzx_ao2geoldnbwjtb+5y7vw...@mail.gmail.com/
> with the final conclusion that it should be OK to implement such a
> policy in userspace, as long as the policy is not implemented in the
> kernel. The kernel only needs to expose bits & info that is needed by
> the userspace to implement such a policy, and it can be used in
> conjunction with "drivers_autoprobe" to implement this policy:
> 
> 
> That's an odd thing, but sure, if you want to write up such a policy for
> your systems, great.  But that policy does not belong in the kernel, it
> belongs in userspace.
> 
> 
> 1) The post 
> https://lore.kernel.org/linux-pci/20200609210400.GA1461839@bjorn-Precision-5520/
> lists out the approach that was agreed on. Replicating it here:
> ---
>   - Expose the PCI pdev->untrusted bit in sysfs.  We don't expose this
> today, but doing so would be trivial.  I think I would prefer a
> sysfs name like "external" so it's more descriptive and less of a
> judgment.
>
> This comes from either the DT "external-facing" property or the
> ACPI "ExternalFacingPort" property.
>
>   - All devices present at boot are enumerated.  Any statically built
> drivers will bind to them before any userspace code runs.
>
> If you want to keep statically built drivers from binding, you'd
> need to invent some mechanism so pci_driver_init() could clear
> drivers_autoprobe after registering pci_bus_type.
>
>   - Early userspace code prevents modular drivers from automatically
> binding to PCI devices:
>
>   echo 0 > /sys/bus/pci/drivers_autoprobe
>
> This prevents modular drivers from binding to all devices, whether
> present at boot or hot-added.
>
>   - Userspace code uses the sysfs "bind" file to control which drivers
> are loaded and can bind to each device, e.g.,
>
>   echo :02:00.0 > /sys/bus/pci/drivers/nvme/bind

I think this is a reasonable suggestion. However, as Greg pointed out
it's gratuitously different to what USB does for no real reason.

> ---
> 2) As part of implementing the above agreed approach, when I exposed
> PCI "untrusted" attribute to userspace, it ran into discussion that
> concluded that instead of this, the device core should be enhanced
> with a location attribute.
> https://lore.kernel.org/linux-pci/20200618184621.ga446...@kroah.com/
> ---
> ...
> The attribute should be called something like "location" or something
> like that (naming is hard), as you don't always know if something is
> external or not (it could be internal, it could be unknown, it could be
> internal to an external device that you trust (think PCI drawers for
> "super" computers that are hot pluggable but yet really part of the
> internal bus).
> 
> "trust" has no direct relation to the location, except in a policy of
> what you wish to do with that 

Re: [PATCH 2/2] pci: Add parameter to disable attaching untrusted devices

2020-06-26 Thread Oliver O'Halloran
On Fri, Jun 26, 2020 at 10:27 AM Rajat Jain  wrote:
>
> Introduce a PCI parameter that disables the automatic attachment of
> untrusted devices to their drivers.
>
> Signed-off-by: Rajat Jain 
> ---
> Context:
>
>   I set out to implement the approach outlined in
> https://lkml.org/lkml/2020/6/9/1331
> https://lkml.org/lkml/2020/6/15/1453
>
>   But to my surprise, I found that the new hotplugged PCI devices
>   were getting automatically attached to drivers even though
>   /sys/bus/pci/drivers_autoprobe was set to 0.
>
>   I realized that the device core's "drivers_autoprobe":
>
>   * only disables the *initial* probe of the device (i.e. from
> device_add()). If a subsystem calls device_attach() explicitly
> for its devices like PCI subsystem does, the drivers_autoprobe
> setting does not matter. The core will attach device to the driver.
> This looks like correct semantic behavior to me because PCI is
> explicitly calling device_attach(), which is a way to explicitly
> ask the core to find and attach a driver for a device.

Right, but we're doing using device_attach() largely because the
driver core doesn't provide any mechanism for deferring the initial
probe. I didn't think there was any deeper reason for it, but while
looking I noticed that the initial probe can be async and
device_attach() forces probing to be synchronous. That has the side
effect of serialising all PCI device probing which might be
intentional to avoid device renaming due to the change in probe order.
Userspace is better at dealing with device names changing now days,
but you might still get some people mad at you for changing it.

>   2) Make the drivers_autoprobe property available to PCI to use
>  (currently it is private to device core). The PCI could use this
>  to determine whether or not to call device_attach(). This still
>  leaves the other problem (of not being able to set
>  drivers_autoprobe via command line open).
>
>   3) I found the pci_dev->match_driver, which seemed similar to what I
>  am trying to do, but can't be controlled from userspace. I considered
>  populating that field based on drivers_autoprobe (still need (2)).
>  But the problem is that there is the AMD IOMMU driver which is setting
>  this independently, so setting the match_driver based on
>  drivers_autoprobe may not be a good idea.

Huh, that's pretty weird. Even with that hack you should be able
trigger the bug they're working around by removing the IOMMU device in
sysfs and doing a rescan. I wouldn't worry much about making
match_device user controllable since you would need to work pretty
hard for it to be an issue.

Oliver


Re: [PATCH 3/4] powerpc/pseries/iommu: Move window-removing part of remove_ddw into remove_dma_window

2020-06-22 Thread Oliver O'Halloran
On Tue, Jun 23, 2020 at 11:12 AM Alexey Kardashevskiy  wrote:
>
> On 23/06/2020 04:59, Leonardo Bras wrote:
> >
> >> Also, despite this particular file, the "pdn" name is usually used for
> >> struct pci_dn (not device_node), let's keep it that way.
> >
> > Sure, I got confused for some time about this, as we have:
> > static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn).
> > but on *_ddw() we have "struct pci_dn *pdn".
>
> True again, not the cleanest style here.
>
>
> > I will also add a patch that renames those 'struct device_node *pdn' to
> > something like 'struct device_node *parent_dn'.

I usually go with "np" or "node". In this case I'd use "parent_np" or
just "parent." As you said pci_dn conventionally uses pdn so that
should be avoided if at all possible. There's some places that just
use "dn" for device_node, but I don't think that's something we should
encourage due to how similar it is to pdn.

> I would not go that far, we (well, Oliver) are getting rid of many
> occurrences of pci_dn and Oliver may have a stronger opinion here.

I'm trying to remove the use of pci_dn from non-RTAS platforms which
doesn't apply to pseries. For RTAS platforms having pci_dn sort of
makes sense since it's used to cache data from the device_node and
having it saves you from needing to parse and validate the DT at
runtime since we're supposed to be relying on the FW provided settings
in the DT. I want to get rid of it on PowerNV because it's become a
dumping ground for random bits and pieces of platform specific data.
It's confusing at best and IMO it duplicates a lot of what's already
available in the per-PHB structures which the platform specific stuff
should actually be looking at.

Oliver


Re: [PATCH v2] powerpc/powernv/pci: use ifdef to avoid dead code

2020-06-14 Thread Oliver O'Halloran
On Mon, Jun 15, 2020 at 9:33 AM Greg Thelen  wrote:
>
> Commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
> configuration") removed a couple pnv_ioda_setup_bus_dma() calls.  The
> only remaining calls are behind CONFIG_IOMMU_API.  Thus builds without
> CONFIG_IOMMU_API see:
>   arch/powerpc/platforms/powernv/pci-ioda.c:1888:13: error: 
> 'pnv_ioda_setup_bus_dma' defined but not used
>
> Move pnv_ioda_setup_bus_dma() under CONFIG_IOMMU_API to avoid dead code.

Doh! Thanks for the fix.

Reviewed-by: Oliver O'Halloran 


Re: [RFC] Restrict the untrusted devices, to bind to only a set of "whitelisted" drivers

2020-06-09 Thread Oliver O'Halloran
On Wed, Jun 10, 2020 at 7:04 AM Bjorn Helgaas  wrote:
>
> To sketch this out, my understanding of how this would work is:
>
>   - Expose the PCI pdev->untrusted bit in sysfs.  We don't expose this
> today, but doing so would be trivial.  I think I would prefer a
> sysfs name like "external" so it's more descriptive and less of a
> judgment.
>
> This comes from either the DT "external-facing" property or the
> ACPI "ExternalFacingPort" property.

I don't think internal / external is the right distinction to be
making. We have a similar trust issue with the BMC in servers even
though they're internal devices. They're typically network accessible
and infrequently updated so treating them as trustworthy isn't a great
idea. We have been slowly de-privileging the BMC over the last few
years, but the PCIe interface isn't locked down enough for my liking
since the SoCs we use do allow software to set the VDID and perform
arbitrary DMAs (thankfully limited to 32bit). If we're going to add in
infrastructure for handling possibly untrustworthy PCI devices then
I'd like to use that for BMCs too.

>   - All devices present at boot are enumerated.  Any statically built
> drivers will bind to them before any userspace code runs.
>
> If you want to keep statically built drivers from binding, you'd
> need to invent some mechanism so pci_driver_init() could clear
> drivers_autoprobe after registering pci_bus_type.
>
>   - Early userspace code prevents modular drivers from automatically
> binding to PCI devices:
>
>   echo 0 > /sys/bus/pci/drivers_autoprobe
>
> This prevents modular drivers from binding to all devices, whether
> present at boot or hot-added.

I don't see why this is preferable to just disabling autoprobe for
untrusted devices. That would dovetail nicely with Rajat's whitelist
idea if we want to go down that route and I think we might want to.
The BMC usually provides some form of VGA console and we'd like that
to continue working out-of-the-box without too much user (or distro)
intervention.

Oliver


Re: [PATCH v1 1/1] PCI/ERR: Handle fatal error recovery for non-hotplug capable devices

2020-05-26 Thread Oliver O'Halloran
On Wed, May 27, 2020 at 1:06 PM Kuppuswamy, Sathyanarayanan
 wrote:
>
> Yes, in case of DPC (Fatal errors) link is already reset. So we
> don't need any special handling. This reset logic is mainly for
> non-fatal errors.

Why? In our experience most fatal errors aren't all that fatal and can
be recovered by resetting the device. The base spec backs that up (see
gen5 base, sec 6.2) too saying the main point of distinction between
fatal and non-fatal errors is whether handling the error requires a
reset or not. For EEH we always try to recover the device and only
mark it as permanently failed once the devices goes over the max error
threshold (5 errors per hour, by default). Doing something similar for
(native) DPC would make sense IMO.


Re: [PATCH v1 1/1] PCI/ERR: Handle fatal error recovery for non-hotplug capable devices

2020-05-26 Thread Oliver O'Halloran
On Wed, May 27, 2020 at 12:00 PM Kuppuswamy, Sathyanarayanan
 wrote:
>
> Hi,
>
> On 5/21/20 7:56 PM, Yicong Yang wrote:
> >
> >
> > On 2020/5/22 3:31, Kuppuswamy, Sathyanarayanan wrote:
> >>
> > Not exactly. In pci_bus_error_reset(), we call pci_slot_reset() only if it's
> > hotpluggable. But we always call pci_bus_reset() to perform a secondary bus
> > reset for the bridge. That's what I think is unnecessary for a normal link,
> > and that's what reset link indicates us to do. The slot reset is introduced
> > in the process only to solve side effects. (c4eed62a2143, PCI/ERR: Use slot 
> > reset if available)
>
> IIUC, pci_bus_reset() will do slot reset if its supported (hot-plug
> capable slots). If its not supported then it will attempt secondary
> bus reset. So secondary bus reset will be attempted only if slot
> reset is not supported.
>
> Since reported_error_detected() requests us to do reset, we will have
> to attempt some kind of reset before we call ->slot_reset() right?

Yes, the driver returns PCI_ERS_RESULT_NEED_RESET from
->error_detected() to indicate that it doesn't know how to recover
from the error. How that reset is performed doesn't really matter, but
it does need to happen.


> > PCI_ERS_RESULT_NEED_RESET indicates that the driver
> > wants a platform-dependent slot reset and its ->slot_reset() method to be 
> > called then.
> > I don't think it's same as slot reset mentioned above, which is only for 
> > hotpluggable
> > ones.
> What you think is the correct reset implementation ? Is it something
> like this?
>
> if (hotplug capable)
> try_slot_reset()
> else
> do_nothing()

Looks broken to me, but all the reset handling is a rat's nest so
maybe I'm missing something. In the case of a DPC trip the link is
disabled which has the side-effect of hot-resetting the downstream
device. Maybe it's fine?

As an aside, why do we have both ->slot_reset() and ->reset_done() in
the error handling callbacks? Seems like their roles are almost
identical.

Oliver


Re: ioremap() called early from pnv_pci_init_ioda_phb()

2020-05-09 Thread Oliver O'Halloran
On Sun, May 10, 2020 at 1:51 AM Christophe Leroy
 wrote:
>
>
>
> Le 08/05/2020 à 19:41, Qian Cai a écrit :
> >
> >
> >> On May 8, 2020, at 10:39 AM, Qian Cai  wrote:
> >>
> >> Booting POWER9 PowerNV has this message,
> >>
> >> "ioremap() called early from pnv_pci_init_ioda_phb+0x420/0xdfc. Use 
> >> early_ioremap() instead”
> >>
> >> but use the patch below will result in leaks because it will never call 
> >> early_iounmap() anywhere. However, it looks me it was by design that 
> >> phb->regs mapping would be there forever where it would be used in 
> >> pnv_ioda_get_inval_reg(), so is just that check_early_ioremap_leak() 
> >> initcall too strong?
> >>
> >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> @@ -36,6 +36,7 @@
> >> #include 
> >> #include 
> >> #include 
> >> +#include 
> >>
> >> #include 
> >>
> >> @@ -3827,7 +3828,7 @@ static void __init pnv_pci_init_ioda_phb(struct 
> >> device_node *np,
> >> /* Get registers */
> >> if (!of_address_to_resource(np, 0, )) {
> >> phb->regs_phys = r.start;
> >> -   phb->regs = ioremap(r.start, resource_size());
> >> +   phb->regs = early_ioremap(r.start, resource_size());
> >> if (phb->regs == NULL)
> >> pr_err("  Failed to map registers !\n”);
> >
> > This will also trigger a panic with debugfs reads, so isn’t that this 
> > commit bogus at least for powerpc64?
> >
> > d538aadc2718 (“powerpc/ioremap: warn on early use of ioremap()")
>
> No d538aadc2718 is not bogus. That's the point, we want to remove all
> early usages of ioremap() in order to remove the hack with the
> ioremap_bot stuff and all, and stick to the generic ioremap logic.
>
> In order to do so, all early use of ioremap() has to be converted to
> early_ioremap() or to fixmap or anything else that allows to do ioremaps
> before the slab is ready.
>
> early_ioremap() is for temporary mappings necessary at boottime. For
> long lasting mappings, another method is to be used.
>
> Now, the point is that other architectures like for instance x86 don't
> seem to have to use early_ioremap() much. Powerpc is for instance doing
> early mappings for PCI. Seems like x86 initialises PCI once slab is
> ready. Can't powerpc do the same ?

Patches welcome.


Re: ioremap() called early from pnv_pci_init_ioda_phb()

2020-05-09 Thread Oliver O'Halloran
On Sat, May 9, 2020 at 12:41 AM Qian Cai  wrote:
>
>  Booting POWER9 PowerNV has this message,
>
> "ioremap() called early from pnv_pci_init_ioda_phb+0x420/0xdfc. Use 
> early_ioremap() instead”
>
> but use the patch below will result in leaks because it will never call 
> early_iounmap() anywhere. However, it looks me it was by design that 
> phb->regs mapping would be there forever where it would be used in 
> pnv_ioda_get_inval_reg(), so is just that check_early_ioremap_leak() initcall 
> too strong?

The warning there is junk. The PHBs are setup at boot and never torn
down so we're not "leaking" the mapping. It's supposed to be there for
the lifetime of the kernel.

That said, we could probably move the PCI setup to a point later in
boot where the normal ioremap can be be used. We would have to check
for initcalls which depend on the PHBs being setup and delay those too
though.

Oliver


Re: [PATCH v3 4/4] PCI: pciehp: Remove pciehp_green_led_{on,off,blink}()

2019-08-27 Thread Oliver O'Halloran
On Tue, Aug 20, 2019 at 2:09 AM Denis Efremov  wrote:
>
> Remove pciehp_green_led_{on,off,blink}() and use pciehp_set_indicators()
> instead, since the code is mostly the same.
>
> Signed-off-by: Denis Efremov 
> ---
>  drivers/pci/hotplug/pciehp.h  |  3 ---
>  drivers/pci/hotplug/pciehp_ctrl.c | 12 ---
>  drivers/pci/hotplug/pciehp_hpc.c  | 36 ---
>  3 files changed, 9 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index acda513f37d7..da429345cf70 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -170,9 +170,6 @@ void pciehp_get_power_status(struct controller *ctrl, u8 
> *status);
>  void pciehp_set_indicators(struct controller *ctrl, int pwr, int attn);
>  void pciehp_get_latch_status(struct controller *ctrl, u8 *status);
>  int pciehp_query_power_fault(struct controller *ctrl);
> -void pciehp_green_led_on(struct controller *ctrl);
> -void pciehp_green_led_off(struct controller *ctrl);
> -void pciehp_green_led_blink(struct controller *ctrl);
>  bool pciehp_card_present(struct controller *ctrl);
>  bool pciehp_card_present_or_link_active(struct controller *ctrl);
>  int pciehp_check_link_status(struct controller *ctrl);
> diff --git a/drivers/pci/hotplug/pciehp_ctrl.c 
> b/drivers/pci/hotplug/pciehp_ctrl.c
> index 232f7bfcfce9..862fe86e87cc 100644
> --- a/drivers/pci/hotplug/pciehp_ctrl.c
> +++ b/drivers/pci/hotplug/pciehp_ctrl.c
> @@ -65,7 +65,9 @@ static int board_added(struct controller *ctrl)
> return retval;
> }
>
> -   pciehp_green_led_blink(ctrl);
> +   pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_BLINK,
> + PCI_EXP_SLTCTL_ATTN_IND_NONE);

I think it woud be good if you provided a helper macro for setting one
of the indicators rather than open coding the _NONE constant. e.g.

#define set_power_indicator(ctrl, x) pciehp_set_indicators(ctrl, (x),
PCI_EXP_SLTCTL_ATTN_IND_NONE);

then you can do:

set_power_indicator(ctrl, PCI_EXP_SLTCTL_PWR_IND_BLINK);

which is a bit nicer to read.


Re: [PATCH v3 1/4] PCI: pciehp: Add pciehp_set_indicators() to jointly set LED indicators

2019-08-27 Thread Oliver O'Halloran
On Tue, Aug 20, 2019 at 2:07 AM Denis Efremov  wrote:
>
> Add pciehp_set_indicators() to set power and attention indicators with a
> single register write. Thus, avoiding waiting twice for Command Complete.
>
> Signed-off-by: Denis Efremov 
> ---
>  drivers/pci/hotplug/pciehp.h |  1 +
>  drivers/pci/hotplug/pciehp_hpc.c | 29 +
>  include/uapi/linux/pci_regs.h|  2 ++
>  3 files changed, 32 insertions(+)
>
> diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h
> index 8c51a04b8083..0e272bf3deb4 100644
> --- a/drivers/pci/hotplug/pciehp.h
> +++ b/drivers/pci/hotplug/pciehp.h
> @@ -167,6 +167,7 @@ int pciehp_power_on_slot(struct controller *ctrl);
>  void pciehp_power_off_slot(struct controller *ctrl);
>  void pciehp_get_power_status(struct controller *ctrl, u8 *status);
>
> +void pciehp_set_indicators(struct controller *ctrl, int pwr, int attn);
>  void pciehp_set_attention_status(struct controller *ctrl, u8 status);
>  void pciehp_get_latch_status(struct controller *ctrl, u8 *status);
>  int pciehp_query_power_fault(struct controller *ctrl);
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c 
> b/drivers/pci/hotplug/pciehp_hpc.c
> index bd990e3371e3..5474b9854a7f 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -443,6 +443,35 @@ void pciehp_set_attention_status(struct controller 
> *ctrl, u8 value)
>  pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, slot_cmd);
>  }
>
> +void pciehp_set_indicators(struct controller *ctrl, int pwr, int attn)
> +{
> +   u16 cmd = 0, mask = 0;
> +
> +   if (PWR_LED(ctrl))
> +   switch (pwr) {
> +   case PCI_EXP_SLTCTL_PWR_IND_ON:
> +   case PCI_EXP_SLTCTL_PWR_IND_BLINK:
> +   case PCI_EXP_SLTCTL_PWR_IND_OFF:

Do you need an explicit /* fallthrough */ comment? I thought the
fallthrough warning was enabled by default now.

> +   cmd |= pwr;
> +   mask |= PCI_EXP_SLTCTL_PIC;
> +   }
> +
> +   if (ATTN_LED(ctrl))
> +   switch (attn) {
> +   case PCI_EXP_SLTCTL_ATTN_IND_ON:
> +   case PCI_EXP_SLTCTL_ATTN_IND_BLINK:
> +   case PCI_EXP_SLTCTL_ATTN_IND_OFF:
> +   cmd |= attn;
> +   mask |= PCI_EXP_SLTCTL_AIC;
> +   }
> +
> +   if (cmd) {
> +   pcie_write_cmd_nowait(ctrl, cmd, mask);
> +   ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
> +pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, 
> cmd);
> +   }
> +}
> +
>  void pciehp_green_led_on(struct controller *ctrl)
>  {
> if (!PWR_LED(ctrl))
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index f28e562d7ca8..291788b58f3a 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -591,10 +591,12 @@
>  #define  PCI_EXP_SLTCTL_CCIE   0x0010  /* Command Completed Interrupt Enable 
> */
>  #define  PCI_EXP_SLTCTL_HPIE   0x0020  /* Hot-Plug Interrupt Enable */
>  #define  PCI_EXP_SLTCTL_AIC0x00c0  /* Attention Indicator Control */

> +#define  PCI_EXP_SLTCTL_ATTN_IND_NONE  0x0/* Attention Indicator noop */
>  #define  PCI_EXP_SLTCTL_ATTN_IND_ON0x0040 /* Attention Indicator on */
>  #define  PCI_EXP_SLTCTL_ATTN_IND_BLINK 0x0080 /* Attention Indicator 
> blinking */
>  #define  PCI_EXP_SLTCTL_ATTN_IND_OFF   0x00c0 /* Attention Indicator off */
>  #define  PCI_EXP_SLTCTL_PIC0x0300  /* Power Indicator Control */
> +#define  PCI_EXP_SLTCTL_PWR_IND_NONE   0x0/* Power Indicator noop */

I don't think pci_regs.h is the right place for this. The contents of
pci_regs.h is essentially a transcription of bit definitions from the
various PCI standards documents. The PCIe Base spec says the the 0b00
combination for the two-bit Power and Attn indicator fields is
reserved and there's nothing suggesting that writing a zero to those
fields should preserve the current value. Basicly, we now have a
implementation quirk of pcie_set_indicators() masquerading as part of
the standard.

>  #define  PCI_EXP_SLTCTL_PWR_IND_ON 0x0100 /* Power Indicator on */
>  #define  PCI_EXP_SLTCTL_PWR_IND_BLINK  0x0200 /* Power Indicator blinking */
>  #define  PCI_EXP_SLTCTL_PWR_IND_OFF0x0300 /* Power Indicator off */
> --
> 2.21.0
>


Re: [PATCH v3 1/4] PCI: pciehp: Add pciehp_set_indicators() to jointly set LED indicators

2019-08-27 Thread Oliver O'Halloran
On Tue, Aug 20, 2019 at 2:17 AM Denis Efremov  wrote:
>
> Hi,
>
> On 8/19/19 7:06 PM, Denis Efremov wrote:
> On 8/12/19 11:25 AM, sathyanarayanan kuppuswamy wrote:
> > Do we need to switch case here ? if (pwr > 0) {} should work right ?
>
> I saved the switch here from v2. I think switch makes the inputs check more
> precise and filters-out all non-valid values. Maybe this check is too strict?

Sounds like you're overthinking it tbh. If want to catch programming
errors then a WARN_ON_ONCE() in the default case would be better than
silently ignoring invalid values, but it's pretty hard to care.

> We could use mask here ON|OFF|BLINK for the check, but I don't know how 
> hardware
> will handle a case, for example, pwr == ON|BLINK.

ON|BLINK is the same as OFF


Re: [PATCH 2/4] powerpc/32: activate ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE

2019-07-15 Thread Oliver O'Halloran
On Mon, Jul 15, 2019 at 4:49 PM Michael Ellerman  wrote:
>
> Christophe Leroy  writes:
> > PPC32 also have flush_dcache_range() so it can also support
> > ARCH_HAS_PMEM_API and ARCH_HAS_UACCESS_FLUSHCACHE without changes.
> >
> > Signed-off-by: Christophe Leroy 
> > ---
> >  arch/powerpc/Kconfig | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index d7996cfaceca..cf6e30f637be 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -127,13 +127,13 @@ config PPC
> >   select ARCH_HAS_KCOV
> >   select ARCH_HAS_MMIOWB  if PPC64
> >   select ARCH_HAS_PHYS_TO_DMA
> > - select ARCH_HAS_PMEM_APIif PPC64
> > + select ARCH_HAS_PMEM_API
> >   select ARCH_HAS_PTE_SPECIAL
> >   select ARCH_HAS_MEMBARRIER_CALLBACKS
> >   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
> > && PPC64
> >   select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) 
> > && !RELOCATABLE && !HIBERNATION)
> >   select ARCH_HAS_TICK_BROADCAST  if 
> > GENERIC_CLOCKEVENTS_BROADCAST
> > - select ARCH_HAS_UACCESS_FLUSHCACHE  if PPC64
> > + select ARCH_HAS_UACCESS_FLUSHCACHE
> >   select ARCH_HAS_UBSAN_SANITIZE_ALL
> >   select ARCH_HAS_ZONE_DEVICE if PPC_BOOK3S_64
> >   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>
> This didn't build for me, probably due to something that's changed in
> the long period between you posting it and me applying it?
>
> corenet32_smp_defconfig:
>
>   powerpc64-unknown-linux-gnu-ld: lib/iov_iter.o: in function 
> `_copy_from_iter_flushcache':
>   powerpc64-unknown-linux-gnu-ld: 
> /scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined 
> reference to `memcpy_page_flushcache'
>   powerpc64-unknown-linux-gnu-ld: 
> /scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined 
> reference to `memcpy_flushcache'
>   powerpc64-unknown-linux-gnu-ld: 
> /scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined 
> reference to `__copy_from_user_flushcache'
>   powerpc64-unknown-linux-gnu-ld: 
> /scratch/michael/build/maint/build~/../lib/iov_iter.c:825: undefined 
> reference to `memcpy_flushcache'

I think lib/pmem.c just needs to be moved out of obj64-y.

>
> cheers


Re: [PATCH 4/4] powerpc/64: reuse PPC32 static inline flush_dcache_range()

2019-07-08 Thread Oliver O'Halloran
On Tue, Jul 9, 2019 at 12:52 PM Aneesh Kumar K.V
 wrote:
>
> On 7/9/19 7:50 AM, Oliver O'Halloran wrote:
> > On Tue, Jul 9, 2019 at 12:22 AM Aneesh Kumar K.V
> >  wrote:
> >>
> >> Christophe Leroy  writes:
> >>
> >>> *snip*
> >>> + if (IS_ENABLED(CONFIG_PPC64))
> >>> + isync();
> >>>   }
> >>
> >>
> >> Was checking with Michael about why we need that extra isync. Michael
> >> pointed this came via
> >>
> >> https://github.com/mpe/linux-fullhistory/commit/faa5ee3743ff9b6df9f9a03600e34fdae596cfb2#diff-67c7ffa8e420c7d4206cae4a9e888e14
> >>
> >> for 970 which doesn't have coherent icache. So possibly isync there is
> >> to flush the prefetch instructions? But even so we would need an icbi
> >> there before that isync.
> >
> > I don't think it's that, there's some magic in flush_icache_range() to
> > handle dropping prefetched instructions on 970.
> >
> >> So overall wondering why we need that extra barriers there.
> >
> > I think the isync is needed there because the architecture only
> > requires sync to provide ordering. A sync alone doesn't guarantee the
> > dcbfs have actually completed so the isync is necessary to ensure the
> > flushed cache lines are back in memory. That said, as far as I know
> > all the IBM book3s chips from power4 onwards will wait for pending
> > dcbfs when they hit a sync, but that might change in the future.
> >
>
> ISA doesn't list that as the sequence. Only place where isync was
> mentioned was w.r.t  icbi where want to discards the prefetch.

doesn't list that as the sequence for what?

> > If it's a problem we could add a cpu-feature section around the isync
> > to no-op it in the common case. However, when I had a look with perf
> > it always showed that the sync was the hotspot so I don't think it'll
> > help much.
> >
>
> What about the preceding barriers (sync; isync;) before dcbf? Why are
> they needed?

Dunno, the sync might just be to ensure ordering between prior stores
and the dcbf.

>
> -aneesh


Re: [PATCH 4/4] powerpc/64: reuse PPC32 static inline flush_dcache_range()

2019-07-08 Thread Oliver O'Halloran
On Tue, Jul 9, 2019 at 12:22 AM Aneesh Kumar K.V
 wrote:
>
> Christophe Leroy  writes:
>
> > *snip*
> > + if (IS_ENABLED(CONFIG_PPC64))
> > + isync();
> >  }
>
>
> Was checking with Michael about why we need that extra isync. Michael
> pointed this came via
>
> https://github.com/mpe/linux-fullhistory/commit/faa5ee3743ff9b6df9f9a03600e34fdae596cfb2#diff-67c7ffa8e420c7d4206cae4a9e888e14
>
> for 970 which doesn't have coherent icache. So possibly isync there is
> to flush the prefetch instructions? But even so we would need an icbi
> there before that isync.

I don't think it's that, there's some magic in flush_icache_range() to
handle dropping prefetched instructions on 970.

> So overall wondering why we need that extra barriers there.

I think the isync is needed there because the architecture only
requires sync to provide ordering. A sync alone doesn't guarantee the
dcbfs have actually completed so the isync is necessary to ensure the
flushed cache lines are back in memory. That said, as far as I know
all the IBM book3s chips from power4 onwards will wait for pending
dcbfs when they hit a sync, but that might change in the future.

If it's a problem we could add a cpu-feature section around the isync
to no-op it in the common case. However, when I had a look with perf
it always showed that the sync was the hotspot so I don't think it'll
help much.

Oliver


Re: [PATCH 2/4] powerpc/powernv: remove the unused tunneling exports

2019-06-20 Thread Oliver O'Halloran
On Thu, May 23, 2019 at 5:51 PM Christoph Hellwig  wrote:
>
> These have been unused ever since they've been added to the kernel.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/include/asm/pnv-pci.h|  4 --
>  arch/powerpc/platforms/powernv/pci-ioda.c |  4 +-
>  arch/powerpc/platforms/powernv/pci.c  | 71 ---
>  arch/powerpc/platforms/powernv/pci.h  |  1 -
>  4 files changed, 3 insertions(+), 77 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pnv-pci.h 
> b/arch/powerpc/include/asm/pnv-pci.h
> index 9fcb0bc462c6..1ab4b0111abc 100644
> --- a/arch/powerpc/include/asm/pnv-pci.h
> +++ b/arch/powerpc/include/asm/pnv-pci.h
> @@ -27,12 +27,8 @@ extern int pnv_pci_get_power_state(uint64_t id, uint8_t 
> *state);
>  extern int pnv_pci_set_power_state(uint64_t id, uint8_t state,
>struct opal_msg *msg);
>
> -extern int pnv_pci_enable_tunnel(struct pci_dev *dev, uint64_t *asnind);
> -extern int pnv_pci_disable_tunnel(struct pci_dev *dev);
>  extern int pnv_pci_set_tunnel_bar(struct pci_dev *dev, uint64_t addr,
>   int enable);
> -extern int pnv_pci_get_as_notify_info(struct task_struct *task, u32 *lpid,
> - u32 *pid, u32 *tid);

IIRC as-notify was for CAPI which has an in-tree driver (cxl). Fred or
Andrew (+cc), what's going on with this? Will it ever see the light of
day?

>  int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
>  int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
>unsigned int virq);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 126602b4e399..6b0caa2d0425 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -54,6 +54,8 @@
>  static const char * const pnv_phb_names[] = { "IODA1", "IODA2", "NPU_NVLINK",
>   "NPU_OCAPI" };
>
> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable);
> +
>  void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level,
> const char *fmt, ...)
>  {
> @@ -2360,7 +2362,7 @@ static long pnv_pci_ioda2_set_window(struct 
> iommu_table_group *table_group,
> return 0;
>  }
>
> -void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>  {
> uint16_t window_id = (pe->pe_number << 1 ) + 1;
> int64_t rc;
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index 8d28f2932c3b..fc69f5611020 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -868,54 +868,6 @@ struct device_node *pnv_pci_get_phb_node(struct pci_dev 
> *dev)
>  }
>  EXPORT_SYMBOL(pnv_pci_get_phb_node);
>
> -int pnv_pci_enable_tunnel(struct pci_dev *dev, u64 *asnind)
> -{
> -   struct device_node *np;
> -   const __be32 *prop;
> -   struct pnv_ioda_pe *pe;
> -   uint16_t window_id;
> -   int rc;
> -
> -   if (!radix_enabled())
> -   return -ENXIO;
> -
> -   if (!(np = pnv_pci_get_phb_node(dev)))
> -   return -ENXIO;
> -
> -   prop = of_get_property(np, "ibm,phb-indications", NULL);
> -   of_node_put(np);
> -
> -   if (!prop || !prop[1])
> -   return -ENXIO;
> -
> -   *asnind = (u64)be32_to_cpu(prop[1]);
> -   pe = pnv_ioda_get_pe(dev);
> -   if (!pe)
> -   return -ENODEV;
> -
> -   /* Increase real window size to accept as_notify messages. */
> -   window_id = (pe->pe_number << 1 ) + 1;
> -   rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id, pe->pe_number,
> -window_id, pe->tce_bypass_base,
> -(uint64_t)1 << 48);
> -   return opal_error_code(rc);
> -}
> -EXPORT_SYMBOL_GPL(pnv_pci_enable_tunnel);
> -
> -int pnv_pci_disable_tunnel(struct pci_dev *dev)
> -{
> -   struct pnv_ioda_pe *pe;
> -
> -   pe = pnv_ioda_get_pe(dev);
> -   if (!pe)
> -   return -ENODEV;
> -
> -   /* Restore default real window size. */
> -   pnv_pci_ioda2_set_bypass(pe, true);
> -   return 0;
> -}
> -EXPORT_SYMBOL_GPL(pnv_pci_disable_tunnel);
> -
>  int pnv_pci_set_tunnel_bar(struct pci_dev *dev, u64 addr, int enable)
>  {
> __be64 val;
> @@ -970,29 +922,6 @@ int pnv_pci_set_tunnel_bar(struct pci_dev *dev, u64 
> addr, int enable)
>  }
>  EXPORT_SYMBOL_GPL(pnv_pci_set_tunnel_bar);
>
> -#ifdef CONFIG_PPC64/* for thread.tidr */
> -int pnv_pci_get_as_notify_info(struct task_struct *task, u32 *lpid, u32 *pid,
> -  u32 *tid)
> -{
> -   struct mm_struct *mm = NULL;
> -
> -   if (task == NULL)
> -   return -EINVAL;
> -
> -   

Re: [PATCH 4/4] powerpc/powernv: remove the unused vas_win_paste_addr and vas_win_id functions

2019-06-20 Thread Oliver O'Halloran
On Thu, May 23, 2019 at 5:56 PM Christoph Hellwig  wrote:
>
> These two function have never been used since they were added to the
> kernel.
>
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/powerpc/include/asm/vas.h  | 10 --
>  arch/powerpc/platforms/powernv/vas-window.c | 19 ---
>  arch/powerpc/platforms/powernv/vas.h| 20 
>  3 files changed, 49 deletions(-)

Sukadev (+cc), what's the reason this is not being used?

IIRC the VAS hardware on P9 had some issues, but I don't know any of
the details.

> diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
> index 771456227496..9b5b7261df7b 100644
> --- a/arch/powerpc/include/asm/vas.h
> +++ b/arch/powerpc/include/asm/vas.h
> @@ -167,14 +167,4 @@ int vas_copy_crb(void *crb, int offset);
>   */
>  int vas_paste_crb(struct vas_window *win, int offset, bool re);
>
> -/*
> - * Return a system-wide unique id for the VAS window @win.
> - */
> -extern u32 vas_win_id(struct vas_window *win);
> -
> -/*
> - * Return the power bus paste address associated with @win so the caller
> - * can map that address into their address space.
> - */
> -extern u64 vas_win_paste_addr(struct vas_window *win);
>  #endif /* __ASM_POWERPC_VAS_H */
> diff --git a/arch/powerpc/platforms/powernv/vas-window.c 
> b/arch/powerpc/platforms/powernv/vas-window.c
> index e59e0e60e5b5..e48c44cb3a16 100644
> --- a/arch/powerpc/platforms/powernv/vas-window.c
> +++ b/arch/powerpc/platforms/powernv/vas-window.c
> @@ -44,16 +44,6 @@ static void compute_paste_address(struct vas_window 
> *window, u64 *addr, int *len
> pr_debug("Txwin #%d: Paste addr 0x%llx\n", winid, *addr);
>  }
>
> -u64 vas_win_paste_addr(struct vas_window *win)
> -{
> -   u64 addr;
> -
> -   compute_paste_address(win, , NULL);
> -
> -   return addr;
> -}
> -EXPORT_SYMBOL(vas_win_paste_addr);
> -
>  static inline void get_hvwc_mmio_bar(struct vas_window *window,
> u64 *start, int *len)
>  {
> @@ -1268,12 +1258,3 @@ int vas_win_close(struct vas_window *window)
> return 0;
>  }
>  EXPORT_SYMBOL_GPL(vas_win_close);
> -
> -/*
> - * Return a system-wide unique window id for the window @win.
> - */
> -u32 vas_win_id(struct vas_window *win)
> -{
> -   return encode_pswid(win->vinst->vas_id, win->winid);
> -}
> -EXPORT_SYMBOL_GPL(vas_win_id);
> diff --git a/arch/powerpc/platforms/powernv/vas.h 
> b/arch/powerpc/platforms/powernv/vas.h
> index f5493dbdd7ff..551affaddd59 100644
> --- a/arch/powerpc/platforms/powernv/vas.h
> +++ b/arch/powerpc/platforms/powernv/vas.h
> @@ -448,26 +448,6 @@ static inline u64 read_hvwc_reg(struct vas_window *win,
> return in_be64(win->hvwc_map+reg);
>  }
>
> -/*
> - * Encode/decode the Partition Send Window ID (PSWID) for a window in
> - * a way that we can uniquely identify any window in the system. i.e.
> - * we should be able to locate the 'struct vas_window' given the PSWID.
> - *
> - * BitsUsage
> - * 0:7 VAS id (8 bits)
> - * 8:15Unused, 0 (3 bits)
> - * 16:31   Window id (16 bits)
> - */
> -static inline u32 encode_pswid(int vasid, int winid)
> -{
> -   u32 pswid = 0;
> -
> -   pswid |= vasid << (31 - 7);
> -   pswid |= winid;
> -
> -   return pswid;
> -}
> -
>  static inline void decode_pswid(u32 pswid, int *vasid, int *winid)
>  {
> if (vasid)
> --
> 2.20.1
>


Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-12 Thread Oliver O'Halloran
On Wed, Jun 12, 2019 at 4:53 PM Christoph Hellwig  wrote:
>
> On Wed, Jun 12, 2019 at 04:35:22PM +1000, Oliver O'Halloran wrote:
> > Setting a 48 bit DMA mask doesn't work today because we only allocate
> > IOMMU tables to cover the 0..2GB range of PCI bus addresses.
>
> I don't think that is true upstream, and if it is we need to fix bug
> in the powerpc code.  powerpc should be falling back treating a 48-bit
> dma mask like a 32-bit one at least, that is use dynamic iommu mappings
> instead of using the direct mapping.  And from my reding of
> arch/powerpc/kernel/dma-iommu.c that is exactly what it does.

This is more or less what Alexey's patches fix. The IOMMU table
allocated for the 32bit DMA window is only sized for 2GB in the
platform code, see pnv_pci_ioda2_setup_default_config().


Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-12 Thread Oliver O'Halloran
On Wed, Jun 12, 2019 at 3:25 AM Oded Gabbay  wrote:
>
> On Tue, Jun 11, 2019 at 8:03 PM Oded Gabbay  wrote:
> >
> > On Tue, Jun 11, 2019 at 6:26 PM Greg KH  wrote:
> > > *snip*
> >
> > Now, when I tried to integrate Goya into a POWER9 machine, I got a
> > reject from the call to pci_set_dma_mask(pdev, 48). The standard code,
> > as I wrote above, is to call the same function with 32-bits. That
> > works BUT it is not practical, as our applications require much more
> > memory mapped then 32-bits.

Setting a 48 bit DMA mask doesn't work today because we only allocate
IOMMU tables to cover the 0..2GB range of PCI bus addresses. Alexey
has some patches to expand that range so we can support devices that
can't hit the 64 bit bypass window. You need:

This fix: http://patchwork.ozlabs.org/patch/1113506/
This series: 
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=110810

Give that a try and see if the IOMMU overhead is tolerable.

> >In addition, once you add more cards which
> > are all mapped to the same range, it is simply not usable at all.

Each IOMMU group should have a separate bus address space and seperate
cards shouldn't be in the same IOMMU group. If they are then there's
something up.

Oliver


Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-11 Thread Oliver O'Halloran
On Wed, Jun 12, 2019 at 8:54 AM Benjamin Herrenschmidt
 wrote:
>
> On Tue, 2019-06-11 at 20:22 +0300, Oded Gabbay wrote:
> >
> > > So, to summarize:
> > > If I call pci_set_dma_mask with 48, then it fails on POWER9. However,
> > > in runtime, I don't know if its POWER9 or not, so upon failure I will
> > > call it again with 32, which makes our device pretty much unusable.
> > > If I call pci_set_dma_mask with 64, and do the dedicated configuration
> > > in Goya's PCIe controller, then it won't work on x86-64, because bit
> > > 59 will be set and the host won't like it (I checked it). In addition,
> > > I might get addresses above 50 bits, which my device can't generate.
> > >
> > > I hope this makes things more clear. Now, please explain to me how I
> > > can call pci_set_dma_mask without any regard to whether I run on
> > > x86-64 or POWER9, considering what I wrote above ?
> > >
> > > Thanks,
> > > Oded
> >
> > Adding ppc mailing list.
>
> You can't. Your device is broken. Devices that don't support DMAing to
> the full 64-bit deserve to be added to the trash pile.
>
> As a result, getting it to work will require hacks. Some GPUs have
> similar issues and require similar hacks, it's unfortunate.
>
> Added a couple of guys on CC who might be able to help get those hacks
> right.

> It's still very fishy .. the idea is to detect the case where setting a
> 64-bit mask will give your system memory mapped at a fixed high address
> (1 << 59 in our case) and program that in your chip in the "Fixed high
> bits" register that you seem to have (also make sure it doesn't affect
> MSIs or it will break them).

Judging from the patch (https://lkml.org/lkml/2019/6/11/59) this is
what they're doing.

Also, are you sure about the MSI thing? The IODA3 spec says the only
important bits for a 64bit MSI are bits 61:60 (to hit the window) and
the lower bits that determine what IVE to use. Everything in between
is ignored so ORing in bit 59 shouldn't break anything.

> This will only work as long as all of the system memory can be
> addressed at an offset from that fixed address that itself fits your
> device addressing capabilities (50 bits in this case). It may or may
> not be the case but there's no way to check since the DMA mask logic
> won't really apply.
>
> You might want to consider fixing your HW in the next iteration... This
> is going to bite you when x86 increases the max physical memory for
> example, or on other architectures.

Yes, do this. The easiest way to avoid this sort of wierd hack is to
just design the PCIe interface to the spec in the first place.


[PATCH] drivers/of: Introduce ARCH_HAS_OWN_OF_NUMA

2018-04-09 Thread Oliver O'Halloran
Some OF platforms (pseries and some SPARC systems) has their own
implementations of NUMA affinity detection rather than using the generic
OF_NUMA driver, which mainly exists for arm64. For other platforms one
of two fallbacks provided by the base OF driver are used depending on
CONFIG_NUMA.

In the CONFIG_NUMA=n case the fallback is an inline function in of.h.
In the =y case the fallback is a real function which is defined as a
weak symbol so that it may be overwritten by the architecture if desired.

The problem with this arrangement is that the real implementations all
export of_node_to_nid(). Unfortunately it's not possible to export the
fallback since it would clash with the non-weak version. As a result
we get build failures when:

a) CONFIG_NUMA=y && CONFIG_OF=y, and
b) The platform doesn't implement of_node_to_nid(), and
c) A module uses of_node_to_nid()

Given b) will be true for most platforms this is fairly easy to hit
and has been observed on ia64 and x86.

This patch remedies the problem by introducing the ARCH_HAS_OWN_OF_NUMA
Kconfig option which is selected if an architecture provides an
implementation of of_node_to_nid(). If a platform does not use it's own,
or the generic OF_NUMA, then always use the inline fallback in of.h so
we don't need to futz around with exports.

Cc: devicet...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Fixes: 298535c00a2c ("of, numa: Add NUMA of binding implementation.")
Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
---
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 drivers/of/Kconfig   | 3 +++
 drivers/of/base.c| 7 ---
 include/linux/of.h   | 2 +-
 5 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..74ce5f3564ae 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -625,6 +625,7 @@ config NUMA
bool "NUMA support"
depends on PPC64
default y if SMP && PPC_PSERIES
+   select ARCH_HAS_OWN_OF_NUMA
 
 config NODES_SHIFT
int
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8767e45f1b2b..f8071f1c3edb 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -299,6 +299,7 @@ config GENERIC_LOCKBREAK
 config NUMA
bool "NUMA support"
depends on SPARC64 && SMP
+   select ARCH_HAS_OWN_OF_NUMA
 
 config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)"
diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index ad3fcad4d75b..01c62b747b25 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -103,4 +103,7 @@ config OF_OVERLAY
 config OF_NUMA
bool
 
+config ARCH_HAS_OWN_OF_NUMA
+   bool
+
 endif # OF
diff --git a/drivers/of/base.c b/drivers/of/base.c
index 848f549164cd..82a9584bb0e2 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -84,13 +84,6 @@ int of_n_size_cells(struct device_node *np)
 }
 EXPORT_SYMBOL(of_n_size_cells);
 
-#ifdef CONFIG_NUMA
-int __weak of_node_to_nid(struct device_node *np)
-{
-   return NUMA_NO_NODE;
-}
-#endif
-
 static struct device_node **phandle_cache;
 static u32 phandle_cache_mask;
 
diff --git a/include/linux/of.h b/include/linux/of.h
index 4d25e4f952d9..9bb42dac5e65 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -942,7 +942,7 @@ static inline int of_cpu_node_to_id(struct device_node *np)
 #define of_node_cmp(s1, s2)strcasecmp((s1), (s2))
 #endif
 
-#if defined(CONFIG_OF) && defined(CONFIG_NUMA)
+#if defined(CONFIG_OF_NUMA) || defined(CONFIG_ARCH_HAS_OWN_OF_NUMA)
 extern int of_node_to_nid(struct device_node *np);
 #else
 static inline int of_node_to_nid(struct device_node *device)
-- 
2.9.5



[PATCH] drivers/of: Introduce ARCH_HAS_OWN_OF_NUMA

2018-04-09 Thread Oliver O'Halloran
Some OF platforms (pseries and some SPARC systems) has their own
implementations of NUMA affinity detection rather than using the generic
OF_NUMA driver, which mainly exists for arm64. For other platforms one
of two fallbacks provided by the base OF driver are used depending on
CONFIG_NUMA.

In the CONFIG_NUMA=n case the fallback is an inline function in of.h.
In the =y case the fallback is a real function which is defined as a
weak symbol so that it may be overwritten by the architecture if desired.

The problem with this arrangement is that the real implementations all
export of_node_to_nid(). Unfortunately it's not possible to export the
fallback since it would clash with the non-weak version. As a result
we get build failures when:

a) CONFIG_NUMA=y && CONFIG_OF=y, and
b) The platform doesn't implement of_node_to_nid(), and
c) A module uses of_node_to_nid()

Given b) will be true for most platforms this is fairly easy to hit
and has been observed on ia64 and x86.

This patch remedies the problem by introducing the ARCH_HAS_OWN_OF_NUMA
Kconfig option which is selected if an architecture provides an
implementation of of_node_to_nid(). If a platform does not use it's own,
or the generic OF_NUMA, then always use the inline fallback in of.h so
we don't need to futz around with exports.

Cc: devicet...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linuxppc-...@lists.ozlabs.org
Fixes: 298535c00a2c ("of, numa: Add NUMA of binding implementation.")
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/Kconfig | 1 +
 arch/sparc/Kconfig   | 1 +
 drivers/of/Kconfig   | 3 +++
 drivers/of/base.c| 7 ---
 include/linux/of.h   | 2 +-
 5 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..74ce5f3564ae 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -625,6 +625,7 @@ config NUMA
bool "NUMA support"
depends on PPC64
default y if SMP && PPC_PSERIES
+   select ARCH_HAS_OWN_OF_NUMA
 
 config NODES_SHIFT
int
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8767e45f1b2b..f8071f1c3edb 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -299,6 +299,7 @@ config GENERIC_LOCKBREAK
 config NUMA
bool "NUMA support"
depends on SPARC64 && SMP
+   select ARCH_HAS_OWN_OF_NUMA
 
 config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)"
diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index ad3fcad4d75b..01c62b747b25 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -103,4 +103,7 @@ config OF_OVERLAY
 config OF_NUMA
bool
 
+config ARCH_HAS_OWN_OF_NUMA
+   bool
+
 endif # OF
diff --git a/drivers/of/base.c b/drivers/of/base.c
index 848f549164cd..82a9584bb0e2 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -84,13 +84,6 @@ int of_n_size_cells(struct device_node *np)
 }
 EXPORT_SYMBOL(of_n_size_cells);
 
-#ifdef CONFIG_NUMA
-int __weak of_node_to_nid(struct device_node *np)
-{
-   return NUMA_NO_NODE;
-}
-#endif
-
 static struct device_node **phandle_cache;
 static u32 phandle_cache_mask;
 
diff --git a/include/linux/of.h b/include/linux/of.h
index 4d25e4f952d9..9bb42dac5e65 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -942,7 +942,7 @@ static inline int of_cpu_node_to_id(struct device_node *np)
 #define of_node_cmp(s1, s2)strcasecmp((s1), (s2))
 #endif
 
-#if defined(CONFIG_OF) && defined(CONFIG_NUMA)
+#if defined(CONFIG_OF_NUMA) || defined(CONFIG_ARCH_HAS_OWN_OF_NUMA)
 extern int of_node_to_nid(struct device_node *np);
 #else
 static inline int of_node_to_nid(struct device_node *device)
-- 
2.9.5



[PATCH] drm/sti: Depend on OF rather than selecting it

2018-04-02 Thread Oliver O'Halloran
Commit cc6b741c6f63 ("drm: sti: remove useless fields from vtg
structure") reworked some code inside of this driver and made it select
CONFIG_OF. This results in the entire OF layer being enabled when
building an allmodconfig on ia64. OF on ia64 is completely unsupported
so this isn't a great state of affairs.

The 0day robot noticed a link-time failure on ia64 caused by
using of_node_to_nid() in an otherwise unrelated driver. The
generic fallback for of_node_to_nid() only exists when:

defined(CONFIG_OF) && defined(CONFIG_NUMA) == false

Since CONFIG_NUMA is usually selected for IA64 we get the link failure.
Fix this by making the driver depend on OF rather than selecting it,
odds are that was the original intent.

Link: https://lists.01.org/pipermail/kbuild-all/2018-March/045172.html
Fixes: cc6b741c6f63 ("drm: sti: remove useless fields from vtg structure")
Cc: Benjamin Gaignard <benjamin.gaign...@linaro.org>
Cc: Vincent Abriou <vincent.abr...@st.com>
Cc: David Airlie <airl...@linux.ie>
Cc: dri-de...@lists.freedesktop.org
Cc: linux-i...@vger.kernel.org
Cc: sta...@vger.kernel.org
Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
---
Cc`ed to stable since the ia64 guys might want it backported. I'm not
bothered if it just goes into mainline.
---
 drivers/gpu/drm/sti/Kconfig | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/sti/Kconfig b/drivers/gpu/drm/sti/Kconfig
index cca4b3c9aeb5..1963cc1b1cc5 100644
--- a/drivers/gpu/drm/sti/Kconfig
+++ b/drivers/gpu/drm/sti/Kconfig
@@ -1,6 +1,6 @@
 config DRM_STI
tristate "DRM Support for STMicroelectronics SoC stiH4xx Series"
-   depends on DRM && (ARCH_STI || ARCH_MULTIPLATFORM)
+   depends on OF && DRM && (ARCH_STI || ARCH_MULTIPLATFORM)
select RESET_CONTROLLER
select DRM_KMS_HELPER
select DRM_GEM_CMA_HELPER
@@ -8,6 +8,5 @@ config DRM_STI
select DRM_PANEL
select FW_LOADER
select SND_SOC_HDMI_CODEC if SND_SOC
-   select OF
help
  Choose this option to enable DRM on STM stiH4xx chipset
-- 
2.9.5



[PATCH] drm/sti: Depend on OF rather than selecting it

2018-04-02 Thread Oliver O'Halloran
Commit cc6b741c6f63 ("drm: sti: remove useless fields from vtg
structure") reworked some code inside of this driver and made it select
CONFIG_OF. This results in the entire OF layer being enabled when
building an allmodconfig on ia64. OF on ia64 is completely unsupported
so this isn't a great state of affairs.

The 0day robot noticed a link-time failure on ia64 caused by
using of_node_to_nid() in an otherwise unrelated driver. The
generic fallback for of_node_to_nid() only exists when:

defined(CONFIG_OF) && defined(CONFIG_NUMA) == false

Since CONFIG_NUMA is usually selected for IA64 we get the link failure.
Fix this by making the driver depend on OF rather than selecting it,
odds are that was the original intent.

Link: https://lists.01.org/pipermail/kbuild-all/2018-March/045172.html
Fixes: cc6b741c6f63 ("drm: sti: remove useless fields from vtg structure")
Cc: Benjamin Gaignard 
Cc: Vincent Abriou 
Cc: David Airlie 
Cc: dri-de...@lists.freedesktop.org
Cc: linux-i...@vger.kernel.org
Cc: sta...@vger.kernel.org
Signed-off-by: Oliver O'Halloran 
---
Cc`ed to stable since the ia64 guys might want it backported. I'm not
bothered if it just goes into mainline.
---
 drivers/gpu/drm/sti/Kconfig | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/sti/Kconfig b/drivers/gpu/drm/sti/Kconfig
index cca4b3c9aeb5..1963cc1b1cc5 100644
--- a/drivers/gpu/drm/sti/Kconfig
+++ b/drivers/gpu/drm/sti/Kconfig
@@ -1,6 +1,6 @@
 config DRM_STI
tristate "DRM Support for STMicroelectronics SoC stiH4xx Series"
-   depends on DRM && (ARCH_STI || ARCH_MULTIPLATFORM)
+   depends on OF && DRM && (ARCH_STI || ARCH_MULTIPLATFORM)
select RESET_CONTROLLER
select DRM_KMS_HELPER
select DRM_GEM_CMA_HELPER
@@ -8,6 +8,5 @@ config DRM_STI
select DRM_PANEL
select FW_LOADER
select SND_SOC_HDMI_CODEC if SND_SOC
-   select OF
help
  Choose this option to enable DRM on STM stiH4xx chipset
-- 
2.9.5



[PATCH] kernel/memremap: Remove stale devres_free() call

2018-03-05 Thread Oliver O'Halloran
devm_memremap_pages() was re-worked in e8d513483300 to take a caller
allocated struct dev_pagemap as a function parameter. A call to
devres_free() was left in the error cleanup path which results in
a kernel panic if the remap fails for some reason. Remove it
to fix the panic and let devm_memremap_pages() fail gracefully.

Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface to use 
struct dev_pagemap")
Cc: Logan Gunthorpe <log...@deltatee.com>
Cc: Christoph Hellwig <h...@lst.de>
Cc: Dan Williams <dan.j.willi...@intel.com>
Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
---
Both in-tree users of devm_memremap_pages() embed dev_pagemap into other
structures so this shouldn't cause any leaks. Logan's p2p series does
add one usage that assumes pgmap will be freed on error so that'll
need fixing.
---
 kernel/memremap.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4dd4274cabe2..895e6b76b25e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -427,7 +427,6 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
  err_pfn_remap:
  err_radix:
pgmap_radix_release(res, pgoff);
-   devres_free(pgmap);
return ERR_PTR(error);
 }
 EXPORT_SYMBOL(devm_memremap_pages);
-- 
2.9.5



[PATCH] kernel/memremap: Remove stale devres_free() call

2018-03-05 Thread Oliver O'Halloran
devm_memremap_pages() was re-worked in e8d513483300 to take a caller
allocated struct dev_pagemap as a function parameter. A call to
devres_free() was left in the error cleanup path which results in
a kernel panic if the remap fails for some reason. Remove it
to fix the panic and let devm_memremap_pages() fail gracefully.

Fixes: e8d513483300 ("memremap: change devm_memremap_pages interface to use 
struct dev_pagemap")
Cc: Logan Gunthorpe 
Cc: Christoph Hellwig 
Cc: Dan Williams 
Signed-off-by: Oliver O'Halloran 
---
Both in-tree users of devm_memremap_pages() embed dev_pagemap into other
structures so this shouldn't cause any leaks. Logan's p2p series does
add one usage that assumes pgmap will be freed on error so that'll
need fixing.
---
 kernel/memremap.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/memremap.c b/kernel/memremap.c
index 4dd4274cabe2..895e6b76b25e 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -427,7 +427,6 @@ void *devm_memremap_pages(struct device *dev, struct 
dev_pagemap *pgmap)
  err_pfn_remap:
  err_radix:
pgmap_radix_release(res, pgoff);
-   devres_free(pgmap);
return ERR_PTR(error);
 }
 EXPORT_SYMBOL(devm_memremap_pages);
-- 
2.9.5



Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Oliver O'Halloran
On Wed, Apr 19, 2017 at 2:46 AM, Rob Herring  wrote:
> On Mon, Apr 17, 2017 at 7:32 PM, Tyrel Datwyler
>  wrote:
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>>
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR) dynamic
>> logical partitioning code (ie. POWER speak for hotplugging virtual and 
>> physcial
>> resources at runtime such as cpus or IOAs). These tracepoints provide a
>> easy and quick mechanism for validating the reference counting of
>> device_nodes during their lifetime.
>
> Not really relevant for this patch, but since you are looking at
> pseries and refcounting, the refcounting largely exists for pseries.
> It's also hard to get right as this type of fix is fairly common. It's
> now used for overlays, but we really probably only need to refcount
> the overlays or changesets as a whole, not at a node level. If you
> have any thoughts on how a different model of refcounting could work
> for pseries, I'd like to discuss it.

One idea I've been kicking around is differentiating short and long
term references to a node. I figure most leaks are due to a missing
of_node_put() within a stack frame so it might be possible to use the
ftrace infrastructure to detect and emit warnings if a short term
reference is leaked. Long term references are slightly harder to deal
with, but they're less common so we can add more detailed reference
tracking there (devm_of_get_node?).

Oliver


Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-18 Thread Oliver O'Halloran
On Wed, Apr 19, 2017 at 2:46 AM, Rob Herring  wrote:
> On Mon, Apr 17, 2017 at 7:32 PM, Tyrel Datwyler
>  wrote:
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>>
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR) dynamic
>> logical partitioning code (ie. POWER speak for hotplugging virtual and 
>> physcial
>> resources at runtime such as cpus or IOAs). These tracepoints provide a
>> easy and quick mechanism for validating the reference counting of
>> device_nodes during their lifetime.
>
> Not really relevant for this patch, but since you are looking at
> pseries and refcounting, the refcounting largely exists for pseries.
> It's also hard to get right as this type of fix is fairly common. It's
> now used for overlays, but we really probably only need to refcount
> the overlays or changesets as a whole, not at a node level. If you
> have any thoughts on how a different model of refcounting could work
> for pseries, I'd like to discuss it.

One idea I've been kicking around is differentiating short and long
term references to a node. I figure most leaks are due to a missing
of_node_put() within a stack frame so it might be possible to use the
ftrace infrastructure to detect and emit warnings if a short term
reference is leaked. Long term references are slightly harder to deal
with, but they're less common so we can add more detailed reference
tracking there (devm_of_get_node?).

Oliver


[PATCH v2] mm, x86: Add ARCH_HAS_ZONE_DEVICE to Kconfig

2017-04-12 Thread Oliver O'Halloran
Currently ZONE_DEVICE depends on X86_64 and this will get unwieldly as
new architectures (and platforms) get ZONE_DEVICE support. Move to an
arch selected Kconfig option to save us the trouble.

Cc: linux...@kvack.org
Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
---
v2: add missing hunk
---
 arch/x86/Kconfig | 1 +
 mm/Kconfig   | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c43f47622440..535b4d514792 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -59,6 +59,7 @@ config X86
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/mm/Kconfig b/mm/Kconfig
index c89f472b658c..392bf8d7574c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -684,12 +684,16 @@ config IDLE_PAGE_TRACKING
 
  See Documentation/vm/idle_page_tracking.txt for more details.
 
+# arch_add_memory() comprehends device memory
+config ARCH_HAS_ZONE_DEVICE
+   bool
+
 config ZONE_DEVICE
bool "Device memory (pmem, etc...) hotplug support"
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
depends on SPARSEMEM_VMEMMAP
-   depends on X86_64 #arch_add_memory() comprehends device memory
+   depends on ARCH_HAS_ZONE_DEVICE
 
help
  Device memory hotplug support allows for establishing pmem,
-- 
2.9.3



[PATCH v2] mm, x86: Add ARCH_HAS_ZONE_DEVICE to Kconfig

2017-04-12 Thread Oliver O'Halloran
Currently ZONE_DEVICE depends on X86_64 and this will get unwieldly as
new architectures (and platforms) get ZONE_DEVICE support. Move to an
arch selected Kconfig option to save us the trouble.

Cc: linux...@kvack.org
Signed-off-by: Oliver O'Halloran 
---
v2: add missing hunk
---
 arch/x86/Kconfig | 1 +
 mm/Kconfig   | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c43f47622440..535b4d514792 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -59,6 +59,7 @@ config X86
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/mm/Kconfig b/mm/Kconfig
index c89f472b658c..392bf8d7574c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -684,12 +684,16 @@ config IDLE_PAGE_TRACKING
 
  See Documentation/vm/idle_page_tracking.txt for more details.
 
+# arch_add_memory() comprehends device memory
+config ARCH_HAS_ZONE_DEVICE
+   bool
+
 config ZONE_DEVICE
bool "Device memory (pmem, etc...) hotplug support"
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
depends on SPARSEMEM_VMEMMAP
-   depends on X86_64 #arch_add_memory() comprehends device memory
+   depends on ARCH_HAS_ZONE_DEVICE
 
help
  Device memory hotplug support allows for establishing pmem,
-- 
2.9.3



[PATCH] mm, x86: Add ARCH_HAS_ZONE_DEVICE to Kconfig

2017-04-12 Thread Oliver O'Halloran
Currently ZONE_DEVICE depends on X86_64 and this will get unwieldly as
new architectures (and platforms) get ZONE_DEVICE support. Move to an
arch selected Kconfig option to save us the trouble.

Cc: linux...@kvack.org
Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
---
 arch/x86/Kconfig | 1 +
 mm/Kconfig   | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c43f47622440..535b4d514792 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -59,6 +59,7 @@ config X86
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/mm/Kconfig b/mm/Kconfig
index c89f472b658c..57c1cbd9a050 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -689,7 +689,7 @@ config ZONE_DEVICE
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
depends on SPARSEMEM_VMEMMAP
-   depends on X86_64 #arch_add_memory() comprehends device memory
+   depends on ARCH_HAS_ZONE_DEVICE
 
help
  Device memory hotplug support allows for establishing pmem,
-- 
2.9.3



[PATCH] mm, x86: Add ARCH_HAS_ZONE_DEVICE to Kconfig

2017-04-12 Thread Oliver O'Halloran
Currently ZONE_DEVICE depends on X86_64 and this will get unwieldly as
new architectures (and platforms) get ZONE_DEVICE support. Move to an
arch selected Kconfig option to save us the trouble.

Cc: linux...@kvack.org
Signed-off-by: Oliver O'Halloran 
---
 arch/x86/Kconfig | 1 +
 mm/Kconfig   | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c43f47622440..535b4d514792 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -59,6 +59,7 @@ config X86
select ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/mm/Kconfig b/mm/Kconfig
index c89f472b658c..57c1cbd9a050 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -689,7 +689,7 @@ config ZONE_DEVICE
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
depends on SPARSEMEM_VMEMMAP
-   depends on X86_64 #arch_add_memory() comprehends device memory
+   depends on ARCH_HAS_ZONE_DEVICE
 
help
  Device memory hotplug support allows for establishing pmem,
-- 
2.9.3



Re: [PowerPC] 4.10.0 fails to build on BE config

2017-02-21 Thread Oliver O'Halloran
On Tue, Feb 21, 2017 at 6:25 PM, abdul  wrote:
> Hi,
>
> Today's mainline build, breaks on Power6 and Power7 (all BE config) with
> these build errors
>
> arch/powerpc/kernel/time.c: In function ‘running_clock’:
> arch/powerpc/kernel/time.c:712:2: error: implicit declaration of function
> ‘cputime_to_nsecs’ [-Werror=implicit-function-declaration]
> return local_clock() -
> cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);
> ^
> cc1: some warnings being treated as errors
> make[1]: *** [arch/powerpc/kernel/time.o] Error 1
>
>
> Regard's
> Abdul Haleem
> IBM Linux Technology Center.

Hi Abdul,

Are there any extra patches in your tree? I briefly tried to reproduce
this, but in my local tree this line:

> return local_clock() - 
> cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);

Is at time.c:692 rather than time.c:712

Oliver


Re: [PowerPC] 4.10.0 fails to build on BE config

2017-02-21 Thread Oliver O'Halloran
On Tue, Feb 21, 2017 at 6:25 PM, abdul  wrote:
> Hi,
>
> Today's mainline build, breaks on Power6 and Power7 (all BE config) with
> these build errors
>
> arch/powerpc/kernel/time.c: In function ‘running_clock’:
> arch/powerpc/kernel/time.c:712:2: error: implicit declaration of function
> ‘cputime_to_nsecs’ [-Werror=implicit-function-declaration]
> return local_clock() -
> cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);
> ^
> cc1: some warnings being treated as errors
> make[1]: *** [arch/powerpc/kernel/time.o] Error 1
>
>
> Regard's
> Abdul Haleem
> IBM Linux Technology Center.

Hi Abdul,

Are there any extra patches in your tree? I briefly tried to reproduce
this, but in my local tree this line:

> return local_clock() - 
> cputime_to_nsecs(kcpustat_this_cpu->cpustat[CPUTIME_STEAL]);

Is at time.c:692 rather than time.c:712

Oliver


Re: [PATCH v5 2/5] powernv:stop: Uniformly rename power9 to arch300

2017-01-12 Thread Oliver O'Halloran
On Fri, Jan 13, 2017 at 2:44 PM, Gautham R Shenoy
 wrote:
> On Thu, Jan 12, 2017 at 03:17:33PM +0530, Balbir Singh wrote:
>> On Tue, Jan 10, 2017 at 02:37:01PM +0530, Gautham R. Shenoy wrote:
>> > From: "Gautham R. Shenoy" 
>> >
>> > Balbir pointed out that in idle_book3s.S and powernv/idle.c some
>> > functions and variables had power9 in their names while some others
>> > had arch300.
>> >
>>
>> I would prefer power9 to arch300
>>
>
>
> I don't have a strong preference for arch300 vs power9, will change it
> to power9 if that looks better.

Personally I think we should be as descriptive as possible and use
power_9_arch_300_the_bikeshed_is_red_dammit.

Oliver


Re: [PATCH v5 2/5] powernv:stop: Uniformly rename power9 to arch300

2017-01-12 Thread Oliver O'Halloran
On Fri, Jan 13, 2017 at 2:44 PM, Gautham R Shenoy
 wrote:
> On Thu, Jan 12, 2017 at 03:17:33PM +0530, Balbir Singh wrote:
>> On Tue, Jan 10, 2017 at 02:37:01PM +0530, Gautham R. Shenoy wrote:
>> > From: "Gautham R. Shenoy" 
>> >
>> > Balbir pointed out that in idle_book3s.S and powernv/idle.c some
>> > functions and variables had power9 in their names while some others
>> > had arch300.
>> >
>>
>> I would prefer power9 to arch300
>>
>
>
> I don't have a strong preference for arch300 vs power9, will change it
> to power9 if that looks better.

Personally I think we should be as descriptive as possible and use
power_9_arch_300_the_bikeshed_is_red_dammit.

Oliver


Re: [PATCH v2 2/3] cpuidle:powernv: Add helper function to populate powernv idle states.

2016-11-01 Thread Oliver O'Halloran
exit_latency, 0);
> } else if ((flags[i] & OPAL_PM_STOP_INST_FAST) &&
> !(flags[i] & OPAL_PM_TIMEBASE_STOP)) {
> -   strncpy(powernv_states[nr_idle_states].name,
> -   names[i], CPUIDLE_NAME_LEN);
> -   strncpy(powernv_states[nr_idle_states].desc,
> -   names[i], CPUIDLE_NAME_LEN);
> -   powernv_states[nr_idle_states].flags = 0;
> -
> -   powernv_states[nr_idle_states].enter = stop_loop;
> -   stop_psscr_table[nr_idle_states] = psscr_val[i];
> +   add_powernv_state(nr_idle_states, names[i],
> + CPUIDLE_FLAG_NONE, stop_loop,
> + target_residency, exit_latency,
> + psscr_val[i]);
> }
>
> /*
> @@ -274,32 +300,20 @@ static int powernv_add_idle_states(void)
>  #ifdef CONFIG_TICK_ONESHOT
> if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
> flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
> +   target_residency = 30;

Same comment as above.

> /* Add FASTSLEEP state */
> -   strcpy(powernv_states[nr_idle_states].name, 
> "FastSleep");
> -   strcpy(powernv_states[nr_idle_states].desc, 
> "FastSleep");
> -   powernv_states[nr_idle_states].flags = 
> CPUIDLE_FLAG_TIMER_STOP;
> -   powernv_states[nr_idle_states].target_residency = 
> 30;
> -   powernv_states[nr_idle_states].enter = fastsleep_loop;
> +   add_powernv_state(nr_idle_states, "FastSleep",
> + CPUIDLE_FLAG_TIMER_STOP,
> + fastsleep_loop,
> + target_residency, exit_latency, 0);
> } else if ((flags[i] & OPAL_PM_STOP_INST_DEEP) &&
> (flags[i] & OPAL_PM_TIMEBASE_STOP)) {
> -   strncpy(powernv_states[nr_idle_states].name,
> -   names[i], CPUIDLE_NAME_LEN);
> -   strncpy(powernv_states[nr_idle_states].desc,
> -   names[i], CPUIDLE_NAME_LEN);
> -
> -   powernv_states[nr_idle_states].flags = 
> CPUIDLE_FLAG_TIMER_STOP;
> -   powernv_states[nr_idle_states].enter = stop_loop;
> -   stop_psscr_table[nr_idle_states] = psscr_val[i];
> +   add_powernv_state(nr_idle_states, names[i],
> + CPUIDLE_FLAG_TIMER_STOP, stop_loop,
> + target_residency, exit_latency,
> + psscr_val[i]);
> }
>  #endif
> -   powernv_states[nr_idle_states].exit_latency =
> -   ((unsigned int)latency_ns[i]) / 1000;
> -
> -   if (!rc) {
> -   powernv_states[nr_idle_states].target_residency =
> -   ((unsigned int)residency_ns[i]) / 1000;
> -   }
> -
> nr_idle_states++;
> }
>  out:
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index bb31373..c4e10f8 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -62,6 +62,7 @@ struct cpuidle_state {
>  };
>
>  /* Idle State Flags */
> +#define CPUIDLE_FLAG_NONE   (0x00)
>  #define CPUIDLE_FLAG_COUPLED   (0x02) /* state applies to multiple cpus */
>  #define CPUIDLE_FLAG_TIMER_STOP (0x04)  /* timer is stopped on this state */
>
> --
> 1.9.4
>

Looks good otherwise.

Reviewed-by: Oliver O'Halloran <ooh...@gmail.com>


Re: [PATCH v2 2/3] cpuidle:powernv: Add helper function to populate powernv idle states.

2016-11-01 Thread Oliver O'Halloran
NST_FAST) &&
> !(flags[i] & OPAL_PM_TIMEBASE_STOP)) {
> -   strncpy(powernv_states[nr_idle_states].name,
> -   names[i], CPUIDLE_NAME_LEN);
> -   strncpy(powernv_states[nr_idle_states].desc,
> -   names[i], CPUIDLE_NAME_LEN);
> -   powernv_states[nr_idle_states].flags = 0;
> -
> -   powernv_states[nr_idle_states].enter = stop_loop;
> -   stop_psscr_table[nr_idle_states] = psscr_val[i];
> +   add_powernv_state(nr_idle_states, names[i],
> + CPUIDLE_FLAG_NONE, stop_loop,
> + target_residency, exit_latency,
> + psscr_val[i]);
> }
>
> /*
> @@ -274,32 +300,20 @@ static int powernv_add_idle_states(void)
>  #ifdef CONFIG_TICK_ONESHOT
> if (flags[i] & OPAL_PM_SLEEP_ENABLED ||
> flags[i] & OPAL_PM_SLEEP_ENABLED_ER1) {
> +   target_residency = 30;

Same comment as above.

> /* Add FASTSLEEP state */
> -   strcpy(powernv_states[nr_idle_states].name, 
> "FastSleep");
> -   strcpy(powernv_states[nr_idle_states].desc, 
> "FastSleep");
> -   powernv_states[nr_idle_states].flags = 
> CPUIDLE_FLAG_TIMER_STOP;
> -   powernv_states[nr_idle_states].target_residency = 
> 30;
> -   powernv_states[nr_idle_states].enter = fastsleep_loop;
> +   add_powernv_state(nr_idle_states, "FastSleep",
> + CPUIDLE_FLAG_TIMER_STOP,
> + fastsleep_loop,
> + target_residency, exit_latency, 0);
> } else if ((flags[i] & OPAL_PM_STOP_INST_DEEP) &&
> (flags[i] & OPAL_PM_TIMEBASE_STOP)) {
> -   strncpy(powernv_states[nr_idle_states].name,
> -   names[i], CPUIDLE_NAME_LEN);
> -   strncpy(powernv_states[nr_idle_states].desc,
> -   names[i], CPUIDLE_NAME_LEN);
> -
> -   powernv_states[nr_idle_states].flags = 
> CPUIDLE_FLAG_TIMER_STOP;
> -   powernv_states[nr_idle_states].enter = stop_loop;
> -   stop_psscr_table[nr_idle_states] = psscr_val[i];
> +   add_powernv_state(nr_idle_states, names[i],
> + CPUIDLE_FLAG_TIMER_STOP, stop_loop,
> + target_residency, exit_latency,
> + psscr_val[i]);
> }
>  #endif
> -   powernv_states[nr_idle_states].exit_latency =
> -   ((unsigned int)latency_ns[i]) / 1000;
> -
> -   if (!rc) {
> -   powernv_states[nr_idle_states].target_residency =
> -   ((unsigned int)residency_ns[i]) / 1000;
> -   }
> -
> nr_idle_states++;
> }
>  out:
> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
> index bb31373..c4e10f8 100644
> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -62,6 +62,7 @@ struct cpuidle_state {
>  };
>
>  /* Idle State Flags */
> +#define CPUIDLE_FLAG_NONE   (0x00)
>  #define CPUIDLE_FLAG_COUPLED   (0x02) /* state applies to multiple cpus */
>  #define CPUIDLE_FLAG_TIMER_STOP (0x04)  /* timer is stopped on this state */
>
> --
> 1.9.4
>

Looks good otherwise.

Reviewed-by: Oliver O'Halloran 


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Oliver O'Halloran
On Mon, Sep 12, 2016 at 3:27 PM, Christoph Hellwig  wrote:
> On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
>> I think this goes back to our previous discussion about support for the PMEM
>> programming model.  Really I think what NVML needs isn't a way to tell if it
>> is getting a DAX mapping, but whether it is getting a DAX mapping on a
>> filesystem that fully supports the PMEM programming model.  This of course is
>> defined to be a filesystem where it can do all of its flushes from userspace
>> safely and never call fsync/msync, and that allocations that happen in page
>> faults will be synchronized to media before the page fault completes.
>
> That's a an easy way to flag:  you will never get that from a Linux
> filesystem, period.
>
> NVML folks really need to stop taking crack and dreaming this could
> happen.

Well, that's a bummer.

What are the problems here? Is this a matter of existing filesystems
being unable/unwilling to support this or is it just fundamentally
broken? The end goal is to let applications manage the persistence of
their own data without having to involve the kernel in every IOP, but
if we can't do that then what would a 90% solution look like? I think
most people would be OK with having to do an fsync() occasionally, but
not after ever write to pmem.


Re: DAX mapping detection (was: Re: [PATCH] Fix region lost in /proc/self/smaps)

2016-09-12 Thread Oliver O'Halloran
On Mon, Sep 12, 2016 at 3:27 PM, Christoph Hellwig  wrote:
> On Thu, Sep 08, 2016 at 04:56:36PM -0600, Ross Zwisler wrote:
>> I think this goes back to our previous discussion about support for the PMEM
>> programming model.  Really I think what NVML needs isn't a way to tell if it
>> is getting a DAX mapping, but whether it is getting a DAX mapping on a
>> filesystem that fully supports the PMEM programming model.  This of course is
>> defined to be a filesystem where it can do all of its flushes from userspace
>> safely and never call fsync/msync, and that allocations that happen in page
>> faults will be synchronized to media before the page fault completes.
>
> That's a an easy way to flag:  you will never get that from a Linux
> filesystem, period.
>
> NVML folks really need to stop taking crack and dreaming this could
> happen.

Well, that's a bummer.

What are the problems here? Is this a matter of existing filesystems
being unable/unwilling to support this or is it just fundamentally
broken? The end goal is to let applications manage the persistence of
their own data without having to involve the kernel in every IOP, but
if we can't do that then what would a 90% solution look like? I think
most people would be OK with having to do an fsync() occasionally, but
not after ever write to pmem.


Re: [RFC PATCH 1/2] mm, mincore2(): retrieve dax and tlb-size attributes of an address range

2016-09-12 Thread Oliver O'Halloran
On Mon, Sep 12, 2016 at 3:31 AM, Dan Williams  wrote:
> As evidenced by this bug report [1], userspace libraries are interested
> in whether a mapping is DAX mapped, i.e. no intervening page cache.
> Rather than using the ambiguous VM_MIXEDMAP flag in smaps, provide an
> explicit "is dax" indication as a new flag in the page vector populated
> by mincore.
>
> There are also cases, particularly for testing and validating a
> configuration to know the hardware mapping geometry of the pages in a
> given process address range.  Consider filesystem-dax where a
> configuration needs to take care to align partitions and block
> allocations before huge page mappings might be used, or
> anonymous-transparent-huge-pages where a process is opportunistically
> assigned large pages.  mincore2() allows these configurations to be
> surveyed and validated.
>
> The implementation takes advantage of the unused bits in the per-page
> byte returned for each PAGE_SIZE extent of a given address range.  The
> new format of each vector byte is:
>
> (TLB_SHIFT - PAGE_SHIFT) << 2 | vma_is_dax() << 1 | page_present

What is userspace expected to do with the information in vec? Whether
PMD or THP mappings can be used is going to depend more on the block
allocations done by the filesystem rather than anything the an
application can directly influence. Returning a vector for each page
makes some sense in the mincore() case since the application can touch
each page to fault them in, but I don't see what they can do here.

Why not just get rid of vec entirely and make mincore2() a yes/no
check over the range for whatever is supplied in flags? That would
work for NVML's use case and it should be easier to extend if needed.

Oliver


Re: [RFC PATCH 1/2] mm, mincore2(): retrieve dax and tlb-size attributes of an address range

2016-09-12 Thread Oliver O'Halloran
On Mon, Sep 12, 2016 at 3:31 AM, Dan Williams  wrote:
> As evidenced by this bug report [1], userspace libraries are interested
> in whether a mapping is DAX mapped, i.e. no intervening page cache.
> Rather than using the ambiguous VM_MIXEDMAP flag in smaps, provide an
> explicit "is dax" indication as a new flag in the page vector populated
> by mincore.
>
> There are also cases, particularly for testing and validating a
> configuration to know the hardware mapping geometry of the pages in a
> given process address range.  Consider filesystem-dax where a
> configuration needs to take care to align partitions and block
> allocations before huge page mappings might be used, or
> anonymous-transparent-huge-pages where a process is opportunistically
> assigned large pages.  mincore2() allows these configurations to be
> surveyed and validated.
>
> The implementation takes advantage of the unused bits in the per-page
> byte returned for each PAGE_SIZE extent of a given address range.  The
> new format of each vector byte is:
>
> (TLB_SHIFT - PAGE_SHIFT) << 2 | vma_is_dax() << 1 | page_present

What is userspace expected to do with the information in vec? Whether
PMD or THP mappings can be used is going to depend more on the block
allocations done by the filesystem rather than anything the an
application can directly influence. Returning a vector for each page
makes some sense in the mincore() case since the application can touch
each page to fault them in, but I don't see what they can do here.

Why not just get rid of vec entirely and make mincore2() a yes/no
check over the range for whatever is supplied in flags? That would
work for NVML's use case and it should be easier to extend if needed.

Oliver


Re: [PATCH v5 04/13] powerpc: Factor out relocation code from module_64.c to elf_util_64.c.

2016-08-23 Thread Oliver O'Halloran
On Tue, Aug 23, 2016 at 1:21 PM, Balbir Singh  wrote:
>
>> zImage on ppc64 BE is an ELF32 file. This patch set only supports loading
>> ELF files of the same class as the kernel, so a 64 bit kernel can't load an
>> ELF32 file. It would be possible to add such support, but it would be a new
>> feature.
>>
>> The distros I was able to check on ppc64 LE and BE all use vmlinux.
>> kexec-tools with kexec_load also doesn't support zImage. Do you think it is
>> important to support zImage?
>
> Well if it didn't work already, I think its low priority. Michael should be
> able to confirm this. Oliver's been trying to cleanup the zImage to get rid
> the old zImage limitation, cc'ing him

I don't think it's ever worked so I wouldn't worry too much about
supporting it. Fixing kexec-into-zImage and fixing the 32bit wrapper
on 64bit BE kernel problem has been on my TODO list for a while, but
it's not a priority.

oliver


Re: [PATCH v5 04/13] powerpc: Factor out relocation code from module_64.c to elf_util_64.c.

2016-08-23 Thread Oliver O'Halloran
On Tue, Aug 23, 2016 at 1:21 PM, Balbir Singh  wrote:
>
>> zImage on ppc64 BE is an ELF32 file. This patch set only supports loading
>> ELF files of the same class as the kernel, so a 64 bit kernel can't load an
>> ELF32 file. It would be possible to add such support, but it would be a new
>> feature.
>>
>> The distros I was able to check on ppc64 LE and BE all use vmlinux.
>> kexec-tools with kexec_load also doesn't support zImage. Do you think it is
>> important to support zImage?
>
> Well if it didn't work already, I think its low priority. Michael should be
> able to confirm this. Oliver's been trying to cleanup the zImage to get rid
> the old zImage limitation, cc'ing him

I don't think it's ever worked so I wouldn't worry too much about
supporting it. Fixing kexec-into-zImage and fixing the 32bit wrapper
on 64bit BE kernel problem has been on my TODO list for a while, but
it's not a priority.

oliver