Re: [PATCH 07/12] dma-mapping: move CONFIG_DMA_CMA to kernel/dma/Kconfig

2019-02-11 Thread Greg Kroah-Hartman
On Mon, Feb 11, 2019 at 02:35:49PM +0100, Christoph Hellwig wrote:
> This is where all the related code already lives.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/base/Kconfig | 77 
>  kernel/dma/Kconfig   | 77 
>  2 files changed, 77 insertions(+), 77 deletions(-)

Much nicer, thanks!

Reviewed-by: Greg Kroah-Hartman 


Re: [PATCH 09/12] dma-mapping: remove the DMA_MEMORY_EXCLUSIVE flag

2019-02-11 Thread Greg Kroah-Hartman
On Mon, Feb 11, 2019 at 02:35:51PM +0100, Christoph Hellwig wrote:
> All users of dma_declare_coherent want their allocations to be
> exclusive, so default to exclusive allocations.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  Documentation/DMA-API.txt |  9 +--
>  arch/arm/mach-imx/mach-imx27_visstrim_m10.c   | 12 +++--
>  arch/arm/mach-imx/mach-mx31moboard.c  |  3 +--
>  arch/sh/boards/mach-ap325rxa/setup.c  |  5 ++--
>  arch/sh/boards/mach-ecovec24/setup.c  |  6 ++---
>  arch/sh/boards/mach-kfr2r09/setup.c   |  5 ++--
>  arch/sh/boards/mach-migor/setup.c |  5 ++--
>  arch/sh/boards/mach-se/7724/setup.c   |  6 ++---
>  arch/sh/drivers/pci/fixups-dreamcast.c|  3 +--
>  .../soc_camera/sh_mobile_ceu_camera.c |  3 +--
>  drivers/usb/host/ohci-sm501.c |  3 +--
>  drivers/usb/host/ohci-tmio.c  |  2 +-
>  include/linux/dma-mapping.h   |  7 ++
>  kernel/dma/coherent.c | 25 ++-
>  14 files changed, 29 insertions(+), 65 deletions(-)

Reviewed-by: Greg Kroah-Hartman 


Re: [PATCH 06/12] dma-mapping: improve selection of dma_declare_coherent availability

2019-02-11 Thread Greg Kroah-Hartman
On Mon, Feb 11, 2019 at 02:35:48PM +0100, Christoph Hellwig wrote:
> This API is primarily used through DT entries, but two architectures
> and two drivers call it directly.  So instead of selecting the config
> symbol for random architectures pull it in implicitly for the actual
> users.  Also rename the Kconfig option to describe the feature better.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Greg Kroah-Hartman 


Re: [PATCH 02/12] device.h: dma_mem is only needed for HAVE_GENERIC_DMA_COHERENT

2019-02-11 Thread Greg Kroah-Hartman
On Mon, Feb 11, 2019 at 02:35:44PM +0100, Christoph Hellwig wrote:
> No need to carry an unused field around.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/device.h | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Greg Kroah-Hartman 


Re: [PATCH kernel] powerpc/powernv/ioda: Store correct amount of memory used for table

2019-02-11 Thread Alexey Kardashevskiy



On 12/02/2019 11:20, David Gibson wrote:
> On Mon, Feb 11, 2019 at 06:48:01PM +1100, Alexey Kardashevskiy wrote:
>> We store 2 multilevel tables in iommu_table - one for the hardware and
>> one with the corresponding userspace addresses. Before allocating
>> the tables, the iommu_table_group_ops::get_table_size() hook returns
>> the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts
>> the locked_vm counter correctly. When the table is actually allocated,
>> the amount of allocated memory is stored in iommu_table::it_allocated_size
>> and used to adjust the locked_vm counter when we release the memory used
>> by the table; .get_table_size() and .create_table() calculate it
>> independently but the result is expected to be the same.
> 
> Any way we can remove that redundant calculation?  That seems like
> begging for bugs.


I do not see an easy way. One way could be adding a "dryrun" flag to
pnv_pci_ioda2_table_alloc_pages(), count allocated memory there and call
it from .get_table_size() but for multilevel TCEs it only allocates
first level...


>> Unfortunately the allocator does not add the userspace table size to
>> ::it_allocated_size so when we destroy the table because of VFIO PCI
>> unplug (i.e. VFIO container is gone but the userspace keeps running),
>> we decrement locked_vm by just a half of size of memory we are releasing.
>> As the result, we leak locked_vm and may not be able to allocate more
>> IOMMU tables after few iterations of hotplug/unplug.
>>
>> This adjusts it_allocated_size if the userspace addresses table was
>> requested (total_allocated_uas is initialized by zero).
>>
>> Fixes: 090bad39b "powerpc/powernv: Add indirect levels to it_userspace"
>> Signed-off-by: Alexey Kardashevskiy 
> 
> Reviewed-by: David Gibson 
> 
>> ---
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
>> b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
>> index 697449a..58146e1 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
>> @@ -313,7 +313,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 
>> bus_offset,
>>  page_shift);
>>  tbl->it_level_size = 1ULL << (level_shift - 3);
>>  tbl->it_indirect_levels = levels - 1;
>> -tbl->it_allocated_size = total_allocated;
>> +tbl->it_allocated_size = total_allocated + total_allocated_uas;
>>  tbl->it_userspace = uas;
>>  tbl->it_nid = nid;
>>  
> 

-- 
Alexey


Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20

2019-02-11 Thread Greg Kroah-Hartman
On Mon, Feb 11, 2019 at 02:43:48PM -0600, Alan Tull wrote:
> On Mon, Feb 11, 2019 at 1:13 PM Greg Kroah-Hartman
>  wrote:
> >
> > On Mon, Feb 11, 2019 at 12:41:40PM -0600, Alan Tull wrote:
> > > On Fri, Nov 9, 2018 at 12:58 AM Frank Rowand  
> > > wrote:
> > >
> > > What LTSI's are these patches likely to end up in?  Just to be clear,
> > > I'm not pushing for any specific answer, I just want to know what to
> > > expect.
> >
> > I have no idea what you are asking here.
> >
> > What patches?
> 
> I probably should have asked my question *below* the pertinent context
> of the the 17 patches listed in the pull request, which was:
> 
> >   of: overlay: add tests to validate kfrees from overlay removal
> >   of: overlay: add missing of_node_put() after add new node to changeset
> >   of: overlay: add missing of_node_get() in __of_attach_node_sysfs
> >   powerpc/pseries: add of_node_put() in dlpar_detach_node()
> >   of: overlay: use prop add changeset entry for property in new nodes
> >   of: overlay: do not duplicate properties from overlay for new nodes
> >   of: overlay: reorder fields in struct fragment
> >   of: overlay: validate overlay properties #address-cells and 
> > #size-cells
> >   of: overlay: make all pr_debug() and pr_err() messages unique
> >   of: overlay: test case of two fragments adding same node
> >   of: overlay: check prevents multiple fragments add or delete same node
> >   of: overlay: check prevents multiple fragments touching same property
> >   of: unittest: remove unused of_unittest_apply_overlay() argument
> >   of: overlay: set node fields from properties when add new overlay node
> >   of: unittest: allow base devicetree to have symbol metadata
> >   of: unittest: find overlays[] entry by name instead of index
> >   of: unittest: initialize args before calling of_*parse_*()
> 
> > What is "LTSI's"?
> 
> I have recently seen some of devicetree patches being picked up for
> the 4.20 stable-queue.  That seemed to suggest that some, but not all
> of these will end up in the next LTS release.

If the git commit has the "cc: stable@" marking in it, yes, it will be
picked up.  Without the actual git ids, it's hard to know what did, and
what did not, get backported :)

> Also I was wondering if any of this is likely to get backported to
> LTSI-4.14.

Note, "LTSI" and "LTS" are two different things.  "LTSI" is a project
run by some LF member companies, and "LTS" are the normal long term
kernels that I release on kernel.org.  They have vastly different
requirements for inclusion in them.

If you have questions about LTSI, I recommend go asking on their mailing
list.

As for showing up in the 4.14 "LTS" kernel, again, I need git commit ids
to know for sure.

Also, as these are now in Linus's tree, you should be able to look at
the stable releases yourself to see if they are present there, right?

thanks,

greg k-h


Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Alexey Kardashevskiy



On 12/02/2019 09:44, Daniel Jordan wrote:
> Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> pages"), locked and pinned pages are accounted separately.  The SPAPR
> TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm
> instead.
> 
> pinned_vm recently became atomic and so no longer relies on mmap_sem
> held as writer: delete.
> 
> Signed-off-by: Daniel Jordan 
> ---
>  Documentation/vfio.txt  |  6 +--
>  drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++---
>  2 files changed, 33 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> index f1a4d3c3ba0b..fa37d65363f9 100644
> --- a/Documentation/vfio.txt
> +++ b/Documentation/vfio.txt
> @@ -308,7 +308,7 @@ This implementation has some specifics:
> currently there is no way to reduce the number of calls. In order to make
> things faster, the map/unmap handling has been implemented in real mode
> which provides an excellent performance which has limitations such as
> -   inability to do locked pages accounting in real time.
> +   inability to do pinned pages accounting in real time.
>  
>  4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
> subtree that can be treated as a unit for the purposes of partitioning and
> @@ -324,7 +324,7 @@ This implementation has some specifics:
>   returns the size and the start of the DMA window on the PCI bus.
>  
>   VFIO_IOMMU_ENABLE
> - enables the container. The locked pages accounting
> + enables the container. The pinned pages accounting
>   is done at this point. This lets user first to know what
>   the DMA window is and adjust rlimit before doing any real job.
>  
> @@ -454,7 +454,7 @@ This implementation has some specifics:
>  
> PPC64 paravirtualized guests generate a lot of map/unmap requests,
> and the handling of those includes pinning/unpinning pages and updating
> -   mm::locked_vm counter to make sure we do not exceed the rlimit.
> +   mm::pinned_vm counter to make sure we do not exceed the rlimit.
> The v2 IOMMU splits accounting and pinning into separate operations:
>  
> - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 
> ioctls
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c424913324e3..f47e020dc5e4 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -34,9 +34,11 @@
>  static void tce_iommu_detach_group(void *iommu_data,
>   struct iommu_group *iommu_group);
>  
> -static long try_increment_locked_vm(struct mm_struct *mm, long npages)
> +static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
>  {
> - long ret = 0, locked, lock_limit;
> + long ret = 0;
> + s64 pinned;
> + unsigned long lock_limit;
>  
>   if (WARN_ON_ONCE(!mm))
>   return -EPERM;
> @@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct *mm, 
> long npages)
>   if (!npages)
>   return 0;
>  
> - down_write(>mmap_sem);
> - locked = mm->locked_vm + npages;
> + pinned = atomic64_add_return(npages, >pinned_vm);
>   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> - if (locked > lock_limit && !capable(CAP_IPC_LOCK))
> + if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
>   ret = -ENOMEM;
> - else
> - mm->locked_vm += npages;
> + atomic64_sub(npages, >pinned_vm);
> + }
>  
> - pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
> + pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
>   npages << PAGE_SHIFT,
> - mm->locked_vm << PAGE_SHIFT,
> - rlimit(RLIMIT_MEMLOCK),
> - ret ? " - exceeded" : "");
> -
> - up_write(>mmap_sem);
> + atomic64_read(>pinned_vm) << PAGE_SHIFT,
> + rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
>  
>   return ret;
>  }
>  
> -static void decrement_locked_vm(struct mm_struct *mm, long npages)
> +static void decrement_pinned_vm(struct mm_struct *mm, long npages)
>  {
>   if (!mm || !npages)
>   return;
>  
> - down_write(>mmap_sem);
> - if (WARN_ON_ONCE(npages > mm->locked_vm))
> - npages = mm->locked_vm;
> - mm->locked_vm -= npages;
> - pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
> + if (WARN_ON_ONCE(npages > atomic64_read(>pinned_vm)))
> + npages = atomic64_read(>pinned_vm);
> + atomic64_sub(npages, >pinned_vm);
> + pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid,
>   npages << PAGE_SHIFT,
> - mm->locked_vm << PAGE_SHIFT,
> + atomic64_read(>pinned_vm) << PAGE_SHIFT,
>  

[PATCH kernel] KVM: PPC: Release all hardware TCE tables attached to a group

2019-02-11 Thread Alexey Kardashevskiy
The SPAPR TCE KVM device references all hardware IOMMU tables assigned to
some IOMMU group to ensure that in-kernel KVM acceleration of H_PUT_TCE
can work. The tables are references when an IOMMU group gets registered
with the VFIO KVM device by the KVM_DEV_VFIO_GROUP_ADD ioctl;
KVM_DEV_VFIO_GROUP_DEL calls into the dereferencing code
in kvm_spapr_tce_release_iommu_group() which walks through the list of
LIOBNs, finds a matching IOMMU table and calls kref_put() when found.

However that code stops after the very first successful derefencing
leaving other tables referenced till the SPAPR TCE KVM device is destroyed
which normally happens on guest reboot or termination so if we do hotplug
and unplug in a loop, we are leaking IOMMU tables here.

This removes a premature return to let kvm_spapr_tce_release_iommu_group()
find and dereference all attached tables.

Fixes: 121f80ba68f "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
Signed-off-by: Alexey Kardashevskiy 
---

I kinda hoped to blame RCU for misbehaviour but it was me all over again :)

---
 arch/powerpc/kvm/book3s_64_vio.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 532ab797..6630dde 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -133,7 +133,6 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm 
*kvm,
continue;
 
kref_put(>kref, kvm_spapr_tce_liobn_put);
-   return;
}
}
}
-- 
2.17.1



Re: [PATCH 06/12] dma-mapping: improve selection of dma_declare_coherent availability

2019-02-11 Thread Paul Burton
Hi Christoph,

On Mon, Feb 11, 2019 at 02:35:48PM +0100, Christoph Hellwig wrote:
> This API is primarily used through DT entries, but two architectures
> and two drivers call it directly.  So instead of selecting the config
> symbol for random architectures pull it in implicitly for the actual
> users.  Also rename the Kconfig option to describe the feature better.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Paul Burton  # MIPS

Thanks,
Paul


Re: [PATCH kernel] powerpc/powernv/ioda: Store correct amount of memory used for table

2019-02-11 Thread David Gibson
On Mon, Feb 11, 2019 at 06:48:01PM +1100, Alexey Kardashevskiy wrote:
> We store 2 multilevel tables in iommu_table - one for the hardware and
> one with the corresponding userspace addresses. Before allocating
> the tables, the iommu_table_group_ops::get_table_size() hook returns
> the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts
> the locked_vm counter correctly. When the table is actually allocated,
> the amount of allocated memory is stored in iommu_table::it_allocated_size
> and used to adjust the locked_vm counter when we release the memory used
> by the table; .get_table_size() and .create_table() calculate it
> independently but the result is expected to be the same.

Any way we can remove that redundant calculation?  That seems like
begging for bugs.

> Unfortunately the allocator does not add the userspace table size to
> ::it_allocated_size so when we destroy the table because of VFIO PCI
> unplug (i.e. VFIO container is gone but the userspace keeps running),
> we decrement locked_vm by just a half of size of memory we are releasing.
> As the result, we leak locked_vm and may not be able to allocate more
> IOMMU tables after few iterations of hotplug/unplug.
> 
> This adjusts it_allocated_size if the userspace addresses table was
> requested (total_allocated_uas is initialized by zero).
> 
> Fixes: 090bad39b "powerpc/powernv: Add indirect levels to it_userspace"
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
> b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> index 697449a..58146e1 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> @@ -313,7 +313,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 
> bus_offset,
>   page_shift);
>   tbl->it_level_size = 1ULL << (level_shift - 3);
>   tbl->it_indirect_levels = levels - 1;
> - tbl->it_allocated_size = total_allocated;
> + tbl->it_allocated_size = total_allocated + total_allocated_uas;
>   tbl->it_userspace = uas;
>   tbl->it_nid = nid;
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel] vfio/spapr_tce: Skip unsetting already unset table

2019-02-11 Thread David Gibson
On Mon, Feb 11, 2019 at 06:49:17PM +1100, Alexey Kardashevskiy wrote:
> VFIO TCE IOMMU v2 owns IOMMU tables so when detach a IOMMU group from
> a container, we need to unset those from a group so we call unset_window()
> so do we unconditionally. We also unset tables when removing a DMA window
> via the VFIO_IOMMU_SPAPR_TCE_REMOVE ioctl.
> 
> The window removal checks if the table actually exists (hidden inside
> tce_iommu_find_table()) but the group detaching does not so the user
> may see duplicating messages:
> pci 0009:03 : [PE# fd] Removing DMA window #0
> pci 0009:03 : [PE# fd] Removing DMA window #1
> pci 0009:03 : [PE# fd] Removing DMA window #0
> pci 0009:03 : [PE# fd] Removing DMA window #1
> 
> At the moment this is not a problem as the second invocation
> of unset_window() writes zeroes to the HW registers again and exits early
> as there is no table.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

> ---
> 
> When doing VFIO PCI hot unplug, first we remove the DMA window and
> set container->tables[num] - this is a first couple of messages.
> Then we detach the group and then we see another couple of the same
> messages which confused myself.
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index c424913..8dbb270 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -1235,7 +1235,8 @@ static void tce_iommu_release_ownership_ddw(struct 
> tce_container *container,
>   }
>  
>   for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i)
> - table_group->ops->unset_window(table_group, i);
> + if (container->tables[i])
> + table_group->ops->unset_window(table_group, i);
>  
>   table_group->ops->release_ownership(table_group);
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] powerpc/configs: Enable CONFIG_USB_XHCI_HCD by default

2019-02-11 Thread Joel Stanley
On Mon, 11 Feb 2019 at 22:07, Thomas Huth  wrote:
>
> Recent versions of QEMU provide a XHCI device by default these
> days instead of an old-fashioned OHCI device:
>
>  https://git.qemu.org/?p=qemu.git;a=commitdiff;h=57040d451315320b7d27

"recent" :D

> So to get the keyboard working in the graphical console there again,
> we should now include XHCI support in the kernel by default, too.
>
> Signed-off-by: Thomas Huth 

Acked-by: Joel Stanley 

> ---
>  arch/powerpc/configs/pseries_defconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/configs/pseries_defconfig 
> b/arch/powerpc/configs/pseries_defconfig
> index ea79c51..62e12f6 100644
> --- a/arch/powerpc/configs/pseries_defconfig
> +++ b/arch/powerpc/configs/pseries_defconfig
> @@ -217,6 +217,7 @@ CONFIG_USB_MON=m
>  CONFIG_USB_EHCI_HCD=y
>  # CONFIG_USB_EHCI_HCD_PPC_OF is not set
>  CONFIG_USB_OHCI_HCD=y
> +CONFIG_USB_XHCI_HCD=y
>  CONFIG_USB_STORAGE=m
>  CONFIG_NEW_LEDS=y
>  CONFIG_LEDS_CLASS=m
> --
> 1.8.3.1
>


Re: [PATCH v4 3/3] powerpc/32: Add KASAN support

2019-02-11 Thread Daniel Axtens
Andrey Ryabinin  writes:

> On 2/11/19 3:25 PM, Andrey Konovalov wrote:
>> On Sat, Feb 9, 2019 at 12:55 PM christophe leroy
>>  wrote:
>>>
>>> Hi Andrey,
>>>
>>> Le 08/02/2019 à 18:40, Andrey Konovalov a écrit :
 On Fri, Feb 8, 2019 at 6:17 PM Christophe Leroy  
 wrote:
>
> Hi Daniel,
>
> Le 08/02/2019 à 17:18, Daniel Axtens a écrit :
>> Hi Christophe,
>>
>> I've been attempting to port this to 64-bit Book3e nohash (e6500),
>> although I think I've ended up with an approach more similar to Aneesh's
>> much earlier (2015) series for book3s.
>>
>> Part of this is just due to the changes between 32 and 64 bits - we need
>> to hack around the discontiguous mappings - but one thing that I'm
>> particularly puzzled by is what the kasan_early_init is supposed to do.
>
> It should be a problem as my patch uses a 'for_each_memblock(memory,
> reg)' loop.
>
>>
>>> +void __init kasan_early_init(void)
>>> +{
>>> +unsigned long addr = KASAN_SHADOW_START;
>>> +unsigned long end = KASAN_SHADOW_END;
>>> +unsigned long next;
>>> +pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(addr), addr), 
>>> addr);
>>> +int i;
>>> +phys_addr_t pa = __pa(kasan_early_shadow_page);
>>> +
>>> +BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
>>> +
>>> +if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
>>> +panic("KASAN not supported with Hash MMU\n");
>>> +
>>> +for (i = 0; i < PTRS_PER_PTE; i++)
>>> +__set_pte_at(_mm, (unsigned 
>>> long)kasan_early_shadow_page,
>>> + kasan_early_shadow_pte + i,
>>> + pfn_pte(PHYS_PFN(pa), PAGE_KERNEL_RO), 0);
>>> +
>>> +do {
>>> +next = pgd_addr_end(addr, end);
>>> +pmd_populate_kernel(_mm, pmd, kasan_early_shadow_pte);
>>> +} while (pmd++, addr = next, addr != end);
>>> +}
>>
>> As far as I can tell it's mapping the early shadow page, read-only, over
>> the KASAN_SHADOW_START->KASAN_SHADOW_END range, and it's using the early
>> shadow PTE array from the generic code.
>>
>> I haven't been able to find an answer to why this is in the docs, so I
>> was wondering if you or anyone else could explain the early part of
>> kasan init a bit better.
>
> See https://www.kernel.org/doc/html/latest/dev-tools/kasan.html for an
> explanation of the shadow.
>
> When shadow is 0, it means the memory area is entirely accessible.
>
> It is necessary to setup a shadow area as soon as possible because all
> data accesses check the shadow area, from the begining (except for a few
> files where sanitizing has been disabled in Makefiles).
>
> Until the real shadow area is set, all access are granted thanks to the
> zero shadow area beeing for of zeros.

 Not entirely correct. kasan_early_init() indeed maps the whole shadow
 memory range to the same kasan_early_shadow_page. However as kernel
 loads and memory gets allocated this shadow page gets rewritten with
 non-zero values by different KASAN allocator hooks. Since these values
 come from completely different parts of the kernel, but all land on
 the same page, kasan_early_shadow_page's content can be considered
 garbage. When KASAN checks memory accesses for validity it detects
 these garbage shadow values, but doesn't print any reports, as the
 reporting routine bails out on the current->kasan_depth check (which
 has the value of 1 initially). Only after kasan_init() completes, when
 the proper shadow memory is mapped, current->kasan_depth gets set to 0
 and we start reporting bad accesses.
>>>
>>> That's surprising, because in the early phase I map the shadow area
>>> read-only, so I do not expect it to get modified unless RO protection is
>>> failing for some reason.
>> 
>> Actually it might be that the allocator hooks don't modify shadow at
>> this point, as the allocator is not yet initialized. However stack
>> should be getting poisoned and unpoisoned from the very start. But the
>> generic statement that early shadow gets dirtied should be correct.
>> Might it be that you don't use stack instrumentation?
>> 
>
> Yes, stack instrumentation is not used here, because shadow offset which we 
> pass to
> the -fasan-shadow-offset= cflag is not specified here. So the logic in 
> scrpits/Makefile.kasan
> just fallbacks to CFLAGS_KASAN_MINIMAL, which is outline and without stack 
> instrumentation.
>
> Christophe, you can specify KASAN_SHADOW_OFFSET either in Kconfig (e.g. 
> x86_64) or
> in Makefile (e.g. arm64). And make early mapping writable, because compiler 
> generated code will write
> to shadow memory in function prologue/epilogue.

Hmm. Is this limitation just that compilers have not implemented
out-of-line support for 

[PATCH] powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit

2019-02-11 Thread Paul Mackerras
Commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api
only on Hotplug", 2017-07-21) added two calls to opal_slw_set_reg()
inside pnv_cpu_offline(), with the aim of changing the LPCR value in
the SLW image to disable wakeups from the decrementer while a CPU is
offline.  However, pnv_cpu_offline() gets called each time a secondary
CPU thread is woken up to participate in running a KVM guest, that is,
not just when a CPU is offlined.

Since opal_slw_set_reg() is a very slow operation (with observed
execution times around 20 milliseconds), this means that an offline
secondary CPU can often be busy doing the opal_slw_set_reg() call
when the primary CPU wants to grab all the secondary threads so that
it can run a KVM guest.  This leads to messages like "KVM: couldn't
grab CPU n" being printed and guest execution failing.

There is no need to reprogram the SLW image on every KVM guest entry
and exit.  So that we do it only when a CPU is really transitioning
between online and offline, this moves the calls to
pnv_program_cpu_hotplug_lpcr() into pnv_smp_cpu_kill_self().

Fixes: 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on 
Hotplug")
Cc: sta...@vger.kernel.org # v4.14+
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/powernv.h|  2 ++
 arch/powerpc/platforms/powernv/idle.c | 27 ++-
 arch/powerpc/platforms/powernv/smp.c  | 25 +
 3 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/powernv.h 
b/arch/powerpc/include/asm/powernv.h
index 2f3ff7a27881..d85fcfea32ca 100644
--- a/arch/powerpc/include/asm/powernv.h
+++ b/arch/powerpc/include/asm/powernv.h
@@ -23,6 +23,8 @@ extern int pnv_npu2_handle_fault(struct npu_context *context, 
uintptr_t *ea,
unsigned long *flags, unsigned long *status,
int count);
 
+void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val);
+
 void pnv_tm_init(void);
 #else
 static inline void powernv_set_nmmu_ptcr(unsigned long ptcr) { }
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 35f699ebb662..e52f9b06dd9c 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -458,7 +458,8 @@ EXPORT_SYMBOL_GPL(pnv_power9_force_smt4_release);
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 
 #ifdef CONFIG_HOTPLUG_CPU
-static void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val)
+
+void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val)
 {
u64 pir = get_hard_smp_processor_id(cpu);
 
@@ -481,20 +482,6 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
 {
unsigned long srr1;
u32 idle_states = pnv_get_supported_cpuidle_states();
-   u64 lpcr_val;
-
-   /*
-* We don't want to take decrementer interrupts while we are
-* offline, so clear LPCR:PECE1. We keep PECE2 (and
-* LPCR_PECE_HVEE on P9) enabled as to let IPIs in.
-*
-* If the CPU gets woken up by a special wakeup, ensure that
-* the SLW engine sets LPCR with decrementer bit cleared, else
-* the CPU will come back to the kernel due to a spurious
-* wakeup.
-*/
-   lpcr_val = mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1;
-   pnv_program_cpu_hotplug_lpcr(cpu, lpcr_val);
 
__ppc64_runlatch_off();
 
@@ -526,16 +513,6 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
 
__ppc64_runlatch_on();
 
-   /*
-* Re-enable decrementer interrupts in LPCR.
-*
-* Further, we want stop states to be woken up by decrementer
-* for non-hotplug cases. So program the LPCR via stop api as
-* well.
-*/
-   lpcr_val = mfspr(SPRN_LPCR) | (u64)LPCR_PECE1;
-   pnv_program_cpu_hotplug_lpcr(cpu, lpcr_val);
-
return srr1;
 }
 #endif
diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index 0d354e19ef92..db09c7022635 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "powernv.h"
 
@@ -153,6 +154,7 @@ static void pnv_smp_cpu_kill_self(void)
 {
unsigned int cpu;
unsigned long srr1, wmask;
+   u64 lpcr_val;
 
/* Standard hot unplug procedure */
/*
@@ -174,6 +176,19 @@ static void pnv_smp_cpu_kill_self(void)
if (cpu_has_feature(CPU_FTR_ARCH_207S))
wmask = SRR1_WAKEMASK_P8;
 
+   /*
+* We don't want to take decrementer interrupts while we are
+* offline, so clear LPCR:PECE1. We keep PECE2 (and
+* LPCR_PECE_HVEE on P9) enabled so as to let IPIs in.
+*
+* If the CPU gets woken up by a special wakeup, ensure that
+* the SLW engine sets LPCR with decrementer bit cleared, else
+* the CPU will come back to the kernel due to a 

Re: [QUESTION] powerpc, libseccomp, and spu

2019-02-11 Thread Michael Ellerman
Hi Tom,

Sorry this has caused you trouble, using "spu" there is a bit of a hack
and I want to remove it.

See: https://patchwork.ozlabs.org/patch/1025830/

Unfortunately that series clashed with some of Arnd's work and I haven't
got around to rebasing it.

Tom Hromatka  writes:
> PowerPC experts,
>
> Paul Moore and I are working on the v2.4 release of libseccomp,
> and as part of this work I need to update the syscall table for
> each architecture.
>
> I have incorporated the new ppc syscall.tbl into libseccomp, but
> I am not familiar with the value of "spu" in the ABI column.  For
> example:
>
> 2232  umount  sys_oldumount
> 2264  umount  sys_ni_syscall
> 22spu umount  sys_ni_syscall
>
> In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit
> ppc syscall table.  Do we also need to add a "spu" ppc syscall
> table?  Some clarification on the syscalls marked "spu" and "nospu"
> would be greatly appreciated.

The name "spu" comes from SPU, which are the small cores in the
Playstation 3. The value in the syscall table says whether that syscall
is available to SPU programs ("spu") or blocked ("nospu"). I don't think
you want to support libseccomp on SPUs, so basically you can just ignore
the spu/nospu distinction.

So I'm pretty sure you can just remove all the "spu" lines, and then
replace "nospu" with "common". As I've done below.

I'll try and get my patch above into a branch and into linux-next
somehow, so that you can at least refer to an upstream commit.

cheers


# SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
#
# system call numbers and entry vectors for powerpc
#
# The format is:
# 
#
# The  can be common, 64, or 32 for this file.
#
0   common  restart_syscall sys_restart_syscall
1   common  exitsys_exit
2   common  forkppc_fork
3   common  readsys_read
4   common  write   sys_write
5   common  opensys_open
compat_sys_open
6   common  close   sys_close
7   common  waitpid sys_waitpid
8   common  creat   sys_creat
9   common  linksys_link
10  common  unlink  sys_unlink
11  common  execve  sys_execve  
compat_sys_execve
12  common  chdir   sys_chdir
13  common  timesys_time
compat_sys_time
14  common  mknod   sys_mknod
15  common  chmod   sys_chmod
16  common  lchown  sys_lchown
17  common  break   sys_ni_syscall
18  32  oldstat sys_stat
sys_ni_syscall
18  64  oldstat sys_ni_syscall
19  common  lseek   sys_lseek   
compat_sys_lseek
20  common  getpid  sys_getpid
21  common  mount   sys_mount   
compat_sys_mount
22  32  umount  sys_oldumount
22  64  umount  sys_ni_syscall
23  common  setuid  sys_setuid
24  common  getuid  sys_getuid
25  common  stime   sys_stime   
compat_sys_stime
26  common  ptrace  sys_ptrace  
compat_sys_ptrace
27  common  alarm   sys_alarm
28  32  oldfstatsys_fstat   
sys_ni_syscall
28  64  oldfstatsys_ni_syscall
29  common  pause   sys_pause
30  common  utime   sys_utime   
compat_sys_utime
31  common  sttysys_ni_syscall
32  common  gttysys_ni_syscall
33  common  access  sys_access
34  common  nicesys_nice
35  common  ftime   sys_ni_syscall
36  common  syncsys_sync
37  common  killsys_kill
38  common  rename  sys_rename
39  common  mkdir   sys_mkdir
40  common  rmdir   sys_rmdir
41  common  dup sys_dup
42  common  pipesys_pipe
43  common  times   sys_times   

Re: [QUESTION] powerpc, libseccomp, and spu

2019-02-11 Thread Benjamin Herrenschmidt
On Mon, 2019-02-11 at 11:54 -0700, Tom Hromatka wrote:
> PowerPC experts,
> 
> Paul Moore and I are working on the v2.4 release of libseccomp,
> and as part of this work I need to update the syscall table for
> each architecture.
> 
> I have incorporated the new ppc syscall.tbl into libseccomp, but
> I am not familiar with the value of "spu" in the ABI column.  For
> example:
> 
> 2232  umount  sys_oldumount
> 2264  umount  sys_ni_syscall
> 22spu umount  sys_ni_syscall
> 
> In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit
> ppc syscall table.  Do we also need to add a "spu" ppc syscall
> table?  Some clarification on the syscalls marked "spu" and "nospu"
> would be greatly appreciated.

On the Cell processor, there is a number of little co-processors (SPUs)
that run alongside the main PowerPC core. Userspace can run code on
them, they operate within the user context via their own MMUs. We
provide a facility for them to issue syscalls (via some kind of RPC to
the main core). The "SPU" indication indicates syscalls that can be
called from the SPUs via that mechanism.

Now, the big question is, anybody still using Cell ? :-)

Cheers,
Ben.




Re: [QUESTION] powerpc, libseccomp, and spu

2019-02-11 Thread Benjamin Herrenschmidt
On Mon, 2019-02-11 at 11:54 -0700, Tom Hromatka wrote:
> PowerPC experts,
> 
> Paul Moore and I are working on the v2.4 release of libseccomp,
> and as part of this work I need to update the syscall table for
> each architecture.
> 
> I have incorporated the new ppc syscall.tbl into libseccomp, but
> I am not familiar with the value of "spu" in the ABI column.  For
> example:
> 
> 2232  umount  sys_oldumount
> 2264  umount  sys_ni_syscall
> 22spu umount  sys_ni_syscall
> 
> In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit
> ppc syscall table.  Do we also need to add a "spu" ppc syscall
> table?  Some clarification on the syscalls marked "spu" and "nospu"
> would be greatly appreciated.

On the Cell processor, there is a number of little co-processors (SPUs)
that run alongside the main PowerPC core. Userspace can run code on
them, they operate within the user context via their own MMUs. We
provide a facility for them to issue syscalls (via some kind of RPC to
the main core). The "SPU" indication indicates syscalls that can be
called from the SPUs via that mechanism.

Now, the big question is, anybody still using Cell ? :-)

Cheers,
Ben.




Re: [PATCH] powerpc/configs: Enable CONFIG_USB_XHCI_HCD by default

2019-02-11 Thread David Gibson
On Mon, 11 Feb 2019 12:37:12 +0100
Thomas Huth  wrote:

> Recent versions of QEMU provide a XHCI device by default these
> days instead of an old-fashioned OHCI device:
> 
>  https://git.qemu.org/?p=qemu.git;a=commitdiff;h=57040d451315320b7d27
> 
> So to get the keyboard working in the graphical console there again,
> we should now include XHCI support in the kernel by default, too.
> 
> Signed-off-by: Thomas Huth 

Wow, we didn't before?  That's bonkers.

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/configs/pseries_defconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/configs/pseries_defconfig 
> b/arch/powerpc/configs/pseries_defconfig
> index ea79c51..62e12f6 100644
> --- a/arch/powerpc/configs/pseries_defconfig
> +++ b/arch/powerpc/configs/pseries_defconfig
> @@ -217,6 +217,7 @@ CONFIG_USB_MON=m
>  CONFIG_USB_EHCI_HCD=y
>  # CONFIG_USB_EHCI_HCD_PPC_OF is not set
>  CONFIG_USB_OHCI_HCD=y
> +CONFIG_USB_XHCI_HCD=y
>  CONFIG_USB_STORAGE=m
>  CONFIG_NEW_LEDS=y
>  CONFIG_LEDS_CLASS=m
> -- 
> 1.8.3.1
> 


-- 
David Gibson 
Principal Software Engineer, Virtualization, Red Hat


pgphCIzmbh3fC.pgp
Description: OpenPGP digital signature


Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Daniel Jordan
On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> > Hi,
> > 
> > This series converts users that account pinned pages with locked_vm to
> > account with pinned_vm instead, pinned_vm being the correct counter to
> > use.  It's based on a similar patch I posted recently[0].
> > 
> > The patches are based on rdma/for-next to build on Davidlohr Bueso's
> > recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
> > sense for these to be routed the same way, despite lack of rdma content?
> 
> Oy.. I'd be willing to accumulate a branch with acks to send to Linus
> *separately* from RDMA to Linus, but this is very abnormal.
> 
> Better to wait a few weeks for -rc1 and send patches through the
> subsystem trees.

Ok, I can do that.  It did seem strange, so I made it a question...


Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Daniel Jordan
On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote:
> On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote:
> > @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
> > npage, bool async)
> > if (!mm)
> > return -ESRCH; /* process exited */
> >  
> > -   ret = down_write_killable(>mmap_sem);
> > -   if (!ret) {
> > -   if (npage > 0) {
> > -   if (!dma->lock_cap) {
> > -   unsigned long limit;
> > -
> > -   limit = task_rlimit(dma->task,
> > -   RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +   pinned_vm = atomic64_add_return(npage, >pinned_vm);
> >  
> > -   if (mm->locked_vm + npage > limit)
> > -   ret = -ENOMEM;
> > -   }
> > +   if (npage > 0 && !dma->lock_cap) {
> > +   unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
> > +
> > -   PAGE_SHIFT;
> 
> I haven't looked at this super closely, but how does this stuff work?
> 
> do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...
> 
> Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?
>
> Otherwise MEMLOCK is really doubled..

So this has been a problem for some time, but it's not as easy as adding them
together, see [1][2] for a start.

The locked_vm/pinned_vm issue definitely needs fixing, but all this series is
trying to do is account to the right counter.

Daniel

[1] 
http://lkml.kernel.org/r/20130523104154.ga23...@twins.programming.kicks-ass.net
[2] 
http://lkml.kernel.org/r/20130524140114.gk23...@twins.programming.kicks-ass.net


Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Jason Gunthorpe
On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote:
> Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
> pages"), locked and pinned pages are accounted separately.  Type1
> accounts pinned pages to locked_vm; use pinned_vm instead.
> 
> pinned_vm recently became atomic and so no longer relies on mmap_sem
> held as writer: delete.
> 
> Signed-off-by: Daniel Jordan 
>  drivers/vfio/vfio_iommu_type1.c | 31 ---
>  1 file changed, 12 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 73652e21efec..a56cc341813f 100644
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -257,7 +257,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, 
> struct vfio_pfn *vpfn)
>  static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
>  {
>   struct mm_struct *mm;
> - int ret;
> + s64 pinned_vm;
> + int ret = 0;
>  
>   if (!npage)
>   return 0;
> @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
> npage, bool async)
>   if (!mm)
>   return -ESRCH; /* process exited */
>  
> - ret = down_write_killable(>mmap_sem);
> - if (!ret) {
> - if (npage > 0) {
> - if (!dma->lock_cap) {
> - unsigned long limit;
> -
> - limit = task_rlimit(dma->task,
> - RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + pinned_vm = atomic64_add_return(npage, >pinned_vm);
>  
> - if (mm->locked_vm + npage > limit)
> - ret = -ENOMEM;
> - }
> + if (npage > 0 && !dma->lock_cap) {
> + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
> +
> - PAGE_SHIFT;

I haven't looked at this super closely, but how does this stuff work?

do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm...

Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ?

Otherwise MEMLOCK is really doubled..

Jason


Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Jason Gunthorpe
On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote:
> Hi,
> 
> This series converts users that account pinned pages with locked_vm to
> account with pinned_vm instead, pinned_vm being the correct counter to
> use.  It's based on a similar patch I posted recently[0].
> 
> The patches are based on rdma/for-next to build on Davidlohr Bueso's
> recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
> sense for these to be routed the same way, despite lack of rdma content?

Oy.. I'd be willing to accumulate a branch with acks to send to Linus
*separately* from RDMA to Linus, but this is very abnormal.

Better to wait a few weeks for -rc1 and send patches through the
subsystem trees.

> All five of these places, and probably some of Davidlohr's conversions,
> probably want to be collapsed into a common helper in the core mm for
> accounting pinned pages.  I tried, and there are several details that
> likely need discussion, so this can be done as a follow-on.

I've wondered the same..

Jason


[PATCH 5/5] kvm/book3s: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Daniel Jordan
Memory used for TCE tables in kvm_vm_ioctl_create_spapr_tce is currently
accounted to locked_vm because it stays resident and its allocation is
directly triggered from userspace as explained in f8626985c7c2 ("KVM:
PPC: Account TCE-containing pages in locked_vm").

However, since the memory comes straight from the page allocator (and to
a lesser extent unreclaimable slab) and is effectively pinned, it should
be accounted with pinned_vm (see bc3e53f682d9 ("mm: distinguish between
mlocked and pinned pages")).

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan 
---
 arch/powerpc/kvm/book3s_64_vio.c | 35 ++--
 1 file changed, 15 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 532ab79734c7..2f8d7c051e4e 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -56,39 +56,34 @@ static unsigned long kvmppc_stt_pages(unsigned long 
tce_pages)
return tce_pages + ALIGN(stt_bytes, PAGE_SIZE) / PAGE_SIZE;
 }
 
-static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc)
+static long kvmppc_account_memlimit(unsigned long pages, bool inc)
 {
long ret = 0;
+   s64 pinned_vm;
 
if (!current || !current->mm)
return ret; /* process exited */
 
-   down_write(>mm->mmap_sem);
-
if (inc) {
-   unsigned long locked, lock_limit;
+   unsigned long lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 
-   locked = current->mm->locked_vm + stt_pages;
-   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+   pinned_vm = atomic64_add_return(pages, >mm->pinned_vm);
+   if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) {
ret = -ENOMEM;
-   else
-   current->mm->locked_vm += stt_pages;
+   atomic64_sub(pages, >mm->pinned_vm);
+   }
} else {
-   if (WARN_ON_ONCE(stt_pages > current->mm->locked_vm))
-   stt_pages = current->mm->locked_vm;
+   pinned_vm = atomic64_read(>mm->pinned_vm);
+   if (WARN_ON_ONCE(pages > pinned_vm))
+   pages = pinned_vm;
 
-   current->mm->locked_vm -= stt_pages;
+   atomic64_sub(pages, >mm->pinned_vm);
}
 
-   pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid,
-   inc ? '+' : '-',
-   stt_pages << PAGE_SHIFT,
-   current->mm->locked_vm << PAGE_SHIFT,
-   rlimit(RLIMIT_MEMLOCK),
-   ret ? " - exceeded" : "");
-
-   up_write(>mm->mmap_sem);
+   pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%lu %ld/%lu%s\n", current->pid,
+   inc ? '+' : '-', pages << PAGE_SHIFT,
+   atomic64_read(>mm->pinned_vm) << PAGE_SHIFT,
+   rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
 
return ret;
 }
-- 
2.20.1



[PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Daniel Jordan
Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
pages"), locked and pinned pages are accounted separately.  Type1
accounts pinned pages to locked_vm; use pinned_vm instead.

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan 
---
 drivers/vfio/vfio_iommu_type1.c | 31 ---
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 73652e21efec..a56cc341813f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -257,7 +257,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, 
struct vfio_pfn *vpfn)
 static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async)
 {
struct mm_struct *mm;
-   int ret;
+   s64 pinned_vm;
+   int ret = 0;
 
if (!npage)
return 0;
@@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long 
npage, bool async)
if (!mm)
return -ESRCH; /* process exited */
 
-   ret = down_write_killable(>mmap_sem);
-   if (!ret) {
-   if (npage > 0) {
-   if (!dma->lock_cap) {
-   unsigned long limit;
-
-   limit = task_rlimit(dma->task,
-   RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   pinned_vm = atomic64_add_return(npage, >pinned_vm);
 
-   if (mm->locked_vm + npage > limit)
-   ret = -ENOMEM;
-   }
+   if (npage > 0 && !dma->lock_cap) {
+   unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >>
+  PAGE_SHIFT;
+   if (pinned_vm > limit) {
+   atomic64_sub(npage, >pinned_vm);
+   ret = -ENOMEM;
}
-
-   if (!ret)
-   mm->locked_vm += npage;
-
-   up_write(>mmap_sem);
}
 
if (async)
@@ -401,6 +393,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
long ret, pinned = 0, lock_acct = 0;
bool rsvd;
dma_addr_t iova = vaddr - dma->vaddr + dma->iova;
+   atomic64_t *pinned_vm = >mm->pinned_vm;
 
/* This code path is only user initiated */
if (!current->mm)
@@ -418,7 +411,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
 * pages are already counted against the user.
 */
if (!rsvd && !vfio_find_vpfn(dma, iova)) {
-   if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) {
+   if (!dma->lock_cap && atomic64_read(pinned_vm) + 1 > limit) {
put_pfn(*pfn_base, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
limit << PAGE_SHIFT);
@@ -445,7 +438,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, 
unsigned long vaddr,
 
if (!rsvd && !vfio_find_vpfn(dma, iova)) {
if (!dma->lock_cap &&
-   current->mm->locked_vm + lock_acct + 1 > limit) {
+   atomic64_read(pinned_vm) + lock_acct + 1 > limit) {
put_pfn(pfn, dma->prot);
pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
__func__, limit << PAGE_SHIFT);
-- 
2.20.1



[PATCH 3/5] fpga/dlf/afu: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Daniel Jordan
Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
pages"), locked and pinned pages are accounted separately.  The FPGA AFU
driver accounts pinned pages to locked_vm; use pinned_vm instead.

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan 
---
 drivers/fpga/dfl-afu-dma-region.c | 50 ++-
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/drivers/fpga/dfl-afu-dma-region.c 
b/drivers/fpga/dfl-afu-dma-region.c
index e18a786fc943..a9a6b317fe2e 100644
--- a/drivers/fpga/dfl-afu-dma-region.c
+++ b/drivers/fpga/dfl-afu-dma-region.c
@@ -32,47 +32,43 @@ void afu_dma_region_init(struct dfl_feature_platform_data 
*pdata)
 }
 
 /**
- * afu_dma_adjust_locked_vm - adjust locked memory
+ * afu_dma_adjust_pinned_vm - adjust pinned memory
  * @dev: port device
  * @npages: number of pages
- * @incr: increase or decrease locked memory
  *
- * Increase or decrease the locked memory size with npages input.
+ * Increase or decrease the pinned memory size with npages input.
  *
  * Return 0 on success.
- * Return -ENOMEM if locked memory size is over the limit and no CAP_IPC_LOCK.
+ * Return -ENOMEM if pinned memory size is over the limit and no CAP_IPC_LOCK.
  */
-static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr)
+static int afu_dma_adjust_pinned_vm(struct device *dev, long pages)
 {
-   unsigned long locked, lock_limit;
+   unsigned long lock_limit;
+   s64 pinned_vm;
int ret = 0;
 
/* the task is exiting. */
-   if (!current->mm)
+   if (!current->mm || !pages)
return 0;
 
-   down_write(>mm->mmap_sem);
-
-   if (incr) {
-   locked = current->mm->locked_vm + npages;
+   if (pages > 0) {
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+   pinned_vm = atomic64_add_return(pages, >mm->pinned_vm);
+   if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) {
ret = -ENOMEM;
-   else
-   current->mm->locked_vm += npages;
+   atomic64_sub(pages, >mm->pinned_vm);
+   }
} else {
-   if (WARN_ON_ONCE(npages > current->mm->locked_vm))
-   npages = current->mm->locked_vm;
-   current->mm->locked_vm -= npages;
+   pinned_vm = atomic64_read(>mm->pinned_vm);
+   if (WARN_ON_ONCE(pages > pinned_vm))
+   pages = pinned_vm;
+   atomic64_sub(pages, >mm->pinned_vm);
}
 
-   dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", current->pid,
-   incr ? '+' : '-', npages << PAGE_SHIFT,
-   current->mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK),
-   ret ? "- exceeded" : "");
-
-   up_write(>mm->mmap_sem);
+   dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %lld/%lu%s\n", current->pid,
+   (pages > 0) ? '+' : '-', pages << PAGE_SHIFT,
+   (s64)atomic64_read(>mm->pinned_vm) << PAGE_SHIFT,
+   rlimit(RLIMIT_MEMLOCK), ret ? "- exceeded" : "");
 
return ret;
 }
@@ -92,7 +88,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data 
*pdata,
struct device *dev = >dev->dev;
int ret, pinned;
 
-   ret = afu_dma_adjust_locked_vm(dev, npages, true);
+   ret = afu_dma_adjust_pinned_vm(dev, npages);
if (ret)
return ret;
 
@@ -121,7 +117,7 @@ static int afu_dma_pin_pages(struct 
dfl_feature_platform_data *pdata,
 free_pages:
kfree(region->pages);
 unlock_vm:
-   afu_dma_adjust_locked_vm(dev, npages, false);
+   afu_dma_adjust_pinned_vm(dev, -npages);
return ret;
 }
 
@@ -141,7 +137,7 @@ static void afu_dma_unpin_pages(struct 
dfl_feature_platform_data *pdata,
 
put_all_pages(region->pages, npages);
kfree(region->pages);
-   afu_dma_adjust_locked_vm(dev, npages, false);
+   afu_dma_adjust_pinned_vm(dev, -npages);
 
dev_dbg(dev, "%ld pages unpinned\n", npages);
 }
-- 
2.20.1



[PATCH 4/5] powerpc/mmu: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Daniel Jordan
Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
pages"), locked and pinned pages are accounted separately.  The IOMMU
MMU helpers on powerpc account pinned pages to locked_vm; use pinned_vm
instead.

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan 
---
 arch/powerpc/mm/mmu_context_iommu.c | 43 ++---
 1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index a712a650a8b6..fdf670542847 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -40,36 +40,35 @@ struct mm_iommu_table_group_mem_t {
u64 dev_hpa;/* Device memory base address */
 };
 
-static long mm_iommu_adjust_locked_vm(struct mm_struct *mm,
+static long mm_iommu_adjust_pinned_vm(struct mm_struct *mm,
unsigned long npages, bool incr)
 {
-   long ret = 0, locked, lock_limit;
+   long ret = 0;
+   unsigned long lock_limit;
+   s64 pinned_vm;
 
if (!npages)
return 0;
 
-   down_write(>mmap_sem);
-
if (incr) {
-   locked = mm->locked_vm + npages;
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+   pinned_vm = atomic64_add_return(npages, >pinned_vm);
+   if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) {
ret = -ENOMEM;
-   else
-   mm->locked_vm += npages;
+   atomic64_sub(npages, >pinned_vm);
+   }
} else {
-   if (WARN_ON_ONCE(npages > mm->locked_vm))
-   npages = mm->locked_vm;
-   mm->locked_vm -= npages;
+   pinned_vm = atomic64_read(>pinned_vm);
+   if (WARN_ON_ONCE(npages > pinned_vm))
+   npages = pinned_vm;
+   atomic64_sub(npages, >pinned_vm);
}
 
-   pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n",
-   current ? current->pid : 0,
-   incr ? '+' : '-',
+   pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%lu %ld/%lu\n",
+   current ? current->pid : 0, incr ? '+' : '-',
npages << PAGE_SHIFT,
-   mm->locked_vm << PAGE_SHIFT,
+   atomic64_read(>pinned_vm) << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK));
-   up_write(>mmap_sem);
 
return ret;
 }
@@ -133,7 +132,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
-   long i, j, ret = 0, locked_entries = 0;
+   long i, j, ret = 0, pinned_entries = 0;
unsigned int pageshift;
unsigned long flags;
unsigned long cur_ua;
@@ -154,11 +153,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
}
 
if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
-   ret = mm_iommu_adjust_locked_vm(mm, entries, true);
+   ret = mm_iommu_adjust_pinned_vm(mm, entries, true);
if (ret)
goto unlock_exit;
 
-   locked_entries = entries;
+   pinned_entries = entries;
}
 
mem = kzalloc(sizeof(*mem), GFP_KERNEL);
@@ -252,8 +251,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
list_add_rcu(>next, >context.iommu_group_mem_list);
 
 unlock_exit:
-   if (locked_entries && ret)
-   mm_iommu_adjust_locked_vm(mm, locked_entries, false);
+   if (pinned_entries && ret)
+   mm_iommu_adjust_pinned_vm(mm, pinned_entries, false);
 
mutex_unlock(_list_mutex);
 
@@ -352,7 +351,7 @@ long mm_iommu_put(struct mm_struct *mm, struct 
mm_iommu_table_group_mem_t *mem)
mm_iommu_release(mem);
 
if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
-   mm_iommu_adjust_locked_vm(mm, entries, false);
+   mm_iommu_adjust_pinned_vm(mm, entries, false);
 
 unlock_exit:
mutex_unlock(_list_mutex);
-- 
2.20.1



[PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Daniel Jordan
Hi,

This series converts users that account pinned pages with locked_vm to
account with pinned_vm instead, pinned_vm being the correct counter to
use.  It's based on a similar patch I posted recently[0].

The patches are based on rdma/for-next to build on Davidlohr Bueso's
recent conversion of pinned_vm to an atomic64_t[1].  Seems to make some
sense for these to be routed the same way, despite lack of rdma content?

All five of these places, and probably some of Davidlohr's conversions,
probably want to be collapsed into a common helper in the core mm for
accounting pinned pages.  I tried, and there are several details that
likely need discussion, so this can be done as a follow-on.

I'd appreciate a look at patch 5 especially since the accounting is
unusual no matter whether locked_vm or pinned_vm are used.

On powerpc, this was cross-compile tested only.

[0] http://lkml.kernel.org/r/20181105165558.11698-8-daniel.m.jor...@oracle.com
[1] http://lkml.kernel.org/r/20190206175920.31082-1-d...@stgolabs.net

Daniel Jordan (5):
  vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
  vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned
pages
  fpga/dlf/afu: use pinned_vm instead of locked_vm to account pinned
pages
  powerpc/mmu: use pinned_vm instead of locked_vm to account pinned
pages
  kvm/book3s: use pinned_vm instead of locked_vm to account pinned pages

 Documentation/vfio.txt  |  6 +--
 arch/powerpc/kvm/book3s_64_vio.c| 35 +++-
 arch/powerpc/mm/mmu_context_iommu.c | 43 ++-
 drivers/fpga/dfl-afu-dma-region.c   | 50 +++---
 drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++---
 drivers/vfio/vfio_iommu_type1.c | 31 ++
 6 files changed, 104 insertions(+), 125 deletions(-)

-- 
2.20.1



[PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages

2019-02-11 Thread Daniel Jordan
Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned
pages"), locked and pinned pages are accounted separately.  The SPAPR
TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm
instead.

pinned_vm recently became atomic and so no longer relies on mmap_sem
held as writer: delete.

Signed-off-by: Daniel Jordan 
---
 Documentation/vfio.txt  |  6 +--
 drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++---
 2 files changed, 33 insertions(+), 37 deletions(-)

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index f1a4d3c3ba0b..fa37d65363f9 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -308,7 +308,7 @@ This implementation has some specifics:
currently there is no way to reduce the number of calls. In order to make
things faster, the map/unmap handling has been implemented in real mode
which provides an excellent performance which has limitations such as
-   inability to do locked pages accounting in real time.
+   inability to do pinned pages accounting in real time.
 
 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
subtree that can be treated as a unit for the purposes of partitioning and
@@ -324,7 +324,7 @@ This implementation has some specifics:
returns the size and the start of the DMA window on the PCI bus.
 
VFIO_IOMMU_ENABLE
-   enables the container. The locked pages accounting
+   enables the container. The pinned pages accounting
is done at this point. This lets user first to know what
the DMA window is and adjust rlimit before doing any real job.
 
@@ -454,7 +454,7 @@ This implementation has some specifics:
 
PPC64 paravirtualized guests generate a lot of map/unmap requests,
and the handling of those includes pinning/unpinning pages and updating
-   mm::locked_vm counter to make sure we do not exceed the rlimit.
+   mm::pinned_vm counter to make sure we do not exceed the rlimit.
The v2 IOMMU splits accounting and pinning into separate operations:
 
- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index c424913324e3..f47e020dc5e4 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -34,9 +34,11 @@
 static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
 
-static long try_increment_locked_vm(struct mm_struct *mm, long npages)
+static long try_increment_pinned_vm(struct mm_struct *mm, long npages)
 {
-   long ret = 0, locked, lock_limit;
+   long ret = 0;
+   s64 pinned;
+   unsigned long lock_limit;
 
if (WARN_ON_ONCE(!mm))
return -EPERM;
@@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct *mm, 
long npages)
if (!npages)
return 0;
 
-   down_write(>mmap_sem);
-   locked = mm->locked_vm + npages;
+   pinned = atomic64_add_return(npages, >pinned_vm);
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-   if (locked > lock_limit && !capable(CAP_IPC_LOCK))
+   if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) {
ret = -ENOMEM;
-   else
-   mm->locked_vm += npages;
+   atomic64_sub(npages, >pinned_vm);
+   }
 
-   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid,
+   pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid,
npages << PAGE_SHIFT,
-   mm->locked_vm << PAGE_SHIFT,
-   rlimit(RLIMIT_MEMLOCK),
-   ret ? " - exceeded" : "");
-
-   up_write(>mmap_sem);
+   atomic64_read(>pinned_vm) << PAGE_SHIFT,
+   rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : "");
 
return ret;
 }
 
-static void decrement_locked_vm(struct mm_struct *mm, long npages)
+static void decrement_pinned_vm(struct mm_struct *mm, long npages)
 {
if (!mm || !npages)
return;
 
-   down_write(>mmap_sem);
-   if (WARN_ON_ONCE(npages > mm->locked_vm))
-   npages = mm->locked_vm;
-   mm->locked_vm -= npages;
-   pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid,
+   if (WARN_ON_ONCE(npages > atomic64_read(>pinned_vm)))
+   npages = atomic64_read(>pinned_vm);
+   atomic64_sub(npages, >pinned_vm);
+   pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid,
npages << PAGE_SHIFT,
-   mm->locked_vm << PAGE_SHIFT,
+   atomic64_read(>pinned_vm) << PAGE_SHIFT,
rlimit(RLIMIT_MEMLOCK));
-   up_write(>mmap_sem);
 }
 
 /*
@@ -110,7 +106,7 @@ struct tce_container {
bool enabled;
bool v2;
bool 

Re: [PATCH] powerpc: fix 32-bit KVM-PR lockup and panic with MacOS guest

2019-02-11 Thread Mark Cave-Ayland
On 11/02/2019 00:30, Benjamin Herrenschmidt wrote:

> On Fri, 2019-02-08 at 14:51 +, Mark Cave-Ayland wrote:
>>
>> Indeed, but there are still some questions to be asked here:
>>
>> 1) Why were these bits removed from the original bitmask in the first place 
>> without
>> it being documented in the commit message?
>>
>> 2) Is this the right fix? I'm told that MacOS guests already run without 
>> this patch
>> on a G5 under 64-bit KVM-PR which may suggest that this is a workaround for 
>> another
>> bug elsewhere in the 32-bit powerpc code.
>>
>>
>> If you think that these points don't matter, then I'm happy to resubmit the 
>> patch
>> as-is based upon your comments above.
> 
> We should write a test case to verify that FE0/FE1 are properly
> preserved/context-switched etc... I bet if we accidentally wiped them,
> we wouldn't notice 99.9% of the time.

Right I guess it's more likely to cause in issue in the KVM PR case because the 
guest
can alter the flags in a way that doesn't go through the normal process switch 
mechanism.

The original patchset at
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg98326.html does 
include
some tests in the first few patches, but AFAICT they are concerned with the 
contents
of the FP registers rather than the related MSRs.

Who is the right person to ask about fixing issues related to context switching 
with
KVM PR? I did add the original author's email address to my first few emails 
but have
had no response back :/


ATB,

Mark.


Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20

2019-02-11 Thread Alan Tull
On Mon, Feb 11, 2019 at 1:13 PM Greg Kroah-Hartman
 wrote:
>
> On Mon, Feb 11, 2019 at 12:41:40PM -0600, Alan Tull wrote:
> > On Fri, Nov 9, 2018 at 12:58 AM Frank Rowand  wrote:
> >
> > What LTSI's are these patches likely to end up in?  Just to be clear,
> > I'm not pushing for any specific answer, I just want to know what to
> > expect.
>
> I have no idea what you are asking here.
>
> What patches?

I probably should have asked my question *below* the pertinent context
of the the 17 patches listed in the pull request, which was:

>   of: overlay: add tests to validate kfrees from overlay removal
>   of: overlay: add missing of_node_put() after add new node to changeset
>   of: overlay: add missing of_node_get() in __of_attach_node_sysfs
>   powerpc/pseries: add of_node_put() in dlpar_detach_node()
>   of: overlay: use prop add changeset entry for property in new nodes
>   of: overlay: do not duplicate properties from overlay for new nodes
>   of: overlay: reorder fields in struct fragment
>   of: overlay: validate overlay properties #address-cells and #size-cells
>   of: overlay: make all pr_debug() and pr_err() messages unique
>   of: overlay: test case of two fragments adding same node
>   of: overlay: check prevents multiple fragments add or delete same node
>   of: overlay: check prevents multiple fragments touching same property
>   of: unittest: remove unused of_unittest_apply_overlay() argument
>   of: overlay: set node fields from properties when add new overlay node
>   of: unittest: allow base devicetree to have symbol metadata
>   of: unittest: find overlays[] entry by name instead of index
>   of: unittest: initialize args before calling of_*parse_*()

> What is "LTSI's"?

I have recently seen some of devicetree patches being picked up for
the 4.20 stable-queue.  That seemed to suggest that some, but not all
of these will end up in the next LTS release.  Also I was wondering if
any of this is likely to get backported to LTSI-4.14.

>
> confused,

Yes, and now I'm confused about the confusion.  Sorry for spreading confusion.

Alan

>
> greg k-h


[QUESTION] powerpc, libseccomp, and spu

2019-02-11 Thread Tom Hromatka

PowerPC experts,

Paul Moore and I are working on the v2.4 release of libseccomp,
and as part of this work I need to update the syscall table for
each architecture.

I have incorporated the new ppc syscall.tbl into libseccomp, but
I am not familiar with the value of "spu" in the ABI column.  For
example:

22  32  umount  sys_oldumount
22  64  umount  sys_ni_syscall
22  spu umount  sys_ni_syscall

In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit
ppc syscall table.  Do we also need to add a "spu" ppc syscall
table?  Some clarification on the syscalls marked "spu" and "nospu"
would be greatly appreciated.

Thanks.

Tom


[PATCH v2 2/2] locking/rwsem: Optimize down_read_trylock()

2019-02-11 Thread Waiman Long
Modify __down_read_trylock() to make it generate slightly better code
(smaller and maybe a tiny bit faster).

Before this patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: jmp0x18 
   0x0007 <+7>: lea0x1(%rdx),%rcx
   0x000b <+11>:mov%rdx,%rax
   0x000e <+14>:lock cmpxchg %rcx,(%rdi)
   0x0013 <+19>:cmp%rax,%rdx
   0x0016 <+22>:je 0x23 
   0x0018 <+24>:mov(%rdi),%rdx
   0x001b <+27>:test   %rdx,%rdx
   0x001e <+30>:jns0x7 
   0x0020 <+32>:xor%eax,%eax
   0x0022 <+34>:retq
   0x0023 <+35>:mov%gs:0x0,%rax
   0x002c <+44>:or $0x3,%rax
   0x0030 <+48>:mov%rax,0x20(%rdi)
   0x0034 <+52>:mov$0x1,%eax
   0x0039 <+57>:retq

After patch, down_read_trylock:

   0x <+0>: callq  0x5 
   0x0005 <+5>: mov(%rdi),%rax
   0x0008 <+8>: test   %rax,%rax
   0x000b <+11>:js 0x2f 
   0x000d <+13>:lea0x1(%rax),%rdx
   0x0011 <+17>:lock cmpxchg %rdx,(%rdi)
   0x0016 <+22>:jne0x8 
   0x0018 <+24>:mov%gs:0x0,%rax
   0x0021 <+33>:or $0x3,%rax
   0x0025 <+37>:mov%rax,0x20(%rdi)
   0x0029 <+41>:mov$0x1,%eax
   0x002e <+46>:retq
   0x002f <+47>:xor%eax,%eax
   0x0031 <+49>:retq

By using a rwsem microbenchmark, the down_read_trylock() rate on a
x86-64 system before and after the patch were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   27,787  28,259
28,359   9,234

On a ARM64 system, the performance results were:

 Before PatchAfter Patch
   # of Threads rlock   rlock
    -   -
1   24,155  25,000
26,820   8,699

Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 067e265..028bc33 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -175,11 +175,11 @@ static inline int __down_read_killable(struct 
rw_semaphore *sem)
 
 static inline int __down_read_trylock(struct rw_semaphore *sem)
 {
-   long tmp;
+   long tmp = atomic_long_read(>count);
 
-   while ((tmp = atomic_long_read(>count)) >= 0) {
-   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
-  tmp + RWSEM_ACTIVE_READ_BIAS)) {
+   while (tmp >= 0) {
+   if (atomic_long_try_cmpxchg_acquire(>count, ,
+   tmp + RWSEM_ACTIVE_READ_BIAS)) {
return 1;
}
}
-- 
1.8.3.1



[PATCH v2 1/2] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Waiman Long
As the generic rwsem-xadd code is using the appropriate acquire and
release versions of the atomic operations, the arch specific rwsem.h
files will not be that much faster than the generic code as long as the
atomic functions are properly implemented. So we can remove those arch
specific rwsem.h and stop building asm/rwsem.h to reduce maintenance
effort.

Currently, only x86, alpha and ia64 have implemented architecture
specific fast paths. I don't have access to alpha and ia64 systems for
testing, but they are legacy systems that are not likely to be updated
to the latest kernel anyway.

By using a rwsem microbenchmark, the total locking rates on a 4-socket
56-core 112-thread x86-64 system before and after the patch were as
follows (mixed means equal # of read and write locks):

  Before Patch  After Patch
   # of Threads  wlock   rlock   mixed wlock   rlock   mixed
     -   -   - -   -   -
129,201  30,143  29,45828,615  30,172  29,201
2 6,807  13,299   1,171 7,725  15,025   1,804
4 6,504  12,755   1,520 7,127  14,286   1,345
8 6,762  13,412 764 6,826  13,652 726
   16 6,693  15,408 662 6,599  15,938 626
   32 6,145  15,286 496 5,549  15,487 511
   64 5,812  15,495  60 5,858  15,572  60

There were some run-to-run variations for the multi-thread tests. For
x86-64, using the generic C code fast path seems to be a little bit
faster than the assembly version with low lock contention.  Looking at
the assembly version of the fast paths, there are assembly to/from C
code wrappers that save and restore all the callee-clobbered registers
(7 registers on x86-64). The assembly generated from the generic C
code doesn't need to do that. That may explain the slight performance
gain here.

The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h
with no code change as no other code other than those under
kernel/locking needs to access the internal rwsem macros and functions.

Signed-off-by: Waiman Long 
---
 MAINTAINERS |   1 -
 arch/alpha/include/asm/rwsem.h  | 211 ---
 arch/arm/include/asm/Kbuild |   1 -
 arch/arm64/include/asm/Kbuild   |   1 -
 arch/hexagon/include/asm/Kbuild |   1 -
 arch/ia64/include/asm/rwsem.h   | 172 -
 arch/powerpc/include/asm/Kbuild |   1 -
 arch/s390/include/asm/Kbuild|   1 -
 arch/sh/include/asm/Kbuild  |   1 -
 arch/sparc/include/asm/Kbuild   |   1 -
 arch/x86/include/asm/rwsem.h| 237 
 arch/x86/lib/Makefile   |   1 -
 arch/x86/lib/rwsem.S| 156 --
 arch/xtensa/include/asm/Kbuild  |   1 -
 include/asm-generic/rwsem.h | 140 
 include/linux/rwsem.h   |   4 +-
 kernel/locking/percpu-rwsem.c   |   2 +
 kernel/locking/rwsem.h  | 130 ++
 18 files changed, 133 insertions(+), 929 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 9919840..053f536 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8926,7 +8926,6 @@ F:arch/*/include/asm/spinlock*.h
 F: include/linux/rwlock*.h
 F: include/linux/mutex*.h
 F: include/linux/rwsem*.h
-F: arch/*/include/asm/rwsem.h
 F: include/linux/seqlock.h
 F: lib/locking*.[ch]
 F: kernel/locking/
diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
deleted file mode 100644
index cf8fc8f9..000
--- a/arch/alpha/include/asm/rwsem.h
+++ /dev/null
@@ -1,211 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ALPHA_RWSEM_H
-#define _ALPHA_RWSEM_H
-
-/*
- * Written by Ivan Kokshaysky , 2001.
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include 
-
-#define RWSEM_UNLOCKED_VALUE   0xL
-#define RWSEM_ACTIVE_BIAS  0x0001L
-#define RWSEM_ACTIVE_MASK  0xL
-#define RWSEM_WAITING_BIAS (-0x0001L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + 
RWSEM_ACTIVE_BIAS)
-
-static inline int ___down_read(struct rw_semaphore *sem)
-{
-   long oldcount;
-#ifndefCONFIG_SMP
-   oldcount = sem->count.counter;
-   sem->count.counter += RWSEM_ACTIVE_READ_BIAS;
-#else
-   long temp;
-   __asm__ __volatile__(
-   "1: ldq_l   %0,%1\n"
-   " 

[PATCH v2 0/2] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Waiman Long
v2:
 - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ.
 - Update performance test data in patch 1.

This is part 0 of my rwsem patchset. It just removes the architecture
specific files to make it easer to add enhancements in the upcoming
rwsem patches.

Since the two ll/sc platforms that I can tested on (arm64 & ppc) are
both using the generic C codes, the rwsem performance shouldn't be
affected by this patch except the down_read_trylock() code which was
included in patch 2 for arm64.

Waiman Long (2):
  locking/rwsem: Remove arch specific rwsem files
  locking/rwsem: Optimize down_read_trylock()

 MAINTAINERS |   1 -
 arch/alpha/include/asm/rwsem.h  | 211 ---
 arch/arm/include/asm/Kbuild |   1 -
 arch/arm64/include/asm/Kbuild   |   1 -
 arch/hexagon/include/asm/Kbuild |   1 -
 arch/ia64/include/asm/rwsem.h   | 172 -
 arch/powerpc/include/asm/Kbuild |   1 -
 arch/s390/include/asm/Kbuild|   1 -
 arch/sh/include/asm/Kbuild  |   1 -
 arch/sparc/include/asm/Kbuild   |   1 -
 arch/x86/include/asm/rwsem.h| 237 
 arch/x86/lib/Makefile   |   1 -
 arch/x86/lib/rwsem.S| 156 --
 arch/xtensa/include/asm/Kbuild  |   1 -
 include/asm-generic/rwsem.h | 140 
 include/linux/rwsem.h   |   4 +-
 kernel/locking/percpu-rwsem.c   |   2 +
 kernel/locking/rwsem.h  | 130 ++
 18 files changed, 133 insertions(+), 929 deletions(-)
 delete mode 100644 arch/alpha/include/asm/rwsem.h
 delete mode 100644 arch/ia64/include/asm/rwsem.h
 delete mode 100644 arch/x86/include/asm/rwsem.h
 delete mode 100644 arch/x86/lib/rwsem.S
 delete mode 100644 include/asm-generic/rwsem.h

-- 
1.8.3.1



Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20

2019-02-11 Thread Greg Kroah-Hartman
On Mon, Feb 11, 2019 at 12:41:40PM -0600, Alan Tull wrote:
> On Fri, Nov 9, 2018 at 12:58 AM Frank Rowand  wrote:
> 
> What LTSI's are these patches likely to end up in?  Just to be clear,
> I'm not pushing for any specific answer, I just want to know what to
> expect.

I have no idea what you are asking here.

What patches?  What is "LTSI's"?

confused,

greg k-h


Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20

2019-02-11 Thread Alan Tull
On Fri, Nov 9, 2018 at 12:58 AM Frank Rowand  wrote:

What LTSI's are these patches likely to end up in?  Just to be clear,
I'm not pushing for any specific answer, I just want to know what to
expect.

Thanks,
Alan

>
> On 11/8/18 10:56 PM, Frank Rowand wrote:
> > Hi Rob,
> >
> > Please pull the changes to add the overlay validation checks.
> >
> > This is the v7 version of the patch series.
> >
> > -Frank
> >
> >
> > The following changes since commit 651022382c7f8da46cb4872a545ee1da6d097d2a:
> >
> >   Linux 4.20-rc1 (2018-11-04 15:37:52 -0800)
> >
> > are available in the git repository at:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/frowand/linux.git 
> > tags/kfree_validate_v7-for-4.20
> >
> > for you to fetch changes up to eeb07c573ec307c53fe2f6ac6d8d11c261f64006:
> >
> >   of: unittest: initialize args before calling of_*parse_*() (2018-11-08 
> > 22:12:37 -0800)
> >
> > 
> > Add checks to (1) overlay apply process and (2) memory freeing
> > triggered by overlay release.  The checks are intended to detect
> > possible memory leaks and invalid overlays.
> >
> > The checks revealed bugs in existing code.  Fixed the bugs.
> >
> > While fixing bugs, noted other issues, which are fixed in
> > separate patches.
> >
> > 
> > Frank Rowand (17):
> >   of: overlay: add tests to validate kfrees from overlay removal
> >   of: overlay: add missing of_node_put() after add new node to changeset
> >   of: overlay: add missing of_node_get() in __of_attach_node_sysfs
> >   powerpc/pseries: add of_node_put() in dlpar_detach_node()
> >   of: overlay: use prop add changeset entry for property in new nodes
> >   of: overlay: do not duplicate properties from overlay for new nodes
> >   of: overlay: reorder fields in struct fragment
> >   of: overlay: validate overlay properties #address-cells and 
> > #size-cells
> >   of: overlay: make all pr_debug() and pr_err() messages unique
> >   of: overlay: test case of two fragments adding same node
> >   of: overlay: check prevents multiple fragments add or delete same node
> >   of: overlay: check prevents multiple fragments touching same property
> >   of: unittest: remove unused of_unittest_apply_overlay() argument
> >   of: overlay: set node fields from properties when add new overlay node
> >   of: unittest: allow base devicetree to have symbol metadata
> >   of: unittest: find overlays[] entry by name instead of index
> >   of: unittest: initialize args before calling of_*parse_*()
> >
> >  arch/powerpc/platforms/pseries/dlpar.c |   2 +
> >  drivers/of/dynamic.c   |  59 -
> >  drivers/of/kobj.c  |   4 +-
> >  drivers/of/overlay.c   | 292 
> > -
> >  drivers/of/unittest-data/Makefile  |   2 +
> >  .../of/unittest-data/overlay_bad_add_dup_node.dts  |  28 ++
> >  .../of/unittest-data/overlay_bad_add_dup_prop.dts  |  24 ++
> >  drivers/of/unittest-data/overlay_base.dts  |   1 +
> >  drivers/of/unittest.c  |  96 +--
> >  include/linux/of.h |  21 +-
> >  10 files changed, 432 insertions(+), 97 deletions(-)
> >  create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts
> >  create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts
> >
>


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Mon, Feb 11, 2019 at 11:35:24AM -0500, Waiman Long wrote:
> On 02/11/2019 06:58 AM, Peter Zijlstra wrote:
> > Which is clearly worse. Now we can write that as:
> >
> >   int __down_read_trylock2(unsigned long *l)
> >   {
> >   long tmp = READ_ONCE(*l);
> >
> >   while (tmp >= 0) {
> >   if (try_cmpxchg(l, , tmp + 1))
> >   return 1;
> >   }
> >
> >   return 0;
> >   }
> >
> > which generates:
> >
> >   0030 <__down_read_trylock2>:
> >   30:   48 8b 07mov(%rdi),%rax
> >   33:   48 85 c0test   %rax,%rax
> >   36:   78 18   js 50 <__down_read_trylock2+0x20>
> >   38:   48 8d 50 01 lea0x1(%rax),%rdx
> >   3c:   f0 48 0f b1 17  lock cmpxchg %rdx,(%rdi)
> >   41:   75 f0   jne33 <__down_read_trylock2+0x3>
> >   43:   b8 01 00 00 00  mov$0x1,%eax
> >   48:   c3  retq
> >   49:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
> >   50:   31 c0   xor%eax,%eax
> >   52:   c3  retq
> >
> > Which is a lot better; but not quite there yet.
> >
> >
> > I've tried quite a bit, but I can't seem to get GCC to generate the:
> >
> > add $1,%rdx
> > jle
> >
> > required; stuff like:
> >
> > new = old + 1;
> > if (new <= 0)
> >
> > generates:
> >
> > lea 0x1(%rax),%rdx
> > test %rdx, %rdx
> > jle
> 
> Thanks for the suggested code snippet. So you want to replace "lea
> 0x1(%rax), %rdx" by "add $1,%rdx"?
> 
> I think the compiler is doing that so as to use the address generation
> unit for addition instead of using the ALU. That will leave the ALU
> available for doing other arithmetic operation in parallel. I don't
> think it is a good idea to override the compiler and force it to use
> ALU. So I am not going to try doing that. It is only 1 or 2 more of
> codes anyway.

Yeah, I was trying to see what I could make it do.. #2 really should be
good enough, but you know how it is once you're poking at it :-)


[PATCH] mmap.2: describe the 5level paging hack

2019-02-11 Thread Jann Horn
The manpage is missing information about the compatibility hack for
5-level paging that went in in 4.14, around commit ee00f4a32a76 ("x86/mm:
Allow userspace have mappings above 47-bit"). Add some information about
that.

While I don't think any hardware supporting this is shipping yet (?), I
think it's useful to try to write a manpage for this API, partly to
figure out how usable that API actually is, and partly because when this
hardware does ship, it'd be nice if distro manpages had information about
how to use it.

Signed-off-by: Jann Horn 
---
This patch goes on top of the patch "[PATCH] mmap.2: fix description of
treatment of the hint" that I just sent, but I'm not sending them in a
series because I want the first one to go in, and I think this one might
be a bit more controversial.

It would be nice if the architecture maintainers and mm folks could have
a look at this and check that what I wrote is right - I only looked at
the source for this, I haven't tried it.

 man2/mmap.2 | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/man2/mmap.2 b/man2/mmap.2
index 8556bbfeb..977782fa8 100644
--- a/man2/mmap.2
+++ b/man2/mmap.2
@@ -67,6 +67,8 @@ is NULL,
 then the kernel chooses the (page-aligned) address
 at which to create the mapping;
 this is the most portable method of creating a new mapping.
+On Linux, in this case, the kernel may limit the maximum address that can be
+used for allocations to a legacy limit for compatibility reasons.
 If
 .I addr
 is not NULL,
@@ -77,6 +79,19 @@ or equal to the value specified by
 and attempt to create the mapping there.
 If another mapping already exists there, the kernel picks a new
 address, independent of the hint.
+However, if a hint above the architecture's legacy address limit is provided
+(on x86-64: above 0x7000, on arm64: above 0x1, on ppc64 
with
+book3s: above 0x7fff or 0x3fff, depending on page size), the
+kernel is permitted to allocate mappings beyond the architecture's legacy
+address limit. The availability of such addresses is hardware-dependent.
+Therefore, if you want to be able to use the full virtual address space of
+hardware that supports addresses beyond the legacy range, you need to specify 
an
+address above that limit; however, for security reasons, you should avoid
+specifying a fixed valid address outside the compatibility range,
+since that would reduce the value of userspace address space layout
+randomization. Therefore, it is recommended to specify an address
+.I beyond
+the end of the userspace address space.
 .\" Before Linux 2.6.24, the address was rounded up to the next page
 .\" boundary; since 2.6.24, it is rounded down!
 The address of the new mapping is returned as the result of the call.
-- 
2.20.1.791.gb4d0f1c61a-goog



Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Waiman Long
On 02/11/2019 06:58 AM, Peter Zijlstra wrote:
> Which is clearly worse. Now we can write that as:
>
>   int __down_read_trylock2(unsigned long *l)
>   {
> long tmp = READ_ONCE(*l);
>
> while (tmp >= 0) {
> if (try_cmpxchg(l, , tmp + 1))
> return 1;
> }
>
> return 0;
>   }
>
> which generates:
>
>   0030 <__down_read_trylock2>:
>   30:   48 8b 07mov(%rdi),%rax
>   33:   48 85 c0test   %rax,%rax
>   36:   78 18   js 50 <__down_read_trylock2+0x20>
>   38:   48 8d 50 01 lea0x1(%rax),%rdx
>   3c:   f0 48 0f b1 17  lock cmpxchg %rdx,(%rdi)
>   41:   75 f0   jne33 <__down_read_trylock2+0x3>
>   43:   b8 01 00 00 00  mov$0x1,%eax
>   48:   c3  retq
>   49:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
>   50:   31 c0   xor%eax,%eax
>   52:   c3  retq
>
> Which is a lot better; but not quite there yet.
>
>
> I've tried quite a bit, but I can't seem to get GCC to generate the:
>
>   add $1,%rdx
>   jle
>
> required; stuff like:
>
>   new = old + 1;
>   if (new <= 0)
>
> generates:
>
>   lea 0x1(%rax),%rdx
>   test %rdx, %rdx
>   jle

Thanks for the suggested code snippet. So you want to replace "lea
0x1(%rax), %rdx" by "add $1,%rdx"?

I think the compiler is doing that so as to use the address generation
unit for addition instead of using the ALU. That will leave the ALU
available for doing other arithmetic operation in parallel. I don't
think it is a good idea to override the compiler and force it to use
ALU. So I am not going to try doing that. It is only 1 or 2 more of
codes anyway.

Cheers,
Longman



Re: [PATCH v4 3/3] powerpc/32: Add KASAN support

2019-02-11 Thread Andrey Ryabinin



On 2/11/19 3:25 PM, Andrey Konovalov wrote:
> On Sat, Feb 9, 2019 at 12:55 PM christophe leroy
>  wrote:
>>
>> Hi Andrey,
>>
>> Le 08/02/2019 à 18:40, Andrey Konovalov a écrit :
>>> On Fri, Feb 8, 2019 at 6:17 PM Christophe Leroy  
>>> wrote:

 Hi Daniel,

 Le 08/02/2019 à 17:18, Daniel Axtens a écrit :
> Hi Christophe,
>
> I've been attempting to port this to 64-bit Book3e nohash (e6500),
> although I think I've ended up with an approach more similar to Aneesh's
> much earlier (2015) series for book3s.
>
> Part of this is just due to the changes between 32 and 64 bits - we need
> to hack around the discontiguous mappings - but one thing that I'm
> particularly puzzled by is what the kasan_early_init is supposed to do.

 It should be a problem as my patch uses a 'for_each_memblock(memory,
 reg)' loop.

>
>> +void __init kasan_early_init(void)
>> +{
>> +unsigned long addr = KASAN_SHADOW_START;
>> +unsigned long end = KASAN_SHADOW_END;
>> +unsigned long next;
>> +pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(addr), addr), addr);
>> +int i;
>> +phys_addr_t pa = __pa(kasan_early_shadow_page);
>> +
>> +BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK);
>> +
>> +if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE))
>> +panic("KASAN not supported with Hash MMU\n");
>> +
>> +for (i = 0; i < PTRS_PER_PTE; i++)
>> +__set_pte_at(_mm, (unsigned 
>> long)kasan_early_shadow_page,
>> + kasan_early_shadow_pte + i,
>> + pfn_pte(PHYS_PFN(pa), PAGE_KERNEL_RO), 0);
>> +
>> +do {
>> +next = pgd_addr_end(addr, end);
>> +pmd_populate_kernel(_mm, pmd, kasan_early_shadow_pte);
>> +} while (pmd++, addr = next, addr != end);
>> +}
>
> As far as I can tell it's mapping the early shadow page, read-only, over
> the KASAN_SHADOW_START->KASAN_SHADOW_END range, and it's using the early
> shadow PTE array from the generic code.
>
> I haven't been able to find an answer to why this is in the docs, so I
> was wondering if you or anyone else could explain the early part of
> kasan init a bit better.

 See https://www.kernel.org/doc/html/latest/dev-tools/kasan.html for an
 explanation of the shadow.

 When shadow is 0, it means the memory area is entirely accessible.

 It is necessary to setup a shadow area as soon as possible because all
 data accesses check the shadow area, from the begining (except for a few
 files where sanitizing has been disabled in Makefiles).

 Until the real shadow area is set, all access are granted thanks to the
 zero shadow area beeing for of zeros.
>>>
>>> Not entirely correct. kasan_early_init() indeed maps the whole shadow
>>> memory range to the same kasan_early_shadow_page. However as kernel
>>> loads and memory gets allocated this shadow page gets rewritten with
>>> non-zero values by different KASAN allocator hooks. Since these values
>>> come from completely different parts of the kernel, but all land on
>>> the same page, kasan_early_shadow_page's content can be considered
>>> garbage. When KASAN checks memory accesses for validity it detects
>>> these garbage shadow values, but doesn't print any reports, as the
>>> reporting routine bails out on the current->kasan_depth check (which
>>> has the value of 1 initially). Only after kasan_init() completes, when
>>> the proper shadow memory is mapped, current->kasan_depth gets set to 0
>>> and we start reporting bad accesses.
>>
>> That's surprising, because in the early phase I map the shadow area
>> read-only, so I do not expect it to get modified unless RO protection is
>> failing for some reason.
> 
> Actually it might be that the allocator hooks don't modify shadow at
> this point, as the allocator is not yet initialized. However stack
> should be getting poisoned and unpoisoned from the very start. But the
> generic statement that early shadow gets dirtied should be correct.
> Might it be that you don't use stack instrumentation?
> 

Yes, stack instrumentation is not used here, because shadow offset which we 
pass to
the -fasan-shadow-offset= cflag is not specified here. So the logic in 
scrpits/Makefile.kasan
just fallbacks to CFLAGS_KASAN_MINIMAL, which is outline and without stack 
instrumentation.

Christophe, you can specify KASAN_SHADOW_OFFSET either in Kconfig (e.g. x86_64) 
or
in Makefile (e.g. arm64). And make early mapping writable, because compiler 
generated code will write
to shadow memory in function prologue/epilogue.


Re: [PATCH v2 0/4] [powerpc] perf vendor events: Add JSON metrics for POWER9

2019-02-11 Thread Arnaldo Carvalho de Melo
Em Sat, Feb 09, 2019 at 01:14:25PM -0500, Paul Clarke escreveu:
> [Note this is for POWER*9* and is different content than a
> previous patchset for POWER*8*.]
> 
> The patches define metrics and metric groups for computation by "perf"
> for POWER9 processors.

Applied, thanks.

- Arnaldo


[PATCH 4.9 096/137] block/swim3: Fix -EBUSY error when re-opening device after unmount

2019-02-11 Thread Greg Kroah-Hartman
4.9-stable review patch.  If anyone has any objections, please let me know.

--

[ Upstream commit 296dcc40f2f2e402facf7cd26cf3f2c8f4b17d47 ]

When the block device is opened with FMODE_EXCL, ref_count is set to -1.
This value doesn't get reset when the device is closed which means the
device cannot be opened again. Fix this by checking for refcount <= 0
in the release method.

Reported-and-tested-by: Stan Johnson 
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Finn Thain 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
---
 drivers/block/swim3.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index c264f2d284a7..2e0a9e2531cb 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -1027,7 +1027,11 @@ static void floppy_release(struct gendisk *disk, fmode_t 
mode)
struct swim3 __iomem *sw = fs->swim3;
 
mutex_lock(_mutex);
-   if (fs->ref_count > 0 && --fs->ref_count == 0) {
+   if (fs->ref_count > 0)
+   --fs->ref_count;
+   else if (fs->ref_count == -1)
+   fs->ref_count = 0;
+   if (fs->ref_count == 0) {
swim3_action(fs, MOTOR_OFF);
out_8(>control_bic, 0xff);
swim3_select(fs, RELAX);
-- 
2.19.1





[PATCH 4.14 153/205] block/swim3: Fix -EBUSY error when re-opening device after unmount

2019-02-11 Thread Greg Kroah-Hartman
4.14-stable review patch.  If anyone has any objections, please let me know.

--

[ Upstream commit 296dcc40f2f2e402facf7cd26cf3f2c8f4b17d47 ]

When the block device is opened with FMODE_EXCL, ref_count is set to -1.
This value doesn't get reset when the device is closed which means the
device cannot be opened again. Fix this by checking for refcount <= 0
in the release method.

Reported-and-tested-by: Stan Johnson 
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Finn Thain 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
---
 drivers/block/swim3.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 0d7527c6825a..2f7acdb830c3 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -1027,7 +1027,11 @@ static void floppy_release(struct gendisk *disk, fmode_t 
mode)
struct swim3 __iomem *sw = fs->swim3;
 
mutex_lock(_mutex);
-   if (fs->ref_count > 0 && --fs->ref_count == 0) {
+   if (fs->ref_count > 0)
+   --fs->ref_count;
+   else if (fs->ref_count == -1)
+   fs->ref_count = 0;
+   if (fs->ref_count == 0) {
swim3_action(fs, MOTOR_OFF);
out_8(>control_bic, 0xff);
swim3_select(fs, RELAX);
-- 
2.19.1





Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-02-11 Thread Michal Hocko
On Mon 11-02-19 14:49:09, Ingo Molnar wrote:
> 
> * Michal Hocko  wrote:
> 
> > On Thu 24-01-19 11:10:50, Dave Hansen wrote:
> > > On 1/24/19 6:17 AM, Michal Hocko wrote:
> > > > and nr_cpus set to 4. The underlying reason is tha the device is bound
> > > > to node 2 which doesn't have any memory and init_cpu_to_node only
> > > > initializes memory-less nodes for possible cpus which nr_cpus restrics.
> > > > This in turn means that proper zonelists are not allocated and the page
> > > > allocator blows up.
> > > 
> > > This looks OK to me.
> > > 
> > > Could we add a few DEBUG_VM checks that *look* for these invalid
> > > zonelists?  Or, would our existing list debugging have caught this?
> > 
> > Currently we simply blow up because those zonelists are NULL. I do not
> > think we have a way to check whether an existing zonelist is actually 
> > _correct_ other thatn check it for NULL. But what would we do in the
> > later case?
> > 
> > > Basically, is this bug also a sign that we need better debugging around
> > > this?
> > 
> > My earlier patch had a debugging printk to display the zonelists and
> > that might be worthwhile I guess. Basically something like this
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2e097f336126..c30d59f803fb 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5259,6 +5259,11 @@ static void build_zonelists(pg_data_t *pgdat)
> >  
> > build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
> > build_thisnode_zonelists(pgdat);
> > +
> > +   pr_info("node[%d] zonelist: ", pgdat->node_id);
> > +   for_each_zone_zonelist(zone, z, 
> > >node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
> > +   pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
> > +   pr_cont("\n");
> >  }
> 
> Looks like this patch fell through the cracks - any update on this?

I was waiting for some feedback. As there were no complains about the
above debugging output I will make it a separate patch and post both
patches later this week. I just have to go through my backlog pile after
vacation.
-- 
Michal Hocko
SUSE Labs


[PATCH 4.19 233/313] block/swim3: Fix -EBUSY error when re-opening device after unmount

2019-02-11 Thread Greg Kroah-Hartman
4.19-stable review patch.  If anyone has any objections, please let me know.

--

[ Upstream commit 296dcc40f2f2e402facf7cd26cf3f2c8f4b17d47 ]

When the block device is opened with FMODE_EXCL, ref_count is set to -1.
This value doesn't get reset when the device is closed which means the
device cannot be opened again. Fix this by checking for refcount <= 0
in the release method.

Reported-and-tested-by: Stan Johnson 
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Finn Thain 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
---
 drivers/block/swim3.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 469541c1e51e..20907a0a043b 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -1026,7 +1026,11 @@ static void floppy_release(struct gendisk *disk, fmode_t 
mode)
struct swim3 __iomem *sw = fs->swim3;
 
mutex_lock(_mutex);
-   if (fs->ref_count > 0 && --fs->ref_count == 0) {
+   if (fs->ref_count > 0)
+   --fs->ref_count;
+   else if (fs->ref_count == -1)
+   fs->ref_count = 0;
+   if (fs->ref_count == 0) {
swim3_action(fs, MOTOR_OFF);
out_8(>control_bic, 0xff);
swim3_select(fs, RELAX);
-- 
2.19.1





Re: [PATCH v3 1/7] dump_stack: Support adding to the dump stack arch description

2019-02-11 Thread Petr Mladek
On Mon 2019-02-11 13:50:35, Andrea Parri wrote:
> Hi Michael,
> 
> 
> On Thu, Feb 07, 2019 at 11:46:29PM +1100, Michael Ellerman wrote:
> > Arch code can set a "dump stack arch description string" which is
> > displayed with oops output to describe the hardware platform.
> > 
> > It is useful to initialise this as early as possible, so that an early
> > oops will have the hardware description.
> > 
> > However in practice we discover the hardware platform in stages, so it
> > would be useful to be able to incrementally fill in the hardware
> > description as we discover it.
> > 
> > This patch adds that ability, by creating dump_stack_add_arch_desc().
> > 
> > If there is no existing string it behaves exactly like
> > dump_stack_set_arch_desc(). However if there is an existing string it
> > appends to it, with a leading space.
> > 
> > This makes it easy to call it multiple times from different parts of the
> > code and get a reasonable looking result.
> > 
> > Signed-off-by: Michael Ellerman 
> > ---
> >  include/linux/printk.h |  5 
> >  lib/dump_stack.c   | 58 ++
> >  2 files changed, 63 insertions(+)
> > 
> > v3: No change, just widened Cc list.
> > 
> > v2: Add a smp_wmb() and comment.
> > 
> > v1 is here for reference 
> > https://lore.kernel.org/lkml/1430824337-15339-1-git-send-email-...@ellerman.id.au/
> > 
> > I'll take this series via the powerpc tree if no one minds?
> > 
> > 
> > diff --git a/include/linux/printk.h b/include/linux/printk.h
> > index 77740a506ebb..d5fb4f960271 100644
> > --- a/include/linux/printk.h
> > +++ b/include/linux/printk.h
> > @@ -198,6 +198,7 @@ u32 log_buf_len_get(void);
> >  void log_buf_vmcoreinfo_setup(void);
> >  void __init setup_log_buf(int early);
> >  __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
> > +__printf(1, 2) void dump_stack_add_arch_desc(const char *fmt, ...);
> >  void dump_stack_print_info(const char *log_lvl);
> >  void show_regs_print_info(const char *log_lvl);
> >  extern asmlinkage void dump_stack(void) __cold;
> > @@ -256,6 +257,10 @@ static inline __printf(1, 2) void 
> > dump_stack_set_arch_desc(const char *fmt, ...)
> >  {
> >  }
> >  
> > +static inline __printf(1, 2) void dump_stack_add_arch_desc(const char 
> > *fmt, ...)
> > +{
> > +}
> > +
> >  static inline void dump_stack_print_info(const char *log_lvl)
> >  {
> >  }
> > diff --git a/lib/dump_stack.c b/lib/dump_stack.c
> > index 5cff72f18c4a..69b710ff92b5 100644
> > --- a/lib/dump_stack.c
> > +++ b/lib/dump_stack.c
> > @@ -35,6 +35,64 @@ void __init dump_stack_set_arch_desc(const char *fmt, 
> > ...)
> > va_end(args);
> >  }
> >  
> > +/**
> > + * dump_stack_add_arch_desc - add arch-specific info to show with task 
> > dumps
> > + * @fmt: printf-style format string
> > + * @...: arguments for the format string
> > + *
> > + * See dump_stack_set_arch_desc() for why you'd want to use this.
> > + *
> > + * This version adds to any existing string already created with either
> > + * dump_stack_set_arch_desc() or dump_stack_add_arch_desc(). If there is an
> > + * existing string a space will be prepended to the passed string.
> > + */
> > +void __init dump_stack_add_arch_desc(const char *fmt, ...)
> > +{
> > +   va_list args;
> > +   int pos, len;
> > +   char *p;
> > +
> > +   /*
> > +* If there's an existing string we snprintf() past the end of it, and
> > +* then turn the terminating NULL of the existing string into a space
> > +* to create one string separated by a space.
> > +*
> > +* If there's no existing string we just snprintf() to the buffer, like
> > +* dump_stack_set_arch_desc(), but without calling it because we'd need
> > +* a varargs version.
> > +*/
> > +   len = strnlen(dump_stack_arch_desc_str, 
> > sizeof(dump_stack_arch_desc_str));
> > +   pos = len;
> > +
> > +   if (len)
> > +   pos++;
> > +
> > +   if (pos >= sizeof(dump_stack_arch_desc_str))
> > +   return; /* Ran out of space */
> > +
> > +   p = _stack_arch_desc_str[pos];
> > +
> > +   va_start(args, fmt);
> > +   vsnprintf(p, sizeof(dump_stack_arch_desc_str) - pos, fmt, args);
> > +   va_end(args);
> > +
> > +   if (len) {
> > +   /*
> > +* Order the stores above in vsnprintf() vs the store of the
> > +* space below which joins the two strings. Note this doesn't
> > +* make the code truly race free because there is no barrier on
> > +* the read side. ie. Another CPU might load the uninitialised
> > +* tail of the buffer first and then the space below (rather
> > +* than the NULL that was there previously), and so print the
> > +* uninitialised tail. But the whole string lives in BSS so in
> > +* practice it should just see NULLs.
> 
> The comment doesn't say _why_ we need to order these stores: IOW, what
> will or can go wrong without this order?  This isn't clear to me.
>
> Another good 

[PATCH 4.20 278/352] block/swim3: Fix regression on PowerBook G3

2019-02-11 Thread Greg Kroah-Hartman
4.20-stable review patch.  If anyone has any objections, please let me know.

--

[ Upstream commit 427c5ce4417cba0801fbf79c8525d1330704759c ]

As of v4.20, the swim3 driver crashes when loaded on a PowerBook G3
(Wallstreet).

MacIO PCI driver attached to Gatwick chipset
MacIO PCI driver attached to Heathrow chipset
swim3 0.00015000:floppy: [fd0] SWIM3 floppy controller in media bay
0.00013020:ch-a: ttyS0 at MMIO 0xf3013020 (irq = 16, base_baud = 230400) is a 
Z85c30 ESCC - Serial port
0.00013000:ch-b: ttyS1 at MMIO 0xf3013000 (irq = 17, base_baud = 230400) is a 
Z85c30 ESCC - Infrared port
macio: fixed media-bay irq on gatwick
macio: fixed left floppy irqs
swim3 1.00015000:floppy: [fd1] Couldn't request interrupt
Unable to handle kernel paging request for data at address 0x0024
Faulting instruction address: 0xc02652f8
Oops: Kernel access of bad area, sig: 11 [#1]
BE SMP NR_CPUS=2 PowerMac
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.20.0 #2
NIP:  c02652f8 LR: c026915c CTR: c0276d1c
REGS: df43ba10 TRAP: 0300   Not tainted  (4.20.0)
MSR:  9032   CR: 28228288  XER: 0100
DAR: 0024 DSISR: 4000
GPR00: c026915c df43bac0 df439060 c0731524 df494700  c06e1c08 0001
GPR08: 0001  df5ff220 1032 28228282  c0004ca4 
GPR16:    c073144c dfffe064 c0731524 0120 c0586108
GPR24: c073132c c073143c c073143c  c0731524 df67cd70 df494700 0001
NIP [c02652f8] blk_mq_free_rqs+0x28/0xf8
LR [c026915c] blk_mq_sched_tags_teardown+0x58/0x84
Call Trace:
[df43bac0] [c0045f50] flush_workqueue_prep_pwqs+0x178/0x1c4 (unreliable)
[df43bae0] [c026915c] blk_mq_sched_tags_teardown+0x58/0x84
[df43bb00] [c02697f0] blk_mq_exit_sched+0x9c/0xb8
[df43bb20] [c0252794] elevator_exit+0x84/0xa4
[df43bb40] [c0256538] blk_exit_queue+0x30/0x50
[df43bb50] [c0256640] blk_cleanup_queue+0xe8/0x184
[df43bb70] [c034732c] swim3_attach+0x330/0x5f0
[df43bbb0] [c034fb24] macio_device_probe+0x58/0xec
[df43bbd0] [c032ba88] really_probe+0x1e4/0x2f4
[df43bc00] [c032bd28] driver_probe_device+0x64/0x204
[df43bc20] [c0329ac4] bus_for_each_drv+0x60/0xac
[df43bc50] [c032b824] __device_attach+0xe8/0x160
[df43bc80] [c032ab38] bus_probe_device+0xa0/0xbc
[df43bca0] [c0327338] device_add+0x3d8/0x630
[df43bcf0] [c0350848] macio_add_one_device+0x444/0x48c
[df43bd50] [c03509f8] macio_pci_add_devices+0x168/0x1bc
[df43bd90] [c03500ec] macio_pci_probe+0xc0/0x10c
[df43bda0] [c02ad884] pci_device_probe+0xd4/0x184
[df43bdd0] [c032ba88] really_probe+0x1e4/0x2f4
[df43be00] [c032bd28] driver_probe_device+0x64/0x204
[df43be20] [c032bfcc] __driver_attach+0x104/0x108
[df43be40] [c0329a00] bus_for_each_dev+0x64/0xb4
[df43be70] [c032add8] bus_add_driver+0x154/0x238
[df43be90] [c032ca24] driver_register+0x84/0x148
[df43bea0] [c0004aa0] do_one_initcall+0x40/0x188
[df43bf00] [c0690100] kernel_init_freeable+0x138/0x1d4
[df43bf30] [c0004cbc] kernel_init+0x18/0x10c
[df43bf40] [c00121e4] ret_from_kernel_thread+0x14/0x1c
Instruction dump:
5484d97e 4bfff4f4 9421ffe0 7c0802a6 bf410008 7c9e2378 90010024 8124005c
2f89 419e0078 81230004 7c7c1b78 <81290024> 2f89 419e0064 8144
---[ end trace 12025ab921a9784c ]---

Reverting commit 8ccb8cb1892b ("swim3: convert to blk-mq") resolves the
problem.

That commit added a struct blk_mq_tag_set to struct floppy_state and
initialized it with a blk_mq_init_sq_queue() call. Unfortunately, there
is a memset() in swim3_add_device() that subsequently clears the
floppy_state struct. That means fs->tag_set->ops is a NULL pointer, and
it gets dereferenced by blk_mq_free_rqs() which gets called in the
request_irq() error path. Move the memset() to fix this bug.

BTW, the request_irq() failure for the left mediabay floppy (fd1) is not
a regression. I don't know why it happens. The right media bay floppy
(fd0) works fine however.

Reported-and-tested-by: Stan Johnson 
Fixes: 8ccb8cb1892b ("swim3: convert to blk-mq")
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Finn Thain 

Signed-off-by: Jens Axboe 

Signed-off-by: Sasha Levin 
---
 drivers/block/swim3.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index 3f6df3f1f5d9..1046459f172b 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -1091,8 +1091,6 @@ static int swim3_add_device(struct macio_dev *mdev, int 
index)
struct floppy_state *fs = _states[index];
int rc = -EBUSY;
 
-   /* Do this first for message macros */
-   memset(fs, 0, sizeof(*fs));
fs->mdev = mdev;
fs->index = index;
 
@@ -1192,14 +1190,15 @@ static int swim3_attach(struct macio_dev *mdev,
return rc;
}
 
-   fs = _states[floppy_count];
-
disk = alloc_disk(1);
if (disk == NULL) {
rc = -ENOMEM;
goto out_unregister;
}
 
+   fs = _states[floppy_count];
+   memset(fs, 0, sizeof(*fs));
+

[PATCH 4.20 273/352] block/swim3: Fix -EBUSY error when re-opening device after unmount

2019-02-11 Thread Greg Kroah-Hartman
4.20-stable review patch.  If anyone has any objections, please let me know.

--

[ Upstream commit 296dcc40f2f2e402facf7cd26cf3f2c8f4b17d47 ]

When the block device is opened with FMODE_EXCL, ref_count is set to -1.
This value doesn't get reset when the device is closed which means the
device cannot be opened again. Fix this by checking for refcount <= 0
in the release method.

Reported-and-tested-by: Stan Johnson 
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Finn Thain 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
---
 drivers/block/swim3.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index c1c676a33e4a..3f6df3f1f5d9 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -995,7 +995,11 @@ static void floppy_release(struct gendisk *disk, fmode_t 
mode)
struct swim3 __iomem *sw = fs->swim3;
 
mutex_lock(_mutex);
-   if (fs->ref_count > 0 && --fs->ref_count == 0) {
+   if (fs->ref_count > 0)
+   --fs->ref_count;
+   else if (fs->ref_count == -1)
+   fs->ref_count = 0;
+   if (fs->ref_count == 0) {
swim3_action(fs, MOTOR_OFF);
out_8(>control_bic, 0xff);
swim3_select(fs, RELAX);
-- 
2.19.1





Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-02-11 Thread Ingo Molnar


* Michal Hocko  wrote:

> On Thu 24-01-19 11:10:50, Dave Hansen wrote:
> > On 1/24/19 6:17 AM, Michal Hocko wrote:
> > > and nr_cpus set to 4. The underlying reason is tha the device is bound
> > > to node 2 which doesn't have any memory and init_cpu_to_node only
> > > initializes memory-less nodes for possible cpus which nr_cpus restrics.
> > > This in turn means that proper zonelists are not allocated and the page
> > > allocator blows up.
> > 
> > This looks OK to me.
> > 
> > Could we add a few DEBUG_VM checks that *look* for these invalid
> > zonelists?  Or, would our existing list debugging have caught this?
> 
> Currently we simply blow up because those zonelists are NULL. I do not
> think we have a way to check whether an existing zonelist is actually 
> _correct_ other thatn check it for NULL. But what would we do in the
> later case?
> 
> > Basically, is this bug also a sign that we need better debugging around
> > this?
> 
> My earlier patch had a debugging printk to display the zonelists and
> that might be worthwhile I guess. Basically something like this
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2e097f336126..c30d59f803fb 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5259,6 +5259,11 @@ static void build_zonelists(pg_data_t *pgdat)
>  
>   build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
>   build_thisnode_zonelists(pgdat);
> +
> + pr_info("node[%d] zonelist: ", pgdat->node_id);
> + for_each_zone_zonelist(zone, z, 
> >node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1)
> + pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
> + pr_cont("\n");
>  }

Looks like this patch fell through the cracks - any update on this?

Thanks,

Ingo


[PATCH 10/12] dma-mapping: simplify allocations from per-device coherent memory

2019-02-11 Thread Christoph Hellwig
All users of per-device coherent memory are exclusive, that is if we can't
allocate from the per-device pool we can't use the system memory either.
Unfold the current dma_{alloc,free}_from_dev_coherent implementation and
always use the per-device pool if it exists.

Signed-off-by: Christoph Hellwig 
---
 arch/arm/mm/dma-mapping-nommu.c | 12 ++---
 include/linux/dma-mapping.h | 14 ++
 kernel/dma/coherent.c   | 89 -
 kernel/dma/internal.h   | 19 +++
 kernel/dma/mapping.c| 12 +++--
 5 files changed, 55 insertions(+), 91 deletions(-)
 create mode 100644 kernel/dma/internal.h

diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index f304b10e23a4..c72f024f1e82 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -70,16 +70,10 @@ static void arm_nommu_dma_free(struct device *dev, size_t 
size,
   void *cpu_addr, dma_addr_t dma_addr,
   unsigned long attrs)
 {
-   if (attrs & DMA_ATTR_NON_CONSISTENT) {
+   if (attrs & DMA_ATTR_NON_CONSISTENT)
dma_direct_free_pages(dev, size, cpu_addr, dma_addr, attrs);
-   } else {
-   int ret = dma_release_from_global_coherent(get_order(size),
-  cpu_addr);
-
-   WARN_ON_ONCE(ret == 0);
-   }
-
-   return;
+   else
+   dma_release_from_global_coherent(size, cpu_addr);
 }
 
 static int arm_nommu_dma_mmap(struct device *dev, struct vm_area_struct *vma,
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index b12fba725f19..018e37a0870e 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -158,30 +158,24 @@ static inline int is_device_dma_capable(struct device 
*dev)
  * These three functions are only for dma allocator.
  * Don't use them in device drivers.
  */
-int dma_alloc_from_dev_coherent(struct device *dev, ssize_t size,
-  dma_addr_t *dma_handle, void **ret);
-int dma_release_from_dev_coherent(struct device *dev, int order, void *vaddr);
-
 int dma_mmap_from_dev_coherent(struct device *dev, struct vm_area_struct *vma,
void *cpu_addr, size_t size, int *ret);
 
-void *dma_alloc_from_global_coherent(ssize_t size, dma_addr_t *dma_handle);
-int dma_release_from_global_coherent(int order, void *vaddr);
+void *dma_alloc_from_global_coherent(size_t size, dma_addr_t *dma_handle);
+void dma_release_from_global_coherent(size_t size, void *vaddr);
 int dma_mmap_from_global_coherent(struct vm_area_struct *vma, void *cpu_addr,
  size_t size, int *ret);
 
 #else
-#define dma_alloc_from_dev_coherent(dev, size, handle, ret) (0)
-#define dma_release_from_dev_coherent(dev, order, vaddr) (0)
 #define dma_mmap_from_dev_coherent(dev, vma, vaddr, order, ret) (0)
 
-static inline void *dma_alloc_from_global_coherent(ssize_t size,
+static inline void *dma_alloc_from_global_coherent(size_t size,
   dma_addr_t *dma_handle)
 {
return NULL;
 }
 
-static inline int dma_release_from_global_coherent(int order, void *vaddr)
+static inline void dma_release_from_global_coherent(size_t size, void *vaddr)
 {
return 0;
 }
diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c
index 29fd6590dc1e..d1da1048e470 100644
--- a/kernel/dma/coherent.c
+++ b/kernel/dma/coherent.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include "internal.h"
 
 struct dma_coherent_mem {
void*virt_base;
@@ -21,13 +22,6 @@ struct dma_coherent_mem {
 
 static struct dma_coherent_mem *dma_coherent_default_memory __ro_after_init;
 
-static inline struct dma_coherent_mem *dev_get_coherent_memory(struct device 
*dev)
-{
-   if (dev && dev->dma_mem)
-   return dev->dma_mem;
-   return NULL;
-}
-
 static inline dma_addr_t dma_get_device_base(struct device *dev,
 struct dma_coherent_mem * mem)
 {
@@ -135,8 +129,8 @@ void dma_release_declared_memory(struct device *dev)
 }
 EXPORT_SYMBOL(dma_release_declared_memory);
 
-static void *__dma_alloc_from_coherent(struct dma_coherent_mem *mem,
-   ssize_t size, dma_addr_t *dma_handle)
+void *__dma_alloc_from_coherent(struct dma_coherent_mem *mem, size_t size,
+   dma_addr_t *dma_handle)
 {
int order = get_order(size);
unsigned long flags;
@@ -165,33 +159,7 @@ static void *__dma_alloc_from_coherent(struct 
dma_coherent_mem *mem,
return NULL;
 }
 
-/**
- * dma_alloc_from_dev_coherent() - allocate memory from device coherent pool
- * @dev:   device from which we allocate memory
- * @size:  size of requested memory area
- * @dma_handle:This will be filled with the correct dma handle
- * @ret:   This pointer will be filled with the 

[PATCH 12/12] dma-mapping: remove dma_assign_coherent_memory

2019-02-11 Thread Christoph Hellwig
The only useful bit in this function was the already assigned check.
Once that is moved to dma_init_coherent_memory thee rest can easily
be handled in the two callers.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/coherent.c | 47 +--
 1 file changed, 14 insertions(+), 33 deletions(-)

diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c
index d7a27008f228..1e3ce71cd993 100644
--- a/kernel/dma/coherent.c
+++ b/kernel/dma/coherent.c
@@ -41,6 +41,9 @@ static int dma_init_coherent_memory(phys_addr_t phys_addr,
int bitmap_size = BITS_TO_LONGS(pages) * sizeof(long);
int ret;
 
+   if (*mem)
+   return -EBUSY;
+
if (!size) {
ret = -EINVAL;
goto out;
@@ -88,33 +91,11 @@ static void dma_release_coherent_memory(struct 
dma_coherent_mem *mem)
kfree(mem);
 }
 
-static int dma_assign_coherent_memory(struct device *dev,
- struct dma_coherent_mem *mem)
-{
-   if (!dev)
-   return -ENODEV;
-
-   if (dev->dma_mem)
-   return -EBUSY;
-
-   dev->dma_mem = mem;
-   return 0;
-}
-
 int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr,
dma_addr_t device_addr, size_t size)
 {
-   struct dma_coherent_mem *mem;
-   int ret;
-
-   ret = dma_init_coherent_memory(phys_addr, device_addr, size, );
-   if (ret)
-   return ret;
-
-   ret = dma_assign_coherent_memory(dev, mem);
-   if (ret)
-   dma_release_coherent_memory(mem);
-   return ret;
+   return dma_init_coherent_memory(phys_addr, device_addr, size,
+   >dma_mem);
 }
 EXPORT_SYMBOL(dma_declare_coherent_memory);
 
@@ -238,18 +219,18 @@ static int rmem_dma_device_init(struct reserved_mem 
*rmem, struct device *dev)
struct dma_coherent_mem *mem = rmem->priv;
int ret;
 
-   if (!mem) {
-   ret = dma_init_coherent_memory(rmem->base, rmem->base,
-  rmem->size, );
-   if (ret) {
-   pr_err("Reserved memory: failed to init DMA memory pool 
at %pa, size %ld MiB\n",
-   >base, (unsigned long)rmem->size / SZ_1M);
-   return ret;
-   }
+   ret = dma_init_coherent_memory(rmem->base, rmem->base, rmem->size,
+   );
+   if (ret && ret != -EBUSY) {
+   pr_err("Reserved memory: failed to init DMA memory pool at %pa, 
size %ld MiB\n",
+   >base, (unsigned long)rmem->size / SZ_1M);
+   return ret;
}
+
mem->use_dev_dma_pfn_offset = true;
+   if (dev)
+   dev->dma_mem = mem;
rmem->priv = mem;
-   dma_assign_coherent_memory(dev, mem);
return 0;
 }
 
-- 
2.20.1



[PATCH 11/12] dma-mapping: handle per-device coherent memory mmap in common code

2019-02-11 Thread Christoph Hellwig
We handle allocation and freeing in common code, so we should handle
mmap the same way.  Also all users of per-device coherent memory are
exclusive, that is if we can't allocate from the per-device pool we
can't use the system memory either.  Unfold the current
dma_mmap_from_dev_coherent implementation and always use the
per-device pool if it exists.

Signed-off-by: Christoph Hellwig 
---
 arch/arm/mm/dma-mapping-nommu.c |  7 ++--
 arch/arm/mm/dma-mapping.c   |  3 --
 arch/arm64/mm/dma-mapping.c |  3 --
 include/linux/dma-mapping.h | 11 ++-
 kernel/dma/coherent.c   | 58 -
 kernel/dma/internal.h   |  2 ++
 kernel/dma/mapping.c|  8 ++---
 7 files changed, 24 insertions(+), 68 deletions(-)

diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index c72f024f1e82..4eeb7e5d9c07 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -80,11 +80,8 @@ static int arm_nommu_dma_mmap(struct device *dev, struct 
vm_area_struct *vma,
  void *cpu_addr, dma_addr_t dma_addr, size_t size,
  unsigned long attrs)
 {
-   int ret;
-
-   if (dma_mmap_from_global_coherent(vma, cpu_addr, size, ))
-   return ret;
-
+   if (!(attrs & DMA_ATTR_NON_CONSISTENT))
+   return dma_mmap_from_global_coherent(vma, cpu_addr, size);
return dma_common_mmap(dev, vma, cpu_addr, dma_addr, size, attrs);
 }
 
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 3c8534904209..e2993e5a7166 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -830,9 +830,6 @@ static int __arm_dma_mmap(struct device *dev, struct 
vm_area_struct *vma,
unsigned long pfn = dma_to_pfn(dev, dma_addr);
unsigned long off = vma->vm_pgoff;
 
-   if (dma_mmap_from_dev_coherent(dev, vma, cpu_addr, size, ))
-   return ret;
-
if (off < nr_pages && nr_vma_pages <= (nr_pages - off)) {
ret = remap_pfn_range(vma, vma->vm_start,
  pfn + off,
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 78c0a72f822c..a55be91c1d1a 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -246,9 +246,6 @@ static int __iommu_mmap_attrs(struct device *dev, struct 
vm_area_struct *vma,
 
vma->vm_page_prot = arch_dma_mmap_pgprot(dev, vma->vm_page_prot, attrs);
 
-   if (dma_mmap_from_dev_coherent(dev, vma, cpu_addr, size, ))
-   return ret;
-
if (attrs & DMA_ATTR_FORCE_CONTIGUOUS) {
/*
 * DMA_ATTR_FORCE_CONTIGUOUS allocations are always remapped,
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 018e37a0870e..ae6fe66f97b7 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -158,17 +158,12 @@ static inline int is_device_dma_capable(struct device 
*dev)
  * These three functions are only for dma allocator.
  * Don't use them in device drivers.
  */
-int dma_mmap_from_dev_coherent(struct device *dev, struct vm_area_struct *vma,
-   void *cpu_addr, size_t size, int *ret);
-
 void *dma_alloc_from_global_coherent(size_t size, dma_addr_t *dma_handle);
 void dma_release_from_global_coherent(size_t size, void *vaddr);
 int dma_mmap_from_global_coherent(struct vm_area_struct *vma, void *cpu_addr,
- size_t size, int *ret);
+ size_t size);
 
 #else
-#define dma_mmap_from_dev_coherent(dev, vma, vaddr, order, ret) (0)
-
 static inline void *dma_alloc_from_global_coherent(size_t size,
   dma_addr_t *dma_handle)
 {
@@ -177,12 +172,10 @@ static inline void *dma_alloc_from_global_coherent(size_t 
size,
 
 static inline void dma_release_from_global_coherent(size_t size, void *vaddr)
 {
-   return 0;
 }
 
 static inline int dma_mmap_from_global_coherent(struct vm_area_struct *vma,
-   void *cpu_addr, size_t size,
-   int *ret)
+   void *cpu_addr, size_t size)
 {
return 0;
 }
diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c
index d1da1048e470..d7a27008f228 100644
--- a/kernel/dma/coherent.c
+++ b/kernel/dma/coherent.c
@@ -197,60 +197,30 @@ void dma_release_from_global_coherent(size_t size, void 
*vaddr)
__dma_release_from_coherent(dma_coherent_default_memory, size, vaddr);
 }
 
-static int __dma_mmap_from_coherent(struct dma_coherent_mem *mem,
-   struct vm_area_struct *vma, void *vaddr, size_t size, int *ret)
+int __dma_mmap_from_coherent(struct dma_coherent_mem *mem,
+   struct vm_area_struct *vma, void *vaddr, size_t size)
 {
-   if (mem && vaddr >= mem->virt_base && vaddr + size <=
-  

[PATCH 09/12] dma-mapping: remove the DMA_MEMORY_EXCLUSIVE flag

2019-02-11 Thread Christoph Hellwig
All users of dma_declare_coherent want their allocations to be
exclusive, so default to exclusive allocations.

Signed-off-by: Christoph Hellwig 
---
 Documentation/DMA-API.txt |  9 +--
 arch/arm/mach-imx/mach-imx27_visstrim_m10.c   | 12 +++--
 arch/arm/mach-imx/mach-mx31moboard.c  |  3 +--
 arch/sh/boards/mach-ap325rxa/setup.c  |  5 ++--
 arch/sh/boards/mach-ecovec24/setup.c  |  6 ++---
 arch/sh/boards/mach-kfr2r09/setup.c   |  5 ++--
 arch/sh/boards/mach-migor/setup.c |  5 ++--
 arch/sh/boards/mach-se/7724/setup.c   |  6 ++---
 arch/sh/drivers/pci/fixups-dreamcast.c|  3 +--
 .../soc_camera/sh_mobile_ceu_camera.c |  3 +--
 drivers/usb/host/ohci-sm501.c |  3 +--
 drivers/usb/host/ohci-tmio.c  |  2 +-
 include/linux/dma-mapping.h   |  7 ++
 kernel/dma/coherent.c | 25 ++-
 14 files changed, 29 insertions(+), 65 deletions(-)

diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index b9d0cba83877..38e561b773b4 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -566,8 +566,7 @@ boundaries when doing this.
 
int
dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr,
-   dma_addr_t device_addr, size_t size, int
-   flags)
+   dma_addr_t device_addr, size_t size);
 
 Declare region of memory to be handed out by dma_alloc_coherent() when
 it's asked for coherent memory for this device.
@@ -581,12 +580,6 @@ dma_addr_t in dma_alloc_coherent()).
 
 size is the size of the area (must be multiples of PAGE_SIZE).
 
-flags can be ORed together and are:
-
-- DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions.
-  Do not allow dma_alloc_coherent() to fall back to system memory when
-  it's out of memory in the declared region.
-
 As a simplification for the platforms, only *one* such region of
 memory may be declared per device.
 
diff --git a/arch/arm/mach-imx/mach-imx27_visstrim_m10.c 
b/arch/arm/mach-imx/mach-imx27_visstrim_m10.c
index 5169dfba9718..07d4fcfe5c2e 100644
--- a/arch/arm/mach-imx/mach-imx27_visstrim_m10.c
+++ b/arch/arm/mach-imx/mach-imx27_visstrim_m10.c
@@ -258,8 +258,7 @@ static void __init visstrim_analog_camera_init(void)
return;
 
dma_declare_coherent_memory(>dev, mx2_camera_base,
-   mx2_camera_base, MX2_CAMERA_BUF_SIZE,
-   DMA_MEMORY_EXCLUSIVE);
+   mx2_camera_base, MX2_CAMERA_BUF_SIZE);
 }
 
 static void __init visstrim_reserve(void)
@@ -445,8 +444,7 @@ static void __init visstrim_coda_init(void)
dma_declare_coherent_memory(>dev,
mx2_camera_base + MX2_CAMERA_BUF_SIZE,
mx2_camera_base + MX2_CAMERA_BUF_SIZE,
-   MX2_CAMERA_BUF_SIZE,
-   DMA_MEMORY_EXCLUSIVE);
+   MX2_CAMERA_BUF_SIZE);
 }
 
 /* DMA deinterlace */
@@ -465,8 +463,7 @@ static void __init visstrim_deinterlace_init(void)
dma_declare_coherent_memory(>dev,
mx2_camera_base + 2 * MX2_CAMERA_BUF_SIZE,
mx2_camera_base + 2 * MX2_CAMERA_BUF_SIZE,
-   MX2_CAMERA_BUF_SIZE,
-   DMA_MEMORY_EXCLUSIVE);
+   MX2_CAMERA_BUF_SIZE);
 }
 
 /* Emma-PrP for format conversion */
@@ -485,8 +482,7 @@ static void __init visstrim_emmaprp_init(void)
 */
ret = dma_declare_coherent_memory(>dev,
mx2_camera_base, mx2_camera_base,
-   MX2_CAMERA_BUF_SIZE,
-   DMA_MEMORY_EXCLUSIVE);
+   MX2_CAMERA_BUF_SIZE);
if (ret)
pr_err("Failed to declare memory for emmaprp\n");
 }
diff --git a/arch/arm/mach-imx/mach-mx31moboard.c 
b/arch/arm/mach-imx/mach-mx31moboard.c
index 643a3d749703..fe50f4cf00a7 100644
--- a/arch/arm/mach-imx/mach-mx31moboard.c
+++ b/arch/arm/mach-imx/mach-mx31moboard.c
@@ -475,8 +475,7 @@ static int __init mx31moboard_init_cam(void)
 
ret = dma_declare_coherent_memory(>dev,
  mx3_camera_base, mx3_camera_base,
- MX3_CAMERA_BUF_SIZE,
- DMA_MEMORY_EXCLUSIVE);
+ MX3_CAMERA_BUF_SIZE);
if (ret)
goto err;
 
diff --git a/arch/sh/boards/mach-ap325rxa/setup.c 
b/arch/sh/boards/mach-ap325rxa/setup.c
index 8f234d0435aa..7899b4f51fdd 100644
--- a/arch/sh/boards/mach-ap325rxa/setup.c
+++ 

[PATCH 08/12] dma-mapping: remove dma_mark_declared_memory_occupied

2019-02-11 Thread Christoph Hellwig
This API is not used anywhere, so remove it.

Signed-off-by: Christoph Hellwig 
---
 Documentation/DMA-API.txt   | 17 -
 include/linux/dma-mapping.h |  9 -
 kernel/dma/coherent.c   | 23 ---
 3 files changed, 49 deletions(-)

diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index 78114ee63057..b9d0cba83877 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -605,23 +605,6 @@ unconditionally having removed all the required 
structures.  It is the
 driver's job to ensure that no parts of this memory region are
 currently in use.
 
-::
-
-   void *
-   dma_mark_declared_memory_occupied(struct device *dev,
- dma_addr_t device_addr, size_t size)
-
-This is used to occupy specific regions of the declared space
-(dma_alloc_coherent() will hand out the first free region it finds).
-
-device_addr is the *device* address of the region requested.
-
-size is the size (and should be a page-sized multiple).
-
-The return value will be either a pointer to the processor virtual
-address of the memory, or an error (via PTR_ERR()) if any part of the
-region is occupied.
-
 Part III - Debug drivers use of the DMA-API
 ---
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index fde0cfc71824..9df0f4d318c5 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -735,8 +735,6 @@ static inline int dma_get_cache_alignment(void)
 int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr,
dma_addr_t device_addr, size_t size, int flags);
 void dma_release_declared_memory(struct device *dev);
-void *dma_mark_declared_memory_occupied(struct device *dev,
-   dma_addr_t device_addr, size_t size);
 #else
 static inline int
 dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr,
@@ -749,13 +747,6 @@ static inline void
 dma_release_declared_memory(struct device *dev)
 {
 }
-
-static inline void *
-dma_mark_declared_memory_occupied(struct device *dev,
- dma_addr_t device_addr, size_t size)
-{
-   return ERR_PTR(-EBUSY);
-}
 #endif /* CONFIG_DMA_DECLARE_COHERENT */
 
 static inline void *dmam_alloc_coherent(struct device *dev, size_t size,
diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c
index 4b76aba574c2..1d12a31af6d7 100644
--- a/kernel/dma/coherent.c
+++ b/kernel/dma/coherent.c
@@ -137,29 +137,6 @@ void dma_release_declared_memory(struct device *dev)
 }
 EXPORT_SYMBOL(dma_release_declared_memory);
 
-void *dma_mark_declared_memory_occupied(struct device *dev,
-   dma_addr_t device_addr, size_t size)
-{
-   struct dma_coherent_mem *mem = dev->dma_mem;
-   unsigned long flags;
-   int pos, err;
-
-   size += device_addr & ~PAGE_MASK;
-
-   if (!mem)
-   return ERR_PTR(-EINVAL);
-
-   spin_lock_irqsave(>spinlock, flags);
-   pos = PFN_DOWN(device_addr - dma_get_device_base(dev, mem));
-   err = bitmap_allocate_region(mem->bitmap, pos, get_order(size));
-   spin_unlock_irqrestore(>spinlock, flags);
-
-   if (err != 0)
-   return ERR_PTR(err);
-   return mem->virt_base + (pos << PAGE_SHIFT);
-}
-EXPORT_SYMBOL(dma_mark_declared_memory_occupied);
-
 static void *__dma_alloc_from_coherent(struct dma_coherent_mem *mem,
ssize_t size, dma_addr_t *dma_handle)
 {
-- 
2.20.1



[PATCH 07/12] dma-mapping: move CONFIG_DMA_CMA to kernel/dma/Kconfig

2019-02-11 Thread Christoph Hellwig
This is where all the related code already lives.

Signed-off-by: Christoph Hellwig 
---
 drivers/base/Kconfig | 77 
 kernel/dma/Kconfig   | 77 
 2 files changed, 77 insertions(+), 77 deletions(-)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a900b330..059700ea3521 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -191,83 +191,6 @@ config DMA_FENCE_TRACE
  lockup related problems for dma-buffers shared across multiple
  devices.
 
-config DMA_CMA
-   bool "DMA Contiguous Memory Allocator"
-   depends on HAVE_DMA_CONTIGUOUS && CMA
-   help
- This enables the Contiguous Memory Allocator which allows drivers
- to allocate big physically-contiguous blocks of memory for use with
- hardware components that do not support I/O map nor scatter-gather.
-
- You can disable CMA by specifying "cma=0" on the kernel's command
- line.
-
- For more information see .
- If unsure, say "n".
-
-if  DMA_CMA
-comment "Default contiguous memory area size:"
-
-config CMA_SIZE_MBYTES
-   int "Size in Mega Bytes"
-   depends on !CMA_SIZE_SEL_PERCENTAGE
-   default 0 if X86
-   default 16
-   help
- Defines the size (in MiB) of the default memory area for Contiguous
- Memory Allocator.  If the size of 0 is selected, CMA is disabled by
- default, but it can be enabled by passing cma=size[MG] to the kernel.
-
-
-config CMA_SIZE_PERCENTAGE
-   int "Percentage of total memory"
-   depends on !CMA_SIZE_SEL_MBYTES
-   default 0 if X86
-   default 10
-   help
- Defines the size of the default memory area for Contiguous Memory
- Allocator as a percentage of the total memory in the system.
- If 0 percent is selected, CMA is disabled by default, but it can be
- enabled by passing cma=size[MG] to the kernel.
-
-choice
-   prompt "Selected region size"
-   default CMA_SIZE_SEL_MBYTES
-
-config CMA_SIZE_SEL_MBYTES
-   bool "Use mega bytes value only"
-
-config CMA_SIZE_SEL_PERCENTAGE
-   bool "Use percentage value only"
-
-config CMA_SIZE_SEL_MIN
-   bool "Use lower value (minimum)"
-
-config CMA_SIZE_SEL_MAX
-   bool "Use higher value (maximum)"
-
-endchoice
-
-config CMA_ALIGNMENT
-   int "Maximum PAGE_SIZE order of alignment for contiguous buffers"
-   range 4 12
-   default 8
-   help
- DMA mapping framework by default aligns all buffers to the smallest
- PAGE_SIZE order which is greater than or equal to the requested buffer
- size. This works well for buffers up to a few hundreds kilobytes, but
- for larger buffers it just a memory waste. With this parameter you can
- specify the maximum PAGE_SIZE order for contiguous buffers. Larger
- buffers will be aligned only to this specified order. The order is
- expressed as a power of two multiplied by the PAGE_SIZE.
-
- For example, if your system defaults to 4KiB pages, the order value
- of 8 means that the buffers will be aligned up to 1MiB only.
-
- If unsure, leave the default value "8".
-
-endif
-
 config GENERIC_ARCH_TOPOLOGY
bool
help
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index b122ab100d66..d785286ad868 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -53,3 +53,80 @@ config DMA_REMAP
 config DMA_DIRECT_REMAP
bool
select DMA_REMAP
+
+config DMA_CMA
+   bool "DMA Contiguous Memory Allocator"
+   depends on HAVE_DMA_CONTIGUOUS && CMA
+   help
+ This enables the Contiguous Memory Allocator which allows drivers
+ to allocate big physically-contiguous blocks of memory for use with
+ hardware components that do not support I/O map nor scatter-gather.
+
+ You can disable CMA by specifying "cma=0" on the kernel's command
+ line.
+
+ For more information see .
+ If unsure, say "n".
+
+if  DMA_CMA
+comment "Default contiguous memory area size:"
+
+config CMA_SIZE_MBYTES
+   int "Size in Mega Bytes"
+   depends on !CMA_SIZE_SEL_PERCENTAGE
+   default 0 if X86
+   default 16
+   help
+ Defines the size (in MiB) of the default memory area for Contiguous
+ Memory Allocator.  If the size of 0 is selected, CMA is disabled by
+ default, but it can be enabled by passing cma=size[MG] to the kernel.
+
+
+config CMA_SIZE_PERCENTAGE
+   int "Percentage of total memory"
+   depends on !CMA_SIZE_SEL_MBYTES
+   default 0 if X86
+   default 10
+   help
+ Defines the size of the default memory area for Contiguous Memory
+ Allocator as a percentage of the total memory in the system.
+ If 0 percent is selected, CMA is disabled by default, but it can be
+ enabled by passing cma=size[MG] 

[PATCH 06/12] dma-mapping: improve selection of dma_declare_coherent availability

2019-02-11 Thread Christoph Hellwig
This API is primarily used through DT entries, but two architectures
and two drivers call it directly.  So instead of selecting the config
symbol for random architectures pull it in implicitly for the actual
users.  Also rename the Kconfig option to describe the feature better.

Signed-off-by: Christoph Hellwig 
---
 arch/arc/Kconfig| 1 -
 arch/arm/Kconfig| 2 +-
 arch/arm64/Kconfig  | 1 -
 arch/csky/Kconfig   | 1 -
 arch/mips/Kconfig   | 1 -
 arch/riscv/Kconfig  | 1 -
 arch/sh/Kconfig | 2 +-
 arch/unicore32/Kconfig  | 1 -
 arch/x86/Kconfig| 1 -
 drivers/mfd/Kconfig | 2 ++
 drivers/of/Kconfig  | 3 ++-
 include/linux/device.h  | 2 +-
 include/linux/dma-mapping.h | 8 
 kernel/dma/Kconfig  | 2 +-
 kernel/dma/Makefile | 2 +-
 15 files changed, 13 insertions(+), 17 deletions(-)

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 4103f23b6cea..56e9397542e0 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -30,7 +30,6 @@ config ARC
select HAVE_ARCH_TRACEHOOK
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_FUTEX_CMPXCHG if FUTEX
-   select HAVE_GENERIC_DMA_COHERENT
select HAVE_IOREMAP_PROT
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZMA
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 9395f138301a..25fbbd3cb91d 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -30,6 +30,7 @@ config ARM
select CLONE_BACKWARDS
select CPU_PM if SUSPEND || CPU_IDLE
select DCACHE_WORD_ACCESS if HAVE_EFFICIENT_UNALIGNED_ACCESS
+   select DMA_DECLARE_COHERENT
select DMA_REMAP if MMU
select EDAC_SUPPORT
select EDAC_ATOMIC_SCRUB
@@ -72,7 +73,6 @@ config ARM
select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL
select HAVE_FUNCTION_TRACER if !XIP_KERNEL
select HAVE_GCC_PLUGINS
-   select HAVE_GENERIC_DMA_COHERENT
select HAVE_HW_BREAKPOINT if PERF_EVENTS && (CPU_V6 || CPU_V6K || 
CPU_V7)
select HAVE_IDE if PCI || ISA || PCMCIA
select HAVE_IRQ_TIME_ACCOUNTING
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1d22e969bdcb..d558461a5107 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -137,7 +137,6 @@ config ARM64
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_GCC_PLUGINS
-   select HAVE_GENERIC_DMA_COHERENT
select HAVE_HW_BREAKPOINT if PERF_EVENTS
select HAVE_IRQ_TIME_ACCOUNTING
select HAVE_MEMBLOCK_NODE_MAP if NUMA
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 0a9595afe9be..c009a8c63946 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -30,7 +30,6 @@ config CSKY
select HAVE_ARCH_TRACEHOOK
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_GRAPH_TRACER
-   select HAVE_GENERIC_DMA_COHERENT
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZO
select HAVE_KERNEL_LZMA
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 0d14f51d0002..ba50dc2d37dc 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -56,7 +56,6 @@ config MIPS
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
-   select HAVE_GENERIC_DMA_COHERENT
select HAVE_IDE
select HAVE_IOREMAP_PROT
select HAVE_IRQ_EXIT_ON_IRQ_STACK
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index feeeaa60697c..51b9c97751bf 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -32,7 +32,6 @@ config RISCV
select HAVE_MEMBLOCK_NODE_MAP
select HAVE_DMA_CONTIGUOUS
select HAVE_FUTEX_CMPXCHG if FUTEX
-   select HAVE_GENERIC_DMA_COHERENT
select HAVE_PERF_EVENTS
select HAVE_SYSCALL_TRACEPOINTS
select IRQ_DOMAIN
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index a9c36f95744a..a3d2a24e75c7 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -7,11 +7,11 @@ config SUPERH
select ARCH_NO_COHERENT_DMA_MMAP if !MMU
select HAVE_PATA_PLATFORM
select CLKDEV_LOOKUP
+   select DMA_DECLARE_COHERENT
select HAVE_IDE if HAS_IOPORT_MAP
select HAVE_MEMBLOCK_NODE_MAP
select ARCH_DISCARD_MEMBLOCK
select HAVE_OPROFILE
-   select HAVE_GENERIC_DMA_COHERENT
select HAVE_ARCH_TRACEHOOK
select HAVE_PERF_EVENTS
select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/unicore32/Kconfig b/arch/unicore32/Kconfig
index c3a41bfe161b..6d2891d37e32 100644
--- a/arch/unicore32/Kconfig
+++ b/arch/unicore32/Kconfig
@@ -4,7 +4,6 @@ config UNICORE32
select ARCH_HAS_DEVMEM_IS_ALLOWED
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
-   select HAVE_GENERIC_DMA_COHERENT
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_BZIP2
select GENERIC_ATOMIC64
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig

[PATCH 05/12] dma-mapping: remove an incorrect __iommem annotation

2019-02-11 Thread Christoph Hellwig
memmap return a regular void pointer, not and __iomem one.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/coherent.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c
index 66f0fb7e9a3a..4b76aba574c2 100644
--- a/kernel/dma/coherent.c
+++ b/kernel/dma/coherent.c
@@ -43,7 +43,7 @@ static int dma_init_coherent_memory(
struct dma_coherent_mem **mem)
 {
struct dma_coherent_mem *dma_mem = NULL;
-   void __iomem *mem_base = NULL;
+   void *mem_base = NULL;
int pages = size >> PAGE_SHIFT;
int bitmap_size = BITS_TO_LONGS(pages) * sizeof(long);
int ret;
-- 
2.20.1



[PATCH 04/12] of: select OF_RESERVED_MEM automatically

2019-02-11 Thread Christoph Hellwig
The OF_RESERVED_MEM can be used if we have either CMA or the generic
declare coherent code built and we support the early flattened DT.

So don't bother making it a user visible options that is selected
by most configs that fit the above category, but just select it when
the requirements are met.

Signed-off-by: Christoph Hellwig 
---
 arch/arc/Kconfig | 1 -
 arch/arm/Kconfig | 1 -
 arch/arm64/Kconfig   | 1 -
 arch/csky/Kconfig| 1 -
 arch/powerpc/Kconfig | 1 -
 arch/xtensa/Kconfig  | 1 -
 drivers/of/Kconfig   | 5 ++---
 7 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 376366a7db81..4103f23b6cea 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -44,7 +44,6 @@ config ARC
select MODULES_USE_ELF_RELA
select OF
select OF_EARLY_FLATTREE
-   select OF_RESERVED_MEM
select PCI_SYSCALL if PCI
select PERF_USE_VMALLOC if ARC_CACHE_VIPT_ALIASING
 
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 664e918e2624..9395f138301a 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -101,7 +101,6 @@ config ARM
select MODULES_USE_ELF_REL
select NEED_DMA_MAP_STATE
select OF_EARLY_FLATTREE if OF
-   select OF_RESERVED_MEM if OF
select OLD_SIGACTION
select OLD_SIGSUSPEND3
select PCI_SYSCALL if PCI
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d366127..1d22e969bdcb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -163,7 +163,6 @@ config ARM64
select NEED_SG_DMA_LENGTH
select OF
select OF_EARLY_FLATTREE
-   select OF_RESERVED_MEM
select PCI_DOMAINS_GENERIC if PCI
select PCI_ECAM if (ACPI && PCI)
select PCI_SYSCALL if PCI
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 398113c845f5..0a9595afe9be 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -42,7 +42,6 @@ config CSKY
select MODULES_USE_ELF_RELA if MODULES
select OF
select OF_EARLY_FLATTREE
-   select OF_RESERVED_MEM
select PERF_USE_VMALLOC if CPU_CK610
select RTC_LIB
select TIMER_OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2890d36eb531..5cc4eea362c6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -233,7 +233,6 @@ config PPC
select NEED_SG_DMA_LENGTH
select OF
select OF_EARLY_FLATTREE
-   select OF_RESERVED_MEM
select OLD_SIGACTIONif PPC32
select OLD_SIGSUSPEND
select PCI_DOMAINS  if PCI
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index 20a0756f27ef..e242a405151e 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -447,7 +447,6 @@ config USE_OF
bool "Flattened Device Tree support"
select OF
select OF_EARLY_FLATTREE
-   select OF_RESERVED_MEM
help
  Include support for flattened device tree machine descriptions.
 
diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
index ad3fcad4d75b..3607fd2810e4 100644
--- a/drivers/of/Kconfig
+++ b/drivers/of/Kconfig
@@ -81,10 +81,9 @@ config OF_MDIO
  OpenFirmware MDIO bus (Ethernet PHY) accessors
 
 config OF_RESERVED_MEM
-   depends on OF_EARLY_FLATTREE
bool
-   help
- Helpers to allow for reservation of memory regions
+   depends on OF_EARLY_FLATTREE
+   default y if HAVE_GENERIC_DMA_COHERENT || DMA_CMA
 
 config OF_RESOLVE
bool
-- 
2.20.1



[PATCH 03/12] of: mark early_init_dt_alloc_reserved_memory_arch static

2019-02-11 Thread Christoph Hellwig
This function is only used in of_reserved_mem.c, and never overridden
despite the __weak marker.

Signed-off-by: Christoph Hellwig 
---
 drivers/of/of_reserved_mem.c| 2 +-
 include/linux/of_reserved_mem.h | 7 ---
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 1977ee0adcb1..9f165fc1d1a2 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -26,7 +26,7 @@
 static struct reserved_mem reserved_mem[MAX_RESERVED_REGIONS];
 static int reserved_mem_count;
 
-int __init __weak early_init_dt_alloc_reserved_memory_arch(phys_addr_t size,
+static int __init early_init_dt_alloc_reserved_memory_arch(phys_addr_t size,
phys_addr_t align, phys_addr_t start, phys_addr_t end, bool nomap,
phys_addr_t *res_base)
 {
diff --git a/include/linux/of_reserved_mem.h b/include/linux/of_reserved_mem.h
index 67ab8d271df3..60f541912ccf 100644
--- a/include/linux/of_reserved_mem.h
+++ b/include/linux/of_reserved_mem.h
@@ -35,13 +35,6 @@ int of_reserved_mem_device_init_by_idx(struct device *dev,
   struct device_node *np, int idx);
 void of_reserved_mem_device_release(struct device *dev);
 
-int early_init_dt_alloc_reserved_memory_arch(phys_addr_t size,
-phys_addr_t align,
-phys_addr_t start,
-phys_addr_t end,
-bool nomap,
-phys_addr_t *res_base);
-
 void fdt_init_reserved_mem(void);
 void fdt_reserved_mem_save_node(unsigned long node, const char *uname,
   phys_addr_t base, phys_addr_t size);
-- 
2.20.1



[PATCH 02/12] device.h: dma_mem is only needed for HAVE_GENERIC_DMA_COHERENT

2019-02-11 Thread Christoph Hellwig
No need to carry an unused field around.

Signed-off-by: Christoph Hellwig 
---
 include/linux/device.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 6cb4640b6160..be544400acdd 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1017,8 +1017,10 @@ struct device {
 
struct list_headdma_pools;  /* dma pools (if dma'ble) */
 
+#ifdef CONFIG_HAVE_GENERIC_DMA_COHERENT
struct dma_coherent_mem *dma_mem; /* internal for coherent mem
 override */
+#endif
 #ifdef CONFIG_DMA_CMA
struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
-- 
2.20.1



[PATCH 01/12] mfd/sm501: depend on HAS_DMA

2019-02-11 Thread Christoph Hellwig
Currently the sm501 mfd driver can be compiled without any dependencies,
but through the use of dma_declare_coherent it really depends on
having DMA and iomem support.  Normally we don't explicitly require DMA
support as we have stubs for it if on UML, but in this case the driver
selects support for dma_declare_coherent and thus also requires
memmap support.  Guard this by an explicit dependency.

Signed-off-by: Christoph Hellwig 
---
 drivers/mfd/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig
index f461460a2aeb..f15f6489803d 100644
--- a/drivers/mfd/Kconfig
+++ b/drivers/mfd/Kconfig
@@ -1066,6 +1066,7 @@ config MFD_SI476X_CORE
 
 config MFD_SM501
tristate "Silicon Motion SM501"
+   depends on HAS_DMA
 ---help---
  This is the core driver for the Silicon Motion SM501 multimedia
  companion chip. This device is a multifunction device which may
-- 
2.20.1



dma_declare_coherent spring cleaning

2019-02-11 Thread Christoph Hellwig
Hi all,

this series removes various bits of dead code and refactors the
remaining functionality around dma_declare_coherent to be a somewhat
more coherent code base.


Re: [PATCH v3 1/2] drivers/mtd: Use mtd->name when registering nvmem device

2019-02-11 Thread Aneesh Kumar K.V

On 2/11/19 4:46 PM, Boris Brezillon wrote:

On Mon, 11 Feb 2019 16:26:38 +0530
"Aneesh Kumar K.V"  wrote:


On 2/10/19 6:25 PM, Boris Brezillon wrote:

Hello Aneesh,

On Fri,  8 Feb 2019 20:44:18 +0530
"Aneesh Kumar K.V"  wrote:
   

With this patch, we use the mtd->name instead of concatenating the name with '0'

Fixes: c4dfa25ab307 ("mtd: add support for reading MTD devices via the nvmem 
API")
Signed-off-by: Aneesh Kumar K.V 


You forgot to Cc the MTD ML and maintainers. Can you please send a new
version?
   


linux-mtd list is on CC: Is that not sufficient?


Not in your original email, I added it in my reply.




Sorry about that. I will now resend with linux-mtd on CC: I missed that 
earlier.


-aneesh



[PATCH v3 2/2] drivers/mtd: Fix device registration error

2019-02-11 Thread Aneesh Kumar K.V
This change helps me to get multiple mtd device registered. Without this
I get

sysfs: cannot create duplicate filename '/bus/nvmem/devices/flash0'
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2-00557-g1ef20ef21f22 #13
Call Trace:
[c000b38e3220] [c0b58fe4] dump_stack+0xe8/0x164 (unreliable)
[c000b38e3270] [c04cf074] sysfs_warn_dup+0x84/0xb0
[c000b38e32f0] [c04cf6c4] sysfs_do_create_link_sd.isra.0+0x114/0x150
[c000b38e3340] [c0726a84] bus_add_device+0x94/0x1e0
[c000b38e33c0] [c07218f0] device_add+0x4d0/0x830
[c000b38e3480] [c09d54a8] nvmem_register.part.2+0x1c8/0xb30
[c000b38e3560] [c0834530] mtd_nvmem_add+0x90/0x120
[c000b38e3650] [c0835bc8] add_mtd_device+0x198/0x4e0
[c000b38e36f0] [c083619c] mtd_device_parse_register+0x11c/0x280
[c000b38e3780] [c0840830] powernv_flash_probe+0x180/0x250
[c000b38e3820] [c072c120] platform_drv_probe+0x60/0xf0
[c000b38e38a0] [c07283c8] really_probe+0x138/0x4d0
[c000b38e3930] [c0728acc] driver_probe_device+0x13c/0x1b0
[c000b38e39b0] [c0728c7c] __driver_attach+0x13c/0x1c0
[c000b38e3a30] [c0725130] bus_for_each_dev+0xa0/0x120
[c000b38e3a90] [c0727b2c] driver_attach+0x2c/0x40
[c000b38e3ab0] [c07270f8] bus_add_driver+0x228/0x360
[c000b38e3b40] [c072a2e0] driver_register+0x90/0x1a0
[c000b38e3bb0] [c072c020] __platform_driver_register+0x50/0x70
[c000b38e3bd0] [c105c984] powernv_flash_driver_init+0x24/0x38
[c000b38e3bf0] [c0010904] do_one_initcall+0x84/0x464
[c000b38e3cd0] [c1004548] kernel_init_freeable+0x530/0x634
[c000b38e3db0] [c0011154] kernel_init+0x1c/0x168
[c000b38e3e20] [c000bed4] ret_from_kernel_thread+0x5c/0x68
mtd mtd1: Failed to register NVMEM device

With the change we now have

root@(none):/sys/bus/nvmem/devices# ls -al
total 0
drwxr-xr-x 2 root root 0 Feb  6 20:49 .
drwxr-xr-x 4 root root 0 Feb  6 20:49 ..
lrwxrwxrwx 1 root root 0 Feb  6 20:49 flash@0 -> 
../../../devices/platform/ibm,opal:flash@0/mtd/mtd0/flash@0
lrwxrwxrwx 1 root root 0 Feb  6 20:49 flash@1 -> 
../../../devices/platform/ibm,opal:flash@1/mtd/mtd1/flash@1

Fixes: acfe63ec1c59 ("mtd: Convert to using %pOFn instead of device_node.name")
Signed-off-by: Aneesh Kumar K.V 
---
 drivers/mtd/devices/powernv_flash.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/mtd/devices/powernv_flash.c 
b/drivers/mtd/devices/powernv_flash.c
index 22f753e555ac..83f88b8b5d9f 100644
--- a/drivers/mtd/devices/powernv_flash.c
+++ b/drivers/mtd/devices/powernv_flash.c
@@ -212,7 +212,7 @@ static int powernv_flash_set_driver_info(struct device *dev,
 * Going to have to check what details I need to set and how to
 * get them
 */
-   mtd->name = devm_kasprintf(dev, GFP_KERNEL, "%pOFn", dev->of_node);
+   mtd->name = devm_kasprintf(dev, GFP_KERNEL, "%pOFP", dev->of_node);
mtd->type = MTD_NORFLASH;
mtd->flags = MTD_WRITEABLE;
mtd->size = size;
-- 
2.20.1



[PATCH v3 1/2] drivers/mtd: Use mtd->name when registering nvmem device

2019-02-11 Thread Aneesh Kumar K.V
With this patch, we use the mtd->name instead of concatenating the name with '0'

Fixes: c4dfa25ab307 ("mtd: add support for reading MTD devices via the nvmem 
API")
Signed-off-by: Aneesh Kumar K.V 
---
 drivers/mtd/mtdcore.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 999b705769a8..3ef01baef9b6 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -507,6 +507,7 @@ static int mtd_nvmem_add(struct mtd_info *mtd)
 {
struct nvmem_config config = {};
 
+   config.id = -1;
config.dev = >dev;
config.name = mtd->name;
config.owner = THIS_MODULE;
-- 
2.20.1



Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Waiman Long
On 02/11/2019 05:39 AM, Ingo Molnar wrote:
> * Ingo Molnar  wrote:
>
>> Sounds good to me - I've merged this patch, will push it out after 
>> testing.
> Based on Peter's feedback I'm delaying this - performance testing on at 
> least one key ll/sc arch would be nice indeed.
>
> Thanks,
>
>   Ingo

Yes, I will twist the generic code to generate better code.

As I said in the commit log, only x86, ia64 and alpha provide assembly
code to replace the generic C code. The ll/sc archs that I have access
to (ARM64, ppc) are all using the generic C code anyway. I actually had
done some performance measurement on both those platforms and didn't see
any performance difference. I didn't include them as they were using
generic code before. I will rerun the tests after I twisted the generic
C code.

Thanks,
Longman




Re: [PATCH v3 1/7] dump_stack: Support adding to the dump stack arch description

2019-02-11 Thread Andrea Parri
Hi Michael,


On Thu, Feb 07, 2019 at 11:46:29PM +1100, Michael Ellerman wrote:
> Arch code can set a "dump stack arch description string" which is
> displayed with oops output to describe the hardware platform.
> 
> It is useful to initialise this as early as possible, so that an early
> oops will have the hardware description.
> 
> However in practice we discover the hardware platform in stages, so it
> would be useful to be able to incrementally fill in the hardware
> description as we discover it.
> 
> This patch adds that ability, by creating dump_stack_add_arch_desc().
> 
> If there is no existing string it behaves exactly like
> dump_stack_set_arch_desc(). However if there is an existing string it
> appends to it, with a leading space.
> 
> This makes it easy to call it multiple times from different parts of the
> code and get a reasonable looking result.
> 
> Signed-off-by: Michael Ellerman 
> ---
>  include/linux/printk.h |  5 
>  lib/dump_stack.c   | 58 ++
>  2 files changed, 63 insertions(+)
> 
> v3: No change, just widened Cc list.
> 
> v2: Add a smp_wmb() and comment.
> 
> v1 is here for reference 
> https://lore.kernel.org/lkml/1430824337-15339-1-git-send-email-...@ellerman.id.au/
> 
> I'll take this series via the powerpc tree if no one minds?
> 
> 
> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index 77740a506ebb..d5fb4f960271 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> @@ -198,6 +198,7 @@ u32 log_buf_len_get(void);
>  void log_buf_vmcoreinfo_setup(void);
>  void __init setup_log_buf(int early);
>  __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
> +__printf(1, 2) void dump_stack_add_arch_desc(const char *fmt, ...);
>  void dump_stack_print_info(const char *log_lvl);
>  void show_regs_print_info(const char *log_lvl);
>  extern asmlinkage void dump_stack(void) __cold;
> @@ -256,6 +257,10 @@ static inline __printf(1, 2) void 
> dump_stack_set_arch_desc(const char *fmt, ...)
>  {
>  }
>  
> +static inline __printf(1, 2) void dump_stack_add_arch_desc(const char *fmt, 
> ...)
> +{
> +}
> +
>  static inline void dump_stack_print_info(const char *log_lvl)
>  {
>  }
> diff --git a/lib/dump_stack.c b/lib/dump_stack.c
> index 5cff72f18c4a..69b710ff92b5 100644
> --- a/lib/dump_stack.c
> +++ b/lib/dump_stack.c
> @@ -35,6 +35,64 @@ void __init dump_stack_set_arch_desc(const char *fmt, ...)
>   va_end(args);
>  }
>  
> +/**
> + * dump_stack_add_arch_desc - add arch-specific info to show with task dumps
> + * @fmt: printf-style format string
> + * @...: arguments for the format string
> + *
> + * See dump_stack_set_arch_desc() for why you'd want to use this.
> + *
> + * This version adds to any existing string already created with either
> + * dump_stack_set_arch_desc() or dump_stack_add_arch_desc(). If there is an
> + * existing string a space will be prepended to the passed string.
> + */
> +void __init dump_stack_add_arch_desc(const char *fmt, ...)
> +{
> + va_list args;
> + int pos, len;
> + char *p;
> +
> + /*
> +  * If there's an existing string we snprintf() past the end of it, and
> +  * then turn the terminating NULL of the existing string into a space
> +  * to create one string separated by a space.
> +  *
> +  * If there's no existing string we just snprintf() to the buffer, like
> +  * dump_stack_set_arch_desc(), but without calling it because we'd need
> +  * a varargs version.
> +  */
> + len = strnlen(dump_stack_arch_desc_str, 
> sizeof(dump_stack_arch_desc_str));
> + pos = len;
> +
> + if (len)
> + pos++;
> +
> + if (pos >= sizeof(dump_stack_arch_desc_str))
> + return; /* Ran out of space */
> +
> + p = _stack_arch_desc_str[pos];
> +
> + va_start(args, fmt);
> + vsnprintf(p, sizeof(dump_stack_arch_desc_str) - pos, fmt, args);
> + va_end(args);
> +
> + if (len) {
> + /*
> +  * Order the stores above in vsnprintf() vs the store of the
> +  * space below which joins the two strings. Note this doesn't
> +  * make the code truly race free because there is no barrier on
> +  * the read side. ie. Another CPU might load the uninitialised
> +  * tail of the buffer first and then the space below (rather
> +  * than the NULL that was there previously), and so print the
> +  * uninitialised tail. But the whole string lives in BSS so in
> +  * practice it should just see NULLs.

The comment doesn't say _why_ we need to order these stores: IOW, what
will or can go wrong without this order?  This isn't clear to me.

Another good practice when adding smp_*-constructs (as discussed, e.g.,
at KS'18) is to indicate the matching construct/synch. mechanism.

  Andrea


> +  */
> + smp_wmb();
> +
> + dump_stack_arch_desc_str[len] = ' ';
> + }
> 

Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote:

> +static inline int __down_read_trylock(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + while ((tmp = atomic_long_read(>count)) >= 0) {
> + if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> +tmp + RWSEM_ACTIVE_READ_BIAS)) {
> + return 1;
> + }
> + }
> + return 0;
> +}

So the orignal x86 implementation reads:

  static inline bool __down_read_trylock(struct rw_semaphore *sem)
  {
  long result, tmp;
  asm volatile("# beginning __down_read_trylock\n\t"
   "  mov  %[count],%[result]\n\t"
   "1:\n\t"
   "  mov  %[result],%[tmp]\n\t"
   "  add  %[inc],%[tmp]\n\t"
   "  jle2f\n\t"
   LOCK_PREFIX "  cmpxchg  %[tmp],%[count]\n\t"
   "  jnz1b\n\t"
   "2:\n\t"
   "# ending __down_read_trylock\n\t"
   : [count] "+m" (sem->count), [result] "=" (result),
 [tmp] "=" (tmp)
   : [inc] "i" (RWSEM_ACTIVE_READ_BIAS)
   : "memory", "cc");
  return result >= 0;
  }

you replace that with:

  int __down_read_trylock1(unsigned long *l)
  {
  long tmp;

  while ((tmp = READ_ONCE(*l)) >= 0) {
  if (tmp == cmpxchg(l, tmp, tmp + 1))
  return 1;
  }

  return 0;
  }

which generates:

   <__down_read_trylock1>:
   0:   eb 17   jmp19 <__down_read_trylock1+0x19>
   2:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)
   8:   48 8d 4a 01 lea0x1(%rdx),%rcx
   c:   48 89 d0mov%rdx,%rax
   f:   f0 48 0f b1 0f  lock cmpxchg %rcx,(%rdi)
  14:   48 39 c2cmp%rax,%rdx
  17:   74 0f   je 28 <__down_read_trylock1+0x28>
  19:   48 8b 17mov(%rdi),%rdx
  1c:   48 85 d2test   %rdx,%rdx
  1f:   79 e7   jns8 <__down_read_trylock1+0x8>
  21:   31 c0   xor%eax,%eax
  23:   c3  retq
  24:   0f 1f 40 00 nopl   0x0(%rax)
  28:   b8 01 00 00 00  mov$0x1,%eax
  2d:   c3  retq


Which is clearly worse. Now we can write that as:

  int __down_read_trylock2(unsigned long *l)
  {
  long tmp = READ_ONCE(*l);

  while (tmp >= 0) {
  if (try_cmpxchg(l, , tmp + 1))
  return 1;
  }

  return 0;
  }

which generates:

  0030 <__down_read_trylock2>:
  30:   48 8b 07mov(%rdi),%rax
  33:   48 85 c0test   %rax,%rax
  36:   78 18   js 50 <__down_read_trylock2+0x20>
  38:   48 8d 50 01 lea0x1(%rax),%rdx
  3c:   f0 48 0f b1 17  lock cmpxchg %rdx,(%rdi)
  41:   75 f0   jne33 <__down_read_trylock2+0x3>
  43:   b8 01 00 00 00  mov$0x1,%eax
  48:   c3  retq
  49:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
  50:   31 c0   xor%eax,%eax
  52:   c3  retq

Which is a lot better; but not quite there yet.


I've tried quite a bit, but I can't seem to get GCC to generate the:

add $1,%rdx
jle

required; stuff like:

new = old + 1;
if (new <= 0)

generates:

lea 0x1(%rax),%rdx
test %rdx, %rdx
jle


Ah well, have fun :-)

typedef unsigned char u8;
typedef unsigned short u16;
typedef unsigned int u32;
typedef unsigned long long u64;
typedef signed char s8;
typedef signed short s16;
typedef signed int s32;
typedef signed long long s64;
typedef _Bool bool;

# define CC_SET(c) "\n\t/* output condition code " #c "*/\n"
# define CC_OUT(c) "=@cc" #c

#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x)   __builtin_expect(!!(x), 0)

extern void __cmpxchg_wrong_size(void);

#define __raw_cmpxchg(ptr, old, new, size, lock)			\
({	\
	__typeof__(*(ptr)) __ret;	\
	__typeof__(*(ptr)) __old = (old);\
	__typeof__(*(ptr)) __new = (new);\
	switch (size) {			\
	case 1:		\
	{\
		volatile u8 *__ptr = (volatile u8 *)(ptr);		\
		asm volatile(lock "cmpxchgb %2,%1"			\
			 : "=a" (__ret), "+m" (*__ptr)		\
			 : "q" (__new), "0" (__old)			\
			 : "memory");\
		break;			\
	}\
	case 2:		\
	{\
		volatile u16 *__ptr = (volatile u16 *)(ptr);		\
		asm volatile(lock "cmpxchgw %2,%1"			\
			 : "=a" (__ret), "+m" (*__ptr)		\
			 : "r" (__new), "0" (__old)			\
			 : "memory");\
		break;			\
	}\
	case 4:		\
	{\
		volatile u32 *__ptr = (volatile u32 *)(ptr);		\
		asm volatile(lock 

[PATCH] powerpc/configs: Enable CONFIG_USB_XHCI_HCD by default

2019-02-11 Thread Thomas Huth
Recent versions of QEMU provide a XHCI device by default these
days instead of an old-fashioned OHCI device:

 https://git.qemu.org/?p=qemu.git;a=commitdiff;h=57040d451315320b7d27

So to get the keyboard working in the graphical console there again,
we should now include XHCI support in the kernel by default, too.

Signed-off-by: Thomas Huth 
---
 arch/powerpc/configs/pseries_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/pseries_defconfig 
b/arch/powerpc/configs/pseries_defconfig
index ea79c51..62e12f6 100644
--- a/arch/powerpc/configs/pseries_defconfig
+++ b/arch/powerpc/configs/pseries_defconfig
@@ -217,6 +217,7 @@ CONFIG_USB_MON=m
 CONFIG_USB_EHCI_HCD=y
 # CONFIG_USB_EHCI_HCD_PPC_OF is not set
 CONFIG_USB_OHCI_HCD=y
+CONFIG_USB_XHCI_HCD=y
 CONFIG_USB_STORAGE=m
 CONFIG_NEW_LEDS=y
 CONFIG_LEDS_CLASS=m
-- 
1.8.3.1



Re: [PATCH v3 1/2] drivers/mtd: Use mtd->name when registering nvmem device

2019-02-11 Thread Boris Brezillon
On Mon, 11 Feb 2019 16:26:38 +0530
"Aneesh Kumar K.V"  wrote:

> On 2/10/19 6:25 PM, Boris Brezillon wrote:
> > Hello Aneesh,
> > 
> > On Fri,  8 Feb 2019 20:44:18 +0530
> > "Aneesh Kumar K.V"  wrote:
> >   
> >> With this patch, we use the mtd->name instead of concatenating the name 
> >> with '0'
> >>
> >> Fixes: c4dfa25ab307 ("mtd: add support for reading MTD devices via the 
> >> nvmem API")
> >> Signed-off-by: Aneesh Kumar K.V   
> > 
> > You forgot to Cc the MTD ML and maintainers. Can you please send a new
> > version?
> >   
> 
> linux-mtd list is on CC: Is that not sufficient?

Not in your original email, I added it in my reply.


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Mon, Feb 11, 2019 at 10:40:44AM +0100, Peter Zijlstra wrote:
> On Mon, Feb 11, 2019 at 10:36:01AM +0100, Peter Zijlstra wrote:
> > On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote:
> > > +static inline int __down_read_trylock(struct rw_semaphore *sem)
> > > +{
> > > + long tmp;
> > > +
> > > + while ((tmp = atomic_long_read(>count)) >= 0) {
> > > + if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> > > +tmp + RWSEM_ACTIVE_READ_BIAS)) {
> > > + return 1;
> > 
> > That really wants to be:
> > 
> > if (atomic_long_try_cmpxchg_acquire(>count, ,
> > tmp + 
> > RWSEM_ACTIVE_READ_BIAS))
> > 
> > > + }
> > > + }
> > > + return 0;
> > > +}
> 
> Also, the is the one case where LL/SC can actually do 'better'. Do you
> have benchmarks for say PowerPC or ARM64 ?

Ah, I see they already used asm-generic/rwsem.h which has similar code
to the above.


Re: [PATCH v3 1/2] drivers/mtd: Use mtd->name when registering nvmem device

2019-02-11 Thread Aneesh Kumar K.V

On 2/10/19 6:25 PM, Boris Brezillon wrote:

Hello Aneesh,

On Fri,  8 Feb 2019 20:44:18 +0530
"Aneesh Kumar K.V"  wrote:


With this patch, we use the mtd->name instead of concatenating the name with '0'

Fixes: c4dfa25ab307 ("mtd: add support for reading MTD devices via the nvmem 
API")
Signed-off-by: Aneesh Kumar K.V 


You forgot to Cc the MTD ML and maintainers. Can you please send a new
version?



linux-mtd list is on CC: Is that not sufficient?

-aneesh



Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Ingo Molnar


* Will Deacon  wrote:

> On Mon, Feb 11, 2019 at 11:39:27AM +0100, Ingo Molnar wrote:
> > 
> > * Ingo Molnar  wrote:
> > 
> > > Sounds good to me - I've merged this patch, will push it out after 
> > > testing.
> > 
> > Based on Peter's feedback I'm delaying this - performance testing on at 
> > least one key ll/sc arch would be nice indeed.
> 
> Once Waiman has posted a new version, I can take it for a spin on some
> arm64 boxen if he shares his workload.

Cool, thanks!

Ingo


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Will Deacon
On Mon, Feb 11, 2019 at 11:39:27AM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar  wrote:
> 
> > Sounds good to me - I've merged this patch, will push it out after 
> > testing.
> 
> Based on Peter's feedback I'm delaying this - performance testing on at 
> least one key ll/sc arch would be nice indeed.

Once Waiman has posted a new version, I can take it for a spin on some
arm64 boxen if he shares his workload.

Will


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Ingo Molnar


* Ingo Molnar  wrote:

> Sounds good to me - I've merged this patch, will push it out after 
> testing.

Based on Peter's feedback I'm delaying this - performance testing on at 
least one key ll/sc arch would be nice indeed.

Thanks,

Ingo


Re: [RFC PATCH 3/5] powerpc: sstep: Add instruction emulation selftests

2019-02-11 Thread Sandipan Das



On 11/02/19 6:17 AM, Daniel Axtens wrote:
> Hi Sandipan,
> 
> I'm not really confident to review the asm, but I did have a couple of
> questions about the C:
> 
>> +#define MAX_INSNS   32
> This doesn't seem to be used...
> 

True. Thanks for pointing this out.

>> +int execute_instr(struct pt_regs *regs, unsigned int instr)
>> +{
>> +extern unsigned int exec_instr_execute[];
>> +extern int exec_instr(struct pt_regs *regs);
> 
> These externs sit inside the function scope. This feels less than ideal
> to me - is there a reason not to have these at global scope?
> 

Currently, execute_instr() is the only consumer. So, I thought I'd keep
them local for now.

>> +
>> +if (!regs || !instr)
>> +return -EINVAL;
>> +
>> +/* Patch the NOP with the actual instruction */
>> +patch_instruction(_instr_execute[0], instr);
>> +if (exec_instr(regs)) {
>> +pr_info("execution failed, opcode = 0x%08x\n", instr);
>> +return -EFAULT;
>> +}
>> +
>> +return 0;
>> +}
> 
>> +late_initcall(run_sstep_tests);
> A design question: is there a reason to run these as an initcall rather
> than as a module that could either be built in or loaded separately? I'm
> not saying you have to do this, but I was wondering if you had
> considered it?
> 

I did. As of now, there are some existing tests in test_emulate_step.c
which use the same approach. So, I thought I'd stick with that approach
to start off. This is anyway controlled by a Kconfig option.

> Lastly, snowpatch reports some checkpatch issues for this and your
> remaining patches: https://patchwork.ozlabs.org/patch/1035683/ (You are
> allowed to violate checkpatch rules with justification, FWIW)
> 

Will look into them.

> Regards,
> Daniel
>> -- 
>> 2.19.2
> 



Re: [RFC PATCH 5/5] powerpc: sstep: Add selftests for addc[.] instruction

2019-02-11 Thread Sandipan Das



On 11/02/19 6:30 AM, Daniel Axtens wrote:
> Hi Sandipan,
> 
>> +{
>> +.descr = "RA = LONG_MIN | INT_MIN, RB = 
>> LONG_MIN | INT_MIN",
>> +.instr = PPC_INST_ADDC | ___PPC_RT(20) | 
>> ___PPC_RA(21) | ___PPC_RB(22),
>> +.regs =
>> +{
>> +.gpr[21] = LONG_MIN | (uint) INT_MIN,
>> +.gpr[22] = LONG_MIN | (uint) INT_MIN,
>> +}
>> +}
> I don't know what this bit pattern is supposed to represent - is it
> supposed to be the smallest 32bit integer and the smallest 64bit
> integer 80008000 - so you test 32 and 64 bit overflow at the
> same time? 
> 

Yes, exactly.

> 
> For the series:
> Tested-by: Daniel Axtens  # Power8 LE
> 
> I notice the output is quite verbose, and doesn't include a line when it
> starts:
> 
> [0.826181] Running code patching self-tests ...
> [0.826607] Running feature fixup self-tests ...
> [0.826615] nop : R0 = LONG_MAX  
> [PASS]
> [0.826617] add : RA = LONG_MIN, RB = LONG_MIN   
> [PASS]
> 
> Maybe it would be good to include a line saying "Running single-step
> emulation self-tests" and perhaps by default on printing when there is a
> failure.
> 

That makes sense. Will include it in the next revision.

> Finally, I think you might be able to squash patches 1 and 2 and patches
> 4 and 5, but that's just my personal preference.
> 
> Regards,
> Daniel
> 



Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Mon, Feb 11, 2019 at 10:36:01AM +0100, Peter Zijlstra wrote:
> On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote:
> > +static inline int __down_read_trylock(struct rw_semaphore *sem)
> > +{
> > +   long tmp;
> > +
> > +   while ((tmp = atomic_long_read(>count)) >= 0) {
> > +   if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> > +  tmp + RWSEM_ACTIVE_READ_BIAS)) {
> > +   return 1;
> 
> That really wants to be:
> 
>   if (atomic_long_try_cmpxchg_acquire(>count, ,
>   tmp + 
> RWSEM_ACTIVE_READ_BIAS))
> 
> > +   }
> > +   }
> > +   return 0;
> > +}

Also, the is the one case where LL/SC can actually do 'better'. Do you
have benchmarks for say PowerPC or ARM64 ?


Re: [PATCH] locking/rwsem: Remove arch specific rwsem files

2019-02-11 Thread Peter Zijlstra
On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote:
> diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
> index bad2bca..067e265 100644
> --- a/kernel/locking/rwsem.h
> +++ b/kernel/locking/rwsem.h
> @@ -32,6 +32,26 @@
>  # define DEBUG_RWSEMS_WARN_ON(c)
>  #endif
>  
> +/*
> + * R/W semaphores originally for PPC using the stuff in lib/rwsem.c.
> + * Adapted largely from include/asm-i386/rwsem.h
> + * by Paul Mackerras .
> + */
> +
> +/*
> + * the semaphore definition
> + */
> +#ifdef CONFIG_64BIT
> +# define RWSEM_ACTIVE_MASK   0xL
> +#else
> +# define RWSEM_ACTIVE_MASK   0xL
> +#endif
> +
> +#define RWSEM_ACTIVE_BIAS0x0001L
> +#define RWSEM_WAITING_BIAS   (-RWSEM_ACTIVE_MASK-1)
> +#define RWSEM_ACTIVE_READ_BIAS   RWSEM_ACTIVE_BIAS
> +#define RWSEM_ACTIVE_WRITE_BIAS  (RWSEM_WAITING_BIAS + 
> RWSEM_ACTIVE_BIAS)
> +
>  #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
>  /*
>   * All writes to owner are protected by WRITE_ONCE() to make sure that
> @@ -132,3 +152,113 @@ static inline void rwsem_clear_reader_owned(struct 
> rw_semaphore *sem)
>  {
>  }
>  #endif
> +
> +#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM
> +/*
> + * lock for reading
> + */
> +static inline void __down_read(struct rw_semaphore *sem)
> +{
> + if (unlikely(atomic_long_inc_return_acquire(>count) <= 0))
> + rwsem_down_read_failed(sem);
> +}
> +
> +static inline int __down_read_killable(struct rw_semaphore *sem)
> +{
> + if (unlikely(atomic_long_inc_return_acquire(>count) <= 0)) {
> + if (IS_ERR(rwsem_down_read_failed_killable(sem)))
> + return -EINTR;
> + }
> +
> + return 0;
> +}
> +
> +static inline int __down_read_trylock(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + while ((tmp = atomic_long_read(>count)) >= 0) {
> + if (tmp == atomic_long_cmpxchg_acquire(>count, tmp,
> +tmp + RWSEM_ACTIVE_READ_BIAS)) {
> + return 1;

That really wants to be:

if (atomic_long_try_cmpxchg_acquire(>count, ,
tmp + 
RWSEM_ACTIVE_READ_BIAS))

> + }
> + }
> + return 0;
> +}
> +
> +/*
> + * lock for writing
> + */
> +static inline void __down_write(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
> +  >count);
> + if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
> + rwsem_down_write_failed(sem);
> +}
> +
> +static inline int __down_write_killable(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS,
> +  >count);
> + if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
> + if (IS_ERR(rwsem_down_write_failed_killable(sem)))
> + return -EINTR;
> + return 0;
> +}
> +
> +static inline int __down_write_trylock(struct rw_semaphore *sem)
> +{
> + long tmp;

tmp = RWSEM_UNLOCKED_VALUE;

> +
> + tmp = atomic_long_cmpxchg_acquire(>count, RWSEM_UNLOCKED_VALUE,
> +   RWSEM_ACTIVE_WRITE_BIAS);
> + return tmp == RWSEM_UNLOCKED_VALUE;

return atomic_long_try_cmpxchg_acquire(>count, ,
   RWSEM_ACTIVE_WRITE_BIAS);

> +}
> +
> +/*
> + * unlock after reading
> + */
> +static inline void __up_read(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + tmp = atomic_long_dec_return_release(>count);
> + if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
> + rwsem_wake(sem);
> +}
> +
> +/*
> + * unlock after writing
> + */
> +static inline void __up_write(struct rw_semaphore *sem)
> +{
> + if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS,
> + >count) < 0))
> + rwsem_wake(sem);
> +}
> +
> +/*
> + * downgrade write lock to read lock
> + */
> +static inline void __downgrade_write(struct rw_semaphore *sem)
> +{
> + long tmp;
> +
> + /*
> +  * When downgrading from exclusive to shared ownership,
> +  * anything inside the write-locked region cannot leak
> +  * into the read side. In contrast, anything in the
> +  * read-locked region is ok to be re-ordered into the
> +  * write side. As such, rely on RELEASE semantics.
> +  */
> + tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, >count);
> + if (tmp < 0)
> + rwsem_downgrade_wake(sem);
> +}
> +
> +#endif /* CONFIG_RWSEM_XCHGADD_ALGORITHM */


Re: [PATCH v1 03/16] powerpc/32: move LOAD_MSR_KERNEL() into head_32.h and use it

2019-02-11 Thread Benjamin Herrenschmidt
On Mon, 2019-02-11 at 07:26 +0100, Christophe Leroy wrote:
> 
> Le 11/02/2019 à 01:21, Benjamin Herrenschmidt a écrit :
> > On Fri, 2019-02-08 at 12:52 +, Christophe Leroy wrote:
> > >   /*
> > > + * MSR_KERNEL is > 0x8000 on 4xx/Book-E since it include MSR_CE.
> > > + */
> > > +.macro __LOAD_MSR_KERNEL r, x
> > > +.if \x >= 0x8000
> > > +   lis \r, (\x)@h
> > > +   ori \r, \r, (\x)@l
> > > +.else
> > > +   li \r, (\x)
> > > +.endif
> > > +.endm
> > > +#define LOAD_MSR_KERNEL(r, x) __LOAD_MSR_KERNEL r, x
> > > +
> > 
> > You changed the limit from >= 0x1 to >= 0x8000 without a
> > corresponding explanation as to why...
> 
> Yes, the existing LOAD_MSR_KERNEL() was buggy because 'li' takes a 
> signed u16, ie between -0x8000 and 0x7999.

Ah yes, I was only looking at the "large" case which is fine...

> By chance it was working because until now nobody was trying to set 
> MSR_KERNEL | MSR_EE.
> 
> Christophe