Re: [PATCH 07/12] dma-mapping: move CONFIG_DMA_CMA to kernel/dma/Kconfig
On Mon, Feb 11, 2019 at 02:35:49PM +0100, Christoph Hellwig wrote: > This is where all the related code already lives. > > Signed-off-by: Christoph Hellwig > --- > drivers/base/Kconfig | 77 > kernel/dma/Kconfig | 77 > 2 files changed, 77 insertions(+), 77 deletions(-) Much nicer, thanks! Reviewed-by: Greg Kroah-Hartman
Re: [PATCH 09/12] dma-mapping: remove the DMA_MEMORY_EXCLUSIVE flag
On Mon, Feb 11, 2019 at 02:35:51PM +0100, Christoph Hellwig wrote: > All users of dma_declare_coherent want their allocations to be > exclusive, so default to exclusive allocations. > > Signed-off-by: Christoph Hellwig > --- > Documentation/DMA-API.txt | 9 +-- > arch/arm/mach-imx/mach-imx27_visstrim_m10.c | 12 +++-- > arch/arm/mach-imx/mach-mx31moboard.c | 3 +-- > arch/sh/boards/mach-ap325rxa/setup.c | 5 ++-- > arch/sh/boards/mach-ecovec24/setup.c | 6 ++--- > arch/sh/boards/mach-kfr2r09/setup.c | 5 ++-- > arch/sh/boards/mach-migor/setup.c | 5 ++-- > arch/sh/boards/mach-se/7724/setup.c | 6 ++--- > arch/sh/drivers/pci/fixups-dreamcast.c| 3 +-- > .../soc_camera/sh_mobile_ceu_camera.c | 3 +-- > drivers/usb/host/ohci-sm501.c | 3 +-- > drivers/usb/host/ohci-tmio.c | 2 +- > include/linux/dma-mapping.h | 7 ++ > kernel/dma/coherent.c | 25 ++- > 14 files changed, 29 insertions(+), 65 deletions(-) Reviewed-by: Greg Kroah-Hartman
Re: [PATCH 06/12] dma-mapping: improve selection of dma_declare_coherent availability
On Mon, Feb 11, 2019 at 02:35:48PM +0100, Christoph Hellwig wrote: > This API is primarily used through DT entries, but two architectures > and two drivers call it directly. So instead of selecting the config > symbol for random architectures pull it in implicitly for the actual > users. Also rename the Kconfig option to describe the feature better. > > Signed-off-by: Christoph Hellwig Reviewed-by: Greg Kroah-Hartman
Re: [PATCH 02/12] device.h: dma_mem is only needed for HAVE_GENERIC_DMA_COHERENT
On Mon, Feb 11, 2019 at 02:35:44PM +0100, Christoph Hellwig wrote: > No need to carry an unused field around. > > Signed-off-by: Christoph Hellwig > --- > include/linux/device.h | 2 ++ > 1 file changed, 2 insertions(+) Reviewed-by: Greg Kroah-Hartman
Re: [PATCH kernel] powerpc/powernv/ioda: Store correct amount of memory used for table
On 12/02/2019 11:20, David Gibson wrote: > On Mon, Feb 11, 2019 at 06:48:01PM +1100, Alexey Kardashevskiy wrote: >> We store 2 multilevel tables in iommu_table - one for the hardware and >> one with the corresponding userspace addresses. Before allocating >> the tables, the iommu_table_group_ops::get_table_size() hook returns >> the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts >> the locked_vm counter correctly. When the table is actually allocated, >> the amount of allocated memory is stored in iommu_table::it_allocated_size >> and used to adjust the locked_vm counter when we release the memory used >> by the table; .get_table_size() and .create_table() calculate it >> independently but the result is expected to be the same. > > Any way we can remove that redundant calculation? That seems like > begging for bugs. I do not see an easy way. One way could be adding a "dryrun" flag to pnv_pci_ioda2_table_alloc_pages(), count allocated memory there and call it from .get_table_size() but for multilevel TCEs it only allocates first level... >> Unfortunately the allocator does not add the userspace table size to >> ::it_allocated_size so when we destroy the table because of VFIO PCI >> unplug (i.e. VFIO container is gone but the userspace keeps running), >> we decrement locked_vm by just a half of size of memory we are releasing. >> As the result, we leak locked_vm and may not be able to allocate more >> IOMMU tables after few iterations of hotplug/unplug. >> >> This adjusts it_allocated_size if the userspace addresses table was >> requested (total_allocated_uas is initialized by zero). >> >> Fixes: 090bad39b "powerpc/powernv: Add indirect levels to it_userspace" >> Signed-off-by: Alexey Kardashevskiy > > Reviewed-by: David Gibson > >> --- >> arch/powerpc/platforms/powernv/pci-ioda-tce.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c >> b/arch/powerpc/platforms/powernv/pci-ioda-tce.c >> index 697449a..58146e1 100644 >> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c >> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c >> @@ -313,7 +313,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 >> bus_offset, >> page_shift); >> tbl->it_level_size = 1ULL << (level_shift - 3); >> tbl->it_indirect_levels = levels - 1; >> -tbl->it_allocated_size = total_allocated; >> +tbl->it_allocated_size = total_allocated + total_allocated_uas; >> tbl->it_userspace = uas; >> tbl->it_nid = nid; >> > -- Alexey
Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20
On Mon, Feb 11, 2019 at 02:43:48PM -0600, Alan Tull wrote: > On Mon, Feb 11, 2019 at 1:13 PM Greg Kroah-Hartman > wrote: > > > > On Mon, Feb 11, 2019 at 12:41:40PM -0600, Alan Tull wrote: > > > On Fri, Nov 9, 2018 at 12:58 AM Frank Rowand > > > wrote: > > > > > > What LTSI's are these patches likely to end up in? Just to be clear, > > > I'm not pushing for any specific answer, I just want to know what to > > > expect. > > > > I have no idea what you are asking here. > > > > What patches? > > I probably should have asked my question *below* the pertinent context > of the the 17 patches listed in the pull request, which was: > > > of: overlay: add tests to validate kfrees from overlay removal > > of: overlay: add missing of_node_put() after add new node to changeset > > of: overlay: add missing of_node_get() in __of_attach_node_sysfs > > powerpc/pseries: add of_node_put() in dlpar_detach_node() > > of: overlay: use prop add changeset entry for property in new nodes > > of: overlay: do not duplicate properties from overlay for new nodes > > of: overlay: reorder fields in struct fragment > > of: overlay: validate overlay properties #address-cells and > > #size-cells > > of: overlay: make all pr_debug() and pr_err() messages unique > > of: overlay: test case of two fragments adding same node > > of: overlay: check prevents multiple fragments add or delete same node > > of: overlay: check prevents multiple fragments touching same property > > of: unittest: remove unused of_unittest_apply_overlay() argument > > of: overlay: set node fields from properties when add new overlay node > > of: unittest: allow base devicetree to have symbol metadata > > of: unittest: find overlays[] entry by name instead of index > > of: unittest: initialize args before calling of_*parse_*() > > > What is "LTSI's"? > > I have recently seen some of devicetree patches being picked up for > the 4.20 stable-queue. That seemed to suggest that some, but not all > of these will end up in the next LTS release. If the git commit has the "cc: stable@" marking in it, yes, it will be picked up. Without the actual git ids, it's hard to know what did, and what did not, get backported :) > Also I was wondering if any of this is likely to get backported to > LTSI-4.14. Note, "LTSI" and "LTS" are two different things. "LTSI" is a project run by some LF member companies, and "LTS" are the normal long term kernels that I release on kernel.org. They have vastly different requirements for inclusion in them. If you have questions about LTSI, I recommend go asking on their mailing list. As for showing up in the 4.14 "LTS" kernel, again, I need git commit ids to know for sure. Also, as these are now in Linus's tree, you should be able to look at the stable releases yourself to see if they are present there, right? thanks, greg k-h
Re: [PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
On 12/02/2019 09:44, Daniel Jordan wrote: > Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned > pages"), locked and pinned pages are accounted separately. The SPAPR > TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm > instead. > > pinned_vm recently became atomic and so no longer relies on mmap_sem > held as writer: delete. > > Signed-off-by: Daniel Jordan > --- > Documentation/vfio.txt | 6 +-- > drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++--- > 2 files changed, 33 insertions(+), 37 deletions(-) > > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt > index f1a4d3c3ba0b..fa37d65363f9 100644 > --- a/Documentation/vfio.txt > +++ b/Documentation/vfio.txt > @@ -308,7 +308,7 @@ This implementation has some specifics: > currently there is no way to reduce the number of calls. In order to make > things faster, the map/unmap handling has been implemented in real mode > which provides an excellent performance which has limitations such as > - inability to do locked pages accounting in real time. > + inability to do pinned pages accounting in real time. > > 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O > subtree that can be treated as a unit for the purposes of partitioning and > @@ -324,7 +324,7 @@ This implementation has some specifics: > returns the size and the start of the DMA window on the PCI bus. > > VFIO_IOMMU_ENABLE > - enables the container. The locked pages accounting > + enables the container. The pinned pages accounting > is done at this point. This lets user first to know what > the DMA window is and adjust rlimit before doing any real job. > > @@ -454,7 +454,7 @@ This implementation has some specifics: > > PPC64 paravirtualized guests generate a lot of map/unmap requests, > and the handling of those includes pinning/unpinning pages and updating > - mm::locked_vm counter to make sure we do not exceed the rlimit. > + mm::pinned_vm counter to make sure we do not exceed the rlimit. > The v2 IOMMU splits accounting and pinning into separate operations: > > - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY > ioctls > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > b/drivers/vfio/vfio_iommu_spapr_tce.c > index c424913324e3..f47e020dc5e4 100644 > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > @@ -34,9 +34,11 @@ > static void tce_iommu_detach_group(void *iommu_data, > struct iommu_group *iommu_group); > > -static long try_increment_locked_vm(struct mm_struct *mm, long npages) > +static long try_increment_pinned_vm(struct mm_struct *mm, long npages) > { > - long ret = 0, locked, lock_limit; > + long ret = 0; > + s64 pinned; > + unsigned long lock_limit; > > if (WARN_ON_ONCE(!mm)) > return -EPERM; > @@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct *mm, > long npages) > if (!npages) > return 0; > > - down_write(&mm->mmap_sem); > - locked = mm->locked_vm + npages; > + pinned = atomic64_add_return(npages, &mm->pinned_vm); > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) > + if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) { > ret = -ENOMEM; > - else > - mm->locked_vm += npages; > + atomic64_sub(npages, &mm->pinned_vm); > + } > > - pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, > + pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid, > npages << PAGE_SHIFT, > - mm->locked_vm << PAGE_SHIFT, > - rlimit(RLIMIT_MEMLOCK), > - ret ? " - exceeded" : ""); > - > - up_write(&mm->mmap_sem); > + atomic64_read(&mm->pinned_vm) << PAGE_SHIFT, > + rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : ""); > > return ret; > } > > -static void decrement_locked_vm(struct mm_struct *mm, long npages) > +static void decrement_pinned_vm(struct mm_struct *mm, long npages) > { > if (!mm || !npages) > return; > > - down_write(&mm->mmap_sem); > - if (WARN_ON_ONCE(npages > mm->locked_vm)) > - npages = mm->locked_vm; > - mm->locked_vm -= npages; > - pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid, > + if (WARN_ON_ONCE(npages > atomic64_read(&mm->pinned_vm))) > + npages = atomic64_read(&mm->pinned_vm); > + atomic64_sub(npages, &mm->pinned_vm); > + pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid, > npages << PAGE_SHIFT, > - mm->locked_vm << PAGE_SHIFT, > + atomic64_read(&mm->p
[PATCH kernel] KVM: PPC: Release all hardware TCE tables attached to a group
The SPAPR TCE KVM device references all hardware IOMMU tables assigned to some IOMMU group to ensure that in-kernel KVM acceleration of H_PUT_TCE can work. The tables are references when an IOMMU group gets registered with the VFIO KVM device by the KVM_DEV_VFIO_GROUP_ADD ioctl; KVM_DEV_VFIO_GROUP_DEL calls into the dereferencing code in kvm_spapr_tce_release_iommu_group() which walks through the list of LIOBNs, finds a matching IOMMU table and calls kref_put() when found. However that code stops after the very first successful derefencing leaving other tables referenced till the SPAPR TCE KVM device is destroyed which normally happens on guest reboot or termination so if we do hotplug and unplug in a loop, we are leaking IOMMU tables here. This removes a premature return to let kvm_spapr_tce_release_iommu_group() find and dereference all attached tables. Fixes: 121f80ba68f "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO" Signed-off-by: Alexey Kardashevskiy --- I kinda hoped to blame RCU for misbehaviour but it was me all over again :) --- arch/powerpc/kvm/book3s_64_vio.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 532ab797..6630dde 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -133,7 +133,6 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm, continue; kref_put(&stit->kref, kvm_spapr_tce_liobn_put); - return; } } } -- 2.17.1
Re: [PATCH 06/12] dma-mapping: improve selection of dma_declare_coherent availability
Hi Christoph, On Mon, Feb 11, 2019 at 02:35:48PM +0100, Christoph Hellwig wrote: > This API is primarily used through DT entries, but two architectures > and two drivers call it directly. So instead of selecting the config > symbol for random architectures pull it in implicitly for the actual > users. Also rename the Kconfig option to describe the feature better. > > Signed-off-by: Christoph Hellwig Acked-by: Paul Burton # MIPS Thanks, Paul
Re: [PATCH kernel] powerpc/powernv/ioda: Store correct amount of memory used for table
On Mon, Feb 11, 2019 at 06:48:01PM +1100, Alexey Kardashevskiy wrote: > We store 2 multilevel tables in iommu_table - one for the hardware and > one with the corresponding userspace addresses. Before allocating > the tables, the iommu_table_group_ops::get_table_size() hook returns > the combined size of the two and VFIO SPAPR TCE IOMMU driver adjusts > the locked_vm counter correctly. When the table is actually allocated, > the amount of allocated memory is stored in iommu_table::it_allocated_size > and used to adjust the locked_vm counter when we release the memory used > by the table; .get_table_size() and .create_table() calculate it > independently but the result is expected to be the same. Any way we can remove that redundant calculation? That seems like begging for bugs. > Unfortunately the allocator does not add the userspace table size to > ::it_allocated_size so when we destroy the table because of VFIO PCI > unplug (i.e. VFIO container is gone but the userspace keeps running), > we decrement locked_vm by just a half of size of memory we are releasing. > As the result, we leak locked_vm and may not be able to allocate more > IOMMU tables after few iterations of hotplug/unplug. > > This adjusts it_allocated_size if the userspace addresses table was > requested (total_allocated_uas is initialized by zero). > > Fixes: 090bad39b "powerpc/powernv: Add indirect levels to it_userspace" > Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson > --- > arch/powerpc/platforms/powernv/pci-ioda-tce.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c > b/arch/powerpc/platforms/powernv/pci-ioda-tce.c > index 697449a..58146e1 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c > @@ -313,7 +313,7 @@ long pnv_pci_ioda2_table_alloc_pages(int nid, __u64 > bus_offset, > page_shift); > tbl->it_level_size = 1ULL << (level_shift - 3); > tbl->it_indirect_levels = levels - 1; > - tbl->it_allocated_size = total_allocated; > + tbl->it_allocated_size = total_allocated + total_allocated_uas; > tbl->it_userspace = uas; > tbl->it_nid = nid; > -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH kernel] vfio/spapr_tce: Skip unsetting already unset table
On Mon, Feb 11, 2019 at 06:49:17PM +1100, Alexey Kardashevskiy wrote: > VFIO TCE IOMMU v2 owns IOMMU tables so when detach a IOMMU group from > a container, we need to unset those from a group so we call unset_window() > so do we unconditionally. We also unset tables when removing a DMA window > via the VFIO_IOMMU_SPAPR_TCE_REMOVE ioctl. > > The window removal checks if the table actually exists (hidden inside > tce_iommu_find_table()) but the group detaching does not so the user > may see duplicating messages: > pci 0009:03 : [PE# fd] Removing DMA window #0 > pci 0009:03 : [PE# fd] Removing DMA window #1 > pci 0009:03 : [PE# fd] Removing DMA window #0 > pci 0009:03 : [PE# fd] Removing DMA window #1 > > At the moment this is not a problem as the second invocation > of unset_window() writes zeroes to the HW registers again and exits early > as there is no table. > > Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson > --- > > When doing VFIO PCI hot unplug, first we remove the DMA window and > set container->tables[num] - this is a first couple of messages. > Then we detach the group and then we see another couple of the same > messages which confused myself. > --- > drivers/vfio/vfio_iommu_spapr_tce.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > b/drivers/vfio/vfio_iommu_spapr_tce.c > index c424913..8dbb270 100644 > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > @@ -1235,7 +1235,8 @@ static void tce_iommu_release_ownership_ddw(struct > tce_container *container, > } > > for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) > - table_group->ops->unset_window(table_group, i); > + if (container->tables[i]) > + table_group->ops->unset_window(table_group, i); > > table_group->ops->release_ownership(table_group); > } -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH] powerpc/configs: Enable CONFIG_USB_XHCI_HCD by default
On Mon, 11 Feb 2019 at 22:07, Thomas Huth wrote: > > Recent versions of QEMU provide a XHCI device by default these > days instead of an old-fashioned OHCI device: > > https://git.qemu.org/?p=qemu.git;a=commitdiff;h=57040d451315320b7d27 "recent" :D > So to get the keyboard working in the graphical console there again, > we should now include XHCI support in the kernel by default, too. > > Signed-off-by: Thomas Huth Acked-by: Joel Stanley > --- > arch/powerpc/configs/pseries_defconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/configs/pseries_defconfig > b/arch/powerpc/configs/pseries_defconfig > index ea79c51..62e12f6 100644 > --- a/arch/powerpc/configs/pseries_defconfig > +++ b/arch/powerpc/configs/pseries_defconfig > @@ -217,6 +217,7 @@ CONFIG_USB_MON=m > CONFIG_USB_EHCI_HCD=y > # CONFIG_USB_EHCI_HCD_PPC_OF is not set > CONFIG_USB_OHCI_HCD=y > +CONFIG_USB_XHCI_HCD=y > CONFIG_USB_STORAGE=m > CONFIG_NEW_LEDS=y > CONFIG_LEDS_CLASS=m > -- > 1.8.3.1 >
Re: [PATCH v4 3/3] powerpc/32: Add KASAN support
Andrey Ryabinin writes: > On 2/11/19 3:25 PM, Andrey Konovalov wrote: >> On Sat, Feb 9, 2019 at 12:55 PM christophe leroy >> wrote: >>> >>> Hi Andrey, >>> >>> Le 08/02/2019 à 18:40, Andrey Konovalov a écrit : On Fri, Feb 8, 2019 at 6:17 PM Christophe Leroy wrote: > > Hi Daniel, > > Le 08/02/2019 à 17:18, Daniel Axtens a écrit : >> Hi Christophe, >> >> I've been attempting to port this to 64-bit Book3e nohash (e6500), >> although I think I've ended up with an approach more similar to Aneesh's >> much earlier (2015) series for book3s. >> >> Part of this is just due to the changes between 32 and 64 bits - we need >> to hack around the discontiguous mappings - but one thing that I'm >> particularly puzzled by is what the kasan_early_init is supposed to do. > > It should be a problem as my patch uses a 'for_each_memblock(memory, > reg)' loop. > >> >>> +void __init kasan_early_init(void) >>> +{ >>> +unsigned long addr = KASAN_SHADOW_START; >>> +unsigned long end = KASAN_SHADOW_END; >>> +unsigned long next; >>> +pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(addr), addr), >>> addr); >>> +int i; >>> +phys_addr_t pa = __pa(kasan_early_shadow_page); >>> + >>> +BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK); >>> + >>> +if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) >>> +panic("KASAN not supported with Hash MMU\n"); >>> + >>> +for (i = 0; i < PTRS_PER_PTE; i++) >>> +__set_pte_at(&init_mm, (unsigned >>> long)kasan_early_shadow_page, >>> + kasan_early_shadow_pte + i, >>> + pfn_pte(PHYS_PFN(pa), PAGE_KERNEL_RO), 0); >>> + >>> +do { >>> +next = pgd_addr_end(addr, end); >>> +pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte); >>> +} while (pmd++, addr = next, addr != end); >>> +} >> >> As far as I can tell it's mapping the early shadow page, read-only, over >> the KASAN_SHADOW_START->KASAN_SHADOW_END range, and it's using the early >> shadow PTE array from the generic code. >> >> I haven't been able to find an answer to why this is in the docs, so I >> was wondering if you or anyone else could explain the early part of >> kasan init a bit better. > > See https://www.kernel.org/doc/html/latest/dev-tools/kasan.html for an > explanation of the shadow. > > When shadow is 0, it means the memory area is entirely accessible. > > It is necessary to setup a shadow area as soon as possible because all > data accesses check the shadow area, from the begining (except for a few > files where sanitizing has been disabled in Makefiles). > > Until the real shadow area is set, all access are granted thanks to the > zero shadow area beeing for of zeros. Not entirely correct. kasan_early_init() indeed maps the whole shadow memory range to the same kasan_early_shadow_page. However as kernel loads and memory gets allocated this shadow page gets rewritten with non-zero values by different KASAN allocator hooks. Since these values come from completely different parts of the kernel, but all land on the same page, kasan_early_shadow_page's content can be considered garbage. When KASAN checks memory accesses for validity it detects these garbage shadow values, but doesn't print any reports, as the reporting routine bails out on the current->kasan_depth check (which has the value of 1 initially). Only after kasan_init() completes, when the proper shadow memory is mapped, current->kasan_depth gets set to 0 and we start reporting bad accesses. >>> >>> That's surprising, because in the early phase I map the shadow area >>> read-only, so I do not expect it to get modified unless RO protection is >>> failing for some reason. >> >> Actually it might be that the allocator hooks don't modify shadow at >> this point, as the allocator is not yet initialized. However stack >> should be getting poisoned and unpoisoned from the very start. But the >> generic statement that early shadow gets dirtied should be correct. >> Might it be that you don't use stack instrumentation? >> > > Yes, stack instrumentation is not used here, because shadow offset which we > pass to > the -fasan-shadow-offset= cflag is not specified here. So the logic in > scrpits/Makefile.kasan > just fallbacks to CFLAGS_KASAN_MINIMAL, which is outline and without stack > instrumentation. > > Christophe, you can specify KASAN_SHADOW_OFFSET either in Kconfig (e.g. > x86_64) or > in Makefile (e.g. arm64). And make early mapping writable, because compiler > generated code will write > to shadow memory in function prologue/epilogue. Hmm. Is this limitation just that compilers have not implemented out-of-line suppor
[PATCH] powerpc/powernv: Don't reprogram SLW image on every KVM guest entry/exit
Commit 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug", 2017-07-21) added two calls to opal_slw_set_reg() inside pnv_cpu_offline(), with the aim of changing the LPCR value in the SLW image to disable wakeups from the decrementer while a CPU is offline. However, pnv_cpu_offline() gets called each time a secondary CPU thread is woken up to participate in running a KVM guest, that is, not just when a CPU is offlined. Since opal_slw_set_reg() is a very slow operation (with observed execution times around 20 milliseconds), this means that an offline secondary CPU can often be busy doing the opal_slw_set_reg() call when the primary CPU wants to grab all the secondary threads so that it can run a KVM guest. This leads to messages like "KVM: couldn't grab CPU n" being printed and guest execution failing. There is no need to reprogram the SLW image on every KVM guest entry and exit. So that we do it only when a CPU is really transitioning between online and offline, this moves the calls to pnv_program_cpu_hotplug_lpcr() into pnv_smp_cpu_kill_self(). Fixes: 24be85a23d1f ("powerpc/powernv: Clear PECE1 in LPCR via stop-api only on Hotplug") Cc: sta...@vger.kernel.org # v4.14+ Signed-off-by: Paul Mackerras --- arch/powerpc/include/asm/powernv.h| 2 ++ arch/powerpc/platforms/powernv/idle.c | 27 ++- arch/powerpc/platforms/powernv/smp.c | 25 + 3 files changed, 29 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/include/asm/powernv.h b/arch/powerpc/include/asm/powernv.h index 2f3ff7a27881..d85fcfea32ca 100644 --- a/arch/powerpc/include/asm/powernv.h +++ b/arch/powerpc/include/asm/powernv.h @@ -23,6 +23,8 @@ extern int pnv_npu2_handle_fault(struct npu_context *context, uintptr_t *ea, unsigned long *flags, unsigned long *status, int count); +void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val); + void pnv_tm_init(void); #else static inline void powernv_set_nmmu_ptcr(unsigned long ptcr) { } diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c index 35f699ebb662..e52f9b06dd9c 100644 --- a/arch/powerpc/platforms/powernv/idle.c +++ b/arch/powerpc/platforms/powernv/idle.c @@ -458,7 +458,8 @@ EXPORT_SYMBOL_GPL(pnv_power9_force_smt4_release); #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */ #ifdef CONFIG_HOTPLUG_CPU -static void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val) + +void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val) { u64 pir = get_hard_smp_processor_id(cpu); @@ -481,20 +482,6 @@ unsigned long pnv_cpu_offline(unsigned int cpu) { unsigned long srr1; u32 idle_states = pnv_get_supported_cpuidle_states(); - u64 lpcr_val; - - /* -* We don't want to take decrementer interrupts while we are -* offline, so clear LPCR:PECE1. We keep PECE2 (and -* LPCR_PECE_HVEE on P9) enabled as to let IPIs in. -* -* If the CPU gets woken up by a special wakeup, ensure that -* the SLW engine sets LPCR with decrementer bit cleared, else -* the CPU will come back to the kernel due to a spurious -* wakeup. -*/ - lpcr_val = mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1; - pnv_program_cpu_hotplug_lpcr(cpu, lpcr_val); __ppc64_runlatch_off(); @@ -526,16 +513,6 @@ unsigned long pnv_cpu_offline(unsigned int cpu) __ppc64_runlatch_on(); - /* -* Re-enable decrementer interrupts in LPCR. -* -* Further, we want stop states to be woken up by decrementer -* for non-hotplug cases. So program the LPCR via stop api as -* well. -*/ - lpcr_val = mfspr(SPRN_LPCR) | (u64)LPCR_PECE1; - pnv_program_cpu_hotplug_lpcr(cpu, lpcr_val); - return srr1; } #endif diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c index 0d354e19ef92..db09c7022635 100644 --- a/arch/powerpc/platforms/powernv/smp.c +++ b/arch/powerpc/platforms/powernv/smp.c @@ -39,6 +39,7 @@ #include #include #include +#include #include "powernv.h" @@ -153,6 +154,7 @@ static void pnv_smp_cpu_kill_self(void) { unsigned int cpu; unsigned long srr1, wmask; + u64 lpcr_val; /* Standard hot unplug procedure */ /* @@ -174,6 +176,19 @@ static void pnv_smp_cpu_kill_self(void) if (cpu_has_feature(CPU_FTR_ARCH_207S)) wmask = SRR1_WAKEMASK_P8; + /* +* We don't want to take decrementer interrupts while we are +* offline, so clear LPCR:PECE1. We keep PECE2 (and +* LPCR_PECE_HVEE on P9) enabled so as to let IPIs in. +* +* If the CPU gets woken up by a special wakeup, ensure that +* the SLW engine sets LPCR with decrementer bit cleared, else +* the CPU will come back to the kernel due to a sp
Re: [QUESTION] powerpc, libseccomp, and spu
Hi Tom, Sorry this has caused you trouble, using "spu" there is a bit of a hack and I want to remove it. See: https://patchwork.ozlabs.org/patch/1025830/ Unfortunately that series clashed with some of Arnd's work and I haven't got around to rebasing it. Tom Hromatka writes: > PowerPC experts, > > Paul Moore and I are working on the v2.4 release of libseccomp, > and as part of this work I need to update the syscall table for > each architecture. > > I have incorporated the new ppc syscall.tbl into libseccomp, but > I am not familiar with the value of "spu" in the ABI column. For > example: > > 2232 umount sys_oldumount > 2264 umount sys_ni_syscall > 22spu umount sys_ni_syscall > > In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit > ppc syscall table. Do we also need to add a "spu" ppc syscall > table? Some clarification on the syscalls marked "spu" and "nospu" > would be greatly appreciated. The name "spu" comes from SPU, which are the small cores in the Playstation 3. The value in the syscall table says whether that syscall is available to SPU programs ("spu") or blocked ("nospu"). I don't think you want to support libseccomp on SPUs, so basically you can just ignore the spu/nospu distinction. So I'm pretty sure you can just remove all the "spu" lines, and then replace "nospu" with "common". As I've done below. I'll try and get my patch above into a branch and into linux-next somehow, so that you can at least refer to an upstream commit. cheers # SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note # # system call numbers and entry vectors for powerpc # # The format is: # # # The can be common, 64, or 32 for this file. # 0 common restart_syscall sys_restart_syscall 1 common exitsys_exit 2 common forkppc_fork 3 common readsys_read 4 common write sys_write 5 common opensys_open compat_sys_open 6 common close sys_close 7 common waitpid sys_waitpid 8 common creat sys_creat 9 common linksys_link 10 common unlink sys_unlink 11 common execve sys_execve compat_sys_execve 12 common chdir sys_chdir 13 common timesys_time compat_sys_time 14 common mknod sys_mknod 15 common chmod sys_chmod 16 common lchown sys_lchown 17 common break sys_ni_syscall 18 32 oldstat sys_stat sys_ni_syscall 18 64 oldstat sys_ni_syscall 19 common lseek sys_lseek compat_sys_lseek 20 common getpid sys_getpid 21 common mount sys_mount compat_sys_mount 22 32 umount sys_oldumount 22 64 umount sys_ni_syscall 23 common setuid sys_setuid 24 common getuid sys_getuid 25 common stime sys_stime compat_sys_stime 26 common ptrace sys_ptrace compat_sys_ptrace 27 common alarm sys_alarm 28 32 oldfstatsys_fstat sys_ni_syscall 28 64 oldfstatsys_ni_syscall 29 common pause sys_pause 30 common utime sys_utime compat_sys_utime 31 common sttysys_ni_syscall 32 common gttysys_ni_syscall 33 common access sys_access 34 common nicesys_nice 35 common ftime sys_ni_syscall 36 common syncsys_sync 37 common killsys_kill 38 common rename sys_rename 39 common mkdir sys_mkdir 40 common rmdir sys_rmdir 41 common dup sys_dup 42 common pipesys_pipe 43 common times sys_times compat_sys_time
Re: [QUESTION] powerpc, libseccomp, and spu
On Mon, 2019-02-11 at 11:54 -0700, Tom Hromatka wrote: > PowerPC experts, > > Paul Moore and I are working on the v2.4 release of libseccomp, > and as part of this work I need to update the syscall table for > each architecture. > > I have incorporated the new ppc syscall.tbl into libseccomp, but > I am not familiar with the value of "spu" in the ABI column. For > example: > > 2232 umount sys_oldumount > 2264 umount sys_ni_syscall > 22spu umount sys_ni_syscall > > In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit > ppc syscall table. Do we also need to add a "spu" ppc syscall > table? Some clarification on the syscalls marked "spu" and "nospu" > would be greatly appreciated. On the Cell processor, there is a number of little co-processors (SPUs) that run alongside the main PowerPC core. Userspace can run code on them, they operate within the user context via their own MMUs. We provide a facility for them to issue syscalls (via some kind of RPC to the main core). The "SPU" indication indicates syscalls that can be called from the SPUs via that mechanism. Now, the big question is, anybody still using Cell ? :-) Cheers, Ben.
Re: [QUESTION] powerpc, libseccomp, and spu
On Mon, 2019-02-11 at 11:54 -0700, Tom Hromatka wrote: > PowerPC experts, > > Paul Moore and I are working on the v2.4 release of libseccomp, > and as part of this work I need to update the syscall table for > each architecture. > > I have incorporated the new ppc syscall.tbl into libseccomp, but > I am not familiar with the value of "spu" in the ABI column. For > example: > > 2232 umount sys_oldumount > 2264 umount sys_ni_syscall > 22spu umount sys_ni_syscall > > In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit > ppc syscall table. Do we also need to add a "spu" ppc syscall > table? Some clarification on the syscalls marked "spu" and "nospu" > would be greatly appreciated. On the Cell processor, there is a number of little co-processors (SPUs) that run alongside the main PowerPC core. Userspace can run code on them, they operate within the user context via their own MMUs. We provide a facility for them to issue syscalls (via some kind of RPC to the main core). The "SPU" indication indicates syscalls that can be called from the SPUs via that mechanism. Now, the big question is, anybody still using Cell ? :-) Cheers, Ben.
Re: [PATCH] powerpc/configs: Enable CONFIG_USB_XHCI_HCD by default
On Mon, 11 Feb 2019 12:37:12 +0100 Thomas Huth wrote: > Recent versions of QEMU provide a XHCI device by default these > days instead of an old-fashioned OHCI device: > > https://git.qemu.org/?p=qemu.git;a=commitdiff;h=57040d451315320b7d27 > > So to get the keyboard working in the graphical console there again, > we should now include XHCI support in the kernel by default, too. > > Signed-off-by: Thomas Huth Wow, we didn't before? That's bonkers. Reviewed-by: David Gibson > --- > arch/powerpc/configs/pseries_defconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/powerpc/configs/pseries_defconfig > b/arch/powerpc/configs/pseries_defconfig > index ea79c51..62e12f6 100644 > --- a/arch/powerpc/configs/pseries_defconfig > +++ b/arch/powerpc/configs/pseries_defconfig > @@ -217,6 +217,7 @@ CONFIG_USB_MON=m > CONFIG_USB_EHCI_HCD=y > # CONFIG_USB_EHCI_HCD_PPC_OF is not set > CONFIG_USB_OHCI_HCD=y > +CONFIG_USB_XHCI_HCD=y > CONFIG_USB_STORAGE=m > CONFIG_NEW_LEDS=y > CONFIG_LEDS_CLASS=m > -- > 1.8.3.1 > -- David Gibson Principal Software Engineer, Virtualization, Red Hat pgphCIzmbh3fC.pgp Description: OpenPGP digital signature
Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
On Mon, Feb 11, 2019 at 03:54:47PM -0700, Jason Gunthorpe wrote: > On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote: > > Hi, > > > > This series converts users that account pinned pages with locked_vm to > > account with pinned_vm instead, pinned_vm being the correct counter to > > use. It's based on a similar patch I posted recently[0]. > > > > The patches are based on rdma/for-next to build on Davidlohr Bueso's > > recent conversion of pinned_vm to an atomic64_t[1]. Seems to make some > > sense for these to be routed the same way, despite lack of rdma content? > > Oy.. I'd be willing to accumulate a branch with acks to send to Linus > *separately* from RDMA to Linus, but this is very abnormal. > > Better to wait a few weeks for -rc1 and send patches through the > subsystem trees. Ok, I can do that. It did seem strange, so I made it a question...
Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
On Mon, Feb 11, 2019 at 03:56:20PM -0700, Jason Gunthorpe wrote: > On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote: > > @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long > > npage, bool async) > > if (!mm) > > return -ESRCH; /* process exited */ > > > > - ret = down_write_killable(&mm->mmap_sem); > > - if (!ret) { > > - if (npage > 0) { > > - if (!dma->lock_cap) { > > - unsigned long limit; > > - > > - limit = task_rlimit(dma->task, > > - RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > + pinned_vm = atomic64_add_return(npage, &mm->pinned_vm); > > > > - if (mm->locked_vm + npage > limit) > > - ret = -ENOMEM; > > - } > > + if (npage > 0 && !dma->lock_cap) { > > + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> > > + > > - PAGE_SHIFT; > > I haven't looked at this super closely, but how does this stuff work? > > do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm... > > Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ? > > Otherwise MEMLOCK is really doubled.. So this has been a problem for some time, but it's not as easy as adding them together, see [1][2] for a start. The locked_vm/pinned_vm issue definitely needs fixing, but all this series is trying to do is account to the right counter. Daniel [1] http://lkml.kernel.org/r/20130523104154.ga23...@twins.programming.kicks-ass.net [2] http://lkml.kernel.org/r/20130524140114.gk23...@twins.programming.kicks-ass.net
Re: [PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
On Mon, Feb 11, 2019 at 05:44:33PM -0500, Daniel Jordan wrote: > Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned > pages"), locked and pinned pages are accounted separately. Type1 > accounts pinned pages to locked_vm; use pinned_vm instead. > > pinned_vm recently became atomic and so no longer relies on mmap_sem > held as writer: delete. > > Signed-off-by: Daniel Jordan > drivers/vfio/vfio_iommu_type1.c | 31 --- > 1 file changed, 12 insertions(+), 19 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index 73652e21efec..a56cc341813f 100644 > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -257,7 +257,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, > struct vfio_pfn *vpfn) > static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) > { > struct mm_struct *mm; > - int ret; > + s64 pinned_vm; > + int ret = 0; > > if (!npage) > return 0; > @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long > npage, bool async) > if (!mm) > return -ESRCH; /* process exited */ > > - ret = down_write_killable(&mm->mmap_sem); > - if (!ret) { > - if (npage > 0) { > - if (!dma->lock_cap) { > - unsigned long limit; > - > - limit = task_rlimit(dma->task, > - RLIMIT_MEMLOCK) >> PAGE_SHIFT; > + pinned_vm = atomic64_add_return(npage, &mm->pinned_vm); > > - if (mm->locked_vm + npage > limit) > - ret = -ENOMEM; > - } > + if (npage > 0 && !dma->lock_cap) { > + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> > + > - PAGE_SHIFT; I haven't looked at this super closely, but how does this stuff work? do_mlock doesn't touch pinned_vm, and this doesn't touch locked_vm... Shouldn't all this be 'if (locked_vm + pinned_vm < RLIMIT_MEMLOCK)' ? Otherwise MEMLOCK is really doubled.. Jason
Re: [PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
On Mon, Feb 11, 2019 at 05:44:32PM -0500, Daniel Jordan wrote: > Hi, > > This series converts users that account pinned pages with locked_vm to > account with pinned_vm instead, pinned_vm being the correct counter to > use. It's based on a similar patch I posted recently[0]. > > The patches are based on rdma/for-next to build on Davidlohr Bueso's > recent conversion of pinned_vm to an atomic64_t[1]. Seems to make some > sense for these to be routed the same way, despite lack of rdma content? Oy.. I'd be willing to accumulate a branch with acks to send to Linus *separately* from RDMA to Linus, but this is very abnormal. Better to wait a few weeks for -rc1 and send patches through the subsystem trees. > All five of these places, and probably some of Davidlohr's conversions, > probably want to be collapsed into a common helper in the core mm for > accounting pinned pages. I tried, and there are several details that > likely need discussion, so this can be done as a follow-on. I've wondered the same.. Jason
[PATCH 5/5] kvm/book3s: use pinned_vm instead of locked_vm to account pinned pages
Memory used for TCE tables in kvm_vm_ioctl_create_spapr_tce is currently accounted to locked_vm because it stays resident and its allocation is directly triggered from userspace as explained in f8626985c7c2 ("KVM: PPC: Account TCE-containing pages in locked_vm"). However, since the memory comes straight from the page allocator (and to a lesser extent unreclaimable slab) and is effectively pinned, it should be accounted with pinned_vm (see bc3e53f682d9 ("mm: distinguish between mlocked and pinned pages")). pinned_vm recently became atomic and so no longer relies on mmap_sem held as writer: delete. Signed-off-by: Daniel Jordan --- arch/powerpc/kvm/book3s_64_vio.c | 35 ++-- 1 file changed, 15 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c index 532ab79734c7..2f8d7c051e4e 100644 --- a/arch/powerpc/kvm/book3s_64_vio.c +++ b/arch/powerpc/kvm/book3s_64_vio.c @@ -56,39 +56,34 @@ static unsigned long kvmppc_stt_pages(unsigned long tce_pages) return tce_pages + ALIGN(stt_bytes, PAGE_SIZE) / PAGE_SIZE; } -static long kvmppc_account_memlimit(unsigned long stt_pages, bool inc) +static long kvmppc_account_memlimit(unsigned long pages, bool inc) { long ret = 0; + s64 pinned_vm; if (!current || !current->mm) return ret; /* process exited */ - down_write(¤t->mm->mmap_sem); - if (inc) { - unsigned long locked, lock_limit; + unsigned long lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - locked = current->mm->locked_vm + stt_pages; - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) + pinned_vm = atomic64_add_return(pages, ¤t->mm->pinned_vm); + if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) { ret = -ENOMEM; - else - current->mm->locked_vm += stt_pages; + atomic64_sub(pages, ¤t->mm->pinned_vm); + } } else { - if (WARN_ON_ONCE(stt_pages > current->mm->locked_vm)) - stt_pages = current->mm->locked_vm; + pinned_vm = atomic64_read(¤t->mm->pinned_vm); + if (WARN_ON_ONCE(pages > pinned_vm)) + pages = pinned_vm; - current->mm->locked_vm -= stt_pages; + atomic64_sub(pages, ¤t->mm->pinned_vm); } - pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%ld %ld/%ld%s\n", current->pid, - inc ? '+' : '-', - stt_pages << PAGE_SHIFT, - current->mm->locked_vm << PAGE_SHIFT, - rlimit(RLIMIT_MEMLOCK), - ret ? " - exceeded" : ""); - - up_write(¤t->mm->mmap_sem); + pr_debug("[%d] RLIMIT_MEMLOCK KVM %c%lu %ld/%lu%s\n", current->pid, + inc ? '+' : '-', pages << PAGE_SHIFT, + atomic64_read(¤t->mm->pinned_vm) << PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : ""); return ret; } -- 2.20.1
[PATCH 1/5] vfio/type1: use pinned_vm instead of locked_vm to account pinned pages
Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned pages"), locked and pinned pages are accounted separately. Type1 accounts pinned pages to locked_vm; use pinned_vm instead. pinned_vm recently became atomic and so no longer relies on mmap_sem held as writer: delete. Signed-off-by: Daniel Jordan --- drivers/vfio/vfio_iommu_type1.c | 31 --- 1 file changed, 12 insertions(+), 19 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 73652e21efec..a56cc341813f 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -257,7 +257,8 @@ static int vfio_iova_put_vfio_pfn(struct vfio_dma *dma, struct vfio_pfn *vpfn) static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) { struct mm_struct *mm; - int ret; + s64 pinned_vm; + int ret = 0; if (!npage) return 0; @@ -266,24 +267,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) if (!mm) return -ESRCH; /* process exited */ - ret = down_write_killable(&mm->mmap_sem); - if (!ret) { - if (npage > 0) { - if (!dma->lock_cap) { - unsigned long limit; - - limit = task_rlimit(dma->task, - RLIMIT_MEMLOCK) >> PAGE_SHIFT; + pinned_vm = atomic64_add_return(npage, &mm->pinned_vm); - if (mm->locked_vm + npage > limit) - ret = -ENOMEM; - } + if (npage > 0 && !dma->lock_cap) { + unsigned long limit = task_rlimit(dma->task, RLIMIT_MEMLOCK) >> + PAGE_SHIFT; + if (pinned_vm > limit) { + atomic64_sub(npage, &mm->pinned_vm); + ret = -ENOMEM; } - - if (!ret) - mm->locked_vm += npage; - - up_write(&mm->mmap_sem); } if (async) @@ -401,6 +393,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, long ret, pinned = 0, lock_acct = 0; bool rsvd; dma_addr_t iova = vaddr - dma->vaddr + dma->iova; + atomic64_t *pinned_vm = ¤t->mm->pinned_vm; /* This code path is only user initiated */ if (!current->mm) @@ -418,7 +411,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, * pages are already counted against the user. */ if (!rsvd && !vfio_find_vpfn(dma, iova)) { - if (!dma->lock_cap && current->mm->locked_vm + 1 > limit) { + if (!dma->lock_cap && atomic64_read(pinned_vm) + 1 > limit) { put_pfn(*pfn_base, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); @@ -445,7 +438,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr, if (!rsvd && !vfio_find_vpfn(dma, iova)) { if (!dma->lock_cap && - current->mm->locked_vm + lock_acct + 1 > limit) { + atomic64_read(pinned_vm) + lock_acct + 1 > limit) { put_pfn(pfn, dma->prot); pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__, limit << PAGE_SHIFT); -- 2.20.1
[PATCH 3/5] fpga/dlf/afu: use pinned_vm instead of locked_vm to account pinned pages
Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned pages"), locked and pinned pages are accounted separately. The FPGA AFU driver accounts pinned pages to locked_vm; use pinned_vm instead. pinned_vm recently became atomic and so no longer relies on mmap_sem held as writer: delete. Signed-off-by: Daniel Jordan --- drivers/fpga/dfl-afu-dma-region.c | 50 ++- 1 file changed, 23 insertions(+), 27 deletions(-) diff --git a/drivers/fpga/dfl-afu-dma-region.c b/drivers/fpga/dfl-afu-dma-region.c index e18a786fc943..a9a6b317fe2e 100644 --- a/drivers/fpga/dfl-afu-dma-region.c +++ b/drivers/fpga/dfl-afu-dma-region.c @@ -32,47 +32,43 @@ void afu_dma_region_init(struct dfl_feature_platform_data *pdata) } /** - * afu_dma_adjust_locked_vm - adjust locked memory + * afu_dma_adjust_pinned_vm - adjust pinned memory * @dev: port device * @npages: number of pages - * @incr: increase or decrease locked memory * - * Increase or decrease the locked memory size with npages input. + * Increase or decrease the pinned memory size with npages input. * * Return 0 on success. - * Return -ENOMEM if locked memory size is over the limit and no CAP_IPC_LOCK. + * Return -ENOMEM if pinned memory size is over the limit and no CAP_IPC_LOCK. */ -static int afu_dma_adjust_locked_vm(struct device *dev, long npages, bool incr) +static int afu_dma_adjust_pinned_vm(struct device *dev, long pages) { - unsigned long locked, lock_limit; + unsigned long lock_limit; + s64 pinned_vm; int ret = 0; /* the task is exiting. */ - if (!current->mm) + if (!current->mm || !pages) return 0; - down_write(¤t->mm->mmap_sem); - - if (incr) { - locked = current->mm->locked_vm + npages; + if (pages > 0) { lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) + pinned_vm = atomic64_add_return(pages, ¤t->mm->pinned_vm); + if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) { ret = -ENOMEM; - else - current->mm->locked_vm += npages; + atomic64_sub(pages, ¤t->mm->pinned_vm); + } } else { - if (WARN_ON_ONCE(npages > current->mm->locked_vm)) - npages = current->mm->locked_vm; - current->mm->locked_vm -= npages; + pinned_vm = atomic64_read(¤t->mm->pinned_vm); + if (WARN_ON_ONCE(pages > pinned_vm)) + pages = pinned_vm; + atomic64_sub(pages, ¤t->mm->pinned_vm); } - dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %ld/%ld%s\n", current->pid, - incr ? '+' : '-', npages << PAGE_SHIFT, - current->mm->locked_vm << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK), - ret ? "- exceeded" : ""); - - up_write(¤t->mm->mmap_sem); + dev_dbg(dev, "[%d] RLIMIT_MEMLOCK %c%ld %lld/%lu%s\n", current->pid, + (pages > 0) ? '+' : '-', pages << PAGE_SHIFT, + (s64)atomic64_read(¤t->mm->pinned_vm) << PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), ret ? "- exceeded" : ""); return ret; } @@ -92,7 +88,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata, struct device *dev = &pdata->dev->dev; int ret, pinned; - ret = afu_dma_adjust_locked_vm(dev, npages, true); + ret = afu_dma_adjust_pinned_vm(dev, npages); if (ret) return ret; @@ -121,7 +117,7 @@ static int afu_dma_pin_pages(struct dfl_feature_platform_data *pdata, free_pages: kfree(region->pages); unlock_vm: - afu_dma_adjust_locked_vm(dev, npages, false); + afu_dma_adjust_pinned_vm(dev, -npages); return ret; } @@ -141,7 +137,7 @@ static void afu_dma_unpin_pages(struct dfl_feature_platform_data *pdata, put_all_pages(region->pages, npages); kfree(region->pages); - afu_dma_adjust_locked_vm(dev, npages, false); + afu_dma_adjust_pinned_vm(dev, -npages); dev_dbg(dev, "%ld pages unpinned\n", npages); } -- 2.20.1
[PATCH 4/5] powerpc/mmu: use pinned_vm instead of locked_vm to account pinned pages
Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned pages"), locked and pinned pages are accounted separately. The IOMMU MMU helpers on powerpc account pinned pages to locked_vm; use pinned_vm instead. pinned_vm recently became atomic and so no longer relies on mmap_sem held as writer: delete. Signed-off-by: Daniel Jordan --- arch/powerpc/mm/mmu_context_iommu.c | 43 ++--- 1 file changed, 21 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index a712a650a8b6..fdf670542847 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -40,36 +40,35 @@ struct mm_iommu_table_group_mem_t { u64 dev_hpa;/* Device memory base address */ }; -static long mm_iommu_adjust_locked_vm(struct mm_struct *mm, +static long mm_iommu_adjust_pinned_vm(struct mm_struct *mm, unsigned long npages, bool incr) { - long ret = 0, locked, lock_limit; + long ret = 0; + unsigned long lock_limit; + s64 pinned_vm; if (!npages) return 0; - down_write(&mm->mmap_sem); - if (incr) { - locked = mm->locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) + pinned_vm = atomic64_add_return(npages, &mm->pinned_vm); + if (pinned_vm > lock_limit && !capable(CAP_IPC_LOCK)) { ret = -ENOMEM; - else - mm->locked_vm += npages; + atomic64_sub(npages, &mm->pinned_vm); + } } else { - if (WARN_ON_ONCE(npages > mm->locked_vm)) - npages = mm->locked_vm; - mm->locked_vm -= npages; + pinned_vm = atomic64_read(&mm->pinned_vm); + if (WARN_ON_ONCE(npages > pinned_vm)) + npages = pinned_vm; + atomic64_sub(npages, &mm->pinned_vm); } - pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%ld %ld/%ld\n", - current ? current->pid : 0, - incr ? '+' : '-', + pr_debug("[%d] RLIMIT_MEMLOCK HASH64 %c%lu %ld/%lu\n", + current ? current->pid : 0, incr ? '+' : '-', npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, + atomic64_read(&mm->pinned_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK)); - up_write(&mm->mmap_sem); return ret; } @@ -133,7 +132,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; - long i, j, ret = 0, locked_entries = 0; + long i, j, ret = 0, pinned_entries = 0; unsigned int pageshift; unsigned long flags; unsigned long cur_ua; @@ -154,11 +153,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, } if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) { - ret = mm_iommu_adjust_locked_vm(mm, entries, true); + ret = mm_iommu_adjust_pinned_vm(mm, entries, true); if (ret) goto unlock_exit; - locked_entries = entries; + pinned_entries = entries; } mem = kzalloc(sizeof(*mem), GFP_KERNEL); @@ -252,8 +251,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, list_add_rcu(&mem->next, &mm->context.iommu_group_mem_list); unlock_exit: - if (locked_entries && ret) - mm_iommu_adjust_locked_vm(mm, locked_entries, false); + if (pinned_entries && ret) + mm_iommu_adjust_pinned_vm(mm, pinned_entries, false); mutex_unlock(&mem_list_mutex); @@ -352,7 +351,7 @@ long mm_iommu_put(struct mm_struct *mm, struct mm_iommu_table_group_mem_t *mem) mm_iommu_release(mem); if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) - mm_iommu_adjust_locked_vm(mm, entries, false); + mm_iommu_adjust_pinned_vm(mm, entries, false); unlock_exit: mutex_unlock(&mem_list_mutex); -- 2.20.1
[PATCH 0/5] use pinned_vm instead of locked_vm to account pinned pages
Hi, This series converts users that account pinned pages with locked_vm to account with pinned_vm instead, pinned_vm being the correct counter to use. It's based on a similar patch I posted recently[0]. The patches are based on rdma/for-next to build on Davidlohr Bueso's recent conversion of pinned_vm to an atomic64_t[1]. Seems to make some sense for these to be routed the same way, despite lack of rdma content? All five of these places, and probably some of Davidlohr's conversions, probably want to be collapsed into a common helper in the core mm for accounting pinned pages. I tried, and there are several details that likely need discussion, so this can be done as a follow-on. I'd appreciate a look at patch 5 especially since the accounting is unusual no matter whether locked_vm or pinned_vm are used. On powerpc, this was cross-compile tested only. [0] http://lkml.kernel.org/r/20181105165558.11698-8-daniel.m.jor...@oracle.com [1] http://lkml.kernel.org/r/20190206175920.31082-1-d...@stgolabs.net Daniel Jordan (5): vfio/type1: use pinned_vm instead of locked_vm to account pinned pages vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages fpga/dlf/afu: use pinned_vm instead of locked_vm to account pinned pages powerpc/mmu: use pinned_vm instead of locked_vm to account pinned pages kvm/book3s: use pinned_vm instead of locked_vm to account pinned pages Documentation/vfio.txt | 6 +-- arch/powerpc/kvm/book3s_64_vio.c| 35 +++- arch/powerpc/mm/mmu_context_iommu.c | 43 ++- drivers/fpga/dfl-afu-dma-region.c | 50 +++--- drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++--- drivers/vfio/vfio_iommu_type1.c | 31 ++ 6 files changed, 104 insertions(+), 125 deletions(-) -- 2.20.1
[PATCH 2/5] vfio/spapr_tce: use pinned_vm instead of locked_vm to account pinned pages
Beginning with bc3e53f682d9 ("mm: distinguish between mlocked and pinned pages"), locked and pinned pages are accounted separately. The SPAPR TCE VFIO IOMMU driver accounts pinned pages to locked_vm; use pinned_vm instead. pinned_vm recently became atomic and so no longer relies on mmap_sem held as writer: delete. Signed-off-by: Daniel Jordan --- Documentation/vfio.txt | 6 +-- drivers/vfio/vfio_iommu_spapr_tce.c | 64 ++--- 2 files changed, 33 insertions(+), 37 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index f1a4d3c3ba0b..fa37d65363f9 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -308,7 +308,7 @@ This implementation has some specifics: currently there is no way to reduce the number of calls. In order to make things faster, the map/unmap handling has been implemented in real mode which provides an excellent performance which has limitations such as - inability to do locked pages accounting in real time. + inability to do pinned pages accounting in real time. 4) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O subtree that can be treated as a unit for the purposes of partitioning and @@ -324,7 +324,7 @@ This implementation has some specifics: returns the size and the start of the DMA window on the PCI bus. VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting + enables the container. The pinned pages accounting is done at this point. This lets user first to know what the DMA window is and adjust rlimit before doing any real job. @@ -454,7 +454,7 @@ This implementation has some specifics: PPC64 paravirtualized guests generate a lot of map/unmap requests, and the handling of those includes pinning/unpinning pages and updating - mm::locked_vm counter to make sure we do not exceed the rlimit. + mm::pinned_vm counter to make sure we do not exceed the rlimit. The v2 IOMMU splits accounting and pinning into separate operations: - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index c424913324e3..f47e020dc5e4 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -34,9 +34,11 @@ static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group); -static long try_increment_locked_vm(struct mm_struct *mm, long npages) +static long try_increment_pinned_vm(struct mm_struct *mm, long npages) { - long ret = 0, locked, lock_limit; + long ret = 0; + s64 pinned; + unsigned long lock_limit; if (WARN_ON_ONCE(!mm)) return -EPERM; @@ -44,39 +46,33 @@ static long try_increment_locked_vm(struct mm_struct *mm, long npages) if (!npages) return 0; - down_write(&mm->mmap_sem); - locked = mm->locked_vm + npages; + pinned = atomic64_add_return(npages, &mm->pinned_vm); lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) + if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) { ret = -ENOMEM; - else - mm->locked_vm += npages; + atomic64_sub(npages, &mm->pinned_vm); + } - pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, + pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%lu%s\n", current->pid, npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, - rlimit(RLIMIT_MEMLOCK), - ret ? " - exceeded" : ""); - - up_write(&mm->mmap_sem); + atomic64_read(&mm->pinned_vm) << PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), ret ? " - exceeded" : ""); return ret; } -static void decrement_locked_vm(struct mm_struct *mm, long npages) +static void decrement_pinned_vm(struct mm_struct *mm, long npages) { if (!mm || !npages) return; - down_write(&mm->mmap_sem); - if (WARN_ON_ONCE(npages > mm->locked_vm)) - npages = mm->locked_vm; - mm->locked_vm -= npages; - pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid, + if (WARN_ON_ONCE(npages > atomic64_read(&mm->pinned_vm))) + npages = atomic64_read(&mm->pinned_vm); + atomic64_sub(npages, &mm->pinned_vm); + pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%lu\n", current->pid, npages << PAGE_SHIFT, - mm->locked_vm << PAGE_SHIFT, + atomic64_read(&mm->pinned_vm) << PAGE_SHIFT, rlimit(RLIMIT_MEMLOCK)); - up_write(&mm->mmap_sem); } /* @@ -110,7 +106,7 @@ struct tce_container { bool en
Re: [PATCH] powerpc: fix 32-bit KVM-PR lockup and panic with MacOS guest
On 11/02/2019 00:30, Benjamin Herrenschmidt wrote: > On Fri, 2019-02-08 at 14:51 +, Mark Cave-Ayland wrote: >> >> Indeed, but there are still some questions to be asked here: >> >> 1) Why were these bits removed from the original bitmask in the first place >> without >> it being documented in the commit message? >> >> 2) Is this the right fix? I'm told that MacOS guests already run without >> this patch >> on a G5 under 64-bit KVM-PR which may suggest that this is a workaround for >> another >> bug elsewhere in the 32-bit powerpc code. >> >> >> If you think that these points don't matter, then I'm happy to resubmit the >> patch >> as-is based upon your comments above. > > We should write a test case to verify that FE0/FE1 are properly > preserved/context-switched etc... I bet if we accidentally wiped them, > we wouldn't notice 99.9% of the time. Right I guess it's more likely to cause in issue in the KVM PR case because the guest can alter the flags in a way that doesn't go through the normal process switch mechanism. The original patchset at https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg98326.html does include some tests in the first few patches, but AFAICT they are concerned with the contents of the FP registers rather than the related MSRs. Who is the right person to ask about fixing issues related to context switching with KVM PR? I did add the original author's email address to my first few emails but have had no response back :/ ATB, Mark.
Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20
On Mon, Feb 11, 2019 at 1:13 PM Greg Kroah-Hartman wrote: > > On Mon, Feb 11, 2019 at 12:41:40PM -0600, Alan Tull wrote: > > On Fri, Nov 9, 2018 at 12:58 AM Frank Rowand wrote: > > > > What LTSI's are these patches likely to end up in? Just to be clear, > > I'm not pushing for any specific answer, I just want to know what to > > expect. > > I have no idea what you are asking here. > > What patches? I probably should have asked my question *below* the pertinent context of the the 17 patches listed in the pull request, which was: > of: overlay: add tests to validate kfrees from overlay removal > of: overlay: add missing of_node_put() after add new node to changeset > of: overlay: add missing of_node_get() in __of_attach_node_sysfs > powerpc/pseries: add of_node_put() in dlpar_detach_node() > of: overlay: use prop add changeset entry for property in new nodes > of: overlay: do not duplicate properties from overlay for new nodes > of: overlay: reorder fields in struct fragment > of: overlay: validate overlay properties #address-cells and #size-cells > of: overlay: make all pr_debug() and pr_err() messages unique > of: overlay: test case of two fragments adding same node > of: overlay: check prevents multiple fragments add or delete same node > of: overlay: check prevents multiple fragments touching same property > of: unittest: remove unused of_unittest_apply_overlay() argument > of: overlay: set node fields from properties when add new overlay node > of: unittest: allow base devicetree to have symbol metadata > of: unittest: find overlays[] entry by name instead of index > of: unittest: initialize args before calling of_*parse_*() > What is "LTSI's"? I have recently seen some of devicetree patches being picked up for the 4.20 stable-queue. That seemed to suggest that some, but not all of these will end up in the next LTS release. Also I was wondering if any of this is likely to get backported to LTSI-4.14. > > confused, Yes, and now I'm confused about the confusion. Sorry for spreading confusion. Alan > > greg k-h
[QUESTION] powerpc, libseccomp, and spu
PowerPC experts, Paul Moore and I are working on the v2.4 release of libseccomp, and as part of this work I need to update the syscall table for each architecture. I have incorporated the new ppc syscall.tbl into libseccomp, but I am not familiar with the value of "spu" in the ABI column. For example: 22 32 umount sys_oldumount 22 64 umount sys_ni_syscall 22 spu umount sys_ni_syscall In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit ppc syscall table. Do we also need to add a "spu" ppc syscall table? Some clarification on the syscalls marked "spu" and "nospu" would be greatly appreciated. Thanks. Tom
[PATCH v2 2/2] locking/rwsem: Optimize down_read_trylock()
Modify __down_read_trylock() to make it generate slightly better code (smaller and maybe a tiny bit faster). Before this patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: jmp0x18 0x0007 <+7>: lea0x1(%rdx),%rcx 0x000b <+11>:mov%rdx,%rax 0x000e <+14>:lock cmpxchg %rcx,(%rdi) 0x0013 <+19>:cmp%rax,%rdx 0x0016 <+22>:je 0x23 0x0018 <+24>:mov(%rdi),%rdx 0x001b <+27>:test %rdx,%rdx 0x001e <+30>:jns0x7 0x0020 <+32>:xor%eax,%eax 0x0022 <+34>:retq 0x0023 <+35>:mov%gs:0x0,%rax 0x002c <+44>:or $0x3,%rax 0x0030 <+48>:mov%rax,0x20(%rdi) 0x0034 <+52>:mov$0x1,%eax 0x0039 <+57>:retq After patch, down_read_trylock: 0x <+0>: callq 0x5 0x0005 <+5>: mov(%rdi),%rax 0x0008 <+8>: test %rax,%rax 0x000b <+11>:js 0x2f 0x000d <+13>:lea0x1(%rax),%rdx 0x0011 <+17>:lock cmpxchg %rdx,(%rdi) 0x0016 <+22>:jne0x8 0x0018 <+24>:mov%gs:0x0,%rax 0x0021 <+33>:or $0x3,%rax 0x0025 <+37>:mov%rax,0x20(%rdi) 0x0029 <+41>:mov$0x1,%eax 0x002e <+46>:retq 0x002f <+47>:xor%eax,%eax 0x0031 <+49>:retq By using a rwsem microbenchmark, the down_read_trylock() rate on a x86-64 system before and after the patch were: Before PatchAfter Patch # of Threads rlock rlock - - 1 27,787 28,259 28,359 9,234 On a ARM64 system, the performance results were: Before PatchAfter Patch # of Threads rlock rlock - - 1 24,155 25,000 26,820 8,699 Suggested-by: Peter Zijlstra Signed-off-by: Waiman Long --- kernel/locking/rwsem.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h index 067e265..028bc33 100644 --- a/kernel/locking/rwsem.h +++ b/kernel/locking/rwsem.h @@ -175,11 +175,11 @@ static inline int __down_read_killable(struct rw_semaphore *sem) static inline int __down_read_trylock(struct rw_semaphore *sem) { - long tmp; + long tmp = atomic_long_read(&sem->count); - while ((tmp = atomic_long_read(&sem->count)) >= 0) { - if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, - tmp + RWSEM_ACTIVE_READ_BIAS)) { + while (tmp >= 0) { + if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, + tmp + RWSEM_ACTIVE_READ_BIAS)) { return 1; } } -- 1.8.3.1
[PATCH v2 1/2] locking/rwsem: Remove arch specific rwsem files
As the generic rwsem-xadd code is using the appropriate acquire and release versions of the atomic operations, the arch specific rwsem.h files will not be that much faster than the generic code as long as the atomic functions are properly implemented. So we can remove those arch specific rwsem.h and stop building asm/rwsem.h to reduce maintenance effort. Currently, only x86, alpha and ia64 have implemented architecture specific fast paths. I don't have access to alpha and ia64 systems for testing, but they are legacy systems that are not likely to be updated to the latest kernel anyway. By using a rwsem microbenchmark, the total locking rates on a 4-socket 56-core 112-thread x86-64 system before and after the patch were as follows (mixed means equal # of read and write locks): Before Patch After Patch # of Threads wlock rlock mixed wlock rlock mixed - - - - - - 129,201 30,143 29,45828,615 30,172 29,201 2 6,807 13,299 1,171 7,725 15,025 1,804 4 6,504 12,755 1,520 7,127 14,286 1,345 8 6,762 13,412 764 6,826 13,652 726 16 6,693 15,408 662 6,599 15,938 626 32 6,145 15,286 496 5,549 15,487 511 64 5,812 15,495 60 5,858 15,572 60 There were some run-to-run variations for the multi-thread tests. For x86-64, using the generic C code fast path seems to be a little bit faster than the assembly version with low lock contention. Looking at the assembly version of the fast paths, there are assembly to/from C code wrappers that save and restore all the callee-clobbered registers (7 registers on x86-64). The assembly generated from the generic C code doesn't need to do that. That may explain the slight performance gain here. The generic asm rwsem.h can also be merged into kernel/locking/rwsem.h with no code change as no other code other than those under kernel/locking needs to access the internal rwsem macros and functions. Signed-off-by: Waiman Long --- MAINTAINERS | 1 - arch/alpha/include/asm/rwsem.h | 211 --- arch/arm/include/asm/Kbuild | 1 - arch/arm64/include/asm/Kbuild | 1 - arch/hexagon/include/asm/Kbuild | 1 - arch/ia64/include/asm/rwsem.h | 172 - arch/powerpc/include/asm/Kbuild | 1 - arch/s390/include/asm/Kbuild| 1 - arch/sh/include/asm/Kbuild | 1 - arch/sparc/include/asm/Kbuild | 1 - arch/x86/include/asm/rwsem.h| 237 arch/x86/lib/Makefile | 1 - arch/x86/lib/rwsem.S| 156 -- arch/xtensa/include/asm/Kbuild | 1 - include/asm-generic/rwsem.h | 140 include/linux/rwsem.h | 4 +- kernel/locking/percpu-rwsem.c | 2 + kernel/locking/rwsem.h | 130 ++ 18 files changed, 133 insertions(+), 929 deletions(-) delete mode 100644 arch/alpha/include/asm/rwsem.h delete mode 100644 arch/ia64/include/asm/rwsem.h delete mode 100644 arch/x86/include/asm/rwsem.h delete mode 100644 arch/x86/lib/rwsem.S delete mode 100644 include/asm-generic/rwsem.h diff --git a/MAINTAINERS b/MAINTAINERS index 9919840..053f536 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8926,7 +8926,6 @@ F:arch/*/include/asm/spinlock*.h F: include/linux/rwlock*.h F: include/linux/mutex*.h F: include/linux/rwsem*.h -F: arch/*/include/asm/rwsem.h F: include/linux/seqlock.h F: lib/locking*.[ch] F: kernel/locking/ diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h deleted file mode 100644 index cf8fc8f9..000 --- a/arch/alpha/include/asm/rwsem.h +++ /dev/null @@ -1,211 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _ALPHA_RWSEM_H -#define _ALPHA_RWSEM_H - -/* - * Written by Ivan Kokshaysky , 2001. - * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h - */ - -#ifndef _LINUX_RWSEM_H -#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead" -#endif - -#ifdef __KERNEL__ - -#include - -#define RWSEM_UNLOCKED_VALUE 0xL -#define RWSEM_ACTIVE_BIAS 0x0001L -#define RWSEM_ACTIVE_MASK 0xL -#define RWSEM_WAITING_BIAS (-0x0001L) -#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS -#define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS) - -static inline int ___down_read(struct rw_semaphore *sem) -{ - long oldcount; -#ifndefCONFIG_SMP - oldcount = sem->count.counter; - sem->count.counter += RWSEM_ACTIVE_READ_BIAS; -#else - long temp; - __asm__ __volatile__( - "1: ldq_l %0,%1\n" - "
[PATCH v2 0/2] locking/rwsem: Remove arch specific rwsem files
v2: - Add patch 2 to optimize __down_read_trylock() as suggested by PeterZ. - Update performance test data in patch 1. This is part 0 of my rwsem patchset. It just removes the architecture specific files to make it easer to add enhancements in the upcoming rwsem patches. Since the two ll/sc platforms that I can tested on (arm64 & ppc) are both using the generic C codes, the rwsem performance shouldn't be affected by this patch except the down_read_trylock() code which was included in patch 2 for arm64. Waiman Long (2): locking/rwsem: Remove arch specific rwsem files locking/rwsem: Optimize down_read_trylock() MAINTAINERS | 1 - arch/alpha/include/asm/rwsem.h | 211 --- arch/arm/include/asm/Kbuild | 1 - arch/arm64/include/asm/Kbuild | 1 - arch/hexagon/include/asm/Kbuild | 1 - arch/ia64/include/asm/rwsem.h | 172 - arch/powerpc/include/asm/Kbuild | 1 - arch/s390/include/asm/Kbuild| 1 - arch/sh/include/asm/Kbuild | 1 - arch/sparc/include/asm/Kbuild | 1 - arch/x86/include/asm/rwsem.h| 237 arch/x86/lib/Makefile | 1 - arch/x86/lib/rwsem.S| 156 -- arch/xtensa/include/asm/Kbuild | 1 - include/asm-generic/rwsem.h | 140 include/linux/rwsem.h | 4 +- kernel/locking/percpu-rwsem.c | 2 + kernel/locking/rwsem.h | 130 ++ 18 files changed, 133 insertions(+), 929 deletions(-) delete mode 100644 arch/alpha/include/asm/rwsem.h delete mode 100644 arch/ia64/include/asm/rwsem.h delete mode 100644 arch/x86/include/asm/rwsem.h delete mode 100644 arch/x86/lib/rwsem.S delete mode 100644 include/asm-generic/rwsem.h -- 1.8.3.1
Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20
On Mon, Feb 11, 2019 at 12:41:40PM -0600, Alan Tull wrote: > On Fri, Nov 9, 2018 at 12:58 AM Frank Rowand wrote: > > What LTSI's are these patches likely to end up in? Just to be clear, > I'm not pushing for any specific answer, I just want to know what to > expect. I have no idea what you are asking here. What patches? What is "LTSI's"? confused, greg k-h
Re: [GIT PULL] of: overlay: validation checks, subsequent fixes for v20 -- correction: v4.20
On Fri, Nov 9, 2018 at 12:58 AM Frank Rowand wrote: What LTSI's are these patches likely to end up in? Just to be clear, I'm not pushing for any specific answer, I just want to know what to expect. Thanks, Alan > > On 11/8/18 10:56 PM, Frank Rowand wrote: > > Hi Rob, > > > > Please pull the changes to add the overlay validation checks. > > > > This is the v7 version of the patch series. > > > > -Frank > > > > > > The following changes since commit 651022382c7f8da46cb4872a545ee1da6d097d2a: > > > > Linux 4.20-rc1 (2018-11-04 15:37:52 -0800) > > > > are available in the git repository at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frowand/linux.git > > tags/kfree_validate_v7-for-4.20 > > > > for you to fetch changes up to eeb07c573ec307c53fe2f6ac6d8d11c261f64006: > > > > of: unittest: initialize args before calling of_*parse_*() (2018-11-08 > > 22:12:37 -0800) > > > > > > Add checks to (1) overlay apply process and (2) memory freeing > > triggered by overlay release. The checks are intended to detect > > possible memory leaks and invalid overlays. > > > > The checks revealed bugs in existing code. Fixed the bugs. > > > > While fixing bugs, noted other issues, which are fixed in > > separate patches. > > > > > > Frank Rowand (17): > > of: overlay: add tests to validate kfrees from overlay removal > > of: overlay: add missing of_node_put() after add new node to changeset > > of: overlay: add missing of_node_get() in __of_attach_node_sysfs > > powerpc/pseries: add of_node_put() in dlpar_detach_node() > > of: overlay: use prop add changeset entry for property in new nodes > > of: overlay: do not duplicate properties from overlay for new nodes > > of: overlay: reorder fields in struct fragment > > of: overlay: validate overlay properties #address-cells and > > #size-cells > > of: overlay: make all pr_debug() and pr_err() messages unique > > of: overlay: test case of two fragments adding same node > > of: overlay: check prevents multiple fragments add or delete same node > > of: overlay: check prevents multiple fragments touching same property > > of: unittest: remove unused of_unittest_apply_overlay() argument > > of: overlay: set node fields from properties when add new overlay node > > of: unittest: allow base devicetree to have symbol metadata > > of: unittest: find overlays[] entry by name instead of index > > of: unittest: initialize args before calling of_*parse_*() > > > > arch/powerpc/platforms/pseries/dlpar.c | 2 + > > drivers/of/dynamic.c | 59 - > > drivers/of/kobj.c | 4 +- > > drivers/of/overlay.c | 292 > > - > > drivers/of/unittest-data/Makefile | 2 + > > .../of/unittest-data/overlay_bad_add_dup_node.dts | 28 ++ > > .../of/unittest-data/overlay_bad_add_dup_prop.dts | 24 ++ > > drivers/of/unittest-data/overlay_base.dts | 1 + > > drivers/of/unittest.c | 96 +-- > > include/linux/of.h | 21 +- > > 10 files changed, 432 insertions(+), 97 deletions(-) > > create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_node.dts > > create mode 100644 drivers/of/unittest-data/overlay_bad_add_dup_prop.dts > > >
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
On Mon, Feb 11, 2019 at 11:35:24AM -0500, Waiman Long wrote: > On 02/11/2019 06:58 AM, Peter Zijlstra wrote: > > Which is clearly worse. Now we can write that as: > > > > int __down_read_trylock2(unsigned long *l) > > { > > long tmp = READ_ONCE(*l); > > > > while (tmp >= 0) { > > if (try_cmpxchg(l, &tmp, tmp + 1)) > > return 1; > > } > > > > return 0; > > } > > > > which generates: > > > > 0030 <__down_read_trylock2>: > > 30: 48 8b 07mov(%rdi),%rax > > 33: 48 85 c0test %rax,%rax > > 36: 78 18 js 50 <__down_read_trylock2+0x20> > > 38: 48 8d 50 01 lea0x1(%rax),%rdx > > 3c: f0 48 0f b1 17 lock cmpxchg %rdx,(%rdi) > > 41: 75 f0 jne33 <__down_read_trylock2+0x3> > > 43: b8 01 00 00 00 mov$0x1,%eax > > 48: c3 retq > > 49: 0f 1f 80 00 00 00 00nopl 0x0(%rax) > > 50: 31 c0 xor%eax,%eax > > 52: c3 retq > > > > Which is a lot better; but not quite there yet. > > > > > > I've tried quite a bit, but I can't seem to get GCC to generate the: > > > > add $1,%rdx > > jle > > > > required; stuff like: > > > > new = old + 1; > > if (new <= 0) > > > > generates: > > > > lea 0x1(%rax),%rdx > > test %rdx, %rdx > > jle > > Thanks for the suggested code snippet. So you want to replace "lea > 0x1(%rax), %rdx" by "add $1,%rdx"? > > I think the compiler is doing that so as to use the address generation > unit for addition instead of using the ALU. That will leave the ALU > available for doing other arithmetic operation in parallel. I don't > think it is a good idea to override the compiler and force it to use > ALU. So I am not going to try doing that. It is only 1 or 2 more of > codes anyway. Yeah, I was trying to see what I could make it do.. #2 really should be good enough, but you know how it is once you're poking at it :-)
[PATCH] mmap.2: describe the 5level paging hack
The manpage is missing information about the compatibility hack for 5-level paging that went in in 4.14, around commit ee00f4a32a76 ("x86/mm: Allow userspace have mappings above 47-bit"). Add some information about that. While I don't think any hardware supporting this is shipping yet (?), I think it's useful to try to write a manpage for this API, partly to figure out how usable that API actually is, and partly because when this hardware does ship, it'd be nice if distro manpages had information about how to use it. Signed-off-by: Jann Horn --- This patch goes on top of the patch "[PATCH] mmap.2: fix description of treatment of the hint" that I just sent, but I'm not sending them in a series because I want the first one to go in, and I think this one might be a bit more controversial. It would be nice if the architecture maintainers and mm folks could have a look at this and check that what I wrote is right - I only looked at the source for this, I haven't tried it. man2/mmap.2 | 15 +++ 1 file changed, 15 insertions(+) diff --git a/man2/mmap.2 b/man2/mmap.2 index 8556bbfeb..977782fa8 100644 --- a/man2/mmap.2 +++ b/man2/mmap.2 @@ -67,6 +67,8 @@ is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping; this is the most portable method of creating a new mapping. +On Linux, in this case, the kernel may limit the maximum address that can be +used for allocations to a legacy limit for compatibility reasons. If .I addr is not NULL, @@ -77,6 +79,19 @@ or equal to the value specified by and attempt to create the mapping there. If another mapping already exists there, the kernel picks a new address, independent of the hint. +However, if a hint above the architecture's legacy address limit is provided +(on x86-64: above 0x7000, on arm64: above 0x1, on ppc64 with +book3s: above 0x7fff or 0x3fff, depending on page size), the +kernel is permitted to allocate mappings beyond the architecture's legacy +address limit. The availability of such addresses is hardware-dependent. +Therefore, if you want to be able to use the full virtual address space of +hardware that supports addresses beyond the legacy range, you need to specify an +address above that limit; however, for security reasons, you should avoid +specifying a fixed valid address outside the compatibility range, +since that would reduce the value of userspace address space layout +randomization. Therefore, it is recommended to specify an address +.I beyond +the end of the userspace address space. .\" Before Linux 2.6.24, the address was rounded up to the next page .\" boundary; since 2.6.24, it is rounded down! The address of the new mapping is returned as the result of the call. -- 2.20.1.791.gb4d0f1c61a-goog
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
On 02/11/2019 06:58 AM, Peter Zijlstra wrote: > Which is clearly worse. Now we can write that as: > > int __down_read_trylock2(unsigned long *l) > { > long tmp = READ_ONCE(*l); > > while (tmp >= 0) { > if (try_cmpxchg(l, &tmp, tmp + 1)) > return 1; > } > > return 0; > } > > which generates: > > 0030 <__down_read_trylock2>: > 30: 48 8b 07mov(%rdi),%rax > 33: 48 85 c0test %rax,%rax > 36: 78 18 js 50 <__down_read_trylock2+0x20> > 38: 48 8d 50 01 lea0x1(%rax),%rdx > 3c: f0 48 0f b1 17 lock cmpxchg %rdx,(%rdi) > 41: 75 f0 jne33 <__down_read_trylock2+0x3> > 43: b8 01 00 00 00 mov$0x1,%eax > 48: c3 retq > 49: 0f 1f 80 00 00 00 00nopl 0x0(%rax) > 50: 31 c0 xor%eax,%eax > 52: c3 retq > > Which is a lot better; but not quite there yet. > > > I've tried quite a bit, but I can't seem to get GCC to generate the: > > add $1,%rdx > jle > > required; stuff like: > > new = old + 1; > if (new <= 0) > > generates: > > lea 0x1(%rax),%rdx > test %rdx, %rdx > jle Thanks for the suggested code snippet. So you want to replace "lea 0x1(%rax), %rdx" by "add $1,%rdx"? I think the compiler is doing that so as to use the address generation unit for addition instead of using the ALU. That will leave the ALU available for doing other arithmetic operation in parallel. I don't think it is a good idea to override the compiler and force it to use ALU. So I am not going to try doing that. It is only 1 or 2 more of codes anyway. Cheers, Longman
Re: [PATCH v4 3/3] powerpc/32: Add KASAN support
On 2/11/19 3:25 PM, Andrey Konovalov wrote: > On Sat, Feb 9, 2019 at 12:55 PM christophe leroy > wrote: >> >> Hi Andrey, >> >> Le 08/02/2019 à 18:40, Andrey Konovalov a écrit : >>> On Fri, Feb 8, 2019 at 6:17 PM Christophe Leroy >>> wrote: Hi Daniel, Le 08/02/2019 à 17:18, Daniel Axtens a écrit : > Hi Christophe, > > I've been attempting to port this to 64-bit Book3e nohash (e6500), > although I think I've ended up with an approach more similar to Aneesh's > much earlier (2015) series for book3s. > > Part of this is just due to the changes between 32 and 64 bits - we need > to hack around the discontiguous mappings - but one thing that I'm > particularly puzzled by is what the kasan_early_init is supposed to do. It should be a problem as my patch uses a 'for_each_memblock(memory, reg)' loop. > >> +void __init kasan_early_init(void) >> +{ >> +unsigned long addr = KASAN_SHADOW_START; >> +unsigned long end = KASAN_SHADOW_END; >> +unsigned long next; >> +pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(addr), addr), addr); >> +int i; >> +phys_addr_t pa = __pa(kasan_early_shadow_page); >> + >> +BUILD_BUG_ON(KASAN_SHADOW_START & ~PGDIR_MASK); >> + >> +if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) >> +panic("KASAN not supported with Hash MMU\n"); >> + >> +for (i = 0; i < PTRS_PER_PTE; i++) >> +__set_pte_at(&init_mm, (unsigned >> long)kasan_early_shadow_page, >> + kasan_early_shadow_pte + i, >> + pfn_pte(PHYS_PFN(pa), PAGE_KERNEL_RO), 0); >> + >> +do { >> +next = pgd_addr_end(addr, end); >> +pmd_populate_kernel(&init_mm, pmd, kasan_early_shadow_pte); >> +} while (pmd++, addr = next, addr != end); >> +} > > As far as I can tell it's mapping the early shadow page, read-only, over > the KASAN_SHADOW_START->KASAN_SHADOW_END range, and it's using the early > shadow PTE array from the generic code. > > I haven't been able to find an answer to why this is in the docs, so I > was wondering if you or anyone else could explain the early part of > kasan init a bit better. See https://www.kernel.org/doc/html/latest/dev-tools/kasan.html for an explanation of the shadow. When shadow is 0, it means the memory area is entirely accessible. It is necessary to setup a shadow area as soon as possible because all data accesses check the shadow area, from the begining (except for a few files where sanitizing has been disabled in Makefiles). Until the real shadow area is set, all access are granted thanks to the zero shadow area beeing for of zeros. >>> >>> Not entirely correct. kasan_early_init() indeed maps the whole shadow >>> memory range to the same kasan_early_shadow_page. However as kernel >>> loads and memory gets allocated this shadow page gets rewritten with >>> non-zero values by different KASAN allocator hooks. Since these values >>> come from completely different parts of the kernel, but all land on >>> the same page, kasan_early_shadow_page's content can be considered >>> garbage. When KASAN checks memory accesses for validity it detects >>> these garbage shadow values, but doesn't print any reports, as the >>> reporting routine bails out on the current->kasan_depth check (which >>> has the value of 1 initially). Only after kasan_init() completes, when >>> the proper shadow memory is mapped, current->kasan_depth gets set to 0 >>> and we start reporting bad accesses. >> >> That's surprising, because in the early phase I map the shadow area >> read-only, so I do not expect it to get modified unless RO protection is >> failing for some reason. > > Actually it might be that the allocator hooks don't modify shadow at > this point, as the allocator is not yet initialized. However stack > should be getting poisoned and unpoisoned from the very start. But the > generic statement that early shadow gets dirtied should be correct. > Might it be that you don't use stack instrumentation? > Yes, stack instrumentation is not used here, because shadow offset which we pass to the -fasan-shadow-offset= cflag is not specified here. So the logic in scrpits/Makefile.kasan just fallbacks to CFLAGS_KASAN_MINIMAL, which is outline and without stack instrumentation. Christophe, you can specify KASAN_SHADOW_OFFSET either in Kconfig (e.g. x86_64) or in Makefile (e.g. arm64). And make early mapping writable, because compiler generated code will write to shadow memory in function prologue/epilogue.
Re: [PATCH v2 0/4] [powerpc] perf vendor events: Add JSON metrics for POWER9
Em Sat, Feb 09, 2019 at 01:14:25PM -0500, Paul Clarke escreveu: > [Note this is for POWER*9* and is different content than a > previous patchset for POWER*8*.] > > The patches define metrics and metric groups for computation by "perf" > for POWER9 processors. Applied, thanks. - Arnaldo
[PATCH 4.9 096/137] block/swim3: Fix -EBUSY error when re-opening device after unmount
4.9-stable review patch. If anyone has any objections, please let me know. -- [ Upstream commit 296dcc40f2f2e402facf7cd26cf3f2c8f4b17d47 ] When the block device is opened with FMODE_EXCL, ref_count is set to -1. This value doesn't get reset when the device is closed which means the device cannot be opened again. Fix this by checking for refcount <= 0 in the release method. Reported-and-tested-by: Stan Johnson Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Finn Thain Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin --- drivers/block/swim3.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index c264f2d284a7..2e0a9e2531cb 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -1027,7 +1027,11 @@ static void floppy_release(struct gendisk *disk, fmode_t mode) struct swim3 __iomem *sw = fs->swim3; mutex_lock(&swim3_mutex); - if (fs->ref_count > 0 && --fs->ref_count == 0) { + if (fs->ref_count > 0) + --fs->ref_count; + else if (fs->ref_count == -1) + fs->ref_count = 0; + if (fs->ref_count == 0) { swim3_action(fs, MOTOR_OFF); out_8(&sw->control_bic, 0xff); swim3_select(fs, RELAX); -- 2.19.1
[PATCH 4.14 153/205] block/swim3: Fix -EBUSY error when re-opening device after unmount
4.14-stable review patch. If anyone has any objections, please let me know. -- [ Upstream commit 296dcc40f2f2e402facf7cd26cf3f2c8f4b17d47 ] When the block device is opened with FMODE_EXCL, ref_count is set to -1. This value doesn't get reset when the device is closed which means the device cannot be opened again. Fix this by checking for refcount <= 0 in the release method. Reported-and-tested-by: Stan Johnson Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Finn Thain Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin --- drivers/block/swim3.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index 0d7527c6825a..2f7acdb830c3 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -1027,7 +1027,11 @@ static void floppy_release(struct gendisk *disk, fmode_t mode) struct swim3 __iomem *sw = fs->swim3; mutex_lock(&swim3_mutex); - if (fs->ref_count > 0 && --fs->ref_count == 0) { + if (fs->ref_count > 0) + --fs->ref_count; + else if (fs->ref_count == -1) + fs->ref_count = 0; + if (fs->ref_count == 0) { swim3_action(fs, MOTOR_OFF); out_8(&sw->control_bic, 0xff); swim3_select(fs, RELAX); -- 2.19.1
Re: [RFC PATCH] x86, numa: always initialize all possible nodes
On Mon 11-02-19 14:49:09, Ingo Molnar wrote: > > * Michal Hocko wrote: > > > On Thu 24-01-19 11:10:50, Dave Hansen wrote: > > > On 1/24/19 6:17 AM, Michal Hocko wrote: > > > > and nr_cpus set to 4. The underlying reason is tha the device is bound > > > > to node 2 which doesn't have any memory and init_cpu_to_node only > > > > initializes memory-less nodes for possible cpus which nr_cpus restrics. > > > > This in turn means that proper zonelists are not allocated and the page > > > > allocator blows up. > > > > > > This looks OK to me. > > > > > > Could we add a few DEBUG_VM checks that *look* for these invalid > > > zonelists? Or, would our existing list debugging have caught this? > > > > Currently we simply blow up because those zonelists are NULL. I do not > > think we have a way to check whether an existing zonelist is actually > > _correct_ other thatn check it for NULL. But what would we do in the > > later case? > > > > > Basically, is this bug also a sign that we need better debugging around > > > this? > > > > My earlier patch had a debugging printk to display the zonelists and > > that might be worthwhile I guess. Basically something like this > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 2e097f336126..c30d59f803fb 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -5259,6 +5259,11 @@ static void build_zonelists(pg_data_t *pgdat) > > > > build_zonelists_in_node_order(pgdat, node_order, nr_nodes); > > build_thisnode_zonelists(pgdat); > > + > > + pr_info("node[%d] zonelist: ", pgdat->node_id); > > + for_each_zone_zonelist(zone, z, > > &pgdat->node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1) > > + pr_cont("%d:%s ", zone_to_nid(zone), zone->name); > > + pr_cont("\n"); > > } > > Looks like this patch fell through the cracks - any update on this? I was waiting for some feedback. As there were no complains about the above debugging output I will make it a separate patch and post both patches later this week. I just have to go through my backlog pile after vacation. -- Michal Hocko SUSE Labs
[PATCH 4.19 233/313] block/swim3: Fix -EBUSY error when re-opening device after unmount
4.19-stable review patch. If anyone has any objections, please let me know. -- [ Upstream commit 296dcc40f2f2e402facf7cd26cf3f2c8f4b17d47 ] When the block device is opened with FMODE_EXCL, ref_count is set to -1. This value doesn't get reset when the device is closed which means the device cannot be opened again. Fix this by checking for refcount <= 0 in the release method. Reported-and-tested-by: Stan Johnson Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Finn Thain Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin --- drivers/block/swim3.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index 469541c1e51e..20907a0a043b 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -1026,7 +1026,11 @@ static void floppy_release(struct gendisk *disk, fmode_t mode) struct swim3 __iomem *sw = fs->swim3; mutex_lock(&swim3_mutex); - if (fs->ref_count > 0 && --fs->ref_count == 0) { + if (fs->ref_count > 0) + --fs->ref_count; + else if (fs->ref_count == -1) + fs->ref_count = 0; + if (fs->ref_count == 0) { swim3_action(fs, MOTOR_OFF); out_8(&sw->control_bic, 0xff); swim3_select(fs, RELAX); -- 2.19.1
Re: [PATCH v3 1/7] dump_stack: Support adding to the dump stack arch description
On Mon 2019-02-11 13:50:35, Andrea Parri wrote: > Hi Michael, > > > On Thu, Feb 07, 2019 at 11:46:29PM +1100, Michael Ellerman wrote: > > Arch code can set a "dump stack arch description string" which is > > displayed with oops output to describe the hardware platform. > > > > It is useful to initialise this as early as possible, so that an early > > oops will have the hardware description. > > > > However in practice we discover the hardware platform in stages, so it > > would be useful to be able to incrementally fill in the hardware > > description as we discover it. > > > > This patch adds that ability, by creating dump_stack_add_arch_desc(). > > > > If there is no existing string it behaves exactly like > > dump_stack_set_arch_desc(). However if there is an existing string it > > appends to it, with a leading space. > > > > This makes it easy to call it multiple times from different parts of the > > code and get a reasonable looking result. > > > > Signed-off-by: Michael Ellerman > > --- > > include/linux/printk.h | 5 > > lib/dump_stack.c | 58 ++ > > 2 files changed, 63 insertions(+) > > > > v3: No change, just widened Cc list. > > > > v2: Add a smp_wmb() and comment. > > > > v1 is here for reference > > https://lore.kernel.org/lkml/1430824337-15339-1-git-send-email-...@ellerman.id.au/ > > > > I'll take this series via the powerpc tree if no one minds? > > > > > > diff --git a/include/linux/printk.h b/include/linux/printk.h > > index 77740a506ebb..d5fb4f960271 100644 > > --- a/include/linux/printk.h > > +++ b/include/linux/printk.h > > @@ -198,6 +198,7 @@ u32 log_buf_len_get(void); > > void log_buf_vmcoreinfo_setup(void); > > void __init setup_log_buf(int early); > > __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); > > +__printf(1, 2) void dump_stack_add_arch_desc(const char *fmt, ...); > > void dump_stack_print_info(const char *log_lvl); > > void show_regs_print_info(const char *log_lvl); > > extern asmlinkage void dump_stack(void) __cold; > > @@ -256,6 +257,10 @@ static inline __printf(1, 2) void > > dump_stack_set_arch_desc(const char *fmt, ...) > > { > > } > > > > +static inline __printf(1, 2) void dump_stack_add_arch_desc(const char > > *fmt, ...) > > +{ > > +} > > + > > static inline void dump_stack_print_info(const char *log_lvl) > > { > > } > > diff --git a/lib/dump_stack.c b/lib/dump_stack.c > > index 5cff72f18c4a..69b710ff92b5 100644 > > --- a/lib/dump_stack.c > > +++ b/lib/dump_stack.c > > @@ -35,6 +35,64 @@ void __init dump_stack_set_arch_desc(const char *fmt, > > ...) > > va_end(args); > > } > > > > +/** > > + * dump_stack_add_arch_desc - add arch-specific info to show with task > > dumps > > + * @fmt: printf-style format string > > + * @...: arguments for the format string > > + * > > + * See dump_stack_set_arch_desc() for why you'd want to use this. > > + * > > + * This version adds to any existing string already created with either > > + * dump_stack_set_arch_desc() or dump_stack_add_arch_desc(). If there is an > > + * existing string a space will be prepended to the passed string. > > + */ > > +void __init dump_stack_add_arch_desc(const char *fmt, ...) > > +{ > > + va_list args; > > + int pos, len; > > + char *p; > > + > > + /* > > +* If there's an existing string we snprintf() past the end of it, and > > +* then turn the terminating NULL of the existing string into a space > > +* to create one string separated by a space. > > +* > > +* If there's no existing string we just snprintf() to the buffer, like > > +* dump_stack_set_arch_desc(), but without calling it because we'd need > > +* a varargs version. > > +*/ > > + len = strnlen(dump_stack_arch_desc_str, > > sizeof(dump_stack_arch_desc_str)); > > + pos = len; > > + > > + if (len) > > + pos++; > > + > > + if (pos >= sizeof(dump_stack_arch_desc_str)) > > + return; /* Ran out of space */ > > + > > + p = &dump_stack_arch_desc_str[pos]; > > + > > + va_start(args, fmt); > > + vsnprintf(p, sizeof(dump_stack_arch_desc_str) - pos, fmt, args); > > + va_end(args); > > + > > + if (len) { > > + /* > > +* Order the stores above in vsnprintf() vs the store of the > > +* space below which joins the two strings. Note this doesn't > > +* make the code truly race free because there is no barrier on > > +* the read side. ie. Another CPU might load the uninitialised > > +* tail of the buffer first and then the space below (rather > > +* than the NULL that was there previously), and so print the > > +* uninitialised tail. But the whole string lives in BSS so in > > +* practice it should just see NULLs. > > The comment doesn't say _why_ we need to order these stores: IOW, what > will or can go wrong without this order? This isn't clear to me. > > Another good
[PATCH 4.20 278/352] block/swim3: Fix regression on PowerBook G3
4.20-stable review patch. If anyone has any objections, please let me know. -- [ Upstream commit 427c5ce4417cba0801fbf79c8525d1330704759c ] As of v4.20, the swim3 driver crashes when loaded on a PowerBook G3 (Wallstreet). MacIO PCI driver attached to Gatwick chipset MacIO PCI driver attached to Heathrow chipset swim3 0.00015000:floppy: [fd0] SWIM3 floppy controller in media bay 0.00013020:ch-a: ttyS0 at MMIO 0xf3013020 (irq = 16, base_baud = 230400) is a Z85c30 ESCC - Serial port 0.00013000:ch-b: ttyS1 at MMIO 0xf3013000 (irq = 17, base_baud = 230400) is a Z85c30 ESCC - Infrared port macio: fixed media-bay irq on gatwick macio: fixed left floppy irqs swim3 1.00015000:floppy: [fd1] Couldn't request interrupt Unable to handle kernel paging request for data at address 0x0024 Faulting instruction address: 0xc02652f8 Oops: Kernel access of bad area, sig: 11 [#1] BE SMP NR_CPUS=2 PowerMac Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.20.0 #2 NIP: c02652f8 LR: c026915c CTR: c0276d1c REGS: df43ba10 TRAP: 0300 Not tainted (4.20.0) MSR: 9032 CR: 28228288 XER: 0100 DAR: 0024 DSISR: 4000 GPR00: c026915c df43bac0 df439060 c0731524 df494700 c06e1c08 0001 GPR08: 0001 df5ff220 1032 28228282 c0004ca4 GPR16: c073144c dfffe064 c0731524 0120 c0586108 GPR24: c073132c c073143c c073143c c0731524 df67cd70 df494700 0001 NIP [c02652f8] blk_mq_free_rqs+0x28/0xf8 LR [c026915c] blk_mq_sched_tags_teardown+0x58/0x84 Call Trace: [df43bac0] [c0045f50] flush_workqueue_prep_pwqs+0x178/0x1c4 (unreliable) [df43bae0] [c026915c] blk_mq_sched_tags_teardown+0x58/0x84 [df43bb00] [c02697f0] blk_mq_exit_sched+0x9c/0xb8 [df43bb20] [c0252794] elevator_exit+0x84/0xa4 [df43bb40] [c0256538] blk_exit_queue+0x30/0x50 [df43bb50] [c0256640] blk_cleanup_queue+0xe8/0x184 [df43bb70] [c034732c] swim3_attach+0x330/0x5f0 [df43bbb0] [c034fb24] macio_device_probe+0x58/0xec [df43bbd0] [c032ba88] really_probe+0x1e4/0x2f4 [df43bc00] [c032bd28] driver_probe_device+0x64/0x204 [df43bc20] [c0329ac4] bus_for_each_drv+0x60/0xac [df43bc50] [c032b824] __device_attach+0xe8/0x160 [df43bc80] [c032ab38] bus_probe_device+0xa0/0xbc [df43bca0] [c0327338] device_add+0x3d8/0x630 [df43bcf0] [c0350848] macio_add_one_device+0x444/0x48c [df43bd50] [c03509f8] macio_pci_add_devices+0x168/0x1bc [df43bd90] [c03500ec] macio_pci_probe+0xc0/0x10c [df43bda0] [c02ad884] pci_device_probe+0xd4/0x184 [df43bdd0] [c032ba88] really_probe+0x1e4/0x2f4 [df43be00] [c032bd28] driver_probe_device+0x64/0x204 [df43be20] [c032bfcc] __driver_attach+0x104/0x108 [df43be40] [c0329a00] bus_for_each_dev+0x64/0xb4 [df43be70] [c032add8] bus_add_driver+0x154/0x238 [df43be90] [c032ca24] driver_register+0x84/0x148 [df43bea0] [c0004aa0] do_one_initcall+0x40/0x188 [df43bf00] [c0690100] kernel_init_freeable+0x138/0x1d4 [df43bf30] [c0004cbc] kernel_init+0x18/0x10c [df43bf40] [c00121e4] ret_from_kernel_thread+0x14/0x1c Instruction dump: 5484d97e 4bfff4f4 9421ffe0 7c0802a6 bf410008 7c9e2378 90010024 8124005c 2f89 419e0078 81230004 7c7c1b78 <81290024> 2f89 419e0064 8144 ---[ end trace 12025ab921a9784c ]--- Reverting commit 8ccb8cb1892b ("swim3: convert to blk-mq") resolves the problem. That commit added a struct blk_mq_tag_set to struct floppy_state and initialized it with a blk_mq_init_sq_queue() call. Unfortunately, there is a memset() in swim3_add_device() that subsequently clears the floppy_state struct. That means fs->tag_set->ops is a NULL pointer, and it gets dereferenced by blk_mq_free_rqs() which gets called in the request_irq() error path. Move the memset() to fix this bug. BTW, the request_irq() failure for the left mediabay floppy (fd1) is not a regression. I don't know why it happens. The right media bay floppy (fd0) works fine however. Reported-and-tested-by: Stan Johnson Fixes: 8ccb8cb1892b ("swim3: convert to blk-mq") Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Finn Thain Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin --- drivers/block/swim3.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index 3f6df3f1f5d9..1046459f172b 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -1091,8 +1091,6 @@ static int swim3_add_device(struct macio_dev *mdev, int index) struct floppy_state *fs = &floppy_states[index]; int rc = -EBUSY; - /* Do this first for message macros */ - memset(fs, 0, sizeof(*fs)); fs->mdev = mdev; fs->index = index; @@ -1192,14 +1190,15 @@ static int swim3_attach(struct macio_dev *mdev, return rc; } - fs = &floppy_states[floppy_count]; - disk = alloc_disk(1); if (disk == NULL) { rc = -ENOMEM; goto out_unregister; } + fs = &floppy_states[floppy_count]; + memset(fs, 0, sizeof(*
[PATCH 4.20 273/352] block/swim3: Fix -EBUSY error when re-opening device after unmount
4.20-stable review patch. If anyone has any objections, please let me know. -- [ Upstream commit 296dcc40f2f2e402facf7cd26cf3f2c8f4b17d47 ] When the block device is opened with FMODE_EXCL, ref_count is set to -1. This value doesn't get reset when the device is closed which means the device cannot be opened again. Fix this by checking for refcount <= 0 in the release method. Reported-and-tested-by: Stan Johnson Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: Finn Thain Signed-off-by: Jens Axboe Signed-off-by: Sasha Levin --- drivers/block/swim3.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index c1c676a33e4a..3f6df3f1f5d9 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -995,7 +995,11 @@ static void floppy_release(struct gendisk *disk, fmode_t mode) struct swim3 __iomem *sw = fs->swim3; mutex_lock(&swim3_mutex); - if (fs->ref_count > 0 && --fs->ref_count == 0) { + if (fs->ref_count > 0) + --fs->ref_count; + else if (fs->ref_count == -1) + fs->ref_count = 0; + if (fs->ref_count == 0) { swim3_action(fs, MOTOR_OFF); out_8(&sw->control_bic, 0xff); swim3_select(fs, RELAX); -- 2.19.1
Re: [RFC PATCH] x86, numa: always initialize all possible nodes
* Michal Hocko wrote: > On Thu 24-01-19 11:10:50, Dave Hansen wrote: > > On 1/24/19 6:17 AM, Michal Hocko wrote: > > > and nr_cpus set to 4. The underlying reason is tha the device is bound > > > to node 2 which doesn't have any memory and init_cpu_to_node only > > > initializes memory-less nodes for possible cpus which nr_cpus restrics. > > > This in turn means that proper zonelists are not allocated and the page > > > allocator blows up. > > > > This looks OK to me. > > > > Could we add a few DEBUG_VM checks that *look* for these invalid > > zonelists? Or, would our existing list debugging have caught this? > > Currently we simply blow up because those zonelists are NULL. I do not > think we have a way to check whether an existing zonelist is actually > _correct_ other thatn check it for NULL. But what would we do in the > later case? > > > Basically, is this bug also a sign that we need better debugging around > > this? > > My earlier patch had a debugging printk to display the zonelists and > that might be worthwhile I guess. Basically something like this > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 2e097f336126..c30d59f803fb 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5259,6 +5259,11 @@ static void build_zonelists(pg_data_t *pgdat) > > build_zonelists_in_node_order(pgdat, node_order, nr_nodes); > build_thisnode_zonelists(pgdat); > + > + pr_info("node[%d] zonelist: ", pgdat->node_id); > + for_each_zone_zonelist(zone, z, > &pgdat->node_zonelists[ZONELIST_FALLBACK], MAX_NR_ZONES-1) > + pr_cont("%d:%s ", zone_to_nid(zone), zone->name); > + pr_cont("\n"); > } Looks like this patch fell through the cracks - any update on this? Thanks, Ingo
[PATCH 10/12] dma-mapping: simplify allocations from per-device coherent memory
All users of per-device coherent memory are exclusive, that is if we can't allocate from the per-device pool we can't use the system memory either. Unfold the current dma_{alloc,free}_from_dev_coherent implementation and always use the per-device pool if it exists. Signed-off-by: Christoph Hellwig --- arch/arm/mm/dma-mapping-nommu.c | 12 ++--- include/linux/dma-mapping.h | 14 ++ kernel/dma/coherent.c | 89 - kernel/dma/internal.h | 19 +++ kernel/dma/mapping.c| 12 +++-- 5 files changed, 55 insertions(+), 91 deletions(-) create mode 100644 kernel/dma/internal.h diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c index f304b10e23a4..c72f024f1e82 100644 --- a/arch/arm/mm/dma-mapping-nommu.c +++ b/arch/arm/mm/dma-mapping-nommu.c @@ -70,16 +70,10 @@ static void arm_nommu_dma_free(struct device *dev, size_t size, void *cpu_addr, dma_addr_t dma_addr, unsigned long attrs) { - if (attrs & DMA_ATTR_NON_CONSISTENT) { + if (attrs & DMA_ATTR_NON_CONSISTENT) dma_direct_free_pages(dev, size, cpu_addr, dma_addr, attrs); - } else { - int ret = dma_release_from_global_coherent(get_order(size), - cpu_addr); - - WARN_ON_ONCE(ret == 0); - } - - return; + else + dma_release_from_global_coherent(size, cpu_addr); } static int arm_nommu_dma_mmap(struct device *dev, struct vm_area_struct *vma, diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index b12fba725f19..018e37a0870e 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -158,30 +158,24 @@ static inline int is_device_dma_capable(struct device *dev) * These three functions are only for dma allocator. * Don't use them in device drivers. */ -int dma_alloc_from_dev_coherent(struct device *dev, ssize_t size, - dma_addr_t *dma_handle, void **ret); -int dma_release_from_dev_coherent(struct device *dev, int order, void *vaddr); - int dma_mmap_from_dev_coherent(struct device *dev, struct vm_area_struct *vma, void *cpu_addr, size_t size, int *ret); -void *dma_alloc_from_global_coherent(ssize_t size, dma_addr_t *dma_handle); -int dma_release_from_global_coherent(int order, void *vaddr); +void *dma_alloc_from_global_coherent(size_t size, dma_addr_t *dma_handle); +void dma_release_from_global_coherent(size_t size, void *vaddr); int dma_mmap_from_global_coherent(struct vm_area_struct *vma, void *cpu_addr, size_t size, int *ret); #else -#define dma_alloc_from_dev_coherent(dev, size, handle, ret) (0) -#define dma_release_from_dev_coherent(dev, order, vaddr) (0) #define dma_mmap_from_dev_coherent(dev, vma, vaddr, order, ret) (0) -static inline void *dma_alloc_from_global_coherent(ssize_t size, +static inline void *dma_alloc_from_global_coherent(size_t size, dma_addr_t *dma_handle) { return NULL; } -static inline int dma_release_from_global_coherent(int order, void *vaddr) +static inline void dma_release_from_global_coherent(size_t size, void *vaddr) { return 0; } diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 29fd6590dc1e..d1da1048e470 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -8,6 +8,7 @@ #include #include #include +#include "internal.h" struct dma_coherent_mem { void*virt_base; @@ -21,13 +22,6 @@ struct dma_coherent_mem { static struct dma_coherent_mem *dma_coherent_default_memory __ro_after_init; -static inline struct dma_coherent_mem *dev_get_coherent_memory(struct device *dev) -{ - if (dev && dev->dma_mem) - return dev->dma_mem; - return NULL; -} - static inline dma_addr_t dma_get_device_base(struct device *dev, struct dma_coherent_mem * mem) { @@ -135,8 +129,8 @@ void dma_release_declared_memory(struct device *dev) } EXPORT_SYMBOL(dma_release_declared_memory); -static void *__dma_alloc_from_coherent(struct dma_coherent_mem *mem, - ssize_t size, dma_addr_t *dma_handle) +void *__dma_alloc_from_coherent(struct dma_coherent_mem *mem, size_t size, + dma_addr_t *dma_handle) { int order = get_order(size); unsigned long flags; @@ -165,33 +159,7 @@ static void *__dma_alloc_from_coherent(struct dma_coherent_mem *mem, return NULL; } -/** - * dma_alloc_from_dev_coherent() - allocate memory from device coherent pool - * @dev: device from which we allocate memory - * @size: size of requested memory area - * @dma_handle:This will be filled with the correct dma handle - * @ret: This pointer will be filled with the vi
[PATCH 12/12] dma-mapping: remove dma_assign_coherent_memory
The only useful bit in this function was the already assigned check. Once that is moved to dma_init_coherent_memory thee rest can easily be handled in the two callers. Signed-off-by: Christoph Hellwig --- kernel/dma/coherent.c | 47 +-- 1 file changed, 14 insertions(+), 33 deletions(-) diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index d7a27008f228..1e3ce71cd993 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -41,6 +41,9 @@ static int dma_init_coherent_memory(phys_addr_t phys_addr, int bitmap_size = BITS_TO_LONGS(pages) * sizeof(long); int ret; + if (*mem) + return -EBUSY; + if (!size) { ret = -EINVAL; goto out; @@ -88,33 +91,11 @@ static void dma_release_coherent_memory(struct dma_coherent_mem *mem) kfree(mem); } -static int dma_assign_coherent_memory(struct device *dev, - struct dma_coherent_mem *mem) -{ - if (!dev) - return -ENODEV; - - if (dev->dma_mem) - return -EBUSY; - - dev->dma_mem = mem; - return 0; -} - int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, dma_addr_t device_addr, size_t size) { - struct dma_coherent_mem *mem; - int ret; - - ret = dma_init_coherent_memory(phys_addr, device_addr, size, &mem); - if (ret) - return ret; - - ret = dma_assign_coherent_memory(dev, mem); - if (ret) - dma_release_coherent_memory(mem); - return ret; + return dma_init_coherent_memory(phys_addr, device_addr, size, + &dev->dma_mem); } EXPORT_SYMBOL(dma_declare_coherent_memory); @@ -238,18 +219,18 @@ static int rmem_dma_device_init(struct reserved_mem *rmem, struct device *dev) struct dma_coherent_mem *mem = rmem->priv; int ret; - if (!mem) { - ret = dma_init_coherent_memory(rmem->base, rmem->base, - rmem->size, &mem); - if (ret) { - pr_err("Reserved memory: failed to init DMA memory pool at %pa, size %ld MiB\n", - &rmem->base, (unsigned long)rmem->size / SZ_1M); - return ret; - } + ret = dma_init_coherent_memory(rmem->base, rmem->base, rmem->size, + &mem); + if (ret && ret != -EBUSY) { + pr_err("Reserved memory: failed to init DMA memory pool at %pa, size %ld MiB\n", + &rmem->base, (unsigned long)rmem->size / SZ_1M); + return ret; } + mem->use_dev_dma_pfn_offset = true; + if (dev) + dev->dma_mem = mem; rmem->priv = mem; - dma_assign_coherent_memory(dev, mem); return 0; } -- 2.20.1
[PATCH 11/12] dma-mapping: handle per-device coherent memory mmap in common code
We handle allocation and freeing in common code, so we should handle mmap the same way. Also all users of per-device coherent memory are exclusive, that is if we can't allocate from the per-device pool we can't use the system memory either. Unfold the current dma_mmap_from_dev_coherent implementation and always use the per-device pool if it exists. Signed-off-by: Christoph Hellwig --- arch/arm/mm/dma-mapping-nommu.c | 7 ++-- arch/arm/mm/dma-mapping.c | 3 -- arch/arm64/mm/dma-mapping.c | 3 -- include/linux/dma-mapping.h | 11 ++- kernel/dma/coherent.c | 58 - kernel/dma/internal.h | 2 ++ kernel/dma/mapping.c| 8 ++--- 7 files changed, 24 insertions(+), 68 deletions(-) diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c index c72f024f1e82..4eeb7e5d9c07 100644 --- a/arch/arm/mm/dma-mapping-nommu.c +++ b/arch/arm/mm/dma-mapping-nommu.c @@ -80,11 +80,8 @@ static int arm_nommu_dma_mmap(struct device *dev, struct vm_area_struct *vma, void *cpu_addr, dma_addr_t dma_addr, size_t size, unsigned long attrs) { - int ret; - - if (dma_mmap_from_global_coherent(vma, cpu_addr, size, &ret)) - return ret; - + if (!(attrs & DMA_ATTR_NON_CONSISTENT)) + return dma_mmap_from_global_coherent(vma, cpu_addr, size); return dma_common_mmap(dev, vma, cpu_addr, dma_addr, size, attrs); } diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c index 3c8534904209..e2993e5a7166 100644 --- a/arch/arm/mm/dma-mapping.c +++ b/arch/arm/mm/dma-mapping.c @@ -830,9 +830,6 @@ static int __arm_dma_mmap(struct device *dev, struct vm_area_struct *vma, unsigned long pfn = dma_to_pfn(dev, dma_addr); unsigned long off = vma->vm_pgoff; - if (dma_mmap_from_dev_coherent(dev, vma, cpu_addr, size, &ret)) - return ret; - if (off < nr_pages && nr_vma_pages <= (nr_pages - off)) { ret = remap_pfn_range(vma, vma->vm_start, pfn + off, diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c index 78c0a72f822c..a55be91c1d1a 100644 --- a/arch/arm64/mm/dma-mapping.c +++ b/arch/arm64/mm/dma-mapping.c @@ -246,9 +246,6 @@ static int __iommu_mmap_attrs(struct device *dev, struct vm_area_struct *vma, vma->vm_page_prot = arch_dma_mmap_pgprot(dev, vma->vm_page_prot, attrs); - if (dma_mmap_from_dev_coherent(dev, vma, cpu_addr, size, &ret)) - return ret; - if (attrs & DMA_ATTR_FORCE_CONTIGUOUS) { /* * DMA_ATTR_FORCE_CONTIGUOUS allocations are always remapped, diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 018e37a0870e..ae6fe66f97b7 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -158,17 +158,12 @@ static inline int is_device_dma_capable(struct device *dev) * These three functions are only for dma allocator. * Don't use them in device drivers. */ -int dma_mmap_from_dev_coherent(struct device *dev, struct vm_area_struct *vma, - void *cpu_addr, size_t size, int *ret); - void *dma_alloc_from_global_coherent(size_t size, dma_addr_t *dma_handle); void dma_release_from_global_coherent(size_t size, void *vaddr); int dma_mmap_from_global_coherent(struct vm_area_struct *vma, void *cpu_addr, - size_t size, int *ret); + size_t size); #else -#define dma_mmap_from_dev_coherent(dev, vma, vaddr, order, ret) (0) - static inline void *dma_alloc_from_global_coherent(size_t size, dma_addr_t *dma_handle) { @@ -177,12 +172,10 @@ static inline void *dma_alloc_from_global_coherent(size_t size, static inline void dma_release_from_global_coherent(size_t size, void *vaddr) { - return 0; } static inline int dma_mmap_from_global_coherent(struct vm_area_struct *vma, - void *cpu_addr, size_t size, - int *ret) + void *cpu_addr, size_t size) { return 0; } diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index d1da1048e470..d7a27008f228 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -197,60 +197,30 @@ void dma_release_from_global_coherent(size_t size, void *vaddr) __dma_release_from_coherent(dma_coherent_default_memory, size, vaddr); } -static int __dma_mmap_from_coherent(struct dma_coherent_mem *mem, - struct vm_area_struct *vma, void *vaddr, size_t size, int *ret) +int __dma_mmap_from_coherent(struct dma_coherent_mem *mem, + struct vm_area_struct *vma, void *vaddr, size_t size) { - if (mem && vaddr >= mem->virt_base && vaddr +
[PATCH 09/12] dma-mapping: remove the DMA_MEMORY_EXCLUSIVE flag
All users of dma_declare_coherent want their allocations to be exclusive, so default to exclusive allocations. Signed-off-by: Christoph Hellwig --- Documentation/DMA-API.txt | 9 +-- arch/arm/mach-imx/mach-imx27_visstrim_m10.c | 12 +++-- arch/arm/mach-imx/mach-mx31moboard.c | 3 +-- arch/sh/boards/mach-ap325rxa/setup.c | 5 ++-- arch/sh/boards/mach-ecovec24/setup.c | 6 ++--- arch/sh/boards/mach-kfr2r09/setup.c | 5 ++-- arch/sh/boards/mach-migor/setup.c | 5 ++-- arch/sh/boards/mach-se/7724/setup.c | 6 ++--- arch/sh/drivers/pci/fixups-dreamcast.c| 3 +-- .../soc_camera/sh_mobile_ceu_camera.c | 3 +-- drivers/usb/host/ohci-sm501.c | 3 +-- drivers/usb/host/ohci-tmio.c | 2 +- include/linux/dma-mapping.h | 7 ++ kernel/dma/coherent.c | 25 ++- 14 files changed, 29 insertions(+), 65 deletions(-) diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index b9d0cba83877..38e561b773b4 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -566,8 +566,7 @@ boundaries when doing this. int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, - dma_addr_t device_addr, size_t size, int - flags) + dma_addr_t device_addr, size_t size); Declare region of memory to be handed out by dma_alloc_coherent() when it's asked for coherent memory for this device. @@ -581,12 +580,6 @@ dma_addr_t in dma_alloc_coherent()). size is the size of the area (must be multiples of PAGE_SIZE). -flags can be ORed together and are: - -- DMA_MEMORY_EXCLUSIVE - only allocate memory from the declared regions. - Do not allow dma_alloc_coherent() to fall back to system memory when - it's out of memory in the declared region. - As a simplification for the platforms, only *one* such region of memory may be declared per device. diff --git a/arch/arm/mach-imx/mach-imx27_visstrim_m10.c b/arch/arm/mach-imx/mach-imx27_visstrim_m10.c index 5169dfba9718..07d4fcfe5c2e 100644 --- a/arch/arm/mach-imx/mach-imx27_visstrim_m10.c +++ b/arch/arm/mach-imx/mach-imx27_visstrim_m10.c @@ -258,8 +258,7 @@ static void __init visstrim_analog_camera_init(void) return; dma_declare_coherent_memory(&pdev->dev, mx2_camera_base, - mx2_camera_base, MX2_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + mx2_camera_base, MX2_CAMERA_BUF_SIZE); } static void __init visstrim_reserve(void) @@ -445,8 +444,7 @@ static void __init visstrim_coda_init(void) dma_declare_coherent_memory(&pdev->dev, mx2_camera_base + MX2_CAMERA_BUF_SIZE, mx2_camera_base + MX2_CAMERA_BUF_SIZE, - MX2_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + MX2_CAMERA_BUF_SIZE); } /* DMA deinterlace */ @@ -465,8 +463,7 @@ static void __init visstrim_deinterlace_init(void) dma_declare_coherent_memory(&pdev->dev, mx2_camera_base + 2 * MX2_CAMERA_BUF_SIZE, mx2_camera_base + 2 * MX2_CAMERA_BUF_SIZE, - MX2_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + MX2_CAMERA_BUF_SIZE); } /* Emma-PrP for format conversion */ @@ -485,8 +482,7 @@ static void __init visstrim_emmaprp_init(void) */ ret = dma_declare_coherent_memory(&pdev->dev, mx2_camera_base, mx2_camera_base, - MX2_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + MX2_CAMERA_BUF_SIZE); if (ret) pr_err("Failed to declare memory for emmaprp\n"); } diff --git a/arch/arm/mach-imx/mach-mx31moboard.c b/arch/arm/mach-imx/mach-mx31moboard.c index 643a3d749703..fe50f4cf00a7 100644 --- a/arch/arm/mach-imx/mach-mx31moboard.c +++ b/arch/arm/mach-imx/mach-mx31moboard.c @@ -475,8 +475,7 @@ static int __init mx31moboard_init_cam(void) ret = dma_declare_coherent_memory(&pdev->dev, mx3_camera_base, mx3_camera_base, - MX3_CAMERA_BUF_SIZE, - DMA_MEMORY_EXCLUSIVE); + MX3_CAMERA_BUF_SIZE); if (ret) goto err; diff --git a/arch/sh/boards/mach-ap325rxa/setup.c b/arch/sh/boards/mach-ap325rxa/setup.c index 8f234d0435aa..7899b4f51fdd 100644 --- a/arch/sh/boards/mach-ap325rxa/
[PATCH 08/12] dma-mapping: remove dma_mark_declared_memory_occupied
This API is not used anywhere, so remove it. Signed-off-by: Christoph Hellwig --- Documentation/DMA-API.txt | 17 - include/linux/dma-mapping.h | 9 - kernel/dma/coherent.c | 23 --- 3 files changed, 49 deletions(-) diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt index 78114ee63057..b9d0cba83877 100644 --- a/Documentation/DMA-API.txt +++ b/Documentation/DMA-API.txt @@ -605,23 +605,6 @@ unconditionally having removed all the required structures. It is the driver's job to ensure that no parts of this memory region are currently in use. -:: - - void * - dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size) - -This is used to occupy specific regions of the declared space -(dma_alloc_coherent() will hand out the first free region it finds). - -device_addr is the *device* address of the region requested. - -size is the size (and should be a page-sized multiple). - -The return value will be either a pointer to the processor virtual -address of the memory, or an error (via PTR_ERR()) if any part of the -region is occupied. - Part III - Debug drivers use of the DMA-API --- diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index fde0cfc71824..9df0f4d318c5 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -735,8 +735,6 @@ static inline int dma_get_cache_alignment(void) int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, dma_addr_t device_addr, size_t size, int flags); void dma_release_declared_memory(struct device *dev); -void *dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size); #else static inline int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr, @@ -749,13 +747,6 @@ static inline void dma_release_declared_memory(struct device *dev) { } - -static inline void * -dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size) -{ - return ERR_PTR(-EBUSY); -} #endif /* CONFIG_DMA_DECLARE_COHERENT */ static inline void *dmam_alloc_coherent(struct device *dev, size_t size, diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 4b76aba574c2..1d12a31af6d7 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -137,29 +137,6 @@ void dma_release_declared_memory(struct device *dev) } EXPORT_SYMBOL(dma_release_declared_memory); -void *dma_mark_declared_memory_occupied(struct device *dev, - dma_addr_t device_addr, size_t size) -{ - struct dma_coherent_mem *mem = dev->dma_mem; - unsigned long flags; - int pos, err; - - size += device_addr & ~PAGE_MASK; - - if (!mem) - return ERR_PTR(-EINVAL); - - spin_lock_irqsave(&mem->spinlock, flags); - pos = PFN_DOWN(device_addr - dma_get_device_base(dev, mem)); - err = bitmap_allocate_region(mem->bitmap, pos, get_order(size)); - spin_unlock_irqrestore(&mem->spinlock, flags); - - if (err != 0) - return ERR_PTR(err); - return mem->virt_base + (pos << PAGE_SHIFT); -} -EXPORT_SYMBOL(dma_mark_declared_memory_occupied); - static void *__dma_alloc_from_coherent(struct dma_coherent_mem *mem, ssize_t size, dma_addr_t *dma_handle) { -- 2.20.1
[PATCH 07/12] dma-mapping: move CONFIG_DMA_CMA to kernel/dma/Kconfig
This is where all the related code already lives. Signed-off-by: Christoph Hellwig --- drivers/base/Kconfig | 77 kernel/dma/Kconfig | 77 2 files changed, 77 insertions(+), 77 deletions(-) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index 3e63a900b330..059700ea3521 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -191,83 +191,6 @@ config DMA_FENCE_TRACE lockup related problems for dma-buffers shared across multiple devices. -config DMA_CMA - bool "DMA Contiguous Memory Allocator" - depends on HAVE_DMA_CONTIGUOUS && CMA - help - This enables the Contiguous Memory Allocator which allows drivers - to allocate big physically-contiguous blocks of memory for use with - hardware components that do not support I/O map nor scatter-gather. - - You can disable CMA by specifying "cma=0" on the kernel's command - line. - - For more information see . - If unsure, say "n". - -if DMA_CMA -comment "Default contiguous memory area size:" - -config CMA_SIZE_MBYTES - int "Size in Mega Bytes" - depends on !CMA_SIZE_SEL_PERCENTAGE - default 0 if X86 - default 16 - help - Defines the size (in MiB) of the default memory area for Contiguous - Memory Allocator. If the size of 0 is selected, CMA is disabled by - default, but it can be enabled by passing cma=size[MG] to the kernel. - - -config CMA_SIZE_PERCENTAGE - int "Percentage of total memory" - depends on !CMA_SIZE_SEL_MBYTES - default 0 if X86 - default 10 - help - Defines the size of the default memory area for Contiguous Memory - Allocator as a percentage of the total memory in the system. - If 0 percent is selected, CMA is disabled by default, but it can be - enabled by passing cma=size[MG] to the kernel. - -choice - prompt "Selected region size" - default CMA_SIZE_SEL_MBYTES - -config CMA_SIZE_SEL_MBYTES - bool "Use mega bytes value only" - -config CMA_SIZE_SEL_PERCENTAGE - bool "Use percentage value only" - -config CMA_SIZE_SEL_MIN - bool "Use lower value (minimum)" - -config CMA_SIZE_SEL_MAX - bool "Use higher value (maximum)" - -endchoice - -config CMA_ALIGNMENT - int "Maximum PAGE_SIZE order of alignment for contiguous buffers" - range 4 12 - default 8 - help - DMA mapping framework by default aligns all buffers to the smallest - PAGE_SIZE order which is greater than or equal to the requested buffer - size. This works well for buffers up to a few hundreds kilobytes, but - for larger buffers it just a memory waste. With this parameter you can - specify the maximum PAGE_SIZE order for contiguous buffers. Larger - buffers will be aligned only to this specified order. The order is - expressed as a power of two multiplied by the PAGE_SIZE. - - For example, if your system defaults to 4KiB pages, the order value - of 8 means that the buffers will be aligned up to 1MiB only. - - If unsure, leave the default value "8". - -endif - config GENERIC_ARCH_TOPOLOGY bool help diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index b122ab100d66..d785286ad868 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -53,3 +53,80 @@ config DMA_REMAP config DMA_DIRECT_REMAP bool select DMA_REMAP + +config DMA_CMA + bool "DMA Contiguous Memory Allocator" + depends on HAVE_DMA_CONTIGUOUS && CMA + help + This enables the Contiguous Memory Allocator which allows drivers + to allocate big physically-contiguous blocks of memory for use with + hardware components that do not support I/O map nor scatter-gather. + + You can disable CMA by specifying "cma=0" on the kernel's command + line. + + For more information see . + If unsure, say "n". + +if DMA_CMA +comment "Default contiguous memory area size:" + +config CMA_SIZE_MBYTES + int "Size in Mega Bytes" + depends on !CMA_SIZE_SEL_PERCENTAGE + default 0 if X86 + default 16 + help + Defines the size (in MiB) of the default memory area for Contiguous + Memory Allocator. If the size of 0 is selected, CMA is disabled by + default, but it can be enabled by passing cma=size[MG] to the kernel. + + +config CMA_SIZE_PERCENTAGE + int "Percentage of total memory" + depends on !CMA_SIZE_SEL_MBYTES + default 0 if X86 + default 10 + help + Defines the size of the default memory area for Contiguous Memory + Allocator as a percentage of the total memory in the system. + If 0 percent is selected, CMA is disabled by default, but it can be + enabled by passing cma=size[MG]
[PATCH 06/12] dma-mapping: improve selection of dma_declare_coherent availability
This API is primarily used through DT entries, but two architectures and two drivers call it directly. So instead of selecting the config symbol for random architectures pull it in implicitly for the actual users. Also rename the Kconfig option to describe the feature better. Signed-off-by: Christoph Hellwig --- arch/arc/Kconfig| 1 - arch/arm/Kconfig| 2 +- arch/arm64/Kconfig | 1 - arch/csky/Kconfig | 1 - arch/mips/Kconfig | 1 - arch/riscv/Kconfig | 1 - arch/sh/Kconfig | 2 +- arch/unicore32/Kconfig | 1 - arch/x86/Kconfig| 1 - drivers/mfd/Kconfig | 2 ++ drivers/of/Kconfig | 3 ++- include/linux/device.h | 2 +- include/linux/dma-mapping.h | 8 kernel/dma/Kconfig | 2 +- kernel/dma/Makefile | 2 +- 15 files changed, 13 insertions(+), 17 deletions(-) diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 4103f23b6cea..56e9397542e0 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -30,7 +30,6 @@ config ARC select HAVE_ARCH_TRACEHOOK select HAVE_DEBUG_STACKOVERFLOW select HAVE_FUTEX_CMPXCHG if FUTEX - select HAVE_GENERIC_DMA_COHERENT select HAVE_IOREMAP_PROT select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZMA diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 9395f138301a..25fbbd3cb91d 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -30,6 +30,7 @@ config ARM select CLONE_BACKWARDS select CPU_PM if SUSPEND || CPU_IDLE select DCACHE_WORD_ACCESS if HAVE_EFFICIENT_UNALIGNED_ACCESS + select DMA_DECLARE_COHERENT select DMA_REMAP if MMU select EDAC_SUPPORT select EDAC_ATOMIC_SCRUB @@ -72,7 +73,6 @@ config ARM select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL select HAVE_FUNCTION_TRACER if !XIP_KERNEL select HAVE_GCC_PLUGINS - select HAVE_GENERIC_DMA_COHERENT select HAVE_HW_BREAKPOINT if PERF_EVENTS && (CPU_V6 || CPU_V6K || CPU_V7) select HAVE_IDE if PCI || ISA || PCMCIA select HAVE_IRQ_TIME_ACCOUNTING diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1d22e969bdcb..d558461a5107 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -137,7 +137,6 @@ config ARM64 select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_GRAPH_TRACER select HAVE_GCC_PLUGINS - select HAVE_GENERIC_DMA_COHERENT select HAVE_HW_BREAKPOINT if PERF_EVENTS select HAVE_IRQ_TIME_ACCOUNTING select HAVE_MEMBLOCK_NODE_MAP if NUMA diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 0a9595afe9be..c009a8c63946 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -30,7 +30,6 @@ config CSKY select HAVE_ARCH_TRACEHOOK select HAVE_FUNCTION_TRACER select HAVE_FUNCTION_GRAPH_TRACER - select HAVE_GENERIC_DMA_COHERENT select HAVE_KERNEL_GZIP select HAVE_KERNEL_LZO select HAVE_KERNEL_LZMA diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index 0d14f51d0002..ba50dc2d37dc 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -56,7 +56,6 @@ config MIPS select HAVE_FTRACE_MCOUNT_RECORD select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER - select HAVE_GENERIC_DMA_COHERENT select HAVE_IDE select HAVE_IOREMAP_PROT select HAVE_IRQ_EXIT_ON_IRQ_STACK diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index feeeaa60697c..51b9c97751bf 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -32,7 +32,6 @@ config RISCV select HAVE_MEMBLOCK_NODE_MAP select HAVE_DMA_CONTIGUOUS select HAVE_FUTEX_CMPXCHG if FUTEX - select HAVE_GENERIC_DMA_COHERENT select HAVE_PERF_EVENTS select HAVE_SYSCALL_TRACEPOINTS select IRQ_DOMAIN diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig index a9c36f95744a..a3d2a24e75c7 100644 --- a/arch/sh/Kconfig +++ b/arch/sh/Kconfig @@ -7,11 +7,11 @@ config SUPERH select ARCH_NO_COHERENT_DMA_MMAP if !MMU select HAVE_PATA_PLATFORM select CLKDEV_LOOKUP + select DMA_DECLARE_COHERENT select HAVE_IDE if HAS_IOPORT_MAP select HAVE_MEMBLOCK_NODE_MAP select ARCH_DISCARD_MEMBLOCK select HAVE_OPROFILE - select HAVE_GENERIC_DMA_COHERENT select HAVE_ARCH_TRACEHOOK select HAVE_PERF_EVENTS select HAVE_DEBUG_BUGVERBOSE diff --git a/arch/unicore32/Kconfig b/arch/unicore32/Kconfig index c3a41bfe161b..6d2891d37e32 100644 --- a/arch/unicore32/Kconfig +++ b/arch/unicore32/Kconfig @@ -4,7 +4,6 @@ config UNICORE32 select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_MIGHT_HAVE_PC_PARPORT select ARCH_MIGHT_HAVE_PC_SERIO - select HAVE_GENERIC_DMA_COHERENT select HAVE_KERNEL_GZIP select HAVE_KERNEL_BZIP2 select GENERIC_ATOMIC64 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig ind
[PATCH 05/12] dma-mapping: remove an incorrect __iommem annotation
memmap return a regular void pointer, not and __iomem one. Signed-off-by: Christoph Hellwig --- kernel/dma/coherent.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c index 66f0fb7e9a3a..4b76aba574c2 100644 --- a/kernel/dma/coherent.c +++ b/kernel/dma/coherent.c @@ -43,7 +43,7 @@ static int dma_init_coherent_memory( struct dma_coherent_mem **mem) { struct dma_coherent_mem *dma_mem = NULL; - void __iomem *mem_base = NULL; + void *mem_base = NULL; int pages = size >> PAGE_SHIFT; int bitmap_size = BITS_TO_LONGS(pages) * sizeof(long); int ret; -- 2.20.1
[PATCH 04/12] of: select OF_RESERVED_MEM automatically
The OF_RESERVED_MEM can be used if we have either CMA or the generic declare coherent code built and we support the early flattened DT. So don't bother making it a user visible options that is selected by most configs that fit the above category, but just select it when the requirements are met. Signed-off-by: Christoph Hellwig --- arch/arc/Kconfig | 1 - arch/arm/Kconfig | 1 - arch/arm64/Kconfig | 1 - arch/csky/Kconfig| 1 - arch/powerpc/Kconfig | 1 - arch/xtensa/Kconfig | 1 - drivers/of/Kconfig | 5 ++--- 7 files changed, 2 insertions(+), 9 deletions(-) diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 376366a7db81..4103f23b6cea 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -44,7 +44,6 @@ config ARC select MODULES_USE_ELF_RELA select OF select OF_EARLY_FLATTREE - select OF_RESERVED_MEM select PCI_SYSCALL if PCI select PERF_USE_VMALLOC if ARC_CACHE_VIPT_ALIASING diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 664e918e2624..9395f138301a 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -101,7 +101,6 @@ config ARM select MODULES_USE_ELF_REL select NEED_DMA_MAP_STATE select OF_EARLY_FLATTREE if OF - select OF_RESERVED_MEM if OF select OLD_SIGACTION select OLD_SIGSUSPEND3 select PCI_SYSCALL if PCI diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a4168d366127..1d22e969bdcb 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -163,7 +163,6 @@ config ARM64 select NEED_SG_DMA_LENGTH select OF select OF_EARLY_FLATTREE - select OF_RESERVED_MEM select PCI_DOMAINS_GENERIC if PCI select PCI_ECAM if (ACPI && PCI) select PCI_SYSCALL if PCI diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig index 398113c845f5..0a9595afe9be 100644 --- a/arch/csky/Kconfig +++ b/arch/csky/Kconfig @@ -42,7 +42,6 @@ config CSKY select MODULES_USE_ELF_RELA if MODULES select OF select OF_EARLY_FLATTREE - select OF_RESERVED_MEM select PERF_USE_VMALLOC if CPU_CK610 select RTC_LIB select TIMER_OF diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 2890d36eb531..5cc4eea362c6 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -233,7 +233,6 @@ config PPC select NEED_SG_DMA_LENGTH select OF select OF_EARLY_FLATTREE - select OF_RESERVED_MEM select OLD_SIGACTIONif PPC32 select OLD_SIGSUSPEND select PCI_DOMAINS if PCI diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig index 20a0756f27ef..e242a405151e 100644 --- a/arch/xtensa/Kconfig +++ b/arch/xtensa/Kconfig @@ -447,7 +447,6 @@ config USE_OF bool "Flattened Device Tree support" select OF select OF_EARLY_FLATTREE - select OF_RESERVED_MEM help Include support for flattened device tree machine descriptions. diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig index ad3fcad4d75b..3607fd2810e4 100644 --- a/drivers/of/Kconfig +++ b/drivers/of/Kconfig @@ -81,10 +81,9 @@ config OF_MDIO OpenFirmware MDIO bus (Ethernet PHY) accessors config OF_RESERVED_MEM - depends on OF_EARLY_FLATTREE bool - help - Helpers to allow for reservation of memory regions + depends on OF_EARLY_FLATTREE + default y if HAVE_GENERIC_DMA_COHERENT || DMA_CMA config OF_RESOLVE bool -- 2.20.1
[PATCH 03/12] of: mark early_init_dt_alloc_reserved_memory_arch static
This function is only used in of_reserved_mem.c, and never overridden despite the __weak marker. Signed-off-by: Christoph Hellwig --- drivers/of/of_reserved_mem.c| 2 +- include/linux/of_reserved_mem.h | 7 --- 2 files changed, 1 insertion(+), 8 deletions(-) diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 1977ee0adcb1..9f165fc1d1a2 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -26,7 +26,7 @@ static struct reserved_mem reserved_mem[MAX_RESERVED_REGIONS]; static int reserved_mem_count; -int __init __weak early_init_dt_alloc_reserved_memory_arch(phys_addr_t size, +static int __init early_init_dt_alloc_reserved_memory_arch(phys_addr_t size, phys_addr_t align, phys_addr_t start, phys_addr_t end, bool nomap, phys_addr_t *res_base) { diff --git a/include/linux/of_reserved_mem.h b/include/linux/of_reserved_mem.h index 67ab8d271df3..60f541912ccf 100644 --- a/include/linux/of_reserved_mem.h +++ b/include/linux/of_reserved_mem.h @@ -35,13 +35,6 @@ int of_reserved_mem_device_init_by_idx(struct device *dev, struct device_node *np, int idx); void of_reserved_mem_device_release(struct device *dev); -int early_init_dt_alloc_reserved_memory_arch(phys_addr_t size, -phys_addr_t align, -phys_addr_t start, -phys_addr_t end, -bool nomap, -phys_addr_t *res_base); - void fdt_init_reserved_mem(void); void fdt_reserved_mem_save_node(unsigned long node, const char *uname, phys_addr_t base, phys_addr_t size); -- 2.20.1
[PATCH 02/12] device.h: dma_mem is only needed for HAVE_GENERIC_DMA_COHERENT
No need to carry an unused field around. Signed-off-by: Christoph Hellwig --- include/linux/device.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/device.h b/include/linux/device.h index 6cb4640b6160..be544400acdd 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1017,8 +1017,10 @@ struct device { struct list_headdma_pools; /* dma pools (if dma'ble) */ +#ifdef CONFIG_HAVE_GENERIC_DMA_COHERENT struct dma_coherent_mem *dma_mem; /* internal for coherent mem override */ +#endif #ifdef CONFIG_DMA_CMA struct cma *cma_area; /* contiguous memory area for dma allocations */ -- 2.20.1
[PATCH 01/12] mfd/sm501: depend on HAS_DMA
Currently the sm501 mfd driver can be compiled without any dependencies, but through the use of dma_declare_coherent it really depends on having DMA and iomem support. Normally we don't explicitly require DMA support as we have stubs for it if on UML, but in this case the driver selects support for dma_declare_coherent and thus also requires memmap support. Guard this by an explicit dependency. Signed-off-by: Christoph Hellwig --- drivers/mfd/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig index f461460a2aeb..f15f6489803d 100644 --- a/drivers/mfd/Kconfig +++ b/drivers/mfd/Kconfig @@ -1066,6 +1066,7 @@ config MFD_SI476X_CORE config MFD_SM501 tristate "Silicon Motion SM501" + depends on HAS_DMA ---help--- This is the core driver for the Silicon Motion SM501 multimedia companion chip. This device is a multifunction device which may -- 2.20.1
dma_declare_coherent spring cleaning
Hi all, this series removes various bits of dead code and refactors the remaining functionality around dma_declare_coherent to be a somewhat more coherent code base.
Re: [PATCH v3 1/2] drivers/mtd: Use mtd->name when registering nvmem device
On 2/11/19 4:46 PM, Boris Brezillon wrote: On Mon, 11 Feb 2019 16:26:38 +0530 "Aneesh Kumar K.V" wrote: On 2/10/19 6:25 PM, Boris Brezillon wrote: Hello Aneesh, On Fri, 8 Feb 2019 20:44:18 +0530 "Aneesh Kumar K.V" wrote: With this patch, we use the mtd->name instead of concatenating the name with '0' Fixes: c4dfa25ab307 ("mtd: add support for reading MTD devices via the nvmem API") Signed-off-by: Aneesh Kumar K.V You forgot to Cc the MTD ML and maintainers. Can you please send a new version? linux-mtd list is on CC: Is that not sufficient? Not in your original email, I added it in my reply. Sorry about that. I will now resend with linux-mtd on CC: I missed that earlier. -aneesh
[PATCH v3 2/2] drivers/mtd: Fix device registration error
This change helps me to get multiple mtd device registered. Without this I get sysfs: cannot create duplicate filename '/bus/nvmem/devices/flash0' CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2-00557-g1ef20ef21f22 #13 Call Trace: [c000b38e3220] [c0b58fe4] dump_stack+0xe8/0x164 (unreliable) [c000b38e3270] [c04cf074] sysfs_warn_dup+0x84/0xb0 [c000b38e32f0] [c04cf6c4] sysfs_do_create_link_sd.isra.0+0x114/0x150 [c000b38e3340] [c0726a84] bus_add_device+0x94/0x1e0 [c000b38e33c0] [c07218f0] device_add+0x4d0/0x830 [c000b38e3480] [c09d54a8] nvmem_register.part.2+0x1c8/0xb30 [c000b38e3560] [c0834530] mtd_nvmem_add+0x90/0x120 [c000b38e3650] [c0835bc8] add_mtd_device+0x198/0x4e0 [c000b38e36f0] [c083619c] mtd_device_parse_register+0x11c/0x280 [c000b38e3780] [c0840830] powernv_flash_probe+0x180/0x250 [c000b38e3820] [c072c120] platform_drv_probe+0x60/0xf0 [c000b38e38a0] [c07283c8] really_probe+0x138/0x4d0 [c000b38e3930] [c0728acc] driver_probe_device+0x13c/0x1b0 [c000b38e39b0] [c0728c7c] __driver_attach+0x13c/0x1c0 [c000b38e3a30] [c0725130] bus_for_each_dev+0xa0/0x120 [c000b38e3a90] [c0727b2c] driver_attach+0x2c/0x40 [c000b38e3ab0] [c07270f8] bus_add_driver+0x228/0x360 [c000b38e3b40] [c072a2e0] driver_register+0x90/0x1a0 [c000b38e3bb0] [c072c020] __platform_driver_register+0x50/0x70 [c000b38e3bd0] [c105c984] powernv_flash_driver_init+0x24/0x38 [c000b38e3bf0] [c0010904] do_one_initcall+0x84/0x464 [c000b38e3cd0] [c1004548] kernel_init_freeable+0x530/0x634 [c000b38e3db0] [c0011154] kernel_init+0x1c/0x168 [c000b38e3e20] [c000bed4] ret_from_kernel_thread+0x5c/0x68 mtd mtd1: Failed to register NVMEM device With the change we now have root@(none):/sys/bus/nvmem/devices# ls -al total 0 drwxr-xr-x 2 root root 0 Feb 6 20:49 . drwxr-xr-x 4 root root 0 Feb 6 20:49 .. lrwxrwxrwx 1 root root 0 Feb 6 20:49 flash@0 -> ../../../devices/platform/ibm,opal:flash@0/mtd/mtd0/flash@0 lrwxrwxrwx 1 root root 0 Feb 6 20:49 flash@1 -> ../../../devices/platform/ibm,opal:flash@1/mtd/mtd1/flash@1 Fixes: acfe63ec1c59 ("mtd: Convert to using %pOFn instead of device_node.name") Signed-off-by: Aneesh Kumar K.V --- drivers/mtd/devices/powernv_flash.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/mtd/devices/powernv_flash.c b/drivers/mtd/devices/powernv_flash.c index 22f753e555ac..83f88b8b5d9f 100644 --- a/drivers/mtd/devices/powernv_flash.c +++ b/drivers/mtd/devices/powernv_flash.c @@ -212,7 +212,7 @@ static int powernv_flash_set_driver_info(struct device *dev, * Going to have to check what details I need to set and how to * get them */ - mtd->name = devm_kasprintf(dev, GFP_KERNEL, "%pOFn", dev->of_node); + mtd->name = devm_kasprintf(dev, GFP_KERNEL, "%pOFP", dev->of_node); mtd->type = MTD_NORFLASH; mtd->flags = MTD_WRITEABLE; mtd->size = size; -- 2.20.1
[PATCH v3 1/2] drivers/mtd: Use mtd->name when registering nvmem device
With this patch, we use the mtd->name instead of concatenating the name with '0' Fixes: c4dfa25ab307 ("mtd: add support for reading MTD devices via the nvmem API") Signed-off-by: Aneesh Kumar K.V --- drivers/mtd/mtdcore.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c index 999b705769a8..3ef01baef9b6 100644 --- a/drivers/mtd/mtdcore.c +++ b/drivers/mtd/mtdcore.c @@ -507,6 +507,7 @@ static int mtd_nvmem_add(struct mtd_info *mtd) { struct nvmem_config config = {}; + config.id = -1; config.dev = &mtd->dev; config.name = mtd->name; config.owner = THIS_MODULE; -- 2.20.1
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
On 02/11/2019 05:39 AM, Ingo Molnar wrote: > * Ingo Molnar wrote: > >> Sounds good to me - I've merged this patch, will push it out after >> testing. > Based on Peter's feedback I'm delaying this - performance testing on at > least one key ll/sc arch would be nice indeed. > > Thanks, > > Ingo Yes, I will twist the generic code to generate better code. As I said in the commit log, only x86, ia64 and alpha provide assembly code to replace the generic C code. The ll/sc archs that I have access to (ARM64, ppc) are all using the generic C code anyway. I actually had done some performance measurement on both those platforms and didn't see any performance difference. I didn't include them as they were using generic code before. I will rerun the tests after I twisted the generic C code. Thanks, Longman
Re: [PATCH v3 1/7] dump_stack: Support adding to the dump stack arch description
Hi Michael, On Thu, Feb 07, 2019 at 11:46:29PM +1100, Michael Ellerman wrote: > Arch code can set a "dump stack arch description string" which is > displayed with oops output to describe the hardware platform. > > It is useful to initialise this as early as possible, so that an early > oops will have the hardware description. > > However in practice we discover the hardware platform in stages, so it > would be useful to be able to incrementally fill in the hardware > description as we discover it. > > This patch adds that ability, by creating dump_stack_add_arch_desc(). > > If there is no existing string it behaves exactly like > dump_stack_set_arch_desc(). However if there is an existing string it > appends to it, with a leading space. > > This makes it easy to call it multiple times from different parts of the > code and get a reasonable looking result. > > Signed-off-by: Michael Ellerman > --- > include/linux/printk.h | 5 > lib/dump_stack.c | 58 ++ > 2 files changed, 63 insertions(+) > > v3: No change, just widened Cc list. > > v2: Add a smp_wmb() and comment. > > v1 is here for reference > https://lore.kernel.org/lkml/1430824337-15339-1-git-send-email-...@ellerman.id.au/ > > I'll take this series via the powerpc tree if no one minds? > > > diff --git a/include/linux/printk.h b/include/linux/printk.h > index 77740a506ebb..d5fb4f960271 100644 > --- a/include/linux/printk.h > +++ b/include/linux/printk.h > @@ -198,6 +198,7 @@ u32 log_buf_len_get(void); > void log_buf_vmcoreinfo_setup(void); > void __init setup_log_buf(int early); > __printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...); > +__printf(1, 2) void dump_stack_add_arch_desc(const char *fmt, ...); > void dump_stack_print_info(const char *log_lvl); > void show_regs_print_info(const char *log_lvl); > extern asmlinkage void dump_stack(void) __cold; > @@ -256,6 +257,10 @@ static inline __printf(1, 2) void > dump_stack_set_arch_desc(const char *fmt, ...) > { > } > > +static inline __printf(1, 2) void dump_stack_add_arch_desc(const char *fmt, > ...) > +{ > +} > + > static inline void dump_stack_print_info(const char *log_lvl) > { > } > diff --git a/lib/dump_stack.c b/lib/dump_stack.c > index 5cff72f18c4a..69b710ff92b5 100644 > --- a/lib/dump_stack.c > +++ b/lib/dump_stack.c > @@ -35,6 +35,64 @@ void __init dump_stack_set_arch_desc(const char *fmt, ...) > va_end(args); > } > > +/** > + * dump_stack_add_arch_desc - add arch-specific info to show with task dumps > + * @fmt: printf-style format string > + * @...: arguments for the format string > + * > + * See dump_stack_set_arch_desc() for why you'd want to use this. > + * > + * This version adds to any existing string already created with either > + * dump_stack_set_arch_desc() or dump_stack_add_arch_desc(). If there is an > + * existing string a space will be prepended to the passed string. > + */ > +void __init dump_stack_add_arch_desc(const char *fmt, ...) > +{ > + va_list args; > + int pos, len; > + char *p; > + > + /* > + * If there's an existing string we snprintf() past the end of it, and > + * then turn the terminating NULL of the existing string into a space > + * to create one string separated by a space. > + * > + * If there's no existing string we just snprintf() to the buffer, like > + * dump_stack_set_arch_desc(), but without calling it because we'd need > + * a varargs version. > + */ > + len = strnlen(dump_stack_arch_desc_str, > sizeof(dump_stack_arch_desc_str)); > + pos = len; > + > + if (len) > + pos++; > + > + if (pos >= sizeof(dump_stack_arch_desc_str)) > + return; /* Ran out of space */ > + > + p = &dump_stack_arch_desc_str[pos]; > + > + va_start(args, fmt); > + vsnprintf(p, sizeof(dump_stack_arch_desc_str) - pos, fmt, args); > + va_end(args); > + > + if (len) { > + /* > + * Order the stores above in vsnprintf() vs the store of the > + * space below which joins the two strings. Note this doesn't > + * make the code truly race free because there is no barrier on > + * the read side. ie. Another CPU might load the uninitialised > + * tail of the buffer first and then the space below (rather > + * than the NULL that was there previously), and so print the > + * uninitialised tail. But the whole string lives in BSS so in > + * practice it should just see NULLs. The comment doesn't say _why_ we need to order these stores: IOW, what will or can go wrong without this order? This isn't clear to me. Another good practice when adding smp_*-constructs (as discussed, e.g., at KS'18) is to indicate the matching construct/synch. mechanism. Andrea > + */ > + smp_wmb(); > + > + dump_stack_arch_desc_str[len] = ' '; > +
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote: > +static inline int __down_read_trylock(struct rw_semaphore *sem) > +{ > + long tmp; > + > + while ((tmp = atomic_long_read(&sem->count)) >= 0) { > + if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, > +tmp + RWSEM_ACTIVE_READ_BIAS)) { > + return 1; > + } > + } > + return 0; > +} So the orignal x86 implementation reads: static inline bool __down_read_trylock(struct rw_semaphore *sem) { long result, tmp; asm volatile("# beginning __down_read_trylock\n\t" " mov %[count],%[result]\n\t" "1:\n\t" " mov %[result],%[tmp]\n\t" " add %[inc],%[tmp]\n\t" " jle2f\n\t" LOCK_PREFIX " cmpxchg %[tmp],%[count]\n\t" " jnz1b\n\t" "2:\n\t" "# ending __down_read_trylock\n\t" : [count] "+m" (sem->count), [result] "=&a" (result), [tmp] "=&r" (tmp) : [inc] "i" (RWSEM_ACTIVE_READ_BIAS) : "memory", "cc"); return result >= 0; } you replace that with: int __down_read_trylock1(unsigned long *l) { long tmp; while ((tmp = READ_ONCE(*l)) >= 0) { if (tmp == cmpxchg(l, tmp, tmp + 1)) return 1; } return 0; } which generates: <__down_read_trylock1>: 0: eb 17 jmp19 <__down_read_trylock1+0x19> 2: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 8: 48 8d 4a 01 lea0x1(%rdx),%rcx c: 48 89 d0mov%rdx,%rax f: f0 48 0f b1 0f lock cmpxchg %rcx,(%rdi) 14: 48 39 c2cmp%rax,%rdx 17: 74 0f je 28 <__down_read_trylock1+0x28> 19: 48 8b 17mov(%rdi),%rdx 1c: 48 85 d2test %rdx,%rdx 1f: 79 e7 jns8 <__down_read_trylock1+0x8> 21: 31 c0 xor%eax,%eax 23: c3 retq 24: 0f 1f 40 00 nopl 0x0(%rax) 28: b8 01 00 00 00 mov$0x1,%eax 2d: c3 retq Which is clearly worse. Now we can write that as: int __down_read_trylock2(unsigned long *l) { long tmp = READ_ONCE(*l); while (tmp >= 0) { if (try_cmpxchg(l, &tmp, tmp + 1)) return 1; } return 0; } which generates: 0030 <__down_read_trylock2>: 30: 48 8b 07mov(%rdi),%rax 33: 48 85 c0test %rax,%rax 36: 78 18 js 50 <__down_read_trylock2+0x20> 38: 48 8d 50 01 lea0x1(%rax),%rdx 3c: f0 48 0f b1 17 lock cmpxchg %rdx,(%rdi) 41: 75 f0 jne33 <__down_read_trylock2+0x3> 43: b8 01 00 00 00 mov$0x1,%eax 48: c3 retq 49: 0f 1f 80 00 00 00 00nopl 0x0(%rax) 50: 31 c0 xor%eax,%eax 52: c3 retq Which is a lot better; but not quite there yet. I've tried quite a bit, but I can't seem to get GCC to generate the: add $1,%rdx jle required; stuff like: new = old + 1; if (new <= 0) generates: lea 0x1(%rax),%rdx test %rdx, %rdx jle Ah well, have fun :-) typedef unsigned char u8; typedef unsigned short u16; typedef unsigned int u32; typedef unsigned long long u64; typedef signed char s8; typedef signed short s16; typedef signed int s32; typedef signed long long s64; typedef _Bool bool; # define CC_SET(c) "\n\t/* output condition code " #c "*/\n" # define CC_OUT(c) "=@cc" #c #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) extern void __cmpxchg_wrong_size(void); #define __raw_cmpxchg(ptr, old, new, size, lock) \ ({ \ __typeof__(*(ptr)) __ret; \ __typeof__(*(ptr)) __old = (old);\ __typeof__(*(ptr)) __new = (new);\ switch (size) { \ case 1: \ {\ volatile u8 *__ptr = (volatile u8 *)(ptr); \ asm volatile(lock "cmpxchgb %2,%1" \ : "=a" (__ret), "+m" (*__ptr) \ : "q" (__new), "0" (__old) \ : "memory");\ break; \ }\ case 2: \ {\ volatile u16 *__ptr = (volatile u16 *)(ptr); \ asm volatile(lock "cmpxchgw %2,%1" \ : "=a" (__ret), "+m" (*__ptr) \ : "r" (__new), "0" (__old) \ : "memory");\ break; \ }\ case 4: \ {\ volatile u32 *__ptr = (volatile u32 *)(ptr); \ a
[PATCH] powerpc/configs: Enable CONFIG_USB_XHCI_HCD by default
Recent versions of QEMU provide a XHCI device by default these days instead of an old-fashioned OHCI device: https://git.qemu.org/?p=qemu.git;a=commitdiff;h=57040d451315320b7d27 So to get the keyboard working in the graphical console there again, we should now include XHCI support in the kernel by default, too. Signed-off-by: Thomas Huth --- arch/powerpc/configs/pseries_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/configs/pseries_defconfig b/arch/powerpc/configs/pseries_defconfig index ea79c51..62e12f6 100644 --- a/arch/powerpc/configs/pseries_defconfig +++ b/arch/powerpc/configs/pseries_defconfig @@ -217,6 +217,7 @@ CONFIG_USB_MON=m CONFIG_USB_EHCI_HCD=y # CONFIG_USB_EHCI_HCD_PPC_OF is not set CONFIG_USB_OHCI_HCD=y +CONFIG_USB_XHCI_HCD=y CONFIG_USB_STORAGE=m CONFIG_NEW_LEDS=y CONFIG_LEDS_CLASS=m -- 1.8.3.1
Re: [PATCH v3 1/2] drivers/mtd: Use mtd->name when registering nvmem device
On Mon, 11 Feb 2019 16:26:38 +0530 "Aneesh Kumar K.V" wrote: > On 2/10/19 6:25 PM, Boris Brezillon wrote: > > Hello Aneesh, > > > > On Fri, 8 Feb 2019 20:44:18 +0530 > > "Aneesh Kumar K.V" wrote: > > > >> With this patch, we use the mtd->name instead of concatenating the name > >> with '0' > >> > >> Fixes: c4dfa25ab307 ("mtd: add support for reading MTD devices via the > >> nvmem API") > >> Signed-off-by: Aneesh Kumar K.V > > > > You forgot to Cc the MTD ML and maintainers. Can you please send a new > > version? > > > > linux-mtd list is on CC: Is that not sufficient? Not in your original email, I added it in my reply.
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
On Mon, Feb 11, 2019 at 10:40:44AM +0100, Peter Zijlstra wrote: > On Mon, Feb 11, 2019 at 10:36:01AM +0100, Peter Zijlstra wrote: > > On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote: > > > +static inline int __down_read_trylock(struct rw_semaphore *sem) > > > +{ > > > + long tmp; > > > + > > > + while ((tmp = atomic_long_read(&sem->count)) >= 0) { > > > + if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, > > > +tmp + RWSEM_ACTIVE_READ_BIAS)) { > > > + return 1; > > > > That really wants to be: > > > > if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, > > tmp + > > RWSEM_ACTIVE_READ_BIAS)) > > > > > + } > > > + } > > > + return 0; > > > +} > > Also, the is the one case where LL/SC can actually do 'better'. Do you > have benchmarks for say PowerPC or ARM64 ? Ah, I see they already used asm-generic/rwsem.h which has similar code to the above.
Re: [PATCH v3 1/2] drivers/mtd: Use mtd->name when registering nvmem device
On 2/10/19 6:25 PM, Boris Brezillon wrote: Hello Aneesh, On Fri, 8 Feb 2019 20:44:18 +0530 "Aneesh Kumar K.V" wrote: With this patch, we use the mtd->name instead of concatenating the name with '0' Fixes: c4dfa25ab307 ("mtd: add support for reading MTD devices via the nvmem API") Signed-off-by: Aneesh Kumar K.V You forgot to Cc the MTD ML and maintainers. Can you please send a new version? linux-mtd list is on CC: Is that not sufficient? -aneesh
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
* Will Deacon wrote: > On Mon, Feb 11, 2019 at 11:39:27AM +0100, Ingo Molnar wrote: > > > > * Ingo Molnar wrote: > > > > > Sounds good to me - I've merged this patch, will push it out after > > > testing. > > > > Based on Peter's feedback I'm delaying this - performance testing on at > > least one key ll/sc arch would be nice indeed. > > Once Waiman has posted a new version, I can take it for a spin on some > arm64 boxen if he shares his workload. Cool, thanks! Ingo
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
On Mon, Feb 11, 2019 at 11:39:27AM +0100, Ingo Molnar wrote: > > * Ingo Molnar wrote: > > > Sounds good to me - I've merged this patch, will push it out after > > testing. > > Based on Peter's feedback I'm delaying this - performance testing on at > least one key ll/sc arch would be nice indeed. Once Waiman has posted a new version, I can take it for a spin on some arm64 boxen if he shares his workload. Will
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
* Ingo Molnar wrote: > Sounds good to me - I've merged this patch, will push it out after > testing. Based on Peter's feedback I'm delaying this - performance testing on at least one key ll/sc arch would be nice indeed. Thanks, Ingo
Re: [RFC PATCH 3/5] powerpc: sstep: Add instruction emulation selftests
On 11/02/19 6:17 AM, Daniel Axtens wrote: > Hi Sandipan, > > I'm not really confident to review the asm, but I did have a couple of > questions about the C: > >> +#define MAX_INSNS 32 > This doesn't seem to be used... > True. Thanks for pointing this out. >> +int execute_instr(struct pt_regs *regs, unsigned int instr) >> +{ >> +extern unsigned int exec_instr_execute[]; >> +extern int exec_instr(struct pt_regs *regs); > > These externs sit inside the function scope. This feels less than ideal > to me - is there a reason not to have these at global scope? > Currently, execute_instr() is the only consumer. So, I thought I'd keep them local for now. >> + >> +if (!regs || !instr) >> +return -EINVAL; >> + >> +/* Patch the NOP with the actual instruction */ >> +patch_instruction(&exec_instr_execute[0], instr); >> +if (exec_instr(regs)) { >> +pr_info("execution failed, opcode = 0x%08x\n", instr); >> +return -EFAULT; >> +} >> + >> +return 0; >> +} > >> +late_initcall(run_sstep_tests); > A design question: is there a reason to run these as an initcall rather > than as a module that could either be built in or loaded separately? I'm > not saying you have to do this, but I was wondering if you had > considered it? > I did. As of now, there are some existing tests in test_emulate_step.c which use the same approach. So, I thought I'd stick with that approach to start off. This is anyway controlled by a Kconfig option. > Lastly, snowpatch reports some checkpatch issues for this and your > remaining patches: https://patchwork.ozlabs.org/patch/1035683/ (You are > allowed to violate checkpatch rules with justification, FWIW) > Will look into them. > Regards, > Daniel >> -- >> 2.19.2 >
Re: [RFC PATCH 5/5] powerpc: sstep: Add selftests for addc[.] instruction
On 11/02/19 6:30 AM, Daniel Axtens wrote: > Hi Sandipan, > >> +{ >> +.descr = "RA = LONG_MIN | INT_MIN, RB = >> LONG_MIN | INT_MIN", >> +.instr = PPC_INST_ADDC | ___PPC_RT(20) | >> ___PPC_RA(21) | ___PPC_RB(22), >> +.regs = >> +{ >> +.gpr[21] = LONG_MIN | (uint) INT_MIN, >> +.gpr[22] = LONG_MIN | (uint) INT_MIN, >> +} >> +} > I don't know what this bit pattern is supposed to represent - is it > supposed to be the smallest 32bit integer and the smallest 64bit > integer 80008000 - so you test 32 and 64 bit overflow at the > same time? > Yes, exactly. > > For the series: > Tested-by: Daniel Axtens # Power8 LE > > I notice the output is quite verbose, and doesn't include a line when it > starts: > > [0.826181] Running code patching self-tests ... > [0.826607] Running feature fixup self-tests ... > [0.826615] nop : R0 = LONG_MAX > [PASS] > [0.826617] add : RA = LONG_MIN, RB = LONG_MIN > [PASS] > > Maybe it would be good to include a line saying "Running single-step > emulation self-tests" and perhaps by default on printing when there is a > failure. > That makes sense. Will include it in the next revision. > Finally, I think you might be able to squash patches 1 and 2 and patches > 4 and 5, but that's just my personal preference. > > Regards, > Daniel >
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
On Mon, Feb 11, 2019 at 10:36:01AM +0100, Peter Zijlstra wrote: > On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote: > > +static inline int __down_read_trylock(struct rw_semaphore *sem) > > +{ > > + long tmp; > > + > > + while ((tmp = atomic_long_read(&sem->count)) >= 0) { > > + if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, > > + tmp + RWSEM_ACTIVE_READ_BIAS)) { > > + return 1; > > That really wants to be: > > if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, > tmp + > RWSEM_ACTIVE_READ_BIAS)) > > > + } > > + } > > + return 0; > > +} Also, the is the one case where LL/SC can actually do 'better'. Do you have benchmarks for say PowerPC or ARM64 ?
Re: [PATCH] locking/rwsem: Remove arch specific rwsem files
On Sun, Feb 10, 2019 at 09:00:50PM -0500, Waiman Long wrote: > diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h > index bad2bca..067e265 100644 > --- a/kernel/locking/rwsem.h > +++ b/kernel/locking/rwsem.h > @@ -32,6 +32,26 @@ > # define DEBUG_RWSEMS_WARN_ON(c) > #endif > > +/* > + * R/W semaphores originally for PPC using the stuff in lib/rwsem.c. > + * Adapted largely from include/asm-i386/rwsem.h > + * by Paul Mackerras . > + */ > + > +/* > + * the semaphore definition > + */ > +#ifdef CONFIG_64BIT > +# define RWSEM_ACTIVE_MASK 0xL > +#else > +# define RWSEM_ACTIVE_MASK 0xL > +#endif > + > +#define RWSEM_ACTIVE_BIAS0x0001L > +#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1) > +#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS > +#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + > RWSEM_ACTIVE_BIAS) > + > #ifdef CONFIG_RWSEM_SPIN_ON_OWNER > /* > * All writes to owner are protected by WRITE_ONCE() to make sure that > @@ -132,3 +152,113 @@ static inline void rwsem_clear_reader_owned(struct > rw_semaphore *sem) > { > } > #endif > + > +#ifdef CONFIG_RWSEM_XCHGADD_ALGORITHM > +/* > + * lock for reading > + */ > +static inline void __down_read(struct rw_semaphore *sem) > +{ > + if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) > + rwsem_down_read_failed(sem); > +} > + > +static inline int __down_read_killable(struct rw_semaphore *sem) > +{ > + if (unlikely(atomic_long_inc_return_acquire(&sem->count) <= 0)) { > + if (IS_ERR(rwsem_down_read_failed_killable(sem))) > + return -EINTR; > + } > + > + return 0; > +} > + > +static inline int __down_read_trylock(struct rw_semaphore *sem) > +{ > + long tmp; > + > + while ((tmp = atomic_long_read(&sem->count)) >= 0) { > + if (tmp == atomic_long_cmpxchg_acquire(&sem->count, tmp, > +tmp + RWSEM_ACTIVE_READ_BIAS)) { > + return 1; That really wants to be: if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, tmp + RWSEM_ACTIVE_READ_BIAS)) > + } > + } > + return 0; > +} > + > +/* > + * lock for writing > + */ > +static inline void __down_write(struct rw_semaphore *sem) > +{ > + long tmp; > + > + tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS, > + &sem->count); > + if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS)) > + rwsem_down_write_failed(sem); > +} > + > +static inline int __down_write_killable(struct rw_semaphore *sem) > +{ > + long tmp; > + > + tmp = atomic_long_add_return_acquire(RWSEM_ACTIVE_WRITE_BIAS, > + &sem->count); > + if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS)) > + if (IS_ERR(rwsem_down_write_failed_killable(sem))) > + return -EINTR; > + return 0; > +} > + > +static inline int __down_write_trylock(struct rw_semaphore *sem) > +{ > + long tmp; tmp = RWSEM_UNLOCKED_VALUE; > + > + tmp = atomic_long_cmpxchg_acquire(&sem->count, RWSEM_UNLOCKED_VALUE, > + RWSEM_ACTIVE_WRITE_BIAS); > + return tmp == RWSEM_UNLOCKED_VALUE; return atomic_long_try_cmpxchg_acquire(&sem->count, &tmp, RWSEM_ACTIVE_WRITE_BIAS); > +} > + > +/* > + * unlock after reading > + */ > +static inline void __up_read(struct rw_semaphore *sem) > +{ > + long tmp; > + > + tmp = atomic_long_dec_return_release(&sem->count); > + if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0)) > + rwsem_wake(sem); > +} > + > +/* > + * unlock after writing > + */ > +static inline void __up_write(struct rw_semaphore *sem) > +{ > + if (unlikely(atomic_long_sub_return_release(RWSEM_ACTIVE_WRITE_BIAS, > + &sem->count) < 0)) > + rwsem_wake(sem); > +} > + > +/* > + * downgrade write lock to read lock > + */ > +static inline void __downgrade_write(struct rw_semaphore *sem) > +{ > + long tmp; > + > + /* > + * When downgrading from exclusive to shared ownership, > + * anything inside the write-locked region cannot leak > + * into the read side. In contrast, anything in the > + * read-locked region is ok to be re-ordered into the > + * write side. As such, rely on RELEASE semantics. > + */ > + tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS, &sem->count); > + if (tmp < 0) > + rwsem_downgrade_wake(sem); > +} > + > +#endif /* CONFIG_RWSEM_XCHGADD_ALGORITHM */
Re: [PATCH v1 03/16] powerpc/32: move LOAD_MSR_KERNEL() into head_32.h and use it
On Mon, 2019-02-11 at 07:26 +0100, Christophe Leroy wrote: > > Le 11/02/2019 à 01:21, Benjamin Herrenschmidt a écrit : > > On Fri, 2019-02-08 at 12:52 +, Christophe Leroy wrote: > > > /* > > > + * MSR_KERNEL is > 0x8000 on 4xx/Book-E since it include MSR_CE. > > > + */ > > > +.macro __LOAD_MSR_KERNEL r, x > > > +.if \x >= 0x8000 > > > + lis \r, (\x)@h > > > + ori \r, \r, (\x)@l > > > +.else > > > + li \r, (\x) > > > +.endif > > > +.endm > > > +#define LOAD_MSR_KERNEL(r, x) __LOAD_MSR_KERNEL r, x > > > + > > > > You changed the limit from >= 0x1 to >= 0x8000 without a > > corresponding explanation as to why... > > Yes, the existing LOAD_MSR_KERNEL() was buggy because 'li' takes a > signed u16, ie between -0x8000 and 0x7999. Ah yes, I was only looking at the "large" case which is fine... > By chance it was working because until now nobody was trying to set > MSR_KERNEL | MSR_EE. > > Christophe