Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Thu 19-12-19 12:30:31, John Hubbard wrote: > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > Hi, > > > > > > This implements an API naming change (put_user_page*() --> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > extends that tracking to a few select subsystems. More subsystems will > > > be added in follow up work. > > > > Hi John, > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > allocated single memory block and registered multiple MRs using the single > > block. > > > > The possible bad flow is: > > ib_umem_geti() -> > >pin_user_pages_fast(FOLL_WRITE) -> > > internal_get_user_pages_fast(FOLL_WRITE) -> > > gup_pgd_range() -> > > gup_huge_pd() -> > >gup_hugepte() -> > > try_grab_compound_head() -> > > Hi Leon, > > Thanks very much for the detailed report! So we're overflowing... > > At first look, this seems likely to be hitting a weak point in the > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > 99-121). Basically it's pretty easy to overflow the page->_refcount > with huge pages if the pages have a *lot* of subpages. > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > Do you have any idea how many pins (repeated pins on the same page, which > it sounds like you have) might be involved in your test case, > and the huge page and system page sizes? That would allow calculating > if we're likely overflowing for that reason. > > So, ideas and next steps: > > 1. Assuming that you *are* hitting this, I think I may have to fall back to > implementing the "deferred" part of this design, as part of this series, after > all. That means: > > For the pin/unpin calls at least, stop treating all pages as if they are > a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page. > That's not how it works now, and the need to hand back a huge array of > subpages is part of the problem. This affects the callers too, so it's not > a super quick change to make. (I was really hoping not to have to do this > yet.) Does that mean that you would need to make all GUP users huge page aware? Otherwise I don't see how what you suggest would work... And I don't think making all GUP users huge page aware is realistic (effort-wise) or even wanted (maintenance overhead in all those places). I believe there might be also a different solution for this: For transparent huge pages, we could find a space in 'struct page' of the second page in the huge page for proper pin counter and just account pins there so we'd have full width of 32-bits for it. Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()
On Fri 20-12-19 16:06:05, Alexey Kardashevskiy wrote: > > > On 11/12/2019 21:42, Jan Kara wrote: > > The last jump to free_exit in mm_iommu_do_alloc() happens after page > > pointers in struct mm_iommu_table_group_mem_t were already converted to > > physical addresses. Thus calling put_page() on these physical addresses > > will likely crash. Convert physical addresses back to page pointers > > during the error cleanup. > > > > Signed-off-by: Jan Kara > > --- > > arch/powerpc/mm/book3s64/iommu_api.c | 5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > Beware, this is completely untested, spotted just by code audit. > > > > diff --git a/arch/powerpc/mm/book3s64/iommu_api.c > > b/arch/powerpc/mm/book3s64/iommu_api.c > > index 56cc84520577..06c403381c9c 100644 > > --- a/arch/powerpc/mm/book3s64/iommu_api.c > > +++ b/arch/powerpc/mm/book3s64/iommu_api.c > > @@ -154,7 +154,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, > > unsigned long ua, > >(mem2->entries << PAGE_SHIFT { > > ret = -EINVAL; > > mutex_unlock(&mem_list_mutex); > > - goto free_exit; > > + goto convert_exit; > > } > > } > > > > @@ -166,6 +166,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, > > unsigned long ua, > > > > return 0; > > > > +convert_exit: > > + for (i = 0; i < pinned; i++) > > + mem->hpages[i] = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT); > > > Good find. Although doing it where you added "goto convert_exit" seems > cleaner imho. Thanks, I don't really care :). V2 attached. Honza -- Jan Kara SUSE Labs, CR >From 947c7a893c282b829a8623da73276a2fe56fdcd3 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Wed, 11 Dec 2019 11:36:54 +0100 Subject: [PATCH v2] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc() The last jump to free_exit in mm_iommu_do_alloc() happens after page pointers in struct mm_iommu_table_group_mem_t were already converted to physical addresses. Thus calling put_page() on these physical addresses will likely crash. Convert physical addresses back to page pointers during the error cleanup. Signed-off-by: Jan Kara --- arch/powerpc/mm/book3s64/iommu_api.c | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c index 56cc84520577..1aa06584d783 100644 --- a/arch/powerpc/mm/book3s64/iommu_api.c +++ b/arch/powerpc/mm/book3s64/iommu_api.c @@ -154,6 +154,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, (mem2->entries << PAGE_SHIFT { ret = -EINVAL; mutex_unlock(&mem_list_mutex); + /* Convert back to page pointers */ + for (i = 0; i < pinned; i++) { +mem->hpages[i] = + pfn_to_page(mem->hpas[i] >> PAGE_SHIFT); + } goto free_exit; } } -- 2.16.4
Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()
On Thu, Dec 19, 2019 at 2:08 PM Cédric Le Goater wrote:> > There is a patch for msgsndp in my tree you could try : > > https://github.com/legoater/qemu/tree/powernv-5.0 > > Currently being reviewed. I have to address some remarks from David before > it can be merged. Thanks. I've cherry-picked https://github.com/legoater/qemu/commit/910c9ea5ecc and can confirm that I no longer receive the crashes. QEMU 5.1 or 5.0.1 release, I guess? Jason
Re: [PATCH v17 06/23] powerpc: mm: Add p?d_leaf() definitions
On 19/12/2019 11:49, Michael Ellerman wrote: > Steven Price writes: >> walk_page_range() is going to be allowed to walk page tables other than >> those of user space. For this it needs to know when it has reached a >> 'leaf' entry in the page tables. This information is provided by the >> p?d_leaf() functions/macros. >> >> For powerpc p?d_is_leaf() functions already exist. Export them using the >> new p?d_leaf() name. >> >> CC: Benjamin Herrenschmidt >> CC: Paul Mackerras >> CC: Michael Ellerman >> CC: linuxppc-dev@lists.ozlabs.org >> CC: kvm-...@vger.kernel.org >> Signed-off-by: Steven Price >> --- >> arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++ >> 1 file changed, 3 insertions(+) > > We have fallback versions of our pmd_is_leaf() etc. in > arch/powerpc/include/asm/pgtable.h, eg: > > #ifndef pmd_is_leaf > #define pmd_is_leaf pmd_is_leaf > static inline bool pmd_is_leaf(pmd_t pmd) > { > return false; > } > #endif > > Because we support several different MMUs and most of them don't need to > do anything. > > So we could put the compatibility #defines to your names along with the > fallback versions in asm/pgtable.h, rather than in > asm/book3s/64/pgtable.h > > But I see you also have fallbacks for your versions, so it probably > doesn't matter either way. > > So I'm OK with this version, unless you think there's a compelling > reason to do the compatibility #defines in our asm/pgtable.h I was thinking that (assuming this series actually gets merged this time), the p?d_is_leaf() versions could be removed and replaced by p?d_leaf() in the future. Since the fallbacks are in the asm-generic code it makes sense for the pmd_leaf() definitions to be next to the non-fallback versions. > Acked-by: Michael Ellerman Thanks! Steve > cheers > > >> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h >> b/arch/powerpc/include/asm/book3s/64/pgtable.h >> index b01624e5c467..201a69e6a355 100644 >> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h >> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h >> @@ -1355,18 +1355,21 @@ static inline bool is_pte_rw_upgrade(unsigned long >> old_val, unsigned long new_va >> * Like pmd_huge() and pmd_large(), but works regardless of config options >> */ >> #define pmd_is_leaf pmd_is_leaf >> +#define pmd_leaf pmd_is_leaf >> static inline bool pmd_is_leaf(pmd_t pmd) >> { >> return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE)); >> } >> >> #define pud_is_leaf pud_is_leaf >> +#define pud_leaf pud_is_leaf >> static inline bool pud_is_leaf(pud_t pud) >> { >> return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE)); >> } >> >> #define pgd_is_leaf pgd_is_leaf >> +#define pgd_leaf pgd_is_leaf >> static inline bool pgd_is_leaf(pgd_t pgd) >> { >> return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE)); >> -- >> 2.20.1 > > ___ > linux-arm-kernel mailing list > linux-arm-ker...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
[Bug 205889] CONFIG_PPC_85xx with CONFIG_CORENET_GENERIC outputs uImage instead of zImage
https://bugzilla.kernel.org/show_bug.cgi?id=205889 --- Comment #2 from Bradley Gamble (bradley.gam...@ncipher.com) --- I understand the desire not to break scripts etc that may be used, but zImage and uImage files are different and may not both be bootable on the same device. I am currently able to work around the issue by stripping the first 64 bytes of the uImage to turn it back in to a zImage. Is it possible to overcome the implicit link between the platform and enabling CONFIG_DEFAULT_UIMAGE? I have tried marking it as "not set" in my defconfig but it is still enabled. -- You are receiving this mail because: You are watching the assignee of the bug.
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote: > On 12/19/19 1:07 PM, Jason Gunthorpe wrote: > > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: > > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > > Hi, > > > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. > > > > > It > > > > > extends that tracking to a few select subsystems. More subsystems will > > > > > be added in follow up work. > > > > > > > > Hi John, > > > > > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > > > allocated single memory block and registered multiple MRs using the > > > > single > > > > block. > > > > > > > > The possible bad flow is: > > > >ib_umem_geti() -> > > > > pin_user_pages_fast(FOLL_WRITE) -> > > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > > gup_pgd_range() -> > > > >gup_huge_pd() -> > > > > gup_hugepte() -> > > > > try_grab_compound_head() -> > > > > > > Hi Leon, > > > > > > Thanks very much for the detailed report! So we're overflowing... > > > > > > At first look, this seems likely to be hitting a weak point in the > > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > > with huge pages if the pages have a *lot* of subpages. > > > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > > > Considering that establishing these pins is entirely under user > > control, we can't have a limit here. > > There's already a limit, it's just a much larger one. :) What does "no limit" > really mean, numerically, to you in this case? I guess I mean 'hidden limit' - hitting the limit and failing would be managable. I think 7 is probably too low though, but we are not using 1GB huge pages, only 2M.. > > If the number of allowed pins are exhausted then the > > pin_user_pages_fast() must fail back to the user. > > I'll poke around the IB call stack and see how much of that return > path is in place, if any. Because it's the same situation for > get_user_pages_fast(). This code just added a warning on overflow > so we could spot it early. All GUP callers must be prepared for failure, IB should be fine... Jason
Re: [PATCH] serial: ucc_uart: remove redundant assignment to pointer bdp
On 12/19/19 6:10 PM, Colin King wrote: From: Colin Ian King The variable bdp is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King Acked-by: Timur Tabi Looks like this bug has been there since day 1.
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Hi Thomas, In do_hres(), I see: cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; if (unlikely((s64)cycles < 0)) return -1; __arch_get_hw_counter() returns a u64 values. On the PPC, this is read from the timebase which is a 64 bits counter. Why returning -1 if (s64)cycles < 0 ? Does it means we have to mask out the most significant bit when reading the HW counter ? Christophe
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Thu, Dec 19, 2019 at 05:07:43PM -0400, Jason Gunthorpe wrote: > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > Hi, > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > > extends that tracking to a few select subsystems. More subsystems will > > > > be added in follow up work. > > > > > > Hi John, > > > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > > allocated single memory block and registered multiple MRs using the single > > > block. > > > > > > The possible bad flow is: > > > ib_umem_geti() -> > > >pin_user_pages_fast(FOLL_WRITE) -> > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > gup_pgd_range() -> > > > gup_huge_pd() -> > > >gup_hugepte() -> > > > try_grab_compound_head() -> > > > > Hi Leon, > > > > Thanks very much for the detailed report! So we're overflowing... > > > > At first look, this seems likely to be hitting a weak point in the > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > with huge pages if the pages have a *lot* of subpages. > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > Considering that establishing these pins is entirely under user > control, we can't have a limit here. > > If the number of allowed pins are exhausted then the > pin_user_pages_fast() must fail back to the user. > > > 3. It would be nice if I could reproduce this. I have a two-node mlx5 > > Infiniband > > test setup, but I have done only the tiniest bit of user space IB coding, so > > if you have any test programs that aren't too hard to deal with that could > > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind > > that I'm not an advanced IB programmer. At all. :) > > Clone this: > > https://github.com/linux-rdma/rdma-core.git > > Install all the required deps to build it (notably cython), see the README.md > > $ ./build.sh > $ build/bin/run_tests.py > > If you get things that far I think Leon can get a reproduction for you I'm not so optimistic about that. Thanks > > Jason
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Thu, Dec 19, 2019 at 02:58:43PM -0800, John Hubbard wrote: > On 12/19/19 1:07 PM, Jason Gunthorpe wrote: > ... > > > 3. It would be nice if I could reproduce this. I have a two-node mlx5 > > > Infiniband > > > test setup, but I have done only the tiniest bit of user space IB coding, > > > so > > > if you have any test programs that aren't too hard to deal with that could > > > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in > > > mind > > > that I'm not an advanced IB programmer. At all. :) > > > > Clone this: > > > > https://github.com/linux-rdma/rdma-core.git > > > > Install all the required deps to build it (notably cython), see the > > README.md > > > > $ ./build.sh > > $ build/bin/run_tests.py > > > > If you get things that far I think Leon can get a reproduction for you > > > > Cool, it's up and running (1 failure, 3 skipped, out of 67 tests). > > This is a great test suite to have running, I'll add it to my scripts. Here's > the > full output in case the failure or skip cases are a problem: > > $ sudo ./build/bin/run_tests.py --verbose > > test_create_ah (tests.test_addr.AHTest) ... ok > test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE > tests on IB link layer" > test_destroy_ah (tests.test_addr.AHTest) ... ok > test_create_comp_channel (tests.test_cq.CCTest) ... ok > test_destroy_comp_channel (tests.test_cq.CCTest) ... ok > test_create_cq_ex (tests.test_cq.CQEXTest) ... ok > test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok > test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok > test_create_cq (tests.test_cq.CQTest) ... ok > test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok > test_destroy_cq (tests.test_cq.CQTest) ... ok > test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok > test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok > test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok > test_create_dm (tests.test_device.DMTest) ... ok > test_create_dm_bad_flow (tests.test_device.DMTest) ... ok > test_destroy_dm (tests.test_device.DMTest) ... ok > test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok > test_dm_read (tests.test_device.DMTest) ... ok > test_dm_write (tests.test_device.DMTest) ... ok > test_dm_write_bad_flow (tests.test_device.DMTest) ... ok > test_dev_list (tests.test_device.DeviceTest) ... ok > test_open_dev (tests.test_device.DeviceTest) ... ok > test_query_device (tests.test_device.DeviceTest) ... ok > test_query_device_ex (tests.test_device.DeviceTest) ... ok > test_query_gid (tests.test_device.DeviceTest) ... ok > test_query_port (tests.test_device.DeviceTest) ... FAIL > test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok > test_create_dm_mr (tests.test_mr.DMMRTest) ... ok > test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok > test_buffer (tests.test_mr.MRTest) ... ok > test_dereg_mr (tests.test_mr.MRTest) ... ok > test_dereg_mr_twice (tests.test_mr.MRTest) ... ok > test_lkey (tests.test_mr.MRTest) ... ok > test_read (tests.test_mr.MRTest) ... ok > test_reg_mr (tests.test_mr.MRTest) ... ok > test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok > test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok > test_rkey (tests.test_mr.MRTest) ... ok > test_write (tests.test_mr.MRTest) ... ok > test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok > test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok > test_reg_mw_type1 (tests.test_mr.MWTest) ... ok > test_reg_mw_type2 (tests.test_mr.MWTest) ... ok > test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok > test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok > test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not > supported - ODP recv not supported' > test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok > test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok > test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... > ok > test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok > test_alloc_pd (tests.test_pd.PDTest) ... ok > test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok > test_dealloc_pd (tests.test_pd.PDTest) ... ok > test_destroy_pd_twice (tests.test_pd.PDTest) ... ok > test_multiple_pd_creation (tests.test_pd.PDTest) ... ok > test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok > test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok > test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok > test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok > test_create_qp_no_attr (tests.test_qp.QPTest) ... ok > test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok > test_create_qp_with_attr (tests.test_qp.QPTest) ... ok > test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok > test_modify_qp (tests.test_qp.QPTest) ... ok > test_query_qp (tests.test_qp.QPTest) ... ok > test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No > devices with net interface' > > ==
Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()
On 12/20/19 12:32 PM, Jason A. Donenfeld wrote: > On Thu, Dec 19, 2019 at 2:08 PM Cédric Le Goater wrote:> >> There is a patch for msgsndp in my tree you could try : >> >> https://github.com/legoater/qemu/tree/powernv-5.0 >> >> Currently being reviewed. I have to address some remarks from David before >> it can be merged. > > Thanks. I've cherry-picked > https://github.com/legoater/qemu/commit/910c9ea5ecc and can confirm > that I no longer receive the crashes. QEMU 5.1 or 5.0.1 release, I > guess? QEMU 5.1 I would say. That's enough time to address the comments. Michael's patch to drop self IPIs is also interesting for Linux. I wonder how we can reach that point. Thanks, C.
[Bug 205283] BUG: KASAN: global-out-of-bounds in _copy_to_iter+0x3d4/0x5a8
https://bugzilla.kernel.org/show_bug.cgi?id=205283 --- Comment #5 from Erhard F. (erhar...@mailbox.org) --- Created attachment 286385 --> https://bugzilla.kernel.org/attachment.cgi?id=286385&action=edit dmesg (kernel 5.5.0-rc2+, PowerMac G4 DP) -- You are receiving this mail because: You are watching someone on the CC list of the bug.
[Bug 205283] BUG: KASAN: global-out-of-bounds in _copy_to_iter+0x3d4/0x5a8
https://bugzilla.kernel.org/show_bug.cgi?id=205283 --- Comment #6 from Erhard F. (erhar...@mailbox.org) --- Created attachment 286387 --> https://bugzilla.kernel.org/attachment.cgi?id=286387&action=edit 5.5.0-rc2+ kernel .config (PowerMac G4 DP) -- You are receiving this mail because: You are watching someone on the CC list of the bug.
[Bug 205283] BUG: KASAN: global-out-of-bounds in _copy_to_iter+0x3d4/0x5a8
https://bugzilla.kernel.org/show_bug.cgi?id=205283 --- Comment #7 from Erhard F. (erhar...@mailbox.org) --- (In reply to Christophe Leroy from comment #4) > Can you apply https://patchwork.ozlabs.org/patch/1213028/ and select > CONFIG_KASAN_VMALLOC Re-tried with 5.5-rc2 your KASAN_VMALLOC patch and this patch (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7de7de7ca0ae0fc70515ee3154af33af75edae2c) to make -rc2 boot again. Result with KASAN is the same as before. If I enable CONFIG_KASAN_VMALLOC additionally the KASAN hit is gone. modprobing and removing btrfs shows no error then. Interesting! Double-checked it just to be sure. -- You are receiving this mail because: You are watching someone on the CC list of the bug.
Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()
On Fri, Dec 20, 2019 at 12:32:06PM +0100, Jason A. Donenfeld wrote: > On Thu, Dec 19, 2019 at 2:08 PM Cédric Le Goater wrote:> > > There is a patch for msgsndp in my tree you could try : > > > > https://github.com/legoater/qemu/tree/powernv-5.0 > > > > Currently being reviewed. I have to address some remarks from David before > > it can be merged. > > Thanks. I've cherry-picked > https://github.com/legoater/qemu/commit/910c9ea5ecc and can confirm > that I no longer receive the crashes. QEMU 5.1 or 5.0.1 release, I > guess? Unless the revision takes much longer than I expect, I'd anticipate it in 5.0. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson signature.asc Description: PGP signature
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On 12/20/19 10:48 AM, Leon Romanovsky wrote: ... >> test_query_qp (tests.test_qp.QPTest) ... ok >> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No >> devices with net interface' >> >> == >> FAIL: test_query_port (tests.test_device.DeviceTest) >> -- >> Traceback (most recent call last): >> File "/kernel_work/rdma-core/tests/test_device.py", line 129, in >> test_query_port >> self.verify_port_attr(port_attr) >> File "/kernel_work/rdma-core/tests/test_device.py", line 113, in >> verify_port_attr >> assert 'Invalid' not in d.speed_to_str(attr.active_speed) >> AssertionError > > I'm very curious how did you get this assert "d.speed_to_str" covers all > known speeds according to the IBTA. > Hi Leon, Short answer: I can make that one pass, with a small fix the the rdma-core test suite: commit a1b9fb0846e1b2356d7a16f4fbdd1960cf8dcbe5 (HEAD -> fix_speed_to_str) Author: John Hubbard Date: Fri Dec 20 15:07:47 2019 -0800 device: fix speed_to_str(), to handle disabled links For disabled links, the raw speed token is 0. However, speed_to_str() doesn't have that in the list. This leads to an assertion when running tests (test_query_port) when one link is down and other link(s) are up. Fix this by returning '(Disabled/down)' for the zero speed case. diff --git a/pyverbs/device.pyx b/pyverbs/device.pyx index 33d133fd..f8b7826b 100755 --- a/pyverbs/device.pyx +++ b/pyverbs/device.pyx @@ -923,8 +923,8 @@ def width_to_str(width): def speed_to_str(speed): -l = {1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps', 8: '10.0 Gbps', - 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'} +l = {0: '(Disabled/down)', 1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps', + 8: '10.0 Gbps', 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'} try: return '{s} ({n})'.format(s=l[speed], n=speed) except KeyError: Longer answer: == It looks like this test suite assumes that every link is connected! (Probably in most test systems, they are.) But in my setup, the ConnectX cards each have two slots, and I only have (and only need) one cable. So one link is up, and the other is disabled. This leads to the other problem, which is that if a link is disabled, the test suite finds a "0" token for attr.active_speed. That token is not in the approved list, and so d.speed_to_str() asserts. With some diagnostics added, I can see it checking each link: one passes, and the other asserts: diff --git a/tests/test_device.py b/tests/test_device.py index 524e0e89..7b33d7db 100644 --- a/tests/test_device.py +++ b/tests/test_device.py @@ -110,6 +110,12 @@ class DeviceTest(unittest.TestCase): assert 'Invalid' not in d.translate_mtu(attr.max_mtu) assert 'Invalid' not in d.translate_mtu(attr.active_mtu) assert 'Invalid' not in d.width_to_str(attr.active_width) +print("") +print('Diagnostics ===') +print('phys_state:', d.phys_state_to_str(attr.phys_state)) +print('active_width): ', d.width_to_str(attr.active_width)) +print('active_speed: ', d.speed_to_str(attr.active_speed)) +print('END of Diagnostics ') assert 'Invalid' not in d.speed_to_str(attr.active_speed) assert 'Invalid' not in d.translate_link_layer(attr.link_layer) assert attr.max_msg_sz > 0x1000 assert attr.max_msg_sz > 0x1000 ...and the test run from that is: # ./build/bin/run_tests.py --verbose tests.test_device.DeviceTest test_dev_list (tests.test_device.DeviceTest) ... ok test_open_dev (tests.test_device.DeviceTest) ... ok test_query_device (tests.test_device.DeviceTest) ... ok test_query_device_ex (tests.test_device.DeviceTest) ... ok test_query_gid (tests.test_device.DeviceTest) ... ok test_query_port (tests.test_device.DeviceTest) ... Diagnostics === phys_state: Link up (5) active_width): 4X (2) active_speed: 25.0 Gbps (32) END of Diagnostics Diagnostics === phys_state: Disabled (3) active_width): 4X (2) active_speed: Invalid speed END of Diagnostics FAIL test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok == FAIL: test_query_port (tests.test_device.DeviceTest) -- Traceback (most recent call last): File "/kernel_work/rdma-core/tests/test_device.py", line 135, in test_query_port self.verify_port_attr(port_attr) File "/kernel_work/rdma-core/tests/test_device.py", line 119, in verify_port_attr assert 'Invalid' not in d
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On 12/20/19 10:29 AM, Leon Romanovsky wrote: ... >> $ ./build.sh >> $ build/bin/run_tests.py >> >> If you get things that far I think Leon can get a reproduction for you > > I'm not so optimistic about that. > OK, I'm going to proceed for now on the assumption that I've got an overflow problem that happens when huge pages are pinned. If I can get more information, great, otherwise it's probably enough. One thing: for your repro, if you know the huge page size, and the system page size for that case, that would really help. Also the number of pins per page, more or less, that you'd expect. Because Jason says that only 2M huge pages are used... Because the other possibility is that the refcount really is going negative, likely due to a mismatched pin/unpin somehow. If there's not an obvious repro case available, but you do have one (is it easy to repro, though?), then *if* you have the time, I could point you to a github branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this: diff --git a/include/linux/mm.h b/include/linux/mm.h index bb44c4d2ada7..8526fd03b978 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page) * get_user_pages and page_mkclean and other calls that race to set up page * table entries. */ -#define GUP_PIN_COUNTING_BIAS (1U << 10) +#define GUP_PIN_COUNTING_BIAS (1U << 8) void unpin_user_page(struct page *page); void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, If that fails to repro, then we would be zeroing in on the root cause. The branch is here (I just tested it and it seems healthy): g...@github.com:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags thanks, -- John Hubbard NVIDIA
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On 12/20/19 1:21 AM, Jan Kara wrote: ... >> So, ideas and next steps: >> >> 1. Assuming that you *are* hitting this, I think I may have to fall back to >> implementing the "deferred" part of this design, as part of this series, >> after >> all. That means: >> >> For the pin/unpin calls at least, stop treating all pages as if they are >> a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page. >> That's not how it works now, and the need to hand back a huge array of >> subpages is part of the problem. This affects the callers too, so it's not >> a super quick change to make. (I was really hoping not to have to do this >> yet.) > > Does that mean that you would need to make all GUP users huge page aware? > Otherwise I don't see how what you suggest would work... And I don't think > making all GUP users huge page aware is realistic (effort-wise) or even > wanted (maintenance overhead in all those places). > Well, pretty much yes. It's really just the pin_user_pages*() callers, but the internals, follow_page() and such, are so interconnected right now that it would probably blow up into a huge effort, as you point out. > I believe there might be also a different solution for this: For > transparent huge pages, we could find a space in 'struct page' of the > second page in the huge page for proper pin counter and just account pins > there so we'd have full width of 32-bits for it. > > Honza > OK, let me pursue that. Given that I shouldn't need to handle pages splitting, it should be not *too* bad. I am starting to think that I should just post the first 9 or so prerequisite patches (first 9 patches, plus the v4l2 fix that arguably should have been earlier in the sequence I guess), as 5.6 candidates, while I go back to the drawing board here. thanks, -- John Hubbard NVIDIA
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Fri, Dec 20, 2019 at 5:34 AM Jason Gunthorpe wrote: > > On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote: > > On 12/19/19 1:07 PM, Jason Gunthorpe wrote: > > > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote: > > > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > > > Hi, > > > > > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN > > > > > > pages. It > > > > > > extends that tracking to a few select subsystems. More subsystems > > > > > > will > > > > > > be added in follow up work. > > > > > > > > > > Hi John, > > > > > > > > > > The patchset generates kernel panics in our IB testing. In our tests, > > > > > we > > > > > allocated single memory block and registered multiple MRs using the > > > > > single > > > > > block. > > > > > > > > > > The possible bad flow is: > > > > >ib_umem_geti() -> > > > > > pin_user_pages_fast(FOLL_WRITE) -> > > > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > > > gup_pgd_range() -> > > > > >gup_huge_pd() -> > > > > > gup_hugepte() -> > > > > > try_grab_compound_head() -> > > > > > > > > Hi Leon, > > > > > > > > Thanks very much for the detailed report! So we're overflowing... > > > > > > > > At first look, this seems likely to be hitting a weak point in the > > > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be > > > > deferred > > > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > > > with huge pages if the pages have a *lot* of subpages. > > > > > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > > > > > Considering that establishing these pins is entirely under user > > > control, we can't have a limit here. > > > > There's already a limit, it's just a much larger one. :) What does "no > > limit" > > really mean, numerically, to you in this case? > > I guess I mean 'hidden limit' - hitting the limit and failing would > be managable. > > I think 7 is probably too low though, but we are not using 1GB huge > pages, only 2M.. What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Fri, Dec 20, 2019 at 1:22 AM Jan Kara wrote: > > On Thu 19-12-19 12:30:31, John Hubbard wrote: > > On 12/19/19 5:26 AM, Leon Romanovsky wrote: > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote: > > > > Hi, > > > > > > > > This implements an API naming change (put_user_page*() --> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It > > > > extends that tracking to a few select subsystems. More subsystems will > > > > be added in follow up work. > > > > > > Hi John, > > > > > > The patchset generates kernel panics in our IB testing. In our tests, we > > > allocated single memory block and registered multiple MRs using the single > > > block. > > > > > > The possible bad flow is: > > > ib_umem_geti() -> > > >pin_user_pages_fast(FOLL_WRITE) -> > > > internal_get_user_pages_fast(FOLL_WRITE) -> > > > gup_pgd_range() -> > > > gup_huge_pd() -> > > >gup_hugepte() -> > > > try_grab_compound_head() -> > > > > Hi Leon, > > > > Thanks very much for the detailed report! So we're overflowing... > > > > At first look, this seems likely to be hitting a weak point in the > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines > > 99-121). Basically it's pretty easy to overflow the page->_refcount > > with huge pages if the pages have a *lot* of subpages. > > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages. > > Do you have any idea how many pins (repeated pins on the same page, which > > it sounds like you have) might be involved in your test case, > > and the huge page and system page sizes? That would allow calculating > > if we're likely overflowing for that reason. > > > > So, ideas and next steps: > > > > 1. Assuming that you *are* hitting this, I think I may have to fall back to > > implementing the "deferred" part of this design, as part of this series, > > after > > all. That means: > > > > For the pin/unpin calls at least, stop treating all pages as if they are > > a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page. > > That's not how it works now, and the need to hand back a huge array of > > subpages is part of the problem. This affects the callers too, so it's not > > a super quick change to make. (I was really hoping not to have to do this > > yet.) > > Does that mean that you would need to make all GUP users huge page aware? > Otherwise I don't see how what you suggest would work... And I don't think > making all GUP users huge page aware is realistic (effort-wise) or even > wanted (maintenance overhead in all those places). > > I believe there might be also a different solution for this: For > transparent huge pages, we could find a space in 'struct page' of the > second page in the huge page for proper pin counter and just account pins > there so we'd have full width of 32-bits for it. That would require THP accounting for dax pages. It is something that was probably going to be needed, but this would seem to force the issue.
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On 12/20/19 4:33 PM, Dan Williams wrote: ... >> I believe there might be also a different solution for this: For >> transparent huge pages, we could find a space in 'struct page' of the >> second page in the huge page for proper pin counter and just account pins >> there so we'd have full width of 32-bits for it. > > That would require THP accounting for dax pages. It is something that > was probably going to be needed, but this would seem to force the > issue. > Thanks for mentioning that, it wasn't obvious to me yet. How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?) test setup? I'd hate to go into this without having that coverage up and running. It's been sketchy enough as it is. :) thanks, -- John Hubbard NVIDIA
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On Fri, Dec 20, 2019 at 4:41 PM John Hubbard wrote: > > On 12/20/19 4:33 PM, Dan Williams wrote: > ... > >> I believe there might be also a different solution for this: For > >> transparent huge pages, we could find a space in 'struct page' of the > >> second page in the huge page for proper pin counter and just account pins > >> there so we'd have full width of 32-bits for it. > > > > That would require THP accounting for dax pages. It is something that > > was probably going to be needed, but this would seem to force the > > issue. > > > > Thanks for mentioning that, it wasn't obvious to me yet. > > How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?) > test setup? I'd hate to go into this without having that coverage up > and running. It's been sketchy enough as it is. :) You too can have the power of the gods for the low low price of a kernel command line parameter, or a qemu setup. Details here: https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/pmem_in_qemu
Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
On 12/20/19 4:51 PM, Dan Williams wrote: > On Fri, Dec 20, 2019 at 4:41 PM John Hubbard wrote: >> >> On 12/20/19 4:33 PM, Dan Williams wrote: >> ... I believe there might be also a different solution for this: For transparent huge pages, we could find a space in 'struct page' of the second page in the huge page for proper pin counter and just account pins there so we'd have full width of 32-bits for it. >>> >>> That would require THP accounting for dax pages. It is something that >>> was probably going to be needed, but this would seem to force the >>> issue. >>> >> >> Thanks for mentioning that, it wasn't obvious to me yet. >> >> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?) >> test setup? I'd hate to go into this without having that coverage up >> and running. It's been sketchy enough as it is. :) > > You too can have the power of the gods for the low low price of a > kernel command line parameter, or a qemu setup. > > Details here: > > https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system > https://nvdimm.wiki.kernel.org/pmem_in_qemu > Swt! Now I can really cause some damage. :) thanks, -- John Hubbard NVIDIA
Re: PPC64: G5 & 4k/64k page size (was: Re: Call for report - G5/PPC970 status)
Romain Dolbeau writes: > Le jeu. 12 déc. 2019 à 22:40, Andreas Schwab a écrit : >> I'm using 4K pages, in case that matters > > Yes it does matter, as it seems to be the difference between "working" > and "not working" :-) > Thank you for the config & pointing out the culprit! > > With your config, my machine boots (though it's missing some features > as the config seems quite tuned). > > Moving from 64k pages to 4k pages on 'my' config (essentially, > Debian's 5.3 with default values for changes since), my machine boots > as well & everything seems to work fine. > > So question to Aneesh - did you try 64k pages on your G5, or only 4k? > In the second case, could you try with 64k to see if you can reproduce > the crash? I don't have direct access to this system, I have asked if we can get a run with 64K. Meanwhile is there a way to find out what caused MachineCheck? more details on this? I was checking the manual and I don't see any restrictions w.r.t effective address. We now have very high EA with 64K page size. -aneesh