Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Jan Kara
On Thu 19-12-19 12:30:31, John Hubbard wrote:
> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > Hi,
> > > 
> > > This implements an API naming change (put_user_page*() -->
> > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > extends that tracking to a few select subsystems. More subsystems will
> > > be added in follow up work.
> > 
> > Hi John,
> > 
> > The patchset generates kernel panics in our IB testing. In our tests, we
> > allocated single memory block and registered multiple MRs using the single
> > block.
> > 
> > The possible bad flow is:
> >   ib_umem_geti() ->
> >pin_user_pages_fast(FOLL_WRITE) ->
> > internal_get_user_pages_fast(FOLL_WRITE) ->
> >  gup_pgd_range() ->
> >   gup_huge_pd() ->
> >gup_hugepte() ->
> > try_grab_compound_head() ->
> 
> Hi Leon,
> 
> Thanks very much for the detailed report! So we're overflowing...
> 
> At first look, this seems likely to be hitting a weak point in the
> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> 99-121). Basically it's pretty easy to overflow the page->_refcount
> with huge pages if the pages have a *lot* of subpages.
> 
> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> Do you have any idea how many pins (repeated pins on the same page, which
> it sounds like you have) might be involved in your test case,
> and the huge page and system page sizes? That would allow calculating
> if we're likely overflowing for that reason.
> 
> So, ideas and next steps:
> 
> 1. Assuming that you *are* hitting this, I think I may have to fall back to
> implementing the "deferred" part of this design, as part of this series, after
> all. That means:
> 
>   For the pin/unpin calls at least, stop treating all pages as if they are
>   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
>   That's not how it works now, and the need to hand back a huge array of
>   subpages is part of the problem. This affects the callers too, so it's not
>   a super quick change to make. (I was really hoping not to have to do this
>   yet.)

Does that mean that you would need to make all GUP users huge page aware?
Otherwise I don't see how what you suggest would work... And I don't think
making all GUP users huge page aware is realistic (effort-wise) or even
wanted (maintenance overhead in all those places).

I believe there might be also a different solution for this: For
transparent huge pages, we could find a space in 'struct page' of the
second page in the huge page for proper pin counter and just account pins
there so we'd have full width of 32-bits for it.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

2019-12-20 Thread Jan Kara
On Fri 20-12-19 16:06:05, Alexey Kardashevskiy wrote:
> 
> 
> On 11/12/2019 21:42, Jan Kara wrote:
> > The last jump to free_exit in mm_iommu_do_alloc() happens after page
> > pointers in struct mm_iommu_table_group_mem_t were already converted to
> > physical addresses. Thus calling put_page() on these physical addresses
> > will likely crash. Convert physical addresses back to page pointers
> > during the error cleanup.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> >  arch/powerpc/mm/book3s64/iommu_api.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> >  Beware, this is completely untested, spotted just by code audit.
> > 
> > diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> > b/arch/powerpc/mm/book3s64/iommu_api.c
> > index 56cc84520577..06c403381c9c 100644
> > --- a/arch/powerpc/mm/book3s64/iommu_api.c
> > +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> > @@ -154,7 +154,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> > unsigned long ua,
> >(mem2->entries << PAGE_SHIFT {
> > ret = -EINVAL;
> > mutex_unlock(&mem_list_mutex);
> > -   goto free_exit;
> > +   goto convert_exit;
> > }
> > }
> >  
> > @@ -166,6 +166,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> > unsigned long ua,
> >  
> > return 0;
> >  
> > +convert_exit:
> > +   for (i = 0; i < pinned; i++)
> > +   mem->hpages[i] = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
> 
> 
> Good find. Although doing it where you added "goto convert_exit" seems
> cleaner imho. Thanks,

I don't really care :). V2 attached.

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From 947c7a893c282b829a8623da73276a2fe56fdcd3 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Wed, 11 Dec 2019 11:36:54 +0100
Subject: [PATCH v2] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

The last jump to free_exit in mm_iommu_do_alloc() happens after page
pointers in struct mm_iommu_table_group_mem_t were already converted to
physical addresses. Thus calling put_page() on these physical addresses
will likely crash. Convert physical addresses back to page pointers
during the error cleanup.

Signed-off-by: Jan Kara 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..1aa06584d783 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -154,6 +154,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
    (mem2->entries << PAGE_SHIFT {
 			ret = -EINVAL;
 			mutex_unlock(&mem_list_mutex);
+			/* Convert back to page pointers */
+			for (i = 0; i < pinned; i++) {
+mem->hpages[i] =
+	pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
+			}
 			goto free_exit;
 		}
 	}
-- 
2.16.4



Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-20 Thread Jason A. Donenfeld
On Thu, Dec 19, 2019 at 2:08 PM Cédric Le Goater  wrote:>
> There is a patch for msgsndp in my tree you could try :
>
>   https://github.com/legoater/qemu/tree/powernv-5.0
>
> Currently being reviewed. I have to address some remarks from David before
> it can be merged.

Thanks. I've cherry-picked
https://github.com/legoater/qemu/commit/910c9ea5ecc and can confirm
that I no longer receive the crashes. QEMU 5.1 or 5.0.1 release, I
guess?

Jason


Re: [PATCH v17 06/23] powerpc: mm: Add p?d_leaf() definitions

2019-12-20 Thread Steven Price
On 19/12/2019 11:49, Michael Ellerman wrote:
> Steven Price  writes:
>> walk_page_range() is going to be allowed to walk page tables other than
>> those of user space. For this it needs to know when it has reached a
>> 'leaf' entry in the page tables. This information is provided by the
>> p?d_leaf() functions/macros.
>>
>> For powerpc p?d_is_leaf() functions already exist. Export them using the
>> new p?d_leaf() name.
>>
>> CC: Benjamin Herrenschmidt 
>> CC: Paul Mackerras 
>> CC: Michael Ellerman 
>> CC: linuxppc-dev@lists.ozlabs.org
>> CC: kvm-...@vger.kernel.org
>> Signed-off-by: Steven Price 
>> ---
>>  arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++
>>  1 file changed, 3 insertions(+)
> 
> We have fallback versions of our pmd_is_leaf() etc. in
> arch/powerpc/include/asm/pgtable.h, eg:
> 
> #ifndef pmd_is_leaf
> #define pmd_is_leaf pmd_is_leaf
> static inline bool pmd_is_leaf(pmd_t pmd)
> {
>   return false;
> }
> #endif
> 
> Because we support several different MMUs and most of them don't need to
> do anything.
> 
> So we could put the compatibility #defines to your names along with the
> fallback versions in asm/pgtable.h, rather than in
> asm/book3s/64/pgtable.h
> 
> But I see you also have fallbacks for your versions, so it probably
> doesn't matter either way.
> 
> So I'm OK with this version, unless you think there's a compelling
> reason to do the compatibility #defines in our asm/pgtable.h

I was thinking that (assuming this series actually gets merged this
time), the p?d_is_leaf() versions could be removed and replaced by
p?d_leaf() in the future. Since the fallbacks are in the asm-generic
code it makes sense for the pmd_leaf() definitions to be next to the
non-fallback versions.

> Acked-by: Michael Ellerman 

Thanks!

Steve

> cheers
> 
> 
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index b01624e5c467..201a69e6a355 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -1355,18 +1355,21 @@ static inline bool is_pte_rw_upgrade(unsigned long 
>> old_val, unsigned long new_va
>>   * Like pmd_huge() and pmd_large(), but works regardless of config options
>>   */
>>  #define pmd_is_leaf pmd_is_leaf
>> +#define pmd_leaf pmd_is_leaf
>>  static inline bool pmd_is_leaf(pmd_t pmd)
>>  {
>>  return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
>>  }
>>  
>>  #define pud_is_leaf pud_is_leaf
>> +#define pud_leaf pud_is_leaf
>>  static inline bool pud_is_leaf(pud_t pud)
>>  {
>>  return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
>>  }
>>  
>>  #define pgd_is_leaf pgd_is_leaf
>> +#define pgd_leaf pgd_is_leaf
>>  static inline bool pgd_is_leaf(pgd_t pgd)
>>  {
>>  return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PTE));
>> -- 
>> 2.20.1
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 



[Bug 205889] CONFIG_PPC_85xx with CONFIG_CORENET_GENERIC outputs uImage instead of zImage

2019-12-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205889

--- Comment #2 from Bradley Gamble (bradley.gam...@ncipher.com) ---
I understand the desire not to break scripts etc that may be used, but zImage
and uImage files are different and may not both be bootable on the same device.
I am currently able to work around the issue by stripping the first 64 bytes of
the uImage to turn it back in to a zImage.

Is it possible to overcome the implicit link between the platform and enabling
CONFIG_DEFAULT_UIMAGE? I have tried marking it as "not set" in my defconfig but
it is still enabled.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Jason Gunthorpe
On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote:
> On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > > Hi,
> > > > > 
> > > > > This implements an API naming change (put_user_page*() -->
> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. 
> > > > > It
> > > > > extends that tracking to a few select subsystems. More subsystems will
> > > > > be added in follow up work.
> > > > 
> > > > Hi John,
> > > > 
> > > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > > allocated single memory block and registered multiple MRs using the 
> > > > single
> > > > block.
> > > > 
> > > > The possible bad flow is:
> > > >ib_umem_geti() ->
> > > > pin_user_pages_fast(FOLL_WRITE) ->
> > > >  internal_get_user_pages_fast(FOLL_WRITE) ->
> > > >   gup_pgd_range() ->
> > > >gup_huge_pd() ->
> > > > gup_hugepte() ->
> > > >  try_grab_compound_head() ->
> > > 
> > > Hi Leon,
> > > 
> > > Thanks very much for the detailed report! So we're overflowing...
> > > 
> > > At first look, this seems likely to be hitting a weak point in the
> > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > > with huge pages if the pages have a *lot* of subpages.
> > > 
> > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > 
> > Considering that establishing these pins is entirely under user
> > control, we can't have a limit here.
> 
> There's already a limit, it's just a much larger one. :) What does "no limit"
> really mean, numerically, to you in this case?

I guess I mean 'hidden limit' - hitting the limit and failing would
be managable.

I think 7 is probably too low though, but we are not using 1GB huge
pages, only 2M..

> > If the number of allowed pins are exhausted then the
> > pin_user_pages_fast() must fail back to the user.
> 
> I'll poke around the IB call stack and see how much of that return
> path is in place, if any. Because it's the same situation for
> get_user_pages_fast().  This code just added a warning on overflow
> so we could spot it early.

All GUP callers must be prepared for failure, IB should be fine...

Jason


Re: [PATCH] serial: ucc_uart: remove redundant assignment to pointer bdp

2019-12-20 Thread Timur Tabi

On 12/19/19 6:10 PM, Colin King wrote:

From: Colin Ian King

The variable bdp is being initialized with a value that is never
read and it is being updated later with a new value. The initialization
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King


Acked-by: Timur Tabi 

Looks like this bug has been there since day 1.


Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.

2019-12-20 Thread Christophe Leroy

Hi Thomas,

In do_hres(), I see:

cycles = __arch_get_hw_counter(vd->clock_mode);
ns = vdso_ts->nsec;
last = vd->cycle_last;
if (unlikely((s64)cycles < 0))
return -1;

__arch_get_hw_counter() returns a u64 values. On the PPC, this is read 
from the timebase which is a 64 bits counter.


Why returning -1 if (s64)cycles < 0 ? Does it means we have to mask out 
the most significant bit when reading the HW counter ?


Christophe


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Leon Romanovsky
On Thu, Dec 19, 2019 at 05:07:43PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > Hi,
> > > >
> > > > This implements an API naming change (put_user_page*() -->
> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > extends that tracking to a few select subsystems. More subsystems will
> > > > be added in follow up work.
> > >
> > > Hi John,
> > >
> > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > allocated single memory block and registered multiple MRs using the single
> > > block.
> > >
> > > The possible bad flow is:
> > >   ib_umem_geti() ->
> > >pin_user_pages_fast(FOLL_WRITE) ->
> > > internal_get_user_pages_fast(FOLL_WRITE) ->
> > >  gup_pgd_range() ->
> > >   gup_huge_pd() ->
> > >gup_hugepte() ->
> > > try_grab_compound_head() ->
> >
> > Hi Leon,
> >
> > Thanks very much for the detailed report! So we're overflowing...
> >
> > At first look, this seems likely to be hitting a weak point in the
> > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > with huge pages if the pages have a *lot* of subpages.
> >
> > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
>
> Considering that establishing these pins is entirely under user
> control, we can't have a limit here.
>
> If the number of allowed pins are exhausted then the
> pin_user_pages_fast() must fail back to the user.
>
> > 3. It would be nice if I could reproduce this. I have a two-node mlx5 
> > Infiniband
> > test setup, but I have done only the tiniest bit of user space IB coding, so
> > if you have any test programs that aren't too hard to deal with that could
> > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> > that I'm not an advanced IB programmer. At all. :)
>
> Clone this:
>
> https://github.com/linux-rdma/rdma-core.git
>
> Install all the required deps to build it (notably cython), see the README.md
>
> $ ./build.sh
> $ build/bin/run_tests.py
>
> If you get things that far I think Leon can get a reproduction for you

I'm not so optimistic about that.

Thanks

>
> Jason


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Leon Romanovsky
On Thu, Dec 19, 2019 at 02:58:43PM -0800, John Hubbard wrote:
> On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> ...
> > > 3. It would be nice if I could reproduce this. I have a two-node mlx5 
> > > Infiniband
> > > test setup, but I have done only the tiniest bit of user space IB coding, 
> > > so
> > > if you have any test programs that aren't too hard to deal with that could
> > > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in 
> > > mind
> > > that I'm not an advanced IB programmer. At all. :)
> >
> > Clone this:
> >
> > https://github.com/linux-rdma/rdma-core.git
> >
> > Install all the required deps to build it (notably cython), see the 
> > README.md
> >
> > $ ./build.sh
> > $ build/bin/run_tests.py
> >
> > If you get things that far I think Leon can get a reproduction for you
> >
>
> Cool, it's up and running (1 failure, 3 skipped, out of 67 tests).
>
> This is a great test suite to have running, I'll add it to my scripts. Here's 
> the
> full output in case the failure or skip cases are a problem:
>
> $ sudo ./build/bin/run_tests.py --verbose
>
> test_create_ah (tests.test_addr.AHTest) ... ok
> test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE 
> tests on IB link layer"
> test_destroy_ah (tests.test_addr.AHTest) ... ok
> test_create_comp_channel (tests.test_cq.CCTest) ... ok
> test_destroy_comp_channel (tests.test_cq.CCTest) ... ok
> test_create_cq_ex (tests.test_cq.CQEXTest) ... ok
> test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok
> test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok
> test_create_cq (tests.test_cq.CQTest) ... ok
> test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok
> test_destroy_cq (tests.test_cq.CQTest) ... ok
> test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_create_dm (tests.test_device.DMTest) ... ok
> test_create_dm_bad_flow (tests.test_device.DMTest) ... ok
> test_destroy_dm (tests.test_device.DMTest) ... ok
> test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok
> test_dm_read (tests.test_device.DMTest) ... ok
> test_dm_write (tests.test_device.DMTest) ... ok
> test_dm_write_bad_flow (tests.test_device.DMTest) ... ok
> test_dev_list (tests.test_device.DeviceTest) ... ok
> test_open_dev (tests.test_device.DeviceTest) ... ok
> test_query_device (tests.test_device.DeviceTest) ... ok
> test_query_device_ex (tests.test_device.DeviceTest) ... ok
> test_query_gid (tests.test_device.DeviceTest) ... ok
> test_query_port (tests.test_device.DeviceTest) ... FAIL
> test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok
> test_create_dm_mr (tests.test_mr.DMMRTest) ... ok
> test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok
> test_buffer (tests.test_mr.MRTest) ... ok
> test_dereg_mr (tests.test_mr.MRTest) ... ok
> test_dereg_mr_twice (tests.test_mr.MRTest) ... ok
> test_lkey (tests.test_mr.MRTest) ... ok
> test_read (tests.test_mr.MRTest) ... ok
> test_reg_mr (tests.test_mr.MRTest) ... ok
> test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok
> test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok
> test_rkey (tests.test_mr.MRTest) ... ok
> test_write (tests.test_mr.MRTest) ... ok
> test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok
> test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok
> test_reg_mw_type1 (tests.test_mr.MWTest) ... ok
> test_reg_mw_type2 (tests.test_mr.MWTest) ... ok
> test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok
> test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok
> test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not 
> supported - ODP recv not supported'
> test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok
> test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... 
> ok
> test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_alloc_pd (tests.test_pd.PDTest) ... ok
> test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok
> test_dealloc_pd (tests.test_pd.PDTest) ... ok
> test_destroy_pd_twice (tests.test_pd.PDTest) ... ok
> test_multiple_pd_creation (tests.test_pd.PDTest) ... ok
> test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_no_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_with_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok
> test_modify_qp (tests.test_qp.QPTest) ... ok
> test_query_qp (tests.test_qp.QPTest) ... ok
> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No 
> devices with net interface'
>
> ==

Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-20 Thread Cédric Le Goater
On 12/20/19 12:32 PM, Jason A. Donenfeld wrote:
> On Thu, Dec 19, 2019 at 2:08 PM Cédric Le Goater  wrote:>
>> There is a patch for msgsndp in my tree you could try :
>>
>>   https://github.com/legoater/qemu/tree/powernv-5.0
>>
>> Currently being reviewed. I have to address some remarks from David before
>> it can be merged.
> 
> Thanks. I've cherry-picked
> https://github.com/legoater/qemu/commit/910c9ea5ecc and can confirm
> that I no longer receive the crashes. QEMU 5.1 or 5.0.1 release, I
> guess?

QEMU 5.1 I would say. That's enough time to address the comments.

Michael's patch to drop self IPIs is also interesting for Linux. I wonder
how we can reach that point. 

Thanks,

C. 



[Bug 205283] BUG: KASAN: global-out-of-bounds in _copy_to_iter+0x3d4/0x5a8

2019-12-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205283

--- Comment #5 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 286385
  --> https://bugzilla.kernel.org/attachment.cgi?id=286385&action=edit
dmesg (kernel 5.5.0-rc2+, PowerMac G4 DP)

-- 
You are receiving this mail because:
You are watching someone on the CC list of the bug.

[Bug 205283] BUG: KASAN: global-out-of-bounds in _copy_to_iter+0x3d4/0x5a8

2019-12-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205283

--- Comment #6 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 286387
  --> https://bugzilla.kernel.org/attachment.cgi?id=286387&action=edit
5.5.0-rc2+ kernel .config (PowerMac G4 DP)

-- 
You are receiving this mail because:
You are watching someone on the CC list of the bug.

[Bug 205283] BUG: KASAN: global-out-of-bounds in _copy_to_iter+0x3d4/0x5a8

2019-12-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=205283

--- Comment #7 from Erhard F. (erhar...@mailbox.org) ---
(In reply to Christophe Leroy from comment #4)
> Can you apply https://patchwork.ozlabs.org/patch/1213028/ and select
> CONFIG_KASAN_VMALLOC
Re-tried with 5.5-rc2 your KASAN_VMALLOC patch and this patch
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7de7de7ca0ae0fc70515ee3154af33af75edae2c)
to make -rc2 boot again.

Result with KASAN is the same as before. If I enable CONFIG_KASAN_VMALLOC
additionally the KASAN hit is gone. modprobing and removing btrfs shows no
error then. Interesting!

Double-checked it just to be sure.

-- 
You are receiving this mail because:
You are watching someone on the CC list of the bug.

Re: [Qemu-ppc] pseries on qemu-system-ppc64le crashes in doorbell_core_ipi()

2019-12-20 Thread David? Gibson
On Fri, Dec 20, 2019 at 12:32:06PM +0100, Jason A. Donenfeld wrote:
> On Thu, Dec 19, 2019 at 2:08 PM Cédric Le Goater  wrote:>
> > There is a patch for msgsndp in my tree you could try :
> >
> >   https://github.com/legoater/qemu/tree/powernv-5.0
> >
> > Currently being reviewed. I have to address some remarks from David before
> > it can be merged.
> 
> Thanks. I've cherry-picked
> https://github.com/legoater/qemu/commit/910c9ea5ecc and can confirm
> that I no longer receive the crashes. QEMU 5.1 or 5.0.1 release, I
> guess?

Unless the revision takes much longer than I expect, I'd anticipate it
in 5.0.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 10:48 AM, Leon Romanovsky wrote:
...
>> test_query_qp (tests.test_qp.QPTest) ... ok
>> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No 
>> devices with net interface'
>>
>> ==
>> FAIL: test_query_port (tests.test_device.DeviceTest)
>> --
>> Traceback (most recent call last):
>>   File "/kernel_work/rdma-core/tests/test_device.py", line 129, in 
>> test_query_port
>> self.verify_port_attr(port_attr)
>>   File "/kernel_work/rdma-core/tests/test_device.py", line 113, in 
>> verify_port_attr
>> assert 'Invalid' not in d.speed_to_str(attr.active_speed)
>> AssertionError
> 
> I'm very curious how did you get this assert "d.speed_to_str" covers all
> known speeds according to the IBTA.
> 

Hi Leon,

Short answer: I can make that one pass, with a small fix the the rdma-core test
suite:

commit a1b9fb0846e1b2356d7a16f4fbdd1960cf8dcbe5 (HEAD -> fix_speed_to_str)
Author: John Hubbard 
Date:   Fri Dec 20 15:07:47 2019 -0800

device: fix speed_to_str(), to handle disabled links

For disabled links, the raw speed token is 0. However,
speed_to_str() doesn't have that in the list. This leads
to an assertion when running tests (test_query_port) when
one link is down and other link(s) are up.

Fix this by returning '(Disabled/down)' for the zero speed
case.

diff --git a/pyverbs/device.pyx b/pyverbs/device.pyx
index 33d133fd..f8b7826b 100755
--- a/pyverbs/device.pyx
+++ b/pyverbs/device.pyx
@@ -923,8 +923,8 @@ def width_to_str(width):
 
 
 def speed_to_str(speed):
-l = {1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps', 8: '10.0 Gbps',
- 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'}
+l = {0: '(Disabled/down)', 1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps',
+ 8: '10.0 Gbps', 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'}
 try:
 return '{s} ({n})'.format(s=l[speed], n=speed)
 except KeyError:


Longer answer:
==

It looks like this test suite assumes that every link is connected! (Probably
in most test systems, they are.) But in my setup, the ConnectX cards each have
two slots, and I only have (and only need) one cable. So one link is up, and
the other is disabled. 

This leads to the other problem, which is that if a link is disabled, the
test suite finds a "0" token for attr.active_speed. That token is not in the
approved list, and so d.speed_to_str() asserts.

With some diagnostics added, I can see it checking each link: one passes, and
the other asserts:

diff --git a/tests/test_device.py b/tests/test_device.py
index 524e0e89..7b33d7db 100644
--- a/tests/test_device.py
+++ b/tests/test_device.py
@@ -110,6 +110,12 @@ class DeviceTest(unittest.TestCase):
 assert 'Invalid' not in d.translate_mtu(attr.max_mtu)
 assert 'Invalid' not in d.translate_mtu(attr.active_mtu)
 assert 'Invalid' not in d.width_to_str(attr.active_width)
+print("")
+print('Diagnostics ===')
+print('phys_state:', d.phys_state_to_str(attr.phys_state))
+print('active_width): ', d.width_to_str(attr.active_width))
+print('active_speed:  ',   d.speed_to_str(attr.active_speed))
+print('END of Diagnostics ')
 assert 'Invalid' not in d.speed_to_str(attr.active_speed)
 assert 'Invalid' not in d.translate_link_layer(attr.link_layer)
 assert attr.max_msg_sz > 0x1000

 assert attr.max_msg_sz > 0x1000

...and the test run from that is:

# ./build/bin/run_tests.py --verbose tests.test_device.DeviceTest
test_dev_list (tests.test_device.DeviceTest) ... ok
test_open_dev (tests.test_device.DeviceTest) ... ok
test_query_device (tests.test_device.DeviceTest) ... ok
test_query_device_ex (tests.test_device.DeviceTest) ... ok
test_query_gid (tests.test_device.DeviceTest) ... ok
test_query_port (tests.test_device.DeviceTest) ... 
Diagnostics ===
phys_state: Link up (5)
active_width):  4X (2)
active_speed:   25.0 Gbps (32)
END of Diagnostics 

Diagnostics ===
phys_state: Disabled (3)
active_width):  4X (2)
active_speed:   Invalid speed
END of Diagnostics 
FAIL
test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok

==
FAIL: test_query_port (tests.test_device.DeviceTest)
--
Traceback (most recent call last):
  File "/kernel_work/rdma-core/tests/test_device.py", line 135, in 
test_query_port
self.verify_port_attr(port_attr)
  File "/kernel_work/rdma-core/tests/test_device.py", line 119, in 
verify_port_attr
assert 'Invalid' not in d

Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 10:29 AM, Leon Romanovsky wrote:
...
>> $ ./build.sh
>> $ build/bin/run_tests.py
>>
>> If you get things that far I think Leon can get a reproduction for you
> 
> I'm not so optimistic about that.
> 

OK, I'm going to proceed for now on the assumption that I've got an overflow
problem that happens when huge pages are pinned. If I can get more information,
great, otherwise it's probably enough.

One thing: for your repro, if you know the huge page size, and the system
page size for that case, that would really help. Also the number of pins per
page, more or less, that you'd expect. Because Jason says that only 2M huge 
pages are used...

Because the other possibility is that the refcount really is going negative, 
likely due to a mismatched pin/unpin somehow.

If there's not an obvious repro case available, but you do have one (is it easy
to repro, though?), then *if* you have the time, I could point you to a github
branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb44c4d2ada7..8526fd03b978 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
  * get_user_pages and page_mkclean and other calls that race to set up page
  * table entries.
  */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define GUP_PIN_COUNTING_BIAS (1U << 8)
 
 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,

If that fails to repro, then we would be zeroing in on the root cause. 

The branch is here (I just tested it and it seems healthy):

g...@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags



thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 1:21 AM, Jan Kara wrote:
...
>> So, ideas and next steps:
>>
>> 1. Assuming that you *are* hitting this, I think I may have to fall back to
>> implementing the "deferred" part of this design, as part of this series, 
>> after
>> all. That means:
>>
>>   For the pin/unpin calls at least, stop treating all pages as if they are
>>   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
>>   That's not how it works now, and the need to hand back a huge array of
>>   subpages is part of the problem. This affects the callers too, so it's not
>>   a super quick change to make. (I was really hoping not to have to do this
>>   yet.)
> 
> Does that mean that you would need to make all GUP users huge page aware?
> Otherwise I don't see how what you suggest would work... And I don't think
> making all GUP users huge page aware is realistic (effort-wise) or even
> wanted (maintenance overhead in all those places).
> 

Well, pretty much yes. It's really just the pin_user_pages*() callers, but
the internals, follow_page() and such, are so interconnected right now that
it would probably blow up into a huge effort, as you point out.

> I believe there might be also a different solution for this: For
> transparent huge pages, we could find a space in 'struct page' of the
> second page in the huge page for proper pin counter and just account pins
> there so we'd have full width of 32-bits for it.
> 
>   Honza
> 

OK, let me pursue that. Given that I shouldn't need to handle pages
splitting, it should be not *too* bad.

I am starting to think that I should just post the first 9 or so 
prerequisite patches (first 9 patches, plus the v4l2 fix that arguably should 
have been earlier in the sequence I guess), as 5.6 candidates, while I go
back to the drawing board here. 

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Dan Williams
On Fri, Dec 20, 2019 at 5:34 AM Jason Gunthorpe  wrote:
>
> On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote:
> > On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> > > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > > > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This implements an API naming change (put_user_page*() -->
> > > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN 
> > > > > > pages. It
> > > > > > extends that tracking to a few select subsystems. More subsystems 
> > > > > > will
> > > > > > be added in follow up work.
> > > > >
> > > > > Hi John,
> > > > >
> > > > > The patchset generates kernel panics in our IB testing. In our tests, 
> > > > > we
> > > > > allocated single memory block and registered multiple MRs using the 
> > > > > single
> > > > > block.
> > > > >
> > > > > The possible bad flow is:
> > > > >ib_umem_geti() ->
> > > > > pin_user_pages_fast(FOLL_WRITE) ->
> > > > >  internal_get_user_pages_fast(FOLL_WRITE) ->
> > > > >   gup_pgd_range() ->
> > > > >gup_huge_pd() ->
> > > > > gup_hugepte() ->
> > > > >  try_grab_compound_head() ->
> > > >
> > > > Hi Leon,
> > > >
> > > > Thanks very much for the detailed report! So we're overflowing...
> > > >
> > > > At first look, this seems likely to be hitting a weak point in the
> > > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be 
> > > > deferred
> > > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > > > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > > > with huge pages if the pages have a *lot* of subpages.
> > > >
> > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > >
> > > Considering that establishing these pins is entirely under user
> > > control, we can't have a limit here.
> >
> > There's already a limit, it's just a much larger one. :) What does "no 
> > limit"
> > really mean, numerically, to you in this case?
>
> I guess I mean 'hidden limit' - hitting the limit and failing would
> be managable.
>
> I think 7 is probably too low though, but we are not using 1GB huge
> pages, only 2M..

What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Dan Williams
On Fri, Dec 20, 2019 at 1:22 AM Jan Kara  wrote:
>
> On Thu 19-12-19 12:30:31, John Hubbard wrote:
> > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > Hi,
> > > >
> > > > This implements an API naming change (put_user_page*() -->
> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > extends that tracking to a few select subsystems. More subsystems will
> > > > be added in follow up work.
> > >
> > > Hi John,
> > >
> > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > allocated single memory block and registered multiple MRs using the single
> > > block.
> > >
> > > The possible bad flow is:
> > >   ib_umem_geti() ->
> > >pin_user_pages_fast(FOLL_WRITE) ->
> > > internal_get_user_pages_fast(FOLL_WRITE) ->
> > >  gup_pgd_range() ->
> > >   gup_huge_pd() ->
> > >gup_hugepte() ->
> > > try_grab_compound_head() ->
> >
> > Hi Leon,
> >
> > Thanks very much for the detailed report! So we're overflowing...
> >
> > At first look, this seems likely to be hitting a weak point in the
> > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > with huge pages if the pages have a *lot* of subpages.
> >
> > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > Do you have any idea how many pins (repeated pins on the same page, which
> > it sounds like you have) might be involved in your test case,
> > and the huge page and system page sizes? That would allow calculating
> > if we're likely overflowing for that reason.
> >
> > So, ideas and next steps:
> >
> > 1. Assuming that you *are* hitting this, I think I may have to fall back to
> > implementing the "deferred" part of this design, as part of this series, 
> > after
> > all. That means:
> >
> >   For the pin/unpin calls at least, stop treating all pages as if they are
> >   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
> >   That's not how it works now, and the need to hand back a huge array of
> >   subpages is part of the problem. This affects the callers too, so it's not
> >   a super quick change to make. (I was really hoping not to have to do this
> >   yet.)
>
> Does that mean that you would need to make all GUP users huge page aware?
> Otherwise I don't see how what you suggest would work... And I don't think
> making all GUP users huge page aware is realistic (effort-wise) or even
> wanted (maintenance overhead in all those places).
>
> I believe there might be also a different solution for this: For
> transparent huge pages, we could find a space in 'struct page' of the
> second page in the huge page for proper pin counter and just account pins
> there so we'd have full width of 32-bits for it.

That would require THP accounting for dax pages. It is something that
was probably going to be needed, but this would seem to force the
issue.


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 4:33 PM, Dan Williams wrote:
...
>> I believe there might be also a different solution for this: For
>> transparent huge pages, we could find a space in 'struct page' of the
>> second page in the huge page for proper pin counter and just account pins
>> there so we'd have full width of 32-bits for it.
> 
> That would require THP accounting for dax pages. It is something that
> was probably going to be needed, but this would seem to force the
> issue.
> 

Thanks for mentioning that, it wasn't obvious to me yet. 

How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
test setup? I'd hate to go into this without having that coverage up
and running. It's been sketchy enough as it is. :)

thanks,
-- 
John Hubbard
NVIDIA


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Dan Williams
On Fri, Dec 20, 2019 at 4:41 PM John Hubbard  wrote:
>
> On 12/20/19 4:33 PM, Dan Williams wrote:
> ...
> >> I believe there might be also a different solution for this: For
> >> transparent huge pages, we could find a space in 'struct page' of the
> >> second page in the huge page for proper pin counter and just account pins
> >> there so we'd have full width of 32-bits for it.
> >
> > That would require THP accounting for dax pages. It is something that
> > was probably going to be needed, but this would seem to force the
> > issue.
> >
>
> Thanks for mentioning that, it wasn't obvious to me yet.
>
> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
> test setup? I'd hate to go into this without having that coverage up
> and running. It's been sketchy enough as it is. :)

You too can have the power of the gods for the low low price of a
kernel command line parameter, or a qemu setup.

Details here:

https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
https://nvdimm.wiki.kernel.org/pmem_in_qemu


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread John Hubbard
On 12/20/19 4:51 PM, Dan Williams wrote:
> On Fri, Dec 20, 2019 at 4:41 PM John Hubbard  wrote:
>>
>> On 12/20/19 4:33 PM, Dan Williams wrote:
>> ...
 I believe there might be also a different solution for this: For
 transparent huge pages, we could find a space in 'struct page' of the
 second page in the huge page for proper pin counter and just account pins
 there so we'd have full width of 32-bits for it.
>>>
>>> That would require THP accounting for dax pages. It is something that
>>> was probably going to be needed, but this would seem to force the
>>> issue.
>>>
>>
>> Thanks for mentioning that, it wasn't obvious to me yet.
>>
>> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
>> test setup? I'd hate to go into this without having that coverage up
>> and running. It's been sketchy enough as it is. :)
> 
> You too can have the power of the gods for the low low price of a
> kernel command line parameter, or a qemu setup.
> 
> Details here:
> 
> https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
> https://nvdimm.wiki.kernel.org/pmem_in_qemu
> 

Swt! Now I can really cause some damage. :)

thanks,
-- 
John Hubbard
NVIDIA


Re: PPC64: G5 & 4k/64k page size (was: Re: Call for report - G5/PPC970 status)

2019-12-20 Thread Aneesh Kumar K.V
Romain Dolbeau  writes:

> Le jeu. 12 déc. 2019 à 22:40, Andreas Schwab  a écrit :
>> I'm using 4K pages, in case that matters
>
> Yes it does matter, as it seems to be the difference between "working"
> and "not working" :-)
> Thank you for the config & pointing out the culprit!
>
> With your config, my machine boots (though it's missing some features
> as the config seems quite tuned).
>
> Moving from 64k pages to 4k pages on 'my' config (essentially,
> Debian's 5.3 with default values for changes since), my machine boots
> as well & everything seems to work fine.
>
> So question to Aneesh - did you try 64k pages on your G5, or only 4k?
> In the second case, could you try with 64k to see if you can reproduce
> the crash?

I don't have direct access to this system, I have asked if we can get a run
with 64K. 

Meanwhile is there a way to find out what caused MachineCheck? more
details on this? I was checking the manual and I don't see any
restrictions w.r.t effective address. We now have very high EA with 64K
page size. 

-aneesh