Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Benjamin Herrenschmidt
On Thu, 2018-08-23 at 07:24 +0200, Christoph Hellwig wrote:
> > Well, iommus can have bypass regions, which we also use for
> > performance, so we do at dma_set_mask() time "swap" the ops around, and
> > in that case, we do want to check the mask against the actual top of
> > memory...
> 
> That is a bit of a powerpc special case (we also had one other arch
> doing that, but it got removed in the great purge, can't rember which
> one right now).  Everyone else has one set of ops, and they just switch
> to the direct mapping inside the iommu ops.

We more or less do that too in some of ours these days bcs of the whole
coherent_mask vs mask where a given device might need either depending
on the type of mapping.

Ben.



Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 23, 2018 at 10:01:45AM +1000, Benjamin Herrenschmidt wrote:
> > The general scheme that architectures should implement is:
> > 
> > ZONE_DMA:   Any memory below a magic threshold that is lower than
> > 32-bit.  Only enabled if actually required (usually
> > either 24-bit for ISA, or some other weird architecture
> > specific value like 32-bit for S/390)
> 
> It should have been ZONE_ISA_DMA :-)

For most of these use cases it should have been indeed, and that
would avoid a lot of confusion where people use GFP_DMA just because
they do DMA.

Anyway, switching powerpc to this scheme would be great, but I don't
think it is required - GFP_KERNEL will silently fall back to ZONE_DMA,
so except for an additional GFP_DMA fallback allocation when the
GFP_KERNEL one fails the code should just work.


Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 23, 2018 at 09:59:18AM +1000, Benjamin Herrenschmidt wrote:
> > Yeah, the other platforms that support these devices support ZONE_DMA
> > to reliably handle these devices. But there is two other ways the
> > current code would actually handle these fine despite the dma_direct
> > checks:
> > 
> >  1) if the device only has physical addresses up to 31-bit anyway
> >  2) by trying again to find a lower address.  But this only works
> > for coherent allocations and not streaming maps (unless we have
> > swiotlb with a buffer below 31-bits).
> > 
> > It seems powerpc can have ZONE_DMA, though and we will cover these
> > devices just fine.  If it didn't have that the current powerpc
> > code would not work either.
> 
> Not exactly. powerpc has ZONE_DMA covering all of system memory.
> 
> What happens in ppc32 is that we somewhat "know" that none of the
> systems with those stupid 31-bit limited pieces of HW is capable of
> having more than 2GB of memory anyway.
> 
> So we get away with just returning "1".

I think I can up with a proper way of handling that by checking
the actual amount of physical memory present instead of the hard coded
32-bit.

> > If your PCI bridge / PCIe root port doesn't support dma to addresses
> > larger than 32-bit the device capabilities above that don't matter, it
> > just won't work.  We have this case at least for some old VIA x86 chipsets
> > and some relatively modern Xilinx FPGAs with PCIe.
> 
> Hrm... that's the usual confusion dma_capable() vs. dma_set_mask().
> 
> It's always been perfectly fine for a driver to do a dma_set_mask(64-
> bit) on a system where the bridge can only do 32-bits ...

No, it hasn't.  That's why we have this pattern of trying a 64-bit
mask first and then setting a 32-bit mask if that fails all over
drivers/.  However with all the work we've done over the last month
we are getting really close to a world where:

 - the driver just does one dma_set_mask for the capabilities and
   stores that in the dma_mask
 - other limitations go elsewhere and will be automatically taken
   into account.

Which is I guess what you always wanted, but which wasn't how things
actually worked before.

> We shouldn't fail there, we should instead "clamp" the mask to 32-bit,
> see what I mean ? It doesn't matter that the device itself is capable
> of issuing >32 addresses, I agree, but what we need to express is that
> the combination device+bridge doesn't want addresses above 32-bit, so
> it's equivalent to making the device do a set_mask(32-bit).

As said, we'll get there (but with the new separate bus_dma_mask in 4.19),
but this is not how things currently work.

> > Your observation is right, but there always has been the implicit
> > assumption that architectures with more than 4GB of physical address
> > space must either support and iommu or swiotlb and use that.  It's
> > never been document anywhere, but I'm working on integrating all
> > this code to make more sense.
> 
> Well, iommus can have bypass regions, which we also use for
> performance, so we do at dma_set_mask() time "swap" the ops around, and
> in that case, we do want to check the mask against the actual top of
> memory...

That is a bit of a powerpc special case (we also had one other arch
doing that, but it got removed in the great purge, can't rember which
one right now).  Everyone else has one set of ops, and they just switch
to the direct mapping inside the iommu ops.


[PATCH v2] poewrpc/mce: Fix SLB rebolting during MCE recovery path.

2018-08-22 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

With the powrpc next commit e7e81847478 (poewrpc/mce: Fix SLB rebolting
during MCE recovery path.), the SLB error recovery is broken. The new
change now does not add index value to RB[52-63] that selects the SLB
entry while rebolting, instead it assumes that the shadow save area
already have index embeded correctly in esid field. While all valid bolted
save areas do contain index value set correctly, there is a case where
3rd (KSTACK_INDEX) entry for kernel stack does not embed index for NULL
esid entry. This patch fixes that.

Without this patch the SLB rebolt code overwirtes the 1st entry of kernel
linear mapping and causes SLB recovery to fail.

Signed-off-by: Mahesh Salgaonkar 
Signed-off-by: Nicholas Piggin 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/mm/slb.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 0b095fa54049..9f574e59d178 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -70,7 +70,7 @@ static inline void slb_shadow_update(unsigned long ea, int 
ssize,
 
 static inline void slb_shadow_clear(enum slb_index index)
 {
-   WRITE_ONCE(get_slb_shadow()->save_area[index].esid, 0);
+   WRITE_ONCE(get_slb_shadow()->save_area[index].esid, cpu_to_be64(index));
 }
 
 static inline void create_shadowed_slbe(unsigned long ea, int ssize,



Re: [PATCH] poewrpc/mce: Fix SLB rebolting during MCE recovery path.

2018-08-22 Thread Nicholas Piggin
On Thu, 23 Aug 2018 09:58:31 +0530
Mahesh Jagannath Salgaonkar  wrote:

> On 08/21/2018 03:57 PM, Nicholas Piggin wrote:
> > On Fri, 17 Aug 2018 14:51:47 +0530
> > Mahesh J Salgaonkar  wrote:
> >   
> >> From: Mahesh Salgaonkar 
> >>
> >> With the powrpc next commit e7e81847478 (poewrpc/mce: Fix SLB rebolting
> >> during MCE recovery path.), the SLB error recovery is broken. The
> >> commit missed a crucial change of OR-ing index value to RB[52-63] which
> >> selects the SLB entry while rebolting. This patch fixes that.
> >>
> >> Signed-off-by: Mahesh Salgaonkar 
> >> Reviewed-by: Nicholas Piggin 
> >> ---
> >>  arch/powerpc/mm/slb.c |5 -
> >>  1 file changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
> >> index 0b095fa54049..6dd9913425bc 100644
> >> --- a/arch/powerpc/mm/slb.c
> >> +++ b/arch/powerpc/mm/slb.c
> >> @@ -101,9 +101,12 @@ void __slb_restore_bolted_realmode(void)
> >>  
> >> /* No isync needed because realmode. */
> >>for (index = 0; index < SLB_NUM_BOLTED; index++) {
> >> +  unsigned long rb = be64_to_cpu(p->save_area[index].esid);
> >> +
> >> +  rb = (rb & ~0xFFFul) | index;
> >>asm volatile("slbmte  %0,%1" :
> >> : "r" (be64_to_cpu(p->save_area[index].vsid)),
> >> - "r" (be64_to_cpu(p->save_area[index].esid)));
> >> + "r" (rb));
> >>}
> >>  }
> >>  
> >>  
> > 
> > I'm just looking at this again. The bolted save areas do have the
> > index field set. So for the OS, your patch should be equivalent to
> > this, right?
> > 
> >  static inline void slb_shadow_clear(enum slb_index index)
> >  {
> > -   WRITE_ONCE(get_slb_shadow()->save_area[index].esid, 0);
> > +   WRITE_ONCE(get_slb_shadow()->save_area[index].esid, index);
> >  }
> > 
> > Which seems like a better fix.  
> 
> Yeah this also fixes the issue. The only additional change required is
> cpu_to_be64(index).

Ah yep.

> As long as we maintain index in bolted save areas
> (for valid/invalid entries) we should be ok. Will respin v2 with this
> change.

Cool, Reviewed-by: Nicholas Piggin  in that case :)

Thanks,
Nick


Re: [PATCH] poewrpc/mce: Fix SLB rebolting during MCE recovery path.

2018-08-22 Thread Mahesh Jagannath Salgaonkar
On 08/21/2018 03:57 PM, Nicholas Piggin wrote:
> On Fri, 17 Aug 2018 14:51:47 +0530
> Mahesh J Salgaonkar  wrote:
> 
>> From: Mahesh Salgaonkar 
>>
>> With the powrpc next commit e7e81847478 (poewrpc/mce: Fix SLB rebolting
>> during MCE recovery path.), the SLB error recovery is broken. The
>> commit missed a crucial change of OR-ing index value to RB[52-63] which
>> selects the SLB entry while rebolting. This patch fixes that.
>>
>> Signed-off-by: Mahesh Salgaonkar 
>> Reviewed-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/mm/slb.c |5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
>> index 0b095fa54049..6dd9913425bc 100644
>> --- a/arch/powerpc/mm/slb.c
>> +++ b/arch/powerpc/mm/slb.c
>> @@ -101,9 +101,12 @@ void __slb_restore_bolted_realmode(void)
>>  
>>   /* No isync needed because realmode. */
>>  for (index = 0; index < SLB_NUM_BOLTED; index++) {
>> +unsigned long rb = be64_to_cpu(p->save_area[index].esid);
>> +
>> +rb = (rb & ~0xFFFul) | index;
>>  asm volatile("slbmte  %0,%1" :
>>   : "r" (be64_to_cpu(p->save_area[index].vsid)),
>> -   "r" (be64_to_cpu(p->save_area[index].esid)));
>> +   "r" (rb));
>>  }
>>  }
>>  
>>
> 
> I'm just looking at this again. The bolted save areas do have the
> index field set. So for the OS, your patch should be equivalent to
> this, right?
> 
>  static inline void slb_shadow_clear(enum slb_index index)
>  {
> -   WRITE_ONCE(get_slb_shadow()->save_area[index].esid, 0);
> +   WRITE_ONCE(get_slb_shadow()->save_area[index].esid, index);
>  }
> 
> Which seems like a better fix.

Yeah this also fixes the issue. The only additional change required is
cpu_to_be64(index). As long as we maintain index in bolted save areas
(for valid/invalid entries) we should be ok. Will respin v2 with this
change.

Thanks,
-Mahesh.



Re: [PATCH 1/2] macintosh: therm_windtunnel: drop using attach_adapter

2018-08-22 Thread Michael Ellerman
Wolfram Sang  writes:

> As we now have deferred probing, we can use a custom mechanism and
> finally get rid of the legacy interface from the i2c core.
>
> Signed-off-by: Wolfram Sang 
> ---
>  drivers/macintosh/therm_windtunnel.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)

I don't have a G4 to test this on, so merge it and if it breaks we can
fix it up later.

Acked-by: Michael Ellerman 

cheers

> diff --git a/drivers/macintosh/therm_windtunnel.c 
> b/drivers/macintosh/therm_windtunnel.c
> index 68dcbcb4fc5b..8c744578122a 100644
> --- a/drivers/macintosh/therm_windtunnel.c
> +++ b/drivers/macintosh/therm_windtunnel.c
> @@ -432,7 +432,6 @@ static struct i2c_driver g4fan_driver = {
>   .driver = {
>   .name   = "therm_windtunnel",
>   },
> - .attach_adapter = do_attach,
>   .probe  = do_probe,
>   .remove = do_remove,
>   .id_table   = therm_windtunnel_id,
> @@ -445,7 +444,29 @@ static struct i2c_driver g4fan_driver = {
>  
>  static int therm_of_probe(struct platform_device *dev)
>  {
> - return i2c_add_driver( _driver );
> + struct i2c_adapter *adap;
> + int ret, i = 0;
> +
> + adap = i2c_get_adapter(0);
> + if (!adap)
> + return -EPROBE_DEFER;
> +
> + ret = i2c_add_driver(_driver);
> + if (ret) {
> + i2c_put_adapter(adap);
> + return ret;
> + }
> +
> + /* We assume Macs have consecutive I2C bus numbers starting at 0 */
> + while (adap) {
> + do_attach(adap);
> + if (x.running)
> + return 0;
> + i2c_put_adapter(adap);
> + adap = i2c_get_adapter(++i);
> + }
> +
> + return -ENODEV;
>  }
>  
>  static int
> -- 
> 2.11.0


Re: [PATCH 0/2] i2c: remove deprecated attach_adapter callback

2018-08-22 Thread Michael Ellerman
Wolfram Sang  writes:

> So, I wanted to do this in the next cycle, but Linus seems to want it this
> cycle already [1], so here it is:
>
> Remove the attach_adapter callback from the 2.4 times by converting
> the last user to a custom probing mechanism based on deferred probing. We used
> this already in commit ac397c80de89 ("ALSA: ppc: keywest: drop using attach
> adapter") successfully on HW, so we agreed to use it on the windtunnel driver
> as well.
>
> With the last user gone, we can then remove the callback \o/ I think this
> allows for more cleanup in the core, but let's do this later and focus on the
> removal for now.
>
> Tested on a Renesas R-Car Salvator-XS board (M3N) by using and rebinding
> various I2C busses. Build bot and checkpatch are happy, too.
>
> I'd like to send a pull request to Linus this merge window, so looking forward
> to super fast comments, acks, etc...

Sure, I don't have a G4 hooked up to test this, so just merge it and if
it breaks we can fix it.

cheers


Re: [PATCH] powerpc/xive: Initialize symbol before usage

2018-08-22 Thread Michael Ellerman
Hi Breno,

Breno Leitao  writes:
> Function xive_native_get_ipi() might uses chip_id without it being
> initialized. This gives the following error on 'smatch' tool:
>
>   error: uninitialized symbol 'chip_id'

Which is correct, it can be used uninitialised. I'm surprised GCC
doesn't warn about it.

> This patch simply sets chip_id initial value to 0.

I'd prefer we fixed it differently, by explicitly initialising to zero
at the appropriate place in the code.

> diff --git a/arch/powerpc/sysdev/xive/native.c 
> b/arch/powerpc/sysdev/xive/native.c
> index 311185b9960a..fc56673a3c0f 100644
> --- a/arch/powerpc/sysdev/xive/native.c
> +++ b/arch/powerpc/sysdev/xive/native.c
> @@ -239,7 +239,7 @@ static bool xive_native_match(struct device_node *node)
>  static int xive_native_get_ipi(unsigned int cpu, struct xive_cpu *xc)
>  {
>   struct device_node *np;
> - unsigned int chip_id;
> + unsigned int chip_id = 0;
>   s64 irq;
>  
>   /* Find the chip ID */

The current code is:

/* Find the chip ID */
np = of_get_cpu_node(cpu, NULL);
if (np) {
if (of_property_read_u32(np, "ibm,chip-id", _id) < 0)
chip_id = 0;
}

Where if np is NULL then we don't initialise chip_id.

Which could be:

np = of_get_cpu_node(cpu, NULL);
if (of_property_read_u32(np, "ibm,chip-id", _id) < 0)
chip_id = 0;

Because of_property_read_u32() will just return an error if np is NULL.

It's also missing an of_node_put() of np, you should do a separate patch
to fix that. You can just do it unconditionally after the
of_property_read_u32().

cheers


Re: Infinite looping observed in __offline_pages

2018-08-22 Thread Aneesh Kumar K.V

On 08/23/2018 12:28 AM, Mike Kravetz wrote:

On 08/22/2018 02:30 AM, Aneesh Kumar K.V wrote:

commit 2e9d754ac211f2af3731f15df3cd8cd070b4cc54
Author: Aneesh Kumar K.V 
Date:   Tue Aug 21 14:17:55 2018 +0530

 mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not 
supported.
 
 When scanning for movable pages, filter out Hugetlb pages if hugepage migration

 is not supported. Without this we hit infinte loop in __offline pages 
where we
 do
 pfn = scan_movable_pages(start_pfn, end_pfn);
 if (pfn) { /* We have movable pages */
 ret = do_migrate_range(pfn, end_pfn);
 goto repeat;
 }
 
 We do support hugetlb migration ony if the hugetlb pages are at pmd level. Here


I thought migration at pgd level was added for POWER?  commit 94310cbcaa3c
(mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD level).
Only remember, because I did not fully understand the use case. :)



yes. We hit the issue on older distro kernels.


 we just check for Kernel config. The gigantic page size check is done in
 page_huge_active.
 
 Reported-by: Haren Myneni 

 CC: Naoya Horiguchi 
 Signed-off-by: Aneesh Kumar K.V 

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4eb6e824a80c..f9bdea685cf4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1338,7 +1338,8 @@ static unsigned long scan_movable_pages(unsigned long 
start, unsigned long end)
return pfn;
if (__PageMovable(page))
return pfn;
-   if (PageHuge(page)) {
+   if (IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION) &&
+   PageHuge(page)) {


How about using hugepage_migration_supported instead?  It would automatically
catch those non-migratable huge page sizes.  Something like:




Will do that.


if (PageHuge(page) &&
hugepage_migration_supported(page_hstate(page))) {



-aneesh



Re: DT case sensitivity

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 20:26 -0500, Rob Herring wrote:
> On Wed, Aug 22, 2018 at 8:14 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > On Wed, 2018-08-22 at 19:47 -0500, Rob Herring wrote:
> > > The default DT string handling in the kernel is node names and
> > > compatibles are case insensitive and property names are case sensitive
> > > (Sparc is the the only variation and is opposite). It seems only PPC
> > > (and perhaps only Power Macs?) needs to support case insensitive
> > > comparisons. It was probably a mistake to follow PPC for new arches
> > > and we should have made everything case sensitive from the start. So I
> > > have a few questions for the DT historians. :)
> > 
> > Open Firmware itself is insensitive.
> 
> Doesn't it depend on the implementation? Otherwise, how is Sparc different?

Not sure ... Forth itself is insensitive for words but maybe not for
string comparisons.

> 
> > > What PPC systems are case insensitive? Can we limit that to certain 
> > > systems?
> > 
> > All PowerMacs at least, the problem is that I don't have DT images or
> > access to all the historical systems (and yes some people occasionally
> > still use them) to properly test a change in that area.
> 
> I'm temped to break them so I can find folks to provide me with DT dumps. :)

I have a collection of DT dumps but I'm not sure about the legality of
publishing them...

Cheers,
Ben.




Re: DT case sensitivity

2018-08-22 Thread Rob Herring
On Wed, Aug 22, 2018 at 8:14 PM Benjamin Herrenschmidt
 wrote:
>
> On Wed, 2018-08-22 at 19:47 -0500, Rob Herring wrote:
> > The default DT string handling in the kernel is node names and
> > compatibles are case insensitive and property names are case sensitive
> > (Sparc is the the only variation and is opposite). It seems only PPC
> > (and perhaps only Power Macs?) needs to support case insensitive
> > comparisons. It was probably a mistake to follow PPC for new arches
> > and we should have made everything case sensitive from the start. So I
> > have a few questions for the DT historians. :)
>
> Open Firmware itself is insensitive.

Doesn't it depend on the implementation? Otherwise, how is Sparc different?

> > What PPC systems are case insensitive? Can we limit that to certain systems?
>
> All PowerMacs at least, the problem is that I don't have DT images or
> access to all the historical systems (and yes some people occasionally
> still use them) to properly test a change in that area.

I'm temped to break them so I can find folks to provide me with DT dumps. :)

Rob


Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")

2018-08-22 Thread Michael Ellerman
Christophe LEROY  writes:
> Hello,
>
> I have an odd issue on my powerpc 8xx board.
>
> I am running latest 4.14 and get the following SIGSEGV which appears 
> more or less randomly.
>
> [9.190354] touch[91]: unhandled signal 11 at 67807b58 nip 777cf114 
> lr 777cf100 code 30001
> [   24.634810] ifconfig[160]: unhandled signal 11 at 67ae7b58 nip 
> 77aaf114 lr 77aaf100 code 30001


It would be interesting to see the code dump here and which registers
are being used.

Can you backport the show unhandled signal changes and see what that
shows us?

cheers


Re: DT case sensitivity

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 19:47 -0500, Rob Herring wrote:
> The default DT string handling in the kernel is node names and
> compatibles are case insensitive and property names are case sensitive
> (Sparc is the the only variation and is opposite). It seems only PPC
> (and perhaps only Power Macs?) needs to support case insensitive
> comparisons. It was probably a mistake to follow PPC for new arches
> and we should have made everything case sensitive from the start. So I
> have a few questions for the DT historians. :)

Open Firmware itself is insensitive.

> What PPC systems are case insensitive? Can we limit that to certain systems?

All PowerMacs at least, the problem is that I don't have DT images or
access to all the historical systems (and yes some people occasionally
still use them) to properly test a change in that area.

> AFAICT, dtc at least (if not anything FDT based) has always been case
> sensitive at least for node and property names. I'm not sure about
> compatible strings?
> 
> Anyone see potential issues with switching all platforms except PPC
> and Sparc to case sensitive comparisons?

Cheers,
Ben.



DT case sensitivity

2018-08-22 Thread Rob Herring
The default DT string handling in the kernel is node names and
compatibles are case insensitive and property names are case sensitive
(Sparc is the the only variation and is opposite). It seems only PPC
(and perhaps only Power Macs?) needs to support case insensitive
comparisons. It was probably a mistake to follow PPC for new arches
and we should have made everything case sensitive from the start. So I
have a few questions for the DT historians. :)

What PPC systems are case insensitive? Can we limit that to certain systems?

AFAICT, dtc at least (if not anything FDT based) has always been case
sensitive at least for node and property names. I'm not sure about
compatible strings?

Anyone see potential issues with switching all platforms except PPC
and Sparc to case sensitive comparisons?

Rob


[PATCH] KVM: PPC: Book3S: Fix guest DMA when guest partially backed by THP pages

2018-08-22 Thread Paul Mackerras
Commit 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the
pinned physical page", 2018-07-17) added some checks to ensure that guest
DMA mappings don't attempt to map more than the guest is entitled to
access.  However, errors in the logic mean that legitimate guest requests
to map pages for DMA are being denied in some situations.  Specifically,
if the first page of the range passed to mm_iommu_get() is mapped with
a normal page, and subsequent pages are mapped with transparent huge
pages, we end up with mem->pageshift == 0.  That means that the page
size checks in mm_iommu_ua_to_hpa() and mm_iommu_up_to_hpa_rm() will
always fail for every page in that region, and thus the guest can never
map any memory in that region for DMA, typically leading to a flood of
error messages like this:

qemu-system-ppc64: VFIO_MAP_DMA: -22
qemu-system-ppc64: vfio_dma_map(0x10005f47780, 0x800, 0x1, 
0x7fff63ff) = -22 (Invalid argument)

The logic errors in mm_iommu_get() are:

(a) use of 'ua' not 'ua + (i << PAGE_SHIFT)' in the find_linux_pte() call
(meaning that find_linux_pte() returns the pte for the first address
in the range, not the address we are currently up to);
(b) use of 'pageshift' as the variable to receive the hugepage shift
returned by find_linux_pte() - for a normal page this gets set to
0, leading to us setting mem->pageshift to 0 when we conclude that
the pte returned by find_linux_pte didn't match the page we were
looking at;
(c) comparing 'compshift', which is a page order, i.e. log base 2 of the
number of pages, with 'pageshift', which is a log base 2 of the number
of bytes.

To fix these problems, this patch introduces 'cur_ua' to hold the current
user address and uses that in the find_linux_pte call; introduces
'pteshift' to hold the hugepage shift found by find_linux_pte(); and
compares 'pteshift' with 'compshift + PAGE_SHIFT' rather than 'compshift'.
The patch also moves the local_irq_restore to the point after the pte
pointer returned by find_linux_pte has been dereferenced because that
seems safer, and adds a check to avoid doing the find_linux_pte() call
once mem->pageshift has been reduced to PAGE_SHIFT, as an optimization.

Cc: sta...@vger.kernel.org # v4.12+
Fixes: 76fa4975f3ed ("KVM: PPC: Check if IOMMU page is contained in the
pinned physical page")
Signed-off-by: Paul Mackerras 
---
 arch/powerpc/mm/mmu_context_iommu.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index a4ca57612558..c9ee9e23845f 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -129,6 +129,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
long i, j, ret = 0, locked_entries = 0;
unsigned int pageshift;
unsigned long flags;
+   unsigned long cur_ua;
struct page *page = NULL;
 
mutex_lock(_list_mutex);
@@ -177,7 +178,8 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
}
 
for (i = 0; i < entries; ++i) {
-   if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+   cur_ua = ua + (i << PAGE_SHIFT);
+   if (1 != get_user_pages_fast(cur_ua,
1/* pages */, 1/* iswrite */, )) {
ret = -EFAULT;
for (j = 0; j < i; ++j)
@@ -196,7 +198,7 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
if (is_migrate_cma_page(page)) {
if (mm_iommu_move_page_from_cma(page))
goto populate;
-   if (1 != get_user_pages_fast(ua + (i << PAGE_SHIFT),
+   if (1 != get_user_pages_fast(cur_ua,
1/* pages */, 1/* iswrite */,
)) {
ret = -EFAULT;
@@ -210,20 +212,21 @@ long mm_iommu_get(struct mm_struct *mm, unsigned long ua, 
unsigned long entries,
}
 populate:
pageshift = PAGE_SHIFT;
-   if (PageCompound(page)) {
+   if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) {
pte_t *pte;
struct page *head = compound_head(page);
unsigned int compshift = compound_order(head);
+   unsigned int pteshift;
 
local_irq_save(flags); /* disables as well */
-   pte = find_linux_pte(mm->pgd, ua, NULL, );
-   local_irq_restore(flags);
+   pte = find_linux_pte(mm->pgd, cur_ua, NULL, );
 
/* Double check it is still the same pinned page */
if (pte && pte_page(*pte) == 

Re: [PATCH v2] crypto: vmx - Fix sleep-in-atomic bugs

2018-08-22 Thread Marcelo Henrique Cerri
That looks good to me. Maybe Paulo can help testing it.

-- 
Regards,
Marcelo

On Wed, Aug 22, 2018 at 08:26:31AM +0200, Ondrej Mosnacek wrote:
> This patch fixes sleep-in-atomic bugs in AES-CBC and AES-XTS VMX
> implementations. The problem is that the blkcipher_* functions should
> not be called in atomic context.
> 
> The bugs can be reproduced via the AF_ALG interface by trying to
> encrypt/decrypt sufficiently large buffers (at least 64 KiB) using the
> VMX implementations of 'cbc(aes)' or 'xts(aes)'. Such operations then
> trigger BUG in crypto_yield():
> 
> [  891.863680] BUG: sleeping function called from invalid context at 
> include/crypto/algapi.h:424
> [  891.864622] in_atomic(): 1, irqs_disabled(): 0, pid: 12347, name: kcapi-enc
> [  891.864739] 1 lock held by kcapi-enc/12347:
> [  891.864811]  #0: f5d42c46 (sk_lock-AF_ALG){+.+.}, at: 
> skcipher_recvmsg+0x50/0x530
> [  891.865076] CPU: 5 PID: 12347 Comm: kcapi-enc Not tainted 
> 4.19.0-0.rc0.git3.1.fc30.ppc64le #1
> [  891.865251] Call Trace:
> [  891.865340] [c003387578c0] [c0d67ea4] dump_stack+0xe8/0x164 
> (unreliable)
> [  891.865511] [c00338757910] [c0172a58] 
> ___might_sleep+0x2f8/0x310
> [  891.865679] [c00338757990] [c06bff74] 
> blkcipher_walk_done+0x374/0x4a0
> [  891.865825] [c003387579e0] [d7e73e70] 
> p8_aes_cbc_encrypt+0x1c8/0x260 [vmx_crypto]
> [  891.865993] [c00338757ad0] [c06c0ee0] 
> skcipher_encrypt_blkcipher+0x60/0x80
> [  891.866128] [c00338757b10] [c06ec504] 
> skcipher_recvmsg+0x424/0x530
> [  891.866283] [c00338757bd0] [c0b00654] sock_recvmsg+0x74/0xa0
> [  891.866403] [c00338757c10] [c0b00f64] ___sys_recvmsg+0xf4/0x2f0
> [  891.866515] [c00338757d90] [c0b02bb8] __sys_recvmsg+0x68/0xe0
> [  891.866631] [c00338757e30] [c000bbe4] system_call+0x5c/0x70
> 
> Fixes: 8c755ace357c ("crypto: vmx - Adding CBC routines for VMX module")
> Fixes: c07f5d3da643 ("crypto: vmx - Adding support for XTS")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Ondrej Mosnacek 
> ---
> Still untested, please test and review if possible.
> 
> Changes in v2:
> - fix leaving preemtption, etc. disabled when leaving the function
>   (I switched to the more obvious and less efficient variant for the
>   sake of clarity.)
> 
>  drivers/crypto/vmx/aes_cbc.c | 30 ++
>  drivers/crypto/vmx/aes_xts.c | 21 ++---
>  2 files changed, 28 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/crypto/vmx/aes_cbc.c b/drivers/crypto/vmx/aes_cbc.c
> index 5285ece4f33a..b71895871be3 100644
> --- a/drivers/crypto/vmx/aes_cbc.c
> +++ b/drivers/crypto/vmx/aes_cbc.c
> @@ -107,24 +107,23 @@ static int p8_aes_cbc_encrypt(struct blkcipher_desc 
> *desc,
>   ret = crypto_skcipher_encrypt(req);
>   skcipher_request_zero(req);
>   } else {
> - preempt_disable();
> - pagefault_disable();
> - enable_kernel_vsx();
> -
>   blkcipher_walk_init(, dst, src, nbytes);
>   ret = blkcipher_walk_virt(desc, );
>   while ((nbytes = walk.nbytes)) {
> + preempt_disable();
> + pagefault_disable();
> + enable_kernel_vsx();
>   aes_p8_cbc_encrypt(walk.src.virt.addr,
>  walk.dst.virt.addr,
>  nbytes & AES_BLOCK_MASK,
>  >enc_key, walk.iv, 1);
> + disable_kernel_vsx();
> + pagefault_enable();
> + preempt_enable();
> +
>   nbytes &= AES_BLOCK_SIZE - 1;
>   ret = blkcipher_walk_done(desc, , nbytes);
>   }
> -
> - disable_kernel_vsx();
> - pagefault_enable();
> - preempt_enable();
>   }
>  
>   return ret;
> @@ -147,24 +146,23 @@ static int p8_aes_cbc_decrypt(struct blkcipher_desc 
> *desc,
>   ret = crypto_skcipher_decrypt(req);
>   skcipher_request_zero(req);
>   } else {
> - preempt_disable();
> - pagefault_disable();
> - enable_kernel_vsx();
> -
>   blkcipher_walk_init(, dst, src, nbytes);
>   ret = blkcipher_walk_virt(desc, );
>   while ((nbytes = walk.nbytes)) {
> + preempt_disable();
> + pagefault_disable();
> + enable_kernel_vsx();
>   aes_p8_cbc_encrypt(walk.src.virt.addr,
>  walk.dst.virt.addr,
>  nbytes & AES_BLOCK_MASK,
>  >dec_key, walk.iv, 0);
> + disable_kernel_vsx();
> + pagefault_enable();
> + preempt_enable();

Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:58 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 09:54:33AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > > We need to take the DMA offset and encryption bit into account when 
> > > selecting
> > > a zone.  Add a helper that takes those into account and use it.
> > 
> > That whole "encryption" stuff seems to be completely specific to the
> > way x86 does memory encryption, or am I mistaken ? It's not clear to me
> > what that does in practice and how it relates to DMA mappings.
> 
> Not even all of x86, but AMD in particular, Intel does it yet another
> way.  But it still is easier to take this into the core with a few
> overrides than duplicating all the code.
> 
> > I'm also not sure about that whole business with ZONE_DMA and
> > ARCH_ZONE_DMA_BITS...
> 
> ZONE_DMA usually (but not always) maps to 24-bits of address space,
> if it doesn't (I mostly through about s390 with it's odd 31-bits)
> the architecture can override it if it cares).
> 
> > On ppc64, unless you enable swiotlb (which we only do currently on
> > some embedded platforms), you have all of memory in ZONE_DMA.
> > 
> > [0.00] Zone ranges:
> > [0.00]   DMA  [mem 0x-0x001f]
> > [0.00]   DMA32empty
> > [0.00]   Normal   empty
> > [0.00]   Device   empty
> 
> This is really weird.  Why would you wire up ZONE_DMA like this?

We always did :-) It predates my involvement and I think it predates
even Pauls. It's quite silly actually since the first powerpc machines
actually had ISA devices in them, but that's how it's been for ever. I
suppose we could change it but that would mean digging out some old
stuff to test.

> The general scheme that architectures should implement is:
> 
> ZONE_DMA: Any memory below a magic threshold that is lower than
>   32-bit.  Only enabled if actually required (usually
>   either 24-bit for ISA, or some other weird architecture
>   specific value like 32-bit for S/390)

It should have been ZONE_ISA_DMA :-)

> ZONE_DMA32:   Memory <= 32-bit if the architecture supports more than
>   32-bits worth of physical address space.  Should generally
>   be enabled on all 64-bit architectures unless you have
>   a very good reason not to.

Yeah so we sort-of enable the config option but only populate the zone
on platforms using swiotlb (freescale stuff). It's a bit messy at the
moment I must admit.

> ZONE_NORMAL:  Everything above 32-bit not falling into HIGHMEM or
>   MOVEABLE.

Cheers,
Ben.




Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:53 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 09:44:18AM +1000, Benjamin Herrenschmidt wrote:
> > We do have the occasional device with things like 31-bit DMA
> > limitation. We know they happens to work because those systems
> > can't have enough memory to be a problem. This is why our current
> > DMA direct ops in powerpc just unconditionally return true on ppc32.
> > 
> > The test against a full 32-bit mask here will break them I think.
> > 
> > Thing is, I'm not sure I still have access to one of these things
> > to test, I'll have to dig (from memory things like b43 wifi).
> 
> Yeah, the other platforms that support these devices support ZONE_DMA
> to reliably handle these devices. But there is two other ways the
> current code would actually handle these fine despite the dma_direct
> checks:
> 
>  1) if the device only has physical addresses up to 31-bit anyway
>  2) by trying again to find a lower address.  But this only works
> for coherent allocations and not streaming maps (unless we have
> swiotlb with a buffer below 31-bits).
> 
> It seems powerpc can have ZONE_DMA, though and we will cover these
> devices just fine.  If it didn't have that the current powerpc
> code would not work either.

Not exactly. powerpc has ZONE_DMA covering all of system memory.

What happens in ppc32 is that we somewhat "know" that none of the
systems with those stupid 31-bit limited pieces of HW is capable of
having more than 2GB of memory anyway.

So we get away with just returning "1".

> 
> >  - What is this trying to achieve ?
> > 
> > /*
> >  * Various PCI/PCIe bridges have broken support for > 32bit DMA even
> >  * if the device itself might support it.
> >  */
> > if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
> > return 0;
> > 
> > IE, if the device has a 32-bit limit, we fail an attempt at checking
> > if a >32-bit mask works ? That doesn't quite seem to be the right thing
> > to do... Shouldn't this be in dma_set_mask() and just clamp the mask down ?
> > 
> > IE, dma_set_mask() is what a driver uses to establish the device capability,
> > so it makes sense tot have dma_32bit_limit just reduce that capability, not
> > fail because the device can do more than what the bridge can 
> 
> If your PCI bridge / PCIe root port doesn't support dma to addresses
> larger than 32-bit the device capabilities above that don't matter, it
> just won't work.  We have this case at least for some old VIA x86 chipsets
> and some relatively modern Xilinx FPGAs with PCIe.

Hrm... that's the usual confusion dma_capable() vs. dma_set_mask().

It's always been perfectly fine for a driver to do a dma_set_mask(64-
bit) on a system where the bridge can only do 32-bits ...

We shouldn't fail there, we should instead "clamp" the mask to 32-bit,
see what I mean ? It doesn't matter that the device itself is capable
of issuing >32 addresses, I agree, but what we need to express is that
the combination device+bridge doesn't want addresses above 32-bit, so
it's equivalent to making the device do a set_mask(32-bit).

This will succeed if the system can limit the addresses (for example
because memory is never above 32-bit) and will fail if the system
can't.

So that's equivalent of writing

if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
mask = phys_to_dma(dev, DMA_BIT_MASK(32)); 

Effectively meaning "don't give me addresses aboe 32-bit".

Still, your code doesn't check the mask against the memory size. Which
means it will fail for 32-bit masks even on systems that do not have
memory above 4G.

> >  - How is that file supposed to work on 64-bit platforms ? From what I can
> > tell, dma_supported() will unconditionally return true if the mask is
> > 32-bit or larger (appart from the above issue). This doesn't look right,
> > the mask needs to be compared to the max memory address. There are a bunch
> > of devices out there with masks anywhere bettween 40 and 64 bits, and
> > some of these will not work "out of the box" if the offseted top
> > of memory is beyond the mask limit. Or am I missing something ?
> 
> Your are not missing anything except for the history of this code.
> 
> Your observation is right, but there always has been the implicit
> assumption that architectures with more than 4GB of physical address
> space must either support and iommu or swiotlb and use that.  It's
> never been document anywhere, but I'm working on integrating all
> this code to make more sense.

Well, iommus can have bypass regions, which we also use for
performance, so we do at dma_set_mask() time "swap" the ops around, and
in that case, we do want to check the mask against the actual top of
memory...

Cheers,
Ben.




Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 08:45 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 10:01:16AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> > > It turns out cxl actually uses it.  So for now skip this patch,
> > > although random code in drivers messing with dma ops will need to
> > > be sorted out sooner or later.
> > 
> > CXL devices are "special", they bypass the classic iommu in favor of
> > allowing the device to operate using the main processor page tables
> > using an MMU context (so basically the device can use userspace
> > addresses directly), akin to ATS.
> > 
> > I think the code currently uses the nommu ops as a way to do a simple
> > kernel mapping for kernel drivers using CXL (not userspace stuff)
> > though.
> 
> Its still a horrible idea to have this in drivers/, we need some
> core API to mediate this behavior.  Also if the device supports
> using virtual addresses dma_nommu_ops seems wrong as it won't do
> the right thing for e.g. vmalloc addresses not mapped into the
> kernel linear mapping (which I guess can't currently happen on
> powerpc, but still..)

You are right it won't do the right thing, but neither will standard
DMA ops, will they ? Drivers know not to try to dma_map vmalloc
addresses without first getting the underlying page, nothing unusal
there.

Yes I agree having this in drivers somewhat sucks though.

Cheers,
Ben.



Re: [PATCH 10/20] powerpc/dma-noncoherent: don't disable irqs over kmap_atomic

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 09:02 +0200, Christoph Hellwig wrote:
> On Thu, Aug 09, 2018 at 10:27:46AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > > The requirement to disable local irqs over kmap_atomic is long gone,
> > > so remove those calls.
> > 
> > Really ? I'm trying to verify that and getting lost in a mess of macros
> > from hell in the per-cpu stuff but if you look at our implementation
> > of kmap_atomic_prot(), all it does is a preempt_disable(), and then
> > it uses kmap_atomic_idx_push():
> > 
> > int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
> > 
> > Note the use of __this_cpu_inc_return(), not this_cpu_inc_return(),
> > ie this is the non-interrupt safe version...
> 
> Looks like the powerpc variant indeed isn't save.
> 
> I did look a bit more through the code and history, and it seems
> like we remove the need to disable irqs when called from process
> context a while ago, but we still require disabling irqs when called
> from irq context.  Given that this code can also be called from
> irq context we'll have to keep the local_irq_save.

This is the same with x86 no ?

32-bit x86 kmap_atomic_prot is the same as ours...

In fact I wonder why the preempt_disable() in there since it needs to
be protected against interrupt ?

Or is it that we never actually call kmap_atomic_* these days from
interrupt, and the atomic versions are just about dealing with
spinlocks ?

Cheers,
Ben.
 



Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")

2018-08-22 Thread Ram Pai
On Wed, Aug 22, 2018 at 10:19:02AM +0200, Christophe LEROY wrote:
> 
> 
> Le 21/08/2018 à 19:50, Ram Pai a écrit :
> >On Tue, Aug 21, 2018 at 04:40:15PM +1000, Michael Ellerman wrote:
> >>Christophe LEROY  writes:
> >>...
> >>>
> >>>And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
> >>>Deliver SEGV signal on pkey violation")
> >>
> >>Whoa that's weird.
> >>
> >>>Looking at those two commits, especially the one which makes it
> >>>dissapear, I'm quite sceptic. Any idea on what could be the cause and/or
> >>>how to investigate further ?
> >>
> >>Are you sure it's not some corruption that just happens to be masked by
> >>that commit? I can't see anything in that commit that could explain that
> >>change in behaviour.
> >>
> >>The only real change is if you're hitting DSISR_KEYFAULT isn't it?
> >
> >even with the 'commit 99cd1302327a2', a SEGV signal should get generated;
> >which should kill the process. Unless the process handles SEGV signals
> >with SEGV_PKUERR differently.
> 
> No, the sigsegv are not handled differently. And the trace shown it
> is SEGV_MAPERR which is generated.
> 
> >
> >The other surprising thing is, why is DSISR_KEYFAULT getting generated
> >in the first place?  Are keys somehow getting programmed into the HPTE?
> 
> Can't be that, because DSISR_KEYFAULT is filtered out when applying
> DSISR_SRR1_MATCH_32S mask.

Ah.. in that case, 99cd1302327a2 does nothing to fix the problem.

Are you sure it is this patch that fixes the problem?


RP



Re: Infinite looping observed in __offline_pages

2018-08-22 Thread Mike Kravetz
On 08/22/2018 02:30 AM, Aneesh Kumar K.V wrote:
> commit 2e9d754ac211f2af3731f15df3cd8cd070b4cc54
> Author: Aneesh Kumar K.V 
> Date:   Tue Aug 21 14:17:55 2018 +0530
> 
> mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not 
> supported.
> 
> When scanning for movable pages, filter out Hugetlb pages if hugepage 
> migration
> is not supported. Without this we hit infinte loop in __offline pages 
> where we
> do
> pfn = scan_movable_pages(start_pfn, end_pfn);
> if (pfn) { /* We have movable pages */
> ret = do_migrate_range(pfn, end_pfn);
> goto repeat;
> }
> 
> We do support hugetlb migration ony if the hugetlb pages are at pmd 
> level. Here

I thought migration at pgd level was added for POWER?  commit 94310cbcaa3c
(mm/madvise: enable (soft|hard) offline of HugeTLB pages at PGD level).
Only remember, because I did not fully understand the use case. :)

> we just check for Kernel config. The gigantic page size check is done in
> page_huge_active.
> 
> Reported-by: Haren Myneni 
> CC: Naoya Horiguchi 
> Signed-off-by: Aneesh Kumar K.V 
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4eb6e824a80c..f9bdea685cf4 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1338,7 +1338,8 @@ static unsigned long scan_movable_pages(unsigned long 
> start, unsigned long end)
>   return pfn;
>   if (__PageMovable(page))
>   return pfn;
> - if (PageHuge(page)) {
> + if (IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION) &&
> + PageHuge(page)) {

How about using hugepage_migration_supported instead?  It would automatically
catch those non-migratable huge page sizes.  Something like:

if (PageHuge(page) &&
hugepage_migration_supported(page_hstate(page))) {

-- 
Mike Kravetz

>   if (page_huge_active(page))
>   return pfn;
>   else
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 15ea511fb41c..a3f81e18c882 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7649,6 +7649,10 @@ bool has_unmovable_pages(struct zone *zone, struct 
> page *page, int count,
>* handle each tail page individually in migration.
>*/
>   if (PageHuge(page)) {
> +
> + if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
> + goto unmovable;
> +
>   iter = round_up(iter + 1, 1<   continue;
>   }
> 


[PATCH 1/2] powerpc/mm/books3s: Add new pte bit to mark pte temporarily invalid.

2018-08-22 Thread Aneesh Kumar K.V
When splitting a huge pmd pte, we need to mark the pmd entry invalid. We
can do that by clearing _PAGE_PRESENT bit. But then that will be taken as a
swap pte. In order to differentiate between the two use a software pte bit
when invalidating.

For regular pte, due to bd5050e38aec ("powerpc/mm/radix: Change pte relax
sequence to handle nest MMU hang") we need to mark the pte entry invalid when
relaxing access permission. Instead of marking pte_none which can result in
different page table walk routines possibly skipping this pte entry, invalidate
it but still keep it marked present.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 676118743a06..13a688fc8cd0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -44,6 +44,16 @@
 
 #define _PAGE_PTE  0x4000UL/* distinguishes PTEs 
from pointers */
 #define _PAGE_PRESENT  0x8000UL/* pte contains a 
translation */
+/*
+ * We need to mark a pmd pte invalid while splitting. We can do that by 
clearing
+ * the _PAGE_PRESENT bit. But then that will be taken as a swap pte. In order 
to
+ * differentiate between two use a SW field when invalidating.
+ *
+ * We do that temporary invalidate for regular pte entry in 
ptep_set_access_flags
+ *
+ * This is used only when _PAGE_PRESENT is cleared.
+ */
+#define _PAGE_INVALID  _RPAGE_SW0
 
 /*
  * Top and bottom bits of RPN which can be used by hash
@@ -568,7 +578,13 @@ static inline pte_t pte_clear_savedwrite(pte_t pte)
 
 static inline int pte_present(pte_t pte)
 {
-   return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT));
+   /*
+* A pte is considerent present if _PAGE_PRESENT is set.
+* We also need to consider the pte present which is marked
+* invalid during ptep_set_access_flags. Hence we look for _PAGE_INVALID
+* if we find _PAGE_PRESENT cleared.
+*/
+   return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_INVALID));
 }
 
 #ifdef CONFIG_PPC_MEM_KEYS
-- 
2.17.1



[PATCH 2/2] powerpc/mm/hash: Only need the Nest MMU workaround for R -> RW transition

2018-08-22 Thread Aneesh Kumar K.V
The Nest MMU workaround is only needed for RW upgrades. Avoid doing that
for other pte updates.

We also avoid clearing the pte while marking it invalid. This is because other
page table walk will find this pte none and can result in unexpected behaviour
due to that. Instead we clear _PAGE_PRESENT and set the software pte bit
_PAGE_INVALID. pte_present is already updated to check for bot the bits. This
make sure page table walkers will find the pte present and things like
pte_pfn(pte) returns the right value.

Based on the original patch from Benjamin Herrenschmidt 


Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/pgtable-radix.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 7be99fd9af15..c879979faa73 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -1045,20 +1045,22 @@ void radix__ptep_set_access_flags(struct vm_area_struct 
*vma, pte_t *ptep,
struct mm_struct *mm = vma->vm_mm;
unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED |
  _PAGE_RW | _PAGE_EXEC);
+
+   unsigned long change = pte_val(entry) ^ pte_val(*ptep);
/*
 * To avoid NMMU hang while relaxing access, we need mark
 * the pte invalid in between.
 */
-   if (atomic_read(>context.copros) > 0) {
+   if ((change & _PAGE_RW) && atomic_read(>context.copros) > 0) {
unsigned long old_pte, new_pte;
 
-   old_pte = __radix_pte_update(ptep, ~0, 0);
+   old_pte = __radix_pte_update(ptep, _PAGE_PRESENT, 
_PAGE_INVALID);
/*
 * new value of pte
 */
new_pte = old_pte | set;
radix__flush_tlb_page_psize(mm, address, psize);
-   __radix_pte_update(ptep, 0, new_pte);
+   __radix_pte_update(ptep, _PAGE_INVALID, new_pte);
} else {
__radix_pte_update(ptep, 0, set);
/*
-- 
2.17.1



Re: [PATCH 2/2] powerpc/mm/hash: Only need the Nest MMU workaround for R -> RW transition

2018-08-22 Thread Aneesh Kumar K.V

On 08/22/2018 10:46 PM, Aneesh Kumar K.V wrote:

The Nest MMU workaround is only needed for RW upgrades. Avoid doing that
for other pte updates.

We also avoid clearing the pte while marking it invalid. This is because other
page table walk will find this pte none and can result in unexpected behaviour
due to that. Instead we clear _PAGE_PRESENT and set the software pte bit
_PAGE_INVALID. pte_present is already updated to check for bot the bits. This
make sure page table walkers will find the pte present and things like
pte_pfn(pte) returns the right value.

Based on the original patch from Benjamin Herrenschmidt 




Fixes: bd5050e38aec3
("powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang")


Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/mm/pgtable-radix.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 7be99fd9af15..c879979faa73 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -1045,20 +1045,22 @@ void radix__ptep_set_access_flags(struct vm_area_struct 
*vma, pte_t *ptep,
struct mm_struct *mm = vma->vm_mm;
unsigned long set = pte_val(entry) & (_PAGE_DIRTY | _PAGE_ACCESSED |
  _PAGE_RW | _PAGE_EXEC);
+
+   unsigned long change = pte_val(entry) ^ pte_val(*ptep);
/*
 * To avoid NMMU hang while relaxing access, we need mark
 * the pte invalid in between.
 */
-   if (atomic_read(>context.copros) > 0) {
+   if ((change & _PAGE_RW) && atomic_read(>context.copros) > 0) {
unsigned long old_pte, new_pte;
  
-		old_pte = __radix_pte_update(ptep, ~0, 0);

+   old_pte = __radix_pte_update(ptep, _PAGE_PRESENT, 
_PAGE_INVALID);
/*
 * new value of pte
 */
new_pte = old_pte | set;
radix__flush_tlb_page_psize(mm, address, psize);
-   __radix_pte_update(ptep, 0, new_pte);
+   __radix_pte_update(ptep, _PAGE_INVALID, new_pte);
} else {
__radix_pte_update(ptep, 0, set);
/*





powerpc compile failure: linux/crc32poly.h: No such file or directory

2018-08-22 Thread Meelis Roos
In 4.18 + todays git on 32-bit powerpc:

  BOOTCC  arch/powerpc/boot/decompress.o
In file included from arch/powerpc/boot/../../../lib/decompress_unxz.c:233,
 from arch/powerpc/boot/decompress.c:42:
arch/powerpc/boot/../../../lib/xz/xz_crc32.c:18:10: fatal error: 
linux/crc32poly.h: No such file or directory
 #include 
  ^~~


-- 
Meelis Roos (mr...@linux.ee)


Re: [PATCH v11 00/26] Speculative page faults

2018-08-22 Thread Laurent Dufour
On 03/08/2018 08:36, Song, HaiyanX wrote:
> Hi Laurent,

Hi Haiyan,

Sorry for the late answer, I was off a couple of days.

> 
> Thanks for your analysis for the last perf results.
> Your mentioned ," the major differences at the head of the perf report is the 
> 92% testcase which is weirdly not reported
> on the head side", which is a bug of 0-day,and it caused the item is not 
> counted in perf. 
> 
> I've triggered the test page_fault2 and page_fault3 again only with thread 
> mode of will-it-scale on 0-day (on the same test box,every case tested 3 
> times).
> I checked the perf report have no above mentioned problem.
> 
> I have compared them, found some items have difference, such as below case:
>page_fault2-thp-always: handle_mm_fault, base: 45.22%head: 29.41%
>page_fault3-thp-always: handle_mm_fault, base: 22.95%head: 14.15%  
>  

These would mean that the system spends lees time running handle_mm_fault()
when SPF is in the picture in this 2 cases which is good. This should lead to
better results with the SPF series, and I can't find any values higher on the
head side.

> 
> So i attached the perf result in mail again, could your have a look again for 
> checking the difference between base and head commit.

I took a close look to all the perf result you sent, but I can't identify any
major difference. But the compiler optimization is getting rid of the
handle_pte_fault() symbol on the base kernel which add complexity to check the
differences.

To get rid of that, I'm proposing that you applied the attached patch to the
spf kernel. This patch is allowing to turn on/off the SPF handler through
/proc/sys/vm/speculative_page_fault.

This should ease the testing by limiting the reboot and avoid kernel's symbols
mismatch. Obviously there is still a small overhead due to the check but it
should not be viewable.

With this patch applied you can simply run
echo 1 > /proc/sys/vm/speculative_page_fault
to run a test with the speculative page fault handler activated. Or run
echo 0 > /proc/sys/vm/speculative_page_fault
to run a test without it.

I'm really sorry to asking that again, but could please run the test
page_fault3_base_THP-Always with and without SPF and capture the perf output.

I think we should focus on that test which showed the biggest regression.

Thanks,
Laurent.


> 
> Thanks,
> Haiyan, Song
>  
> 
> From: owner-linux...@kvack.org [owner-linux...@kvack.org] on behalf of 
> Laurent Dufour [lduf...@linux.vnet.ibm.com]
> Sent: Tuesday, July 17, 2018 5:36 PM
> To: Song, HaiyanX
> Cc: a...@linux-foundation.org; mho...@kernel.org; pet...@infradead.org; 
> kir...@shutemov.name; a...@linux.intel.com; d...@stgolabs.net; j...@suse.cz; 
> Matthew Wilcox; khand...@linux.vnet.ibm.com; aneesh.ku...@linux.vnet.ibm.com; 
> b...@kernel.crashing.org; m...@ellerman.id.au; pau...@samba.org; Thomas 
> Gleixner; Ingo Molnar; h...@zytor.com; Will Deacon; Sergey Senozhatsky; 
> sergey.senozhatsky.w...@gmail.com; Andrea Arcangeli; Alexei Starovoitov; 
> Wang, Kemi; Daniel Jordan; David Rientjes; Jerome Glisse; Ganesh Mahendran; 
> Minchan Kim; Punit Agrawal; vinayak menon; Yang Shi; 
> linux-ker...@vger.kernel.org; linux...@kvack.org; ha...@linux.vnet.ibm.com; 
> npig...@gmail.com; bsinghar...@gmail.com; paul...@linux.vnet.ibm.com; Tim 
> Chen; linuxppc-dev@lists.ozlabs.org; x...@kernel.org
> Subject: Re: [PATCH v11 00/26] Speculative page faults
> 
> On 13/07/2018 05:56, Song, HaiyanX wrote:
>> Hi Laurent,
> 
> Hi Haiyan,
> 
> Thanks a lot for sharing this perf reports.
> 
> I looked at them closely, and I've to admit that I was not able to found a
> major difference between the base and the head report, except that
> handle_pte_fault() is no more in-lined in the head one.
> 
> As expected, __handle_speculative_fault() is never traced since these tests 
> are
> dealing with file mapping, not handled in the speculative way.
> 
> When running these test did you seen a major differences in the test's result
> between base and head ?
> 
> From the number of cycles counted, the biggest difference is page_fault3 when
> run with the THP enabled:
> BASEHEADDelta
> page_fault2_base_thp_never  1142252426747   1065866197589   -6.69%
> page_fault2_base_THP-Alwasys1124844374523   1076312228927   -4.31%
> page_fault3_base_thp_never  1099387298152   1134118402345   3.16%
> page_fault3_base_THP-Always 1059370178101   853985561949-19.39%
> 
> 
> The very weird thing is the difference of the delta cycles reported between
> thp never and thp always, because the speculative way is aborted when checking
> for the vma->ops field, which is the same in both case, and the thp is never
> checked. So there is no code covering differnce, on the speculative path,
> between these 2 cases. This leads me to think that there are other 
> interactions
> interfering in the measure.
> 
> Looking at the 

Re: [PATCH v2 3/4] powerpc/mm: fix a warning when a cache is common to PGD and hugepages

2018-08-22 Thread Aneesh Kumar K.V

On 08/17/2018 04:14 PM, Christophe LEROY wrote:



Le 17/08/2018 à 05:32, Aneesh Kumar K.V a écrit :

On 08/14/2018 08:24 PM, Christophe Leroy wrote:

While implementing TLB miss HW assistance on the 8xx, the following
warning was encountered:

[  423.732965] WARNING: CPU: 0 PID: 345 at mm/slub.c:2412 
___slab_alloc.constprop.30+0x26c/0x46c
[  423.733033] CPU: 0 PID: 345 Comm: mmap Not tainted 
4.18.0-rc8-00664-g2dfff9121c55 #671

[  423.733075] NIP:  c0108f90 LR: c0109ad0 CTR: 0004
[  423.733121] REGS: c455bba0 TRAP: 0700   Not tainted 
(4.18.0-rc8-00664-g2dfff9121c55)

[  423.733147] MSR:  00021032   CR: 24224848  XER: 2000
[  423.733319]
[  423.733319] GPR00: c0109ad0 c455bc50 c4521910 c60053c0 007080c0 
c0011b34 c7fa41e0 c455be30
[  423.733319] GPR08: 0001 c00103a0 c7fa41e0 c49afcc4 24282842 
10018840 c079b37c 0040
[  423.733319] GPR16: 73f0 00210d00  0001 c455a000 
0100 0200 c455a000
[  423.733319] GPR24: c60053c0 c0011b34 007080c0 c455a000 c455a000 
c7fa41e0  9032

[  423.734190] NIP [c0108f90] ___slab_alloc.constprop.30+0x26c/0x46c
[  423.734257] LR [c0109ad0] kmem_cache_alloc+0x210/0x23c
[  423.734283] Call Trace:
[  423.734326] [c455bc50] [0100] 0x100 (unreliable)
[  423.734430] [c455bcc0] [c0109ad0] kmem_cache_alloc+0x210/0x23c
[  423.734543] [c455bcf0] [c0011b34] huge_pte_alloc+0xc0/0x1dc
[  423.734633] [c455bd20] [c01044dc] hugetlb_fault+0x408/0x48c
[  423.734720] [c455bdb0] [c0104b20] follow_hugetlb_page+0x14c/0x44c
[  423.734826] [c455be10] [c00e8e54] __get_user_pages+0x1c4/0x3dc
[  423.734919] [c455be80] [c00e9924] __mm_populate+0xac/0x140
[  423.735020] [c455bec0] [c00db14c] vm_mmap_pgoff+0xb4/0xb8
[  423.735127] [c455bf00] [c00f27c0] ksys_mmap_pgoff+0xcc/0x1fc
[  423.735222] [c455bf40] [c000e0f8] ret_from_syscall+0x0/0x38
[  423.735271] Instruction dump:
[  423.735321] 7cbf482e 38fd0008 7fa6eb78 7fc4f378 4bfff5dd 7fe3fb78 
4bfffe24 81370010
[  423.735536] 71280004 41a2ff88 4840c571 4b80 <0fe0> 
4bfffeb8 81340010 712a0004

[  423.735757] ---[ end trace e9b222919a470790 ]---

This warning occurs when calling kmem_cache_zalloc() on a
cache having a constructor.

In this case it happens because PGD cache and 512k hugepte cache are
the same size (4k). While a cache with constructor is created for
the PGD, hugepages create cache without constructor and uses
kmem_cache_zalloc(). As both expect a cache with the same size,
the hugepages reuse the cache created for PGD, hence the conflict.

In order to avoid this conflict, this patch:
- modifies pgtable_cache_add() so that a zeroising constructor is
added for any cache size.
- replaces calls to kmem_cache_zalloc() by kmem_cache_alloc()



Can't we just do kmem_cache_alloc with gfp flags __GFP_ZERO? and 
remove the constructor completely?


I don't understand what you mean. That's exactly what I did in v1 (by 
using kmem_cache_zalloc()), and you commented that doing this we would 
zeroise at allocation whereas the constructors are called when adding 
memory to the slab and when freeing the allocated block. Or did I 
misunderstood your comment ?


static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
{
 return kmem_cache_alloc(k, flags | __GFP_ZERO);
}




I completely misunderstood kmem_cache_zalloc. I took it as we zero out 
after each alloc. I guess your earlier patch is then good. We may want 
to double check this, I haven't looked at the slab internals.


What we want is to make sure when we add new memory to slab, we want it 
zeroed. If we are allocating objects from existing slab memory pool, we 
don't need to zero out, because when we release objects to slab we make 
sure we clear it.


-aneesh



Re: [PATCH v2 3/4] powerpc/mm: fix a warning when a cache is common to PGD and hugepages

2018-08-22 Thread Christophe LEROY

Aneesh,

Le 17/08/2018 à 12:44, Christophe LEROY a écrit :



Le 17/08/2018 à 05:32, Aneesh Kumar K.V a écrit :

On 08/14/2018 08:24 PM, Christophe Leroy wrote:

While implementing TLB miss HW assistance on the 8xx, the following
warning was encountered:

[  423.732965] WARNING: CPU: 0 PID: 345 at mm/slub.c:2412 
___slab_alloc.constprop.30+0x26c/0x46c
[  423.733033] CPU: 0 PID: 345 Comm: mmap Not tainted 
4.18.0-rc8-00664-g2dfff9121c55 #671

[  423.733075] NIP:  c0108f90 LR: c0109ad0 CTR: 0004
[  423.733121] REGS: c455bba0 TRAP: 0700   Not tainted 
(4.18.0-rc8-00664-g2dfff9121c55)

[  423.733147] MSR:  00021032   CR: 24224848  XER: 2000
[  423.733319]
[  423.733319] GPR00: c0109ad0 c455bc50 c4521910 c60053c0 007080c0 
c0011b34 c7fa41e0 c455be30
[  423.733319] GPR08: 0001 c00103a0 c7fa41e0 c49afcc4 24282842 
10018840 c079b37c 0040
[  423.733319] GPR16: 73f0 00210d00  0001 c455a000 
0100 0200 c455a000
[  423.733319] GPR24: c60053c0 c0011b34 007080c0 c455a000 c455a000 
c7fa41e0  9032

[  423.734190] NIP [c0108f90] ___slab_alloc.constprop.30+0x26c/0x46c
[  423.734257] LR [c0109ad0] kmem_cache_alloc+0x210/0x23c
[  423.734283] Call Trace:
[  423.734326] [c455bc50] [0100] 0x100 (unreliable)
[  423.734430] [c455bcc0] [c0109ad0] kmem_cache_alloc+0x210/0x23c
[  423.734543] [c455bcf0] [c0011b34] huge_pte_alloc+0xc0/0x1dc
[  423.734633] [c455bd20] [c01044dc] hugetlb_fault+0x408/0x48c
[  423.734720] [c455bdb0] [c0104b20] follow_hugetlb_page+0x14c/0x44c
[  423.734826] [c455be10] [c00e8e54] __get_user_pages+0x1c4/0x3dc
[  423.734919] [c455be80] [c00e9924] __mm_populate+0xac/0x140
[  423.735020] [c455bec0] [c00db14c] vm_mmap_pgoff+0xb4/0xb8
[  423.735127] [c455bf00] [c00f27c0] ksys_mmap_pgoff+0xcc/0x1fc
[  423.735222] [c455bf40] [c000e0f8] ret_from_syscall+0x0/0x38
[  423.735271] Instruction dump:
[  423.735321] 7cbf482e 38fd0008 7fa6eb78 7fc4f378 4bfff5dd 7fe3fb78 
4bfffe24 81370010
[  423.735536] 71280004 41a2ff88 4840c571 4b80 <0fe0> 
4bfffeb8 81340010 712a0004

[  423.735757] ---[ end trace e9b222919a470790 ]---

This warning occurs when calling kmem_cache_zalloc() on a
cache having a constructor.

In this case it happens because PGD cache and 512k hugepte cache are
the same size (4k). While a cache with constructor is created for
the PGD, hugepages create cache without constructor and uses
kmem_cache_zalloc(). As both expect a cache with the same size,
the hugepages reuse the cache created for PGD, hence the conflict.

In order to avoid this conflict, this patch:
- modifies pgtable_cache_add() so that a zeroising constructor is
added for any cache size.
- replaces calls to kmem_cache_zalloc() by kmem_cache_alloc()



Can't we just do kmem_cache_alloc with gfp flags __GFP_ZERO? and 
remove the constructor completely?


I don't understand what you mean. That's exactly what I did in v1 (by 
using kmem_cache_zalloc()), and you commented that doing this we would 
zeroise at allocation whereas the constructors are called when adding 
memory to the slab and when freeing the allocated block. Or did I 
misunderstood your comment ?


static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
{
 return kmem_cache_alloc(k, flags | __GFP_ZERO);
}



Wasn't it what you meant in your comment to v1 ? If not, could you 
detail your thought so that I can take it in account in a v3 ?


Thanks
Christophe


Re: Infinite looping observed in __offline_pages

2018-08-22 Thread Michal Hocko
On Wed 22-08-18 15:00:18, Aneesh Kumar K.V wrote:
> 
> Hi Michal,
> 
> Michal Hocko  writes:
> 
> > On Wed 25-07-18 13:11:15, John Allen wrote:
> > [...]
> >> Does a failure in do_migrate_range indicate that the range is unmigratable
> >> and the loop in __offline_pages should terminate and goto failed_removal? 
> >> Or
> >> should we allow a certain number of retrys before we
> >> give up on migrating the range?
> >
> > Unfortunatelly not. Migration code doesn't tell a difference between
> > ephemeral and permanent failures. We are relying on
> > start_isolate_page_range to tell us this. So the question is, what kind
> > of page is not migratable and for what reason.
> >
> > Are you able to add some debugging to give us more information. The
> > current debugging code in the hotplug/migration sucks...
> 
> Haren did most of the debugging, so at minimum we need a patch like this
> I guess. 
> 
> modified   mm/page_alloc.c
> @@ -7649,6 +7649,10 @@ bool has_unmovable_pages(struct zone *zone, struct 
> page *page, int count,
>* handle each tail page individually in migration.
>*/
>   if (PageHuge(page)) {
> +
> + if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
> + goto unmovable;
> +
>   iter = round_up(iter + 1, 1<   continue;
>   }
> 
> 
> The problem is start_isolate_range, doesn't look at hugetlbpage and
> return error if the architecture didn't support HUGEPAGE migration. 

Yes this makes sense. I didn't really have arches without huge page
migration in mind.
 
> Now discussing with Naoya, I was suggsting whether we should add a
> similar check in 
> 
> modified   mm/memory_hotplug.c
> @@ -1338,7 +1338,8 @@ static unsigned long scan_movable_pages(unsigned long 
> start, unsigned long end)
>   return pfn;
>   if (__PageMovable(page))
>   return pfn;
> - if (PageHuge(page)) {
> + if (IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION) &&
> + PageHuge(page)) {
>   if (page_huge_active(page))
>   return pfn;
> 
> One of the thinking there is it possible to get new hugetlb pages
> allocated in that range after start_isolate_range ran. But i guess since
> we marked all the free pages as  MIGRATE_ISOLATE that is not possible?

I do not follow. You are usually allocating new pages outside of the
offlined range. But the above change makes sense because it doesn't
really make sense to migrate pfn if it is backed by a non-migrateable
huge page. dissolve_free_huge_pages should then try to remove it
completely and the offlining fails if that is not possible.

> But then it is good to have scan_movable_pages also check for
> HUGEPAGE_MIGRATION?

Good question. It is not necessary right now because has_unmovable_pages
called earlier should take care of it. But I guess it will not hurt to
have it there as well for the clarity.

> 
> Complete patch below.
> 
> commit 2e9d754ac211f2af3731f15df3cd8cd070b4cc54
> Author: Aneesh Kumar K.V 
> Date:   Tue Aug 21 14:17:55 2018 +0530
> 
> mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not 
> supported.
> 
> When scanning for movable pages, filter out Hugetlb pages if hugepage 
> migration
> is not supported. Without this we hit infinte loop in __offline pages 
> where we
> do
> pfn = scan_movable_pages(start_pfn, end_pfn);
> if (pfn) { /* We have movable pages */
> ret = do_migrate_range(pfn, end_pfn);
> goto repeat;
> }
> 
> We do support hugetlb migration ony if the hugetlb pages are at pmd 
> level. Here
> we just check for Kernel config. The gigantic page size check is done in
> page_huge_active.

That being said. The patch makes sense to me.

> 
> Reported-by: Haren Myneni 
> CC: Naoya Horiguchi 
> Signed-off-by: Aneesh Kumar K.V 

I guess we should mark it for stable even though I am not sure how often
do we see CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=n

Acked-by: Michal Hocko 

> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4eb6e824a80c..f9bdea685cf4 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1338,7 +1338,8 @@ static unsigned long scan_movable_pages(unsigned long 
> start, unsigned long end)
>   return pfn;
>   if (__PageMovable(page))
>   return pfn;
> - if (PageHuge(page)) {
> + if (IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION) &&
> + PageHuge(page)) {
>   if (page_huge_active(page))
>   return pfn;
>   else
> 

Re: [v5] powerpc/topology: Get topology for shared processors at boot

2018-08-22 Thread Michael Ellerman
Srikar Dronamraju  writes:
> * Michael Ellerman  [2018-08-21 20:35:23]:
>
>> On Fri, 2018-08-17 at 14:54:39 UTC, Srikar Dronamraju wrote:
>> > On a shared lpar, Phyp will not update the cpu associativity at boot
>> > time. Just after the boot system does recognize itself as a shared lpar and
>> > trigger a request for correct cpu associativity. But by then the scheduler
>> > would have already created/destroyed its sched domains.
>> > 
>> > This causes
>> > - Broken load balance across Nodes causing islands of cores.
>> > - Performance degradation esp if the system is lightly loaded
>> > - dmesg to wrongly report all cpus to be in Node 0.
>> > - Messages in dmesg saying borken topology.
>> > - With commit 051f3ca02e46 ("sched/topology: Introduce NUMA identity
>> >   node sched domain"), can cause rcu stalls at boot up.
>> > 
>> > 
>> > Previous attempt to solve this problem
>> > https://patchwork.ozlabs.org/patch/530090/
>> > 
>> > Reported-by: Manjunatha H R 
>> > Signed-off-by: Srikar Dronamraju 
>> 
>> Applied to powerpc next, thanks.
>> 
>> https://git.kernel.org/powerpc/c/2ea62630681027c455117aa471ea3a
>> 
>
> Once it gets to Linus's tree, can we request this to be included in
> stable trees?

You can yes.

I'd prefer if we wait a week or two so it can get some testing first.

cheers



Re: [RFC PATCH 1/5] powerpc/64s/hash: convert SLB miss handlers to C

2018-08-22 Thread Michael Ellerman
Nicholas Piggin  writes:
> On Tue, 21 Aug 2018 16:46:02 +1000
> Michael Ellerman  wrote:
>> Nicholas Piggin  writes:
>> > This patch moves SLB miss handlers completely to C, using the standard
>> > exception handler macros to set up the stack and branch to C.
>> >
>> > This can be done because the segment containing the kernel stack is
>> > always bolted, so accessing it with relocation on will not cause an
>> > SLB exception.
>> >
>> > Arbitrary kernel memory may not be accessed when handling kernel space
>> > SLB misses, so care should be taken there.  
>> 
>> We'll need to mark everything that's used in slb.c as notrace, otherwise
>> 
>> Probably we just need to mark the whole file as not traceable.
>
> Yeah good point there. I'll do that. The whole file including things we
> allow today? How do we do that, like this?

For now yeah do the whole file, if there's anything in there we're sure
is safe then we can move it out later.

> CFLAGS_REMOVE_slb.o = -mno-sched-epilog $(CC_FLAGS_FTRACE)

Yeah AFAIK.

And yet another reminder for me to finally work out if we still need the
epilog crap.

cheers


Re: Infinite looping observed in __offline_pages

2018-08-22 Thread Aneesh Kumar K.V


Hi Michal,

Michal Hocko  writes:

> On Wed 25-07-18 13:11:15, John Allen wrote:
> [...]
>> Does a failure in do_migrate_range indicate that the range is unmigratable
>> and the loop in __offline_pages should terminate and goto failed_removal? Or
>> should we allow a certain number of retrys before we
>> give up on migrating the range?
>
> Unfortunatelly not. Migration code doesn't tell a difference between
> ephemeral and permanent failures. We are relying on
> start_isolate_page_range to tell us this. So the question is, what kind
> of page is not migratable and for what reason.
>
> Are you able to add some debugging to give us more information. The
> current debugging code in the hotplug/migration sucks...

Haren did most of the debugging, so at minimum we need a patch like this
I guess. 

modified   mm/page_alloc.c
@@ -7649,6 +7649,10 @@ bool has_unmovable_pages(struct zone *zone, struct page 
*page, int count,
 * handle each tail page individually in migration.
 */
if (PageHuge(page)) {
+
+   if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
+   goto unmovable;
+
iter = round_up(iter + 1, 1<
Date:   Tue Aug 21 14:17:55 2018 +0530

mm/hugetlb: filter out hugetlb pages if HUGEPAGE migration is not supported.

When scanning for movable pages, filter out Hugetlb pages if hugepage 
migration
is not supported. Without this we hit infinte loop in __offline pages where 
we
do
pfn = scan_movable_pages(start_pfn, end_pfn);
if (pfn) { /* We have movable pages */
ret = do_migrate_range(pfn, end_pfn);
goto repeat;
}

We do support hugetlb migration ony if the hugetlb pages are at pmd level. 
Here
we just check for Kernel config. The gigantic page size check is done in
page_huge_active.

Reported-by: Haren Myneni 
CC: Naoya Horiguchi 
Signed-off-by: Aneesh Kumar K.V 

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4eb6e824a80c..f9bdea685cf4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1338,7 +1338,8 @@ static unsigned long scan_movable_pages(unsigned long 
start, unsigned long end)
return pfn;
if (__PageMovable(page))
return pfn;
-   if (PageHuge(page)) {
+   if (IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION) &&
+   PageHuge(page)) {
if (page_huge_active(page))
return pfn;
else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 15ea511fb41c..a3f81e18c882 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7649,6 +7649,10 @@ bool has_unmovable_pages(struct zone *zone, struct page 
*page, int count,
 * handle each tail page individually in migration.
 */
if (PageHuge(page)) {
+
+   if (!IS_ENABLED(CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION))
+   goto unmovable;
+
iter = round_up(iter + 1, 1<

Patch "net/ethernet/freescale/fman: fix cross-build error" has been added to the 4.4-stable tree

2018-08-22 Thread gregkh


This is a note to let you know that I've just added the patch titled

net/ethernet/freescale/fman: fix cross-build error

to the 4.4-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 net-ethernet-freescale-fman-fix-cross-build-error.patch
and it can be found in the queue-4.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From foo@baz Wed Aug 22 10:28:26 CEST 2018
From: Randy Dunlap 
Date: Fri, 13 Jul 2018 21:25:19 -0700
Subject: net/ethernet/freescale/fman: fix cross-build error

From: Randy Dunlap 

[ Upstream commit c133459765fae249ba482f62e12f987aec4376f0 ]

  CC [M]  drivers/net/ethernet/freescale/fman/fman.o
In file included from ../drivers/net/ethernet/freescale/fman/fman.c:35:
../include/linux/fsl/guts.h: In function 'guts_set_dmacr':
../include/linux/fsl/guts.h:165:2: error: implicit declaration of function 
'clrsetbits_be32' [-Werror=implicit-function-declaration]
  clrsetbits_be32(>dmacr, 3 << shift, device << shift);
  ^~~

Signed-off-by: Randy Dunlap 
Cc: Madalin Bucur 
Cc: net...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 include/linux/fsl/guts.h |1 +
 1 file changed, 1 insertion(+)

--- a/include/linux/fsl/guts.h
+++ b/include/linux/fsl/guts.h
@@ -16,6 +16,7 @@
 #define __FSL_GUTS_H__
 
 #include 
+#include 
 
 /**
  * Global Utility Registers.


Patches currently in stable-queue which might be from rdun...@infradead.org are

queue-4.4/tcp-identify-cryptic-messages-as-tcp-seq-bugs.patch
queue-4.4/net-ethernet-freescale-fman-fix-cross-build-error.patch


[PATCH v2] crypto: vmx - Fix sleep-in-atomic bugs

2018-08-22 Thread Ondrej Mosnacek
This patch fixes sleep-in-atomic bugs in AES-CBC and AES-XTS VMX
implementations. The problem is that the blkcipher_* functions should
not be called in atomic context.

The bugs can be reproduced via the AF_ALG interface by trying to
encrypt/decrypt sufficiently large buffers (at least 64 KiB) using the
VMX implementations of 'cbc(aes)' or 'xts(aes)'. Such operations then
trigger BUG in crypto_yield():

[  891.863680] BUG: sleeping function called from invalid context at 
include/crypto/algapi.h:424
[  891.864622] in_atomic(): 1, irqs_disabled(): 0, pid: 12347, name: kcapi-enc
[  891.864739] 1 lock held by kcapi-enc/12347:
[  891.864811]  #0: f5d42c46 (sk_lock-AF_ALG){+.+.}, at: 
skcipher_recvmsg+0x50/0x530
[  891.865076] CPU: 5 PID: 12347 Comm: kcapi-enc Not tainted 
4.19.0-0.rc0.git3.1.fc30.ppc64le #1
[  891.865251] Call Trace:
[  891.865340] [c003387578c0] [c0d67ea4] dump_stack+0xe8/0x164 
(unreliable)
[  891.865511] [c00338757910] [c0172a58] ___might_sleep+0x2f8/0x310
[  891.865679] [c00338757990] [c06bff74] 
blkcipher_walk_done+0x374/0x4a0
[  891.865825] [c003387579e0] [d7e73e70] 
p8_aes_cbc_encrypt+0x1c8/0x260 [vmx_crypto]
[  891.865993] [c00338757ad0] [c06c0ee0] 
skcipher_encrypt_blkcipher+0x60/0x80
[  891.866128] [c00338757b10] [c06ec504] 
skcipher_recvmsg+0x424/0x530
[  891.866283] [c00338757bd0] [c0b00654] sock_recvmsg+0x74/0xa0
[  891.866403] [c00338757c10] [c0b00f64] ___sys_recvmsg+0xf4/0x2f0
[  891.866515] [c00338757d90] [c0b02bb8] __sys_recvmsg+0x68/0xe0
[  891.866631] [c00338757e30] [c000bbe4] system_call+0x5c/0x70

Fixes: 8c755ace357c ("crypto: vmx - Adding CBC routines for VMX module")
Fixes: c07f5d3da643 ("crypto: vmx - Adding support for XTS")
Cc: sta...@vger.kernel.org
Signed-off-by: Ondrej Mosnacek 
---
Still untested, please test and review if possible.

Changes in v2:
- fix leaving preemtption, etc. disabled when leaving the function
  (I switched to the more obvious and less efficient variant for the
  sake of clarity.)

 drivers/crypto/vmx/aes_cbc.c | 30 ++
 drivers/crypto/vmx/aes_xts.c | 21 ++---
 2 files changed, 28 insertions(+), 23 deletions(-)

diff --git a/drivers/crypto/vmx/aes_cbc.c b/drivers/crypto/vmx/aes_cbc.c
index 5285ece4f33a..b71895871be3 100644
--- a/drivers/crypto/vmx/aes_cbc.c
+++ b/drivers/crypto/vmx/aes_cbc.c
@@ -107,24 +107,23 @@ static int p8_aes_cbc_encrypt(struct blkcipher_desc *desc,
ret = crypto_skcipher_encrypt(req);
skcipher_request_zero(req);
} else {
-   preempt_disable();
-   pagefault_disable();
-   enable_kernel_vsx();
-
blkcipher_walk_init(, dst, src, nbytes);
ret = blkcipher_walk_virt(desc, );
while ((nbytes = walk.nbytes)) {
+   preempt_disable();
+   pagefault_disable();
+   enable_kernel_vsx();
aes_p8_cbc_encrypt(walk.src.virt.addr,
   walk.dst.virt.addr,
   nbytes & AES_BLOCK_MASK,
   >enc_key, walk.iv, 1);
+   disable_kernel_vsx();
+   pagefault_enable();
+   preempt_enable();
+
nbytes &= AES_BLOCK_SIZE - 1;
ret = blkcipher_walk_done(desc, , nbytes);
}
-
-   disable_kernel_vsx();
-   pagefault_enable();
-   preempt_enable();
}
 
return ret;
@@ -147,24 +146,23 @@ static int p8_aes_cbc_decrypt(struct blkcipher_desc *desc,
ret = crypto_skcipher_decrypt(req);
skcipher_request_zero(req);
} else {
-   preempt_disable();
-   pagefault_disable();
-   enable_kernel_vsx();
-
blkcipher_walk_init(, dst, src, nbytes);
ret = blkcipher_walk_virt(desc, );
while ((nbytes = walk.nbytes)) {
+   preempt_disable();
+   pagefault_disable();
+   enable_kernel_vsx();
aes_p8_cbc_encrypt(walk.src.virt.addr,
   walk.dst.virt.addr,
   nbytes & AES_BLOCK_MASK,
   >dec_key, walk.iv, 0);
+   disable_kernel_vsx();
+   pagefault_enable();
+   preempt_enable();
+
nbytes &= AES_BLOCK_SIZE - 1;
ret = blkcipher_walk_done(desc, , nbytes);
}
-
-   disable_kernel_vsx();
-   pagefault_enable();
-   preempt_enable();
}
 

Re: [PATCH] crypto: vmx - Fix sleep-in-atomic bugs

2018-08-22 Thread Ondrej Mosnáček
ut 21. 8. 2018 o 18:41 Marcelo Henrique Cerri
 napísal(a):
> On Tue, Aug 21, 2018 at 05:24:45PM +0200, Ondrej Mosnáček wrote:
> > CC: Paulo Flabiano Smorigo ,
> > linuxppc-dev@lists.ozlabs.org
> >
> > (Sorry, sent this before reading new e-mails in the thread...)
> >
> > ut 21. 8. 2018 o 17:18 Ondrej Mosnacek  napísal(a):
> > >
> > > This patch fixes sleep-in-atomic bugs in AES-CBC and AES-XTS VMX
> > > implementations. The problem is that the blkcipher_* functions should
> > > not be called in atomic context.
> > >
> > > The bugs can be reproduced via the AF_ALG interface by trying to
> > > encrypt/decrypt sufficiently large buffers (at least 64 KiB) using the
> > > VMX implementations of 'cbc(aes)' or 'xts(aes)'. Such operations then
> > > trigger BUG in crypto_yield():
> > >
> > > [  891.863680] BUG: sleeping function called from invalid context at 
> > > include/crypto/algapi.h:424
> > > [  891.864622] in_atomic(): 1, irqs_disabled(): 0, pid: 12347, name: 
> > > kcapi-enc
> > > [  891.864739] 1 lock held by kcapi-enc/12347:
> > > [  891.864811]  #0: f5d42c46 (sk_lock-AF_ALG){+.+.}, at: 
> > > skcipher_recvmsg+0x50/0x530
> > > [  891.865076] CPU: 5 PID: 12347 Comm: kcapi-enc Not tainted 
> > > 4.19.0-0.rc0.git3.1.fc30.ppc64le #1
> > > [  891.865251] Call Trace:
> > > [  891.865340] [c003387578c0] [c0d67ea4] 
> > > dump_stack+0xe8/0x164 (unreliable)
> > > [  891.865511] [c00338757910] [c0172a58] 
> > > ___might_sleep+0x2f8/0x310
> > > [  891.865679] [c00338757990] [c06bff74] 
> > > blkcipher_walk_done+0x374/0x4a0
> > > [  891.865825] [c003387579e0] [d7e73e70] 
> > > p8_aes_cbc_encrypt+0x1c8/0x260 [vmx_crypto]
> > > [  891.865993] [c00338757ad0] [c06c0ee0] 
> > > skcipher_encrypt_blkcipher+0x60/0x80
> > > [  891.866128] [c00338757b10] [c06ec504] 
> > > skcipher_recvmsg+0x424/0x530
> > > [  891.866283] [c00338757bd0] [c0b00654] 
> > > sock_recvmsg+0x74/0xa0
> > > [  891.866403] [c00338757c10] [c0b00f64] 
> > > ___sys_recvmsg+0xf4/0x2f0
> > > [  891.866515] [c00338757d90] [c0b02bb8] 
> > > __sys_recvmsg+0x68/0xe0
> > > [  891.866631] [c00338757e30] [c000bbe4] system_call+0x5c/0x70
> > >
> > > Fixes: 8c755ace357c ("crypto: vmx - Adding CBC routines for VMX module")
> > > Fixes: c07f5d3da643 ("crypto: vmx - Adding support for XTS")
> > > Cc: sta...@vger.kernel.org
> > > Signed-off-by: Ondrej Mosnacek 
> > > ---
> > > This patch should fix the issue, but I didn't test it. (I'll see if I
> > > can find some time tomorrow to try and recompile the kernel on a PPC
> > > machine... in the meantime please review :)
> > >
> > >  drivers/crypto/vmx/aes_cbc.c | 30 ++
> > >  drivers/crypto/vmx/aes_xts.c | 19 ---
> > >  2 files changed, 26 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/drivers/crypto/vmx/aes_cbc.c b/drivers/crypto/vmx/aes_cbc.c
> > > index 5285ece4f33a..b71895871be3 100644
> > > --- a/drivers/crypto/vmx/aes_cbc.c
> > > +++ b/drivers/crypto/vmx/aes_cbc.c
> > > @@ -107,24 +107,23 @@ static int p8_aes_cbc_encrypt(struct blkcipher_desc 
> > > *desc,
> > > ret = crypto_skcipher_encrypt(req);
> > > skcipher_request_zero(req);
> > > } else {
> > > -   preempt_disable();
> > > -   pagefault_disable();
> > > -   enable_kernel_vsx();
> > > -
> > > blkcipher_walk_init(, dst, src, nbytes);
> > > ret = blkcipher_walk_virt(desc, );
> > > while ((nbytes = walk.nbytes)) {
> > > +   preempt_disable();
> > > +   pagefault_disable();
> > > +   enable_kernel_vsx();
> > > aes_p8_cbc_encrypt(walk.src.virt.addr,
> > >walk.dst.virt.addr,
> > >nbytes & AES_BLOCK_MASK,
> > >>enc_key, walk.iv, 1);
> > > +   disable_kernel_vsx();
> > > +   pagefault_enable();
> > > +   preempt_enable();
> > > +
> > > nbytes &= AES_BLOCK_SIZE - 1;
> > > ret = blkcipher_walk_done(desc, , nbytes);
> > > }
> > > -
> > > -   disable_kernel_vsx();
> > > -   pagefault_enable();
> > > -   preempt_enable();
> > > }
> > >
> > > return ret;
> > > @@ -147,24 +146,23 @@ static int p8_aes_cbc_decrypt(struct blkcipher_desc 
> > > *desc,
> > > ret = crypto_skcipher_decrypt(req);
> > > skcipher_request_zero(req);
> > > } else {
> > > -   preempt_disable();
> > > -   pagefault_disable();
> > > -   enable_kernel_vsx();
> > > -
> > > blkcipher_walk_init(, dst, src, nbytes);
> > >   

Patch "net/ethernet/freescale/fman: fix cross-build error" has been added to the 4.14-stable tree

2018-08-22 Thread gregkh


This is a note to let you know that I've just added the patch titled

net/ethernet/freescale/fman: fix cross-build error

to the 4.14-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 net-ethernet-freescale-fman-fix-cross-build-error.patch
and it can be found in the queue-4.14 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From foo@baz Wed Aug 22 09:33:46 CEST 2018
From: Randy Dunlap 
Date: Fri, 13 Jul 2018 21:25:19 -0700
Subject: net/ethernet/freescale/fman: fix cross-build error

From: Randy Dunlap 

[ Upstream commit c133459765fae249ba482f62e12f987aec4376f0 ]

  CC [M]  drivers/net/ethernet/freescale/fman/fman.o
In file included from ../drivers/net/ethernet/freescale/fman/fman.c:35:
../include/linux/fsl/guts.h: In function 'guts_set_dmacr':
../include/linux/fsl/guts.h:165:2: error: implicit declaration of function 
'clrsetbits_be32' [-Werror=implicit-function-declaration]
  clrsetbits_be32(>dmacr, 3 << shift, device << shift);
  ^~~

Signed-off-by: Randy Dunlap 
Cc: Madalin Bucur 
Cc: net...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 include/linux/fsl/guts.h |1 +
 1 file changed, 1 insertion(+)

--- a/include/linux/fsl/guts.h
+++ b/include/linux/fsl/guts.h
@@ -16,6 +16,7 @@
 #define __FSL_GUTS_H__
 
 #include 
+#include 
 
 /**
  * Global Utility Registers.


Patches currently in stable-queue which might be from rdun...@infradead.org are

queue-4.14/tcp-identify-cryptic-messages-as-tcp-seq-bugs.patch
queue-4.14/net-ethernet-freescale-fman-fix-cross-build-error.patch


Re: Odd SIGSEGV issue introduced by commit 6b31d5955cb29 ("mm, oom: fix potential data corruption when oom_reaper races with writer")

2018-08-22 Thread Christophe LEROY




Le 21/08/2018 à 19:50, Ram Pai a écrit :

On Tue, Aug 21, 2018 at 04:40:15PM +1000, Michael Ellerman wrote:

Christophe LEROY  writes:
...


And I bisected its disappearance with commit 99cd1302327a2 ("powerpc:
Deliver SEGV signal on pkey violation")


Whoa that's weird.


Looking at those two commits, especially the one which makes it
dissapear, I'm quite sceptic. Any idea on what could be the cause and/or
how to investigate further ?


Are you sure it's not some corruption that just happens to be masked by
that commit? I can't see anything in that commit that could explain that
change in behaviour.

The only real change is if you're hitting DSISR_KEYFAULT isn't it?


even with the 'commit 99cd1302327a2', a SEGV signal should get generated;
which should kill the process. Unless the process handles SEGV signals
with SEGV_PKUERR differently.


No, the sigsegv are not handled differently. And the trace shown it is 
SEGV_MAPERR which is generated.




The other surprising thing is, why is DSISR_KEYFAULT getting generated
in the first place?  Are keys somehow getting programmed into the HPTE?


Can't be that, because DSISR_KEYFAULT is filtered out when applying 
DSISR_SRR1_MATCH_32S mask.




Feels like some random corruption.


In a way yes, except that it is always at the same instruction (in 
ld.so) and always because the accessed address is 0x67xx instead of 
0x77xx
I also tested with TASK_SIZE set to 0xa000 instead of 0x8000, 
and I get same failure with bad address being 0x87xx instead of 
0x97xx


Christophe



Is this behavior seen with power8 or power9?

RP



Patch "net/ethernet/freescale/fman: fix cross-build error" has been added to the 4.9-stable tree

2018-08-22 Thread gregkh


This is a note to let you know that I've just added the patch titled

net/ethernet/freescale/fman: fix cross-build error

to the 4.9-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 net-ethernet-freescale-fman-fix-cross-build-error.patch
and it can be found in the queue-4.9 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From foo@baz Wed Aug 22 09:42:09 CEST 2018
From: Randy Dunlap 
Date: Fri, 13 Jul 2018 21:25:19 -0700
Subject: net/ethernet/freescale/fman: fix cross-build error

From: Randy Dunlap 

[ Upstream commit c133459765fae249ba482f62e12f987aec4376f0 ]

  CC [M]  drivers/net/ethernet/freescale/fman/fman.o
In file included from ../drivers/net/ethernet/freescale/fman/fman.c:35:
../include/linux/fsl/guts.h: In function 'guts_set_dmacr':
../include/linux/fsl/guts.h:165:2: error: implicit declaration of function 
'clrsetbits_be32' [-Werror=implicit-function-declaration]
  clrsetbits_be32(>dmacr, 3 << shift, device << shift);
  ^~~

Signed-off-by: Randy Dunlap 
Cc: Madalin Bucur 
Cc: net...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 include/linux/fsl/guts.h |1 +
 1 file changed, 1 insertion(+)

--- a/include/linux/fsl/guts.h
+++ b/include/linux/fsl/guts.h
@@ -16,6 +16,7 @@
 #define __FSL_GUTS_H__
 
 #include 
+#include 
 
 /**
  * Global Utility Registers.


Patches currently in stable-queue which might be from rdun...@infradead.org are

queue-4.9/tcp-identify-cryptic-messages-as-tcp-seq-bugs.patch
queue-4.9/net-ethernet-freescale-fman-fix-cross-build-error.patch


Patch "net/ethernet/freescale/fman: fix cross-build error" has been added to the 4.17-stable tree

2018-08-22 Thread gregkh


This is a note to let you know that I've just added the patch titled

net/ethernet/freescale/fman: fix cross-build error

to the 4.17-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 net-ethernet-freescale-fman-fix-cross-build-error.patch
and it can be found in the queue-4.17 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From foo@baz Wed Aug 22 09:16:56 CEST 2018
From: Randy Dunlap 
Date: Fri, 13 Jul 2018 21:25:19 -0700
Subject: net/ethernet/freescale/fman: fix cross-build error

From: Randy Dunlap 

[ Upstream commit c133459765fae249ba482f62e12f987aec4376f0 ]

  CC [M]  drivers/net/ethernet/freescale/fman/fman.o
In file included from ../drivers/net/ethernet/freescale/fman/fman.c:35:
../include/linux/fsl/guts.h: In function 'guts_set_dmacr':
../include/linux/fsl/guts.h:165:2: error: implicit declaration of function 
'clrsetbits_be32' [-Werror=implicit-function-declaration]
  clrsetbits_be32(>dmacr, 3 << shift, device << shift);
  ^~~

Signed-off-by: Randy Dunlap 
Cc: Madalin Bucur 
Cc: net...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 include/linux/fsl/guts.h |1 +
 1 file changed, 1 insertion(+)

--- a/include/linux/fsl/guts.h
+++ b/include/linux/fsl/guts.h
@@ -16,6 +16,7 @@
 #define __FSL_GUTS_H__
 
 #include 
+#include 
 
 /**
  * Global Utility Registers.


Patches currently in stable-queue which might be from rdun...@infradead.org are

queue-4.17/tcp-identify-cryptic-messages-as-tcp-seq-bugs.patch
queue-4.17/net-ethernet-freescale-fman-fix-cross-build-error.patch


Re: [PATCH 17/20] powerpc/dma-swiotlb: use generic swiotlb_dma_ops

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 11:57:53AM +1000, Benjamin Herrenschmidt wrote:
> Note: We will still need to implement some custom variant of this
> for our secure VMs ... 
> 
> Basically we'll need to use the existing bounce bufferring as-is but
> the condition will be different, it won't be whether the address is
> below a certain limit, it will be *always*.

The conditions are in the dma_capable() helper that the architecture
can override (and which powerpc already does override).


Re: [PATCH 10/20] powerpc/dma-noncoherent: don't disable irqs over kmap_atomic

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 10:27:46AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > The requirement to disable local irqs over kmap_atomic is long gone,
> > so remove those calls.
> 
> Really ? I'm trying to verify that and getting lost in a mess of macros
> from hell in the per-cpu stuff but if you look at our implementation
> of kmap_atomic_prot(), all it does is a preempt_disable(), and then
> it uses kmap_atomic_idx_push():
> 
>   int idx = __this_cpu_inc_return(__kmap_atomic_idx) - 1;
> 
> Note the use of __this_cpu_inc_return(), not this_cpu_inc_return(),
> ie this is the non-interrupt safe version...

Looks like the powerpc variant indeed isn't save.

I did look a bit more through the code and history, and it seems
like we remove the need to disable irqs when called from process
context a while ago, but we still require disabling irqs when called
from irq context.  Given that this code can also be called from
irq context we'll have to keep the local_irq_save.


Re: [PATCH 02/20] kernel/dma/direct: refine dma_direct_alloc zone selection

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 09:54:33AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-30 at 18:38 +0200, Christoph Hellwig wrote:
> > We need to take the DMA offset and encryption bit into account when 
> > selecting
> > a zone.  Add a helper that takes those into account and use it.
> 
> That whole "encryption" stuff seems to be completely specific to the
> way x86 does memory encryption, or am I mistaken ? It's not clear to me
> what that does in practice and how it relates to DMA mappings.

Not even all of x86, but AMD in particular, Intel does it yet another
way.  But it still is easier to take this into the core with a few
overrides than duplicating all the code.

> I'm also not sure about that whole business with ZONE_DMA and
> ARCH_ZONE_DMA_BITS...

ZONE_DMA usually (but not always) maps to 24-bits of address space,
if it doesn't (I mostly through about s390 with it's odd 31-bits)
the architecture can override it if it cares).

> On ppc64, unless you enable swiotlb (which we only do currently on
> some embedded platforms), you have all of memory in ZONE_DMA.
> 
> [0.00] Zone ranges:
> [0.00]   DMA  [mem 0x-0x001f]
> [0.00]   DMA32empty
> [0.00]   Normal   empty
> [0.00]   Device   empty

This is really weird.  Why would you wire up ZONE_DMA like this?
The general scheme that architectures should implement is:

ZONE_DMA:   Any memory below a magic threshold that is lower than
32-bit.  Only enabled if actually required (usually
either 24-bit for ISA, or some other weird architecture
specific value like 32-bit for S/390)
ZONE_DMA32: Memory <= 32-bit if the architecture supports more than
32-bits worth of physical address space.  Should generally
be enabled on all 64-bit architectures unless you have
a very good reason not to.
ZONE_NORMAL:Everything above 32-bit not falling into HIGHMEM or
MOVEABLE.


Re: [PATCH 01/20] kernel/dma/direct: take DMA offset into account in dma_direct_supported

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 09:44:18AM +1000, Benjamin Herrenschmidt wrote:
> We do have the occasional device with things like 31-bit DMA
> limitation. We know they happens to work because those systems
> can't have enough memory to be a problem. This is why our current
> DMA direct ops in powerpc just unconditionally return true on ppc32.
> 
> The test against a full 32-bit mask here will break them I think.
> 
> Thing is, I'm not sure I still have access to one of these things
> to test, I'll have to dig (from memory things like b43 wifi).

Yeah, the other platforms that support these devices support ZONE_DMA
to reliably handle these devices. But there is two other ways the
current code would actually handle these fine despite the dma_direct
checks:

 1) if the device only has physical addresses up to 31-bit anyway
 2) by trying again to find a lower address.  But this only works
for coherent allocations and not streaming maps (unless we have
swiotlb with a buffer below 31-bits).

It seems powerpc can have ZONE_DMA, though and we will cover these
devices just fine.  If it didn't have that the current powerpc
code would not work either.

>  - What is this trying to achieve ?
> 
>   /*
>* Various PCI/PCIe bridges have broken support for > 32bit DMA even
>* if the device itself might support it.
>*/
>   if (dev->dma_32bit_limit && mask > phys_to_dma(dev, DMA_BIT_MASK(32)))
>   return 0;
> 
> IE, if the device has a 32-bit limit, we fail an attempt at checking
> if a >32-bit mask works ? That doesn't quite seem to be the right thing
> to do... Shouldn't this be in dma_set_mask() and just clamp the mask down ?
> 
> IE, dma_set_mask() is what a driver uses to establish the device capability,
> so it makes sense tot have dma_32bit_limit just reduce that capability, not
> fail because the device can do more than what the bridge can 

If your PCI bridge / PCIe root port doesn't support dma to addresses
larger than 32-bit the device capabilities above that don't matter, it
just won't work.  We have this case at least for some old VIA x86 chipsets
and some relatively modern Xilinx FPGAs with PCIe.

>  - How is that file supposed to work on 64-bit platforms ? From what I can
> tell, dma_supported() will unconditionally return true if the mask is
> 32-bit or larger (appart from the above issue). This doesn't look right,
> the mask needs to be compared to the max memory address. There are a bunch
> of devices out there with masks anywhere bettween 40 and 64 bits, and
> some of these will not work "out of the box" if the offseted top
> of memory is beyond the mask limit. Or am I missing something ?

Your are not missing anything except for the history of this code.

Your observation is right, but there always has been the implicit
assumption that architectures with more than 4GB of physical address
space must either support and iommu or swiotlb and use that.  It's
never been document anywhere, but I'm working on integrating all
this code to make more sense.


Re: [PATCH 08/20] powerpc/dma: remove the unused dma_nommu_ops export

2018-08-22 Thread Christoph Hellwig
On Thu, Aug 09, 2018 at 10:01:16AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2018-07-31 at 14:16 +0200, Christoph Hellwig wrote:
> > It turns out cxl actually uses it.  So for now skip this patch,
> > although random code in drivers messing with dma ops will need to
> > be sorted out sooner or later.
> 
> CXL devices are "special", they bypass the classic iommu in favor of
> allowing the device to operate using the main processor page tables
> using an MMU context (so basically the device can use userspace
> addresses directly), akin to ATS.
> 
> I think the code currently uses the nommu ops as a way to do a simple
> kernel mapping for kernel drivers using CXL (not userspace stuff)
> though.

Its still a horrible idea to have this in drivers/, we need some
core API to mediate this behavior.  Also if the device supports
using virtual addresses dma_nommu_ops seems wrong as it won't do
the right thing for e.g. vmalloc addresses not mapped into the
kernel linear mapping (which I guess can't currently happen on
powerpc, but still..)