Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
On Mon 13-11-17 09:35:22, Khalid Aziz wrote:
> On 11/13/2017 09:06 AM, Michal Hocko wrote:
> > OK, so this one should take care of the backward compatibility while
> > still not touching the arch code
> > ---
> > commit 39ff9bf8597e79a032da0954aea1f0d77d137765
> > Author: Michal Hocko 
> > Date:   Mon Nov 13 17:06:24 2017 +0100
> > 
> >  mm: introduce MAP_FIXED_SAFE
> >  MAP_FIXED is used quite often but it is inherently dangerous because it
> >  unmaps an existing mapping covered by the requested range. While this
> >  might be might be really desidered behavior in many cases there are
> >  others which would rather see a failure than a silent memory 
> > corruption.
> >  Introduce a new MAP_FIXED_SAFE flag for mmap to achive this behavior.
> >  It is a MAP_FIXED extension with a single exception that it fails with
> >  ENOMEM if the requested address is already covered by an existing
> >  mapping. We still do rely on get_unmaped_area to handle all the arch
> >  specific MAP_FIXED treatment and check for a conflicting vma after it
> >  returns.
> >  Signed-off-by: Michal Hocko 
> > 
> > .. deleted ...
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 680506faceae..aad8d37f0205 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1358,6 +1358,10 @@ unsigned long do_mmap(struct file *file, unsigned 
> > long addr,
> > if (mm->map_count > sysctl_max_map_count)
> > return -ENOMEM;
> > +   /* force arch specific MAP_FIXED handling in get_unmapped_area */
> > +   if (flags & MAP_FIXED_SAFE)
> > +   flags |= MAP_FIXED;
> > +
> > /* Obtain the address to map to. we verify (or select) it and ensure
> >  * that it represents a valid section of the address space.
> >  */
> 
> Do you need to move this code above:
> 
> if (!(flags & MAP_FIXED))
> addr = round_hint_to_min(addr);
> 
> /* Careful about overflows.. */
> len = PAGE_ALIGN(len);
> if (!len)
> return -ENOMEM;
> 
> Not doing that might mean the hint address will end up being rounded for
> MAP_FIXED_SAFE which would change the behavior from MAP_FIXED.

Yes, I will move it.
-- 
Michal Hocko
SUSE Labs


Re: [PATCHv4 0/6] printk/ia64/ppc64/parisc64: let's deprecate %pF/%pf printk specifiers

2017-11-13 Thread Sergey Senozhatsky
On (11/13/17 18:17), Helge Deller wrote:
> On 10.11.2017 00:48, Sergey Senozhatsky wrote:
> > All Ack-s/Tested-by-s were dropped, since the patch set has been
> > reworked. I'm kindly asking arch-s maintainers and developers to test it
> > once again. Sorry for any inconveniences and thanks for your help in
> > advance.
> 
> I tested it successfully on parisc64.
> You can add back the
> Tested-by: Helge Deller  #parisc64

thanks!

-ss


Re: [PATCH kernel v2] powerpc/powernv/ioda: Remove explicit max window size check

2017-11-13 Thread David Gibson
On Tue, Nov 07, 2017 at 02:43:01PM +1100, Alexey Kardashevskiy wrote:
> DMA windows can only have a size of power of two on IODA2 hardware and
> using memory_hotplug_max() to determine the upper limit won't work
> correcly if it returns not power of two value.
> 
> This removes the check as the platform code does this check in
> pnv_pci_ioda2_setup_default_config() anyway; the other client is VFIO
> and that thing checks against locked_vm limit which prevents the userspace
> from locking too much memory.
> 
> It is expected to impact DPDK on machines with non-power-of-two RAM size,
> mostly. KVM guests are less likely to be affected as usually guests get
> less than half of hosts RAM.
> 
> Signed-off-by: Alexey Kardashevskiy 

Reviewed-by: David Gibson 

This _can_ be called with essentially arbitrary window_size via the
VFIO code.  However, that should be constrained by the locked memory
limit, which is checked before this gets called.

> ---
> Changes:
> v2:
> * instead of relaxing the check, this simply removes it
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 1de94fb..433cf84 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2776,7 +2776,7 @@ static long pnv_pci_ioda2_table_alloc_pages(int nid, 
> __u64 bus_offset,
>   if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS))
>   return -EINVAL;
>  
> - if ((window_size > memory_hotplug_max()) || !is_power_of_2(window_size))
> + if (!is_power_of_2(window_size))
>   return -EINVAL;
>  
>   /* Adjust direct table size from window_size and levels */

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH] ibmveth: Kernel crash LSO offload flag toggle

2017-11-13 Thread Daniel Axtens
Hi Bryant,

A few things:

1) The commit message could probably be trimmed, especially the stack
traces.

2) What exactly are you changing and why does it fix the issue? I
couldn't figure that out from the commit message.

3) Does this need a Fixes: tag?


>   }
>  
> - netdev->min_mtu = IBMVETH_MIN_MTU;
> - netdev->max_mtu = ETH_MAX_MTU;
> -

4) What does the above hunk do? It seems unrelated...


Regards,
Daniel


Re: [PATCH kernel] vfio/spapr: Add trace points for map/unmap

2017-11-13 Thread Alexey Kardashevskiy
On 27/10/17 14:00, Alexey Kardashevskiy wrote:
> This adds trace_map/trace_unmap tracepoints to spapr driver. Type1 already
> uses these via the IOMMU API (iommu_map/__iommu_unmap).
> 
> Signed-off-by: Alexey Kardashevskiy 

Ping?



> ---
> 
> Example:
>  qemu-system-ppc-8655  [096]   724.662740: unmap:IOMMU: 
> iova=0x3000 size=4096 unmapped_size=4096
>  qemu-system-ppc-8656  [104]   724.970912: map:  IOMMU: 
> iova=0x0800 paddr=0x7ffef7ff size=65536
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index 63112c36ab2d..4531486c77c6 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -502,17 +503,19 @@ static int tce_iommu_clear(struct tce_container 
> *container,
>   struct iommu_table *tbl,
>   unsigned long entry, unsigned long pages)
>  {
> - unsigned long oldhpa;
> + unsigned long oldhpa, unmapped, firstentry = entry, totalpages = pages;
>   long ret;
>   enum dma_data_direction direction;
>  
> - for ( ; pages; --pages, ++entry) {
> + for (unmapped = 0; pages; --pages, ++entry) {
>   direction = DMA_NONE;
>   oldhpa = 0;
>   ret = iommu_tce_xchg(tbl, entry, , );
>   if (ret)
>   continue;
>  
> + ++unmapped;
> +
>   if (direction == DMA_NONE)
>   continue;
>  
> @@ -523,6 +526,9 @@ static int tce_iommu_clear(struct tce_container 
> *container,
>  
>   tce_iommu_unuse_page(container, oldhpa);
>   }
> + trace_unmap(firstentry << tbl->it_page_shift,
> + totalpages << tbl->it_page_shift,
> + unmapped << tbl->it_page_shift);
>  
>   return 0;
>  }
> @@ -965,6 +971,8 @@ static long tce_iommu_ioctl(void *iommu_data,
>   direction);
>  
>   iommu_flush_tce(tbl);
> + if (!ret)
> + trace_map(param.iova, param.vaddr, param.size);
>  
>   return ret;
>   }
> 


-- 
Alexey


[PATCH] ibmveth: Kernel crash LSO offload flag toggle

2017-11-13 Thread Bryant G. Ly
The following script when run (along with some iperf traffic recreates the
crash within 5-10 mins or so).

while true
do
ethtool -k ibmveth0 | grep tcp-segmentation-offload
ethtool -K ibmveth0 tso off
ethtool -k ibmveth0 | grep tcp-segmentation-offload
ethtool -K ibmveth0 tso on
done

Note: This issue happens the very first time largsesend offload is
turned off too (but the above script recreates the issue all the times)

Stack trace output:
 [76563.914380] NIP [c0063940] memcpy_power7+0x40/0x800
[76563.914387] LR [d0d31788] ibmveth_start_xmit+0x1c8/0x8d0 [ibmveth]
[76563.914392] Call Trace:
[76563.914396] [c000feab3270] [c000feab32d0] 0xc000feab32d0 
(unreliable)
[76563.914407] [c000feab3360] [c09816f4] 
dev_hard_start_xmit+0x304/0x530
[76563.914415] [c000feab3440] [c09b6564] sch_direct_xmit+0x124/0x330
[76563.914423] [c000feab34e0] [c0981ddc] 
__dev_queue_xmit+0x26c/0x770
[76563.914431] [c000feab3580] [c0a1efc0] arp_xmit+0x30/0xd0
[76563.914438] [c000feab35f0] [c0a1f0f4] 
arp_send_dst.part.0+0x94/0xb0
[76563.914445] [c000feab3660] [c0a1fcb4] arp_solicit+0x114/0x2b0
[76563.914452] [c000feab3730] [c098d8f4] neigh_probe+0x84/0xd0
[76563.914460] [c000feab3760] [c09937cc] 
neigh_timer_handler+0xbc/0x320
[76563.914468] [c000feab37a0] [c014a3fc] call_timer_fn+0x5c/0x1c0
[76563.914474] [c000feab3830] [c014a8bc] 
run_timer_softirq+0x31c/0x3f0
[76563.914483] [c000feab3900] [c00bec58] __do_softirq+0x188/0x3e0
[76563.914490] [c000feab39f0] [c00bf128] irq_exit+0xc8/0x100
[76563.914498] [c000feab3a10] [c001f974] timer_interrupt+0xa4/0xe0
[76563.914505] [c000feab3a40] [c0002714] 
decrementer_common+0x114/0x180

Oops output:
 [76563.914173] Unable to handle kernel paging request for data at address 
0x
[76563.914197] Faulting instruction address: 0xc0063940
[76563.914205] Oops: Kernel access of bad area, sig: 11 [#1]
[76563.914210] SMP NR_CPUS=2048 NUMA pSeries
[76563.914217] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag 
udp_diag inet_diag unix_diag af_packet_diag netlink_diag nls_utf8 isofs 
binfmt_misc pseries_rng rtc_generic autofs4 ibmvfc scsi_transport_fc ibmvscsi 
ibmveth
[76563.914251] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0-34-generic 
#53-Ubuntu
[76563.914258] task: c000fe9efcc0 ti: c000feab task.ti: 
c000feab
[76563.914265] NIP: c0063940 LR: d0d31788 CTR: c0064100
[76563.914271] REGS: c000feab2ff0 TRAP: 0300   Not tainted  
(4.4.0-34-generic)
[76563.914277] MSR: 80009033   CR: 4800284e  XER: 
001a
[76563.914294] CFAR: c0008468 DAR:  DSISR: 4200 
SOFTE: 1
GPR00: f240 c000feab3270 c15b5d00 
GPR04: cd9b0004 002a 0006 c000efc0ccac
GPR08: d0d3dd28 0080  d0d34758
GPR12: c0064100 ce7f1c80 c000ffdca938 0100
GPR16: c000ffdca738 c000ffdca538 c000feab34a0 c15f4d00
GPR20:  c15f4cf0 c000f5945900 
GPR24:  8000 c000feab32d0 c000efc0ccac
GPR28: c000f23ccb00 c000f5945000 c000f23ccb00 c000efc0cc00
[76563.914380] NIP [c0063940] memcpy_power7+0x40/0x800
[76563.914387] LR [d0d31788] ibmveth_start_xmit+0x1c8/0x8d0 [ibmveth]
[76563.914392] Call Trace:
[76563.914396] [c000feab3270] [c000feab32d0] 0xc000feab32d0 
(unreliable)
[76563.914407] [c000feab3360] [c09816f4] 
dev_hard_start_xmit+0x304/0x530
[76563.914415] [c000feab3440] [c09b6564] sch_direct_xmit+0x124/0x330
[76563.914423] [c000feab34e0] [c0981ddc] 
__dev_queue_xmit+0x26c/0x770
[76563.914431] [c000feab3580] [c0a1efc0] arp_xmit+0x30/0xd0
[76563.914438] [c000feab35f0] [c0a1f0f4] 
arp_send_dst.part.0+0x94/0xb0
[76563.914445] [c000feab3660] [c0a1fcb4] arp_solicit+0x114/0x2b0
[76563.914452] [c000feab3730] [c098d8f4] neigh_probe+0x84/0xd0
[76563.914460] [c000feab3760] [c09937cc] 
neigh_timer_handler+0xbc/0x320
[76563.914468] [c000feab37a0] [c014a3fc] call_timer_fn+0x5c/0x1c0
[76563.914474] [c000feab3830] [c014a8bc] 
run_timer_softirq+0x31c/0x3f0
[76563.914483] [c000feab3900] [c00bec58] __do_softirq+0x188/0x3e0
[76563.914490] [c000feab39f0] [c00bf128] irq_exit+0xc8/0x100
[76563.914498] [c000feab3a10] [c001f974] timer_interrupt+0xa4/0xe0
[76563.914505] [c000feab3a40] [c0002714] 
decrementer_common+0x114/0x180
[76563.914515] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28
[76563.914515] LR = check_and_cede_processor+0x34/0x50
[76563.914525] 

[PATCH] [net-next, v4] ibmvnic: Feature implementation of Vital Product Data (VPD) for the ibmvnic driver

2017-11-13 Thread Desnes Augusto Nunes do Rosario
This patch implements and enables VDP support for the ibmvnic driver.
Moreover, it includes the implementation of suitable structs, signal
 transmission/handling and functions which allows the retrival of firmware
 information from the ibmvnic card through the ethtool command.

Signed-off-by: Desnes A. Nunes do Rosario 
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 153 -
 drivers/net/ethernet/ibm/ibmvnic.h |  27 ++-
 2 files changed, 176 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index b918bc2..04aaacb 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -574,6 +574,15 @@ static int reset_tx_pools(struct ibmvnic_adapter *adapter)
return 0;
 }
 
+static void release_vpd_data(struct ibmvnic_adapter *adapter)
+{
+   if (!adapter->vpd)
+   return;
+
+   kfree(adapter->vpd->buff);
+   kfree(adapter->vpd);
+}
+
 static void release_tx_pools(struct ibmvnic_adapter *adapter)
 {
struct ibmvnic_tx_pool *tx_pool;
@@ -754,6 +763,8 @@ static void release_resources(struct ibmvnic_adapter 
*adapter)
 {
int i;
 
+   release_vpd_data(adapter);
+
release_tx_pools(adapter);
release_rx_pools(adapter);
 
@@ -834,6 +845,57 @@ static int set_real_num_queues(struct net_device *netdev)
return rc;
 }
 
+static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter)
+{
+   struct device *dev = >vdev->dev;
+   union ibmvnic_crq crq;
+   dma_addr_t dma_addr;
+   int len = 0;
+
+   if (adapter->vpd->buff)
+   len = adapter->vpd->len;
+
+   reinit_completion(>fw_done);
+   crq.get_vpd_size.first = IBMVNIC_CRQ_CMD;
+   crq.get_vpd_size.cmd = GET_VPD_SIZE;
+   ibmvnic_send_crq(adapter, );
+   wait_for_completion(>fw_done);
+
+   if (!adapter->vpd->len)
+   return -ENODATA;
+
+   if (!adapter->vpd->buff)
+   adapter->vpd->buff = kzalloc(adapter->vpd->len, GFP_KERNEL);
+   else if (adapter->vpd->len != len)
+   adapter->vpd->buff =
+   krealloc(adapter->vpd->buff,
+adapter->vpd->len, GFP_KERNEL);
+
+   if (!adapter->vpd->buff) {
+   dev_err(dev, "Could allocate VPD buffer\n");
+   return -ENOMEM;
+   }
+
+   adapter->vpd->dma_addr =
+   dma_map_single(dev, adapter->vpd->buff, adapter->vpd->len,
+  DMA_FROM_DEVICE);
+   if (dma_mapping_error(dev, dma_addr)) {
+   dev_err(dev, "Could not map VPD buffer\n");
+   kfree(adapter->vpd->buff);
+   return -ENOMEM;
+   }
+
+   reinit_completion(>fw_done);
+   crq.get_vpd.first = IBMVNIC_CRQ_CMD;
+   crq.get_vpd.cmd = GET_VPD;
+   crq.get_vpd.ioba = cpu_to_be32(adapter->vpd->dma_addr);
+   crq.get_vpd.len = cpu_to_be32((u32)adapter->vpd->len);
+   ibmvnic_send_crq(adapter, );
+   wait_for_completion(>fw_done);
+
+   return 0;
+}
+
 static int init_resources(struct ibmvnic_adapter *adapter)
 {
struct net_device *netdev = adapter->netdev;
@@ -851,6 +913,10 @@ static int init_resources(struct ibmvnic_adapter *adapter)
if (rc)
return rc;
 
+   adapter->vpd = kzalloc(sizeof(*adapter->vpd), GFP_KERNEL);
+   if (!adapter->vpd)
+   return -ENOMEM;
+
adapter->map_id = 1;
adapter->napi = kcalloc(adapter->req_rx_queues,
sizeof(struct napi_struct), GFP_KERNEL);
@@ -924,7 +990,7 @@ static int __ibmvnic_open(struct net_device *netdev)
 static int ibmvnic_open(struct net_device *netdev)
 {
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
-   int rc;
+   int rc, vpd;
 
mutex_lock(>reset_lock);
 
@@ -951,6 +1017,12 @@ static int ibmvnic_open(struct net_device *netdev)
 
rc = __ibmvnic_open(netdev);
netif_carrier_on(netdev);
+
+   /* Vital Product Data (VPD) */
+   vpd = ibmvnic_get_vpd(adapter);
+   if (vpd)
+   netdev_err(netdev, "failed to initialize Vital Product Data 
(VPD)\n");
+
mutex_unlock(>reset_lock);
 
return rc;
@@ -1879,11 +1951,15 @@ static int ibmvnic_get_link_ksettings(struct net_device 
*netdev,
return 0;
 }
 
-static void ibmvnic_get_drvinfo(struct net_device *dev,
+static void ibmvnic_get_drvinfo(struct net_device *netdev,
struct ethtool_drvinfo *info)
 {
+   struct ibmvnic_adapter *adapter = netdev_priv(netdev);
+
strlcpy(info->driver, ibmvnic_driver_name, sizeof(info->driver));
strlcpy(info->version, IBMVNIC_DRIVER_VERSION, sizeof(info->version));
+   strlcpy(info->fw_version, adapter->fw_version,
+   sizeof(info->fw_version));
 }
 
 

Re: [PATCHv4 0/6] printk/ia64/ppc64/parisc64: let's deprecate %pF/%pf printk specifiers

2017-11-13 Thread Helge Deller

On 10.11.2017 00:48, Sergey Senozhatsky wrote:

All Ack-s/Tested-by-s were dropped, since the patch set has been
reworked. I'm kindly asking arch-s maintainers and developers to test it
once again. Sorry for any inconveniences and thanks for your help in
advance.


I tested it successfully on parisc64.
You can add back the
Tested-by: Helge Deller  #parisc64

Thanks!
Helge


Re: [PATCH v2] powerpc64/exceptions: Refactor code to eliminate a few memory loads

2017-11-13 Thread Naveen N. Rao
Michael Ellerman wrote:
> "Naveen N. Rao"  writes:
> 
>> On 2017/06/19 03:21PM, Aneesh Kumar K.V wrote:
>>> > @@ -1445,8 +1446,8 @@ do_hash_page:
>>> >  handle_page_fault:
>>> >   andis.  r0,r4,DSISR_DABRMATCH@h
>>> >   bne-handle_dabr_fault
>>> > - ld  r4,_DAR(r1)
>>> > - ld  r5,_DSISR(r1)
>>> > + mr  r5,r4
>>> > + mr  r4,r3
>>> >   addir3,r1,STACK_FRAME_OVERHEAD
>>> >   bl  do_page_fault
>>> >   cmpdi   r3,0
>>> 
>>> 
>>> Can we avoid that if we rearrange args of other functions calls, so that
>>> we can use r3 and r4 as it is ?
>>
>> Here's a version that does that. Again, boot tested with radix and 
>> disable_radix.
>>
>> Thanks,
>> Naveen
>>
>> -
>> Change data_access_common() and instruction_access_common() to load the
>> trap number in r3, DAR in r4 and DSISR in r5 (rather than in r5, r3 and
>> r4 respectively). This change allows us to eliminate a few un-necessary
>> memory loads and register move operations in handle_page_fault(),
>> handle_dabr_fault() and label '77'.
>>
>> Signed-off-by: Naveen N. Rao 
>> ---
>>  arch/powerpc/kernel/exceptions-64s.S | 38 
>> +---
>>  1 file changed, 18 insertions(+), 20 deletions(-)
> 
> Sorry I missed this and now it doesn't apply. Do you mind rebasing.

No problem - this is just a small refactoring after all :).
Here's a rebased version.  Boot tested on mambo with/without radix.

Thanks,
Naveen

--
[PATCH v3] powerpc/exceptions64s: Eliminate a few un-necessary memory  
loads

Change data_access_common() and instruction_access_common() to load the
trap number in r3, DAR in r4 and DSISR in r5 (rather than in r5, r3 and
r4 respectively). This change allows us to eliminate a few un-necessary
memory loads and register move operations in handle_page_fault(),
handle_dabr_fault() and label '77'.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/exceptions-64s.S | 38 +---
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 72cffa0c4af6..b8166b4fa728 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -499,11 +499,11 @@ EXC_COMMON_BEGIN(data_access_common)
EXCEPTION_PROLOG_COMMON(0x300, PACA_EXGEN)
RECONCILE_IRQ_STATE(r10, r11)
ld  r12,_MSR(r1)
-   ld  r3,PACA_EXGEN+EX_DAR(r13)
-   lwz r4,PACA_EXGEN+EX_DSISR(r13)
-   li  r5,0x300
-   std r3,_DAR(r1)
-   std r4,_DSISR(r1)
+   ld  r4,PACA_EXGEN+EX_DAR(r13)
+   lwz r5,PACA_EXGEN+EX_DSISR(r13)
+   li  r3,0x300
+   std r4,_DAR(r1)
+   std r5,_DSISR(r1)
 BEGIN_MMU_FTR_SECTION
b   do_hash_page/* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
@@ -543,11 +543,11 @@ EXC_COMMON_BEGIN(instruction_access_common)
EXCEPTION_PROLOG_COMMON(0x400, PACA_EXGEN)
RECONCILE_IRQ_STATE(r10, r11)
ld  r12,_MSR(r1)
-   ld  r3,_NIP(r1)
-   andis.  r4,r12,DSISR_BAD_FAULT_64S@h
-   li  r5,0x400
-   std r3,_DAR(r1)
-   std r4,_DSISR(r1)
+   ld  r4,_NIP(r1)
+   andis.  r5,r12,DSISR_BAD_FAULT_64S@h
+   li  r3,0x400
+   std r4,_DAR(r1)
+   std r5,_DSISR(r1)
 BEGIN_MMU_FTR_SECTION
b   do_hash_page/* Try to handle as hpte fault */
 MMU_FTR_SECTION_ELSE
@@ -1523,7 +1523,7 @@ do_hash_page:
 #ifdef CONFIG_PPC_BOOK3S_64
lis r0,DSISR_BAD_FAULT_64S@h
ori r0,r0,DSISR_BAD_FAULT_64S@l
-   and.r0,r4,r0/* weird error? */
+   and.r0,r5,r0/* weird error? */
bne-handle_page_fault   /* if not, try to insert a HPTE */
CURRENT_THREAD_INFO(r11, r1)
lwz r0,TI_PREEMPT(r11)  /* If we're in an "NMI" */
@@ -1538,8 +1538,10 @@ do_hash_page:
 *
 * at return r3 = 0 for success, 1 for page fault, negative for error
 */
+   mr  r6,r5
+   mr  r5,r3
+   mr  r3,r4
 mr r4,r12
-   ld  r6,_DSISR(r1)
bl  __hash_page /* build HPTE if possible */
 cmpdi  r3,0/* see if __hash_page succeeded */
 
@@ -1549,16 +1551,15 @@ do_hash_page:
/* Error */
blt-13f
 
-   /* Reload DSISR into r4 for the DABR check below */
-   ld  r4,_DSISR(r1)
+   /* Reload DAR/DSISR for handle_page_fault */
+   ld  r4,_DAR(r1)
+   ld  r5,_DSISR(r1)
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 /* Here we have a page fault that hash_page can't handle. */
 handle_page_fault:
-11:andis.  r0,r4,DSISR_DABRMATCH@h
+   andis.  r0,r5,DSISR_DABRMATCH@h
bne-handle_dabr_fault
-   ld  r4,_DAR(r1)
-   ld  r5,_DSISR(r1)
addi

Re: linux-next: Tree for Nov 7

2017-11-13 Thread Khalid Aziz

On 11/13/2017 09:06 AM, Michal Hocko wrote:

OK, so this one should take care of the backward compatibility while
still not touching the arch code
---
commit 39ff9bf8597e79a032da0954aea1f0d77d137765
Author: Michal Hocko 
Date:   Mon Nov 13 17:06:24 2017 +0100

 mm: introduce MAP_FIXED_SAFE
 
 MAP_FIXED is used quite often but it is inherently dangerous because it

 unmaps an existing mapping covered by the requested range. While this
 might be might be really desidered behavior in many cases there are
 others which would rather see a failure than a silent memory corruption.
 Introduce a new MAP_FIXED_SAFE flag for mmap to achive this behavior.
 It is a MAP_FIXED extension with a single exception that it fails with
 ENOMEM if the requested address is already covered by an existing
 mapping. We still do rely on get_unmaped_area to handle all the arch
 specific MAP_FIXED treatment and check for a conflicting vma after it
 returns.
 
 Signed-off-by: Michal Hocko 


.. deleted ...
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..aad8d37f0205 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1358,6 +1358,10 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
if (mm->map_count > sysctl_max_map_count)
return -ENOMEM;
  
+	/* force arch specific MAP_FIXED handling in get_unmapped_area */

+   if (flags & MAP_FIXED_SAFE)
+   flags |= MAP_FIXED;
+
/* Obtain the address to map to. we verify (or select) it and ensure
 * that it represents a valid section of the address space.
 */


Do you need to move this code above:

if (!(flags & MAP_FIXED))
addr = round_hint_to_min(addr);

/* Careful about overflows.. */
len = PAGE_ALIGN(len);
if (!len)
return -ENOMEM;

Not doing that might mean the hint address will end up being rounded for 
MAP_FIXED_SAFE which would change the behavior from MAP_FIXED.


--
Khalid


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
[Sorry for spamming, this one is the last attempt hopefully]

On Mon 13-11-17 16:49:39, Michal Hocko wrote:
> On Mon 13-11-17 16:16:41, Michal Hocko wrote:
> > On Mon 13-11-17 13:00:57, Michal Hocko wrote:
> > [...]
> > > Yes, I have mentioned that in the previous email but the amount of code
> > > would be even larger. Basically every arch which reimplements
> > > arch_get_unmapped_area would have to special case new MAP_FIXED flag to
> > > do vma lookup.
> > 
> > It turned out that this might be much more easier than I thought after
> > all. It seems we can really handle that in the common code. This would
> > mean that we are exposing a new functionality to the userspace though.
> > Myabe this would be useful on its own though. Just a quick draft (not
> > even compile tested) whether this makes sense in general. I would be
> > worried about unexpected behavior when somebody set other bit without a
> > good reason and we might fail with ENOMEM for such a call now.
> 
> Hmm, the bigger problem would be the backward compatibility actually. We
> would get silent corruptions which is exactly what the flag is trying
> fix. mmap flags handling really sucks. So I guess we would have to make
> the flag internal only :/

OK, so this one should take care of the backward compatibility while
still not touching the arch code
---
commit 39ff9bf8597e79a032da0954aea1f0d77d137765
Author: Michal Hocko 
Date:   Mon Nov 13 17:06:24 2017 +0100

mm: introduce MAP_FIXED_SAFE

MAP_FIXED is used quite often but it is inherently dangerous because it
unmaps an existing mapping covered by the requested range. While this
might be might be really desidered behavior in many cases there are
others which would rather see a failure than a silent memory corruption.
Introduce a new MAP_FIXED_SAFE flag for mmap to achive this behavior.
It is a MAP_FIXED extension with a single exception that it fails with
ENOMEM if the requested address is already covered by an existing
mapping. We still do rely on get_unmaped_area to handle all the arch
specific MAP_FIXED treatment and check for a conflicting vma after it
returns.

Signed-off-by: Michal Hocko 

diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..767bcb8a4c28 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -31,6 +31,8 @@
 #define MAP_STACK  0x8 /* give out an address that is best 
suited for process/thread stacks */
 #define MAP_HUGETLB0x10/* create a huge page mapping */
 
+#define MAP_FIXED_SAFE 0x200   /* MAP_FIXED which doesn't unmap 
underlying mapping */
+
 #define MS_ASYNC   1   /* sync memory asynchronously */
 #define MS_SYNC2   /* synchronous memory sync */
 #define MS_INVALIDATE  4   /* invalidate the caches */
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..c2311eb7219b 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -49,6 +49,8 @@
 #define MAP_STACK  0x4 /* give out an address that is best 
suited for process/thread stacks */
 #define MAP_HUGETLB0x8 /* create a huge page mapping */
 
+#define MAP_FIXED_SAFE 0x200   /* MAP_FIXED which doesn't unmap 
underlying mapping */
+
 /*
  * Flags for msync
  */
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index cc9ba1d34779..b06fd830bc6f 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -25,6 +25,8 @@
 #define MAP_STACK  0x4 /* give out an address that is best 
suited for process/thread stacks */
 #define MAP_HUGETLB0x8 /* create a huge page mapping */
 
+#define MAP_FIXED_SAFE 0x200   /* MAP_FIXED which doesn't unmap 
underlying mapping */
+
 #define MS_SYNC1   /* synchronous memory sync */
 #define MS_ASYNC   2   /* sync memory asynchronously */
 #define MS_INVALIDATE  4   /* invalidate the caches */
diff --git a/arch/xtensa/include/uapi/asm/mman.h 
b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..f4b291bca764 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -62,6 +62,8 @@
 # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
 #endif
 
+#define MAP_FIXED_SAFE 0x200   /* MAP_FIXED which doesn't unmap 
underlying mapping */
+
 /*
  * Flags for msync
  */
diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..03c518777f83 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -25,6 +25,8 @@
 # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
 #endif

Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
On Mon 13-11-17 15:48:13, Russell King - ARM Linux wrote:
> On Mon, Nov 13, 2017 at 04:16:41PM +0100, Michal Hocko wrote:
> > On Mon 13-11-17 13:00:57, Michal Hocko wrote:
> > [...]
> > > Yes, I have mentioned that in the previous email but the amount of code
> > > would be even larger. Basically every arch which reimplements
> > > arch_get_unmapped_area would have to special case new MAP_FIXED flag to
> > > do vma lookup.
> > 
> > It turned out that this might be much more easier than I thought after
> > all. It seems we can really handle that in the common code. This would
> > mean that we are exposing a new functionality to the userspace though.
> > Myabe this would be useful on its own though. Just a quick draft (not
> > even compile tested) whether this makes sense in general. I would be
> > worried about unexpected behavior when somebody set other bit without a
> > good reason and we might fail with ENOMEM for such a call now.
> > 
> > Elf loader would then use MAP_FIXED_SAFE rather than MAP_FIXED.
> > ---
> > diff --git a/arch/alpha/include/uapi/asm/mman.h 
> > b/arch/alpha/include/uapi/asm/mman.h
> > index 3b26cc62dadb..d021c21f9b01 100644
> > --- a/arch/alpha/include/uapi/asm/mman.h
> > +++ b/arch/alpha/include/uapi/asm/mman.h
> > @@ -31,6 +31,9 @@
> >  #define MAP_STACK  0x8 /* give out an address that is best 
> > suited for process/thread stacks */
> >  #define MAP_HUGETLB0x10/* create a huge page mapping */
> >  
> > +#define MAP_KEEP_MAPPING 0x200
> > +#define MAP_FIXED_SAFE MAP_FIXED|MAP_KEEP_MAPPING /* enforce MAP_FIXED 
> > without clobbering an existing mapping */
> 
> A few things...
> 
> 1. Does this need to be exposed to userland?

As I've written in another email, exposing the flag this way would be
really dangerous wrt. backward compatibility. So we would either need some
translation or make it a flag on its own and touch the arch specific
code which I really wanted to prevent from.

Whether this is something useful for the userspace is a separate
question which I should bring up to linux-api for a wider audience to
discuss.

So I guess this goes down to whether we want/need something like
MAP_FIXED_SAFE or opt out the specific hardening code for arches that
cannot make unaligned mappings for the requested address.

> 2. Can it end up in include/uapi/asm-generic/mman*.h ?
> 3. The definition of MAP_FIXED_SAFE should really have parens around it.

Of course. I thought I did...

> > @@ -1365,6 +1365,13 @@ unsigned long do_mmap(struct file *file, unsigned 
> > long addr,
> > if (offset_in_page(addr))
> > return addr;
> >  
> > +   if ((flags & MAP_FIXED_SAFE) == MAP_FIXED_SAFE) {
> 
> I'm surprised this doesn't warn - since this effectively expands to:
> 
>   flags & MAP_FIXED | MAP_KEEP_MAPPING
> 
> hence why MAP_FIXED_SAFE needs parens.

It sure does.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
On Mon 13-11-17 16:16:41, Michal Hocko wrote:
> On Mon 13-11-17 13:00:57, Michal Hocko wrote:
> [...]
> > Yes, I have mentioned that in the previous email but the amount of code
> > would be even larger. Basically every arch which reimplements
> > arch_get_unmapped_area would have to special case new MAP_FIXED flag to
> > do vma lookup.
> 
> It turned out that this might be much more easier than I thought after
> all. It seems we can really handle that in the common code. This would
> mean that we are exposing a new functionality to the userspace though.
> Myabe this would be useful on its own though. Just a quick draft (not
> even compile tested) whether this makes sense in general. I would be
> worried about unexpected behavior when somebody set other bit without a
> good reason and we might fail with ENOMEM for such a call now.

Hmm, the bigger problem would be the backward compatibility actually. We
would get silent corruptions which is exactly what the flag is trying
fix. mmap flags handling really sucks. So I guess we would have to make
the flag internal only :/

-- 
Michal Hocko
SUSE Labs


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Russell King - ARM Linux
On Mon, Nov 13, 2017 at 04:16:41PM +0100, Michal Hocko wrote:
> On Mon 13-11-17 13:00:57, Michal Hocko wrote:
> [...]
> > Yes, I have mentioned that in the previous email but the amount of code
> > would be even larger. Basically every arch which reimplements
> > arch_get_unmapped_area would have to special case new MAP_FIXED flag to
> > do vma lookup.
> 
> It turned out that this might be much more easier than I thought after
> all. It seems we can really handle that in the common code. This would
> mean that we are exposing a new functionality to the userspace though.
> Myabe this would be useful on its own though. Just a quick draft (not
> even compile tested) whether this makes sense in general. I would be
> worried about unexpected behavior when somebody set other bit without a
> good reason and we might fail with ENOMEM for such a call now.
> 
> Elf loader would then use MAP_FIXED_SAFE rather than MAP_FIXED.
> ---
> diff --git a/arch/alpha/include/uapi/asm/mman.h 
> b/arch/alpha/include/uapi/asm/mman.h
> index 3b26cc62dadb..d021c21f9b01 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -31,6 +31,9 @@
>  #define MAP_STACK0x8 /* give out an address that is best 
> suited for process/thread stacks */
>  #define MAP_HUGETLB  0x10/* create a huge page mapping */
>  
> +#define MAP_KEEP_MAPPING 0x200
> +#define MAP_FIXED_SAFE   MAP_FIXED|MAP_KEEP_MAPPING /* enforce MAP_FIXED 
> without clobbering an existing mapping */

A few things...

1. Does this need to be exposed to userland?
2. Can it end up in include/uapi/asm-generic/mman*.h ?
3. The definition of MAP_FIXED_SAFE should really have parens around it.

> @@ -1365,6 +1365,13 @@ unsigned long do_mmap(struct file *file, unsigned long 
> addr,
>   if (offset_in_page(addr))
>   return addr;
>  
> + if ((flags & MAP_FIXED_SAFE) == MAP_FIXED_SAFE) {

I'm surprised this doesn't warn - since this effectively expands to:

flags & MAP_FIXED | MAP_KEEP_MAPPING

hence why MAP_FIXED_SAFE needs parens.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
On Mon 13-11-17 15:09:09, Russell King - ARM Linux wrote:
> On Mon, Nov 13, 2017 at 03:11:40PM +0100, Michal Hocko wrote:
> > On Mon 13-11-17 10:20:06, Michal Hocko wrote:
> > > [Cc arm and ppc maintainers]
> > > 
> > > Thanks a lot for testing!
> > > 
> > > On Sun 12-11-17 11:38:02, Joel Stanley wrote:
> > > > On Fri, Nov 10, 2017 at 11:00 PM, Michal Hocko  
> > > > wrote:
> > > > > Hi Joel,
> > > > >
> > > > > On Wed 08-11-17 15:20:50, Michal Hocko wrote:
> > > > > [...]
> > > > >> > There are a lot of messages on the way up that look like this:
> > > > >> >
> > > > >> > [2.527460] Uhuuh, elf segement at 000d9000 requested but the
> > > > >> > memory is mapped already
> > > > >> > [2.540160] Uhuuh, elf segement at 000d9000 requested but the
> > > > >> > memory is mapped already
> > > > >> > [2.546153] Uhuuh, elf segement at 000d9000 requested but the
> > > > >> > memory is mapped already
> > > > >> >
> > > > >> > And then trying to run userspace looks like this:
> > > > >>
> > > > >> Could you please run with debugging patch posted
> > > > >> http://lkml.kernel.org/r/20171107102854.vylrtaodla63k...@dhcp22.suse.cz
> > > > >
> > > > > Did you have chance to test with this debugging patch, please?
> > > > 
> > > > Lots of this:
> > > > 
> > > > [1.177266] Uhuuh, elf segement at 000d9000 requested but the  
> > > > memory is mapped already, got 000dd000
> > > > [1.177555] Clashing vma [dd000, de000] flags:100873 name:(null)
> > > 
> > > This smells like the problem I've expected that mmap with hint doesn't
> > > respect the hint even though there is no clashing mapping. The above
> > > basically says that we didn't map at 0xd9000 but it has placed it at
> > > 0xdd000. The nearest (clashing) vma is at 0xdd000 so this is our new
> > > mapping. find_vma returns the closest vma (with addr < vm_end) for the
> > > given address 0xd9000 so this address cannot be mapped by any other vma.
> > > 
> > > Now that I am looking at arm's arch_get_unmapped_area it does perform
> > > aligning for shared vmas.
> > 
> > Sorry for confusion here. These are not shared mappings as pointed out
> > by Russell in a private email. I got confused by the above flags which I
> > have misinterpreted as bit 0 set => MAP_SHARED. These are vm_flags
> > obviously so the bit 0 is VM_READ. Sorry about the confusion. The real
> > reason we are doing the alignment is that we do a file mapping
> > /*
> >  * We only need to do colour alignment if either the I or D
> >  * caches alias.
> >  */
> > if (aliasing)
> > do_align = filp || (flags & MAP_SHARED);
> > 
> > I am not really familiar with this architecture to understand why do we
> > need aliasing for file mappings, though.
> 
> I think it's there so that flush_dcache_page() works - possibly
> get_user_pages() being used on a private mapping of page cache pages,
> but that's guessing.

I fail to see how the mixure of MAP_FIXED and regular mapping of the
same file work then, but as I've said I really do not understand this
code.

> I'm afraid I don't remember all the details, this is code from around
> 15 years ago, and I'd be very nervous about changing it now without
> fully understanding the issues.

Ohh, absolutely! I didn't dare to touch this code and that's why I took
the easy way and simply opt-out from the harding for all those archs
that are basically sharing this pattern. But after a closer look it
seems that we can really introduce MAP_FIXED_SAFE that would keep the
arch mmap code intact yet we would get the hardening for all archs.
It would allow also allow a safer MAP_FIXED semantic for userspace.
-- 
Michal Hocko
SUSE Labs


Re: powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.

2017-11-13 Thread Masahiro Yamada
Hi Nicholas,


2017-11-11 17:45 GMT+09:00 Nicholas Piggin :
> On Fri, 10 Nov 2017 23:41:49 +0800
> kbuild test robot  wrote:
>
>> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
>> master
>> head:   1c9dbd4615fd751e5e0b99807a3c7c8612e28e20
>> commit: cb87481ee89dbd6609e227afbf64900fb4e5c930 kbuild: linker script do 
>> not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured
>> date:   3 months ago
>> config: powerpc-fsp2_defconfig (attached as .config)
>> compiler: powerpc-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
>> reproduce:
>> wget 
>> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
>> ~/bin/make.cross
>> chmod +x ~/bin/make.cross
>> git checkout cb87481ee89dbd6609e227afbf64900fb4e5c930
>> # save the attached .config to linux build tree
>> make.cross ARCH=powerpc
>>
>> All warnings (new ones prefixed by >>):
>>
>> >> powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
>> >> `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.
>> >> powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
>> >> `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.
>> >> powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
>> >> `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.
>
> Okay this is not caused by the above patch, it was just hiding it.
> This should do the trick I think:
> --
> powerpc/32: Add .data.rel* sections explicitly
>
> Match powerpc/64 and include .data.rel* input sections in the .data output
> section explicitly.
>
> This should solve the warning:
>
> powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
> `arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.
>
> Reported-by: kbuild test robot 
> Signed-off-by: Nicholas Piggin 


I think this will go to the PPC tree.

I will not touch this patch, but
if necessary, please feel free to add

Reviewed-by: Masahiro Yamada 




> ---
>  arch/powerpc/kernel/vmlinux.lds.S | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
> b/arch/powerpc/kernel/vmlinux.lds.S
> index 0494e1566ee2..51e4ec92ade1 100644
> --- a/arch/powerpc/kernel/vmlinux.lds.S
> +++ b/arch/powerpc/kernel/vmlinux.lds.S
> @@ -264,6 +264,7 @@ SECTIONS
>  #ifdef CONFIG_PPC32
> .data : AT(ADDR(.data) - LOAD_OFFSET) {
> DATA_DATA
> +   *(.data.rel*)
> *(.sdata)
> *(.sdata2)
> *(.got.plt) *(.got)
> --
> 2.15.0
>



-- 
Best Regards
Masahiro Yamada


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
On Mon 13-11-17 13:00:57, Michal Hocko wrote:
[...]
> Yes, I have mentioned that in the previous email but the amount of code
> would be even larger. Basically every arch which reimplements
> arch_get_unmapped_area would have to special case new MAP_FIXED flag to
> do vma lookup.

It turned out that this might be much more easier than I thought after
all. It seems we can really handle that in the common code. This would
mean that we are exposing a new functionality to the userspace though.
Myabe this would be useful on its own though. Just a quick draft (not
even compile tested) whether this makes sense in general. I would be
worried about unexpected behavior when somebody set other bit without a
good reason and we might fail with ENOMEM for such a call now.

Elf loader would then use MAP_FIXED_SAFE rather than MAP_FIXED.
---
diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..d021c21f9b01 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -31,6 +31,9 @@
 #define MAP_STACK  0x8 /* give out an address that is best 
suited for process/thread stacks */
 #define MAP_HUGETLB0x10/* create a huge page mapping */
 
+#define MAP_KEEP_MAPPING 0x200
+#define MAP_FIXED_SAFE MAP_FIXED|MAP_KEEP_MAPPING /* enforce MAP_FIXED without 
clobbering an existing mapping */
+
 #define MS_ASYNC   1   /* sync memory asynchronously */
 #define MS_SYNC2   /* synchronous memory sync */
 #define MS_INVALIDATE  4   /* invalidate the caches */
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..51e3885fbfc1 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -49,6 +49,9 @@
 #define MAP_STACK  0x4 /* give out an address that is best 
suited for process/thread stacks */
 #define MAP_HUGETLB0x8 /* create a huge page mapping */
 
+#define MAP_KEEP_MAPPING 0x200
+#define MAP_FIXED_SAFE MAP_FIXED|MAP_KEEP_MAPPING /* enforce MAP_FIXED without 
clobbering an existing mapping */
+
 /*
  * Flags for msync
  */
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index cc9ba1d34779..5a4381484fc5 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -25,6 +25,9 @@
 #define MAP_STACK  0x4 /* give out an address that is best 
suited for process/thread stacks */
 #define MAP_HUGETLB0x8 /* create a huge page mapping */
 
+#define MAP_KEEP_MAPPING 0x200
+#define MAP_FIXED_SAFE MAP_FIXED|MAP_KEEP_MAPPING /* enforce MAP_FIXED without 
clobbering an existing mapping */
+
 #define MS_SYNC1   /* synchronous memory sync */
 #define MS_ASYNC   2   /* sync memory asynchronously */
 #define MS_INVALIDATE  4   /* invalidate the caches */
diff --git a/arch/xtensa/include/uapi/asm/mman.h 
b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..5df8a81524da 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -62,6 +62,9 @@
 # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
 #endif
 
+#define MAP_KEEP_MAPPING 0x200
+#define MAP_FIXED_SAFE MAP_FIXED|MAP_KEEP_MAPPING /* enforce MAP_FIXED without 
clobbering an existing mapping */
+
 /*
  * Flags for msync
  */
diff --git a/include/uapi/asm-generic/mman-common.h 
b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..22442846f5c8 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -25,6 +25,9 @@
 # define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
 #endif
 
+#define MAP_KEEP_MAPPING 0x200
+#define MAP_FIXED_SAFE MAP_FIXED|MAP_KEEP_MAPPING /* enforce MAP_FIXED without 
clobbering an existing mapping */
+
 /*
  * Flags for mlock
  */
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..e53b6b15a8d9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1365,6 +1365,13 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
if (offset_in_page(addr))
return addr;
 
+   if ((flags & MAP_FIXED_SAFE) == MAP_FIXED_SAFE) {
+   struct vm_area_struct *vma = find_vma(mm, addr);
+
+   if (vma && vma->vm_start <= addr)
+   return -ENOMEM;
+   }
+
if (prot == PROT_EXEC) {
pkey = execute_only_pkey(mm);
if (pkey < 0)
-- 
Michal Hocko
SUSE Labs


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Russell King - ARM Linux
On Mon, Nov 13, 2017 at 03:11:40PM +0100, Michal Hocko wrote:
> On Mon 13-11-17 10:20:06, Michal Hocko wrote:
> > [Cc arm and ppc maintainers]
> > 
> > Thanks a lot for testing!
> > 
> > On Sun 12-11-17 11:38:02, Joel Stanley wrote:
> > > On Fri, Nov 10, 2017 at 11:00 PM, Michal Hocko  wrote:
> > > > Hi Joel,
> > > >
> > > > On Wed 08-11-17 15:20:50, Michal Hocko wrote:
> > > > [...]
> > > >> > There are a lot of messages on the way up that look like this:
> > > >> >
> > > >> > [2.527460] Uhuuh, elf segement at 000d9000 requested but the
> > > >> > memory is mapped already
> > > >> > [2.540160] Uhuuh, elf segement at 000d9000 requested but the
> > > >> > memory is mapped already
> > > >> > [2.546153] Uhuuh, elf segement at 000d9000 requested but the
> > > >> > memory is mapped already
> > > >> >
> > > >> > And then trying to run userspace looks like this:
> > > >>
> > > >> Could you please run with debugging patch posted
> > > >> http://lkml.kernel.org/r/20171107102854.vylrtaodla63k...@dhcp22.suse.cz
> > > >
> > > > Did you have chance to test with this debugging patch, please?
> > > 
> > > Lots of this:
> > > 
> > > [1.177266] Uhuuh, elf segement at 000d9000 requested but the  memory 
> > > is mapped already, got 000dd000
> > > [1.177555] Clashing vma [dd000, de000] flags:100873 name:(null)
> > 
> > This smells like the problem I've expected that mmap with hint doesn't
> > respect the hint even though there is no clashing mapping. The above
> > basically says that we didn't map at 0xd9000 but it has placed it at
> > 0xdd000. The nearest (clashing) vma is at 0xdd000 so this is our new
> > mapping. find_vma returns the closest vma (with addr < vm_end) for the
> > given address 0xd9000 so this address cannot be mapped by any other vma.
> > 
> > Now that I am looking at arm's arch_get_unmapped_area it does perform
> > aligning for shared vmas.
> 
> Sorry for confusion here. These are not shared mappings as pointed out
> by Russell in a private email. I got confused by the above flags which I
> have misinterpreted as bit 0 set => MAP_SHARED. These are vm_flags
> obviously so the bit 0 is VM_READ. Sorry about the confusion. The real
> reason we are doing the alignment is that we do a file mapping
>   /*
>* We only need to do colour alignment if either the I or D
>* caches alias.
>*/
>   if (aliasing)
>   do_align = filp || (flags & MAP_SHARED);
> 
> I am not really familiar with this architecture to understand why do we
> need aliasing for file mappings, though.

I think it's there so that flush_dcache_page() works - possibly
get_user_pages() being used on a private mapping of page cache pages,
but that's guessing.

I'm afraid I don't remember all the details, this is code from around
15 years ago, and I'd be very nervous about changing it now without
fully understanding the issues.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


Re: [PATCH] [net-next,v3] ibmvnic: Feature implementation of Vital Product Data (VPD) for the ibmvnic driver

2017-11-13 Thread Nathan Fontenot
On 11/10/2017 01:13 PM, Desnes Augusto Nunes do Rosário wrote:
> 
> 
> On 11/10/2017 12:54 PM, Nathan Fontenot wrote:
>> On 11/10/2017 08:41 AM, Desnes Augusto Nunes do Rosário wrote:
>>>
>>>
>>> On 11/09/2017 06:31 PM, Nathan Fontenot wrote:
 On 11/09/2017 01:00 PM, Desnes Augusto Nunes do Rosario wrote:
> This patch implements and enables VDP support for the ibmvnic driver.
> Moreover, it includes the implementation of suitable structs, signal
>    transmission/handling and functions which allows the retrival of 
> firmware
>    information from the ibmvnic card through the ethtool command.
>
> Signed-off-by: Desnes A. Nunes do Rosario 
> Signed-off-by: Thomas Falcon 
> ---
>    drivers/net/ethernet/ibm/ibmvnic.c | 149 
> -
>    drivers/net/ethernet/ibm/ibmvnic.h |  27 ++-
>    2 files changed, 173 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index d0cff28..693b502 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -573,6 +573,15 @@ static int reset_tx_pools(struct ibmvnic_adapter 
> *adapter)
>    return 0;
>    }
>
> +static void release_vpd_data(struct ibmvnic_adapter *adapter)
> +{
> +    if (!adapter->vpd)
> +    return;
> +
> +    kfree(adapter->vpd->buff);
> +    kfree(adapter->vpd);
> +}
> +
>    static void release_tx_pools(struct ibmvnic_adapter *adapter)
>    {
>    struct ibmvnic_tx_pool *tx_pool;
> @@ -753,6 +762,8 @@ static void release_resources(struct ibmvnic_adapter 
> *adapter)
>    {
>    int i;
>
> +    release_vpd_data(adapter);
> +
>    release_tx_pools(adapter);
>    release_rx_pools(adapter);
>
> @@ -833,6 +844,53 @@ static int set_real_num_queues(struct net_device 
> *netdev)
>    return rc;
>    }
>
> +static int ibmvnic_get_vpd(struct ibmvnic_adapter *adapter)
> +{
> +    struct device *dev = >vdev->dev;
> +    union ibmvnic_crq crq;
> +    dma_addr_t dma_addr;
> +    int len;
> +
> +    if (adapter->vpd->buff)
> +    len = adapter->vpd->len;
> +
> +    reinit_completion(>fw_done);
> +    crq.get_vpd_size.first = IBMVNIC_CRQ_CMD;
> +    crq.get_vpd_size.cmd = GET_VPD_SIZE;
> +    ibmvnic_send_crq(adapter, );
> +    wait_for_completion(>fw_done);
> +

 Shouldn't there be a check for the return code when getting the
 vpd size?
>>>
>>> Hello Nathan,
>>>
>>> This check is already being performed on the handle_vpd_size_rsp() function 
>>> down below.
>>>
>>> In short, a GET_VPD_SIZE signal is sent here through a ibmvnic_crq union in 
>>> ibmvnic_send_crq(), whereas handle_query_ip_offload_rsp() receives from the 
>>> VNIC adapter a GET_VPD_SIZE_RSP containing a ibmvnic_crq union with the vpd 
>>> size information and the rc.code. If successful, a >fw_done is 
>>> sent and this part of the code continues; however if not, a dev_error() is 
>>> thrown. Same logic applies to GET_VPD/GET_VPD_RSP.
>>>
>>
>> Yes, I did see that code. You do a complet of the completion variable for 
>> both success and failure,
>> this then lets this routine continue irregardless of the results of the get 
>> vpd size request. The
>> call to dev_err will print the error message but does not prevent use from 
>> bailing if the
>> get vpd size fails. Perhaps setting vpd->len to -1 to indicate the get vpd 
>> call failed which could
>> then be checked by the requester.
>>
>> -Nathan
>>
>>  >> What I am adding on the next version of the patch is a check if 
> adapter->vpd->len is different than 0 before allocating adapter->vpd->buff, 
> since that in a case of a failure, adapter->vpd->len will be 0.
> 
> I do concur with your observation that the break is necessary.
> 
> If the reception of vpd failed, adapter->vpd->len will be still zeroed out 
> since it was created with kzalloc in init_resources().
> 
> Thus, do you agree if in the next version I send the following code?

Yes, this would be good.

I think you should also add an explicit setting of len to 0 before the call too.
Looking at the code before the get cpd size call, if a vpd buffer already exists
then the len will be non-zero.

-Nathan

> 
> ===
>   +   reinit_completion(>fw_done);
>   +   crq.get_vpd_size.first = IBMVNIC_CRQ_CMD;
>   +   crq.get_vpd_size.cmd = GET_VPD_SIZE;
>   +   ibmvnic_send_crq(adapter, );
>   +   wait_for_completion(>fw_done);
>   +
> ->+   if(!adapter->vpd->len)
> ->+   return -ENODATA;
>   +
>   +   if (!adapter->vpd->buff)
>   +   adapter->vpd->buff = kzalloc(adapter->vpd->len, GFP_KERNEL);
>   +   else if (adapter->vpd->len != len)
> 

Re: [PATCH v2 3/8] powerpc/mm: Separate ibm, dynamic-memory data from DT format

2017-11-13 Thread Nathan Fontenot
On 11/12/2017 06:43 AM, Michael Ellerman wrote:
> Hi Nathan,
> 
> Nathan Fontenot  writes:
>> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
>> index f83056297441..917184c13890 100644
>> --- a/arch/powerpc/kernel/prom.c
>> +++ b/arch/powerpc/kernel/prom.c
>> @@ -454,92 +455,93 @@ static int __init 
>> early_init_dt_scan_chosen_ppc(unsigned long node,
> ...
>>  
>>  static int __init early_init_dt_scan_memory_ppc(unsigned long node,
>>  const char *uname,
>>  int depth, void *data)
>>  {
>> +int rc;
>> +
>>  if (depth == 1 &&
>> -strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0)
>> -return early_init_dt_scan_drconf_memory(node);
>> +strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
>> +rc = init_drmem_lmbs(node);
>> +if (!rc)
>> +early_init_dt_scan_drmem_lmbs(node);
>> +
>> +return rc;
>> +}
>>  
>>  return early_init_dt_scan_memory(node, uname, depth, data);
>>  }
> 
> There's one bug in here which is that you return rc as returned by
> init_drmem_lmbs(). Returning non-zero from these scan routines
> terminates the scan, which means if anything goes wrong in
> init_drmem_lmbs() we may not call early_init_dt_scan_memory()
> in which case we won't have any memory at all.
> 

I didn't know this would stop scanning the device tree, thanks for letting me 
know.

> I say "may not" because it depends on the order of the nodes in the
> device tree whether you hit the memory nodes or the dynamic reconfig mem
> info first. And the order of the nodes in the device tree is arbitrary
> so we can't rely on that.
> 
> 
>> diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
>> new file mode 100644
>> index ..8ad7cf36b2c4
>> --- /dev/null
>> +++ b/arch/powerpc/mm/drmem.c
>> @@ -0,0 +1,84 @@
> ...
>> +
>> +int __init init_drmem_lmbs(unsigned long node)
>> +{
>> +struct drmem_lmb *lmb;
>> +const __be32 *prop;
>> +int prop_sz;
>> +u32 len;
>> +
>> +prop = of_get_flat_dt_prop(node, "ibm,lmb-size", );
>> +if (!prop || len < dt_root_size_cells * sizeof(__be32))
>> +return -1;
>> +
>> +drmem_info->lmb_size = dt_mem_next_cell(dt_root_size_cells, );
>> +
>> +prop = of_get_flat_dt_prop(node, "ibm,dynamic-memory", );
>> +if (!prop || len < dt_root_size_cells * sizeof(__be32))
>> +return -1;
>> +
>> +drmem_info->n_lmbs = of_read_number(prop++, 1);
>> +prop_sz = drmem_info->n_lmbs * sizeof(struct of_drconf_cell)
>> +  + sizeof(__be32);
>> +if (prop_sz < len)
>> +return -1;
>> +
>> +drmem_info->lmbs = alloc_bootmem(drmem_info->n_lmbs * sizeof(*lmb));
>> +if (!drmem_info->lmbs)
>> +return -1;
> 
> The bigger problem we have though is that you're trying to allocate
> memory, in order to find out what memory we have :)
> 
> I suspect it works in some cases because you hit the memory@0 node first
> in the device tree, and add that memory to memblock, which means
> init_drmem_lmbs() *can* allocate memory, and everything's good.
> 
> But if we hit init_drmem_lmbs() first, or there's not enough space in
> memory@0, then allocating memory in order to discover memory is not
> going to work.
> 
> I'm not sure what the best solution is. One option would be to
> statically allocate some space, so that we can discover some of the LMBs
> without doing an allocation. But we wouldn't be able to guarantee that
> we had enough space i nthat static allocation, so the code would need to
> handle doing that and then potentially finding more LMBs later using a
> dynamic alloc. So that could be a bit messy.
> 
> The other option would be for the early_init_dt_scan_drmem_lmbs() code
> to still work on the device tree directly, rather than using the
> drmem_info array. That would make for uglier code, but may be necessary.
> 

I have been thinking about my initial approach, and the more I look at it
the more I do not like trying to do the bootmem allocation. As you mention
there is just too much that could go wrong with that.

I have started looking at a design where an interface similar to
walk_memory_range() is used for the prom and numa code so we do not have to rely
on making the allocation for the lmb array early in boot. The lmb array
could then be allocated in the late_initcall in drmem.c at which point
the general kernel allocator is available.

I'm still working on getting this coded up and when send out a new patch set
once it's ready unless anyone has objections to this approach.

-Nathan



Re: [PATCH v2 2/2] powerpc/pci: Unroll two pass loop when scanning bridges

2017-11-13 Thread Mika Westerberg
On Fri, Nov 10, 2017 at 07:52:30PM +0200, Andy Shevchenko wrote:
> The current scanning code is really hard to understand because it calls
> the same function in a loop where pass value is changed without any
> comments explaining it:
> 
>   for (pass = 0; pass < 2; pass++)
> for_each_pci_bridge(dev, bus)
>   max = pci_scan_bridge(bus, dev, max, pass);
> 
> Unfamiliar reader cannot tell easily what is the purpose of this loop
> without looking at internals of pci_scan_bridge().
> 
> In order to make this bit easier to understand, open-code the loop in
> pci_scan_child_bus() and pci_hp_add_bridge() with added comments.
> 
> No functional changes intended.
> 
> Cc: Mika Westerberg 

Reviewed-by: Mika Westerberg 


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
On Mon 13-11-17 10:20:06, Michal Hocko wrote:
> [Cc arm and ppc maintainers]
> 
> Thanks a lot for testing!
> 
> On Sun 12-11-17 11:38:02, Joel Stanley wrote:
> > On Fri, Nov 10, 2017 at 11:00 PM, Michal Hocko  wrote:
> > > Hi Joel,
> > >
> > > On Wed 08-11-17 15:20:50, Michal Hocko wrote:
> > > [...]
> > >> > There are a lot of messages on the way up that look like this:
> > >> >
> > >> > [2.527460] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> > [2.540160] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> > [2.546153] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> >
> > >> > And then trying to run userspace looks like this:
> > >>
> > >> Could you please run with debugging patch posted
> > >> http://lkml.kernel.org/r/20171107102854.vylrtaodla63k...@dhcp22.suse.cz
> > >
> > > Did you have chance to test with this debugging patch, please?
> > 
> > Lots of this:
> > 
> > [1.177266] Uhuuh, elf segement at 000d9000 requested but the  memory is 
> > mapped already, got 000dd000
> > [1.177555] Clashing vma [dd000, de000] flags:100873 name:(null)
> 
> This smells like the problem I've expected that mmap with hint doesn't
> respect the hint even though there is no clashing mapping. The above
> basically says that we didn't map at 0xd9000 but it has placed it at
> 0xdd000. The nearest (clashing) vma is at 0xdd000 so this is our new
> mapping. find_vma returns the closest vma (with addr < vm_end) for the
> given address 0xd9000 so this address cannot be mapped by any other vma.
> 
> Now that I am looking at arm's arch_get_unmapped_area it does perform
> aligning for shared vmas.

Sorry for confusion here. These are not shared mappings as pointed out
by Russell in a private email. I got confused by the above flags which I
have misinterpreted as bit 0 set => MAP_SHARED. These are vm_flags
obviously so the bit 0 is VM_READ. Sorry about the confusion. The real
reason we are doing the alignment is that we do a file mapping
/*
 * We only need to do colour alignment if either the I or D
 * caches alias.
 */
if (aliasing)
do_align = filp || (flags & MAP_SHARED);

I am not really familiar with this architecture to understand why do we
need aliasing for file mappings, though.
-- 
Michal Hocko
SUSE Labs


[PATCH] powerpc/32: Add .data.rel* sections explicitly

2017-11-13 Thread Nicholas Piggin
Match powerpc/64 and include .data.rel* input sections in the .data output
section explicitly.

This solves the warning:

powerpc-linux-gnu-ld: warning: orphan section `.data.rel.ro' from 
`arch/powerpc/kernel/head_44x.o' being placed in section `.data.rel.ro'.

Link: https://lists.01.org/pipermail/kbuild-all/2017-November/040010.html
Reported-by: kbuild test robot 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/vmlinux.lds.S | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index 0494e1566ee2..51e4ec92ade1 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -264,6 +264,7 @@ SECTIONS
 #ifdef CONFIG_PPC32
.data : AT(ADDR(.data) - LOAD_OFFSET) {
DATA_DATA
+   *(.data.rel*)
*(.sdata)
*(.sdata2)
*(.got.plt) *(.got)
-- 
2.15.0



Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
On Mon 13-11-17 22:34:50, Michael Ellerman wrote:
> Hi Michal,
> 
> Michal Hocko  writes:
> > On Mon 13-11-17 10:20:06, Michal Hocko wrote:
> >> [Cc arm and ppc maintainers]
> >
> > Hmm, it turned out to be a problem on other architectures as well.
> > CCing more maintainers. For your reference, we are talking about
> > http://lkml.kernel.org/r/20171023082608.6167-1-mho...@kernel.org
> > which has broken architectures which do apply aligning on the mmap
> > address hint without MAP_FIXED applied. See below my proposed way
> > around this issue because I belive that the above patch is quite
> > valuable on its own to be dropped for all archs.
> 
> I don't really like your solution sorry :)  The fact that you've had to
> patch seven arches seems like a red flag.
> 
> I think this is a generic problem with MAP_FIXED, which I've heard
> userspace folks complain about in the past.

The thing is that we canno  change MAP_FIXED behavior as it is carved in
stone

> Currently MAP_FIXED does two things:
>   1. makes addr not a hint but the required address
>   2. blasts any existing mapping
> 
> You want 1) but not 2).

+ fail if there is a clashing range

> So the right solution IMHO would be to add a new mmap flag to request
> that behaviour, ie. a fixed address but iff there is nothing already
> mapped there.
> 
> I don't know the mm code well enough to know if that's hard for some
> reason, but it *seems* like it should be doable.

Yes, I have mentioned that in the previous email but the amount of code
would be even larger. Basically every arch which reimplements
arch_get_unmapped_area would have to special case new MAP_FIXED flag to
do vma lookup. So this was the most simple solution I could come up
with. If there was a general interest for MAP_FIXED_SAFE then we can
introduce it later of course. I would just like the hardening merged
sooner rather than later.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] kernel/module_64.c: Add REL24 relocation support of livepatch symbols

2017-11-13 Thread Kamalesh Babulal

On Monday 13 November 2017 02:08 PM, Balbir Singh wrote:

On Fri, Nov 10, 2017 at 2:28 PM, Josh Poimboeuf  wrote:

On Fri, Nov 10, 2017 at 01:06:25PM +1100, Balbir Singh wrote:

On Fri, Nov 10, 2017 at 2:19 AM, Josh Poimboeuf  wrote:

FWIW, I think it won't matter anyway.  I'm currently pursuing the option
of inserting nops after local calls, because it has less runtime
complexity than using a stub.

I think I've figured out a way to do it with a GCC plugin, but if that
doesn't work I'll try the asm listing sed approach suggested by Michael.



A plugin that runs for the new kernel with the patch? Just for
specific files involved in the patch?


The plugin will affect the code generation of all functions in the patch
module.  So all calls in all replacement functions will have the nops.

Here's a prototype (not yet fully tested):

  
https://github.com/jpoimboe/kpatch/blob/TODO-ppc-fix/kpatch-build/gcc-plugins/ppc64le-plugin.c





[...]


I'd have to look closer, but I wonder what happens if a
function has no global entry
point

(See http://mpe.github.io/posts/2016/05/23/kernel-live-patching-for-ppc64le/)
section
Understanding the TOC & entry points and the example of int_to_scsilun()



int_to_scsilun() is accessed via global entry point, but the TOC setup
part in the function prologue is skipped as the function doesn't have a
need to save/restore TOC. The reason begin, it doesn't calls any global
functions, thus the r2 register is not clobbered.

As per my understanding, the plug-in inserts a nop instruction after
branch to a function called via localentry a.k.a call to local
function. In case of int_to_scsilun(), the plug-in will not modify the
function in any way, as there are no local calls also made by the
int_to_scsilun().

--
cheers,
Kamalesh.



Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michael Ellerman
Hi Michal,

Michal Hocko  writes:
> On Mon 13-11-17 10:20:06, Michal Hocko wrote:
>> [Cc arm and ppc maintainers]
>
> Hmm, it turned out to be a problem on other architectures as well.
> CCing more maintainers. For your reference, we are talking about
> http://lkml.kernel.org/r/20171023082608.6167-1-mho...@kernel.org
> which has broken architectures which do apply aligning on the mmap
> address hint without MAP_FIXED applied. See below my proposed way
> around this issue because I belive that the above patch is quite
> valuable on its own to be dropped for all archs.

I don't really like your solution sorry :)  The fact that you've had to
patch seven arches seems like a red flag.

I think this is a generic problem with MAP_FIXED, which I've heard
userspace folks complain about in the past.

Currently MAP_FIXED does two things:
  1. makes addr not a hint but the required address
  2. blasts any existing mapping

You want 1) but not 2).

So the right solution IMHO would be to add a new mmap flag to request
that behaviour, ie. a fixed address but iff there is nothing already
mapped there.

I don't know the mm code well enough to know if that's hard for some
reason, but it *seems* like it should be doable.

cheers


Re: [PATCH] bootwrapper: mspsc.c: fix pointer-to-int-cast warnings

2017-11-13 Thread Michael Ellerman
Hi Michal,

Thanks for trying to fix this.

Michal Suchanek  writes:
> I get these warnings:
>
> ../arch/powerpc/boot/mpsc.c: In function 'mpsc_get_virtreg_of_phandle':
> ../arch/powerpc/boot/mpsc.c:113:35: warning: cast from pointer to
> integer of different size [-Wpointer-to-int-cast]
>
> ../arch/powerpc/boot/mpsc.c: In function 'mpsc_console_init':
> ../arch/powerpc/boot/mpsc.c:147:12: warning: cast from pointer to
> integer of different size [-Wpointer-to-int-cast]
>
> Presumably the patch below fixes these, and presumably the DT defines
> that pointes and integers have the same size in the DT so this is fine

Actually no it doesn't. A device tree doesn't contain pointers, and it
doesn't say anything about the size of a pointer on a platform that uses
that device tree.

phandles are not pointers, they are handles, and they are always 32-bit.
So this code is just wrong on 64-bit, you can't safely cast from a
64-bit pointer to a phandle.

But I think this warning has actually been fixed differently by commit:

  866bfc75f40e ("powerpc: conditionally compile platform-specific serial 
drivers")

If you're still seeing the warning let me know.

cheers


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
On Mon 13-11-17 10:20:06, Michal Hocko wrote:
> [Cc arm and ppc maintainers]

Hmm, it turned out to be a problem on other architectures as well.
CCing more maintainers. For your reference, we are talking about
http://lkml.kernel.org/r/20171023082608.6167-1-mho...@kernel.org
which has broken architectures which do apply aligning on the mmap
address hint without MAP_FIXED applied. See below my proposed way
around this issue because I belive that the above patch is quite
valuable on its own to be dropped for all archs.

> Thanks a lot for testing!
> 
> On Sun 12-11-17 11:38:02, Joel Stanley wrote:
> > On Fri, Nov 10, 2017 at 11:00 PM, Michal Hocko  wrote:
> > > Hi Joel,
> > >
> > > On Wed 08-11-17 15:20:50, Michal Hocko wrote:
> > > [...]
> > >> > There are a lot of messages on the way up that look like this:
> > >> >
> > >> > [2.527460] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> > [2.540160] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> > [2.546153] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> >
> > >> > And then trying to run userspace looks like this:
> > >>
> > >> Could you please run with debugging patch posted
> > >> http://lkml.kernel.org/r/20171107102854.vylrtaodla63k...@dhcp22.suse.cz
> > >
> > > Did you have chance to test with this debugging patch, please?
> > 
> > Lots of this:
> > 
> > [1.177266] Uhuuh, elf segement at 000d9000 requested but the  memory is 
> > mapped already, got 000dd000
> > [1.177555] Clashing vma [dd000, de000] flags:100873 name:(null)
> 
> This smells like the problem I've expected that mmap with hint doesn't
> respect the hint even though there is no clashing mapping. The above
> basically says that we didn't map at 0xd9000 but it has placed it at
> 0xdd000. The nearest (clashing) vma is at 0xdd000 so this is our new
> mapping. find_vma returns the closest vma (with addr < vm_end) for the
> given address 0xd9000 so this address cannot be mapped by any other vma.
> 
> Now that I am looking at arm's arch_get_unmapped_area it does perform
> aligning for shared vmas. We do not do that for MAP_FIXED.  Powepc,
> reported earlier [1] seems to suffer from the similar problem.
> slice_get_unmapped_area alignes to slices, whatever that means.
> 
> I can see two possible ways around that. Either we explicitly request
> non-aligned mappings via a special MAP_$FOO (e.g. MAP_FIXED_SAFE) or
> simply opt out from the MAP_FIXED protection via ifdefs. The first
> option sounds more generic to me but also more tricky to not introduce
> other user visible effects. The later is quite straightforward. What do
> you think about the following on top of the previous patch?
> 
> It is rather terse and disables the MAP_FIXED protection for arm
> comletely because I couldn't find a way to make it conditional on
> CACHEID_VIPT_ALIASING. But this can be always handled later. I find the
> protection for other archtectures useful enough to have this working for
> most architectures now and handle others specially.
> 
> [1] http://lkml.kernel.org/r/1510048229.12079.7.ca...@abdul.in.ibm.com
> ---
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 61a0cb15067e..018d041a30e6 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -99,6 +99,7 @@ config ARM
select PERF_USE_VMALLOC
select RTC_LIB
select SYS_SUPPORTS_APM_EMULATION
+   select ARCH_ALIGNED_MMAPS
# Above selects are sorted alphabetically; please add new ones
# according to that.  Thanks.
help
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 48d91d5be4e9..eca59d27e9f1 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -72,6 +72,7 @@ config MIPS
select RTC_LIB if !MACH_LOONGSON64
select SYSCTL_EXCEPTION_TRACE
select VIRT_TO_BUS
+   select ARCH_ALIGNED_MMAPS
 
 menu "Machine selection"
 
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 22f27ec8c117..8376d16e0a4a 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -40,6 +40,7 @@ config PARISC
select GENERIC_CLOCKEVENTS
select ARCH_NO_COHERENT_DMA_MMAP
select CPU_NO_EFFICIENT_FFS
+   select ARCH_ALIGNED_MMAPS
 
help
  The PA-RISC microprocessor is designed by Hewlett-Packard and used
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 2f629e0551e9..156f69c09c7f 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -368,6 +368,7 @@ config PPC_MM_SLICES
bool
default y if PPC_STD_MMU_64
default n
+   select ARCH_ALIGNED_MMAPS
 
 config PPC_HAVE_PMU_SUPPORT
bool
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 640a85925060..ac1d4637a728 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -49,6 +49,7 @@ config SUPERH
select 

Re: [PATCHv4 3/6] powerpc64: Add .opd based function descriptor dereference

2017-11-13 Thread Sergey Senozhatsky
On (11/13/17 12:41), Santosh Sivaraj wrote:
> * Sergey Senozhatsky  wrote (on 2017-11-10 
> 08:48:27 +0900):
> 
> > We are moving towards separate kernel and module function descriptor
> > dereference callbacks. This patch enables it for powerpc64.
> > 
> > For pointers that belong to the kernel
> > -  Added __start_opd and __end_opd pointers, to track the kernel
> >.opd section address range;
> > 
> > -  Added dereference_kernel_function_descriptor(). Now we
> >will dereference only function pointers that are within
> >[__start_opd, __end_opd);
> > 
> > For pointers that belong to a module
> > -  Added dereference_module_function_descriptor() to handle module
> >function descriptor dereference. Now we will dereference only
> >pointers that are within [module->opd.start, module->opd.end).
> > 
> > Signed-off-by: Sergey Senozhatsky 
> > ---
> >  arch/powerpc/include/asm/module.h   |  3 +++
> >  arch/powerpc/include/asm/sections.h | 12 
> >  arch/powerpc/kernel/module_64.c | 14 ++
> >  arch/powerpc/kernel/vmlinux.lds.S   |  2 ++
> >  4 files changed, 31 insertions(+)
> >
> 
> Looks good on powerpc. If you wish:
> 
> Tested-by: Santosh Sivaraj  # for powerpc

thanks!

-ss


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Russell King - ARM Linux
On Mon, Nov 13, 2017 at 10:20:06AM +0100, Michal Hocko wrote:
> [Cc arm and ppc maintainers]
> 
> Thanks a lot for testing!
> 
> On Sun 12-11-17 11:38:02, Joel Stanley wrote:
> > On Fri, Nov 10, 2017 at 11:00 PM, Michal Hocko  wrote:
> > > Hi Joel,
> > >
> > > On Wed 08-11-17 15:20:50, Michal Hocko wrote:
> > > [...]
> > >> > There are a lot of messages on the way up that look like this:
> > >> >
> > >> > [2.527460] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> > [2.540160] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> > [2.546153] Uhuuh, elf segement at 000d9000 requested but the
> > >> > memory is mapped already
> > >> >
> > >> > And then trying to run userspace looks like this:
> > >>
> > >> Could you please run with debugging patch posted
> > >> http://lkml.kernel.org/r/20171107102854.vylrtaodla63k...@dhcp22.suse.cz
> > >
> > > Did you have chance to test with this debugging patch, please?
> > 
> > Lots of this:
> > 
> > [1.177266] Uhuuh, elf segement at 000d9000 requested but the  memory is 
> > mapped already, got 000dd000
> > [1.177555] Clashing vma [dd000, de000] flags:100873 name:(null)
> 
> This smells like the problem I've expected that mmap with hint doesn't
> respect the hint even though there is no clashing mapping. The above
> basically says that we didn't map at 0xd9000 but it has placed it at
> 0xdd000. The nearest (clashing) vma is at 0xdd000 so this is our new
> mapping. find_vma returns the closest vma (with addr < vm_end) for the
> given address 0xd9000 so this address cannot be mapped by any other vma.
> 
> Now that I am looking at arm's arch_get_unmapped_area it does perform
> aligning for shared vmas. We do not do that for MAP_FIXED.  Powepc,
> reported earlier [1] seems to suffer from the similar problem.
> slice_get_unmapped_area alignes to slices, whatever that means.
> 
> I can see two possible ways around that. Either we explicitly request
> non-aligned mappings via a special MAP_$FOO (e.g. MAP_FIXED_SAFE) or
> simply opt out from the MAP_FIXED protection via ifdefs. The first
> option sounds more generic to me but also more tricky to not introduce
> other user visible effects. The later is quite straightforward. What do
> you think about the following on top of the previous patch?
> 
> It is rather terse and disables the MAP_FIXED protection for arm
> comletely because I couldn't find a way to make it conditional on
> CACHEID_VIPT_ALIASING. But this can be always handled later. I find the
> protection for other archtectures useful enough to have this working for
> most architectures now and handle others specially.

Can someone provide the background information for this please?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


Re: linux-next: Tree for Nov 7

2017-11-13 Thread Michal Hocko
[Cc arm and ppc maintainers]

Thanks a lot for testing!

On Sun 12-11-17 11:38:02, Joel Stanley wrote:
> On Fri, Nov 10, 2017 at 11:00 PM, Michal Hocko  wrote:
> > Hi Joel,
> >
> > On Wed 08-11-17 15:20:50, Michal Hocko wrote:
> > [...]
> >> > There are a lot of messages on the way up that look like this:
> >> >
> >> > [2.527460] Uhuuh, elf segement at 000d9000 requested but the
> >> > memory is mapped already
> >> > [2.540160] Uhuuh, elf segement at 000d9000 requested but the
> >> > memory is mapped already
> >> > [2.546153] Uhuuh, elf segement at 000d9000 requested but the
> >> > memory is mapped already
> >> >
> >> > And then trying to run userspace looks like this:
> >>
> >> Could you please run with debugging patch posted
> >> http://lkml.kernel.org/r/20171107102854.vylrtaodla63k...@dhcp22.suse.cz
> >
> > Did you have chance to test with this debugging patch, please?
> 
> Lots of this:
> 
> [1.177266] Uhuuh, elf segement at 000d9000 requested but the  memory is 
> mapped already, got 000dd000
> [1.177555] Clashing vma [dd000, de000] flags:100873 name:(null)

This smells like the problem I've expected that mmap with hint doesn't
respect the hint even though there is no clashing mapping. The above
basically says that we didn't map at 0xd9000 but it has placed it at
0xdd000. The nearest (clashing) vma is at 0xdd000 so this is our new
mapping. find_vma returns the closest vma (with addr < vm_end) for the
given address 0xd9000 so this address cannot be mapped by any other vma.

Now that I am looking at arm's arch_get_unmapped_area it does perform
aligning for shared vmas. We do not do that for MAP_FIXED.  Powepc,
reported earlier [1] seems to suffer from the similar problem.
slice_get_unmapped_area alignes to slices, whatever that means.

I can see two possible ways around that. Either we explicitly request
non-aligned mappings via a special MAP_$FOO (e.g. MAP_FIXED_SAFE) or
simply opt out from the MAP_FIXED protection via ifdefs. The first
option sounds more generic to me but also more tricky to not introduce
other user visible effects. The later is quite straightforward. What do
you think about the following on top of the previous patch?

It is rather terse and disables the MAP_FIXED protection for arm
comletely because I couldn't find a way to make it conditional on
CACHEID_VIPT_ALIASING. But this can be always handled later. I find the
protection for other archtectures useful enough to have this working for
most architectures now and handle others specially.

[1] http://lkml.kernel.org/r/1510048229.12079.7.ca...@abdul.in.ibm.com
---
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 61a0cb15067e..018d041a30e6 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -99,6 +99,7 @@ config ARM
select PERF_USE_VMALLOC
select RTC_LIB
select SYS_SUPPORTS_APM_EMULATION
+   select ARCH_ALIGNED_MMAPS
# Above selects are sorted alphabetically; please add new ones
# according to that.  Thanks.
help
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 2f629e0551e9..156f69c09c7f 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -368,6 +368,7 @@ config PPC_MM_SLICES
bool
default y if PPC_STD_MMU_64
default n
+   select ARCH_ALIGNED_MMAPS
 
 config PPC_HAVE_PMU_SUPPORT
bool
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index a22718de42db..d23eb89f31c0 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -345,13 +345,19 @@ static unsigned long elf_vm_mmap(struct file *filep, 
unsigned long addr,
unsigned long size, int prot, int type, unsigned long off)
 {
unsigned long map_addr;
+   unsigned long map_type = type;
 
/*
 * If caller requests the mapping at a specific place, make sure we fail
 * rather than potentially clobber an existing mapping which can have
-* security consequences (e.g. smash over the stack area).
+* security consequences (e.g. smash over the stack area). Be careful
+* about architectures which do not respect the address hint due to
+* aligning restrictions for !fixed mappings.
 */
-   map_addr = vm_mmap(filep, addr, size, prot, type & ~MAP_FIXED, off);
+   if (!IS_ENABLED(ARCH_ALIGNED_MMAPS))
+   map_type &= ~MAP_FIXED;
+
+   map_addr = vm_mmap(filep, addr, size, prot, map_type, off);
if (BAD_ADDR(map_addr))
return map_addr;
 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] kernel/module_64.c: Add REL24 relocation support of livepatch symbols

2017-11-13 Thread Balbir Singh
On Fri, Nov 10, 2017 at 2:28 PM, Josh Poimboeuf  wrote:
> On Fri, Nov 10, 2017 at 01:06:25PM +1100, Balbir Singh wrote:
>> On Fri, Nov 10, 2017 at 2:19 AM, Josh Poimboeuf  wrote:
>> > FWIW, I think it won't matter anyway.  I'm currently pursuing the option
>> > of inserting nops after local calls, because it has less runtime
>> > complexity than using a stub.
>> >
>> > I think I've figured out a way to do it with a GCC plugin, but if that
>> > doesn't work I'll try the asm listing sed approach suggested by Michael.
>> >
>>
>> A plugin that runs for the new kernel with the patch? Just for
>> specific files involved in the patch?
>
> The plugin will affect the code generation of all functions in the patch
> module.  So all calls in all replacement functions will have the nops.
>
> Here's a prototype (not yet fully tested):
>
>   
> https://github.com/jpoimboe/kpatch/blob/TODO-ppc-fix/kpatch-build/gcc-plugins/ppc64le-plugin.c
>


Thanks! I looked at it briefly, I need to catch up with gcc plugins,
code < 1000 was interesting.
I guess now we don't need to pass via the new kpatch helper stub, we
would just replace the no-ops
and be done! I'd have to look closer, but I wonder what happens if a
function has no global entry
point

(See http://mpe.github.io/posts/2016/05/23/kernel-live-patching-for-ppc64le/)
section
Understanding the TOC & entry points and the example of int_to_scsilun()

Balbir Singh.


[PATCH] powerpc/npu: Cleanup MMIO ATSD flushing

2017-11-13 Thread Balbir Singh
While reviewing the code I found that the flush assumes all
pages are of mmu_linear_psize, which is not correct. The patch
uses find_linux_pte to find the right page size and uses that
for launching the ATSD invalidation. A new helper is added
to abstract the invalidation from the various notifiers.

The patch also cleans up a bit by removing AP size from PID
flushes.

Signed-off-by: Balbir Singh 
---

Changelog since RFC
 - Implement review comments from Aneesh

 arch/powerpc/platforms/powernv/npu-dma.c | 66 
 1 file changed, 49 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c 
b/arch/powerpc/platforms/powernv/npu-dma.c
index 2cb6cbea4b3b..bf12a023a8de 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -27,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "powernv.h"
 #include "pci.h"
@@ -459,9 +461,6 @@ static int mmio_invalidate_pid(struct npu *npu, unsigned 
long pid, bool flush)
/* PRS set to process-scoped */
launch |= PPC_BIT(13);
 
-   /* AP */
-   launch |= (u64) mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
-
/* PID */
launch |= pid << PPC_BITLSHIFT(38);
 
@@ -473,7 +472,8 @@ static int mmio_invalidate_pid(struct npu *npu, unsigned 
long pid, bool flush)
 }
 
 static int mmio_invalidate_va(struct npu *npu, unsigned long va,
-   unsigned long pid, bool flush)
+   unsigned long pid, bool flush,
+   unsigned int shift)
 {
unsigned long launch;
 
@@ -484,9 +484,8 @@ static int mmio_invalidate_va(struct npu *npu, unsigned 
long va,
launch |= PPC_BIT(13);
 
/* AP */
-   launch |= (u64) mmu_get_ap(mmu_virtual_psize) << PPC_BITLSHIFT(17);
+   launch |= (u64) mmu_get_ap(shift) << PPC_BITLSHIFT(17);
 
-   /* PID */
launch |= pid << PPC_BITLSHIFT(38);
 
/* No flush */
@@ -503,7 +502,8 @@ struct mmio_atsd_reg {
 };
 
 static void mmio_invalidate_wait(
-   struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS], bool flush)
+   struct mmio_atsd_reg mmio_atsd_reg[NV_MAX_NPUS], bool flush,
+   unsigned int shift)
 {
struct npu *npu;
int i, reg;
@@ -536,7 +536,8 @@ static void mmio_invalidate_wait(
  * the value of va.
  */
 static void mmio_invalidate(struct npu_context *npu_context, int va,
-   unsigned long address, bool flush)
+   unsigned long address, bool flush,
+   unsigned int shift)
 {
int i, j;
struct npu *npu;
@@ -569,7 +570,7 @@ static void mmio_invalidate(struct npu_context 
*npu_context, int va,
if (va)
mmio_atsd_reg[i].reg =
mmio_invalidate_va(npu, address, pid,
-   flush);
+   flush, shift);
else
mmio_atsd_reg[i].reg =
mmio_invalidate_pid(npu, pid, flush);
@@ -582,10 +583,43 @@ static void mmio_invalidate(struct npu_context 
*npu_context, int va,
}
}
 
-   mmio_invalidate_wait(mmio_atsd_reg, flush);
+   mmio_invalidate_wait(mmio_atsd_reg, flush, shift);
if (flush)
/* Wait for the flush to complete */
-   mmio_invalidate_wait(mmio_atsd_reg, false);
+   mmio_invalidate_wait(mmio_atsd_reg, false, shift);
+}
+
+static void pnv_npu2_invalidate_helper(struct npu_context *npu_context,
+   struct mm_struct *mm, unsigned long start,
+   unsigned long end, bool flush)
+{
+   unsigned long address;
+   bool is_thp = false;
+   unsigned int hshift = 0, shift;
+
+   address = start;
+   do {
+   local_irq_disable();
+   find_linux_pte(mm->pgd, address, _thp, );
+   if (!is_thp)
+   shift = PAGE_SHIFT;
+   else if (hshift && !is_thp)
+   shift = hshift;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   else
+   shift = HPAGE_PMD_SIZE;
+#else
+   else {
+   shift = PAGE_SHIFT;
+   pr_warn_once("unsupport page size for mm %p,addr %lx\n",
+   mm, start);
+   }
+#endif
+   mmio_invalidate(npu_context, address > 0, address, flush,
+   shift);
+   local_irq_enable();
+   address += (1ull << shift);
+   } while (address < end);
 }
 
 static void pnv_npu2_mn_release(struct mmu_notifier *mn,
@@ -601,7 +635,7 @@ static void