Re: linux-next: powerpc le qemu boot failure after merge of the akpm tree

2019-01-30 Thread Mike Rapoport
(added Andrey Konovalov)

On Thu, Jan 31, 2019 at 07:15:26AM +0100, Christophe Leroy wrote:
> 
> Le 31/01/2019 à 07:06, Stephen Rothwell a écrit :
> >Hi all,
> >
> >On Thu, 31 Jan 2019 16:38:54 +1100 Stephen Rothwell  
> >wrote:
> >>
> >>[I am guessing that is is something in Andrew's tree that has caused
> >>this.]
> >>
> >>My qemu boot of the powerpc pseries_le_defconfig config failed like this:
> >>
> >>htab_hash_mask= 0x1
> >>-
> >>numa:   NODE_DATA [mem 0x7ffe7000-0x7ffebfff]
> >>Kernel panic - not syncing: sparse_buffer_init: Failed to allocate 
> >>2147483648 bytes align=0x1 nid=0 from=fff
> >>CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc4 #2
> >>Call Trace:
> >>[c105bbd0] [c0b1345c] dump_stack+0xb0/0xf4 (unreliable)
> >>[c105bc10] [c020] panic+0x168/0x3b8
> >>[c105bcb0] [c0e701c8] sparse_init_nid+0x178/0x550
> >>[c105bd70] [c0e709b4] sparse_init+0x210/0x238
> >>[c105bdb0] [c0e468f4] initmem_init+0x1e0/0x260
> >>[c105be80] [c0e3b9b0] setup_arch+0x354/0x3d4
> >>[c105bef0] [c0e33afc] start_kernel+0x98/0x648
> >>[c105bf90] [c000b270] start_here_common+0x1c/0x52c
> >
> >A quick bisect leads to this:
> >
> >1c3c9328cde027eb875ba4692f0a5d66b0afe862 is the first bad commit
> >commit 1c3c9328cde027eb875ba4692f0a5d66b0afe862
> >Author: Mike Rapoport 
> >Date:   Thu Jan 31 10:51:32 2019 +1100
> >
> > treewide: add checks for the return value of memblock_alloc*()
> > Add check for the return value of memblock_alloc*() functions and call
> > panic() in case of error.  The panic message repeats the one used by
> > panicing memblock allocators with adjustment of parameters to include 
> > only
> > relevant ones.
> >
> >Which is just adding the panic we hit.  So, presumably, the bug is in a
> >preceding patch :-(
> >
> >I have left the kernel not booting for today.
> >
> 
> No I think the error is really in that patch, see my other mail.
> 
> See https://elixir.bootlin.com/linux/v5.0-rc4/source/mm/memblock.c#L1455,
> memblock_alloc_try_nid_raw() is not supposed to panic, so the last hunk of
> this patch should be reverted.
> 
> Found in total three problematic hunks in that patch:
> 
> @@ -48,6 +53,11 @@ static phys_addr_t __init kasan_alloc_raw_page(int node)
>   void *p = memblock_alloc_try_nid_raw(PAGE_SIZE, PAGE_SIZE,
>   __pa(MAX_DMA_ADDRESS),
>   MEMBLOCK_ALLOC_KASAN, node);
> + if (!p)
> + panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
> from=%llx\n",
> +   __func__, PAGE_SIZE, PAGE_SIZE, node,
> +   __pa(MAX_DMA_ADDRESS));
> +
>   return __pa(p);
>  }
 
I've looked more closely to the code that uses this function and it does
not seem to handle allocation error.
I can replace the panic with WARN(), but I think that panic() here is
appropriate.

Andrey, can you comment?


> @@ -211,6 +211,9 @@ static int __init iob_init(struct device_node *dn)
>   iob_l2_base = memblock_alloc_try_nid_raw(1UL << 21, 1UL << 21,
>   MEMBLOCK_LOW_LIMIT, 0x8000,
>   NUMA_NO_NODE);
> + if (!iob_l2_base)
> + panic("%s: Failed to allocate %lu bytes align=0x%lx 
> max_addr=%x\n",
> +   __func__, 1UL << 21, 1UL << 21, 0x8000);
> 
>   pr_info("IOBMAP L2 allocated at: %p\n", iob_l2_base);
 
This one is actually fixes my own mistake from one of the previous patches
that converted memblock_alloc_base() to memblock_alloc_try_nid_raw() without
adding the panic() (commit 47e382eb08cfa0199c4ea9f9cc73f1b48a3a4b1d
"powerpc: prefer memblock APIs returning virtual address")
 
> @@ -425,6 +436,10 @@ static void __init sparse_buffer_init(unsigned long
> size, int nid)
>   memblock_alloc_try_nid_raw(size, PAGE_SIZE,
>   __pa(MAX_DMA_ADDRESS),
>   MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> + if (!sparsemap_buf)
> + panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
> from=%lx\n",
> +   __func__, size, PAGE_SIZE, nid, __pa(MAX_DMA_ADDRESS));
> +
>   sparsemap_buf_end = sparsemap_buf + size;
>  }
 
This hunk was not needed as sparse can deal with this allocation failure.

Andrew, can you please add the below patch to as a fixup to "treewide: add
checks for the return value of memblock_alloc*()"?
 
>From 854f54b9d4fe52f477765b905a4b2c421d30f46e Mon Sep 17 00:00:00 2001
From: Mike Rapoport 
Date: Thu, 31 Jan 2019 09:18:50 +0200
Subject: [PATCH] mm/sparse: don't panic if the allocation in
 sparse_buffer_init fails

Addition of panic if memblock_alloc_try_nid_raw() call in
sparse_buffer_init() fails was over enthusiastic as

Re: [PATCH v2 19/21] treewide: add checks for the return value of memblock_alloc*()

2019-01-30 Thread Mike Rapoport
On Thu, Jan 31, 2019 at 08:07:29AM +0100, Christophe Leroy wrote:
> 
> 
> Le 31/01/2019 à 07:44, Christophe Leroy a écrit :
> >
> >
> >Le 31/01/2019 à 07:41, Mike Rapoport a écrit :
> >>On Thu, Jan 31, 2019 at 07:07:46AM +0100, Christophe Leroy wrote:
> >>>
> >>>
> >>>Le 21/01/2019 à 09:04, Mike Rapoport a écrit :
> Add check for the return value of memblock_alloc*() functions and call
> panic() in case of error.
> The panic message repeats the one used by panicing memblock
> allocators with
> adjustment of parameters to include only relevant ones.
> 
> The replacement was mostly automated with semantic patches like the one
> below with manual massaging of format strings.
> 
> @@
> expression ptr, size, align;
> @@
> ptr = memblock_alloc(size, align);
> + if (!ptr)
> + panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__,
> size, align);
> 
> Signed-off-by: Mike Rapoport 
> Reviewed-by: Guo Ren  # c-sky
> Acked-by: Paul Burton  # MIPS
> Acked-by: Heiko Carstens  # s390
> Reviewed-by: Juergen Gross  # Xen
> ---
> >>>
> >>>[...]
> >>>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 7ea5dc6..ad94242 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> >>>
> >>>[...]
> >>>
> @@ -425,6 +436,10 @@ static void __init sparse_buffer_init(unsigned
> long size, int nid)
>   memblock_alloc_try_nid_raw(size, PAGE_SIZE,
>   __pa(MAX_DMA_ADDRESS),
>   MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> +    if (!sparsemap_buf)
> +    panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d
> from=%lx\n",
> +  __func__, size, PAGE_SIZE, nid, __pa(MAX_DMA_ADDRESS));
> +
> >>>
> >>>memblock_alloc_try_nid_raw() does not panic (help explicitly says:
> >>>Does not
> >>>zero allocated memory, does not panic if request cannot be satisfied.).
> >>
> >>"Does not panic" does not mean it always succeeds.
> >
> >I agree, but at least here you are changing the behaviour by making it
> >panic explicitly. Are we sure there are not cases where the system could
> >just continue functionning ? Maybe a WARN_ON() would be enough there ?
> 
> Looking more in details, it looks like everything is done to live with
> sparsemap_buf NULL, all functions using it check it so having it NULL
> shouldn't imply a panic I believe, see code below.

You are right, I'm preparing the fix right now.
 
> static void *sparsemap_buf __meminitdata;
> static void *sparsemap_buf_end __meminitdata;
> 
> static void __init sparse_buffer_init(unsigned long size, int nid)
> {
>   WARN_ON(sparsemap_buf); /* forgot to call sparse_buffer_fini()? */
>   sparsemap_buf =
>   memblock_alloc_try_nid_raw(size, PAGE_SIZE,
>   __pa(MAX_DMA_ADDRESS),
>   MEMBLOCK_ALLOC_ACCESSIBLE, nid);
>   sparsemap_buf_end = sparsemap_buf + size;
> }
> 
> static void __init sparse_buffer_fini(void)
> {
>   unsigned long size = sparsemap_buf_end - sparsemap_buf;
> 
>   if (sparsemap_buf && size > 0)
>   memblock_free_early(__pa(sparsemap_buf), size);
>   sparsemap_buf = NULL;
> }
> 
> void * __meminit sparse_buffer_alloc(unsigned long size)
> {
>   void *ptr = NULL;
> 
>   if (sparsemap_buf) {
>   ptr = PTR_ALIGN(sparsemap_buf, size);
>   if (ptr + size > sparsemap_buf_end)
>   ptr = NULL;
>   else
>   sparsemap_buf = ptr + size;
>   }
>   return ptr;
> }
> 
> 
> Christophe
> 

-- 
Sincerely yours,
Mike.



Re: linux-next: powerpc le qemu boot failure after merge of the akpm tree

2019-01-30 Thread Stephen Rothwell
Hi Mike,

On Thu, 31 Jan 2019 08:39:30 +0200 Mike Rapoport  wrote:
>
> On Thu, Jan 31, 2019 at 07:15:26AM +0100, Christophe Leroy wrote:
> > 
> > Le 31/01/2019 à 07:06, Stephen Rothwell a écrit :  
> > >>My qemu boot of the powerpc pseries_le_defconfig config failed like this:
> > >>
> > >>htab_hash_mask= 0x1
> > >>-
> > >>numa:   NODE_DATA [mem 0x7ffe7000-0x7ffebfff]
> > >>Kernel panic - not syncing: sparse_buffer_init: Failed to allocate 
> > >>2147483648 bytes align=0x1 nid=0 from=fff  
> 
> This means that sparse_buffer_init tries to allocate 2G for the 
> sparsemap_buf...
> 
> Stephen, how many memory do you give to your VM?

Exactly 2G.

 qemu-system-ppc64 -M pseries -m 2G 

The boot normally continue like this:

rfi-flush: fallback displacement flush available
count-cache-flush: software flush disabled.
stf-barrier: hwsync barrier available
PCI host bridge /pci@8002000  ranges:
  IO 0x2000..0x2000 -> 0x
 MEM 0x20008000..0x2000 -> 0x8000
 MEM 0x2100..0x21ff -> 0x2100
PPC64 nvram contains 65536 bytes
barrier-nospec: using ORI speculation barrier
Zone ranges:
  Normal   [mem 0x-0x7fff]
Movable zone start for each node
Early memory node ranges
  node   0: [mem 0x-0x7fff]
Initmem setup node 0 [mem 0x-0x7fff]

-- 
Cheers,
Stephen Rothwell


pgpQXcHiDeJJW.pgp
Description: OpenPGP digital signature


Re: [PATCH v2 19/21] treewide: add checks for the return value of memblock_alloc*()

2019-01-30 Thread Christophe Leroy




Le 31/01/2019 à 07:44, Christophe Leroy a écrit :



Le 31/01/2019 à 07:41, Mike Rapoport a écrit :

On Thu, Jan 31, 2019 at 07:07:46AM +0100, Christophe Leroy wrote:



Le 21/01/2019 à 09:04, Mike Rapoport a écrit :

Add check for the return value of memblock_alloc*() functions and call
panic() in case of error.
The panic message repeats the one used by panicing memblock 
allocators with

adjustment of parameters to include only relevant ones.

The replacement was mostly automated with semantic patches like the one
below with manual massaging of format strings.

@@
expression ptr, size, align;
@@
ptr = memblock_alloc(size, align);
+ if (!ptr)
+ panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__,
size, align);

Signed-off-by: Mike Rapoport 
Reviewed-by: Guo Ren  # c-sky
Acked-by: Paul Burton  # MIPS
Acked-by: Heiko Carstens  # s390
Reviewed-by: Juergen Gross  # Xen
---


[...]


diff --git a/mm/sparse.c b/mm/sparse.c
index 7ea5dc6..ad94242 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c


[...]

@@ -425,6 +436,10 @@ static void __init sparse_buffer_init(unsigned 
long size, int nid)

  memblock_alloc_try_nid_raw(size, PAGE_SIZE,
  __pa(MAX_DMA_ADDRESS),
  MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+    if (!sparsemap_buf)
+    panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
from=%lx\n",

+  __func__, size, PAGE_SIZE, nid, __pa(MAX_DMA_ADDRESS));
+


memblock_alloc_try_nid_raw() does not panic (help explicitly says: 
Does not

zero allocated memory, does not panic if request cannot be satisfied.).


"Does not panic" does not mean it always succeeds.


I agree, but at least here you are changing the behaviour by making it 
panic explicitly. Are we sure there are not cases where the system could 
just continue functionning ? Maybe a WARN_ON() would be enough there ?


Looking more in details, it looks like everything is done to live with 
sparsemap_buf NULL, all functions using it check it so having it NULL 
shouldn't imply a panic I believe, see code below.


static void *sparsemap_buf __meminitdata;
static void *sparsemap_buf_end __meminitdata;

static void __init sparse_buffer_init(unsigned long size, int nid)
{
WARN_ON(sparsemap_buf); /* forgot to call sparse_buffer_fini()? */
sparsemap_buf =
memblock_alloc_try_nid_raw(size, PAGE_SIZE,
__pa(MAX_DMA_ADDRESS),
MEMBLOCK_ALLOC_ACCESSIBLE, nid);
sparsemap_buf_end = sparsemap_buf + size;
}

static void __init sparse_buffer_fini(void)
{
unsigned long size = sparsemap_buf_end - sparsemap_buf;

if (sparsemap_buf && size > 0)
memblock_free_early(__pa(sparsemap_buf), size);
sparsemap_buf = NULL;
}

void * __meminit sparse_buffer_alloc(unsigned long size)
{
void *ptr = NULL;

if (sparsemap_buf) {
ptr = PTR_ALIGN(sparsemap_buf, size);
if (ptr + size > sparsemap_buf_end)
ptr = NULL;
else
sparsemap_buf = ptr + size;
}
return ptr;
}


Christophe


Re: [PATCH] powerpc/prom_init: add __init markers to all functions

2019-01-30 Thread kbuild test robot
Hi Masahiro,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.0-rc4 next-20190130]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Masahiro-Yamada/powerpc-prom_init-add-__init-markers-to-all-functions/20190131-134035
config: powerpc-mpc837x_mds_defconfig (attached as .config)
compiler: powerpc-linux-gnu-gcc (Debian 8.2.0-11) 8.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.2.0 make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

>> arch/powerpc/kernel/prom_init.c:511:19: error: 'prom_getproplen' defined but 
>> not used [-Werror=unused-function]
static int __init prom_getproplen(phandle node, const char *pname)
  ^~~
   cc1: all warnings being treated as errors

vim +/prom_getproplen +511 arch/powerpc/kernel/prom_init.c

   510  
 > 511  static int __init prom_getproplen(phandle node, const char *pname)
   512  {
   513  return call_prom("getproplen", 2, 1, node, ADDR(pname));
   514  }
   515  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v2 19/21] treewide: add checks for the return value of memblock_alloc*()

2019-01-30 Thread Christophe Leroy




Le 31/01/2019 à 07:41, Mike Rapoport a écrit :

On Thu, Jan 31, 2019 at 07:07:46AM +0100, Christophe Leroy wrote:



Le 21/01/2019 à 09:04, Mike Rapoport a écrit :

Add check for the return value of memblock_alloc*() functions and call
panic() in case of error.
The panic message repeats the one used by panicing memblock allocators with
adjustment of parameters to include only relevant ones.

The replacement was mostly automated with semantic patches like the one
below with manual massaging of format strings.

@@
expression ptr, size, align;
@@
ptr = memblock_alloc(size, align);
+ if (!ptr)
+   panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__,
size, align);

Signed-off-by: Mike Rapoport 
Reviewed-by: Guo Ren  # c-sky
Acked-by: Paul Burton# MIPS
Acked-by: Heiko Carstens  # s390
Reviewed-by: Juergen Gross  # Xen
---


[...]


diff --git a/mm/sparse.c b/mm/sparse.c
index 7ea5dc6..ad94242 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c


[...]


@@ -425,6 +436,10 @@ static void __init sparse_buffer_init(unsigned long size, 
int nid)
memblock_alloc_try_nid_raw(size, PAGE_SIZE,
__pa(MAX_DMA_ADDRESS),
MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+   if (!sparsemap_buf)
+   panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
from=%lx\n",
+ __func__, size, PAGE_SIZE, nid, __pa(MAX_DMA_ADDRESS));
+


memblock_alloc_try_nid_raw() does not panic (help explicitly says: Does not
zero allocated memory, does not panic if request cannot be satisfied.).


"Does not panic" does not mean it always succeeds.


I agree, but at least here you are changing the behaviour by making it 
panic explicitly. Are we sure there are not cases where the system could 
just continue functionning ? Maybe a WARN_ON() would be enough there ?


Christophe

  

Stephen Rothwell reports a boot failure due to this change.


Please see my reply on that thread.


Christophe


sparsemap_buf_end = sparsemap_buf + size;
  }







Re: [PATCH v2 19/21] treewide: add checks for the return value of memblock_alloc*()

2019-01-30 Thread Mike Rapoport
On Thu, Jan 31, 2019 at 07:07:46AM +0100, Christophe Leroy wrote:
> 
> 
> Le 21/01/2019 à 09:04, Mike Rapoport a écrit :
> >Add check for the return value of memblock_alloc*() functions and call
> >panic() in case of error.
> >The panic message repeats the one used by panicing memblock allocators with
> >adjustment of parameters to include only relevant ones.
> >
> >The replacement was mostly automated with semantic patches like the one
> >below with manual massaging of format strings.
> >
> >@@
> >expression ptr, size, align;
> >@@
> >ptr = memblock_alloc(size, align);
> >+ if (!ptr)
> >+panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__,
> >size, align);
> >
> >Signed-off-by: Mike Rapoport 
> >Reviewed-by: Guo Ren  # c-sky
> >Acked-by: Paul Burton   # MIPS
> >Acked-by: Heiko Carstens  # s390
> >Reviewed-by: Juergen Gross  # Xen
> >---
> 
> [...]
> 
> >diff --git a/mm/sparse.c b/mm/sparse.c
> >index 7ea5dc6..ad94242 100644
> >--- a/mm/sparse.c
> >+++ b/mm/sparse.c
> 
> [...]
> 
> >@@ -425,6 +436,10 @@ static void __init sparse_buffer_init(unsigned long 
> >size, int nid)
> > memblock_alloc_try_nid_raw(size, PAGE_SIZE,
> > __pa(MAX_DMA_ADDRESS),
> > MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> >+if (!sparsemap_buf)
> >+panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
> >from=%lx\n",
> >+  __func__, size, PAGE_SIZE, nid, __pa(MAX_DMA_ADDRESS));
> >+
> 
> memblock_alloc_try_nid_raw() does not panic (help explicitly says: Does not
> zero allocated memory, does not panic if request cannot be satisfied.).

"Does not panic" does not mean it always succeeds.
 
> Stephen Rothwell reports a boot failure due to this change.

Please see my reply on that thread.

> Christophe
> 
> > sparsemap_buf_end = sparsemap_buf + size;
> >  }
> >
> 

-- 
Sincerely yours,
Mike.



Re: linux-next: powerpc le qemu boot failure after merge of the akpm tree

2019-01-30 Thread Mike Rapoport
On Thu, Jan 31, 2019 at 07:15:26AM +0100, Christophe Leroy wrote:
> 
> 
> Le 31/01/2019 à 07:06, Stephen Rothwell a écrit :
> >Hi all,
> >
> >On Thu, 31 Jan 2019 16:38:54 +1100 Stephen Rothwell  
> >wrote:
> >>
> >>[I am guessing that is is something in Andrew's tree that has caused
> >>this.]
> >>
> >>My qemu boot of the powerpc pseries_le_defconfig config failed like this:
> >>
> >>htab_hash_mask= 0x1
> >>-
> >>numa:   NODE_DATA [mem 0x7ffe7000-0x7ffebfff]
> >>Kernel panic - not syncing: sparse_buffer_init: Failed to allocate 
> >>2147483648 bytes align=0x1 nid=0 from=fff

This means that sparse_buffer_init tries to allocate 2G for the sparsemap_buf...

Stephen, how many memory do you give to your VM?

> >>CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc4 #2
> >>Call Trace:
> >>[c105bbd0] [c0b1345c] dump_stack+0xb0/0xf4 (unreliable)
> >>[c105bc10] [c020] panic+0x168/0x3b8
> >>[c105bcb0] [c0e701c8] sparse_init_nid+0x178/0x550
> >>[c105bd70] [c0e709b4] sparse_init+0x210/0x238
> >>[c105bdb0] [c0e468f4] initmem_init+0x1e0/0x260
> >>[c105be80] [c0e3b9b0] setup_arch+0x354/0x3d4
> >>[c105bef0] [c0e33afc] start_kernel+0x98/0x648
> >>[c105bf90] [c000b270] start_here_common+0x1c/0x52c
> >
> >A quick bisect leads to this:
> >
> >1c3c9328cde027eb875ba4692f0a5d66b0afe862 is the first bad commit
> >commit 1c3c9328cde027eb875ba4692f0a5d66b0afe862
> >Author: Mike Rapoport 
> >Date:   Thu Jan 31 10:51:32 2019 +1100
> >
> > treewide: add checks for the return value of memblock_alloc*()
> > Add check for the return value of memblock_alloc*() functions and call
> > panic() in case of error.  The panic message repeats the one used by
> > panicing memblock allocators with adjustment of parameters to include 
> > only
> > relevant ones.
> > The replacement was mostly automated with semantic patches like the one
> > below with manual massaging of format strings.
> > @@
> > expression ptr, size, align;
> > @@
> > ptr = memblock_alloc(size, align);
> > + if (!ptr)
> > +   panic("%s: Failed to allocate %lu bytes align=0x%lx\n", 
> > __func__,
> > size, align);
> > Link: 
> > http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-r...@linux.ibm.com
> > Signed-off-by: Mike Rapoport 
> > Reviewed-by: Guo Ren [c-sky]
> > Acked-by: Paul Burton [MIPS]
> > Acked-by: Heiko Carstens [s390]
> > Reviewed-by: Juergen Gross [Xen]
> > Reviewed-by: Geert Uytterhoeven   [m68k]
> > Cc: Catalin Marinas 
> > Cc: Christophe Leroy 
> > Cc: Christoph Hellwig 
> > Cc: "David S. Miller" 
> > Cc: Dennis Zhou 
> > Cc: Greentime Hu 
> > Cc: Greg Kroah-Hartman 
> > Cc: Guan Xuetao 
> > Cc: Guo Ren 
> > Cc: Mark Salter 
> > Cc: Matt Turner 
> > Cc: Max Filippov 
> > Cc: Michael Ellerman 
> > Cc: Michal Simek 
> > Cc: Petr Mladek 
> > Cc: Richard Weinberger 
> > Cc: Rich Felker 
> > Cc: Rob Herring 
> > Cc: Rob Herring 
> > Cc: Russell King 
> > Cc: Stafford Horne 
> > Cc: Tony Luck 
> > Cc: Vineet Gupta 
> > Cc: Yoshinori Sato 
> > Signed-off-by: Andrew Morton 
> >
> >Which is just adding the panic we hit.  So, presumably, the bug is in a
> >preceding patch :-(
> >
> >I have left the kernel not booting for today.
> >
> 
> No I think the error is really in that patch, see my other mail.
> 
> See https://elixir.bootlin.com/linux/v5.0-rc4/source/mm/memblock.c#L1455,
> memblock_alloc_try_nid_raw() is not supposed to panic, so the last hunk of
> this patch should be reverted.

It is not supposed to panic, but it can still fail, so simply ignoring it's
return value seems a bit odd at least.
 
> Found in total three problematic hunks in that patch:
> 
> @@ -48,6 +53,11 @@ static phys_addr_t __init kasan_alloc_raw_page(int node)
>   void *p = memblock_alloc_try_nid_raw(PAGE_SIZE, PAGE_SIZE,
>   __pa(MAX_DMA_ADDRESS),
>   MEMBLOCK_ALLOC_KASAN, node);
> + if (!p)
> + panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
> from=%llx\n",
> +   __func__, PAGE_SIZE, PAGE_SIZE, node,
> +   __pa(MAX_DMA_ADDRESS));
> +
>   return __pa(p);
>  }
> 
> @@ -211,6 +211,9 @@ static int __init iob_init(struct device_node *dn)
>   iob_l2_base = memblock_alloc_try_nid_raw(1UL << 21, 1UL << 21,
>   MEMBLOCK_LOW_LIMIT, 0x8000,
>   NUMA_NO_NODE);
> + if (!iob_l2_base)
> + panic("%s: Failed to allocate %lu bytes align=0x%lx 
> max_addr=%x\n",
> +   __func__, 1UL << 21, 1UL << 21, 0x8000);
> 

Re: linux-next: powerpc le qemu boot failure after merge of the akpm tree

2019-01-30 Thread Christophe Leroy




Le 31/01/2019 à 07:06, Stephen Rothwell a écrit :

Hi all,

On Thu, 31 Jan 2019 16:38:54 +1100 Stephen Rothwell  
wrote:


[I am guessing that is is something in Andrew's tree that has caused
this.]

My qemu boot of the powerpc pseries_le_defconfig config failed like this:

htab_hash_mask= 0x1
-
numa:   NODE_DATA [mem 0x7ffe7000-0x7ffebfff]
Kernel panic - not syncing: sparse_buffer_init: Failed to allocate 2147483648 
bytes align=0x1 nid=0 from=fff
CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc4 #2
Call Trace:
[c105bbd0] [c0b1345c] dump_stack+0xb0/0xf4 (unreliable)
[c105bc10] [c020] panic+0x168/0x3b8
[c105bcb0] [c0e701c8] sparse_init_nid+0x178/0x550
[c105bd70] [c0e709b4] sparse_init+0x210/0x238
[c105bdb0] [c0e468f4] initmem_init+0x1e0/0x260
[c105be80] [c0e3b9b0] setup_arch+0x354/0x3d4
[c105bef0] [c0e33afc] start_kernel+0x98/0x648
[c105bf90] [c000b270] start_here_common+0x1c/0x52c


A quick bisect leads to this:

1c3c9328cde027eb875ba4692f0a5d66b0afe862 is the first bad commit
commit 1c3c9328cde027eb875ba4692f0a5d66b0afe862
Author: Mike Rapoport 
Date:   Thu Jan 31 10:51:32 2019 +1100

 treewide: add checks for the return value of memblock_alloc*()
 
 Add check for the return value of memblock_alloc*() functions and call

 panic() in case of error.  The panic message repeats the one used by
 panicing memblock allocators with adjustment of parameters to include only
 relevant ones.
 
 The replacement was mostly automated with semantic patches like the one

 below with manual massaging of format strings.
 
 @@

 expression ptr, size, align;
 @@
 ptr = memblock_alloc(size, align);
 + if (!ptr)
 +   panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__,
 size, align);
 
 Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-r...@linux.ibm.com

 Signed-off-by: Mike Rapoport 
 Reviewed-by: Guo Ren [c-sky]
 Acked-by: Paul Burton [MIPS]
 Acked-by: Heiko Carstens [s390]
 Reviewed-by: Juergen Gross [Xen]
 Reviewed-by: Geert Uytterhoeven   [m68k]
 Cc: Catalin Marinas 
 Cc: Christophe Leroy 
 Cc: Christoph Hellwig 
 Cc: "David S. Miller" 
 Cc: Dennis Zhou 
 Cc: Greentime Hu 
 Cc: Greg Kroah-Hartman 
 Cc: Guan Xuetao 
 Cc: Guo Ren 
 Cc: Mark Salter 
 Cc: Matt Turner 
 Cc: Max Filippov 
 Cc: Michael Ellerman 
 Cc: Michal Simek 
 Cc: Petr Mladek 
 Cc: Richard Weinberger 
 Cc: Rich Felker 
 Cc: Rob Herring 
 Cc: Rob Herring 
 Cc: Russell King 
 Cc: Stafford Horne 
 Cc: Tony Luck 
 Cc: Vineet Gupta 
 Cc: Yoshinori Sato 
 Signed-off-by: Andrew Morton 

Which is just adding the panic we hit.  So, presumably, the bug is in a
preceding patch :-(

I have left the kernel not booting for today.



No I think the error is really in that patch, see my other mail.

See 
https://elixir.bootlin.com/linux/v5.0-rc4/source/mm/memblock.c#L1455, 
memblock_alloc_try_nid_raw() is not supposed to panic, so the last hunk 
of this patch should be reverted.


Found in total three problematic hunks in that patch:

@@ -48,6 +53,11 @@ static phys_addr_t __init kasan_alloc_raw_page(int node)
void *p = memblock_alloc_try_nid_raw(PAGE_SIZE, PAGE_SIZE,
__pa(MAX_DMA_ADDRESS),
MEMBLOCK_ALLOC_KASAN, node);
+   if (!p)
+   panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
from=%llx\n",
+ __func__, PAGE_SIZE, PAGE_SIZE, node,
+ __pa(MAX_DMA_ADDRESS));
+
return __pa(p);
 }

@@ -211,6 +211,9 @@ static int __init iob_init(struct device_node *dn)
iob_l2_base = memblock_alloc_try_nid_raw(1UL << 21, 1UL << 21,
MEMBLOCK_LOW_LIMIT, 0x8000,
NUMA_NO_NODE);
+   if (!iob_l2_base)
+   panic("%s: Failed to allocate %lu bytes align=0x%lx 
max_addr=%x\n",
+ __func__, 1UL << 21, 1UL << 21, 0x8000);

pr_info("IOBMAP L2 allocated at: %p\n", iob_l2_base);


@@ -425,6 +436,10 @@ static void __init sparse_buffer_init(unsigned long 
size, int nid)

memblock_alloc_try_nid_raw(size, PAGE_SIZE,
__pa(MAX_DMA_ADDRESS),
MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+   if (!sparsemap_buf)
+   panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
from=%lx\n",
+ __func__, size, PAGE_SIZE, nid, __pa(MAX_DMA_ADDRESS));
+
sparsemap_buf_end = sparsemap_buf + 

Re: [PATCH v2 19/21] treewide: add checks for the return value of memblock_alloc*()

2019-01-30 Thread Christophe Leroy




Le 21/01/2019 à 09:04, Mike Rapoport a écrit :

Add check for the return value of memblock_alloc*() functions and call
panic() in case of error.
The panic message repeats the one used by panicing memblock allocators with
adjustment of parameters to include only relevant ones.

The replacement was mostly automated with semantic patches like the one
below with manual massaging of format strings.

@@
expression ptr, size, align;
@@
ptr = memblock_alloc(size, align);
+ if (!ptr)
+   panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__,
size, align);

Signed-off-by: Mike Rapoport 
Reviewed-by: Guo Ren  # c-sky
Acked-by: Paul Burton# MIPS
Acked-by: Heiko Carstens  # s390
Reviewed-by: Juergen Gross  # Xen
---


[...]


diff --git a/mm/sparse.c b/mm/sparse.c
index 7ea5dc6..ad94242 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c


[...]


@@ -425,6 +436,10 @@ static void __init sparse_buffer_init(unsigned long size, 
int nid)
memblock_alloc_try_nid_raw(size, PAGE_SIZE,
__pa(MAX_DMA_ADDRESS),
MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+   if (!sparsemap_buf)
+   panic("%s: Failed to allocate %lu bytes align=0x%lx nid=%d 
from=%lx\n",
+ __func__, size, PAGE_SIZE, nid, __pa(MAX_DMA_ADDRESS));
+


memblock_alloc_try_nid_raw() does not panic (help explicitly says: Does 
not zero allocated memory, does not panic if request cannot be satisfied.).


Stephen Rothwell reports a boot failure due to this change.

Christophe


sparsemap_buf_end = sparsemap_buf + size;
  }
  



Re: linux-next: powerpc le qemu boot failure after merge of the akpm tree

2019-01-30 Thread Stephen Rothwell
Hi all,

On Thu, 31 Jan 2019 16:38:54 +1100 Stephen Rothwell  
wrote:
>
> [I am guessing that is is something in Andrew's tree that has caused
> this.]
> 
> My qemu boot of the powerpc pseries_le_defconfig config failed like this:
> 
> htab_hash_mask= 0x1
> -
> numa:   NODE_DATA [mem 0x7ffe7000-0x7ffebfff]
> Kernel panic - not syncing: sparse_buffer_init: Failed to allocate 2147483648 
> bytes align=0x1 nid=0 from=fff
> CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc4 #2
> Call Trace:
> [c105bbd0] [c0b1345c] dump_stack+0xb0/0xf4 (unreliable)
> [c105bc10] [c020] panic+0x168/0x3b8
> [c105bcb0] [c0e701c8] sparse_init_nid+0x178/0x550
> [c105bd70] [c0e709b4] sparse_init+0x210/0x238
> [c105bdb0] [c0e468f4] initmem_init+0x1e0/0x260
> [c105be80] [c0e3b9b0] setup_arch+0x354/0x3d4
> [c105bef0] [c0e33afc] start_kernel+0x98/0x648
> [c105bf90] [c000b270] start_here_common+0x1c/0x52c

A quick bisect leads to this:

1c3c9328cde027eb875ba4692f0a5d66b0afe862 is the first bad commit
commit 1c3c9328cde027eb875ba4692f0a5d66b0afe862
Author: Mike Rapoport 
Date:   Thu Jan 31 10:51:32 2019 +1100

treewide: add checks for the return value of memblock_alloc*()

Add check for the return value of memblock_alloc*() functions and call
panic() in case of error.  The panic message repeats the one used by
panicing memblock allocators with adjustment of parameters to include only
relevant ones.

The replacement was mostly automated with semantic patches like the one
below with manual massaging of format strings.

@@
expression ptr, size, align;
@@
ptr = memblock_alloc(size, align);
+ if (!ptr)
+   panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__,
size, align);

Link: 
http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-r...@linux.ibm.com
Signed-off-by: Mike Rapoport 
Reviewed-by: Guo Ren [c-sky]
Acked-by: Paul Burton [MIPS]
Acked-by: Heiko Carstens [s390]
Reviewed-by: Juergen Gross [Xen]
Reviewed-by: Geert Uytterhoeven   [m68k]
Cc: Catalin Marinas 
Cc: Christophe Leroy 
Cc: Christoph Hellwig 
Cc: "David S. Miller" 
Cc: Dennis Zhou 
Cc: Greentime Hu 
Cc: Greg Kroah-Hartman 
Cc: Guan Xuetao 
Cc: Guo Ren 
Cc: Mark Salter 
Cc: Matt Turner 
Cc: Max Filippov 
Cc: Michael Ellerman 
Cc: Michal Simek 
Cc: Petr Mladek 
Cc: Richard Weinberger 
Cc: Rich Felker 
Cc: Rob Herring 
Cc: Rob Herring 
Cc: Russell King 
Cc: Stafford Horne 
Cc: Tony Luck 
Cc: Vineet Gupta 
Cc: Yoshinori Sato 
Signed-off-by: Andrew Morton 

Which is just adding the panic we hit.  So, presumably, the bug is in a
preceding patch :-(

I have left the kernel not booting for today.
-- 
Cheers,
Stephen Rothwell


pgpP0fxfaAx4H.pgp
Description: OpenPGP digital signature


Re: [PATCH 0/4] powerpc/livepatch: reliable stack unwinder fixes

2019-01-30 Thread Michael Ellerman
Jiri Kosina  writes:

> On Wed, 30 Jan 2019, Michael Ellerman wrote:
>
>> I'm happy to take it, unless there's some reason you'd rather it go via 
>> the LP tree?
>
> As this is more about reliable stack unwinding than live patching per se 
> (which is only a user of that facility), I'd actually slightly prefer if 
> it goes via your tree.

Sure thing, I'll take it.

cheers


Re: [PATCH 1/4] powerpc/64s: Clear on-stack exception marker upon exception return

2019-01-30 Thread Michael Ellerman
Nicolai Stange  writes:

> Michael Ellerman  writes:
>
>> Joe Lawrence  writes:
>>> From: Nicolai Stange 
>>>
>>> The ppc64 specific implementation of the reliable stacktracer,
>>> save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
>>> trace" whenever it finds an exception frame on the stack. Stack frames
>>> are classified as exception frames if the STACK_FRAME_REGS_MARKER magic,
>>> as written by exception prologues, is found at a particular location.
>>>
>>> However, as observed by Joe Lawrence, it is possible in practice that
>>> non-exception stack frames can alias with prior exception frames and thus,
>>> that the reliable stacktracer can find a stale STACK_FRAME_REGS_MARKER on
>>> the stack. It in turn falsely reports an unreliable stacktrace and blocks
>>> any live patching transition to finish. Said condition lasts until the
>>> stack frame is overwritten/initialized by function call or other means.
>>>
>>> In principle, we could mitigate this by making the exception frame
>>> classification condition in save_stack_trace_tsk_reliable() stronger:
>>> in addition to testing for STACK_FRAME_REGS_MARKER, we could also take into
>>> account that for all exceptions executing on the kernel stack
>>> - their stack frames's backlink pointers always match what is saved
>>>   in their pt_regs instance's ->gpr[1] slot and that
>>> - their exception frame size equals STACK_INT_FRAME_SIZE, a value
>>>   uncommonly large for non-exception frames.
>>>
>>> However, while these are currently true, relying on them would make the
>>> reliable stacktrace implementation more sensitive towards future changes in
>>> the exception entry code. Note that false negatives, i.e. not detecting
>>> exception frames, would silently break the live patching consistency model.
>>>
>>> Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
>>> rely on STACK_FRAME_REGS_MARKER as well.
>>>
>>> Make the exception exit code clear the on-stack STACK_FRAME_REGS_MARKER
>>> for those exceptions running on the "normal" kernel stack and returning
>>> to kernelspace: because the topmost frame is ignored by the reliable stack
>>> tracer anyway, returns to userspace don't need to take care of clearing
>>> the marker.
>>>
>>> Furthermore, as I don't have the ability to test this on Book 3E or
>>> 32 bits, limit the change to Book 3S and 64 bits.
>>>
>>> Finally, make the HAVE_RELIABLE_STACKTRACE Kconfig option depend on
>>> PPC_BOOK3S_64 for documentation purposes. Before this patch, it depended
>>> on PPC64 && CPU_LITTLE_ENDIAN and because CPU_LITTLE_ENDIAN implies
>>> PPC_BOOK3S_64, there's no functional change here.
>>
>> That has nothing to do with the fix and should really be in a separate
>> patch.
>>
>> I can split it when applying.
>
> If you don't mind, that would be nice! Or simply drop that
> chunk... Otherwise, let me know if I shall send a split v2 for this
> patch [1/4] only.

No worries, I split it out:

commit a50d3250d7ae34c561177a1f9cfb79816fcbcff1
Author: Nicolai Stange 
AuthorDate: Thu Jan 31 16:41:50 2019 +1100
Commit: Michael Ellerman 
CommitDate: Thu Jan 31 16:43:29 2019 +1100

powerpc/64s: Make reliable stacktrace dependency clearer

Make the HAVE_RELIABLE_STACKTRACE Kconfig option depend on
PPC_BOOK3S_64 for documentation purposes. Before this patch, it
depended on PPC64 && CPU_LITTLE_ENDIAN and because CPU_LITTLE_ENDIAN
implies PPC_BOOK3S_64, there's no functional change here.

Signed-off-by: Nicolai Stange 
Signed-off-by: Joe Lawrence 
[mpe: Split out of larger patch]
Signed-off-by: Michael Ellerman 

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2890d36eb531..73bf87b1d274 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -220,7 +220,7 @@ config PPC
select HAVE_PERF_USER_STACK_DUMP
select HAVE_RCU_TABLE_FREE  if SMP
select HAVE_REGS_AND_STACK_ACCESS_API
-   select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
+   select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && 
CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING


cheers


Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration

2019-01-30 Thread Michael Ellerman
Michael Bringmann  writes:
> This patch is to check for cede'ed CPUs during LPM.  Some extreme
> tests encountered a problem ehere Linux has put some threads to
> sleep (possibly to save energy or something), LPM was attempted,
> and the Linux kernel didn't awaken the sleeping threads, but issued
> the H_JOIN for the active threads.  Since the sleeping threads
> are not awake, they can not issue the expected H_JOIN, and the
> partition would never suspend.  This patch wakes the sleeping
> threads back up.

I'm don't think this is the right solution.

Just after your for loop we do an on_each_cpu() call, which sends an IPI
to every CPU, and that should wake all CPUs up from CEDE.

If that's not happening then there is a bug somewhere, and we need to
work out where.


> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
> b/arch/powerpc/include/asm/plpar_wrappers.h
> index cff5a41..8292eff 100644
> --- a/arch/powerpc/include/asm/plpar_wrappers.h
> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
> @@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
>   get_lppaca()->cede_latency_hint = latency_hint;
>  }
>  
> -static inline long cede_processor(void)
> -{
> - return plpar_hcall_norets(H_CEDE);
> -}
> +int cpu_is_ceded(int cpu);
> +long cede_processor(void);
>  
>  static inline long extended_cede_processor(unsigned long latency_hint)
>  {
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index de35bd8f..fea3d21 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* This is here deliberately so it's only used in this file */
>  void enter_rtas(unsigned long);
> @@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
>   struct rtas_suspend_me_data data;
>   DECLARE_COMPLETION_ONSTACK(done);
>   cpumask_var_t offline_mask;
> - int cpuret;
> + int cpuret, cpu;
>  
>   if (!rtas_service_present("ibm,suspend-me"))
>   return -ENOSYS;
> @@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
>   goto out_hotplug_enable;
>   }
>  
> + for_each_present_cpu(cpu) {
> + if (cpu_is_ceded(cpu))
> + plpar_hcall_norets(H_PROD, 
> get_hard_smp_processor_id(cpu));
> + }

There's a race condition here, there's nothing to prevent the CPUs you
just PROD'ed from going back into CEDE before you do the on_each_cpu()
call below.

>   /* Call function on all CPUs.  One of us will make the
>* rtas call
>*/
> diff --git a/arch/powerpc/platforms/pseries/setup.c 
> b/arch/powerpc/platforms/pseries/setup.c
> index 41f62ca2..48ae6d4 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
>  }
>  machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
>  
> +static DEFINE_PER_CPU(int, cpu_ceded);
> +
> +int cpu_is_ceded(int cpu)
> +{
> + return per_cpu(cpu_ceded, cpu);
> +}
> +
> +long cede_processor(void)
> +{
> + long rc;
> +
> + per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;

And there's also a race condition here. From the other CPU's perspective
the store to cpu_ceded is not necessarily ordered vs the hcall below.
Which means the other CPU can see cpu_ceded = 0, and therefore not prod
us, but this CPU has already called H_CEDE.

> + rc = plpar_hcall_norets(H_CEDE);
> + per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
> +
> + return rc;
> +}
> +
>  static void pseries_lpar_idle(void)
>  {
>   /*

cheers


linux-next: powerpcle qemu boot failure after merge of the akpm tree

2019-01-30 Thread Stephen Rothwell
Hi all,

[I am guessing that is is something in Andrew's tree that has caused
this.]

My qemu boot of the powerpc pseries_le_defconfig config failed like this:

htab_hash_mask= 0x1
-
numa:   NODE_DATA [mem 0x7ffe7000-0x7ffebfff]
Kernel panic - not syncing: sparse_buffer_init: Failed to allocate 2147483648 
bytes align=0x1 nid=0 from=fff
CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc4 #2
Call Trace:
[c105bbd0] [c0b1345c] dump_stack+0xb0/0xf4 (unreliable)
[c105bc10] [c020] panic+0x168/0x3b8
[c105bcb0] [c0e701c8] sparse_init_nid+0x178/0x550
[c105bd70] [c0e709b4] sparse_init+0x210/0x238
[c105bdb0] [c0e468f4] initmem_init+0x1e0/0x260
[c105be80] [c0e3b9b0] setup_arch+0x354/0x3d4
[c105bef0] [c0e33afc] start_kernel+0x98/0x648
[c105bf90] [c000b270] start_here_common+0x1c/0x52c

-- 
Cheers,
Stephen Rothwell


pgpoD2db0bYbg.pgp
Description: OpenPGP digital signature


Re: [PATCH V5 3/5] arch/powerpc/mm: Nest MMU workaround for mprotect RW upgrade.

2019-01-30 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> "Aneesh Kumar K.V"  writes:
>> NestMMU requires us to mark the pte invalid and flush the tlb when we do a
>> RW upgrade of pte. We fixed a variant of this in the fault path in commit
>> Fixes: bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle 
>> nest MMU hang")
>
> You don't want the "Fixes:" there.
>
>>
>> Do the same for mprotect upgrades.
>>
>> Hugetlb is handled in the next patch.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/include/asm/book3s/64/pgtable.h | 18 ++
>>  arch/powerpc/include/asm/book3s/64/radix.h   |  4 
>>  arch/powerpc/mm/pgtable-book3s64.c   | 25 
>>  arch/powerpc/mm/pgtable-radix.c  | 18 ++
>>  4 files changed, 65 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 2e6ada28da64..92eaea164700 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -1314,6 +1314,24 @@ static inline int pud_pfn(pud_t pud)
>>  BUILD_BUG();
>>  return 0;
>>  }
>
> Can we get a blank line here?
>
>> +#define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
>> +pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t 
>> *);
>> +void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
>> + pte_t *, pte_t, pte_t);
>
> So these are not inline ...
>
>> +/*
>> + * Returns true for a R -> RW upgrade of pte
>> + */
>> +static inline bool is_pte_rw_upgrade(unsigned long old_val, unsigned long 
>> new_val)
>> +{
>> +if (!(old_val & _PAGE_READ))
>> +return false;
>> +
>> +if ((!(old_val & _PAGE_WRITE)) && (new_val & _PAGE_WRITE))
>> +return true;
>> +
>> +return false;
>> +}
>>  
>>  #endif /* __ASSEMBLY__ */
>>  #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
>> diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
>> b/arch/powerpc/mm/pgtable-book3s64.c
>> index f3c31f5e1026..47c742f002ea 100644
>> --- a/arch/powerpc/mm/pgtable-book3s64.c
>> +++ b/arch/powerpc/mm/pgtable-book3s64.c
>> @@ -400,3 +400,28 @@ void arch_report_meminfo(struct seq_file *m)
>> atomic_long_read(&direct_pages_count[MMU_PAGE_1G]) << 20);
>>  }
>>  #endif /* CONFIG_PROC_FS */
>> +
>> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
>> + pte_t *ptep)
>> +{
>> +unsigned long pte_val;
>> +
>> +/*
>> + * Clear the _PAGE_PRESENT so that no hardware parallel update is
>> + * possible. Also keep the pte_present true so that we don't take
>> + * wrong fault.
>> + */
>> +pte_val = pte_update(vma->vm_mm, addr, ptep, _PAGE_PRESENT, 
>> _PAGE_INVALID, 0);
>> +
>> +return __pte(pte_val);
>> +
>> +}
>> +
>> +void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
>> + pte_t *ptep, pte_t old_pte, pte_t pte)
>> +{
>
> Which means we're going to be doing a function call to get to here ...
>
>> +if (radix_enabled())
>> +return radix__ptep_modify_prot_commit(vma, addr,
>> +  ptep, old_pte, pte);
>
> And then another function call to get to the radix version ...
>
>> +set_pte_at(vma->vm_mm, addr, ptep, pte);
>> +}
>> diff --git a/arch/powerpc/mm/pgtable-radix.c 
>> b/arch/powerpc/mm/pgtable-radix.c
>> index 931156069a81..dced3cd241c2 100644
>> --- a/arch/powerpc/mm/pgtable-radix.c
>> +++ b/arch/powerpc/mm/pgtable-radix.c
>> @@ -1063,3 +1063,21 @@ void radix__ptep_set_access_flags(struct 
>> vm_area_struct *vma, pte_t *ptep,
>>  }
>>  /* See ptesync comment in radix__set_pte_at */
>>  }
>> +
>> +void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
>> +unsigned long addr, pte_t *ptep,
>> +pte_t old_pte, pte_t pte)
>> +{
>> +struct mm_struct *mm = vma->vm_mm;
>> +
>> +/*
>> + * To avoid NMMU hang while relaxing access we need to flush the tlb 
>> before
>> + * we set the new value. We need to do this only for radix, because hash
>> + * translation does flush when updating the linux pte.
>> + */
>> +if (is_pte_rw_upgrade(pte_val(old_pte), pte_val(pte)) &&
>> +(atomic_read(&mm->context.copros) > 0))
>> +radix__flush_tlb_page(vma, addr);
>
> To finally get here, where we'll realise that 99.99% of processes don't
> use copros and so we have nothing to do except set the PTE.
>
>> +
>> +set_pte_at(mm, addr, ptep, pte);
>> +}
>
> So can we just make it all inline in the header? Or do we think it's not
> a hot enough path to worry about it?
>

I did try that earlier, But IIRC that didn't work due to header
inclusion issue. I can try that again in an addon patch. That would
require moving things around so that we find different struct
definitions correctly.

-aneesh



Re: [PATCH V5 2/5] mm: update ptep_modify_prot_commit to take old pte value as arg

2019-01-30 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> "Aneesh Kumar K.V"  writes:
>
>> Architectures like ppc64 require to do a conditional tlb flush based on the 
>> old
>> and new value of pte. Enable that by passing old pte value as the arg.
>
> It's not actually the architecture, it's to work around a specific bug
> on Power9.
>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index c89ce07923c8..028c724dcb1a 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -110,8 +110,8 @@ static unsigned long change_pte_range(struct 
>> vm_area_struct *vma, pmd_t *pmd,
>>  continue;
>>  }
>>  
>> -ptent = ptep_modify_prot_start(vma, addr, pte);
>> -ptent = pte_modify(ptent, newprot);
>> +oldpte = ptep_modify_prot_start(vma, addr, pte);
>> +ptent = pte_modify(oldpte, newprot);
>>  if (preserve_write)
>>  ptent = pte_mk_savedwrite(ptent);
>
> Is it OK to reuse oldpte here?
>
> It was set at the top of the loop with:
>
>   oldpte = *pte;
>
> Is it guaranteed that ptep_modify_prot_start() returns the old value
> unmodified, or could an implementation conceivably filter some bits out?
>
> If so then it could be confusing for oldpte to have its value change
> half way through the loop.
>

ptep_modify_prot_start and ptep_modify_prot_commit is the sequence that
we can safely use to do read/modify/update of a pte entry. Now w.r.t old
pte, we can't update the pte bits from software because we are holding
the page table lock(ptl). Now we could definitely end up having updated
reference and change bit. But we make sure we don't lose those by using
prot_start and prot_commit sequence.

-aneesh



Re: [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc

2019-01-30 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> "Aneesh Kumar K.V"  writes:
>
>> The current code doesn't do page migration if the page allocated is a 
>> compound page.
>> With HugeTLB migration support, we can end up allocating hugetlb pages from
>> CMA region. Also, THP pages can be allocated from CMA region. This patch 
>> updates
>> the code to handle compound pages correctly. The patch also switches to a 
>> single
>> get_user_pages with the right count, instead of doing one get_user_pages per 
>> page.
>> That avoids reading page table multiple times.
>
> It's not very obvious from the above description that the migration
> logic is now being done by get_user_pages_longterm(), it just looks like
> it's all being deleted in this patch. Would be good to mention that.
>
>> Since these page reference updates are long term pin, switch to
>> get_user_pages_longterm. That makes sure we fail correctly if the guest RAM
>> is backed by DAX pages.
>
> Can you explain that in more detail?

DAX pages lifetime is dictated by file system rules and as such, we need
to make sure that we free these pages on operations like truncate and
punch hole. If we have long term pin on these pages, which are mostly
return to userspace with elevated page count, the entity holding the
long term pin may not be aware of the fact that file got truncated and
the file system blocks possibly got reused. That can result in corruption.

Work is going on to solve this issue by either making operations like
truncate wait or to make these elevated reference counted page/file
system blocks not to be released back to the file system free list.

Till then we prevent long term pin on DAX pages.

Now that we have an API for long term pin, we should ideally be using
that in the vfio code.


>
>> The patch also converts the hpas member of mm_iommu_table_group_mem_t to a 
>> union.
>> We use the same storage location to store pointers to struct page. We cannot
>> update all the code path use struct page *, because we access hpas in real 
>> mode
>> and we can't do that struct page * to pfn conversion in real mode.
>
> That's a pain, it's asking for bugs mixing two different values in the
> same array. But I guess it's the least worst option.
>
> It sounds like that's a separate change you could do in a separate
> patch. But it's not, because it's tied to the fact that we're doing a
> single GUP call.

-aneesh



Re: [PATCH v3 1/2] mm: add probe_user_read()

2019-01-30 Thread Michael Ellerman
Christophe Leroy  writes:

> In powerpc code, there are several places implementing safe
> access to user data. This is sometimes implemented using
> probe_kernel_address() with additional access_ok() verification,
> sometimes with get_user() enclosed in a pagefault_disable()/enable()
> pair, etc. :
> show_user_instructions()
> bad_stack_expansion()
> p9_hmi_special_emu()
> fsl_pci_mcheck_exception()
> read_user_stack_64()
> read_user_stack_32() on PPC64
> read_user_stack_32() on PPC32
> power_pmu_bhrb_to()
>
> In the same spirit as probe_kernel_read(), this patch adds
> probe_user_read().
>
> probe_user_read() does the same as probe_kernel_read() but
> first checks that it is really a user address.
>
> The patch defines this function as a static inline so the "size"
> variable can be examined for const-ness by the check_object_size()
> in __copy_from_user_inatomic()
>
> Signed-off-by: Christophe Leroy 
> ---
>  v3: Moved 'Returns:" comment after description.
>  Explained in the commit log why the function is defined static inline
>
>  v2: Added "Returns:" comment and removed probe_user_address()
>
>  include/linux/uaccess.h | 34 ++
>  1 file changed, 34 insertions(+)
>
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 37b226e8df13..ef99edd63da3 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -263,6 +263,40 @@ extern long strncpy_from_unsafe(char *dst, const void 
> *unsafe_addr, long count);
>  #define probe_kernel_address(addr, retval)   \
>   probe_kernel_read(&retval, addr, sizeof(retval))
>  
> +/**
> + * probe_user_read(): safely attempt to read from a user location
> + * @dst: pointer to the buffer that shall take the data
> + * @src: address to read from
> + * @size: size of the data chunk
> + *
> + * Safely read from address @src to the buffer at @dst.  If a kernel fault
> + * happens, handle that and return -EFAULT.
> + *
> + * We ensure that the copy_from_user is executed in atomic context so that
> + * do_page_fault() doesn't attempt to take mmap_sem.  This makes
> + * probe_user_read() suitable for use within regions where the caller
> + * already holds mmap_sem, or other locks which nest inside mmap_sem.
> + *
> + * Returns: 0 on success, -EFAULT on error.
> + */
> +
> +#ifndef probe_user_read
> +static __always_inline long probe_user_read(void *dst, const void __user 
> *src,
> + size_t size)
> +{
> + long ret;
> +

I wonder if we should explicitly switch to USER_DS here?

That would be sort of unusual, but the whole reason for this helper
existing is to make sure we safely read from user memory and not
accidentally from kernel.

cheers

> + if (!access_ok(src, size))
> + return -EFAULT;
> +
> + pagefault_disable();
> + ret = __copy_from_user_inatomic(dst, src, size);
> + pagefault_enable();
> +
> + return ret ? -EFAULT : 0;
> +}
> +#endif
> +
>  #ifndef user_access_begin
>  #define user_access_begin(ptr,len) access_ok(ptr, len)
>  #define user_access_end() do { } while (0)
> -- 
> 2.13.3


Re: [PATCH v3 2/2] powerpc: use probe_user_read()

2019-01-30 Thread Michael Ellerman
Christophe Leroy  writes:

> Instead of opencoding, use probe_user_read() to failessly
> read a user location.
>
> Signed-off-by: Christophe Leroy 
> ---
>  v3: No change
>
>  v2: Using probe_user_read() instead of probe_user_address()
>
>  arch/powerpc/kernel/process.c   | 12 +---
>  arch/powerpc/mm/fault.c |  6 +-
>  arch/powerpc/perf/callchain.c   | 20 +++-
>  arch/powerpc/perf/core-book3s.c |  8 +---
>  arch/powerpc/sysdev/fsl_pci.c   | 10 --
>  5 files changed, 10 insertions(+), 46 deletions(-)

Looks good.

Acked-by: Michael Ellerman 

cheers

> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index ce393df243aa..6a4b59d574c2 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1298,16 +1298,6 @@ void show_user_instructions(struct pt_regs *regs)
>  
>   pc = regs->nip - (NR_INSN_TO_PRINT * 3 / 4 * sizeof(int));
>  
> - /*
> -  * Make sure the NIP points at userspace, not kernel text/data or
> -  * elsewhere.
> -  */
> - if (!__access_ok(pc, NR_INSN_TO_PRINT * sizeof(int), USER_DS)) {
> - pr_info("%s[%d]: Bad NIP, not dumping instructions.\n",
> - current->comm, current->pid);
> - return;
> - }
> -
>   seq_buf_init(&s, buf, sizeof(buf));
>  
>   while (n) {
> @@ -1318,7 +1308,7 @@ void show_user_instructions(struct pt_regs *regs)
>   for (i = 0; i < 8 && n; i++, n--, pc += sizeof(int)) {
>   int instr;
>  
> - if (probe_kernel_address((const void *)pc, instr)) {
> + if (probe_user_read(&instr, (void __user *)pc, 
> sizeof(instr))) {
>   seq_buf_printf(&s, " ");
>   continue;
>   }
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 887f11bcf330..ec74305fa330 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -276,12 +276,8 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
> unsigned long address,
>   if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) &&
>   access_ok(nip, sizeof(*nip))) {
>   unsigned int inst;
> - int res;
>  
> - pagefault_disable();
> - res = __get_user_inatomic(inst, nip);
> - pagefault_enable();
> - if (!res)
> + if (!probe_user_read(&inst, nip, sizeof(inst)))
>   return !store_updates_sp(inst);
>   *must_retry = true;
>   }
> diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
> index 0af051a1974e..0680efb2237b 100644
> --- a/arch/powerpc/perf/callchain.c
> +++ b/arch/powerpc/perf/callchain.c
> @@ -159,12 +159,8 @@ static int read_user_stack_64(unsigned long __user *ptr, 
> unsigned long *ret)
>   ((unsigned long)ptr & 7))
>   return -EFAULT;
>  
> - pagefault_disable();
> - if (!__get_user_inatomic(*ret, ptr)) {
> - pagefault_enable();
> + if (!probe_user_read(ret, ptr, sizeof(*ret)))
>   return 0;
> - }
> - pagefault_enable();
>  
>   return read_user_stack_slow(ptr, ret, 8);
>  }
> @@ -175,12 +171,8 @@ static int read_user_stack_32(unsigned int __user *ptr, 
> unsigned int *ret)
>   ((unsigned long)ptr & 3))
>   return -EFAULT;
>  
> - pagefault_disable();
> - if (!__get_user_inatomic(*ret, ptr)) {
> - pagefault_enable();
> + if (!probe_user_read(ret, ptr, sizeof(*ret)))
>   return 0;
> - }
> - pagefault_enable();
>  
>   return read_user_stack_slow(ptr, ret, 4);
>  }
> @@ -307,17 +299,11 @@ static inline int current_is_64bit(void)
>   */
>  static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
>  {
> - int rc;
> -
>   if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
>   ((unsigned long)ptr & 3))
>   return -EFAULT;
>  
> - pagefault_disable();
> - rc = __get_user_inatomic(*ret, ptr);
> - pagefault_enable();
> -
> - return rc;
> + return probe_user_read(ret, ptr, sizeof(*ret));
>  }
>  
>  static inline void perf_callchain_user_64(struct perf_callchain_entry_ctx 
> *entry,
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index b0723002a396..4b64ddf0db68 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -416,7 +416,6 @@ static void power_pmu_sched_task(struct 
> perf_event_context *ctx, bool sched_in)
>  static __u64 power_pmu_bhrb_to(u64 addr)
>  {
>   unsigned int instr;
> - int ret;
>   __u64 target;
>  
>   if (is_kernel_addr(addr)) {
> @@ -427,13 +426,8 @@ static __u64 power_pmu_bhrb_to(u64 addr)
>   }
>  
>   /

Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-01-30 Thread Paul Mackerras
On Wed, Jan 30, 2019 at 08:01:22AM +0100, Cédric Le Goater wrote:
> On 1/30/19 5:29 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
> >> On 1/22/19 6:05 AM, Paul Mackerras wrote:
> >>> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>  This is the basic framework for the new KVM device supporting the XIVE
>  native exploitation mode. The user interface exposes a new capability
>  and a new KVM device to be used by QEMU.
> >>>
> >>> [snip]
>  @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>   #ifdef CONFIG_KVM_XIVE
>   if (xive_enabled()) {
>   kvmppc_xive_init_module();
>  +kvmppc_xive_native_init_module();
>   kvm_register_device_ops(&kvm_xive_ops, 
>  KVM_DEV_TYPE_XICS);
>  +kvm_register_device_ops(&kvm_xive_native_ops,
>  +KVM_DEV_TYPE_XIVE);
> >>>
> >>> I think we want tighter conditions on initializing the xive_native
> >>> stuff and creating the xive device class.  We could have
> >>> xive_enabled() returning true in a guest, and this code will get
> >>> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> >>> are running bare metal).
> >>
> >> So yes, I gave nested a try with kernel_irqchip=on and the nested 
> >> hypervisor 
> >> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
> >>
> >>if (xive_enabled() && !kvmhv_on_pseries()) {
> >>
> >> for now.
> >>
> >> As this is a problem today in 5.0.x, I will send a patch for it if you 
> >> think
> > 
> > How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
> > with kernel_irqchip=on in a nested guest and it works just fine.  What
> > exactly did you test?
> 
> L0: Linux 5.0.0-rc3 (+ KVM HV)
> L1: QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
> L2:  QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3
> 
> L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
> it does an OPAL call and its running under SLOF. See below.

OK, you must have a QEMU that advertises XIVE to the guest (L1).  In
that case I can see that L1 would try to do XICS-on-XIVE, which won't
work.  We need to fix that.  Unfortunately the XICS-on-XICS emulation
won't work as is in L1 either, but I think we can fix that by
disabling the real-mode XICS hcall handling.

> I don't understand how L2 can work with kernel_irqchip=on. Could you
> please explain ? 

If QEMU decides to advertise XIVE to the L2 guest and the L2 guest can
do XIVE, then the only possibility is to use the XIVE software
emulation in QEMU, and if kernel_irqchip=on has been specified
explicitly, maybe QEMU decides to terminate the guest rather than
implicitly turning off kernel_irqchip.

If QEMU decides not to advertise XIVE to the L2 guest, or the L2 guest
can't do XIVE, then we could use the XICS-on-XICS emulation in L1 as
long as either (a) L1 is not using XIVE, or (b) we modify the
XICS-on-XICS code to avoid using any XICS or XIVE access (i.e. just
using calls to generic kernel facilities).

Ultimately, if the spapr xive backend code in the kernel could be
extended to provide all the low-level functions that the XICS-on-XIVE
code needs, then we could do XICS-on-XIVE in a guest.

Paul.


Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-30 Thread Paul Mackerras
On Wed, Jan 30, 2019 at 04:54:23PM +0100, Cédric Le Goater wrote:
> On 1/30/19 7:20 AM, Paul Mackerras wrote:
> > On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
> >> On 1/29/19 3:45 AM, Paul Mackerras wrote:
> >>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>  On 1/28/19 7:13 AM, Paul Mackerras wrote:
> > Would we end up with too many VMAs if we just used mmap() to
> > change the mappings from the software-generated pages to the
> > hardware-generated interrupt pages?  
>  The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>  dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> >>>
> >>> Confused.  You say the number space has 32768 entries but then imply
> >>> there are only 8K entries.  Do you mean that the API allows for 15-bit
> >>> IRQ numbers but we are only making using of 8192 of them?
> >>
> >> Ouch. My bad. Let's do it again. 
> >>
> >> The sPAPR IRQ number space is 0x2000 wide :
> >>
> >> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
> >>
> >> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
> >> devices (which can be extended if needed).
> >>
> >> So that's 8192 x 2 ESB pages.
> >>
>  extend the last range if needed as these are for MSIs. Dynamic 
>  extensions under KVM should work also.
> 
>  This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>  lot of mmap(), too much. Also the KVM model needs to be compatible
> >>>
> >>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> >>> space would be covered by a single mmap, overlaid by subsequent mmaps
> >>> where we need to map real device interrupts.
> >>
> >> ok. The same fault handler could be used to populate the VMA with the 
> >> ESB pages. 
> >>
> >> But it would mean extra work on the QEMU side, which is not needed 
> >> with this patch. 
> > 
> > Maybe, but just storing a single vma pointer in our private data is
> > not a feasible approach.  First, you have no control on the lifetime
> > of the vma and thus this is a use-after-free waiting to happen, and
> > secondly, there could be multiple vmas that you need to worry about.
> 
> I fully agree. That's why I was uncomfortable with the solution. There 
> are a few other drivers (GPUs if I recall) doing that but it feels wrong.

There is the HMM infrastructure (heterogeneous memory model) which
could possibly be used to do this, but it's very heavyweight for the
problem we have here.

> > Userspace could do multiple mmaps, or could do mprotect or similar on
> > part of an existing mmap, which would split the vma for the mmap into
> > multiple vmas.  You don't get notified about munmap either as far as I
> > can tell, so the vma is liable to go away.  
> 
> yes ...
> 
> > And it doesn't matter if
> > QEMU would never do such things; if userspace can provoke a
> > use-after-free in the kernel using KVM then someone will write a
> > specially crafted user program to do that.
> > 
> > So we either solve it in userspace, or we have to write and maintain
> > complex kernel code with deep links into the MM subsystem.  
> >
> > I'd much rather solve it in userspace.
> 
> OK, then. I have been reluctant doing so but it seems there are no
> other in-kernel solution. 

I discussed the problem today with David Gibson and he mentioned that
QEMU does have a lot of freedom in how it assigns the guest hwirq
numbers.  So you may be able to avoid the problem by (for example)
never assigning a hwirq to a VFIO device that has previously been used
for an emulated device (or something like that).

>  with the QEMU emulated one and it was simpler to have one overall
>  memory region for the IPI ESBs, one for the END ESBs (if we support
>  that one day) and one for the TIMA.
> 
> > Are the necessary pages for a PCI
> > passthrough device contiguous in both host real space 
> 
>  They should as they are the PHB4 ESBs.
> 
> > and guest real space ? 
> 
>  also. That's how we organized the mapping. 
> >>>
> >>> "How we organized the mapping" is a significant design decision that I
> >>> haven't seen documented anywhere, and is really needed for
> >>> understanding what's going on.
> >>
> >> OK. I will add comments on that. See below for some description.
> >>
> >> There is nothing fancy, it's simply indexed with the interrupt number,
> >> like for HW, or for the QEMU XIVE emulated model.
> >>
> > If so we'd only need one mmap() for all the device's interrupt
> > pages.
> 
>  Ah. So we would want to make a special case for the passthrough 
>  device and have a mmap() and a memory region for its ESBs. Hmm.
> 
>  Wouldn't that require to hot plug a memory region under the guest ? 
> >>>
> >>> No; the way that a memory region works is that userspace can do
> >>> whatever disparate mappings it likes within 

[PATCH] powerpc/prom_init: add __init markers to all functions

2019-01-30 Thread Masahiro Yamada
It is fragile to rely on the compiler's optimization to avoid the
section mismatch. Some functions may not be necessarily inlined
when the compiler's inlining heuristic changes.

Add __init markers consistently.

As for prom_getprop() and prom_getproplen(), they are marked as
'inline', so inlining is guaranteed because PowerPC never enables
CONFIG_OPTIMIZE_INLINING. However, it would be better to leave the
inlining decision to the compiler. I replaced 'inline' with __init.

Signed-off-by: Masahiro Yamada 
---

 arch/powerpc/kernel/prom_init.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index f33ff41..85b0719 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -501,19 +501,19 @@ static int __init prom_next_node(phandle *nodep)
}
 }
 
-static inline int prom_getprop(phandle node, const char *pname,
+static int __init prom_getprop(phandle node, const char *pname,
   void *value, size_t valuelen)
 {
return call_prom("getprop", 4, 1, node, ADDR(pname),
 (u32)(unsigned long) value, (u32) valuelen);
 }
 
-static inline int prom_getproplen(phandle node, const char *pname)
+static int __init prom_getproplen(phandle node, const char *pname)
 {
return call_prom("getproplen", 2, 1, node, ADDR(pname));
 }
 
-static void add_string(char **str, const char *q)
+static void __init add_string(char **str, const char *q)
 {
char *p = *str;
 
@@ -523,7 +523,7 @@ static void add_string(char **str, const char *q)
*str = p;
 }
 
-static char *tohex(unsigned int x)
+static char __init *tohex(unsigned int x)
 {
static const char digits[] __initconst = "0123456789abcdef";
static char result[9] __prombss;
@@ -570,7 +570,7 @@ static int __init prom_setprop(phandle node, const char 
*nodename,
 #define islower(c) ('a' <= (c) && (c) <= 'z')
 #define toupper(c) (islower(c) ? ((c) - 'a' + 'A') : (c))
 
-static unsigned long prom_strtoul(const char *cp, const char **endp)
+static unsigned long __init prom_strtoul(const char *cp, const char **endp)
 {
unsigned long result = 0, base = 10, value;
 
@@ -595,7 +595,7 @@ static unsigned long prom_strtoul(const char *cp, const 
char **endp)
return result;
 }
 
-static unsigned long prom_memparse(const char *ptr, const char **retptr)
+static unsigned long __init prom_memparse(const char *ptr, const char **retptr)
 {
unsigned long ret = prom_strtoul(ptr, retptr);
int shift = 0;
@@ -2924,7 +2924,7 @@ static void __init fixup_device_tree_pasemi(void)
prom_setprop(iob, name, "device_type", "isa", sizeof("isa"));
 }
 #else  /* !CONFIG_PPC_PASEMI_NEMO */
-static inline void fixup_device_tree_pasemi(void) { }
+static inline void __init fixup_device_tree_pasemi(void) { }
 #endif
 
 static void __init fixup_device_tree(void)
@@ -2986,15 +2986,15 @@ static void __init prom_check_initrd(unsigned long r3, 
unsigned long r4)
 
 #ifdef CONFIG_PPC64
 #ifdef CONFIG_RELOCATABLE
-static void reloc_toc(void)
+static void __init reloc_toc(void)
 {
 }
 
-static void unreloc_toc(void)
+static void __init unreloc_toc(void)
 {
 }
 #else
-static void __reloc_toc(unsigned long offset, unsigned long nr_entries)
+static void __init __reloc_toc(unsigned long offset, unsigned long nr_entries)
 {
unsigned long i;
unsigned long *toc_entry;
@@ -3008,7 +3008,7 @@ static void __reloc_toc(unsigned long offset, unsigned 
long nr_entries)
}
 }
 
-static void reloc_toc(void)
+static void __init reloc_toc(void)
 {
unsigned long offset = reloc_offset();
unsigned long nr_entries =
@@ -3019,7 +3019,7 @@ static void reloc_toc(void)
mb();
 }
 
-static void unreloc_toc(void)
+static void __init unreloc_toc(void)
 {
unsigned long offset = reloc_offset();
unsigned long nr_entries =
-- 
2.7.4



[PATCH] powerpc/papr_scm: Use the correct bind address

2019-01-30 Thread Oliver O'Halloran
When binding an SCM volume to a physical address the hypervisor has the
option to return early with a continue token with the expectation that
the guest will resume the bind operation until it completes. A quirk of
this interface is that the bind address will only be returned by the
first bind h-call and the subsequent calls will return
0x___ for the bind address.

We currently do not save the address returned by the first h-call. As a
result we will use the junk address as the base of the bound region if
the hypervisor decides to split the bind across multiple h-calls. This
bug was found when testing with very large SCM volumes where the bind
process would take more time than they hypervisor's internal h-call time
limit would allow. This patch fixes the issue by saving the bind address
from the first call.

Cc: sta...@vger.kernel.org
Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions")
Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 7d6457ab5d34..bba281b1fe1b 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -43,6 +43,7 @@ static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
uint64_t rc, token;
+   uint64_t saved = 0;
 
/*
 * When the hypervisor cannot map all the requested memory in a single
@@ -56,6 +57,8 @@ static int drc_pmem_bind(struct papr_scm_priv *p)
rc = plpar_hcall(H_SCM_BIND_MEM, ret, p->drc_index, 0,
p->blocks, BIND_ANY_ADDR, token);
token = ret[0];
+   if (!saved)
+   saved = ret[1];
cond_resched();
} while (rc == H_BUSY);
 
@@ -64,7 +67,7 @@ static int drc_pmem_bind(struct papr_scm_priv *p)
return -ENXIO;
}
 
-   p->bound_addr = ret[1];
+   p->bound_addr = saved;
 
dev_dbg(&p->pdev->dev, "bound drc %x to %pR\n", p->drc_index, &p->res);
 
-- 
2.20.1



[PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration

2019-01-30 Thread Michael Bringmann
This patch is to check for cede'ed CPUs during LPM.  Some extreme
tests encountered a problem ehere Linux has put some threads to
sleep (possibly to save energy or something), LPM was attempted,
and the Linux kernel didn't awaken the sleeping threads, but issued
the H_JOIN for the active threads.  Since the sleeping threads
are not awake, they can not issue the expected H_JOIN, and the
partition would never suspend.  This patch wakes the sleeping
threads back up.

Signed-off-by: Nathan Fontenot 
Signed-off-by: Gustavo Walbon 
---
Changes in v02:
   -- Rebase to latest powerpc kernel source.
---
 arch/powerpc/include/asm/plpar_wrappers.h |6 ++
 arch/powerpc/kernel/rtas.c|8 +++-
 arch/powerpc/platforms/pseries/setup.c|   18 ++
 3 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index cff5a41..8292eff 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
get_lppaca()->cede_latency_hint = latency_hint;
 }
 
-static inline long cede_processor(void)
-{
-   return plpar_hcall_norets(H_CEDE);
-}
+int cpu_is_ceded(int cpu);
+long cede_processor(void);
 
 static inline long extended_cede_processor(unsigned long latency_hint)
 {
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index de35bd8f..fea3d21 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* This is here deliberately so it's only used in this file */
 void enter_rtas(unsigned long);
@@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
struct rtas_suspend_me_data data;
DECLARE_COMPLETION_ONSTACK(done);
cpumask_var_t offline_mask;
-   int cpuret;
+   int cpuret, cpu;
 
if (!rtas_service_present("ibm,suspend-me"))
return -ENOSYS;
@@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
goto out_hotplug_enable;
}
 
+   for_each_present_cpu(cpu) {
+   if (cpu_is_ceded(cpu))
+   plpar_hcall_norets(H_PROD, 
get_hard_smp_processor_id(cpu));
+   }
+
/* Call function on all CPUs.  One of us will make the
 * rtas call
 */
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 41f62ca2..48ae6d4 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -331,6 +331,24 @@ static int alloc_dispatch_log_kmem_cache(void)
 }
 machine_early_initcall(pseries, alloc_dispatch_log_kmem_cache);
 
+static DEFINE_PER_CPU(int, cpu_ceded);
+
+int cpu_is_ceded(int cpu)
+{
+   return per_cpu(cpu_ceded, cpu);
+}
+
+long cede_processor(void)
+{
+   long rc;
+
+   per_cpu(cpu_ceded, raw_smp_processor_id()) = 1;
+   rc = plpar_hcall_norets(H_CEDE);
+   per_cpu(cpu_ceded, raw_smp_processor_id()) = 0;
+
+   return rc;
+}
+
 static void pseries_lpar_idle(void)
 {
/*



Re: [PATCH] ucc_geth: Reset BQL queue when stopping device

2019-01-30 Thread David Miller
From: Mathias Thore 
Date: Mon, 28 Jan 2019 10:07:47 +0100

> After a timeout event caused by for example a broadcast storm, when
> the MAC and PHY are reset, the BQL TX queue needs to be reset as
> well. Otherwise, the device will exhibit severe performance issues
> even after the storm has ended.
> 
> Co-authored-by: David Gounaris 
> Signed-off-by: Mathias Thore 

Applied and queued up for -stable, thanks.


Re: [PATCH 1/4] powerpc/64s: Clear on-stack exception marker upon exception return

2019-01-30 Thread Nicolai Stange
Michael Ellerman  writes:

> Joe Lawrence  writes:
>> From: Nicolai Stange 
>>
>> The ppc64 specific implementation of the reliable stacktracer,
>> save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
>> trace" whenever it finds an exception frame on the stack. Stack frames
>> are classified as exception frames if the STACK_FRAME_REGS_MARKER magic,
>> as written by exception prologues, is found at a particular location.
>>
>> However, as observed by Joe Lawrence, it is possible in practice that
>> non-exception stack frames can alias with prior exception frames and thus,
>> that the reliable stacktracer can find a stale STACK_FRAME_REGS_MARKER on
>> the stack. It in turn falsely reports an unreliable stacktrace and blocks
>> any live patching transition to finish. Said condition lasts until the
>> stack frame is overwritten/initialized by function call or other means.
>>
>> In principle, we could mitigate this by making the exception frame
>> classification condition in save_stack_trace_tsk_reliable() stronger:
>> in addition to testing for STACK_FRAME_REGS_MARKER, we could also take into
>> account that for all exceptions executing on the kernel stack
>> - their stack frames's backlink pointers always match what is saved
>>   in their pt_regs instance's ->gpr[1] slot and that
>> - their exception frame size equals STACK_INT_FRAME_SIZE, a value
>>   uncommonly large for non-exception frames.
>>
>> However, while these are currently true, relying on them would make the
>> reliable stacktrace implementation more sensitive towards future changes in
>> the exception entry code. Note that false negatives, i.e. not detecting
>> exception frames, would silently break the live patching consistency model.
>>
>> Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
>> rely on STACK_FRAME_REGS_MARKER as well.
>>
>> Make the exception exit code clear the on-stack STACK_FRAME_REGS_MARKER
>> for those exceptions running on the "normal" kernel stack and returning
>> to kernelspace: because the topmost frame is ignored by the reliable stack
>> tracer anyway, returns to userspace don't need to take care of clearing
>> the marker.
>>
>> Furthermore, as I don't have the ability to test this on Book 3E or
>> 32 bits, limit the change to Book 3S and 64 bits.
>>
>> Finally, make the HAVE_RELIABLE_STACKTRACE Kconfig option depend on
>> PPC_BOOK3S_64 for documentation purposes. Before this patch, it depended
>> on PPC64 && CPU_LITTLE_ENDIAN and because CPU_LITTLE_ENDIAN implies
>> PPC_BOOK3S_64, there's no functional change here.
>
> That has nothing to do with the fix and should really be in a separate
> patch.
>
> I can split it when applying.

If you don't mind, that would be nice! Or simply drop that
chunk... Otherwise, let me know if I shall send a split v2 for this
patch [1/4] only.

Thanks,

Nicolai

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
HRB 21284 (AG Nürnberg)


Re: [PATCH v4 2/3] ASoC: add fsl_audmix DT binding documentation

2019-01-30 Thread Rob Herring
On Fri, Jan 25, 2019 at 11:43:16AM -0800, Nicolin Chen wrote:
> On Tue, Jan 22, 2019 at 11:14:27AM +, Viorel Suman wrote:
> > Add the DT binding documentation for NXP Audio Mixer
> > CPU DAI driver.
> > 
> > Signed-off-by: Viorel Suman 
> > ---
> >  .../devicetree/bindings/sound/fsl,audmix.txt   | 54 
> > ++
> >  1 file changed, 54 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/sound/fsl,audmix.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/sound/fsl,audmix.txt 
> > b/Documentation/devicetree/bindings/sound/fsl,audmix.txt
> > new file mode 100644
> > index 000..45f807e
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/sound/fsl,audmix.txt
> > @@ -0,0 +1,54 @@
> > +NXP Audio Mixer (AUDMIX).
> > +
> > +The Audio Mixer is a on-chip functional module that allows mixing of two
> > +audio streams into a single audio stream. Audio Mixer has two input serial
> > +audio interfaces. These are driven by two Synchronous Audio interface
> > +modules (SAI). Each input serial interface carries 8 audio channels in its
> > +frame in TDM manner. Mixer mixes audio samples of corresponding channels
> > +from two interfaces into a single sample. Before mixing, audio samples of
> > +two inputs can be attenuated based on configuration. The output of the
> > +Audio Mixer is also a serial audio interface. Like input interfaces it has
> > +the same TDM frame format. This output is used to drive the serial DAC TDM
> > +interface of audio codec and also sent to the external pins along with the
> > +receive path of normal audio SAI module for readback by the CPU.
> > +
> > +The output of Audio Mixer can be selected from any of the three streams
> > + - serial audio input 1
> > + - serial audio input 2
> > + - mixed audio
> > +
> > +Mixing operation is independent of audio sample rate but the two audio
> > +input streams must have same audio sample rate with same number of channels
> > +in TDM frame to be eligible for mixing.
> > +
> > +Device driver required properties:
> > +=
> > +  - compatible : Compatible list, contains "fsl,imx8qm-audmix"
> > +
> > +  - reg: Offset and length of the register set for the 
> > device.
> > +
> > +  - clocks : Must contain an entry for each entry in clock-names.
> > +
> > +  - clock-names: Must include the "ipg" for register access.
> > +
> > +  - power-domains  : Must contain the phandle to AUDMIX power domain node
> > +
> > +  - dais   : Must contain a list of phandles to AUDMIX connected
> > + DAIs. The current implementation requires two phandles
> > + to SAI interfaces to be provided, the first SAI in the
> > + list being used to route the AUDMIX output.
> > +
> > +  - model  : Must contain machine driver name which will configure
> > + and instantiate the appropriate audio card.
> 
> I don't feel this is necessary for DT node. The driver should be
> able to figure out a proper name for the machine driver: it could
> use the compatible string to get "imx8qm-audmix" as the name like
> fsl_ssi; Or it could have built-in "imx-audmix" of_device_id data
> corresponding to "fsl,imx8qm-audmix" compatible string -- for any
> non-imx compatible in the future, use something else.

I tend to agree, but either way:

Rob Herring 


Re: [PATCH V2] powerpc/ptrace: Mitigate potential Spectre v1

2019-01-30 Thread Gustavo A. R. Silva



On 1/30/19 6:46 AM, Breno Leitao wrote:
> 'regno' is directly controlled by user space, hence leading to a potential
> exploitation of the Spectre variant 1 vulnerability.
> 
> On PTRACE_SETREGS and PTRACE_GETREGS requests, user space passes the
> register number that would be read or written. This register number is
> called 'regno' which is part of the 'addr' syscall parameter.
> 
> This 'regno' value is checked against the maximum pt_regs structure size,
> and then used to dereference it, which matches the initial part of a
> Spectre v1 (and Spectre v1.1) attack. The dereferenced value, then,
> is returned to userspace in the GETREGS case.
> 
> This patch sanitizes 'regno' before using it to dereference pt_reg.
> 
> Notice that given that speculation windows are large, the policy is
> to kill the speculation on the first load and not worry if it can be
> completed with a dependent load/store [1].
> 
> [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
> 
> Signed-off-by: Breno Leitao 

Acked-by: Gustavo A. R. Silva 

> ---
>  arch/powerpc/kernel/ptrace.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index cdd5d1d3ae41..7535f89e08cd 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -274,6 +275,8 @@ static int set_user_trap(struct task_struct *task, 
> unsigned long trap)
>   */
>  int ptrace_get_reg(struct task_struct *task, int regno, unsigned long *data)
>  {
> + unsigned int regs_max;
> +
>   if ((task->thread.regs == NULL) || !data)
>   return -EIO;
>  
> @@ -297,7 +300,9 @@ int ptrace_get_reg(struct task_struct *task, int regno, 
> unsigned long *data)
>   }
>  #endif
>  
> - if (regno < (sizeof(struct user_pt_regs) / sizeof(unsigned long))) {
> + regs_max = sizeof(struct user_pt_regs) / sizeof(unsigned long);
> + if (regno < regs_max) {
> + regno = array_index_nospec(regno, regs_max);
>   *data = ((unsigned long *)task->thread.regs)[regno];
>   return 0;
>   }
> @@ -321,6 +326,7 @@ int ptrace_put_reg(struct task_struct *task, int regno, 
> unsigned long data)
>   return set_user_dscr(task, data);
>  
>   if (regno <= PT_MAX_PUT_REG) {
> + regno = array_index_nospec(regno, PT_MAX_PUT_REG + 1);
>   ((unsigned long *)task->thread.regs)[regno] = data;
>   return 0;
>   }
> 


Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-30 Thread Cédric Le Goater
On 1/30/19 7:20 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 3:45 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
 On 1/28/19 7:13 AM, Paul Mackerras wrote:
> Would we end up with too many VMAs if we just used mmap() to
> change the mappings from the software-generated pages to the
> hardware-generated interrupt pages?  
 The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
 dedicated to CPU IPIs and the remaining 4K are for devices. We can 
>>>
>>> Confused.  You say the number space has 32768 entries but then imply
>>> there are only 8K entries.  Do you mean that the API allows for 15-bit
>>> IRQ numbers but we are only making using of 8192 of them?
>>
>> Ouch. My bad. Let's do it again. 
>>
>> The sPAPR IRQ number space is 0x2000 wide :
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
>>
>> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
>> devices (which can be extended if needed).
>>
>> So that's 8192 x 2 ESB pages.
>>
 extend the last range if needed as these are for MSIs. Dynamic 
 extensions under KVM should work also.

 This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
 lot of mmap(), too much. Also the KVM model needs to be compatible
>>>
>>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
>>> space would be covered by a single mmap, overlaid by subsequent mmaps
>>> where we need to map real device interrupts.
>>
>> ok. The same fault handler could be used to populate the VMA with the 
>> ESB pages. 
>>
>> But it would mean extra work on the QEMU side, which is not needed 
>> with this patch. 
> 
> Maybe, but just storing a single vma pointer in our private data is
> not a feasible approach.  First, you have no control on the lifetime
> of the vma and thus this is a use-after-free waiting to happen, and
> secondly, there could be multiple vmas that you need to worry about.

I fully agree. That's why I was uncomfortable with the solution. There 
are a few other drivers (GPUs if I recall) doing that but it feels wrong.

> Userspace could do multiple mmaps, or could do mprotect or similar on
> part of an existing mmap, which would split the vma for the mmap into
> multiple vmas.  You don't get notified about munmap either as far as I
> can tell, so the vma is liable to go away.  

yes ...

> And it doesn't matter if
> QEMU would never do such things; if userspace can provoke a
> use-after-free in the kernel using KVM then someone will write a
> specially crafted user program to do that.
> 
> So we either solve it in userspace, or we have to write and maintain
> complex kernel code with deep links into the MM subsystem.  
>
> I'd much rather solve it in userspace.

OK, then. I have been reluctant doing so but it seems there are no
other in-kernel solution. 

 with the QEMU emulated one and it was simpler to have one overall
 memory region for the IPI ESBs, one for the END ESBs (if we support
 that one day) and one for the TIMA.

> Are the necessary pages for a PCI
> passthrough device contiguous in both host real space 

 They should as they are the PHB4 ESBs.

> and guest real space ? 

 also. That's how we organized the mapping. 
>>>
>>> "How we organized the mapping" is a significant design decision that I
>>> haven't seen documented anywhere, and is really needed for
>>> understanding what's going on.
>>
>> OK. I will add comments on that. See below for some description.
>>
>> There is nothing fancy, it's simply indexed with the interrupt number,
>> like for HW, or for the QEMU XIVE emulated model.
>>
> If so we'd only need one mmap() for all the device's interrupt
> pages.

 Ah. So we would want to make a special case for the passthrough 
 device and have a mmap() and a memory region for its ESBs. Hmm.

 Wouldn't that require to hot plug a memory region under the guest ? 
>>>
>>> No; the way that a memory region works is that userspace can do
>>> whatever disparate mappings it likes within the region on the user
>>> process side, and the corresponding region of guest real address space
>>> follows that automatically.
>>
>> yes. I suppose this should work also for 'ram device' memory mappings.
>>
>> So when the passthrough device is added to the guest, we would add a 
>> new 'ram device' memory region for the device interrupt ESB pages 
>> that would overlap with the initial guest ESB pages.  
> 
> Not knowing the QEMU internals all that well, I don't at all
> understand why a new ram device is necesssary. 

'ram device' memory regions are of a special type which is used to 
directly map into the guest the result of a mmap() in QEMU. 

This is how we propagate the XIVE ESB pages from HW (and the TIMA) 
to the guest and the Li

Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-01-30 Thread Cédric Le Goater
On 1/30/19 6:40 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:51:05PM +0100, Cédric Le Goater wrote:
> Another general comment is that you seem to have written all this
> code assuming we are using HV KVM in a host running bare-metal.

 Yes. I didn't look at the other configurations. I thought that we could
 use the kernel_irqchip=off option to begin with. A couple of checks
 are indeed missing.
>>>
>>> Using kernel_irqchip=off would mean that we would not be able to use
>>> the in-kernel XICS emulation, which would have a performance impact.
>>
>> yes. But it is not supported today. Correct ? 
> 
> Not correct, it has been working for years, and works in v5.0-rc1 (I
> just tested it), at both L0 and L1.

Please see other email for the test is did.

>>> We need an explicit capability for XIVE exploitation that can be
>>> enabled or disabled on the qemu command line, so that we can enforce a
>>> uniform set of capabilities across all the hosts in a migration
>>> domain.  And it's no good to say we have the capability when all
>>> attempts to use it will fail.  Therefore the kernel needs to say that
>>> it doesn't have the capability in a PR KVM guest or in a nested HV
>>> guest.
>>
>> OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
>> for future use.
> 
> That's not what I meant.  Why do we need that?  I meant that querying
> the new KVM_CAP_PPC_IRQ_XIVE capability should return 0 if we are in a
> guest.  It should only return 1 if we are running bare-metal on a P9.

ok. I guess I need to understand first how the nested guest uses the 
KVM IRQ device. That is a question in another email thread.   

> However, we could be using PR KVM (either in a bare-metal host or in a
> guest), or we could be doing nested HV KVM where we are using the
> kvm_hv module inside a KVM guest and using special hypercalls for
> controlling our guests.

 Yes. 

 It would be good to talk a little about the nested support (offline 
 may be) to make sure that we are not missing some major interface that 
 would require a lot of change. If we need to prepare ground, I think
 the timing is good.

 The size of the IRQ number space might be a problem. It seems we 
 would need to increase it considerably to support multiple nested 
 guests. That said I haven't look much how nested is designed.  
>>>
>>> The current design of nested HV is that the entire non-volatile state
>>> of all the nested guests is encapsulated within the state and
>>> resources of the L1 hypervisor.  That means that if the L1 hypervisor
>>> gets migrated, all of its guests go across inside it and there is no
>>> extra state that L0 needs to be aware of.  That would imply that the
>>> VP number space for the nested guests would need to come from within
>>> the VP number space for L1; but the amount of VP space we allocate to
>>> each guest doesn't seem to be large enough for that to be practical.
>>
>> If the KVM XIVE device had some information on the max number of CPUs 
>> provisioned for the guest, we could optimize the VP allocation.
> 
> The problem is that we might have 1000 guests running under L0, or we
> might have 1 guest running under L0 and 1000 guests running under it,
> and we have no way to know which situation to optimize for at the
> point where an L1 guest starts.  If we had an enormous VP space then
> we could just give each L1 guest a large amount of VP space and solve
> it that way; but we don't.

There are some ideas to increase our VP space size. Using multiblock 
per XIVE chip in skiboot is one I think. It's not an obvious change. 
Also, XIVE2 will add more bits to the NVT index so we will be free 
to allocate more at once when P10 is available.

On the same topic, may be we could move the VP allocator from skiboot
to KVM, allocate the full VP space at the KVM level and let KVM do 
the VP segmentation. 

Any how, I think that if we knew how much VPs we need to provision for 
when the KVM XIVE device is created, we would make a better use of the 
available space. Shouldn't we ?

Thanks,

C.



Re: [RESEND PATCH v3] cxl: Wrap iterations over afu slices inside 'afu_list_lock'

2019-01-30 Thread Sasha Levin
Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: .

The bot has tested the following trees: v4.20.5, v4.19.18, v4.14.96, v4.9.153, 
v4.4.172, v3.18.133.

v4.20.5: Build OK!
v4.19.18: Build OK!
v4.14.96: Build OK!
v4.9.153: Failed to apply! Possible dependencies:
bb81733de28c ("cxl: Fix error handling in 
_cxl_pci_associate_default_context()")

v4.4.172: Failed to apply! Possible dependencies:
14baf4d9c739 ("cxl: Add guest-specific code")
2b04cf310ba8 ("cxl: Rename some bare-metal specific functions")
5be587b11101 ("cxl: Introduce implementation-specific API")
73d55c3b59f7 ("cxl: IRQ allocation for guests")
8633186209e3 ("cxl: Move common code away from bare-metal-specific files")
cbffa3a5146a ("cxl: Separate bare-metal fields in adapter and AFU data 
structures")
d56d301b5174 ("cxl: Move bare-metal specific code to specialized files")
ea2d1f95efd7 ("cxl: Isolate a few bare-metal-specific calls")

v3.18.133: Failed to apply! Possible dependencies:
0520336afe5d ("cxl: Export file ops for use by API")
0b3f9c757cab ("cxl: Drop commands if the PCI channel is not in normal 
state")
14baf4d9c739 ("cxl: Add guest-specific code")
27d4dc7116ee ("cxl: Implement an ioctl to fetch afu card-id, offset-id and 
mode")
3d6b040e7338 ("cxl: sparse: Make declarations static")
406e12ec0b64 ("cxl: Cleanup Makefile")
456295e284be ("cxl: Fix leaking interrupts if attach process fails")
588b34be20bc ("cxl: Convert MMIO read/write macros to inline functions")
5be587b11101 ("cxl: Introduce implementation-specific API")
62fa19d4b4fd ("cxl: Add ability to reset the card")
6428832a7bfa ("cxl: Add cookie parameter to afu_release_irqs()")
6f7f0b3df6d4 ("cxl: Add AFU virtual PHB and kernel API")
73d55c3b59f7 ("cxl: IRQ allocation for guests")
7bb5d91a4dda ("cxl: Rework context lifetimes")
80c394fab896 ("cxl: Add explicit precision specifiers")
80fa93fce37d ("cxl: Name interrupts in /proc/interrupt")
8dde152ea348 ("cxl: fix leak of IRQ names in cxl_free_afu_irqs()")
95bc11bcd142 ("cxl: Add image control to sysfs")
9bcf28cdb28e ("cxl: Add tracepoints")
b087e6190ddc ("cxl: Export optional AFU configuration record in sysfs")
b12994fbfe93 ("cxl: cxl_afu_reset() -> __cxl_afu_reset()")
bc78b05bb412 ("cxl: Return error to PSL if IRQ demultiplexing fails & print 
clearer warning")
c358d84b4e57 ("cxl: Split afu_register_irqs() function")
cbffa3a5146a ("cxl: Separate bare-metal fields in adapter and AFU data 
structures")
de369538436a ("cxl: use more common format specifier")
e36f6fe1f7aa ("cxl: Export AFU error buffer via sysfs")
ea2d1f95efd7 ("cxl: Isolate a few bare-metal-specific calls")
eda3693c842e ("cxl: Rework detach context functions")


How should we proceed with this patch?

--
Thanks,
Sasha


Re: [PATCH v2 21/21] memblock: drop memblock_alloc_*_nopanic() variants

2019-01-30 Thread Petr Mladek
On Mon 2019-01-21 10:04:08, Mike Rapoport wrote:
> As all the memblock allocation functions return NULL in case of error
> rather than panic(), the duplicates with _nopanic suffix can be removed.
> 
> Signed-off-by: Mike Rapoport 
> Acked-by: Greg Kroah-Hartman 
> ---
>  arch/arc/kernel/unwind.c   |  3 +--
>  arch/sh/mm/init.c  |  2 +-
>  arch/x86/kernel/setup_percpu.c | 10 +-
>  arch/x86/mm/kasan_init_64.c| 14 --
>  drivers/firmware/memmap.c  |  2 +-
>  drivers/usb/early/xhci-dbc.c   |  2 +-
>  include/linux/memblock.h   | 35 ---
>  kernel/dma/swiotlb.c   |  2 +-
>  kernel/printk/printk.c |  9 +

For printk:
Reviewed-by: Petr Mladek 
Acked-by: Petr Mladek 

Best Regards,
Petr

>  mm/memblock.c  | 35 ---
>  mm/page_alloc.c| 10 +-
>  mm/page_ext.c  |  2 +-
>  mm/percpu.c| 11 ---
>  mm/sparse.c|  6 ++
>  14 files changed, 31 insertions(+), 112 deletions(-)
> 


Re: [PATCH] powerpc/mm: Add _PAGE_SAO to _PAGE_CACHE_CTL mask

2019-01-30 Thread Michael Ellerman
Reza Arbab  writes:

> On Tue, Jan 29, 2019 at 08:37:28PM +0530, Aneesh Kumar K.V wrote:
>>Not sure what the fix is about. We set the related hash pte flags via
>>
>>  if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_TOLERANT)
>>  rflags |= HPTE_R_I;
>>  else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_NON_IDEMPOTENT)
>>  rflags |= (HPTE_R_I | HPTE_R_G);
>>  else if ((pteflags & _PAGE_CACHE_CTL) == _PAGE_SAO)
>>  rflags |= (HPTE_R_W | HPTE_R_I | HPTE_R_M);
>
> Again, nothing broken here, just a code readability thing. As Alexey 
> (and Charlie) noted, given the above it is a little confusing to define 
> _PAGE_CACHE_CTL this way:
>
>   #define _PAGE_CACHE_CTL  (_PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT)

Yeah that's confusing I agree.

It's not really a maintainability thing, because those bits are in the
architecture, so they can't change.

> I like Alexey's idea, maybe just use a literal?
>
>   #define _PAGE_CACHE_CTL 0x30

I prefer your original patch. It serves as documentation on what values
we expect to see in that field.

cheers


Re: [PATCH 0/4] powerpc/livepatch: reliable stack unwinder fixes

2019-01-30 Thread Jiri Kosina
On Wed, 30 Jan 2019, Michael Ellerman wrote:

> I'm happy to take it, unless there's some reason you'd rather it go via 
> the LP tree?

As this is more about reliable stack unwinding than live patching per se 
(which is only a user of that facility), I'd actually slightly prefer if 
it goes via your tree.

Thanks,

-- 
Jiri Kosina
SUSE Labs



Re: [RESEND PATCH v3] cxl: Wrap iterations over afu slices inside 'afu_list_lock'

2019-01-30 Thread Vaibhav Jain
Michael Ellerman  writes:

> FYI RESEND means you didn't change anything, but you sent the patch
> again for some other reason, like the Cc list was wrong or you thought
> it had been ignored.
>
> In this case you should have just sent a v4, updating the change log is
> a perfectly valid reason for a new version of the patch.
Thanks for pointing this out. Will take care of this in future.

>
> I've applied it, no need to RESEND ;)
Thanks.

-- 
Vaibhav Jain 
Linux Technology Center, IBM India Pvt. Ltd.



Re: [RFC PATCH] powerpc: fix get_arch_dma_ops() for NTB devices

2019-01-30 Thread Michael Ellerman
Alexander Fomichev  writes:

> get_dma_ops() falls into arch-dependant get_arch_dma_ops(), which
> historically returns NULL on PowerPC. Therefore dma_set_mask() fails.
> This affects Switchtec (and probably other) NTB devices, that they fail
> to initialize.

What's an NTB device?

drivers/ntb I assume?

So it's a PCI device of some sort, but presumably the device you're
calling dma_set_mask() on is an NTB device not a PCI device?

But then it works if you tell it to use the PCI DMA ops?

At the very least the code should be checking for the NTB bus type and
only returning the PCI ops in that specific case, not for all devices.

cheers


[PATCH V2] powerpc/ptrace: Mitigate potential Spectre v1

2019-01-30 Thread Breno Leitao
'regno' is directly controlled by user space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.

On PTRACE_SETREGS and PTRACE_GETREGS requests, user space passes the
register number that would be read or written. This register number is
called 'regno' which is part of the 'addr' syscall parameter.

This 'regno' value is checked against the maximum pt_regs structure size,
and then used to dereference it, which matches the initial part of a
Spectre v1 (and Spectre v1.1) attack. The dereferenced value, then,
is returned to userspace in the GETREGS case.

This patch sanitizes 'regno' before using it to dereference pt_reg.

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Signed-off-by: Breno Leitao 
---
 arch/powerpc/kernel/ptrace.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index cdd5d1d3ae41..7535f89e08cd 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -274,6 +275,8 @@ static int set_user_trap(struct task_struct *task, unsigned 
long trap)
  */
 int ptrace_get_reg(struct task_struct *task, int regno, unsigned long *data)
 {
+   unsigned int regs_max;
+
if ((task->thread.regs == NULL) || !data)
return -EIO;
 
@@ -297,7 +300,9 @@ int ptrace_get_reg(struct task_struct *task, int regno, 
unsigned long *data)
}
 #endif
 
-   if (regno < (sizeof(struct user_pt_regs) / sizeof(unsigned long))) {
+   regs_max = sizeof(struct user_pt_regs) / sizeof(unsigned long);
+   if (regno < regs_max) {
+   regno = array_index_nospec(regno, regs_max);
*data = ((unsigned long *)task->thread.regs)[regno];
return 0;
}
@@ -321,6 +326,7 @@ int ptrace_put_reg(struct task_struct *task, int regno, 
unsigned long data)
return set_user_dscr(task, data);
 
if (regno <= PT_MAX_PUT_REG) {
+   regno = array_index_nospec(regno, PT_MAX_PUT_REG + 1);
((unsigned long *)task->thread.regs)[regno] = data;
return 0;
}
-- 
2.19.0



Re: [RESEND PATCH v3] cxl: Wrap iterations over afu slices inside 'afu_list_lock'

2019-01-30 Thread Michael Ellerman
Vaibhav Jain  writes:

> Within cxl module, iteration over array 'adapter->afu' may be racy
> at few points as it might be simultaneously read during an EEH and its
> contents being set to NULL while driver is being unloaded or unbound
> from the adapter. This might result in a NULL pointer to 'struct afu'
> being de-referenced during an EEH thereby causing a kernel oops.
>
> This patch fixes this by making sure that all access to the array
> 'adapter->afu' is wrapped within the context of spin-lock
> 'adapter->afu_list_lock'.
>
> Cc: sta...@vger.kernel.org
> Fixes: 9e8df8a2196("cxl: EEH support")
> Acked-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 
> Acked-by: Christophe Lombard 
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> Resend:
> * Fixed the reference to 'adapter->afu' in patch description. [Andrew]
> * Added the 'Fixes' tag and marked the patch to stable

FYI RESEND means you didn't change anything, but you sent the patch
again for some other reason, like the Cc list was wrong or you thought
it had been ignored.

In this case you should have just sent a v4, updating the change log is
a perfectly valid reason for a new version of the patch.

I've applied it, no need to RESEND ;)

cheers


Re: [PATCH v2] powerpc/traps: fix the message printed when stack overflows

2019-01-30 Thread Michael Ellerman
Christophe Leroy  writes:

> Today's message is useless:
>
> [   42.253267] Kernel stack overflow in process (ptrval), r1=c65500b0
>
> This patch fixes it:
>
> [   66.905235] Kernel stack overflow in process sh[356], r1=c65560b0
>
> Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
> Cc: 
> Signed-off-by: Christophe Leroy 
> ---
>  v2: make checkpatch happy :)
>
>  arch/powerpc/kernel/traps.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index 5e917a84f949..a3dc6872d5aa 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -1535,8 +1535,8 @@ void alignment_exception(struct pt_regs *regs)
>  
>  void StackOverflow(struct pt_regs *regs)
>  {
> - printk(KERN_CRIT "Kernel stack overflow in process %p, r1=%lx\n",
> -current, regs->gpr[1]);
> + pr_crit("Kernel stack overflow in process %s[%d], r1=%lx\n",
> + current->comm, current->pid, regs->gpr[1]);

I think you're meant to use task_pid_nr(current) these days.

I fixed it up when applying.

cheers



Re: [PATCH] powerpc/pseries: Perform full re-add of CPU for topology update post-migration

2019-01-30 Thread Michael Ellerman
Michael Bringmann  writes:
> On 1/29/19 3:37 AM, Michael Ellerman wrote:
>> Michael Bringmann  writes:
>> 
>>> On 10/29/18 1:43 PM, Nathan Fontenot wrote:
 On pseries systems, performing a partition migration can result in
 altering the nodes a CPU is assigned to on the destination system. For
 exampl, pre-migration on the source system CPUs are in node 1 and 3,
 post-migration on the destination system CPUs are in nodes 2 and 3.

 Handling the node change for a CPU can cause corruption in the slab
 cache if we hit a timing where a CPUs node is changed while cache_reap()
 is invoked. The corruption occurs because the slab cache code appears
 to rely on the CPU and slab cache pages being on the same node.

 The current dynamic updating of a CPUs node done in arch/powerpc/mm/numa.c
 does not prevent us from hitting this scenario.

 Changing the device tree property update notification handler that
 recognizes an affinity change for a CPU to do a full DLPAR remove and
 add of the CPU instead of dynamically changing its node resolves this
 issue.

 Signed-off-by: Nathan Fontenot >> Signed-off-by: Michael W. Bringmann 
>
> Tested-by: Michael W. Bringmann 

Thanks.

cheers


[PATCH] scsi: cxlflash: Prevent deadlock when adapter probe fails

2019-01-30 Thread Vaibhav Jain


Presently when an error is encountered during probe of the cxlflash
adapter, a deadlock is seen with cpu thread stuck inside
cxlflash_remove(). Below is the trace of the deadlock as logged by
khungtaskd:

cxlflash 0006:00:00.0: cxlflash_probe: init_afu failed rc=-16
INFO: task kworker/80:1:890 blocked for more than 120 seconds.
   Not tainted 5.0.0-rc4-capi2-kexec+ #2
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/80:1D0   890  2 0x0808
Workqueue: events work_for_cpu_fn

Call Trace:
 0x4d72136320 (unreliable)
 __switch_to+0x2cc/0x460
 __schedule+0x2bc/0xac0
 schedule+0x40/0xb0
 cxlflash_remove+0xec/0x640 [cxlflash]
 cxlflash_probe+0x370/0x8f0 [cxlflash]
 local_pci_probe+0x6c/0x140
 work_for_cpu_fn+0x38/0x60
 process_one_work+0x260/0x530
 worker_thread+0x280/0x5d0
 kthread+0x1a8/0x1b0
 ret_from_kernel_thread+0x5c/0x80
INFO: task systemd-udevd:5160 blocked for more than 120 seconds.

The deadlock occurs as cxlflash_remove() is called from
cxlflash_probe() without setting 'cxlflash_cfg->state' to STATE_PROBED
and the probe thread starts to wait on
'cxlflash_cfg->reset_waitq'. Since the device was never successfully
probed the 'cxlflash_cfg->state' never changes from STATE_PROBING
hence the deadlock occurs.

We fix this deadlock by setting the variable 'cxlflash_cfg->state' to
STATE_PROBED in case an error occurs during cxlflash_probe() and just
before calling cxlflash_remove().

Cc: sta...@vger.kernel.org
Fixes: c21e0bbfc485("cxlflash: Base support for IBM CXL Flash Adapter")
Signed-off-by: Vaibhav Jain 
---
 drivers/scsi/cxlflash/main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/scsi/cxlflash/main.c b/drivers/scsi/cxlflash/main.c
index bfa13e3b191c..c8bad2c093b8 100644
--- a/drivers/scsi/cxlflash/main.c
+++ b/drivers/scsi/cxlflash/main.c
@@ -3687,6 +3687,7 @@ static int cxlflash_probe(struct pci_dev *pdev,
host->max_cmd_len = CXLFLASH_MAX_CDB_LEN;
 
cfg = shost_priv(host);
+   cfg->state = STATE_PROBING;
cfg->host = host;
rc = alloc_mem(cfg);
if (rc) {
@@ -3775,6 +3776,7 @@ static int cxlflash_probe(struct pci_dev *pdev,
return rc;
 
 out_remove:
+   cfg->state = STATE_PROBED;
cxlflash_remove(pdev);
goto out;
 }
-- 
2.20.1



Re: [PATCH 1/4] powerpc/64s: Clear on-stack exception marker upon exception return

2019-01-30 Thread Michael Ellerman
Joe Lawrence  writes:
> From: Nicolai Stange 
>
> The ppc64 specific implementation of the reliable stacktracer,
> save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
> trace" whenever it finds an exception frame on the stack. Stack frames
> are classified as exception frames if the STACK_FRAME_REGS_MARKER magic,
> as written by exception prologues, is found at a particular location.
>
> However, as observed by Joe Lawrence, it is possible in practice that
> non-exception stack frames can alias with prior exception frames and thus,
> that the reliable stacktracer can find a stale STACK_FRAME_REGS_MARKER on
> the stack. It in turn falsely reports an unreliable stacktrace and blocks
> any live patching transition to finish. Said condition lasts until the
> stack frame is overwritten/initialized by function call or other means.
>
> In principle, we could mitigate this by making the exception frame
> classification condition in save_stack_trace_tsk_reliable() stronger:
> in addition to testing for STACK_FRAME_REGS_MARKER, we could also take into
> account that for all exceptions executing on the kernel stack
> - their stack frames's backlink pointers always match what is saved
>   in their pt_regs instance's ->gpr[1] slot and that
> - their exception frame size equals STACK_INT_FRAME_SIZE, a value
>   uncommonly large for non-exception frames.
>
> However, while these are currently true, relying on them would make the
> reliable stacktrace implementation more sensitive towards future changes in
> the exception entry code. Note that false negatives, i.e. not detecting
> exception frames, would silently break the live patching consistency model.
>
> Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
> rely on STACK_FRAME_REGS_MARKER as well.
>
> Make the exception exit code clear the on-stack STACK_FRAME_REGS_MARKER
> for those exceptions running on the "normal" kernel stack and returning
> to kernelspace: because the topmost frame is ignored by the reliable stack
> tracer anyway, returns to userspace don't need to take care of clearing
> the marker.
>
> Furthermore, as I don't have the ability to test this on Book 3E or
> 32 bits, limit the change to Book 3S and 64 bits.
>
> Finally, make the HAVE_RELIABLE_STACKTRACE Kconfig option depend on
> PPC_BOOK3S_64 for documentation purposes. Before this patch, it depended
> on PPC64 && CPU_LITTLE_ENDIAN and because CPU_LITTLE_ENDIAN implies
> PPC_BOOK3S_64, there's no functional change here.

That has nothing to do with the fix and should really be in a separate
patch.

I can split it when applying.

cheers

> Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for 
> the consistency model")
> Reported-by: Joe Lawrence 
> Signed-off-by: Nicolai Stange 
> Signed-off-by: Joe Lawrence 
> ---
>  arch/powerpc/Kconfig   | 2 +-
>  arch/powerpc/kernel/entry_64.S | 7 +++
>  2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 2890d36eb531..73bf87b1d274 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -220,7 +220,7 @@ config PPC
>   select HAVE_PERF_USER_STACK_DUMP
>   select HAVE_RCU_TABLE_FREE  if SMP
>   select HAVE_REGS_AND_STACK_ACCESS_API
> - select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
> + select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && 
> CPU_LITTLE_ENDIAN
>   select HAVE_SYSCALL_TRACEPOINTS
>   select HAVE_VIRT_CPU_ACCOUNTING
>   select HAVE_IRQ_TIME_ACCOUNTING
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 435927f549c4..a2c168b395d2 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -1002,6 +1002,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>   ld  r2,_NIP(r1)
>   mtspr   SPRN_SRR0,r2
>  
> + /*
> +  * Leaving a stale exception_marker on the stack can confuse
> +  * the reliable stack unwinder later on. Clear it.
> +  */
> + li  r2,0
> + std r2,STACK_FRAME_OVERHEAD-16(r1)
> +
>   ld  r0,GPR0(r1)
>   ld  r2,GPR2(r1)
>   ld  r3,GPR3(r1)
> -- 
> 2.20.1


Re: [PATCH 0/4] powerpc/livepatch: reliable stack unwinder fixes

2019-01-30 Thread Michael Ellerman
Jiri Kosina  writes:
> On Tue, 22 Jan 2019, Joe Lawrence wrote:
>
>> This patchset fixes a false negative report (ie, unreliable) from the
>> ppc64 reliable stack unwinder, discussed here [1] when it may
>> inadvertently trip over a stale exception marker left on the stack.
>> 
>> The first two patches fix this bug.  Nicolai's change clears the marker
>> from the stack when an exception is finished.  The next patch modifies
>> the unwinder to only look for such on stack elements when the ABI
>> guarantees that they will actually be initialized.
>> 
>> The final two patches consist of code cleanups that Nicolai and I
>> spotted during the development of the fixes.
>> 
>> Testing included re-running the original test scenario (loading a
>> livepatch module on ppc64le) on a 5.0.0-rc2 kernel as well as a RHEL-7
>> backport.  I ran internal tests on the RHEL-7 backport and no new test
>> failures were introduced.  I believe that Nicolai has done the same
>> with respect to the first patch.
>> 
>> [1] 
>> https://lore.kernel.org/lkml/7f468285-b149-37e2-e782-c9e538b99...@redhat.com/
>> 
>> Joe Lawrence (3):
>>   powerpc/livepatch: relax reliable stack tracer checks for first-frame
>>   powerpc/livepatch: small cleanups in save_stack_trace_tsk_reliable()
>>   powerpc/livepatch: return -ERRNO values in
>> save_stack_trace_tsk_reliable()
>> 
>> Nicolai Stange (1):
>>   powerpc/64s: Clear on-stack exception marker upon exception return
>
> Michael, are you fine with this going through LP tree, or do you plan to 
> take it through yours?

I'm happy to take it, unless there's some reason you'd rather it go via
the LP tree?

I don't have any automated live patch tests, but I assume if it's in
linux-next someone can test it? :)

cheers


Re: Does SMP work at all on 40x ?

2019-01-30 Thread Michael Ellerman
Christophe Leroy  writes:

> In transfer_to_handler() (entry_32.S), we have:
>
> #if defined(CONFIG_40x) || defined(CONFIG_BOOKE)
> ...
> #ifdef CONFIG_SMP
>   CURRENT_THREAD_INFO(r9, r1)
>   lwz r9,TI_CPU(r9)
>   slwir9,r9,3
>   add r11,r11,r9
> #endif
> #endif
>
> When running this piece of code, MMU translation is off. But r9 contains 
> the virtual addr of current_thread_info, so unless I miss something, 
> this cannot work on the 40x, can it ?
>
> On CONFIG_BOOKE it works because phys addr = virt addr

AFAIK 40x can't be SMP:

  config SMP
depends on PPC_BOOK3S || PPC_BOOK3E || FSL_BOOKE || PPC_47x


But this stuff is all before my time.

The commit that added the SMP block was clearly only meant for BookE:

  4eaddb4d7ec3 ("[POWERPC] Make Book-E debug handling SMP safe")

cheers


Re: [PATCH V7 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_do_alloc

2019-01-30 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> The current code doesn't do page migration if the page allocated is a 
> compound page.
> With HugeTLB migration support, we can end up allocating hugetlb pages from
> CMA region. Also, THP pages can be allocated from CMA region. This patch 
> updates
> the code to handle compound pages correctly. The patch also switches to a 
> single
> get_user_pages with the right count, instead of doing one get_user_pages per 
> page.
> That avoids reading page table multiple times.

It's not very obvious from the above description that the migration
logic is now being done by get_user_pages_longterm(), it just looks like
it's all being deleted in this patch. Would be good to mention that.

> Since these page reference updates are long term pin, switch to
> get_user_pages_longterm. That makes sure we fail correctly if the guest RAM
> is backed by DAX pages.

Can you explain that in more detail?

> The patch also converts the hpas member of mm_iommu_table_group_mem_t to a 
> union.
> We use the same storage location to store pointers to struct page. We cannot
> update all the code path use struct page *, because we access hpas in real 
> mode
> and we can't do that struct page * to pfn conversion in real mode.

That's a pain, it's asking for bugs mixing two different values in the
same array. But I guess it's the least worst option.

It sounds like that's a separate change you could do in a separate
patch. But it's not, because it's tied to the fact that we're doing a
single GUP call.


> diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
> b/arch/powerpc/mm/mmu_context_iommu.c
> index a712a650a8b6..f11a2f15071f 100644
> --- a/arch/powerpc/mm/mmu_context_iommu.c
> +++ b/arch/powerpc/mm/mmu_context_iommu.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static DEFINE_MUTEX(mem_list_mutex);
>  
> @@ -34,8 +35,18 @@ struct mm_iommu_table_group_mem_t {
>   atomic64_t mapped;
>   unsigned int pageshift;
>   u64 ua; /* userspace address */
> - u64 entries;/* number of entries in hpas[] */
> - u64 *hpas;  /* vmalloc'ed */
> + u64 entries;/* number of entries in hpas/hpages[] */
> + /*
> +  * in mm_iommu_get we temporarily use this to store
> +  * struct page address.
> +  *
> +  * We need to convert ua to hpa in real mode. Make it
> +  * simpler by storing physical address.
> +  */
> + union {
> + struct page **hpages;   /* vmalloc'ed */
> + phys_addr_t *hpas;
> + };
>  #define MM_IOMMU_TABLE_INVALID_HPA   ((uint64_t)-1)
>   u64 dev_hpa;/* Device memory base address */
>  };
> @@ -80,64 +91,15 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
>  }
>  EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
>  
> -/*
> - * Taken from alloc_migrate_target with changes to remove CMA allocations
> - */
> -struct page *new_iommu_non_cma_page(struct page *page, unsigned long private)
> -{
> - gfp_t gfp_mask = GFP_USER;
> - struct page *new_page;
> -
> - if (PageCompound(page))
> - return NULL;
> -
> - if (PageHighMem(page))
> - gfp_mask |= __GFP_HIGHMEM;
> -
> - /*
> -  * We don't want the allocation to force an OOM if possibe
> -  */
> - new_page = alloc_page(gfp_mask | __GFP_NORETRY | __GFP_NOWARN);
> - return new_page;
> -}
> -
> -static int mm_iommu_move_page_from_cma(struct page *page)
> -{
> - int ret = 0;
> - LIST_HEAD(cma_migrate_pages);
> -
> - /* Ignore huge pages for now */
> - if (PageCompound(page))
> - return -EBUSY;
> -
> - lru_add_drain();
> - ret = isolate_lru_page(page);
> - if (ret)
> - return ret;
> -
> - list_add(&page->lru, &cma_migrate_pages);
> - put_page(page); /* Drop the gup reference */
> -
> - ret = migrate_pages(&cma_migrate_pages, new_iommu_non_cma_page,
> - NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE);
> - if (ret) {
> - if (!list_empty(&cma_migrate_pages))
> - putback_movable_pages(&cma_migrate_pages);
> - }
> -
> - return 0;
> -}
> -
>  static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
> - unsigned long entries, unsigned long dev_hpa,
> - struct mm_iommu_table_group_mem_t **pmem)
> +   unsigned long entries, unsigned long dev_hpa,
> +   struct mm_iommu_table_group_mem_t **pmem)
>  {
>   struct mm_iommu_table_group_mem_t *mem;
> - long i, j, ret = 0, locked_entries = 0;
> + long i, ret = 0, locked_entries = 0;

I'd prefer we didn't initialise ret here.

>   unsigned int pageshift;
>   unsigned long flags;
>   unsigned long cur_ua;
> - struct page *page = NULL;
>  
>   mutex_lock(&mem_list_mutex);
>  
> @@ -187,41 +149,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long u

[PATCH v2 1/2] dt-bindings: soc: fsl: Document Qixis FPGA usage

2019-01-30 Thread Pankaj Bansal
an FPGA-based system controller, called “Qixis”, which
manages several critical system features, including:
• Reset sequencing
• Power supply configuration
• Board configuration
• hardware configuration

The qixis registers are accessible over one or more system-specific
interfaces, typically I2C, JTAG or an embedded processor.

Signed-off-by: Pankaj Bansal 
---

Notes:
V2:
- No change

 .../bindings/soc/fsl/qixis_ctrl.txt  | 33 ++
 1 file changed, 33 insertions(+)

diff --git a/Documentation/devicetree/bindings/soc/fsl/qixis_ctrl.txt 
b/Documentation/devicetree/bindings/soc/fsl/qixis_ctrl.txt
new file mode 100644
index ..bc2950cab71d
--- /dev/null
+++ b/Documentation/devicetree/bindings/soc/fsl/qixis_ctrl.txt
@@ -0,0 +1,33 @@
+* QIXIS FPGA block
+
+an FPGA-based system controller, called “Qixis”, which
+manages several critical system features, including:
+• Configuration switch monitoring
+• Power on/off sequencing
+• Reset sequencing
+• Power supply configuration
+• Board configuration
+• hardware configuration
+• Background power data collection (DCM)
+• Fault monitoring
+• RCW bypass SRAM (replace flash RCW with internal RCW) (NOR only)
+• Dedicated functional validation blocks (POSt/IRS, triggered event, and so on)
+• I2C master for remote board control even with no DUT available
+
+The qixis registers are accessible over one or more system-specific interfaces,
+typically I2C, JTAG or an embedded processor.
+
+Required properties:
+
+ - compatible : string, must contain "fsl,fpga-qixis-i2c"
+ - reg : i2c address of the qixis device.
+
+Examples:
+   /* The FPGA node */
+fpga@66 {
+   compatible = "fsl,lx2160aqds-fpga", "fsl,fpga-qixis-i2c";
+   reg = <0x66>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+   }
+
-- 
2.17.1



[PATCH v2 0/2] add i2c controlled qixis driver

2019-01-30 Thread Pankaj Bansal
FPGA on LX2160AQDS/LX2160ARDB connected on I2C bus, so add qixis
driver which is basically an i2c client driver to control FPGA.

This driver is essential to control MDIO mux multiplexing.

This driver is dependent on below patches:
https://www.mail-archive.com/netdev@vger.kernel.org/msg281274.html

V1 of this series:
https://patchwork.kernel.org/cover/10627297/

Cc: Varun Sethi 

Pankaj Bansal (2):
  dt-bindings: soc: fsl: Document Qixis FPGA usage
  fsl: add i2c controlled qixis driver

 .../bindings/soc/fsl/qixis_ctrl.txt   |  33 +
 drivers/soc/fsl/Kconfig   |   8 ++
 drivers/soc/fsl/Makefile  |   1 +
 drivers/soc/fsl/qixis_ctrl.c  | 121 ++
 4 files changed, 163 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/soc/fsl/qixis_ctrl.txt
 create mode 100644 drivers/soc/fsl/qixis_ctrl.c

-- 
2.17.1



[PATCH v2 2/2] fsl: add i2c controlled qixis driver

2019-01-30 Thread Pankaj Bansal
FPGA on LX2160AQDS/LX2160ARDB connected on I2C bus, so add qixis
driver which is basically an i2c client driver to control FPGA.

Signed-off-by: Wang Dongsheng 
Signed-off-by: Pankaj Bansal 
---

Notes:
V2:
- Modify the driver to not create platform devices corresponding to 
subnodes.
  because the subnodes are not actual devices.
- Use mdio_mux_regmap_init/mdio_mux_regmap_uninit
- Remove header file from include folder, as no qixis api is called from 
outside
- Add regmap_exit in driver's remove function
Dendencies:
- https://www.mail-archive.com/netdev@vger.kernel.org/msg281274.html

 drivers/soc/fsl/Kconfig  |   8 +++
 drivers/soc/fsl/Makefile |   1 +
 drivers/soc/fsl/qixis_ctrl.c | 121 +
 3 files changed, 130 insertions(+)

diff --git a/drivers/soc/fsl/Kconfig b/drivers/soc/fsl/Kconfig
index 8a0ba6da6f8a..e787d3e637a4 100644
--- a/drivers/soc/fsl/Kconfig
+++ b/drivers/soc/fsl/Kconfig
@@ -36,6 +36,14 @@ config FSL_SLEEP_FSM
  The FSM is used to finish clean-ups at the last stage of system 
entering deep
  sleep, and also wakes up system when a wake up event happens.
 
+config FSL_QIXIS
+   tristate "QIXIS system controller driver"
+   depends on REGMAP_I2C
+   default n
+   help
+ Say y here to enable QIXIS system controller api. The qixis driver
+ provides FPGA functions to control system.
+
 if ARM || ARM64
 source "drivers/soc/fsl/Kconfig.arm"
 endif
diff --git a/drivers/soc/fsl/Makefile b/drivers/soc/fsl/Makefile
index ffdf93b34501..db7b09bfbd91 100644
--- a/drivers/soc/fsl/Makefile
+++ b/drivers/soc/fsl/Makefile
@@ -5,6 +5,7 @@
 obj-$(CONFIG_FSL_DPAA) += qbman/
 obj-$(CONFIG_QUICC_ENGINE) += qe/
 obj-$(CONFIG_CPM)  += qe/
+obj-$(CONFIG_FSL_QIXIS)+= qixis_ctrl.o
 obj-$(CONFIG_FSL_GUTS) += guts.o
 obj-$(CONFIG_FSL_MC_DPIO)  += dpio/
 obj-$(CONFIG_FSL_LS2_CONSOLE)  += ls2-console/
diff --git a/drivers/soc/fsl/qixis_ctrl.c b/drivers/soc/fsl/qixis_ctrl.c
new file mode 100644
index ..336f366e228e
--- /dev/null
+++ b/drivers/soc/fsl/qixis_ctrl.c
@@ -0,0 +1,121 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+/* Freescale QIXIS system controller driver.
+ *
+ * Copyright 2015 Freescale Semiconductor, Inc.
+ * Copyright 2018 NXP
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* QIXIS MAP */
+struct fsl_qixis_regs {
+   u8  id; /* Identification Registers */
+   u8  version;/* Version Register */
+   u8  qixis_ver;  /* QIXIS Version Register */
+   u8  reserved1[0x1f];
+};
+
+struct mdio_mux_data {
+   void*data;
+   struct list_headlink;
+};
+
+struct qixis_priv {
+   struct regmap   *regmap;
+   struct list_headmdio_mux_list;
+};
+
+static struct regmap_config qixis_regmap_config = {
+   .reg_bits = 8,
+   .val_bits = 8,
+};
+
+static int fsl_qixis_i2c_probe(struct i2c_client *client)
+{
+   struct device_node *child;
+   struct qixis_priv *priv;
+   int ret;
+   u32 qver;
+   struct mdio_mux_data *mux_data;
+
+   if (!i2c_check_functionality(client->adapter, I2C_FUNC_SMBUS_BYTE_DATA))
+   return -EOPNOTSUPP;
+
+   priv = devm_kzalloc(&client->dev, sizeof(struct qixis_priv),
+   GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+
+   priv->regmap = regmap_init_i2c(client, &qixis_regmap_config);
+   INIT_LIST_HEAD(&priv->mdio_mux_list);
+
+   regmap_read(priv->regmap, offsetof(struct fsl_qixis_regs, qixis_ver),
+   &qver);
+   pr_info("Freescale QIXIS Version: 0x%08x\n", qver);
+
+   for_each_child_of_node(client->dev.of_node, child) {
+   if (of_node_name_prefix(child, "mdio-mux")) {
+   mux_data = devm_kzalloc(&client->dev,
+   sizeof(struct mdio_mux_data),
+   GFP_KERNEL);
+   if (!mux_data)
+   return -ENOMEM;
+   ret = mdio_mux_regmap_init(&client->dev, child,
+  &mux_data->data);
+   if (ret)
+   return ret;
+   list_add(&mux_data->link, &priv->mdio_mux_list);
+   }
+   };
+
+   i2c_set_clientdata(client, priv);
+
+   return 0;
+}
+
+static int fsl_qixis_i2c_remove(struct i2c_client *client)
+{
+   struct qixis_priv *priv;
+   struct list_head *pos;
+   struct mdio_mux_data *mux_data;
+
+   priv = i2c_get_clientdata(client);
+   list_for_each(pos, &priv->mdio_mux_list) {
+   mux_data =

Re: [PATCH V5 5/5] arch/powerpc/mm/hugetlb: NestMMU workaround for hugetlb mprotect RW upgrade

2019-01-30 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> NestMMU requires us to mark the pte invalid and flush the tlb when we do a
> RW upgrade of pte. We fixed a variant of this in the fault path in commit
> Fixes: bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle 
> nest MMU hang")

Odd "Fixes:" again.

> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/hugetlb.h | 12 ++
>  arch/powerpc/mm/hugetlbpage-hash64.c | 25 
>  arch/powerpc/mm/hugetlbpage-radix.c  | 17 +
>  3 files changed, 54 insertions(+)

Same comment about inlining.

But otherwise looks good.

Reviewed-by: Michael Ellerman 

cheers


Re: [PATCH V5 4/5] mm/hugetlb: Add prot_modify_start/commit sequence for hugetlb update

2019-01-30 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> Architectures like ppc64 require to do a conditional tlb flush based on the 
> old
> and new value of pte. Follow the regular pte change protection sequence for
> hugetlb too. This allows the architectures to override the update sequence.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  include/linux/hugetlb.h | 20 
>  mm/hugetlb.c|  8 +---
>  2 files changed, 25 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 087fd5f48c91..39e78b80375c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -543,6 +543,26 @@ static inline void set_huge_swap_pte_at(struct mm_struct 
> *mm, unsigned long addr
>   set_huge_pte_at(mm, addr, ptep, pte);
>  }
>  #endif
> +
> +#ifndef huge_ptep_modify_prot_start
> +#define huge_ptep_modify_prot_start huge_ptep_modify_prot_start
> +static inline pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + return huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
> +}
> +#endif
> +
> +#ifndef huge_ptep_modify_prot_commit
> +#define huge_ptep_modify_prot_commit huge_ptep_modify_prot_commit
> +static inline void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t old_pte, pte_t pte)
> +{
> + set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
> +}
> +#endif
> +
>  #else/* CONFIG_HUGETLB_PAGE */
>  struct hstate {};
>  #define alloc_huge_page(v, a, r) NULL
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index df2e7dd5ff17..f824d2200ca9 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4387,10 +4387,12 @@ unsigned long hugetlb_change_protection(struct 
> vm_area_struct *vma,
>   continue;
>   }
>   if (!huge_pte_none(pte)) {
> - pte = huge_ptep_get_and_clear(mm, address, ptep);
> - pte = pte_mkhuge(huge_pte_modify(pte, newprot));
> + pte_t old_pte;
> +
> + old_pte = huge_ptep_modify_prot_start(vma, address, 
> ptep);
> + pte = pte_mkhuge(huge_pte_modify(old_pte, newprot));
>   pte = arch_make_huge_pte(pte, vma, NULL, 0);
> - set_huge_pte_at(mm, address, ptep, pte);
> + huge_ptep_modify_prot_commit(vma, address, ptep, 
> old_pte, pte);
>   pages++;
>   }
>   spin_unlock(ptl);

Looks like a faithful conversion.

Reviewed-by: Michael Ellerman 

cheers


Re: [PATCH V5 3/5] arch/powerpc/mm: Nest MMU workaround for mprotect RW upgrade.

2019-01-30 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:
> NestMMU requires us to mark the pte invalid and flush the tlb when we do a
> RW upgrade of pte. We fixed a variant of this in the fault path in commit
> Fixes: bd5050e38aec ("powerpc/mm/radix: Change pte relax sequence to handle 
> nest MMU hang")

You don't want the "Fixes:" there.

>
> Do the same for mprotect upgrades.
>
> Hugetlb is handled in the next patch.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 18 ++
>  arch/powerpc/include/asm/book3s/64/radix.h   |  4 
>  arch/powerpc/mm/pgtable-book3s64.c   | 25 
>  arch/powerpc/mm/pgtable-radix.c  | 18 ++
>  4 files changed, 65 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 2e6ada28da64..92eaea164700 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -1314,6 +1314,24 @@ static inline int pud_pfn(pud_t pud)
>   BUILD_BUG();
>   return 0;
>  }

Can we get a blank line here?

> +#define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
> +pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t 
> *);
> +void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
> +  pte_t *, pte_t, pte_t);

So these are not inline ...

> +/*
> + * Returns true for a R -> RW upgrade of pte
> + */
> +static inline bool is_pte_rw_upgrade(unsigned long old_val, unsigned long 
> new_val)
> +{
> + if (!(old_val & _PAGE_READ))
> + return false;
> +
> + if ((!(old_val & _PAGE_WRITE)) && (new_val & _PAGE_WRITE))
> + return true;
> +
> + return false;
> +}
>  
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ */
> diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
> b/arch/powerpc/mm/pgtable-book3s64.c
> index f3c31f5e1026..47c742f002ea 100644
> --- a/arch/powerpc/mm/pgtable-book3s64.c
> +++ b/arch/powerpc/mm/pgtable-book3s64.c
> @@ -400,3 +400,28 @@ void arch_report_meminfo(struct seq_file *m)
>  atomic_long_read(&direct_pages_count[MMU_PAGE_1G]) << 20);
>  }
>  #endif /* CONFIG_PROC_FS */
> +
> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
> +  pte_t *ptep)
> +{
> + unsigned long pte_val;
> +
> + /*
> +  * Clear the _PAGE_PRESENT so that no hardware parallel update is
> +  * possible. Also keep the pte_present true so that we don't take
> +  * wrong fault.
> +  */
> + pte_val = pte_update(vma->vm_mm, addr, ptep, _PAGE_PRESENT, 
> _PAGE_INVALID, 0);
> +
> + return __pte(pte_val);
> +
> +}
> +
> +void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
> +  pte_t *ptep, pte_t old_pte, pte_t pte)
> +{

Which means we're going to be doing a function call to get to here ...

> + if (radix_enabled())
> + return radix__ptep_modify_prot_commit(vma, addr,
> +   ptep, old_pte, pte);

And then another function call to get to the radix version ...

> + set_pte_at(vma->vm_mm, addr, ptep, pte);
> +}
> diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
> index 931156069a81..dced3cd241c2 100644
> --- a/arch/powerpc/mm/pgtable-radix.c
> +++ b/arch/powerpc/mm/pgtable-radix.c
> @@ -1063,3 +1063,21 @@ void radix__ptep_set_access_flags(struct 
> vm_area_struct *vma, pte_t *ptep,
>   }
>   /* See ptesync comment in radix__set_pte_at */
>  }
> +
> +void radix__ptep_modify_prot_commit(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t old_pte, pte_t pte)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> +
> + /*
> +  * To avoid NMMU hang while relaxing access we need to flush the tlb 
> before
> +  * we set the new value. We need to do this only for radix, because hash
> +  * translation does flush when updating the linux pte.
> +  */
> + if (is_pte_rw_upgrade(pte_val(old_pte), pte_val(pte)) &&
> + (atomic_read(&mm->context.copros) > 0))
> + radix__flush_tlb_page(vma, addr);

To finally get here, where we'll realise that 99.99% of processes don't
use copros and so we have nothing to do except set the PTE.

> +
> + set_pte_at(mm, addr, ptep, pte);
> +}

So can we just make it all inline in the header? Or do we think it's not
a hot enough path to worry about it?

cheers


Re: [PATCH V5 2/5] mm: update ptep_modify_prot_commit to take old pte value as arg

2019-01-30 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> Architectures like ppc64 require to do a conditional tlb flush based on the 
> old
> and new value of pte. Enable that by passing old pte value as the arg.

It's not actually the architecture, it's to work around a specific bug
on Power9.

> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index c89ce07923c8..028c724dcb1a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -110,8 +110,8 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   continue;
>   }
>  
> - ptent = ptep_modify_prot_start(vma, addr, pte);
> - ptent = pte_modify(ptent, newprot);
> + oldpte = ptep_modify_prot_start(vma, addr, pte);
> + ptent = pte_modify(oldpte, newprot);
>   if (preserve_write)
>   ptent = pte_mk_savedwrite(ptent);

Is it OK to reuse oldpte here?

It was set at the top of the loop with:

oldpte = *pte;

Is it guaranteed that ptep_modify_prot_start() returns the old value
unmodified, or could an implementation conceivably filter some bits out?

If so then it could be confusing for oldpte to have its value change
half way through the loop.


cheers


Re: [PATCH V5 1/5] mm: Update ptep_modify_prot_start/commit to take vm_area_struct as arg

2019-01-30 Thread Michael Ellerman
"Aneesh Kumar K.V"  writes:

> Some architecture may want to call flush_tlb_range from these helpers.

That's what we want to do, but wouldn't a better description be that
some architectures may need access to the vma for some reason, one of
which might be flushing the TLB.

> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/s390/include/asm/pgtable.h   |  4 ++--
>  arch/s390/mm/pgtable.c|  6 --
>  arch/x86/include/asm/paravirt.h   | 11 ++-
>  arch/x86/include/asm/paravirt_types.h |  5 +++--
>  arch/x86/xen/mmu.h|  4 ++--
>  arch/x86/xen/mmu_pv.c |  8 
>  fs/proc/task_mmu.c|  4 ++--
>  include/asm-generic/pgtable.h | 16 
>  mm/memory.c   |  4 ++--
>  mm/mprotect.c |  4 ++--
>  10 files changed, 35 insertions(+), 31 deletions(-)
>
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 063732414dfb..5d730199e37b 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -1069,8 +1069,8 @@ static inline pte_t ptep_get_and_clear(struct mm_struct 
> *mm,
>  }
>  
>  #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
> -pte_t ptep_modify_prot_start(struct mm_struct *, unsigned long, pte_t *);
> -void ptep_modify_prot_commit(struct mm_struct *, unsigned long, pte_t *, 
> pte_t);
> +pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t 
> *);
> +void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long, pte_t 
> *, pte_t);
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_FLUSH
>  static inline pte_t ptep_clear_flush(struct vm_area_struct *vma,
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index f2cc7da473e4..29c0a21cd34a 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -301,12 +301,13 @@ pte_t ptep_xchg_lazy(struct mm_struct *mm, unsigned 
> long addr,
>  }
>  EXPORT_SYMBOL(ptep_xchg_lazy);
>  
> -pte_t ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr,
> +pte_t ptep_modify_prot_start(struct vm_area_struct *vma, unsigned long addr,
>pte_t *ptep)
>  {
>   pgste_t pgste;
>   pte_t old;
>   int nodat;
> + struct mm_struct *mm = vma->vm_mm;

If this was my code I'd want the mm as the first variable, to preserve
the Reverse Christmas tree format.

>   preempt_disable();
>   pgste = ptep_xchg_start(mm, addr, ptep);
> @@ -320,10 +321,11 @@ pte_t ptep_modify_prot_start(struct mm_struct *mm, 
> unsigned long addr,
>  }
>  EXPORT_SYMBOL(ptep_modify_prot_start);
>  
> -void ptep_modify_prot_commit(struct mm_struct *mm, unsigned long addr,
> +void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr,
>pte_t *ptep, pte_t pte)
>  {
>   pgste_t pgste;
> + struct mm_struct *mm = vma->vm_mm;
  
Ditto.

>   if (!MACHINE_HAS_NX)
>   pte_val(pte) &= ~_PAGE_NOEXEC;
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index a97f28d914d5..c5a7f18cce7e 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -422,25 +422,26 @@ static inline pgdval_t pgd_val(pgd_t pgd)
>  }
>  
>  #define  __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
> -static inline pte_t ptep_modify_prot_start(struct mm_struct *mm, unsigned 
> long addr,
> +static inline pte_t ptep_modify_prot_start(struct vm_area_struct *vma, 
> unsigned long addr,
>  pte_t *ptep)
>  {
>   pteval_t ret;
>  
> - ret = PVOP_CALL3(pteval_t, mmu.ptep_modify_prot_start, mm, addr, ptep);
> + ret = PVOP_CALL3(pteval_t, mmu.ptep_modify_prot_start, vma, addr, ptep);
>  
>   return (pte_t) { .pte = ret };
>  }
>  
> -static inline void ptep_modify_prot_commit(struct mm_struct *mm, unsigned 
> long addr,
> +static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, 
> unsigned long addr,
>  pte_t *ptep, pte_t pte)
>  {
> +

Unnecessary blank line here.

>   if (sizeof(pteval_t) > sizeof(long))
>   /* 5 arg words */
> - pv_ops.mmu.ptep_modify_prot_commit(mm, addr, ptep, pte);
> + pv_ops.mmu.ptep_modify_prot_commit(vma, addr, ptep, pte);
>   else
>   PVOP_VCALL4(mmu.ptep_modify_prot_commit,
> - mm, addr, ptep, pte.pte);
> + vma, addr, ptep, pte.pte);
>  }
>  
>  static inline void set_pte(pte_t *ptep, pte_t pte)

The rest looks good.

cheers


Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-30 Thread Cédric Le Goater
On 1/30/19 6:55 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 06:44:50PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 5:12 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:

 Is clearing the PTEs and repopulating the VMA unsafe ? 
>>>
>>> Actually, now that I come to think of it, there could be any number of
>>> VMAs (well, up to almost 64k of them), since once you have a file
>>> descriptor you can call mmap on it multiple times.
>>>
>>> The more I think about it, the more I think that getting userspace to
>>> manage the mappings with mmap() and munmap() is the right idea if it
>>> can be made to work.
>>
>> We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
>> call "ibm,change-msi".  I think that's the right spot for it. 
> 
> I was thinking that the ESB pages should be mmapped for device
> interrupts at VM startup or when a device is hot-plugged in, and
> munmapped when the device is hot-removed.  Maybe the mmap could be
> done in conjunction with the KVM_IRQFD call?
> 
> What is the reasoning behind doing it in the ibm,change-msi call?

Because when the device is plugged in, it has no interrupts. These are 
allocated on demand only when the driver is loaded in the guest. 

C.



Re: [PATCH v14 01/12] powerpc/32: Fix CONFIG_VIRT_CPU_ACCOUNTING_NATIVE for 40x/booke

2019-01-30 Thread Christophe Leroy




Le 24/01/2019 à 17:19, Christophe Leroy a écrit :

40x/booke have another path to reach 3f from transfer_to_handler,
so ACCOUNT_CPU_USER_ENTRY() have to be moved there.

Signed-off-by: Christophe Leroy 


Now ACCOUNT_CPU_USER_ENTRY() is also in the non-user entry path.

This change is crazy, please ignore it :-S

Christophe


---
  arch/powerpc/kernel/entry_32.S | 13 ++---
  1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 0768dfd8a64e..0165edd03b38 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -185,12 +185,6 @@ transfer_to_handler:
addir12,r12,-1
stw r12,4(r11)
  #endif
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-   CURRENT_THREAD_INFO(r9, r1)
-   tophys(r9, r9)
-   ACCOUNT_CPU_USER_ENTRY(r9, r11, r12)
-#endif
-
b   3f
  
  2:	/* if from kernel, check interrupted DOZE/NAP mode and

@@ -208,9 +202,14 @@ transfer_to_handler:
bt- 31-TLF_NAPPING,4f
bt- 31-TLF_SLEEPING,7f
  #endif /* CONFIG_PPC_BOOK3S_32 || CONFIG_E500 */
+3:
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   CURRENT_THREAD_INFO(r9, r1)
+   tophys(r9, r9)
+   ACCOUNT_CPU_USER_ENTRY(r9, r11, r12)
+#endif
.globl transfer_to_handler_cont
  transfer_to_handler_cont:
-3:
mflrr9
lwz r11,0(r9)   /* virtual address of handler */
lwz r9,4(r9)/* where to go when done */