date:20190107

Re: [PATCH v2 1/2] mm: add probe_user_read()

2019-01-07 Thread Mike Rapoport

On Tue, Jan 08, 2019 at 07:37:44AM +, Christophe Leroy wrote:
> In powerpc code, there are several places implementing safe
> access to user data. This is sometimes implemented using
> probe_kernel_address() with additional access_ok() verification,
> sometimes with get_user() enclosed in a pagefault_disable()/enable()
> pair, etc. :
> show_user_instructions()
> bad_stack_expansion()
> p9_hmi_special_emu()
> fsl_pci_mcheck_exception()
> read_user_stack_64()
> read_user_stack_32() on PPC64
> read_user_stack_32() on PPC32
> power_pmu_bhrb_to()
> 
> In the same spirit as probe_kernel_read(), this patch adds
> probe_user_read().
> 
> probe_user_read() does the same as probe_kernel_read() but
> first checks that it is really a user address.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  v2: Added "Returns:" comment and removed probe_user_address()
> 
>  Changes since RFC: Made a static inline function instead of weak function as 
> recommended by Kees.
> 
>  include/linux/uaccess.h | 34 ++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 37b226e8df13..07f4f0ed69bc 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -263,6 +263,40 @@ extern long strncpy_from_unsafe(char *dst, const void 
> *unsafe_addr, long count);
>  #define probe_kernel_address(addr, retval)   \
>   probe_kernel_read(&retval, addr, sizeof(retval))
>  
> +/**
> + * probe_user_read(): safely attempt to read from a user location
> + * @dst: pointer to the buffer that shall take the data
> + * @src: address to read from
> + * @size: size of the data chunk
> + *
> + * Returns: 0 on success, -EFAULT on error.

Nit: please put the "Returns:" comment after the description, otherwise
kernel-doc considers it a part of the elaborate description.

> + *
> + * Safely read from address @src to the buffer at @dst.  If a kernel fault
> + * happens, handle that and return -EFAULT.
> + *
> + * We ensure that the copy_from_user is executed in atomic context so that
> + * do_page_fault() doesn't attempt to take mmap_sem.  This makes
> + * probe_user_read() suitable for use within regions where the caller
> + * already holds mmap_sem, or other locks which nest inside mmap_sem.
> + */
> +
> +#ifndef probe_user_read
> +static __always_inline long probe_user_read(void *dst, const void __user 
> *src,
> + size_t size)
> +{
> + long ret;
> +
> + if (!access_ok(src, size))
> + return -EFAULT;
> +
> + pagefault_disable();
> + ret = __copy_from_user_inatomic(dst, src, size);
> + pagefault_enable();
> +
> + return ret ? -EFAULT : 0;
> +}
> +#endif
> +
>  #ifndef user_access_begin
>  #define user_access_begin(ptr,len) access_ok(ptr, len)
>  #define user_access_end() do { } while (0)
> -- 
> 2.13.3
> 

-- 
Sincerely yours,
Mike.

[PATCH v2 2/2] powerpc: use probe_user_read()

2019-01-07 Thread Christophe Leroy

Instead of opencoding, use probe_user_read() to failessly
read a user location.

Signed-off-by: Christophe Leroy 
---
 v2: Using probe_user_read() instead of probe_user_address()

 arch/powerpc/kernel/process.c   | 12 +---
 arch/powerpc/mm/fault.c |  6 +-
 arch/powerpc/perf/callchain.c   | 20 +++-
 arch/powerpc/perf/core-book3s.c |  8 +---
 arch/powerpc/sysdev/fsl_pci.c   | 10 --
 5 files changed, 10 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ce393df243aa..6a4b59d574c2 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1298,16 +1298,6 @@ void show_user_instructions(struct pt_regs *regs)
 
pc = regs->nip - (NR_INSN_TO_PRINT * 3 / 4 * sizeof(int));
 
-   /*
-* Make sure the NIP points at userspace, not kernel text/data or
-* elsewhere.
-*/
-   if (!__access_ok(pc, NR_INSN_TO_PRINT * sizeof(int), USER_DS)) {
-   pr_info("%s[%d]: Bad NIP, not dumping instructions.\n",
-   current->comm, current->pid);
-   return;
-   }
-
seq_buf_init(&s, buf, sizeof(buf));
 
while (n) {
@@ -1318,7 +1308,7 @@ void show_user_instructions(struct pt_regs *regs)
for (i = 0; i < 8 && n; i++, n--, pc += sizeof(int)) {
int instr;
 
-   if (probe_kernel_address((const void *)pc, instr)) {
+   if (probe_user_read(&instr, (void __user *)pc, 
sizeof(instr))) {
seq_buf_printf(&s, " ");
continue;
}
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 887f11bcf330..ec74305fa330 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -276,12 +276,8 @@ static bool bad_stack_expansion(struct pt_regs *regs, 
unsigned long address,
if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) &&
access_ok(nip, sizeof(*nip))) {
unsigned int inst;
-   int res;
 
-   pagefault_disable();
-   res = __get_user_inatomic(inst, nip);
-   pagefault_enable();
-   if (!res)
+   if (!probe_user_read(&inst, nip, sizeof(inst)))
return !store_updates_sp(inst);
*must_retry = true;
}
diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c
index 0af051a1974e..0680efb2237b 100644
--- a/arch/powerpc/perf/callchain.c
+++ b/arch/powerpc/perf/callchain.c
@@ -159,12 +159,8 @@ static int read_user_stack_64(unsigned long __user *ptr, 
unsigned long *ret)
((unsigned long)ptr & 7))
return -EFAULT;
 
-   pagefault_disable();
-   if (!__get_user_inatomic(*ret, ptr)) {
-   pagefault_enable();
+   if (!probe_user_read(ret, ptr, sizeof(*ret)))
return 0;
-   }
-   pagefault_enable();
 
return read_user_stack_slow(ptr, ret, 8);
 }
@@ -175,12 +171,8 @@ static int read_user_stack_32(unsigned int __user *ptr, 
unsigned int *ret)
((unsigned long)ptr & 3))
return -EFAULT;
 
-   pagefault_disable();
-   if (!__get_user_inatomic(*ret, ptr)) {
-   pagefault_enable();
+   if (!probe_user_read(ret, ptr, sizeof(*ret)))
return 0;
-   }
-   pagefault_enable();
 
return read_user_stack_slow(ptr, ret, 4);
 }
@@ -307,17 +299,11 @@ static inline int current_is_64bit(void)
  */
 static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret)
 {
-   int rc;
-
if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) ||
((unsigned long)ptr & 3))
return -EFAULT;
 
-   pagefault_disable();
-   rc = __get_user_inatomic(*ret, ptr);
-   pagefault_enable();
-
-   return rc;
+   return probe_user_read(ret, ptr, sizeof(*ret));
 }
 
 static inline void perf_callchain_user_64(struct perf_callchain_entry_ctx 
*entry,
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index b0723002a396..4b64ddf0db68 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -416,7 +416,6 @@ static void power_pmu_sched_task(struct perf_event_context 
*ctx, bool sched_in)
 static __u64 power_pmu_bhrb_to(u64 addr)
 {
unsigned int instr;
-   int ret;
__u64 target;
 
if (is_kernel_addr(addr)) {
@@ -427,13 +426,8 @@ static __u64 power_pmu_bhrb_to(u64 addr)
}
 
/* Userspace: need copy instruction here then translate it */
-   pagefault_disable();
-   ret = __get_user_inatomic(instr, (unsigned int __user *)addr);
-   if (ret) {
-   pagefault_enable();
+   if (pr

[PATCH v2 1/2] mm: add probe_user_read()

2019-01-07 Thread Christophe Leroy

In powerpc code, there are several places implementing safe
access to user data. This is sometimes implemented using
probe_kernel_address() with additional access_ok() verification,
sometimes with get_user() enclosed in a pagefault_disable()/enable()
pair, etc. :
show_user_instructions()
bad_stack_expansion()
p9_hmi_special_emu()
fsl_pci_mcheck_exception()
read_user_stack_64()
read_user_stack_32() on PPC64
read_user_stack_32() on PPC32
power_pmu_bhrb_to()

In the same spirit as probe_kernel_read(), this patch adds
probe_user_read().

probe_user_read() does the same as probe_kernel_read() but
first checks that it is really a user address.

Signed-off-by: Christophe Leroy 
---
 v2: Added "Returns:" comment and removed probe_user_address()

 Changes since RFC: Made a static inline function instead of weak function as 
recommended by Kees.

 include/linux/uaccess.h | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 37b226e8df13..07f4f0ed69bc 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -263,6 +263,40 @@ extern long strncpy_from_unsafe(char *dst, const void 
*unsafe_addr, long count);
 #define probe_kernel_address(addr, retval) \
probe_kernel_read(&retval, addr, sizeof(retval))
 
+/**
+ * probe_user_read(): safely attempt to read from a user location
+ * @dst: pointer to the buffer that shall take the data
+ * @src: address to read from
+ * @size: size of the data chunk
+ *
+ * Returns: 0 on success, -EFAULT on error.
+ *
+ * Safely read from address @src to the buffer at @dst.  If a kernel fault
+ * happens, handle that and return -EFAULT.
+ *
+ * We ensure that the copy_from_user is executed in atomic context so that
+ * do_page_fault() doesn't attempt to take mmap_sem.  This makes
+ * probe_user_read() suitable for use within regions where the caller
+ * already holds mmap_sem, or other locks which nest inside mmap_sem.
+ */
+
+#ifndef probe_user_read
+static __always_inline long probe_user_read(void *dst, const void __user *src,
+   size_t size)
+{
+   long ret;
+
+   if (!access_ok(src, size))
+   return -EFAULT;
+
+   pagefault_disable();
+   ret = __copy_from_user_inatomic(dst, src, size);
+   pagefault_enable();
+
+   return ret ? -EFAULT : 0;
+}
+#endif
+
 #ifndef user_access_begin
 #define user_access_begin(ptr,len) access_ok(ptr, len)
 #define user_access_end() do { } while (0)
-- 
2.13.3

Re: [Bug 202149] New: NULL Pointer Dereference in __split_huge_pmd on PPC64LE

2019-01-07 Thread Aneesh Kumar K.V

Andrew Morton  writes:

> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Fri, 04 Jan 2019 22:49:52 + bugzilla-dae...@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202149
>> 
>> Bug ID: 202149
>>Summary: NULL Pointer Dereference in __split_huge_pmd on
>> PPC64LE
>
> I think that trace is pointing at the ppc-specific
> pgtable_trans_huge_withdraw()?
>

That is correct. 

Matt,
Can you share the .config used for the kernel. Does this happen only
with 4K page size ?

-aneesh

[PATCH v4 1/2] crypto: talitos - reorder code in talitos_edesc_alloc()

2019-01-07 Thread Christophe Leroy

This patch moves the mapping of IV after the kmalloc(). This
avoids having to unmap in case kmalloc() fails.

Signed-off-by: Christophe Leroy 
---
 new in v4

 drivers/crypto/talitos.c | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 45e20707cef8..54d80e7edb86 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -1361,23 +1361,18 @@ static struct talitos_edesc *talitos_edesc_alloc(struct 
device *dev,
struct talitos_private *priv = dev_get_drvdata(dev);
bool is_sec1 = has_ftr_sec1(priv);
int max_len = is_sec1 ? TALITOS1_MAX_DATA_LEN : TALITOS2_MAX_DATA_LEN;
-   void *err;
 
if (cryptlen + authsize > max_len) {
dev_err(dev, "length exceeds h/w max limit\n");
return ERR_PTR(-EINVAL);
}
 
-   if (ivsize)
-   iv_dma = dma_map_single(dev, iv, ivsize, DMA_TO_DEVICE);
-
if (!dst || dst == src) {
src_len = assoclen + cryptlen + authsize;
src_nents = sg_nents_for_len(src, src_len);
if (src_nents < 0) {
dev_err(dev, "Invalid number of src SG.\n");
-   err = ERR_PTR(-EINVAL);
-   goto error_sg;
+   return ERR_PTR(-EINVAL);
}
src_nents = (src_nents == 1) ? 0 : src_nents;
dst_nents = dst ? src_nents : 0;
@@ -1387,16 +1382,14 @@ static struct talitos_edesc *talitos_edesc_alloc(struct 
device *dev,
src_nents = sg_nents_for_len(src, src_len);
if (src_nents < 0) {
dev_err(dev, "Invalid number of src SG.\n");
-   err = ERR_PTR(-EINVAL);
-   goto error_sg;
+   return ERR_PTR(-EINVAL);
}
src_nents = (src_nents == 1) ? 0 : src_nents;
dst_len = assoclen + cryptlen + (encrypt ? authsize : 0);
dst_nents = sg_nents_for_len(dst, dst_len);
if (dst_nents < 0) {
dev_err(dev, "Invalid number of dst SG.\n");
-   err = ERR_PTR(-EINVAL);
-   goto error_sg;
+   return ERR_PTR(-EINVAL);
}
dst_nents = (dst_nents == 1) ? 0 : dst_nents;
}
@@ -1425,10 +1418,10 @@ static struct talitos_edesc *talitos_edesc_alloc(struct 
device *dev,
alloc_len += sizeof(struct talitos_desc);
 
edesc = kmalloc(alloc_len, GFP_DMA | flags);
-   if (!edesc) {
-   err = ERR_PTR(-ENOMEM);
-   goto error_sg;
-   }
+   if (!edesc)
+   return ERR_PTR(-ENOMEM);
+   if (ivsize)
+   iv_dma = dma_map_single(dev, iv, ivsize, DMA_TO_DEVICE);
memset(&edesc->desc, 0, sizeof(edesc->desc));
 
edesc->src_nents = src_nents;
@@ -1445,10 +1438,6 @@ static struct talitos_edesc *talitos_edesc_alloc(struct 
device *dev,
 DMA_BIDIRECTIONAL);
}
return edesc;
-error_sg:
-   if (iv_dma)
-   dma_unmap_single(dev, iv_dma, ivsize, DMA_TO_DEVICE);
-   return err;
 }
 
 static struct talitos_edesc *aead_edesc_alloc(struct aead_request *areq, u8 
*iv,
-- 
2.13.3

[PATCH v4 2/2] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK

2019-01-07 Thread Christophe Leroy

[2.364486] WARNING: CPU: 0 PID: 60 at ./arch/powerpc/include/asm/io.h:837 
dma_nommu_map_page+0x44/0xd4
[2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW 
4.20.0-rc5-00560-g6bfb52e23a00-dirty #531
[2.384740] NIP:  c000c540 LR: c000c584 CTR: 
[2.389743] REGS: c95abab0 TRAP: 0700   Tainted: GW  
(4.20.0-rc5-00560-g6bfb52e23a00-dirty)
[2.400042] MSR:  00029032   CR: 24042204  XER: 
[2.406669]
[2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 
0001 0001
[2.406669] GPR08:  2000 0010 0010 24042202  
0100 c95abd88
[2.406669] GPR16:  c05569d4 0001 0010 c95abc88 c0615664 
0004 
[2.406669] GPR24: 0010 c95abc88 c95abc88  c61ae210 c7ff6d40 
c61ae210 3d68
[2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4
[2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4
[2.451762] Call Trace:
[2.454195] [c95abb60] [82000808] 0x82000808 (unreliable)
[2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8
[2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c
[2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64
[2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08
[2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc
[2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc
[2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8
[2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50
[2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110
[2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
[2.515532] Instruction dump:
[2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 3d20c076 
7c84e850
[2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e 54847022 
7c84fa14
[2.533960] ---[ end trace bf78d94af73fe3b8 ]---
[2.539123] talitos ff02.crypto: master data transfer error
[2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040
[2.551625] alg: skcipher: encryption failed on test 1 for ecb-aes-talitos: 
ret=22

IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack
cannot be DMA mapped anymore.

This patch copies the IV into the extended descriptor.

Fixes: 4de9d0b547b9 ("crypto: talitos - Add ablkcipher algorithms")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
 v4: Split in two patches ; made the copy unconditional.

 v3: Using struct edesc buffer.

 v2: Using per-request context.

 drivers/crypto/talitos.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
index 54d80e7edb86..f8e2c5c3f4eb 100644
--- a/drivers/crypto/talitos.c
+++ b/drivers/crypto/talitos.c
@@ -1416,12 +1416,15 @@ static struct talitos_edesc *talitos_edesc_alloc(struct 
device *dev,
/* if its a ahash, add space for a second desc next to the first one */
if (is_sec1 && !dst)
alloc_len += sizeof(struct talitos_desc);
+   alloc_len += ivsize;
 
edesc = kmalloc(alloc_len, GFP_DMA | flags);
if (!edesc)
return ERR_PTR(-ENOMEM);
-   if (ivsize)
+   if (ivsize) {
+   iv = memcpy(((u8 *)edesc) + alloc_len - ivsize, iv, ivsize);
iv_dma = dma_map_single(dev, iv, ivsize, DMA_TO_DEVICE);
+   }
memset(&edesc->desc, 0, sizeof(edesc->desc));
 
edesc->src_nents = src_nents;
-- 
2.13.3

Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-07 Thread Leon Romanovsky

On Mon, Jan 07, 2019 at 09:01:29PM -0700, Jason Gunthorpe wrote:
> On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote:
> > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
> > >
> > > > Interesting.  I've investigated this further, though I don't have as
> > > > many new clues as I'd like.  The problem occurs reliably, at least on
> > > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> > > > I don't yet know if it occurs with other machines, I'm having trouble
> > > > getting access to other machines with a suitable card.  I didn't
> > > > manage to reproduce it on a different POWER8 machine with a
> > > > ConnectX-5, but I don't know if it's the difference in machine or
> > > > difference in card revision that's important.
> > >
> > > Make sure the card has the latest firmware is always good advice..
> > >
> > > > So possibilities that occur to me:
> > > >   * It's something specific about how the vfio-pci driver uses D3
> > > > state - have you tried rebinding your device to vfio-pci?
> > > >   * It's something specific about POWER, either the kernel or the PCI
> > > > bridge hardware
> > > >   * It's something specific about this particular type of machine
> > >
> > > Does the EEH indicate what happend to actually trigger it?
> >
> > In a very cryptic way that requires manual parsing using non-public
> > docs sadly but yes. From the look of it, it's a completion timeout.
> >
> > Looks to me like we don't get a response to a config space access
> > during the change of D state. I don't know if it's the write of the D3
> > state itself or the read back though (it's probably detected on the
> > read back or a subsequent read, but that doesn't tell me which specific
> > one failed).
>
> If it is just one card doing it (again, check you have latest
> firmware) I wonder if it is a sketchy PCI-E electrical link that is
> causing a long re-training cycle? Can you tell if the PCI-E link is
> permanently gone or does it eventually return?
>
> Does the card work in Gen 3 when it starts? Is there any indication of
> PCI-E link errors?
>
> Everytime or sometimes?
>
> POWER 8 firmware is good? If the link does eventually come back, is
> the POWER8's D3 resumption timeout long enough?
>
> If this doesn't lead to an obvious conclusion you'll probably need to
> connect to IBM's Mellanox support team to get more information from
> the card side.

+1, I tried to find any Mellanox-internal bugs related to your issue
and didn't find anything concrete.

Thanks

>
> Jason


signature.asc
Description: PGP signature

[PATCH 3/3] sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW

2019-01-07 Thread Deepa Dinamani

Add new socket timeout options that are y2038 safe.

Signed-off-by: Deepa Dinamani 
Cc: ccaul...@redhat.com
Cc: da...@davemloft.net
Cc: del...@gmx.de
Cc: pau...@samba.org
Cc: r...@linux-mips.org
Cc: r...@twiddle.net
Cc: cluster-de...@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-al...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 arch/alpha/include/uapi/asm/socket.h  | 12 +++--
 arch/mips/include/uapi/asm/socket.h   | 11 -
 arch/parisc/include/uapi/asm/socket.h | 11 -
 arch/sparc/include/uapi/asm/socket.h  | 11 -
 include/net/sock.h|  4 +-
 include/uapi/asm-generic/socket.h | 11 -
 net/core/sock.c   | 64 +--
 7 files changed, 98 insertions(+), 26 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index ea3ba981d8a0..3d800d5d3d5d 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -118,19 +118,25 @@
 #define SO_TIMESTAMPNS_NEW   63
 #define SO_TIMESTAMPING_NEW  64
 
-#if !defined(__KERNEL__)
+#define SO_RCVTIMEO_NEW  65
+#define SO_SNDTIMEO_NEW  66
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
+#if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD
+
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
 #endif
 
 #define SCM_TIMESTAMP  SO_TIMESTAMP
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 4dde20d64690..5a7f9010c090 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -128,18 +128,25 @@
 #define SO_TIMESTAMPNS_NEW   63
 #define SO_TIMESTAMPING_NEW  64
 
+#define SO_RCVTIMEO_NEW  65
+#define SO_SNDTIMEO_NEW  66
+
 #if !defined(__KERNEL__)
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD
+
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#defineSO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#defineSO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
 #endif
 
 #define SCM_TIMESTAMP  SO_TIMESTAMP
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 546937fa0d8b..bd35de5b4666 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -109,18 +109,25 @@
 #define SO_TIMESTAMPNS_NEW   0x4038
 #define SO_TIMESTAMPING_NEW  0x4039
 
+#define SO_RCVTIMEO_NEW  0x4040
+#define SO_SNDTIMEO_NEW  0x4041
+
 #if !defined(__KERNEL__)
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD
+
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#defineSO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#defineSO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_O

[PATCH 2/3] socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes

2019-01-07 Thread Deepa Dinamani

SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval
as the time format. struct timeval is not y2038 safe.
The subsequent patches in the series add support for new socket
timeout options with _NEW suffix that are y2038 safe.
Rename the existing options with _OLD suffix forms so that the
right option is enabled for userspace applications according
to the architecture and time_t definition of libc.

Signed-off-by: Deepa Dinamani 
Cc: ccaul...@redhat.com
Cc: del...@gmx.de
Cc: pau...@samba.org
Cc: r...@linux-mips.org
Cc: r...@twiddle.net
Cc: cluster-de...@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-al...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 arch/alpha/include/uapi/asm/socket.h   | 7 +--
 arch/mips/include/uapi/asm/socket.h| 6 --
 arch/parisc/include/uapi/asm/socket.h  | 6 --
 arch/powerpc/include/uapi/asm/socket.h | 4 ++--
 arch/sparc/include/uapi/asm/socket.h   | 6 --
 fs/dlm/lowcomms.c  | 4 ++--
 include/net/sock.h | 4 ++--
 include/uapi/asm-generic/socket.h  | 6 --
 net/compat.c   | 4 ++--
 net/core/sock.c| 8 
 10 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index da08412bd49f..ea3ba981d8a0 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -31,8 +31,8 @@
 #define SO_RCVBUFFORCE 0x100b
 #defineSO_RCVLOWAT 0x1010
 #defineSO_SNDLOWAT 0x1011
-#defineSO_RCVTIMEO 0x1012
-#defineSO_SNDTIMEO 0x1013
+#defineSO_RCVTIMEO_OLD 0x1012
+#defineSO_SNDTIMEO_OLD 0x1013
 #define SO_ACCEPTCONN  0x1014
 #define SO_PROTOCOL0x1028
 #define SO_DOMAIN  0x1029
@@ -120,6 +120,9 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
+
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 1e48f67f1052..4dde20d64690 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -39,8 +39,8 @@
 #define SO_RCVBUF  0x1002  /* Receive buffer. */
 #define SO_SNDLOWAT0x1003  /* send low-water mark */
 #define SO_RCVLOWAT0x1004  /* receive low-water mark */
-#define SO_SNDTIMEO0x1005  /* send timeout */
-#define SO_RCVTIMEO0x1006  /* receive timeout */
+#define SO_SNDTIMEO_OLD0x1005  /* send timeout */
+#define SO_RCVTIMEO_OLD0x1006  /* receive timeout */
 #define SO_ACCEPTCONN  0x1009
 #define SO_PROTOCOL0x1028  /* protocol type */
 #define SO_DOMAIN  0x1029  /* domain/socket family */
@@ -130,6 +130,8 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index e8d6cf20f9a4..546937fa0d8b 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -22,8 +22,8 @@
 #define SO_RCVBUFFORCE 0x100b
 #define SO_SNDLOWAT0x1003
 #define SO_RCVLOWAT0x1004
-#define SO_SNDTIMEO0x1005
-#define SO_RCVTIMEO0x1006
+#define SO_SNDTIMEO_OLD0x1005
+#define SO_RCVTIMEO_OLD0x1006
 #define SO_ERROR   0x1007
 #define SO_TYPE0x1008
 #define SO_PROTOCOL0x1028
@@ -111,6 +111,8 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/powerpc/include/uapi/asm/socket.h 
b/arch/powerpc/include/uapi/asm/socket.h
index 94de465e0920..12aa0c43e775 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -11,8 +11,8 @@
 
 #define SO_RCVLOWAT16
 #define SO_SNDLOWAT17
-#define SO_RCVTIMEO18
-#define SO_SNDTIMEO19
+#define SO_RCVTIMEO_OLD18
+#define SO_SNDTIMEO_OLD19
 #define SO_PASSCRED20
 #define SO_PEERCRED21
 
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index fc65bf6b6440..bdc396211627 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -21,8 +21,8 @@
 #define SO_BSDCOMPAT0x0400
 #define SO_RCVLOWAT 0x0800
 #define SO_SNDLOWAT 0x1000
-#define SO_RCVTIMEO 0x2000
-#define SO_SNDTIMEO 0x4000
+#define SO_RCVTIMEO_OLD 0x2000
+#define SO_SNDTIMEO_OLD

[PATCH 0/3] net: y2038-safe socket timeout options

2019-01-07 Thread Deepa Dinamani

The series is aimed at adding y2038-safe timeout options:
SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW.

This is similar to the previous series adding y2038-safe
SO_TIMESTAMP* options.

The series needs to be applied after the socket timestamp series:
https://lore.kernel.org/lkml/20190108032657.8331-1-deepa.ker...@gmail.com

Deepa Dinamani (3):
  socket: Use old_timeval types for socket timeouts
  socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes
  sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW

 arch/alpha/include/uapi/asm/socket.h   | 13 -
 arch/mips/include/uapi/asm/socket.h| 13 -
 arch/parisc/include/uapi/asm/socket.h  | 13 -
 arch/powerpc/include/uapi/asm/socket.h |  4 +-
 arch/sparc/include/uapi/asm/socket.h   | 13 -
 fs/dlm/lowcomms.c  |  4 +-
 include/uapi/asm-generic/socket.h  | 13 -
 net/compat.c   | 14 ++---
 net/core/sock.c| 78 +++---
 net/vmw_vsock/af_vsock.c   |  4 +-
 10 files changed, 126 insertions(+), 43 deletions(-)


base-commit: a4983672f9ca4c8393f26b6b80710e6c78886b8c
prerequisite-patch-id: a03ec6afbdd328cd90557f7ee6675016a5f5c653
prerequisite-patch-id: 724d26c3036e6f3a38f810c2f10db3f7ddbf843b
prerequisite-patch-id: 14017867b6eb4d5231eec1b563edcd840a1be26e
prerequisite-patch-id: 8df0edfd9b973ff5aae91c7709c8223be096a5bc
prerequisite-patch-id: 9850ad48d41bf068f074c0dd3c7610fb7177c89f
prerequisite-patch-id: bd31f35bba11902d1cc3e8726492b54df34b5c59
prerequisite-patch-id: ea4b005c5ad188a4e0899d728357c114710a3a8e
prerequisite-patch-id: cc3ee912c1ee1ea502ca079de81236a467950501
-- 
2.17.1

Cc: ccaul...@redhat.com
Cc: cluster-de...@redhat.com
Cc: da...@davemloft.net
Cc: del...@gmx.de
Cc: linux-al...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: pau...@samba.org
Cc: r...@linux-mips.org
Cc: r...@twiddle.net
Cc: sparcli...@vger.kernel.org

Re: [PATCH v3] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK

2019-01-07 Thread Herbert Xu

On Fri, Dec 21, 2018 at 08:07:52AM +, Christophe Leroy wrote:
> [2.364486] WARNING: CPU: 0 PID: 60 at ./arch/powerpc/include/asm/io.h:837 
> dma_nommu_map_page+0x44/0xd4
> [2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW
>  4.20.0-rc5-00560-g6bfb52e23a00-dirty #531
> [2.384740] NIP:  c000c540 LR: c000c584 CTR: 
> [2.389743] REGS: c95abab0 TRAP: 0700   Tainted: GW  
> (4.20.0-rc5-00560-g6bfb52e23a00-dirty)
> [2.400042] MSR:  00029032   CR: 24042204  XER: 
> [2.406669]
> [2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 
> 0001 0001
> [2.406669] GPR08:  2000 0010 0010 24042202  
> 0100 c95abd88
> [2.406669] GPR16:  c05569d4 0001 0010 c95abc88 c0615664 
> 0004 
> [2.406669] GPR24: 0010 c95abc88 c95abc88  c61ae210 c7ff6d40 
> c61ae210 3d68
> [2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4
> [2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4
> [2.451762] Call Trace:
> [2.454195] [c95abb60] [82000808] 0x82000808 (unreliable)
> [2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8
> [2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c
> [2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64
> [2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08
> [2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc
> [2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc
> [2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8
> [2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50
> [2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110
> [2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c
> [2.515532] Instruction dump:
> [2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 3d20c076 
> 7c84e850
> [2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e 
> 54847022 7c84fa14
> [2.533960] ---[ end trace bf78d94af73fe3b8 ]---
> [2.539123] talitos ff02.crypto: master data transfer error
> [2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040
> [2.551625] alg: skcipher: encryption failed on test 1 for 
> ecb-aes-talitos: ret=22
> 
> IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack
> cannot be DMA mapped anymore.
> 
> This patch copies the IV into the extended descriptor when iv is not
> a valid linear address.

Please make the copy unconditional.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH V6 4/4] powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing

2019-01-07 Thread Aneesh Kumar K.V

THP pages can get split during different code paths. An incremented reference
count do imply we will not split the compound page. But the pmd entry can be
converted to level 4 pte entries. Keep the code simpler by allowing large
IOMMU page size only if the guest ram is backed by hugetlb pages.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/mmu_context_iommu.c | 24 +++-
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index 52ccab294b47..62c7590378d4 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -98,8 +98,6 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned 
long ua,
struct mm_iommu_table_group_mem_t *mem;
long i, ret = 0, locked_entries = 0;
unsigned int pageshift;
-   unsigned long flags;
-   unsigned long cur_ua;
 
mutex_lock(&mem_list_mutex);
 
@@ -167,22 +165,14 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
for (i = 0; i < entries; ++i) {
struct page *page = mem->hpages[i];
 
-   cur_ua = ua + (i << PAGE_SHIFT);
-   if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) {
-   pte_t *pte;
+   /*
+* Allow to use larger than 64k IOMMU pages. Only do that
+* if we are backed by hugetlb.
+*/
+   if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) {
struct page *head = compound_head(page);
-   unsigned int compshift = compound_order(head);
-   unsigned int pteshift;
-
-   local_irq_save(flags); /* disables as well */
-   pte = find_linux_pte(mm->pgd, cur_ua, NULL, &pteshift);
-
-   /* Double check it is still the same pinned page */
-   if (pte && pte_page(*pte) == head &&
-   pteshift == compshift + PAGE_SHIFT)
-   pageshift = max_t(unsigned int, pteshift,
-   PAGE_SHIFT);
-   local_irq_restore(flags);
+
+   pageshift = compound_order(head) + PAGE_SHIFT;
}
mem->pageshift = min(mem->pageshift, pageshift);
/*
-- 
2.20.1

[PATCH V6 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get

2019-01-07 Thread Aneesh Kumar K.V

Current code doesn't do page migration if the page allocated is a compound page.
With HugeTLB migration support, we can end up allocating hugetlb pages from
CMA region. Also THP pages can be allocated from CMA region. This patch updates
the code to handle compound pages correctly.

This use the new helper get_user_pages_cma_migrate. It does single 
get_user_pages
with right count, instead of doing one get_user_pages per page. That avoids
reading page table multiple times.

The patch also convert the hpas member of mm_iommu_table_group_mem_t to a union.
We use the same storage location to store pointers to struct page. We cannot
update all the code path use struct page *, because we access hpas in real mode
and we can't do that struct page * to pfn conversion in real mode.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/mmu_context_iommu.c | 124 +---
 1 file changed, 37 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_iommu.c 
b/arch/powerpc/mm/mmu_context_iommu.c
index a712a650a8b6..52ccab294b47 100644
--- a/arch/powerpc/mm/mmu_context_iommu.c
+++ b/arch/powerpc/mm/mmu_context_iommu.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static DEFINE_MUTEX(mem_list_mutex);
 
@@ -34,8 +35,18 @@ struct mm_iommu_table_group_mem_t {
atomic64_t mapped;
unsigned int pageshift;
u64 ua; /* userspace address */
-   u64 entries;/* number of entries in hpas[] */
-   u64 *hpas;  /* vmalloc'ed */
+   u64 entries;/* number of entries in hpas/hpages[] */
+   /*
+* in mm_iommu_get we temporarily use this to store
+* struct page address.
+*
+* We need to convert ua to hpa in real mode. Make it
+* simpler by storing physical address.
+*/
+   union {
+   struct page **hpages;   /* vmalloc'ed */
+   phys_addr_t *hpas;
+   };
 #define MM_IOMMU_TABLE_INVALID_HPA ((uint64_t)-1)
u64 dev_hpa;/* Device memory base address */
 };
@@ -80,64 +91,15 @@ bool mm_iommu_preregistered(struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mm_iommu_preregistered);
 
-/*
- * Taken from alloc_migrate_target with changes to remove CMA allocations
- */
-struct page *new_iommu_non_cma_page(struct page *page, unsigned long private)
-{
-   gfp_t gfp_mask = GFP_USER;
-   struct page *new_page;
-
-   if (PageCompound(page))
-   return NULL;
-
-   if (PageHighMem(page))
-   gfp_mask |= __GFP_HIGHMEM;
-
-   /*
-* We don't want the allocation to force an OOM if possibe
-*/
-   new_page = alloc_page(gfp_mask | __GFP_NORETRY | __GFP_NOWARN);
-   return new_page;
-}
-
-static int mm_iommu_move_page_from_cma(struct page *page)
-{
-   int ret = 0;
-   LIST_HEAD(cma_migrate_pages);
-
-   /* Ignore huge pages for now */
-   if (PageCompound(page))
-   return -EBUSY;
-
-   lru_add_drain();
-   ret = isolate_lru_page(page);
-   if (ret)
-   return ret;
-
-   list_add(&page->lru, &cma_migrate_pages);
-   put_page(page); /* Drop the gup reference */
-
-   ret = migrate_pages(&cma_migrate_pages, new_iommu_non_cma_page,
-   NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE);
-   if (ret) {
-   if (!list_empty(&cma_migrate_pages))
-   putback_movable_pages(&cma_migrate_pages);
-   }
-
-   return 0;
-}
-
 static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
-   unsigned long entries, unsigned long dev_hpa,
-   struct mm_iommu_table_group_mem_t **pmem)
+ unsigned long entries, unsigned long dev_hpa,
+ struct mm_iommu_table_group_mem_t **pmem)
 {
struct mm_iommu_table_group_mem_t *mem;
-   long i, j, ret = 0, locked_entries = 0;
+   long i, ret = 0, locked_entries = 0;
unsigned int pageshift;
unsigned long flags;
unsigned long cur_ua;
-   struct page *page = NULL;
 
mutex_lock(&mem_list_mutex);
 
@@ -187,41 +149,25 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
goto unlock_exit;
}
 
+   ret = get_user_pages_cma_migrate(ua, entries, 1, mem->hpages);
+   if (ret != entries) {
+   /* free the reference taken */
+   for (i = 0; i < ret; i++)
+   put_page(mem->hpages[i]);
+
+   vfree(mem->hpas);
+   kfree(mem);
+   ret = -EFAULT;
+   goto unlock_exit;
+   } else {
+   ret = 0;
+   }
+
+   pageshift = PAGE_SHIFT;
for (i = 0; i < entries; ++i) {
+   struct page *page = mem->hpages[i];
+
cur_ua = ua + (i << PAGE_SHIFT);
-   if (1 != get_user_pages_fast(cur_ua,
-

[PATCH V6 2/4] mm: Add get_user_pages_cma_migrate

2019-01-07 Thread Aneesh Kumar K.V

This helper does a get_user_pages_fast making sure we migrate pages found in the
CMA area before taking page reference. This makes sure that we don't keep
non-movable pages (due to page reference count) in the CMA area.

This will be used by ppc64 in a later patch to avoid pinning pages in the CMA
region. ppc64 uses CMA region for allocation of hardware page table (hash page
table) and not able to migrate pages out of CMA region results in page table
allocation failures.

One case where we hit this easy is when a guest using VFIO passthrough device.
VFIO locks all the guests memory and if the guest memory is backed by CMA
region, it becomes unmovable resulting in fragmenting the CMA and possibly
preventing other guest from allocation a large enough hash page table.

NOTE: We allocate new page without using __GFP_THISNODE

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h |   2 +
 include/linux/migrate.h |   3 +
 mm/hugetlb.c|   4 +-
 mm/migrate.c| 149 
 4 files changed, 156 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 087fd5f48c91..1eed0cdaec0e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -371,6 +371,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int 
preferred_nid,
nodemask_t *nmask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
unsigned long address);
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+int nid, nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e13d9bf2f9a5..bc83e12a06e9 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -285,6 +285,9 @@ static inline int migrate_vma(const struct migrate_vma_ops 
*ops,
 }
 #endif /* IS_ENABLED(CONFIG_MIGRATE_VMA_HELPER) */
 
+extern int get_user_pages_cma_migrate(unsigned long start, int nr_pages, int 
write,
+ struct page **pages);
+
 #endif /* CONFIG_MIGRATION */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 745088810965..fc4afaec1055 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1586,8 +1586,8 @@ static struct page *alloc_surplus_huge_page(struct hstate 
*h, gfp_t gfp_mask,
return page;
 }
 
-static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
-   int nid, nodemask_t *nmask)
+struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
+int nid, nodemask_t *nmask)
 {
struct page *page;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index ccf8966caf6f..5e21c7aee942 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2982,3 +2982,152 @@ int migrate_vma(const struct migrate_vma_ops *ops,
 }
 EXPORT_SYMBOL(migrate_vma);
 #endif /* defined(MIGRATE_VMA_HELPER) */
+
+static struct page *new_non_cma_page(struct page *page, unsigned long private)
+{
+   /*
+* We want to make sure we allocate the new page from the same node
+* as the source page.
+*/
+   int nid = page_to_nid(page);
+   /*
+* Trying to allocate a page for migration. Ignore allocation
+* failure warnings. We don't force __GFP_THISNODE here because
+* this node here is the node where we have CMA reservation and
+* in some case these nodes will have really less non movable
+* allocation memory.
+*/
+   gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
+
+   if (PageHighMem(page))
+   gfp_mask |= __GFP_HIGHMEM;
+
+#ifdef CONFIG_HUGETLB_PAGE
+   if (PageHuge(page)) {
+   struct hstate *h = page_hstate(page);
+   /*
+* We don't want to dequeue from the pool because pool pages 
will
+* mostly be from the CMA region.
+*/
+   return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
+   }
+#endif
+   if (PageTransHuge(page)) {
+   struct page *thp;
+   /*
+* ignore allocation failure warnings
+*/
+   gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN;
+
+   /*
+* Remove the movable mask so that we don't allocate from
+* CMA area again.
+*/
+   thp_gfpmask &= ~__GFP_MOVABLE;
+   thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER);
+   if (!thp)
+   return NULL;
+   prep_transhuge_page(thp);
+   return thp;
+   }
+
+   return __alloc_pages_node(nid, gfp_mask, 0);
+}
+
+/**
+ * get_user_pages_cma_migrate() - pin user pages in memory by migrating pa

[PATCH V6 1/4] mm/cma: Add PF flag to force non cma alloc

2019-01-07 Thread Aneesh Kumar K.V

This patch add PF_MEMALLOC_NOCMA which make sure any allocation in that context
is marked non movable and hence cannot be satisfied by CMA region.

This is useful with get_user_pages_cma_migrate where we take a page pin by
migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures
that we avoid uncessary page migration later.

Suggested-by: Andrea Arcangeli 
Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/sched.h|  1 +
 include/linux/sched/mm.h | 36 
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 89541d248893..047c8c469774 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,7 @@ extern struct pid *cad_pid;
 #define PF_RANDOMIZE   0x0040  /* Randomize virtual address 
space */
 #define PF_SWAPWRITE   0x0080  /* Allowed to write to swap */
 #define PF_MEMSTALL0x0100  /* Stalled due to lack of 
memory */
+#define PF_MEMALLOC_NOCMA  0x0200  /* All allocation request will 
have _GFP_MOVABLE cleared */
 #define PF_NO_SETAFFINITY  0x0400  /* Userland is not allowed to 
meddle with cpus_allowed */
 #define PF_MCE_EARLY   0x0800  /* Early kill for mce process 
policy */
 #define PF_MUTEX_TESTER0x2000  /* Thread belongs to 
the rt mutex tester */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 3bfa6a0cbba4..b336e7e2ca49 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -148,17 +148,24 @@ static inline bool in_vfork(struct task_struct *tsk)
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
  * PF_MEMALLOC_NOFS implies GFP_NOFS
+ * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
  */
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
-   /*
-* NOIO implies both NOIO and NOFS and it is a weaker context
-* so always make sure it makes precedence
-*/
-   if (unlikely(current->flags & PF_MEMALLOC_NOIO))
-   flags &= ~(__GFP_IO | __GFP_FS);
-   else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
-   flags &= ~__GFP_FS;
+   if (unlikely(current->flags &
+(PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | 
PF_MEMALLOC_NOCMA))) {
+   /*
+* NOIO implies both NOIO and NOFS and it is a weaker context
+* so always make sure it makes precedence
+*/
+   if (current->flags & PF_MEMALLOC_NOIO)
+   flags &= ~(__GFP_IO | __GFP_FS);
+   else if (current->flags & PF_MEMALLOC_NOFS)
+   flags &= ~__GFP_FS;
+
+   if (current->flags & PF_MEMALLOC_NOCMA)
+   flags &= ~__GFP_MOVABLE;
+   }
return flags;
 }
 
@@ -248,6 +255,19 @@ static inline void memalloc_noreclaim_restore(unsigned int 
flags)
current->flags = (current->flags & ~PF_MEMALLOC) | flags;
 }
 
+static inline unsigned int memalloc_nocma_save(void)
+{
+   unsigned int flags = current->flags & PF_MEMALLOC_NOCMA;
+
+   current->flags |= PF_MEMALLOC_NOCMA;
+   return flags;
+}
+
+static inline void memalloc_nocma_restore(unsigned int flags)
+{
+   current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags;
+}
+
 #ifdef CONFIG_MEMCG
 /**
  * memalloc_use_memcg - Starts the remote memcg charging scope.
-- 
2.20.1

[PATCH V6 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region

2019-01-07 Thread Aneesh Kumar K.V

ppc64 use CMA area for the allocation of guest page table (hash page table). We 
won't
be able to start guest if we fail to allocate hash page table. We have observed
hash table allocation failure because we failed to migrate pages out of CMA 
region
because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins
the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we
won't be able to migrate those pages. The pages are also pinned for the 
lifetime of the
guest.

Currently we support migration of non-compound pages. With THP and with the 
addition of
 hugetlb migration we can end up allocating compound pages from CMA region. This
patch series add support for migrating compound pages. The first path adds the 
helper
get_user_pages_cma_migrate() which pin the page making sure we migrate them out 
of
CMA region before incrementing the reference count. 

Changes from V5:
* Add PF_MEMALLOC_NOCMA
* remote __GFP_THISNODE when allocating target page for migration

Changes from V4:
* use __GFP_NOWARN when allocating pages to avoid page allocation failure 
warnings.

Changes from V3:
* Move the hugetlb check before transhuge check
* Use compound head page when isolating hugetlb page


Aneesh Kumar K.V (4):
  mm/cma: Add PF flag to force non cma alloc
  mm: Add get_user_pages_cma_migrate
  powerpc/mm/iommu: Allow migration of cma allocated pages during
mm_iommu_get
  powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing

 arch/powerpc/mm/mmu_context_iommu.c | 144 ---
 include/linux/hugetlb.h |   2 +
 include/linux/migrate.h |   3 +
 include/linux/sched.h   |   1 +
 include/linux/sched/mm.h|  36 +--
 mm/hugetlb.c|   4 +-
 mm/migrate.c| 149 
 7 files changed, 227 insertions(+), 112 deletions(-)

-- 
2.20.1

Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-07 Thread Jason Gunthorpe

On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote:
> On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
> > 
> > > Interesting.  I've investigated this further, though I don't have as
> > > many new clues as I'd like.  The problem occurs reliably, at least on
> > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> > > I don't yet know if it occurs with other machines, I'm having trouble
> > > getting access to other machines with a suitable card.  I didn't
> > > manage to reproduce it on a different POWER8 machine with a
> > > ConnectX-5, but I don't know if it's the difference in machine or
> > > difference in card revision that's important.
> > 
> > Make sure the card has the latest firmware is always good advice..
> > 
> > > So possibilities that occur to me:
> > >   * It's something specific about how the vfio-pci driver uses D3
> > > state - have you tried rebinding your device to vfio-pci?
> > >   * It's something specific about POWER, either the kernel or the PCI
> > > bridge hardware
> > >   * It's something specific about this particular type of machine
> > 
> > Does the EEH indicate what happend to actually trigger it?
> 
> In a very cryptic way that requires manual parsing using non-public
> docs sadly but yes. From the look of it, it's a completion timeout.
> 
> Looks to me like we don't get a response to a config space access
> during the change of D state. I don't know if it's the write of the D3
> state itself or the read back though (it's probably detected on the
> read back or a subsequent read, but that doesn't tell me which specific
> one failed).

If it is just one card doing it (again, check you have latest
firmware) I wonder if it is a sketchy PCI-E electrical link that is
causing a long re-training cycle? Can you tell if the PCI-E link is
permanently gone or does it eventually return?

Does the card work in Gen 3 when it starts? Is there any indication of
PCI-E link errors?

Everytime or sometimes?

POWER 8 firmware is good? If the link does eventually come back, is
the POWER8's D3 resumption timeout long enough?

If this doesn't lead to an obvious conclusion you'll probably need to
connect to IBM's Mellanox support team to get more information from
the card side.

Jason

[PATCHv4 3/4] pci: layerscape: Add the EP mode support.

2019-01-07 Thread Xiaowei Bao

Add the PCIe EP mode support for layerscape platform.

Signed-off-by: Xiaowei Bao 
---
v2:
 - remove the EP mode check function.
v3:
 - modif the return value when enter default case.
v4:
 - no change.

 drivers/pci/controller/dwc/Makefile|2 +-
 drivers/pci/controller/dwc/pci-layerscape-ep.c |  146 
 2 files changed, 147 insertions(+), 1 deletions(-)
 create mode 100644 drivers/pci/controller/dwc/pci-layerscape-ep.c

diff --git a/drivers/pci/controller/dwc/Makefile 
b/drivers/pci/controller/dwc/Makefile
index fcf91ea..e97e920 100644
--- a/drivers/pci/controller/dwc/Makefile
+++ b/drivers/pci/controller/dwc/Makefile
@@ -8,7 +8,7 @@ obj-$(CONFIG_PCI_EXYNOS) += pci-exynos.o
 obj-$(CONFIG_PCI_IMX6) += pci-imx6.o
 obj-$(CONFIG_PCIE_SPEAR13XX) += pcie-spear13xx.o
 obj-$(CONFIG_PCI_KEYSTONE) += pci-keystone.o
-obj-$(CONFIG_PCI_LAYERSCAPE) += pci-layerscape.o
+obj-$(CONFIG_PCI_LAYERSCAPE) += pci-layerscape.o pci-layerscape-ep.o
 obj-$(CONFIG_PCIE_QCOM) += pcie-qcom.o
 obj-$(CONFIG_PCIE_ARMADA_8K) += pcie-armada8k.o
 obj-$(CONFIG_PCIE_ARTPEC6) += pcie-artpec6.o
diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c 
b/drivers/pci/controller/dwc/pci-layerscape-ep.c
new file mode 100644
index 000..dafb528
--- /dev/null
+++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c
@@ -0,0 +1,146 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCIe controller EP driver for Freescale Layerscape SoCs
+ *
+ * Copyright (C) 2018 NXP Semiconductor.
+ *
+ * Author: Xiaowei Bao 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "pcie-designware.h"
+
+#define PCIE_DBI2_OFFSET   0x1000  /* DBI2 base address*/
+
+struct ls_pcie_ep {
+   struct dw_pcie  *pci;
+};
+
+#define to_ls_pcie_ep(x)   dev_get_drvdata((x)->dev)
+
+static int ls_pcie_establish_link(struct dw_pcie *pci)
+{
+   return 0;
+}
+
+static const struct dw_pcie_ops ls_pcie_ep_ops = {
+   .start_link = ls_pcie_establish_link,
+};
+
+static const struct of_device_id ls_pcie_ep_of_match[] = {
+   { .compatible = "fsl,ls-pcie-ep",},
+   { },
+};
+
+static void ls_pcie_ep_init(struct dw_pcie_ep *ep)
+{
+   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+   struct pci_epc *epc = ep->epc;
+   enum pci_barno bar;
+
+   for (bar = BAR_0; bar <= BAR_5; bar++)
+   dw_pcie_ep_reset_bar(pci, bar);
+
+   epc->features |= EPC_FEATURE_NO_LINKUP_NOTIFIER;
+}
+
+static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 func_no,
+ enum pci_epc_irq_type type, u16 interrupt_num)
+{
+   struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
+
+   switch (type) {
+   case PCI_EPC_IRQ_LEGACY:
+   return dw_pcie_ep_raise_legacy_irq(ep, func_no);
+   case PCI_EPC_IRQ_MSI:
+   return dw_pcie_ep_raise_msi_irq(ep, func_no, interrupt_num);
+   case PCI_EPC_IRQ_MSIX:
+   return dw_pcie_ep_raise_msix_irq(ep, func_no, interrupt_num);
+   default:
+   dev_err(pci->dev, "UNKNOWN IRQ type\n");
+   return -EINVAL;
+   }
+}
+
+static struct dw_pcie_ep_ops pcie_ep_ops = {
+   .ep_init = ls_pcie_ep_init,
+   .raise_irq = ls_pcie_ep_raise_irq,
+};
+
+static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie,
+   struct platform_device *pdev)
+{
+   struct dw_pcie *pci = pcie->pci;
+   struct device *dev = pci->dev;
+   struct dw_pcie_ep *ep;
+   struct resource *res;
+   int ret;
+
+   ep = &pci->ep;
+   ep->ops = &pcie_ep_ops;
+
+   res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "addr_space");
+   if (!res)
+   return -EINVAL;
+
+   ep->phys_base = res->start;
+   ep->addr_size = resource_size(res);
+
+   ret = dw_pcie_ep_init(ep);
+   if (ret) {
+   dev_err(dev, "failed to initialize endpoint\n");
+   return ret;
+   }
+
+   return 0;
+}
+
+static int __init ls_pcie_ep_probe(struct platform_device *pdev)
+{
+   struct device *dev = &pdev->dev;
+   struct dw_pcie *pci;
+   struct ls_pcie_ep *pcie;
+   struct resource *dbi_base;
+   int ret;
+
+   pcie = devm_kzalloc(dev, sizeof(*pcie), GFP_KERNEL);
+   if (!pcie)
+   return -ENOMEM;
+
+   pci = devm_kzalloc(dev, sizeof(*pci), GFP_KERNEL);
+   if (!pci)
+   return -ENOMEM;
+
+   dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs");
+   pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base);
+   if (IS_ERR(pci->dbi_base))
+   return PTR_ERR(pci->dbi_base);
+
+   pci->dbi_base2 = pci->dbi_base + PCIE_DBI2_OFFSET;
+   pci->dev = dev;
+   pci->ops = &ls_pcie_ep_ops;
+   pcie->pci = pci;
+
+   platform_set_drvdata(pdev, pcie);
+
+   ret = ls_add_pcie_ep(pcie, pdev);
+
+   return ret;
+}
+
+static struct pla

[PATCHv4 4/4] misc: pci_endpoint_test: Add the layerscape EP device support

2019-01-07 Thread Xiaowei Bao

Add the layerscape EP device support in pci_endpoint_test driver.

Signed-off-by: Xiaowei Bao 
---
v2:
 - no change
v3:
 - no change
v4:
 - delate the comments.

 drivers/misc/pci_endpoint_test.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c
index 896e2df..29582fe 100644
--- a/drivers/misc/pci_endpoint_test.c
+++ b/drivers/misc/pci_endpoint_test.c
@@ -788,6 +788,7 @@ static void pci_endpoint_test_remove(struct pci_dev *pdev)
 static const struct pci_device_id pci_endpoint_test_tbl[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA74x) },
{ PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA72x) },
+   { PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x81c0) },
{ PCI_DEVICE(PCI_VENDOR_ID_SYNOPSYS, 0xedda) },
{ }
 };
-- 
1.7.1

[PATCHv4 1/4] dt-bindings: add DT binding for the layerscape PCIe controller with EP mode

2019-01-07 Thread Xiaowei Bao

Add the documentation for the Device Tree binding for the layerscape PCIe
controller with EP mode.

Signed-off-by: Xiaowei Bao 
---
v2:
 - Add the SoC specific compatibles.
v3:
 - modify the commit message.
v4:
 - no change.

 .../devicetree/bindings/pci/layerscape-pci.txt |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/Documentation/devicetree/bindings/pci/layerscape-pci.txt 
b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
index 9b2b8d6..e20ceaa 100644
--- a/Documentation/devicetree/bindings/pci/layerscape-pci.txt
+++ b/Documentation/devicetree/bindings/pci/layerscape-pci.txt
@@ -13,6 +13,7 @@ information.
 
 Required properties:
 - compatible: should contain the platform identifier such as:
+  RC mode:
 "fsl,ls1021a-pcie"
 "fsl,ls2080a-pcie", "fsl,ls2085a-pcie"
 "fsl,ls2088a-pcie"
@@ -20,6 +21,8 @@ Required properties:
 "fsl,ls1046a-pcie"
 "fsl,ls1043a-pcie"
 "fsl,ls1012a-pcie"
+  EP mode:
+   "fsl,ls1046a-pcie-ep", "fsl,ls-pcie-ep"
 - reg: base addresses and lengths of the PCIe controller register blocks.
 - interrupts: A list of interrupt outputs of the controller. Must contain an
   entry for each entry in the interrupt-names property.
-- 
1.7.1

[PATCHv4 2/4] arm64: dts: Add the PCIE EP node in dts

2019-01-07 Thread Xiaowei Bao

Add the PCIE EP node in dts for ls1046a.

Signed-off-by: Xiaowei Bao 
---
v2:
 - Add the SoC specific compatibles. 
v3:
 - no change
v4:
 - no change

 arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi |   34 +++-
 1 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi 
b/arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi
index 9a2106e..e373826 100644
--- a/arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi
+++ b/arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi
@@ -657,6 +657,17 @@
status = "disabled";
};
 
+   pcie_ep@340 {
+   compatible = "fsl,ls1046a-pcie-ep","fsl,ls-pcie-ep";
+   reg = <0x00 0x0340 0x0 0x0010
+   0x40 0x 0x8 0x>;
+   reg-names = "regs", "addr_space";
+   num-ib-windows = <6>;
+   num-ob-windows = <6>;
+   num-lanes = <2>;
+   status = "disabled";
+   };
+
pcie@350 {
compatible = "fsl,ls1046a-pcie";
reg = <0x00 0x0350 0x0 0x0010   /* controller 
registers */
@@ -683,6 +694,17 @@
status = "disabled";
};
 
+   pcie_ep@350 {
+   compatible = "fsl,ls1046a-pcie-ep","fsl,ls-pcie-ep";
+   reg = <0x00 0x0350 0x0 0x0010
+   0x48 0x 0x8 0x>;
+   reg-names = "regs", "addr_space";
+   num-ib-windows = <6>;
+   num-ob-windows = <6>;
+   num-lanes = <2>;
+   status = "disabled";
+   };
+
pcie@360 {
compatible = "fsl,ls1046a-pcie";
reg = <0x00 0x0360 0x0 0x0010   /* controller 
registers */
@@ -709,6 +731,17 @@
status = "disabled";
};
 
+   pcie_ep@360 {
+   compatible = "fsl,ls1046a-pcie-ep", "fsl,ls-pcie-ep";
+   reg = <0x00 0x0360 0x0 0x0010
+   0x50 0x 0x8 0x>;
+   reg-names = "regs", "addr_space";
+   num-ib-windows = <6>;
+   num-ob-windows = <6>;
+   num-lanes = <2>;
+   status = "disabled";
+   };
+
qdma: dma-controller@838 {
compatible = "fsl,ls1046a-qdma", "fsl,ls1021a-qdma";
reg = <0x0 0x838 0x0 0x1000>, /* Controller regs */
@@ -729,7 +762,6 @@
queue-sizes = <64 64>;
big-endian;
};
-
};
 
reserved-memory {
-- 
1.7.1

[PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration

2019-01-07 Thread Cédric Le Goater

These are used to capture the XIVE END table of the KVM device. It
relies on an OPAL call to retrieve from the XIVE IC the EQ toggle bit
and index which are updated by the HW when events are enqueued in the
guest RAM.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/uapi/asm/kvm.h   |  21 
 arch/powerpc/kvm/book3s_xive_native.c | 166 ++
 2 files changed, 187 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index faf024f39858..95302558ce10 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -684,6 +684,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_SOURCES   2   /* 64-bit source attributes */
 #define KVM_DEV_XIVE_GRP_SYNC  3   /* 64-bit source attributes */
 #define KVM_DEV_XIVE_GRP_EAS   4   /* 64-bit eas attributes */
+#define KVM_DEV_XIVE_GRP_EQ5   /* 64-bit eq attributes */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
@@ -699,4 +700,24 @@ struct kvm_ppc_cpu_char {
 #define KVM_XIVE_EAS_EISN_SHIFT33
 #define KVM_XIVE_EAS_EISN_MASK 0xfffeULL
 
+/* Layout of 64-bit eq attribute */
+#define KVM_XIVE_EQ_PRIORITY_SHIFT 0
+#define KVM_XIVE_EQ_PRIORITY_MASK  0x7
+#define KVM_XIVE_EQ_SERVER_SHIFT   3
+#define KVM_XIVE_EQ_SERVER_MASK0xfff8ULL
+
+/* Layout of 64-bit eq attribute values */
+struct kvm_ppc_xive_eq {
+   __u32 flags;
+   __u32 qsize;
+   __u64 qpage;
+   __u32 qtoggle;
+   __u32 qindex;
+};
+
+#define KVM_XIVE_EQ_FLAG_ENABLED   0x0001
+#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY 0x0002
+#define KVM_XIVE_EQ_FLAG_ESCALATE  0x0004
+
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 0468b605baa7..f4eb71eafc57 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -607,6 +607,164 @@ static int kvmppc_xive_native_get_eas(struct kvmppc_xive 
*xive, long irq,
return 0;
 }
 
+static int kvmppc_xive_native_set_queue(struct kvmppc_xive *xive, long eq_idx,
+ u64 addr)
+{
+   struct kvm *kvm = xive->kvm;
+   struct kvm_vcpu *vcpu;
+   struct kvmppc_xive_vcpu *xc;
+   void __user *ubufp = (u64 __user *) addr;
+   u32 server;
+   u8 priority;
+   struct kvm_ppc_xive_eq kvm_eq;
+   int rc;
+   __be32 *qaddr = 0;
+   struct page *page;
+   struct xive_q *q;
+
+   /*
+* Demangle priority/server tuple from the EQ index
+*/
+   priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
+   KVM_XIVE_EQ_PRIORITY_SHIFT;
+   server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
+   KVM_XIVE_EQ_SERVER_SHIFT;
+
+   if (copy_from_user(&kvm_eq, ubufp, sizeof(kvm_eq)))
+   return -EFAULT;
+
+   vcpu = kvmppc_xive_find_server(kvm, server);
+   if (!vcpu) {
+   pr_err("Can't find server %d\n", server);
+   return -ENOENT;
+   }
+   xc = vcpu->arch.xive_vcpu;
+
+   if (priority != xive_prio_from_guest(priority)) {
+   pr_err("Trying to restore invalid queue %d for VCPU %d\n",
+  priority, server);
+   return -EINVAL;
+   }
+   q = &xc->queues[priority];
+
+   pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n",
+__func__, server, priority, kvm_eq.flags,
+kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex);
+
+   rc = xive_native_validate_queue_size(kvm_eq.qsize);
+   if (rc || !kvm_eq.qsize) {
+   pr_err("invalid queue size %d\n", kvm_eq.qsize);
+   return rc;
+   }
+
+   page = gfn_to_page(kvm, gpa_to_gfn(kvm_eq.qpage));
+   if (is_error_page(page)) {
+   pr_warn("Couldn't get guest page for %llx!\n", kvm_eq.qpage);
+   return -ENOMEM;
+   }
+   qaddr = page_to_virt(page) + (kvm_eq.qpage & ~PAGE_MASK);
+
+   /* Backup queue page guest address for migration */
+   q->guest_qpage = kvm_eq.qpage;
+   q->guest_qsize = kvm_eq.qsize;
+
+   rc = xive_native_configure_queue(xc->vp_id, q, priority,
+(__be32 *) qaddr, kvm_eq.qsize, true);
+   if (rc) {
+   pr_err("Failed to configure queue %d for VCPU %d: %d\n",
+  priority, xc->server_num, rc);
+   put_page(page);
+   return rc;
+   }
+
+   rc = xive_native_set_queue_state(xc->vp_id, priority, kvm_eq.qtoggle,
+kvm_eq.qindex);
+   if (rc)
+   goto error;
+
+   rc = kvmppc_xive_attach_escalation(vcpu, priority);
+error:
+   if (rc)
+   xive_native_cl

[PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration

2019-01-07 Thread Cédric Le Goater

When migration of a VM is initiated, a first copy of the RAM is
transferred to the destination before the VM is stopped. At that time,
QEMU needs to perform a XIVE quiesce sequence to stop the flow of
event notifications and stabilize the EQs. The sources are masked and
the XIVE IC is synced with the KVM ioctl KVM_DEV_XIVE_GRP_SYNC.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kvm/book3s_xive_native.c | 32 +++
 2 files changed, 33 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 6fc9660c5aec..f3b859223b80 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_GET_TIMA_FD 2
 #define   KVM_DEV_XIVE_VC_BASE 3
 #define KVM_DEV_XIVE_GRP_SOURCES   2   /* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_SYNC  3   /* 64-bit source attributes */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 4ca75aade069..a8052867afc1 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -459,6 +459,35 @@ static int kvmppc_xive_native_set_source(struct 
kvmppc_xive *xive, long irq,
return 0;
 }
 
+static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 
addr)
+{
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   struct xive_irq_data *xd;
+   u32 hw_num;
+   u16 src;
+
+   pr_devel("%s irq=0x%lx\n", __func__, irq);
+
+   sb = kvmppc_xive_find_source(xive, irq, &src);
+   if (!sb)
+   return -ENOENT;
+
+   state = &sb->irq_state[src];
+
+   if (!state->valid)
+   return -ENOENT;
+
+   arch_spin_lock(&sb->lock);
+
+   kvmppc_xive_select_irq(state, &hw_num, &xd);
+   xive_native_sync_source(hw_num);
+   xive_native_sync_queue(hw_num);
+
+   arch_spin_unlock(&sb->lock);
+   return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -474,6 +503,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
case KVM_DEV_XIVE_GRP_SOURCES:
return kvmppc_xive_native_set_source(xive, attr->attr,
 attr->addr);
+   case KVM_DEV_XIVE_GRP_SYNC:
+   return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
}
return -ENXIO;
 }
@@ -511,6 +542,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
*dev,
}
break;
case KVM_DEV_XIVE_GRP_SOURCES:
+   case KVM_DEV_XIVE_GRP_SYNC:
if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
attr->attr < KVMPPC_XIVE_NR_IRQS)
return 0;
-- 
2.20.1

[PATCH 10/19] KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state

2019-01-07 Thread Cédric Le Goater

The Effective IRQ Source Number is the interrupt number pushed in the
event queue that the guest OS will use to dispatch events internally.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_xive.h | 3 +++
 arch/powerpc/kvm/book3s_xive.c | 1 +
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index ae4a670eea63..67e07b41061d 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -57,6 +57,9 @@ struct kvmppc_xive_irq_state {
bool saved_p;
bool saved_q;
u8 saved_scan_prio;
+
+   /* Xive native */
+   u32 eisn;   /* Guest Effective IRQ number */
 };
 
 /* Select the "right" interrupt (IPI vs. passthrough) */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index bb5d32f7e4e6..e9f05d9c9ad5 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1515,6 +1515,7 @@ struct kvmppc_xive_src_block 
*kvmppc_xive_create_src_block(
 
for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
sb->irq_state[i].number = (bid << KVMPPC_XICS_ICS_SHIFT) | i;
+   sb->irq_state[i].eisn = 0;
sb->irq_state[i].guest_priority = MASKED;
sb->irq_state[i].saved_priority = MASKED;
sb->irq_state[i].act_priority = MASKED;
-- 
2.20.1

[PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device

2019-01-07 Thread Cédric Le Goater

The ESB MMIO region controls the interrupt sources of the guest. QEMU
will query an fd (GET_ESB_FD ioctl) and map this region at a specific
address for the guest to use. The guest will obtain this information
using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address
setting used by QEMU, add a VC_BASE control to the KVM XIVE device

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kvm/book3s_xive.h|  3 +++
 arch/powerpc/kvm/book3s_xive_native.c | 39 +++
 3 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 89c140cb9e79..8b78b12aa118 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -679,5 +679,6 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_CTRL  1
 #define   KVM_DEV_XIVE_GET_ESB_FD  1
 #define   KVM_DEV_XIVE_GET_TIMA_FD 2
+#define   KVM_DEV_XIVE_VC_BASE 3
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 5f22415520b4..ae4a670eea63 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -125,6 +125,9 @@ struct kvmppc_xive {
 
/* Flags */
u8  single_escalation;
+
+   /* VC base address for ESBs */
+   u64 vc_base;
 };
 
 #define KVMPPC_XIVE_Q_COUNT8
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index ee9d12bf2dae..29a62914de55 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -153,6 +153,25 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
return rc;
 }
 
+static int kvmppc_xive_native_set_vc_base(struct kvmppc_xive *xive, u64 addr)
+{
+   u64 __user *ubufp = (u64 __user *) addr;
+
+   if (get_user(xive->vc_base, ubufp))
+   return -EFAULT;
+   return 0;
+}
+
+static int kvmppc_xive_native_get_vc_base(struct kvmppc_xive *xive, u64 addr)
+{
+   u64 __user *ubufp = (u64 __user *) addr;
+
+   if (put_user(xive->vc_base, ubufp))
+   return -EFAULT;
+
+   return 0;
+}
+
 static int xive_native_esb_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -289,6 +308,16 @@ static int kvmppc_xive_native_get_tima_fd(struct 
kvmppc_xive *xive, u64 addr)
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
+   struct kvmppc_xive *xive = dev->private;
+
+   switch (attr->group) {
+   case KVM_DEV_XIVE_GRP_CTRL:
+   switch (attr->attr) {
+   case KVM_DEV_XIVE_VC_BASE:
+   return kvmppc_xive_native_set_vc_base(xive, attr->addr);
+   }
+   break;
+   }
return -ENXIO;
 }
 
@@ -304,6 +333,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device 
*dev,
return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
case KVM_DEV_XIVE_GET_TIMA_FD:
return kvmppc_xive_native_get_tima_fd(xive, attr->addr);
+   case KVM_DEV_XIVE_VC_BASE:
+   return kvmppc_xive_native_get_vc_base(xive, attr->addr);
}
break;
}
@@ -318,6 +349,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
*dev,
switch (attr->attr) {
case KVM_DEV_XIVE_GET_ESB_FD:
case KVM_DEV_XIVE_GET_TIMA_FD:
+   case KVM_DEV_XIVE_VC_BASE:
return 0;
}
break;
@@ -353,6 +385,11 @@ static void kvmppc_xive_native_free(struct kvm_device *dev)
kfree(dev);
 }
 
+/*
+ * ESB MMIO address of chip 0
+ */
+#define XIVE_VC_BASE   0x00060100ull
+
 static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 {
struct kvmppc_xive *xive;
@@ -387,6 +424,8 @@ static int kvmppc_xive_native_create(struct kvm_device 
*dev, u32 type)
if (xive->vp_base == XIVE_INVALID_VP)
ret = -ENOMEM;
 
+   xive->vc_base = XIVE_VC_BASE;
+
xive->single_escalation = xive_native_has_single_escalation();
 
if (ret)
-- 
2.20.1

[PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device

2019-01-07 Thread Cédric Le Goater

Interrupt sources are simply created at the OPAL level and then
MASKED. KVM only needs to know about their type: LSI or MSI.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/uapi/asm/kvm.h   |  5 +
 arch/powerpc/kvm/book3s_xive_native.c | 98 +++
 .../powerpc/kvm/book3s_xive_native_template.c | 27 +
 3 files changed, 130 insertions(+)
 create mode 100644 arch/powerpc/kvm/book3s_xive_native_template.c

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 8b78b12aa118..6fc9660c5aec 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -680,5 +680,10 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_GET_ESB_FD  1
 #define   KVM_DEV_XIVE_GET_TIMA_FD 2
 #define   KVM_DEV_XIVE_VC_BASE 3
+#define KVM_DEV_XIVE_GRP_SOURCES   2   /* 64-bit source attributes */
+
+/* Layout of 64-bit XIVE source attribute values */
+#define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
+#define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1)
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 29a62914de55..2518640d4a58 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -31,6 +31,24 @@
 
 #include "book3s_xive.h"
 
+/*
+ * We still instantiate them here because we use some of the
+ * generated utility functions as well in this file.
+ */
+#define XIVE_RUNTIME_CHECKS
+#define X_PFX xive_vm_
+#define X_STATIC static
+#define X_STAT_PFX stat_vm_
+#define __x_tima   xive_tima
+#define __x_eoi_page(xd)   ((void __iomem *)((xd)->eoi_mmio))
+#define __x_trig_page(xd)  ((void __iomem *)((xd)->trig_mmio))
+#define __x_writeb __raw_writeb
+#define __x_readw  __raw_readw
+#define __x_readq  __raw_readq
+#define __x_writeq __raw_writeq
+
+#include "book3s_xive_native_template.c"
+
 static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
@@ -305,6 +323,78 @@ static int kvmppc_xive_native_get_tima_fd(struct 
kvmppc_xive *xive, u64 addr)
return put_user(ret, ubufp);
 }
 
+static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
+u64 addr)
+{
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   u64 __user *ubufp = (u64 __user *) addr;
+   u64 val;
+   u16 idx;
+
+   pr_devel("%s irq=0x%lx\n", __func__, irq);
+
+   if (irq < KVMPPC_XIVE_FIRST_IRQ || irq >= KVMPPC_XIVE_NR_IRQS)
+   return -ENOENT;
+
+   sb = kvmppc_xive_find_source(xive, irq, &idx);
+   if (!sb) {
+   pr_debug("No source, creating source block...\n");
+   sb = kvmppc_xive_create_src_block(xive, irq);
+   if (!sb) {
+   pr_err("Failed to create block...\n");
+   return -ENOMEM;
+   }
+   }
+   state = &sb->irq_state[idx];
+
+   if (get_user(val, ubufp)) {
+   pr_err("fault getting user info !\n");
+   return -EFAULT;
+   }
+
+   /*
+* If the source doesn't already have an IPI, allocate
+* one and get the corresponding data
+*/
+   if (!state->ipi_number) {
+   state->ipi_number = xive_native_alloc_irq();
+   if (state->ipi_number == 0) {
+   pr_err("Failed to allocate IRQ !\n");
+   return -ENOMEM;
+   }
+   xive_native_populate_irq_data(state->ipi_number,
+ &state->ipi_data);
+   pr_debug("%s allocated hw_irq=0x%x for irq=0x%lx\n", __func__,
+state->ipi_number, irq);
+   }
+
+   arch_spin_lock(&sb->lock);
+
+   /* Restore LSI state */
+   if (val & KVM_XIVE_LEVEL_SENSITIVE) {
+   state->lsi = true;
+   if (val & KVM_XIVE_LEVEL_ASSERTED)
+   state->asserted = true;
+   pr_devel("  LSI ! Asserted=%d\n", state->asserted);
+   }
+
+   /* Mask IRQ to start with */
+   state->act_server = 0;
+   state->act_priority = MASKED;
+   xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
+   xive_native_configure_irq(state->ipi_number, 0, MASKED, 0);
+
+   /* Increment the number of valid sources and mark this one valid */
+   if (!state->valid)
+   xive->src_count++;
+   state->valid = true;
+
+   arch_spin_unlock(&sb->lock);
+
+   return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -317,6 +407,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
return kvmppc_xive_native_s

[PATCH 07/19] KVM: PPC: Book3S HV: add a GET_TIMA_FD control to XIVE native device

2019-01-07 Thread Cédric Le Goater

This will let the guest create a memory mapping to expose the XIVE
MMIO region (TIMA) used for interrupt management at the CPU level.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h   |  1 +
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kvm/book3s_xive_native.c | 57 +++
 arch/powerpc/sysdev/xive/native.c | 11 ++
 4 files changed, 70 insertions(+)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index d6be3e4d9fa4..7a7aa22d8258 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -23,6 +23,7 @@
  * same offset regardless of where the code is executing
  */
 extern void __iomem *xive_tima;
+extern unsigned long xive_tima_os;
 
 /*
  * Offset in the TM area of our current execution level (provided by
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 6bb61ba141c2..89c140cb9e79 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -678,5 +678,6 @@ struct kvm_ppc_cpu_char {
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL  1
 #define   KVM_DEV_XIVE_GET_ESB_FD  1
+#define   KVM_DEV_XIVE_GET_TIMA_FD 2
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index e20081f0c8d4..ee9d12bf2dae 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -232,6 +232,60 @@ static int kvmppc_xive_native_get_esb_fd(struct 
kvmppc_xive *xive, u64 addr)
return put_user(ret, ubufp);
 }
 
+static int xive_native_tima_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+
+   switch (vmf->pgoff) {
+   case 0: /* HW - forbid access */
+   case 1: /* HV - forbid access */
+   return VM_FAULT_SIGBUS;
+   case 2: /* OS */
+   vmf_insert_pfn(vma, vmf->address, xive_tima_os >> PAGE_SHIFT);
+   return VM_FAULT_NOPAGE;
+   case 3: /* USER - TODO */
+   default:
+   return VM_FAULT_SIGBUS;
+   }
+}
+
+static const struct vm_operations_struct xive_native_tima_vmops = {
+   .fault = xive_native_tima_fault,
+};
+
+static int xive_native_tima_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   /*
+* The TIMA is four pages wide but only the last two pages (OS
+* and User view) are accessible to the guest. The page fault
+* handler will handle the permissions.
+*/
+   if (vma_pages(vma) + vma->vm_pgoff > 4)
+   return -EINVAL;
+
+   vma->vm_flags |= VM_IO | VM_PFNMAP;
+   vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
+   vma->vm_ops = &xive_native_tima_vmops;
+   return 0;
+}
+
+static const struct file_operations xive_native_tima_fops = {
+   .mmap = xive_native_tima_mmap,
+};
+
+static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
+{
+   u64 __user *ubufp = (u64 __user *) addr;
+   int ret;
+
+   ret = anon_inode_getfd("[xive-tima]", &xive_native_tima_fops, xive,
+  O_RDWR | O_CLOEXEC);
+   if (ret < 0)
+   return ret;
+
+   return put_user(ret, ubufp);
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -248,6 +302,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device 
*dev,
switch (attr->attr) {
case KVM_DEV_XIVE_GET_ESB_FD:
return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
+   case KVM_DEV_XIVE_GET_TIMA_FD:
+   return kvmppc_xive_native_get_tima_fd(xive, attr->addr);
}
break;
}
@@ -261,6 +317,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
*dev,
case KVM_DEV_XIVE_GRP_CTRL:
switch (attr->attr) {
case KVM_DEV_XIVE_GET_ESB_FD:
+   case KVM_DEV_XIVE_GET_TIMA_FD:
return 0;
}
break;
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 0c037e933e55..7782201e5fe8 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -521,6 +521,9 @@ u32 xive_native_default_eq_shift(void)
 }
 EXPORT_SYMBOL_GPL(xive_native_default_eq_shift);
 
+unsigned long xive_tima_os;
+EXPORT_SYMBOL_GPL(xive_tima_os);
+
 bool __init xive_native_init(void)
 {
struct device_node *np;
@@ -573,6 +576,14 @@ bool __init xive_native_init(void)
for_each_possible_cpu(cpu)
kvmppc_set_xive_tima(cpu, r.start, tima);
 
+   /* Resource 2 is OS window */
+   if (of_address_to_resource(np, 2, &r)) {
+   pr_err("Failed to get thread mgmnt area resource\n");
+

[PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support

2019-01-07 Thread Cédric Le Goater

The support for XIVE native exploitation mode in Linux/KVM needs a
couple more OPAL calls to configure the sPAPR guest and to get/set the
state of the XIVE internal structures.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/opal-api.h   | 11 ++-
 arch/powerpc/include/asm/opal.h   |  7 ++
 arch/powerpc/include/asm/xive.h   | 14 +++
 arch/powerpc/sysdev/xive/native.c | 99 +++
 .../powerpc/platforms/powernv/opal-wrappers.S |  3 +
 5 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 870fb7b239ea..cdfc54f78101 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -186,8 +186,8 @@
 #define OPAL_XIVE_FREE_IRQ 140
 #define OPAL_XIVE_SYNC 141
 #define OPAL_XIVE_DUMP 142
-#define OPAL_XIVE_RESERVED3143
-#define OPAL_XIVE_RESERVED4144
+#define OPAL_XIVE_GET_QUEUE_STATE  143
+#define OPAL_XIVE_SET_QUEUE_STATE  144
 #define OPAL_SIGNAL_SYSTEM_RESET   145
 #define OPAL_NPU_INIT_CONTEXT  146
 #define OPAL_NPU_DESTROY_CONTEXT   147
@@ -209,8 +209,11 @@
 #define OPAL_SENSOR_GROUP_ENABLE   163
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   165
-#defineOPAL_NX_COPROC_INIT 167
-#define OPAL_LAST  167
+#define OPAL_HANDLE_HMI2   166
+#define OPAL_NX_COPROC_INIT167
+#define OPAL_NPU_SET_RELAXED_ORDER 168
+#define OPAL_NPU_GET_RELAXED_ORDER 169
+#define OPAL_XIVE_GET_VP_STATE 170
 
 #define QUIESCE_HOLD   1 /* Spin all calls at entry */
 #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index a55b01c90bb1..4e978d4dea5c 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -279,6 +279,13 @@ int64_t opal_xive_allocate_irq(uint32_t chip_id);
 int64_t opal_xive_free_irq(uint32_t girq);
 int64_t opal_xive_sync(uint32_t type, uint32_t id);
 int64_t opal_xive_dump(uint32_t type, uint32_t id);
+int64_t opal_xive_get_queue_state(uint64_t vp, uint32_t prio,
+ __be32 *out_qtoggle,
+ __be32 *out_qindex);
+int64_t opal_xive_set_queue_state(uint64_t vp, uint32_t prio,
+ uint32_t qtoggle,
+ uint32_t qindex);
+int64_t opal_xive_get_vp_state(uint64_t vp, __be64 *out_w01);
 int64_t opal_pci_set_p2p(uint64_t phb_init, uint64_t phb_target,
uint64_t desc, uint16_t pe_number);
 
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 32f033bfbf42..d6be3e4d9fa4 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -132,12 +132,26 @@ extern int xive_native_configure_queue(u32 vp_id, struct 
xive_q *q, u8 prio,
 extern void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio);
 
 extern void xive_native_sync_source(u32 hw_irq);
+extern void xive_native_sync_queue(u32 hw_irq);
 extern bool is_xive_irq(struct irq_chip *chip);
 extern int xive_native_enable_vp(u32 vp_id, bool single_escalation);
 extern int xive_native_disable_vp(u32 vp_id);
 extern int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 
*out_chip_id);
 extern bool xive_native_has_single_escalation(void);
 
+extern int xive_native_get_queue_info(u32 vp_id, uint32_t prio,
+ u64 *out_qpage,
+ u64 *out_qsize,
+ u64 *out_qeoi_page,
+ u32 *out_escalate_irq,
+ u64 *out_qflags);
+
+extern int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle,
+  u32 *qindex);
+extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle,
+  u32 qindex);
+extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
+
 #else
 
 static inline bool xive_enabled(void) { return false; }
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 1ca127d052a6..0c037e933e55 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -437,6 +437,12 @@ void xive_native_sync_source(u32 hw_irq)
 }
 EXPORT_SYMBOL_GPL(xive_native_sync_source);
 
+void xive_native_sync_queue(u32 hw_irq)
+{
+   opal_xive_sync(XIVE_SYNC_QUEUE, hw_irq);
+}
+EXPORT_SYMBOL_GPL(xive_native_sync_queue);
+
 static const struct xive_ops xive_native_ops = {

[PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl

2019-01-07 Thread Cédric Le Goater

This will be used to destroy the KVM XICS or XIVE device when the
sPAPR machine is reseted. When the VM boots, the CAS negotiation
process will determine which interrupt mode to use and the appropriate
KVM device will then be created.

Signed-off-by: Cédric Le Goater 
---
 include/linux/kvm_host.h  |  2 ++
 include/uapi/linux/kvm.h  |  2 ++
 arch/powerpc/kvm/book3s_xive.c| 38 +-
 arch/powerpc/kvm/book3s_xive_native.c | 24 +
 virt/kvm/kvm_main.c   | 39 +++
 5 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e73..259b6885dc74 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1218,6 +1218,8 @@ struct kvm_device_ops {
 */
void (*destroy)(struct kvm_device *dev);
 
+   int (*delete)(struct kvm_device *dev);
+
int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52bf74a1616e..b00cb4d986cf 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1331,6 +1331,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+#define KVM_DELETE_DEVICE_IOWR(KVMIO,  0xf0, struct kvm_create_device)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 9b4751713554..5449fb4c87f9 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1109,11 +1109,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct 
kvm_vcpu *vcpu)
 void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
-   struct kvmppc_xive *xive = xc->xive;
+   struct kvmppc_xive *xive;
int i;
 
+   if (!kvmppc_xics_enabled(vcpu))
+   return;
+
+   if (!xc)
+   return;
+
pr_devel("cleanup_vcpu(cpu=%d)\n", xc->server_num);
 
+   xive = xc->xive;
+
/* Ensure no interrupt is still routed to that VP */
xc->valid = false;
kvmppc_xive_disable_vcpu_interrupts(vcpu);
@@ -1150,6 +1158,10 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
}
/* Free the VP */
kfree(xc);
+
+   /* Cleanup the vcpu */
+   vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
+   vcpu->arch.xive_vcpu = NULL;
 }
 
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
@@ -1861,6 +1873,29 @@ static void kvmppc_xive_free(struct kvm_device *dev)
kfree(dev);
 }
 
+static int kvmppc_xive_delete(struct kvm_device *dev)
+{
+   struct kvm *kvm = dev->kvm;
+   unsigned int i;
+   struct kvm_vcpu *vcpu;
+
+   if (!kvm->arch.xive)
+   return -EPERM;
+
+   /*
+* call kick_all_cpus_sync() to ensure that all CPUs have
+* executed any pending interrupts
+*/
+   if (is_kvmppc_hv_enabled(kvm))
+   kick_all_cpus_sync();
+
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   kvmppc_xive_cleanup_vcpu(vcpu);
+
+   kvmppc_xive_free(dev);
+   return 0;
+}
+
 static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 {
struct kvmppc_xive *xive;
@@ -2035,6 +2070,7 @@ struct kvm_device_ops kvm_xive_ops = {
.create = kvmppc_xive_create,
.init = kvmppc_xive_init,
.destroy = kvmppc_xive_free,
+   .delete = kvmppc_xive_delete,
.set_attr = xive_set_attr,
.get_attr = xive_get_attr,
.has_attr = xive_has_attr,
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 12edac29995e..7367962e670a 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -979,6 +979,29 @@ static void kvmppc_xive_native_free(struct kvm_device *dev)
kfree(dev);
 }
 
+static int kvmppc_xive_native_delete(struct kvm_device *dev)
+{
+   struct kvm *kvm = dev->kvm;
+   unsigned int i;
+   struct kvm_vcpu *vcpu;
+
+   if (!kvm->arch.xive)
+   return -EPERM;
+
+   /*
+* call kick_all_cpus_sync() to ensure that all CPUs have
+* executed any pending interrupts
+*/
+   if (is_kvmppc_hv_enabled(kvm))
+   kick_all_cpus_sync();
+
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   kvmppc_xive_native_cleanup_vcpu(vcpu);
+
+   kvmppc_xive_native_free(dev);
+   return 0;
+}
+
 /*
  * ESB MMIO address of chip 0
  */
@@ -1350,6 +1373,7 @@ struct kvm_device_ops kvm_xive_native_ops = {
.create = kvmppc_xive_native_create,
.init = kvmppc_xive_na

[PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state

2019-01-07 Thread Cédric Le Goater

At a VCPU level, the state of the thread context interrupt management
registers needs to be collected. These registers are cached under the
'xive_saved_state.w01' field of the VCPU when the VPCU context is
pulled from the HW thread. An OPAL call retrieves the backup of the
IPB register in the NVT structure and merges it in the KVM state.

The structures of the interface between QEMU and KVM provisions some
extra room (two u64) for further extensions if more state needs to be
transferred back to QEMU.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/kvm_ppc.h|  5 ++
 arch/powerpc/include/uapi/asm/kvm.h   |  2 +
 arch/powerpc/kvm/book3s.c | 24 +
 arch/powerpc/kvm/book3s_xive_native.c | 78 +++
 4 files changed, 109 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 4cc897039485..49c488af168c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -270,6 +270,7 @@ union kvmppc_one_reg {
u64 addr;
u64 length;
}   vpaval;
+   u64 xive_timaval[4];
 };
 
 struct kvmppc_ops {
@@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu 
*vcpu);
 extern void kvmppc_xive_native_init_module(void);
 extern void kvmppc_xive_native_exit_module(void);
 extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
+extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union 
kvmppc_one_reg *val);
+extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union 
kvmppc_one_reg *val);
 
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
@@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { }
 static inline void kvmppc_xive_native_exit_module(void) { }
 static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
{ return 0; }
+static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union 
kvmppc_one_reg *val) { return 0; }
+static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union 
kvmppc_one_reg *val) { return -ENOENT; }
 
 #endif /* CONFIG_KVM_XIVE */
 
diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 95302558ce10..3c958c39a782 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
 #define  KVM_REG_PPC_ICP_PPRI_SHIFT16  /* pending irq priority */
 #define  KVM_REG_PPC_ICP_PPRI_MASK 0xff
 
+#define KVM_REG_PPC_VP_STATE   (KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
+
 /* Device control API: PPC-specific devices */
 #define KVM_DEV_MPIC_GRP_MISC  1
 #define   KVM_DEV_MPIC_BASE_ADDR   0   /* 64-bit */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index de7eed191107..5ad658077a35 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -641,6 +641,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
*val = get_reg_val(id, 
kvmppc_xics_get_icp(vcpu));
break;
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+   case KVM_REG_PPC_VP_STATE:
+   if (!vcpu->arch.xive_vcpu) {
+   r = -ENXIO;
+   break;
+   }
+   if (xive_enabled())
+   r = kvmppc_xive_native_get_vp(vcpu, val);
+   else
+   r = -ENXIO;
+   break;
+#endif /* CONFIG_KVM_XIVE */
case KVM_REG_PPC_FSCR:
*val = get_reg_val(id, vcpu->arch.fscr);
break;
@@ -714,6 +726,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, 
*val));
break;
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+   case KVM_REG_PPC_VP_STATE:
+   if (!vcpu->arch.xive_vcpu) {
+   r = -ENXIO;
+   break;
+   }
+   if (xive_enabled())
+   r = kvmppc_xive_native_set_vp(vcpu, val);
+   else
+   r = -ENXIO;
+   break;
+#endif /* CONFIG_KVM_XIVE */
case KVM_REG_PPC_FSCR:
vcpu->arch.fscr = set_reg_val(id, *val);
break;
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index f4eb71eafc57..1aefb366df0b 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -424,6 +424,84 @@ static int xive_native_validate_queue_size(u32 qsize)
}
 }
 
+#define TM_I

[PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-07 Thread Cédric Le Goater

Clear the ESB pages from the VMA of the IRQ being pass through to the
guest and let the fault handler repopulate the VMA when the ESB pages
are accessed for an EOI or for a trigger.

Storing the VMA under the KVM XIVE device is a little ugly.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_xive.h|  8 +++
 arch/powerpc/kvm/book3s_xive.c| 15 ++
 arch/powerpc/kvm/book3s_xive_native.c | 30 +++
 3 files changed, 53 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 31e598e62589..6e64d3496a2c 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -90,6 +90,11 @@ struct kvmppc_xive_src_block {
struct kvmppc_xive_irq_state irq_state[KVMPPC_XICS_IRQ_PER_ICS];
 };
 
+struct kvmppc_xive;
+
+struct kvmppc_xive_ops {
+   int (*reset_mapped)(struct kvm *kvm, unsigned long guest_irq);
+};
 
 struct kvmppc_xive {
struct kvm *kvm;
@@ -131,6 +136,9 @@ struct kvmppc_xive {
 
/* VC base address for ESBs */
u64 vc_base;
+
+   struct kvmppc_xive_ops *ops;
+   struct vm_area_struct *vma;
 };
 
 #define KVMPPC_XIVE_Q_COUNT8
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e9f05d9c9ad5..9b4751713554 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -946,6 +946,13 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long 
guest_irq,
/* Turn the IPI hard off */
xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
 
+   /*
+* Reset ESB guest mapping. Needed when ESB pages are exposed
+* to the guest in XIVE native mode
+*/
+   if (xive->ops && xive->ops->reset_mapped)
+   xive->ops->reset_mapped(kvm, guest_irq);
+
/* Grab info about irq */
state->pt_number = hw_irq;
state->pt_data = irq_data_get_irq_handler_data(host_data);
@@ -1031,6 +1038,14 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned 
long guest_irq,
state->pt_number = 0;
state->pt_data = NULL;
 
+   /*
+* Reset ESB guest mapping. Needed when ESB pages are exposed
+* to the guest in XIVE native mode
+*/
+   if (xive->ops && xive->ops->reset_mapped) {
+   xive->ops->reset_mapped(kvm, guest_irq);
+   }
+
/* Reconfigure the IPI */
xive_native_configure_irq(state->ipi_number,
  xive_vp(xive, state->act_server),
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 1aefb366df0b..12edac29995e 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -240,6 +240,32 @@ static int kvmppc_xive_native_get_vc_base(struct 
kvmppc_xive *xive, u64 addr)
return 0;
 }
 
+static int kvmppc_xive_native_reset_mapped(struct kvm *kvm, unsigned long irq)
+{
+   struct kvmppc_xive *xive = kvm->arch.xive;
+   struct mm_struct *mm = kvm->mm;
+   struct vm_area_struct *vma = xive->vma;
+   unsigned long address;
+
+   if (irq >= KVMPPC_XIVE_NR_IRQS)
+   return -EINVAL;
+
+   pr_debug("clearing esb pages for girq 0x%lx\n", irq);
+
+   down_read(&mm->mmap_sem);
+   /* TODO: can we clear the PTEs without keeping a VMA pointer ? */
+   if (vma) {
+   address = vma->vm_start + irq * (2ull << PAGE_SHIFT);
+   zap_vma_ptes(vma, address, 2ull << PAGE_SHIFT);
+   }
+   up_read(&mm->mmap_sem);
+   return 0;
+}
+
+static struct kvmppc_xive_ops kvmppc_xive_native_ops =  {
+   .reset_mapped = kvmppc_xive_native_reset_mapped,
+};
+
 static int xive_native_esb_fault(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
@@ -292,6 +318,8 @@ static const struct vm_operations_struct 
xive_native_esb_vmops = {
 
 static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
 {
+   struct kvmppc_xive *xive = vma->vm_file->private_data;
+
/* There are two ESB pages (trigger and EOI) per IRQ */
if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
return -EINVAL;
@@ -299,6 +327,7 @@ static int xive_native_esb_mmap(struct file *file, struct 
vm_area_struct *vma)
vma->vm_flags |= VM_IO | VM_PFNMAP;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_ops = &xive_native_esb_vmops;
+   xive->vma = vma; /* TODO: get rid of the VMA pointer */
return 0;
 }
 
@@ -992,6 +1021,7 @@ static int kvmppc_xive_native_create(struct kvm_device 
*dev, u32 type)
xive->vc_base = XIVE_VC_BASE;
 
xive->single_escalation = xive_native_has_single_escalation();
+   xive->ops = &kvmppc_xive_native_ops;
 
if (ret)
kfree(xive);
-- 
2.20.1

[PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device

2019-01-07 Thread Cédric Le Goater

The KVM device for the XIVE native exploitation mode will reuse the
structures of the XICS-over-XIVE glue implementation. Some code will
also be shared : source block creation and destruction, target
selection and escalation attachment.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_xive.h | 11 +
 arch/powerpc/kvm/book3s_xive.c | 89 +++---
 2 files changed, 62 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index a08ae6fd4c51..10c4aa5cd010 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -248,5 +248,16 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, 
unsigned long server,
 extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr);
 extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr);
 
+/*
+ * Common Xive routines for XICS-over-XIVE and XIVE native
+ */
+struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
+   struct kvmppc_xive *xive, int irq);
+void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
+int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
+void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
+int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio);
+int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
+
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 8a4fa45f07f8..bb5d32f7e4e6 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -166,7 +166,7 @@ static irqreturn_t xive_esc_irq(int irq, void *data)
return IRQ_HANDLED;
 }
 
-static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
+int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
struct xive_q *q = &xc->queues[prio];
@@ -291,7 +291,7 @@ static int xive_check_provisioning(struct kvm *kvm, u8 prio)
continue;
rc = xive_provision_queue(vcpu, prio);
if (rc == 0 && !xive->single_escalation)
-   xive_attach_escalation(vcpu, prio);
+   kvmppc_xive_attach_escalation(vcpu, prio);
if (rc)
return rc;
}
@@ -342,7 +342,7 @@ static int xive_try_pick_queue(struct kvm_vcpu *vcpu, u8 
prio)
return atomic_add_unless(&q->count, 1, max) ? 0 : -EBUSY;
 }
 
-static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
+int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
 {
struct kvm_vcpu *vcpu;
int i, rc;
@@ -535,7 +535,7 @@ static int xive_target_interrupt(struct kvm *kvm,
 * priority. The count for that new target will have
 * already been incremented.
 */
-   rc = xive_select_target(kvm, &server, prio);
+   rc = kvmppc_xive_select_target(kvm, &server, prio);
 
/*
 * We failed to find a target ? Not much we can do
@@ -1055,7 +1055,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long 
guest_irq,
 }
 EXPORT_SYMBOL_GPL(kvmppc_xive_clr_mapped);
 
-static void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
+void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
struct kvm *kvm = vcpu->kvm;
@@ -1225,7 +1225,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
if (xive->qmap & (1 << i)) {
r = xive_provision_queue(vcpu, i);
if (r == 0 && !xive->single_escalation)
-   xive_attach_escalation(vcpu, i);
+   kvmppc_xive_attach_escalation(vcpu, i);
if (r)
goto bail;
} else {
@@ -1240,7 +1240,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
}
 
/* If not done above, attach priority 0 escalation */
-   r = xive_attach_escalation(vcpu, 0);
+   r = kvmppc_xive_attach_escalation(vcpu, 0);
if (r)
goto bail;
 
@@ -1491,8 +1491,8 @@ static int xive_get_source(struct kvmppc_xive *xive, long 
irq, u64 addr)
return 0;
 }
 
-static struct kvmppc_xive_src_block *xive_create_src_block(struct kvmppc_xive 
*xive,
-  int irq)
+struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
+   struct kvmppc_xive *xive, int irq)
 {
struct kvm *kvm = xive->kvm;
struct kvmppc_xive_src_block *sb;
@@ -1571,7 +1571,7 @@ static int xive_set_source(struct kvmppc_xive *xive, long 
irq, u64 addr)
sb = kvmppc_xive_find_source(xive, irq, &idx);
if (!sb) {
pr_devel("No source, creating source block...\n");
-

[PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration

2019-01-07 Thread Cédric Le Goater

Theses are use to capure the XIVE EAS table of the KVM device, the
configuration of the source targets.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/uapi/asm/kvm.h   | 11 
 arch/powerpc/kvm/book3s_xive_native.c | 87 +++
 2 files changed, 98 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 1a8740629acf..faf024f39858 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_SAVE_EQ_PAGES   4
 #define KVM_DEV_XIVE_GRP_SOURCES   2   /* 64-bit source attributes */
 #define KVM_DEV_XIVE_GRP_SYNC  3   /* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_EAS   4   /* 64-bit eas attributes */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
 #define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1)
 
+/* Layout of 64-bit eas attribute values */
+#define KVM_XIVE_EAS_PRIORITY_SHIFT0
+#define KVM_XIVE_EAS_PRIORITY_MASK 0x7
+#define KVM_XIVE_EAS_SERVER_SHIFT  3
+#define KVM_XIVE_EAS_SERVER_MASK   0xfff8ULL
+#define KVM_XIVE_EAS_MASK_SHIFT32
+#define KVM_XIVE_EAS_MASK_MASK 0x1ULL
+#define KVM_XIVE_EAS_EISN_SHIFT33
+#define KVM_XIVE_EAS_EISN_MASK 0xfffeULL
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index f2de1bcf3b35..0468b605baa7 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive 
*xive, long irq, u64 addr)
return 0;
 }
 
+static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
+ u64 addr)
+{
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   u64 __user *ubufp = (u64 __user *) addr;
+   u16 src;
+   u64 kvm_eas;
+   u32 server;
+   u8 priority;
+   u32 eisn;
+
+   sb = kvmppc_xive_find_source(xive, irq, &src);
+   if (!sb)
+   return -ENOENT;
+
+   state = &sb->irq_state[src];
+
+   if (!state->valid)
+   return -EINVAL;
+
+   if (get_user(kvm_eas, ubufp))
+   return -EFAULT;
+
+   pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
+
+   priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
+   KVM_XIVE_EAS_PRIORITY_SHIFT;
+   server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
+   KVM_XIVE_EAS_SERVER_SHIFT;
+   eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
+
+   if (priority != xive_prio_from_guest(priority)) {
+   pr_err("invalid priority for queue %d for VCPU %d\n",
+  priority, server);
+   return -EINVAL;
+   }
+
+   return kvmppc_xive_native_set_source_config(xive, sb, state, server,
+   priority, eisn);
+}
+
+static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq,
+ u64 addr)
+{
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   u64 __user *ubufp = (u64 __user *) addr;
+   u16 src;
+   u64 kvm_eas;
+
+   sb = kvmppc_xive_find_source(xive, irq, &src);
+   if (!sb)
+   return -ENOENT;
+
+   state = &sb->irq_state[src];
+
+   if (!state->valid)
+   return -EINVAL;
+
+   arch_spin_lock(&sb->lock);
+
+   if (state->act_priority == MASKED)
+   kvm_eas = KVM_XIVE_EAS_MASK_MASK;
+   else {
+   kvm_eas = (state->act_priority << KVM_XIVE_EAS_PRIORITY_SHIFT) &
+   KVM_XIVE_EAS_PRIORITY_MASK;
+   kvm_eas |= (state->act_server << KVM_XIVE_EAS_SERVER_SHIFT) &
+   KVM_XIVE_EAS_SERVER_MASK;
+   kvm_eas |= ((u64) state->eisn << KVM_XIVE_EAS_EISN_SHIFT) &
+   KVM_XIVE_EAS_EISN_MASK;
+   }
+   arch_spin_unlock(&sb->lock);
+
+   pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
+
+   if (put_user(kvm_eas, ubufp))
+   return -EFAULT;
+
+   return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -544,6 +626,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
 attr->addr);
case KVM_DEV_XIVE_GRP_SYNC:
return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
+   case KVM_DEV_XIVE_GRP_EAS:
+   return kvmppc_xive_native_set_eas(xive, attr->attr, attr->addr);
}
return -ENXI

[PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-01-07 Thread Cédric Le Goater

This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new capability
and a new KVM device to be used by QEMU.

Internally, the interface to the new KVM device is protected with a
new interrupt mode: KVMPPC_IRQ_XIVE.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/kvm_host.h   |   2 +
 arch/powerpc/include/asm/kvm_ppc.h|  21 ++
 arch/powerpc/kvm/book3s_xive.h|   3 +
 include/uapi/linux/kvm.h  |   3 +
 arch/powerpc/kvm/book3s.c |   7 +-
 arch/powerpc/kvm/book3s_xive_native.c | 332 ++
 arch/powerpc/kvm/powerpc.c|  30 +++
 arch/powerpc/kvm/Makefile |   2 +-
 8 files changed, 398 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xive_native.c

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 0f98f00da2ea..c522e8274ad9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
 struct kvmppc_xive;
 struct kvmppc_xive_vcpu;
 extern struct kvm_device_ops kvm_xive_ops;
+extern struct kvm_device_ops kvm_xive_native_ops;
 
 struct kvmppc_passthru_irqmap;
 
@@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
 #define KVMPPC_IRQ_DEFAULT 0
 #define KVMPPC_IRQ_MPIC1
 #define KVMPPC_IRQ_XICS2 /* Includes a XIVE option */
+#define KVMPPC_IRQ_XIVE3 /* XIVE native exploitation mode */
 
 #define MMIO_HPTE_CACHE_SIZE   4
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index eb0d79f0ca45..1bb313f238fe 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
   int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
+
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
+}
+
+extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+   struct kvm_vcpu *vcpu, u32 cpu);
+extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
+extern void kvmppc_xive_native_init_module(void);
+extern void kvmppc_xive_native_exit_module(void);
+
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
   u32 priority) { return -1; }
@@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
*vcpu, u64 icpval) { retur
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq,
  int level, bool line_status) { return 
-ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
+
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+   { return 0; }
+static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+ struct kvm_vcpu *vcpu, u32 
cpu) { return -EBUSY; }
+static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
+static inline void kvmppc_xive_native_init_module(void) { }
+static inline void kvmppc_xive_native_exit_module(void) { }
+
 #endif /* CONFIG_KVM_XIVE */
 
 /*
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 10c4aa5cd010..5f22415520b4 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -12,6 +12,9 @@
 #ifdef CONFIG_KVM_XICS
 #include "book3s_xics.h"
 
+#define KVMPPC_XIVE_FIRST_IRQ  0
+#define KVMPPC_XIVE_NR_IRQSKVMPPC_XICS_NR_IRQS
+
 /*
  * State for one guest irq source.
  *
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d4ea4b6c922..52bf74a1616e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
 #define KVM_CAP_HYPERV_CPUID 167
+#define KVM_CAP_PPC_IRQ_XIVE 168
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1211,6 +1212,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_V3   KVM_DEV_TYPE_ARM_VGIC_V3
KVM_DEV_TYPE_ARM_VGIC_ITS,
 #define KVM_DEV_TYPE_ARM_VGIC_ITS  KVM_DEV_TYPE_ARM_VGIC_ITS
+   KVM_DEV_TYPE_XIVE,
+#define KVM_DEV_TYPE_XIVE  KVM_DEV_TYPE_XIVE
KVM_DEV_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bd1a677dd9e4..de7eed191107 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
 #ifdef CONFIG_KVM_XIVE
if (xive_enabled()) {
kvmppc_xive_init_module();
+

[PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-01-07 Thread Cédric Le Goater

Hello,

On the POWER9 processor, the XIVE interrupt controller can control
interrupt sources using MMIO to trigger events, to EOI or to turn off
the sources. Priority management and interrupt acknowledgment is also
controlled by MMIO in the CPU presenter subengine.

PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
special support from the hypervisor to do the same. This is called the
XIVE native exploitation mode and today, it can be activated under the
PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
and still offers the old interrupt mode interface using a
XICS-over-XIVE glue which implements the XICS hcalls.

The following series is proposal to add the same support under KVM.

A new KVM device is introduced for the XIVE native exploitation
mode. It reuses most of the XICS-over-XIVE glue implementation
structures which are internal to KVM but has a completely different
interface. A set of Hypervisor calls configures the sources and the
event queues and from there, all control is done by the guest through
MMIOs.

These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
similarly to VFIO, and the associated VMAs are populated dynamically
with the appropriate pages using a fault handler. This is implemented
with a couple of KVM device ioctls.

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with a
interrupt controller using the XICS legacy model, as found on POWER8,
or in XIVE exploitation mode. Which means that the KVM interrupt
device should be created at runtime, after the machine as started.
This requires extra KVM support to create/destroy KVM devices. The
last patches are an attempt to solve that problem.

Migration has its own specific needs. The patchset provides the
necessary routines to quiesce XIVE, to capture and restore the state
of the different structures used by KVM, OPAL and HW. Extra OPAL
support is required for these.

GitHub trees available here :
 
QEMU sPAPR:

  https://github.com/legoater/qemu/commits/xive-next
  
Linux/KVM:

  https://github.com/legoater/linux/commits/xive-5.0

OPAL:

  https://github.com/legoater/skiboot/commits/xive

Best wishes for 2019 !

C.


Cédric Le Goater (19):
  powerpc/xive: export flags for the XIVE native exploitation mode
hcalls
  powerpc/xive: add OPAL extensions for the XIVE native exploitation
support
  KVM: PPC: Book3S HV: check the IRQ controller type
  KVM: PPC: Book3S HV: export services for the XIVE native exploitation
device
  KVM: PPC: Book3S HV: add a new KVM device for the XIVE native
exploitation mode
  KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native
device
  KVM: PPC: Book3S HV: add a GET_TIMA_FD control to XIVE native device
  KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
  KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native
device
  KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state
  KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode
hcalls
  KVM: PPC: Book3S HV: record guest queue page address
  KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration
  KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty
  KVM: PPC: Book3S HV: add get/set accessors for the source
configuration
  KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration
  KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  KVM: PPC: Book3S HV: add passthrough support
  KVM: introduce a KVM_DELETE_DEVICE ioctl

 arch/powerpc/include/asm/kvm_host.h   |2 +
 arch/powerpc/include/asm/kvm_ppc.h|   69 +
 arch/powerpc/include/asm/opal-api.h   |   11 +-
 arch/powerpc/include/asm/opal.h   |7 +
 arch/powerpc/include/asm/xive.h   |   40 +
 arch/powerpc/include/uapi/asm/kvm.h   |   47 +
 arch/powerpc/kvm/book3s_xive.h|   82 +
 include/linux/kvm_host.h  |2 +
 include/uapi/linux/kvm.h  |5 +
 arch/powerpc/kvm/book3s.c |   31 +-
 arch/powerpc/kvm/book3s_hv.c  |   29 +
 arch/powerpc/kvm/book3s_hv_builtin.c  |  196 +++
 arch/powerpc/kvm/book3s_hv_rm_xive_native.c   |   47 +
 arch/powerpc/kvm/book3s_xive.c|  149 +-
 arch/powerpc/kvm/book3s_xive_native.c | 1406 +
 .../powerpc/kvm/book3s_xive_native_template.c |  398 +
 arch/powerpc/kvm/powerpc.c|   30 +
 arch/powerpc/sysdev/xive/native.c |  110 ++
 arch/powerpc/sysdev/xive/spapr.c  |   28 +-
 virt/kvm/kvm_main.c   |   39 +
 arch/powerpc/kvm/Makefile |4 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |   52 +
 .../powerpc/platforms/powernv/opal-wrappers.S |3 +
 23 files changed, 2722 insertions(+), 65 deletions(-)
 create mode 100644

[PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty

2019-01-07 Thread Cédric Le Goater

When the VM is stopped in a migration sequence, the sources are masked
and the XIVE IC is synced to stabilize the EQs. When done, the KVM
ioctl KVM_DEV_XIVE_SAVE_EQ_PAGES is called to mark dirty the EQ pages.

The migration can then transfer the remaining dirty pages to the
destination and start collecting the state of the devices.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kvm/book3s_xive_native.c | 40 +++
 2 files changed, 41 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index f3b859223b80..1a8740629acf 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -680,6 +680,7 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_GET_ESB_FD  1
 #define   KVM_DEV_XIVE_GET_TIMA_FD 2
 #define   KVM_DEV_XIVE_VC_BASE 3
+#define   KVM_DEV_XIVE_SAVE_EQ_PAGES   4
 #define KVM_DEV_XIVE_GRP_SOURCES   2   /* 64-bit source attributes */
 #define KVM_DEV_XIVE_GRP_SYNC  3   /* 64-bit source attributes */
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index a8052867afc1..f2de1bcf3b35 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -373,6 +373,43 @@ static int kvmppc_xive_native_get_tima_fd(struct 
kvmppc_xive *xive, u64 addr)
return put_user(ret, ubufp);
 }
 
+static int kvmppc_xive_native_vcpu_save_eq_pages(struct kvm_vcpu *vcpu)
+{
+   struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+   unsigned int prio;
+
+   if (!xc)
+   return -ENOENT;
+
+   for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
+   struct xive_q *q = &xc->queues[prio];
+
+   if (!q->qpage)
+   continue;
+
+   /* Mark EQ page dirty for migration */
+   mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qpage));
+   }
+   return 0;
+}
+
+static int kvmppc_xive_native_save_eq_pages(struct kvmppc_xive *xive)
+{
+   struct kvm *kvm = xive->kvm;
+   struct kvm_vcpu *vcpu;
+   unsigned int i;
+
+   pr_devel("%s\n", __func__);
+
+   mutex_lock(&kvm->lock);
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   kvmppc_xive_native_vcpu_save_eq_pages(vcpu);
+   }
+   mutex_unlock(&kvm->lock);
+
+   return 0;
+}
+
 static int xive_native_validate_queue_size(u32 qsize)
 {
switch (qsize) {
@@ -498,6 +535,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
switch (attr->attr) {
case KVM_DEV_XIVE_VC_BASE:
return kvmppc_xive_native_set_vc_base(xive, attr->addr);
+   case KVM_DEV_XIVE_SAVE_EQ_PAGES:
+   return kvmppc_xive_native_save_eq_pages(xive);
}
break;
case KVM_DEV_XIVE_GRP_SOURCES:
@@ -538,6 +577,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device 
*dev,
case KVM_DEV_XIVE_GET_ESB_FD:
case KVM_DEV_XIVE_GET_TIMA_FD:
case KVM_DEV_XIVE_VC_BASE:
+   case KVM_DEV_XIVE_SAVE_EQ_PAGES:
return 0;
}
break;
-- 
2.20.1

[PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls

2019-01-07 Thread Cédric Le Goater

These flags are shared between Linux/KVM implementing the hypervisor
calls for the XIVE native exploitation mode and the driver for the
sPAPR guests.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h  | 23 +++
 arch/powerpc/sysdev/xive/spapr.c | 28 
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 3c704f5dd3ae..32f033bfbf42 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -93,6 +93,29 @@ extern void xive_flush_interrupt(void);
 /* xmon hook */
 extern void xmon_xive_do_dump(int cpu);
 
+/*
+ * Hcall flags shared by the sPAPR backend and KVM
+ */
+
+/* H_INT_GET_SOURCE_INFO */
+#define XIVE_SPAPR_SRC_H_INT_ESB   PPC_BIT(60)
+#define XIVE_SPAPR_SRC_LSI PPC_BIT(61)
+#define XIVE_SPAPR_SRC_TRIGGER PPC_BIT(62)
+#define XIVE_SPAPR_SRC_STORE_EOI   PPC_BIT(63)
+
+/* H_INT_SET_SOURCE_CONFIG */
+#define XIVE_SPAPR_SRC_SET_EISNPPC_BIT(62)
+#define XIVE_SPAPR_SRC_MASKPPC_BIT(63) /* unused */
+
+/* H_INT_SET_QUEUE_CONFIG */
+#define XIVE_SPAPR_EQ_ALWAYS_NOTIFYPPC_BIT(63)
+
+/* H_INT_SET_QUEUE_CONFIG */
+#define XIVE_SPAPR_EQ_DEBUGPPC_BIT(63)
+
+/* H_INT_ESB */
+#define XIVE_SPAPR_ESB_STORE   PPC_BIT(63)
+
 /* APIs used by KVM */
 extern u32 xive_native_default_eq_shift(void);
 extern u32 xive_native_alloc_vp_block(u32 max_vcpus);
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 575db3b06a6b..730284f838c8 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -184,9 +184,6 @@ static long plpar_int_get_source_info(unsigned long flags,
return 0;
 }
 
-#define XIVE_SRC_SET_EISN (1ull << (63 - 62))
-#define XIVE_SRC_MASK (1ull << (63 - 63)) /* unused */
-
 static long plpar_int_set_source_config(unsigned long flags,
unsigned long lisn,
unsigned long target,
@@ -243,8 +240,6 @@ static long plpar_int_get_queue_info(unsigned long flags,
return 0;
 }
 
-#define XIVE_EQ_ALWAYS_NOTIFY (1ull << (63 - 63))
-
 static long plpar_int_set_queue_config(unsigned long flags,
   unsigned long target,
   unsigned long priority,
@@ -286,8 +281,6 @@ static long plpar_int_sync(unsigned long flags, unsigned 
long lisn)
return 0;
 }
 
-#define XIVE_ESB_FLAG_STORE (1ull << (63 - 63))
-
 static long plpar_int_esb(unsigned long flags,
  unsigned long lisn,
  unsigned long offset,
@@ -321,7 +314,7 @@ static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 
data, bool write)
unsigned long read_data;
long rc;
 
-   rc = plpar_int_esb(write ? XIVE_ESB_FLAG_STORE : 0,
+   rc = plpar_int_esb(write ? XIVE_SPAPR_ESB_STORE : 0,
   lisn, offset, data, &read_data);
if (rc)
return -1;
@@ -329,11 +322,6 @@ static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 
data, bool write)
return write ? 0 : read_data;
 }
 
-#define XIVE_SRC_H_INT_ESB (1ull << (63 - 60))
-#define XIVE_SRC_LSI   (1ull << (63 - 61))
-#define XIVE_SRC_TRIGGER   (1ull << (63 - 62))
-#define XIVE_SRC_STORE_EOI (1ull << (63 - 63))
-
 static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
 {
long rc;
@@ -349,11 +337,11 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, 
struct xive_irq_data *data)
if (rc)
return  -EINVAL;
 
-   if (flags & XIVE_SRC_H_INT_ESB)
+   if (flags & XIVE_SPAPR_SRC_H_INT_ESB)
data->flags  |= XIVE_IRQ_FLAG_H_INT_ESB;
-   if (flags & XIVE_SRC_STORE_EOI)
+   if (flags & XIVE_SPAPR_SRC_STORE_EOI)
data->flags  |= XIVE_IRQ_FLAG_STORE_EOI;
-   if (flags & XIVE_SRC_LSI)
+   if (flags & XIVE_SPAPR_SRC_LSI)
data->flags  |= XIVE_IRQ_FLAG_LSI;
data->eoi_page  = eoi_page;
data->esb_shift = esb_shift;
@@ -374,7 +362,7 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct 
xive_irq_data *data)
data->hw_irq = hw_irq;
 
/* Full function page supports trigger */
-   if (flags & XIVE_SRC_TRIGGER) {
+   if (flags & XIVE_SPAPR_SRC_TRIGGER) {
data->trig_mmio = data->eoi_mmio;
return 0;
}
@@ -391,8 +379,8 @@ static int xive_spapr_configure_irq(u32 hw_irq, u32 target, 
u8 prio, u32 sw_irq)
 {
long rc;
 
-   rc = plpar_int_set_source_config(XIVE_SRC_SET_EISN, hw_irq, target,
-prio, sw_irq);
+   rc = plpar_int_set_source_config(XIVE_SPAPR_SRC_SET_EISN, hw_irq,
+target, prio, sw_irq);
 
return rc == 0 ? 0 : -ENXIO;
 }

[PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type

2019-01-07 Thread Cédric Le Goater

We will have different KVM devices for interrupts, one for the
XICS-over-XIVE mode and one for the XIVE native exploitation
mode. Let's add some checks to make sure we are not mixing the
interfaces in KVM.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/kvm/book3s_xive.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index f78d002f0fe0..8a4fa45f07f8 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
 {
struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
 
+   if (!kvmppc_xics_enabled(vcpu))
+   return -EPERM;
+
if (!xc)
return 0;
 
@@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
u8 cppr, mfrr;
u32 xisr;
 
+   if (!kvmppc_xics_enabled(vcpu))
+   return -EPERM;
+
if (!xc || !xive)
return -ENOENT;
 
-- 
2.20.1

[PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address

2019-01-07 Thread Cédric Le Goater

The guest physical address of the event queue will be part of the
state to transfer in the migration. Cache its value when the queue is
configured, it will save us an OPAL call.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/xive.h   | 2 ++
 arch/powerpc/kvm/book3s_xive_native.c | 4 
 2 files changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 7a7aa22d8258..e90c3c5d9533 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -74,6 +74,8 @@ struct xive_q {
u32 esc_irq;
atomic_tcount;
atomic_tpending_count;
+   u64 guest_qpage;
+   u32 guest_qsize;
 };
 
 /* Global enable flags for the XIVE support */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 35d806740c3a..4ca75aade069 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -708,6 +708,10 @@ static int kvmppc_h_int_set_queue_config(struct kvm_vcpu 
*vcpu,
}
qaddr = page_to_virt(page) + (qpage & ~PAGE_MASK);
 
+   /* Backup queue page address and size for migration */
+   q->guest_qpage = qpage;
+   q->guest_qsize = qsize;
+
rc = xive_native_configure_queue(xc->vp_id, q, priority,
 (__be32 *) qaddr, qsize, true);
if (rc) {
-- 
2.20.1

[PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device

2019-01-07 Thread Cédric Le Goater

This will let the guest create a memory mapping to expose the ESB MMIO
regions used to control the interrupt sources, to trigger events, to
EOI or to turn off the sources.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
 arch/powerpc/kvm/book3s_xive_native.c | 97 +++
 2 files changed, 101 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
b/arch/powerpc/include/uapi/asm/kvm.h
index 8c876c166ef2..6bb61ba141c2 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
 #define  KVM_XICS_PRESENTED(1ULL << 43)
 #define  KVM_XICS_QUEUED   (1ULL << 44)
 
+/* POWER9 XIVE Native Interrupt Controller */
+#define KVM_DEV_XIVE_GRP_CTRL  1
+#define   KVM_DEV_XIVE_GET_ESB_FD  1
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c 
b/arch/powerpc/kvm/book3s_xive_native.c
index 115143e76c45..e20081f0c8d4 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
return rc;
 }
 
+static int xive_native_esb_fault(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct kvmppc_xive *xive = vma->vm_file->private_data;
+   struct kvmppc_xive_src_block *sb;
+   struct kvmppc_xive_irq_state *state;
+   struct xive_irq_data *xd;
+   u32 hw_num;
+   u16 src;
+   u64 page;
+   unsigned long irq;
+
+   /*
+* Linux/KVM uses a two pages ESB setting, one for trigger and
+* one for EOI
+*/
+   irq = vmf->pgoff / 2;
+
+   sb = kvmppc_xive_find_source(xive, irq, &src);
+   if (!sb) {
+   pr_err("%s: source %lx not found !\n", __func__, irq);
+   return VM_FAULT_SIGBUS;
+   }
+
+   state = &sb->irq_state[src];
+   kvmppc_xive_select_irq(state, &hw_num, &xd);
+
+   arch_spin_lock(&sb->lock);
+
+   /*
+* first/even page is for trigger
+* second/odd page is for EOI and management.
+*/
+   page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
+   arch_spin_unlock(&sb->lock);
+
+   if (!page) {
+   pr_err("%s: acessing invalid ESB page for source %lx !\n",
+  __func__, irq);
+   return VM_FAULT_SIGBUS;
+   }
+
+   vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT);
+   return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct xive_native_esb_vmops = {
+   .fault = xive_native_esb_fault,
+};
+
+static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   /* There are two ESB pages (trigger and EOI) per IRQ */
+   if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
+   return -EINVAL;
+
+   vma->vm_flags |= VM_IO | VM_PFNMAP;
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+   vma->vm_ops = &xive_native_esb_vmops;
+   return 0;
+}
+
+static const struct file_operations xive_native_esb_fops = {
+   .mmap = xive_native_esb_mmap,
+};
+
+static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr)
+{
+   u64 __user *ubufp = (u64 __user *) addr;
+   int ret;
+
+   ret = anon_inode_getfd("[xive-esb]", &xive_native_esb_fops, xive,
+   O_RDWR | O_CLOEXEC);
+   if (ret < 0)
+   return ret;
+
+   return put_user(ret, ubufp);
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
@@ -162,12 +241,30 @@ static int kvmppc_xive_native_set_attr(struct kvm_device 
*dev,
 static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
+   struct kvmppc_xive *xive = dev->private;
+
+   switch (attr->group) {
+   case KVM_DEV_XIVE_GRP_CTRL:
+   switch (attr->attr) {
+   case KVM_DEV_XIVE_GET_ESB_FD:
+   return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
+   }
+   break;
+   }
return -ENXIO;
 }
 
 static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
   struct kvm_device_attr *attr)
 {
+   switch (attr->group) {
+   case KVM_DEV_XIVE_GRP_CTRL:
+   switch (attr->attr) {
+   case KVM_DEV_XIVE_GET_ESB_FD:
+   return 0;
+   }
+   break;
+   }
return -ENXIO;
 }
 
-- 
2.20.1

[PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls

2019-01-07 Thread Cédric Le Goater

The XIVE native exploitation mode specs define a set of Hypervisor
calls to configure the sources and the event queues :

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (PQ bits) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification configuration associated
   with the queue, only unconditional notification is supported for
   the moment. Reset is performed with a queue size of 0 and queueing
   is disabled in that case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the guest's internal interrupt structures to their
   initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure all notifications
   have reached their queue.

Calls that still need to be addressed :

   H_INT_SET_OS_REPORTING_LINE
   H_INT_GET_OS_REPORTING_LINE

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/include/asm/kvm_ppc.h|  43 ++
 arch/powerpc/kvm/book3s_xive.h|  54 +++
 arch/powerpc/kvm/book3s_hv.c  |  29 ++
 arch/powerpc/kvm/book3s_hv_builtin.c  | 196 +
 arch/powerpc/kvm/book3s_hv_rm_xive_native.c   |  47 +++
 arch/powerpc/kvm/book3s_xive_native.c | 326 ++-
 .../powerpc/kvm/book3s_xive_native_template.c | 371 ++
 arch/powerpc/kvm/Makefile |   2 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  52 +++
 9 files changed, 1118 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive_native.c

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 1bb313f238fe..4cc897039485 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -602,6 +602,7 @@ extern int kvmppc_xive_native_connect_vcpu(struct 
kvm_device *dev,
 extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern void kvmppc_xive_native_init_module(void);
 extern void kvmppc_xive_native_exit_module(void);
+extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
 
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
@@ -634,6 +635,8 @@ static inline int kvmppc_xive_native_connect_vcpu(struct 
kvm_device *dev,
 static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
 static inline void kvmppc_xive_native_init_module(void) { }
 static inline void kvmppc_xive_native_exit_module(void) { }
+static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
+   { return 0; }
 
 #endif /* CONFIG_KVM_XIVE */
 
@@ -682,6 +685,46 @@ int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long 
cppr);
 int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
 void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu);
 
+int kvmppc_rm_h_int_get_source_info(struct kvm_vcpu *vcpu,
+   unsigned long flag,
+   unsigned long lisn);
+int kvmppc_rm_h_int_set_source_config(struct kvm_vcpu *vcpu,
+ unsigned long flag,
+ unsigned long lisn,
+ unsigned long target,
+ unsigned long priority,
+ unsigned long eisn);
+int kvmppc_rm_h_int_get_source_config(struct kvm_vcpu *vcpu,
+ unsigned long flag,
+ unsigned long lisn);
+int kvmppc_rm_h_int_get_queue_info(struct kvm_vcpu *vcpu,
+  unsigned long flag,
+  unsigned long target,
+  unsigned long priority);
+int kvmppc_rm_h_int_set_queue_config(struct kvm_vcpu *vcpu,
+unsigned long flag,
+unsigned long target,
+unsigned long priority,
+unsigned long qpage,
+unsigned long qsize);
+int kvmppc_rm_h_int_get_queue_config(struct kvm_vcpu *vcpu,
+unsigned long flag,
+unsigned long target,
+unsigned long priority);
+int kvmppc_rm_h_int_set_os_reporting_line(struct kvm_vcpu *vcpu,
+

Re: [PATCH v3] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK

2019-01-07 Thread Christophe Leroy





Le 04/01/2019 à 16:24, Horia Geanta a écrit :

On 1/4/2019 5:17 PM, Horia Geanta wrote:

On 12/21/2018 10:07 AM, Christophe Leroy wrote:
[snip]

IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack
cannot be DMA mapped anymore.
This looks better, thanks.



This patch copies the IV into the extended descriptor when iv is not
a valid linear address.


Though I am not sure the checks in place are enough.


Fixes: 4de9d0b547b9 ("crypto: talitos - Add ablkcipher algorithms")
Cc: sta...@vger.kernel.org
Signed-off-by: Christophe Leroy 
---
  v3: Using struct edesc buffer.

  v2: Using per-request context.

[snip]

+   if (ivsize && !virt_addr_valid(iv))
+   alloc_len += ivsize;

[snip]
  
+	if (ivsize && !virt_addr_valid(iv))

A more precise condition would be (!is_vmalloc_addr || is_vmalloc_addr(iv))


Sorry for the typo, I meant:
(!virt_addr_valid(iv) || is_vmalloc_addr(iv))


As far as I know, virt_addr_valid() means the address is in the linear 
memory space. So it cannot be a vmalloc_addr if it is a linear space 
addr, can it ?


At least, it is that way on powerpc which is the arch embedding the 
talitos crypto engine. virt_addr_valid() means we are under max_pfn, 
while VMALLOC_START is above max_pfn.


Christophe




It matches the checks in debug_dma_map_single() helper, though I am not sure
they are enough to rule out all exceptions of DMA API.

Re: [PATCH 4/11] KVM/MMU: Introduce tlb flush with range list

2019-01-07 Thread Paolo Bonzini

On 04/01/19 09:53, lantianyu1...@gmail.com wrote:
>  struct kvm_mmu_page {
>   struct list_head link;
> +
> + /*
> +  * Tlb flush with range list uses struct kvm_mmu_page as list entry
> +  * and all list operations should be under protection of mmu_lock.
> +  */
> + struct list_head flush_link;
>   struct hlist_node hash_link;
>   bool unsync;
>  
> @@ -443,6 +449,7 @@ struct kvm_mmu {

Again, it would be nice not to grow the struct too much, though I
understand that it's already relatively big (168 bytes).

Can you at least make this an hlist, so that it only takes a single word?

Paolo

Re: [PATCH 3/11] KVM: Add spte's point in the struct kvm_mmu_page

2019-01-07 Thread Paolo Bonzini

On 04/01/19 09:53, lantianyu1...@gmail.com wrote:
> @@ -332,6 +332,7 @@ struct kvm_mmu_page {
>   int root_count;  /* Currently serving as active root */
>   unsigned int unsync_children;
>   struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
> + u64 *sptep;

Is this really needed?  Can we put the "last" flag in the struct instead
as a bool?  In fact, if you do

u16 unsync_children;
bool unsync;
bool last_level;

the struct does not grow at all. :)

(I'm not sure where "large" is tested using the sptep field, even though
it is in the commit message).

Paolo

>   /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen.  */
>   unsigned long mmu_valid_gen;

[PATCH v4 13/13] drivers/perf: use PERF_PMU_CAP_NO_EXCLUDE for Cavium TX2 PMU

2019-01-07 Thread Andrew Murray

The Cavium ThunderX2 UNCORE PMU driver doesn't support any event
filtering. Let's advertise the PERF_PMU_CAP_NO_EXCLUDE capability to
simplify the code.

Signed-off-by: Andrew Murray 
---
 drivers/perf/thunderx2_pmu.c | 10 +-
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
index c9a1701..43d76c8 100644
--- a/drivers/perf/thunderx2_pmu.c
+++ b/drivers/perf/thunderx2_pmu.c
@@ -424,15 +424,6 @@ static int tx2_uncore_event_init(struct perf_event *event)
if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
return -EINVAL;
 
-   /* We have no filtering of any kind */
-   if (event->attr.exclude_user||
-   event->attr.exclude_kernel  ||
-   event->attr.exclude_hv  ||
-   event->attr.exclude_idle||
-   event->attr.exclude_host||
-   event->attr.exclude_guest)
-   return -EINVAL;
-
if (event->cpu < 0)
return -EINVAL;
 
@@ -572,6 +563,7 @@ static int tx2_uncore_pmu_register(
.start  = tx2_uncore_event_start,
.stop   = tx2_uncore_event_stop,
.read   = tx2_uncore_event_read,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
tx2_pmu->pmu.name = devm_kasprintf(dev, GFP_KERNEL,
-- 
2.7.4

[PATCH v4 12/13] perf/core: remove unused perf_flags

2019-01-07 Thread Andrew Murray

Now that perf_flags is not used we remove it.

Signed-off-by: Andrew Murray 
---
 include/uapi/linux/perf_event.h   | 2 --
 tools/include/uapi/linux/perf_event.h | 2 --
 2 files changed, 4 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9de8780..ea19b5d 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -445,8 +445,6 @@ struct perf_event_query_bpf {
__u32   ids[0];
 };
 
-#define perf_flags(attr)   (*(&(attr)->read_format + 1))
-
 /*
  * Ioctls that can be done on a perf event fd:
  */
diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index 9de8780..ea19b5d 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -445,8 +445,6 @@ struct perf_event_query_bpf {
__u32   ids[0];
 };
 
-#define perf_flags(attr)   (*(&(attr)->read_format + 1))
-
 /*
  * Ioctls that can be done on a perf event fd:
  */
-- 
2.7.4

[PATCH v4 11/13] x86: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs

2019-01-07 Thread Andrew Murray

For x86 PMUs that do not support context exclusion let's advertise the
PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will
prevent us from handling events where any exclusion flags are set.
Let's also remove the now unnecessary check for exclusion flags.

This change means that amd/iommu and amd/uncore will now also
indicate that they do not support exclude_{hv|idle} and intel/uncore
that it does not support exclude_{guest|host}.

Signed-off-by: Andrew Murray 
---
 arch/x86/events/amd/iommu.c| 6 +-
 arch/x86/events/amd/uncore.c   | 7 ++-
 arch/x86/events/intel/uncore.c | 9 +
 3 files changed, 4 insertions(+), 18 deletions(-)

diff --git a/arch/x86/events/amd/iommu.c b/arch/x86/events/amd/iommu.c
index 3210fee..7635c23 100644
--- a/arch/x86/events/amd/iommu.c
+++ b/arch/x86/events/amd/iommu.c
@@ -223,11 +223,6 @@ static int perf_iommu_event_init(struct perf_event *event)
if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
return -EINVAL;
 
-   /* IOMMU counters do not have usr/os/guest/host bits */
-   if (event->attr.exclude_user || event->attr.exclude_kernel ||
-   event->attr.exclude_host || event->attr.exclude_guest)
-   return -EINVAL;
-
if (event->cpu < 0)
return -EINVAL;
 
@@ -414,6 +409,7 @@ static const struct pmu iommu_pmu __initconst = {
.read   = perf_iommu_read,
.task_ctx_nr= perf_invalid_context,
.attr_groups= amd_iommu_attr_groups,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
 static __init int init_one_iommu(unsigned int idx)
diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c
index 398df6e..79cfd3b 100644
--- a/arch/x86/events/amd/uncore.c
+++ b/arch/x86/events/amd/uncore.c
@@ -201,11 +201,6 @@ static int amd_uncore_event_init(struct perf_event *event)
if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
return -EINVAL;
 
-   /* NB and Last level cache counters do not have usr/os/guest/host bits 
*/
-   if (event->attr.exclude_user || event->attr.exclude_kernel ||
-   event->attr.exclude_host || event->attr.exclude_guest)
-   return -EINVAL;
-
/* and we do not enable counter overflow interrupts */
hwc->config = event->attr.config & AMD64_RAW_EVENT_MASK_NB;
hwc->idx = -1;
@@ -307,6 +302,7 @@ static struct pmu amd_nb_pmu = {
.start  = amd_uncore_start,
.stop   = amd_uncore_stop,
.read   = amd_uncore_read,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
 static struct pmu amd_llc_pmu = {
@@ -317,6 +313,7 @@ static struct pmu amd_llc_pmu = {
.start  = amd_uncore_start,
.stop   = amd_uncore_stop,
.read   = amd_uncore_read,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
 static struct amd_uncore *amd_uncore_alloc(unsigned int cpu)
diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c
index 27a4614..d516161 100644
--- a/arch/x86/events/intel/uncore.c
+++ b/arch/x86/events/intel/uncore.c
@@ -695,14 +695,6 @@ static int uncore_pmu_event_init(struct perf_event *event)
if (pmu->func_id < 0)
return -ENOENT;
 
-   /*
-* Uncore PMU does measure at all privilege level all the time.
-* So it doesn't make sense to specify any exclude bits.
-*/
-   if (event->attr.exclude_user || event->attr.exclude_kernel ||
-   event->attr.exclude_hv || event->attr.exclude_idle)
-   return -EINVAL;
-
/* Sampling not supported yet */
if (hwc->sample_period)
return -EINVAL;
@@ -800,6 +792,7 @@ static int uncore_pmu_register(struct intel_uncore_pmu *pmu)
.stop   = uncore_pmu_event_stop,
.read   = uncore_pmu_event_read,
.module = THIS_MODULE,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
} else {
pmu->pmu = *pmu->type->pmu;
-- 
2.7.4

[PATCH v4 10/13] x86: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs

2019-01-07 Thread Andrew Murray

For drivers that do not support context exclusion let's advertise the
PERF_PMU_CAP_NOEXCLUDE capability. This ensures that perf will
prevent us from handling events where any exclusion flags are set.
Let's also remove the now unnecessary check for exclusion flags.

Signed-off-by: Andrew Murray 
---
 arch/x86/events/amd/ibs.c  | 13 +
 arch/x86/events/amd/power.c| 10 ++
 arch/x86/events/intel/cstate.c | 12 +++-
 arch/x86/events/intel/rapl.c   |  9 ++---
 arch/x86/events/intel/uncore_snb.c |  9 ++---
 arch/x86/events/msr.c  | 10 ++
 6 files changed, 12 insertions(+), 51 deletions(-)

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index d50bb4d..62f317c 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -253,15 +253,6 @@ static int perf_ibs_precise_event(struct perf_event 
*event, u64 *config)
return -EOPNOTSUPP;
 }
 
-static const struct perf_event_attr ibs_notsupp = {
-   .exclude_user   = 1,
-   .exclude_kernel = 1,
-   .exclude_hv = 1,
-   .exclude_idle   = 1,
-   .exclude_host   = 1,
-   .exclude_guest  = 1,
-};
-
 static int perf_ibs_init(struct perf_event *event)
 {
struct hw_perf_event *hwc = &event->hw;
@@ -282,9 +273,6 @@ static int perf_ibs_init(struct perf_event *event)
if (event->pmu != &perf_ibs->pmu)
return -ENOENT;
 
-   if (perf_flags(&event->attr) & perf_flags(&ibs_notsupp))
-   return -EINVAL;
-
if (config & ~perf_ibs->config_mask)
return -EINVAL;
 
@@ -537,6 +525,7 @@ static struct perf_ibs perf_ibs_fetch = {
.start  = perf_ibs_start,
.stop   = perf_ibs_stop,
.read   = perf_ibs_read,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
},
.msr= MSR_AMD64_IBSFETCHCTL,
.config_mask= IBS_FETCH_CONFIG_MASK,
diff --git a/arch/x86/events/amd/power.c b/arch/x86/events/amd/power.c
index 2aefacf..c5ff084 100644
--- a/arch/x86/events/amd/power.c
+++ b/arch/x86/events/amd/power.c
@@ -136,14 +136,7 @@ static int pmu_event_init(struct perf_event *event)
return -ENOENT;
 
/* Unsupported modes and filters. */
-   if (event->attr.exclude_user   ||
-   event->attr.exclude_kernel ||
-   event->attr.exclude_hv ||
-   event->attr.exclude_idle   ||
-   event->attr.exclude_host   ||
-   event->attr.exclude_guest  ||
-   /* no sampling */
-   event->attr.sample_period)
+   if (event->attr.sample_period)
return -EINVAL;
 
if (cfg != AMD_POWER_EVENTSEL_PKG)
@@ -226,6 +219,7 @@ static struct pmu pmu_class = {
.start  = pmu_event_start,
.stop   = pmu_event_stop,
.read   = pmu_event_read,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
 static int power_cpu_exit(unsigned int cpu)
diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c
index d2e7807..94a4b7f 100644
--- a/arch/x86/events/intel/cstate.c
+++ b/arch/x86/events/intel/cstate.c
@@ -280,13 +280,7 @@ static int cstate_pmu_event_init(struct perf_event *event)
return -ENOENT;
 
/* unsupported modes and filters */
-   if (event->attr.exclude_user   ||
-   event->attr.exclude_kernel ||
-   event->attr.exclude_hv ||
-   event->attr.exclude_idle   ||
-   event->attr.exclude_host   ||
-   event->attr.exclude_guest  ||
-   event->attr.sample_period) /* no sampling */
+   if (event->attr.sample_period) /* no sampling */
return -EINVAL;
 
if (event->cpu < 0)
@@ -437,7 +431,7 @@ static struct pmu cstate_core_pmu = {
.start  = cstate_pmu_event_start,
.stop   = cstate_pmu_event_stop,
.read   = cstate_pmu_event_update,
-   .capabilities   = PERF_PMU_CAP_NO_INTERRUPT,
+   .capabilities   = PERF_PMU_CAP_NO_INTERRUPT | PERF_PMU_CAP_NO_EXCLUDE,
.module = THIS_MODULE,
 };
 
@@ -451,7 +445,7 @@ static struct pmu cstate_pkg_pmu = {
.start  = cstate_pmu_event_start,
.stop   = cstate_pmu_event_stop,
.read   = cstate_pmu_event_update,
-   .capabilities   = PERF_PMU_CAP_NO_INTERRUPT,
+   .capabilities   = PERF_PMU_CAP_NO_INTERRUPT | PERF_PMU_CAP_NO_EXCLUDE,
.module = THIS_MODULE,
 };
 
diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c
index 91039ff..94dc564 100644
--- a/arch/x86/events/intel/rapl.c
+++ b/arch/x86/events/intel/rapl.c
@@ -397,13 +397,7 @@ static int rapl_pmu_event_init(struct perf_event *event)
return -EINVAL;
 
/* unsupported modes and filters */
-   if (event->attr.exclude_user   ||
-   event->attr.exclude_kernel ||

[PATCH v4 09/13] powerpc: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs

2019-01-07 Thread Andrew Murray

For PowerPC PMUs that do not support context exclusion let's
advertise the PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that
perf will prevent us from handling events where any exclusion flags
are set. Let's also remove the now unnecessary check for exclusion
flags.

Signed-off-by: Andrew Murray 
Reviewed-by: Madhavan Srinivasan 
Acked-by: Michael Ellerman 
---
 arch/powerpc/perf/hv-24x7.c | 10 +-
 arch/powerpc/perf/hv-gpci.c | 10 +-
 arch/powerpc/perf/imc-pmu.c | 19 +--
 3 files changed, 3 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 72238ee..d2b8e60 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -1306,15 +1306,6 @@ static int h_24x7_event_init(struct perf_event *event)
return -EINVAL;
}
 
-   /* unsupported modes and filters */
-   if (event->attr.exclude_user   ||
-   event->attr.exclude_kernel ||
-   event->attr.exclude_hv ||
-   event->attr.exclude_idle   ||
-   event->attr.exclude_host   ||
-   event->attr.exclude_guest)
-   return -EINVAL;
-
/* no branch sampling */
if (has_branch_stack(event))
return -EOPNOTSUPP;
@@ -1577,6 +1568,7 @@ static struct pmu h_24x7_pmu = {
.start_txn   = h_24x7_event_start_txn,
.commit_txn  = h_24x7_event_commit_txn,
.cancel_txn  = h_24x7_event_cancel_txn,
+   .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
 static int hv_24x7_init(void)
diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
index 43fabb3..735e77b 100644
--- a/arch/powerpc/perf/hv-gpci.c
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -232,15 +232,6 @@ static int h_gpci_event_init(struct perf_event *event)
return -EINVAL;
}
 
-   /* unsupported modes and filters */
-   if (event->attr.exclude_user   ||
-   event->attr.exclude_kernel ||
-   event->attr.exclude_hv ||
-   event->attr.exclude_idle   ||
-   event->attr.exclude_host   ||
-   event->attr.exclude_guest)
-   return -EINVAL;
-
/* no branch sampling */
if (has_branch_stack(event))
return -EOPNOTSUPP;
@@ -285,6 +276,7 @@ static struct pmu h_gpci_pmu = {
.start   = h_gpci_event_start,
.stop= h_gpci_event_stop,
.read= h_gpci_event_update,
+   .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
 static int hv_gpci_init(void)
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index f292a3f..b1c37cc 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -473,15 +473,6 @@ static int nest_imc_event_init(struct perf_event *event)
if (event->hw.sample_period)
return -EINVAL;
 
-   /* unsupported modes and filters */
-   if (event->attr.exclude_user   ||
-   event->attr.exclude_kernel ||
-   event->attr.exclude_hv ||
-   event->attr.exclude_idle   ||
-   event->attr.exclude_host   ||
-   event->attr.exclude_guest)
-   return -EINVAL;
-
if (event->cpu < 0)
return -EINVAL;
 
@@ -748,15 +739,6 @@ static int core_imc_event_init(struct perf_event *event)
if (event->hw.sample_period)
return -EINVAL;
 
-   /* unsupported modes and filters */
-   if (event->attr.exclude_user   ||
-   event->attr.exclude_kernel ||
-   event->attr.exclude_hv ||
-   event->attr.exclude_idle   ||
-   event->attr.exclude_host   ||
-   event->attr.exclude_guest)
-   return -EINVAL;
-
if (event->cpu < 0)
return -EINVAL;
 
@@ -1069,6 +1051,7 @@ static int update_pmu_ops(struct imc_pmu *pmu)
pmu->pmu.stop = imc_event_stop;
pmu->pmu.read = imc_event_update;
pmu->pmu.attr_groups = pmu->attr_groups;
+   pmu->pmu.capabilities = PERF_PMU_CAP_NO_EXCLUDE;
pmu->attr_groups[IMC_FORMAT_ATTR] = &imc_format_group;
 
switch (pmu->domain) {
-- 
2.7.4

[PATCH v4 08/13] drivers/perf: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs

2019-01-07 Thread Andrew Murray

For drivers that do not support context exclusion let's advertise the
PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will
prevent us from handling events where any exclusion flags are set.
Let's also remove the now unnecessary check for exclusion flags.

This change means that qcom_{l2|l3}_pmu will now also indicate that
they do not support exclude_{host|guest} and that xgene_pmu does
not also support exclude_idle and exclude_hv.

Note that for qcom_l2_pmu we now implictly return -EINVAL instead
of -EOPNOTSUPP. This change will result in the perf userspace
utility retrying the perf_event_open system call with fallback
event attributes that do not fail.

Signed-off-by: Andrew Murray 
Acked-by: Will Deacon 
---
 drivers/perf/qcom_l2_pmu.c | 9 +
 drivers/perf/qcom_l3_pmu.c | 8 +---
 drivers/perf/xgene_pmu.c   | 6 +-
 3 files changed, 3 insertions(+), 20 deletions(-)

diff --git a/drivers/perf/qcom_l2_pmu.c b/drivers/perf/qcom_l2_pmu.c
index 842135c..091b4d7 100644
--- a/drivers/perf/qcom_l2_pmu.c
+++ b/drivers/perf/qcom_l2_pmu.c
@@ -509,14 +509,6 @@ static int l2_cache_event_init(struct perf_event *event)
return -EOPNOTSUPP;
}
 
-   /* We cannot filter accurately so we just don't allow it. */
-   if (event->attr.exclude_user || event->attr.exclude_kernel ||
-   event->attr.exclude_hv || event->attr.exclude_idle) {
-   dev_dbg_ratelimited(&l2cache_pmu->pdev->dev,
-   "Can't exclude execution levels\n");
-   return -EOPNOTSUPP;
-   }
-
if (((L2_EVT_GROUP(event->attr.config) > L2_EVT_GROUP_MAX) ||
 ((event->attr.config & ~L2_EVT_MASK) != 0)) &&
(event->attr.config != L2CYCLE_CTR_RAW_CODE)) {
@@ -982,6 +974,7 @@ static int l2_cache_pmu_probe(struct platform_device *pdev)
.stop   = l2_cache_event_stop,
.read   = l2_cache_event_read,
.attr_groups= l2_cache_pmu_attr_grps,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
l2cache_pmu->num_counters = get_num_counters();
diff --git a/drivers/perf/qcom_l3_pmu.c b/drivers/perf/qcom_l3_pmu.c
index 2dc63d6..5d70646 100644
--- a/drivers/perf/qcom_l3_pmu.c
+++ b/drivers/perf/qcom_l3_pmu.c
@@ -495,13 +495,6 @@ static int qcom_l3_cache__event_init(struct perf_event 
*event)
return -ENOENT;
 
/*
-* There are no per-counter mode filters in the PMU.
-*/
-   if (event->attr.exclude_user || event->attr.exclude_kernel ||
-   event->attr.exclude_hv || event->attr.exclude_idle)
-   return -EINVAL;
-
-   /*
 * Sampling not supported since these events are not core-attributable.
 */
if (hwc->sample_period)
@@ -777,6 +770,7 @@ static int qcom_l3_cache_pmu_probe(struct platform_device 
*pdev)
.read   = qcom_l3_cache__event_read,
 
.attr_groups= qcom_l3_cache_pmu_attr_grps,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
memrc = platform_get_resource(pdev, IORESOURCE_MEM, 0);
diff --git a/drivers/perf/xgene_pmu.c b/drivers/perf/xgene_pmu.c
index 0dc9ff0..d4ec048 100644
--- a/drivers/perf/xgene_pmu.c
+++ b/drivers/perf/xgene_pmu.c
@@ -917,11 +917,6 @@ static int xgene_perf_event_init(struct perf_event *event)
if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
return -EINVAL;
 
-   /* SOC counters do not have usr/os/guest/host bits */
-   if (event->attr.exclude_user || event->attr.exclude_kernel ||
-   event->attr.exclude_host || event->attr.exclude_guest)
-   return -EINVAL;
-
if (event->cpu < 0)
return -EINVAL;
/*
@@ -1136,6 +1131,7 @@ static int xgene_init_perf(struct xgene_pmu_dev *pmu_dev, 
char *name)
.start  = xgene_perf_start,
.stop   = xgene_perf_stop,
.read   = xgene_perf_read,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
/* Hardware counter init */
-- 
2.7.4

[PATCH v4 07/13] drivers/perf: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs

2019-01-07 Thread Andrew Murray

For drivers that do not support context exclusion let's advertise the
PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will
prevent us from handling events where any exclusion flags are set.
Let's also remove the now unnecessary check for exclusion flags.

Signed-off-by: Andrew Murray 
Acked-by: Will Deacon 
---
 drivers/perf/arm-cci.c| 10 +-
 drivers/perf/arm-ccn.c|  6 ++
 drivers/perf/arm_dsu_pmu.c|  9 ++---
 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c |  1 +
 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c  |  1 +
 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c  |  1 +
 drivers/perf/hisilicon/hisi_uncore_pmu.c  |  9 -
 7 files changed, 8 insertions(+), 29 deletions(-)

diff --git a/drivers/perf/arm-cci.c b/drivers/perf/arm-cci.c
index 1bfeb16..bfd03e0 100644
--- a/drivers/perf/arm-cci.c
+++ b/drivers/perf/arm-cci.c
@@ -1327,15 +1327,6 @@ static int cci_pmu_event_init(struct perf_event *event)
if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
return -EOPNOTSUPP;
 
-   /* We have no filtering of any kind */
-   if (event->attr.exclude_user||
-   event->attr.exclude_kernel  ||
-   event->attr.exclude_hv  ||
-   event->attr.exclude_idle||
-   event->attr.exclude_host||
-   event->attr.exclude_guest)
-   return -EINVAL;
-
/*
 * Following the example set by other "uncore" PMUs, we accept any CPU
 * and rewrite its affinity dynamically rather than having perf core
@@ -1433,6 +1424,7 @@ static int cci_pmu_init(struct cci_pmu *cci_pmu, struct 
platform_device *pdev)
.stop   = cci_pmu_stop,
.read   = pmu_read,
.attr_groups= pmu_attr_groups,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
cci_pmu->plat_device = pdev;
diff --git a/drivers/perf/arm-ccn.c b/drivers/perf/arm-ccn.c
index 7dd850e..2ae7602 100644
--- a/drivers/perf/arm-ccn.c
+++ b/drivers/perf/arm-ccn.c
@@ -741,10 +741,7 @@ static int arm_ccn_pmu_event_init(struct perf_event *event)
return -EOPNOTSUPP;
}
 
-   if (has_branch_stack(event) || event->attr.exclude_user ||
-   event->attr.exclude_kernel || event->attr.exclude_hv ||
-   event->attr.exclude_idle || event->attr.exclude_host ||
-   event->attr.exclude_guest) {
+   if (has_branch_stack(event)) {
dev_dbg(ccn->dev, "Can't exclude execution levels!\n");
return -EINVAL;
}
@@ -1290,6 +1287,7 @@ static int arm_ccn_pmu_init(struct arm_ccn *ccn)
.read = arm_ccn_pmu_event_read,
.pmu_enable = arm_ccn_pmu_enable,
.pmu_disable = arm_ccn_pmu_disable,
+   .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
};
 
/* No overflow interrupt? Have to use a timer instead. */
diff --git a/drivers/perf/arm_dsu_pmu.c b/drivers/perf/arm_dsu_pmu.c
index 660cb8a..5851de5 100644
--- a/drivers/perf/arm_dsu_pmu.c
+++ b/drivers/perf/arm_dsu_pmu.c
@@ -562,13 +562,7 @@ static int dsu_pmu_event_init(struct perf_event *event)
return -EINVAL;
}
 
-   if (has_branch_stack(event) ||
-   event->attr.exclude_user ||
-   event->attr.exclude_kernel ||
-   event->attr.exclude_hv ||
-   event->attr.exclude_idle ||
-   event->attr.exclude_host ||
-   event->attr.exclude_guest) {
+   if (has_branch_stack(event)) {
dev_dbg(dsu_pmu->pmu.dev, "Can't support filtering\n");
return -EINVAL;
}
@@ -735,6 +729,7 @@ static int dsu_pmu_device_probe(struct platform_device 
*pdev)
.read   = dsu_pmu_read,
 
.attr_groups= dsu_pmu_attr_groups,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
rc = perf_pmu_register(&dsu_pmu->pmu, name, -1);
diff --git a/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c 
b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
index 69372e2..0eba947 100644
--- a/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
+++ b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
@@ -396,6 +396,7 @@ static int hisi_ddrc_pmu_probe(struct platform_device *pdev)
.stop   = hisi_uncore_pmu_stop,
.read   = hisi_uncore_pmu_read,
.attr_groups= hisi_ddrc_pmu_attr_groups,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
 
ret = perf_pmu_register(&ddrc_pmu->pmu, name, -1);
diff --git a/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c 
b/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c
index 443906e..2553a84 100644
--- a/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c
+++ b/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c
@@ -407,6 +407,7 @@ static int hisi_hha_pmu_

[PATCH v4 06/13] arm: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs

2019-01-07 Thread Andrew Murray

For drivers that do not support context exclusion let's advertise the
PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will
prevent us from handling events where any exclusion flags are set.
Let's also remove the now unnecessary check for exclusion flags.

Signed-off-by: Andrew Murray 
Acked-by: Shawn Guo 
Acked-by: Will Deacon 
---
 arch/arm/mach-imx/mmdc.c | 9 ++---
 arch/arm/mm/cache-l2x0-pmu.c | 9 +
 2 files changed, 3 insertions(+), 15 deletions(-)

diff --git a/arch/arm/mach-imx/mmdc.c b/arch/arm/mach-imx/mmdc.c
index e49e068..fce4b42 100644
--- a/arch/arm/mach-imx/mmdc.c
+++ b/arch/arm/mach-imx/mmdc.c
@@ -294,13 +294,7 @@ static int mmdc_pmu_event_init(struct perf_event *event)
return -EOPNOTSUPP;
}
 
-   if (event->attr.exclude_user||
-   event->attr.exclude_kernel  ||
-   event->attr.exclude_hv  ||
-   event->attr.exclude_idle||
-   event->attr.exclude_host||
-   event->attr.exclude_guest   ||
-   event->attr.sample_period)
+   if (event->attr.sample_period)
return -EINVAL;
 
if (cfg < 0 || cfg >= MMDC_NUM_COUNTERS)
@@ -456,6 +450,7 @@ static int mmdc_pmu_init(struct mmdc_pmu *pmu_mmdc,
.start  = mmdc_pmu_event_start,
.stop   = mmdc_pmu_event_stop,
.read   = mmdc_pmu_event_update,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
},
.mmdc_base = mmdc_base,
.dev = dev,
diff --git a/arch/arm/mm/cache-l2x0-pmu.c b/arch/arm/mm/cache-l2x0-pmu.c
index afe5b4c..99bcd07 100644
--- a/arch/arm/mm/cache-l2x0-pmu.c
+++ b/arch/arm/mm/cache-l2x0-pmu.c
@@ -314,14 +314,6 @@ static int l2x0_pmu_event_init(struct perf_event *event)
event->attach_state & PERF_ATTACH_TASK)
return -EINVAL;
 
-   if (event->attr.exclude_user   ||
-   event->attr.exclude_kernel ||
-   event->attr.exclude_hv ||
-   event->attr.exclude_idle   ||
-   event->attr.exclude_host   ||
-   event->attr.exclude_guest)
-   return -EINVAL;
-
if (event->cpu < 0)
return -EINVAL;
 
@@ -544,6 +536,7 @@ static __init int l2x0_pmu_init(void)
.del = l2x0_pmu_event_del,
.event_init = l2x0_pmu_event_init,
.attr_groups = l2x0_pmu_attr_groups,
+   .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
};
 
l2x0_pmu_reset();
-- 
2.7.4

[PATCH v4 05/13] arm: perf: conditionally use PERF_PMU_CAP_NO_EXCLUDE

2019-01-07 Thread Andrew Murray

The ARM PMU driver can be used to represent a variety of ARM based
PMUs. Some of these PMUs do not provide support for context
exclusion, where this is the case we advertise the
PERF_PMU_CAP_NO_EXCLUDE capability to ensure that perf prevents us
from handling events where any exclusion flags are set.

Signed-off-by: Andrew Murray 
Acked-by: Will Deacon 
---
 drivers/perf/arm_pmu.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index d0b7dd8..eec75b9 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -357,13 +357,6 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void *dev)
 }
 
 static int
-event_requires_mode_exclusion(struct perf_event_attr *attr)
-{
-   return attr->exclude_idle || attr->exclude_user ||
-  attr->exclude_kernel || attr->exclude_hv;
-}
-
-static int
 __hw_perf_event_init(struct perf_event *event)
 {
struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
@@ -393,9 +386,8 @@ __hw_perf_event_init(struct perf_event *event)
/*
 * Check whether we need to exclude the counter from certain modes.
 */
-   if ((!armpmu->set_event_filter ||
-armpmu->set_event_filter(hwc, &event->attr)) &&
-event_requires_mode_exclusion(&event->attr)) {
+   if (armpmu->set_event_filter &&
+   armpmu->set_event_filter(hwc, &event->attr)) {
pr_debug("ARM performance counters do not support "
 "mode exclusion\n");
return -EOPNOTSUPP;
@@ -867,6 +859,9 @@ int armpmu_register(struct arm_pmu *pmu)
if (ret)
return ret;
 
+   if (!pmu->set_event_filter)
+   pmu->pmu.capabilities |= PERF_PMU_CAP_NO_EXCLUDE;
+
ret = perf_pmu_register(&pmu->pmu, pmu->name, -1);
if (ret)
goto out_destroy;
-- 
2.7.4

[PATCH v4 04/13] alpha: perf/core: use PERF_PMU_CAP_NO_EXCLUDE

2019-01-07 Thread Andrew Murray

As the Alpha PMU doesn't support context exclusion let's advertise
the PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will
prevent us from handling events where any exclusion flags are set.
Let's also remove the now unnecessary check for exclusion flags.

This change means that __hw_perf_event_init will now also
indicate that it doesn't support exclude_host and exclude_guest and
will now implicitly return -EINVAL instead of -EPERM. This is likely
more desirable as -EPERM will result in a kernel.perf_event_paranoid
related warning from the perf userspace utility.

Signed-off-by: Andrew Murray 
---
 arch/alpha/kernel/perf_event.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
index 5613aa37..4341ccf 100644
--- a/arch/alpha/kernel/perf_event.c
+++ b/arch/alpha/kernel/perf_event.c
@@ -630,12 +630,6 @@ static int __hw_perf_event_init(struct perf_event *event)
return ev;
}
 
-   /* The EV67 does not support mode exclusion */
-   if (attr->exclude_kernel || attr->exclude_user
-   || attr->exclude_hv || attr->exclude_idle) {
-   return -EPERM;
-   }
-
/*
 * We place the event type in event_base here and leave calculation
 * of the codes to programme the PMU for alpha_pmu_enable() because
@@ -771,6 +765,7 @@ static struct pmu pmu = {
.start  = alpha_pmu_start,
.stop   = alpha_pmu_stop,
.read   = alpha_pmu_read,
+   .capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
 };
 
 
-- 
2.7.4

[PATCH v4 03/13] perf/core: add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs

2019-01-07 Thread Andrew Murray

Many PMU drivers do not have the capability to exclude counting events
that occur in specific contexts such as idle, kernel, guest, etc. These
drivers indicate this by returning an error in their event_init upon
testing the events attribute flags. This approach is error prone and
often inconsistent.

Let's instead allow PMU drivers to advertise their inability to exclude
based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This
allows the perf core to reject requests for exclusion events where
there is no support in the PMU.

Signed-off-by: Andrew Murray 
---
 include/linux/perf_event.h | 1 +
 kernel/events/core.c   | 9 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 54a78d2..cec02dc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -244,6 +244,7 @@ struct perf_event;
 #define PERF_PMU_CAP_EXCLUSIVE 0x10
 #define PERF_PMU_CAP_ITRACE0x20
 #define PERF_PMU_CAP_HETEROGENEOUS_CPUS0x40
+#define PERF_PMU_CAP_NO_EXCLUDE0x80
 
 /**
  * struct pmu - generic performance monitoring unit
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3cd13a3..fbe59b7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9772,6 +9772,15 @@ static int perf_try_init_event(struct pmu *pmu, struct 
perf_event *event)
if (ctx)
perf_event_ctx_unlock(event->group_leader, ctx);
 
+   if (!ret) {
+   if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE &&
+   event_has_any_exclude_flag(event)) {
+   if (event->destroy)
+   event->destroy(event);
+   ret = -EINVAL;
+   }
+   }
+
if (ret)
module_put(pmu->module);
 
-- 
2.7.4

[PATCH v4 02/13] perf/core: add function to test for event exclusion flags

2019-01-07 Thread Andrew Murray

Add a function that tests if any of the perf event exclusion flags
are set on a given event.

Signed-off-by: Andrew Murray 
---
 include/linux/perf_event.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1d5c551..54a78d2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1004,6 +1004,15 @@ perf_event__output_id_sample(struct perf_event *event,
 extern void
 perf_log_lost_samples(struct perf_event *event, u64 lost);
 
+static inline bool event_has_any_exclude_flag(struct perf_event *event)
+{
+   struct perf_event_attr *attr = &event->attr;
+
+   return attr->exclude_idle || attr->exclude_user ||
+  attr->exclude_kernel || attr->exclude_hv ||
+  attr->exclude_guest || attr->exclude_host;
+}
+
 static inline bool is_sampling_event(struct perf_event *event)
 {
return event->attr.sample_period != 0;
-- 
2.7.4

[PATCH v4 00/13] perf/core: Generalise event exclusion checking

2019-01-07 Thread Andrew Murray

Many PMU drivers do not have the capability to exclude counting events
that occur in specific contexts such as idle, kernel, guest, etc. These
drivers indicate this by returning an error in their event_init upon
testing the events attribute flags.

However this approach requires that each time a new event modifier is
added to perf, all the perf drivers need to be modified to indicate that
they don't support the attribute. This results in additional boiler-plate
code common to many drivers that needs to be maintained. Furthermore the
drivers are not consistent with regards to the error value they return
when reporting unsupported attributes.

This patchset allow PMU drivers to advertise their inability to exclude
based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This
allows the perf core to reject requests for exclusion events where there
is no support in the PMU.

This is a functional change, in particular:

 - Some drivers will now additionally (but correctly) report unsupported
   exclusion flags. It's typical for existing userspace tools such as
   perf to handle such errors by retrying the system call without the
   unsupported flags.

 - Drivers that do not support any exclusion that previously reported
   -EPERM or -EOPNOTSUPP will now report -EINVAL - this is consistent
   with the majority and results in userspace perf retrying without
   exclusion.

All drivers touched by this patchset have been compile tested.

Changes from v3:

 - Added PERF_PMU_CAP_NO_EXCLUDE to Cavium TX2 PMU driver

Changes from v2:

 - Invert logic from CAP_EXCLUDE to CAP_NO_EXCLUDE

Changes from v1:

 - Changed approach from explicitly rejecting events in unsupporting PMU
   drivers to explicitly advertising a capability in PMU drivers that
   do support exclusion events

 - Added additional information to tools/perf/design.txt

 - Rename event_has_exclude_flags to event_has_any_exclude_flag and
   update commit log to reflect it's a function

Andrew Murray (13):
  perf/doc: update design.txt for exclude_{host|guest} flags
  perf/core: add function to test for event exclusion flags
  perf/core: add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs
  alpha: perf/core: use PERF_PMU_CAP_NO_EXCLUDE
  arm: perf: conditionally use PERF_PMU_CAP_NO_EXCLUDE
  arm: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
  drivers/perf: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude
incapable PMUs
  drivers/perf: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude
incapable PMUs
  powerpc: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable
PMUs
  x86: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
  x86: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
  perf/core: remove unused perf_flags
  drivers/perf: use PERF_PMU_CAP_NO_EXCLUDE for Cavium TX2 PMU

 arch/alpha/kernel/perf_event.c|  7 +--
 arch/arm/mach-imx/mmdc.c  |  9 ++---
 arch/arm/mm/cache-l2x0-pmu.c  |  9 +
 arch/powerpc/perf/hv-24x7.c   | 10 +-
 arch/powerpc/perf/hv-gpci.c   | 10 +-
 arch/powerpc/perf/imc-pmu.c   | 19 +--
 arch/x86/events/amd/ibs.c | 13 +
 arch/x86/events/amd/iommu.c   |  6 +-
 arch/x86/events/amd/power.c   | 10 ++
 arch/x86/events/amd/uncore.c  |  7 ++-
 arch/x86/events/intel/cstate.c| 12 +++-
 arch/x86/events/intel/rapl.c  |  9 ++---
 arch/x86/events/intel/uncore.c|  9 +
 arch/x86/events/intel/uncore_snb.c|  9 ++---
 arch/x86/events/msr.c | 10 ++
 drivers/perf/arm-cci.c| 10 +-
 drivers/perf/arm-ccn.c|  6 ++
 drivers/perf/arm_dsu_pmu.c|  9 ++---
 drivers/perf/arm_pmu.c| 15 +--
 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c |  1 +
 drivers/perf/hisilicon/hisi_uncore_hha_pmu.c  |  1 +
 drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c  |  1 +
 drivers/perf/hisilicon/hisi_uncore_pmu.c  |  9 -
 drivers/perf/qcom_l2_pmu.c|  9 +
 drivers/perf/qcom_l3_pmu.c|  8 +---
 drivers/perf/thunderx2_pmu.c  | 10 +-
 drivers/perf/xgene_pmu.c  |  6 +-
 include/linux/perf_event.h| 10 ++
 include/uapi/linux/perf_event.h   |  2 --
 kernel/events/core.c  |  9 +
 tools/include/uapi/linux/perf_event.h |  2 --
 tools/perf/design.txt |  4 
 32 files changed, 63 insertions(+), 198 deletions(-)

-- 
2.7.4

[PATCH v4 01/13] perf/doc: update design.txt for exclude_{host|guest} flags

2019-01-07 Thread Andrew Murray

Update design.txt to reflect the presence of the exclude_host
and exclude_guest perf flags.

Signed-off-by: Andrew Murray 
---
 tools/perf/design.txt | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/perf/design.txt b/tools/perf/design.txt
index a28dca2..0453ba2 100644
--- a/tools/perf/design.txt
+++ b/tools/perf/design.txt
@@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits 
provide a
 way to request that counting of events be restricted to times when the
 CPU is in user, kernel and/or hypervisor mode.
 
+Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way
+to request counting of events restricted to guest and host contexts when
+using Linux as the hypervisor.
+
 The 'mmap' and 'munmap' bits allow recording of PROT_EXEC mmap/munmap
 operations, these can be used to relate userspace IP addresses to actual
 code, even after the mapping (or even the whole process) is gone,
-- 
2.7.4

Re: [PATCH 9/11] KVM/MMU: Flush tlb in the kvm_mmu_write_protect_pt_masked()

2019-01-07 Thread Paolo Bonzini

On 04/01/19 09:54, lantianyu1...@gmail.com wrote:
>   rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + 
> __ffs(mask),
> PT_PAGE_TABLE_LEVEL, slot);
> - __rmap_write_protect(kvm, rmap_head, false);
> + flush |= __rmap_write_protect(kvm, rmap_head, false);
>  
>   /* clear the first set bit */
>   mask &= mask - 1;
>   }
> +
> + if (flush && kvm_available_flush_tlb_with_range()) {
> + kvm_flush_remote_tlbs_with_address(kvm,
> + slot->base_gfn + gfn_offset,
> + hweight_long(mask));

Mask is zero here, so this probably won't work.

In addition, I suspect calling the hypercall once for every 64 pages is
not very efficient.  Passing a flush list into
kvm_mmu_write_protect_pt_masked, and flushing in
kvm_arch_mmu_enable_log_dirty_pt_masked, isn't efficient either because
kvm_arch_mmu_enable_log_dirty_pt_masked is also called once per word.

I don't have any good ideas, except for moving the whole
kvm_clear_dirty_log_protect loop into architecture-specific code (which
is not the direction we want---architectures should share more code, not
less).

Paolo

> + flush = false;
> + }
> +

Re: [PATCH 7/11] KVM: Remove redundant check in the kvm_get_dirty_log_protect()

2019-01-07 Thread Paolo Bonzini

On 04/01/19 16:50, Sean Christopherson wrote:
> Tangentially related, does mmu_lock actually need to be held while we
> walk dirty_bitmap in kvm_{clear,get}_dirty_log_protect()?  The bitmap
> itself is protected by slots_lock (a lockdep assertion would be nice
> too), e.g. can we grab the lock iff dirty_bitmap[i] != 0?

Yes, we could avoid grabbing it as long as the bitmap is zero.  However,
without kvm->manual_dirty_log_protect, the granularity of
kvm_get_dirty_log_protect() is too coarse so it won't happen in
practice.  Instead, with the new manual clear,
kvm_get_dirty_log_protect() does not take the lock and a well-written
userspace is not going to call the clear ioctl unless some bits are set.

Paolo

Re: [PATCH 6/11] KVM/MMU: Flush tlb with range list in sync_page()

2019-01-07 Thread Paolo Bonzini

On 04/01/19 17:30, Sean Christopherson wrote:
>> +
>> +if (kvm_available_flush_tlb_with_range()
>> +&& (tmp_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)) {
>> +struct kvm_mmu_page *leaf_sp = page_header(sp->spt[i]
>> +& PT64_BASE_ADDR_MASK);
>> +list_add(&leaf_sp->flush_link, &flush_list);
>> +}
>> +
>> +set_spte_ret |= tmp_spte_ret;
>> +
>>  }
>>  
>>  if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)
>> -kvm_flush_remote_tlbs(vcpu->kvm);
>> +kvm_flush_remote_tlbs_with_list(vcpu->kvm, &flush_list);
> This is a bit confusing and potentially fragile.  It's not obvious that
> kvm_flush_remote_tlbs_with_list() is guaranteed to call
> kvm_flush_remote_tlbs() when kvm_available_flush_tlb_with_range() is
> false, and you're relying on the kvm_flush_remote_tlbs_with_list() call
> chain to never optimize away the empty list case.  Rechecking
> kvm_available_flush_tlb_with_range() isn't expensive.
> 

Alternatively, do not check it during the loop: always build the
flush_list, and always call kvm_flush_remote_tlbs_with_list.  The
function can then check whether the list is empty, and the OR of
tmp_spte_ret on every iteration goes away.

Paolo

Re: [PATCH v3 1/2] selftests/powerpc: Add MSR bits

2019-01-07 Thread Breno Leitao

hi Christophe,

On 1/7/19 10:47 AM, Christophe Leroy wrote:
> Hi Breno,
> 
> Le 07/01/2019 à 13:44, Breno Leitao a écrit :
>> hi Christophe,
>>
>> On 1/3/19 3:19 PM, LEROY Christophe wrote:
>>> Breno Leitao  a écrit :
>>>
 This patch simply adds definitions for the MSR bits and some macros to
 test for MSR TM bits.

 This was copied from arch/powerpc/include/asm/reg.h generic MSR part.
>>>
>>> Can't we find a way to avoid duplicating such defines ?
>>
>> I think there are three possible ways, but none of them respect the premises
>> we are used too. These are the possible ways I can think of:
>>
>> 1) Including arch/powerpc/include/asm as part of the selftest compilation
>> process.
>>     Problem: This might break the selftest independence of the kbuild system.
>>
>> 2) Generate a temporary header file inside selftests/include which contains
>> these macros at compilation time.
>>     Problem: The problem as above.
>>
>> 3) Define MSR fields at userspace headers (/usr/include).
>>     Problem: I am not sure userspace should have MSR bits information.
>>
>> Do you suggest me to investigate any other way?
> 
> Looking it other .h in selftests, it looks like they are limited to the only
> strictly necessary values.
> 
> Are all the values you have listed used ? If not, could you only include in
> the file the necessary ones ?

Sure. That works also.  Let send a v4 patch.

Re: [PATCH v3 1/2] selftests/powerpc: Add MSR bits

2019-01-07 Thread Christophe Leroy


Hi Breno,

Le 07/01/2019 à 13:44, Breno Leitao a écrit :

hi Christophe,

On 1/3/19 3:19 PM, LEROY Christophe wrote:

Breno Leitao  a écrit :


This patch simply adds definitions for the MSR bits and some macros to
test for MSR TM bits.

This was copied from arch/powerpc/include/asm/reg.h generic MSR part.


Can't we find a way to avoid duplicating such defines ?


I think there are three possible ways, but none of them respect the premises
we are used too. These are the possible ways I can think of:

1) Including arch/powerpc/include/asm as part of the selftest compilation
process.
Problem: This might break the selftest independence of the kbuild system.

2) Generate a temporary header file inside selftests/include which contains
these macros at compilation time.
Problem: The problem as above.

3) Define MSR fields at userspace headers (/usr/include).
Problem: I am not sure userspace should have MSR bits information.

Do you suggest me to investigate any other way?


Looking it other .h in selftests, it looks like they are limited to the 
only strictly necessary values.


Are all the values you have listed used ? If not, could you only include 
in the file the necessary ones ?


Christophe

Re: [PATCH v3 1/2] selftests/powerpc: Add MSR bits

2019-01-07 Thread Breno Leitao

hi Christophe,

On 1/3/19 3:19 PM, LEROY Christophe wrote:
> Breno Leitao  a écrit :
> 
>> This patch simply adds definitions for the MSR bits and some macros to
>> test for MSR TM bits.
>>
>> This was copied from arch/powerpc/include/asm/reg.h generic MSR part.
> 
> Can't we find a way to avoid duplicating such defines ?

I think there are three possible ways, but none of them respect the premises
we are used too. These are the possible ways I can think of:

1) Including arch/powerpc/include/asm as part of the selftest compilation
process.
   Problem: This might break the selftest independence of the kbuild system.

2) Generate a temporary header file inside selftests/include which contains
these macros at compilation time.
   Problem: The problem as above.

3) Define MSR fields at userspace headers (/usr/include).
   Problem: I am not sure userspace should have MSR bits information.

Do you suggest me to investigate any other way?

[PATCH] powerpc: use a CONSOLE_LOGLEVEL_DEBUG macro

2019-01-07 Thread Sergey Senozhatsky

Use a CONSOLE_LOGLEVEL_DEBUG macro for console_loglevel rather
than a naked number.

Signed-off-by: Sergey Senozhatsky 
---
 arch/powerpc/kernel/udbg.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/udbg.c b/arch/powerpc/kernel/udbg.c
index 7cc38b5b58bc..8db4891acdaf 100644
--- a/arch/powerpc/kernel/udbg.c
+++ b/arch/powerpc/kernel/udbg.c
@@ -74,7 +74,7 @@ void __init udbg_early_init(void)
 #endif
 
 #ifdef CONFIG_PPC_EARLY_DEBUG
-   console_loglevel = 10;
+   console_loglevel = CONSOLE_LOGLEVEL_DEBUG;
 
register_early_udbg_console();
 #endif
-- 
2.20.1

65 matches

Mail list logo