Re: [PATCH v2 1/2] mm: add probe_user_read()
On Tue, Jan 08, 2019 at 07:37:44AM +, Christophe Leroy wrote: > In powerpc code, there are several places implementing safe > access to user data. This is sometimes implemented using > probe_kernel_address() with additional access_ok() verification, > sometimes with get_user() enclosed in a pagefault_disable()/enable() > pair, etc. : > show_user_instructions() > bad_stack_expansion() > p9_hmi_special_emu() > fsl_pci_mcheck_exception() > read_user_stack_64() > read_user_stack_32() on PPC64 > read_user_stack_32() on PPC32 > power_pmu_bhrb_to() > > In the same spirit as probe_kernel_read(), this patch adds > probe_user_read(). > > probe_user_read() does the same as probe_kernel_read() but > first checks that it is really a user address. > > Signed-off-by: Christophe Leroy > --- > v2: Added "Returns:" comment and removed probe_user_address() > > Changes since RFC: Made a static inline function instead of weak function as > recommended by Kees. > > include/linux/uaccess.h | 34 ++ > 1 file changed, 34 insertions(+) > > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h > index 37b226e8df13..07f4f0ed69bc 100644 > --- a/include/linux/uaccess.h > +++ b/include/linux/uaccess.h > @@ -263,6 +263,40 @@ extern long strncpy_from_unsafe(char *dst, const void > *unsafe_addr, long count); > #define probe_kernel_address(addr, retval) \ > probe_kernel_read(&retval, addr, sizeof(retval)) > > +/** > + * probe_user_read(): safely attempt to read from a user location > + * @dst: pointer to the buffer that shall take the data > + * @src: address to read from > + * @size: size of the data chunk > + * > + * Returns: 0 on success, -EFAULT on error. Nit: please put the "Returns:" comment after the description, otherwise kernel-doc considers it a part of the elaborate description. > + * > + * Safely read from address @src to the buffer at @dst. If a kernel fault > + * happens, handle that and return -EFAULT. > + * > + * We ensure that the copy_from_user is executed in atomic context so that > + * do_page_fault() doesn't attempt to take mmap_sem. This makes > + * probe_user_read() suitable for use within regions where the caller > + * already holds mmap_sem, or other locks which nest inside mmap_sem. > + */ > + > +#ifndef probe_user_read > +static __always_inline long probe_user_read(void *dst, const void __user > *src, > + size_t size) > +{ > + long ret; > + > + if (!access_ok(src, size)) > + return -EFAULT; > + > + pagefault_disable(); > + ret = __copy_from_user_inatomic(dst, src, size); > + pagefault_enable(); > + > + return ret ? -EFAULT : 0; > +} > +#endif > + > #ifndef user_access_begin > #define user_access_begin(ptr,len) access_ok(ptr, len) > #define user_access_end() do { } while (0) > -- > 2.13.3 > -- Sincerely yours, Mike.
[PATCH v2 2/2] powerpc: use probe_user_read()
Instead of opencoding, use probe_user_read() to failessly read a user location. Signed-off-by: Christophe Leroy --- v2: Using probe_user_read() instead of probe_user_address() arch/powerpc/kernel/process.c | 12 +--- arch/powerpc/mm/fault.c | 6 +- arch/powerpc/perf/callchain.c | 20 +++- arch/powerpc/perf/core-book3s.c | 8 +--- arch/powerpc/sysdev/fsl_pci.c | 10 -- 5 files changed, 10 insertions(+), 46 deletions(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index ce393df243aa..6a4b59d574c2 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1298,16 +1298,6 @@ void show_user_instructions(struct pt_regs *regs) pc = regs->nip - (NR_INSN_TO_PRINT * 3 / 4 * sizeof(int)); - /* -* Make sure the NIP points at userspace, not kernel text/data or -* elsewhere. -*/ - if (!__access_ok(pc, NR_INSN_TO_PRINT * sizeof(int), USER_DS)) { - pr_info("%s[%d]: Bad NIP, not dumping instructions.\n", - current->comm, current->pid); - return; - } - seq_buf_init(&s, buf, sizeof(buf)); while (n) { @@ -1318,7 +1308,7 @@ void show_user_instructions(struct pt_regs *regs) for (i = 0; i < 8 && n; i++, n--, pc += sizeof(int)) { int instr; - if (probe_kernel_address((const void *)pc, instr)) { + if (probe_user_read(&instr, (void __user *)pc, sizeof(instr))) { seq_buf_printf(&s, " "); continue; } diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 887f11bcf330..ec74305fa330 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -276,12 +276,8 @@ static bool bad_stack_expansion(struct pt_regs *regs, unsigned long address, if ((flags & FAULT_FLAG_WRITE) && (flags & FAULT_FLAG_USER) && access_ok(nip, sizeof(*nip))) { unsigned int inst; - int res; - pagefault_disable(); - res = __get_user_inatomic(inst, nip); - pagefault_enable(); - if (!res) + if (!probe_user_read(&inst, nip, sizeof(inst))) return !store_updates_sp(inst); *must_retry = true; } diff --git a/arch/powerpc/perf/callchain.c b/arch/powerpc/perf/callchain.c index 0af051a1974e..0680efb2237b 100644 --- a/arch/powerpc/perf/callchain.c +++ b/arch/powerpc/perf/callchain.c @@ -159,12 +159,8 @@ static int read_user_stack_64(unsigned long __user *ptr, unsigned long *ret) ((unsigned long)ptr & 7)) return -EFAULT; - pagefault_disable(); - if (!__get_user_inatomic(*ret, ptr)) { - pagefault_enable(); + if (!probe_user_read(ret, ptr, sizeof(*ret))) return 0; - } - pagefault_enable(); return read_user_stack_slow(ptr, ret, 8); } @@ -175,12 +171,8 @@ static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret) ((unsigned long)ptr & 3)) return -EFAULT; - pagefault_disable(); - if (!__get_user_inatomic(*ret, ptr)) { - pagefault_enable(); + if (!probe_user_read(ret, ptr, sizeof(*ret))) return 0; - } - pagefault_enable(); return read_user_stack_slow(ptr, ret, 4); } @@ -307,17 +299,11 @@ static inline int current_is_64bit(void) */ static int read_user_stack_32(unsigned int __user *ptr, unsigned int *ret) { - int rc; - if ((unsigned long)ptr > TASK_SIZE - sizeof(unsigned int) || ((unsigned long)ptr & 3)) return -EFAULT; - pagefault_disable(); - rc = __get_user_inatomic(*ret, ptr); - pagefault_enable(); - - return rc; + return probe_user_read(ret, ptr, sizeof(*ret)); } static inline void perf_callchain_user_64(struct perf_callchain_entry_ctx *entry, diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index b0723002a396..4b64ddf0db68 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -416,7 +416,6 @@ static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in) static __u64 power_pmu_bhrb_to(u64 addr) { unsigned int instr; - int ret; __u64 target; if (is_kernel_addr(addr)) { @@ -427,13 +426,8 @@ static __u64 power_pmu_bhrb_to(u64 addr) } /* Userspace: need copy instruction here then translate it */ - pagefault_disable(); - ret = __get_user_inatomic(instr, (unsigned int __user *)addr); - if (ret) { - pagefault_enable(); + if (pr
[PATCH v2 1/2] mm: add probe_user_read()
In powerpc code, there are several places implementing safe access to user data. This is sometimes implemented using probe_kernel_address() with additional access_ok() verification, sometimes with get_user() enclosed in a pagefault_disable()/enable() pair, etc. : show_user_instructions() bad_stack_expansion() p9_hmi_special_emu() fsl_pci_mcheck_exception() read_user_stack_64() read_user_stack_32() on PPC64 read_user_stack_32() on PPC32 power_pmu_bhrb_to() In the same spirit as probe_kernel_read(), this patch adds probe_user_read(). probe_user_read() does the same as probe_kernel_read() but first checks that it is really a user address. Signed-off-by: Christophe Leroy --- v2: Added "Returns:" comment and removed probe_user_address() Changes since RFC: Made a static inline function instead of weak function as recommended by Kees. include/linux/uaccess.h | 34 ++ 1 file changed, 34 insertions(+) diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 37b226e8df13..07f4f0ed69bc 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -263,6 +263,40 @@ extern long strncpy_from_unsafe(char *dst, const void *unsafe_addr, long count); #define probe_kernel_address(addr, retval) \ probe_kernel_read(&retval, addr, sizeof(retval)) +/** + * probe_user_read(): safely attempt to read from a user location + * @dst: pointer to the buffer that shall take the data + * @src: address to read from + * @size: size of the data chunk + * + * Returns: 0 on success, -EFAULT on error. + * + * Safely read from address @src to the buffer at @dst. If a kernel fault + * happens, handle that and return -EFAULT. + * + * We ensure that the copy_from_user is executed in atomic context so that + * do_page_fault() doesn't attempt to take mmap_sem. This makes + * probe_user_read() suitable for use within regions where the caller + * already holds mmap_sem, or other locks which nest inside mmap_sem. + */ + +#ifndef probe_user_read +static __always_inline long probe_user_read(void *dst, const void __user *src, + size_t size) +{ + long ret; + + if (!access_ok(src, size)) + return -EFAULT; + + pagefault_disable(); + ret = __copy_from_user_inatomic(dst, src, size); + pagefault_enable(); + + return ret ? -EFAULT : 0; +} +#endif + #ifndef user_access_begin #define user_access_begin(ptr,len) access_ok(ptr, len) #define user_access_end() do { } while (0) -- 2.13.3
Re: [Bug 202149] New: NULL Pointer Dereference in __split_huge_pmd on PPC64LE
Andrew Morton writes: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Fri, 04 Jan 2019 22:49:52 + bugzilla-dae...@bugzilla.kernel.org wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=202149 >> >> Bug ID: 202149 >>Summary: NULL Pointer Dereference in __split_huge_pmd on >> PPC64LE > > I think that trace is pointing at the ppc-specific > pgtable_trans_huge_withdraw()? > That is correct. Matt, Can you share the .config used for the kernel. Does this happen only with 4K page size ? -aneesh
[PATCH v4 1/2] crypto: talitos - reorder code in talitos_edesc_alloc()
This patch moves the mapping of IV after the kmalloc(). This avoids having to unmap in case kmalloc() fails. Signed-off-by: Christophe Leroy --- new in v4 drivers/crypto/talitos.c | 25 +++-- 1 file changed, 7 insertions(+), 18 deletions(-) diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c index 45e20707cef8..54d80e7edb86 100644 --- a/drivers/crypto/talitos.c +++ b/drivers/crypto/talitos.c @@ -1361,23 +1361,18 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev, struct talitos_private *priv = dev_get_drvdata(dev); bool is_sec1 = has_ftr_sec1(priv); int max_len = is_sec1 ? TALITOS1_MAX_DATA_LEN : TALITOS2_MAX_DATA_LEN; - void *err; if (cryptlen + authsize > max_len) { dev_err(dev, "length exceeds h/w max limit\n"); return ERR_PTR(-EINVAL); } - if (ivsize) - iv_dma = dma_map_single(dev, iv, ivsize, DMA_TO_DEVICE); - if (!dst || dst == src) { src_len = assoclen + cryptlen + authsize; src_nents = sg_nents_for_len(src, src_len); if (src_nents < 0) { dev_err(dev, "Invalid number of src SG.\n"); - err = ERR_PTR(-EINVAL); - goto error_sg; + return ERR_PTR(-EINVAL); } src_nents = (src_nents == 1) ? 0 : src_nents; dst_nents = dst ? src_nents : 0; @@ -1387,16 +1382,14 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev, src_nents = sg_nents_for_len(src, src_len); if (src_nents < 0) { dev_err(dev, "Invalid number of src SG.\n"); - err = ERR_PTR(-EINVAL); - goto error_sg; + return ERR_PTR(-EINVAL); } src_nents = (src_nents == 1) ? 0 : src_nents; dst_len = assoclen + cryptlen + (encrypt ? authsize : 0); dst_nents = sg_nents_for_len(dst, dst_len); if (dst_nents < 0) { dev_err(dev, "Invalid number of dst SG.\n"); - err = ERR_PTR(-EINVAL); - goto error_sg; + return ERR_PTR(-EINVAL); } dst_nents = (dst_nents == 1) ? 0 : dst_nents; } @@ -1425,10 +1418,10 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev, alloc_len += sizeof(struct talitos_desc); edesc = kmalloc(alloc_len, GFP_DMA | flags); - if (!edesc) { - err = ERR_PTR(-ENOMEM); - goto error_sg; - } + if (!edesc) + return ERR_PTR(-ENOMEM); + if (ivsize) + iv_dma = dma_map_single(dev, iv, ivsize, DMA_TO_DEVICE); memset(&edesc->desc, 0, sizeof(edesc->desc)); edesc->src_nents = src_nents; @@ -1445,10 +1438,6 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev, DMA_BIDIRECTIONAL); } return edesc; -error_sg: - if (iv_dma) - dma_unmap_single(dev, iv_dma, ivsize, DMA_TO_DEVICE); - return err; } static struct talitos_edesc *aead_edesc_alloc(struct aead_request *areq, u8 *iv, -- 2.13.3
[PATCH v4 2/2] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK
[2.364486] WARNING: CPU: 0 PID: 60 at ./arch/powerpc/include/asm/io.h:837 dma_nommu_map_page+0x44/0xd4 [2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW 4.20.0-rc5-00560-g6bfb52e23a00-dirty #531 [2.384740] NIP: c000c540 LR: c000c584 CTR: [2.389743] REGS: c95abab0 TRAP: 0700 Tainted: GW (4.20.0-rc5-00560-g6bfb52e23a00-dirty) [2.400042] MSR: 00029032 CR: 24042204 XER: [2.406669] [2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 0001 0001 [2.406669] GPR08: 2000 0010 0010 24042202 0100 c95abd88 [2.406669] GPR16: c05569d4 0001 0010 c95abc88 c0615664 0004 [2.406669] GPR24: 0010 c95abc88 c95abc88 c61ae210 c7ff6d40 c61ae210 3d68 [2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4 [2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4 [2.451762] Call Trace: [2.454195] [c95abb60] [82000808] 0x82000808 (unreliable) [2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8 [2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c [2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64 [2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08 [2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc [2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc [2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8 [2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50 [2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110 [2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c [2.515532] Instruction dump: [2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 3d20c076 7c84e850 [2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e 54847022 7c84fa14 [2.533960] ---[ end trace bf78d94af73fe3b8 ]--- [2.539123] talitos ff02.crypto: master data transfer error [2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040 [2.551625] alg: skcipher: encryption failed on test 1 for ecb-aes-talitos: ret=22 IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack cannot be DMA mapped anymore. This patch copies the IV into the extended descriptor. Fixes: 4de9d0b547b9 ("crypto: talitos - Add ablkcipher algorithms") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- v4: Split in two patches ; made the copy unconditional. v3: Using struct edesc buffer. v2: Using per-request context. drivers/crypto/talitos.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c index 54d80e7edb86..f8e2c5c3f4eb 100644 --- a/drivers/crypto/talitos.c +++ b/drivers/crypto/talitos.c @@ -1416,12 +1416,15 @@ static struct talitos_edesc *talitos_edesc_alloc(struct device *dev, /* if its a ahash, add space for a second desc next to the first one */ if (is_sec1 && !dst) alloc_len += sizeof(struct talitos_desc); + alloc_len += ivsize; edesc = kmalloc(alloc_len, GFP_DMA | flags); if (!edesc) return ERR_PTR(-ENOMEM); - if (ivsize) + if (ivsize) { + iv = memcpy(((u8 *)edesc) + alloc_len - ivsize, iv, ivsize); iv_dma = dma_map_single(dev, iv, ivsize, DMA_TO_DEVICE); + } memset(&edesc->desc, 0, sizeof(edesc->desc)); edesc->src_nents = src_nents; -- 2.13.3
Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]
On Mon, Jan 07, 2019 at 09:01:29PM -0700, Jason Gunthorpe wrote: > On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote: > > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote: > > > > > > > Interesting. I've investigated this further, though I don't have as > > > > many new clues as I'd like. The problem occurs reliably, at least on > > > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > > > > I don't yet know if it occurs with other machines, I'm having trouble > > > > getting access to other machines with a suitable card. I didn't > > > > manage to reproduce it on a different POWER8 machine with a > > > > ConnectX-5, but I don't know if it's the difference in machine or > > > > difference in card revision that's important. > > > > > > Make sure the card has the latest firmware is always good advice.. > > > > > > > So possibilities that occur to me: > > > > * It's something specific about how the vfio-pci driver uses D3 > > > > state - have you tried rebinding your device to vfio-pci? > > > > * It's something specific about POWER, either the kernel or the PCI > > > > bridge hardware > > > > * It's something specific about this particular type of machine > > > > > > Does the EEH indicate what happend to actually trigger it? > > > > In a very cryptic way that requires manual parsing using non-public > > docs sadly but yes. From the look of it, it's a completion timeout. > > > > Looks to me like we don't get a response to a config space access > > during the change of D state. I don't know if it's the write of the D3 > > state itself or the read back though (it's probably detected on the > > read back or a subsequent read, but that doesn't tell me which specific > > one failed). > > If it is just one card doing it (again, check you have latest > firmware) I wonder if it is a sketchy PCI-E electrical link that is > causing a long re-training cycle? Can you tell if the PCI-E link is > permanently gone or does it eventually return? > > Does the card work in Gen 3 when it starts? Is there any indication of > PCI-E link errors? > > Everytime or sometimes? > > POWER 8 firmware is good? If the link does eventually come back, is > the POWER8's D3 resumption timeout long enough? > > If this doesn't lead to an obvious conclusion you'll probably need to > connect to IBM's Mellanox support team to get more information from > the card side. +1, I tried to find any Mellanox-internal bugs related to your issue and didn't find anything concrete. Thanks > > Jason signature.asc Description: PGP signature
[PATCH 3/3] sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW
Add new socket timeout options that are y2038 safe. Signed-off-by: Deepa Dinamani Cc: ccaul...@redhat.com Cc: da...@davemloft.net Cc: del...@gmx.de Cc: pau...@samba.org Cc: r...@linux-mips.org Cc: r...@twiddle.net Cc: cluster-de...@redhat.com Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-al...@vger.kernel.org Cc: linux-a...@vger.kernel.org Cc: linux-m...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: sparcli...@vger.kernel.org --- arch/alpha/include/uapi/asm/socket.h | 12 +++-- arch/mips/include/uapi/asm/socket.h | 11 - arch/parisc/include/uapi/asm/socket.h | 11 - arch/sparc/include/uapi/asm/socket.h | 11 - include/net/sock.h| 4 +- include/uapi/asm-generic/socket.h | 11 - net/core/sock.c | 64 +-- 7 files changed, 98 insertions(+), 26 deletions(-) diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index ea3ba981d8a0..3d800d5d3d5d 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -118,19 +118,25 @@ #define SO_TIMESTAMPNS_NEW 63 #define SO_TIMESTAMPING_NEW 64 -#if !defined(__KERNEL__) +#define SO_RCVTIMEO_NEW 65 +#define SO_SNDTIMEO_NEW 66 -#defineSO_RCVTIMEO SO_RCVTIMEO_OLD -#defineSO_SNDTIMEO SO_SNDTIMEO_OLD +#if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 #define SO_TIMESTAMP SO_TIMESTAMP_OLD #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD #define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD + +#defineSO_RCVTIMEO SO_RCVTIMEO_OLD +#defineSO_SNDTIMEO SO_SNDTIMEO_OLD #else #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW) #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW) #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW) + +#define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW) +#define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW) #endif #define SCM_TIMESTAMP SO_TIMESTAMP diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 4dde20d64690..5a7f9010c090 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -128,18 +128,25 @@ #define SO_TIMESTAMPNS_NEW 63 #define SO_TIMESTAMPING_NEW 64 +#define SO_RCVTIMEO_NEW 65 +#define SO_SNDTIMEO_NEW 66 + #if !defined(__KERNEL__) -#defineSO_RCVTIMEO SO_RCVTIMEO_OLD -#defineSO_SNDTIMEO SO_SNDTIMEO_OLD #if __BITS_PER_LONG == 64 #define SO_TIMESTAMP SO_TIMESTAMP_OLD #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD #define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD + +#defineSO_RCVTIMEO SO_RCVTIMEO_OLD +#defineSO_SNDTIMEO SO_SNDTIMEO_OLD #else #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW) #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW) #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW) + +#defineSO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW) +#defineSO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW) #endif #define SCM_TIMESTAMP SO_TIMESTAMP diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index 546937fa0d8b..bd35de5b4666 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -109,18 +109,25 @@ #define SO_TIMESTAMPNS_NEW 0x4038 #define SO_TIMESTAMPING_NEW 0x4039 +#define SO_RCVTIMEO_NEW 0x4040 +#define SO_SNDTIMEO_NEW 0x4041 + #if !defined(__KERNEL__) -#defineSO_RCVTIMEO SO_RCVTIMEO_OLD -#defineSO_SNDTIMEO SO_SNDTIMEO_OLD #if __BITS_PER_LONG == 64 #define SO_TIMESTAMP SO_TIMESTAMP_OLD #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD #define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD + +#defineSO_RCVTIMEO SO_RCVTIMEO_OLD +#defineSO_SNDTIMEO SO_SNDTIMEO_OLD #else #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW) #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW) #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW) + +#defineSO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW) +#defineSO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? SO_SNDTIMEO_O
[PATCH 2/3] socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes
SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval as the time format. struct timeval is not y2038 safe. The subsequent patches in the series add support for new socket timeout options with _NEW suffix that are y2038 safe. Rename the existing options with _OLD suffix forms so that the right option is enabled for userspace applications according to the architecture and time_t definition of libc. Signed-off-by: Deepa Dinamani Cc: ccaul...@redhat.com Cc: del...@gmx.de Cc: pau...@samba.org Cc: r...@linux-mips.org Cc: r...@twiddle.net Cc: cluster-de...@redhat.com Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-al...@vger.kernel.org Cc: linux-a...@vger.kernel.org Cc: linux-m...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: sparcli...@vger.kernel.org --- arch/alpha/include/uapi/asm/socket.h | 7 +-- arch/mips/include/uapi/asm/socket.h| 6 -- arch/parisc/include/uapi/asm/socket.h | 6 -- arch/powerpc/include/uapi/asm/socket.h | 4 ++-- arch/sparc/include/uapi/asm/socket.h | 6 -- fs/dlm/lowcomms.c | 4 ++-- include/net/sock.h | 4 ++-- include/uapi/asm-generic/socket.h | 6 -- net/compat.c | 4 ++-- net/core/sock.c| 8 10 files changed, 33 insertions(+), 22 deletions(-) diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index da08412bd49f..ea3ba981d8a0 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -31,8 +31,8 @@ #define SO_RCVBUFFORCE 0x100b #defineSO_RCVLOWAT 0x1010 #defineSO_SNDLOWAT 0x1011 -#defineSO_RCVTIMEO 0x1012 -#defineSO_SNDTIMEO 0x1013 +#defineSO_RCVTIMEO_OLD 0x1012 +#defineSO_SNDTIMEO_OLD 0x1013 #define SO_ACCEPTCONN 0x1014 #define SO_PROTOCOL0x1028 #define SO_DOMAIN 0x1029 @@ -120,6 +120,9 @@ #if !defined(__KERNEL__) +#defineSO_RCVTIMEO SO_RCVTIMEO_OLD +#defineSO_SNDTIMEO SO_SNDTIMEO_OLD + #if __BITS_PER_LONG == 64 #define SO_TIMESTAMP SO_TIMESTAMP_OLD #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 1e48f67f1052..4dde20d64690 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -39,8 +39,8 @@ #define SO_RCVBUF 0x1002 /* Receive buffer. */ #define SO_SNDLOWAT0x1003 /* send low-water mark */ #define SO_RCVLOWAT0x1004 /* receive low-water mark */ -#define SO_SNDTIMEO0x1005 /* send timeout */ -#define SO_RCVTIMEO0x1006 /* receive timeout */ +#define SO_SNDTIMEO_OLD0x1005 /* send timeout */ +#define SO_RCVTIMEO_OLD0x1006 /* receive timeout */ #define SO_ACCEPTCONN 0x1009 #define SO_PROTOCOL0x1028 /* protocol type */ #define SO_DOMAIN 0x1029 /* domain/socket family */ @@ -130,6 +130,8 @@ #if !defined(__KERNEL__) +#defineSO_RCVTIMEO SO_RCVTIMEO_OLD +#defineSO_SNDTIMEO SO_SNDTIMEO_OLD #if __BITS_PER_LONG == 64 #define SO_TIMESTAMP SO_TIMESTAMP_OLD #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index e8d6cf20f9a4..546937fa0d8b 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -22,8 +22,8 @@ #define SO_RCVBUFFORCE 0x100b #define SO_SNDLOWAT0x1003 #define SO_RCVLOWAT0x1004 -#define SO_SNDTIMEO0x1005 -#define SO_RCVTIMEO0x1006 +#define SO_SNDTIMEO_OLD0x1005 +#define SO_RCVTIMEO_OLD0x1006 #define SO_ERROR 0x1007 #define SO_TYPE0x1008 #define SO_PROTOCOL0x1028 @@ -111,6 +111,8 @@ #if !defined(__KERNEL__) +#defineSO_RCVTIMEO SO_RCVTIMEO_OLD +#defineSO_SNDTIMEO SO_SNDTIMEO_OLD #if __BITS_PER_LONG == 64 #define SO_TIMESTAMP SO_TIMESTAMP_OLD #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD diff --git a/arch/powerpc/include/uapi/asm/socket.h b/arch/powerpc/include/uapi/asm/socket.h index 94de465e0920..12aa0c43e775 100644 --- a/arch/powerpc/include/uapi/asm/socket.h +++ b/arch/powerpc/include/uapi/asm/socket.h @@ -11,8 +11,8 @@ #define SO_RCVLOWAT16 #define SO_SNDLOWAT17 -#define SO_RCVTIMEO18 -#define SO_SNDTIMEO19 +#define SO_RCVTIMEO_OLD18 +#define SO_SNDTIMEO_OLD19 #define SO_PASSCRED20 #define SO_PEERCRED21 diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index fc65bf6b6440..bdc396211627 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -21,8 +21,8 @@ #define SO_BSDCOMPAT0x0400 #define SO_RCVLOWAT 0x0800 #define SO_SNDLOWAT 0x1000 -#define SO_RCVTIMEO 0x2000 -#define SO_SNDTIMEO 0x4000 +#define SO_RCVTIMEO_OLD 0x2000 +#define SO_SNDTIMEO_OLD
[PATCH 0/3] net: y2038-safe socket timeout options
The series is aimed at adding y2038-safe timeout options: SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW. This is similar to the previous series adding y2038-safe SO_TIMESTAMP* options. The series needs to be applied after the socket timestamp series: https://lore.kernel.org/lkml/20190108032657.8331-1-deepa.ker...@gmail.com Deepa Dinamani (3): socket: Use old_timeval types for socket timeouts socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW arch/alpha/include/uapi/asm/socket.h | 13 - arch/mips/include/uapi/asm/socket.h| 13 - arch/parisc/include/uapi/asm/socket.h | 13 - arch/powerpc/include/uapi/asm/socket.h | 4 +- arch/sparc/include/uapi/asm/socket.h | 13 - fs/dlm/lowcomms.c | 4 +- include/uapi/asm-generic/socket.h | 13 - net/compat.c | 14 ++--- net/core/sock.c| 78 +++--- net/vmw_vsock/af_vsock.c | 4 +- 10 files changed, 126 insertions(+), 43 deletions(-) base-commit: a4983672f9ca4c8393f26b6b80710e6c78886b8c prerequisite-patch-id: a03ec6afbdd328cd90557f7ee6675016a5f5c653 prerequisite-patch-id: 724d26c3036e6f3a38f810c2f10db3f7ddbf843b prerequisite-patch-id: 14017867b6eb4d5231eec1b563edcd840a1be26e prerequisite-patch-id: 8df0edfd9b973ff5aae91c7709c8223be096a5bc prerequisite-patch-id: 9850ad48d41bf068f074c0dd3c7610fb7177c89f prerequisite-patch-id: bd31f35bba11902d1cc3e8726492b54df34b5c59 prerequisite-patch-id: ea4b005c5ad188a4e0899d728357c114710a3a8e prerequisite-patch-id: cc3ee912c1ee1ea502ca079de81236a467950501 -- 2.17.1 Cc: ccaul...@redhat.com Cc: cluster-de...@redhat.com Cc: da...@davemloft.net Cc: del...@gmx.de Cc: linux-al...@vger.kernel.org Cc: linux-a...@vger.kernel.org Cc: linux-m...@vger.kernel.org Cc: linux-par...@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: pau...@samba.org Cc: r...@linux-mips.org Cc: r...@twiddle.net Cc: sparcli...@vger.kernel.org
Re: [PATCH v3] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK
On Fri, Dec 21, 2018 at 08:07:52AM +, Christophe Leroy wrote: > [2.364486] WARNING: CPU: 0 PID: 60 at ./arch/powerpc/include/asm/io.h:837 > dma_nommu_map_page+0x44/0xd4 > [2.373579] CPU: 0 PID: 60 Comm: cryptomgr_test Tainted: GW > 4.20.0-rc5-00560-g6bfb52e23a00-dirty #531 > [2.384740] NIP: c000c540 LR: c000c584 CTR: > [2.389743] REGS: c95abab0 TRAP: 0700 Tainted: GW > (4.20.0-rc5-00560-g6bfb52e23a00-dirty) > [2.400042] MSR: 00029032 CR: 24042204 XER: > [2.406669] > [2.406669] GPR00: c02f2244 c95abb60 c6262990 c95abd80 256a 0001 > 0001 0001 > [2.406669] GPR08: 2000 0010 0010 24042202 > 0100 c95abd88 > [2.406669] GPR16: c05569d4 0001 0010 c95abc88 c0615664 > 0004 > [2.406669] GPR24: 0010 c95abc88 c95abc88 c61ae210 c7ff6d40 > c61ae210 3d68 > [2.441559] NIP [c000c540] dma_nommu_map_page+0x44/0xd4 > [2.446720] LR [c000c584] dma_nommu_map_page+0x88/0xd4 > [2.451762] Call Trace: > [2.454195] [c95abb60] [82000808] 0x82000808 (unreliable) > [2.459572] [c95abb80] [c02f2244] talitos_edesc_alloc+0xbc/0x3c8 > [2.465493] [c95abbb0] [c02f2600] ablkcipher_edesc_alloc+0x4c/0x5c > [2.471606] [c95abbd0] [c02f4ed0] ablkcipher_encrypt+0x20/0x64 > [2.477389] [c95abbe0] [c02023b0] __test_skcipher+0x4bc/0xa08 > [2.483049] [c95abe00] [c0204b60] test_skcipher+0x2c/0xcc > [2.488385] [c95abe20] [c0204c48] alg_test_skcipher+0x48/0xbc > [2.494064] [c95abe40] [c0205cec] alg_test+0x164/0x2e8 > [2.499142] [c95abf00] [c0200dec] cryptomgr_test+0x48/0x50 > [2.504558] [c95abf10] [c0039ff4] kthread+0xe4/0x110 > [2.509471] [c95abf40] [c000e1d0] ret_from_kernel_thread+0x14/0x1c > [2.515532] Instruction dump: > [2.518468] 7c7e1b78 7c9d2378 7cbf2b78 41820054 3d20c076 8089c200 3d20c076 > 7c84e850 > [2.526127] 8129c204 7c842e70 7f844840 419c0008 <0fe0> 2f9e > 54847022 7c84fa14 > [2.533960] ---[ end trace bf78d94af73fe3b8 ]--- > [2.539123] talitos ff02.crypto: master data transfer error > [2.544775] talitos ff02.crypto: TEA error: ISR 0x2000_0040 > [2.551625] alg: skcipher: encryption failed on test 1 for > ecb-aes-talitos: ret=22 > > IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack > cannot be DMA mapped anymore. > > This patch copies the IV into the extended descriptor when iv is not > a valid linear address. Please make the copy unconditional. Thanks. -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
[PATCH V6 4/4] powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing
THP pages can get split during different code paths. An incremented reference count do imply we will not split the compound page. But the pmd entry can be converted to level 4 pte entries. Keep the code simpler by allowing large IOMMU page size only if the guest ram is backed by hugetlb pages. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/mmu_context_iommu.c | 24 +++- 1 file changed, 7 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index 52ccab294b47..62c7590378d4 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -98,8 +98,6 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, struct mm_iommu_table_group_mem_t *mem; long i, ret = 0, locked_entries = 0; unsigned int pageshift; - unsigned long flags; - unsigned long cur_ua; mutex_lock(&mem_list_mutex); @@ -167,22 +165,14 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, for (i = 0; i < entries; ++i) { struct page *page = mem->hpages[i]; - cur_ua = ua + (i << PAGE_SHIFT); - if (mem->pageshift > PAGE_SHIFT && PageCompound(page)) { - pte_t *pte; + /* +* Allow to use larger than 64k IOMMU pages. Only do that +* if we are backed by hugetlb. +*/ + if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page)) { struct page *head = compound_head(page); - unsigned int compshift = compound_order(head); - unsigned int pteshift; - - local_irq_save(flags); /* disables as well */ - pte = find_linux_pte(mm->pgd, cur_ua, NULL, &pteshift); - - /* Double check it is still the same pinned page */ - if (pte && pte_page(*pte) == head && - pteshift == compshift + PAGE_SHIFT) - pageshift = max_t(unsigned int, pteshift, - PAGE_SHIFT); - local_irq_restore(flags); + + pageshift = compound_order(head) + PAGE_SHIFT; } mem->pageshift = min(mem->pageshift, pageshift); /* -- 2.20.1
[PATCH V6 3/4] powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get
Current code doesn't do page migration if the page allocated is a compound page. With HugeTLB migration support, we can end up allocating hugetlb pages from CMA region. Also THP pages can be allocated from CMA region. This patch updates the code to handle compound pages correctly. This use the new helper get_user_pages_cma_migrate. It does single get_user_pages with right count, instead of doing one get_user_pages per page. That avoids reading page table multiple times. The patch also convert the hpas member of mm_iommu_table_group_mem_t to a union. We use the same storage location to store pointers to struct page. We cannot update all the code path use struct page *, because we access hpas in real mode and we can't do that struct page * to pfn conversion in real mode. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/mmu_context_iommu.c | 124 +--- 1 file changed, 37 insertions(+), 87 deletions(-) diff --git a/arch/powerpc/mm/mmu_context_iommu.c b/arch/powerpc/mm/mmu_context_iommu.c index a712a650a8b6..52ccab294b47 100644 --- a/arch/powerpc/mm/mmu_context_iommu.c +++ b/arch/powerpc/mm/mmu_context_iommu.c @@ -21,6 +21,7 @@ #include #include #include +#include static DEFINE_MUTEX(mem_list_mutex); @@ -34,8 +35,18 @@ struct mm_iommu_table_group_mem_t { atomic64_t mapped; unsigned int pageshift; u64 ua; /* userspace address */ - u64 entries;/* number of entries in hpas[] */ - u64 *hpas; /* vmalloc'ed */ + u64 entries;/* number of entries in hpas/hpages[] */ + /* +* in mm_iommu_get we temporarily use this to store +* struct page address. +* +* We need to convert ua to hpa in real mode. Make it +* simpler by storing physical address. +*/ + union { + struct page **hpages; /* vmalloc'ed */ + phys_addr_t *hpas; + }; #define MM_IOMMU_TABLE_INVALID_HPA ((uint64_t)-1) u64 dev_hpa;/* Device memory base address */ }; @@ -80,64 +91,15 @@ bool mm_iommu_preregistered(struct mm_struct *mm) } EXPORT_SYMBOL_GPL(mm_iommu_preregistered); -/* - * Taken from alloc_migrate_target with changes to remove CMA allocations - */ -struct page *new_iommu_non_cma_page(struct page *page, unsigned long private) -{ - gfp_t gfp_mask = GFP_USER; - struct page *new_page; - - if (PageCompound(page)) - return NULL; - - if (PageHighMem(page)) - gfp_mask |= __GFP_HIGHMEM; - - /* -* We don't want the allocation to force an OOM if possibe -*/ - new_page = alloc_page(gfp_mask | __GFP_NORETRY | __GFP_NOWARN); - return new_page; -} - -static int mm_iommu_move_page_from_cma(struct page *page) -{ - int ret = 0; - LIST_HEAD(cma_migrate_pages); - - /* Ignore huge pages for now */ - if (PageCompound(page)) - return -EBUSY; - - lru_add_drain(); - ret = isolate_lru_page(page); - if (ret) - return ret; - - list_add(&page->lru, &cma_migrate_pages); - put_page(page); /* Drop the gup reference */ - - ret = migrate_pages(&cma_migrate_pages, new_iommu_non_cma_page, - NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE); - if (ret) { - if (!list_empty(&cma_migrate_pages)) - putback_movable_pages(&cma_migrate_pages); - } - - return 0; -} - static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, - unsigned long entries, unsigned long dev_hpa, - struct mm_iommu_table_group_mem_t **pmem) + unsigned long entries, unsigned long dev_hpa, + struct mm_iommu_table_group_mem_t **pmem) { struct mm_iommu_table_group_mem_t *mem; - long i, j, ret = 0, locked_entries = 0; + long i, ret = 0, locked_entries = 0; unsigned int pageshift; unsigned long flags; unsigned long cur_ua; - struct page *page = NULL; mutex_lock(&mem_list_mutex); @@ -187,41 +149,25 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua, goto unlock_exit; } + ret = get_user_pages_cma_migrate(ua, entries, 1, mem->hpages); + if (ret != entries) { + /* free the reference taken */ + for (i = 0; i < ret; i++) + put_page(mem->hpages[i]); + + vfree(mem->hpas); + kfree(mem); + ret = -EFAULT; + goto unlock_exit; + } else { + ret = 0; + } + + pageshift = PAGE_SHIFT; for (i = 0; i < entries; ++i) { + struct page *page = mem->hpages[i]; + cur_ua = ua + (i << PAGE_SHIFT); - if (1 != get_user_pages_fast(cur_ua, -
[PATCH V6 2/4] mm: Add get_user_pages_cma_migrate
This helper does a get_user_pages_fast making sure we migrate pages found in the CMA area before taking page reference. This makes sure that we don't keep non-movable pages (due to page reference count) in the CMA area. This will be used by ppc64 in a later patch to avoid pinning pages in the CMA region. ppc64 uses CMA region for allocation of hardware page table (hash page table) and not able to migrate pages out of CMA region results in page table allocation failures. One case where we hit this easy is when a guest using VFIO passthrough device. VFIO locks all the guests memory and if the guest memory is backed by CMA region, it becomes unmovable resulting in fragmenting the CMA and possibly preventing other guest from allocation a large enough hash page table. NOTE: We allocate new page without using __GFP_THISNODE Signed-off-by: Aneesh Kumar K.V --- include/linux/hugetlb.h | 2 + include/linux/migrate.h | 3 + mm/hugetlb.c| 4 +- mm/migrate.c| 149 4 files changed, 156 insertions(+), 2 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 087fd5f48c91..1eed0cdaec0e 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -371,6 +371,8 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask); struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address); +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, +int nid, nodemask_t *nmask); int huge_add_to_page_cache(struct page *page, struct address_space *mapping, pgoff_t idx); diff --git a/include/linux/migrate.h b/include/linux/migrate.h index e13d9bf2f9a5..bc83e12a06e9 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -285,6 +285,9 @@ static inline int migrate_vma(const struct migrate_vma_ops *ops, } #endif /* IS_ENABLED(CONFIG_MIGRATE_VMA_HELPER) */ +extern int get_user_pages_cma_migrate(unsigned long start, int nr_pages, int write, + struct page **pages); + #endif /* CONFIG_MIGRATION */ #endif /* _LINUX_MIGRATE_H */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 745088810965..fc4afaec1055 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1586,8 +1586,8 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, return page; } -static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, - int nid, nodemask_t *nmask) +struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, +int nid, nodemask_t *nmask) { struct page *page; diff --git a/mm/migrate.c b/mm/migrate.c index ccf8966caf6f..5e21c7aee942 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2982,3 +2982,152 @@ int migrate_vma(const struct migrate_vma_ops *ops, } EXPORT_SYMBOL(migrate_vma); #endif /* defined(MIGRATE_VMA_HELPER) */ + +static struct page *new_non_cma_page(struct page *page, unsigned long private) +{ + /* +* We want to make sure we allocate the new page from the same node +* as the source page. +*/ + int nid = page_to_nid(page); + /* +* Trying to allocate a page for migration. Ignore allocation +* failure warnings. We don't force __GFP_THISNODE here because +* this node here is the node where we have CMA reservation and +* in some case these nodes will have really less non movable +* allocation memory. +*/ + gfp_t gfp_mask = GFP_USER | __GFP_NOWARN; + + if (PageHighMem(page)) + gfp_mask |= __GFP_HIGHMEM; + +#ifdef CONFIG_HUGETLB_PAGE + if (PageHuge(page)) { + struct hstate *h = page_hstate(page); + /* +* We don't want to dequeue from the pool because pool pages will +* mostly be from the CMA region. +*/ + return alloc_migrate_huge_page(h, gfp_mask, nid, NULL); + } +#endif + if (PageTransHuge(page)) { + struct page *thp; + /* +* ignore allocation failure warnings +*/ + gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN; + + /* +* Remove the movable mask so that we don't allocate from +* CMA area again. +*/ + thp_gfpmask &= ~__GFP_MOVABLE; + thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER); + if (!thp) + return NULL; + prep_transhuge_page(thp); + return thp; + } + + return __alloc_pages_node(nid, gfp_mask, 0); +} + +/** + * get_user_pages_cma_migrate() - pin user pages in memory by migrating pa
[PATCH V6 1/4] mm/cma: Add PF flag to force non cma alloc
This patch add PF_MEMALLOC_NOCMA which make sure any allocation in that context is marked non movable and hence cannot be satisfied by CMA region. This is useful with get_user_pages_cma_migrate where we take a page pin by migrating pages from CMA region. Marking the section PF_MEMALLOC_NOCMA ensures that we avoid uncessary page migration later. Suggested-by: Andrea Arcangeli Signed-off-by: Aneesh Kumar K.V --- include/linux/sched.h| 1 + include/linux/sched/mm.h | 36 2 files changed, 29 insertions(+), 8 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 89541d248893..047c8c469774 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1406,6 +1406,7 @@ extern struct pid *cad_pid; #define PF_RANDOMIZE 0x0040 /* Randomize virtual address space */ #define PF_SWAPWRITE 0x0080 /* Allowed to write to swap */ #define PF_MEMSTALL0x0100 /* Stalled due to lack of memory */ +#define PF_MEMALLOC_NOCMA 0x0200 /* All allocation request will have _GFP_MOVABLE cleared */ #define PF_NO_SETAFFINITY 0x0400 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x0800 /* Early kill for mce process policy */ #define PF_MUTEX_TESTER0x2000 /* Thread belongs to the rt mutex tester */ diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 3bfa6a0cbba4..b336e7e2ca49 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -148,17 +148,24 @@ static inline bool in_vfork(struct task_struct *tsk) * Applies per-task gfp context to the given allocation flags. * PF_MEMALLOC_NOIO implies GFP_NOIO * PF_MEMALLOC_NOFS implies GFP_NOFS + * PF_MEMALLOC_NOCMA implies no allocation from CMA region. */ static inline gfp_t current_gfp_context(gfp_t flags) { - /* -* NOIO implies both NOIO and NOFS and it is a weaker context -* so always make sure it makes precedence -*/ - if (unlikely(current->flags & PF_MEMALLOC_NOIO)) - flags &= ~(__GFP_IO | __GFP_FS); - else if (unlikely(current->flags & PF_MEMALLOC_NOFS)) - flags &= ~__GFP_FS; + if (unlikely(current->flags & +(PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA))) { + /* +* NOIO implies both NOIO and NOFS and it is a weaker context +* so always make sure it makes precedence +*/ + if (current->flags & PF_MEMALLOC_NOIO) + flags &= ~(__GFP_IO | __GFP_FS); + else if (current->flags & PF_MEMALLOC_NOFS) + flags &= ~__GFP_FS; + + if (current->flags & PF_MEMALLOC_NOCMA) + flags &= ~__GFP_MOVABLE; + } return flags; } @@ -248,6 +255,19 @@ static inline void memalloc_noreclaim_restore(unsigned int flags) current->flags = (current->flags & ~PF_MEMALLOC) | flags; } +static inline unsigned int memalloc_nocma_save(void) +{ + unsigned int flags = current->flags & PF_MEMALLOC_NOCMA; + + current->flags |= PF_MEMALLOC_NOCMA; + return flags; +} + +static inline void memalloc_nocma_restore(unsigned int flags) +{ + current->flags = (current->flags & ~PF_MEMALLOC_NOCMA) | flags; +} + #ifdef CONFIG_MEMCG /** * memalloc_use_memcg - Starts the remote memcg charging scope. -- 2.20.1
[PATCH V6 0/4] mm/kvm/vfio/ppc64: Migrate compound pages out of CMA region
ppc64 use CMA area for the allocation of guest page table (hash page table). We won't be able to start guest if we fail to allocate hash page table. We have observed hash table allocation failure because we failed to migrate pages out of CMA region because they were pinned. This happen when we are using VFIO. VFIO on ppc64 pins the entire guest RAM. If the guest RAM pages get allocated out of CMA region, we won't be able to migrate those pages. The pages are also pinned for the lifetime of the guest. Currently we support migration of non-compound pages. With THP and with the addition of hugetlb migration we can end up allocating compound pages from CMA region. This patch series add support for migrating compound pages. The first path adds the helper get_user_pages_cma_migrate() which pin the page making sure we migrate them out of CMA region before incrementing the reference count. Changes from V5: * Add PF_MEMALLOC_NOCMA * remote __GFP_THISNODE when allocating target page for migration Changes from V4: * use __GFP_NOWARN when allocating pages to avoid page allocation failure warnings. Changes from V3: * Move the hugetlb check before transhuge check * Use compound head page when isolating hugetlb page Aneesh Kumar K.V (4): mm/cma: Add PF flag to force non cma alloc mm: Add get_user_pages_cma_migrate powerpc/mm/iommu: Allow migration of cma allocated pages during mm_iommu_get powerpc/mm/iommu: Allow large IOMMU page size only for hugetlb backing arch/powerpc/mm/mmu_context_iommu.c | 144 --- include/linux/hugetlb.h | 2 + include/linux/migrate.h | 3 + include/linux/sched.h | 1 + include/linux/sched/mm.h| 36 +-- mm/hugetlb.c| 4 +- mm/migrate.c| 149 7 files changed, 227 insertions(+), 112 deletions(-) -- 2.20.1
Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]
On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote: > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote: > > > > > Interesting. I've investigated this further, though I don't have as > > > many new clues as I'd like. The problem occurs reliably, at least on > > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > > > I don't yet know if it occurs with other machines, I'm having trouble > > > getting access to other machines with a suitable card. I didn't > > > manage to reproduce it on a different POWER8 machine with a > > > ConnectX-5, but I don't know if it's the difference in machine or > > > difference in card revision that's important. > > > > Make sure the card has the latest firmware is always good advice.. > > > > > So possibilities that occur to me: > > > * It's something specific about how the vfio-pci driver uses D3 > > > state - have you tried rebinding your device to vfio-pci? > > > * It's something specific about POWER, either the kernel or the PCI > > > bridge hardware > > > * It's something specific about this particular type of machine > > > > Does the EEH indicate what happend to actually trigger it? > > In a very cryptic way that requires manual parsing using non-public > docs sadly but yes. From the look of it, it's a completion timeout. > > Looks to me like we don't get a response to a config space access > during the change of D state. I don't know if it's the write of the D3 > state itself or the read back though (it's probably detected on the > read back or a subsequent read, but that doesn't tell me which specific > one failed). If it is just one card doing it (again, check you have latest firmware) I wonder if it is a sketchy PCI-E electrical link that is causing a long re-training cycle? Can you tell if the PCI-E link is permanently gone or does it eventually return? Does the card work in Gen 3 when it starts? Is there any indication of PCI-E link errors? Everytime or sometimes? POWER 8 firmware is good? If the link does eventually come back, is the POWER8's D3 resumption timeout long enough? If this doesn't lead to an obvious conclusion you'll probably need to connect to IBM's Mellanox support team to get more information from the card side. Jason
[PATCHv4 3/4] pci: layerscape: Add the EP mode support.
Add the PCIe EP mode support for layerscape platform. Signed-off-by: Xiaowei Bao --- v2: - remove the EP mode check function. v3: - modif the return value when enter default case. v4: - no change. drivers/pci/controller/dwc/Makefile|2 +- drivers/pci/controller/dwc/pci-layerscape-ep.c | 146 2 files changed, 147 insertions(+), 1 deletions(-) create mode 100644 drivers/pci/controller/dwc/pci-layerscape-ep.c diff --git a/drivers/pci/controller/dwc/Makefile b/drivers/pci/controller/dwc/Makefile index fcf91ea..e97e920 100644 --- a/drivers/pci/controller/dwc/Makefile +++ b/drivers/pci/controller/dwc/Makefile @@ -8,7 +8,7 @@ obj-$(CONFIG_PCI_EXYNOS) += pci-exynos.o obj-$(CONFIG_PCI_IMX6) += pci-imx6.o obj-$(CONFIG_PCIE_SPEAR13XX) += pcie-spear13xx.o obj-$(CONFIG_PCI_KEYSTONE) += pci-keystone.o -obj-$(CONFIG_PCI_LAYERSCAPE) += pci-layerscape.o +obj-$(CONFIG_PCI_LAYERSCAPE) += pci-layerscape.o pci-layerscape-ep.o obj-$(CONFIG_PCIE_QCOM) += pcie-qcom.o obj-$(CONFIG_PCIE_ARMADA_8K) += pcie-armada8k.o obj-$(CONFIG_PCIE_ARTPEC6) += pcie-artpec6.o diff --git a/drivers/pci/controller/dwc/pci-layerscape-ep.c b/drivers/pci/controller/dwc/pci-layerscape-ep.c new file mode 100644 index 000..dafb528 --- /dev/null +++ b/drivers/pci/controller/dwc/pci-layerscape-ep.c @@ -0,0 +1,146 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCIe controller EP driver for Freescale Layerscape SoCs + * + * Copyright (C) 2018 NXP Semiconductor. + * + * Author: Xiaowei Bao + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "pcie-designware.h" + +#define PCIE_DBI2_OFFSET 0x1000 /* DBI2 base address*/ + +struct ls_pcie_ep { + struct dw_pcie *pci; +}; + +#define to_ls_pcie_ep(x) dev_get_drvdata((x)->dev) + +static int ls_pcie_establish_link(struct dw_pcie *pci) +{ + return 0; +} + +static const struct dw_pcie_ops ls_pcie_ep_ops = { + .start_link = ls_pcie_establish_link, +}; + +static const struct of_device_id ls_pcie_ep_of_match[] = { + { .compatible = "fsl,ls-pcie-ep",}, + { }, +}; + +static void ls_pcie_ep_init(struct dw_pcie_ep *ep) +{ + struct dw_pcie *pci = to_dw_pcie_from_ep(ep); + struct pci_epc *epc = ep->epc; + enum pci_barno bar; + + for (bar = BAR_0; bar <= BAR_5; bar++) + dw_pcie_ep_reset_bar(pci, bar); + + epc->features |= EPC_FEATURE_NO_LINKUP_NOTIFIER; +} + +static int ls_pcie_ep_raise_irq(struct dw_pcie_ep *ep, u8 func_no, + enum pci_epc_irq_type type, u16 interrupt_num) +{ + struct dw_pcie *pci = to_dw_pcie_from_ep(ep); + + switch (type) { + case PCI_EPC_IRQ_LEGACY: + return dw_pcie_ep_raise_legacy_irq(ep, func_no); + case PCI_EPC_IRQ_MSI: + return dw_pcie_ep_raise_msi_irq(ep, func_no, interrupt_num); + case PCI_EPC_IRQ_MSIX: + return dw_pcie_ep_raise_msix_irq(ep, func_no, interrupt_num); + default: + dev_err(pci->dev, "UNKNOWN IRQ type\n"); + return -EINVAL; + } +} + +static struct dw_pcie_ep_ops pcie_ep_ops = { + .ep_init = ls_pcie_ep_init, + .raise_irq = ls_pcie_ep_raise_irq, +}; + +static int __init ls_add_pcie_ep(struct ls_pcie_ep *pcie, + struct platform_device *pdev) +{ + struct dw_pcie *pci = pcie->pci; + struct device *dev = pci->dev; + struct dw_pcie_ep *ep; + struct resource *res; + int ret; + + ep = &pci->ep; + ep->ops = &pcie_ep_ops; + + res = platform_get_resource_byname(pdev, IORESOURCE_MEM, "addr_space"); + if (!res) + return -EINVAL; + + ep->phys_base = res->start; + ep->addr_size = resource_size(res); + + ret = dw_pcie_ep_init(ep); + if (ret) { + dev_err(dev, "failed to initialize endpoint\n"); + return ret; + } + + return 0; +} + +static int __init ls_pcie_ep_probe(struct platform_device *pdev) +{ + struct device *dev = &pdev->dev; + struct dw_pcie *pci; + struct ls_pcie_ep *pcie; + struct resource *dbi_base; + int ret; + + pcie = devm_kzalloc(dev, sizeof(*pcie), GFP_KERNEL); + if (!pcie) + return -ENOMEM; + + pci = devm_kzalloc(dev, sizeof(*pci), GFP_KERNEL); + if (!pci) + return -ENOMEM; + + dbi_base = platform_get_resource_byname(pdev, IORESOURCE_MEM, "regs"); + pci->dbi_base = devm_pci_remap_cfg_resource(dev, dbi_base); + if (IS_ERR(pci->dbi_base)) + return PTR_ERR(pci->dbi_base); + + pci->dbi_base2 = pci->dbi_base + PCIE_DBI2_OFFSET; + pci->dev = dev; + pci->ops = &ls_pcie_ep_ops; + pcie->pci = pci; + + platform_set_drvdata(pdev, pcie); + + ret = ls_add_pcie_ep(pcie, pdev); + + return ret; +} + +static struct pla
[PATCHv4 4/4] misc: pci_endpoint_test: Add the layerscape EP device support
Add the layerscape EP device support in pci_endpoint_test driver. Signed-off-by: Xiaowei Bao --- v2: - no change v3: - no change v4: - delate the comments. drivers/misc/pci_endpoint_test.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/misc/pci_endpoint_test.c b/drivers/misc/pci_endpoint_test.c index 896e2df..29582fe 100644 --- a/drivers/misc/pci_endpoint_test.c +++ b/drivers/misc/pci_endpoint_test.c @@ -788,6 +788,7 @@ static void pci_endpoint_test_remove(struct pci_dev *pdev) static const struct pci_device_id pci_endpoint_test_tbl[] = { { PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA74x) }, { PCI_DEVICE(PCI_VENDOR_ID_TI, PCI_DEVICE_ID_TI_DRA72x) }, + { PCI_DEVICE(PCI_VENDOR_ID_FREESCALE, 0x81c0) }, { PCI_DEVICE(PCI_VENDOR_ID_SYNOPSYS, 0xedda) }, { } }; -- 1.7.1
[PATCHv4 1/4] dt-bindings: add DT binding for the layerscape PCIe controller with EP mode
Add the documentation for the Device Tree binding for the layerscape PCIe controller with EP mode. Signed-off-by: Xiaowei Bao --- v2: - Add the SoC specific compatibles. v3: - modify the commit message. v4: - no change. .../devicetree/bindings/pci/layerscape-pci.txt |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/Documentation/devicetree/bindings/pci/layerscape-pci.txt b/Documentation/devicetree/bindings/pci/layerscape-pci.txt index 9b2b8d6..e20ceaa 100644 --- a/Documentation/devicetree/bindings/pci/layerscape-pci.txt +++ b/Documentation/devicetree/bindings/pci/layerscape-pci.txt @@ -13,6 +13,7 @@ information. Required properties: - compatible: should contain the platform identifier such as: + RC mode: "fsl,ls1021a-pcie" "fsl,ls2080a-pcie", "fsl,ls2085a-pcie" "fsl,ls2088a-pcie" @@ -20,6 +21,8 @@ Required properties: "fsl,ls1046a-pcie" "fsl,ls1043a-pcie" "fsl,ls1012a-pcie" + EP mode: + "fsl,ls1046a-pcie-ep", "fsl,ls-pcie-ep" - reg: base addresses and lengths of the PCIe controller register blocks. - interrupts: A list of interrupt outputs of the controller. Must contain an entry for each entry in the interrupt-names property. -- 1.7.1
[PATCHv4 2/4] arm64: dts: Add the PCIE EP node in dts
Add the PCIE EP node in dts for ls1046a. Signed-off-by: Xiaowei Bao --- v2: - Add the SoC specific compatibles. v3: - no change v4: - no change arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi | 34 +++- 1 files changed, 33 insertions(+), 1 deletions(-) diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi b/arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi index 9a2106e..e373826 100644 --- a/arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi +++ b/arch/arm64/boot/dts/freescale/fsl-ls1046a.dtsi @@ -657,6 +657,17 @@ status = "disabled"; }; + pcie_ep@340 { + compatible = "fsl,ls1046a-pcie-ep","fsl,ls-pcie-ep"; + reg = <0x00 0x0340 0x0 0x0010 + 0x40 0x 0x8 0x>; + reg-names = "regs", "addr_space"; + num-ib-windows = <6>; + num-ob-windows = <6>; + num-lanes = <2>; + status = "disabled"; + }; + pcie@350 { compatible = "fsl,ls1046a-pcie"; reg = <0x00 0x0350 0x0 0x0010 /* controller registers */ @@ -683,6 +694,17 @@ status = "disabled"; }; + pcie_ep@350 { + compatible = "fsl,ls1046a-pcie-ep","fsl,ls-pcie-ep"; + reg = <0x00 0x0350 0x0 0x0010 + 0x48 0x 0x8 0x>; + reg-names = "regs", "addr_space"; + num-ib-windows = <6>; + num-ob-windows = <6>; + num-lanes = <2>; + status = "disabled"; + }; + pcie@360 { compatible = "fsl,ls1046a-pcie"; reg = <0x00 0x0360 0x0 0x0010 /* controller registers */ @@ -709,6 +731,17 @@ status = "disabled"; }; + pcie_ep@360 { + compatible = "fsl,ls1046a-pcie-ep", "fsl,ls-pcie-ep"; + reg = <0x00 0x0360 0x0 0x0010 + 0x50 0x 0x8 0x>; + reg-names = "regs", "addr_space"; + num-ib-windows = <6>; + num-ob-windows = <6>; + num-lanes = <2>; + status = "disabled"; + }; + qdma: dma-controller@838 { compatible = "fsl,ls1046a-qdma", "fsl,ls1021a-qdma"; reg = <0x0 0x838 0x0 0x1000>, /* Controller regs */ @@ -729,7 +762,6 @@ queue-sizes = <64 64>; big-endian; }; - }; reserved-memory { -- 1.7.1
[PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration
These are used to capture the XIVE END table of the KVM device. It relies on an OPAL call to retrieve from the XIVE IC the EQ toggle bit and index which are updated by the HW when events are enqueued in the guest RAM. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/uapi/asm/kvm.h | 21 arch/powerpc/kvm/book3s_xive_native.c | 166 ++ 2 files changed, 187 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index faf024f39858..95302558ce10 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -684,6 +684,7 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_GRP_SOURCES 2 /* 64-bit source attributes */ #define KVM_DEV_XIVE_GRP_SYNC 3 /* 64-bit source attributes */ #define KVM_DEV_XIVE_GRP_EAS 4 /* 64-bit eas attributes */ +#define KVM_DEV_XIVE_GRP_EQ5 /* 64-bit eq attributes */ /* Layout of 64-bit XIVE source attribute values */ #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) @@ -699,4 +700,24 @@ struct kvm_ppc_cpu_char { #define KVM_XIVE_EAS_EISN_SHIFT33 #define KVM_XIVE_EAS_EISN_MASK 0xfffeULL +/* Layout of 64-bit eq attribute */ +#define KVM_XIVE_EQ_PRIORITY_SHIFT 0 +#define KVM_XIVE_EQ_PRIORITY_MASK 0x7 +#define KVM_XIVE_EQ_SERVER_SHIFT 3 +#define KVM_XIVE_EQ_SERVER_MASK0xfff8ULL + +/* Layout of 64-bit eq attribute values */ +struct kvm_ppc_xive_eq { + __u32 flags; + __u32 qsize; + __u64 qpage; + __u32 qtoggle; + __u32 qindex; +}; + +#define KVM_XIVE_EQ_FLAG_ENABLED 0x0001 +#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY 0x0002 +#define KVM_XIVE_EQ_FLAG_ESCALATE 0x0004 + + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 0468b605baa7..f4eb71eafc57 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -607,6 +607,164 @@ static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq, return 0; } +static int kvmppc_xive_native_set_queue(struct kvmppc_xive *xive, long eq_idx, + u64 addr) +{ + struct kvm *kvm = xive->kvm; + struct kvm_vcpu *vcpu; + struct kvmppc_xive_vcpu *xc; + void __user *ubufp = (u64 __user *) addr; + u32 server; + u8 priority; + struct kvm_ppc_xive_eq kvm_eq; + int rc; + __be32 *qaddr = 0; + struct page *page; + struct xive_q *q; + + /* +* Demangle priority/server tuple from the EQ index +*/ + priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >> + KVM_XIVE_EQ_PRIORITY_SHIFT; + server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >> + KVM_XIVE_EQ_SERVER_SHIFT; + + if (copy_from_user(&kvm_eq, ubufp, sizeof(kvm_eq))) + return -EFAULT; + + vcpu = kvmppc_xive_find_server(kvm, server); + if (!vcpu) { + pr_err("Can't find server %d\n", server); + return -ENOENT; + } + xc = vcpu->arch.xive_vcpu; + + if (priority != xive_prio_from_guest(priority)) { + pr_err("Trying to restore invalid queue %d for VCPU %d\n", + priority, server); + return -EINVAL; + } + q = &xc->queues[priority]; + + pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n", +__func__, server, priority, kvm_eq.flags, +kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex); + + rc = xive_native_validate_queue_size(kvm_eq.qsize); + if (rc || !kvm_eq.qsize) { + pr_err("invalid queue size %d\n", kvm_eq.qsize); + return rc; + } + + page = gfn_to_page(kvm, gpa_to_gfn(kvm_eq.qpage)); + if (is_error_page(page)) { + pr_warn("Couldn't get guest page for %llx!\n", kvm_eq.qpage); + return -ENOMEM; + } + qaddr = page_to_virt(page) + (kvm_eq.qpage & ~PAGE_MASK); + + /* Backup queue page guest address for migration */ + q->guest_qpage = kvm_eq.qpage; + q->guest_qsize = kvm_eq.qsize; + + rc = xive_native_configure_queue(xc->vp_id, q, priority, +(__be32 *) qaddr, kvm_eq.qsize, true); + if (rc) { + pr_err("Failed to configure queue %d for VCPU %d: %d\n", + priority, xc->server_num, rc); + put_page(page); + return rc; + } + + rc = xive_native_set_queue_state(xc->vp_id, priority, kvm_eq.qtoggle, +kvm_eq.qindex); + if (rc) + goto error; + + rc = kvmppc_xive_attach_escalation(vcpu, priority); +error: + if (rc) + xive_native_cl
[PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration
When migration of a VM is initiated, a first copy of the RAM is transferred to the destination before the VM is stopped. At that time, QEMU needs to perform a XIVE quiesce sequence to stop the flow of event notifications and stabilize the EQs. The sources are masked and the XIVE IC is synced with the KVM ioctl KVM_DEV_XIVE_GRP_SYNC. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/uapi/asm/kvm.h | 1 + arch/powerpc/kvm/book3s_xive_native.c | 32 +++ 2 files changed, 33 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 6fc9660c5aec..f3b859223b80 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_GET_TIMA_FD 2 #define KVM_DEV_XIVE_VC_BASE 3 #define KVM_DEV_XIVE_GRP_SOURCES 2 /* 64-bit source attributes */ +#define KVM_DEV_XIVE_GRP_SYNC 3 /* 64-bit source attributes */ /* Layout of 64-bit XIVE source attribute values */ #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 4ca75aade069..a8052867afc1 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -459,6 +459,35 @@ static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq, return 0; } +static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr) +{ + struct kvmppc_xive_src_block *sb; + struct kvmppc_xive_irq_state *state; + struct xive_irq_data *xd; + u32 hw_num; + u16 src; + + pr_devel("%s irq=0x%lx\n", __func__, irq); + + sb = kvmppc_xive_find_source(xive, irq, &src); + if (!sb) + return -ENOENT; + + state = &sb->irq_state[src]; + + if (!state->valid) + return -ENOENT; + + arch_spin_lock(&sb->lock); + + kvmppc_xive_select_irq(state, &hw_num, &xd); + xive_native_sync_source(hw_num); + xive_native_sync_queue(hw_num); + + arch_spin_unlock(&sb->lock); + return 0; +} + static int kvmppc_xive_native_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -474,6 +503,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, case KVM_DEV_XIVE_GRP_SOURCES: return kvmppc_xive_native_set_source(xive, attr->attr, attr->addr); + case KVM_DEV_XIVE_GRP_SYNC: + return kvmppc_xive_native_sync(xive, attr->attr, attr->addr); } return -ENXIO; } @@ -511,6 +542,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev, } break; case KVM_DEV_XIVE_GRP_SOURCES: + case KVM_DEV_XIVE_GRP_SYNC: if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ && attr->attr < KVMPPC_XIVE_NR_IRQS) return 0; -- 2.20.1
[PATCH 10/19] KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state
The Effective IRQ Source Number is the interrupt number pushed in the event queue that the guest OS will use to dispatch events internally. Signed-off-by: Cédric Le Goater --- arch/powerpc/kvm/book3s_xive.h | 3 +++ arch/powerpc/kvm/book3s_xive.c | 1 + 2 files changed, 4 insertions(+) diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index ae4a670eea63..67e07b41061d 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -57,6 +57,9 @@ struct kvmppc_xive_irq_state { bool saved_p; bool saved_q; u8 saved_scan_prio; + + /* Xive native */ + u32 eisn; /* Guest Effective IRQ number */ }; /* Select the "right" interrupt (IPI vs. passthrough) */ diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index bb5d32f7e4e6..e9f05d9c9ad5 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -1515,6 +1515,7 @@ struct kvmppc_xive_src_block *kvmppc_xive_create_src_block( for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) { sb->irq_state[i].number = (bid << KVMPPC_XICS_ICS_SHIFT) | i; + sb->irq_state[i].eisn = 0; sb->irq_state[i].guest_priority = MASKED; sb->irq_state[i].saved_priority = MASKED; sb->irq_state[i].act_priority = MASKED; -- 2.20.1
[PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
The ESB MMIO region controls the interrupt sources of the guest. QEMU will query an fd (GET_ESB_FD ioctl) and map this region at a specific address for the guest to use. The guest will obtain this information using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address setting used by QEMU, add a VC_BASE control to the KVM XIVE device Signed-off-by: Cédric Le Goater --- arch/powerpc/include/uapi/asm/kvm.h | 1 + arch/powerpc/kvm/book3s_xive.h| 3 +++ arch/powerpc/kvm/book3s_xive_native.c | 39 +++ 3 files changed, 43 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 89c140cb9e79..8b78b12aa118 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -679,5 +679,6 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_GRP_CTRL 1 #define KVM_DEV_XIVE_GET_ESB_FD 1 #define KVM_DEV_XIVE_GET_TIMA_FD 2 +#define KVM_DEV_XIVE_VC_BASE 3 #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index 5f22415520b4..ae4a670eea63 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -125,6 +125,9 @@ struct kvmppc_xive { /* Flags */ u8 single_escalation; + + /* VC base address for ESBs */ + u64 vc_base; }; #define KVMPPC_XIVE_Q_COUNT8 diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index ee9d12bf2dae..29a62914de55 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -153,6 +153,25 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, return rc; } +static int kvmppc_xive_native_set_vc_base(struct kvmppc_xive *xive, u64 addr) +{ + u64 __user *ubufp = (u64 __user *) addr; + + if (get_user(xive->vc_base, ubufp)) + return -EFAULT; + return 0; +} + +static int kvmppc_xive_native_get_vc_base(struct kvmppc_xive *xive, u64 addr) +{ + u64 __user *ubufp = (u64 __user *) addr; + + if (put_user(xive->vc_base, ubufp)) + return -EFAULT; + + return 0; +} + static int xive_native_esb_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -289,6 +308,16 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr) static int kvmppc_xive_native_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { + struct kvmppc_xive *xive = dev->private; + + switch (attr->group) { + case KVM_DEV_XIVE_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_XIVE_VC_BASE: + return kvmppc_xive_native_set_vc_base(xive, attr->addr); + } + break; + } return -ENXIO; } @@ -304,6 +333,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev, return kvmppc_xive_native_get_esb_fd(xive, attr->addr); case KVM_DEV_XIVE_GET_TIMA_FD: return kvmppc_xive_native_get_tima_fd(xive, attr->addr); + case KVM_DEV_XIVE_VC_BASE: + return kvmppc_xive_native_get_vc_base(xive, attr->addr); } break; } @@ -318,6 +349,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev, switch (attr->attr) { case KVM_DEV_XIVE_GET_ESB_FD: case KVM_DEV_XIVE_GET_TIMA_FD: + case KVM_DEV_XIVE_VC_BASE: return 0; } break; @@ -353,6 +385,11 @@ static void kvmppc_xive_native_free(struct kvm_device *dev) kfree(dev); } +/* + * ESB MMIO address of chip 0 + */ +#define XIVE_VC_BASE 0x00060100ull + static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type) { struct kvmppc_xive *xive; @@ -387,6 +424,8 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type) if (xive->vp_base == XIVE_INVALID_VP) ret = -ENOMEM; + xive->vc_base = XIVE_VC_BASE; + xive->single_escalation = xive_native_has_single_escalation(); if (ret) -- 2.20.1
[PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device
Interrupt sources are simply created at the OPAL level and then MASKED. KVM only needs to know about their type: LSI or MSI. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/uapi/asm/kvm.h | 5 + arch/powerpc/kvm/book3s_xive_native.c | 98 +++ .../powerpc/kvm/book3s_xive_native_template.c | 27 + 3 files changed, 130 insertions(+) create mode 100644 arch/powerpc/kvm/book3s_xive_native_template.c diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 8b78b12aa118..6fc9660c5aec 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -680,5 +680,10 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_GET_ESB_FD 1 #define KVM_DEV_XIVE_GET_TIMA_FD 2 #define KVM_DEV_XIVE_VC_BASE 3 +#define KVM_DEV_XIVE_GRP_SOURCES 2 /* 64-bit source attributes */ + +/* Layout of 64-bit XIVE source attribute values */ +#define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) +#define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1) #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 29a62914de55..2518640d4a58 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -31,6 +31,24 @@ #include "book3s_xive.h" +/* + * We still instantiate them here because we use some of the + * generated utility functions as well in this file. + */ +#define XIVE_RUNTIME_CHECKS +#define X_PFX xive_vm_ +#define X_STATIC static +#define X_STAT_PFX stat_vm_ +#define __x_tima xive_tima +#define __x_eoi_page(xd) ((void __iomem *)((xd)->eoi_mmio)) +#define __x_trig_page(xd) ((void __iomem *)((xd)->trig_mmio)) +#define __x_writeb __raw_writeb +#define __x_readw __raw_readw +#define __x_readq __raw_readq +#define __x_writeq __raw_writeq + +#include "book3s_xive_native_template.c" + static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio) { struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; @@ -305,6 +323,78 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr) return put_user(ret, ubufp); } +static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq, +u64 addr) +{ + struct kvmppc_xive_src_block *sb; + struct kvmppc_xive_irq_state *state; + u64 __user *ubufp = (u64 __user *) addr; + u64 val; + u16 idx; + + pr_devel("%s irq=0x%lx\n", __func__, irq); + + if (irq < KVMPPC_XIVE_FIRST_IRQ || irq >= KVMPPC_XIVE_NR_IRQS) + return -ENOENT; + + sb = kvmppc_xive_find_source(xive, irq, &idx); + if (!sb) { + pr_debug("No source, creating source block...\n"); + sb = kvmppc_xive_create_src_block(xive, irq); + if (!sb) { + pr_err("Failed to create block...\n"); + return -ENOMEM; + } + } + state = &sb->irq_state[idx]; + + if (get_user(val, ubufp)) { + pr_err("fault getting user info !\n"); + return -EFAULT; + } + + /* +* If the source doesn't already have an IPI, allocate +* one and get the corresponding data +*/ + if (!state->ipi_number) { + state->ipi_number = xive_native_alloc_irq(); + if (state->ipi_number == 0) { + pr_err("Failed to allocate IRQ !\n"); + return -ENOMEM; + } + xive_native_populate_irq_data(state->ipi_number, + &state->ipi_data); + pr_debug("%s allocated hw_irq=0x%x for irq=0x%lx\n", __func__, +state->ipi_number, irq); + } + + arch_spin_lock(&sb->lock); + + /* Restore LSI state */ + if (val & KVM_XIVE_LEVEL_SENSITIVE) { + state->lsi = true; + if (val & KVM_XIVE_LEVEL_ASSERTED) + state->asserted = true; + pr_devel(" LSI ! Asserted=%d\n", state->asserted); + } + + /* Mask IRQ to start with */ + state->act_server = 0; + state->act_priority = MASKED; + xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01); + xive_native_configure_irq(state->ipi_number, 0, MASKED, 0); + + /* Increment the number of valid sources and mark this one valid */ + if (!state->valid) + xive->src_count++; + state->valid = true; + + arch_spin_unlock(&sb->lock); + + return 0; +} + static int kvmppc_xive_native_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -317,6 +407,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, return kvmppc_xive_native_s
[PATCH 07/19] KVM: PPC: Book3S HV: add a GET_TIMA_FD control to XIVE native device
This will let the guest create a memory mapping to expose the XIVE MMIO region (TIMA) used for interrupt management at the CPU level. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/xive.h | 1 + arch/powerpc/include/uapi/asm/kvm.h | 1 + arch/powerpc/kvm/book3s_xive_native.c | 57 +++ arch/powerpc/sysdev/xive/native.c | 11 ++ 4 files changed, 70 insertions(+) diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index d6be3e4d9fa4..7a7aa22d8258 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -23,6 +23,7 @@ * same offset regardless of where the code is executing */ extern void __iomem *xive_tima; +extern unsigned long xive_tima_os; /* * Offset in the TM area of our current execution level (provided by diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 6bb61ba141c2..89c140cb9e79 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -678,5 +678,6 @@ struct kvm_ppc_cpu_char { /* POWER9 XIVE Native Interrupt Controller */ #define KVM_DEV_XIVE_GRP_CTRL 1 #define KVM_DEV_XIVE_GET_ESB_FD 1 +#define KVM_DEV_XIVE_GET_TIMA_FD 2 #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index e20081f0c8d4..ee9d12bf2dae 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -232,6 +232,60 @@ static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr) return put_user(ret, ubufp); } +static int xive_native_tima_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + + switch (vmf->pgoff) { + case 0: /* HW - forbid access */ + case 1: /* HV - forbid access */ + return VM_FAULT_SIGBUS; + case 2: /* OS */ + vmf_insert_pfn(vma, vmf->address, xive_tima_os >> PAGE_SHIFT); + return VM_FAULT_NOPAGE; + case 3: /* USER - TODO */ + default: + return VM_FAULT_SIGBUS; + } +} + +static const struct vm_operations_struct xive_native_tima_vmops = { + .fault = xive_native_tima_fault, +}; + +static int xive_native_tima_mmap(struct file *file, struct vm_area_struct *vma) +{ + /* +* The TIMA is four pages wide but only the last two pages (OS +* and User view) are accessible to the guest. The page fault +* handler will handle the permissions. +*/ + if (vma_pages(vma) + vma->vm_pgoff > 4) + return -EINVAL; + + vma->vm_flags |= VM_IO | VM_PFNMAP; + vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot); + vma->vm_ops = &xive_native_tima_vmops; + return 0; +} + +static const struct file_operations xive_native_tima_fops = { + .mmap = xive_native_tima_mmap, +}; + +static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr) +{ + u64 __user *ubufp = (u64 __user *) addr; + int ret; + + ret = anon_inode_getfd("[xive-tima]", &xive_native_tima_fops, xive, + O_RDWR | O_CLOEXEC); + if (ret < 0) + return ret; + + return put_user(ret, ubufp); +} + static int kvmppc_xive_native_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -248,6 +302,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev, switch (attr->attr) { case KVM_DEV_XIVE_GET_ESB_FD: return kvmppc_xive_native_get_esb_fd(xive, attr->addr); + case KVM_DEV_XIVE_GET_TIMA_FD: + return kvmppc_xive_native_get_tima_fd(xive, attr->addr); } break; } @@ -261,6 +317,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev, case KVM_DEV_XIVE_GRP_CTRL: switch (attr->attr) { case KVM_DEV_XIVE_GET_ESB_FD: + case KVM_DEV_XIVE_GET_TIMA_FD: return 0; } break; diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c index 0c037e933e55..7782201e5fe8 100644 --- a/arch/powerpc/sysdev/xive/native.c +++ b/arch/powerpc/sysdev/xive/native.c @@ -521,6 +521,9 @@ u32 xive_native_default_eq_shift(void) } EXPORT_SYMBOL_GPL(xive_native_default_eq_shift); +unsigned long xive_tima_os; +EXPORT_SYMBOL_GPL(xive_tima_os); + bool __init xive_native_init(void) { struct device_node *np; @@ -573,6 +576,14 @@ bool __init xive_native_init(void) for_each_possible_cpu(cpu) kvmppc_set_xive_tima(cpu, r.start, tima); + /* Resource 2 is OS window */ + if (of_address_to_resource(np, 2, &r)) { + pr_err("Failed to get thread mgmnt area resource\n"); +
[PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support
The support for XIVE native exploitation mode in Linux/KVM needs a couple more OPAL calls to configure the sPAPR guest and to get/set the state of the XIVE internal structures. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/opal-api.h | 11 ++- arch/powerpc/include/asm/opal.h | 7 ++ arch/powerpc/include/asm/xive.h | 14 +++ arch/powerpc/sysdev/xive/native.c | 99 +++ .../powerpc/platforms/powernv/opal-wrappers.S | 3 + 5 files changed, 130 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 870fb7b239ea..cdfc54f78101 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -186,8 +186,8 @@ #define OPAL_XIVE_FREE_IRQ 140 #define OPAL_XIVE_SYNC 141 #define OPAL_XIVE_DUMP 142 -#define OPAL_XIVE_RESERVED3143 -#define OPAL_XIVE_RESERVED4144 +#define OPAL_XIVE_GET_QUEUE_STATE 143 +#define OPAL_XIVE_SET_QUEUE_STATE 144 #define OPAL_SIGNAL_SYSTEM_RESET 145 #define OPAL_NPU_INIT_CONTEXT 146 #define OPAL_NPU_DESTROY_CONTEXT 147 @@ -209,8 +209,11 @@ #define OPAL_SENSOR_GROUP_ENABLE 163 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR 164 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR 165 -#defineOPAL_NX_COPROC_INIT 167 -#define OPAL_LAST 167 +#define OPAL_HANDLE_HMI2 166 +#define OPAL_NX_COPROC_INIT167 +#define OPAL_NPU_SET_RELAXED_ORDER 168 +#define OPAL_NPU_GET_RELAXED_ORDER 169 +#define OPAL_XIVE_GET_VP_STATE 170 #define QUIESCE_HOLD 1 /* Spin all calls at entry */ #define QUIESCE_REJECT 2 /* Fail all calls with OPAL_BUSY */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index a55b01c90bb1..4e978d4dea5c 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -279,6 +279,13 @@ int64_t opal_xive_allocate_irq(uint32_t chip_id); int64_t opal_xive_free_irq(uint32_t girq); int64_t opal_xive_sync(uint32_t type, uint32_t id); int64_t opal_xive_dump(uint32_t type, uint32_t id); +int64_t opal_xive_get_queue_state(uint64_t vp, uint32_t prio, + __be32 *out_qtoggle, + __be32 *out_qindex); +int64_t opal_xive_set_queue_state(uint64_t vp, uint32_t prio, + uint32_t qtoggle, + uint32_t qindex); +int64_t opal_xive_get_vp_state(uint64_t vp, __be64 *out_w01); int64_t opal_pci_set_p2p(uint64_t phb_init, uint64_t phb_target, uint64_t desc, uint16_t pe_number); diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index 32f033bfbf42..d6be3e4d9fa4 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -132,12 +132,26 @@ extern int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio, extern void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio); extern void xive_native_sync_source(u32 hw_irq); +extern void xive_native_sync_queue(u32 hw_irq); extern bool is_xive_irq(struct irq_chip *chip); extern int xive_native_enable_vp(u32 vp_id, bool single_escalation); extern int xive_native_disable_vp(u32 vp_id); extern int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 *out_chip_id); extern bool xive_native_has_single_escalation(void); +extern int xive_native_get_queue_info(u32 vp_id, uint32_t prio, + u64 *out_qpage, + u64 *out_qsize, + u64 *out_qeoi_page, + u32 *out_escalate_irq, + u64 *out_qflags); + +extern int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle, + u32 *qindex); +extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle, + u32 qindex); +extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state); + #else static inline bool xive_enabled(void) { return false; } diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c index 1ca127d052a6..0c037e933e55 100644 --- a/arch/powerpc/sysdev/xive/native.c +++ b/arch/powerpc/sysdev/xive/native.c @@ -437,6 +437,12 @@ void xive_native_sync_source(u32 hw_irq) } EXPORT_SYMBOL_GPL(xive_native_sync_source); +void xive_native_sync_queue(u32 hw_irq) +{ + opal_xive_sync(XIVE_SYNC_QUEUE, hw_irq); +} +EXPORT_SYMBOL_GPL(xive_native_sync_queue); + static const struct xive_ops xive_native_ops = {
[PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl
This will be used to destroy the KVM XICS or XIVE device when the sPAPR machine is reseted. When the VM boots, the CAS negotiation process will determine which interrupt mode to use and the appropriate KVM device will then be created. Signed-off-by: Cédric Le Goater --- include/linux/kvm_host.h | 2 ++ include/uapi/linux/kvm.h | 2 ++ arch/powerpc/kvm/book3s_xive.c| 38 +- arch/powerpc/kvm/book3s_xive_native.c | 24 + virt/kvm/kvm_main.c | 39 +++ 5 files changed, 104 insertions(+), 1 deletion(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c38cc5eb7e73..259b6885dc74 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1218,6 +1218,8 @@ struct kvm_device_ops { */ void (*destroy)(struct kvm_device *dev); + int (*delete)(struct kvm_device *dev); + int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr); int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr); int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr); diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 52bf74a1616e..b00cb4d986cf 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1331,6 +1331,8 @@ struct kvm_s390_ucas_mapping { #define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xe2, struct kvm_device_attr) #define KVM_HAS_DEVICE_ATTR _IOW(KVMIO, 0xe3, struct kvm_device_attr) +#define KVM_DELETE_DEVICE_IOWR(KVMIO, 0xf0, struct kvm_create_device) + /* * ioctls for vcpu fds */ diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index 9b4751713554..5449fb4c87f9 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -1109,11 +1109,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu) void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu) { struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; - struct kvmppc_xive *xive = xc->xive; + struct kvmppc_xive *xive; int i; + if (!kvmppc_xics_enabled(vcpu)) + return; + + if (!xc) + return; + pr_devel("cleanup_vcpu(cpu=%d)\n", xc->server_num); + xive = xc->xive; + /* Ensure no interrupt is still routed to that VP */ xc->valid = false; kvmppc_xive_disable_vcpu_interrupts(vcpu); @@ -1150,6 +1158,10 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu) } /* Free the VP */ kfree(xc); + + /* Cleanup the vcpu */ + vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT; + vcpu->arch.xive_vcpu = NULL; } int kvmppc_xive_connect_vcpu(struct kvm_device *dev, @@ -1861,6 +1873,29 @@ static void kvmppc_xive_free(struct kvm_device *dev) kfree(dev); } +static int kvmppc_xive_delete(struct kvm_device *dev) +{ + struct kvm *kvm = dev->kvm; + unsigned int i; + struct kvm_vcpu *vcpu; + + if (!kvm->arch.xive) + return -EPERM; + + /* +* call kick_all_cpus_sync() to ensure that all CPUs have +* executed any pending interrupts +*/ + if (is_kvmppc_hv_enabled(kvm)) + kick_all_cpus_sync(); + + kvm_for_each_vcpu(i, vcpu, kvm) + kvmppc_xive_cleanup_vcpu(vcpu); + + kvmppc_xive_free(dev); + return 0; +} + static int kvmppc_xive_create(struct kvm_device *dev, u32 type) { struct kvmppc_xive *xive; @@ -2035,6 +2070,7 @@ struct kvm_device_ops kvm_xive_ops = { .create = kvmppc_xive_create, .init = kvmppc_xive_init, .destroy = kvmppc_xive_free, + .delete = kvmppc_xive_delete, .set_attr = xive_set_attr, .get_attr = xive_get_attr, .has_attr = xive_has_attr, diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 12edac29995e..7367962e670a 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -979,6 +979,29 @@ static void kvmppc_xive_native_free(struct kvm_device *dev) kfree(dev); } +static int kvmppc_xive_native_delete(struct kvm_device *dev) +{ + struct kvm *kvm = dev->kvm; + unsigned int i; + struct kvm_vcpu *vcpu; + + if (!kvm->arch.xive) + return -EPERM; + + /* +* call kick_all_cpus_sync() to ensure that all CPUs have +* executed any pending interrupts +*/ + if (is_kvmppc_hv_enabled(kvm)) + kick_all_cpus_sync(); + + kvm_for_each_vcpu(i, vcpu, kvm) + kvmppc_xive_native_cleanup_vcpu(vcpu); + + kvmppc_xive_native_free(dev); + return 0; +} + /* * ESB MMIO address of chip 0 */ @@ -1350,6 +1373,7 @@ struct kvm_device_ops kvm_xive_native_ops = { .create = kvmppc_xive_native_create, .init = kvmppc_xive_na
[PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
At a VCPU level, the state of the thread context interrupt management registers needs to be collected. These registers are cached under the 'xive_saved_state.w01' field of the VCPU when the VPCU context is pulled from the HW thread. An OPAL call retrieves the backup of the IPB register in the NVT structure and merges it in the KVM state. The structures of the interface between QEMU and KVM provisions some extra room (two u64) for further extensions if more state needs to be transferred back to QEMU. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/kvm_ppc.h| 5 ++ arch/powerpc/include/uapi/asm/kvm.h | 2 + arch/powerpc/kvm/book3s.c | 24 + arch/powerpc/kvm/book3s_xive_native.c | 78 +++ 4 files changed, 109 insertions(+) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 4cc897039485..49c488af168c 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -270,6 +270,7 @@ union kvmppc_one_reg { u64 addr; u64 length; } vpaval; + u64 xive_timaval[4]; }; struct kvmppc_ops { @@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu); extern void kvmppc_xive_native_init_module(void); extern void kvmppc_xive_native_exit_module(void); extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd); +extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val); +extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val); #else static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server, @@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { } static inline void kvmppc_xive_native_exit_module(void) { } static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd) { return 0; } +static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return 0; } +static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return -ENOENT; } #endif /* CONFIG_KVM_XIVE */ diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 95302558ce10..3c958c39a782 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char { #define KVM_REG_PPC_ICP_PPRI_SHIFT16 /* pending irq priority */ #define KVM_REG_PPC_ICP_PPRI_MASK 0xff +#define KVM_REG_PPC_VP_STATE (KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d) + /* Device control API: PPC-specific devices */ #define KVM_DEV_MPIC_GRP_MISC 1 #define KVM_DEV_MPIC_BASE_ADDR 0 /* 64-bit */ diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index de7eed191107..5ad658077a35 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -641,6 +641,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id, *val = get_reg_val(id, kvmppc_xics_get_icp(vcpu)); break; #endif /* CONFIG_KVM_XICS */ +#ifdef CONFIG_KVM_XIVE + case KVM_REG_PPC_VP_STATE: + if (!vcpu->arch.xive_vcpu) { + r = -ENXIO; + break; + } + if (xive_enabled()) + r = kvmppc_xive_native_get_vp(vcpu, val); + else + r = -ENXIO; + break; +#endif /* CONFIG_KVM_XIVE */ case KVM_REG_PPC_FSCR: *val = get_reg_val(id, vcpu->arch.fscr); break; @@ -714,6 +726,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, *val)); break; #endif /* CONFIG_KVM_XICS */ +#ifdef CONFIG_KVM_XIVE + case KVM_REG_PPC_VP_STATE: + if (!vcpu->arch.xive_vcpu) { + r = -ENXIO; + break; + } + if (xive_enabled()) + r = kvmppc_xive_native_set_vp(vcpu, val); + else + r = -ENXIO; + break; +#endif /* CONFIG_KVM_XIVE */ case KVM_REG_PPC_FSCR: vcpu->arch.fscr = set_reg_val(id, *val); break; diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index f4eb71eafc57..1aefb366df0b 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -424,6 +424,84 @@ static int xive_native_validate_queue_size(u32 qsize) } } +#define TM_I
[PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
Clear the ESB pages from the VMA of the IRQ being pass through to the guest and let the fault handler repopulate the VMA when the ESB pages are accessed for an EOI or for a trigger. Storing the VMA under the KVM XIVE device is a little ugly. Signed-off-by: Cédric Le Goater --- arch/powerpc/kvm/book3s_xive.h| 8 +++ arch/powerpc/kvm/book3s_xive.c| 15 ++ arch/powerpc/kvm/book3s_xive_native.c | 30 +++ 3 files changed, 53 insertions(+) diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index 31e598e62589..6e64d3496a2c 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -90,6 +90,11 @@ struct kvmppc_xive_src_block { struct kvmppc_xive_irq_state irq_state[KVMPPC_XICS_IRQ_PER_ICS]; }; +struct kvmppc_xive; + +struct kvmppc_xive_ops { + int (*reset_mapped)(struct kvm *kvm, unsigned long guest_irq); +}; struct kvmppc_xive { struct kvm *kvm; @@ -131,6 +136,9 @@ struct kvmppc_xive { /* VC base address for ESBs */ u64 vc_base; + + struct kvmppc_xive_ops *ops; + struct vm_area_struct *vma; }; #define KVMPPC_XIVE_Q_COUNT8 diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index e9f05d9c9ad5..9b4751713554 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -946,6 +946,13 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq, /* Turn the IPI hard off */ xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01); + /* +* Reset ESB guest mapping. Needed when ESB pages are exposed +* to the guest in XIVE native mode +*/ + if (xive->ops && xive->ops->reset_mapped) + xive->ops->reset_mapped(kvm, guest_irq); + /* Grab info about irq */ state->pt_number = hw_irq; state->pt_data = irq_data_get_irq_handler_data(host_data); @@ -1031,6 +1038,14 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq, state->pt_number = 0; state->pt_data = NULL; + /* +* Reset ESB guest mapping. Needed when ESB pages are exposed +* to the guest in XIVE native mode +*/ + if (xive->ops && xive->ops->reset_mapped) { + xive->ops->reset_mapped(kvm, guest_irq); + } + /* Reconfigure the IPI */ xive_native_configure_irq(state->ipi_number, xive_vp(xive, state->act_server), diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 1aefb366df0b..12edac29995e 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -240,6 +240,32 @@ static int kvmppc_xive_native_get_vc_base(struct kvmppc_xive *xive, u64 addr) return 0; } +static int kvmppc_xive_native_reset_mapped(struct kvm *kvm, unsigned long irq) +{ + struct kvmppc_xive *xive = kvm->arch.xive; + struct mm_struct *mm = kvm->mm; + struct vm_area_struct *vma = xive->vma; + unsigned long address; + + if (irq >= KVMPPC_XIVE_NR_IRQS) + return -EINVAL; + + pr_debug("clearing esb pages for girq 0x%lx\n", irq); + + down_read(&mm->mmap_sem); + /* TODO: can we clear the PTEs without keeping a VMA pointer ? */ + if (vma) { + address = vma->vm_start + irq * (2ull << PAGE_SHIFT); + zap_vma_ptes(vma, address, 2ull << PAGE_SHIFT); + } + up_read(&mm->mmap_sem); + return 0; +} + +static struct kvmppc_xive_ops kvmppc_xive_native_ops = { + .reset_mapped = kvmppc_xive_native_reset_mapped, +}; + static int xive_native_esb_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -292,6 +318,8 @@ static const struct vm_operations_struct xive_native_esb_vmops = { static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma) { + struct kvmppc_xive *xive = vma->vm_file->private_data; + /* There are two ESB pages (trigger and EOI) per IRQ */ if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2) return -EINVAL; @@ -299,6 +327,7 @@ static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma) vma->vm_flags |= VM_IO | VM_PFNMAP; vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_ops = &xive_native_esb_vmops; + xive->vma = vma; /* TODO: get rid of the VMA pointer */ return 0; } @@ -992,6 +1021,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type) xive->vc_base = XIVE_VC_BASE; xive->single_escalation = xive_native_has_single_escalation(); + xive->ops = &kvmppc_xive_native_ops; if (ret) kfree(xive); -- 2.20.1
[PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device
The KVM device for the XIVE native exploitation mode will reuse the structures of the XICS-over-XIVE glue implementation. Some code will also be shared : source block creation and destruction, target selection and escalation attachment. Signed-off-by: Cédric Le Goater --- arch/powerpc/kvm/book3s_xive.h | 11 + arch/powerpc/kvm/book3s_xive.c | 89 +++--- 2 files changed, 62 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index a08ae6fd4c51..10c4aa5cd010 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -248,5 +248,16 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, unsigned long server, extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr); extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr); +/* + * Common Xive routines for XICS-over-XIVE and XIVE native + */ +struct kvmppc_xive_src_block *kvmppc_xive_create_src_block( + struct kvmppc_xive *xive, int irq); +void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb); +int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio); +void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu); +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio); +int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu); + #endif /* CONFIG_KVM_XICS */ #endif /* _KVM_PPC_BOOK3S_XICS_H */ diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index 8a4fa45f07f8..bb5d32f7e4e6 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -166,7 +166,7 @@ static irqreturn_t xive_esc_irq(int irq, void *data) return IRQ_HANDLED; } -static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio) +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio) { struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; struct xive_q *q = &xc->queues[prio]; @@ -291,7 +291,7 @@ static int xive_check_provisioning(struct kvm *kvm, u8 prio) continue; rc = xive_provision_queue(vcpu, prio); if (rc == 0 && !xive->single_escalation) - xive_attach_escalation(vcpu, prio); + kvmppc_xive_attach_escalation(vcpu, prio); if (rc) return rc; } @@ -342,7 +342,7 @@ static int xive_try_pick_queue(struct kvm_vcpu *vcpu, u8 prio) return atomic_add_unless(&q->count, 1, max) ? 0 : -EBUSY; } -static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio) +int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio) { struct kvm_vcpu *vcpu; int i, rc; @@ -535,7 +535,7 @@ static int xive_target_interrupt(struct kvm *kvm, * priority. The count for that new target will have * already been incremented. */ - rc = xive_select_target(kvm, &server, prio); + rc = kvmppc_xive_select_target(kvm, &server, prio); /* * We failed to find a target ? Not much we can do @@ -1055,7 +1055,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq, } EXPORT_SYMBOL_GPL(kvmppc_xive_clr_mapped); -static void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu) +void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu) { struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; struct kvm *kvm = vcpu->kvm; @@ -1225,7 +1225,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev, if (xive->qmap & (1 << i)) { r = xive_provision_queue(vcpu, i); if (r == 0 && !xive->single_escalation) - xive_attach_escalation(vcpu, i); + kvmppc_xive_attach_escalation(vcpu, i); if (r) goto bail; } else { @@ -1240,7 +1240,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev, } /* If not done above, attach priority 0 escalation */ - r = xive_attach_escalation(vcpu, 0); + r = kvmppc_xive_attach_escalation(vcpu, 0); if (r) goto bail; @@ -1491,8 +1491,8 @@ static int xive_get_source(struct kvmppc_xive *xive, long irq, u64 addr) return 0; } -static struct kvmppc_xive_src_block *xive_create_src_block(struct kvmppc_xive *xive, - int irq) +struct kvmppc_xive_src_block *kvmppc_xive_create_src_block( + struct kvmppc_xive *xive, int irq) { struct kvm *kvm = xive->kvm; struct kvmppc_xive_src_block *sb; @@ -1571,7 +1571,7 @@ static int xive_set_source(struct kvmppc_xive *xive, long irq, u64 addr) sb = kvmppc_xive_find_source(xive, irq, &idx); if (!sb) { pr_devel("No source, creating source block...\n"); -
[PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
Theses are use to capure the XIVE EAS table of the KVM device, the configuration of the source targets. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/uapi/asm/kvm.h | 11 arch/powerpc/kvm/book3s_xive_native.c | 87 +++ 2 files changed, 98 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 1a8740629acf..faf024f39858 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_SAVE_EQ_PAGES 4 #define KVM_DEV_XIVE_GRP_SOURCES 2 /* 64-bit source attributes */ #define KVM_DEV_XIVE_GRP_SYNC 3 /* 64-bit source attributes */ +#define KVM_DEV_XIVE_GRP_EAS 4 /* 64-bit eas attributes */ /* Layout of 64-bit XIVE source attribute values */ #define KVM_XIVE_LEVEL_SENSITIVE (1ULL << 0) #define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1) +/* Layout of 64-bit eas attribute values */ +#define KVM_XIVE_EAS_PRIORITY_SHIFT0 +#define KVM_XIVE_EAS_PRIORITY_MASK 0x7 +#define KVM_XIVE_EAS_SERVER_SHIFT 3 +#define KVM_XIVE_EAS_SERVER_MASK 0xfff8ULL +#define KVM_XIVE_EAS_MASK_SHIFT32 +#define KVM_XIVE_EAS_MASK_MASK 0x1ULL +#define KVM_XIVE_EAS_EISN_SHIFT33 +#define KVM_XIVE_EAS_EISN_MASK 0xfffeULL + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index f2de1bcf3b35..0468b605baa7 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr) return 0; } +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq, + u64 addr) +{ + struct kvmppc_xive_src_block *sb; + struct kvmppc_xive_irq_state *state; + u64 __user *ubufp = (u64 __user *) addr; + u16 src; + u64 kvm_eas; + u32 server; + u8 priority; + u32 eisn; + + sb = kvmppc_xive_find_source(xive, irq, &src); + if (!sb) + return -ENOENT; + + state = &sb->irq_state[src]; + + if (!state->valid) + return -EINVAL; + + if (get_user(kvm_eas, ubufp)) + return -EFAULT; + + pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas); + + priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >> + KVM_XIVE_EAS_PRIORITY_SHIFT; + server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >> + KVM_XIVE_EAS_SERVER_SHIFT; + eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT; + + if (priority != xive_prio_from_guest(priority)) { + pr_err("invalid priority for queue %d for VCPU %d\n", + priority, server); + return -EINVAL; + } + + return kvmppc_xive_native_set_source_config(xive, sb, state, server, + priority, eisn); +} + +static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq, + u64 addr) +{ + struct kvmppc_xive_src_block *sb; + struct kvmppc_xive_irq_state *state; + u64 __user *ubufp = (u64 __user *) addr; + u16 src; + u64 kvm_eas; + + sb = kvmppc_xive_find_source(xive, irq, &src); + if (!sb) + return -ENOENT; + + state = &sb->irq_state[src]; + + if (!state->valid) + return -EINVAL; + + arch_spin_lock(&sb->lock); + + if (state->act_priority == MASKED) + kvm_eas = KVM_XIVE_EAS_MASK_MASK; + else { + kvm_eas = (state->act_priority << KVM_XIVE_EAS_PRIORITY_SHIFT) & + KVM_XIVE_EAS_PRIORITY_MASK; + kvm_eas |= (state->act_server << KVM_XIVE_EAS_SERVER_SHIFT) & + KVM_XIVE_EAS_SERVER_MASK; + kvm_eas |= ((u64) state->eisn << KVM_XIVE_EAS_EISN_SHIFT) & + KVM_XIVE_EAS_EISN_MASK; + } + arch_spin_unlock(&sb->lock); + + pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas); + + if (put_user(kvm_eas, ubufp)) + return -EFAULT; + + return 0; +} + static int kvmppc_xive_native_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -544,6 +626,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, attr->addr); case KVM_DEV_XIVE_GRP_SYNC: return kvmppc_xive_native_sync(xive, attr->attr, attr->addr); + case KVM_DEV_XIVE_GRP_EAS: + return kvmppc_xive_native_set_eas(xive, attr->attr, attr->addr); } return -ENXI
[PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
This is the basic framework for the new KVM device supporting the XIVE native exploitation mode. The user interface exposes a new capability and a new KVM device to be used by QEMU. Internally, the interface to the new KVM device is protected with a new interrupt mode: KVMPPC_IRQ_XIVE. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/kvm_host.h | 2 + arch/powerpc/include/asm/kvm_ppc.h| 21 ++ arch/powerpc/kvm/book3s_xive.h| 3 + include/uapi/linux/kvm.h | 3 + arch/powerpc/kvm/book3s.c | 7 +- arch/powerpc/kvm/book3s_xive_native.c | 332 ++ arch/powerpc/kvm/powerpc.c| 30 +++ arch/powerpc/kvm/Makefile | 2 +- 8 files changed, 398 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/kvm/book3s_xive_native.c diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 0f98f00da2ea..c522e8274ad9 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops; struct kvmppc_xive; struct kvmppc_xive_vcpu; extern struct kvm_device_ops kvm_xive_ops; +extern struct kvm_device_ops kvm_xive_native_ops; struct kvmppc_passthru_irqmap; @@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap { #define KVMPPC_IRQ_DEFAULT 0 #define KVMPPC_IRQ_MPIC1 #define KVMPPC_IRQ_XICS2 /* Includes a XIVE option */ +#define KVMPPC_IRQ_XIVE3 /* XIVE native exploitation mode */ #define MMIO_HPTE_CACHE_SIZE 4 diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index eb0d79f0ca45..1bb313f238fe 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval); extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, bool line_status); extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu); + +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE; +} + +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, + struct kvm_vcpu *vcpu, u32 cpu); +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu); +extern void kvmppc_xive_native_init_module(void); +extern void kvmppc_xive_native_exit_module(void); + #else static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server, u32 priority) { return -1; } @@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq, int level, bool line_status) { return -ENODEV; } static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { } + +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu) + { return 0; } +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, + struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; } +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { } +static inline void kvmppc_xive_native_init_module(void) { } +static inline void kvmppc_xive_native_exit_module(void) { } + #endif /* CONFIG_KVM_XIVE */ /* diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h index 10c4aa5cd010..5f22415520b4 100644 --- a/arch/powerpc/kvm/book3s_xive.h +++ b/arch/powerpc/kvm/book3s_xive.h @@ -12,6 +12,9 @@ #ifdef CONFIG_KVM_XICS #include "book3s_xics.h" +#define KVMPPC_XIVE_FIRST_IRQ 0 +#define KVMPPC_XIVE_NR_IRQSKVMPPC_XICS_NR_IRQS + /* * State for one guest irq source. * diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 6d4ea4b6c922..52bf74a1616e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_ARM_VM_IPA_SIZE 165 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166 #define KVM_CAP_HYPERV_CPUID 167 +#define KVM_CAP_PPC_IRQ_XIVE 168 #ifdef KVM_CAP_IRQ_ROUTING @@ -1211,6 +1212,8 @@ enum kvm_device_type { #define KVM_DEV_TYPE_ARM_VGIC_V3 KVM_DEV_TYPE_ARM_VGIC_V3 KVM_DEV_TYPE_ARM_VGIC_ITS, #define KVM_DEV_TYPE_ARM_VGIC_ITS KVM_DEV_TYPE_ARM_VGIC_ITS + KVM_DEV_TYPE_XIVE, +#define KVM_DEV_TYPE_XIVE KVM_DEV_TYPE_XIVE KVM_DEV_TYPE_MAX, }; diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c index bd1a677dd9e4..de7eed191107 100644 --- a/arch/powerpc/kvm/book3s.c +++ b/arch/powerpc/kvm/book3s.c @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void) #ifdef CONFIG_KVM_XIVE if (xive_enabled()) { kvmppc_xive_init_module(); +
[PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
Hello, On the POWER9 processor, the XIVE interrupt controller can control interrupt sources using MMIO to trigger events, to EOI or to turn off the sources. Priority management and interrupt acknowledgment is also controlled by MMIO in the CPU presenter subengine. PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need special support from the hypervisor to do the same. This is called the XIVE native exploitation mode and today, it can be activated under the PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support and still offers the old interrupt mode interface using a XICS-over-XIVE glue which implements the XICS hcalls. The following series is proposal to add the same support under KVM. A new KVM device is introduced for the XIVE native exploitation mode. It reuses most of the XICS-over-XIVE glue implementation structures which are internal to KVM but has a completely different interface. A set of Hypervisor calls configures the sources and the event queues and from there, all control is done by the guest through MMIOs. These MMIO regions (ESB and TIMA) are exposed to guests in QEMU, similarly to VFIO, and the associated VMAs are populated dynamically with the appropriate pages using a fault handler. This is implemented with a couple of KVM device ioctls. On a POWER9 sPAPR machine, the Client Architecture Support (CAS) negotiation process determines whether the guest operates with a interrupt controller using the XICS legacy model, as found on POWER8, or in XIVE exploitation mode. Which means that the KVM interrupt device should be created at runtime, after the machine as started. This requires extra KVM support to create/destroy KVM devices. The last patches are an attempt to solve that problem. Migration has its own specific needs. The patchset provides the necessary routines to quiesce XIVE, to capture and restore the state of the different structures used by KVM, OPAL and HW. Extra OPAL support is required for these. GitHub trees available here : QEMU sPAPR: https://github.com/legoater/qemu/commits/xive-next Linux/KVM: https://github.com/legoater/linux/commits/xive-5.0 OPAL: https://github.com/legoater/skiboot/commits/xive Best wishes for 2019 ! C. Cédric Le Goater (19): powerpc/xive: export flags for the XIVE native exploitation mode hcalls powerpc/xive: add OPAL extensions for the XIVE native exploitation support KVM: PPC: Book3S HV: check the IRQ controller type KVM: PPC: Book3S HV: export services for the XIVE native exploitation device KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device KVM: PPC: Book3S HV: add a GET_TIMA_FD control to XIVE native device KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls KVM: PPC: Book3S HV: record guest queue page address KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty KVM: PPC: Book3S HV: add get/set accessors for the source configuration KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state KVM: PPC: Book3S HV: add passthrough support KVM: introduce a KVM_DELETE_DEVICE ioctl arch/powerpc/include/asm/kvm_host.h |2 + arch/powerpc/include/asm/kvm_ppc.h| 69 + arch/powerpc/include/asm/opal-api.h | 11 +- arch/powerpc/include/asm/opal.h |7 + arch/powerpc/include/asm/xive.h | 40 + arch/powerpc/include/uapi/asm/kvm.h | 47 + arch/powerpc/kvm/book3s_xive.h| 82 + include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |5 + arch/powerpc/kvm/book3s.c | 31 +- arch/powerpc/kvm/book3s_hv.c | 29 + arch/powerpc/kvm/book3s_hv_builtin.c | 196 +++ arch/powerpc/kvm/book3s_hv_rm_xive_native.c | 47 + arch/powerpc/kvm/book3s_xive.c| 149 +- arch/powerpc/kvm/book3s_xive_native.c | 1406 + .../powerpc/kvm/book3s_xive_native_template.c | 398 + arch/powerpc/kvm/powerpc.c| 30 + arch/powerpc/sysdev/xive/native.c | 110 ++ arch/powerpc/sysdev/xive/spapr.c | 28 +- virt/kvm/kvm_main.c | 39 + arch/powerpc/kvm/Makefile |4 +- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 52 + .../powerpc/platforms/powernv/opal-wrappers.S |3 + 23 files changed, 2722 insertions(+), 65 deletions(-) create mode 100644
[PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty
When the VM is stopped in a migration sequence, the sources are masked and the XIVE IC is synced to stabilize the EQs. When done, the KVM ioctl KVM_DEV_XIVE_SAVE_EQ_PAGES is called to mark dirty the EQ pages. The migration can then transfer the remaining dirty pages to the destination and start collecting the state of the devices. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/uapi/asm/kvm.h | 1 + arch/powerpc/kvm/book3s_xive_native.c | 40 +++ 2 files changed, 41 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index f3b859223b80..1a8740629acf 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -680,6 +680,7 @@ struct kvm_ppc_cpu_char { #define KVM_DEV_XIVE_GET_ESB_FD 1 #define KVM_DEV_XIVE_GET_TIMA_FD 2 #define KVM_DEV_XIVE_VC_BASE 3 +#define KVM_DEV_XIVE_SAVE_EQ_PAGES 4 #define KVM_DEV_XIVE_GRP_SOURCES 2 /* 64-bit source attributes */ #define KVM_DEV_XIVE_GRP_SYNC 3 /* 64-bit source attributes */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index a8052867afc1..f2de1bcf3b35 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -373,6 +373,43 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr) return put_user(ret, ubufp); } +static int kvmppc_xive_native_vcpu_save_eq_pages(struct kvm_vcpu *vcpu) +{ + struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; + unsigned int prio; + + if (!xc) + return -ENOENT; + + for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) { + struct xive_q *q = &xc->queues[prio]; + + if (!q->qpage) + continue; + + /* Mark EQ page dirty for migration */ + mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qpage)); + } + return 0; +} + +static int kvmppc_xive_native_save_eq_pages(struct kvmppc_xive *xive) +{ + struct kvm *kvm = xive->kvm; + struct kvm_vcpu *vcpu; + unsigned int i; + + pr_devel("%s\n", __func__); + + mutex_lock(&kvm->lock); + kvm_for_each_vcpu(i, vcpu, kvm) { + kvmppc_xive_native_vcpu_save_eq_pages(vcpu); + } + mutex_unlock(&kvm->lock); + + return 0; +} + static int xive_native_validate_queue_size(u32 qsize) { switch (qsize) { @@ -498,6 +535,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, switch (attr->attr) { case KVM_DEV_XIVE_VC_BASE: return kvmppc_xive_native_set_vc_base(xive, attr->addr); + case KVM_DEV_XIVE_SAVE_EQ_PAGES: + return kvmppc_xive_native_save_eq_pages(xive); } break; case KVM_DEV_XIVE_GRP_SOURCES: @@ -538,6 +577,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev, case KVM_DEV_XIVE_GET_ESB_FD: case KVM_DEV_XIVE_GET_TIMA_FD: case KVM_DEV_XIVE_VC_BASE: + case KVM_DEV_XIVE_SAVE_EQ_PAGES: return 0; } break; -- 2.20.1
[PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls
These flags are shared between Linux/KVM implementing the hypervisor calls for the XIVE native exploitation mode and the driver for the sPAPR guests. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/xive.h | 23 +++ arch/powerpc/sysdev/xive/spapr.c | 28 2 files changed, 31 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index 3c704f5dd3ae..32f033bfbf42 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -93,6 +93,29 @@ extern void xive_flush_interrupt(void); /* xmon hook */ extern void xmon_xive_do_dump(int cpu); +/* + * Hcall flags shared by the sPAPR backend and KVM + */ + +/* H_INT_GET_SOURCE_INFO */ +#define XIVE_SPAPR_SRC_H_INT_ESB PPC_BIT(60) +#define XIVE_SPAPR_SRC_LSI PPC_BIT(61) +#define XIVE_SPAPR_SRC_TRIGGER PPC_BIT(62) +#define XIVE_SPAPR_SRC_STORE_EOI PPC_BIT(63) + +/* H_INT_SET_SOURCE_CONFIG */ +#define XIVE_SPAPR_SRC_SET_EISNPPC_BIT(62) +#define XIVE_SPAPR_SRC_MASKPPC_BIT(63) /* unused */ + +/* H_INT_SET_QUEUE_CONFIG */ +#define XIVE_SPAPR_EQ_ALWAYS_NOTIFYPPC_BIT(63) + +/* H_INT_SET_QUEUE_CONFIG */ +#define XIVE_SPAPR_EQ_DEBUGPPC_BIT(63) + +/* H_INT_ESB */ +#define XIVE_SPAPR_ESB_STORE PPC_BIT(63) + /* APIs used by KVM */ extern u32 xive_native_default_eq_shift(void); extern u32 xive_native_alloc_vp_block(u32 max_vcpus); diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c index 575db3b06a6b..730284f838c8 100644 --- a/arch/powerpc/sysdev/xive/spapr.c +++ b/arch/powerpc/sysdev/xive/spapr.c @@ -184,9 +184,6 @@ static long plpar_int_get_source_info(unsigned long flags, return 0; } -#define XIVE_SRC_SET_EISN (1ull << (63 - 62)) -#define XIVE_SRC_MASK (1ull << (63 - 63)) /* unused */ - static long plpar_int_set_source_config(unsigned long flags, unsigned long lisn, unsigned long target, @@ -243,8 +240,6 @@ static long plpar_int_get_queue_info(unsigned long flags, return 0; } -#define XIVE_EQ_ALWAYS_NOTIFY (1ull << (63 - 63)) - static long plpar_int_set_queue_config(unsigned long flags, unsigned long target, unsigned long priority, @@ -286,8 +281,6 @@ static long plpar_int_sync(unsigned long flags, unsigned long lisn) return 0; } -#define XIVE_ESB_FLAG_STORE (1ull << (63 - 63)) - static long plpar_int_esb(unsigned long flags, unsigned long lisn, unsigned long offset, @@ -321,7 +314,7 @@ static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 data, bool write) unsigned long read_data; long rc; - rc = plpar_int_esb(write ? XIVE_ESB_FLAG_STORE : 0, + rc = plpar_int_esb(write ? XIVE_SPAPR_ESB_STORE : 0, lisn, offset, data, &read_data); if (rc) return -1; @@ -329,11 +322,6 @@ static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 data, bool write) return write ? 0 : read_data; } -#define XIVE_SRC_H_INT_ESB (1ull << (63 - 60)) -#define XIVE_SRC_LSI (1ull << (63 - 61)) -#define XIVE_SRC_TRIGGER (1ull << (63 - 62)) -#define XIVE_SRC_STORE_EOI (1ull << (63 - 63)) - static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data) { long rc; @@ -349,11 +337,11 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data) if (rc) return -EINVAL; - if (flags & XIVE_SRC_H_INT_ESB) + if (flags & XIVE_SPAPR_SRC_H_INT_ESB) data->flags |= XIVE_IRQ_FLAG_H_INT_ESB; - if (flags & XIVE_SRC_STORE_EOI) + if (flags & XIVE_SPAPR_SRC_STORE_EOI) data->flags |= XIVE_IRQ_FLAG_STORE_EOI; - if (flags & XIVE_SRC_LSI) + if (flags & XIVE_SPAPR_SRC_LSI) data->flags |= XIVE_IRQ_FLAG_LSI; data->eoi_page = eoi_page; data->esb_shift = esb_shift; @@ -374,7 +362,7 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data) data->hw_irq = hw_irq; /* Full function page supports trigger */ - if (flags & XIVE_SRC_TRIGGER) { + if (flags & XIVE_SPAPR_SRC_TRIGGER) { data->trig_mmio = data->eoi_mmio; return 0; } @@ -391,8 +379,8 @@ static int xive_spapr_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 sw_irq) { long rc; - rc = plpar_int_set_source_config(XIVE_SRC_SET_EISN, hw_irq, target, -prio, sw_irq); + rc = plpar_int_set_source_config(XIVE_SPAPR_SRC_SET_EISN, hw_irq, +target, prio, sw_irq); return rc == 0 ? 0 : -ENXIO; }
[PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type
We will have different KVM devices for interrupts, one for the XICS-over-XIVE mode and one for the XIVE native exploitation mode. Let's add some checks to make sure we are not mixing the interfaces in KVM. Signed-off-by: Cédric Le Goater --- arch/powerpc/kvm/book3s_xive.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c index f78d002f0fe0..8a4fa45f07f8 100644 --- a/arch/powerpc/kvm/book3s_xive.c +++ b/arch/powerpc/kvm/book3s_xive.c @@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu) { struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu; + if (!kvmppc_xics_enabled(vcpu)) + return -EPERM; + if (!xc) return 0; @@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) u8 cppr, mfrr; u32 xisr; + if (!kvmppc_xics_enabled(vcpu)) + return -EPERM; + if (!xc || !xive) return -ENOENT; -- 2.20.1
[PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address
The guest physical address of the event queue will be part of the state to transfer in the migration. Cache its value when the queue is configured, it will save us an OPAL call. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/xive.h | 2 ++ arch/powerpc/kvm/book3s_xive_native.c | 4 2 files changed, 6 insertions(+) diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h index 7a7aa22d8258..e90c3c5d9533 100644 --- a/arch/powerpc/include/asm/xive.h +++ b/arch/powerpc/include/asm/xive.h @@ -74,6 +74,8 @@ struct xive_q { u32 esc_irq; atomic_tcount; atomic_tpending_count; + u64 guest_qpage; + u32 guest_qsize; }; /* Global enable flags for the XIVE support */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 35d806740c3a..4ca75aade069 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -708,6 +708,10 @@ static int kvmppc_h_int_set_queue_config(struct kvm_vcpu *vcpu, } qaddr = page_to_virt(page) + (qpage & ~PAGE_MASK); + /* Backup queue page address and size for migration */ + q->guest_qpage = qpage; + q->guest_qsize = qsize; + rc = xive_native_configure_queue(xc->vp_id, q, priority, (__be32 *) qaddr, qsize, true); if (rc) { -- 2.20.1
[PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
This will let the guest create a memory mapping to expose the ESB MMIO regions used to control the interrupt sources, to trigger events, to EOI or to turn off the sources. Signed-off-by: Cédric Le Goater --- arch/powerpc/include/uapi/asm/kvm.h | 4 ++ arch/powerpc/kvm/book3s_xive_native.c | 97 +++ 2 files changed, 101 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 8c876c166ef2..6bb61ba141c2 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char { #define KVM_XICS_PRESENTED(1ULL << 43) #define KVM_XICS_QUEUED (1ULL << 44) +/* POWER9 XIVE Native Interrupt Controller */ +#define KVM_DEV_XIVE_GRP_CTRL 1 +#define KVM_DEV_XIVE_GET_ESB_FD 1 + #endif /* __LINUX_KVM_POWERPC_H */ diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c index 115143e76c45..e20081f0c8d4 100644 --- a/arch/powerpc/kvm/book3s_xive_native.c +++ b/arch/powerpc/kvm/book3s_xive_native.c @@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, return rc; } +static int xive_native_esb_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct kvmppc_xive *xive = vma->vm_file->private_data; + struct kvmppc_xive_src_block *sb; + struct kvmppc_xive_irq_state *state; + struct xive_irq_data *xd; + u32 hw_num; + u16 src; + u64 page; + unsigned long irq; + + /* +* Linux/KVM uses a two pages ESB setting, one for trigger and +* one for EOI +*/ + irq = vmf->pgoff / 2; + + sb = kvmppc_xive_find_source(xive, irq, &src); + if (!sb) { + pr_err("%s: source %lx not found !\n", __func__, irq); + return VM_FAULT_SIGBUS; + } + + state = &sb->irq_state[src]; + kvmppc_xive_select_irq(state, &hw_num, &xd); + + arch_spin_lock(&sb->lock); + + /* +* first/even page is for trigger +* second/odd page is for EOI and management. +*/ + page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page; + arch_spin_unlock(&sb->lock); + + if (!page) { + pr_err("%s: acessing invalid ESB page for source %lx !\n", + __func__, irq); + return VM_FAULT_SIGBUS; + } + + vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT); + return VM_FAULT_NOPAGE; +} + +static const struct vm_operations_struct xive_native_esb_vmops = { + .fault = xive_native_esb_fault, +}; + +static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma) +{ + /* There are two ESB pages (trigger and EOI) per IRQ */ + if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2) + return -EINVAL; + + vma->vm_flags |= VM_IO | VM_PFNMAP; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_ops = &xive_native_esb_vmops; + return 0; +} + +static const struct file_operations xive_native_esb_fops = { + .mmap = xive_native_esb_mmap, +}; + +static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr) +{ + u64 __user *ubufp = (u64 __user *) addr; + int ret; + + ret = anon_inode_getfd("[xive-esb]", &xive_native_esb_fops, xive, + O_RDWR | O_CLOEXEC); + if (ret < 0) + return ret; + + return put_user(ret, ubufp); +} + static int kvmppc_xive_native_set_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -162,12 +241,30 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev, static int kvmppc_xive_native_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { + struct kvmppc_xive *xive = dev->private; + + switch (attr->group) { + case KVM_DEV_XIVE_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_XIVE_GET_ESB_FD: + return kvmppc_xive_native_get_esb_fd(xive, attr->addr); + } + break; + } return -ENXIO; } static int kvmppc_xive_native_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { + switch (attr->group) { + case KVM_DEV_XIVE_GRP_CTRL: + switch (attr->attr) { + case KVM_DEV_XIVE_GET_ESB_FD: + return 0; + } + break; + } return -ENXIO; } -- 2.20.1
[PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls
The XIVE native exploitation mode specs define a set of Hypervisor calls to configure the sources and the event queues : - H_INT_GET_SOURCE_INFO used to obtain the address of the MMIO page of the Event State Buffer (PQ bits) entry associated with the source. - H_INT_SET_SOURCE_CONFIG assigns a source to a "target". - H_INT_GET_SOURCE_CONFIG determines which "target" and "priority" is assigned to a source - H_INT_GET_QUEUE_INFO returns the address of the notification management page associated with the specified "target" and "priority". - H_INT_SET_QUEUE_CONFIG sets or resets the event queue for a given "target" and "priority". It is also used to set the notification configuration associated with the queue, only unconditional notification is supported for the moment. Reset is performed with a queue size of 0 and queueing is disabled in that case. - H_INT_GET_QUEUE_CONFIG returns the queue settings for a given "target" and "priority". - H_INT_RESET resets all of the guest's internal interrupt structures to their initial state, losing all configuration set via the hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG. - H_INT_SYNC issue a synchronisation on a source to make sure all notifications have reached their queue. Calls that still need to be addressed : H_INT_SET_OS_REPORTING_LINE H_INT_GET_OS_REPORTING_LINE Signed-off-by: Cédric Le Goater --- arch/powerpc/include/asm/kvm_ppc.h| 43 ++ arch/powerpc/kvm/book3s_xive.h| 54 +++ arch/powerpc/kvm/book3s_hv.c | 29 ++ arch/powerpc/kvm/book3s_hv_builtin.c | 196 + arch/powerpc/kvm/book3s_hv_rm_xive_native.c | 47 +++ arch/powerpc/kvm/book3s_xive_native.c | 326 ++- .../powerpc/kvm/book3s_xive_native_template.c | 371 ++ arch/powerpc/kvm/Makefile | 2 + arch/powerpc/kvm/book3s_hv_rmhandlers.S | 52 +++ 9 files changed, 1118 insertions(+), 2 deletions(-) create mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive_native.c diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 1bb313f238fe..4cc897039485 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -602,6 +602,7 @@ extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu); extern void kvmppc_xive_native_init_module(void); extern void kvmppc_xive_native_exit_module(void); +extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd); #else static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server, @@ -634,6 +635,8 @@ static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev, static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { } static inline void kvmppc_xive_native_init_module(void) { } static inline void kvmppc_xive_native_exit_module(void) { } +static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd) + { return 0; } #endif /* CONFIG_KVM_XIVE */ @@ -682,6 +685,46 @@ int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr); int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr); void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu); +int kvmppc_rm_h_int_get_source_info(struct kvm_vcpu *vcpu, + unsigned long flag, + unsigned long lisn); +int kvmppc_rm_h_int_set_source_config(struct kvm_vcpu *vcpu, + unsigned long flag, + unsigned long lisn, + unsigned long target, + unsigned long priority, + unsigned long eisn); +int kvmppc_rm_h_int_get_source_config(struct kvm_vcpu *vcpu, + unsigned long flag, + unsigned long lisn); +int kvmppc_rm_h_int_get_queue_info(struct kvm_vcpu *vcpu, + unsigned long flag, + unsigned long target, + unsigned long priority); +int kvmppc_rm_h_int_set_queue_config(struct kvm_vcpu *vcpu, +unsigned long flag, +unsigned long target, +unsigned long priority, +unsigned long qpage, +unsigned long qsize); +int kvmppc_rm_h_int_get_queue_config(struct kvm_vcpu *vcpu, +unsigned long flag, +unsigned long target, +unsigned long priority); +int kvmppc_rm_h_int_set_os_reporting_line(struct kvm_vcpu *vcpu, +
Re: [PATCH v3] crypto: talitos - fix ablkcipher for CONFIG_VMAP_STACK
Le 04/01/2019 à 16:24, Horia Geanta a écrit : On 1/4/2019 5:17 PM, Horia Geanta wrote: On 12/21/2018 10:07 AM, Christophe Leroy wrote: [snip] IV cannot be on stack when CONFIG_VMAP_STACK is selected because the stack cannot be DMA mapped anymore. This looks better, thanks. This patch copies the IV into the extended descriptor when iv is not a valid linear address. Though I am not sure the checks in place are enough. Fixes: 4de9d0b547b9 ("crypto: talitos - Add ablkcipher algorithms") Cc: sta...@vger.kernel.org Signed-off-by: Christophe Leroy --- v3: Using struct edesc buffer. v2: Using per-request context. [snip] + if (ivsize && !virt_addr_valid(iv)) + alloc_len += ivsize; [snip] + if (ivsize && !virt_addr_valid(iv)) A more precise condition would be (!is_vmalloc_addr || is_vmalloc_addr(iv)) Sorry for the typo, I meant: (!virt_addr_valid(iv) || is_vmalloc_addr(iv)) As far as I know, virt_addr_valid() means the address is in the linear memory space. So it cannot be a vmalloc_addr if it is a linear space addr, can it ? At least, it is that way on powerpc which is the arch embedding the talitos crypto engine. virt_addr_valid() means we are under max_pfn, while VMALLOC_START is above max_pfn. Christophe It matches the checks in debug_dma_map_single() helper, though I am not sure they are enough to rule out all exceptions of DMA API.
Re: [PATCH 4/11] KVM/MMU: Introduce tlb flush with range list
On 04/01/19 09:53, lantianyu1...@gmail.com wrote: > struct kvm_mmu_page { > struct list_head link; > + > + /* > + * Tlb flush with range list uses struct kvm_mmu_page as list entry > + * and all list operations should be under protection of mmu_lock. > + */ > + struct list_head flush_link; > struct hlist_node hash_link; > bool unsync; > > @@ -443,6 +449,7 @@ struct kvm_mmu { Again, it would be nice not to grow the struct too much, though I understand that it's already relatively big (168 bytes). Can you at least make this an hlist, so that it only takes a single word? Paolo
Re: [PATCH 3/11] KVM: Add spte's point in the struct kvm_mmu_page
On 04/01/19 09:53, lantianyu1...@gmail.com wrote: > @@ -332,6 +332,7 @@ struct kvm_mmu_page { > int root_count; /* Currently serving as active root */ > unsigned int unsync_children; > struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ > + u64 *sptep; Is this really needed? Can we put the "last" flag in the struct instead as a bool? In fact, if you do u16 unsync_children; bool unsync; bool last_level; the struct does not grow at all. :) (I'm not sure where "large" is tested using the sptep field, even though it is in the commit message). Paolo > /* The page is obsolete if mmu_valid_gen != kvm->arch.mmu_valid_gen. */ > unsigned long mmu_valid_gen;
[PATCH v4 13/13] drivers/perf: use PERF_PMU_CAP_NO_EXCLUDE for Cavium TX2 PMU
The Cavium ThunderX2 UNCORE PMU driver doesn't support any event filtering. Let's advertise the PERF_PMU_CAP_NO_EXCLUDE capability to simplify the code. Signed-off-by: Andrew Murray --- drivers/perf/thunderx2_pmu.c | 10 +- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c index c9a1701..43d76c8 100644 --- a/drivers/perf/thunderx2_pmu.c +++ b/drivers/perf/thunderx2_pmu.c @@ -424,15 +424,6 @@ static int tx2_uncore_event_init(struct perf_event *event) if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK) return -EINVAL; - /* We have no filtering of any kind */ - if (event->attr.exclude_user|| - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle|| - event->attr.exclude_host|| - event->attr.exclude_guest) - return -EINVAL; - if (event->cpu < 0) return -EINVAL; @@ -572,6 +563,7 @@ static int tx2_uncore_pmu_register( .start = tx2_uncore_event_start, .stop = tx2_uncore_event_stop, .read = tx2_uncore_event_read, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; tx2_pmu->pmu.name = devm_kasprintf(dev, GFP_KERNEL, -- 2.7.4
[PATCH v4 12/13] perf/core: remove unused perf_flags
Now that perf_flags is not used we remove it. Signed-off-by: Andrew Murray --- include/uapi/linux/perf_event.h | 2 -- tools/include/uapi/linux/perf_event.h | 2 -- 2 files changed, 4 deletions(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 9de8780..ea19b5d 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -445,8 +445,6 @@ struct perf_event_query_bpf { __u32 ids[0]; }; -#define perf_flags(attr) (*(&(attr)->read_format + 1)) - /* * Ioctls that can be done on a perf event fd: */ diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h index 9de8780..ea19b5d 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -445,8 +445,6 @@ struct perf_event_query_bpf { __u32 ids[0]; }; -#define perf_flags(attr) (*(&(attr)->read_format + 1)) - /* * Ioctls that can be done on a perf event fd: */ -- 2.7.4
[PATCH v4 11/13] x86: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
For x86 PMUs that do not support context exclusion let's advertise the PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will prevent us from handling events where any exclusion flags are set. Let's also remove the now unnecessary check for exclusion flags. This change means that amd/iommu and amd/uncore will now also indicate that they do not support exclude_{hv|idle} and intel/uncore that it does not support exclude_{guest|host}. Signed-off-by: Andrew Murray --- arch/x86/events/amd/iommu.c| 6 +- arch/x86/events/amd/uncore.c | 7 ++- arch/x86/events/intel/uncore.c | 9 + 3 files changed, 4 insertions(+), 18 deletions(-) diff --git a/arch/x86/events/amd/iommu.c b/arch/x86/events/amd/iommu.c index 3210fee..7635c23 100644 --- a/arch/x86/events/amd/iommu.c +++ b/arch/x86/events/amd/iommu.c @@ -223,11 +223,6 @@ static int perf_iommu_event_init(struct perf_event *event) if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK) return -EINVAL; - /* IOMMU counters do not have usr/os/guest/host bits */ - if (event->attr.exclude_user || event->attr.exclude_kernel || - event->attr.exclude_host || event->attr.exclude_guest) - return -EINVAL; - if (event->cpu < 0) return -EINVAL; @@ -414,6 +409,7 @@ static const struct pmu iommu_pmu __initconst = { .read = perf_iommu_read, .task_ctx_nr= perf_invalid_context, .attr_groups= amd_iommu_attr_groups, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; static __init int init_one_iommu(unsigned int idx) diff --git a/arch/x86/events/amd/uncore.c b/arch/x86/events/amd/uncore.c index 398df6e..79cfd3b 100644 --- a/arch/x86/events/amd/uncore.c +++ b/arch/x86/events/amd/uncore.c @@ -201,11 +201,6 @@ static int amd_uncore_event_init(struct perf_event *event) if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK) return -EINVAL; - /* NB and Last level cache counters do not have usr/os/guest/host bits */ - if (event->attr.exclude_user || event->attr.exclude_kernel || - event->attr.exclude_host || event->attr.exclude_guest) - return -EINVAL; - /* and we do not enable counter overflow interrupts */ hwc->config = event->attr.config & AMD64_RAW_EVENT_MASK_NB; hwc->idx = -1; @@ -307,6 +302,7 @@ static struct pmu amd_nb_pmu = { .start = amd_uncore_start, .stop = amd_uncore_stop, .read = amd_uncore_read, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; static struct pmu amd_llc_pmu = { @@ -317,6 +313,7 @@ static struct pmu amd_llc_pmu = { .start = amd_uncore_start, .stop = amd_uncore_stop, .read = amd_uncore_read, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; static struct amd_uncore *amd_uncore_alloc(unsigned int cpu) diff --git a/arch/x86/events/intel/uncore.c b/arch/x86/events/intel/uncore.c index 27a4614..d516161 100644 --- a/arch/x86/events/intel/uncore.c +++ b/arch/x86/events/intel/uncore.c @@ -695,14 +695,6 @@ static int uncore_pmu_event_init(struct perf_event *event) if (pmu->func_id < 0) return -ENOENT; - /* -* Uncore PMU does measure at all privilege level all the time. -* So it doesn't make sense to specify any exclude bits. -*/ - if (event->attr.exclude_user || event->attr.exclude_kernel || - event->attr.exclude_hv || event->attr.exclude_idle) - return -EINVAL; - /* Sampling not supported yet */ if (hwc->sample_period) return -EINVAL; @@ -800,6 +792,7 @@ static int uncore_pmu_register(struct intel_uncore_pmu *pmu) .stop = uncore_pmu_event_stop, .read = uncore_pmu_event_read, .module = THIS_MODULE, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; } else { pmu->pmu = *pmu->type->pmu; -- 2.7.4
[PATCH v4 10/13] x86: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
For drivers that do not support context exclusion let's advertise the PERF_PMU_CAP_NOEXCLUDE capability. This ensures that perf will prevent us from handling events where any exclusion flags are set. Let's also remove the now unnecessary check for exclusion flags. Signed-off-by: Andrew Murray --- arch/x86/events/amd/ibs.c | 13 + arch/x86/events/amd/power.c| 10 ++ arch/x86/events/intel/cstate.c | 12 +++- arch/x86/events/intel/rapl.c | 9 ++--- arch/x86/events/intel/uncore_snb.c | 9 ++--- arch/x86/events/msr.c | 10 ++ 6 files changed, 12 insertions(+), 51 deletions(-) diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c index d50bb4d..62f317c 100644 --- a/arch/x86/events/amd/ibs.c +++ b/arch/x86/events/amd/ibs.c @@ -253,15 +253,6 @@ static int perf_ibs_precise_event(struct perf_event *event, u64 *config) return -EOPNOTSUPP; } -static const struct perf_event_attr ibs_notsupp = { - .exclude_user = 1, - .exclude_kernel = 1, - .exclude_hv = 1, - .exclude_idle = 1, - .exclude_host = 1, - .exclude_guest = 1, -}; - static int perf_ibs_init(struct perf_event *event) { struct hw_perf_event *hwc = &event->hw; @@ -282,9 +273,6 @@ static int perf_ibs_init(struct perf_event *event) if (event->pmu != &perf_ibs->pmu) return -ENOENT; - if (perf_flags(&event->attr) & perf_flags(&ibs_notsupp)) - return -EINVAL; - if (config & ~perf_ibs->config_mask) return -EINVAL; @@ -537,6 +525,7 @@ static struct perf_ibs perf_ibs_fetch = { .start = perf_ibs_start, .stop = perf_ibs_stop, .read = perf_ibs_read, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }, .msr= MSR_AMD64_IBSFETCHCTL, .config_mask= IBS_FETCH_CONFIG_MASK, diff --git a/arch/x86/events/amd/power.c b/arch/x86/events/amd/power.c index 2aefacf..c5ff084 100644 --- a/arch/x86/events/amd/power.c +++ b/arch/x86/events/amd/power.c @@ -136,14 +136,7 @@ static int pmu_event_init(struct perf_event *event) return -ENOENT; /* Unsupported modes and filters. */ - if (event->attr.exclude_user || - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle || - event->attr.exclude_host || - event->attr.exclude_guest || - /* no sampling */ - event->attr.sample_period) + if (event->attr.sample_period) return -EINVAL; if (cfg != AMD_POWER_EVENTSEL_PKG) @@ -226,6 +219,7 @@ static struct pmu pmu_class = { .start = pmu_event_start, .stop = pmu_event_stop, .read = pmu_event_read, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; static int power_cpu_exit(unsigned int cpu) diff --git a/arch/x86/events/intel/cstate.c b/arch/x86/events/intel/cstate.c index d2e7807..94a4b7f 100644 --- a/arch/x86/events/intel/cstate.c +++ b/arch/x86/events/intel/cstate.c @@ -280,13 +280,7 @@ static int cstate_pmu_event_init(struct perf_event *event) return -ENOENT; /* unsupported modes and filters */ - if (event->attr.exclude_user || - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle || - event->attr.exclude_host || - event->attr.exclude_guest || - event->attr.sample_period) /* no sampling */ + if (event->attr.sample_period) /* no sampling */ return -EINVAL; if (event->cpu < 0) @@ -437,7 +431,7 @@ static struct pmu cstate_core_pmu = { .start = cstate_pmu_event_start, .stop = cstate_pmu_event_stop, .read = cstate_pmu_event_update, - .capabilities = PERF_PMU_CAP_NO_INTERRUPT, + .capabilities = PERF_PMU_CAP_NO_INTERRUPT | PERF_PMU_CAP_NO_EXCLUDE, .module = THIS_MODULE, }; @@ -451,7 +445,7 @@ static struct pmu cstate_pkg_pmu = { .start = cstate_pmu_event_start, .stop = cstate_pmu_event_stop, .read = cstate_pmu_event_update, - .capabilities = PERF_PMU_CAP_NO_INTERRUPT, + .capabilities = PERF_PMU_CAP_NO_INTERRUPT | PERF_PMU_CAP_NO_EXCLUDE, .module = THIS_MODULE, }; diff --git a/arch/x86/events/intel/rapl.c b/arch/x86/events/intel/rapl.c index 91039ff..94dc564 100644 --- a/arch/x86/events/intel/rapl.c +++ b/arch/x86/events/intel/rapl.c @@ -397,13 +397,7 @@ static int rapl_pmu_event_init(struct perf_event *event) return -EINVAL; /* unsupported modes and filters */ - if (event->attr.exclude_user || - event->attr.exclude_kernel ||
[PATCH v4 09/13] powerpc: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
For PowerPC PMUs that do not support context exclusion let's advertise the PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will prevent us from handling events where any exclusion flags are set. Let's also remove the now unnecessary check for exclusion flags. Signed-off-by: Andrew Murray Reviewed-by: Madhavan Srinivasan Acked-by: Michael Ellerman --- arch/powerpc/perf/hv-24x7.c | 10 +- arch/powerpc/perf/hv-gpci.c | 10 +- arch/powerpc/perf/imc-pmu.c | 19 +-- 3 files changed, 3 insertions(+), 36 deletions(-) diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c index 72238ee..d2b8e60 100644 --- a/arch/powerpc/perf/hv-24x7.c +++ b/arch/powerpc/perf/hv-24x7.c @@ -1306,15 +1306,6 @@ static int h_24x7_event_init(struct perf_event *event) return -EINVAL; } - /* unsupported modes and filters */ - if (event->attr.exclude_user || - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle || - event->attr.exclude_host || - event->attr.exclude_guest) - return -EINVAL; - /* no branch sampling */ if (has_branch_stack(event)) return -EOPNOTSUPP; @@ -1577,6 +1568,7 @@ static struct pmu h_24x7_pmu = { .start_txn = h_24x7_event_start_txn, .commit_txn = h_24x7_event_commit_txn, .cancel_txn = h_24x7_event_cancel_txn, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; static int hv_24x7_init(void) diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c index 43fabb3..735e77b 100644 --- a/arch/powerpc/perf/hv-gpci.c +++ b/arch/powerpc/perf/hv-gpci.c @@ -232,15 +232,6 @@ static int h_gpci_event_init(struct perf_event *event) return -EINVAL; } - /* unsupported modes and filters */ - if (event->attr.exclude_user || - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle || - event->attr.exclude_host || - event->attr.exclude_guest) - return -EINVAL; - /* no branch sampling */ if (has_branch_stack(event)) return -EOPNOTSUPP; @@ -285,6 +276,7 @@ static struct pmu h_gpci_pmu = { .start = h_gpci_event_start, .stop= h_gpci_event_stop, .read= h_gpci_event_update, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; static int hv_gpci_init(void) diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index f292a3f..b1c37cc 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -473,15 +473,6 @@ static int nest_imc_event_init(struct perf_event *event) if (event->hw.sample_period) return -EINVAL; - /* unsupported modes and filters */ - if (event->attr.exclude_user || - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle || - event->attr.exclude_host || - event->attr.exclude_guest) - return -EINVAL; - if (event->cpu < 0) return -EINVAL; @@ -748,15 +739,6 @@ static int core_imc_event_init(struct perf_event *event) if (event->hw.sample_period) return -EINVAL; - /* unsupported modes and filters */ - if (event->attr.exclude_user || - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle || - event->attr.exclude_host || - event->attr.exclude_guest) - return -EINVAL; - if (event->cpu < 0) return -EINVAL; @@ -1069,6 +1051,7 @@ static int update_pmu_ops(struct imc_pmu *pmu) pmu->pmu.stop = imc_event_stop; pmu->pmu.read = imc_event_update; pmu->pmu.attr_groups = pmu->attr_groups; + pmu->pmu.capabilities = PERF_PMU_CAP_NO_EXCLUDE; pmu->attr_groups[IMC_FORMAT_ATTR] = &imc_format_group; switch (pmu->domain) { -- 2.7.4
[PATCH v4 08/13] drivers/perf: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
For drivers that do not support context exclusion let's advertise the PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will prevent us from handling events where any exclusion flags are set. Let's also remove the now unnecessary check for exclusion flags. This change means that qcom_{l2|l3}_pmu will now also indicate that they do not support exclude_{host|guest} and that xgene_pmu does not also support exclude_idle and exclude_hv. Note that for qcom_l2_pmu we now implictly return -EINVAL instead of -EOPNOTSUPP. This change will result in the perf userspace utility retrying the perf_event_open system call with fallback event attributes that do not fail. Signed-off-by: Andrew Murray Acked-by: Will Deacon --- drivers/perf/qcom_l2_pmu.c | 9 + drivers/perf/qcom_l3_pmu.c | 8 +--- drivers/perf/xgene_pmu.c | 6 +- 3 files changed, 3 insertions(+), 20 deletions(-) diff --git a/drivers/perf/qcom_l2_pmu.c b/drivers/perf/qcom_l2_pmu.c index 842135c..091b4d7 100644 --- a/drivers/perf/qcom_l2_pmu.c +++ b/drivers/perf/qcom_l2_pmu.c @@ -509,14 +509,6 @@ static int l2_cache_event_init(struct perf_event *event) return -EOPNOTSUPP; } - /* We cannot filter accurately so we just don't allow it. */ - if (event->attr.exclude_user || event->attr.exclude_kernel || - event->attr.exclude_hv || event->attr.exclude_idle) { - dev_dbg_ratelimited(&l2cache_pmu->pdev->dev, - "Can't exclude execution levels\n"); - return -EOPNOTSUPP; - } - if (((L2_EVT_GROUP(event->attr.config) > L2_EVT_GROUP_MAX) || ((event->attr.config & ~L2_EVT_MASK) != 0)) && (event->attr.config != L2CYCLE_CTR_RAW_CODE)) { @@ -982,6 +974,7 @@ static int l2_cache_pmu_probe(struct platform_device *pdev) .stop = l2_cache_event_stop, .read = l2_cache_event_read, .attr_groups= l2_cache_pmu_attr_grps, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; l2cache_pmu->num_counters = get_num_counters(); diff --git a/drivers/perf/qcom_l3_pmu.c b/drivers/perf/qcom_l3_pmu.c index 2dc63d6..5d70646 100644 --- a/drivers/perf/qcom_l3_pmu.c +++ b/drivers/perf/qcom_l3_pmu.c @@ -495,13 +495,6 @@ static int qcom_l3_cache__event_init(struct perf_event *event) return -ENOENT; /* -* There are no per-counter mode filters in the PMU. -*/ - if (event->attr.exclude_user || event->attr.exclude_kernel || - event->attr.exclude_hv || event->attr.exclude_idle) - return -EINVAL; - - /* * Sampling not supported since these events are not core-attributable. */ if (hwc->sample_period) @@ -777,6 +770,7 @@ static int qcom_l3_cache_pmu_probe(struct platform_device *pdev) .read = qcom_l3_cache__event_read, .attr_groups= qcom_l3_cache_pmu_attr_grps, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; memrc = platform_get_resource(pdev, IORESOURCE_MEM, 0); diff --git a/drivers/perf/xgene_pmu.c b/drivers/perf/xgene_pmu.c index 0dc9ff0..d4ec048 100644 --- a/drivers/perf/xgene_pmu.c +++ b/drivers/perf/xgene_pmu.c @@ -917,11 +917,6 @@ static int xgene_perf_event_init(struct perf_event *event) if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK) return -EINVAL; - /* SOC counters do not have usr/os/guest/host bits */ - if (event->attr.exclude_user || event->attr.exclude_kernel || - event->attr.exclude_host || event->attr.exclude_guest) - return -EINVAL; - if (event->cpu < 0) return -EINVAL; /* @@ -1136,6 +1131,7 @@ static int xgene_init_perf(struct xgene_pmu_dev *pmu_dev, char *name) .start = xgene_perf_start, .stop = xgene_perf_stop, .read = xgene_perf_read, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; /* Hardware counter init */ -- 2.7.4
[PATCH v4 07/13] drivers/perf: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
For drivers that do not support context exclusion let's advertise the PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will prevent us from handling events where any exclusion flags are set. Let's also remove the now unnecessary check for exclusion flags. Signed-off-by: Andrew Murray Acked-by: Will Deacon --- drivers/perf/arm-cci.c| 10 +- drivers/perf/arm-ccn.c| 6 ++ drivers/perf/arm_dsu_pmu.c| 9 ++--- drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 1 + drivers/perf/hisilicon/hisi_uncore_hha_pmu.c | 1 + drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c | 1 + drivers/perf/hisilicon/hisi_uncore_pmu.c | 9 - 7 files changed, 8 insertions(+), 29 deletions(-) diff --git a/drivers/perf/arm-cci.c b/drivers/perf/arm-cci.c index 1bfeb16..bfd03e0 100644 --- a/drivers/perf/arm-cci.c +++ b/drivers/perf/arm-cci.c @@ -1327,15 +1327,6 @@ static int cci_pmu_event_init(struct perf_event *event) if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK) return -EOPNOTSUPP; - /* We have no filtering of any kind */ - if (event->attr.exclude_user|| - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle|| - event->attr.exclude_host|| - event->attr.exclude_guest) - return -EINVAL; - /* * Following the example set by other "uncore" PMUs, we accept any CPU * and rewrite its affinity dynamically rather than having perf core @@ -1433,6 +1424,7 @@ static int cci_pmu_init(struct cci_pmu *cci_pmu, struct platform_device *pdev) .stop = cci_pmu_stop, .read = pmu_read, .attr_groups= pmu_attr_groups, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; cci_pmu->plat_device = pdev; diff --git a/drivers/perf/arm-ccn.c b/drivers/perf/arm-ccn.c index 7dd850e..2ae7602 100644 --- a/drivers/perf/arm-ccn.c +++ b/drivers/perf/arm-ccn.c @@ -741,10 +741,7 @@ static int arm_ccn_pmu_event_init(struct perf_event *event) return -EOPNOTSUPP; } - if (has_branch_stack(event) || event->attr.exclude_user || - event->attr.exclude_kernel || event->attr.exclude_hv || - event->attr.exclude_idle || event->attr.exclude_host || - event->attr.exclude_guest) { + if (has_branch_stack(event)) { dev_dbg(ccn->dev, "Can't exclude execution levels!\n"); return -EINVAL; } @@ -1290,6 +1287,7 @@ static int arm_ccn_pmu_init(struct arm_ccn *ccn) .read = arm_ccn_pmu_event_read, .pmu_enable = arm_ccn_pmu_enable, .pmu_disable = arm_ccn_pmu_disable, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; /* No overflow interrupt? Have to use a timer instead. */ diff --git a/drivers/perf/arm_dsu_pmu.c b/drivers/perf/arm_dsu_pmu.c index 660cb8a..5851de5 100644 --- a/drivers/perf/arm_dsu_pmu.c +++ b/drivers/perf/arm_dsu_pmu.c @@ -562,13 +562,7 @@ static int dsu_pmu_event_init(struct perf_event *event) return -EINVAL; } - if (has_branch_stack(event) || - event->attr.exclude_user || - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle || - event->attr.exclude_host || - event->attr.exclude_guest) { + if (has_branch_stack(event)) { dev_dbg(dsu_pmu->pmu.dev, "Can't support filtering\n"); return -EINVAL; } @@ -735,6 +729,7 @@ static int dsu_pmu_device_probe(struct platform_device *pdev) .read = dsu_pmu_read, .attr_groups= dsu_pmu_attr_groups, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; rc = perf_pmu_register(&dsu_pmu->pmu, name, -1); diff --git a/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c index 69372e2..0eba947 100644 --- a/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c +++ b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c @@ -396,6 +396,7 @@ static int hisi_ddrc_pmu_probe(struct platform_device *pdev) .stop = hisi_uncore_pmu_stop, .read = hisi_uncore_pmu_read, .attr_groups= hisi_ddrc_pmu_attr_groups, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; ret = perf_pmu_register(&ddrc_pmu->pmu, name, -1); diff --git a/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c b/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c index 443906e..2553a84 100644 --- a/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c +++ b/drivers/perf/hisilicon/hisi_uncore_hha_pmu.c @@ -407,6 +407,7 @@ static int hisi_hha_pmu_
[PATCH v4 06/13] arm: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs
For drivers that do not support context exclusion let's advertise the PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will prevent us from handling events where any exclusion flags are set. Let's also remove the now unnecessary check for exclusion flags. Signed-off-by: Andrew Murray Acked-by: Shawn Guo Acked-by: Will Deacon --- arch/arm/mach-imx/mmdc.c | 9 ++--- arch/arm/mm/cache-l2x0-pmu.c | 9 + 2 files changed, 3 insertions(+), 15 deletions(-) diff --git a/arch/arm/mach-imx/mmdc.c b/arch/arm/mach-imx/mmdc.c index e49e068..fce4b42 100644 --- a/arch/arm/mach-imx/mmdc.c +++ b/arch/arm/mach-imx/mmdc.c @@ -294,13 +294,7 @@ static int mmdc_pmu_event_init(struct perf_event *event) return -EOPNOTSUPP; } - if (event->attr.exclude_user|| - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle|| - event->attr.exclude_host|| - event->attr.exclude_guest || - event->attr.sample_period) + if (event->attr.sample_period) return -EINVAL; if (cfg < 0 || cfg >= MMDC_NUM_COUNTERS) @@ -456,6 +450,7 @@ static int mmdc_pmu_init(struct mmdc_pmu *pmu_mmdc, .start = mmdc_pmu_event_start, .stop = mmdc_pmu_event_stop, .read = mmdc_pmu_event_update, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }, .mmdc_base = mmdc_base, .dev = dev, diff --git a/arch/arm/mm/cache-l2x0-pmu.c b/arch/arm/mm/cache-l2x0-pmu.c index afe5b4c..99bcd07 100644 --- a/arch/arm/mm/cache-l2x0-pmu.c +++ b/arch/arm/mm/cache-l2x0-pmu.c @@ -314,14 +314,6 @@ static int l2x0_pmu_event_init(struct perf_event *event) event->attach_state & PERF_ATTACH_TASK) return -EINVAL; - if (event->attr.exclude_user || - event->attr.exclude_kernel || - event->attr.exclude_hv || - event->attr.exclude_idle || - event->attr.exclude_host || - event->attr.exclude_guest) - return -EINVAL; - if (event->cpu < 0) return -EINVAL; @@ -544,6 +536,7 @@ static __init int l2x0_pmu_init(void) .del = l2x0_pmu_event_del, .event_init = l2x0_pmu_event_init, .attr_groups = l2x0_pmu_attr_groups, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; l2x0_pmu_reset(); -- 2.7.4
[PATCH v4 05/13] arm: perf: conditionally use PERF_PMU_CAP_NO_EXCLUDE
The ARM PMU driver can be used to represent a variety of ARM based PMUs. Some of these PMUs do not provide support for context exclusion, where this is the case we advertise the PERF_PMU_CAP_NO_EXCLUDE capability to ensure that perf prevents us from handling events where any exclusion flags are set. Signed-off-by: Andrew Murray Acked-by: Will Deacon --- drivers/perf/arm_pmu.c | 15 +-- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c index d0b7dd8..eec75b9 100644 --- a/drivers/perf/arm_pmu.c +++ b/drivers/perf/arm_pmu.c @@ -357,13 +357,6 @@ static irqreturn_t armpmu_dispatch_irq(int irq, void *dev) } static int -event_requires_mode_exclusion(struct perf_event_attr *attr) -{ - return attr->exclude_idle || attr->exclude_user || - attr->exclude_kernel || attr->exclude_hv; -} - -static int __hw_perf_event_init(struct perf_event *event) { struct arm_pmu *armpmu = to_arm_pmu(event->pmu); @@ -393,9 +386,8 @@ __hw_perf_event_init(struct perf_event *event) /* * Check whether we need to exclude the counter from certain modes. */ - if ((!armpmu->set_event_filter || -armpmu->set_event_filter(hwc, &event->attr)) && -event_requires_mode_exclusion(&event->attr)) { + if (armpmu->set_event_filter && + armpmu->set_event_filter(hwc, &event->attr)) { pr_debug("ARM performance counters do not support " "mode exclusion\n"); return -EOPNOTSUPP; @@ -867,6 +859,9 @@ int armpmu_register(struct arm_pmu *pmu) if (ret) return ret; + if (!pmu->set_event_filter) + pmu->pmu.capabilities |= PERF_PMU_CAP_NO_EXCLUDE; + ret = perf_pmu_register(&pmu->pmu, pmu->name, -1); if (ret) goto out_destroy; -- 2.7.4
[PATCH v4 04/13] alpha: perf/core: use PERF_PMU_CAP_NO_EXCLUDE
As the Alpha PMU doesn't support context exclusion let's advertise the PERF_PMU_CAP_NO_EXCLUDE capability. This ensures that perf will prevent us from handling events where any exclusion flags are set. Let's also remove the now unnecessary check for exclusion flags. This change means that __hw_perf_event_init will now also indicate that it doesn't support exclude_host and exclude_guest and will now implicitly return -EINVAL instead of -EPERM. This is likely more desirable as -EPERM will result in a kernel.perf_event_paranoid related warning from the perf userspace utility. Signed-off-by: Andrew Murray --- arch/alpha/kernel/perf_event.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c index 5613aa37..4341ccf 100644 --- a/arch/alpha/kernel/perf_event.c +++ b/arch/alpha/kernel/perf_event.c @@ -630,12 +630,6 @@ static int __hw_perf_event_init(struct perf_event *event) return ev; } - /* The EV67 does not support mode exclusion */ - if (attr->exclude_kernel || attr->exclude_user - || attr->exclude_hv || attr->exclude_idle) { - return -EPERM; - } - /* * We place the event type in event_base here and leave calculation * of the codes to programme the PMU for alpha_pmu_enable() because @@ -771,6 +765,7 @@ static struct pmu pmu = { .start = alpha_pmu_start, .stop = alpha_pmu_stop, .read = alpha_pmu_read, + .capabilities = PERF_PMU_CAP_NO_EXCLUDE, }; -- 2.7.4
[PATCH v4 03/13] perf/core: add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs
Many PMU drivers do not have the capability to exclude counting events that occur in specific contexts such as idle, kernel, guest, etc. These drivers indicate this by returning an error in their event_init upon testing the events attribute flags. This approach is error prone and often inconsistent. Let's instead allow PMU drivers to advertise their inability to exclude based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This allows the perf core to reject requests for exclusion events where there is no support in the PMU. Signed-off-by: Andrew Murray --- include/linux/perf_event.h | 1 + kernel/events/core.c | 9 + 2 files changed, 10 insertions(+) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 54a78d2..cec02dc 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -244,6 +244,7 @@ struct perf_event; #define PERF_PMU_CAP_EXCLUSIVE 0x10 #define PERF_PMU_CAP_ITRACE0x20 #define PERF_PMU_CAP_HETEROGENEOUS_CPUS0x40 +#define PERF_PMU_CAP_NO_EXCLUDE0x80 /** * struct pmu - generic performance monitoring unit diff --git a/kernel/events/core.c b/kernel/events/core.c index 3cd13a3..fbe59b7 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9772,6 +9772,15 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event) if (ctx) perf_event_ctx_unlock(event->group_leader, ctx); + if (!ret) { + if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE && + event_has_any_exclude_flag(event)) { + if (event->destroy) + event->destroy(event); + ret = -EINVAL; + } + } + if (ret) module_put(pmu->module); -- 2.7.4
[PATCH v4 02/13] perf/core: add function to test for event exclusion flags
Add a function that tests if any of the perf event exclusion flags are set on a given event. Signed-off-by: Andrew Murray --- include/linux/perf_event.h | 9 + 1 file changed, 9 insertions(+) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1d5c551..54a78d2 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1004,6 +1004,15 @@ perf_event__output_id_sample(struct perf_event *event, extern void perf_log_lost_samples(struct perf_event *event, u64 lost); +static inline bool event_has_any_exclude_flag(struct perf_event *event) +{ + struct perf_event_attr *attr = &event->attr; + + return attr->exclude_idle || attr->exclude_user || + attr->exclude_kernel || attr->exclude_hv || + attr->exclude_guest || attr->exclude_host; +} + static inline bool is_sampling_event(struct perf_event *event) { return event->attr.sample_period != 0; -- 2.7.4
[PATCH v4 00/13] perf/core: Generalise event exclusion checking
Many PMU drivers do not have the capability to exclude counting events that occur in specific contexts such as idle, kernel, guest, etc. These drivers indicate this by returning an error in their event_init upon testing the events attribute flags. However this approach requires that each time a new event modifier is added to perf, all the perf drivers need to be modified to indicate that they don't support the attribute. This results in additional boiler-plate code common to many drivers that needs to be maintained. Furthermore the drivers are not consistent with regards to the error value they return when reporting unsupported attributes. This patchset allow PMU drivers to advertise their inability to exclude based on context via a new capability: PERF_PMU_CAP_NO_EXCLUDE. This allows the perf core to reject requests for exclusion events where there is no support in the PMU. This is a functional change, in particular: - Some drivers will now additionally (but correctly) report unsupported exclusion flags. It's typical for existing userspace tools such as perf to handle such errors by retrying the system call without the unsupported flags. - Drivers that do not support any exclusion that previously reported -EPERM or -EOPNOTSUPP will now report -EINVAL - this is consistent with the majority and results in userspace perf retrying without exclusion. All drivers touched by this patchset have been compile tested. Changes from v3: - Added PERF_PMU_CAP_NO_EXCLUDE to Cavium TX2 PMU driver Changes from v2: - Invert logic from CAP_EXCLUDE to CAP_NO_EXCLUDE Changes from v1: - Changed approach from explicitly rejecting events in unsupporting PMU drivers to explicitly advertising a capability in PMU drivers that do support exclusion events - Added additional information to tools/perf/design.txt - Rename event_has_exclude_flags to event_has_any_exclude_flag and update commit log to reflect it's a function Andrew Murray (13): perf/doc: update design.txt for exclude_{host|guest} flags perf/core: add function to test for event exclusion flags perf/core: add PERF_PMU_CAP_NO_EXCLUDE for exclusion incapable PMUs alpha: perf/core: use PERF_PMU_CAP_NO_EXCLUDE arm: perf: conditionally use PERF_PMU_CAP_NO_EXCLUDE arm: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs drivers/perf: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs drivers/perf: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs powerpc: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs x86: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs x86: perf/core: use PERF_PMU_CAP_NO_EXCLUDE for exclude incapable PMUs perf/core: remove unused perf_flags drivers/perf: use PERF_PMU_CAP_NO_EXCLUDE for Cavium TX2 PMU arch/alpha/kernel/perf_event.c| 7 +-- arch/arm/mach-imx/mmdc.c | 9 ++--- arch/arm/mm/cache-l2x0-pmu.c | 9 + arch/powerpc/perf/hv-24x7.c | 10 +- arch/powerpc/perf/hv-gpci.c | 10 +- arch/powerpc/perf/imc-pmu.c | 19 +-- arch/x86/events/amd/ibs.c | 13 + arch/x86/events/amd/iommu.c | 6 +- arch/x86/events/amd/power.c | 10 ++ arch/x86/events/amd/uncore.c | 7 ++- arch/x86/events/intel/cstate.c| 12 +++- arch/x86/events/intel/rapl.c | 9 ++--- arch/x86/events/intel/uncore.c| 9 + arch/x86/events/intel/uncore_snb.c| 9 ++--- arch/x86/events/msr.c | 10 ++ drivers/perf/arm-cci.c| 10 +- drivers/perf/arm-ccn.c| 6 ++ drivers/perf/arm_dsu_pmu.c| 9 ++--- drivers/perf/arm_pmu.c| 15 +-- drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 1 + drivers/perf/hisilicon/hisi_uncore_hha_pmu.c | 1 + drivers/perf/hisilicon/hisi_uncore_l3c_pmu.c | 1 + drivers/perf/hisilicon/hisi_uncore_pmu.c | 9 - drivers/perf/qcom_l2_pmu.c| 9 + drivers/perf/qcom_l3_pmu.c| 8 +--- drivers/perf/thunderx2_pmu.c | 10 +- drivers/perf/xgene_pmu.c | 6 +- include/linux/perf_event.h| 10 ++ include/uapi/linux/perf_event.h | 2 -- kernel/events/core.c | 9 + tools/include/uapi/linux/perf_event.h | 2 -- tools/perf/design.txt | 4 32 files changed, 63 insertions(+), 198 deletions(-) -- 2.7.4
[PATCH v4 01/13] perf/doc: update design.txt for exclude_{host|guest} flags
Update design.txt to reflect the presence of the exclude_host and exclude_guest perf flags. Signed-off-by: Andrew Murray --- tools/perf/design.txt | 4 1 file changed, 4 insertions(+) diff --git a/tools/perf/design.txt b/tools/perf/design.txt index a28dca2..0453ba2 100644 --- a/tools/perf/design.txt +++ b/tools/perf/design.txt @@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a way to request that counting of events be restricted to times when the CPU is in user, kernel and/or hypervisor mode. +Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way +to request counting of events restricted to guest and host contexts when +using Linux as the hypervisor. + The 'mmap' and 'munmap' bits allow recording of PROT_EXEC mmap/munmap operations, these can be used to relate userspace IP addresses to actual code, even after the mapping (or even the whole process) is gone, -- 2.7.4
Re: [PATCH 9/11] KVM/MMU: Flush tlb in the kvm_mmu_write_protect_pt_masked()
On 04/01/19 09:54, lantianyu1...@gmail.com wrote: > rmap_head = __gfn_to_rmap(slot->base_gfn + gfn_offset + > __ffs(mask), > PT_PAGE_TABLE_LEVEL, slot); > - __rmap_write_protect(kvm, rmap_head, false); > + flush |= __rmap_write_protect(kvm, rmap_head, false); > > /* clear the first set bit */ > mask &= mask - 1; > } > + > + if (flush && kvm_available_flush_tlb_with_range()) { > + kvm_flush_remote_tlbs_with_address(kvm, > + slot->base_gfn + gfn_offset, > + hweight_long(mask)); Mask is zero here, so this probably won't work. In addition, I suspect calling the hypercall once for every 64 pages is not very efficient. Passing a flush list into kvm_mmu_write_protect_pt_masked, and flushing in kvm_arch_mmu_enable_log_dirty_pt_masked, isn't efficient either because kvm_arch_mmu_enable_log_dirty_pt_masked is also called once per word. I don't have any good ideas, except for moving the whole kvm_clear_dirty_log_protect loop into architecture-specific code (which is not the direction we want---architectures should share more code, not less). Paolo > + flush = false; > + } > +
Re: [PATCH 7/11] KVM: Remove redundant check in the kvm_get_dirty_log_protect()
On 04/01/19 16:50, Sean Christopherson wrote: > Tangentially related, does mmu_lock actually need to be held while we > walk dirty_bitmap in kvm_{clear,get}_dirty_log_protect()? The bitmap > itself is protected by slots_lock (a lockdep assertion would be nice > too), e.g. can we grab the lock iff dirty_bitmap[i] != 0? Yes, we could avoid grabbing it as long as the bitmap is zero. However, without kvm->manual_dirty_log_protect, the granularity of kvm_get_dirty_log_protect() is too coarse so it won't happen in practice. Instead, with the new manual clear, kvm_get_dirty_log_protect() does not take the lock and a well-written userspace is not going to call the clear ioctl unless some bits are set. Paolo
Re: [PATCH 6/11] KVM/MMU: Flush tlb with range list in sync_page()
On 04/01/19 17:30, Sean Christopherson wrote: >> + >> +if (kvm_available_flush_tlb_with_range() >> +&& (tmp_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)) { >> +struct kvm_mmu_page *leaf_sp = page_header(sp->spt[i] >> +& PT64_BASE_ADDR_MASK); >> +list_add(&leaf_sp->flush_link, &flush_list); >> +} >> + >> +set_spte_ret |= tmp_spte_ret; >> + >> } >> >> if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH) >> -kvm_flush_remote_tlbs(vcpu->kvm); >> +kvm_flush_remote_tlbs_with_list(vcpu->kvm, &flush_list); > This is a bit confusing and potentially fragile. It's not obvious that > kvm_flush_remote_tlbs_with_list() is guaranteed to call > kvm_flush_remote_tlbs() when kvm_available_flush_tlb_with_range() is > false, and you're relying on the kvm_flush_remote_tlbs_with_list() call > chain to never optimize away the empty list case. Rechecking > kvm_available_flush_tlb_with_range() isn't expensive. > Alternatively, do not check it during the loop: always build the flush_list, and always call kvm_flush_remote_tlbs_with_list. The function can then check whether the list is empty, and the OR of tmp_spte_ret on every iteration goes away. Paolo
Re: [PATCH v3 1/2] selftests/powerpc: Add MSR bits
hi Christophe, On 1/7/19 10:47 AM, Christophe Leroy wrote: > Hi Breno, > > Le 07/01/2019 à 13:44, Breno Leitao a écrit : >> hi Christophe, >> >> On 1/3/19 3:19 PM, LEROY Christophe wrote: >>> Breno Leitao a écrit : >>> This patch simply adds definitions for the MSR bits and some macros to test for MSR TM bits. This was copied from arch/powerpc/include/asm/reg.h generic MSR part. >>> >>> Can't we find a way to avoid duplicating such defines ? >> >> I think there are three possible ways, but none of them respect the premises >> we are used too. These are the possible ways I can think of: >> >> 1) Including arch/powerpc/include/asm as part of the selftest compilation >> process. >> Problem: This might break the selftest independence of the kbuild system. >> >> 2) Generate a temporary header file inside selftests/include which contains >> these macros at compilation time. >> Problem: The problem as above. >> >> 3) Define MSR fields at userspace headers (/usr/include). >> Problem: I am not sure userspace should have MSR bits information. >> >> Do you suggest me to investigate any other way? > > Looking it other .h in selftests, it looks like they are limited to the only > strictly necessary values. > > Are all the values you have listed used ? If not, could you only include in > the file the necessary ones ? Sure. That works also. Let send a v4 patch.
Re: [PATCH v3 1/2] selftests/powerpc: Add MSR bits
Hi Breno, Le 07/01/2019 à 13:44, Breno Leitao a écrit : hi Christophe, On 1/3/19 3:19 PM, LEROY Christophe wrote: Breno Leitao a écrit : This patch simply adds definitions for the MSR bits and some macros to test for MSR TM bits. This was copied from arch/powerpc/include/asm/reg.h generic MSR part. Can't we find a way to avoid duplicating such defines ? I think there are three possible ways, but none of them respect the premises we are used too. These are the possible ways I can think of: 1) Including arch/powerpc/include/asm as part of the selftest compilation process. Problem: This might break the selftest independence of the kbuild system. 2) Generate a temporary header file inside selftests/include which contains these macros at compilation time. Problem: The problem as above. 3) Define MSR fields at userspace headers (/usr/include). Problem: I am not sure userspace should have MSR bits information. Do you suggest me to investigate any other way? Looking it other .h in selftests, it looks like they are limited to the only strictly necessary values. Are all the values you have listed used ? If not, could you only include in the file the necessary ones ? Christophe
Re: [PATCH v3 1/2] selftests/powerpc: Add MSR bits
hi Christophe, On 1/3/19 3:19 PM, LEROY Christophe wrote: > Breno Leitao a écrit : > >> This patch simply adds definitions for the MSR bits and some macros to >> test for MSR TM bits. >> >> This was copied from arch/powerpc/include/asm/reg.h generic MSR part. > > Can't we find a way to avoid duplicating such defines ? I think there are three possible ways, but none of them respect the premises we are used too. These are the possible ways I can think of: 1) Including arch/powerpc/include/asm as part of the selftest compilation process. Problem: This might break the selftest independence of the kbuild system. 2) Generate a temporary header file inside selftests/include which contains these macros at compilation time. Problem: The problem as above. 3) Define MSR fields at userspace headers (/usr/include). Problem: I am not sure userspace should have MSR bits information. Do you suggest me to investigate any other way?
[PATCH] powerpc: use a CONSOLE_LOGLEVEL_DEBUG macro
Use a CONSOLE_LOGLEVEL_DEBUG macro for console_loglevel rather than a naked number. Signed-off-by: Sergey Senozhatsky --- arch/powerpc/kernel/udbg.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/udbg.c b/arch/powerpc/kernel/udbg.c index 7cc38b5b58bc..8db4891acdaf 100644 --- a/arch/powerpc/kernel/udbg.c +++ b/arch/powerpc/kernel/udbg.c @@ -74,7 +74,7 @@ void __init udbg_early_init(void) #endif #ifdef CONFIG_PPC_EARLY_DEBUG - console_loglevel = 10; + console_loglevel = CONSOLE_LOGLEVEL_DEBUG; register_early_udbg_console(); #endif -- 2.20.1