[PATCH 1/1] swiotlb: dump used and total slots when swiotlb buffer is full
So far the kernel only prints the requested size if swiotlb buffer if full. It is not possible to know whether it is simply an out of buffer, or it is because swiotlb cannot allocate buffer with the requested size due to fragmentation. As 'io_tlb_used' is available since commit 71602fe6d4e9 ("swiotlb: add debugfs to track swiotlb buffer usage"), both 'io_tlb_used' and 'io_tlb_nslabs' are printed when swiotlb buffer is full. Signed-off-by: Dongli Zhang --- kernel/dma/swiotlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 53012db..3f43b37 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -540,7 +540,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev, not_found: spin_unlock_irqrestore(&io_tlb_lock, flags); if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit()) - dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes)\n", size); + dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes), total %lu, used %lu\n", +size, io_tlb_nslabs, io_tlb_used); return DMA_MAPPING_ERROR; found: io_tlb_used += nslots; -- 2.7.4 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
On 4/3/19 10:10 PM, Andy Lutomirski wrote: > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz wrote: >> >> XPFO flushes kernel space TLB entries for pages that are now mapped >> in userspace on not only the current CPU but also all other CPUs >> synchronously. Processes on each core allocating pages causes a >> flood of IPI messages to all other cores to flush TLB entries. >> Many of these messages are to flush the entire TLB on the core if >> the number of entries being flushed from local core exceeds >> tlb_single_page_flush_ceiling. The cost of TLB flush caused by >> unmapping pages from physmap goes up dramatically on machines with >> high core count. >> >> This patch flushes relevant TLB entries for current process or >> entire TLB depending upon number of entries for the current CPU >> and posts a pending TLB flush on all other CPUs when a page is >> unmapped from kernel space and mapped in userspace. Each core >> checks the pending TLB flush flag for itself on every context >> switch, flushes its TLB if the flag is set and clears it. >> This patch potentially aggregates multiple TLB flushes into one. >> This has very significant impact especially on machines with large >> core counts. > > Why is this a reasonable strategy? Ideally when pages are unmapped from physmap, all CPUs would be sent IPI synchronously to flush TLB entry for those pages immediately. This may be ideal from correctness and consistency point of view, but it also results in IPI storm and repeated TLB flushes on all processors. Any time a page is allocated to userspace, we are going to go through this and it is very expensive. On a 96-core server, performance degradation is 26x!! When xpfo unmaps a page from physmap only (after mapping the page in userspace in response to an allocation request from userspace) on one processor, there is a small window of opportunity for ret2dir attack on other cpus until the TLB entry in physmap for the unmapped pages on other cpus is cleared. Forcing that to happen synchronously is the expensive part. A multiple of these requests can come in over a very short time across multiple processors resulting in every cpu asking every other cpusto flush TLB just to close this small window of vulnerability in the kernel. If each request is processed synchronously, each CPU will do multiple TLB flushes in short order. If we could consolidate these TLB flush requests instead and do one TLB flush on each cpu at the time of context switch, we can reduce the performance impact significantly. This bears out in real life measuring the system time when doing a parallel kernel build on a large server. Without this, system time on 96-core server when doing "make -j60 all" went up 26x. After this optimization, impact went down to 1.44x. The trade-off with this strategy is, the kernel on a cpu is vulnerable for a short time if the current running processor is the malicious process. Is that an acceptable trade-off? I am open to other ideas on reducing the performance impact due to xpfo. > >> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end) >> +{ >> + struct cpumask tmp_mask; >> + >> + /* >> +* Balance as user space task's flush, a bit conservative. >> +* Do a local flush immediately and post a pending flush on all >> +* other CPUs. Local flush can be a range flush or full flush >> +* depending upon the number of entries to be flushed. Remote >> +* flushes will be done by individual processors at the time of >> +* context switch and this allows multiple flush requests from >> +* other CPUs to be batched together. >> +*/ > > I don't like this function at all. A core function like this is a > contract of sorts between the caller and the implementation. There is > no such thing as an "xpfo" flush, and this function's behavior isn't > at all well defined. For flush_tlb_kernel_range(), I can tell you > exactly what that function does, and the implementation is either > correct or incorrect. With this function, I have no idea what is > actually required, and I can't possibly tell whether it's correct. > > As far as I can see, xpfo_flush_tlb_kernel_range() actually means > "flush this range on this CPU right now, and flush it on remote CPUs > eventually". It would be valid, but probably silly, to flush locally > and to never flush at all on remote CPUs. This makes me wonder what > the point is. > I would restate that as "flush this range on this cpu right now, and flush it on remote cpus at the next context switch". A better name for the routine and a better description is a reasonable change to make. Thanks, Khalid ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
> On Apr 4, 2019, at 10:28 AM, Thomas Gleixner wrote: > >> On Thu, 4 Apr 2019, Tycho Andersen wrote: >>leaq-PTREGS_SIZE(%rax), %rsp >>UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE >> >> +/* >> + * If we oopsed in an interrupt handler, interrupts may be off. Let's >> turn >> + * them back on before going back to "normal" code. >> + */ >> +sti > > That breaks the paravirt muck and tracing/lockdep. > > ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer > and lockdep happy. > > I’m sure we’ll find some other thing we forgot to reset eventually, so let’s do this in C. Change the call do_exit to call __finish_rewind_stack_do_exit and add the latter as a C function that does local_irq_enable() and do_exit(). ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
On Thu, Apr 04, 2019 at 09:15:46AM -0600, Khalid Aziz wrote: > Thanks Peter. I really appreciate your review. Your feedback helps make > this code better and closer to where I can feel comfortable not calling > it RFC any more. > > The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get > uncomfortable with it. As you pointed out about calling kmap_atomic from > NMI context, that just makes the kmap_atomic code look even worse. I > pointed out one problem with this code in cover letter and suggested a > rewrite. I see these problems with this code: Well, I no longer use it from NMI context, but I did do that for a while. We now have a giant heap of magic in the NMI path that allows us to take faults from NMI context (please don't ask), this means we can mostly do copy_from_user_inatomic() now. > 1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir > attack security hole again even if just for the duration of kmap. A kmap > can stay around for some time if the page is being used for I/O. Correct. > 2. This code uses spinlock which leads to problems. If it does not > disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables > IRQ, I think it can still deadlock around pgd_lock. I've not spotted that inversion yet, but then I didn't look at the lock usage outside of k{,un}map_xpfo(). > I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map > the page at a new virtual address similar to what kmap_high for i386 > does. This avoids re-opening the ret2dir security hole. We can also > possibly do away with xpfo_lock saving bytes in page-frame and the not > so sane code sequence can go away. Right, the TLB invalidation issues are still tricky, even there :/ ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
On Thu, 4 Apr 2019, Tycho Andersen wrote: > leaq-PTREGS_SIZE(%rax), %rsp > UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE > > + /* > + * If we oopsed in an interrupt handler, interrupts may be off. Let's > turn > + * them back on before going back to "normal" code. > + */ > + sti That breaks the paravirt muck and tracing/lockdep. ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer and lockdep happy. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] sparc64/pci_sun4v: fix ATU checks for large DMA masks
Now that we allow drivers to always need to set larger than required DMA masks we need to be a little more careful in the sun4v PCI iommu driver to chose when to select the ATU support - a larger DMA mask can be set even when the platform does not support ATU, so we always have to check if it is avaiable before using it. Add a little helper for that and use it in all the places where we make ATU usage decisions based on the DMA mask. Fixes: 24132a419c68 ("sparc64/pci_sun4v: allow large DMA masks") Reported-by: Meelis Roos Signed-off-by: Christoph Hellwig Tested-by: Meelis Roos --- arch/sparc/kernel/pci_sun4v.c | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c index a8af6023c126..14b93c5564e3 100644 --- a/arch/sparc/kernel/pci_sun4v.c +++ b/arch/sparc/kernel/pci_sun4v.c @@ -73,6 +73,11 @@ static inline void iommu_batch_start(struct device *dev, unsigned long prot, uns p->npages = 0; } +static inline bool iommu_use_atu(struct iommu *iommu, u64 mask) +{ + return iommu->atu && mask > DMA_BIT_MASK(32); +} + /* Interrupts must be disabled. */ static long iommu_batch_flush(struct iommu_batch *p, u64 mask) { @@ -92,7 +97,7 @@ static long iommu_batch_flush(struct iommu_batch *p, u64 mask) prot &= (HV_PCI_MAP_ATTR_READ | HV_PCI_MAP_ATTR_WRITE); while (npages != 0) { - if (mask <= DMA_BIT_MASK(32) || !pbm->iommu->atu) { + if (!iommu_use_atu(pbm->iommu, mask)) { num = pci_sun4v_iommu_map(devhandle, HV_PCI_TSBID(0, entry), npages, @@ -179,7 +184,6 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size, unsigned long flags, order, first_page, npages, n; unsigned long prot = 0; struct iommu *iommu; - struct atu *atu; struct iommu_map_table *tbl; struct page *page; void *ret; @@ -205,13 +209,11 @@ static void *dma_4v_alloc_coherent(struct device *dev, size_t size, memset((char *)first_page, 0, PAGE_SIZE << order); iommu = dev->archdata.iommu; - atu = iommu->atu; - mask = dev->coherent_dma_mask; - if (mask <= DMA_BIT_MASK(32) || !atu) + if (!iommu_use_atu(iommu, mask)) tbl = &iommu->tbl; else - tbl = &atu->tbl; + tbl = &iommu->atu->tbl; entry = iommu_tbl_range_alloc(dev, tbl, npages, NULL, (unsigned long)(-1), 0); @@ -333,7 +335,7 @@ static void dma_4v_free_coherent(struct device *dev, size_t size, void *cpu, atu = iommu->atu; devhandle = pbm->devhandle; - if (dvma <= DMA_BIT_MASK(32)) { + if (!iommu_use_atu(iommu, dvma)) { tbl = &iommu->tbl; iotsb_num = 0; /* we don't care for legacy iommu */ } else { @@ -374,7 +376,7 @@ static dma_addr_t dma_4v_map_page(struct device *dev, struct page *page, npages >>= IO_PAGE_SHIFT; mask = *dev->dma_mask; - if (mask <= DMA_BIT_MASK(32)) + if (!iommu_use_atu(iommu, mask)) tbl = &iommu->tbl; else tbl = &atu->tbl; @@ -510,7 +512,7 @@ static int dma_4v_map_sg(struct device *dev, struct scatterlist *sglist, IO_PAGE_SIZE) >> IO_PAGE_SHIFT; mask = *dev->dma_mask; - if (mask <= DMA_BIT_MASK(32)) + if (!iommu_use_atu(iommu, mask)) tbl = &iommu->tbl; else tbl = &atu->tbl; -- 2.20.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
- stepping on del button while browsing though CCs. On 2019-04-04 09:47:27 [-0600], Tycho Andersen wrote: > > Hmm. do_exit() isn't really meant to be "try your best to leave the > > system somewhat usable without returning" -- it's a function that, > > other than in OOPSes, is called from a well-defined state. So I think > > rewind_stack_do_exit() is probably a better spot. But we need to > > rewind the stack and *then* turn on IRQs, since we otherwise risk > > exploding quite badly. > > Ok, sounds good. I guess we can include something like this patch in > the next series. The tracing infrastructure probably doesn't know that the interrupts are back on. Also if you were holding a spin lock then your preempt count isn't 0 which means that might_sleep() will trigger a splat (in your backtrace it was zero). > Thanks, > > Tycho Sebastian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap
On 4/4/19 1:56 AM, Peter Zijlstra wrote: > On Wed, Apr 03, 2019 at 11:34:12AM -0600, Khalid Aziz wrote: >> From: Julian Stecklina >> >> Only the xpfo_kunmap call that needs to actually unmap the page >> needs to be serialized. We need to be careful to handle the case, >> where after the atomic decrement of the mapcount, a xpfo_kmap >> increased the mapcount again. In this case, we can safely skip >> modifying the page table. >> >> Model-checked with up to 4 concurrent callers with Spin. >> >> Signed-off-by: Julian Stecklina >> Signed-off-by: Khalid Aziz >> Cc: Khalid Aziz >> Cc: x...@kernel.org >> Cc: kernel-harden...@lists.openwall.com >> Cc: Vasileios P. Kemerlis >> Cc: Juerg Haefliger >> Cc: Tycho Andersen >> Cc: Marco Benatto >> Cc: David Woodhouse >> --- >> include/linux/xpfo.h | 24 +++- >> 1 file changed, 15 insertions(+), 9 deletions(-) >> >> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h >> index 2318c7eb5fb7..37e7f52fa6ce 100644 >> --- a/include/linux/xpfo.h >> +++ b/include/linux/xpfo.h >> @@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page >> *page) >> static inline void xpfo_kunmap(void *kaddr, struct page *page) >> { >> unsigned long flags; >> +bool flush_tlb = false; >> >> if (!static_branch_unlikely(&xpfo_inited)) >> return; >> @@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page >> *page) >> * The page is to be allocated back to user space, so unmap it from >> * the kernel, flush the TLB and tag it as a user page. >> */ >> -spin_lock_irqsave(&page->xpfo_lock, flags); >> - >> if (atomic_dec_return(&page->xpfo_mapcount) == 0) { >> -#ifdef CONFIG_XPFO_DEBUG >> -WARN_ON(PageXpfoUnmapped(page)); >> -#endif >> -SetPageXpfoUnmapped(page); >> -set_kpte(kaddr, page, __pgprot(0)); >> -xpfo_flush_kernel_tlb(page, 0); >> +spin_lock_irqsave(&page->xpfo_lock, flags); >> + >> +/* >> + * In the case, where we raced with kmap after the >> + * atomic_dec_return, we must not nuke the mapping. >> + */ >> +if (atomic_read(&page->xpfo_mapcount) == 0) { >> +SetPageXpfoUnmapped(page); >> +set_kpte(kaddr, page, __pgprot(0)); >> +flush_tlb = true; >> +} >> +spin_unlock_irqrestore(&page->xpfo_lock, flags); >> } >> >> -spin_unlock_irqrestore(&page->xpfo_lock, flags); >> +if (flush_tlb) >> +xpfo_flush_kernel_tlb(page, 0); >> } > > This doesn't help with the TLB invalidation issue, AFAICT this is still > completely buggered. kunmap_atomic() can be called from IRQ context. > OK. xpfo_kmap/xpfo_kunmap need redesign. -- Khalid ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault
On Wed, Apr 03, 2019 at 09:12:16PM -0700, Andy Lutomirski wrote: > On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen wrote: > > > > On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote: > > > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz > > > wrote: > > > > > > > > From: Tycho Andersen > > > > > > > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, > > > > and > > > > that might sleep: > > > > > > > > > > > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > > > index 9d5c75f02295..7891add0913f 100644 > > > > --- a/arch/x86/mm/fault.c > > > > +++ b/arch/x86/mm/fault.c > > > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long > > > > error_code, > > > > /* Executive summary in case the body of the oops scrolled away > > > > */ > > > > printk(KERN_DEFAULT "CR2: %016lx\n", address); > > > > > > > > + /* > > > > +* We're about to oops, which might kill the task. Make sure > > > > we're > > > > +* allowed to sleep. > > > > +*/ > > > > + flags |= X86_EFLAGS_IF; > > > > + > > > > oops_end(flags, regs, sig); > > > > } > > > > > > > > > > > > > NAK. If there's a bug in rewind_stack_do_exit(), please fix it in > > > rewind_stack_do_exit(). > > > > [I trimmed the CC list since google rejected it with E2BIG :)] > > > > I guess the problem is really that do_exit() (or really > > exit_signals()) might sleep. Maybe we should put an irq_enable() at > > the beginning of do_exit() instead and fix this problem for all > > arches? > > > > Hmm. do_exit() isn't really meant to be "try your best to leave the > system somewhat usable without returning" -- it's a function that, > other than in OOPSes, is called from a well-defined state. So I think > rewind_stack_do_exit() is probably a better spot. But we need to > rewind the stack and *then* turn on IRQs, since we otherwise risk > exploding quite badly. Ok, sounds good. I guess we can include something like this patch in the next series. Thanks, Tycho >From 34dce229a4f43f90db823671eb0b8da7c4906045 Mon Sep 17 00:00:00 2001 From: Tycho Andersen Date: Thu, 4 Apr 2019 09:41:32 -0600 Subject: [PATCH] x86/entry: re-enable interrupts before exiting If the kernel oopses in an interrupt, nothing re-enables interrupts: Aug 23 19:30:27 xpfo kernel: [ 38.302714] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33 Aug 23 19:30:27 xpfo kernel: [ 38.303837] in_atomic(): 0, irqs_disabled(): 1, pid: 1970, name: lkdtm_xpfo_test Aug 23 19:30:27 xpfo kernel: [ 38.304758] CPU: 3 PID: 1970 Comm: lkdtm_xpfo_test Tainted: G D 4.13.0-rc5+ #228 Aug 23 19:30:27 xpfo kernel: [ 38.305813] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014 Aug 23 19:30:27 xpfo kernel: [ 38.306926] Call Trace: Aug 23 19:30:27 xpfo kernel: [ 38.307243] dump_stack+0x63/0x8b Aug 23 19:30:27 xpfo kernel: [ 38.307665] ___might_sleep+0xec/0x110 Aug 23 19:30:27 xpfo kernel: [ 38.308139] __might_sleep+0x45/0x80 Aug 23 19:30:27 xpfo kernel: [ 38.308593] exit_signals+0x21/0x1c0 Aug 23 19:30:27 xpfo kernel: [ 38.309046] ? blocking_notifier_call_chain+0x11/0x20 Aug 23 19:30:27 xpfo kernel: [ 38.309677] do_exit+0x98/0xbf0 Aug 23 19:30:27 xpfo kernel: [ 38.310078] ? smp_reader+0x27/0x40 [lkdtm] Aug 23 19:30:27 xpfo kernel: [ 38.310604] ? kthread+0x10f/0x150 Aug 23 19:30:27 xpfo kernel: [ 38.311045] ? read_user_with_flags+0x60/0x60 [lkdtm] Aug 23 19:30:27 xpfo kernel: [ 38.311680] rewind_stack_do_exit+0x17/0x20 do_exit() expects to be called in a well-defined environment, so let's re-enable interrupts after unwinding the stack, in case they were disabled. Signed-off-by: Tycho Andersen --- arch/x86/entry/entry_32.S | 6 ++ arch/x86/entry/entry_64.S | 6 ++ 2 files changed, 12 insertions(+) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index d309f30cf7af..8ddb7b41669d 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1507,6 +1507,12 @@ ENTRY(rewind_stack_do_exit) movlPER_CPU_VAR(cpu_current_top_of_stack), %esi leal-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp + /* +* If we oopsed in an interrupt handler, interrupts may be off. Let's turn +* them back on before going back to "normal" code. +*/ + sti + calldo_exit 1: jmp 1b END(rewind_stack_do_exit) diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 1f0efdb7b629..c0759f3e3ad2 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -1672,5 +1672,11 @@ ENTRY(rewind_stack_do_exit) leaq-PTREGS_SIZE(%rax), %rsp UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE + /* +* If we oopsed in an interrupt handler, interrupts may be off. Let's turn +* them back on before going back to "normal" code. +*/ + sti +
Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64
On Thu, Apr 04, 2019 at 06:38:53PM +0300, Meelis Roos wrote: > > > Yes, reverting this commit makes my T1000 boot. > > > > Does the patch attached to the last mail work as well? > > Sorry for misreading your mail - tested now and yes, it works. Thanks, I'll submit it with a proper changelog. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64
[trimmed To: and Cc: It was too large] On 4/4/19 1:52 AM, Peter Zijlstra wrote: > On Wed, Apr 03, 2019 at 11:34:05AM -0600, Khalid Aziz wrote: >> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h >> index 2779ace16d23..5c0e1581fa56 100644 >> --- a/arch/x86/include/asm/pgtable.h >> +++ b/arch/x86/include/asm/pgtable.h >> @@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void) >> return boot_cpu_has_bug(X86_BUG_L1TF); >> } >> >> +/* >> + * The current flushing context - we pass it instead of 5 arguments: >> + */ >> +struct cpa_data { >> +unsigned long *vaddr; >> +pgd_t *pgd; >> +pgprot_tmask_set; >> +pgprot_tmask_clr; >> +unsigned long numpages; >> +unsigned long curpage; >> +unsigned long pfn; >> +unsigned intflags; >> +unsigned intforce_split : 1, >> +force_static_prot : 1; >> +struct page **pages; >> +}; >> + >> + >> +int >> +should_split_large_page(pte_t *kpte, unsigned long address, >> +struct cpa_data *cpa); >> +extern spinlock_t cpa_lock; >> +int >> +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, >> + struct page *base); >> + > > I really hate exposing all that. I believe this was done so set_kpte() could split large pages if needed. I will look into creating a helper function instead so this does not have to be exposed. > >> #include >> #endif /* __ASSEMBLY__ */ >> > >> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c >> new file mode 100644 >> index ..3045bb7e4659 >> --- /dev/null >> +++ b/arch/x86/mm/xpfo.c >> @@ -0,0 +1,123 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P. >> + * Copyright (C) 2016 Brown University. All rights reserved. >> + * >> + * Authors: >> + * Juerg Haefliger >> + * Vasileios P. Kemerlis >> + * >> + * This program is free software; you can redistribute it and/or modify it >> + * under the terms of the GNU General Public License version 2 as published >> by >> + * the Free Software Foundation. >> + */ >> + >> +#include >> + >> +#include >> + >> +extern spinlock_t cpa_lock; >> + >> +/* Update a single kernel page table entry */ >> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) >> +{ >> +unsigned int level; >> +pgprot_t msk_clr; >> +pte_t *pte = lookup_address((unsigned long)kaddr, &level); >> + >> +if (unlikely(!pte)) { >> +WARN(1, "xpfo: invalid address %p\n", kaddr); >> +return; >> +} >> + >> +switch (level) { >> +case PG_LEVEL_4K: >> +set_pte_atomic(pte, pfn_pte(page_to_pfn(page), >> + canon_pgprot(prot))); > > (sorry, do we also need a nikon_pgprot() ? :-) Are we trying to encourage nikon as well to sponsor this patch :) > >> +break; >> +case PG_LEVEL_2M: >> +case PG_LEVEL_1G: { >> +struct cpa_data cpa = { }; >> +int do_split; >> + >> +if (level == PG_LEVEL_2M) >> +msk_clr = pmd_pgprot(*(pmd_t *)pte); >> +else >> +msk_clr = pud_pgprot(*(pud_t *)pte); >> + >> +cpa.vaddr = kaddr; >> +cpa.pages = &page; >> +cpa.mask_set = prot; >> +cpa.mask_clr = msk_clr; >> +cpa.numpages = 1; >> +cpa.flags = 0; >> +cpa.curpage = 0; >> +cpa.force_split = 0; >> + >> + >> +do_split = should_split_large_page(pte, (unsigned long)kaddr, >> + &cpa); >> +if (do_split) { >> +struct page *base; >> + >> +base = alloc_pages(GFP_ATOMIC, 0); >> +if (!base) { >> +WARN(1, "xpfo: failed to split large page\n"); > > You have to be fcking kidding right? A WARN when a GFP_ATOMIC allocation > fails?! > Not sure what the reasoning was for this WARN in original patch, but I think this is trying to warn about failure to split the large page in order to unmap this single page as opposed to warning about allocation failure. Nevertheless this could be done better. >> +break; >> +} >> + >> +if (!debug_pagealloc_enabled()) >> +spin_lock(&cpa_lock); >> +if (__split_large_page(&cpa, pte, (unsigned long)kaddr, >> +base) < 0) { >> +__free_page(base); >> +WARN(1, "xpfo: failed to split large page\n"); >> +} >> +if (!debug_pagealloc_enabled()) >> +spin_unlock(&cpa_lock); >> +} >> + >> +break; > > Ever h
Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64
Yes, reverting this commit makes my T1000 boot. Does the patch attached to the last mail work as well? Sorry for misreading your mail - tested now and yes, it works. -- Meelis Roos ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/2] iommu/arm-smmu-v3: make sure the kdump kernel can work well when smmu is enabled
Hi Zhen Lei, On Mon, Mar 18, 2019 at 09:12:41PM +0800, Zhen Lei wrote: > v1 --> v2: > 1. Drop part2. Now, we only use the SMMUv3 hardware feature STE.config=0b000 > (Report abort to device, no event recorded) to suppress the event messages > caused by the unexpected devices. > 2. rewrite the patch description. This issue came up a while back: https://lore.kernel.org/linux-pci/20180302103032.gb19...@arm.com/ and I'd still prefer to solve it using the disable_bypass logic which we already have. Something along the lines of the diff below? We're relying on the DMA API not subsequently requesting a passthrough domain, but it should only do that if you've configured your crashkernel to do so. Will --->8 diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c index d3880010c6cf..91b8f3b2ee25 100644 --- a/drivers/iommu/arm-smmu-v3.c +++ b/drivers/iommu/arm-smmu-v3.c @@ -2454,13 +2454,9 @@ static int arm_smmu_device_reset(struct arm_smmu_device *smmu, bool bypass) /* Clear CR0 and sync (disables SMMU and queue processing) */ reg = readl_relaxed(smmu->base + ARM_SMMU_CR0); if (reg & CR0_SMMUEN) { - if (is_kdump_kernel()) { - arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0); - arm_smmu_device_disable(smmu); - return -EBUSY; - } - dev_warn(smmu->dev, "SMMU currently enabled! Resetting...\n"); + WARN_ON(is_kdump_kernel() && !disable_bypass); + arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0); } ret = arm_smmu_device_disable(smmu); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
On 4/4/19 1:43 AM, Peter Zijlstra wrote: > > You must be so glad I no longer use kmap_atomic from NMI context :-) > > On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: >> +static inline void xpfo_kmap(void *kaddr, struct page *page) >> +{ >> +unsigned long flags; >> + >> +if (!static_branch_unlikely(&xpfo_inited)) >> +return; >> + >> +if (!PageXpfoUser(page)) >> +return; >> + >> +/* >> + * The page was previously allocated to user space, so >> + * map it back into the kernel if needed. No TLB flush required. >> + */ >> +spin_lock_irqsave(&page->xpfo_lock, flags); >> + >> +if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && >> +TestClearPageXpfoUnmapped(page)) >> +set_kpte(kaddr, page, PAGE_KERNEL); >> + >> +spin_unlock_irqrestore(&page->xpfo_lock, flags); > > That's a really sad sequence, not wrong, but sad. _3_ atomic operations, > 2 on likely the same cacheline. And mostly all pointless. > > This patch makes xpfo_mapcount an atomic, but then all modifications are > under the spinlock, what gives? > > Anyway, a possibly saner sequence might be: > > if (atomic_inc_not_zero(&page->xpfo_mapcount)) > return; > > spin_lock_irqsave(&page->xpfo_lock, flag); > if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && > TestClearPageXpfoUnmapped(page)) > set_kpte(kaddr, page, PAGE_KERNEL); > spin_unlock_irqrestore(&page->xpfo_lock, flags); > >> +} >> + >> +static inline void xpfo_kunmap(void *kaddr, struct page *page) >> +{ >> +unsigned long flags; >> + >> +if (!static_branch_unlikely(&xpfo_inited)) >> +return; >> + >> +if (!PageXpfoUser(page)) >> +return; >> + >> +/* >> + * The page is to be allocated back to user space, so unmap it from >> + * the kernel, flush the TLB and tag it as a user page. >> + */ >> +spin_lock_irqsave(&page->xpfo_lock, flags); >> + >> +if (atomic_dec_return(&page->xpfo_mapcount) == 0) { >> +#ifdef CONFIG_XPFO_DEBUG >> +WARN_ON(PageXpfoUnmapped(page)); >> +#endif >> +SetPageXpfoUnmapped(page); >> +set_kpte(kaddr, page, __pgprot(0)); >> +xpfo_flush_kernel_tlb(page, 0); > > You didn't speak about the TLB invalidation anywhere. But basically this > is one that x86 cannot do. > >> +} >> + >> +spin_unlock_irqrestore(&page->xpfo_lock, flags); > > Idem: > > if (atomic_add_unless(&page->xpfo_mapcount, -1, 1)) > return; > > > > >> +} > > Also I'm failing to see the point of PG_xpfo_unmapped, afaict it > is identical to !atomic_read(&page->xpfo_mapcount). > Thanks Peter. I really appreciate your review. Your feedback helps make this code better and closer to where I can feel comfortable not calling it RFC any more. The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get uncomfortable with it. As you pointed out about calling kmap_atomic from NMI context, that just makes the kmap_atomic code look even worse. I pointed out one problem with this code in cover letter and suggested a rewrite. I see these problems with this code: 1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir attack security hole again even if just for the duration of kmap. A kmap can stay around for some time if the page is being used for I/O. 2. This code uses spinlock which leads to problems. If it does not disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables IRQ, I think it can still deadlock around pgd_lock. I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map the page at a new virtual address similar to what kmap_high for i386 does. This avoids re-opening the ret2dir security hole. We can also possibly do away with xpfo_lock saving bytes in page-frame and the not so sane code sequence can go away. Good point about PG_xpfo_unmapped flag. You are right that it can be replaced with !atomic_read(&page->xpfo_mapcount). Thanks, Khalid ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu/arm-smmu: Break insecure users by disabling bypass by default
On Fri, Mar 01, 2019 at 11:20:17AM -0800, Douglas Anderson wrote: > If you're bisecting why your peripherals stopped working, it's > probably this CL. Specifically if you see this in your dmesg: > Unexpected global fault, this could be serious > ...then it's almost certainly this CL. > > Running your IOMMU-enabled peripherals with the IOMMU in bypass mode > is insecure and effectively disables the protection they provide. > There are few reasons to allow unmatched stream bypass, and even fewer > good ones. > > This patch starts the transition over to make it much harder to run > your system insecurely. Expected steps: > > 1. By default disable bypass (so anyone insecure will notice) but make >it easy for someone to re-enable bypass with just a KConfig change. >That's this patch. > > 2. After people have had a little time to come to grips with the fact >that they need to set their IOMMUs properly and have had time to >dig into how to do this, the KConfig will be eliminated and bypass >will simply be disabled. Folks who are truly upset and still >haven't fixed their system can either figure out how to add >'arm-smmu.disable_bypass=n' to their command line or revert the >patch in their own private kernel. Of course these folks will be >less secure. > > Suggested-by: Robin Murphy > Signed-off-by: Douglas Anderson > --- > > Changes in v2: > - Flipped default to 'yes' and changed comments a lot. > > drivers/iommu/Kconfig| 25 + > drivers/iommu/arm-smmu.c | 3 ++- > 2 files changed, 27 insertions(+), 1 deletion(-) Cheers, I'll pick this one up for 5.2. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote: > On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 2c471a2c43fa..d17d33f36a01 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -204,6 +204,14 @@ struct page { > > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS > > int _last_cpupid; > > #endif > > + > > +#ifdef CONFIG_XPFO > > + /* Counts the number of times this page has been kmapped. */ > > + atomic_t xpfo_mapcount; > > + > > + /* Serialize kmap/kunmap of this page */ > > + spinlock_t xpfo_lock; > > NAK, see ALLOC_SPLIT_PTLOCKS > > spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_ > sure you want spinlock_t and not raw_spinlock_t ? For > CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex. > > > +#endif > > Growing the page-frame by 8 bytes (in the good case) is really sad, > that's a _lot_ of memory. Originally we had this in page_ext, it's not really clear to me why we moved it out. Julien? Tycho ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/4] iommu/arm-smmu-v3: Link domains and devices
Hi Jean-Philippe, First off, thanks for posting this: it's definitely something that I'm keen to support, and getting bits in a piece at a time is probably a good idea. On Wed, Mar 20, 2019 at 05:36:32PM +, Jean-Philippe Brucker wrote: > When removing a mapping from a domain, we need to send an invalidation to > all devices that might have stored it in their Address Translation Cache > (ATC). In addition when updating the context descriptor of a live domain, > we'll need to send invalidations for all devices attached to it. > > Maintain a list of devices in each domain, protected by a spinlock. It is > updated every time we attach or detach devices to and from domains. > > It needs to be a spinlock because we'll invalidate ATC entries from > within hardirq-safe contexts, but it may be possible to relax the read > side with RCU later. > > Signed-off-by: Jean-Philippe Brucker > --- > drivers/iommu/arm-smmu-v3.c | 28 > 1 file changed, 28 insertions(+) > > diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c > index d3880010c6cf..66a29c113dbc 100644 > --- a/drivers/iommu/arm-smmu-v3.c > +++ b/drivers/iommu/arm-smmu-v3.c > @@ -594,6 +594,11 @@ struct arm_smmu_device { > struct arm_smmu_master_data { > struct arm_smmu_device *smmu; > struct arm_smmu_strtab_ent ste; > + > + struct arm_smmu_domain *domain; > + struct list_headdomain_head; > + > + struct device *dev; > }; > > /* SMMU private data for an IOMMU domain */ > @@ -618,6 +623,9 @@ struct arm_smmu_domain { > }; > > struct iommu_domain domain; > + > + struct list_headdevices; > + spinlock_t devices_lock; > }; > > struct arm_smmu_option_prop { > @@ -1493,6 +1501,9 @@ static struct iommu_domain > *arm_smmu_domain_alloc(unsigned type) > } > > mutex_init(&smmu_domain->init_mutex); > + INIT_LIST_HEAD(&smmu_domain->devices); > + spin_lock_init(&smmu_domain->devices_lock); I'm wondering whether we can't take this a bit further and re-organise the data structures to make this a little simpler overall. Something along the lines of: struct arm_smmu_master_data { struct list_headlist; // masters in the same domain struct arm_smmu_device *smmu; unsigned intnum_sids; u32 *sids; // Points into fwspec struct arm_smmu_domain *domain; // NULL -> !assigned }; and then just add a list_head into struct arm_smmu_domain to track the masters that have been attached (if you're feeling brave, you could put this into the s1_cfg). The ATC invalidation logic would then be: - Detaching a device: walk over the sids from the master data - Unmapping a range from a domain: walk over the attached masters I think this would also allow us to remove struct arm_smmu_strtab_ent completely. Dunno: this is one of the those things where you really have to try it to figure out why it doesn't work... Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64
On Thu, Apr 04, 2019 at 10:11:43AM +0300, Meelis Roos wrote: > > I think this might have been this commit: > > > > commit 24132a419c68f1d69eb8ecc91b3c80d730ecbb59 > > Author: Christoph Hellwig > > Date: Fri Feb 15 09:30:28 2019 +0100 > > > > sparc64/pci_sun4v: allow large DMA masks > > > > the patch below adds a few missing checks and hopefully should fix > > your problem. If not can you try to revert the commit to check if > > my theory was correct to start with? > Yes, reverting this commit makes my T1000 boot. Does the patch attached to the last mail work as well? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure
On 03/04/2019 10:20, John Garry wrote: On 03/04/2019 09:14, Greg KH wrote: On Wed, Apr 03, 2019 at 09:02:36AM +0100, John Garry wrote: On 28/03/2019 10:08, John Garry wrote: In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after devres release"), we changed the ordering of tearing down the device DMA ops and releasing all the device's resources; this was because the DMA ops should be maintained until we release the device's managed DMA memories. Hi all, A friendly reminder on this patch... I didn't see any update. I thought that it had some importance. Thanks, John However, we have seen another crash on an arm64 system when a device driver probe fails: hisi_sas_v3_hw :74:02.0: Adding to iommu group 2 scsi host1: hisi_sas_v3_hw BUG: Bad page state in process swapper/0 pfn:313f5 page:7ec4fd40 count:1 mapcount:0 mapping: index:0x0 flags: 0xfffe0001000(reserved) raw: 0fffe0001000 7ec4fd48 7ec4fd48 raw: 0001 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: 0x1000(reserved) Modules linked in: CPU: 49 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc1-43081-g22d97fd-dirty #1433 Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI RC0 - V1.12.01 01/29/2019 Call trace: dump_backtrace+0x0/0x118 show_stack+0x14/0x1c dump_stack+0xa4/0xc8 bad_page+0xe4/0x13c free_pages_check_bad+0x4c/0xc0 __free_pages_ok+0x30c/0x340 __free_pages+0x30/0x44 __dma_direct_free_pages+0x30/0x38 dma_direct_free+0x24/0x38 dma_free_attrs+0x9c/0xd8 dmam_release+0x20/0x28 release_nodes+0x17c/0x220 devres_release_all+0x34/0x54 really_probe+0xc4/0x2c8 driver_probe_device+0x58/0xfc device_driver_attach+0x68/0x70 __driver_attach+0x94/0xdc bus_for_each_dev+0x5c/0xb4 driver_attach+0x20/0x28 bus_add_driver+0x14c/0x200 driver_register+0x6c/0x124 __pci_register_driver+0x48/0x50 sas_v3_pci_driver_init+0x20/0x28 do_one_initcall+0x40/0x25c kernel_init_freeable+0x2b8/0x3c0 kernel_init+0x10/0x100 ret_from_fork+0x10/0x18 Disabling lock debugging due to kernel taint BUG: Bad page state in process swapper/0 pfn:313f6 page:7ec4fd80 count:1 mapcount:0 mapping: index:0x0 [ 89.322983] flags: 0xfffe0001000(reserved) raw: 0fffe0001000 7ec4fd88 7ec4fd88 raw: 0001 The crash occurs for the same reason. In this case, on the really_probe() failure path, we are still clearing the DMA ops prior to releasing the device's managed memories. This patch fixes this issue by reordering the DMA ops teardown and the call to devres_release_all() on the failure path. Reported-by: Xiang Chen Tested-by: Xiang Chen Signed-off-by: John Garry So does this "fix" 376991db4b64? If so, should this be added to the patch and also backported to the stable trees? Hi Greg, No, I don't think so. I'd say it supplements it. Here I'm trying to fix up another path in which we tear down the DMA ops prior to releasing the device's resources. I didn't add a fixes tag as 376991db4b64 didn't have one either. It will need to be backported to stable, I figure the same as 376991db4b64. So 376991db4b64 required manual backporting to stable, and this patch would require the same: https://www.spinics.net/lists/stable/msg289685.html Robin, any further comment? Thanks, John thanks, greg k-h . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1 3/3] arm64: dts: qcom: msm8998: Add PCIe PHY and RC nodes
On 28/03/2019 18:06, Marc Gonzalez wrote: > Add MSM8998 PCIe QMP PHY and PCIe root complex DT nodes. I suppose I should add a reference to the downstream DTS: https://source.codeaurora.org/quic/la/kernel/msm-4.4/tree/arch/arm/boot/dts/qcom/msm8998.dtsi?h=LE.UM.1.3.r3.25#n2537 > diff --git a/arch/arm64/boot/dts/qcom/msm8998.dtsi > b/arch/arm64/boot/dts/qcom/msm8998.dtsi > index 5a1c0961b281..9f979a51f679 100644 > --- a/arch/arm64/boot/dts/qcom/msm8998.dtsi > +++ b/arch/arm64/boot/dts/qcom/msm8998.dtsi > @@ -621,6 +621,84 @@ > ; > }; > > + pcie0: pci@1c0 { > + compatible = "qcom,pcie-msm8996"; Should probably be "qcom,pcie-msm8998"; in case we need to fudge something in 8998 and not in 8996. Apart from this minor issue, I think it's good to go for 5.2 (I'll respin without the SMMU patch) I just need Kishon to take the PHY series, and Stephen to take the clock hack, and we're rolling. Oh and it would be useful to have someone review the PCIE20_PARF_BDF_TRANSLATE_N patch. Regards. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote: > On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index 2c471a2c43fa..d17d33f36a01 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -204,6 +204,14 @@ struct page { > > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS > > int _last_cpupid; > > #endif > > + > > +#ifdef CONFIG_XPFO > > + /* Counts the number of times this page has been kmapped. */ > > + atomic_t xpfo_mapcount; > > + > > + /* Serialize kmap/kunmap of this page */ > > + spinlock_t xpfo_lock; > > NAK, see ALLOC_SPLIT_PTLOCKS > > spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_ > sure you want spinlock_t and not raw_spinlock_t ? For > CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex. > > > +#endif > > Growing the page-frame by 8 bytes (in the good case) is really sad, > that's a _lot_ of memory. So if you use the original kmap_atomic/kmap code from i386 and create an alias per user you can do away with all that. Now, that leaves you with the fixmap kmap_atomic code, which I also hate, but it gets rid of a lot of the ugly you introduce in these here patches. As to the fixmap kmap_atomic; so fundamentally the PTEs are only used on a single CPU and therefore CPU local TLB invalidation _should_ suffice. However, speculation... Another CPU can speculatively hit upon a fixmap entry for another CPU and populate it's own TLB entry. Then the TLB invalidate is insufficient, it leaves a stale entry in a remote TLB. If the remote CPU then re-uses that fixmap slot to alias another page, we have two CPUs with different translations for the same VA, a condition that AMD CPU's dislike enough to machine check on (IIRC). Actually hitting that is incredibly difficult (we have to have speculation, fixmap reuse and not get a full TLB invalidate in between), but, afaict, not impossible. Your monstrosity from the last patch avoids this particular issue by not aliasing in this manner, but it comes at the cost of this page-frame bloat. Also, I'm still not sure there's not other problems with it. Bah.. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)
On Wed, Apr 03, 2019 at 11:34:13AM -0600, Khalid Aziz wrote: > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 999d6d8f0bef..cc806a01a0eb 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -37,6 +37,20 @@ > */ > #define LAST_USER_MM_IBPB0x1UL > > +/* > + * A TLB flush may be needed to flush stale TLB entries > + * for pages that have been mapped into userspace and unmapped > + * from kernel space. This TLB flush needs to be propagated to > + * all CPUs. Asynchronous flush requests to all CPUs can cause > + * significant performance imapct. Queue a pending flush for > + * a CPU instead. Multiple of these requests can then be handled > + * by a CPU at a less disruptive time, like context switch, in > + * one go and reduce performance impact significantly. Following > + * data structure is used to keep track of CPUs with pending full > + * TLB flush forced by xpfo. > + */ > +static cpumask_t pending_xpfo_flush; > + > /* > * We get here when we do something requiring a TLB invalidation > * but could not go invalidate all of the contexts. We do the > @@ -321,6 +335,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct > mm_struct *next, > __flush_tlb_all(); > } > #endif > + > + /* > + * If there is a pending TLB flush for this CPU due to XPFO > + * flush, do it now. > + */ > + if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) { > + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); > + __flush_tlb_all(); > + } That really should be: if (cpumask_test_cpu(cpu, &pending_xpfo_flush)) { cpumask_clear_cpu(cpu, &pending_xpfo_flush); count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); __flush_tlb_all(); } test_and_clear is an unconditional RmW and can cause cacheline contention between adjecent CPUs even if none of the bits are set. > + > this_cpu_write(cpu_tlbstate.is_lazy, false); > > /* > @@ -803,6 +827,34 @@ void flush_tlb_kernel_range(unsigned long start, > unsigned long end) > } > } > > +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end) > +{ > + struct cpumask tmp_mask; > + > + /* > + * Balance as user space task's flush, a bit conservative. > + * Do a local flush immediately and post a pending flush on all > + * other CPUs. Local flush can be a range flush or full flush > + * depending upon the number of entries to be flushed. Remote > + * flushes will be done by individual processors at the time of > + * context switch and this allows multiple flush requests from > + * other CPUs to be batched together. > + */ > + if (end == TLB_FLUSH_ALL || > + (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) { > + do_flush_tlb_all(NULL); > + } else { > + struct flush_tlb_info info; > + > + info.start = start; > + info.end = end; > + do_kernel_range_flush(&info); > + } > + cpumask_setall(&tmp_mask); > + __cpumask_clear_cpu(smp_processor_id(), &tmp_mask); > + cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask); > +} > + > void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > { > struct flush_tlb_info info = { > diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c > index b42513347865..638eee5b1f09 100644 > --- a/arch/x86/mm/xpfo.c > +++ b/arch/x86/mm/xpfo.c > @@ -118,7 +118,7 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int > order) > return; > } > > - flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); > + xpfo_flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size); > } > EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb); So this patch is the one that makes it 'work', but I'm with Andy on hating it something fierce. Up until this point x86_64 is completely buggered in this series, after this it sorta works but *urgh* what crap. All in all your changelog is complete and utter garbage, this is _NOT_ a performance issue. It is a very much a correctness issue. Also; I distinctly dislike the inconsistent TLB states this generates. It makes it very hard to argue for its correctness.. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap
On Wed, Apr 03, 2019 at 11:34:12AM -0600, Khalid Aziz wrote: > From: Julian Stecklina > > Only the xpfo_kunmap call that needs to actually unmap the page > needs to be serialized. We need to be careful to handle the case, > where after the atomic decrement of the mapcount, a xpfo_kmap > increased the mapcount again. In this case, we can safely skip > modifying the page table. > > Model-checked with up to 4 concurrent callers with Spin. > > Signed-off-by: Julian Stecklina > Signed-off-by: Khalid Aziz > Cc: Khalid Aziz > Cc: x...@kernel.org > Cc: kernel-harden...@lists.openwall.com > Cc: Vasileios P. Kemerlis > Cc: Juerg Haefliger > Cc: Tycho Andersen > Cc: Marco Benatto > Cc: David Woodhouse > --- > include/linux/xpfo.h | 24 +++- > 1 file changed, 15 insertions(+), 9 deletions(-) > > diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h > index 2318c7eb5fb7..37e7f52fa6ce 100644 > --- a/include/linux/xpfo.h > +++ b/include/linux/xpfo.h > @@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page *page) > static inline void xpfo_kunmap(void *kaddr, struct page *page) > { > unsigned long flags; > + bool flush_tlb = false; > > if (!static_branch_unlikely(&xpfo_inited)) > return; > @@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page > *page) >* The page is to be allocated back to user space, so unmap it from >* the kernel, flush the TLB and tag it as a user page. >*/ > - spin_lock_irqsave(&page->xpfo_lock, flags); > - > if (atomic_dec_return(&page->xpfo_mapcount) == 0) { > -#ifdef CONFIG_XPFO_DEBUG > - WARN_ON(PageXpfoUnmapped(page)); > -#endif > - SetPageXpfoUnmapped(page); > - set_kpte(kaddr, page, __pgprot(0)); > - xpfo_flush_kernel_tlb(page, 0); > + spin_lock_irqsave(&page->xpfo_lock, flags); > + > + /* > + * In the case, where we raced with kmap after the > + * atomic_dec_return, we must not nuke the mapping. > + */ > + if (atomic_read(&page->xpfo_mapcount) == 0) { > + SetPageXpfoUnmapped(page); > + set_kpte(kaddr, page, __pgprot(0)); > + flush_tlb = true; > + } > + spin_unlock_irqrestore(&page->xpfo_lock, flags); > } > > - spin_unlock_irqrestore(&page->xpfo_lock, flags); > + if (flush_tlb) > + xpfo_flush_kernel_tlb(page, 0); > } This doesn't help with the TLB invalidation issue, AFAICT this is still completely buggered. kunmap_atomic() can be called from IRQ context. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64
On Wed, Apr 03, 2019 at 11:34:05AM -0600, Khalid Aziz wrote: > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > index 2779ace16d23..5c0e1581fa56 100644 > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void) > return boot_cpu_has_bug(X86_BUG_L1TF); > } > > +/* > + * The current flushing context - we pass it instead of 5 arguments: > + */ > +struct cpa_data { > + unsigned long *vaddr; > + pgd_t *pgd; > + pgprot_tmask_set; > + pgprot_tmask_clr; > + unsigned long numpages; > + unsigned long curpage; > + unsigned long pfn; > + unsigned intflags; > + unsigned intforce_split : 1, > + force_static_prot : 1; > + struct page **pages; > +}; > + > + > +int > +should_split_large_page(pte_t *kpte, unsigned long address, > + struct cpa_data *cpa); > +extern spinlock_t cpa_lock; > +int > +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address, > +struct page *base); > + I really hate exposing all that. > #include > #endif /* __ASSEMBLY__ */ > > diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c > new file mode 100644 > index ..3045bb7e4659 > --- /dev/null > +++ b/arch/x86/mm/xpfo.c > @@ -0,0 +1,123 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P. > + * Copyright (C) 2016 Brown University. All rights reserved. > + * > + * Authors: > + * Juerg Haefliger > + * Vasileios P. Kemerlis > + * > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of the GNU General Public License version 2 as published > by > + * the Free Software Foundation. > + */ > + > +#include > + > +#include > + > +extern spinlock_t cpa_lock; > + > +/* Update a single kernel page table entry */ > +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot) > +{ > + unsigned int level; > + pgprot_t msk_clr; > + pte_t *pte = lookup_address((unsigned long)kaddr, &level); > + > + if (unlikely(!pte)) { > + WARN(1, "xpfo: invalid address %p\n", kaddr); > + return; > + } > + > + switch (level) { > + case PG_LEVEL_4K: > + set_pte_atomic(pte, pfn_pte(page_to_pfn(page), > +canon_pgprot(prot))); (sorry, do we also need a nikon_pgprot() ? :-) > + break; > + case PG_LEVEL_2M: > + case PG_LEVEL_1G: { > + struct cpa_data cpa = { }; > + int do_split; > + > + if (level == PG_LEVEL_2M) > + msk_clr = pmd_pgprot(*(pmd_t *)pte); > + else > + msk_clr = pud_pgprot(*(pud_t *)pte); > + > + cpa.vaddr = kaddr; > + cpa.pages = &page; > + cpa.mask_set = prot; > + cpa.mask_clr = msk_clr; > + cpa.numpages = 1; > + cpa.flags = 0; > + cpa.curpage = 0; > + cpa.force_split = 0; > + > + > + do_split = should_split_large_page(pte, (unsigned long)kaddr, > +&cpa); > + if (do_split) { > + struct page *base; > + > + base = alloc_pages(GFP_ATOMIC, 0); > + if (!base) { > + WARN(1, "xpfo: failed to split large page\n"); You have to be fcking kidding right? A WARN when a GFP_ATOMIC allocation fails?! > + break; > + } > + > + if (!debug_pagealloc_enabled()) > + spin_lock(&cpa_lock); > + if (__split_large_page(&cpa, pte, (unsigned long)kaddr, > + base) < 0) { > + __free_page(base); > + WARN(1, "xpfo: failed to split large page\n"); > + } > + if (!debug_pagealloc_enabled()) > + spin_unlock(&cpa_lock); > + } > + > + break; Ever heard of helper functions? > + } > + case PG_LEVEL_512G: > + /* fallthrough, splitting infrastructure doesn't > + * support 512G pages. > + */ Broken coment style. > + default: > + WARN(1, "xpfo: unsupported page level %x\n", level); > + } > + > +} > +EXPORT_SYMBOL_GPL(set_kpte); > + > +inline void xpfo_flush_kernel_tlb(struct page *page, int order) > +{ > + int level; > + unsigned long size, kaddr; > + > + kaddr = (unsigned long)page_address(page); > + > + if (unlikely(!lookup_address(kaddr, &level))) { > + WARN(1, "xpfo: invalid address to flush %lx %d\n", kadd
Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
You must be so glad I no longer use kmap_atomic from NMI context :-) On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > +static inline void xpfo_kmap(void *kaddr, struct page *page) > +{ > + unsigned long flags; > + > + if (!static_branch_unlikely(&xpfo_inited)) > + return; > + > + if (!PageXpfoUser(page)) > + return; > + > + /* > + * The page was previously allocated to user space, so > + * map it back into the kernel if needed. No TLB flush required. > + */ > + spin_lock_irqsave(&page->xpfo_lock, flags); > + > + if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && > + TestClearPageXpfoUnmapped(page)) > + set_kpte(kaddr, page, PAGE_KERNEL); > + > + spin_unlock_irqrestore(&page->xpfo_lock, flags); That's a really sad sequence, not wrong, but sad. _3_ atomic operations, 2 on likely the same cacheline. And mostly all pointless. This patch makes xpfo_mapcount an atomic, but then all modifications are under the spinlock, what gives? Anyway, a possibly saner sequence might be: if (atomic_inc_not_zero(&page->xpfo_mapcount)) return; spin_lock_irqsave(&page->xpfo_lock, flag); if ((atomic_inc_return(&page->xpfo_mapcount) == 1) && TestClearPageXpfoUnmapped(page)) set_kpte(kaddr, page, PAGE_KERNEL); spin_unlock_irqrestore(&page->xpfo_lock, flags); > +} > + > +static inline void xpfo_kunmap(void *kaddr, struct page *page) > +{ > + unsigned long flags; > + > + if (!static_branch_unlikely(&xpfo_inited)) > + return; > + > + if (!PageXpfoUser(page)) > + return; > + > + /* > + * The page is to be allocated back to user space, so unmap it from > + * the kernel, flush the TLB and tag it as a user page. > + */ > + spin_lock_irqsave(&page->xpfo_lock, flags); > + > + if (atomic_dec_return(&page->xpfo_mapcount) == 0) { > +#ifdef CONFIG_XPFO_DEBUG > + WARN_ON(PageXpfoUnmapped(page)); > +#endif > + SetPageXpfoUnmapped(page); > + set_kpte(kaddr, page, __pgprot(0)); > + xpfo_flush_kernel_tlb(page, 0); You didn't speak about the TLB invalidation anywhere. But basically this is one that x86 cannot do. > + } > + > + spin_unlock_irqrestore(&page->xpfo_lock, flags); Idem: if (atomic_add_unless(&page->xpfo_mapcount, -1, 1)) return; > +} Also I'm failing to see the point of PG_xpfo_unmapped, afaict it is identical to !atomic_read(&page->xpfo_mapcount). ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)
On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote: > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 2c471a2c43fa..d17d33f36a01 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -204,6 +204,14 @@ struct page { > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS > int _last_cpupid; > #endif > + > +#ifdef CONFIG_XPFO > + /* Counts the number of times this page has been kmapped. */ > + atomic_t xpfo_mapcount; > + > + /* Serialize kmap/kunmap of this page */ > + spinlock_t xpfo_lock; NAK, see ALLOC_SPLIT_PTLOCKS spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_ sure you want spinlock_t and not raw_spinlock_t ? For CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex. > +#endif Growing the page-frame by 8 bytes (in the good case) is really sad, that's a _lot_ of memory. > } _struct_page_alignment; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64
I think this might have been this commit: commit 24132a419c68f1d69eb8ecc91b3c80d730ecbb59 Author: Christoph Hellwig Date: Fri Feb 15 09:30:28 2019 +0100 sparc64/pci_sun4v: allow large DMA masks the patch below adds a few missing checks and hopefully should fix your problem. If not can you try to revert the commit to check if my theory was correct to start with? Yes, reverting this commit makes my T1000 boot. -- Meelis Roos ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu