[PATCH 1/1] swiotlb: dump used and total slots when swiotlb buffer is full

2019-04-04 Thread Dongli Zhang
So far the kernel only prints the requested size if swiotlb buffer if full.
It is not possible to know whether it is simply an out of buffer, or it is
because swiotlb cannot allocate buffer with the requested size due to
fragmentation.

As 'io_tlb_used' is available since commit 71602fe6d4e9 ("swiotlb: add
debugfs to track swiotlb buffer usage"), both 'io_tlb_used' and
'io_tlb_nslabs' are printed when swiotlb buffer is full.

Signed-off-by: Dongli Zhang 
---
 kernel/dma/swiotlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 53012db..3f43b37 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -540,7 +540,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
 not_found:
spin_unlock_irqrestore(&io_tlb_lock, flags);
if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
-   dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes)\n", 
size);
+   dev_warn(hwdev, "swiotlb buffer is full (sz: %zd bytes), total 
%lu, used %lu\n",
+size, io_tlb_nslabs, io_tlb_used);
return DMA_MAPPING_ERROR;
 found:
io_tlb_used += nslots;
-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-04 Thread Khalid Aziz
On 4/3/19 10:10 PM, Andy Lutomirski wrote:
> On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  wrote:
>>
>> XPFO flushes kernel space TLB entries for pages that are now mapped
>> in userspace on not only the current CPU but also all other CPUs
>> synchronously. Processes on each core allocating pages causes a
>> flood of IPI messages to all other cores to flush TLB entries.
>> Many of these messages are to flush the entire TLB on the core if
>> the number of entries being flushed from local core exceeds
>> tlb_single_page_flush_ceiling. The cost of TLB flush caused by
>> unmapping pages from physmap goes up dramatically on machines with
>> high core count.
>>
>> This patch flushes relevant TLB entries for current process or
>> entire TLB depending upon number of entries for the current CPU
>> and posts a pending TLB flush on all other CPUs when a page is
>> unmapped from kernel space and mapped in userspace. Each core
>> checks the pending TLB flush flag for itself on every context
>> switch, flushes its TLB if the flag is set and clears it.
>> This patch potentially aggregates multiple TLB flushes into one.
>> This has very significant impact especially on machines with large
>> core counts.
> 
> Why is this a reasonable strategy?

Ideally when pages are unmapped from physmap, all CPUs would be sent IPI
synchronously to flush TLB entry for those pages immediately. This may
be ideal from correctness and consistency point of view, but it also
results in IPI storm and repeated TLB flushes on all processors. Any
time a page is allocated to userspace, we are going to go through this
and it is very expensive. On a 96-core server, performance degradation
is 26x!!

When xpfo unmaps a page from physmap only (after mapping the page in
userspace in response to an allocation request from userspace) on one
processor, there is a small window of opportunity for ret2dir attack on
other cpus until the TLB entry in physmap for the unmapped pages on
other cpus is cleared. Forcing that to happen synchronously is the
expensive part. A multiple of these requests can come in over a very
short time across multiple processors resulting in every cpu asking
every other cpusto flush TLB just to close this small window of
vulnerability in the kernel. If each request is processed synchronously,
each CPU will do multiple TLB flushes in short order. If we could
consolidate these TLB flush requests instead and do one TLB flush on
each cpu at the time of context switch, we can reduce the performance
impact significantly. This bears out in real life measuring the system
time when doing a parallel kernel build on a large server. Without this,
system time on 96-core server when doing "make -j60 all" went up 26x.
After this optimization, impact went down to 1.44x.

The trade-off with this strategy is, the kernel on a cpu is vulnerable
for a short time if the current running processor is the malicious
process. Is that an acceptable trade-off?

I am open to other ideas on reducing the performance impact due to xpfo.

> 
>> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
>> +{
>> +   struct cpumask tmp_mask;
>> +
>> +   /*
>> +* Balance as user space task's flush, a bit conservative.
>> +* Do a local flush immediately and post a pending flush on all
>> +* other CPUs. Local flush can be a range flush or full flush
>> +* depending upon the number of entries to be flushed. Remote
>> +* flushes will be done by individual processors at the time of
>> +* context switch and this allows multiple flush requests from
>> +* other CPUs to be batched together.
>> +*/
> 
> I don't like this function at all.  A core function like this is a
> contract of sorts between the caller and the implementation.  There is
> no such thing as an "xpfo" flush, and this function's behavior isn't
> at all well defined.  For flush_tlb_kernel_range(), I can tell you
> exactly what that function does, and the implementation is either
> correct or incorrect.  With this function, I have no idea what is
> actually required, and I can't possibly tell whether it's correct.
> 
> As far as I can see, xpfo_flush_tlb_kernel_range() actually means
> "flush this range on this CPU right now, and flush it on remote CPUs
> eventually".  It would be valid, but probably silly, to flush locally
> and to never flush at all on remote CPUs.  This makes me wonder what
> the point is.
> 

I would restate that as "flush this range on this cpu right now, and
flush it on remote cpus at the next context switch". A better name for
the routine and a better description is a reasonable change to make.

Thanks,
Khalid


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-04 Thread Andy Lutomirski

> On Apr 4, 2019, at 10:28 AM, Thomas Gleixner  wrote:
> 
>> On Thu, 4 Apr 2019, Tycho Andersen wrote:
>>leaq-PTREGS_SIZE(%rax), %rsp
>>UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
>> 
>> +/*
>> + * If we oopsed in an interrupt handler, interrupts may be off. Let's 
>> turn
>> + * them back on before going back to "normal" code.
>> + */
>> +sti
> 
> That breaks the paravirt muck and tracing/lockdep.
> 
> ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer
> and lockdep happy.
> 
> 

I’m sure we’ll find some other thing we forgot to reset eventually, so let’s do 
this in C.  Change the call do_exit to call __finish_rewind_stack_do_exit and 
add the latter as a C function that does local_irq_enable() and do_exit().
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-04 Thread Peter Zijlstra
On Thu, Apr 04, 2019 at 09:15:46AM -0600, Khalid Aziz wrote:
> Thanks Peter. I really appreciate your review. Your feedback helps make
> this code better and closer to where I can feel comfortable not calling
> it RFC any more.
> 
> The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get
> uncomfortable with it. As you pointed out about calling kmap_atomic from
> NMI context, that just makes the kmap_atomic code look even worse. I
> pointed out one problem with this code in cover letter and suggested a
> rewrite. I see these problems with this code:

Well, I no longer use it from NMI context, but I did do that for a
while. We now have a giant heap of magic in the NMI path that allows us
to take faults from NMI context (please don't ask), this means we can
mostly do copy_from_user_inatomic() now.

> 1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir
> attack security hole again even if just for the duration of kmap. A kmap
> can stay around for some time if the page is being used for I/O.

Correct.

> 2. This code uses spinlock which leads to problems. If it does not
> disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables
> IRQ, I think it can still deadlock around pgd_lock.

I've not spotted that inversion yet, but then I didn't look at the lock
usage outside of k{,un}map_xpfo().

> I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map
> the page at a new virtual address similar to what kmap_high for i386
> does. This avoids re-opening the ret2dir security hole. We can also
> possibly do away with xpfo_lock saving bytes in page-frame and the not
> so sane code sequence can go away.

Right, the TLB invalidation issues are still tricky, even there :/
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-04 Thread Thomas Gleixner
On Thu, 4 Apr 2019, Tycho Andersen wrote:
>   leaq-PTREGS_SIZE(%rax), %rsp
>   UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
>  
> + /*
> +  * If we oopsed in an interrupt handler, interrupts may be off. Let's 
> turn
> +  * them back on before going back to "normal" code.
> +  */
> + sti

That breaks the paravirt muck and tracing/lockdep.

ENABLE_INTERRUPTS() is what you want plus TRACE_IRQ_ON to keep the tracer
and lockdep happy.

Thanks,

tglx
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] sparc64/pci_sun4v: fix ATU checks for large DMA masks

2019-04-04 Thread Christoph Hellwig
Now that we allow drivers to always need to set larger than required
DMA masks we need to be a little more careful in the sun4v PCI iommu
driver to chose when to select the ATU support - a larger DMA mask
can be set even when the platform does not support ATU, so we always
have to check if it is avaiable before using it.  Add a little helper
for that and use it in all the places where we make ATU usage decisions
based on the DMA mask.

Fixes: 24132a419c68 ("sparc64/pci_sun4v: allow large DMA masks")
Reported-by: Meelis Roos 
Signed-off-by: Christoph Hellwig 
Tested-by: Meelis Roos 
---
 arch/sparc/kernel/pci_sun4v.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index a8af6023c126..14b93c5564e3 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -73,6 +73,11 @@ static inline void iommu_batch_start(struct device *dev, 
unsigned long prot, uns
p->npages   = 0;
 }
 
+static inline bool iommu_use_atu(struct iommu *iommu, u64 mask)
+{
+   return iommu->atu && mask > DMA_BIT_MASK(32);
+}
+
 /* Interrupts must be disabled.  */
 static long iommu_batch_flush(struct iommu_batch *p, u64 mask)
 {
@@ -92,7 +97,7 @@ static long iommu_batch_flush(struct iommu_batch *p, u64 mask)
prot &= (HV_PCI_MAP_ATTR_READ | HV_PCI_MAP_ATTR_WRITE);
 
while (npages != 0) {
-   if (mask <= DMA_BIT_MASK(32) || !pbm->iommu->atu) {
+   if (!iommu_use_atu(pbm->iommu, mask)) {
num = pci_sun4v_iommu_map(devhandle,
  HV_PCI_TSBID(0, entry),
  npages,
@@ -179,7 +184,6 @@ static void *dma_4v_alloc_coherent(struct device *dev, 
size_t size,
unsigned long flags, order, first_page, npages, n;
unsigned long prot = 0;
struct iommu *iommu;
-   struct atu *atu;
struct iommu_map_table *tbl;
struct page *page;
void *ret;
@@ -205,13 +209,11 @@ static void *dma_4v_alloc_coherent(struct device *dev, 
size_t size,
memset((char *)first_page, 0, PAGE_SIZE << order);
 
iommu = dev->archdata.iommu;
-   atu = iommu->atu;
-
mask = dev->coherent_dma_mask;
-   if (mask <= DMA_BIT_MASK(32) || !atu)
+   if (!iommu_use_atu(iommu, mask))
tbl = &iommu->tbl;
else
-   tbl = &atu->tbl;
+   tbl = &iommu->atu->tbl;
 
entry = iommu_tbl_range_alloc(dev, tbl, npages, NULL,
  (unsigned long)(-1), 0);
@@ -333,7 +335,7 @@ static void dma_4v_free_coherent(struct device *dev, size_t 
size, void *cpu,
atu = iommu->atu;
devhandle = pbm->devhandle;
 
-   if (dvma <= DMA_BIT_MASK(32)) {
+   if (!iommu_use_atu(iommu, dvma)) {
tbl = &iommu->tbl;
iotsb_num = 0; /* we don't care for legacy iommu */
} else {
@@ -374,7 +376,7 @@ static dma_addr_t dma_4v_map_page(struct device *dev, 
struct page *page,
npages >>= IO_PAGE_SHIFT;
 
mask = *dev->dma_mask;
-   if (mask <= DMA_BIT_MASK(32))
+   if (!iommu_use_atu(iommu, mask))
tbl = &iommu->tbl;
else
tbl = &atu->tbl;
@@ -510,7 +512,7 @@ static int dma_4v_map_sg(struct device *dev, struct 
scatterlist *sglist,
  IO_PAGE_SIZE) >> IO_PAGE_SHIFT;
 
mask = *dev->dma_mask;
-   if (mask <= DMA_BIT_MASK(32))
+   if (!iommu_use_atu(iommu, mask))
tbl = &iommu->tbl;
else
tbl = &atu->tbl;
-- 
2.20.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-04 Thread Sebastian Andrzej Siewior
- stepping on del button while browsing though CCs.
On 2019-04-04 09:47:27 [-0600], Tycho Andersen wrote:
> > Hmm.  do_exit() isn't really meant to be "try your best to leave the
> > system somewhat usable without returning" -- it's a function that,
> > other than in OOPSes, is called from a well-defined state.  So I think
> > rewind_stack_do_exit() is probably a better spot.  But we need to
> > rewind the stack and *then* turn on IRQs, since we otherwise risk
> > exploding quite badly.
> 
> Ok, sounds good. I guess we can include something like this patch in
> the next series.

The tracing infrastructure probably doesn't know that the interrupts are
back on. Also if you were holding a spin lock then your preempt count
isn't 0 which means that might_sleep() will trigger a splat (in your
backtrace it was zero).

> Thanks,
> 
> Tycho
Sebastian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap

2019-04-04 Thread Khalid Aziz
On 4/4/19 1:56 AM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:12AM -0600, Khalid Aziz wrote:
>> From: Julian Stecklina 
>>
>> Only the xpfo_kunmap call that needs to actually unmap the page
>> needs to be serialized. We need to be careful to handle the case,
>> where after the atomic decrement of the mapcount, a xpfo_kmap
>> increased the mapcount again. In this case, we can safely skip
>> modifying the page table.
>>
>> Model-checked with up to 4 concurrent callers with Spin.
>>
>> Signed-off-by: Julian Stecklina 
>> Signed-off-by: Khalid Aziz 
>> Cc: Khalid Aziz 
>> Cc: x...@kernel.org
>> Cc: kernel-harden...@lists.openwall.com
>> Cc: Vasileios P. Kemerlis 
>> Cc: Juerg Haefliger 
>> Cc: Tycho Andersen 
>> Cc: Marco Benatto 
>> Cc: David Woodhouse 
>> ---
>>  include/linux/xpfo.h | 24 +++-
>>  1 file changed, 15 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> index 2318c7eb5fb7..37e7f52fa6ce 100644
>> --- a/include/linux/xpfo.h
>> +++ b/include/linux/xpfo.h
>> @@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page 
>> *page)
>>  static inline void xpfo_kunmap(void *kaddr, struct page *page)
>>  {
>>  unsigned long flags;
>> +bool flush_tlb = false;
>>  
>>  if (!static_branch_unlikely(&xpfo_inited))
>>  return;
>> @@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page 
>> *page)
>>   * The page is to be allocated back to user space, so unmap it from
>>   * the kernel, flush the TLB and tag it as a user page.
>>   */
>> -spin_lock_irqsave(&page->xpfo_lock, flags);
>> -
>>  if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
>> -#ifdef CONFIG_XPFO_DEBUG
>> -WARN_ON(PageXpfoUnmapped(page));
>> -#endif
>> -SetPageXpfoUnmapped(page);
>> -set_kpte(kaddr, page, __pgprot(0));
>> -xpfo_flush_kernel_tlb(page, 0);
>> +spin_lock_irqsave(&page->xpfo_lock, flags);
>> +
>> +/*
>> + * In the case, where we raced with kmap after the
>> + * atomic_dec_return, we must not nuke the mapping.
>> + */
>> +if (atomic_read(&page->xpfo_mapcount) == 0) {
>> +SetPageXpfoUnmapped(page);
>> +set_kpte(kaddr, page, __pgprot(0));
>> +flush_tlb = true;
>> +}
>> +spin_unlock_irqrestore(&page->xpfo_lock, flags);
>>  }
>>  
>> -spin_unlock_irqrestore(&page->xpfo_lock, flags);
>> +if (flush_tlb)
>> +xpfo_flush_kernel_tlb(page, 0);
>>  }
> 
> This doesn't help with the TLB invalidation issue, AFAICT this is still
> completely buggered. kunmap_atomic() can be called from IRQ context.
> 

OK. xpfo_kmap/xpfo_kunmap need redesign.

--
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 02/13] x86: always set IF before oopsing from page fault

2019-04-04 Thread Tycho Andersen
On Wed, Apr 03, 2019 at 09:12:16PM -0700, Andy Lutomirski wrote:
> On Wed, Apr 3, 2019 at 6:42 PM Tycho Andersen  wrote:
> >
> > On Wed, Apr 03, 2019 at 05:12:56PM -0700, Andy Lutomirski wrote:
> > > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz  
> > > wrote:
> > > >
> > > > From: Tycho Andersen 
> > > >
> > > > Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, 
> > > > and
> > > > that might sleep:
> > > >
> > >
> > >
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index 9d5c75f02295..7891add0913f 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -858,6 +858,12 @@ no_context(struct pt_regs *regs, unsigned long 
> > > > error_code,
> > > > /* Executive summary in case the body of the oops scrolled away 
> > > > */
> > > > printk(KERN_DEFAULT "CR2: %016lx\n", address);
> > > >
> > > > +   /*
> > > > +* We're about to oops, which might kill the task. Make sure 
> > > > we're
> > > > +* allowed to sleep.
> > > > +*/
> > > > +   flags |= X86_EFLAGS_IF;
> > > > +
> > > > oops_end(flags, regs, sig);
> > > >  }
> > > >
> > >
> > >
> > > NAK.  If there's a bug in rewind_stack_do_exit(), please fix it in
> > > rewind_stack_do_exit().
> >
> > [I trimmed the CC list since google rejected it with E2BIG :)]
> >
> > I guess the problem is really that do_exit() (or really
> > exit_signals()) might sleep. Maybe we should put an irq_enable() at
> > the beginning of do_exit() instead and fix this problem for all
> > arches?
> >
> 
> Hmm.  do_exit() isn't really meant to be "try your best to leave the
> system somewhat usable without returning" -- it's a function that,
> other than in OOPSes, is called from a well-defined state.  So I think
> rewind_stack_do_exit() is probably a better spot.  But we need to
> rewind the stack and *then* turn on IRQs, since we otherwise risk
> exploding quite badly.

Ok, sounds good. I guess we can include something like this patch in
the next series.

Thanks,

Tycho


>From 34dce229a4f43f90db823671eb0b8da7c4906045 Mon Sep 17 00:00:00 2001
From: Tycho Andersen 
Date: Thu, 4 Apr 2019 09:41:32 -0600
Subject: [PATCH] x86/entry: re-enable interrupts before exiting

If the kernel oopses in an interrupt, nothing re-enables interrupts:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from 
invalid context at
./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, 
pid: 1970, name:
lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: 
lkdtm_xpfo_test Tainted: G  D
4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC 
(i440FX + PIIX, 1996), BIOS
1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? 
blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 
[lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

do_exit() expects to be called in a well-defined environment, so let's
re-enable interrupts after unwinding the stack, in case they were disabled.

Signed-off-by: Tycho Andersen 
---
 arch/x86/entry/entry_32.S | 6 ++
 arch/x86/entry/entry_64.S | 6 ++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..8ddb7b41669d 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1507,6 +1507,12 @@ ENTRY(rewind_stack_do_exit)
movlPER_CPU_VAR(cpu_current_top_of_stack), %esi
leal-TOP_OF_KERNEL_STACK_PADDING-PTREGS_SIZE(%esi), %esp
 
+   /*
+* If we oopsed in an interrupt handler, interrupts may be off. Let's 
turn
+* them back on before going back to "normal" code.
+*/
+   sti
+
calldo_exit
 1: jmp 1b
 END(rewind_stack_do_exit)
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..c0759f3e3ad2 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1672,5 +1672,11 @@ ENTRY(rewind_stack_do_exit)
leaq-PTREGS_SIZE(%rax), %rsp
UNWIND_HINT_FUNC sp_offset=PTREGS_SIZE
 
+   /*
+* If we oopsed in an interrupt handler, interrupts may be off. Let's 
turn
+* them back on before going back to "normal" code.
+*/
+   sti
+
   

Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64

2019-04-04 Thread Christoph Hellwig
On Thu, Apr 04, 2019 at 06:38:53PM +0300, Meelis Roos wrote:
> > > Yes, reverting this commit makes my T1000 boot.
> > 
> > Does the patch attached to the last mail work as well?
> 
> Sorry for misreading your mail - tested now and yes, it works.

Thanks, I'll submit it with a proper changelog.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64

2019-04-04 Thread Khalid Aziz
[trimmed To: and Cc: It was too large]

On 4/4/19 1:52 AM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:05AM -0600, Khalid Aziz wrote:
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 2779ace16d23..5c0e1581fa56 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
>>  return boot_cpu_has_bug(X86_BUG_L1TF);
>>  }
>>  
>> +/*
>> + * The current flushing context - we pass it instead of 5 arguments:
>> + */
>> +struct cpa_data {
>> +unsigned long   *vaddr;
>> +pgd_t   *pgd;
>> +pgprot_tmask_set;
>> +pgprot_tmask_clr;
>> +unsigned long   numpages;
>> +unsigned long   curpage;
>> +unsigned long   pfn;
>> +unsigned intflags;
>> +unsigned intforce_split : 1,
>> +force_static_prot   : 1;
>> +struct page **pages;
>> +};
>> +
>> +
>> +int
>> +should_split_large_page(pte_t *kpte, unsigned long address,
>> +struct cpa_data *cpa);
>> +extern spinlock_t cpa_lock;
>> +int
>> +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
>> +   struct page *base);
>> +
> 
> I really hate exposing all that.

I believe this was done so set_kpte() could split large pages if needed.
I will look into creating a helper function instead so this does not
have to be exposed.

> 
>>  #include 
>>  #endif  /* __ASSEMBLY__ */
>>  
> 
>> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
>> new file mode 100644
>> index ..3045bb7e4659
>> --- /dev/null
>> +++ b/arch/x86/mm/xpfo.c
>> @@ -0,0 +1,123 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + *
>> + * Authors:
>> + *   Juerg Haefliger 
>> + *   Vasileios P. Kemerlis 
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published 
>> by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include 
>> +
>> +#include 
>> +
>> +extern spinlock_t cpa_lock;
>> +
>> +/* Update a single kernel page table entry */
>> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
>> +{
>> +unsigned int level;
>> +pgprot_t msk_clr;
>> +pte_t *pte = lookup_address((unsigned long)kaddr, &level);
>> +
>> +if (unlikely(!pte)) {
>> +WARN(1, "xpfo: invalid address %p\n", kaddr);
>> +return;
>> +}
>> +
>> +switch (level) {
>> +case PG_LEVEL_4K:
>> +set_pte_atomic(pte, pfn_pte(page_to_pfn(page),
>> +   canon_pgprot(prot)));
> 
> (sorry, do we also need a nikon_pgprot() ? :-)

Are we trying to encourage nikon as well to sponsor this patch :)

> 
>> +break;
>> +case PG_LEVEL_2M:
>> +case PG_LEVEL_1G: {
>> +struct cpa_data cpa = { };
>> +int do_split;
>> +
>> +if (level == PG_LEVEL_2M)
>> +msk_clr = pmd_pgprot(*(pmd_t *)pte);
>> +else
>> +msk_clr = pud_pgprot(*(pud_t *)pte);
>> +
>> +cpa.vaddr = kaddr;
>> +cpa.pages = &page;
>> +cpa.mask_set = prot;
>> +cpa.mask_clr = msk_clr;
>> +cpa.numpages = 1;
>> +cpa.flags = 0;
>> +cpa.curpage = 0;
>> +cpa.force_split = 0;
>> +
>> +
>> +do_split = should_split_large_page(pte, (unsigned long)kaddr,
>> +   &cpa);
>> +if (do_split) {
>> +struct page *base;
>> +
>> +base = alloc_pages(GFP_ATOMIC, 0);
>> +if (!base) {
>> +WARN(1, "xpfo: failed to split large page\n");
> 
> You have to be fcking kidding right? A WARN when a GFP_ATOMIC allocation
> fails?!
> 

Not sure what the reasoning was for this WARN in original patch, but I
think this is trying to warn about failure to split the large page in
order to unmap this single page as opposed to warning about allocation
failure. Nevertheless this could be done better.

>> +break;
>> +}
>> +
>> +if (!debug_pagealloc_enabled())
>> +spin_lock(&cpa_lock);
>> +if  (__split_large_page(&cpa, pte, (unsigned long)kaddr,
>> +base) < 0) {
>> +__free_page(base);
>> +WARN(1, "xpfo: failed to split large page\n");
>> +}
>> +if (!debug_pagealloc_enabled())
>> +spin_unlock(&cpa_lock);
>> +}
>> +
>> +break;
> 
> Ever h

Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64

2019-04-04 Thread Meelis Roos

Yes, reverting this commit makes my T1000 boot.


Does the patch attached to the last mail work as well?


Sorry for misreading your mail - tested now and yes, it works.

--
Meelis Roos
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/2] iommu/arm-smmu-v3: make sure the kdump kernel can work well when smmu is enabled

2019-04-04 Thread Will Deacon
Hi Zhen Lei,

On Mon, Mar 18, 2019 at 09:12:41PM +0800, Zhen Lei wrote:
> v1 --> v2:
> 1. Drop part2. Now, we only use the SMMUv3 hardware feature STE.config=0b000
> (Report abort to device, no event recorded) to suppress the event messages
> caused by the unexpected devices.
> 2. rewrite the patch description.

This issue came up a while back:

https://lore.kernel.org/linux-pci/20180302103032.gb19...@arm.com/

and I'd still prefer to solve it using the disable_bypass logic which we
already have. Something along the lines of the diff below?

We're relying on the DMA API not subsequently requesting a passthrough
domain, but it should only do that if you've configured your crashkernel
to do so.

Will

--->8

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index d3880010c6cf..91b8f3b2ee25 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -2454,13 +2454,9 @@ static int arm_smmu_device_reset(struct arm_smmu_device 
*smmu, bool bypass)
/* Clear CR0 and sync (disables SMMU and queue processing) */
reg = readl_relaxed(smmu->base + ARM_SMMU_CR0);
if (reg & CR0_SMMUEN) {
-   if (is_kdump_kernel()) {
-   arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
-   arm_smmu_device_disable(smmu);
-   return -EBUSY;
-   }
-
dev_warn(smmu->dev, "SMMU currently enabled! Resetting...\n");
+   WARN_ON(is_kdump_kernel() && !disable_bypass);
+   arm_smmu_update_gbpa(smmu, GBPA_ABORT, 0);
}
 
ret = arm_smmu_device_disable(smmu);
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-04 Thread Khalid Aziz
On 4/4/19 1:43 AM, Peter Zijlstra wrote:
> 
> You must be so glad I no longer use kmap_atomic from NMI context :-)
> 
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
>> +static inline void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +unsigned long flags;
>> +
>> +if (!static_branch_unlikely(&xpfo_inited))
>> +return;
>> +
>> +if (!PageXpfoUser(page))
>> +return;
>> +
>> +/*
>> + * The page was previously allocated to user space, so
>> + * map it back into the kernel if needed. No TLB flush required.
>> + */
>> +spin_lock_irqsave(&page->xpfo_lock, flags);
>> +
>> +if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
>> +TestClearPageXpfoUnmapped(page))
>> +set_kpte(kaddr, page, PAGE_KERNEL);
>> +
>> +spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
> That's a really sad sequence, not wrong, but sad. _3_ atomic operations,
> 2 on likely the same cacheline. And mostly all pointless.
> 
> This patch makes xpfo_mapcount an atomic, but then all modifications are
> under the spinlock, what gives?
> 
> Anyway, a possibly saner sequence might be:
> 
>   if (atomic_inc_not_zero(&page->xpfo_mapcount))
>   return;
> 
>   spin_lock_irqsave(&page->xpfo_lock, flag);
>   if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
>   TestClearPageXpfoUnmapped(page))
>   set_kpte(kaddr, page, PAGE_KERNEL);
>   spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
>> +}
>> +
>> +static inline void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +unsigned long flags;
>> +
>> +if (!static_branch_unlikely(&xpfo_inited))
>> +return;
>> +
>> +if (!PageXpfoUser(page))
>> +return;
>> +
>> +/*
>> + * The page is to be allocated back to user space, so unmap it from
>> + * the kernel, flush the TLB and tag it as a user page.
>> + */
>> +spin_lock_irqsave(&page->xpfo_lock, flags);
>> +
>> +if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
>> +#ifdef CONFIG_XPFO_DEBUG
>> +WARN_ON(PageXpfoUnmapped(page));
>> +#endif
>> +SetPageXpfoUnmapped(page);
>> +set_kpte(kaddr, page, __pgprot(0));
>> +xpfo_flush_kernel_tlb(page, 0);
> 
> You didn't speak about the TLB invalidation anywhere. But basically this
> is one that x86 cannot do.
> 
>> +}
>> +
>> +spin_unlock_irqrestore(&page->xpfo_lock, flags);
> 
> Idem:
> 
>   if (atomic_add_unless(&page->xpfo_mapcount, -1, 1))
>   return;
> 
>   
> 
> 
>> +}
> 
> Also I'm failing to see the point of PG_xpfo_unmapped, afaict it
> is identical to !atomic_read(&page->xpfo_mapcount).
> 

Thanks Peter. I really appreciate your review. Your feedback helps make
this code better and closer to where I can feel comfortable not calling
it RFC any more.

The more I look at xpfo_kmap()/xpfo_kunmap() code, the more I get
uncomfortable with it. As you pointed out about calling kmap_atomic from
NMI context, that just makes the kmap_atomic code look even worse. I
pointed out one problem with this code in cover letter and suggested a
rewrite. I see these problems with this code:

1. When xpfo_kmap maps a page back in physmap, it opens up the ret2dir
attack security hole again even if just for the duration of kmap. A kmap
can stay around for some time if the page is being used for I/O.

2. This code uses spinlock which leads to problems. If it does not
disable IRQ, it is exposed to deadlock around xpfo_lock. If it disables
IRQ, I think it can still deadlock around pgd_lock.

I think a better implementation of xpfo_kmap()/xpfo_kunmap() would map
the page at a new virtual address similar to what kmap_high for i386
does. This avoids re-opening the ret2dir security hole. We can also
possibly do away with xpfo_lock saving bytes in page-frame and the not
so sane code sequence can go away.

Good point about PG_xpfo_unmapped flag. You are right that it can be
replaced with !atomic_read(&page->xpfo_mapcount).

Thanks,
Khalid

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2] iommu/arm-smmu: Break insecure users by disabling bypass by default

2019-04-04 Thread Will Deacon
On Fri, Mar 01, 2019 at 11:20:17AM -0800, Douglas Anderson wrote:
> If you're bisecting why your peripherals stopped working, it's
> probably this CL.  Specifically if you see this in your dmesg:
>   Unexpected global fault, this could be serious
> ...then it's almost certainly this CL.
> 
> Running your IOMMU-enabled peripherals with the IOMMU in bypass mode
> is insecure and effectively disables the protection they provide.
> There are few reasons to allow unmatched stream bypass, and even fewer
> good ones.
> 
> This patch starts the transition over to make it much harder to run
> your system insecurely.  Expected steps:
> 
> 1. By default disable bypass (so anyone insecure will notice) but make
>it easy for someone to re-enable bypass with just a KConfig change.
>That's this patch.
> 
> 2. After people have had a little time to come to grips with the fact
>that they need to set their IOMMUs properly and have had time to
>dig into how to do this, the KConfig will be eliminated and bypass
>will simply be disabled.  Folks who are truly upset and still
>haven't fixed their system can either figure out how to add
>'arm-smmu.disable_bypass=n' to their command line or revert the
>patch in their own private kernel.  Of course these folks will be
>less secure.
> 
> Suggested-by: Robin Murphy 
> Signed-off-by: Douglas Anderson 
> ---
> 
> Changes in v2:
> - Flipped default to 'yes' and changed comments a lot.
> 
>  drivers/iommu/Kconfig| 25 +
>  drivers/iommu/arm-smmu.c |  3 ++-
>  2 files changed, 27 insertions(+), 1 deletion(-)

Cheers, I'll pick this one up for 5.2.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-04 Thread Tycho Andersen
On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 2c471a2c43fa..d17d33f36a01 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -204,6 +204,14 @@ struct page {
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> > int _last_cpupid;
> >  #endif
> > +
> > +#ifdef CONFIG_XPFO
> > +   /* Counts the number of times this page has been kmapped. */
> > +   atomic_t xpfo_mapcount;
> > +
> > +   /* Serialize kmap/kunmap of this page */
> > +   spinlock_t xpfo_lock;
> 
> NAK, see ALLOC_SPLIT_PTLOCKS
> 
> spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
> sure you want spinlock_t and not raw_spinlock_t ? For
> CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.
> 
> > +#endif
> 
> Growing the page-frame by 8 bytes (in the good case) is really sad,
> that's a _lot_ of memory.

Originally we had this in page_ext, it's not really clear to me why we
moved it out.

Julien?

Tycho
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/4] iommu/arm-smmu-v3: Link domains and devices

2019-04-04 Thread Will Deacon
Hi Jean-Philippe,

First off, thanks for posting this: it's definitely something that I'm keen
to support, and getting bits in a piece at a time is probably a good idea.

On Wed, Mar 20, 2019 at 05:36:32PM +, Jean-Philippe Brucker wrote:
> When removing a mapping from a domain, we need to send an invalidation to
> all devices that might have stored it in their Address Translation Cache
> (ATC). In addition when updating the context descriptor of a live domain,
> we'll need to send invalidations for all devices attached to it.
> 
> Maintain a list of devices in each domain, protected by a spinlock. It is
> updated every time we attach or detach devices to and from domains.
> 
> It needs to be a spinlock because we'll invalidate ATC entries from
> within hardirq-safe contexts, but it may be possible to relax the read
> side with RCU later.
> 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  drivers/iommu/arm-smmu-v3.c | 28 
>  1 file changed, 28 insertions(+)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index d3880010c6cf..66a29c113dbc 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -594,6 +594,11 @@ struct arm_smmu_device {
>  struct arm_smmu_master_data {
>   struct arm_smmu_device  *smmu;
>   struct arm_smmu_strtab_ent  ste;
> +
> + struct arm_smmu_domain  *domain;
> + struct list_headdomain_head;
> +
> + struct device   *dev;
>  };
>  
>  /* SMMU private data for an IOMMU domain */
> @@ -618,6 +623,9 @@ struct arm_smmu_domain {
>   };
>  
>   struct iommu_domain domain;
> +
> + struct list_headdevices;
> + spinlock_t  devices_lock;
>  };
>  
>  struct arm_smmu_option_prop {
> @@ -1493,6 +1501,9 @@ static struct iommu_domain 
> *arm_smmu_domain_alloc(unsigned type)
>   }
>  
>   mutex_init(&smmu_domain->init_mutex);
> + INIT_LIST_HEAD(&smmu_domain->devices);
> + spin_lock_init(&smmu_domain->devices_lock);

I'm wondering whether we can't take this a bit further and re-organise the
data structures to make this a little simpler overall. Something along the
lines of:

struct arm_smmu_master_data {
struct list_headlist; // masters in the same 
domain
struct arm_smmu_device  *smmu;
unsigned intnum_sids;
u32 *sids; // Points into fwspec
struct arm_smmu_domain  *domain; // NULL -> !assigned
};

and then just add a list_head into struct arm_smmu_domain to track the
masters that have been attached (if you're feeling brave, you could put
this into the s1_cfg).

The ATC invalidation logic would then be:

  - Detaching a device: walk over the sids from the master data
  - Unmapping a range from a domain: walk over the attached masters

I think this would also allow us to remove struct arm_smmu_strtab_ent
completely.

Dunno: this is one of the those things where you really have to try it
to figure out why it doesn't work...

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64

2019-04-04 Thread Christoph Hellwig
On Thu, Apr 04, 2019 at 10:11:43AM +0300, Meelis Roos wrote:
> > I think this might have been this commit:
> > 
> > commit 24132a419c68f1d69eb8ecc91b3c80d730ecbb59
> > Author: Christoph Hellwig 
> > Date:   Fri Feb 15 09:30:28 2019 +0100
> > 
> >  sparc64/pci_sun4v: allow large DMA masks
> > 
> > the patch below adds a few missing checks and hopefully should fix
> > your problem.  If not can you try to revert the commit to check if
> > my theory was correct to start with?
> Yes, reverting this commit makes my T1000 boot.

Does the patch attached to the last mail work as well?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] driver core: Postpone DMA tear-down until after devres release for probe failure

2019-04-04 Thread John Garry

On 03/04/2019 10:20, John Garry wrote:

On 03/04/2019 09:14, Greg KH wrote:

On Wed, Apr 03, 2019 at 09:02:36AM +0100, John Garry wrote:

On 28/03/2019 10:08, John Garry wrote:

In commit 376991db4b64 ("driver core: Postpone DMA tear-down until
after
devres release"), we changed the ordering of tearing down the device
DMA
ops and releasing all the device's resources; this was because the
DMA ops
should be maintained until we release the device's managed DMA
memories.



Hi all,

A friendly reminder on this patch... I didn't see any update.

I thought that it had some importance.

Thanks,
John


However, we have seen another crash on an arm64 system when a
device driver probe fails:

  hisi_sas_v3_hw :74:02.0: Adding to iommu group 2
  scsi host1: hisi_sas_v3_hw
  BUG: Bad page state in process swapper/0  pfn:313f5
  page:7ec4fd40 count:1 mapcount:0
  mapping: index:0x0
  flags: 0xfffe0001000(reserved)
  raw: 0fffe0001000 7ec4fd48 7ec4fd48

  raw:   0001

  page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
  bad because of flags: 0x1000(reserved)
  Modules linked in:
  CPU: 49 PID: 1 Comm: swapper/0 Not tainted
5.1.0-rc1-43081-g22d97fd-dirty #1433
  Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI
RC0 - V1.12.01 01/29/2019
  Call trace:
  dump_backtrace+0x0/0x118
  show_stack+0x14/0x1c
  dump_stack+0xa4/0xc8
  bad_page+0xe4/0x13c
  free_pages_check_bad+0x4c/0xc0
  __free_pages_ok+0x30c/0x340
  __free_pages+0x30/0x44
  __dma_direct_free_pages+0x30/0x38
  dma_direct_free+0x24/0x38
  dma_free_attrs+0x9c/0xd8
  dmam_release+0x20/0x28
  release_nodes+0x17c/0x220
  devres_release_all+0x34/0x54
  really_probe+0xc4/0x2c8
  driver_probe_device+0x58/0xfc
  device_driver_attach+0x68/0x70
  __driver_attach+0x94/0xdc
  bus_for_each_dev+0x5c/0xb4
  driver_attach+0x20/0x28
  bus_add_driver+0x14c/0x200
  driver_register+0x6c/0x124
  __pci_register_driver+0x48/0x50
  sas_v3_pci_driver_init+0x20/0x28
  do_one_initcall+0x40/0x25c
  kernel_init_freeable+0x2b8/0x3c0
  kernel_init+0x10/0x100
  ret_from_fork+0x10/0x18
  Disabling lock debugging due to kernel taint
  BUG: Bad page state in process swapper/0  pfn:313f6
  page:7ec4fd80 count:1 mapcount:0
mapping: index:0x0
[   89.322983] flags: 0xfffe0001000(reserved)
  raw: 0fffe0001000 7ec4fd88 7ec4fd88

  raw:   0001


The crash occurs for the same reason.

In this case, on the really_probe() failure path, we are still clearing
the DMA ops prior to releasing the device's managed memories.

This patch fixes this issue by reordering the DMA ops teardown and the
call to devres_release_all() on the failure path.

Reported-by: Xiang Chen 
Tested-by: Xiang Chen 
Signed-off-by: John Garry 


So does this "fix" 376991db4b64?  If so, should this be added to the
patch and also backported to the stable trees?


Hi Greg,

No, I don't think so. I'd say it supplements it. Here I'm trying to fix
up another path in which we tear down the DMA ops prior to releasing the
device's resources.

I didn't add a fixes tag as 376991db4b64 didn't have one either. It will
need to be backported to stable, I figure the same as 376991db4b64.


So 376991db4b64 required manual backporting to stable, and this patch 
would require the same:


https://www.spinics.net/lists/stable/msg289685.html

Robin, any further comment?

Thanks,
John



thanks,

greg k-h

.






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1 3/3] arm64: dts: qcom: msm8998: Add PCIe PHY and RC nodes

2019-04-04 Thread Marc Gonzalez
On 28/03/2019 18:06, Marc Gonzalez wrote:

> Add MSM8998 PCIe QMP PHY and PCIe root complex DT nodes.

I suppose I should add a reference to the downstream DTS:

https://source.codeaurora.org/quic/la/kernel/msm-4.4/tree/arch/arm/boot/dts/qcom/msm8998.dtsi?h=LE.UM.1.3.r3.25#n2537


> diff --git a/arch/arm64/boot/dts/qcom/msm8998.dtsi 
> b/arch/arm64/boot/dts/qcom/msm8998.dtsi
> index 5a1c0961b281..9f979a51f679 100644
> --- a/arch/arm64/boot/dts/qcom/msm8998.dtsi
> +++ b/arch/arm64/boot/dts/qcom/msm8998.dtsi
> @@ -621,6 +621,84 @@
>   ;
>   };
>  
> + pcie0: pci@1c0 {
> + compatible = "qcom,pcie-msm8996";

Should probably be "qcom,pcie-msm8998"; in case we need to fudge something
in 8998 and not in 8996.

Apart from this minor issue, I think it's good to go for 5.2
(I'll respin without the SMMU patch)

I just need Kishon to take the PHY series, and Stephen to take the clock hack,
and we're rolling.

Oh and it would be useful to have someone review the PCIE20_PARF_BDF_TRANSLATE_N
patch.

Regards.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-04 Thread Peter Zijlstra
On Thu, Apr 04, 2019 at 09:21:52AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 2c471a2c43fa..d17d33f36a01 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -204,6 +204,14 @@ struct page {
> >  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
> > int _last_cpupid;
> >  #endif
> > +
> > +#ifdef CONFIG_XPFO
> > +   /* Counts the number of times this page has been kmapped. */
> > +   atomic_t xpfo_mapcount;
> > +
> > +   /* Serialize kmap/kunmap of this page */
> > +   spinlock_t xpfo_lock;
> 
> NAK, see ALLOC_SPLIT_PTLOCKS
> 
> spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
> sure you want spinlock_t and not raw_spinlock_t ? For
> CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.
> 
> > +#endif
> 
> Growing the page-frame by 8 bytes (in the good case) is really sad,
> that's a _lot_ of memory.

So if you use the original kmap_atomic/kmap code from i386 and create
an alias per user you can do away with all that.

Now, that leaves you with the fixmap kmap_atomic code, which I also
hate, but it gets rid of a lot of the ugly you introduce in these here
patches.

As to the fixmap kmap_atomic; so fundamentally the PTEs are only used on
a single CPU and therefore CPU local TLB invalidation _should_ suffice.

However, speculation...

Another CPU can speculatively hit upon a fixmap entry for another CPU
and populate it's own TLB entry. Then the TLB invalidate is
insufficient, it leaves a stale entry in a remote TLB.

If the remote CPU then re-uses that fixmap slot to alias another page,
we have two CPUs with different translations for the same VA, a
condition that AMD CPU's dislike enough to machine check on (IIRC).

Actually hitting that is incredibly difficult (we have to have
speculation, fixmap reuse and not get a full TLB invalidate in between),
but, afaict, not impossible.

Your monstrosity from the last patch avoids this particular issue by not
aliasing in this manner, but it comes at the cost of this page-frame
bloat. Also, I'm still not sure there's not other problems with it.

Bah..
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-04-04 Thread Peter Zijlstra
On Wed, Apr 03, 2019 at 11:34:13AM -0600, Khalid Aziz wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 999d6d8f0bef..cc806a01a0eb 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -37,6 +37,20 @@
>   */
>  #define LAST_USER_MM_IBPB0x1UL
>  
> +/*
> + * A TLB flush may be needed to flush stale TLB entries
> + * for pages that have been mapped into userspace and unmapped
> + * from kernel space. This TLB flush needs to be propagated to
> + * all CPUs. Asynchronous flush requests to all CPUs can cause
> + * significant performance imapct. Queue a pending flush for
> + * a CPU instead. Multiple of these requests can then be handled
> + * by a CPU at a less disruptive time, like context switch, in
> + * one go and reduce performance impact significantly. Following
> + * data structure is used to keep track of CPUs with pending full
> + * TLB flush forced by xpfo.
> + */
> +static cpumask_t pending_xpfo_flush;
> +
>  /*
>   * We get here when we do something requiring a TLB invalidation
>   * but could not go invalidate all of the contexts.  We do the
> @@ -321,6 +335,16 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
> mm_struct *next,
>   __flush_tlb_all();
>   }
>  #endif
> +
> + /*
> +  * If there is a pending TLB flush for this CPU due to XPFO
> +  * flush, do it now.
> +  */
> + if (cpumask_test_and_clear_cpu(cpu, &pending_xpfo_flush)) {
> + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> + __flush_tlb_all();
> + }

That really should be:

if (cpumask_test_cpu(cpu, &pending_xpfo_flush)) {
cpumask_clear_cpu(cpu, &pending_xpfo_flush);
count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
__flush_tlb_all();
}

test_and_clear is an unconditional RmW and can cause cacheline
contention between adjecent CPUs even if none of the bits are set.

> +
>   this_cpu_write(cpu_tlbstate.is_lazy, false);
>  
>   /*
> @@ -803,6 +827,34 @@ void flush_tlb_kernel_range(unsigned long start, 
> unsigned long end)
>   }
>  }
>  
> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
> +{
> + struct cpumask tmp_mask;
> +
> + /*
> +  * Balance as user space task's flush, a bit conservative.
> +  * Do a local flush immediately and post a pending flush on all
> +  * other CPUs. Local flush can be a range flush or full flush
> +  * depending upon the number of entries to be flushed. Remote
> +  * flushes will be done by individual processors at the time of
> +  * context switch and this allows multiple flush requests from
> +  * other CPUs to be batched together.
> +  */
> + if (end == TLB_FLUSH_ALL ||
> + (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
> + do_flush_tlb_all(NULL);
> + } else {
> + struct flush_tlb_info info;
> +
> + info.start = start;
> + info.end = end;
> + do_kernel_range_flush(&info);
> + }
> + cpumask_setall(&tmp_mask);
> + __cpumask_clear_cpu(smp_processor_id(), &tmp_mask);
> + cpumask_or(&pending_xpfo_flush, &pending_xpfo_flush, &tmp_mask);
> +}
> +
>  void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>  {
>   struct flush_tlb_info info = {
> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> index b42513347865..638eee5b1f09 100644
> --- a/arch/x86/mm/xpfo.c
> +++ b/arch/x86/mm/xpfo.c
> @@ -118,7 +118,7 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int 
> order)
>   return;
>   }
>  
> - flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> + xpfo_flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
>  }
>  EXPORT_SYMBOL_GPL(xpfo_flush_kernel_tlb);

So this patch is the one that makes it 'work', but I'm with Andy on
hating it something fierce.

Up until this point x86_64 is completely buggered in this series, after
this it sorta works but *urgh* what crap.

All in all your changelog is complete and utter garbage, this is _NOT_ a
performance issue. It is a very much a correctness issue.

Also; I distinctly dislike the inconsistent TLB states this generates.
It makes it very hard to argue for its correctness..

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 11/13] xpfo, mm: optimize spinlock usage in xpfo_kunmap

2019-04-04 Thread Peter Zijlstra
On Wed, Apr 03, 2019 at 11:34:12AM -0600, Khalid Aziz wrote:
> From: Julian Stecklina 
> 
> Only the xpfo_kunmap call that needs to actually unmap the page
> needs to be serialized. We need to be careful to handle the case,
> where after the atomic decrement of the mapcount, a xpfo_kmap
> increased the mapcount again. In this case, we can safely skip
> modifying the page table.
> 
> Model-checked with up to 4 concurrent callers with Spin.
> 
> Signed-off-by: Julian Stecklina 
> Signed-off-by: Khalid Aziz 
> Cc: Khalid Aziz 
> Cc: x...@kernel.org
> Cc: kernel-harden...@lists.openwall.com
> Cc: Vasileios P. Kemerlis 
> Cc: Juerg Haefliger 
> Cc: Tycho Andersen 
> Cc: Marco Benatto 
> Cc: David Woodhouse 
> ---
>  include/linux/xpfo.h | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
> index 2318c7eb5fb7..37e7f52fa6ce 100644
> --- a/include/linux/xpfo.h
> +++ b/include/linux/xpfo.h
> @@ -61,6 +61,7 @@ static inline void xpfo_kmap(void *kaddr, struct page *page)
>  static inline void xpfo_kunmap(void *kaddr, struct page *page)
>  {
>   unsigned long flags;
> + bool flush_tlb = false;
>  
>   if (!static_branch_unlikely(&xpfo_inited))
>   return;
> @@ -72,18 +73,23 @@ static inline void xpfo_kunmap(void *kaddr, struct page 
> *page)
>* The page is to be allocated back to user space, so unmap it from
>* the kernel, flush the TLB and tag it as a user page.
>*/
> - spin_lock_irqsave(&page->xpfo_lock, flags);
> -
>   if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
> -#ifdef CONFIG_XPFO_DEBUG
> - WARN_ON(PageXpfoUnmapped(page));
> -#endif
> - SetPageXpfoUnmapped(page);
> - set_kpte(kaddr, page, __pgprot(0));
> - xpfo_flush_kernel_tlb(page, 0);
> + spin_lock_irqsave(&page->xpfo_lock, flags);
> +
> + /*
> +  * In the case, where we raced with kmap after the
> +  * atomic_dec_return, we must not nuke the mapping.
> +  */
> + if (atomic_read(&page->xpfo_mapcount) == 0) {
> + SetPageXpfoUnmapped(page);
> + set_kpte(kaddr, page, __pgprot(0));
> + flush_tlb = true;
> + }
> + spin_unlock_irqrestore(&page->xpfo_lock, flags);
>   }
>  
> - spin_unlock_irqrestore(&page->xpfo_lock, flags);
> + if (flush_tlb)
> + xpfo_flush_kernel_tlb(page, 0);
>  }

This doesn't help with the TLB invalidation issue, AFAICT this is still
completely buggered. kunmap_atomic() can be called from IRQ context.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 04/13] xpfo, x86: Add support for XPFO for x86-64

2019-04-04 Thread Peter Zijlstra
On Wed, Apr 03, 2019 at 11:34:05AM -0600, Khalid Aziz wrote:
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 2779ace16d23..5c0e1581fa56 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
>   return boot_cpu_has_bug(X86_BUG_L1TF);
>  }
>  
> +/*
> + * The current flushing context - we pass it instead of 5 arguments:
> + */
> +struct cpa_data {
> + unsigned long   *vaddr;
> + pgd_t   *pgd;
> + pgprot_tmask_set;
> + pgprot_tmask_clr;
> + unsigned long   numpages;
> + unsigned long   curpage;
> + unsigned long   pfn;
> + unsigned intflags;
> + unsigned intforce_split : 1,
> + force_static_prot   : 1;
> + struct page **pages;
> +};
> +
> +
> +int
> +should_split_large_page(pte_t *kpte, unsigned long address,
> + struct cpa_data *cpa);
> +extern spinlock_t cpa_lock;
> +int
> +__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
> +struct page *base);
> +

I really hate exposing all that.

>  #include 
>  #endif   /* __ASSEMBLY__ */
>  

> diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
> new file mode 100644
> index ..3045bb7e4659
> --- /dev/null
> +++ b/arch/x86/mm/xpfo.c
> @@ -0,0 +1,123 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
> + * Copyright (C) 2016 Brown University. All rights reserved.
> + *
> + * Authors:
> + *   Juerg Haefliger 
> + *   Vasileios P. Kemerlis 
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published 
> by
> + * the Free Software Foundation.
> + */
> +
> +#include 
> +
> +#include 
> +
> +extern spinlock_t cpa_lock;
> +
> +/* Update a single kernel page table entry */
> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
> +{
> + unsigned int level;
> + pgprot_t msk_clr;
> + pte_t *pte = lookup_address((unsigned long)kaddr, &level);
> +
> + if (unlikely(!pte)) {
> + WARN(1, "xpfo: invalid address %p\n", kaddr);
> + return;
> + }
> +
> + switch (level) {
> + case PG_LEVEL_4K:
> + set_pte_atomic(pte, pfn_pte(page_to_pfn(page),
> +canon_pgprot(prot)));

(sorry, do we also need a nikon_pgprot() ? :-)

> + break;
> + case PG_LEVEL_2M:
> + case PG_LEVEL_1G: {
> + struct cpa_data cpa = { };
> + int do_split;
> +
> + if (level == PG_LEVEL_2M)
> + msk_clr = pmd_pgprot(*(pmd_t *)pte);
> + else
> + msk_clr = pud_pgprot(*(pud_t *)pte);
> +
> + cpa.vaddr = kaddr;
> + cpa.pages = &page;
> + cpa.mask_set = prot;
> + cpa.mask_clr = msk_clr;
> + cpa.numpages = 1;
> + cpa.flags = 0;
> + cpa.curpage = 0;
> + cpa.force_split = 0;
> +
> +
> + do_split = should_split_large_page(pte, (unsigned long)kaddr,
> +&cpa);
> + if (do_split) {
> + struct page *base;
> +
> + base = alloc_pages(GFP_ATOMIC, 0);
> + if (!base) {
> + WARN(1, "xpfo: failed to split large page\n");

You have to be fcking kidding right? A WARN when a GFP_ATOMIC allocation
fails?!

> + break;
> + }
> +
> + if (!debug_pagealloc_enabled())
> + spin_lock(&cpa_lock);
> + if  (__split_large_page(&cpa, pte, (unsigned long)kaddr,
> + base) < 0) {
> + __free_page(base);
> + WARN(1, "xpfo: failed to split large page\n");
> + }
> + if (!debug_pagealloc_enabled())
> + spin_unlock(&cpa_lock);
> + }
> +
> + break;

Ever heard of helper functions?

> + }
> + case PG_LEVEL_512G:
> + /* fallthrough, splitting infrastructure doesn't
> +  * support 512G pages.
> +  */

Broken coment style.

> + default:
> + WARN(1, "xpfo: unsupported page level %x\n", level);
> + }
> +
> +}
> +EXPORT_SYMBOL_GPL(set_kpte);
> +
> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
> +{
> + int level;
> + unsigned long size, kaddr;
> +
> + kaddr = (unsigned long)page_address(page);
> +
> + if (unlikely(!lookup_address(kaddr, &level))) {
> + WARN(1, "xpfo: invalid address to flush %lx %d\n", kadd

Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-04 Thread Peter Zijlstra


You must be so glad I no longer use kmap_atomic from NMI context :-)

On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> +static inline void xpfo_kmap(void *kaddr, struct page *page)
> +{
> + unsigned long flags;
> +
> + if (!static_branch_unlikely(&xpfo_inited))
> + return;
> +
> + if (!PageXpfoUser(page))
> + return;
> +
> + /*
> +  * The page was previously allocated to user space, so
> +  * map it back into the kernel if needed. No TLB flush required.
> +  */
> + spin_lock_irqsave(&page->xpfo_lock, flags);
> +
> + if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
> + TestClearPageXpfoUnmapped(page))
> + set_kpte(kaddr, page, PAGE_KERNEL);
> +
> + spin_unlock_irqrestore(&page->xpfo_lock, flags);

That's a really sad sequence, not wrong, but sad. _3_ atomic operations,
2 on likely the same cacheline. And mostly all pointless.

This patch makes xpfo_mapcount an atomic, but then all modifications are
under the spinlock, what gives?

Anyway, a possibly saner sequence might be:

if (atomic_inc_not_zero(&page->xpfo_mapcount))
return;

spin_lock_irqsave(&page->xpfo_lock, flag);
if ((atomic_inc_return(&page->xpfo_mapcount) == 1) &&
TestClearPageXpfoUnmapped(page))
set_kpte(kaddr, page, PAGE_KERNEL);
spin_unlock_irqrestore(&page->xpfo_lock, flags);

> +}
> +
> +static inline void xpfo_kunmap(void *kaddr, struct page *page)
> +{
> + unsigned long flags;
> +
> + if (!static_branch_unlikely(&xpfo_inited))
> + return;
> +
> + if (!PageXpfoUser(page))
> + return;
> +
> + /*
> +  * The page is to be allocated back to user space, so unmap it from
> +  * the kernel, flush the TLB and tag it as a user page.
> +  */
> + spin_lock_irqsave(&page->xpfo_lock, flags);
> +
> + if (atomic_dec_return(&page->xpfo_mapcount) == 0) {
> +#ifdef CONFIG_XPFO_DEBUG
> + WARN_ON(PageXpfoUnmapped(page));
> +#endif
> + SetPageXpfoUnmapped(page);
> + set_kpte(kaddr, page, __pgprot(0));
> + xpfo_flush_kernel_tlb(page, 0);

You didn't speak about the TLB invalidation anywhere. But basically this
is one that x86 cannot do.

> + }
> +
> + spin_unlock_irqrestore(&page->xpfo_lock, flags);

Idem:

if (atomic_add_unless(&page->xpfo_mapcount, -1, 1))
return;




> +}

Also I'm failing to see the point of PG_xpfo_unmapped, afaict it
is identical to !atomic_read(&page->xpfo_mapcount).

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH v9 03/13] mm: Add support for eXclusive Page Frame Ownership (XPFO)

2019-04-04 Thread Peter Zijlstra
On Wed, Apr 03, 2019 at 11:34:04AM -0600, Khalid Aziz wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2c471a2c43fa..d17d33f36a01 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -204,6 +204,14 @@ struct page {
>  #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
>   int _last_cpupid;
>  #endif
> +
> +#ifdef CONFIG_XPFO
> + /* Counts the number of times this page has been kmapped. */
> + atomic_t xpfo_mapcount;
> +
> + /* Serialize kmap/kunmap of this page */
> + spinlock_t xpfo_lock;

NAK, see ALLOC_SPLIT_PTLOCKS

spinlock_t can be _huge_ (CONFIG_PROVE_LOCKING=y), also are you _really_
sure you want spinlock_t and not raw_spinlock_t ? For
CONFIG_PREEMPT_FULL spinlock_t turns into a rtmutex.

> +#endif

Growing the page-frame by 8 bytes (in the good case) is really sad,
that's a _lot_ of memory.

>  } _struct_page_alignment;
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: 5.1-rc1: mpt init crash in scsi_map_dma, dma_4v_map_sg on sparc64

2019-04-04 Thread Meelis Roos

I think this might have been this commit:

commit 24132a419c68f1d69eb8ecc91b3c80d730ecbb59
Author: Christoph Hellwig 
Date:   Fri Feb 15 09:30:28 2019 +0100

 sparc64/pci_sun4v: allow large DMA masks

the patch below adds a few missing checks and hopefully should fix
your problem.  If not can you try to revert the commit to check if
my theory was correct to start with?

Yes, reverting this commit makes my T1000 boot.

--
Meelis Roos 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu