Re: kernel BUG at mm/huge_memory.c:2613!

2020-06-22 Thread Andrea Arcangeli
Hello,

On Mon, Jun 22, 2020 at 04:30:41PM +0100, Robin Murphy wrote:
> On 2020-06-22 13:46, Joerg Roedel wrote:
> > + Robin
> > 
> > Robin, any idea on this?
> 
> After a bit of archaeology, this dates back to the original review:
> 
> https://lore.kernel.org/linux-arm-kernel/54c285d4.3070...@arm.com/
> https://lore.kernel.org/linux-arm-kernel/54da2666.9030...@arm.com/
> 
> In summary: originally this inherited from other arch code that did 
> simply strip __GFP_COMP; that was deemed questionable because of the 
> nonsensical comment about CONFIG_HUGETLBFS that was stuck to it; the 
> current code is like it is because in 5 and a half years nobody said 
> that it's wrong :)
> 
> If there actually *are* good reasons for stripping __GFP_COMP, then I've 
> certainly no objection to doing so.

The main question is if there's any good reasons for not forbidding
__GFP_COMP to be specified in the callers. The reason given in the
comment isn't convincing.

I don't see how a caller that gets a pointer can care about how the
page structure looks like and in turn why it's asking for __GFP_COMP.

As far as I can tell there are two orthogonal issues in play here:

1) The comment about __GFP_COMP facilitating the sound driver to do
   partial mapping doesn't make much sense. It's probably best to
   WARN_ON immediately in dma_alloc_coherent if __GFP_COMP is
   specified, not only down the call stack in the
   __iommu_dma_alloc_pages() path.

   Note: the CMA paths would already ignore __GFP_COMP if it's
   specified so that __GFP_COMP request can already be ignored. It
   sounds preferable to warn the caller it's asking something it can't
   get, than to silently ignore __GFP_COMP.

   On a side note: hugetlbfs/THP pages can only be allocated with
   __GFP_COMP because for example put_page() must work on all tail
   pages (you can't call compound_head() unless the tail page is part
   of a compound page). But for private driver pages mapped by
   remap_pfn_range, any full or partial mapping is done manually and
   nobody can call GUP on VM_PFNMAP|VM_IO anyway (there's not even the
   requirement of a page struct backing those mappings in fact).

2) __iommu_dma_alloc_pages cannot use __GFP_COMP if it intends to
   return an array of small pages, which is the only thing that the
   current sg_alloc_table_from_pages() supports in input. split_page
   will work as expected to generate small pages from non-compound
   order>0 pages, incidentally it's implement on mm/page_alloc.c, not
   in huge_memory.c.

   split_huge_page as opposed is not intended to be used on newly
   allocated compound page. Maybe we should renamed it to
   split_trans_huge_page to make it more explicit, since it won't even
   work on hugetlbfs (compound) pages.

Thanks,
Andrea

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: kernel BUG at mm/huge_memory.c:2613!

2020-06-18 Thread Andrea Arcangeli
Hello,

On Thu, Jun 18, 2020 at 06:14:49PM -0700, Roman Gushchin wrote:
> I agree. The whole
> 
>   page = alloc_pages_node(nid, alloc_flags, order);
>   if (!page)
>   continue;
>   if (!order)
>   break;
>   if (!PageCompound(page)) {
>   split_page(page, order);
>   break;
>   } else if (!split_huge_page(page)) {
>   break;
>   }
> 
> looks very suspicious to me.
> My wild guess is that gfp flags changed somewhere above, so we hit
> the branch which was never hit before.

Right to be suspicious about the above: split_huge_page on a regular
page allocated by a driver was never meant to work.

The PageLocked BUG_ON is just a symptom of a bigger issue, basically
split_huge_page it may survive, but it'll stay compound and in turn it
must be freed as compound.

The respective free method doesn't even contemplate freeing compound
pages, the only way the free method can survive, is by removing
__GFP_COMP forcefully in the allocation that was perhaps set here
(there are that many __GFP_COMP in that directory):

static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, size_t size)
{
gfp_t gfp_flags;

gfp_flags = GFP_KERNEL
| __GFP_COMP/* compound page lets parts be mapped */

And I'm not sure what the comment means here, compound or non compound
doesn't make a difference when you map it, it's not a THP, the
mappings must be handled manually so nothing should check PG_compound
anyway in the mapping code.

Something like this may improve things, it's an untested quick hack,
but this assumes it's always a bug to setup a compound page for these
DMA allocations and given the API it's probably a correct
assumption.. Compound is slower, unless you need it, you can avoid it
and then split_page will give contiguous memory page granular. Ideally
the code shouldn't call split_page at all and it should free it all at
once by keeping track of the order and by returning the order to the
caller, something the API can't do right now as it returns a plain
array that can only represent individual small pages.

Once this is resolved, you may want to check your config, iommu passthrough
sounds more optimal for a soundcard.

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index f68a62c3c32b..3dfbc010fa83 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -499,6 +499,10 @@ static struct page **__iommu_dma_alloc_pages(struct device 
*dev,
 
/* IOMMU can map any pages, so himem can also be used here */
gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
+   if (unlikely(gfp & __GFP_COMP)) {
+   WARN();
+   gfp &= ~__GFP_COMP;
+   }
 
while (count) {
struct page *page = NULL;
@@ -522,13 +526,8 @@ static struct page **__iommu_dma_alloc_pages(struct device 
*dev,
continue;
if (!order)
break;
-   if (!PageCompound(page)) {
-   split_page(page, order);
-   break;
-   } else if (!split_huge_page(page)) {
-   break;
-   }
-   __free_pages(page, order);
+   split_page(page, order);
+   break;
}
if (!page) {
__iommu_dma_free_pages(pages, i);
diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c
index 6850d13aa98c..378f5a36ec5f 100644
--- a/sound/core/memalloc.c
+++ b/sound/core/memalloc.c
@@ -28,7 +28,6 @@ static void snd_malloc_dev_pages(struct snd_dma_buffer *dmab, 
size_t size)
gfp_t gfp_flags;
 
gfp_flags = GFP_KERNEL
-   | __GFP_COMP/* compound page lets parts be mapped */
| __GFP_NORETRY /* don't trigger OOM-killer */
| __GFP_NOWARN; /* no stack trace print - this call is 
non-critical */
dmab->area = dma_alloc_coherent(dmab->dev.dev, size, >addr,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] mm/mmu_notifier: avoid double notification when it is useless

2017-10-03 Thread Andrea Arcangeli
Hello Jerome,

On Fri, Sep 01, 2017 at 01:30:11PM -0400, Jerome Glisse wrote:
> +Case A is obvious you do not want to take the risk for the device to write to
> +a page that might now be use by some completely different task.

used

> +is true ven if the thread doing the page table update is preempted right 
> after

even

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 90731e3b7e58..5706252b828a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1167,8 +1167,15 @@ static int do_huge_pmd_wp_page_fallback(struct 
> vm_fault *vmf, pmd_t orig_pmd,
>   goto out_free_pages;
>   VM_BUG_ON_PAGE(!PageHead(page), page);
>  
> + /*
> +  * Leave pmd empty until pte is filled note we must notify here as
> +  * concurrent CPU thread might write to new page before the call to
> +  * mmu_notifier_invalidate_range_end() happen which can lead to a

happens

> +  * device seeing memory write in different order than CPU.
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
> - /* leave pmd empty until pte is filled */
>  

Here we can change the following mmu_notifier_invalidate_range_end to
skip calling ->invalidate_range. It could be called
mmu_notifier_invalidate_range_only_end, or other suggestions
welcome. Otherwise we'll repeat the call for nothing.

We need it inside the PT lock for the ordering issue, but we don't
need to run it twice.

Same in do_huge_pmd_wp_page, wp_page_copy and
migrate_vma_insert_page. Every time *clear_flush_notify is used
mmu_notifier_invalidate_range_only_end should be called after it,
instead of mmu_notifier_invalidate_range_end.

I think optimizing that part too, fits in the context of this patchset
(if not in the same patch), because the objective is still the same:
to remove unnecessary ->invalidate_range calls.

> +  * No need to notify as we downgrading page

are

> +  * table protection not changing it to point
> +  * to a new page.
> +  *

> +  * No need to notify as we downgrading page table to read only

are

> +  * No need to notify as we replacing a read only page with another

are

> @@ -1510,13 +1515,43 @@ static bool try_to_unmap_one(struct page *page, 
> struct vm_area_struct *vma,
>   if (pte_soft_dirty(pteval))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
>   set_pte_at(mm, address, pvmw.pte, swp_pte);
> - } else
> + } else {
> + /*
> +  * We should not need to notify here as we reach this
> +  * case only from freeze_page() itself only call from
> +  * split_huge_page_to_list() so everything below must
> +  * be true:
> +  *   - page is not anonymous
> +  *   - page is locked
> +  *
> +  * So as it is a shared page and it is locked, it can
> +  * not be remove from the page cache and replace by
> +  * a new page before mmu_notifier_invalidate_range_end
> +  * so no concurrent thread might update its page table
> +  * to point at new page while a device still is using
> +  * this page.
> +  *
> +  * But we can not assume that new user of try_to_unmap
> +  * will have that in mind so just to be safe here call
> +  * mmu_notifier_invalidate_range()
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   dec_mm_counter(mm, mm_counter_file(page));
> + mmu_notifier_invalidate_range(mm, address,
> +   address + PAGE_SIZE);
> + }
>  discard:
> + /*
> +  * No need to call mmu_notifier_invalidate_range() as we are
> +  * either replacing a present pte with non present one (either
> +  * a swap or special one). We handling the clearing pte case
> +  * above.
> +  *
> +  * See Documentation/vm/mmu_notifier.txt
> +  */
>   page_remove_rmap(subpage, PageHuge(page));
>   put_page(page);
> - mmu_notifier_invalidate_range(mm, address,
> -   address + PAGE_SIZE);
>   }
>  
>   mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);

That is the path that unmaps filebacked pages (btw, not necessarily
shared unlike comment says, they can be private but still filebacked).

I'd like some more explanation about the inner working of "that new

Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback v2

2017-09-02 Thread Andrea Arcangeli
On Thu, Aug 31, 2017 at 05:17:25PM -0400, Jerome Glisse wrote:
> Jérôme Glisse (13):
>   dax: update to new mmu_notifier semantic
>   mm/rmap: update to new mmu_notifier semantic
>   powerpc/powernv: update to new mmu_notifier semantic
>   drm/amdgpu: update to new mmu_notifier semantic
>   IB/umem: update to new mmu_notifier semantic
>   IB/hfi1: update to new mmu_notifier semantic
>   iommu/amd: update to new mmu_notifier semantic
>   iommu/intel: update to new mmu_notifier semantic
>   misc/mic/scif: update to new mmu_notifier semantic
>   sgi-gru: update to new mmu_notifier semantic
>   xen/gntdev: update to new mmu_notifier semantic
>   KVM: update to new mmu_notifier semantic
>   mm/mmu_notifier: kill invalidate_page

Reviewed-by: Andrea Arcangeli <aarca...@redhat.com>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic

2017-08-31 Thread Andrea Arcangeli
On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> > For both CoW and KSM, the correctness is maintained by calling
> > ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> > (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> > will cause memory corruption, and page-lock would not be enough.
> 
> Just to add up, the IOMMU have its own CPU page table walker and it can
> walk the page table at any time (not the page table current to current
> CPU, IOMMU have an array that match a PASID with a page table and device
> request translation for a given virtual address against a PASID).
> 
> So this means the following can happen with ptep_clear_flush() only:
> 
>   CPU  | IOMMU
>| - walk page table populate tlb at addr A
>   - clear pte at addr A|
>   - set new pte|
  mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb

> 
> Device is using old page and CPU new page :(

That condition won't be persistent.

> 
> But with ptep_clear_flush_notify()
> 
>   CPU  | IOMMU
>| - walk page table populate tlb at addr A
>   - clear pte at addr A|
>   - notify -> invalidate_range | > flush IOMMU/device tlb
>   - set new pte|
> 
> So now either the IOMMU see the empty pte and trigger a page fault (this is
> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
> before setting the new pte) or it see the new pte. Either way both IOMMU
> and CPU have a coherent view of what a virtual address points to.

Sure, the _notify version is obviously safe.

> Andrea explained to me the historical reasons set_pte_at_notify call the
> change_pte callback and it was intended so that KVM could update the
> secondary page table directly without having to fault. It is now a pointless
> optimization as the call to range_start() happening in all the places before
> any set_pte_at_notify() invalidate the secondary page table and thus will
> lead to page fault for the vm. I have talk with Andrea on way to bring back
> this optimization.

Yes, we known for a long time, the optimization got basically disabled
when range_start/end expanded. It'd be nice to optimize change_pte
again but this is for later.

> Yes we need the following sequence for IOMMU:
>  - clear pte
>  - invalidate IOMMU/device TLB
>  - set new pte
> 
> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
> stall entry.
> 
> Note that this is not necessary for all the case. For try_to_unmap it
> is fine for instance to move the IOMMU tlb shoot down after changing the
> CPU page table as we are not pointing the pte to a different page. Either
> we clear the pte or we set a swap entry and as long as the page that use
> to be pointed by the pte is not free before the IOMMU tlb flush then we
> are fine.
> 
> In fact i think the only case where we need the above sequence (clear,
> flush secondary tlb, set new pte) is for COW. I think all other cases
> we can get rid of invalidate_range() from inside the page table lock
> and rely on invalidate_range_end() to call unconditionaly.

Even with CoW, it's not big issue if the IOMMU keeps reading from the
old page for a little while longer (in between PT lock release and
mmu_notifier_invalidate_range_end).

How can you tell you read the old data a bit longer, because it
noticed the new page only when mmu_notifier_invalidate_range_end run,
and not because the CPU was faster at writing than the IOMMU was fast
at loading the new pagetable?

I figure it would be detectable only that the CPU could see pageA
changing before pageB. The iommu-v2 could see pageB changing before
pageA. If that's a concern that is the only good reason I can think of
right now, for requiring ->invalidate_page inside the CoW PT lock to
enforce the same ordering. However if the modifications happens
simultaneously and it's a correct runtime, the order must not matter,
but still it would be detectable which may not be desirable.

Currently the _notify is absolutely needed to run ->invalidate_range
before put_page is run in the CoW code below, because of the put_page
that is done in a not scalable place, it would be better moved down
regardless of ->invalidate_range to reduce the size of the PT lock
protected critical section.

-   if (new_page)
-   put_page(new_page);

pte_unmap_unlock(vmf->pte, vmf->ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+   if (new_page)
+   put_page(new_page);

Of course the iommu will not immediately start reading from the new
page, but even if it triggers a write protect fault and calls
handle_mm_fault, it will find a writeable pte already there and it'll
get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end
runs so it can sure notice 

Re: [PATCH 0/3] mmu_notifier: Allow to manage CPU external TLBs

2014-07-24 Thread Andrea Arcangeli
On Thu, Jul 24, 2014 at 04:35:38PM +0200, Joerg Roedel wrote:
 To solve this situation I wrote a patch-set to introduce a
 new notifier call-back: mmu_notifer_invalidate_range(). This
 notifier lifts the strict requirements that no new
 references are taken in the range between _start() and
 _end(). When the subsystem can't guarantee that any new
 references are taken is has to provide the
 invalidate_range() call-back to clear any new references in
 there.
 
 It is called between invalidate_range_start() and _end()
 every time the VMM has to wipe out any references to a
 couple of pages. This are usually the places where the CPU
 TLBs are flushed too and where its important that this
 happens before invalidate_range_end() is called.
 
 Any comments and review appreciated!

Reviewed-by: Andrea Arcangeli aarca...@redhat.com
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu