> On Dec 23, 2020, at 2:29 PM, Yu Zhao <yuz...@google.com> wrote:
> 

> 
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, wouldn't it?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault 
> *vmf)
>        goto unlock;
>    }
>    if (vmf->flags & FAULT_FLAG_WRITE) {
> -        if (!pte_write(entry))
> +        if (!pte_write(entry)) {
> +            if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> +                flush_tlb_page(vmf->vma, vmf->address);
>            return do_wp_page(vmf);
> +        }

I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending 
set for quite a while — mprotect seems like it can wait in IO while splitting a 
huge page, for example. That gives us a window in which every write fault turns 
into a TLB flush.

I’m not immediately sure how to do all that much better, though. We could 
potentially keep a record of pending ranges that need flushing per mm or per 
PTL, protected by the PTL, and arrange to do the flush if we notice that 
flushes are pending when we want to do_wp_page().  At least this would limit us 
to one point extra flush, at least until the concurrent mprotect (or whatever) 
makes further progress.  The bookkeeping might be nasty, though.

But x86 already sort of does some of this bookkeeping, and arguably x86’s code 
could be improved by tracking TLB ranges to flush per mm instead of per flush 
request — Nadav already got us half way there by making a little cache of 
flush_tlb_info structs.  IMO it wouldn’t be totally crazy to integrate this 
better with tlb_gather_mmu to make the pending flush data visible to other CPUs 
even before actually kicking off the flush. In the limit, this starts to look a 
bit like a fully async flush mechanism.  We would have a function to request a 
flush, and that function would return a generation count but not actually flush 
anything.  The case of flushing a range adjacent to a still-pending range would 
be explicitly optimized.  Then another function would actually initiate and 
wait for the flush to complete.  And we could, while holding PTL, scan the list 
of pending flushes, if any, to see if the PTE we’re looking at has a flush 
pending.  This is sort of easy in the one-PTL-per-mm case but potentially 
rather complicated in the split PTL case.  And I’m genuinely unsure where the 
“batch” TLB flush interface fits in, because it can create a batch that spans 
more than one mm.  x86 can deal with this moderately efficiently since we limit 
the number of live mms per CPU and our flushes are (for now?) per cpu, not per 
mm.

u64 gen = 0;
for(...)
  gen = queue_flush(mm, start, end, freed_tables);
flush_to_gen(mm, gen);

and the wp fault path does:

wait_for_pending_flushes(mm, address);

Other than the possibility of devolving to one flush per operation if one 
thread is page faulting at the same speed that another thread is modifying 
PTEs, this should be decently performant.

Reply via email to