> On Dec 23, 2020, at 2:29 PM, Yu Zhao <[email protected]> wrote:
>
>
> I was hesitant to suggest the following because it isn't that straight
> forward. But since you seem to be less concerned with the complexity,
> I'll just bring it on the table -- it would take care of both ufd and
> clear_refs_write, wouldn't it?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault
> *vmf)
> goto unlock;
> }
> if (vmf->flags & FAULT_FLAG_WRITE) {
> - if (!pte_write(entry))
> + if (!pte_write(entry)) {
> + if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> + flush_tlb_page(vmf->vma, vmf->address);
> return do_wp_page(vmf);
> + }
I don’t love this as a long term fix. AFAICT we can have mm_tlb_flush_pending
set for quite a while — mprotect seems like it can wait in IO while splitting a
huge page, for example. That gives us a window in which every write fault turns
into a TLB flush.
I’m not immediately sure how to do all that much better, though. We could
potentially keep a record of pending ranges that need flushing per mm or per
PTL, protected by the PTL, and arrange to do the flush if we notice that
flushes are pending when we want to do_wp_page(). At least this would limit us
to one point extra flush, at least until the concurrent mprotect (or whatever)
makes further progress. The bookkeeping might be nasty, though.
But x86 already sort of does some of this bookkeeping, and arguably x86’s code
could be improved by tracking TLB ranges to flush per mm instead of per flush
request — Nadav already got us half way there by making a little cache of
flush_tlb_info structs. IMO it wouldn’t be totally crazy to integrate this
better with tlb_gather_mmu to make the pending flush data visible to other CPUs
even before actually kicking off the flush. In the limit, this starts to look a
bit like a fully async flush mechanism. We would have a function to request a
flush, and that function would return a generation count but not actually flush
anything. The case of flushing a range adjacent to a still-pending range would
be explicitly optimized. Then another function would actually initiate and
wait for the flush to complete. And we could, while holding PTL, scan the list
of pending flushes, if any, to see if the PTE we’re looking at has a flush
pending. This is sort of easy in the one-PTL-per-mm case but potentially
rather complicated in the split PTL case. And I’m genuinely unsure where the
“batch” TLB flush interface fits in, because it can create a batch that spans
more than one mm. x86 can deal with this moderately efficiently since we limit
the number of live mms per CPU and our flushes are (for now?) per cpu, not per
mm.
u64 gen = 0;
for(...)
gen = queue_flush(mm, start, end, freed_tables);
flush_to_gen(mm, gen);
and the wp fault path does:
wait_for_pending_flushes(mm, address);
Other than the possibility of devolving to one flush per operation if one
thread is page faulting at the same speed that another thread is modifying
PTEs, this should be decently performant.