On 01/12/2025 20:57, Peter Xu wrote:
On Mon, Dec 01, 2025 at 08:12:38PM +0000, Nikita Kalyazin wrote:


On 01/12/2025 18:35, Peter Xu wrote:
On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
I believe I found the precise point where we convinced ourselves that minor
support was sufficient: [1].  If at this moment we don't find that reasoning
valid anymore, then indeed implementing missing is the only option.

[1] https://lore.kernel.org/kvm/[email protected]

Now after I re-read the discussion, I may have made a wrong statement
there, sorry.  I could have got slightly confused on when the write()
syscall can be involved.

I agree if you want to get an event when cache missed with the current uffd
definitions and when pre-population is forbidden, then MISSING trap is
required.  That is, with/without the need of UFFDIO_COPY being available.

Do I understand it right that UFFDIO_COPY is not allowed in your case, but
only write()?

No, UFFDIO_COPY would work perfectly fine.  We will still use write()
whenever we resolve stage-2 faults as they aren't visible to UFFD.  When a
userfault occurs at an offset that already has a page in the cache, we will
have to keep using UFFDIO_CONTINUE so it looks like both will be required:

  - user mapping major fault -> UFFDIO_COPY (fills the cache and sets up
userspace PT)
  - user mapping minor fault -> UFFDIO_CONTINUE (only sets up userspace PT)
  - stage-2 fault -> write() (only fills the cache)

Is stage-2 fault about KVM_MEMORY_EXIT_FLAG_USERFAULT, per James's series?

Yes, that's the one ([1]).

[1] https://lore.kernel.org/kvm/[email protected]


It looks fine indeed, but it looks slightly weird then, as you'll have two
ways to populate the page cache.  Logically here atomicity is indeed not
needed when you trap both MISSING + MINOR.

I reran the test based on the UFFDIO_COPY prototype I had using your series [2], and UFFDIO_COPY is slower than write() to populate 512 MiB: 237 vs 202 ms (+17%). Even though UFFDIO_COPY alone is functionally sufficient, I would prefer to have an option to use write() where possible and only falling back to UFFDIO_COPY for userspace faults to have better performance.

[2] https://lore.kernel.org/all/[email protected]




One way that might work this around, is introducing a new UFFD_FEATURE bit
allowing the MINOR registration to trap all pgtable faults, which will
change the MINOR fault semantics.

This would equally work for us.  I suppose this MINOR+MAJOR semantics would
be more intrusive from the API point of view though.

Yes it is, it's just that I don't know whether it'll be harder when you
want to completely support UFFDIO_COPY here, per previous discussions.

After a 2nd thought, such UFFD_FEATURE is probably not a good design,
because it essentially means that feature bit will functionally overlap
with what MISSING trap was trying to do, however duplicating that concept
in a VMA that was registered as MINOR only.

Maybe it's possible instead if we allow a module to support MISSING trap,
but without supporting UFFDIO_COPY ioctl.

That is, the MISSING events will be properly generated if MISSING traps are
supported, however the module needs to provide its own way to resolve it if
UFFDIO_COPY ioctl isn't available.  Gmem is fine in this case as long as
it'll always be registered with both MISSING+MINOR traps, then resolving
using write()s would work.

Yes, this would also work for me. This is almost how it was (accidentally) working until this version of the patches. If this is a lighter undertaking compared to implementing UFFDIO_COPY, I'd be happy with it too.


Such would be possible when with something like my v3 previously:

https://lore.kernel.org/all/[email protected]/#t

Then gmem needs to declare VM_UFFD_MISSING + VM_UFFD_MINOR in
uffd_features, but _UFFDIO_CONTINUE only (without _UFFDIO_COPY) in
uffd_ioctls.

Since Mike already took this series over, I'll leave that to you all to
decide.

Thanks for you input, Peter.


--
Peter Xu



Reply via email to