On 01/12/2025 18:35, Peter Xu wrote:
On Mon, Dec 01, 2025 at 04:48:22PM +0000, Nikita Kalyazin wrote:
I believe I found the precise point where we convinced ourselves that minor
support was sufficient: [1]. If at this moment we don't find that reasoning
valid anymore, then indeed implementing missing is the only option.
[1] https://lore.kernel.org/kvm/[email protected]
Now after I re-read the discussion, I may have made a wrong statement
there, sorry. I could have got slightly confused on when the write()
syscall can be involved.
I agree if you want to get an event when cache missed with the current uffd
definitions and when pre-population is forbidden, then MISSING trap is
required. That is, with/without the need of UFFDIO_COPY being available.
Do I understand it right that UFFDIO_COPY is not allowed in your case, but
only write()?
No, UFFDIO_COPY would work perfectly fine. We will still use write()
whenever we resolve stage-2 faults as they aren't visible to UFFD. When
a userfault occurs at an offset that already has a page in the cache, we
will have to keep using UFFDIO_CONTINUE so it looks like both will be
required:
- user mapping major fault -> UFFDIO_COPY (fills the cache and sets up
userspace PT)
- user mapping minor fault -> UFFDIO_CONTINUE (only sets up userspace PT)
- stage-2 fault -> write() (only fills the cache)
One way that might work this around, is introducing a new UFFD_FEATURE bit
allowing the MINOR registration to trap all pgtable faults, which will
change the MINOR fault semantics.
This would equally work for us. I suppose this MINOR+MAJOR semantics
would be more intrusive from the API point of view though.
That'll need some further thoughts, meanwhile we may also want to make sure
the old shmem/hugetlbfs semantics are kept (e.g. they should fail MINOR
registers if the new feature bit is enabled in the ctx somehow; or support
them properly in the codebase).
Thanks,
--
Peter Xu