> Date: Tue, 25 Mar 2025 18:59:46 +0000
> From: Miod Vallat <[email protected]>
>
> > However, on amd64, with the diff applied the kernel faults when writing
> > to curproc. In the trace below tatclock+0x108 corresponds to
> > tu_enter(&p->p_tu) in statclock().
>
> I have tried this and it fails even earlier for me.
>
> The uvm_map_protect() call in kern_exec.c will now end up invoking
> pmap_protect(), which is an inline function ending up in
> pmap_write_protect(pmap_kernel, va, va + PAGE_SIZE).
>
> In my case, va = 0xffff.8000.4210.c000 which is in kernel space.
> However, at pmap_write_protect+0x213, which is the pmap_pte_clearbits()
> macro expansion here in the loop:
>
> for (/*null */; spte < epte ; spte++) {
> if (!pmap_valid_entry(*spte))
> continue;
> pmap_pte_clearbits(spte, clear);
> pmap_pte_setbits(spte, set);
> }
>
> we end up with spte == 0x7fffe.c000.8000, which is BELOW the kernel (and
> *spte == 0x464c457f == the ELF signature). Therefore the attempt to flip
> bits in this bogus address faults, pcb_onfault is (correctly not set),
> kpageflttrap() panics.
>
> Now if you look at the beginning of pmap_write_protect(), it does this:
>
> /* should be ok, but just in case ... */
> sva &= PG_FRAME;
> eva &= PG_FRAME;
>
> and I'm afraid I don't understand this. My understanding is that
> PG_FRAME is a mask supposed to apply to physical addresses, not virtual
> addresses!
Indeed. That code seems to be inherited from i386, where it isn't the
right thing to do either, but doesn't do any actual harm.
> Because of this, my initial page address, known as sva, gets
> "normalized" from 0xffff.8000.4210.c000 to 0x000f.8000.4210.c000, which
> is now LOWER than VM_MIN_KERNEL_ADDRESS and will not sign-extend
> correctly.
>
> Is the PG_FRAME masking really only intending to mask the low-order
> bits, and should use ~PAGE_MASK instead?
Maybe. But something needs to be done to handle the VA hole. So
something like:
sva = VA_SIGN_POS(sva);
eva = VA_SIGN_POS(eva);
might work instead and ...
> In addition to this, the computation of `blockend' in the main loop of
> that routine will clear high-order bits (in my case, to
> 0x0000.8000.4220.0000), and because it assumes blockend > va to make
> progress at every iteration, this will actually become an infinite loop
> which will corrupt memory until it faults or you get tired of waiting
> for it to complete.
... fix this endless loop. But we have to pass the real VA to
pmap_tlb_shootrange(). So that wouldn't work either.
> This STRONGLY hints that this routine has never been used on
> pmap_kernel() addresses until now.
I guess we stopped swapping out kernel stacks long before amd64 was a
thing?
> Can anyone with some amd64 mmu knowledge can confirm this analysis and
> do the required work to make that routine cope with non-userland
> addresses?