riscv: Implement superpage splitting for p2m mappings

Jan Beulich Tue, 22 Jul 2025 09:03:37 -0700

On 22.07.2025 16:57, Oleksii Kurochko wrote:
> 
> On 7/21/25 3:34 PM, Jan Beulich wrote:
>> On 17.07.2025 18:37, Oleksii Kurochko wrote:
>>> On 7/2/25 11:25 AM, Jan Beulich wrote:
>>>> On 10.06.2025 15:05, Oleksii Kurochko wrote:
>>>>> Add support for down large memory mappings ("superpages") in the RISC-V
>>>>> p2m mapping so that smaller, more precise mappings ("finer-grained 
>>>>> entries")
>>>>> can be inserted into lower levels of the page table hierarchy.
>>>>>
>>>>> To implement that the following is done:
>>>>> - Introduce p2m_split_superpage(): Recursively shatters a superpage into
>>>>>     smaller page table entries down to the target level, preserving 
>>>>> original
>>>>>     permissions and attributes.
>>>>> - __p2m_set_entry() updated to invoke superpage splitting when inserting
>>>>>     entries at lower levels within a superpage-mapped region.
>>>>>
>>>>> This implementation is based on the ARM code, with modifications to the 
>>>>> part
>>>>> that follows the BBM (break-before-make) approach. Unlike ARM, RISC-V does
>>>>> not require BBM, so there is no need to invalidate the PTE and flush the
>>>>> TLB before updating it with the newly created, split page table.
>>>> But some flushing is going to be necessary. As long as you only ever do
>>>> global flushes, the one after the individual PTE modification (within the
>>>> split table) will do (if BBM isn't required, see below), but once you move
>>>> to more fine-grained flushing, that's not going to be enough anymore. Not
>>>> sure it's a good idea to leave such a pitfall.
>>> I think that I don't fully understand what is an issue.
>> Whether a flush is necessary after solely breaking up a superpage is arch-
>> defined. I don't know how much restrictions the spec on possible hardware
>> behavior for RISC-V. However, the eventual change of (at least) one entry
>> of fulfill the original request will surely require a flush. What I was
>> trying to say is that this required flush would better not also cover for
>> the flushes that may or may not be required by the spec. IOW if the spec
>> leaves any room for flushes to possibly be needed, those flushes would
>> better be explicit.
> 
> I think that I still don't understand why what I wrote above will work as long
> as global flushes is working and will stop to work when more fine-grained 
> flushing
> is used.
> 
> Inside p2m_split_superpage() we don't need any kind of TLB flush operation as
> it is allocation a new page for a table and works with it, so no one could use
> this page at the moment and cache it in TLB.
> 
> The only question is that if it is needed BBM before staring using splitted 
> entry:
>          ....
>          if ( !p2m_split_superpage(p2m, &split_pte, level, target, offsets) )
>          {
>              /* Free the allocated sub-tree */
>              p2m_free_subtree(p2m, split_pte, level);
> 
>              rc = -ENOMEM;
>              goto out;
>          }
> 
> ------> /* Should be BBM used here ? */
>          p2m_write_pte(entry, split_pte, p2m->clean_pte);
> 
> And I can't find anything in the spec what tells me to use BBM here like Arm
> does:


But what you need is a statement in the spec that you can get away without. Imo
at least.

>          /*
>           * Follow the break-before-sequence to update the entry.
>           * For more details see (D4.7.1 in ARM DDI 0487A.j).
>           */
>          p2m_remove_pte(entry, p2m->clean_pte);
>          p2m_force_tlb_flush_sync(p2m);
> 
>          p2m_write_pte(entry, split_pte, p2m->clean_pte);
> 
> I agree that even BBM isn't mandated explicitly sometime it could be useful, 
> but
> here I am not really sure what is the point to do TLB flush before 
> p2m_write_pte()
> as nothing serious will happen if and old mapping will be used for a some time
> as we are keeping the same rights for smaller (splited) mapping.
> The reason why Arm does p2m_remove_pte() & p2m_force_tlb_flush() before 
> updating
> an entry here as it is doesn't guarantee that everything will be okay and they
> explicitly tells:
>   This situation can possibly break coherency and ordering guarantees, 
> leading to
>   inconsistent unwanted behavior in our Processing Element (PE).
> 
> 
>>>> As to (no need for) BBM: I couldn't find anything to that effect in the
>>>> privileged spec. Can you provide some pointer? What I found instead is e.g.
>>>> this sentence: "To ensure that implicit reads observe writes to the same
>>>> memory locations, an SFENCE.VMA instruction must be executed after the
>>>> writes to flush the relevant cached translations." And this: "Accessing the
>>>> same location using different cacheability attributes may cause loss of
>>>> coherence." (This may not only occur when the same physical address is
>>>> mapped twice at different VAs, but also after the shattering of a superpage
>>>> when the new entry differs in cacheability.)
>>> I also couldn't find that RISC-V spec mandates BBM explicitly as Arm does 
>>> it.
>>>
>>> What I meant that on RISC-V can do:
>>> - Write new PTE
>>> - Flush TLB
>>>
>>> While on Arm it is almost always needed to do:
>>> - Write zero to PTE
>>> - Flush TLB
>>> - Write new PTE
>>>
>>> For example, the common CoW code path where you copy from a read only page 
>>> to
>>> a new page, then map that new page as writable just works on RISC-V without
>>> extra considerations and on Arm it requires BBM.
>> CoW is a specific sub-case with increasing privilege.
> 
> Could you please explain why it matters increasing of privilege in terms of 
> TLB
> flushing and PTE clearing before writing a new PTE?

Some architectures automatically re-walk page tables when encountering a
permission violation based on TLB contents. Hence increasing privilege
can be a special case.

>>> It seems to me that BBM is mandated for Arm only because that TLB is shared
>>> among cores, so there is no any guarantee that no prior entry for same VA
>>> remains in TLB. In case of RISC-V's TLB isn't shared and after a flush it is
>>> guaranteed that no prior entry for the same VA remains in the TLB.
>> Not sure that's the sole reason. But again the question is: Is this written
>> down explicitly anywhere? Generally there can be multiple levels of TLBs, and
>> while some of them may be per-core, others may be shared.
> 
> Spec. mentions that:
>    If a hart employs an address-translation cache, that cache must appear to 
> be
>    private to that hart.

Hmm, okay, that indeed pretty much excludes shared TLBs.

Jan

Re: [PATCH v2 15/17] xen/riscv: Implement superpage splitting for p2m mappings

Reply via email to