On 11/06/2024 7:23 pm, Oleksii K. wrote:
> On Tue, 2024-06-11 at 16:53 +0100, Andrew Cooper wrote:
>> On 30/05/2024 7:22 pm, Oleksii K. wrote:
>>> On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
>>>> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
>>>>> Signed-off-by: Oleksii Kurochko <oleksii.kuroc...@gmail.com>
>>>>> Acked-by: Jan Beulich <jbeul...@suse.com>
>>>> This patch looks like it can go in independently?  Or does it
>>>> depend
>>>> on
>>>> having bitops.h working in practice?
>>>>
>>>> However, one very strong suggestion...
>>>>
>>>>
>>>>> diff --git a/xen/arch/riscv/include/asm/mm.h
>>>>> b/xen/arch/riscv/include/asm/mm.h
>>>>> index 07c7a0abba..cc4a07a71c 100644
>>>>> --- a/xen/arch/riscv/include/asm/mm.h
>>>>> +++ b/xen/arch/riscv/include/asm/mm.h
>>>>> @@ -3,11 +3,246 @@
>>>>> <snip>
>>>>> +/* PDX of the first page in the frame table. */
>>>>> +extern unsigned long frametable_base_pdx;
>>>>> +
>>>>> +/* Convert between machine frame numbers and page-info
>>>>> structures.
>>>>> */
>>>>> +#define
>>>>> mfn_to_page(mfn)                                            \
>>>>> +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
>>>>> +#define
>>>>> page_to_mfn(pg)                                             \
>>>>> +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
>>>>> frametable_base_pdx)
>>>> Do yourself a favour and not introduce frametable_base_pdx to
>>>> begin
>>>> with.
>>>>
>>>> Every RISC-V board I can find has things starting from 0 in
>>>> physical
>>>> address space, with RAM starting immediately after.
>>> I checked Linux kernel and grep there:
>>>    [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/
>>> --
>>>    exclude "*.tmp" -I
>>>    arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-
>>> 2.dtsi:33:    
>>>    memory@40000000 {
>>>    arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:    
>>> memory@80000000
>>>    {
>>>    arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:     
>>> ddrc_cache:
>>>    memory@1000000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:  
>>> ddrc_cache_lo:
>>>    memory@80000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:  
>>> ddrc_cache_hi:
>>>    memory@1040000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:     
>>> ddrc_cache_lo:
>>>    memory@80000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:     
>>> ddrc_cache_hi:
>>>    memory@1000000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:  
>>> ddrc_cache_lo:
>>>    memory@80000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:  
>>> ddrc_cache_hi:
>>>    memory@1000000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:  
>>> ddrc_cache_lo:
>>>    memory@80000000 {
>>>    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:  
>>> ddrc_cache_hi:
>>>    memory@1040000000 {
>>>    arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32:  memory@0
>>> {
>>>    arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14:     
>>>    memory@0 {
>>>    arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:   
>>> memory@80000000
>>>    {
>>>    arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:     memory@80000000
>>> {
>>>    arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26:
>>> memory@80000000
>>>    {
>>>    arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25:
>>> memory@80000000
>>>    {
>>>    arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram:
>>> memory@80000000 {
>>>    
>>> And based on  that majority of supported by Linux kernel boards has
>>> RAM
>>> starting not from 0 in physical address space. Am I confusing
>>> something?
>>>
>>>> Taking the microchip board as an example, RAM actually starts at
>>>> 0x8000000,
>>> Today we had conversation with the guy from SiFive in xen-devel
>>> channel
>>> and he mentioned that they are using "starfive visionfive2 and
>>> sifive
>>> unleashed platforms." which based on the grep above has RAM not at
>>> 0
>>> address.
>>>
>>> Also, QEMU uses 0x8000000.
>>>
>>>>  which means that having frametable_base_pdx and assuming it
>>>> does get set to 0x8000 (which isn't even a certainty, given that
>>>> I
>>>> think
>>>> you'll need struct pages covering the PLICs), then what you are
>>>> trading
>>>> off is:
>>>>
>>>> * Saving 32k of virtual address space only (no need to even
>>>> allocate
>>>> memory for this range of the framtable), by
>>>> * Having an extra memory load and add/sub in every page <-> mfn
>>>> conversion, which is a screaming hotpath all over Xen.
>>>>
>>>> It's a terribly short-sighted tradeoff.
>>>>
>>>> 32k of VA space might be worth saving in a 32bit build (I
>>>> personally
>>>> wouldn't - especially as there's no need to share Xen's VA space
>>>> with
>>>> guests, given no PV guests on ARM/RISC-V), but it's absolutely
>>>> not at
>>>> all in an a 64bit build with TB of VA space available.
>>>>
>>>> Even if we do find a board with the first interesting thing in
>>>> the
>>>> frametable starting sufficiently away from 0 that it might be
>>>> worth
>>>> considering this slide, then it should still be Kconfig-able in a
>>>> similar way to PDX_COMPRESSION.
>>> I found your tradeoffs reasonable, but I don't understand how it
>>> will
>>> work if RAM does not start from 0, as the frametable address and
>>> RAM
>>> address are linked.
>>> I tried to look at the PDX_COMPRESSION config and couldn't find any
>>> "slide" there. Could you please clarify this for me?
>>> If to use this "slide" would it help to avoid the mentioned above
>>> tradeoffs?
>>>
>>> One more question: if we decide to go without frametable_base_pdx,
>>> would it be sufficient to simply remove mentions of it from the
>>> code (
>>> at least, for now )?
>> There is a relationship between system/host physical addresses (what
>> Xen
>> called maddr/mfn), and the frametable.  The frametable has one entry
>> per
>> mfn.
>>
>> In the most simple case, there's a 1:1 relationship.  i.e.
>> frametable[0]
>> = maddr(0), frametable[1] = maddr(4k) etc.  This is very simple, and
>> very easy to calculate (page_to_mfn()/mfn_to_page()).
>>
>> The frametable is one big array.  It starts at a compile-time fixed
>> address, and needs to be long enough to cover everything interesting
>> in
>> memory.  Therefore it potentially takes a large amount of virtual
>> address space.
>>
>> However, only interesting maddrs need to have data in the frametable,
>> so
>> it's fine for the backing RAM to be sparsely allocated/mapped in the
>> frametable virtual addresses.
>>
>> For 64bit, that's really all you need, because there's always far
>> more
>> virtual address space than physical RAM in the system, even when
>> you're
>> looking at TB-scale giant servers.
>>
>>
>> For 32bit, virtual address space is a limited resource.  (Also to an
>> extent, 64bit x86 with PV guests because we give 98% of the virtual
>> address space to the guest kernel to use.)
>>
>> There are two tricks to reduce the virtual address space used, but
>> they
>> both cost performance in fastpaths.
>>
>> 1) PDX Compression.
>>
>> PDX compression makes a non-linear mfn <-> maddr mapping.  This is
>> for a
>> usecase where you've got multiple RAM banks which are separated by a
>> large distance (and evenly spaced), then you can "compress" a single
>> range of 0's out of the middle of the system/host physical address.
>>
>> The cost is that all page <-> mfn conversions need to read two masks
>> and
>> a shift-count from variables in memory, to split/shift/recombine the
>> address bits.
>>
>> 2) A slide, which is frametable_base_pdx in this context.
>>
>> When there's a big gap between 0 and the start of something
>> interesting,
>> you could chop out that range by just subtracting base_pdx.  What
>> qualifies as "big" is subjective, but Qemu starting at 128M certainly
>> does not qualify as big enough to warrant frametable_base_pdx.
>>
>> This is less expensive that PDX compression.  It only adds one memory
>> read to the fastpath, but it also doesn't save as much virtual
>> address
>> space as PDX compression.
>>
>>
>> When virtual address space is a major constraint (32 bit builds),
>> both
>> of these techniques are worth doing.  But when there's no constraint
>> on
>> virtual address space (64 bit builds), there's no reason to use
>> either;
>> and the performance will definitely improve as a result.
> Thanks for such good explanation.
>
> For RISC-V we have PDX config disabled as I haven't seen multiple RAM
> banks at boards which has hypervisor extension. Thereby mfn_to_pdx()
> and pdx_to_mfn() do nothing. The same for frametable_base_pdx, in case
> of PDX disabled, it just an offset ( or a slide ).
>
> IIUUC, you meant that it make sense to map frametable not to the
> address of where RAM is started, but to 0, so then we don't need this
> +-frametable_base_pdx. The price for that is loosing of VA space for
> the range from 0 to RAM start address.
>
> Right now, we are trying to support 3 boards with the following RAM
> address:
> 1. 0x8000_0000 - QEMU and SiFive board
> 2. 0x40_0000_0000 - Microchip board
>
> So if we mapping frametable to 0 ( not RAM start ) we will loose:
> 1. 0x8000_0 ( amount of page entries to cover range [0, 0x8000_0000] )
> * 64 ( size of struct page_info ) = 32 MB
> 2. 0x40_0000_0 ( amount of page entries to cover range [0,
> 0x40_0000_0000] ) * 64 ( size of struct page_info ) = 4 Gb
>
> In terms of available virtual address space for RV-64 we can consider
> both options as acceptable.

For Qemu and SiFive, 32M is definitely not worth worrying about.

I personally wouldn't worry about Microchip either.  That's 4G of 1T VA
space (assuming Sv39).

> [OPTION 1] If we accepting of loosing 4 Gb of VA then we could
> implement mfn_to_page() and page_to_mfn() in the following way:
> ```
>    diff --git a/xen/arch/riscv/include/asm/mm.h
>    b/xen/arch/riscv/include/asm/mm.h
>    index cc4a07a71c..fdac7e0646 100644
>    --- a/xen/arch/riscv/include/asm/mm.h
>    +++ b/xen/arch/riscv/include/asm/mm.h
>    @@ -107,14 +107,11 @@ struct page_info
>    
>     #define frame_table ((struct page_info *)FRAMETABLE_VIRT_START)
>    
>    -/* PDX of the first page in the frame table. */
>    -extern unsigned long frametable_base_pdx;
>    -
>     /* Convert between machine frame numbers and page-info structures.
> */
>     #define mfn_to_page(mfn)                                          
> \
>    -    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
>    +    (frame_table + mfn))
>     #define page_to_mfn(pg)                                           
> \
>    -    pdx_to_mfn((unsigned long)((pg) - frame_table) +
>    frametable_base_pdx)
>    +    ((unsigned long)((pg) - frame_table))
>    
>     static inline void *page_to_virt(const struct page_info *pg)
>     {
>    diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
>    index 9c0fd80588..8f6dbdc699 100644
>    --- a/xen/arch/riscv/mm.c
>    +++ b/xen/arch/riscv/mm.c
>    @@ -15,7 +15,7 @@
>     #include <asm/page.h>
>     #include <asm/processor.h>
>    
>    -unsigned long __ro_after_init frametable_base_pdx;
>     unsigned long __ro_after_init frametable_virt_end;
>    
>     struct mmu_desc {
> ```

I firmly recommend option 1, especially at this point.

If specific boards really have a problem with losing 4G of VA space,
then option 2 can be added easily at a later point.

That said, I'd think carefully about doing option 2.  Even subtracting a
constant - which is far better than subtracting a variable - is still
extra overhead in fastpaths.  Option 2 needs careful consideration on a
board-by-board case.

~Andrew

Reply via email to