On 30/05/2024 7:22 pm, Oleksii K. wrote:
> On Thu, 2024-05-30 at 18:23 +0100, Andrew Cooper wrote:
>> On 29/05/2024 8:55 pm, Oleksii Kurochko wrote:
>>> Signed-off-by: Oleksii Kurochko <oleksii.kuroc...@gmail.com>
>>> Acked-by: Jan Beulich <jbeul...@suse.com>
>> This patch looks like it can go in independently?  Or does it depend
>> on
>> having bitops.h working in practice?
>>
>> However, one very strong suggestion...
>>
>>
>>> diff --git a/xen/arch/riscv/include/asm/mm.h
>>> b/xen/arch/riscv/include/asm/mm.h
>>> index 07c7a0abba..cc4a07a71c 100644
>>> --- a/xen/arch/riscv/include/asm/mm.h
>>> +++ b/xen/arch/riscv/include/asm/mm.h
>>> @@ -3,11 +3,246 @@
>>> <snip>
>>> +/* PDX of the first page in the frame table. */
>>> +extern unsigned long frametable_base_pdx;
>>> +
>>> +/* Convert between machine frame numbers and page-info structures.
>>> */
>>> +#define
>>> mfn_to_page(mfn)                                            \
>>> +    (frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx))
>>> +#define
>>> page_to_mfn(pg)                                             \
>>> +    pdx_to_mfn((unsigned long)((pg) - frame_table) +
>>> frametable_base_pdx)
>> Do yourself a favour and not introduce frametable_base_pdx to begin
>> with.
>>
>> Every RISC-V board I can find has things starting from 0 in physical
>> address space, with RAM starting immediately after.
> I checked Linux kernel and grep there:
>    [ok@fedora linux-aia]$ grep -Rni "memory@" arch/riscv/boot/dts/ --
>    exclude "*.tmp" -I
>    arch/riscv/boot/dts/starfive/jh7110-starfive-visionfive-2.dtsi:33:    
>    memory@40000000 {
>    arch/riscv/boot/dts/starfive/jh7100-common.dtsi:28:     memory@80000000
>    {
>    arch/riscv/boot/dts/microchip/mpfs-sev-kit.dts:49:      ddrc_cache:
>    memory@1000000000 {
>    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:33:   ddrc_cache_lo:
>    memory@80000000 {
>    arch/riscv/boot/dts/microchip/mpfs-m100pfsevp.dts:37:   ddrc_cache_hi:
>    memory@1040000000 {
>    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:34:      ddrc_cache_lo:
>    memory@80000000 {
>    arch/riscv/boot/dts/microchip/mpfs-tysom-m.dts:40:      ddrc_cache_hi:
>    memory@1000000000 {
>    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:22:   ddrc_cache_lo:
>    memory@80000000 {
>    arch/riscv/boot/dts/microchip/mpfs-polarberry.dts:27:   ddrc_cache_hi:
>    memory@1000000000 {
>    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:57:   ddrc_cache_lo:
>    memory@80000000 {
>    arch/riscv/boot/dts/microchip/mpfs-icicle-kit.dts:63:   ddrc_cache_hi:
>    memory@1040000000 {
>    arch/riscv/boot/dts/thead/th1520-beaglev-ahead.dts:32:  memory@0 {
>    arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi:14:     
>    memory@0 {
>    arch/riscv/boot/dts/sophgo/cv1800b-milkv-duo.dts:26:    memory@80000000
>    {
>    arch/riscv/boot/dts/sophgo/cv1812h.dtsi:12:     memory@80000000 {
>    arch/riscv/boot/dts/sifive/hifive-unmatched-a00.dts:26: memory@80000000
>    {
>    arch/riscv/boot/dts/sifive/hifive-unleashed-a00.dts:25: memory@80000000
>    {
>    arch/riscv/boot/dts/canaan/k210.dtsi:82:        sram: memory@80000000 {
>    
> And based on  that majority of supported by Linux kernel boards has RAM
> starting not from 0 in physical address space. Am I confusing
> something?
>
>> Taking the microchip board as an example, RAM actually starts at
>> 0x8000000,
> Today we had conversation with the guy from SiFive in xen-devel channel
> and he mentioned that they are using "starfive visionfive2 and sifive
> unleashed platforms." which based on the grep above has RAM not at 0
> address.
>
> Also, QEMU uses 0x8000000.
>
>>  which means that having frametable_base_pdx and assuming it
>> does get set to 0x8000 (which isn't even a certainty, given that I
>> think
>> you'll need struct pages covering the PLICs), then what you are
>> trading
>> off is:
>>
>> * Saving 32k of virtual address space only (no need to even allocate
>> memory for this range of the framtable), by
>> * Having an extra memory load and add/sub in every page <-> mfn
>> conversion, which is a screaming hotpath all over Xen.
>>
>> It's a terribly short-sighted tradeoff.
>>
>> 32k of VA space might be worth saving in a 32bit build (I personally
>> wouldn't - especially as there's no need to share Xen's VA space with
>> guests, given no PV guests on ARM/RISC-V), but it's absolutely not at
>> all in an a 64bit build with TB of VA space available.
>>
>> Even if we do find a board with the first interesting thing in the
>> frametable starting sufficiently away from 0 that it might be worth
>> considering this slide, then it should still be Kconfig-able in a
>> similar way to PDX_COMPRESSION.
> I found your tradeoffs reasonable, but I don't understand how it will
> work if RAM does not start from 0, as the frametable address and RAM
> address are linked.
> I tried to look at the PDX_COMPRESSION config and couldn't find any
> "slide" there. Could you please clarify this for me?
> If to use this "slide" would it help to avoid the mentioned above
> tradeoffs?
>
> One more question: if we decide to go without frametable_base_pdx,
> would it be sufficient to simply remove mentions of it from the code (
> at least, for now )?

There is a relationship between system/host physical addresses (what Xen
called maddr/mfn), and the frametable.  The frametable has one entry per
mfn.

In the most simple case, there's a 1:1 relationship.  i.e. frametable[0]
= maddr(0), frametable[1] = maddr(4k) etc.  This is very simple, and
very easy to calculate (page_to_mfn()/mfn_to_page()).

The frametable is one big array.  It starts at a compile-time fixed
address, and needs to be long enough to cover everything interesting in
memory.  Therefore it potentially takes a large amount of virtual
address space.

However, only interesting maddrs need to have data in the frametable, so
it's fine for the backing RAM to be sparsely allocated/mapped in the
frametable virtual addresses.

For 64bit, that's really all you need, because there's always far more
virtual address space than physical RAM in the system, even when you're
looking at TB-scale giant servers.


For 32bit, virtual address space is a limited resource.  (Also to an
extent, 64bit x86 with PV guests because we give 98% of the virtual
address space to the guest kernel to use.)

There are two tricks to reduce the virtual address space used, but they
both cost performance in fastpaths.

1) PDX Compression.

PDX compression makes a non-linear mfn <-> maddr mapping.  This is for a
usecase where you've got multiple RAM banks which are separated by a
large distance (and evenly spaced), then you can "compress" a single
range of 0's out of the middle of the system/host physical address.

The cost is that all page <-> mfn conversions need to read two masks and
a shift-count from variables in memory, to split/shift/recombine the
address bits.

2) A slide, which is frametable_base_pdx in this context.

When there's a big gap between 0 and the start of something interesting,
you could chop out that range by just subtracting base_pdx.  What
qualifies as "big" is subjective, but Qemu starting at 128M certainly
does not qualify as big enough to warrant frametable_base_pdx.

This is less expensive that PDX compression.  It only adds one memory
read to the fastpath, but it also doesn't save as much virtual address
space as PDX compression.


When virtual address space is a major constraint (32 bit builds), both
of these techniques are worth doing.  But when there's no constraint on
virtual address space (64 bit builds), there's no reason to use either;
and the performance will definitely improve as a result.

~Andrew

Reply via email to