> On 14-Sep-2023, at 2:07 PM, David Hildenbrand <da...@redhat.com> wrote:
>
> On 14.09.23 07:53, Ani Sinha wrote:
>>> On 12-Sep-2023, at 9:04 PM, David Hildenbrand <da...@redhat.com> wrote:
>>>
>>> [...]
>>>
>>>>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>>>>> index 54838c0c41..d187890675 100644
>>>>> --- a/hw/i386/pc.c
>>>>> +++ b/hw/i386/pc.c
>>>>> @@ -908,9 +908,12 @@ static hwaddr pc_max_used_gpa(PCMachineState *pcms,
>>>>> uint64_t pci_hole64_size)
>>>>> {
>>>>> X86CPU *cpu = X86_CPU(first_cpu);
>>>>>
>>>>> - /* 32-bit systems don't have hole64 thus return max CPU address */
>>>>> - if (cpu->phys_bits <= 32) {
>>>>> - return ((hwaddr)1 << cpu->phys_bits) - 1;
>>>>> + /*
>>>>> + * 32-bit systems don't have hole64, but we might have a region for
>>>>> + * memory hotplug.
>>>>> + */
>>>>> + if (!(cpu->env.features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM)) {
>>>>> + return pc_pci_hole64_start() - 1;
>>>> Ok this is very confusing! I am looking at pc_pci_hole64_start() function.
>>>> I have a few questions …
>>>> (a) pc_get_device_memory_range() returns the size of the device memory as
>>>> the difference between ram_size and maxram_size. But from what I
>>>> understand, ram_size is the actual size of the ram present and maxram_size
>>>> is the max size of ram *after* hot plugging additional memory. How can we
>>>> assume that the additional available space is already occupied by hot
>>>> plugged memory?
>>>
>>> Let's take a look at an example:
>>>
>>> $ ./build/qemu-system-x86_64 -m 8g,maxmem=16g,slots=1 \
>>> -object memory-backend-ram,id=mem0,size=1g \
>>> -device pc-dimm,memdev=mem0 \
>>> -nodefaults -nographic -S -monitor stdio
>>>
>>> (qemu) info mtree
>>> ...
>>> memory-region: system
>>> 0000000000000000-ffffffffffffffff (prio 0, i/o): system
>>> 0000000000000000-00000000bfffffff (prio 0, ram): alias ram-below-4g
>>> @pc.ram 0000000000000000-00000000bfffffff
>>> 0000000000000000-ffffffffffffffff (prio -1, i/o): pci
>>> 00000000000c0000-00000000000dffff (prio 1, rom): pc.rom
>>> 00000000000e0000-00000000000fffff (prio 1, rom): alias isa-bios
>>> @pc.bios 0000000000020000-000000000003ffff
>>> 00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios
>>> 00000000000a0000-00000000000bffff (prio 1, i/o): alias smram-region @pci
>>> 00000000000a0000-00000000000bffff
>>> 00000000000c0000-00000000000c3fff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000c0000-00000000000c3fff
>>> 00000000000c4000-00000000000c7fff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000c4000-00000000000c7fff
>>> 00000000000c8000-00000000000cbfff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000c8000-00000000000cbfff
>>> 00000000000cc000-00000000000cffff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000cc000-00000000000cffff
>>> 00000000000d0000-00000000000d3fff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000d0000-00000000000d3fff
>>> 00000000000d4000-00000000000d7fff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000d4000-00000000000d7fff
>>> 00000000000d8000-00000000000dbfff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000d8000-00000000000dbfff
>>> 00000000000dc000-00000000000dffff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000dc000-00000000000dffff
>>> 00000000000e0000-00000000000e3fff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000e0000-00000000000e3fff
>>> 00000000000e4000-00000000000e7fff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000e4000-00000000000e7fff
>>> 00000000000e8000-00000000000ebfff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000e8000-00000000000ebfff
>>> 00000000000ec000-00000000000effff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000ec000-00000000000effff
>>> 00000000000f0000-00000000000fffff (prio 1, i/o): alias pam-pci @pci
>>> 00000000000f0000-00000000000fffff
>>> 00000000fec00000-00000000fec00fff (prio 0, i/o): ioapic
>>> 00000000fed00000-00000000fed003ff (prio 0, i/o): hpet
>>> 00000000fee00000-00000000feefffff (prio 4096, i/o): apic-msi
>>> 0000000100000000-000000023fffffff (prio 0, ram): alias ram-above-4g
>>> @pc.ram 00000000c0000000-00000001ffffffff
>>> 0000000240000000-000000047fffffff (prio 0, i/o): device-memory
>>> 0000000240000000-000000027fffffff (prio 0, ram): mem0
>>>
>>>
>>> We requested 8G of boot memory, which is split between "<4G" memory and
>>> ">=4G" memory.
>>>
>>> We only place exactly 3G (0x0->0xbfffffff) under 4G, starting at address 0.
>> I can’t reconcile this with this code for q35:
>> if (machine->ram_size >= 0xb0000000) {
>> lowmem = 0x80000000; // max memory 0x8fffffff or 2.25 GiB
>> } else {
>> lowmem = 0xb0000000; // max memory 0xbfffffff or 3 GiB
>> }
>> You assigned 8 Gib to ram which is > 0xb0000000 (2.75 Gib)
>
> QEMU defaults to the "pc" machine. If you add "-M q35" you get:
>
> address-space: memory
> 0000000000000000-ffffffffffffffff (prio 0, i/o): system
> 0000000000000000-000000007fffffff (prio 0, ram): alias ram-below-4g
> @pc.ram 0000000000000000-000000007fffffff
> [...]
> 0000000100000000-000000027fffffff (prio 0, ram): alias ram-above-4g
> @pc.ram 0000000080000000-00000001ffffffff
> 0000000280000000-00000004bfffffff (prio 0, i/o): device-memory
> 0000000280000000-00000002bfffffff (prio 0, ram): mem0
>
>
>>>
>>> We leave the remainder (1G) of the <4G addresses available for I/O devices
>>> (32bit PCI hole).
>>>
>>> So we end up with 5G (0x100000000->0x23fffffff) of memory starting exactly
>>> at address 4G.
>>>
>>> "maxram_size - ram_size"=8G is the maximum amount of memory you can
>>> hotplug. We use it to size the
>>> "device-memory" region:
>>>
>>> 0x47fffffff - 0x240000000+1 = 0x240000000
>>> -> 9 GiB
>>>
>>> We requested a to hotplug a maximum of "8 GiB", and sized the area slightly
>>> larger to allow for some flexibility
>>> when it comes to placing DIMMs in that "device-memory" area.
>> Right but here in this example you do not hot plug memory while the VM is
>> running. We can hot plug 8G yes, but the memory may not physically exist yet
>> (and may never exist). How can we use this math to provision device-memory
>> when the memory may not exist physically?
>
> We simply reserve a region in GPA space where we can coldplug and hotplug a
> predefined maximum amount of memory we can hotplug.
>
> What do you think is wrong with that?
The only issue I have is that even though we are accounting for it, the memory
actually might not be physically present.
>
>>>
>>> We place that area for memory devices after the RAM. So it starts after the
>>> 5G of ">=4G" boot memory.
>>>
>>>
>>> Long story short, based on the initial RAM size and the maximum RAM size,
>>> you
>>> can construct the layout above and exactly know
>>> a) How much memory is below 4G, starting at address 0 -> leaving 1G for the
>>> 32bit PCI hole
>>> b) How much memory is above 4G, starting at address 4g.
>>> c) Where the region for memory devices starts (aligned after b) ) and how
>>> big it is.
>>> d) Where the 64bit PCI hole is (after c) )
>>>
>>>> (b) Another question is, in pc_pci_hole64_start(), why are we adding this
>>>> size to the start address?
>>>> } else if (pcmc->has_reserved_memory && (ms->ram_size < ms->maxram_size)) {
>>>> pc_get_device_memory_range(pcms, &hole64_start, &size);
>>>> if (!pcmc->broken_reserved_end) {
>>>> hole64_start += size;
>>>
>>> The 64bit PCI hole starts after "device-memory" above.
>>>
>>> Apparently, we have to take care of some layout issues before QEMU 2.5. You
>>> can assume that nowadays,
>>> "pcmc->broken_reserved_end" is never set. So the PCI64 hole is always after
>>> the device-memory region.
>>>
>>>> I think this is trying to put the hole after the device memory. But if the
>>>> ram size is <=maxram_size then the hole is after the above_4G memory? Why?
>>>
>>> I didn't quit get what the concern is, can you elaborate?
>> Oh I meant the else part here and made a typo, the else implies ram size ==
>> maxram_size
>> } else {
>> hole64_start = pc_above_4g_end(pcms);
>> }
>> So in this case, there is no device_memory region?!
>
> Yes. In this case ms->ram_size == ms->maxram_size and you cannot cold/hotplug
> any memory devices.
>
> See how pc_memory_init() doesn't call machine_memory_devices_init() in that
> case.
>
> That's what the QEMU user asked for when *not* specifying maxmem (e.g., -m
> 4g).
>
> In order to cold/hotplug any memory devices, you have to tell QEMU ahead of
> time how much memory
> you are intending to provide using memory devices (DIMM, NVDIMM, virtio-pmem,
> virtio-mem).
So that means that when we are actually hot plugging the memory, there is no
need to actually perform additional checks. It can be done statically when -mem
and -maxmem etc are provided in the command line.
>
> So when specifying, say -m 4g,maxmem=20g, we can have memory devices of a
> total of 16g (20 - 4).
> We use reserve a GPA space for device_memory that is at least 16g, into which
> we can either coldplug
> (QEMU cmdline) or hotplug (qmp/hmp) memory later.
>
>> Another thing I do not understand is, for 32 -bit,
>> above_4g_mem_start is 4GiB and above_4g_mem_size = ram_size - lowmem.
>> So we are allocating “above-4G” ram above address space of the processor?!
>>>
>>>> (c) in your above change, what does long mode have anything to do with all
>>>> of this?
>>>
>>> According to my understanding, 32bit (i386) doesn't have a 64bit hole. And
>>> 32bit vs.
>>> 64bit (i386 vs. x86_64) is decided based on LM, not on the address bits (as
>>> we learned, PSE36, and PAE).
>>>
>>> But really, I just did what x86_cpu_realizefn() does to decide 32bit vs.
>>> 64bit ;)
>>>
>>> /* For 64bit systems think about the number of physical bits to present.
>>> * ideally this should be the same as the host; anything other than
>>> matching
>>> * the host can cause incorrect guest behaviour.
>>> * QEMU used to pick the magic value of 40 bits that corresponds to
>>> * consumer AMD devices but nothing else.
>>> *
>>> * Note that this code assumes features expansion has already been done
>>> * (as it checks for CPUID_EXT2_LM), and also assumes that potential
>>> * phys_bits adjustments to match the host have been already done in
>>> * accel-specific code in cpu_exec_realizefn.
>>> */
>>> if (env->features[FEAT_8000_0001_EDX] & CPUID_EXT2_LM) {
>>> ...
>>> } else {
>>> /* For 32 bit systems don't use the user set value, but keep
>>> * phys_bits consistent with what we tell the guest.
>>> */
>> Ah I see. I missed this. But I still can’t understand why for 32 bit,
>> pc_pci_hole64_start() would be the right address for max gpa?
>
> You want "end of device memory region" if there is one, or
> "end of RAM" is there is none.
>
> What pc_pci_hole64_start() does:
>
> /*
> * The 64bit pci hole starts after "above 4G RAM" and
> * potentially the space reserved for memory hotplug.
> */
>
> There is the
> ROUND_UP(hole64_start, 1 * GiB);
> in there that is not really required for the !hole64 case. It
> shouldn't matter much in practice I think (besides an aligned value
> showing up in the error message).
>
> We could factor out most of that calculation into a
> separate function, skipping that alignment to make that
> clearer.
Yeah this whole memory segmentation is quite complicated and might benefit from
a qemu doc or a refactoring.