Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Wed, 23 Feb 2022 09:16:51 + "Dr. David Alan Gilbert" wrote: > * Igor Mammedov (imamm...@redhat.com) wrote: > > On Tue, 22 Feb 2022 10:42:55 +0100 > > Gerd Hoffmann wrote: > > > > > Hi, > > > > > > > > And the upstream code is now pretty much identical except for the > > > > > default; note that for TCG you do need to keep to 40 I think. > > > > > > > > will TCG work with 40bits on host that supports less than that? > > > > > > When I understand things correctly the problem is that the phys-bits > > > limit applies to the npt/ept tables too, effectively restricting guest > > > physical address space to host physical address space. > > > > > > TCG is not affected by that and should work just fine. > > > > > > Not sure what happens if you turn off npt/ept and run on softmmu. > > > Possibly that works fine too. > > > > > > > Also quick look at host-phys-bits shows that it affects only 'host' > > > > cpu model and is NOP for all other models. > > > > > > I don't think so. microvm forces host-phys-bits=on and that works with > > > all cpu models. > > > > I just don't see how host-phys-bits can work for other than 'host' cpu > > model. > > It's true that property is available for all cpu models, but the field it > > sets > > is only used in target/i386/host-cpu.c, the same applies to > > host-phys-bits-limit. > > Am I missing something? > > The hook in kvm/kvm-cpu.c kvm_cpu_realizefn: > > /* > * The realize order is important, since x86_cpu_realize() checks if > * nothing else has been set by the user (or by accelerators) in > * cpu->ucode_rev and cpu->phys_bits, and updates the CPUID results in > * mwait.ecx. > * This accel realization code also assumes cpu features are already > expanded. > * > * realize order: > * > * x86_cpu_realize(): > * -> x86_cpu_expand_features() > * -> cpu_exec_realizefn(): > *-> accel_cpu_realizefn() > * kvm_cpu_realizefn() -> host_cpu_realizefn() > * -> check/update ucode_rev, phys_bits, mwait > */ Thanks, I didn't expect host_cpu_realizefn being called from elsewhere beside of cpu model it belongs to or models inherited from it. > > Dave > > > > > > > take care, > > > Gerd > > > > >
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
* Igor Mammedov (imamm...@redhat.com) wrote: > On Tue, 22 Feb 2022 10:42:55 +0100 > Gerd Hoffmann wrote: > > > Hi, > > > > > > And the upstream code is now pretty much identical except for the > > > > default; note that for TCG you do need to keep to 40 I think. > > > > > > will TCG work with 40bits on host that supports less than that? > > > > When I understand things correctly the problem is that the phys-bits > > limit applies to the npt/ept tables too, effectively restricting guest > > physical address space to host physical address space. > > > > TCG is not affected by that and should work just fine. > > > > Not sure what happens if you turn off npt/ept and run on softmmu. > > Possibly that works fine too. > > > > > Also quick look at host-phys-bits shows that it affects only 'host' > > > cpu model and is NOP for all other models. > > > > I don't think so. microvm forces host-phys-bits=on and that works with > > all cpu models. > > I just don't see how host-phys-bits can work for other than 'host' cpu model. > It's true that property is available for all cpu models, but the field it sets > is only used in target/i386/host-cpu.c, the same applies to > host-phys-bits-limit. > Am I missing something? The hook in kvm/kvm-cpu.c kvm_cpu_realizefn: /* * The realize order is important, since x86_cpu_realize() checks if * nothing else has been set by the user (or by accelerators) in * cpu->ucode_rev and cpu->phys_bits, and updates the CPUID results in * mwait.ecx. * This accel realization code also assumes cpu features are already expanded. * * realize order: * * x86_cpu_realize(): * -> x86_cpu_expand_features() * -> cpu_exec_realizefn(): *-> accel_cpu_realizefn() * kvm_cpu_realizefn() -> host_cpu_realizefn() * -> check/update ucode_rev, phys_bits, mwait */ Dave > > > > take care, > > Gerd > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Tue, 22 Feb 2022 10:42:55 +0100 Gerd Hoffmann wrote: > Hi, > > > > And the upstream code is now pretty much identical except for the > > > default; note that for TCG you do need to keep to 40 I think. > > > > will TCG work with 40bits on host that supports less than that? > > When I understand things correctly the problem is that the phys-bits > limit applies to the npt/ept tables too, effectively restricting guest > physical address space to host physical address space. > > TCG is not affected by that and should work just fine. > > Not sure what happens if you turn off npt/ept and run on softmmu. > Possibly that works fine too. > > > Also quick look at host-phys-bits shows that it affects only 'host' > > cpu model and is NOP for all other models. > > I don't think so. microvm forces host-phys-bits=on and that works with > all cpu models. I just don't see how host-phys-bits can work for other than 'host' cpu model. It's true that property is available for all cpu models, but the field it sets is only used in target/i386/host-cpu.c, the same applies to host-phys-bits-limit. Am I missing something? > > take care, > Gerd >
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Tue, 22 Feb 2022 11:00:55 + Joao Martins wrote: > On 2/21/22 15:28, Joao Martins wrote: > > On 2/21/22 06:58, Igor Mammedov wrote: > >> On Fri, 18 Feb 2022 17:12:21 + > >> Joao Martins wrote: > >> > >>> On 2/14/22 15:31, Igor Mammedov wrote: > On Mon, 14 Feb 2022 15:05:00 + > Joao Martins wrote: > > On 2/14/22 14:53, Igor Mammedov wrote: > >> On Mon, 7 Feb 2022 20:24:20 + > >> Joao Martins wrote: > >>> +{ > >>> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); > >>> +X86MachineState *x86ms = X86_MACHINE(pcms); > >>> +ram_addr_t device_mem_size = 0; > >>> +uint32_t eax, vendor[3]; > >>> + > >>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); > >>> +if (!IS_AMD_VENDOR(vendor)) { > >>> +return; > >>> +} > >>> + > >>> +if (pcmc->has_reserved_memory && > >>> + (machine->ram_size < machine->maxram_size)) { > >>> +device_mem_size = machine->maxram_size - machine->ram_size; > >>> +} > >>> + > >>> +if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size + > >>> + device_mem_size) < AMD_HT_START) { > >> > > And I was at two minds on this one, whether to advertise *always* > > the 1T hole, regardless of relocation. Or account the size > > we advertise for the pci64 hole and make that part of the equation > > above. Although that has the flaw that the firmware at admin request > > may pick some ludricous number (limited by maxphysaddr). > > it this point we have only pci64 hole size (machine property), > so I'd include that in equation to make firmware assign > pci64 aperture above HT range. > > as for checking maxphysaddr, we can only check 'default' PCI hole > range at this stage (i.e. 1Gb aligned hole size after all possible RAM) > and hard error on it. > > >>> > >>> Igor, in the context of your comment above, I'll be introducing another > >>> preparatory patch that adds up pci_hole64_size to pc_memory_init() such > >>> that all used/max physaddr space checks are consolidated in > >>> pc_memory_init(). > >>> > >>> To that end, the changes involve mainly moves the the pcihost qdev > >>> creation > >>> to be before pc_memory_init(). Q35 just needs a 2-line order change. > >>> i440fx > >>> needs slightly more of a dance to extract that from i440fx_init() and also > >>> because most i440fx state is private (hence the new helper for size). But > >>> the actual initialization of I440fx/q35 pci host is still after > >>> pc_memory_init(), > >>> it is just to extra pci_hole64_size from the object + user passed args > >>> (-global etc). > >> > >> Shuffling init order is looks too intrusive and in practice > >> quite risky. > > > > Yeah, it is an intrusive change sadly. Although, why would you consider it > > risky (curious)? We are "only" moving this: > > > > qdev_new(host_type); > > > > ... located at the very top of i440fx_init() and called at the top of > > q35_host > > initilization to be instead before pc_memory_init(). And that means that an > > instance of an > > object gets made and its properties initialized i.e. @instance_init of q35 > > and i440fx and > > its properties. I don't see a particular dependence in PC code to tell that > > this > > would affect its surroundings parts. I don't see anything wrong her as well (but I'm probably overcautious since more often than not changing initialization order, has broken things in non obvious ways) > > > > The actual pcihost-related initialization is still kept entirely unchanged. > > > >> How about moving maxphysaddr check to pc_machine_done() instead? > >> (this way you won't have to move pcihost around) > >> > > I can move it. But be there will be a slight disconnect between what > > pc_memory_init() > > checks against "max used address" between ... dictating if the 4G mem > > start should change > > to 1T or not ... and when the phys-bits check is actually made which > > includes the pci hole64. > > > > For example, we create a guest with maxram 1009G (which 4G mem start would > > get at > > unchanged) and then the pci_hole64 goes likely assigned across the rest > > until 1023G (i.e. > > across the HT region). Here it would need an extra check and fail if > > pci_hole64 crosses > > the HT region. Whereby if it was added in pc_memory_init() then we could > > just relocate to > > 1T and the guest creation would still proceed. > > > Actually, on a second thought, not having the pci_hole64_size > to pc_memory_init() to instead introduce it in pc_machine_done() to > include pci_hole64_size looks like a half-step :( because otherwise the user > needs to play games on how much it should include as -m|-object-memory* > number to force it to relocate to 1T and avoid the guest creation > failure. > > So either we: > > 1) co
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On 2/21/22 15:28, Joao Martins wrote: > On 2/21/22 06:58, Igor Mammedov wrote: >> On Fri, 18 Feb 2022 17:12:21 + >> Joao Martins wrote: >> >>> On 2/14/22 15:31, Igor Mammedov wrote: On Mon, 14 Feb 2022 15:05:00 + Joao Martins wrote: > On 2/14/22 14:53, Igor Mammedov wrote: >> On Mon, 7 Feb 2022 20:24:20 + >> Joao Martins wrote: >>> +{ >>> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); >>> +X86MachineState *x86ms = X86_MACHINE(pcms); >>> +ram_addr_t device_mem_size = 0; >>> +uint32_t eax, vendor[3]; >>> + >>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); >>> +if (!IS_AMD_VENDOR(vendor)) { >>> +return; >>> +} >>> + >>> +if (pcmc->has_reserved_memory && >>> + (machine->ram_size < machine->maxram_size)) { >>> +device_mem_size = machine->maxram_size - machine->ram_size; >>> +} >>> + >>> +if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size + >>> + device_mem_size) < AMD_HT_START) { >> > And I was at two minds on this one, whether to advertise *always* > the 1T hole, regardless of relocation. Or account the size > we advertise for the pci64 hole and make that part of the equation > above. Although that has the flaw that the firmware at admin request > may pick some ludricous number (limited by maxphysaddr). it this point we have only pci64 hole size (machine property), so I'd include that in equation to make firmware assign pci64 aperture above HT range. as for checking maxphysaddr, we can only check 'default' PCI hole range at this stage (i.e. 1Gb aligned hole size after all possible RAM) and hard error on it. >>> >>> Igor, in the context of your comment above, I'll be introducing another >>> preparatory patch that adds up pci_hole64_size to pc_memory_init() such >>> that all used/max physaddr space checks are consolidated in >>> pc_memory_init(). >>> >>> To that end, the changes involve mainly moves the the pcihost qdev creation >>> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx >>> needs slightly more of a dance to extract that from i440fx_init() and also >>> because most i440fx state is private (hence the new helper for size). But >>> the actual initialization of I440fx/q35 pci host is still after >>> pc_memory_init(), >>> it is just to extra pci_hole64_size from the object + user passed args >>> (-global etc). >> >> Shuffling init order is looks too intrusive and in practice >> quite risky. > > Yeah, it is an intrusive change sadly. Although, why would you consider it > risky (curious)? We are "only" moving this: > > qdev_new(host_type); > > ... located at the very top of i440fx_init() and called at the top of q35_host > initilization to be instead before pc_memory_init(). And that means that an > instance of an > object gets made and its properties initialized i.e. @instance_init of q35 > and i440fx and > its properties. I don't see a particular dependence in PC code to tell that > this > would affect its surroundings parts. > > The actual pcihost-related initialization is still kept entirely unchanged. > >> How about moving maxphysaddr check to pc_machine_done() instead? >> (this way you won't have to move pcihost around) >> > I can move it. But be there will be a slight disconnect between what > pc_memory_init() > checks against "max used address" between ... dictating if the 4G mem start > should change > to 1T or not ... and when the phys-bits check is actually made which > includes the pci hole64. > > For example, we create a guest with maxram 1009G (which 4G mem start would > get at > unchanged) and then the pci_hole64 goes likely assigned across the rest until > 1023G (i.e. > across the HT region). Here it would need an extra check and fail if > pci_hole64 crosses > the HT region. Whereby if it was added in pc_memory_init() then we could just > relocate to > 1T and the guest creation would still proceed. > Actually, on a second thought, not having the pci_hole64_size to pc_memory_init() to instead introduce it in pc_machine_done() to include pci_hole64_size looks like a half-step :( because otherwise the user needs to play games on how much it should include as -m|-object-memory* number to force it to relocate to 1T and avoid the guest creation failure. So either we: 1) consider pci_hole64_size in pc_memory_init() and move default pci-hole64-size (sort of what I was proposing in this last exchange) 2) keep the maxphysaddr check as is (but generic in pc_memory_init() and disregarding pci-hole64-size) and advertise the actual 1T reserved hole (regardless of above-4G relocation) letting BIOS consider reserved regions when picking hole64-start.
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
Hi, > > And the upstream code is now pretty much identical except for the > > default; note that for TCG you do need to keep to 40 I think. > > will TCG work with 40bits on host that supports less than that? When I understand things correctly the problem is that the phys-bits limit applies to the npt/ept tables too, effectively restricting guest physical address space to host physical address space. TCG is not affected by that and should work just fine. Not sure what happens if you turn off npt/ept and run on softmmu. Possibly that works fine too. > Also quick look at host-phys-bits shows that it affects only 'host' > cpu model and is NOP for all other models. I don't think so. microvm forces host-phys-bits=on and that works with all cpu models. take care, Gerd
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
* Igor Mammedov (imamm...@redhat.com) wrote: > On Mon, 21 Feb 2022 13:15:40 + > "Dr. David Alan Gilbert" wrote: > > > * Daniel P. Berrangé (berra...@redhat.com) wrote: > > > On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote: > > > > Hi, > > > > > > > > > I don't know what behavior should be if firmware tries to program > > > > > PCI64 hole beyond supported phys-bits. > > > > > > > > Well, you are basically f*cked. > > > > > > > > Unfortunately there is no reliable way to figure what phys-bits actually > > > > is. Because of that the firmware (both seabios and edk2) tries to place > > > > the pci64 hole as low as possible. > > > > > > > > The long version: > > > > > > > > qemu advertises phys-bits=40 to the guest by default. Probably because > > > > this is what the first amd opteron processors had, assuming that it > > > > would be a safe default. Then intel came, releasing processors with > > > > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39. > > > > Boom. > > > > > > > > End result is that edk2 uses a 32G pci64 window by default, which is > > > > placed at the first 32G border beyond normal ram. So for virtual > > > > machines with up to ~ 30G ram (including reservations for memory > > > > hotplug) the pci64 hole covers 32G -> 64G in guest physical address > > > > space, which is low enough that it works on hardware with phys-bits=36. > > > > > > > > If your VM has more than 32G of memory the pci64 hole will move and > > > > phys-bits=36 isn't enough any more, but given that you probably only do > > > > that on more beefy hosts which can take >= 64G of RAM and have a larger > > > > physical address space this heuristic works good enough in practice. > > > > > > > > Changing phys-bits behavior has been discussed on and off since years. > > > > It's tricky to change for live migration compatibility reasons. > > > > > > > > We got the host-phys-bits and host-phys-bits-limit properties, which > > > > solve some of the phys-bits problems. > > > > > > > > * host-phys-bits=on makes sure the phys-bits advertised to the guest > > > >actually works. It's off by default though for backward > > > >compatibility reasons (except microvm). Also because turning it on > > > >breaks live migration of machines between hosts with different > > > >phys-bits. > > > > > > RHEL has shipped with host-phys-bits=on in its machine types > > > sinec RHEL-7. If it is good enough for RHEL machine types > > > for 8 years, IMHO, it is a sign that its reasonable to do the > > > same with upstream for new machine types. > > > > And the upstream code is now pretty much identical except for the > > default; note that for TCG you do need to keep to 40 I think. > > will TCG work with 40bits on host that supports less than that? > > Also quick look at host-phys-bits shows that it affects only 'host' > cpu model and is NOP for all other models. > If it's so than we probably need to expand it's scope to other cpu > models to cap them at actually supported range. (We shouldn't really bring TCG oddities into this series!) As I remember it effectively gets it from the accelerator, and TCG being portable, there's no portable way of reading the phys-bits. Whether it would work, hmm. I'm assuming the host OS would stop you allocating a huge ram block, so it shouldn't break from that. But then the guest address translation is done in software, not using the host MMU, so I think the guests view of addressing should be able to be larger than the host. (Unless you try things like vfio/iommu on tcg, which I'm told does work in some combos). Dave > > > > Dave > > > > > > > * host-phys-bits-limit can be used to tweak phys-bits to > > > >be lower than what the host supports. Which can be used for > > > >live migration compatibility, i.e. if you have a pool of machines > > > >where some have 36 and some 39 you can limit phys-bits to 36 so > > > >live migration from 39 hosts to 36 hosts works. > > > > > > RHEL machine types have set this to host-phys-bits-limit=48 > > > since RHEL-8 days, to avoid accidentally enabling 5-level > > > paging in guests without explicit user opt-in. > > > > > > > What is missing: > > > > > > > > * Some way for the firmware to get a phys-bits value it can actually > > > >use. One possible way would be to have a paravirtual bit somewhere > > > >telling whenever host-phys-bits is enabled or not. > > > > > > > > > Regards, > > > Daniel > > > -- > > > |: https://berrange.com -o- > > > https://www.flickr.com/photos/dberrange :| > > > |: https://libvirt.org -o- > > > https://fstop138.berrange.com :| > > > |: https://entangle-photo.org-o- > > > https://www.instagram.com/dberrange :| > > > > > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Mon, 21 Feb 2022 13:15:40 + "Dr. David Alan Gilbert" wrote: > * Daniel P. Berrangé (berra...@redhat.com) wrote: > > On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote: > > > Hi, > > > > > > > I don't know what behavior should be if firmware tries to program > > > > PCI64 hole beyond supported phys-bits. > > > > > > Well, you are basically f*cked. > > > > > > Unfortunately there is no reliable way to figure what phys-bits actually > > > is. Because of that the firmware (both seabios and edk2) tries to place > > > the pci64 hole as low as possible. > > > > > > The long version: > > > > > > qemu advertises phys-bits=40 to the guest by default. Probably because > > > this is what the first amd opteron processors had, assuming that it > > > would be a safe default. Then intel came, releasing processors with > > > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39. > > > Boom. > > > > > > End result is that edk2 uses a 32G pci64 window by default, which is > > > placed at the first 32G border beyond normal ram. So for virtual > > > machines with up to ~ 30G ram (including reservations for memory > > > hotplug) the pci64 hole covers 32G -> 64G in guest physical address > > > space, which is low enough that it works on hardware with phys-bits=36. > > > > > > If your VM has more than 32G of memory the pci64 hole will move and > > > phys-bits=36 isn't enough any more, but given that you probably only do > > > that on more beefy hosts which can take >= 64G of RAM and have a larger > > > physical address space this heuristic works good enough in practice. > > > > > > Changing phys-bits behavior has been discussed on and off since years. > > > It's tricky to change for live migration compatibility reasons. > > > > > > We got the host-phys-bits and host-phys-bits-limit properties, which > > > solve some of the phys-bits problems. > > > > > > * host-phys-bits=on makes sure the phys-bits advertised to the guest > > >actually works. It's off by default though for backward > > >compatibility reasons (except microvm). Also because turning it on > > >breaks live migration of machines between hosts with different > > >phys-bits. > > > > RHEL has shipped with host-phys-bits=on in its machine types > > sinec RHEL-7. If it is good enough for RHEL machine types > > for 8 years, IMHO, it is a sign that its reasonable to do the > > same with upstream for new machine types. > > And the upstream code is now pretty much identical except for the > default; note that for TCG you do need to keep to 40 I think. will TCG work with 40bits on host that supports less than that? Also quick look at host-phys-bits shows that it affects only 'host' cpu model and is NOP for all other models. If it's so than we probably need to expand it's scope to other cpu models to cap them at actually supported range. > > Dave > > > > > * host-phys-bits-limit can be used to tweak phys-bits to > > >be lower than what the host supports. Which can be used for > > >live migration compatibility, i.e. if you have a pool of machines > > >where some have 36 and some 39 you can limit phys-bits to 36 so > > >live migration from 39 hosts to 36 hosts works. > > > > RHEL machine types have set this to host-phys-bits-limit=48 > > since RHEL-8 days, to avoid accidentally enabling 5-level > > paging in guests without explicit user opt-in. > > > > > What is missing: > > > > > > * Some way for the firmware to get a phys-bits value it can actually > > >use. One possible way would be to have a paravirtual bit somewhere > > >telling whenever host-phys-bits is enabled or not. > > > > > > Regards, > > Daniel > > -- > > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange > > :| > > |: https://libvirt.org -o-https://fstop138.berrange.com > > :| > > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange > > :| > > > >
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On 2/21/22 06:58, Igor Mammedov wrote: > On Fri, 18 Feb 2022 17:12:21 + > Joao Martins wrote: > >> On 2/14/22 15:31, Igor Mammedov wrote: >>> On Mon, 14 Feb 2022 15:05:00 + >>> Joao Martins wrote: On 2/14/22 14:53, Igor Mammedov wrote: > On Mon, 7 Feb 2022 20:24:20 + > Joao Martins wrote: >> +{ >> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); >> +X86MachineState *x86ms = X86_MACHINE(pcms); >> +ram_addr_t device_mem_size = 0; >> +uint32_t eax, vendor[3]; >> + >> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); >> +if (!IS_AMD_VENDOR(vendor)) { >> +return; >> +} >> + >> +if (pcmc->has_reserved_memory && >> + (machine->ram_size < machine->maxram_size)) { >> +device_mem_size = machine->maxram_size - machine->ram_size; >> +} >> + >> +if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size + >> + device_mem_size) < AMD_HT_START) { > And I was at two minds on this one, whether to advertise *always* the 1T hole, regardless of relocation. Or account the size we advertise for the pci64 hole and make that part of the equation above. Although that has the flaw that the firmware at admin request may pick some ludricous number (limited by maxphysaddr). >>> >>> it this point we have only pci64 hole size (machine property), >>> so I'd include that in equation to make firmware assign >>> pci64 aperture above HT range. >>> >>> as for checking maxphysaddr, we can only check 'default' PCI hole >>> range at this stage (i.e. 1Gb aligned hole size after all possible RAM) >>> and hard error on it. >>> >> >> Igor, in the context of your comment above, I'll be introducing another >> preparatory patch that adds up pci_hole64_size to pc_memory_init() such >> that all used/max physaddr space checks are consolidated in pc_memory_init(). >> >> To that end, the changes involve mainly moves the the pcihost qdev creation >> to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx >> needs slightly more of a dance to extract that from i440fx_init() and also >> because most i440fx state is private (hence the new helper for size). But >> the actual initialization of I440fx/q35 pci host is still after >> pc_memory_init(), >> it is just to extra pci_hole64_size from the object + user passed args >> (-global etc). > > Shuffling init order is looks too intrusive and in practice > quite risky. Yeah, it is an intrusive change sadly. Although, why would you consider it risky (curious)? We are "only" moving this: qdev_new(host_type); ... located at the very top of i440fx_init() and called at the top of q35_host initilization to be instead before pc_memory_init(). And that means that an instance of an object gets made and its properties initialized i.e. @instance_init of q35 and i440fx and its properties. I don't see a particular dependence in PC code to tell that this would affect its surroundings parts. The actual pcihost-related initialization is still kept entirely unchanged. > How about moving maxphysaddr check to pc_machine_done() instead? > (this way you won't have to move pcihost around) > I can move it. But be there will be a slight disconnect between what pc_memory_init() checks against "max used address" between ... dictating if the 4G mem start should change to 1T or not ... and when the phys-bits check is actually made which includes the pci hole64. For example, we create a guest with maxram 1009G (which 4G mem start would get at unchanged) and then the pci_hole64 goes likely assigned across the rest until 1023G (i.e. across the HT region). Here it would need an extra check and fail if pci_hole64 crosses the HT region. Whereby if it was added in pc_memory_init() then we could just relocate to 1T and the guest creation would still proceed. > >> Raw staging changes below the scissors mark so far. >> >> -->8-- >> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c >> index b2e43eba1106..902977081350 100644 >> --- a/hw/i386/pc.c >> +++ b/hw/i386/pc.c >> @@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState >> *pcms) >> void pc_memory_init(PCMachineState *pcms, >> MemoryRegion *system_memory, >> MemoryRegion *rom_memory, >> -MemoryRegion **ram_memory) >> +MemoryRegion **ram_memory, >> +uint64_t pci_hole64_size) >> { >> int linux_boot, i; >> MemoryRegion *option_rom_mr; >> diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c >> index d9b344248dac..5a608e30e28f 100644 >> --- a/hw/i386/pc_piix.c >> +++ b/hw/i386/pc_piix.c >> @@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine, >> MemoryRegion *pci_memory; >> MemoryRegion *rom_memory; >> ram_addr_t lowmem; >> +uint64_t hole64_size; >> +DeviceState
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
* Daniel P. Berrangé (berra...@redhat.com) wrote: > On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote: > > Hi, > > > > > I don't know what behavior should be if firmware tries to program > > > PCI64 hole beyond supported phys-bits. > > > > Well, you are basically f*cked. > > > > Unfortunately there is no reliable way to figure what phys-bits actually > > is. Because of that the firmware (both seabios and edk2) tries to place > > the pci64 hole as low as possible. > > > > The long version: > > > > qemu advertises phys-bits=40 to the guest by default. Probably because > > this is what the first amd opteron processors had, assuming that it > > would be a safe default. Then intel came, releasing processors with > > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39. > > Boom. > > > > End result is that edk2 uses a 32G pci64 window by default, which is > > placed at the first 32G border beyond normal ram. So for virtual > > machines with up to ~ 30G ram (including reservations for memory > > hotplug) the pci64 hole covers 32G -> 64G in guest physical address > > space, which is low enough that it works on hardware with phys-bits=36. > > > > If your VM has more than 32G of memory the pci64 hole will move and > > phys-bits=36 isn't enough any more, but given that you probably only do > > that on more beefy hosts which can take >= 64G of RAM and have a larger > > physical address space this heuristic works good enough in practice. > > > > Changing phys-bits behavior has been discussed on and off since years. > > It's tricky to change for live migration compatibility reasons. > > > > We got the host-phys-bits and host-phys-bits-limit properties, which > > solve some of the phys-bits problems. > > > > * host-phys-bits=on makes sure the phys-bits advertised to the guest > >actually works. It's off by default though for backward > >compatibility reasons (except microvm). Also because turning it on > >breaks live migration of machines between hosts with different > >phys-bits. > > RHEL has shipped with host-phys-bits=on in its machine types > sinec RHEL-7. If it is good enough for RHEL machine types > for 8 years, IMHO, it is a sign that its reasonable to do the > same with upstream for new machine types. And the upstream code is now pretty much identical except for the default; note that for TCG you do need to keep to 40 I think. Dave > > > * host-phys-bits-limit can be used to tweak phys-bits to > >be lower than what the host supports. Which can be used for > >live migration compatibility, i.e. if you have a pool of machines > >where some have 36 and some 39 you can limit phys-bits to 36 so > >live migration from 39 hosts to 36 hosts works. > > RHEL machine types have set this to host-phys-bits-limit=48 > since RHEL-8 days, to avoid accidentally enabling 5-level > paging in guests without explicit user opt-in. > > > What is missing: > > > > * Some way for the firmware to get a phys-bits value it can actually > >use. One possible way would be to have a paravirtual bit somewhere > >telling whenever host-phys-bits is enabled or not. > > > Regards, > Daniel > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Fri, 18 Feb 2022 17:12:21 + Joao Martins wrote: > On 2/14/22 15:31, Igor Mammedov wrote: > > On Mon, 14 Feb 2022 15:05:00 + > > Joao Martins wrote: > >> On 2/14/22 14:53, Igor Mammedov wrote: > >>> On Mon, 7 Feb 2022 20:24:20 + > >>> Joao Martins wrote: > +{ > +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); > +X86MachineState *x86ms = X86_MACHINE(pcms); > +ram_addr_t device_mem_size = 0; > +uint32_t eax, vendor[3]; > + > +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); > +if (!IS_AMD_VENDOR(vendor)) { > +return; > +} > + > +if (pcmc->has_reserved_memory && > + (machine->ram_size < machine->maxram_size)) { > +device_mem_size = machine->maxram_size - machine->ram_size; > +} > + > +if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size + > + device_mem_size) < AMD_HT_START) { > >>> > >> And I was at two minds on this one, whether to advertise *always* > >> the 1T hole, regardless of relocation. Or account the size > >> we advertise for the pci64 hole and make that part of the equation > >> above. Although that has the flaw that the firmware at admin request > >> may pick some ludricous number (limited by maxphysaddr). > > > > it this point we have only pci64 hole size (machine property), > > so I'd include that in equation to make firmware assign > > pci64 aperture above HT range. > > > > as for checking maxphysaddr, we can only check 'default' PCI hole > > range at this stage (i.e. 1Gb aligned hole size after all possible RAM) > > and hard error on it. > > > > Igor, in the context of your comment above, I'll be introducing another > preparatory patch that adds up pci_hole64_size to pc_memory_init() such > that all used/max physaddr space checks are consolidated in pc_memory_init(). > > To that end, the changes involve mainly moves the the pcihost qdev creation > to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx > needs slightly more of a dance to extract that from i440fx_init() and also > because most i440fx state is private (hence the new helper for size). But > the actual initialization of I440fx/q35 pci host is still after > pc_memory_init(), > it is just to extra pci_hole64_size from the object + user passed args > (-global etc). Shuffling init order is looks too intrusive and in practice quite risky. How about moving maxphysaddr check to pc_machine_done() instead? (this way you won't have to move pcihost around) > Raw staging changes below the scissors mark so far. > > -->8-- > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c > index b2e43eba1106..902977081350 100644 > --- a/hw/i386/pc.c > +++ b/hw/i386/pc.c > @@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState > *pcms) > void pc_memory_init(PCMachineState *pcms, > MemoryRegion *system_memory, > MemoryRegion *rom_memory, > -MemoryRegion **ram_memory) > +MemoryRegion **ram_memory, > +uint64_t pci_hole64_size) > { > int linux_boot, i; > MemoryRegion *option_rom_mr; > diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c > index d9b344248dac..5a608e30e28f 100644 > --- a/hw/i386/pc_piix.c > +++ b/hw/i386/pc_piix.c > @@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine, > MemoryRegion *pci_memory; > MemoryRegion *rom_memory; > ram_addr_t lowmem; > +uint64_t hole64_size; > +DeviceState *i440fx_dev; > > /* > * Calculate ram split, for memory below and above 4G. It's a bit > @@ -164,9 +166,13 @@ static void pc_init1(MachineState *machine, > pci_memory = g_new(MemoryRegion, 1); > memory_region_init(pci_memory, NULL, "pci", UINT64_MAX); > rom_memory = pci_memory; > +i440fx_dev = qdev_new(host_type); > +hole64_size = i440fx_pci_hole64_size(i440fx_dev); > } else { > pci_memory = NULL; > rom_memory = system_memory; > +i440fx_dev = NULL; > +hole64_size = 0; > } > > pc_guest_info_init(pcms); > @@ -183,7 +189,7 @@ static void pc_init1(MachineState *machine, > /* allocate ram and load rom/bios */ > if (!xen_enabled()) { > pc_memory_init(pcms, system_memory, > - rom_memory, &ram_memory); > + rom_memory, &ram_memory, hole64_size); > } else { > pc_system_flash_cleanup_unused(pcms); > if (machine->kernel_filename != NULL) { > @@ -199,7 +205,7 @@ static void pc_init1(MachineState *machine, > > pci_bus = i440fx_init(host_type, >pci_type, > - &i440fx_state, > + i440fx_dev, &i440fx_state, >system_memory, system_io, machine->ram_size, >
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On 2/14/22 15:31, Igor Mammedov wrote: > On Mon, 14 Feb 2022 15:05:00 + > Joao Martins wrote: >> On 2/14/22 14:53, Igor Mammedov wrote: >>> On Mon, 7 Feb 2022 20:24:20 + >>> Joao Martins wrote: +{ +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); +X86MachineState *x86ms = X86_MACHINE(pcms); +ram_addr_t device_mem_size = 0; +uint32_t eax, vendor[3]; + +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); +if (!IS_AMD_VENDOR(vendor)) { +return; +} + +if (pcmc->has_reserved_memory && + (machine->ram_size < machine->maxram_size)) { +device_mem_size = machine->maxram_size - machine->ram_size; +} + +if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size + + device_mem_size) < AMD_HT_START) { >>> >> And I was at two minds on this one, whether to advertise *always* >> the 1T hole, regardless of relocation. Or account the size >> we advertise for the pci64 hole and make that part of the equation >> above. Although that has the flaw that the firmware at admin request >> may pick some ludricous number (limited by maxphysaddr). > > it this point we have only pci64 hole size (machine property), > so I'd include that in equation to make firmware assign > pci64 aperture above HT range. > > as for checking maxphysaddr, we can only check 'default' PCI hole > range at this stage (i.e. 1Gb aligned hole size after all possible RAM) > and hard error on it. > Igor, in the context of your comment above, I'll be introducing another preparatory patch that adds up pci_hole64_size to pc_memory_init() such that all used/max physaddr space checks are consolidated in pc_memory_init(). To that end, the changes involve mainly moves the the pcihost qdev creation to be before pc_memory_init(). Q35 just needs a 2-line order change. i440fx needs slightly more of a dance to extract that from i440fx_init() and also because most i440fx state is private (hence the new helper for size). But the actual initialization of I440fx/q35 pci host is still after pc_memory_init(), it is just to extra pci_hole64_size from the object + user passed args (-global etc). Raw staging changes below the scissors mark so far. -->8-- diff --git a/hw/i386/pc.c b/hw/i386/pc.c index b2e43eba1106..902977081350 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -875,7 +875,8 @@ static void x86_update_above_4g_mem_start(PCMachineState *pcms) void pc_memory_init(PCMachineState *pcms, MemoryRegion *system_memory, MemoryRegion *rom_memory, -MemoryRegion **ram_memory) +MemoryRegion **ram_memory, +uint64_t pci_hole64_size) { int linux_boot, i; MemoryRegion *option_rom_mr; diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c index d9b344248dac..5a608e30e28f 100644 --- a/hw/i386/pc_piix.c +++ b/hw/i386/pc_piix.c @@ -91,6 +91,8 @@ static void pc_init1(MachineState *machine, MemoryRegion *pci_memory; MemoryRegion *rom_memory; ram_addr_t lowmem; +uint64_t hole64_size; +DeviceState *i440fx_dev; /* * Calculate ram split, for memory below and above 4G. It's a bit @@ -164,9 +166,13 @@ static void pc_init1(MachineState *machine, pci_memory = g_new(MemoryRegion, 1); memory_region_init(pci_memory, NULL, "pci", UINT64_MAX); rom_memory = pci_memory; +i440fx_dev = qdev_new(host_type); +hole64_size = i440fx_pci_hole64_size(i440fx_dev); } else { pci_memory = NULL; rom_memory = system_memory; +i440fx_dev = NULL; +hole64_size = 0; } pc_guest_info_init(pcms); @@ -183,7 +189,7 @@ static void pc_init1(MachineState *machine, /* allocate ram and load rom/bios */ if (!xen_enabled()) { pc_memory_init(pcms, system_memory, - rom_memory, &ram_memory); + rom_memory, &ram_memory, hole64_size); } else { pc_system_flash_cleanup_unused(pcms); if (machine->kernel_filename != NULL) { @@ -199,7 +205,7 @@ static void pc_init1(MachineState *machine, pci_bus = i440fx_init(host_type, pci_type, - &i440fx_state, + i440fx_dev, &i440fx_state, system_memory, system_io, machine->ram_size, x86ms->below_4g_mem_size, x86ms->above_4g_mem_size, diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index 1780f79bc127..b7cf44d4755e 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -203,12 +203,13 @@ static void pc_q35_init(MachineState *machine) pcms->smbios_entry_point_type); } -/* allocate ram and load rom/bios */ -pc_memory_init(pcms, get_system_memory(), rom_memory, &ram_memory);
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
Hi, > What I overlooked was the emphasis you had on desktops (qemu default bigger > than > host-advertised), where I am thinking mostly in servers. Yep, on servers you have the reverse problem that phys-bits=40 might be too small for very large guests. > > To make things even worse: The default > > value (phys-bits=40) is larger than what many intel boxes support. > > > > host-phys-bits=on fixes that. It makes guest-phys-bits == host-phys-bits > > by default, and also enforces guest-phys-bits <= host-phys-bits. > > So with host-phys-bits=on the guest can actually use CPUID.8008.EAX > > to figure how big the guest physical address space is. > > > Your 2 paragraphs sound like it's two different things, but +host-phys-bits > just sets CPUID.8008.EAX with the host CPUID equivalent leaf/register > value. Which yes, it makes it reliable, but the way to convey is still the > same. That is regardless, of phys-bits=bogus-bigger-than-host-number, > host-phys-bits=on or host-phys-bits-limit=N. Yep, it's CPUID.8008.EAX in all cases. > > Problem is the guest doesn't know whenever host-phys-bits is enabled or > > not ... > > > > So the options to fix that are: > > > > (1) Make the host-phys-bits option visible to the guest (as suggested > > above), or > > (2) Advertise a _reliable_ phys-bits value to the guest somehow. > > What I am not enterily sure from (1) is the value on having a 'guest > phys-bits' > and a 'host phys-bits' *exposed to the guest* when it seems we are picking > the wrong > value for guests. Well, the guest can read CPUID.8008.EAX but as of today the guest just doesn't know whenever it's bogus or not. > It seems unnecessary redirection (compared to real hw) unless > there's a use-case in keeping both that I am probably missing. Well, the use case is backward compatibility. If we want make better use of the guest physical address space on newer qemu without breaking compatibility with older qemu versions the firmware needs to do something along the lines of: if (i-can-trust-phys-bits) { // new qemu read CPUID.8008.EAX and go with that } else { // old qemu use current heuristic, placing stuff as low as possible. } And for that to actually work new qemu needs to expose something to the guest which can be used to evaluate the "i-can-trust-phys-bits" expression. A guest-readable bit somewhere which is 1 for host-phys-bits=on and 0 otherwise would do the trick, but there are surely other ways to solve the problem. take care, Gerd
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On 2/16/22 08:19, Gerd Hoffmann wrote: > On Tue, Feb 15, 2022 at 07:37:40PM +, Joao Martins wrote: >> On 2/15/22 09:53, Gerd Hoffmann wrote: >>> What is missing: >>> >>> * Some way for the firmware to get a phys-bits value it can actually >>>use. One possible way would be to have a paravirtual bit somewhere >>>telling whenever host-phys-bits is enabled or not. >>> >> If we are not talking about *very old* processors... isn't what already >> advertised in CPUID.8008 EAX enough? That's the maxphysaddr width >> on x86, which on qemu we do set it with the phys-bits value; > > Sigh. Nope. Did you read the complete reply? > Yes, I did. What I overlooked was the emphasis you had on desktops (qemu default bigger than host-advertised), where I am thinking mostly in servers. > Problem is this is not reliable. With host-phys-bits=off (default) qemu > allows to set phys-bits to whatever value you want, including values > larger than what the host actually supports. Which renders > CPUID.8008.EAX unusable. I am seeing from another angle, which the way to convey the phys-bits is via this CPUID leaf is *maybe* enough (like real hardware). But we are setting with a bigger value than we should have (or in other words ... not honoring our physical boundary). > To make things even worse: The default > value (phys-bits=40) is larger than what many intel boxes support. > > host-phys-bits=on fixes that. It makes guest-phys-bits == host-phys-bits > by default, and also enforces guest-phys-bits <= host-phys-bits. > So with host-phys-bits=on the guest can actually use CPUID.8008.EAX > to figure how big the guest physical address space is. > Your 2 paragraphs sound like it's two different things, but +host-phys-bits just sets CPUID.8008.EAX with the host CPUID equivalent leaf/register value. Which yes, it makes it reliable, but the way to convey is still the same. That is regardless, of phys-bits=bogus-bigger-than-host-number, host-phys-bits=on or host-phys-bits-limit=N. > Problem is the guest doesn't know whenever host-phys-bits is enabled or > not ... > > So the options to fix that are: > > (1) Make the host-phys-bits option visible to the guest (as suggested > above), or > (2) Advertise a _reliable_ phys-bits value to the guest somehow. What I am not enterily sure from (1) is the value on having a 'guest phys-bits' and a 'host phys-bits' *exposed to the guest* when it seems we are picking the wrong value for guests. It seems unnecessary redirection (compared to real hw) unless there's a use-case in keeping both that I am probably missing. Joao
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Tue, Feb 15, 2022 at 10:53:58AM +0100, Gerd Hoffmann wrote: > Hi, > > > I don't know what behavior should be if firmware tries to program > > PCI64 hole beyond supported phys-bits. > > Well, you are basically f*cked. > > Unfortunately there is no reliable way to figure what phys-bits actually > is. Because of that the firmware (both seabios and edk2) tries to place > the pci64 hole as low as possible. > > The long version: > > qemu advertises phys-bits=40 to the guest by default. Probably because > this is what the first amd opteron processors had, assuming that it > would be a safe default. Then intel came, releasing processors with > phys-bits=36, even recent (desktop-class) hardware has phys-bits=39. > Boom. > > End result is that edk2 uses a 32G pci64 window by default, which is > placed at the first 32G border beyond normal ram. So for virtual > machines with up to ~ 30G ram (including reservations for memory > hotplug) the pci64 hole covers 32G -> 64G in guest physical address > space, which is low enough that it works on hardware with phys-bits=36. > > If your VM has more than 32G of memory the pci64 hole will move and > phys-bits=36 isn't enough any more, but given that you probably only do > that on more beefy hosts which can take >= 64G of RAM and have a larger > physical address space this heuristic works good enough in practice. > > Changing phys-bits behavior has been discussed on and off since years. > It's tricky to change for live migration compatibility reasons. > > We got the host-phys-bits and host-phys-bits-limit properties, which > solve some of the phys-bits problems. > > * host-phys-bits=on makes sure the phys-bits advertised to the guest >actually works. It's off by default though for backward >compatibility reasons (except microvm). Also because turning it on >breaks live migration of machines between hosts with different >phys-bits. RHEL has shipped with host-phys-bits=on in its machine types sinec RHEL-7. If it is good enough for RHEL machine types for 8 years, IMHO, it is a sign that its reasonable to do the same with upstream for new machine types. > * host-phys-bits-limit can be used to tweak phys-bits to >be lower than what the host supports. Which can be used for >live migration compatibility, i.e. if you have a pool of machines >where some have 36 and some 39 you can limit phys-bits to 36 so >live migration from 39 hosts to 36 hosts works. RHEL machine types have set this to host-phys-bits-limit=48 since RHEL-8 days, to avoid accidentally enabling 5-level paging in guests without explicit user opt-in. > What is missing: > > * Some way for the firmware to get a phys-bits value it can actually >use. One possible way would be to have a paravirtual bit somewhere >telling whenever host-phys-bits is enabled or not. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Tue, Feb 15, 2022 at 07:37:40PM +, Joao Martins wrote: > On 2/15/22 09:53, Gerd Hoffmann wrote: > > What is missing: > > > > * Some way for the firmware to get a phys-bits value it can actually > >use. One possible way would be to have a paravirtual bit somewhere > >telling whenever host-phys-bits is enabled or not. > > > If we are not talking about *very old* processors... isn't what already > advertised in CPUID.8008 EAX enough? That's the maxphysaddr width > on x86, which on qemu we do set it with the phys-bits value; Sigh. Nope. Did you read the complete reply? Problem is this is not reliable. With host-phys-bits=off (default) qemu allows to set phys-bits to whatever value you want, including values larger than what the host actually supports. Which renders CPUID.8008.EAX unusable. To make things even worse: The default value (phys-bits=40) is larger than what many intel boxes support. host-phys-bits=on fixes that. It makes guest-phys-bits == host-phys-bits by default, and also enforces guest-phys-bits <= host-phys-bits. So with host-phys-bits=on the guest can actually use CPUID.8008.EAX to figure how big the guest physical address space is. Problem is the guest doesn't know whenever host-phys-bits is enabled or not ... So the options to fix that are: (1) Make the host-phys-bits option visible to the guest (as suggested above), or (2) Advertise a _reliable_ phys-bits value to the guest somehow. take care, Gerd
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On 2/15/22 09:53, Gerd Hoffmann wrote: > What is missing: > > * Some way for the firmware to get a phys-bits value it can actually >use. One possible way would be to have a paravirtual bit somewhere >telling whenever host-phys-bits is enabled or not. > If we are not talking about *very old* processors... isn't what already advertised in CPUID.8008 EAX enough? That's the maxphysaddr width on x86, which on qemu we do set it with the phys-bits value; Joao
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
Hi, > I don't know what behavior should be if firmware tries to program > PCI64 hole beyond supported phys-bits. Well, you are basically f*cked. Unfortunately there is no reliable way to figure what phys-bits actually is. Because of that the firmware (both seabios and edk2) tries to place the pci64 hole as low as possible. The long version: qemu advertises phys-bits=40 to the guest by default. Probably because this is what the first amd opteron processors had, assuming that it would be a safe default. Then intel came, releasing processors with phys-bits=36, even recent (desktop-class) hardware has phys-bits=39. Boom. End result is that edk2 uses a 32G pci64 window by default, which is placed at the first 32G border beyond normal ram. So for virtual machines with up to ~ 30G ram (including reservations for memory hotplug) the pci64 hole covers 32G -> 64G in guest physical address space, which is low enough that it works on hardware with phys-bits=36. If your VM has more than 32G of memory the pci64 hole will move and phys-bits=36 isn't enough any more, but given that you probably only do that on more beefy hosts which can take >= 64G of RAM and have a larger physical address space this heuristic works good enough in practice. Changing phys-bits behavior has been discussed on and off since years. It's tricky to change for live migration compatibility reasons. We got the host-phys-bits and host-phys-bits-limit properties, which solve some of the phys-bits problems. * host-phys-bits=on makes sure the phys-bits advertised to the guest actually works. It's off by default though for backward compatibility reasons (except microvm). Also because turning it on breaks live migration of machines between hosts with different phys-bits. * host-phys-bits-limit can be used to tweak phys-bits to be lower than what the host supports. Which can be used for live migration compatibility, i.e. if you have a pool of machines where some have 36 and some 39 you can limit phys-bits to 36 so live migration from 39 hosts to 36 hosts works. What is missing: * Some way for the firmware to get a phys-bits value it can actually use. One possible way would be to have a paravirtual bit somewhere telling whenever host-phys-bits is enabled or not. If edk2 could figure what the usable (guest) physical address space actually is it could: (a) make sure it never crosses that limit, and (b) pick better defaults, for example make the pci64 hole larger than 32G in case the available address space allows that. take care, Gerd
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Mon, 14 Feb 2022 15:05:00 + Joao Martins wrote: > On 2/14/22 14:53, Igor Mammedov wrote: > > On Mon, 7 Feb 2022 20:24:20 + > > Joao Martins wrote: > > > >> It is assumed that the whole GPA space is available to be DMA > >> addressable, within a given address space limit, expect for a > >> tiny region before the 4G. Since Linux v5.4, VFIO validates > >> whether the selected GPA is indeed valid i.e. not reserved by > >> IOMMU on behalf of some specific devices or platform-defined > >> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with > >> -EINVAL. > >> > >> AMD systems with an IOMMU are examples of such platforms and > >> particularly may only have these ranges as allowed: > >> > >> - fedf (0 .. 3.982G) > >>fef0 - 00fc (3.983G .. 1011.9G) > >>0100 - (1Tb.. 16Pb[*]) > >> > >> We already account for the 4G hole, albeit if the guest is big > >> enough we will fail to allocate a guest with >1010G due to the > >> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT). > >> > >> [*] there is another reserved region unrelated to HT that exists > >> in the 256T boundaru in Fam 17h according to Errata #1286, > >> documeted also in "Open-Source Register Reference for AMD Family > >> 17h Processors (PUB)" > >> > >> When creating the region above 4G, take into account that on AMD > >> platforms the HyperTransport range is reserved and hence it > >> cannot be used either as GPAs. On those cases rather than > >> establishing the start of ram-above-4g to be 4G, relocate instead > >> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical > >> Topology", for more information on the underlying restriction of > >> IOVAs. > >> > >> After accounting for the 1Tb hole on AMD hosts, mtree should > >> look like: > >> > >> -7fff (prio 0, i/o): > >> alias ram-below-4g @pc.ram -7fff > >> 0100-01ff7fff (prio 0, i/o): > >>alias ram-above-4g @pc.ram 8000-00ff > >> > >> If the relocation is done, we also add the the reserved HT > >> e820 range as reserved. > >> > >> Suggested-by: Igor Mammedov > >> Signed-off-by: Joao Martins > >> --- > >> hw/i386/pc.c | 66 +++ > >> target/i386/cpu.h | 4 +++ > >> 2 files changed, 70 insertions(+) > >> > >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c > >> index 7de0e87f4a3f..b060aedd38f3 100644 > >> --- a/hw/i386/pc.c > >> +++ b/hw/i386/pc.c > >> @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms) > >> #define PC_ROM_ALIGN 0x800 > >> #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA) > >> > >> +/* > >> + * AMD systems with an IOMMU have an additional hole close to the > >> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending > >> + * on kernel version, VFIO may or may not let you DMA map those ranges. > >> + * Starting Linux v5.4 we validate it, and can't create guests on AMD > >> machines > >> + * with certain memory sizes. It's also wrong to use those IOVA ranges > >> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse. > >> + * The ranges reserved for Hyper-Transport are: > >> + * > >> + * FD__h - FF__h > >> + * > >> + * The ranges represent the following: > >> + * > >> + * Base Address Top Address Use > >> + * > >> + * FD__h FD_F7FF_h Reserved interrupt address space > >> + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl > >> + * FD_F900_h FD_F90F_h Legacy PIC IACK > >> + * FD_F910_h FD_F91F_h System Management > >> + * FD_F920_h FD_FAFF_h Reserved Page Tables > >> + * FD_FB00_h FD_FBFF_h Address Translation > >> + * FD_FC00_h FD_FDFF_h I/O Space > >> + * FD_FE00_h FD__h Configuration > >> + * FE__h FE_1FFF_h Extended Configuration/Device Messages > >> + * FE_2000_h FF__h Reserved > >> + * > >> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology", > >> + * Table 3: Special Address Controls (GPA) for more information. > >> + */ > >> +#define AMD_HT_START 0xfdUL > >> +#define AMD_HT_END 0xffUL > >> +#define AMD_ABOVE_1TB_START (AMD_HT_END + 1) > >> +#define AMD_HT_SIZE (AMD_ABOVE_1TB_START - AMD_HT_START) > >> + > >> +static void relocate_4g(MachineState *machine, PCMachineState *pcms) > > > > perhaps rename it to x86_update_above_4g_mem_start() ? > > > Yeap. > > >> +{ > >> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); > >> +X86MachineState *x86ms = X86_MACHINE(pcms); > >> +ram_addr_t device_mem_size = 0; > >> +uint32_t eax, vendor[3]; > >> + > >> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); > >> +if (!IS_AMD_VENDOR(vendor)) { > >> +return; > >> +} > >> + > >> +if (pcmc->has_reserved_memory && > >> + (machine->r
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On 2/14/22 14:53, Igor Mammedov wrote: > On Mon, 7 Feb 2022 20:24:20 + > Joao Martins wrote: > >> It is assumed that the whole GPA space is available to be DMA >> addressable, within a given address space limit, expect for a >> tiny region before the 4G. Since Linux v5.4, VFIO validates >> whether the selected GPA is indeed valid i.e. not reserved by >> IOMMU on behalf of some specific devices or platform-defined >> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with >> -EINVAL. >> >> AMD systems with an IOMMU are examples of such platforms and >> particularly may only have these ranges as allowed: >> >> - fedf (0 .. 3.982G) >> fef0 - 00fc (3.983G .. 1011.9G) >> 0100 - (1Tb.. 16Pb[*]) >> >> We already account for the 4G hole, albeit if the guest is big >> enough we will fail to allocate a guest with >1010G due to the >> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT). >> >> [*] there is another reserved region unrelated to HT that exists >> in the 256T boundaru in Fam 17h according to Errata #1286, >> documeted also in "Open-Source Register Reference for AMD Family >> 17h Processors (PUB)" >> >> When creating the region above 4G, take into account that on AMD >> platforms the HyperTransport range is reserved and hence it >> cannot be used either as GPAs. On those cases rather than >> establishing the start of ram-above-4g to be 4G, relocate instead >> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical >> Topology", for more information on the underlying restriction of >> IOVAs. >> >> After accounting for the 1Tb hole on AMD hosts, mtree should >> look like: >> >> -7fff (prio 0, i/o): >> alias ram-below-4g @pc.ram -7fff >> 0100-01ff7fff (prio 0, i/o): >> alias ram-above-4g @pc.ram 8000-00ff >> >> If the relocation is done, we also add the the reserved HT >> e820 range as reserved. >> >> Suggested-by: Igor Mammedov >> Signed-off-by: Joao Martins >> --- >> hw/i386/pc.c | 66 +++ >> target/i386/cpu.h | 4 +++ >> 2 files changed, 70 insertions(+) >> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c >> index 7de0e87f4a3f..b060aedd38f3 100644 >> --- a/hw/i386/pc.c >> +++ b/hw/i386/pc.c >> @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms) >> #define PC_ROM_ALIGN 0x800 >> #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA) >> >> +/* >> + * AMD systems with an IOMMU have an additional hole close to the >> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending >> + * on kernel version, VFIO may or may not let you DMA map those ranges. >> + * Starting Linux v5.4 we validate it, and can't create guests on AMD >> machines >> + * with certain memory sizes. It's also wrong to use those IOVA ranges >> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse. >> + * The ranges reserved for Hyper-Transport are: >> + * >> + * FD__h - FF__h >> + * >> + * The ranges represent the following: >> + * >> + * Base Address Top Address Use >> + * >> + * FD__h FD_F7FF_h Reserved interrupt address space >> + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl >> + * FD_F900_h FD_F90F_h Legacy PIC IACK >> + * FD_F910_h FD_F91F_h System Management >> + * FD_F920_h FD_FAFF_h Reserved Page Tables >> + * FD_FB00_h FD_FBFF_h Address Translation >> + * FD_FC00_h FD_FDFF_h I/O Space >> + * FD_FE00_h FD__h Configuration >> + * FE__h FE_1FFF_h Extended Configuration/Device Messages >> + * FE_2000_h FF__h Reserved >> + * >> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology", >> + * Table 3: Special Address Controls (GPA) for more information. >> + */ >> +#define AMD_HT_START 0xfdUL >> +#define AMD_HT_END 0xffUL >> +#define AMD_ABOVE_1TB_START (AMD_HT_END + 1) >> +#define AMD_HT_SIZE (AMD_ABOVE_1TB_START - AMD_HT_START) >> + >> +static void relocate_4g(MachineState *machine, PCMachineState *pcms) > > perhaps rename it to x86_update_above_4g_mem_start() ? > Yeap. >> +{ >> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); >> +X86MachineState *x86ms = X86_MACHINE(pcms); >> +ram_addr_t device_mem_size = 0; >> +uint32_t eax, vendor[3]; >> + >> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); >> +if (!IS_AMD_VENDOR(vendor)) { >> +return; >> +} >> + >> +if (pcmc->has_reserved_memory && >> + (machine->ram_size < machine->maxram_size)) { >> +device_mem_size = machine->maxram_size - machine->ram_size; >> +} >> + >> +if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size + >> + device_mem_size) < AMD_HT_START) { > should it account for sgx as well? > Yes,
Re: [PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
On Mon, 7 Feb 2022 20:24:20 + Joao Martins wrote: > It is assumed that the whole GPA space is available to be DMA > addressable, within a given address space limit, expect for a > tiny region before the 4G. Since Linux v5.4, VFIO validates > whether the selected GPA is indeed valid i.e. not reserved by > IOMMU on behalf of some specific devices or platform-defined > restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with > -EINVAL. > > AMD systems with an IOMMU are examples of such platforms and > particularly may only have these ranges as allowed: > > - fedf (0 .. 3.982G) > fef0 - 00fc (3.983G .. 1011.9G) > 0100 - (1Tb.. 16Pb[*]) > > We already account for the 4G hole, albeit if the guest is big > enough we will fail to allocate a guest with >1010G due to the > ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT). > > [*] there is another reserved region unrelated to HT that exists > in the 256T boundaru in Fam 17h according to Errata #1286, > documeted also in "Open-Source Register Reference for AMD Family > 17h Processors (PUB)" > > When creating the region above 4G, take into account that on AMD > platforms the HyperTransport range is reserved and hence it > cannot be used either as GPAs. On those cases rather than > establishing the start of ram-above-4g to be 4G, relocate instead > to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical > Topology", for more information on the underlying restriction of > IOVAs. > > After accounting for the 1Tb hole on AMD hosts, mtree should > look like: > > -7fff (prio 0, i/o): >alias ram-below-4g @pc.ram -7fff > 0100-01ff7fff (prio 0, i/o): > alias ram-above-4g @pc.ram 8000-00ff > > If the relocation is done, we also add the the reserved HT > e820 range as reserved. > > Suggested-by: Igor Mammedov > Signed-off-by: Joao Martins > --- > hw/i386/pc.c | 66 +++ > target/i386/cpu.h | 4 +++ > 2 files changed, 70 insertions(+) > > diff --git a/hw/i386/pc.c b/hw/i386/pc.c > index 7de0e87f4a3f..b060aedd38f3 100644 > --- a/hw/i386/pc.c > +++ b/hw/i386/pc.c > @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms) > #define PC_ROM_ALIGN 0x800 > #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA) > > +/* > + * AMD systems with an IOMMU have an additional hole close to the > + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending > + * on kernel version, VFIO may or may not let you DMA map those ranges. > + * Starting Linux v5.4 we validate it, and can't create guests on AMD > machines > + * with certain memory sizes. It's also wrong to use those IOVA ranges > + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse. > + * The ranges reserved for Hyper-Transport are: > + * > + * FD__h - FF__h > + * > + * The ranges represent the following: > + * > + * Base Address Top Address Use > + * > + * FD__h FD_F7FF_h Reserved interrupt address space > + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl > + * FD_F900_h FD_F90F_h Legacy PIC IACK > + * FD_F910_h FD_F91F_h System Management > + * FD_F920_h FD_FAFF_h Reserved Page Tables > + * FD_FB00_h FD_FBFF_h Address Translation > + * FD_FC00_h FD_FDFF_h I/O Space > + * FD_FE00_h FD__h Configuration > + * FE__h FE_1FFF_h Extended Configuration/Device Messages > + * FE_2000_h FF__h Reserved > + * > + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology", > + * Table 3: Special Address Controls (GPA) for more information. > + */ > +#define AMD_HT_START 0xfdUL > +#define AMD_HT_END 0xffUL > +#define AMD_ABOVE_1TB_START (AMD_HT_END + 1) > +#define AMD_HT_SIZE (AMD_ABOVE_1TB_START - AMD_HT_START) > + > +static void relocate_4g(MachineState *machine, PCMachineState *pcms) perhaps rename it to x86_update_above_4g_mem_start() ? > +{ > +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); > +X86MachineState *x86ms = X86_MACHINE(pcms); > +ram_addr_t device_mem_size = 0; > +uint32_t eax, vendor[3]; > + > +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); > +if (!IS_AMD_VENDOR(vendor)) { > +return; > +} > + > +if (pcmc->has_reserved_memory && > + (machine->ram_size < machine->maxram_size)) { > +device_mem_size = machine->maxram_size - machine->ram_size; > +} > + > +if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size + > + device_mem_size) < AMD_HT_START) { should it account for sgx as well? what if above sum ends up right before AMD_HT_START, and exit without adjusting above_4g_mem_start, but pci64 hole eventually will fall into HT range? Is it expected
[PATCH RFCv2 2/4] i386/pc: relocate 4g start to 1T where applicable
It is assumed that the whole GPA space is available to be DMA addressable, within a given address space limit, expect for a tiny region before the 4G. Since Linux v5.4, VFIO validates whether the selected GPA is indeed valid i.e. not reserved by IOMMU on behalf of some specific devices or platform-defined restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with -EINVAL. AMD systems with an IOMMU are examples of such platforms and particularly may only have these ranges as allowed: - fedf (0 .. 3.982G) fef0 - 00fc (3.983G .. 1011.9G) 0100 - (1Tb.. 16Pb[*]) We already account for the 4G hole, albeit if the guest is big enough we will fail to allocate a guest with >1010G due to the ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT). [*] there is another reserved region unrelated to HT that exists in the 256T boundaru in Fam 17h according to Errata #1286, documeted also in "Open-Source Register Reference for AMD Family 17h Processors (PUB)" When creating the region above 4G, take into account that on AMD platforms the HyperTransport range is reserved and hence it cannot be used either as GPAs. On those cases rather than establishing the start of ram-above-4g to be 4G, relocate instead to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology", for more information on the underlying restriction of IOVAs. After accounting for the 1Tb hole on AMD hosts, mtree should look like: -7fff (prio 0, i/o): alias ram-below-4g @pc.ram -7fff 0100-01ff7fff (prio 0, i/o): alias ram-above-4g @pc.ram 8000-00ff If the relocation is done, we also add the the reserved HT e820 range as reserved. Suggested-by: Igor Mammedov Signed-off-by: Joao Martins --- hw/i386/pc.c | 66 +++ target/i386/cpu.h | 4 +++ 2 files changed, 70 insertions(+) diff --git a/hw/i386/pc.c b/hw/i386/pc.c index 7de0e87f4a3f..b060aedd38f3 100644 --- a/hw/i386/pc.c +++ b/hw/i386/pc.c @@ -802,6 +802,65 @@ void xen_load_linux(PCMachineState *pcms) #define PC_ROM_ALIGN 0x800 #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA) +/* + * AMD systems with an IOMMU have an additional hole close to the + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending + * on kernel version, VFIO may or may not let you DMA map those ranges. + * Starting Linux v5.4 we validate it, and can't create guests on AMD machines + * with certain memory sizes. It's also wrong to use those IOVA ranges + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse. + * The ranges reserved for Hyper-Transport are: + * + * FD__h - FF__h + * + * The ranges represent the following: + * + * Base Address Top Address Use + * + * FD__h FD_F7FF_h Reserved interrupt address space + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl + * FD_F900_h FD_F90F_h Legacy PIC IACK + * FD_F910_h FD_F91F_h System Management + * FD_F920_h FD_FAFF_h Reserved Page Tables + * FD_FB00_h FD_FBFF_h Address Translation + * FD_FC00_h FD_FDFF_h I/O Space + * FD_FE00_h FD__h Configuration + * FE__h FE_1FFF_h Extended Configuration/Device Messages + * FE_2000_h FF__h Reserved + * + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology", + * Table 3: Special Address Controls (GPA) for more information. + */ +#define AMD_HT_START 0xfdUL +#define AMD_HT_END 0xffUL +#define AMD_ABOVE_1TB_START (AMD_HT_END + 1) +#define AMD_HT_SIZE (AMD_ABOVE_1TB_START - AMD_HT_START) + +static void relocate_4g(MachineState *machine, PCMachineState *pcms) +{ +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms); +X86MachineState *x86ms = X86_MACHINE(pcms); +ram_addr_t device_mem_size = 0; +uint32_t eax, vendor[3]; + +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]); +if (!IS_AMD_VENDOR(vendor)) { +return; +} + +if (pcmc->has_reserved_memory && + (machine->ram_size < machine->maxram_size)) { +device_mem_size = machine->maxram_size - machine->ram_size; +} + +if ((x86ms->above_4g_mem_start + x86ms->above_4g_mem_size + + device_mem_size) < AMD_HT_START) { +return; +} + +x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START; +} + void pc_memory_init(PCMachineState *pcms, MemoryRegion *system_memory, MemoryRegion *rom_memory, @@ -821,6 +880,8 @@ void pc_memory_init(PCMachineState *pcms, linux_boot = (machine->kernel_filename != NULL); +relocate_4g(machine, pcms); + /* * Split single memory region and use aliases to address portions of it, * done for backwards compatibility with older qemus. @@ -83