On 7/15/22 12:57, Igor Mammedov wrote:
> On Thu, 14 Jul 2022 19:28:19 +0100
> Joao Martins wrote:
>
>> It is assumed that the whole GPA space is available to be DMA
>> addressable, within a given address space limit, except for a
>> tiny region before the 4G. Since Linux v5.4, VFIO validates
>> whether the selected GPA is indeed valid i.e. not reserved by
>> IOMMU on behalf of some specific devices or platform-defined
>> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>> -EINVAL.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may only have these ranges as allowed:
>>
>> - fedf (0 .. 3.982G)
>> fef0 - 00fc (3.983G .. 1011.9G)
>> 0100 - (1Tb.. 16Pb[*])
>>
>> We already account for the 4G hole, albeit if the guest is big
>> enough we will fail to allocate a guest with >1010G due to the
>> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
>>
>> [*] there is another reserved region unrelated to HT that exists
>> in the 256T boundary in Fam 17h according to Errata #1286,
>> documeted also in "Open-Source Register Reference for AMD Family
>> 17h Processors (PUB)"
>>
>> When creating the region above 4G, take into account that on AMD
>> platforms the HyperTransport range is reserved and hence it
>> cannot be used either as GPAs. On those cases rather than
>> establishing the start of ram-above-4g to be 4G, relocate instead
>> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
>> Topology", for more information on the underlying restriction of
>> IOVAs.
>>
>> After accounting for the 1Tb hole on AMD hosts, mtree should
>> look like:
>>
>> -7fff (prio 0, i/o):
>> alias ram-below-4g @pc.ram -7fff
>> 0100-01ff7fff (prio 0, i/o):
>> alias ram-above-4g @pc.ram 8000-00ff
>>
>> If the relocation is done or the address space covers it, we
>> also add the the reserved HT e820 range as reserved.
>>
>> Default phys-bits on Qemu is TCG_PHYS_ADDR_BITS (40) which is enough
>> to address 1Tb (0xff ). On AMD platforms, if a
>> ram-above-4g relocation may be desired and the CPU wasn't configured
>> with a big enough phys-bits, print an error message to the user
>> and do not make the relocation of the above-4g-region if phys-bits
>> is too low.
>>
>> Suggested-by: Igor Mammedov
>> Signed-off-by: Joao Martins
>> ---
>> hw/i386/pc.c | 82
>> 1 file changed, 82 insertions(+)
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index cda435e3baeb..17613974163e 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -880,6 +880,52 @@ static uint64_t pc_get_cxl_range_end(PCMachineState
>> *pcms)
>> return start;
>> }
>>
>> +static hwaddr pc_max_used_gpa(PCMachineState *pcms, uint64_t
>> pci_hole64_size)
>> +{
>> +X86CPU *cpu = X86_CPU(first_cpu);
>> +
>> +/* 32-bit systems don't have hole64 thus return max CPU address */
>> +if (cpu->phys_bits <= 32) {
>> +return ((hwaddr)1 << cpu->phys_bits) - 1;
>> +}
>> +
>> +return pc_pci_hole64_start() + pci_hole64_size - 1;
>> +}
>> +
> [...]
>
>> +
>> +/*
>> + * Relocating ram-above-4G requires more than TCG_PHYS_ADDR_BITS (40).
>> + * So make sure phys-bits is required to be appropriately sized in order
>> + * to proceed with the above-4g-region relocation and thus boot.
>
> drop mention of relocation here as it's orthogonal to the check.
> Important thing we are checking here is that max used GPA is
> reachable by configured vCPU (physbits).
>
OK
>> + */
>> +maxusedaddr = pc_max_used_gpa(pcms, pci_hole64_size);
>> +maxphysaddr = ((hwaddr)1 << cpu->phys_bits) - 1;
>> +if (maxphysaddr < maxusedaddr) {
>> +error_report("Address space limit 0x%"PRIx64" < 0x%"PRIx64
>> + " phys-bits too low (%u)",
>> + maxphysaddr, maxusedaddr, cpu->phys_bits);
>> +exit(EXIT_FAILURE);
>> +}
>
> these hunks should be a separate patch preceding relocation patch
> as it basically does max_gpa vs physbits check regardless
> of relocation (i.e. relocation is only one of the reasons
> max_used_gpa might exceed physbits).
>
Yeap, makes sense given that this is now generic regardless of AMD 1Tb hole.
>> +
>> /*
>> * Split single memory region and use aliases to address portions of it,
>> * done for backwards compatibility with older qemus.
>