On 17/04/2025 5:27 pm, Alejandro Jimenez wrote:
> Caution: External email. Do not open attachments or click links, unless
> this email comes from a known sender and you know the content is safe.
>
>
> On 4/17/25 8:40 AM, CLEMENT MATHIEU--DRIF wrote:
>>
>>
>> On 14/04/2025 4:02 am, Alejandro Jimenez wrote:
>>> Caution: External email. Do not open attachments or click links,
>>> unless this email comes from a known sender and you know the content
>>> is safe.
>>>
>>>
>>> The current amdvi_page_walk() is designed to be called by the replay()
>>> method. Rather than drastically altering it, introduce helpers to fetch
>>> guest PTEs that will be used by a page walker implementation.
>>>
>>> Signed-off-by: Alejandro Jimenez <[email protected]>
>>> ---
>>> hw/i386/amd_iommu.c | 125 ++++++++++++++++++++++++++++++++++++++++
>>> ++++
>>> hw/i386/amd_iommu.h | 42 +++++++++++++++
>>> 2 files changed, 167 insertions(+)
>>>
>>> diff --git a/hw/i386/amd_iommu.c b/hw/i386/amd_iommu.c
>>> index 0af873b66a31..d089fdc28ef1 100644
>>> --- a/hw/i386/amd_iommu.c
>>> +++ b/hw/i386/amd_iommu.c
>>> @@ -1563,6 +1563,131 @@ static const MemoryRegionOps amdvi_ir_ops = {
>>> }
>>> };
>>>
>>> +/*
>>> + * For a PTE encoding a large page, return the page size it encodes
>>> as described
>>> + * by the AMD IOMMU Specification Table 14: Example Page Size
>>> Encodings.
>>> + * No need to adjust the value of the PTE to point to the first PTE
>>> in the large
>>> + * page since the encoding guarantees all "base" PTEs in the large
>>> page are the
>>> + * same.
>>> + */
>>> +static uint64_t large_pte_page_size(uint64_t pte)
>>> +{
>>> + assert(PTE_NEXT_LEVEL(pte) == 7);
>>> +
>>> + /* Determine size of the large/contiguous page encoded in the
>>> PTE */
>>> + return PTE_LARGE_PAGE_SIZE(pte);
>>> +}
>>> +
>>> +/*
>>> + * Helper function to fetch a PTE using AMD v1 pgtable format.
>>> + * Returns:
>>> + * -2: The Page Table Root could not be read from DTE, or IOVA is
>>> larger than
>>> + * supported by current page table level encodedin DTE[Mode].
>>> + * -1: PTE could not be read from guest memory during a page table
>>> walk.
>>> + * This means that the DTE has valid data, and one of the lower
>>> level
>>> + * entries in the Page Table could not be read.
>>> + * 0: PTE is marked not present, or entry is 0.
>>> + * >0: Leaf PTE value resolved from walking Guest IO Page Table.
>>> + */
>>
>> This seems to be a bit error prone as any statement like "if (pte < 0)"
>> might be completely removed by the compiler during optimization phases.
>
> Yes, caller(s) of fetch_pte() must cast to uint64_t in any comparison to
> check for error values. Like it is the case in many of the patches, I am
> following the examples from the VTD implementations where they do the
> same thing in vtd_iova_to_slpte() to validate the return of vtd_get_pte().
Yes, I know :)
Note that VT-d only has 1 potential error code (-1) which seems easier
to handle at call site.
>
> When using fetch_pte() again in patch [17/18] I considered adding a
> helper to check if fetch_pte() returned a valid PTE, but seemed
> unnecessary given that there are only two errors to be checked.
>
> Another choice was to make fetch_pte() return an int so the error check
> could be done simply via (pte < 0), since bit 63 is either Reserved
> (DTE) or Ignored (PDE/PTE), but that seemed more prone to confusion
> since you'd expect a PTE to be a 64-bit long value. Though I am aware
> the way error return/checking is implemented essentially relies on that
> behavior.
>
>> If you want to reuse such "high" values, defines could help.
>
> Sorry, I don't follow. Do you mean using defines as in still returning a
> uint64_t but giving -1 and -2 special definitions? That might make the
> code a somewhat more readable when checking the error values, but still
> requires casting to uint64_t on the check, and doesn't solve the problem
> of a careless caller using (pte < 0) to check for errors...
Yes, I think that it would be more readable.
When using defines, the caller no longer needs to be aware of the fact
that the value has been casted from a negative number, which reduces the
risk of writing things like (pte < 0).
I prefer the out parameter solution but let's see what other reviews say.
Thanks for this patch set :)
>
>> Otherwise, pte could be an out parameter.
>
> In general, I think we have to accept the caveat that callers of
> fetch_pte() must have some implementation specific knowledge to know
> they cannot check for errors using (pte < 0). Maybe with the aid of a
> strongly worded warning on the function header comment...
>
> But if that argument doesn't convince you, and none of the alternatives
> above seem better, then I am leaning towards using the out parameter
> approach.
>
> Thank you for the feedback.
> Alejandro
>
>>
>>> +static uint64_t __attribute__((unused))
>>> +fetch_pte(AMDVIAddressSpace *as, const hwaddr address, uint64_t dte,
>>> + hwaddr *page_size)
>>> +{
>>> + IOMMUAccessFlags perms = amdvi_get_perms(dte);
>>> +
>>> + uint8_t level, mode;
>>> + uint64_t pte = dte, pte_addr;
>>> +
>>> + *page_size = 0;
>>> +
>>> + if (perms == IOMMU_NONE) {
>>> + return (uint64_t)-2;
>>> + }
>>> +
>>> + /*
>>> + * The Linux kernel driver initializes the default mode to 3,
>>> corresponding
>>> + * to a 39-bit GPA space, where each entry in the pagetable
>>> translates to a
>>> + * 1GB (2^30) page size.
>>> + */
>>> + level = mode = get_pte_translation_mode(dte);
>>> + assert(mode > 0 && mode < 7);
>>> +
>>> + /*
>>> + * If IOVA is larger than the max supported by the current
>>> pgtable level,
>>> + * there is nothing to do. This signals that the pagetable level
>>> should be
>>> + * increased, or is an address meant to have special behavior like
>>> + * invalidating the entire cache.
>>> + */
>>> + if (address > PT_LEVEL_MAX_ADDR(mode - 1)) {
>>> + /* IOVA too large for the current DTE */
>>> + return (uint64_t)-2;
>>> + }
>>> +
>>> + do {
>>> + level -= 1;
>>> +
>>> + /* Update the page_size */
>>> + *page_size = PTE_LEVEL_PAGE_SIZE(level);
>>> +
>>> + /* Permission bits are ANDed at every level, including the
>>> DTE */
>>> + perms &= amdvi_get_perms(pte);
>>> + if (perms == IOMMU_NONE) {
>>> + return pte;
>>> + }
>>> +
>>> + /* Not Present */
>>> + if (!IOMMU_PTE_PRESENT(pte)) {
>>> + return 0;
>>> + }
>>> +
>>> + /* Large or Leaf PTE found */
>>> + if (PTE_NEXT_LEVEL(pte) == 7 || PTE_NEXT_LEVEL(pte) == 0) {
>>> + /* Leaf PTE found */
>>> + break;
>>> + }
>>> +
>>> + /*
>>> + * Index the pgtable using the IOVA bits corresponding to
>>> current level
>>> + * and walk down to the lower level.
>>> + */
>>> + pte_addr = NEXT_PTE_ADDR(pte, level, address);
>>> + pte = amdvi_get_pte_entry(as->iommu_state, pte_addr, as-
>>> >devfn);
>>> +
>>> + if (pte == (uint64_t)-1) {
>>> + /*
>>> + * A returned PTE of -1 indicates a failure to read the
>>> page table
>>> + * entry from guest memory.
>>> + */
>>> + if (level == mode - 1) {
>>> + /* Failure to retrieve the Page Table from Root
>>> Pointer */
>>> + *page_size = 0;
>>> + return (uint64_t)-2;
>>> + } else {
>>> + /* Failure to read PTE. Page walk skips a page_size
>>> chunk */
>>> + return pte;
>>> + }
>>> + }
>>> + } while (level > 0);
>>> +
>>> + /*
>>> + * Page walk ends when Next Level field on PTE shows that either
>>> a leaf PTE
>>> + * or a series of large PTEs have been reached. In the latter
>>> case, return
>>> + * the pointer to the first PTE of the series.
>>> + */
>>> + assert(level == 0 || PTE_NEXT_LEVEL(pte) == 0 ||
>>> PTE_NEXT_LEVEL(pte) == 7);
>>> +
>>> + /*
>>> + * In case the range starts in the middle of a contiguous page,
>>> need to
>>> + * return the first PTE
>>> + */
>>> + if (PTE_NEXT_LEVEL(pte) == 7) {
>>> + /* Update page_size with the large PTE page size */
>>> + *page_size = large_pte_page_size(pte);
>>> + }
>>> +
>>> + return pte;
>>> +}
>>> +
>>> /*
>>> * Toggle between address translation and passthrough modes by
>>> enabling the
>>> * corresponding memory regions.
>>> diff --git a/hw/i386/amd_iommu.h b/hw/i386/amd_iommu.h
>>> index c89e7dc9947d..fc4d2f7a4575 100644
>>> --- a/hw/i386/amd_iommu.h
>>> +++ b/hw/i386/amd_iommu.h
>>> @@ -25,6 +25,8 @@
>>> #include "hw/i386/x86-iommu.h"
>>> #include "qom/object.h"
>>>
>>> +#define GENMASK64(h, l) (((~0ULL) >> (63 - (h) + (l))) << (l))
>>> +
>>> /* Capability registers */
>>> #define AMDVI_CAPAB_BAR_LOW 0x04
>>> #define AMDVI_CAPAB_BAR_HIGH 0x08
>>> @@ -174,6 +176,46 @@
>>> #define AMDVI_GATS_MODE (2ULL << 12)
>>> #define AMDVI_HATS_MODE (2ULL << 10)
>>>
>>> +/* Page Table format */
>>> +
>>> +#define AMDVI_PTE_PR (1ULL << 0)
>>> +#define AMDVI_PTE_NEXT_LEVEL_MASK GENMASK64(11, 9)
>>> +
>>> +#define IOMMU_PTE_PRESENT(pte) ((pte) & AMDVI_PTE_PR)
>>> +
>>> +/* Using level=0 for leaf PTE at 4K page size */
>>> +#define PT_LEVEL_SHIFT(level) (12 + ((level) * 9))
>>> +
>>> +/* Return IOVA bit group used to index the Page Table at specific
>>> level */
>>> +#define PT_LEVEL_INDEX(level, iova) (((iova) >>
>>> PT_LEVEL_SHIFT(level)) & \
>>> + GENMASK64(8, 0))
>>> +
>>> +/* Return the max address for a specified level i.e. max_oaddr */
>>> +#define PT_LEVEL_MAX_ADDR(x) (((x) < 5) ? \
>>> + ((1ULL << PT_LEVEL_SHIFT((x + 1))) -
>>> 1) : \
>>> + (~(0ULL)))
>>> +
>>> +/* Extract the NextLevel field from PTE/PDE */
>>> +#define PTE_NEXT_LEVEL(pte) (((pte) & AMDVI_PTE_NEXT_LEVEL_MASK)
>>> >> 9)
>>> +
>>> +/* Take page table level and return default pagetable size for level */
>>> +#define PTE_LEVEL_PAGE_SIZE(level) (1ULL <<
>>> (PT_LEVEL_SHIFT(level)))
>>> +
>>> +/*
>>> + * Return address of lower level page table encoded in PTE and
>>> specified by
>>> + * current level and corresponding IOVA bit group at such level.
>>> + */
>>> +#define NEXT_PTE_ADDR(pte, level, iova) (((pte) &
>>> AMDVI_DEV_PT_ROOT_MASK) + \
>>> + (PT_LEVEL_INDEX(level, iova)
>>> * 8))
>>> +
>>> +/*
>>> + * Take a PTE value with mode=0x07 and return the page size it encodes.
>>> + */
>>> +#define PTE_LARGE_PAGE_SIZE(pte) (1ULL << (1 + cto64(((pte) |
>>> 0xfffULL))))
>>> +
>>> +/* Return number of PTEs to use for a given page size (expected
>>> power of 2) */
>>> +#define PAGE_SIZE_PTE_COUNT(pgsz) (1ULL << ((ctz64(pgsz) - 12)
>>> % 9))
>>> +
>>> /* IOTLB */
>>> #define AMDVI_IOTLB_MAX_SIZE 1024
>>> #define AMDVI_DEVID_SHIFT 36
>>> --
>>> 2.43.5
>>>
>>>
>