On 18/1/26 02:43, Jason Gunthorpe wrote:
On Sat, Jan 17, 2026 at 03:54:52PM +1100, Alexey Kardashevskiy wrote:

I am trying this with TEE-IO on AMD SEV and hitting problems.

My understanding is that if you want to use SEV today you also have to
use the kernel command line parameter to force 4k IOMMU pages?

No, not only 4K. I do not enforce any page size by default so it is "everything but 512G", only 
when the device is "accepted" - I unmap everything in  QEMU, "accept" the device, then 
map everything again but this time IOMMU uses the (4K|2M) pagemask and takes RMP entry sizes into account.

So, I think your questions are about trying to enhance this to get
larger pages in the IOMMU when possible?

I did test my 6 month old stuff with 2MB pages + runtime smashing, works fine, 
although ugly as it uses that page size recalculating hack I mentioned before.

Now, from time to time the guest will share 4K pages which makes the
host OS smash NPT's 2MB PDEs to 4K PTEs, and 2M RMP entries to 4K
RMP entries, and since the IOMMU performs RMP checks - IOMMU PDEs
have to use the same granularity as NPT and RMP.

IMHO this is a bad hardware choice, it is going to make some very
troublesome software, so sigh.

afaik the Other OS is still not using 2MB pages (or does but not much?) and 
runs on the same hw :)

Sure we can force some rules in Linux to make the sw simpler though.

So I end up in a situation when QEMU asks to map, for example, 2GB
of guest RAM and I want most of it to be 2MB mappings, and only
handful of 2MB pages to be split into 4K pages. But it appears so
that the above enforces the same page size for entire range.

In the old IOMMU code, I handled it like this:

https://github.com/AMDESE/linux-kvm/commit/0a40130987b7b65c367390d23821cc4ecaeb94bd#diff-f22bea128ddb136c3adc56bc09de9822a53ba1ca60c8be662a48c3143c511963L341

tl;dr: I constantly re-calculate the page size while mapping.

Doing it at mapping time doesn't seem right to me, AFAICT the RMP can
change dynamically whenever the guest decides to change the
private/shared status of memory?

The guest requests page state conversion which makes KVM change RMPs and 
potentially smash huge pages, the guest only (in)validates the RMP entry but 
does not change ASID+GPA+otherbits, the host does. But yeah a race is possible 
here.


My expectation for AMD was that the VMM would be monitoring the RMP
granularity and use cut or "increase/decrease page size" through
iommupt to adjust the S2 mapping so it works with these RMP
limitations.

Those don't fully exist yet, but they are in the plans.

I remember the talks about hitless smashing but in case of RMPs atomic xchg is 
not enough (we have a HW engine for that).

It assumes that the VMM is continually aware of what all the RMP PTEs
look like and when they are changing so it can make the required
adjustments.

The flow would be some thing like..
  1) Create an IOAS
  2) Create a HWPT. If there is some known upper bound on RMP/etc page
     size then limit the HWPT page size to the upper bound
  3) Map stuff into the ioas
  4) Build the RMP/etc and map ranges of page granularity
  5) Call iommufd to adjust the page size within ranges

Say, I hotplug a device into a VM with a mix of 4K and 2M RMPs. QEMU will ask 
iommufd to map everything (and that would be 2M/1G), should then QEMU ask KVM 
to walk through ranges and call iommufd directly to make IO PDEs/PTEs match 
RMPs?

I mean, I have to do the KVM->iommufd part anyway when 2M->4K smashing happens 
in runtime but the initial mapping could be simpler if iommufd could check RMP.


  6) Guest changes encrypted state so RMP changes
  7) VMM adjusts the ranges of page granularity and calls iommufd with
     the updates
  8) iommput code increases/decreases page size as required.

Does this seem reasonable?

It does.

For the time being I do bypass IOMMU and make KVM call another FW+HW DMA engine 
to smash IOPDEs.


I know, ideally we would only share memory in 2MB chunks but we are
not there yet as I do not know the early boot stage on x86 enough to

Even 2M is too small, I'd expect realy scenarios to want to get up to
1GB ??

Except SWIOTLB, afaict there is really no good reason to share more than half a 
MB of memory ever, 1GB is just way too much waste imho. The biggest RMP entry 
size is 2M, the 1G RMP optimization is done quite differently. Thanks,


ps. I am still curious about:

btw just realized - does the code check that the folio_size matches IO pagesize? Or 
batch_to_domain() is expected to start a new batch if the next page size is not the 
same as previous? With THP, we can have a mix of page sizes"


Jason

--
Alexey


Reply via email to