Hi Eric, On Wed, Nov 05, 2025 at 08:47:56AM +0100, Eric Auger wrote: > > We aligned with Intel previously about this system address space. > > You might know these very well, yet here are the breakdowns: > > > > 1. VFIO core has a container that manages an HWPT. By default, it > > allocates a stage-1 normal HWPT, unless vIOMMU requests for a
> You may precise this stage-1 normal HWPT is used to map GPA to HPA (so > eventually implements stage 2). Functional-wise, that would work. But not as clean as we create an S2 parent hwpt from the beginning, right? > > nesting parent HWPT for accelerated cases. > > 2. VFIO core adds a listener for that HWPT and sets up a handler > > vfio_container_region_add() where it checks the memory region > > whether it is iommu or not. > > a. In case of !IOMMU as (i.e. system address space), it treats > > the address space as a RAM region, and handles all stage-2 > > mappings for the core allocated nesting parent HWPT. > > b. In case of IOMMU as (i.e. a translation type) it sets up > > the IOTLB notifier and translation replay while bypassing > > the listener for RAM region. > yes S1+S2 are combined through vfio_iommu_map_notify() But that map/unmap notifier is useless in the accelerated mode: we don't need those translation code in the emulated mode (MSI is likely to bypass translation as well); and we don't need the emulated IOTLB either since no page table walk through. Also, S1 and S2 are separated following iommufd design. In this regard, letting the core manage the S2 hwpt and mappings while vIOMMU handling the S1 hwpt allocation/attach/invalidation can look much cleaner. > > In an accelerated case, we need stage-2 mappings to match with the > > nesting parent HWPT. So, returning system address space or an alias > > of that notifies the vfio core to take the 2.a path. > > > > If we take 2.b path by returning IOMMU as in smmu_find_add_as, the > > VFIO core would no longer listen to the RAM region for us, i.e. no > > stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting > except if you change the VFIO common.c as I did the past to force the S2 > mapping in the nested config. > See > https://lore.kernel.org/all/[email protected]/ > and vfio_prereg_listener() Yea, I remember that. But that's somewhat duplicated IMHO. The VFIO core already registers a listener on guest RAM for system address space. Having another set of vfio_prereg_listener does not feel optimal. > Again I do not say this is the right way to do but using system address > space is not the "only" implementation choice I think Oh, neither do I mean that's the "only" way. Sorry I did not make this clear. I had studied your vfio_prereg_listener approach and studied Intel's approach using the system address space, and concluded this "cleaner" way that works for both architectures. > and it needs to be > properly justified, especially has it has at least 2 side effects: > - somehow abusing the semantic of returned address space and pretends > there is no IOMMU translation in place and Perhaps we shall say "there is no emulated translation" :) > - also impacting the way MSIs are handled (introduction of a new > PCIIOMMUOps). That is a solid point. Yet I think it's less confusing now per Jason's remarks -- we will bypass the translation pathway for MSI in accelerated mode. > This kind of explanation you wrote is absolutely needed in the commit > msg for reviewers to understand the design choice I think. Sure. My bad that I didn't explain it well in the first place. Thanks Nicolin
