On Thu, Mar 30, 2017 at 07:57:38PM +0800, Jason Wang wrote: > > > On 2017年03月30日 19:06, Liu, Yi L wrote: > >>-----Original Message----- > >>From: Liu, Yi L > >>Sent: Monday, March 27, 2017 5:22 PM > >>To: Peter Xu <pet...@redhat.com> > >>Cc: alex.william...@redhat.com; Lan, Tianyu <tianyu....@intel.com>; Tian, > >>Kevin > >><kevin.t...@intel.com>; m...@redhat.com; jan.kis...@siemens.com; > >>jasow...@redhat.com; bd.a...@gmail.com; David Gibson > >><da...@gibson.dropbear.id.au>; qemu-devel@nongnu.org > >>Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add > >>MemoryRegionIOMMUOps.replay() callback > >> > >>>-----Original Message----- > >>>From: Peter Xu [mailto:pet...@redhat.com] > >>>Sent: Monday, March 27, 2017 5:12 PM > >>>To: Liu, Yi L <yi.l....@intel.com> > >>>Cc: alex.william...@redhat.com; Lan, Tianyu <tianyu....@intel.com>; > >>>Tian, Kevin <kevin.t...@intel.com>; m...@redhat.com; > >>>jan.kis...@siemens.com; jasow...@redhat.com; bd.a...@gmail.com; David > >>>Gibson <da...@gibson.dropbear.id.au>; qemu-devel@nongnu.org > >>>Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add > >>>MemoryRegionIOMMUOps.replay() callback > >>> > >>>On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote: > >>>>>-----Original Message----- > >>>>>From: Qemu-devel > >>>>>[mailto:qemu-devel-bounces+yi.l.liu=intel....@nongnu.org] On > >>>>>Behalf Of Peter Xu > >>>>>Sent: Tuesday, February 7, 2017 4:28 PM > >>>>>To: qemu-devel@nongnu.org > >>>>>Cc: Lan, Tianyu <tianyu....@intel.com>; Tian, Kevin > >>>>><kevin.t...@intel.com>; m...@redhat.com; jan.kis...@siemens.com; > >>>>>jasow...@redhat.com; pet...@redhat.com; > >>>>>alex.william...@redhat.com; bd.a...@gmail.com; David Gibson > >>>>><da...@gibson.dropbear.id.au> > >>>>>Subject: [Qemu-devel] [PATCH v7 14/17] memory: add > >>>>>MemoryRegionIOMMUOps.replay() callback > >>>>> > >>>>>Originally we have one memory_region_iommu_replay() function, > >>>>>which is the default behavior to replay the translations of the > >>>>>whole IOMMU region. However, on some platform like x86, we may > >>>>>want our own > >>>replay logic for IOMMU regions. > >>>>>This patch add one more hook for IOMMUOps for the callback, and > >>>>>it'll override the default if set. > >>>>> > >>>>>Signed-off-by: Peter Xu <pet...@redhat.com> > >>>>>--- > >>>>> include/exec/memory.h | 2 ++ > >>>>> memory.c | 6 ++++++ > >>>>> 2 files changed, 8 insertions(+) > >>>>> > >>>>>diff --git a/include/exec/memory.h b/include/exec/memory.h index > >>>>>0767888..30b2a74 100644 > >>>>>--- a/include/exec/memory.h > >>>>>+++ b/include/exec/memory.h > >>>>>@@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps { > >>>>> void (*notify_flag_changed)(MemoryRegion *iommu, > >>>>> IOMMUNotifierFlag old_flags, > >>>>> IOMMUNotifierFlag new_flags); > >>>>>+ /* Set this up to provide customized IOMMU replay function */ > >>>>>+ void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier); > >>>>> }; > >>>>> > >>>>> typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff > >>>>>--git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644 > >>>>>--- a/memory.c > >>>>>+++ b/memory.c > >>>>>@@ -1630,6 +1630,12 @@ void > >>>>>memory_region_iommu_replay(MemoryRegion > >>>>>*mr, IOMMUNotifier *n, > >>>>> hwaddr addr, granularity; > >>>>> IOMMUTLBEntry iotlb; > >>>>>+ /* If the IOMMU has its own replay callback, override */ > >>>>>+ if (mr->iommu_ops->replay) { > >>>>>+ mr->iommu_ops->replay(mr, n); > >>>>>+ return; > >>>>>+ } > >>>>Hi Alex, Peter, > >>>> > >>>>Will all the other vendors(e.g. PPC, s390, ARM) add their own replay > >>>>callback as well? I guess it depends on whether the original replay > >>>>algorithm work well for them? Do you have such knowledge? > >>>I guess so. At least for VT-d we had this callback since the default > >>>replay mechanism did not work well on x86 due to its extremely large > >>>memory region size. Thanks, > >>thx. that would make sense. > >Peter, > > > >Just come to mind that there may be a corner case here. > > > >Intel VT-d actually has a "pt" mode which allows device use physical address > >even when VT-d is enabled. In kernel, there is a iommu_identity_mapping. > >If a device is in this map, then it would use "pt" mode. So that IOMMU driver > >would not build second-level page table for it. > > Yes, but qemu does not support ECAP_PT now, so guest will still have a page > table in this case. > > > > >Back to the virtual IOVA implementation, if an assigned device is in the > >iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA. > >So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay > >is not able to build it when guest SL page table is empty. > > > >So I think building an entire guest PA->HPA mapping before guest kernel boot > >would be recommended. Any thoughts? > > We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar > region and use another region without iommu_ops. Then > vfio_listener_region_add() will just do the correct mappings.
Even without any new region. With the patch 16/17 ("intel_iommu: allow dynamic switch of IOMMU region"), we can just turn the IOMMU region on/off, following the device's PT bit, maybe using the new vtd_switch_address_space() interface. That should be enough. Again, we just need to wait until current series merged. (Oh, then I found why I had an extra "on/off" parameter in previous versions in vtd_switch_address_space(), but it was removed.) Thanks, -- peterx