RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-28 Thread Liu, Yi L
> From: Peter Xu 
> Sent: Monday, November 29, 2021 11:14 AM
> 
> On Mon, Nov 29, 2021 at 10:28:42AM +0800, Jason Wang wrote:
> >
> > And in the future, it could be even more troublesome,e.g there's one
> > day we found another bit that needs not to be checked. Maybe we should
> > even remove all the rsvd bits checks?
> 
> When a real hardware sees any of the reserved bits set, it'll bail out and
> raise an error, right?

I think so. vtd spec has defined Non-zero reserved field error code against
all the translation structures (root/context/pasid dir/pasid table/page table)
for it. And it makes sense since any such error indicates a potential
misunderstanding on the spec.

> If that's the case, I'm wondering whether we should always follow the
> hardware behavior as an emulator.

I think so. and current virtual Intel IOMMU does a good job to detect the
SNP setting.:)

> Now I'm trying to remember normally how a spec could re-use a bit that was
> used to be reserved: should the hardware bumps the version of the version reg 
> in so
> that softwares will know what to expect?

defining a new capability bit is also a way for it. New software will probe the
capability bit and then program the bit which was reserved but now redefined.
Old software doesn’t have any idea on the new capability bit, so it will not
program the reserved bit.

> So I'm thinking whether the emulator code can identify the version bump by
> "scalable mode enabled", if so we know some resved bits are "ignored" now,
> and IIUC that's mostly the original proposal to add a quirk when scalable mode
> in vtd_init().

do you mean the spec version or?

> But again, I really think it should be the spec owner who should have
> considered all these..

yes, spec owner should consider it.

> e.g. explicitly document "this bit was used to reserved,
> but when scalable mode enabled it's ignored and programmable by the guest
> driver", or something like that.

there is a good example for your above sentence. It's the root table entry
and the scalable mode root table entry. In legacy mode, the high 64 bits of
root table entry are all reserved. In scalable mode, some of the high 64 bits
are used. I think we have defined scalable mode reserved bits macro in the
emulator code.

But regards to minor changes within a working mode, it may be more common to
define a capability bit when a reserved bit is re-used.

Regards,
Yi Liu


Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-28 Thread Jason Wang
On Mon, Nov 29, 2021 at 11:14 AM Peter Xu  wrote:
>
> On Mon, Nov 29, 2021 at 10:28:42AM +0800, Jason Wang wrote:
> > > I think we can still have Jason's patch continued because the kernel 
> > > commit to
> > > apply SNP bit is merged in v5.13, so we may need the qemu change to let it
> > > still work with v5.13-v5.15+ guest kernels.  We'll loose the resv bit 
> > > check a
> > > bit, but looks worthwhile.  Jason?
> >
> > Yes, I agree. The only thing that may worry me is the migration
> > compatibility. If we migrate from new to old we may break the guests,
> > we probably need compatibility props for that.
>
> Hmm.. How important is new->old migrations?  Is that normally for the
> old->new->old migration so that users can always fallback to the old hosts?

I meant e.g migrating from new qemu with machine=6.3 to old qemu with
machine=6.3.

So guest works on the src but not dst.

>
> If that's the case then IMHO we're fine here, since the new binary check less
> on resv bits than the old binary, then if the guest code can work with the old
> binary already then migrating back to old binary should work too.  Changing
> guest OS during migration of new->old can have a problem but hopefully rare.
>
> OTOH - do you know any of the real enterprise user that uses scalable mode 
> yet?
> To my own understanding it's still mostly "experimental", then hopefully we 
> can
> avoid worrying on that too much?

Probably, it has an "x" prefix.

>
> >
> > And in the future, it could be even more troublesome,e.g there's one
> > day we found another bit that needs not to be checked. Maybe we should
> > even remove all the rsvd bits checks?
>
> When a real hardware sees any of the reserved bits set, it'll bail out and
> raise an error, right?

To say the truth I don't know (though spec said that).

>
> If that's the case, I'm wondering whether we should always follow the hardware
> behavior as an emulator.
>
> Now I'm trying to remember normally how a spec could re-use a bit that was 
> used
> to be reserved: should the hardware bumps the version of the version reg in so
> that softwares will know what to expect?

I think so or at least some kind of negotiation.

>
> So I'm thinking whether the emulator code can identify the version bump by
> "scalable mode enabled", if so we know some resved bits are "ignored" now, and
> IIUC that's mostly the original proposal to add a quirk when scalable mode in
> vtd_init().
>
> But again, I really think it should be the spec owner who should have
> considered all these.. e.g. explicitly document "this bit was used to 
> reserved,
> but when scalable mode enabled it's ignored and programmable by the guest
> driver", or something like that.

I fully agree, we've learnt a lot when dealing with migration
compatibility in the past decade.

I will prepare a V2, it was basically what you suggested. And we can
leave the rest for future investigation. The motivation is prototyping
PASID support for virtio, it's sufficient for this patch to unblock
the work.

Thanks

>
> Thanks,
>
> --
> Peter Xu
>




Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-28 Thread Peter Xu
On Mon, Nov 29, 2021 at 10:28:42AM +0800, Jason Wang wrote:
> > I think we can still have Jason's patch continued because the kernel commit 
> > to
> > apply SNP bit is merged in v5.13, so we may need the qemu change to let it
> > still work with v5.13-v5.15+ guest kernels.  We'll loose the resv bit check 
> > a
> > bit, but looks worthwhile.  Jason?
> 
> Yes, I agree. The only thing that may worry me is the migration
> compatibility. If we migrate from new to old we may break the guests,
> we probably need compatibility props for that.

Hmm.. How important is new->old migrations?  Is that normally for the
old->new->old migration so that users can always fallback to the old hosts?

If that's the case then IMHO we're fine here, since the new binary check less
on resv bits than the old binary, then if the guest code can work with the old
binary already then migrating back to old binary should work too.  Changing
guest OS during migration of new->old can have a problem but hopefully rare.

OTOH - do you know any of the real enterprise user that uses scalable mode yet?
To my own understanding it's still mostly "experimental", then hopefully we can
avoid worrying on that too much?

> 
> And in the future, it could be even more troublesome,e.g there's one
> day we found another bit that needs not to be checked. Maybe we should
> even remove all the rsvd bits checks?

When a real hardware sees any of the reserved bits set, it'll bail out and
raise an error, right?

If that's the case, I'm wondering whether we should always follow the hardware
behavior as an emulator.

Now I'm trying to remember normally how a spec could re-use a bit that was used
to be reserved: should the hardware bumps the version of the version reg in so
that softwares will know what to expect?

So I'm thinking whether the emulator code can identify the version bump by
"scalable mode enabled", if so we know some resved bits are "ignored" now, and
IIUC that's mostly the original proposal to add a quirk when scalable mode in
vtd_init().

But again, I really think it should be the spec owner who should have
considered all these.. e.g. explicitly document "this bit was used to reserved,
but when scalable mode enabled it's ignored and programmable by the guest
driver", or something like that.

Thanks,

-- 
Peter Xu




Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-28 Thread Jason Wang
On Mon, Nov 29, 2021 at 9:19 AM Peter Xu  wrote:
>
> On Sun, Nov 28, 2021 at 07:06:18AM +, Liu, Yi L wrote:
> > > From: Peter Xu 
> > > Sent: Thursday, November 25, 2021 2:14 PM
> > >
> > > On Thu, Nov 25, 2021 at 05:49:38AM +, Liu, Yi L wrote:
> > > > > From: Peter Xu 
> > > > > Sent: Thursday, November 25, 2021 12:31 PM
> > > > >
> > > > > On Thu, Nov 25, 2021 at 04:03:34AM +, Liu, Yi L wrote:
> > > > > > > From: Peter Xu 
> > > > > > > Sent: Wednesday, November 24, 2021 3:57 PM
> > > > > > >
> > > > > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > > > > > When booting with scalable mode, I hit this error:
> > > > > > > >
> > > > > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve
> > > non-
> > > > > > > zero iova=0xf002, level=0x1slpte=0x102681803)
> > > > > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> > > > > failure
> > > > > > > (dev=01:00:00, iova=0xf002)
> > > > > > > > qemu-system-x86_64: New fault is not recorded due to
> > > compression
> > > > > of
> > > > > > > faults
> > > > > > > >
> > > > > > > > This is because the SNP bit is set since Linux kernel commit
> > > > > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> > > > > using
> > > > > > > > FL for IOVA") where SNP bit is set if scalable mode is on 
> > > > > > > > though this
> > > > > > > > seems to be an violation on the spec which said the SNP bit is
> > > > > > > > considered to be reserved if SC is not supported.
> > > > > > >
> > > > > > > When I was reading that commit, I was actually confused by this
> > > change:
> > > > > > >
> > > > > > > ---8<---
> > > > > > > diff --git a/drivers/iommu/intel/iommu.c
> > > > > b/drivers/iommu/intel/iommu.c
> > > > > > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > > > > > --- a/drivers/iommu/intel/iommu.c
> > > > > > > +++ b/drivers/iommu/intel/iommu.c
> > > > > > > @@ -658,7 +658,14 @@ static int
> > > > > domain_update_iommu_snooping(struct
> > > > > > > intel_iommu *skip)
> > > > > > > rcu_read_lock();
> > > > > > > for_each_active_iommu(iommu, drhd) {
> > > > > > > if (iommu != skip) {
> > > > > > > -   if (!ecap_sc_support(iommu->ecap)) {
> > > > > > > +   /*
> > > > > > > +* If the hardware is operating in the 
> > > > > > > scalable mode,
> > > > > > > +* the snooping control is always 
> > > > > > > supported since we
> > > > > > > +* always set PASID-table-entry.PGSNP bit 
> > > > > > > if the domain
> > > > > > > +* is managed outside (UNMANAGED).
> > > > > > > +*/
> > > > > > > +   if (!sm_supported(iommu) &&
> > > > > > > +   !ecap_sc_support(iommu->ecap)) {
> > > > > > > ret = 0;
> > > > > > > break;
> > > > > > > }
> > > > > > > ---8<---
> > > > > > >
> > > > > > > Does it mean that for some hardwares that has
> > > sm_supported()==true,
> > > > > it'll
> > > > > > > have  SC bit cleared in ecap register?  That sounds odd, and not 
> > > > > > > sure
> > > why.
> > > > > Maybe
> > > > > > > Yi Liu or Yi Sun may know?
> > > > > >
> > > > > > scalable mode has no dependency on SC, so it's possible.
> > > > >
> > > > > I see; thanks, Yi.
> > > > >
> > > > > However then OTOH I don't understand above comment
> > > > >
> > > > >   "If the hardware is operating in the scalable mode, the snooping 
> > > > > control
> > > is
> > > > >always supported since... "
> > > > >
> > > > > Because the current qemu vt-d emulation should fall into the case 
> > > > > that Yi
> > > > > mentioned - we support initial scalable mode but no SC yet..
> > > >
> > > > chapter 3.9 of 3.2 spec says below.
> > > >
> > > > “If the remapping hardware is setup in scalable-mode
> > > (RTADDR_REG.TTM=01b)
> > > > and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to
> > > the
> > > > final page is snooped.”
> > > >
> > > > It means the PGSNP field is available under scalable mode. And spec also
> > > > says below in chapter 96. of 3.2 spec.
> > > >
> > > > "Requests snoop processor caches irrespective of, other attributes in 
> > > > the
> > > > request or other fields in paging structure entries used to translate 
> > > > the
> > > > request."
> > > >
> > > > It means the PGSNP field of PASID table entry is the first class control
> > > > of the snoop behaviour. Also it means the scalable mode has the snoop
> > > > control by default. ^_^. So the comment in the above commit is correct
> > > > since the policy of intel iommu driver is always setting the PGSNP bit.
> > >
> > > I see.  Setting PGSNP bit in the pasid entry looks fine to me.
> > >
> > > However IIUC what's triggering the crash (that Jason is fixing) is the 
> > > guest
> > > iommu driver "thinks" SC 

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-28 Thread Peter Xu
On Sun, Nov 28, 2021 at 07:06:18AM +, Liu, Yi L wrote:
> > From: Peter Xu 
> > Sent: Thursday, November 25, 2021 2:14 PM
> > 
> > On Thu, Nov 25, 2021 at 05:49:38AM +, Liu, Yi L wrote:
> > > > From: Peter Xu 
> > > > Sent: Thursday, November 25, 2021 12:31 PM
> > > >
> > > > On Thu, Nov 25, 2021 at 04:03:34AM +, Liu, Yi L wrote:
> > > > > > From: Peter Xu 
> > > > > > Sent: Wednesday, November 24, 2021 3:57 PM
> > > > > >
> > > > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > > > > When booting with scalable mode, I hit this error:
> > > > > > >
> > > > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve
> > non-
> > > > > > zero iova=0xf002, level=0x1slpte=0x102681803)
> > > > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> > > > failure
> > > > > > (dev=01:00:00, iova=0xf002)
> > > > > > > qemu-system-x86_64: New fault is not recorded due to
> > compression
> > > > of
> > > > > > faults
> > > > > > >
> > > > > > > This is because the SNP bit is set since Linux kernel commit
> > > > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> > > > using
> > > > > > > FL for IOVA") where SNP bit is set if scalable mode is on though 
> > > > > > > this
> > > > > > > seems to be an violation on the spec which said the SNP bit is
> > > > > > > considered to be reserved if SC is not supported.
> > > > > >
> > > > > > When I was reading that commit, I was actually confused by this
> > change:
> > > > > >
> > > > > > ---8<---
> > > > > > diff --git a/drivers/iommu/intel/iommu.c
> > > > b/drivers/iommu/intel/iommu.c
> > > > > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > > > > --- a/drivers/iommu/intel/iommu.c
> > > > > > +++ b/drivers/iommu/intel/iommu.c
> > > > > > @@ -658,7 +658,14 @@ static int
> > > > domain_update_iommu_snooping(struct
> > > > > > intel_iommu *skip)
> > > > > > rcu_read_lock();
> > > > > > for_each_active_iommu(iommu, drhd) {
> > > > > > if (iommu != skip) {
> > > > > > -   if (!ecap_sc_support(iommu->ecap)) {
> > > > > > +   /*
> > > > > > +* If the hardware is operating in the 
> > > > > > scalable mode,
> > > > > > +* the snooping control is always supported 
> > > > > > since we
> > > > > > +* always set PASID-table-entry.PGSNP bit 
> > > > > > if the domain
> > > > > > +* is managed outside (UNMANAGED).
> > > > > > +*/
> > > > > > +   if (!sm_supported(iommu) &&
> > > > > > +   !ecap_sc_support(iommu->ecap)) {
> > > > > > ret = 0;
> > > > > > break;
> > > > > > }
> > > > > > ---8<---
> > > > > >
> > > > > > Does it mean that for some hardwares that has
> > sm_supported()==true,
> > > > it'll
> > > > > > have  SC bit cleared in ecap register?  That sounds odd, and not 
> > > > > > sure
> > why.
> > > > Maybe
> > > > > > Yi Liu or Yi Sun may know?
> > > > >
> > > > > scalable mode has no dependency on SC, so it's possible.
> > > >
> > > > I see; thanks, Yi.
> > > >
> > > > However then OTOH I don't understand above comment
> > > >
> > > >   "If the hardware is operating in the scalable mode, the snooping 
> > > > control
> > is
> > > >always supported since... "
> > > >
> > > > Because the current qemu vt-d emulation should fall into the case that 
> > > > Yi
> > > > mentioned - we support initial scalable mode but no SC yet..
> > >
> > > chapter 3.9 of 3.2 spec says below.
> > >
> > > “If the remapping hardware is setup in scalable-mode
> > (RTADDR_REG.TTM=01b)
> > > and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to
> > the
> > > final page is snooped.”
> > >
> > > It means the PGSNP field is available under scalable mode. And spec also
> > > says below in chapter 96. of 3.2 spec.
> > >
> > > "Requests snoop processor caches irrespective of, other attributes in the
> > > request or other fields in paging structure entries used to translate the
> > > request."
> > >
> > > It means the PGSNP field of PASID table entry is the first class control
> > > of the snoop behaviour. Also it means the scalable mode has the snoop
> > > control by default. ^_^. So the comment in the above commit is correct
> > > since the policy of intel iommu driver is always setting the PGSNP bit.
> > 
> > I see.  Setting PGSNP bit in the pasid entry looks fine to me.
> > 
> > However IIUC what's triggering the crash (that Jason is fixing) is the guest
> > iommu driver "thinks" SC is supported since scalable is enabled (even if
> > qemu vIOMMU has declared ECAP.SC==0 there), then it'll update
> > iommu_snooping bit, then it'll try to attach the SNP bit in the 2nd level
> > pgtable (intel_iommu_map):
> > 
> > if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
> 

RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-27 Thread Liu, Yi L
> From: Peter Xu 
> Sent: Thursday, November 25, 2021 2:14 PM
> 
> On Thu, Nov 25, 2021 at 05:49:38AM +, Liu, Yi L wrote:
> > > From: Peter Xu 
> > > Sent: Thursday, November 25, 2021 12:31 PM
> > >
> > > On Thu, Nov 25, 2021 at 04:03:34AM +, Liu, Yi L wrote:
> > > > > From: Peter Xu 
> > > > > Sent: Wednesday, November 24, 2021 3:57 PM
> > > > >
> > > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > > > When booting with scalable mode, I hit this error:
> > > > > >
> > > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve
> non-
> > > > > zero iova=0xf002, level=0x1slpte=0x102681803)
> > > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> > > failure
> > > > > (dev=01:00:00, iova=0xf002)
> > > > > > qemu-system-x86_64: New fault is not recorded due to
> compression
> > > of
> > > > > faults
> > > > > >
> > > > > > This is because the SNP bit is set since Linux kernel commit
> > > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> > > using
> > > > > > FL for IOVA") where SNP bit is set if scalable mode is on though 
> > > > > > this
> > > > > > seems to be an violation on the spec which said the SNP bit is
> > > > > > considered to be reserved if SC is not supported.
> > > > >
> > > > > When I was reading that commit, I was actually confused by this
> change:
> > > > >
> > > > > ---8<---
> > > > > diff --git a/drivers/iommu/intel/iommu.c
> > > b/drivers/iommu/intel/iommu.c
> > > > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > > > --- a/drivers/iommu/intel/iommu.c
> > > > > +++ b/drivers/iommu/intel/iommu.c
> > > > > @@ -658,7 +658,14 @@ static int
> > > domain_update_iommu_snooping(struct
> > > > > intel_iommu *skip)
> > > > > rcu_read_lock();
> > > > > for_each_active_iommu(iommu, drhd) {
> > > > > if (iommu != skip) {
> > > > > -   if (!ecap_sc_support(iommu->ecap)) {
> > > > > +   /*
> > > > > +* If the hardware is operating in the 
> > > > > scalable mode,
> > > > > +* the snooping control is always supported 
> > > > > since we
> > > > > +* always set PASID-table-entry.PGSNP bit if 
> > > > > the domain
> > > > > +* is managed outside (UNMANAGED).
> > > > > +*/
> > > > > +   if (!sm_supported(iommu) &&
> > > > > +   !ecap_sc_support(iommu->ecap)) {
> > > > > ret = 0;
> > > > > break;
> > > > > }
> > > > > ---8<---
> > > > >
> > > > > Does it mean that for some hardwares that has
> sm_supported()==true,
> > > it'll
> > > > > have  SC bit cleared in ecap register?  That sounds odd, and not sure
> why.
> > > Maybe
> > > > > Yi Liu or Yi Sun may know?
> > > >
> > > > scalable mode has no dependency on SC, so it's possible.
> > >
> > > I see; thanks, Yi.
> > >
> > > However then OTOH I don't understand above comment
> > >
> > >   "If the hardware is operating in the scalable mode, the snooping control
> is
> > >always supported since... "
> > >
> > > Because the current qemu vt-d emulation should fall into the case that Yi
> > > mentioned - we support initial scalable mode but no SC yet..
> >
> > chapter 3.9 of 3.2 spec says below.
> >
> > “If the remapping hardware is setup in scalable-mode
> (RTADDR_REG.TTM=01b)
> > and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to
> the
> > final page is snooped.”
> >
> > It means the PGSNP field is available under scalable mode. And spec also
> > says below in chapter 96. of 3.2 spec.
> >
> > "Requests snoop processor caches irrespective of, other attributes in the
> > request or other fields in paging structure entries used to translate the
> > request."
> >
> > It means the PGSNP field of PASID table entry is the first class control
> > of the snoop behaviour. Also it means the scalable mode has the snoop
> > control by default. ^_^. So the comment in the above commit is correct
> > since the policy of intel iommu driver is always setting the PGSNP bit.
> 
> I see.  Setting PGSNP bit in the pasid entry looks fine to me.
> 
> However IIUC what's triggering the crash (that Jason is fixing) is the guest
> iommu driver "thinks" SC is supported since scalable is enabled (even if
> qemu vIOMMU has declared ECAP.SC==0 there), then it'll update
> iommu_snooping bit, then it'll try to attach the SNP bit in the 2nd level
> pgtable (intel_iommu_map):
> 
>   if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
>   prot |= DMA_PTE_SNP;
> 

Above is the fact today.

> So what I'm wondering is: whether the kernel should _not_ set SNP bit in
> the 2nd level pgtable, even if we set PGSNP in the pasid entry.. because
> as you mentioned, the hardware (here the emulated vIOMMU) is allowed to have
> both scalable==1 but 

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Peter Xu
On Thu, Nov 25, 2021 at 05:49:38AM +, Liu, Yi L wrote:
> > From: Peter Xu 
> > Sent: Thursday, November 25, 2021 12:31 PM
> > 
> > On Thu, Nov 25, 2021 at 04:03:34AM +, Liu, Yi L wrote:
> > > > From: Peter Xu 
> > > > Sent: Wednesday, November 24, 2021 3:57 PM
> > > >
> > > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > > When booting with scalable mode, I hit this error:
> > > > >
> > > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> > > > zero iova=0xf002, level=0x1slpte=0x102681803)
> > > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> > failure
> > > > (dev=01:00:00, iova=0xf002)
> > > > > qemu-system-x86_64: New fault is not recorded due to compression
> > of
> > > > faults
> > > > >
> > > > > This is because the SNP bit is set since Linux kernel commit
> > > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> > using
> > > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > > seems to be an violation on the spec which said the SNP bit is
> > > > > considered to be reserved if SC is not supported.
> > > >
> > > > When I was reading that commit, I was actually confused by this change:
> > > >
> > > > ---8<---
> > > > diff --git a/drivers/iommu/intel/iommu.c
> > b/drivers/iommu/intel/iommu.c
> > > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > > --- a/drivers/iommu/intel/iommu.c
> > > > +++ b/drivers/iommu/intel/iommu.c
> > > > @@ -658,7 +658,14 @@ static int
> > domain_update_iommu_snooping(struct
> > > > intel_iommu *skip)
> > > > rcu_read_lock();
> > > > for_each_active_iommu(iommu, drhd) {
> > > > if (iommu != skip) {
> > > > -   if (!ecap_sc_support(iommu->ecap)) {
> > > > +   /*
> > > > +* If the hardware is operating in the scalable 
> > > > mode,
> > > > +* the snooping control is always supported 
> > > > since we
> > > > +* always set PASID-table-entry.PGSNP bit if 
> > > > the domain
> > > > +* is managed outside (UNMANAGED).
> > > > +*/
> > > > +   if (!sm_supported(iommu) &&
> > > > +   !ecap_sc_support(iommu->ecap)) {
> > > > ret = 0;
> > > > break;
> > > > }
> > > > ---8<---
> > > >
> > > > Does it mean that for some hardwares that has sm_supported()==true,
> > it'll
> > > > have  SC bit cleared in ecap register?  That sounds odd, and not sure 
> > > > why.
> > Maybe
> > > > Yi Liu or Yi Sun may know?
> > >
> > > scalable mode has no dependency on SC, so it's possible.
> > 
> > I see; thanks, Yi.
> > 
> > However then OTOH I don't understand above comment
> > 
> >   "If the hardware is operating in the scalable mode, the snooping control 
> > is
> >always supported since... "
> > 
> > Because the current qemu vt-d emulation should fall into the case that Yi
> > mentioned - we support initial scalable mode but no SC yet..
> 
> chapter 3.9 of 3.2 spec says below.
> 
> “If the remapping hardware is setup in scalable-mode (RTADDR_REG.TTM=01b)
> and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to the
> final page is snooped.”
> 
> It means the PGSNP field is available under scalable mode. And spec also
> says below in chapter 96. of 3.2 spec.
> 
> "Requests snoop processor caches irrespective of, other attributes in the
> request or other fields in paging structure entries used to translate the
> request."
> 
> It means the PGSNP field of PASID table entry is the first class control
> of the snoop behaviour. Also it means the scalable mode has the snoop
> control by default. ^_^. So the comment in the above commit is correct
> since the policy of intel iommu driver is always setting the PGSNP bit.

I see.  Setting PGSNP bit in the pasid entry looks fine to me.

However IIUC what's triggering the crash (that Jason is fixing) is the guest
iommu driver "thinks" SC is supported since scalable is enabled (even if qemu
vIOMMU has declared ECAP.SC==0 there), then it'll update iommu_snooping bit,
then it'll try to attach the SNP bit in the 2nd level pgtable (intel_iommu_map):

if ((iommu_prot & IOMMU_CACHE) && dmar_domain->iommu_snooping)
prot |= DMA_PTE_SNP;

So what I'm wondering is: whether the kernel should _not_ set SNP bit in the
2nd level pgtable, even if we set PGSNP in the pasid entry.. because as you
mentioned, the hardware (here the emulated vIOMMU) is allowed to have both
scalable==1 but sc==0 so it may recognize PGSNP in pasid entry but not the SNP
bit in pgtables.

If we'll skip pgtable SNP bit anyway for scalable mode, it looks weird to
explicitly set it too.

I think it's fine for Jason's solution to just skip checking SNP bit so we
ignore it in qemu, however just to double check we're on 

RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Liu, Yi L
> From: Jason Wang 
> Sent: Wednesday, November 24, 2021 4:29 PM
> 
> On Wed, Nov 24, 2021 at 3:57 PM Peter Xu  wrote:
> >
> > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > When booting with scalable mode, I hit this error:
> > >
> > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> zero iova=0xf002, level=0x1slpte=0x102681803)
> > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> failure (dev=01:00:00, iova=0xf002)
> > > qemu-system-x86_64: New fault is not recorded due to compression of
> faults
> > >
> > > This is because the SNP bit is set since Linux kernel commit
> > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > seems to be an violation on the spec which said the SNP bit is
> > > considered to be reserved if SC is not supported.
> >
> > When I was reading that commit, I was actually confused by this change:
> >
> > ---8<---
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 956a02eb40b4..0ee5f1bd8af2 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -658,7 +658,14 @@ static int
> domain_update_iommu_snooping(struct intel_iommu *skip)
> > rcu_read_lock();
> > for_each_active_iommu(iommu, drhd) {
> > if (iommu != skip) {
> > -   if (!ecap_sc_support(iommu->ecap)) {
> > +   /*
> > +* If the hardware is operating in the scalable 
> > mode,
> > +* the snooping control is always supported since we
> > +* always set PASID-table-entry.PGSNP bit if the 
> > domain
> > +* is managed outside (UNMANAGED).
> > +*/
> > +   if (!sm_supported(iommu) &&
> > +   !ecap_sc_support(iommu->ecap)) {
> > ret = 0;
> > break;
> > }
> > ---8<---
> >
> > Does it mean that for some hardwares that has sm_supported()==true,
> it'll have
> > SC bit cleared in ecap register?
> 
> I guess not, so it's probably only the problem of vIOMMU.
> 
> > That sounds odd, and not sure why.  Maybe Yi
> > Liu or Yi Sun may know?
> 
> Another interesting point is that, it looks to me after that commit
> SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> set.

Per spec, if the PGSNP is set, it means the final page access is snooped.
If it's not set, then it's up to other bit to decide it. For detail, you may
refer to table 6 of chapter 3.9 in vtd 3.2 spec.

Regards,
Yi Liu




RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Liu, Yi L
> From: Jason Wang 
> Sent: Wednesday, November 24, 2021 5:35 PM
> 
> On Wed, Nov 24, 2021 at 5:23 PM Peter Xu  wrote:
> >
> > On Wed, Nov 24, 2021 at 05:01:42PM +0800, Jason Wang wrote:
> > > > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t
> level)
> > > > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > > > > +   uint64_t slpte, uint32_t 
> > > > > > > level)
> > > > > > >  {
> > > > > > >  uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > > > > >
> > > > > > > @@ -979,6 +980,10 @@ static bool
> vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > >  rsvd_mask = vtd_spte_rsvd_large[level];
> > > > > > >  }
> > > > > > >
> > > > > > > +if (s->scalable_mode) {
> > > > > > > +rsvd_mask &= ~VTD_SPTE_SNP;
> > > > > > > +}
> > > > > >
> > > > > > IMHO what we want to do is only to skip the leaves of pgtables on
> SNP, so maybe
> > > > > > we still want to keep checking the bit 11 reserved for e.g. common
> pgtable dir
> > > > > > entries?
> > >
> > > Maybe, but it's probably a question that can only be answered by
> > > Intel. I can change it for the next version if you stick.
> >
> > I'm reading vtd spec v3.1 (June 2019) here, and chap 9.8 told me they're
> > reserved bits for pgdir entries, as no SNP bit defined on pgdir entries.
> 
> Yes, you're right.

yeah. The SNP bit is only available in the leaf paging entry. e.g. for 4KB 
pages,
the SNP bit is in the PTE, but for 2MB pages, the SNP bit is in PDE, and etc.

Regards,
Yi Liu



RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Liu, Yi L
> From: Peter Xu 
> Sent: Thursday, November 25, 2021 12:31 PM
> 
> On Thu, Nov 25, 2021 at 04:03:34AM +, Liu, Yi L wrote:
> > > From: Peter Xu 
> > > Sent: Wednesday, November 24, 2021 3:57 PM
> > >
> > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > When booting with scalable mode, I hit this error:
> > > >
> > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> > > zero iova=0xf002, level=0x1slpte=0x102681803)
> > > > qemu-system-x86_64: vtd_iommu_translate: detected translation
> failure
> > > (dev=01:00:00, iova=0xf002)
> > > > qemu-system-x86_64: New fault is not recorded due to compression
> of
> > > faults
> > > >
> > > > This is because the SNP bit is set since Linux kernel commit
> > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when
> using
> > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > seems to be an violation on the spec which said the SNP bit is
> > > > considered to be reserved if SC is not supported.
> > >
> > > When I was reading that commit, I was actually confused by this change:
> > >
> > > ---8<---
> > > diff --git a/drivers/iommu/intel/iommu.c
> b/drivers/iommu/intel/iommu.c
> > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > --- a/drivers/iommu/intel/iommu.c
> > > +++ b/drivers/iommu/intel/iommu.c
> > > @@ -658,7 +658,14 @@ static int
> domain_update_iommu_snooping(struct
> > > intel_iommu *skip)
> > > rcu_read_lock();
> > > for_each_active_iommu(iommu, drhd) {
> > > if (iommu != skip) {
> > > -   if (!ecap_sc_support(iommu->ecap)) {
> > > +   /*
> > > +* If the hardware is operating in the scalable 
> > > mode,
> > > +* the snooping control is always supported since 
> > > we
> > > +* always set PASID-table-entry.PGSNP bit if the 
> > > domain
> > > +* is managed outside (UNMANAGED).
> > > +*/
> > > +   if (!sm_supported(iommu) &&
> > > +   !ecap_sc_support(iommu->ecap)) {
> > > ret = 0;
> > > break;
> > > }
> > > ---8<---
> > >
> > > Does it mean that for some hardwares that has sm_supported()==true,
> it'll
> > > have  SC bit cleared in ecap register?  That sounds odd, and not sure why.
> Maybe
> > > Yi Liu or Yi Sun may know?
> >
> > scalable mode has no dependency on SC, so it's possible.
> 
> I see; thanks, Yi.
> 
> However then OTOH I don't understand above comment
> 
>   "If the hardware is operating in the scalable mode, the snooping control is
>always supported since... "
> 
> Because the current qemu vt-d emulation should fall into the case that Yi
> mentioned - we support initial scalable mode but no SC yet..

chapter 3.9 of 3.2 spec says below.

“If the remapping hardware is setup in scalable-mode (RTADDR_REG.TTM=01b)
and the Page Snoop (PGSNP) field in PASID-table entry is Set, access to the
final page is snooped.”

It means the PGSNP field is available under scalable mode. And spec also
says below in chapter 96. of 3.2 spec.

"Requests snoop processor caches irrespective of, other attributes in the
request or other fields in paging structure entries used to translate the
request."

It means the PGSNP field of PASID table entry is the first class control
of the snoop behaviour. Also it means the scalable mode has the snoop
control by default. ^_^. So the comment in the above commit is correct
since the policy of intel iommu driver is always setting the PGSNP bit.
But spec is not so clear. Will reach out to make it more clearer in the
spec. thanks for catching it. :-)

Regards,
Yi Liu



Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Peter Xu
On Thu, Nov 25, 2021 at 04:03:34AM +, Liu, Yi L wrote:
> > From: Peter Xu 
> > Sent: Wednesday, November 24, 2021 3:57 PM
> > 
> > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > When booting with scalable mode, I hit this error:
> > >
> > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> > zero iova=0xf002, level=0x1slpte=0x102681803)
> > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure
> > (dev=01:00:00, iova=0xf002)
> > > qemu-system-x86_64: New fault is not recorded due to compression of
> > faults
> > >
> > > This is because the SNP bit is set since Linux kernel commit
> > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > seems to be an violation on the spec which said the SNP bit is
> > > considered to be reserved if SC is not supported.
> > 
> > When I was reading that commit, I was actually confused by this change:
> > 
> > ---8<---
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 956a02eb40b4..0ee5f1bd8af2 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct
> > intel_iommu *skip)
> > rcu_read_lock();
> > for_each_active_iommu(iommu, drhd) {
> > if (iommu != skip) {
> > -   if (!ecap_sc_support(iommu->ecap)) {
> > +   /*
> > +* If the hardware is operating in the scalable 
> > mode,
> > +* the snooping control is always supported since we
> > +* always set PASID-table-entry.PGSNP bit if the 
> > domain
> > +* is managed outside (UNMANAGED).
> > +*/
> > +   if (!sm_supported(iommu) &&
> > +   !ecap_sc_support(iommu->ecap)) {
> > ret = 0;
> > break;
> > }
> > ---8<---
> > 
> > Does it mean that for some hardwares that has sm_supported()==true, it'll
> > have  SC bit cleared in ecap register?  That sounds odd, and not sure why.  
> > Maybe
> > Yi Liu or Yi Sun may know?
> 
> scalable mode has no dependency on SC, so it's possible.

I see; thanks, Yi.

However then OTOH I don't understand above comment 

  "If the hardware is operating in the scalable mode, the snooping control is
   always supported since... "

Because the current qemu vt-d emulation should fall into the case that Yi
mentioned - we support initial scalable mode but no SC yet..

Cc Baolu too.

-- 
Peter Xu




RE: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Liu, Yi L
> From: Peter Xu 
> Sent: Wednesday, November 24, 2021 3:57 PM
> 
> On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > When booting with scalable mode, I hit this error:
> >
> > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-
> zero iova=0xf002, level=0x1slpte=0x102681803)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure
> (dev=01:00:00, iova=0xf002)
> > qemu-system-x86_64: New fault is not recorded due to compression of
> faults
> >
> > This is because the SNP bit is set since Linux kernel commit
> > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > seems to be an violation on the spec which said the SNP bit is
> > considered to be reserved if SC is not supported.
> 
> When I was reading that commit, I was actually confused by this change:
> 
> ---8<---
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 956a02eb40b4..0ee5f1bd8af2 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct
> intel_iommu *skip)
> rcu_read_lock();
> for_each_active_iommu(iommu, drhd) {
> if (iommu != skip) {
> -   if (!ecap_sc_support(iommu->ecap)) {
> +   /*
> +* If the hardware is operating in the scalable mode,
> +* the snooping control is always supported since we
> +* always set PASID-table-entry.PGSNP bit if the 
> domain
> +* is managed outside (UNMANAGED).
> +*/
> +   if (!sm_supported(iommu) &&
> +   !ecap_sc_support(iommu->ecap)) {
> ret = 0;
> break;
> }
> ---8<---
> 
> Does it mean that for some hardwares that has sm_supported()==true, it'll
> have  SC bit cleared in ecap register?  That sounds odd, and not sure why.  
> Maybe
> Yi Liu or Yi Sun may know?

scalable mode has no dependency on SC, so it's possible.

> >
> > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > the future we may consider to add SC support.
> 
> Oh yes, I remembered the last time we discussed this.  Could you remind
> me what's missing for us to support SC?
> 
> IIUC, for common device emulations we can just declare SC==1, right?  As all
> the DMAs (including kernel accels like vhost) will be from host processors so
> there're no coherent issues with guest vcpu threads.
> 
> If that's correct, the only challenge is device assignment in any form (I am
> not familiar with vdpa; so perhaps that includes vfio, vpda and any other
> kind of assigning host devices to guest?), then we'll try to detect 
> IOMMU_CACHE
> capability from the host iommu groups that covers the assigned devices,
> and we only set SC==1 if we have cache coherency on all the devices?

above looks good to me. SC bit means SNP field available in leaf paging
structure. So we need to check the host side's SC capability for the assigned
devices, then decide whether set SC or not. Then guest iommu driver can
set the SNP bit per its policy.

btw. there is a discussion on the IOMMU_CACHE, it's considered to be
an incorrect usage to let it linked with no_snoop. So there may be some
cleanup later. anyhow, just let you two be aware of it.

https://lore.kernel.org/kvm/20210922234954.gb964...@nvidia.com/

Regards,
Yi Liu

> 
> >
> > Signed-off-by: Jason Wang 
> > ---
> >  hw/i386/intel_iommu.c  | 18 --
> >  hw/i386/intel_iommu_internal.h |  2 ++
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 294499ee20..3bcac56c3e 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -969,7 +969,8 @@ static dma_addr_t
> vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> >  static uint64_t vtd_spte_rsvd[5];
> >  static uint64_t vtd_spte_rsvd_large[5];
> >
> > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > +   uint64_t slpte, uint32_t level)
> >  {
> >  uint64_t rsvd_mask = vtd_spte_rsvd[level];
> >
> > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte,
> uint32_t level)
> >  rsvd_mask = vtd_spte_rsvd_large[level];
> >  }
> >
> > +if (s->scalable_mode) {
> > +rsvd_mask &= ~VTD_SPTE_SNP;
> > +}
> 
> IMHO what we want to do is only to skip the leaves of pgtables on SNP, so
> maybe
> we still want to keep checking the bit 11 reserved for e.g. common pgtable
> dir
> entries?
> 
> To do so, how about directly modifying the vtd_spte_rsvd* fields in
> vtd_init()?
> I think we only need to modify 

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Jason Wang
On Wed, Nov 24, 2021 at 6:10 PM Peter Xu  wrote:
>
> On Wed, Nov 24, 2021 at 05:35:18PM +0800, Jason Wang wrote:
> > On Wed, Nov 24, 2021 at 5:23 PM Peter Xu  wrote:
> > >
> > > On Wed, Nov 24, 2021 at 05:01:42PM +0800, Jason Wang wrote:
> > > > > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t 
> > > > > > > > level)
> > > > > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > > > > > +   uint64_t slpte, uint32_t 
> > > > > > > > level)
> > > > > > > >  {
> > > > > > > >  uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > > > > > >
> > > > > > > > @@ -979,6 +980,10 @@ static bool 
> > > > > > > > vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > > >  rsvd_mask = vtd_spte_rsvd_large[level];
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +if (s->scalable_mode) {
> > > > > > > > +rsvd_mask &= ~VTD_SPTE_SNP;
> > > > > > > > +}
> > > > > > >
> > > > > > > IMHO what we want to do is only to skip the leaves of pgtables on 
> > > > > > > SNP, so maybe
> > > > > > > we still want to keep checking the bit 11 reserved for e.g. 
> > > > > > > common pgtable dir
> > > > > > > entries?
> > > >
> > > > Maybe, but it's probably a question that can only be answered by
> > > > Intel. I can change it for the next version if you stick.
> > >
> > > I'm reading vtd spec v3.1 (June 2019) here, and chap 9.8 told me they're
> > > reserved bits for pgdir entries, as no SNP bit defined on pgdir entries.
> >
> > Yes, you're right.
> >
> > >
> > > >
> > > > > > >
> > > > > > > To do so, how about directly modifying the vtd_spte_rsvd* fields 
> > > > > > > in vtd_init()?
> > > > > > > I think we only need to modify 4k/2m/1g entries to mask bit 11, 
> > > > > > > they're:
> > > > > > >
> > > > > > >   - vtd_spte_rsvd[1] (4K)
> > > > > > >   - vtd_spte_rsvd_large[2] (2M)
> > > > > > >   - vtd_spte_rsvd_large[3] (1G)
> > > > > > >
> > > > > > > What do you think?  Then we avoid passing IntelIOMMUState* all 
> > > > > > > over too.
> > > >
> > > > I started a version like that:), it should work, I will change that if
> > > > it was agreed by everyone.
> > > >
> > > > The reason that I changed to pass IntelIOMMUState is that it results
> > > > in a smaller changeset. The reason is that I tend to introduce new
> > > > rsvd bits for SM mode since after checking vtd 3.3 it looks have
> > > > different reserved bit requirement than before (at least 1.2)
> > >
> > > Oh I thought changing vtd_spte_rsvd* should have smaller changeset 
> > > instead,
> > > hmm? :)
> > >
> > > IMHO it'll be:
> > >
> > >   if (s->scalable_mode) {
> > > vtd_spte_rsvd[1] &= ~BIT(11);
> > > vtd_spte_rsvd_large[2] &= ~BIT(11);
> > > vtd_spte_rsvd_large[3] &= ~BIT(11);
> > >   }
> > >
> > > Would that work?  Thanks,
> >
> > Works for sure, if we just want to fix the SNP bit.
> >
> > (I actually have a version like this as well). I can go this way
>
> Sounds good at least to me.  Thanks!
>
> >
> > The reason why I had another big changset is to align the reserved
> > bits to vtd 3.3. E.g it equires e.g bit 62 to be reserved 63 to be
> > ignored which seems not as strict as VTD_SL_IGN_COM. But it can be
> > addressed in the future.
>
> Ah I see.  We can do that later, or if the patch is ready anway IMHO we can
> have them fixed altogether too.
>
> It's weird that VT-d spec seems to be very prone to changes.. that's rare as a
> spec, and it even happened multiple times.

A side-effect is to bring troubles for the migration compatibility :(

>
> Another trivial thing: for that SNP bit code change, shall we also reference
> the Linux commit 6c00612d0cba ("iommu/vt-d: Report right snoop capability when
> using FL for IOVA", 2021-04-07) directly in the code as comment?  Just want to
> make sure we'll never forget why we added it as no one will be able to find a
> clue in the spec, meanwhile that explicit hint let us remember when we added 
> SC
> support we can drop it.

Adding BaoLu.

As discussed, according to my test with vIOMMU, there are side effect
of that commit which uses SNP even for the second level page table for
domains that is not IOMMU_DOMAIN_UNMANAGED. If I was wrong, we can
refer to that in the code.

Thanks

>
> --
> Peter Xu
>




Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Peter Xu
On Wed, Nov 24, 2021 at 05:35:18PM +0800, Jason Wang wrote:
> On Wed, Nov 24, 2021 at 5:23 PM Peter Xu  wrote:
> >
> > On Wed, Nov 24, 2021 at 05:01:42PM +0800, Jason Wang wrote:
> > > > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t 
> > > > > > > level)
> > > > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > > > > +   uint64_t slpte, uint32_t 
> > > > > > > level)
> > > > > > >  {
> > > > > > >  uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > > > > >
> > > > > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t 
> > > > > > > slpte, uint32_t level)
> > > > > > >  rsvd_mask = vtd_spte_rsvd_large[level];
> > > > > > >  }
> > > > > > >
> > > > > > > +if (s->scalable_mode) {
> > > > > > > +rsvd_mask &= ~VTD_SPTE_SNP;
> > > > > > > +}
> > > > > >
> > > > > > IMHO what we want to do is only to skip the leaves of pgtables on 
> > > > > > SNP, so maybe
> > > > > > we still want to keep checking the bit 11 reserved for e.g. common 
> > > > > > pgtable dir
> > > > > > entries?
> > >
> > > Maybe, but it's probably a question that can only be answered by
> > > Intel. I can change it for the next version if you stick.
> >
> > I'm reading vtd spec v3.1 (June 2019) here, and chap 9.8 told me they're
> > reserved bits for pgdir entries, as no SNP bit defined on pgdir entries.
> 
> Yes, you're right.
> 
> >
> > >
> > > > > >
> > > > > > To do so, how about directly modifying the vtd_spte_rsvd* fields in 
> > > > > > vtd_init()?
> > > > > > I think we only need to modify 4k/2m/1g entries to mask bit 11, 
> > > > > > they're:
> > > > > >
> > > > > >   - vtd_spte_rsvd[1] (4K)
> > > > > >   - vtd_spte_rsvd_large[2] (2M)
> > > > > >   - vtd_spte_rsvd_large[3] (1G)
> > > > > >
> > > > > > What do you think?  Then we avoid passing IntelIOMMUState* all over 
> > > > > > too.
> > >
> > > I started a version like that:), it should work, I will change that if
> > > it was agreed by everyone.
> > >
> > > The reason that I changed to pass IntelIOMMUState is that it results
> > > in a smaller changeset. The reason is that I tend to introduce new
> > > rsvd bits for SM mode since after checking vtd 3.3 it looks have
> > > different reserved bit requirement than before (at least 1.2)
> >
> > Oh I thought changing vtd_spte_rsvd* should have smaller changeset instead,
> > hmm? :)
> >
> > IMHO it'll be:
> >
> >   if (s->scalable_mode) {
> > vtd_spte_rsvd[1] &= ~BIT(11);
> > vtd_spte_rsvd_large[2] &= ~BIT(11);
> > vtd_spte_rsvd_large[3] &= ~BIT(11);
> >   }
> >
> > Would that work?  Thanks,
> 
> Works for sure, if we just want to fix the SNP bit.
> 
> (I actually have a version like this as well). I can go this way

Sounds good at least to me.  Thanks!

> 
> The reason why I had another big changset is to align the reserved
> bits to vtd 3.3. E.g it equires e.g bit 62 to be reserved 63 to be
> ignored which seems not as strict as VTD_SL_IGN_COM. But it can be
> addressed in the future.

Ah I see.  We can do that later, or if the patch is ready anway IMHO we can
have them fixed altogether too.

It's weird that VT-d spec seems to be very prone to changes.. that's rare as a
spec, and it even happened multiple times.

Another trivial thing: for that SNP bit code change, shall we also reference
the Linux commit 6c00612d0cba ("iommu/vt-d: Report right snoop capability when
using FL for IOVA", 2021-04-07) directly in the code as comment?  Just want to
make sure we'll never forget why we added it as no one will be able to find a
clue in the spec, meanwhile that explicit hint let us remember when we added SC
support we can drop it.

-- 
Peter Xu




Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Jason Wang
On Wed, Nov 24, 2021 at 5:23 PM Peter Xu  wrote:
>
> On Wed, Nov 24, 2021 at 05:01:42PM +0800, Jason Wang wrote:
> > > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > > > +   uint64_t slpte, uint32_t level)
> > > > > >  {
> > > > > >  uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > > > >
> > > > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t 
> > > > > > slpte, uint32_t level)
> > > > > >  rsvd_mask = vtd_spte_rsvd_large[level];
> > > > > >  }
> > > > > >
> > > > > > +if (s->scalable_mode) {
> > > > > > +rsvd_mask &= ~VTD_SPTE_SNP;
> > > > > > +}
> > > > >
> > > > > IMHO what we want to do is only to skip the leaves of pgtables on 
> > > > > SNP, so maybe
> > > > > we still want to keep checking the bit 11 reserved for e.g. common 
> > > > > pgtable dir
> > > > > entries?
> >
> > Maybe, but it's probably a question that can only be answered by
> > Intel. I can change it for the next version if you stick.
>
> I'm reading vtd spec v3.1 (June 2019) here, and chap 9.8 told me they're
> reserved bits for pgdir entries, as no SNP bit defined on pgdir entries.

Yes, you're right.

>
> >
> > > > >
> > > > > To do so, how about directly modifying the vtd_spte_rsvd* fields in 
> > > > > vtd_init()?
> > > > > I think we only need to modify 4k/2m/1g entries to mask bit 11, 
> > > > > they're:
> > > > >
> > > > >   - vtd_spte_rsvd[1] (4K)
> > > > >   - vtd_spte_rsvd_large[2] (2M)
> > > > >   - vtd_spte_rsvd_large[3] (1G)
> > > > >
> > > > > What do you think?  Then we avoid passing IntelIOMMUState* all over 
> > > > > too.
> >
> > I started a version like that:), it should work, I will change that if
> > it was agreed by everyone.
> >
> > The reason that I changed to pass IntelIOMMUState is that it results
> > in a smaller changeset. The reason is that I tend to introduce new
> > rsvd bits for SM mode since after checking vtd 3.3 it looks have
> > different reserved bit requirement than before (at least 1.2)
>
> Oh I thought changing vtd_spte_rsvd* should have smaller changeset instead,
> hmm? :)
>
> IMHO it'll be:
>
>   if (s->scalable_mode) {
> vtd_spte_rsvd[1] &= ~BIT(11);
> vtd_spte_rsvd_large[2] &= ~BIT(11);
> vtd_spte_rsvd_large[3] &= ~BIT(11);
>   }
>
> Would that work?  Thanks,

Works for sure, if we just want to fix the SNP bit.

(I actually have a version like this as well). I can go this way

The reason why I had another big changset is to align the reserved
bits to vtd 3.3. E.g it equires e.g bit 62 to be reserved 63 to be
ignored which seems not as strict as VTD_SL_IGN_COM. But it can be
addressed in the future.

Thanks

>
> --
> Peter Xu
>




Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Peter Xu
On Wed, Nov 24, 2021 at 05:01:42PM +0800, Jason Wang wrote:
> > > > > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > > > > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > > > > +   uint64_t slpte, uint32_t level)
> > > > >  {
> > > > >  uint64_t rsvd_mask = vtd_spte_rsvd[level];
> > > > >
> > > > > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t 
> > > > > slpte, uint32_t level)
> > > > >  rsvd_mask = vtd_spte_rsvd_large[level];
> > > > >  }
> > > > >
> > > > > +if (s->scalable_mode) {
> > > > > +rsvd_mask &= ~VTD_SPTE_SNP;
> > > > > +}
> > > >
> > > > IMHO what we want to do is only to skip the leaves of pgtables on SNP, 
> > > > so maybe
> > > > we still want to keep checking the bit 11 reserved for e.g. common 
> > > > pgtable dir
> > > > entries?
> 
> Maybe, but it's probably a question that can only be answered by
> Intel. I can change it for the next version if you stick.

I'm reading vtd spec v3.1 (June 2019) here, and chap 9.8 told me they're
reserved bits for pgdir entries, as no SNP bit defined on pgdir entries.

> 
> > > >
> > > > To do so, how about directly modifying the vtd_spte_rsvd* fields in 
> > > > vtd_init()?
> > > > I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:
> > > >
> > > >   - vtd_spte_rsvd[1] (4K)
> > > >   - vtd_spte_rsvd_large[2] (2M)
> > > >   - vtd_spte_rsvd_large[3] (1G)
> > > >
> > > > What do you think?  Then we avoid passing IntelIOMMUState* all over too.
> 
> I started a version like that:), it should work, I will change that if
> it was agreed by everyone.
> 
> The reason that I changed to pass IntelIOMMUState is that it results
> in a smaller changeset. The reason is that I tend to introduce new
> rsvd bits for SM mode since after checking vtd 3.3 it looks have
> different reserved bit requirement than before (at least 1.2)

Oh I thought changing vtd_spte_rsvd* should have smaller changeset instead,
hmm? :)

IMHO it'll be:

  if (s->scalable_mode) {
vtd_spte_rsvd[1] &= ~BIT(11);
vtd_spte_rsvd_large[2] &= ~BIT(11);
vtd_spte_rsvd_large[3] &= ~BIT(11);
  }

Would that work?  Thanks,

-- 
Peter Xu




Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Jason Wang
On Wed, Nov 24, 2021 at 4:52 PM Peter Xu  wrote:
>
> On Wed, Nov 24, 2021 at 04:28:52PM +0800, Jason Wang wrote:
> > On Wed, Nov 24, 2021 at 3:57 PM Peter Xu  wrote:
> > >
> > > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > > When booting with scalable mode, I hit this error:
> > > >
> > > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero 
> > > > iova=0xf002, level=0x1slpte=0x102681803)
> > > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> > > > (dev=01:00:00, iova=0xf002)
> > > > qemu-system-x86_64: New fault is not recorded due to compression of 
> > > > faults
> > > >
> > > > This is because the SNP bit is set since Linux kernel commit
> > > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > > seems to be an violation on the spec which said the SNP bit is
> > > > considered to be reserved if SC is not supported.
> > >
> > > When I was reading that commit, I was actually confused by this change:
> > >
> > > ---8<---
> > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > > index 956a02eb40b4..0ee5f1bd8af2 100644
> > > --- a/drivers/iommu/intel/iommu.c
> > > +++ b/drivers/iommu/intel/iommu.c
> > > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct 
> > > intel_iommu *skip)
> > > rcu_read_lock();
> > > for_each_active_iommu(iommu, drhd) {
> > > if (iommu != skip) {
> > > -   if (!ecap_sc_support(iommu->ecap)) {
> > > +   /*
> > > +* If the hardware is operating in the scalable 
> > > mode,
> > > +* the snooping control is always supported since 
> > > we
> > > +* always set PASID-table-entry.PGSNP bit if the 
> > > domain
> > > +* is managed outside (UNMANAGED).
> > > +*/
> > > +   if (!sm_supported(iommu) &&
> > > +   !ecap_sc_support(iommu->ecap)) {
> > > ret = 0;
> > > break;
> > > }
> > > ---8<---
> > >
> > > Does it mean that for some hardwares that has sm_supported()==true, it'll 
> > > have
> > > SC bit cleared in ecap register?
> >
> > I guess not, so it's probably only the problem of vIOMMU.
>
> But then what does the code mean above?
>
> If SC is required for scalable mode,

I guess the code was tested on hardware with both SC and SM.

> ecap_sc_support()==false already implies
> sm_supported()==false too.  Then that check seems redundant.

To tell the truth, I'm not sure. And according to the spec SNP is
reserved if SC==false.

>
> >
> > > That sounds odd, and not sure why.  Maybe Yi
> > > Liu or Yi Sun may know?
> >
> > Another interesting point is that, it looks to me after that commit
> > SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> > set.
> >
> >
> > >
> > > >
> > > > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > > > the future we may consider to add SC support.
> > >
> > > Oh yes, I remembered the last time we discussed this.  Could you remind me
> > > what's missing for us to support SC?
> >
> > Exactly what you described below.
> >
> > >
> > > IIUC, for common device emulations we can just declare SC==1, right?
> >
> > Strictly speaking, only safe for the datapath that is running in the
> > Qemu. For things like vhost-user, I'm not sure it can check CC when
> > using VFIO.
>
> Hmm yeah.. though I'll just call those vhost-user backends to fall into below
> "device assignment" category too.  Great to know we're on the same page here.

I see.

>
> >
> > >  As all
> > > the DMAs (including kernel accels like vhost) will be from host 
> > > processors so
> > > there're no coherent issues with guest vcpu threads.
> > >
> > > If that's correct, the only challenge is device assignment in any form (I 
> > > am
> > > not familiar with vdpa; so perhaps that includes vfio, vpda and any other 
> > > kind
> > > of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> > > capability from the host iommu groups that covers the assigned devices, 
> > > and we
> > > only set SC==1 if we have cache coherency on all the devices?
> >
> > For VFIO yes, and we should prevent VFIO without CC to be plugged if
> > SC is advertised.
> >
> > For vDPA, we don't need to worry about it at all, kernel vDPA forces
> > IOMMU_CACHE now.
> >
> > vhost_vdpa_alloc_domain():
> >
> > if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> > return -ENOTSUPP;
> >
> > (For device with on-chip IOMMU, it's the parent and device that
> > guarantees the CC)
>
> Ah right, yes you mentioned it and I forgot..  Though I'm not sure we'd simply
> double-check again here (if we'll support vfio anyway, then 

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Peter Xu
On Wed, Nov 24, 2021 at 04:28:52PM +0800, Jason Wang wrote:
> On Wed, Nov 24, 2021 at 3:57 PM Peter Xu  wrote:
> >
> > On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > > When booting with scalable mode, I hit this error:
> > >
> > > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero 
> > > iova=0xf002, level=0x1slpte=0x102681803)
> > > qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> > > (dev=01:00:00, iova=0xf002)
> > > qemu-system-x86_64: New fault is not recorded due to compression of faults
> > >
> > > This is because the SNP bit is set since Linux kernel commit
> > > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > > seems to be an violation on the spec which said the SNP bit is
> > > considered to be reserved if SC is not supported.
> >
> > When I was reading that commit, I was actually confused by this change:
> >
> > ---8<---
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 956a02eb40b4..0ee5f1bd8af2 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct 
> > intel_iommu *skip)
> > rcu_read_lock();
> > for_each_active_iommu(iommu, drhd) {
> > if (iommu != skip) {
> > -   if (!ecap_sc_support(iommu->ecap)) {
> > +   /*
> > +* If the hardware is operating in the scalable 
> > mode,
> > +* the snooping control is always supported since we
> > +* always set PASID-table-entry.PGSNP bit if the 
> > domain
> > +* is managed outside (UNMANAGED).
> > +*/
> > +   if (!sm_supported(iommu) &&
> > +   !ecap_sc_support(iommu->ecap)) {
> > ret = 0;
> > break;
> > }
> > ---8<---
> >
> > Does it mean that for some hardwares that has sm_supported()==true, it'll 
> > have
> > SC bit cleared in ecap register?
> 
> I guess not, so it's probably only the problem of vIOMMU.

But then what does the code mean above?

If SC is required for scalable mode, ecap_sc_support()==false already implies
sm_supported()==false too.  Then that check seems redundant.

> 
> > That sounds odd, and not sure why.  Maybe Yi
> > Liu or Yi Sun may know?
> 
> Another interesting point is that, it looks to me after that commit
> SNP is used for the domain that is not UNMANAGED even if PGSNP is not
> set.
> 
> 
> >
> > >
> > > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > > the future we may consider to add SC support.
> >
> > Oh yes, I remembered the last time we discussed this.  Could you remind me
> > what's missing for us to support SC?
> 
> Exactly what you described below.
> 
> >
> > IIUC, for common device emulations we can just declare SC==1, right?
> 
> Strictly speaking, only safe for the datapath that is running in the
> Qemu. For things like vhost-user, I'm not sure it can check CC when
> using VFIO.

Hmm yeah.. though I'll just call those vhost-user backends to fall into below
"device assignment" category too.  Great to know we're on the same page here.

> 
> >  As all
> > the DMAs (including kernel accels like vhost) will be from host processors 
> > so
> > there're no coherent issues with guest vcpu threads.
> >
> > If that's correct, the only challenge is device assignment in any form (I am
> > not familiar with vdpa; so perhaps that includes vfio, vpda and any other 
> > kind
> > of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> > capability from the host iommu groups that covers the assigned devices, and 
> > we
> > only set SC==1 if we have cache coherency on all the devices?
> 
> For VFIO yes, and we should prevent VFIO without CC to be plugged if
> SC is advertised.
> 
> For vDPA, we don't need to worry about it at all, kernel vDPA forces
> IOMMU_CACHE now.
> 
> vhost_vdpa_alloc_domain():
> 
> if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> return -ENOTSUPP;
> 
> (For device with on-chip IOMMU, it's the parent and device that
> guarantees the CC)

Ah right, yes you mentioned it and I forgot..  Though I'm not sure we'd simply
double-check again here (if we'll support vfio anyway, then we'll need to be
able to read those out from IOMMU groups), because we shouldn't rely on that
fact which is an implementation detail of vdpa, imho (say, when vdpa starts to
support !SC someday).

PS: I have other comments below in previous reply - please have a look too! :-D

> 
> Thanks
> 
> 
> >
> > >
> > > Signed-off-by: Jason Wang 
> > > ---
> > >  hw/i386/intel_iommu.c  | 18 --
> > >  hw/i386/intel_iommu_internal.h |  2 

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Jason Wang
On Wed, Nov 24, 2021 at 3:57 PM Peter Xu  wrote:
>
> On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> > When booting with scalable mode, I hit this error:
> >
> > qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero 
> > iova=0xf002, level=0x1slpte=0x102681803)
> > qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> > (dev=01:00:00, iova=0xf002)
> > qemu-system-x86_64: New fault is not recorded due to compression of faults
> >
> > This is because the SNP bit is set since Linux kernel commit
> > 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> > FL for IOVA") where SNP bit is set if scalable mode is on though this
> > seems to be an violation on the spec which said the SNP bit is
> > considered to be reserved if SC is not supported.
>
> When I was reading that commit, I was actually confused by this change:
>
> ---8<---
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 956a02eb40b4..0ee5f1bd8af2 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct 
> intel_iommu *skip)
> rcu_read_lock();
> for_each_active_iommu(iommu, drhd) {
> if (iommu != skip) {
> -   if (!ecap_sc_support(iommu->ecap)) {
> +   /*
> +* If the hardware is operating in the scalable mode,
> +* the snooping control is always supported since we
> +* always set PASID-table-entry.PGSNP bit if the 
> domain
> +* is managed outside (UNMANAGED).
> +*/
> +   if (!sm_supported(iommu) &&
> +   !ecap_sc_support(iommu->ecap)) {
> ret = 0;
> break;
> }
> ---8<---
>
> Does it mean that for some hardwares that has sm_supported()==true, it'll have
> SC bit cleared in ecap register?

I guess not, so it's probably only the problem of vIOMMU.

> That sounds odd, and not sure why.  Maybe Yi
> Liu or Yi Sun may know?

Another interesting point is that, it looks to me after that commit
SNP is used for the domain that is not UNMANAGED even if PGSNP is not
set.


>
> >
> > To unbreak the guest, ignore the SNP bit for scalable mode first. In
> > the future we may consider to add SC support.
>
> Oh yes, I remembered the last time we discussed this.  Could you remind me
> what's missing for us to support SC?

Exactly what you described below.

>
> IIUC, for common device emulations we can just declare SC==1, right?

Strictly speaking, only safe for the datapath that is running in the
Qemu. For things like vhost-user, I'm not sure it can check CC when
using VFIO.

>  As all
> the DMAs (including kernel accels like vhost) will be from host processors so
> there're no coherent issues with guest vcpu threads.
>
> If that's correct, the only challenge is device assignment in any form (I am
> not familiar with vdpa; so perhaps that includes vfio, vpda and any other kind
> of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
> capability from the host iommu groups that covers the assigned devices, and we
> only set SC==1 if we have cache coherency on all the devices?

For VFIO yes, and we should prevent VFIO without CC to be plugged if
SC is advertised.

For vDPA, we don't need to worry about it at all, kernel vDPA forces
IOMMU_CACHE now.

vhost_vdpa_alloc_domain():

if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
return -ENOTSUPP;

(For device with on-chip IOMMU, it's the parent and device that
guarantees the CC)

Thanks


>
> >
> > Signed-off-by: Jason Wang 
> > ---
> >  hw/i386/intel_iommu.c  | 18 --
> >  hw/i386/intel_iommu_internal.h |  2 ++
> >  2 files changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > index 294499ee20..3bcac56c3e 100644
> > --- a/hw/i386/intel_iommu.c
> > +++ b/hw/i386/intel_iommu.c
> > @@ -969,7 +969,8 @@ static dma_addr_t 
> > vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> >  static uint64_t vtd_spte_rsvd[5];
> >  static uint64_t vtd_spte_rsvd_large[5];
> >
> > -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> > +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> > +   uint64_t slpte, uint32_t level)
> >  {
> >  uint64_t rsvd_mask = vtd_spte_rsvd[level];
> >
> > @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, 
> > uint32_t level)
> >  rsvd_mask = vtd_spte_rsvd_large[level];
> >  }
> >
> > +if (s->scalable_mode) {
> > +rsvd_mask &= ~VTD_SPTE_SNP;
> > +}
>
> IMHO what we want to do is only to skip the leaves of pgtables on SNP, so 
> maybe
> we still want to keep checking the bit 11 

Re: [PATCH] intel-iommu: ignore SNP bit in scalable mode

2021-11-24 Thread Peter Xu
On Wed, Nov 24, 2021 at 02:03:09PM +0800, Jason Wang wrote:
> When booting with scalable mode, I hit this error:
> 
> qemu-system-x86_64: vtd_iova_to_slpte: detected splte reserve non-zero 
> iova=0xf002, level=0x1slpte=0x102681803)
> qemu-system-x86_64: vtd_iommu_translate: detected translation failure 
> (dev=01:00:00, iova=0xf002)
> qemu-system-x86_64: New fault is not recorded due to compression of faults
> 
> This is because the SNP bit is set since Linux kernel commit
> 6c00612d0cba1 ("iommu/vt-d: Report right snoop capability when using
> FL for IOVA") where SNP bit is set if scalable mode is on though this
> seems to be an violation on the spec which said the SNP bit is
> considered to be reserved if SC is not supported.

When I was reading that commit, I was actually confused by this change:

---8<---
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 956a02eb40b4..0ee5f1bd8af2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -658,7 +658,14 @@ static int domain_update_iommu_snooping(struct intel_iommu 
*skip)
rcu_read_lock();
for_each_active_iommu(iommu, drhd) {
if (iommu != skip) {
-   if (!ecap_sc_support(iommu->ecap)) {
+   /*
+* If the hardware is operating in the scalable mode,
+* the snooping control is always supported since we
+* always set PASID-table-entry.PGSNP bit if the domain
+* is managed outside (UNMANAGED).
+*/
+   if (!sm_supported(iommu) &&
+   !ecap_sc_support(iommu->ecap)) {
ret = 0;
break;
}
---8<---

Does it mean that for some hardwares that has sm_supported()==true, it'll have
SC bit cleared in ecap register?  That sounds odd, and not sure why.  Maybe Yi
Liu or Yi Sun may know?

> 
> To unbreak the guest, ignore the SNP bit for scalable mode first. In
> the future we may consider to add SC support.

Oh yes, I remembered the last time we discussed this.  Could you remind me
what's missing for us to support SC?

IIUC, for common device emulations we can just declare SC==1, right?  As all
the DMAs (including kernel accels like vhost) will be from host processors so
there're no coherent issues with guest vcpu threads.

If that's correct, the only challenge is device assignment in any form (I am
not familiar with vdpa; so perhaps that includes vfio, vpda and any other kind
of assigning host devices to guest?), then we'll try to detect IOMMU_CACHE
capability from the host iommu groups that covers the assigned devices, and we
only set SC==1 if we have cache coherency on all the devices?

> 
> Signed-off-by: Jason Wang 
> ---
>  hw/i386/intel_iommu.c  | 18 --
>  hw/i386/intel_iommu_internal.h |  2 ++
>  2 files changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 294499ee20..3bcac56c3e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -969,7 +969,8 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState 
> *s,
>  static uint64_t vtd_spte_rsvd[5];
>  static uint64_t vtd_spte_rsvd_large[5];
>  
> -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> +static bool vtd_slpte_nonzero_rsvd(IntelIOMMUState *s,
> +   uint64_t slpte, uint32_t level)
>  {
>  uint64_t rsvd_mask = vtd_spte_rsvd[level];
>  
> @@ -979,6 +980,10 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, 
> uint32_t level)
>  rsvd_mask = vtd_spte_rsvd_large[level];
>  }
>  
> +if (s->scalable_mode) {
> +rsvd_mask &= ~VTD_SPTE_SNP;
> +}

IMHO what we want to do is only to skip the leaves of pgtables on SNP, so maybe
we still want to keep checking the bit 11 reserved for e.g. common pgtable dir
entries?

To do so, how about directly modifying the vtd_spte_rsvd* fields in vtd_init()?
I think we only need to modify 4k/2m/1g entries to mask bit 11, they're:

  - vtd_spte_rsvd[1] (4K)
  - vtd_spte_rsvd_large[2] (2M)
  - vtd_spte_rsvd_large[3] (1G)

What do you think?  Then we avoid passing IntelIOMMUState* all over too.

> +
>  return slpte & rsvd_mask;
>  }
>  
> @@ -1054,7 +1059,7 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, 
> VTDContextEntry *ce,
>iova, level, slpte, is_write);
>  return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
>  }
> -if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +if (vtd_slpte_nonzero_rsvd(s, slpte, level)) {
>  error_report_once("%s: detected splte reserve non-zero "
>"iova=0x%" PRIx64 ", level=0x%" PRIx32
>"slpte=0x%" PRIx64 ")", __func__, iova,
> @@ -1185,7 +1190,8 @@ static int