Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-04 Thread Jason Wang



On 2020/8/5 上午4:30, Peter Xu wrote:

On Mon, Aug 03, 2020 at 06:00:34PM +0200, Eugenio Pérez wrote:

On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in
the notifier then the device (vhost) can choose the message it want to
process like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Thanks



Hi!

Sorry, I thought I had this clear but now it seems not so clear to me. Do you 
mean to add that switch to the current
vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is 
that the scope of the changes, or there is
something I'm missing?

If that is correct, what is the advantage for vhost or other notifiers? I 
understand that move the IOMMUTLBEntry (addr,
len) -> (iova, mask) split/transformation to the different notifiers 
implementation could pollute them, but this is even a deeper change and vhost is 
not insterested in other events but IOMMU_UNMAP, isn't?

On the other hand, who decide what type of event is? If I follow the backtrace 
of the assert in
https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems 
to me that it should be
vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or 
IOMMU_DEV_IOTLB_UNMAP? If I set it in some
function of memory.c, I should decide the type looking the actual notifier, 
isn't?

(Since Jason didn't reply yesterday, I'll try to; Jason, feel free to correct
  me...)

IMHO whether to put the type into the IOMMUTLBEntry is not important.  The
important change should be that we introduce IOMMU_DEV_IOTLB_UNMAP (or I'd
rather call it IOMMU_DEV_IOTLB directly which is shorter and cleaner).  With
that information we can make the failing assertion conditional for MAP/UNMAP
only.



Or having another dedicated device IOTLB notifier.



   We can also allow dev-iotlb messages to take arbitrary addr_mask (so it
becomes a length of address range; imho we can keep using addr_mask for
simplicity, but we can comment for addr_mask that for dev-iotlb it can be not a
real mask).



Yes.

Thanks




Thanks,





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-04 Thread Peter Xu
On Mon, Aug 03, 2020 at 06:00:34PM +0200, Eugenio Pérez wrote:
> On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:
> > On 2020/7/2 下午11:45, Peter Xu wrote:
> > > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > > So I think we agree that a new notifier is needed?
> > > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> > 
> > That should work but I wonder something as following is better.
> > 
> > Instead of introducing new flags, how about carry the type of event in 
> > the notifier then the device (vhost) can choose the message it want to 
> > process like:
> > 
> > static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> > 
> > {
> > 
> > switch (event->type) {
> > 
> > case IOMMU_MAP:
> > case IOMMU_UNMAP:
> > case IOMMU_DEV_IOTLB_UNMAP:
> > ...
> > 
> > }
> > 
> > Thanks
> > 
> > 
> 
> Hi!
> 
> Sorry, I thought I had this clear but now it seems not so clear to me. Do you 
> mean to add that switch to the current
> vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is 
> that the scope of the changes, or there is
> something I'm missing?
> 
> If that is correct, what is the advantage for vhost or other notifiers? I 
> understand that move the IOMMUTLBEntry (addr,
> len) -> (iova, mask) split/transformation to the different notifiers 
> implementation could pollute them, but this is even a deeper change and vhost 
> is not insterested in other events but IOMMU_UNMAP, isn't?
> 
> On the other hand, who decide what type of event is? If I follow the 
> backtrace of the assert in 
> https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems 
> to me that it should be
> vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or 
> IOMMU_DEV_IOTLB_UNMAP? If I set it in some
> function of memory.c, I should decide the type looking the actual notifier, 
> isn't?

(Since Jason didn't reply yesterday, I'll try to; Jason, feel free to correct
 me...)

IMHO whether to put the type into the IOMMUTLBEntry is not important.  The
important change should be that we introduce IOMMU_DEV_IOTLB_UNMAP (or I'd
rather call it IOMMU_DEV_IOTLB directly which is shorter and cleaner).  With
that information we can make the failing assertion conditional for MAP/UNMAP
only.  We can also allow dev-iotlb messages to take arbitrary addr_mask (so it
becomes a length of address range; imho we can keep using addr_mask for
simplicity, but we can comment for addr_mask that for dev-iotlb it can be not a
real mask).

Thanks,

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-08-03 Thread Eugenio Pérez
On Fri, 2020-07-03 at 15:24 +0800, Jason Wang wrote:
> On 2020/7/2 下午11:45, Peter Xu wrote:
> > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > So I think we agree that a new notifier is needed?
> > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> 
> That should work but I wonder something as following is better.
> 
> Instead of introducing new flags, how about carry the type of event in 
> the notifier then the device (vhost) can choose the message it want to 
> process like:
> 
> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> 
> {
> 
> switch (event->type) {
> 
> case IOMMU_MAP:
> case IOMMU_UNMAP:
> case IOMMU_DEV_IOTLB_UNMAP:
> ...
> 
> }
> 
> Thanks
> 
> 

Hi!

Sorry, I thought I had this clear but now it seems not so clear to me. Do you 
mean to add that switch to the current
vhost_iommu_unmap_notify, and then the "type" field to the IOMMUTLBEntry? Is 
that the scope of the changes, or there is
something I'm missing?

If that is correct, what is the advantage for vhost or other notifiers? I 
understand that move the IOMMUTLBEntry (addr,
len) -> (iova, mask) split/transformation to the different notifiers 
implementation could pollute them, but this is even a deeper change and vhost 
is not insterested in other events but IOMMU_UNMAP, isn't?

On the other hand, who decide what type of event is? If I follow the backtrace 
of the assert in 
https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01015.html, it seems 
to me that it should be
vtd_process_device_iotlb_desc. How do I know if it should be IOMMU_UNMAP or 
IOMMU_DEV_IOTLB_UNMAP? If I set it in some
function of memory.c, I should decide the type looking the actual notifier, 
isn't?

Thanks!



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-21 Thread Peter Xu
On Tue, Jul 21, 2020 at 02:20:01PM +0800, Jason Wang wrote:
> 
> On 2020/7/20 下午9:03, Peter Xu wrote:
> > On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:
> > > Right, so there's no need to deal with unmap in vtd's replay 
> > > implementation
> > > (as what generic one did).
> > We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,
> 
> 
> Right, but I meant the vtd_address_space_unmap() in vtd_iomm_replay(). It
> looks to me it will trigger UNMAP notifiers.

Should be pretty safe for now, but I agree it seems optional (as we've
discussed about this previously).  Thanks,

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-21 Thread Jason Wang



On 2020/7/20 下午9:03, Peter Xu wrote:

On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:

Right, so there's no need to deal with unmap in vtd's replay implementation
(as what generic one did).

We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,



Right, but I meant the vtd_address_space_unmap() in vtd_iomm_replay(). 
It looks to me it will trigger UNMAP notifiers.


Thanks




Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-20 Thread Peter Xu
On Mon, Jul 20, 2020 at 12:02:06PM +0800, Jason Wang wrote:
> Right, so there's no need to deal with unmap in vtd's replay implementation
> (as what generic one did).

We don't even for now; see vtd_page_walk_info.notify_unmap.  Thanks,

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-19 Thread Jason Wang



On 2020/7/17 下午10:18, Peter Xu wrote:

On Thu, Jul 16, 2020 at 10:54:31AM +0800, Jason Wang wrote:

On 2020/7/16 上午9:00, Peter Xu wrote:

On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:

On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

   - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
 will setup the page table before IO starts

   - IO/DMA triggers and completes

   - The second vmexit caused by another invalidation to UNMAP the page 
tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().

I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)


I may miss something but I don't find the code to block UNMAP notifiers?

vhost_iommu_region_add()
     memory_region_register_iommu_notifier()
         memory_region_update_iommu_notify_flags()
             imrc->notify_flag_changed()
                 vtd_iommu_notify_flag_changed()

?

Yeah I think you're right.  I might have confused with some previous
implementations.  Maybe we should also do similar thing for DSI/GI just like
what we do in PSI.



Ok.





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.

But this looks conflict with what memory_region_iommu_replay( ) did, for
IOMMU that doesn't have a replay method, it skips UNMAP request:

      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
      iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
      if (iotlb.perm != IOMMU_NONE) {
      n->notify(n, );
      }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
this generic code. Or replay implies that guest doesn't have explicit
MAP/UNMAP?

I think it matches exactly with a 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-17 Thread Peter Xu
On Thu, Jul 16, 2020 at 10:54:31AM +0800, Jason Wang wrote:
> 
> On 2020/7/16 上午9:00, Peter Xu wrote:
> > On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:
> > > On 2020/7/10 下午9:30, Peter Xu wrote:
> > > > On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> > > > > On 2020/7/9 下午10:10, Peter Xu wrote:
> > > > > > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > > > > > - If we care the performance, it's better to implement the 
> > > > > > > > > MAP event for
> > > > > > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > > > > > I feel like these are two things.
> > > > > > > > 
> > > > > > > > So far what we are talking about is whether vt-d should have 
> > > > > > > > knowledge about
> > > > > > > > what kind of events one iommu notifier is interested in.  I 
> > > > > > > > still think we
> > > > > > > > should keep this as answered in question 1.
> > > > > > > > 
> > > > > > > > The other question is whether we want to switch vhost from 
> > > > > > > > UNMAP to MAP/UNMAP
> > > > > > > > events even without vDMA, so that vhost can establish the 
> > > > > > > > mapping even before
> > > > > > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK 
> > > > > > > > workloads.  When
> > > > > > > > the guest is using dynamic iommu page mappings, I feel like 
> > > > > > > > that can be even
> > > > > > > > slower, because then the worst case is for each IO we'll need 
> > > > > > > > to vmexit twice:
> > > > > > > > 
> > > > > > > >   - The first vmexit caused by an invalidation to MAP the 
> > > > > > > > page tables, so vhost
> > > > > > > > will setup the page table before IO starts
> > > > > > > > 
> > > > > > > >   - IO/DMA triggers and completes
> > > > > > > > 
> > > > > > > >   - The second vmexit caused by another invalidation to 
> > > > > > > > UNMAP the page tables
> > > > > > > > 
> > > > > > > > So it seems to be worse than when vhost only uses UNMAP like 
> > > > > > > > right now.  At
> > > > > > > > least we only have one vmexit (when UNMAP).  We'll have a vhost 
> > > > > > > > translate()
> > > > > > > > request from kernel to userspace, but IMHO that's cheaper than 
> > > > > > > > the vmexit.
> > > > > > > Right but then I would still prefer to have another notifier.
> > > > > > > 
> > > > > > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU 
> > > > > > > have a
> > > > > > > dedicated command for flushing device IOTLB. But the check for
> > > > > > > vtd_as_has_map_notifier is used to skip the device which can do 
> > > > > > > demand
> > > > > > > paging via ATS or device specific way. If we have two different 
> > > > > > > notifiers,
> > > > > > > vhost will be on the device iotlb notifier so we don't need it at 
> > > > > > > all?
> > > > > > But we can still have iommu notifier that only registers to UNMAP 
> > > > > > even after we
> > > > > > introduce dev-iotlb notifier?  We don't want to do page walk for 
> > > > > > them as well.
> > > > > > TCG should be the only one so far, but I don't know.. maybe there 
> > > > > > can still be
> > > > > > new ones?
> > > > > I think you're right. But looking at the codes, it looks like the 
> > > > > check of
> > > > > vtd_as_has_map_notifier() was only used in:
> > > > > 
> > > > > 1) vtd_iommu_replay()
> > > > > 2) vtd_iotlb_page_invalidate_notify() (PSI)
> > > > > 
> > > > > For the replay, it's expensive anyhow. For PSI, I think it's just 
> > > > > about one
> > > > > or few mappings, not sure it will have obvious performance impact.
> > > > > 
> > > > > And I had two questions:
> > > > > 
> > > > > 1) The codes doesn't check map for DSI or GI, does this match what 
> > > > > spec
> > > > > said? (It looks to me the spec is unclear in this part)
> > > > Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
> > > > vtd_iotlb_domain_invalidate().
> > > 
> > > I meant the code doesn't check whether there's an MAP notifier :)
> > It's actually checked, because it loops over vtd_as_with_notifiers, and only
> > MAP notifiers register to that. :)
> 
> 
> I may miss something but I don't find the code to block UNMAP notifiers?
> 
> vhost_iommu_region_add()
>     memory_region_register_iommu_notifier()
>         memory_region_update_iommu_notify_flags()
>             imrc->notify_flag_changed()
>                 vtd_iommu_notify_flag_changed()
> 
> ?

Yeah I think you're right.  I might have confused with some previous
implementations.  Maybe we should also do similar thing for DSI/GI just like
what we do in PSI.

> > > > > 2) for the replay() I don't see other implementations (either spapr or
> > > > > generic one) that did unmap (actually they skip unmap explicitly), any
> > > > > reason for doing this in intel IOMMU?
> > > > I could be wrong, but I'd guess it's because vt-d implemented the 
> > > > caching mode
> > > > by leveraging the same invalidation strucuture, so it's harder to make 
> > > > all
> > > > things 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-15 Thread Jason Wang



On 2020/7/16 上午9:00, Peter Xu wrote:

On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:

On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

  - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
will setup the page table before IO starts

  - IO/DMA triggers and completes

  - The second vmexit caused by another invalidation to UNMAP the page 
tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().


I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)



I may miss something but I don't find the code to block UNMAP notifiers?

vhost_iommu_region_add()
    memory_region_register_iommu_notifier()
        memory_region_update_iommu_notify_flags()
            imrc->notify_flag_changed()
                vtd_iommu_notify_flag_changed()

?




But I agree with you that it should be cleaner to introduce the dev-iotlb
notifier type.



Yes, it can solve the issues of duplicated UNMAP issue of vhost.





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.


But this looks conflict with what memory_region_iommu_replay( ) did, for
IOMMU that doesn't have a replay method, it skips UNMAP request:

     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
     iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
     if (iotlb.perm != IOMMU_NONE) {
     n->notify(n, );
     }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP for
this generic code. Or replay implies that guest doesn't have explicit
MAP/UNMAP?

I think it matches exactly with a hot plug case?  Note that when IOMMU_NONE
could also mean the translation does not exist.  So it's actually trying to map

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-15 Thread Peter Xu
On Mon, Jul 13, 2020 at 12:04:16PM +0800, Jason Wang wrote:
> 
> On 2020/7/10 下午9:30, Peter Xu wrote:
> > On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> > > On 2020/7/9 下午10:10, Peter Xu wrote:
> > > > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > > > - If we care the performance, it's better to implement the MAP 
> > > > > > > event for
> > > > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > > > I feel like these are two things.
> > > > > > 
> > > > > > So far what we are talking about is whether vt-d should have 
> > > > > > knowledge about
> > > > > > what kind of events one iommu notifier is interested in.  I still 
> > > > > > think we
> > > > > > should keep this as answered in question 1.
> > > > > > 
> > > > > > The other question is whether we want to switch vhost from UNMAP to 
> > > > > > MAP/UNMAP
> > > > > > events even without vDMA, so that vhost can establish the mapping 
> > > > > > even before
> > > > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK 
> > > > > > workloads.  When
> > > > > > the guest is using dynamic iommu page mappings, I feel like that 
> > > > > > can be even
> > > > > > slower, because then the worst case is for each IO we'll need to 
> > > > > > vmexit twice:
> > > > > > 
> > > > > >  - The first vmexit caused by an invalidation to MAP the page 
> > > > > > tables, so vhost
> > > > > >will setup the page table before IO starts
> > > > > > 
> > > > > >  - IO/DMA triggers and completes
> > > > > > 
> > > > > >  - The second vmexit caused by another invalidation to UNMAP 
> > > > > > the page tables
> > > > > > 
> > > > > > So it seems to be worse than when vhost only uses UNMAP like right 
> > > > > > now.  At
> > > > > > least we only have one vmexit (when UNMAP).  We'll have a vhost 
> > > > > > translate()
> > > > > > request from kernel to userspace, but IMHO that's cheaper than the 
> > > > > > vmexit.
> > > > > Right but then I would still prefer to have another notifier.
> > > > > 
> > > > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> > > > > dedicated command for flushing device IOTLB. But the check for
> > > > > vtd_as_has_map_notifier is used to skip the device which can do demand
> > > > > paging via ATS or device specific way. If we have two different 
> > > > > notifiers,
> > > > > vhost will be on the device iotlb notifier so we don't need it at all?
> > > > But we can still have iommu notifier that only registers to UNMAP even 
> > > > after we
> > > > introduce dev-iotlb notifier?  We don't want to do page walk for them 
> > > > as well.
> > > > TCG should be the only one so far, but I don't know.. maybe there can 
> > > > still be
> > > > new ones?
> > > 
> > > I think you're right. But looking at the codes, it looks like the check of
> > > vtd_as_has_map_notifier() was only used in:
> > > 
> > > 1) vtd_iommu_replay()
> > > 2) vtd_iotlb_page_invalidate_notify() (PSI)
> > > 
> > > For the replay, it's expensive anyhow. For PSI, I think it's just about 
> > > one
> > > or few mappings, not sure it will have obvious performance impact.
> > > 
> > > And I had two questions:
> > > 
> > > 1) The codes doesn't check map for DSI or GI, does this match what spec
> > > said? (It looks to me the spec is unclear in this part)
> > Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
> > vtd_iotlb_domain_invalidate().
> 
> 
> I meant the code doesn't check whether there's an MAP notifier :)

It's actually checked, because it loops over vtd_as_with_notifiers, and only
MAP notifiers register to that. :)

But I agree with you that it should be cleaner to introduce the dev-iotlb
notifier type.

> 
> 
> > 
> > > 2) for the replay() I don't see other implementations (either spapr or
> > > generic one) that did unmap (actually they skip unmap explicitly), any
> > > reason for doing this in intel IOMMU?
> > I could be wrong, but I'd guess it's because vt-d implemented the caching 
> > mode
> > by leveraging the same invalidation strucuture, so it's harder to make all
> > things right (IOW, we can't clearly identify MAP with UNMAP when we receive 
> > an
> > invalidation request, because MAP/UNMAP requests look the same).
> > 
> > I didn't check others, but I believe spapr is doing it differently by using
> > some hypercalls to deliver IOMMU map/unmap requests, which seems a bit 
> > close to
> > what virtio-iommu is doing.  Anyway, the point is if we have explicit 
> > MAP/UNMAP
> > from the guest, logically the replay indeed does not need to do any unmap
> > because we don't need to call replay() on an already existing device but 
> > only
> > for e.g. hot plug.
> 
> 
> But this looks conflict with what memory_region_iommu_replay( ) did, for
> IOMMU that doesn't have a replay method, it skips UNMAP request:
> 
>     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
>     iotlb = imrc->translate(iommu_mr, addr, 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-12 Thread Jason Wang



On 2020/7/10 下午9:30, Peter Xu wrote:

On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:

On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

 - The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
   will setup the page table before IO starts

 - IO/DMA triggers and completes

 - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?


I think you're right. But looking at the codes, it looks like the check of
vtd_as_has_map_notifier() was only used in:

1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about one
or few mappings, not sure it will have obvious performance impact.

And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec
said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().



I meant the code doesn't check whether there's an MAP notifier :)





2) for the replay() I don't see other implementations (either spapr or
generic one) that did unmap (actually they skip unmap explicitly), any
reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.



But this looks conflict with what memory_region_iommu_replay( ) did, for 
IOMMU that doesn't have a replay method, it skips UNMAP request:


    for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
    iotlb = imrc->translate(iommu_mr, addr, IOMMU_NONE, n->iommu_idx);
    if (iotlb.perm != IOMMU_NONE) {
    n->notify(n, );
    }

I guess there's no knowledge of whether guest have an explicit MAP/UMAP 
for this generic code. Or replay implies that guest doesn't have 
explicit MAP/UNMAP?


(btw, the code shortcut the memory_region_notify_one(), not sure the reason)



  VT-d does not have that clear interface, so VT-d needs to
maintain its own mapping structures, and also vt-d is using the same replay &
page_walk operations to sync all these structures, which complicated the vt-d
replay a bit.  With that, we assume replay() can be called anytime on a device,
and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
before.  At the meantime, since we'll compare the latest mapping with the one
we cached in the iova tree, UNMAP becomes possible too.



AFAIK vtd_iommu_replay() did a completely UNMAP:

    /*
 * The replay can be triggered by either a invalidation or a newly
 * created entry. No matter what, we release existing mappings
 * (it means flushing caches 

Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-10 Thread Peter Xu
On Fri, Jul 10, 2020 at 02:34:11PM +0800, Jason Wang wrote:
> 
> On 2020/7/9 下午10:10, Peter Xu wrote:
> > On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > > > - If we care the performance, it's better to implement the MAP event 
> > > > > for
> > > > > vhost, otherwise it could be a lot of IOTLB miss
> > > > I feel like these are two things.
> > > > 
> > > > So far what we are talking about is whether vt-d should have knowledge 
> > > > about
> > > > what kind of events one iommu notifier is interested in.  I still think 
> > > > we
> > > > should keep this as answered in question 1.
> > > > 
> > > > The other question is whether we want to switch vhost from UNMAP to 
> > > > MAP/UNMAP
> > > > events even without vDMA, so that vhost can establish the mapping even 
> > > > before
> > > > IO starts.  IMHO it's doable, but only if the guest runs DPDK 
> > > > workloads.  When
> > > > the guest is using dynamic iommu page mappings, I feel like that can be 
> > > > even
> > > > slower, because then the worst case is for each IO we'll need to vmexit 
> > > > twice:
> > > > 
> > > > - The first vmexit caused by an invalidation to MAP the page 
> > > > tables, so vhost
> > > >   will setup the page table before IO starts
> > > > 
> > > > - IO/DMA triggers and completes
> > > > 
> > > > - The second vmexit caused by another invalidation to UNMAP the 
> > > > page tables
> > > > 
> > > > So it seems to be worse than when vhost only uses UNMAP like right now. 
> > > >  At
> > > > least we only have one vmexit (when UNMAP).  We'll have a vhost 
> > > > translate()
> > > > request from kernel to userspace, but IMHO that's cheaper than the 
> > > > vmexit.
> > > 
> > > Right but then I would still prefer to have another notifier.
> > > 
> > > Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> > > dedicated command for flushing device IOTLB. But the check for
> > > vtd_as_has_map_notifier is used to skip the device which can do demand
> > > paging via ATS or device specific way. If we have two different notifiers,
> > > vhost will be on the device iotlb notifier so we don't need it at all?
> > But we can still have iommu notifier that only registers to UNMAP even 
> > after we
> > introduce dev-iotlb notifier?  We don't want to do page walk for them as 
> > well.
> > TCG should be the only one so far, but I don't know.. maybe there can still 
> > be
> > new ones?
> 
> 
> I think you're right. But looking at the codes, it looks like the check of
> vtd_as_has_map_notifier() was only used in:
> 
> 1) vtd_iommu_replay()
> 2) vtd_iotlb_page_invalidate_notify() (PSI)
> 
> For the replay, it's expensive anyhow. For PSI, I think it's just about one
> or few mappings, not sure it will have obvious performance impact.
> 
> And I had two questions:
> 
> 1) The codes doesn't check map for DSI or GI, does this match what spec
> said? (It looks to me the spec is unclear in this part)

Both DSI/GI should cover maps too?  E.g. vtd_sync_shadow_page_table() in
vtd_iotlb_domain_invalidate().

> 2) for the replay() I don't see other implementations (either spapr or
> generic one) that did unmap (actually they skip unmap explicitly), any
> reason for doing this in intel IOMMU?

I could be wrong, but I'd guess it's because vt-d implemented the caching mode
by leveraging the same invalidation strucuture, so it's harder to make all
things right (IOW, we can't clearly identify MAP with UNMAP when we receive an
invalidation request, because MAP/UNMAP requests look the same).

I didn't check others, but I believe spapr is doing it differently by using
some hypercalls to deliver IOMMU map/unmap requests, which seems a bit close to
what virtio-iommu is doing.  Anyway, the point is if we have explicit MAP/UNMAP
from the guest, logically the replay indeed does not need to do any unmap
because we don't need to call replay() on an already existing device but only
for e.g. hot plug.  VT-d does not have that clear interface, so VT-d needs to
maintain its own mapping structures, and also vt-d is using the same replay &
page_walk operations to sync all these structures, which complicated the vt-d
replay a bit.  With that, we assume replay() can be called anytime on a device,
and we won't notify duplicated MAPs to lower layer like vfio if it is mapped
before.  At the meantime, since we'll compare the latest mapping with the one
we cached in the iova tree, UNMAP becomes possible too.

Thanks,

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-10 Thread Jason Wang



On 2020/7/9 下午10:10, Peter Xu wrote:

On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:

- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

- The first vmexit caused by an invalidation to MAP the page tables, so 
vhost
  will setup the page table before IO starts

- IO/DMA triggers and completes

- The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.


Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
dedicated command for flushing device IOTLB. But the check for
vtd_as_has_map_notifier is used to skip the device which can do demand
paging via ATS or device specific way. If we have two different notifiers,
vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?



I think you're right. But looking at the codes, it looks like the check 
of vtd_as_has_map_notifier() was only used in:


1) vtd_iommu_replay()
2) vtd_iotlb_page_invalidate_notify() (PSI)

For the replay, it's expensive anyhow. For PSI, I think it's just about 
one or few mappings, not sure it will have obvious performance impact.


And I had two questions:

1) The codes doesn't check map for DSI or GI, does this match what spec 
said? (It looks to me the spec is unclear in this part)
2) for the replay() I don't see other implementations (either spapr or 
generic one) that did unmap (actually they skip unmap explicitly), any 
reason for doing this in intel IOMMU?


Thanks




Thanks,





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-09 Thread Peter Xu
On Thu, Jul 09, 2020 at 01:58:33PM +0800, Jason Wang wrote:
> > > - If we care the performance, it's better to implement the MAP event for
> > > vhost, otherwise it could be a lot of IOTLB miss
> > I feel like these are two things.
> > 
> > So far what we are talking about is whether vt-d should have knowledge about
> > what kind of events one iommu notifier is interested in.  I still think we
> > should keep this as answered in question 1.
> > 
> > The other question is whether we want to switch vhost from UNMAP to 
> > MAP/UNMAP
> > events even without vDMA, so that vhost can establish the mapping even 
> > before
> > IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  
> > When
> > the guest is using dynamic iommu page mappings, I feel like that can be even
> > slower, because then the worst case is for each IO we'll need to vmexit 
> > twice:
> > 
> >- The first vmexit caused by an invalidation to MAP the page tables, so 
> > vhost
> >  will setup the page table before IO starts
> > 
> >- IO/DMA triggers and completes
> > 
> >- The second vmexit caused by another invalidation to UNMAP the page 
> > tables
> > 
> > So it seems to be worse than when vhost only uses UNMAP like right now.  At
> > least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
> > request from kernel to userspace, but IMHO that's cheaper than the vmexit.
> 
> 
> Right but then I would still prefer to have another notifier.
> 
> Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a
> dedicated command for flushing device IOTLB. But the check for
> vtd_as_has_map_notifier is used to skip the device which can do demand
> paging via ATS or device specific way. If we have two different notifiers,
> vhost will be on the device iotlb notifier so we don't need it at all?

But we can still have iommu notifier that only registers to UNMAP even after we
introduce dev-iotlb notifier?  We don't want to do page walk for them as well.
TCG should be the only one so far, but I don't know.. maybe there can still be
new ones?

Thanks,

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-09 Thread Jason Wang



On 2020/7/8 下午10:16, Peter Xu wrote:

On Wed, Jul 08, 2020 at 01:42:30PM +0800, Jason Wang wrote:

So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.


Two questions:

- Do we care the performance here? If not, vhost may just ignore the MAP
event?

I think we care, because vtd_page_walk() can be expensive.



Ok.





- If we care the performance, it's better to implement the MAP event for
vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

   - The first vmexit caused by an invalidation to MAP the page tables, so vhost
 will setup the page table before IO starts

   - IO/DMA triggers and completes

   - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.



Right but then I would still prefer to have another notifier.

Since vtd_page_walk has nothing to do with device IOTLB. IOMMU have a 
dedicated command for flushing device IOTLB. But the check for 
vtd_as_has_map_notifier is used to skip the device which can do demand 
paging via ATS or device specific way. If we have two different 
notifiers, vhost will be on the device iotlb notifier so we don't need 
it at all?


Thanks








Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-08 Thread Peter Xu
On Wed, Jul 08, 2020 at 01:42:30PM +0800, Jason Wang wrote:
> > > So it should be functional equivalent to vtd_as_has_notifier().
> > For example: in vtd_iommu_replay() we'll skip the replay if vhost has
> > registered the iommu notifier because vtd_as_has_map_notifier() will return
> > false.
> 
> 
> Two questions:
> 
> - Do we care the performance here? If not, vhost may just ignore the MAP
> event?

I think we care, because vtd_page_walk() can be expensive.

> - If we care the performance, it's better to implement the MAP event for
> vhost, otherwise it could be a lot of IOTLB miss

I feel like these are two things.

So far what we are talking about is whether vt-d should have knowledge about
what kind of events one iommu notifier is interested in.  I still think we
should keep this as answered in question 1.

The other question is whether we want to switch vhost from UNMAP to MAP/UNMAP
events even without vDMA, so that vhost can establish the mapping even before
IO starts.  IMHO it's doable, but only if the guest runs DPDK workloads.  When
the guest is using dynamic iommu page mappings, I feel like that can be even
slower, because then the worst case is for each IO we'll need to vmexit twice:

  - The first vmexit caused by an invalidation to MAP the page tables, so vhost
will setup the page table before IO starts

  - IO/DMA triggers and completes

  - The second vmexit caused by another invalidation to UNMAP the page tables

So it seems to be worse than when vhost only uses UNMAP like right now.  At
least we only have one vmexit (when UNMAP).  We'll have a vhost translate()
request from kernel to userspace, but IMHO that's cheaper than the vmexit.

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-07 Thread Jason Wang



On 2020/7/8 上午3:54, Peter Xu wrote:

On Tue, Jul 07, 2020 at 04:03:10PM +0800, Jason Wang wrote:

On 2020/7/3 下午9:03, Peter Xu wrote:

On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in the
notifier then the device (vhost) can choose the message it want to process
like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().


Is this for a better performance?

I wonder whether it's worth since we can't not have both vhost and vtd to be
registered into the same as.

/me failed to follow this sentence.. :(



Sorry, I meant "vfio" instead "vtd".





So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.



Two questions:

- Do we care the performance here? If not, vhost may just ignore the MAP 
event?
- If we care the performance, it's better to implement the MAP event for 
vhost, otherwise it could be a lot of IOTLB miss


Thanks



   It'll only return true if the device is a vfio-pci device.

Without vtd_as_has_map_notifier(), how could we do that?





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-07 Thread Peter Xu
On Tue, Jul 07, 2020 at 04:03:10PM +0800, Jason Wang wrote:
> 
> On 2020/7/3 下午9:03, Peter Xu wrote:
> > On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:
> > > On 2020/7/2 下午11:45, Peter Xu wrote:
> > > > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > > > So I think we agree that a new notifier is needed?
> > > > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> > > 
> > > That should work but I wonder something as following is better.
> > > 
> > > Instead of introducing new flags, how about carry the type of event in the
> > > notifier then the device (vhost) can choose the message it want to process
> > > like:
> > > 
> > > static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> > > 
> > > {
> > > 
> > > switch (event->type) {
> > > 
> > > case IOMMU_MAP:
> > > case IOMMU_UNMAP:
> > > case IOMMU_DEV_IOTLB_UNMAP:
> > > ...
> > > 
> > > }
> > Looks ok to me, though imo we should still keep the registration 
> > information,
> > so VT-d knows which notifiers is interested in which events.  E.g., we can
> > still do something like vtd_as_has_map_notifier().
> 
> 
> Is this for a better performance?
> 
> I wonder whether it's worth since we can't not have both vhost and vtd to be
> registered into the same as.

/me failed to follow this sentence.. :(

> 
> So it should be functional equivalent to vtd_as_has_notifier().

For example: in vtd_iommu_replay() we'll skip the replay if vhost has
registered the iommu notifier because vtd_as_has_map_notifier() will return
false.  It'll only return true if the device is a vfio-pci device.

Without vtd_as_has_map_notifier(), how could we do that?

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-07 Thread Jason Wang



On 2020/7/3 下午9:03, Peter Xu wrote:

On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:

On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?


That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in the
notifier then the device (vhost) can choose the message it want to process
like:

static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().



Is this for a better performance?

I wonder whether it's worth since we can't not have both vhost and vtd 
to be registered into the same as.


So it should be functional equivalent to vtd_as_has_notifier().

Thanks




So these are probably two different things: the new IOMMU_NOTIFIER_DEV_IOTLB
flag is one as discussed; the other one is whether we would like to introduce
IOMMUTLBEvent to include the type, so that we can avoid introduce two notifiers
for one device majorly to identify dev-iotlb from unmaps.

Thanks,





Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-03 Thread Peter Xu
On Fri, Jul 03, 2020 at 03:24:19PM +0800, Jason Wang wrote:
> 
> On 2020/7/2 下午11:45, Peter Xu wrote:
> > On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> > > So I think we agree that a new notifier is needed?
> > Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?
> 
> 
> That should work but I wonder something as following is better.
> 
> Instead of introducing new flags, how about carry the type of event in the
> notifier then the device (vhost) can choose the message it want to process
> like:
> 
> static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)
> 
> {
> 
> switch (event->type) {
> 
> case IOMMU_MAP:
> case IOMMU_UNMAP:
> case IOMMU_DEV_IOTLB_UNMAP:
> ...
> 
> }

Looks ok to me, though imo we should still keep the registration information,
so VT-d knows which notifiers is interested in which events.  E.g., we can
still do something like vtd_as_has_map_notifier().

So these are probably two different things: the new IOMMU_NOTIFIER_DEV_IOTLB
flag is one as discussed; the other one is whether we would like to introduce
IOMMUTLBEvent to include the type, so that we can avoid introduce two notifiers
for one device majorly to identify dev-iotlb from unmaps.

Thanks,

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-03 Thread Jason Wang



On 2020/7/2 下午11:45, Peter Xu wrote:

On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:

So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?



That should work but I wonder something as following is better.

Instead of introducing new flags, how about carry the type of event in 
the notifier then the device (vhost) can choose the message it want to 
process like:


static vhost_iommu_event(IOMMUNotifier *n, IOMMUTLBEvent *event)

{

switch (event->type) {

case IOMMU_MAP:
case IOMMU_UNMAP:
case IOMMU_DEV_IOTLB_UNMAP:
...

}

Thanks








Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-02 Thread Peter Xu
On Thu, Jul 02, 2020 at 11:01:54AM +0800, Jason Wang wrote:
> So I think we agree that a new notifier is needed?

Good to me, or a new flag should be easier (IOMMU_NOTIFIER_DEV_IOTLB)?

-- 
Peter Xu



Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/7/1 下午4:09, Jason Wang wrote:


On 2020/6/30 下午11:39, Peter Xu wrote:

On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:

  /* According to ATS spec table 2.4:
   * S = 0, bits 15:12 =  range size: 4K
   * S = 1, bits 15:12 = xxx0 range size: 8K
   * S = 1, bits 15:12 = xx01 range size: 16K
   * S = 1, bits 15:12 = x011 range size: 32K
   * S = 1, bits 15:12 = 0111 range size: 64K
   * ...
   */


Right, but the comment is probably misleading here, since it's for 
the PCI-E
transaction between IOMMU and device not for the device IOTLB 
invalidation

descriptor.

For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
invalidation:

"

6.5.2.5 Device-TLB Invalidate Descriptor

...

Size (S): The size field indicates the number of consecutive pages 
targeted

by this invalidation
request. If S field is zero, a single page at page address specified by
Address [63:12] is requested
to be invalidated. If S field is Set, the least significant bit in the
Address field with value 0b
indicates the invalidation address range. For example, if S field is 
Set and

Address[12] is Clear, it
indicates an 8KB invalidation address range with base address in 
Address

[63:13]. If S field and
Address[12] is Set and bit 13 is Clear, it indicates a 16KB 
invalidation

address range with base
address in Address [63:14], etc.

"

So if we receive an address whose [63] is 0 and the rest is all 1, 
it's then

a [0, ~0ULL] invalidation.
Yes.  I think invalidating the whole range is always fine.  It's 
still not

arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
device-iotlb because of the address mask, not to say sub-pages.



Yes.






How about just convert to use a range [start, end] for any 
notifier and move
the checks (e.g the assert) into the actual notifier implemented 
(vhost or

vfio)?
IOMMUTLBEntry itself is the abstraction layer of TLB entry.  
Hardware TLB entry
is definitely not arbitrary range either (because AFAICT the 
hardware should

only cache PFN rather than address, so at least PAGE_SIZE aligned).
Introducing this flag will already make this trickier just to 
avoid introducing
another similar struct to IOMMUTLBEntry, but I really don't want 
to make it a
default option...  Not to mention I probably have no reason to 
urge the rest
iommu notifier users (tcg, vfio) to change their existing good 
code to suite
any of the backend who can cooperate with arbitrary address 
ranges...

Ok, so it looks like we need a dedicated notifiers to device IOTLB.
Or we can also make a new flag for device iotlb just like current 
UNMAP? Then
we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO 
using the
ARBITRARY_LENGTH flag would work in a similar way. DEVICE_IOTLB 
flag could
also allow virtio/vhost to only receive one invalidation (now IIUC 
it'll
receive both iotlb and device-iotlb for unmapping a page when 
ats=on), but then
ats=on will be a must and it could break some old (misconfiged) 
qemu because
afaict previously virtio/vhost could even work with vIOMMU 
(accidentally) even

without ats=on.


That's a bug and I don't think we need to workaround 
mis-configurated qemu

:)

IMHO it depends on the strictness we want on the qemu cmdline API. :)

We should at least check libvirt to make sure it's using ats=on 
always, then I

agree maybe we can avoid considering the rest...

Thanks,



Cc libvirt list, but I think we should fix libvirt if they don't 
provide "ats=on".


Thanks



Libvirt looks fine, according to the domain  XML documentation[1]:

 QEMU's virtio devices have some attributes related to the virtio 
transport under the driver element: The iommu attribute enables the use 
of emulated IOMMU by the device. The attribute ats controls the Address 
Translation Service support for PCIe devices. This is needed to make use 
of IOTLB support (see IOMMU device). Possible values are on or off. 
Since 3.5.0


So I think we agree that a new notifier is needed?

Thanks

[1] https://libvirt.org/formatdomain.html










Re: [RFC v2 1/1] memory: Delete assertion in memory_region_unregister_iommu_notifier

2020-07-01 Thread Jason Wang



On 2020/6/30 下午11:39, Peter Xu wrote:

On Tue, Jun 30, 2020 at 10:41:10AM +0800, Jason Wang wrote:

  /* According to ATS spec table 2.4:
   * S = 0, bits 15:12 =  range size: 4K
   * S = 1, bits 15:12 = xxx0 range size: 8K
   * S = 1, bits 15:12 = xx01 range size: 16K
   * S = 1, bits 15:12 = x011 range size: 32K
   * S = 1, bits 15:12 = 0111 range size: 64K
   * ...
   */


Right, but the comment is probably misleading here, since it's for the PCI-E
transaction between IOMMU and device not for the device IOTLB invalidation
descriptor.

For device IOTLB invalidation descriptor, spec allows a [0, ~0ULL]
invalidation:

"

6.5.2.5 Device-TLB Invalidate Descriptor

...

Size (S): The size field indicates the number of consecutive pages targeted
by this invalidation
request. If S field is zero, a single page at page address specified by
Address [63:12] is requested
to be invalidated. If S field is Set, the least significant bit in the
Address field with value 0b
indicates the invalidation address range. For example, if S field is Set and
Address[12] is Clear, it
indicates an 8KB invalidation address range with base address in Address
[63:13]. If S field and
Address[12] is Set and bit 13 is Clear, it indicates a 16KB invalidation
address range with base
address in Address [63:14], etc.

"

So if we receive an address whose [63] is 0 and the rest is all 1, it's then
a [0, ~0ULL] invalidation.

Yes.  I think invalidating the whole range is always fine.  It's still not
arbitrary, right?  E.g., we can't even invalidate (0x1000, 0x3000) with
device-iotlb because of the address mask, not to say sub-pages.



Yes.







How about just convert to use a range [start, end] for any notifier and move
the checks (e.g the assert) into the actual notifier implemented (vhost or
vfio)?

IOMMUTLBEntry itself is the abstraction layer of TLB entry.  Hardware TLB entry
is definitely not arbitrary range either (because AFAICT the hardware should
only cache PFN rather than address, so at least PAGE_SIZE aligned).
Introducing this flag will already make this trickier just to avoid introducing
another similar struct to IOMMUTLBEntry, but I really don't want to make it a
default option...  Not to mention I probably have no reason to urge the rest
iommu notifier users (tcg, vfio) to change their existing good code to suite
any of the backend who can cooperate with arbitrary address ranges...

Ok, so it looks like we need a dedicated notifiers to device IOTLB.

Or we can also make a new flag for device iotlb just like current UNMAP? Then
we replace the vhost type from UNMAP to DEVICE_IOTLB.  But IMHO using the
ARBITRARY_LENGTH flag would work in a similar way.  DEVICE_IOTLB flag could
also allow virtio/vhost to only receive one invalidation (now IIUC it'll
receive both iotlb and device-iotlb for unmapping a page when ats=on), but then
ats=on will be a must and it could break some old (misconfiged) qemu because
afaict previously virtio/vhost could even work with vIOMMU (accidentally) even
without ats=on.


That's a bug and I don't think we need to workaround mis-configurated qemu
:)

IMHO it depends on the strictness we want on the qemu cmdline API. :)

We should at least check libvirt to make sure it's using ats=on always, then I
agree maybe we can avoid considering the rest...

Thanks,



Cc libvirt list, but I think we should fix libvirt if they don't provide 
"ats=on".


Thanks