Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

2017-01-21 Thread Peter Xu
On Fri, Jan 20, 2017 at 10:14:01AM -0700, Alex Williamson wrote:
> On Fri, 20 Jan 2017 20:27:18 +0800
> Peter Xu  wrote:
> 
> > On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote:
> > 
> > [...]
> > 
> > > > What I don't want to see is for this API bug to leak out into the rest
> > > > of the QEMU code such that intel_iommu code, or iommu code in general
> > > > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > > > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > > > you want to make the iommu address space match the device we're trying
> > > > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > > > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > > > the simplest solution would be to track the and mapped iova high water
> > > > mark per container in vfio and truncate unmaps to that high water end
> > > > address.  Realistically we're probably not going to see iovas at the end
> > > > of the 64-bit address space, but we can come up with some other
> > > > workaround in the vfio code or update the kernel API if we do.  Thanks, 
> > > >  
> > > 
> > > Agree that high watermark can be a good solution for VT-d. I'll use
> > > that instead of 2^63-1.  
> > 
> > Okay when I replied I didn't notice this "watermark" may need more
> > than several (even tens of) LOCs. :(
> > 
> > Considering that I see no further usage of this watermark, I'm
> > thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as
> > the watermark - it's simple, efficient and secure imho.
> 
> Avoiding the issue based on the virtual iommu hardware properties is a
> fine solution, my intention was only to discourage introduction of
> artificial limitations in the surrounding code to avoid this vfio
> issue.  Thanks,

Yes.

I have posted a new version of the vfio series. Looking forward to
your further comment (or ack, if with luck :) on v4.

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

2017-01-20 Thread Alex Williamson
On Fri, 20 Jan 2017 20:27:18 +0800
Peter Xu  wrote:

> On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote:
> 
> [...]
> 
> > > What I don't want to see is for this API bug to leak out into the rest
> > > of the QEMU code such that intel_iommu code, or iommu code in general
> > > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > > you want to make the iommu address space match the device we're trying
> > > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > > the simplest solution would be to track the and mapped iova high water
> > > mark per container in vfio and truncate unmaps to that high water end
> > > address.  Realistically we're probably not going to see iovas at the end
> > > of the 64-bit address space, but we can come up with some other
> > > workaround in the vfio code or update the kernel API if we do.  Thanks,  
> > 
> > Agree that high watermark can be a good solution for VT-d. I'll use
> > that instead of 2^63-1.  
> 
> Okay when I replied I didn't notice this "watermark" may need more
> than several (even tens of) LOCs. :(
> 
> Considering that I see no further usage of this watermark, I'm
> thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as
> the watermark - it's simple, efficient and secure imho.

Avoiding the issue based on the virtual iommu hardware properties is a
fine solution, my intention was only to discourage introduction of
artificial limitations in the surrounding code to avoid this vfio
issue.  Thanks,

Alex



Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

2017-01-20 Thread Peter Xu
On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote:

[...]

> > What I don't want to see is for this API bug to leak out into the rest
> > of the QEMU code such that intel_iommu code, or iommu code in general
> > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > you want to make the iommu address space match the device we're trying
> > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > the simplest solution would be to track the and mapped iova high water
> > mark per container in vfio and truncate unmaps to that high water end
> > address.  Realistically we're probably not going to see iovas at the end
> > of the 64-bit address space, but we can come up with some other
> > workaround in the vfio code or update the kernel API if we do.  Thanks,
> 
> Agree that high watermark can be a good solution for VT-d. I'll use
> that instead of 2^63-1.

Okay when I replied I didn't notice this "watermark" may need more
than several (even tens of) LOCs. :(

Considering that I see no further usage of this watermark, I'm
thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as
the watermark - it's simple, efficient and secure imho.

Thanks,

-- peterx



Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

2017-01-19 Thread Peter Xu
On Thu, Jan 19, 2017 at 09:21:10PM -0700, Alex Williamson wrote:
> On Fri, 20 Jan 2017 11:43:28 +0800
> Peter Xu  wrote:
> 
> > On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote:
> > > On Thu, 19 Jan 2017 17:25:29 +0800
> > > Peter Xu  wrote:
> > >   
> > > > This requirement originates from the VT-d vfio series:
> > > > 
> > > >   https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html
> > > > 
> > > > The goal of this series is to allow IOMMU to notify unmap with very
> > > > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
> > > > the whole address space).
> > > > 
> > > > The first patch is a good to have, for traces.
> > > > 
> > > > The second one is a cleanup of existing code, only.  
> > > 
> > > Sort of, it does add some overhead to the unmap path, but you remove
> > > that and more in the third patch.  
> > 
> > Hmm, yes, I tried to get the ram pointer even for unmap. I should
> > remove the ending "only".
> > 
> > >
> > > > The third one moves the further RAM translation and check into map
> > > > operation logic, so that it'll free unmap operations.
> > > > 
> > > > The series is marked as RFC since I am not sure whether this is a
> > > > workable way. Anyway, please review to help confirm it.  
> > > 
> > > It seems reasonable to me,  
> > 
> > Good to know that you didn't disagree on this. :) Then let me take
> > these patches along with that series in the next post (since that
> > series will depend on this one, so I'll keep them together, though
> > please maintainers decide on how you'll merge them when you want to),
> > 
> > > except for the example here of using 2^63-1,
> > > which I expect is to work around the vfio {un}map API bug as we
> > > discussed on irc.  
> > 
> > Actually not only for that one, I don't know whether we have problem
> > here:
> > 
> > (I mentioned this in IRC, in case you missed that, I paste it here as
> >  well)
> > 
> > static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
> >   dma_addr_t start, size_t size)
> > {
> > struct rb_node *node = iommu->dma_list.rb_node;
> > 
> > while (node) {
> > struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> > 
> > if (start + size <= dma->iova)
> > node = node->rb_left;
> > else if (start >= dma->iova + dma->size)
> >   ^
> >   Could it overflow here? <--
> > node = node->rb_right;
> > else
> > return dma;
> > }
> > 
> > return NULL;
> > }
> > 
> > I used 2^63-1 to try to avoid both cases.
> 
> 
> Perhaps I'm not seeing the overflow you're trying to identify,
> but dma->iova + dma->size cannot overflow, these are existing DMA
> mappings and the DMA mapping path does test for overflow.  I think we
> can consider those sanitized.  I am wondering if the test above it is
> suspect though, start + size doesn't seem to be checked for overflow on
> the unmap path.  It seems pretty easy to avoid from the user side
> though, but maybe we should add a test for it in the kernel.

Ah, yes. :-)

> 
>  
> > > For everyone, the root of the problem is that the
> > > ioctls use:
> > > 
> > > __u64   iova;   /* IO virtual address */
> > > __u64   size;   /* Size of mapping 
> > > (bytes) */
> > > 
> > > So we can only express a size of 2^64-1 and we have an off-by-one error
> > > trying to perform a map/unmap of an entire 64-bit space.  Note when
> > > designing an API, use start/end rather than base/size to avoid this.  
> > 
> > Lesson learned.
> > 
> > > 
> > > What I don't want to see is for this API bug to leak out into the rest
> > > of the QEMU code such that intel_iommu code, or iommu code in general
> > > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > > you want to make the iommu address space match the device we're trying
> > > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > > the simplest solution would be to track the and mapped iova high water
> > > mark per container in vfio and truncate unmaps to that high water end
> > > address.  Realistically we're probably not going to see iovas at the end
> > > of the 64-bit address space, but we can come up with some other
> > > workaround in the vfio code or update the kernel API if we do.  Thanks,  
> > 
> > Agree that high watermark can be a good solution for VT-d. I'll use
> > that instead of 2^63-1. Though for AMD, if it supports 64bits, we may
> > still need to solve existing issues in the future, since in that case
> > the high watermark can be (hwaddr)-1 as well if guest 

Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

2017-01-19 Thread Alex Williamson
On Fri, 20 Jan 2017 11:43:28 +0800
Peter Xu  wrote:

> On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote:
> > On Thu, 19 Jan 2017 17:25:29 +0800
> > Peter Xu  wrote:
> >   
> > > This requirement originates from the VT-d vfio series:
> > > 
> > >   https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html
> > > 
> > > The goal of this series is to allow IOMMU to notify unmap with very
> > > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
> > > the whole address space).
> > > 
> > > The first patch is a good to have, for traces.
> > > 
> > > The second one is a cleanup of existing code, only.  
> > 
> > Sort of, it does add some overhead to the unmap path, but you remove
> > that and more in the third patch.  
> 
> Hmm, yes, I tried to get the ram pointer even for unmap. I should
> remove the ending "only".
> 
> >
> > > The third one moves the further RAM translation and check into map
> > > operation logic, so that it'll free unmap operations.
> > > 
> > > The series is marked as RFC since I am not sure whether this is a
> > > workable way. Anyway, please review to help confirm it.  
> > 
> > It seems reasonable to me,  
> 
> Good to know that you didn't disagree on this. :) Then let me take
> these patches along with that series in the next post (since that
> series will depend on this one, so I'll keep them together, though
> please maintainers decide on how you'll merge them when you want to),
> 
> > except for the example here of using 2^63-1,
> > which I expect is to work around the vfio {un}map API bug as we
> > discussed on irc.  
> 
> Actually not only for that one, I don't know whether we have problem
> here:
> 
> (I mentioned this in IRC, in case you missed that, I paste it here as
>  well)
> 
> static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
> dma_addr_t start, size_t size)
> {
>   struct rb_node *node = iommu->dma_list.rb_node;
> 
>   while (node) {
>   struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> 
>   if (start + size <= dma->iova)
>   node = node->rb_left;
>   else if (start >= dma->iova + dma->size)
>   ^
>   Could it overflow here? <--
>   node = node->rb_right;
>   else
>   return dma;
>   }
> 
>   return NULL;
> }
> 
> I used 2^63-1 to try to avoid both cases.


Perhaps I'm not seeing the overflow you're trying to identify,
but dma->iova + dma->size cannot overflow, these are existing DMA
mappings and the DMA mapping path does test for overflow.  I think we
can consider those sanitized.  I am wondering if the test above it is
suspect though, start + size doesn't seem to be checked for overflow on
the unmap path.  It seems pretty easy to avoid from the user side
though, but maybe we should add a test for it in the kernel.

 
> > For everyone, the root of the problem is that the
> > ioctls use:
> > 
> > __u64   iova;   /* IO virtual address */
> > __u64   size;   /* Size of mapping (bytes) 
> > */
> > 
> > So we can only express a size of 2^64-1 and we have an off-by-one error
> > trying to perform a map/unmap of an entire 64-bit space.  Note when
> > designing an API, use start/end rather than base/size to avoid this.  
> 
> Lesson learned.
> 
> > 
> > What I don't want to see is for this API bug to leak out into the rest
> > of the QEMU code such that intel_iommu code, or iommu code in general
> > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > you want to make the iommu address space match the device we're trying
> > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > the simplest solution would be to track the and mapped iova high water
> > mark per container in vfio and truncate unmaps to that high water end
> > address.  Realistically we're probably not going to see iovas at the end
> > of the 64-bit address space, but we can come up with some other
> > workaround in the vfio code or update the kernel API if we do.  Thanks,  
> 
> Agree that high watermark can be a good solution for VT-d. I'll use
> that instead of 2^63-1. Though for AMD, if it supports 64bits, we may
> still need to solve existing issues in the future, since in that case
> the high watermark can be (hwaddr)-1 as well if guest specifies it.
> 
> (Though anyway I'm not sure when AMD vIOMMUs will be ready for device
>  assignment, so I don't think that's a big issue at least for now)

Even with a true 64bit address space, it seems very unlikely we'd have
iovas at the very top of that address space.  If we 

Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

2017-01-19 Thread Peter Xu
On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote:
> On Thu, 19 Jan 2017 17:25:29 +0800
> Peter Xu  wrote:
> 
> > This requirement originates from the VT-d vfio series:
> > 
> >   https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html
> > 
> > The goal of this series is to allow IOMMU to notify unmap with very
> > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
> > the whole address space).
> > 
> > The first patch is a good to have, for traces.
> > 
> > The second one is a cleanup of existing code, only.
> 
> Sort of, it does add some overhead to the unmap path, but you remove
> that and more in the third patch.

Hmm, yes, I tried to get the ram pointer even for unmap. I should
remove the ending "only".

>  
> > The third one moves the further RAM translation and check into map
> > operation logic, so that it'll free unmap operations.
> > 
> > The series is marked as RFC since I am not sure whether this is a
> > workable way. Anyway, please review to help confirm it.
> 
> It seems reasonable to me,

Good to know that you didn't disagree on this. :) Then let me take
these patches along with that series in the next post (since that
series will depend on this one, so I'll keep them together, though
please maintainers decide on how you'll merge them when you want to),

> except for the example here of using 2^63-1,
> which I expect is to work around the vfio {un}map API bug as we
> discussed on irc.

Actually not only for that one, I don't know whether we have problem
here:

(I mentioned this in IRC, in case you missed that, I paste it here as
 well)

static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
  dma_addr_t start, size_t size)
{
struct rb_node *node = iommu->dma_list.rb_node;

while (node) {
struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);

if (start + size <= dma->iova)
node = node->rb_left;
else if (start >= dma->iova + dma->size)
  ^
  Could it overflow here? <--
node = node->rb_right;
else
return dma;
}

return NULL;
}

I used 2^63-1 to try to avoid both cases.

> For everyone, the root of the problem is that the
> ioctls use:
> 
> __u64   iova;   /* IO virtual address */
> __u64   size;   /* Size of mapping (bytes) */
> 
> So we can only express a size of 2^64-1 and we have an off-by-one error
> trying to perform a map/unmap of an entire 64-bit space.  Note when
> designing an API, use start/end rather than base/size to avoid this.

Lesson learned.

> 
> What I don't want to see is for this API bug to leak out into the rest
> of the QEMU code such that intel_iommu code, or iommu code in general
> subtly avoids it by artificially using a smaller range.  VT-d hardware
> has an actual physical address space of either 2^39 or 2^48 bits, so if
> you want to make the iommu address space match the device we're trying
> to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> 64-bit address space on the IOMMU, so to handle that case I'd expect
> the simplest solution would be to track the and mapped iova high water
> mark per container in vfio and truncate unmaps to that high water end
> address.  Realistically we're probably not going to see iovas at the end
> of the 64-bit address space, but we can come up with some other
> workaround in the vfio code or update the kernel API if we do.  Thanks,

Agree that high watermark can be a good solution for VT-d. I'll use
that instead of 2^63-1. Though for AMD, if it supports 64bits, we may
still need to solve existing issues in the future, since in that case
the high watermark can be (hwaddr)-1 as well if guest specifies it.

(Though anyway I'm not sure when AMD vIOMMUs will be ready for device
 assignment, so I don't think that's a big issue at least for now)

Thanks!

-- peterx



Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

2017-01-19 Thread Alex Williamson
On Thu, 19 Jan 2017 17:25:29 +0800
Peter Xu  wrote:

> This requirement originates from the VT-d vfio series:
> 
>   https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html
> 
> The goal of this series is to allow IOMMU to notify unmap with very
> big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
> the whole address space).
> 
> The first patch is a good to have, for traces.
> 
> The second one is a cleanup of existing code, only.

Sort of, it does add some overhead to the unmap path, but you remove
that and more in the third patch.
 
> The third one moves the further RAM translation and check into map
> operation logic, so that it'll free unmap operations.
> 
> The series is marked as RFC since I am not sure whether this is a
> workable way. Anyway, please review to help confirm it.

It seems reasonable to me, except for the example here of using 2^63-1,
which I expect is to work around the vfio {un}map API bug as we
discussed on irc.  For everyone, the root of the problem is that the
ioctls use:

__u64   iova;   /* IO virtual address */
__u64   size;   /* Size of mapping (bytes) */

So we can only express a size of 2^64-1 and we have an off-by-one error
trying to perform a map/unmap of an entire 64-bit space.  Note when
designing an API, use start/end rather than base/size to avoid this.

What I don't want to see is for this API bug to leak out into the rest
of the QEMU code such that intel_iommu code, or iommu code in general
subtly avoids it by artificially using a smaller range.  VT-d hardware
has an actual physical address space of either 2^39 or 2^48 bits, so if
you want to make the iommu address space match the device we're trying
to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
64-bit address space on the IOMMU, so to handle that case I'd expect
the simplest solution would be to track the and mapped iova high water
mark per container in vfio and truncate unmaps to that high water end
address.  Realistically we're probably not going to see iovas at the end
of the 64-bit address space, but we can come up with some other
workaround in the vfio code or update the kernel API if we do.  Thanks,

Alex



[Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

2017-01-19 Thread Peter Xu
This requirement originates from the VT-d vfio series:

  https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html

The goal of this series is to allow IOMMU to notify unmap with very
big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
the whole address space).

The first patch is a good to have, for traces.

The second one is a cleanup of existing code, only.

The third one moves the further RAM translation and check into map
operation logic, so that it'll free unmap operations.

The series is marked as RFC since I am not sure whether this is a
workable way. Anyway, please review to help confirm it.

Thanks.

Peter Xu (3):
  vfio: trace map/unmap for notify as well
  vfio: introduce vfio_get_vaddr()
  vfio: allow to notify unmap for very large region

 hw/vfio/common.c | 56 ++--
 hw/vfio/trace-events |  2 +-
 2 files changed, 38 insertions(+), 20 deletions(-)

-- 
2.7.4