Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-10-06 Thread Auger Eric
Hi all,

On 10/5/20 3:08 PM, Christoph Hellwig wrote:
> On Mon, Oct 05, 2020 at 11:44:10AM +0100, Lorenzo Pieralisi wrote:
>>> I see that there are both OF and ACPI hooks in pci_dma_configure() and
>>> both modify dev->dma_mask, which is what pci-sysfs is exposing here,
>>> but I'm not convinced this even does what it's intended to do.  The
>>> driver core calls this via the bus->dma_configure callback before
>>> probing a driver, but then what happens when the driver calls
>>> pci_set_dma_mask()?  This is just a wrapper for dma_set_mask() and I
>>> don't see anywhere that would take into account the existing
>>> dev->dma_mask.  It seems for example that pci_dma_configure() could
>>> produce a 42 bit mask as we have here, then the driver could override
>>> that with anything that the dma_ops.dma_supported() callback finds
>>> acceptable, and I don't see any instances where the current
>>> dev->dma_mask is considered.  Am I overlooking something? 
>>
>> I don't think so but Christoph and Robin can provide more input on
>> this - it is a long story.
>>
>> ACPI and OF bindings set a default dma_mask (and dev->bus_dma_limit),
>> this does not prevent a driver from overriding the dev->dma_mask but DMA
>> mapping code still takes into account the dev->bus_dma_limit.
>>
>> This may help:
>>
>> git log -p 03bfdc31176c

Thank you Lorenzo for the pointer.
> 
> This is at best a historic artefact.  Bus drivers have no business
> messing with the DMA mask, dev->bus_dma_limit is the way to communicate
> addressing limits on the bus (or another interconnect closer to the CPU).
> 
Then could I envision to use the dev->bus_dma_limit instead of the
dev->dma_mask?

Nevertheless, I would need a way to let the userspace know that the
usable IOVA ranges reported by VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
takes into account the addressing limits of the bus.

Thanks

Eric



Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-10-05 Thread Christoph Hellwig
On Mon, Oct 05, 2020 at 11:44:10AM +0100, Lorenzo Pieralisi wrote:
> > I see that there are both OF and ACPI hooks in pci_dma_configure() and
> > both modify dev->dma_mask, which is what pci-sysfs is exposing here,
> > but I'm not convinced this even does what it's intended to do.  The
> > driver core calls this via the bus->dma_configure callback before
> > probing a driver, but then what happens when the driver calls
> > pci_set_dma_mask()?  This is just a wrapper for dma_set_mask() and I
> > don't see anywhere that would take into account the existing
> > dev->dma_mask.  It seems for example that pci_dma_configure() could
> > produce a 42 bit mask as we have here, then the driver could override
> > that with anything that the dma_ops.dma_supported() callback finds
> > acceptable, and I don't see any instances where the current
> > dev->dma_mask is considered.  Am I overlooking something? 
> 
> I don't think so but Christoph and Robin can provide more input on
> this - it is a long story.
> 
> ACPI and OF bindings set a default dma_mask (and dev->bus_dma_limit),
> this does not prevent a driver from overriding the dev->dma_mask but DMA
> mapping code still takes into account the dev->bus_dma_limit.
> 
> This may help:
> 
> git log -p 03bfdc31176c

This is at best a historic artefact.  Bus drivers have no business
messing with the DMA mask, dev->bus_dma_limit is the way to communicate
addressing limits on the bus (or another interconnect closer to the CPU).


Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-10-05 Thread Lorenzo Pieralisi
[+Christoph]

On Tue, Sep 29, 2020 at 12:18:49PM -0600, Alex Williamson wrote:
> On Tue, 29 Sep 2020 09:18:22 +0200
> Auger Eric  wrote:
> 
> > Hi all,
> > 
> > [also correcting some outdated email addresses + adding Lorenzo in cc]
> > 
> > On 9/29/20 12:42 AM, Alex Williamson wrote:
> > > On Mon, 28 Sep 2020 21:50:34 +0200
> > > Eric Auger  wrote:
> > >   
> > >> VFIO currently exposes the usable IOVA regions through the
> > >> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> > >> capability. However it fails to take into account the dma_mask
> > >> of the devices within the container. The top limit currently is
> > >> defined by the iommu aperture.  
> > > 
> > > I think that dma_mask is traditionally a DMA API interface for a device
> > > driver to indicate to the DMA layer which mappings are accessible to the
> > > device.  On the other hand, vfio makes use of the IOMMU API where the
> > > driver is in userspace.  That userspace driver has full control of the
> > > IOVA range of the device, therefore dma_mask is mostly irrelevant to
> > > vfio.  I think the issue you're trying to tackle is that the IORT code
> > > is making use of the dma_mask to try to describe a DMA address
> > > limitation imposed by the PCI root bus, living between the endpoint
> > > device and the IOMMU.  Therefore, if the IORT code is exposing a
> > > topology or system imposed device limitation, this seems much more akin
> > > to something like an MSI reserved range, where it's not necessarily the
> > > device or the IOMMU with the limitation, but something that sits
> > > between them.  
> > 
> > First I think I failed to explain the context. I worked on NVMe
> > passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
> > physical device. The IOVA allocator there currently uses an IOVA range
> > within [0x1, 1ULL << 39]. This IOVA layout rather is arbitrary if I
> > understand correctly.
> 
> 39 bits is the minimum available on some VT-d systems, so it was
> probably considered a reasonable minimum address width to consider.
> 
> > I noticed we rapidly get some VFIO MAP DMA
> > failures because the allocated IOVA collide with the ARM MSI reserved
> > IOVA window [0x800, 0x810]. Since  9b77e5c79840 ("vfio/type1:
> > Check reserved region conflict and update iova list"), such VFIO MAP DMA
> > attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
> > by using the VFIO_IOMMU_GET_INFO ioctl /
> > VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
> > avoid allocating within this range and others. While working on this, I
> > tried to automatically compute the min/max IOVAs and change the
> > arbitrary [0x1, 1ULL << 39]. My SMMUv2 supports up to 48b so
> > naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
> > allocates at the bottom and at the top of the range. I noticed the use
> > case was not working as soon as the top IOVA was more than 1ULL << 42.
> > And then we noticed the dma_mask was set to 42 by using
> > cat  /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
> > interpretation is the dma_mask was somehow containing the info the
> > device couldn't handle IOVAs beyond a certain limit.
> 
> I see that there are both OF and ACPI hooks in pci_dma_configure() and
> both modify dev->dma_mask, which is what pci-sysfs is exposing here,
> but I'm not convinced this even does what it's intended to do.  The
> driver core calls this via the bus->dma_configure callback before
> probing a driver, but then what happens when the driver calls
> pci_set_dma_mask()?  This is just a wrapper for dma_set_mask() and I
> don't see anywhere that would take into account the existing
> dev->dma_mask.  It seems for example that pci_dma_configure() could
> produce a 42 bit mask as we have here, then the driver could override
> that with anything that the dma_ops.dma_supported() callback finds
> acceptable, and I don't see any instances where the current
> dev->dma_mask is considered.  Am I overlooking something? 

I don't think so but Christoph and Robin can provide more input on
this - it is a long story.

ACPI and OF bindings set a default dma_mask (and dev->bus_dma_limit),
this does not prevent a driver from overriding the dev->dma_mask but DMA
mapping code still takes into account the dev->bus_dma_limit.

This may help:

git log -p 03bfdc31176c

Thanks,
Lorenzo


Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-09-30 Thread Auger Eric
Hi Alex,

On 9/29/20 8:18 PM, Alex Williamson wrote:
> On Tue, 29 Sep 2020 09:18:22 +0200
> Auger Eric  wrote:
> 
>> Hi all,
>>
>> [also correcting some outdated email addresses + adding Lorenzo in cc]
>>
>> On 9/29/20 12:42 AM, Alex Williamson wrote:
>>> On Mon, 28 Sep 2020 21:50:34 +0200
>>> Eric Auger  wrote:
>>>   
 VFIO currently exposes the usable IOVA regions through the
 VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
 capability. However it fails to take into account the dma_mask
 of the devices within the container. The top limit currently is
 defined by the iommu aperture.  
>>>
>>> I think that dma_mask is traditionally a DMA API interface for a device
>>> driver to indicate to the DMA layer which mappings are accessible to the
>>> device.  On the other hand, vfio makes use of the IOMMU API where the
>>> driver is in userspace.  That userspace driver has full control of the
>>> IOVA range of the device, therefore dma_mask is mostly irrelevant to
>>> vfio.  I think the issue you're trying to tackle is that the IORT code
>>> is making use of the dma_mask to try to describe a DMA address
>>> limitation imposed by the PCI root bus, living between the endpoint
>>> device and the IOMMU.  Therefore, if the IORT code is exposing a
>>> topology or system imposed device limitation, this seems much more akin
>>> to something like an MSI reserved range, where it's not necessarily the
>>> device or the IOMMU with the limitation, but something that sits
>>> between them.  
>>
>> First I think I failed to explain the context. I worked on NVMe
>> passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
>> physical device. The IOVA allocator there currently uses an IOVA range
>> within [0x1, 1ULL << 39]. This IOVA layout rather is arbitrary if I
>> understand correctly.
> 
> 39 bits is the minimum available on some VT-d systems, so it was
> probably considered a reasonable minimum address width to consider.
OK
> 
>> I noticed we rapidly get some VFIO MAP DMA
>> failures because the allocated IOVA collide with the ARM MSI reserved
>> IOVA window [0x800, 0x810]. Since  9b77e5c79840 ("vfio/type1:
>> Check reserved region conflict and update iova list"), such VFIO MAP DMA
>> attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
>> by using the VFIO_IOMMU_GET_INFO ioctl /
>> VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
>> avoid allocating within this range and others. While working on this, I
>> tried to automatically compute the min/max IOVAs and change the
>> arbitrary [0x1, 1ULL << 39]. My SMMUv2 supports up to 48b so
>> naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
>> allocates at the bottom and at the top of the range. I noticed the use
>> case was not working as soon as the top IOVA was more than 1ULL << 42.
>> And then we noticed the dma_mask was set to 42 by using
>> cat  /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
>> interpretation is the dma_mask was somehow containing the info the
>> device couldn't handle IOVAs beyond a certain limit.
> 
> I see that there are both OF and ACPI hooks in pci_dma_configure() and
> both modify dev->dma_mask, which is what pci-sysfs is exposing here,
> but I'm not convinced this even does what it's intended to do.  The
> driver core calls this via the bus->dma_configure callback before
> probing a driver, but then what happens when the driver calls
> pci_set_dma_mask()?  This is just a wrapper for dma_set_mask() and I
> don't see anywhere that would take into account the existing
> dev->dma_mask.  It seems for example that pci_dma_configure() could
> produce a 42 bit mask as we have here, then the driver could override
> that with anything that the dma_ops.dma_supported() callback finds
> acceptable, and I don't see any instances where the current
> dev->dma_mask is considered.  Am I overlooking something? 

I don't see it either. So the dma_mask set by the driver would never be
checked against the dma_mask limited found when parsing OF/ACPI?
>  
>> In my case the 42b limit is computed in iort_dma_setup() by
>> acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
>>
>> Referring to the comment, it does "Evaluate DMA regions and return
>> respectively DMA region start, offset and size in dma_addr, offset and
>> size on parsing success". This parses the ACPI table, looking for ACPI
>> companions with _DMA methods.
>>
>> But as Alex mentioned, the IORT also allows to define limits on "the
>> number of address bits, starting from the least significant bit that can
>> be generated by a device when it accesses memory". See Named component
>> node.Device Memory Address Size limit or PCI root complex node. Memory
>> address size limit.
>>
>> ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
>> if (ret == -ENODEV)
>> ret = dev_is_pci(dev) ? rc_dma_get_range(dev, &size)
>>  

Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-09-29 Thread Alex Williamson
On Tue, 29 Sep 2020 09:18:22 +0200
Auger Eric  wrote:

> Hi all,
> 
> [also correcting some outdated email addresses + adding Lorenzo in cc]
> 
> On 9/29/20 12:42 AM, Alex Williamson wrote:
> > On Mon, 28 Sep 2020 21:50:34 +0200
> > Eric Auger  wrote:
> >   
> >> VFIO currently exposes the usable IOVA regions through the
> >> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> >> capability. However it fails to take into account the dma_mask
> >> of the devices within the container. The top limit currently is
> >> defined by the iommu aperture.  
> > 
> > I think that dma_mask is traditionally a DMA API interface for a device
> > driver to indicate to the DMA layer which mappings are accessible to the
> > device.  On the other hand, vfio makes use of the IOMMU API where the
> > driver is in userspace.  That userspace driver has full control of the
> > IOVA range of the device, therefore dma_mask is mostly irrelevant to
> > vfio.  I think the issue you're trying to tackle is that the IORT code
> > is making use of the dma_mask to try to describe a DMA address
> > limitation imposed by the PCI root bus, living between the endpoint
> > device and the IOMMU.  Therefore, if the IORT code is exposing a
> > topology or system imposed device limitation, this seems much more akin
> > to something like an MSI reserved range, where it's not necessarily the
> > device or the IOMMU with the limitation, but something that sits
> > between them.  
> 
> First I think I failed to explain the context. I worked on NVMe
> passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
> physical device. The IOVA allocator there currently uses an IOVA range
> within [0x1, 1ULL << 39]. This IOVA layout rather is arbitrary if I
> understand correctly.

39 bits is the minimum available on some VT-d systems, so it was
probably considered a reasonable minimum address width to consider.

> I noticed we rapidly get some VFIO MAP DMA
> failures because the allocated IOVA collide with the ARM MSI reserved
> IOVA window [0x800, 0x810]. Since  9b77e5c79840 ("vfio/type1:
> Check reserved region conflict and update iova list"), such VFIO MAP DMA
> attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
> by using the VFIO_IOMMU_GET_INFO ioctl /
> VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
> avoid allocating within this range and others. While working on this, I
> tried to automatically compute the min/max IOVAs and change the
> arbitrary [0x1, 1ULL << 39]. My SMMUv2 supports up to 48b so
> naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
> allocates at the bottom and at the top of the range. I noticed the use
> case was not working as soon as the top IOVA was more than 1ULL << 42.
> And then we noticed the dma_mask was set to 42 by using
> cat  /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
> interpretation is the dma_mask was somehow containing the info the
> device couldn't handle IOVAs beyond a certain limit.

I see that there are both OF and ACPI hooks in pci_dma_configure() and
both modify dev->dma_mask, which is what pci-sysfs is exposing here,
but I'm not convinced this even does what it's intended to do.  The
driver core calls this via the bus->dma_configure callback before
probing a driver, but then what happens when the driver calls
pci_set_dma_mask()?  This is just a wrapper for dma_set_mask() and I
don't see anywhere that would take into account the existing
dev->dma_mask.  It seems for example that pci_dma_configure() could
produce a 42 bit mask as we have here, then the driver could override
that with anything that the dma_ops.dma_supported() callback finds
acceptable, and I don't see any instances where the current
dev->dma_mask is considered.  Am I overlooking something? 
 
> In my case the 42b limit is computed in iort_dma_setup() by
> acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
> 
> Referring to the comment, it does "Evaluate DMA regions and return
> respectively DMA region start, offset and size in dma_addr, offset and
> size on parsing success". This parses the ACPI table, looking for ACPI
> companions with _DMA methods.
> 
> But as Alex mentioned, the IORT also allows to define limits on "the
> number of address bits, starting from the least significant bit that can
> be generated by a device when it accesses memory". See Named component
> node.Device Memory Address Size limit or PCI root complex node. Memory
> address size limit.
> 
> ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
> if (ret == -ENODEV)
> ret = dev_is_pci(dev) ? rc_dma_get_range(dev, &size)
>   : nc_dma_get_range(dev, &size);
> 
> So eventually those info collected from the ACPI tables which do impact
> the usable IOVA range seem to be stored in the dma_mask, hence that
> proposal.

As above, it's not clear to me that anyone other than the driver and
the dma_s

Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-09-29 Thread Auger Eric
Hi all,

[also correcting some outdated email addresses + adding Lorenzo in cc]

On 9/29/20 12:42 AM, Alex Williamson wrote:
> On Mon, 28 Sep 2020 21:50:34 +0200
> Eric Auger  wrote:
> 
>> VFIO currently exposes the usable IOVA regions through the
>> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
>> capability. However it fails to take into account the dma_mask
>> of the devices within the container. The top limit currently is
>> defined by the iommu aperture.
> 
> I think that dma_mask is traditionally a DMA API interface for a device
> driver to indicate to the DMA layer which mappings are accessible to the
> device.  On the other hand, vfio makes use of the IOMMU API where the
> driver is in userspace.  That userspace driver has full control of the
> IOVA range of the device, therefore dma_mask is mostly irrelevant to
> vfio.  I think the issue you're trying to tackle is that the IORT code
> is making use of the dma_mask to try to describe a DMA address
> limitation imposed by the PCI root bus, living between the endpoint
> device and the IOMMU.  Therefore, if the IORT code is exposing a
> topology or system imposed device limitation, this seems much more akin
> to something like an MSI reserved range, where it's not necessarily the
> device or the IOMMU with the limitation, but something that sits
> between them.

First I think I failed to explain the context. I worked on NVMe
passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
physical device. The IOVA allocator there currently uses an IOVA range
within [0x1, 1ULL << 39]. This IOVA layout rather is arbitrary if I
understand correctly. I noticed we rapidly get some VFIO MAP DMA
failures because the allocated IOVA collide with the ARM MSI reserved
IOVA window [0x800, 0x810]. Since  9b77e5c79840 ("vfio/type1:
Check reserved region conflict and update iova list"), such VFIO MAP DMA
attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
by using the VFIO_IOMMU_GET_INFO ioctl /
VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
avoid allocating within this range and others. While working on this, I
tried to automatically compute the min/max IOVAs and change the
arbitrary [0x1, 1ULL << 39]. My SMMUv2 supports up to 48b so
naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
allocates at the bottom and at the top of the range. I noticed the use
case was not working as soon as the top IOVA was more than 1ULL << 42.
And then we noticed the dma_mask was set to 42 by using
cat  /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
interpretation is the dma_mask was somehow containing the info the
device couldn't handle IOVAs beyond a certain limit.

In my case the 42b limit is computed in iort_dma_setup() by
acpi_dma_get_range(dev, &dmaaddr, &offset, &size);

Referring to the comment, it does "Evaluate DMA regions and return
respectively DMA region start, offset and size in dma_addr, offset and
size on parsing success". This parses the ACPI table, looking for ACPI
companions with _DMA methods.

But as Alex mentioned, the IORT also allows to define limits on "the
number of address bits, starting from the least significant bit that can
be generated by a device when it accesses memory". See Named component
node.Device Memory Address Size limit or PCI root complex node. Memory
address size limit.

ret = acpi_dma_get_range(dev, &dmaaddr, &offset, &size);
if (ret == -ENODEV)
ret = dev_is_pci(dev) ? rc_dma_get_range(dev, &size)
  : nc_dma_get_range(dev, &size);

So eventually those info collected from the ACPI tables which do impact
the usable IOVA range seem to be stored in the dma_mask, hence that
proposal.

> 
>> So, for instance, if the IOMMU supports up to 48bits, it may give
>> the impression the max IOVA is 48b while a device may have a
>> dma_mask of 42b. So this API cannot really be used to compute
>> the max usable IOVA.
>>
>> This patch removes the IOVA region beyond the dma_mask's.
> 
> Rather it adds a reserved region accounting for the range above the
> device's dma_mask.

Yep. It adds new reserved regions in
/sys/kernel/iommu_groups//reserved_regions and remove those from the
usable regions exposed by VFIO GET_INFO.

  I don't think the IOMMU API should be consuming
> dma_mask like this though.  For example, what happens in
> pci_dma_configure() when there are no OF or ACPI DMA restrictions?
My guess was that the dma_mask was set to the max range but I did not
test it.
  It
> appears to me that the dma_mask from whatever previous driver had the
> device carries over to the new driver.  That's generally ok for the DMA
> API because a driver is required to set the device's DMA mask.  It
> doesn't make sense however to blindly consume that dma_mask and export
> it via an IOMMU API.  For example I would expect to see different
> results depending on whether a host driver has been bound 

Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-09-28 Thread Alex Williamson
On Mon, 28 Sep 2020 21:50:34 +0200
Eric Auger  wrote:

> VFIO currently exposes the usable IOVA regions through the
> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> capability. However it fails to take into account the dma_mask
> of the devices within the container. The top limit currently is
> defined by the iommu aperture.

I think that dma_mask is traditionally a DMA API interface for a device
driver to indicate to the DMA layer which mappings are accessible to the
device.  On the other hand, vfio makes use of the IOMMU API where the
driver is in userspace.  That userspace driver has full control of the
IOVA range of the device, therefore dma_mask is mostly irrelevant to
vfio.  I think the issue you're trying to tackle is that the IORT code
is making use of the dma_mask to try to describe a DMA address
limitation imposed by the PCI root bus, living between the endpoint
device and the IOMMU.  Therefore, if the IORT code is exposing a
topology or system imposed device limitation, this seems much more akin
to something like an MSI reserved range, where it's not necessarily the
device or the IOMMU with the limitation, but something that sits
between them.

> So, for instance, if the IOMMU supports up to 48bits, it may give
> the impression the max IOVA is 48b while a device may have a
> dma_mask of 42b. So this API cannot really be used to compute
> the max usable IOVA.
> 
> This patch removes the IOVA region beyond the dma_mask's.

Rather it adds a reserved region accounting for the range above the
device's dma_mask.  I don't think the IOMMU API should be consuming
dma_mask like this though.  For example, what happens in
pci_dma_configure() when there are no OF or ACPI DMA restrictions?  It
appears to me that the dma_mask from whatever previous driver had the
device carries over to the new driver.  That's generally ok for the DMA
API because a driver is required to set the device's DMA mask.  It
doesn't make sense however to blindly consume that dma_mask and export
it via an IOMMU API.  For example I would expect to see different
results depending on whether a host driver has been bound to a device.
It seems the correct IOMMU API approach would be for the IORT code to
specifically register reserved ranges for the device.

> As we start to expose this reserved region in the sysfs file
> /sys/kernel/iommu_groups//reserved_regions, we also need to
> handle the IOVA range beyond the IOMMU aperture to handle the case
> where the dma_mask would have a higher number of bits than the iommu
> max input address.

Why?  The IOMMU geometry already describes this and vfio combines both
the IOMMU geometry and the device reserved regions when generating the
IOVA ranges?  Who is going to consume this information?  Additionally
it appears that reserved regions will report different information
depending on whether a device is attached to a domain.

> This is a change to the ABI as this reserved region was not yet
> exposed in sysfs /sys/kernel/iommu_groups//reserved_regions or
> through the VFIO ioctl. At VFIO level we increment the version of
> the VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE capability to advertise
> that change.

Is this really an ABI change?  The original entry for reserved regions
includes:

  Not necessarily all reserved regions are listed. This is typically
  used to output direct-mapped, MSI, non mappable regions.

I imagine the intention here was non-mappable relative to the IOMMU,
but non-mappable to the device is essentially what we're including
here.

I'm also concerned about bumping the vfio interface version for the
IOVA range.  We're not changing the interface, we're modifying the
result, and even then only for a fraction of users.  How many users are
potentially broken by that change?  Are we going to bump the version
for everyone any time the result changes on any platform?  Thanks,

Alex