Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-25 Thread Joao Martins



On 2/25/22 16:15, Alex Williamson wrote:
> On Fri, 25 Feb 2022 12:36:24 +
> Joao Martins  wrote:
> 
>> On 2/24/22 21:40, Alex Williamson wrote:
>>> On Thu, 24 Feb 2022 20:34:40 +
>>> Joao Martins  wrote:
 Of all those cases I would feel the machine-property is better,
 and more flexible than having VFIO/VDPA deal with a bad memory-layout and
 discovering late stage that the user is doing something wrong (and thus
 fail the DMA_MAP operation for those who do check invalid iovas)  
>>>
>>> The trouble is that anything we can glean from the host system where we
>>> instantiate the VM is mostly meaningless relative to data center
>>> orchestration.  We're relatively insulated from these sorts of issues
>>> on x86 (apparently aside from this case), AIUI ARM is even worse about
>>> having arbitrary reserved ranges within their IOVA space.
>>>   
>> In the multi-socket servers we have for ARM I haven't seen much
>> issues /yet/ with VFIO. I only have this reserved region:
>>
>> 0x0800 0x080f msi
>>
>> But of course ARM servers aren't very good representatives of the
>> shifting nature of other ARM machine models. ISTR some thread about GIC ITS 
>> ranges
>> being reserved by IOMMU in some hardware. Perhaps that's what you might
>> be referring to about:
>>
>> https://lore.kernel.org/qemu-devel/1510622154-17224-1-git-send-email-zhuyi...@huawei.com/
> 
> 
> Right, and notice there also that the msi range is different.  On x86
> the msi block is defined by the processor, not the platform and we have
> commonality between Intel and AMD on that range. 

Wasn't aware of that part of ARM platform defining it as opposed to
being architectural -- thanks for the insight.

> We emulate the same
> range in the guest, so for any x86 guest running on an x86 host, the
> msi range is a non-issue because they overlap due to the architectural
> standards.
> 
> How do you create an ARM guest that reserves a block at both 0x800
> for your host and 0xc600 for the host in the above link?  Whatever
> solution we develop to resolve that issue should equally apply to the
> AMD reserved block:
> 
> 0x00fd 0x00ff reserved
> 
>>> For a comprehensive solution, it's not a machine accelerator property
>>> or enable such-and-such functionality flag, it's the ability to specify
>>> a VM memory map on the QEMU command line and data center orchestration
>>> tools gaining insight across all their systems to specify a memory
>>> layout that can work regardless of how a VM might be migrated. 
>>> Maybe
>>> there's a "host" option to that memory map command line option that
>>> would take into account the common case of a static host or at least
>>> homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
>>> to generate a least common denominator set such that the VM is
>>> compatible to any host in the cluster.
>>>   
>>
>> I remember you iterated over the initial RFC over such idea. I do like that
>> option of adjusting memory map... should any new restrictions appear in the
>> IOVA space appear as opposed to have to change the machine code everytime
>> that happens.
>>
>>
>> I am trying to approach this iteratively and starting by fixing AMD 1T+ 
>> guests
>> with something that hopefully is less painful to bear and unbreaks users 
>> doing
>> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
>> DMA mapping bad IOVAs that may lead guests own spurious failures.
>> For the longterm, qemu would need some sort of handling of configurable a 
>> sparse
>> map of all guest RAM which currently does not exist (and it's stuffed inside 
>> on a
>> per-machine basis as you're aware). What I am unsure is the churn associated
>> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus 
>> benefit
>> if it's "just" one class of x86 platforms (Intel not affected) -- which is 
>> what I find
>> attractive with the past 2 revisions via smaller change.
>>
>>> On the device end, I really would prefer not to see device driver
>>> specific enables and we simply cannot hot-add a device of the given
>>> type without a pre-enabled VM.  Give the user visibility and
>>> configurability to the issue and simply fail the device add (ideally
>>> with a useful error message) if the device IOVA space cannot support
>>> the VM memory layout (this is what vfio already does afaik).
>>>
>>> When we have iommufd support common for vfio and vdpa, hopefully we'll
>>> also be able to recommend a common means for learning about system and
>>> IOMMU restrictions to IOVA spaces.   
>>
>> Perhaps even advertising platform-wide regions (without a domain allocated) 
>> that
>> are common in any protection domain (for example on x86 this is one
>> such case where MSI/HT ranges are hardcoded in Intel/AMD).
>>
>>> For now we have reserved_regions
>>> reported in sysfs per IOMMU group:
>>>
>>>  $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-25 Thread Joao Martins
On 2/25/22 12:49, Michael S. Tsirkin wrote:
> On Fri, Feb 25, 2022 at 12:36:24PM +, Joao Martins wrote:
>> I am trying to approach this iteratively and starting by fixing AMD 1T+ 
>> guests
>> with something that hopefully is less painful to bear and unbreaks users 
>> doing
>> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
>> DMA mapping bad IOVAs that may lead guests own spurious failures.
>> For the longterm, qemu would need some sort of handling of configurable 
>> sparse
>> map of all guest RAM which currently does not exist (and it's stuffed inside 
>> on a
>> per-machine basis as you're aware). What I am unsure is the churn associated
>> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus 
>> benefit
>> if it's "just" one class of x86 platforms (Intel not affected) -- which is 
>> what I find
>> attractive with the past 2 revisions via smaller change.
> 
> Right. I pondered this for a while and I wonder whether you considered
> making this depend on the guest cpu vendor and max phys bits. 

Hmmm, I am considering phys-bits already (or +host-phys-bits) but not
max_host_phys_bits. But I am not sure the latter is relevant for this case.
phys-bits is what we need to gate as that's what's ultimately exposed to
the guest based on the various -cpu options. I can bring back to like v2
and prior to consider relocating if phys-bits aren't enough and bail out.

> Things
> are easier to debug if the memory map is the same whatever the host. The
> guest vendor typically matches the host cpu vendor after all, and there
> just could be guests avoiding the reserved memory ranges on principle.
> 
Regarding guest cpu vendor, if we gate to guest CPU vendor alone this actually
increases the span of guests it might affect to, compared to just checking
host AMD IOMMU existence. The checking of AMD IOMMU would exclude no-host-IOMMU
1T AMD guests which do not need to consider this HT reserved range.

I can restrict this to guest CPU being AMD solely (assuming
-cpu host is covered too), if folks have mixed feelings towards checking
host amd IOMMU.

To be clear checking guest CPU vendor alone, would not capture the case of
using a -cpu {Skylake,...} on a AMD host, so the failure would occur just
the same. I assume you're OK with that.

> We'll need a bunch of code comments explaining all this hackery, as well
> as machine type compat things, but that is par for the course.
> 
> Additionally, we could have a host check and then fail to init vdpa and
> vfio devices if the memory map will make some memory inaccessible.
> 
> Does this sound reasonable to others? Alex? Joao?
> 
Sounds reasonable the earlier part.

Regarding device init failure logic I think the only one that might need 
checking
is vDPA as vfio already validates that sort of thing (on >= 5.4). Albeit given
how agnostic this is to the -devices this passes the memory map gets adjusted
to make it work for vfio/vdpa (should this be solely on guest AMD vendor or amd
host iommu existence) ... then I am not sure this is needed.



Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-25 Thread Alex Williamson
On Fri, 25 Feb 2022 12:36:24 +
Joao Martins  wrote:

> On 2/24/22 21:40, Alex Williamson wrote:
> > On Thu, 24 Feb 2022 20:34:40 +
> > Joao Martins  wrote:
> >> Of all those cases I would feel the machine-property is better,
> >> and more flexible than having VFIO/VDPA deal with a bad memory-layout and
> >> discovering late stage that the user is doing something wrong (and thus
> >> fail the DMA_MAP operation for those who do check invalid iovas)  
> > 
> > The trouble is that anything we can glean from the host system where we
> > instantiate the VM is mostly meaningless relative to data center
> > orchestration.  We're relatively insulated from these sorts of issues
> > on x86 (apparently aside from this case), AIUI ARM is even worse about
> > having arbitrary reserved ranges within their IOVA space.
> >   
> In the multi-socket servers we have for ARM I haven't seen much
> issues /yet/ with VFIO. I only have this reserved region:
> 
> 0x0800 0x080f msi
> 
> But of course ARM servers aren't very good representatives of the
> shifting nature of other ARM machine models. ISTR some thread about GIC ITS 
> ranges
> being reserved by IOMMU in some hardware. Perhaps that's what you might
> be referring to about:
> 
> https://lore.kernel.org/qemu-devel/1510622154-17224-1-git-send-email-zhuyi...@huawei.com/


Right, and notice there also that the msi range is different.  On x86
the msi block is defined by the processor, not the platform and we have
commonality between Intel and AMD on that range.  We emulate the same
range in the guest, so for any x86 guest running on an x86 host, the
msi range is a non-issue because they overlap due to the architectural
standards.

How do you create an ARM guest that reserves a block at both 0x800
for your host and 0xc600 for the host in the above link?  Whatever
solution we develop to resolve that issue should equally apply to the
AMD reserved block:

0x00fd 0x00ff reserved

> > For a comprehensive solution, it's not a machine accelerator property
> > or enable such-and-such functionality flag, it's the ability to specify
> > a VM memory map on the QEMU command line and data center orchestration
> > tools gaining insight across all their systems to specify a memory
> > layout that can work regardless of how a VM might be migrated. 
> > Maybe
> > there's a "host" option to that memory map command line option that
> > would take into account the common case of a static host or at least
> > homogeneous set of hosts.  Overall, it's not unlike specifying CPU flags
> > to generate a least common denominator set such that the VM is
> > compatible to any host in the cluster.
> >   
> 
> I remember you iterated over the initial RFC over such idea. I do like that
> option of adjusting memory map... should any new restrictions appear in the
> IOVA space appear as opposed to have to change the machine code everytime
> that happens.
> 
> 
> I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
> with something that hopefully is less painful to bear and unbreaks users doing
> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
> DMA mapping bad IOVAs that may lead guests own spurious failures.
> For the longterm, qemu would need some sort of handling of configurable a 
> sparse
> map of all guest RAM which currently does not exist (and it's stuffed inside 
> on a
> per-machine basis as you're aware). What I am unsure is the churn associated
> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus 
> benefit
> if it's "just" one class of x86 platforms (Intel not affected) -- which is 
> what I find
> attractive with the past 2 revisions via smaller change.
> 
> > On the device end, I really would prefer not to see device driver
> > specific enables and we simply cannot hot-add a device of the given
> > type without a pre-enabled VM.  Give the user visibility and
> > configurability to the issue and simply fail the device add (ideally
> > with a useful error message) if the device IOVA space cannot support
> > the VM memory layout (this is what vfio already does afaik).
> > 
> > When we have iommufd support common for vfio and vdpa, hopefully we'll
> > also be able to recommend a common means for learning about system and
> > IOMMU restrictions to IOVA spaces.   
> 
> Perhaps even advertising platform-wide regions (without a domain allocated) 
> that
> are common in any protection domain (for example on x86 this is one
> such case where MSI/HT ranges are hardcoded in Intel/AMD).
> 
> > For now we have reserved_regions
> > reported in sysfs per IOMMU group:
> > 
> >  $ grep -h . /sys/kernel/iommu_groups/*/reserved_regions | sort -u | grep 
> > -v direct-relaxable  
> 
> And hopefully iommufd might not want to allow iommu_map() on those reserved
> IOVA regions as opposed to letting that go through. Essentially what VFIO 
> does. Unless of
> course there's actually a

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-25 Thread Michael S. Tsirkin
On Fri, Feb 25, 2022 at 12:36:24PM +, Joao Martins wrote:
> I am trying to approach this iteratively and starting by fixing AMD 1T+ guests
> with something that hopefully is less painful to bear and unbreaks users doing
> multi-TB guests on kernels >= 5.4. While for < 5.4 it would not wrongly be
> DMA mapping bad IOVAs that may lead guests own spurious failures.
> For the longterm, qemu would need some sort of handling of configurable a 
> sparse
> map of all guest RAM which currently does not exist (and it's stuffed inside 
> on a
> per-machine basis as you're aware). What I am unsure is the churn associated
> with it (compat, migration, mem-hotplug, nvdimms, memory-backends) versus 
> benefit
> if it's "just" one class of x86 platforms (Intel not affected) -- which is 
> what I find
> attractive with the past 2 revisions via smaller change.

Right. I pondered this for a while and I wonder whether you considered
making this depend on the guest cpu vendor and max phys bits.  Things
are easier to debug if the memory map is the same whatever the host. The
guest vendor typically matches the host cpu vendor after all, and there
just could be guests avoiding the reserved memory ranges on principle.

We'll need a bunch of code comments explaining all this hackery, as well
as machine type compat things, but that is par for the course.

Additionally, we could have a host check and then fail to init vdpa and
vfio devices if the memory map will make some memory inaccessible.

Does this sound reasonable to others? Alex? Joao?

-- 
MST




Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-25 Thread Joao Martins
On 2/25/22 05:22, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 08:34:40PM +, Joao Martins wrote:
>> On 2/24/22 20:12, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 08:04:48PM +, Joao Martins wrote:
 On 2/24/22 19:54, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:
>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
 On 2/24/22 17:23, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
>> On 2/23/22 23:35, Joao Martins wrote:
>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> +  uint64_t 
> pci_hole64_size)
> +{
> +X86MachineState *x86ms = X86_MACHINE(pcms);
> +uint32_t eax, vendor[3];
> +
> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +if (!IS_AMD_VENDOR(vendor)) {
> +return;
> +}

 Wait a sec, should this actually be tying things to the host CPU 
 ID?
 It's really about what we present to the guest though,
 isn't it?
>>>
>>> It was the easier catch all to use cpuid without going into
>>> Linux UAPI specifics. But it doesn't have to tie in there, it is 
>>> only
>>> for systems with an IOMMU present.
>>>
 Also, can't we tie this to whether the AMD IOMMU is present?

>>> I think so, I can add that. Something like a amd_iommu_exists() 
>>> helper
>>> in util/vfio-helpers.c which checks if there's any sysfs child 
>>> entries
>>> that start with ivhd in /sys/class/iommu/. Given that this HT 
>>> region is
>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I 
>>> don't think it's
>>> even worth checking the range exists in:
>>>
>>> /sys/kernel/iommu_groups/0/reserved_regions
>>>
>>> (Also that sysfs ABI is >= 4.11 only)
>>
>> Here's what I have staged in local tree, to address your comment.
>>
>> Naturally the first chunk is what's affected by this patch the rest 
>> is a
>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
>> pass
>> all the tests and what not.
>>
>> I am not entirely sure this is the right place to put such a helper, 
>> open
>> to suggestions. wrt to the naming of the helper, I tried to follow 
>> the rest
>> of the file's style.
>>

[snip]

>
> why are we checking whether an AMD IOMMU is present
> on the host? 
> Isn't what we care about whether it's
> present in the VM? What we are reserving after all
> is a range of GPAs, not actual host PA's ...
>
 Right.

 But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
 and so we need to not map that portion of address space entirely
 and use some other portion of the GPA-space. This has to
 do with host IOMMU which is where the DMA mapping of guest PA takes
 place. So, if you do not have an host IOMMU, you can't
 service guest DMA/PCIe services via VFIO through the host IOMMU, 
 therefore you
 don't need this problem.

 Whether one uses a guest IOMMU or not does not affect the result,
 it would be more of a case of mimicking real hardware not fixing
 the issue of this series.
>>>
>>>
>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>> And ideally move the logic reporting reserved ranges there too.
>>> However, I think vdpa has the same issue too.
>>> CC Jason for an opinion.
>>
>> It does have the same problem.
>>
>> This is not specific to VFIO, it's to anything that uses the iommu.
>
> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> that we don't have a global "enable-vfio" flag since vfio devices
> can be hot-plugged. I think this is not the first time we wanted
> something like this, right Alex?
>
>> It's
>> just that VFIO doesn't let you shoot in the foot :)
>>
>> vDPA would need to validate iova ranges as well to prevent mapping on
>> reserved IOVAs similar to commit 9b77e5c7984 ("vfio/type1: check dma
>> map request is within a valid iova range"). Now you could argue that
>> it's an IOMMU core problem, maybe.
>>
>> But I this as a separate problem,
>> because regardless of said validation, your guest provisioning
>> is still goi

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-25 Thread Joao Martins
On 2/24/22 21:40, Alex Williamson wrote:
> On Thu, 24 Feb 2022 20:34:40 +
> Joao Martins  wrote:
> 
>> On 2/24/22 20:12, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 08:04:48PM +, Joao Martins wrote:  
 On 2/24/22 19:54, Michael S. Tsirkin wrote:  
> On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:  
>> On 2/24/22 18:30, Michael S. Tsirkin wrote:  
>>> On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:  
 On 2/24/22 17:23, Michael S. Tsirkin wrote:  
> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:  
>> On 2/23/22 23:35, Joao Martins wrote:  
>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:  
> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> +  uint64_t 
> pci_hole64_size)
> +{
> +X86MachineState *x86ms = X86_MACHINE(pcms);
> +uint32_t eax, vendor[3];
> +
> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +if (!IS_AMD_VENDOR(vendor)) {
> +return;
> +}  

 Wait a sec, should this actually be tying things to the host CPU 
 ID?
 It's really about what we present to the guest though,
 isn't it?  
>>>
>>> It was the easier catch all to use cpuid without going into
>>> Linux UAPI specifics. But it doesn't have to tie in there, it is 
>>> only
>>> for systems with an IOMMU present.
>>>  
 Also, can't we tie this to whether the AMD IOMMU is present?
  
>>> I think so, I can add that. Something like a amd_iommu_exists() 
>>> helper
>>> in util/vfio-helpers.c which checks if there's any sysfs child 
>>> entries
>>> that start with ivhd in /sys/class/iommu/. Given that this HT 
>>> region is
>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I 
>>> don't think it's
>>> even worth checking the range exists in:
>>>
>>> /sys/kernel/iommu_groups/0/reserved_regions
>>>
>>> (Also that sysfs ABI is >= 4.11 only)  
>>
>> Here's what I have staged in local tree, to address your comment.
>>
>> Naturally the first chunk is what's affected by this patch the rest 
>> is a
>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
>> pass
>> all the tests and what not.
>>
>> I am not entirely sure this is the right place to put such a helper, 
>> open
>> to suggestions. wrt to the naming of the helper, I tried to follow 
>> the rest
>> of the file's style.
>>

[snip]

>
> why are we checking whether an AMD IOMMU is present
> on the host? 
> Isn't what we care about whether it's
> present in the VM? What we are reserving after all
> is a range of GPAs, not actual host PA's ...
>  
 Right.

 But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
 and so we need to not map that portion of address space entirely
 and use some other portion of the GPA-space. This has to
 do with host IOMMU which is where the DMA mapping of guest PA takes
 place. So, if you do not have an host IOMMU, you can't
 service guest DMA/PCIe services via VFIO through the host IOMMU, 
 therefore you
 don't need this problem.

 Whether one uses a guest IOMMU or not does not affect the result,
 it would be more of a case of mimicking real hardware not fixing
 the issue of this series.  
>>>
>>>
>>> Hmm okay ... my first reaction was to say let's put it under VFIO then.
>>> And ideally move the logic reporting reserved ranges there too.
>>> However, I think vdpa has the same issue too.
>>> CC Jason for an opinion.  
>>
>> It does have the same problem.
>>
>> This is not specific to VFIO, it's to anything that uses the iommu.  
>
> Right. Most VMs don't use the host IOMMU though ;) It's unfortunate
> that we don't have a global "enable-vfio" flag since vfio devices
> can be hot-plugged. I think this is not the first time we wanted
> something like this, right Alex?
> 
> 
> I would very much NOT like to see such a flag ever existing.
> 
>>> Also, some concerns
>>> - I think this changes memory layout for working VMs not using VFIO.
>>>   Better preserve the old layout for old machine types?
>>>  
>> Oh definitely, and I do that in this series. See the last commit, and
>> in the past it was also a machine-property exposed to userspace.
>> Otherwise I would be breaking (badly) compat/migratio

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Michael S. Tsirkin
On Thu, Feb 24, 2022 at 08:34:40PM +, Joao Martins wrote:
> On 2/24/22 20:12, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 08:04:48PM +, Joao Martins wrote:
> >> On 2/24/22 19:54, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:
>  On 2/24/22 18:30, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
> >> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
>  On 2/23/22 23:35, Joao Martins wrote:
> > On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>> +  uint64_t 
> >>> pci_hole64_size)
> >>> +{
> >>> +X86MachineState *x86ms = X86_MACHINE(pcms);
> >>> +uint32_t eax, vendor[3];
> >>> +
> >>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>> +if (!IS_AMD_VENDOR(vendor)) {
> >>> +return;
> >>> +}
> >>
> >> Wait a sec, should this actually be tying things to the host CPU 
> >> ID?
> >> It's really about what we present to the guest though,
> >> isn't it?
> >
> > It was the easier catch all to use cpuid without going into
> > Linux UAPI specifics. But it doesn't have to tie in there, it is 
> > only
> > for systems with an IOMMU present.
> >
> >> Also, can't we tie this to whether the AMD IOMMU is present?
> >>
> > I think so, I can add that. Something like a amd_iommu_exists() 
> > helper
> > in util/vfio-helpers.c which checks if there's any sysfs child 
> > entries
> > that start with ivhd in /sys/class/iommu/. Given that this HT 
> > region is
> > hardcoded in iommu reserved regions since >=4.11 (to latest) I 
> > don't think it's
> > even worth checking the range exists in:
> >
> > /sys/kernel/iommu_groups/0/reserved_regions
> >
> > (Also that sysfs ABI is >= 4.11 only)
> 
>  Here's what I have staged in local tree, to address your comment.
> 
>  Naturally the first chunk is what's affected by this patch the rest 
>  is a
>  precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
>  pass
>  all the tests and what not.
> 
>  I am not entirely sure this is the right place to put such a helper, 
>  open
>  to suggestions. wrt to the naming of the helper, I tried to follow 
>  the rest
>  of the file's style.
> 
>  diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>  index a9be5d33a291..2ea4430d5dcc 100644
>  --- a/hw/i386/pc.c
>  +++ b/hw/i386/pc.c
>  @@ -868,10 +868,8 @@ static void 
>  x86_update_above_4g_mem_start(PCMachineState *pcms,
> uint64_t pci_hole64_size)
>   {
>   X86MachineState *x86ms = X86_MACHINE(pcms);
>  -uint32_t eax, vendor[3];
> 
>  -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>  -if (!IS_AMD_VENDOR(vendor)) {
>  +if (!qemu_amd_iommu_is_present()) {
>   return;
>   }
> 
>  diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>  index 7bcce3bceb0f..eb4ea071ecec 100644
>  --- a/include/qemu/osdep.h
>  +++ b/include/qemu/osdep.h
>  @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>    */
>   size_t qemu_get_host_physmem(void);
> 
>  +/**
>  + * qemu_amd_iommu_is_present:
>  + *
>  + * Operating system agnostic way of querying if an AMD IOMMU
>  + * is present.
>  + *
>  + */
>  +bool qemu_amd_iommu_is_present(void);
>  +
>   /*
>    * Toggle write/execute on the pages marked MAP_JIT
>    * for the current thread.
>  diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>  index f2be7321c59f..54cef21217c4 100644
>  --- a/util/oslib-posix.c
>  +++ b/util/oslib-posix.c
>  @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>   #endif
>   return 0;
>   }
>  +
>  +bool qemu_amd_iommu_is_present(void)
>  +{
>  +bool found = false;
>  +#ifdef CONFIG_LINUX
>  +struct dirent *entry;
>  +char *path;
>  +DIR *dir;
>  +
>  +path = g_strdup_printf("/sys/class/iommu");
>  +dir = opendir(path);
>  +if (!dir) {
>  +g_free(path);
>  + 

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Jason Wang
On Fri, Feb 25, 2022 at 2:30 AM Michael S. Tsirkin  wrote:
>
> On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
> > On 2/24/22 17:23, Michael S. Tsirkin wrote:
> > > On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
> > >> On 2/23/22 23:35, Joao Martins wrote:
> > >>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> > > +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> > > +  uint64_t pci_hole64_size)
> > > +{
> > > +X86MachineState *x86ms = X86_MACHINE(pcms);
> > > +uint32_t eax, vendor[3];
> > > +
> > > +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > > +if (!IS_AMD_VENDOR(vendor)) {
> > > +return;
> > > +}
> > 
> >  Wait a sec, should this actually be tying things to the host CPU ID?
> >  It's really about what we present to the guest though,
> >  isn't it?
> > >>>
> > >>> It was the easier catch all to use cpuid without going into
> > >>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> > >>> for systems with an IOMMU present.
> > >>>
> >  Also, can't we tie this to whether the AMD IOMMU is present?
> > 
> > >>> I think so, I can add that. Something like a amd_iommu_exists() helper
> > >>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> > >>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> > >>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
> > >>> think it's
> > >>> even worth checking the range exists in:
> > >>>
> > >>>   /sys/kernel/iommu_groups/0/reserved_regions
> > >>>
> > >>> (Also that sysfs ABI is >= 4.11 only)
> > >>
> > >> Here's what I have staged in local tree, to address your comment.
> > >>
> > >> Naturally the first chunk is what's affected by this patch the rest is a
> > >> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> > >> all the tests and what not.
> > >>
> > >> I am not entirely sure this is the right place to put such a helper, open
> > >> to suggestions. wrt to the naming of the helper, I tried to follow the 
> > >> rest
> > >> of the file's style.
> > >>
> > >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> > >> index a9be5d33a291..2ea4430d5dcc 100644
> > >> --- a/hw/i386/pc.c
> > >> +++ b/hw/i386/pc.c
> > >> @@ -868,10 +868,8 @@ static void 
> > >> x86_update_above_4g_mem_start(PCMachineState *pcms,
> > >>uint64_t pci_hole64_size)
> > >>  {
> > >>  X86MachineState *x86ms = X86_MACHINE(pcms);
> > >> -uint32_t eax, vendor[3];
> > >>
> > >> -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > >> -if (!IS_AMD_VENDOR(vendor)) {
> > >> +if (!qemu_amd_iommu_is_present()) {
> > >>  return;
> > >>  }
> > >>
> > >> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> > >> index 7bcce3bceb0f..eb4ea071ecec 100644
> > >> --- a/include/qemu/osdep.h
> > >> +++ b/include/qemu/osdep.h
> > >> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> > >>   */
> > >>  size_t qemu_get_host_physmem(void);
> > >>
> > >> +/**
> > >> + * qemu_amd_iommu_is_present:
> > >> + *
> > >> + * Operating system agnostic way of querying if an AMD IOMMU
> > >> + * is present.
> > >> + *
> > >> + */
> > >> +bool qemu_amd_iommu_is_present(void);
> > >> +
> > >>  /*
> > >>   * Toggle write/execute on the pages marked MAP_JIT
> > >>   * for the current thread.
> > >> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > >> index f2be7321c59f..54cef21217c4 100644
> > >> --- a/util/oslib-posix.c
> > >> +++ b/util/oslib-posix.c
> > >> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> > >>  #endif
> > >>  return 0;
> > >>  }
> > >> +
> > >> +bool qemu_amd_iommu_is_present(void)
> > >> +{
> > >> +bool found = false;
> > >> +#ifdef CONFIG_LINUX
> > >> +struct dirent *entry;
> > >> +char *path;
> > >> +DIR *dir;
> > >> +
> > >> +path = g_strdup_printf("/sys/class/iommu");
> > >> +dir = opendir(path);
> > >> +if (!dir) {
> > >> +g_free(path);
> > >> +return found;
> > >> +}
> > >> +
> > >> +do {
> > >> +entry = readdir(dir);
> > >> +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> > >> +found = true;
> > >> +break;
> > >> +}
> > >> +} while (entry);
> > >> +
> > >> +g_free(path);
> > >> +closedir(dir);
> > >> +#endif
> > >> +return found;
> > >> +}
> > >
> > > why are we checking whether an AMD IOMMU is present
> > > on the host?
> > > Isn't what we care about whether it's
> > > present in the VM? What we are reserving after all
> > > is a range of GPAs, not actual host PA's ...
> > >
> > Right.
> >
> > But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> > and so we need to not map that portion of address space entirely
> > and use some other por

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Alex Williamson
On Thu, 24 Feb 2022 20:34:40 +
Joao Martins  wrote:

> On 2/24/22 20:12, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 08:04:48PM +, Joao Martins wrote:  
> >> On 2/24/22 19:54, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:  
>  On 2/24/22 18:30, Michael S. Tsirkin wrote:  
> > On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:  
> >> On 2/24/22 17:23, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:  
>  On 2/23/22 23:35, Joao Martins wrote:  
> > On 2/23/22 21:22, Michael S. Tsirkin wrote:  
> >>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>> +  uint64_t 
> >>> pci_hole64_size)
> >>> +{
> >>> +X86MachineState *x86ms = X86_MACHINE(pcms);
> >>> +uint32_t eax, vendor[3];
> >>> +
> >>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>> +if (!IS_AMD_VENDOR(vendor)) {
> >>> +return;
> >>> +}  
> >>
> >> Wait a sec, should this actually be tying things to the host CPU 
> >> ID?
> >> It's really about what we present to the guest though,
> >> isn't it?  
> >
> > It was the easier catch all to use cpuid without going into
> > Linux UAPI specifics. But it doesn't have to tie in there, it is 
> > only
> > for systems with an IOMMU present.
> >  
> >> Also, can't we tie this to whether the AMD IOMMU is present?
> >>  
> > I think so, I can add that. Something like a amd_iommu_exists() 
> > helper
> > in util/vfio-helpers.c which checks if there's any sysfs child 
> > entries
> > that start with ivhd in /sys/class/iommu/. Given that this HT 
> > region is
> > hardcoded in iommu reserved regions since >=4.11 (to latest) I 
> > don't think it's
> > even worth checking the range exists in:
> >
> > /sys/kernel/iommu_groups/0/reserved_regions
> >
> > (Also that sysfs ABI is >= 4.11 only)  
> 
>  Here's what I have staged in local tree, to address your comment.
> 
>  Naturally the first chunk is what's affected by this patch the rest 
>  is a
>  precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
>  pass
>  all the tests and what not.
> 
>  I am not entirely sure this is the right place to put such a helper, 
>  open
>  to suggestions. wrt to the naming of the helper, I tried to follow 
>  the rest
>  of the file's style.
> 
>  diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>  index a9be5d33a291..2ea4430d5dcc 100644
>  --- a/hw/i386/pc.c
>  +++ b/hw/i386/pc.c
>  @@ -868,10 +868,8 @@ static void 
>  x86_update_above_4g_mem_start(PCMachineState *pcms,
> uint64_t pci_hole64_size)
>   {
>   X86MachineState *x86ms = X86_MACHINE(pcms);
>  -uint32_t eax, vendor[3];
> 
>  -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>  -if (!IS_AMD_VENDOR(vendor)) {
>  +if (!qemu_amd_iommu_is_present()) {
>   return;
>   }
> 
>  diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>  index 7bcce3bceb0f..eb4ea071ecec 100644
>  --- a/include/qemu/osdep.h
>  +++ b/include/qemu/osdep.h
>  @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>    */
>   size_t qemu_get_host_physmem(void);
> 
>  +/**
>  + * qemu_amd_iommu_is_present:
>  + *
>  + * Operating system agnostic way of querying if an AMD IOMMU
>  + * is present.
>  + *
>  + */
>  +bool qemu_amd_iommu_is_present(void);
>  +
>   /*
>    * Toggle write/execute on the pages marked MAP_JIT
>    * for the current thread.
>  diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>  index f2be7321c59f..54cef21217c4 100644
>  --- a/util/oslib-posix.c
>  +++ b/util/oslib-posix.c
>  @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>   #endif
>   return 0;
>   }
>  +
>  +bool qemu_amd_iommu_is_present(void)
>  +{
>  +bool found = false;
>  +#ifdef CONFIG_LINUX
>  +struct dirent *entry;
>  +char *path;
>  +DIR *dir;
>  +
>  +path = g_strdup_printf("/sys/class/iommu");
>  +dir = opendir(path);
>  +if (!dir) {
>  +g_fr

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Joao Martins
On 2/24/22 20:12, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 08:04:48PM +, Joao Martins wrote:
>> On 2/24/22 19:54, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:
 On 2/24/22 18:30, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
 On 2/23/22 23:35, Joao Martins wrote:
> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>> +  uint64_t pci_hole64_size)
>>> +{
>>> +X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +uint32_t eax, vendor[3];
>>> +
>>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>> +if (!IS_AMD_VENDOR(vendor)) {
>>> +return;
>>> +}
>>
>> Wait a sec, should this actually be tying things to the host CPU ID?
>> It's really about what we present to the guest though,
>> isn't it?
>
> It was the easier catch all to use cpuid without going into
> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> for systems with an IOMMU present.
>
>> Also, can't we tie this to whether the AMD IOMMU is present?
>>
> I think so, I can add that. Something like a amd_iommu_exists() helper
> in util/vfio-helpers.c which checks if there's any sysfs child entries
> that start with ivhd in /sys/class/iommu/. Given that this HT region 
> is
> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
> think it's
> even worth checking the range exists in:
>
>   /sys/kernel/iommu_groups/0/reserved_regions
>
> (Also that sysfs ABI is >= 4.11 only)

 Here's what I have staged in local tree, to address your comment.

 Naturally the first chunk is what's affected by this patch the rest is 
 a
 precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
 pass
 all the tests and what not.

 I am not entirely sure this is the right place to put such a helper, 
 open
 to suggestions. wrt to the naming of the helper, I tried to follow the 
 rest
 of the file's style.

 diff --git a/hw/i386/pc.c b/hw/i386/pc.c
 index a9be5d33a291..2ea4430d5dcc 100644
 --- a/hw/i386/pc.c
 +++ b/hw/i386/pc.c
 @@ -868,10 +868,8 @@ static void 
 x86_update_above_4g_mem_start(PCMachineState *pcms,
uint64_t pci_hole64_size)
  {
  X86MachineState *x86ms = X86_MACHINE(pcms);
 -uint32_t eax, vendor[3];

 -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
 -if (!IS_AMD_VENDOR(vendor)) {
 +if (!qemu_amd_iommu_is_present()) {
  return;
  }

 diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
 index 7bcce3bceb0f..eb4ea071ecec 100644
 --- a/include/qemu/osdep.h
 +++ b/include/qemu/osdep.h
 @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
   */
  size_t qemu_get_host_physmem(void);

 +/**
 + * qemu_amd_iommu_is_present:
 + *
 + * Operating system agnostic way of querying if an AMD IOMMU
 + * is present.
 + *
 + */
 +bool qemu_amd_iommu_is_present(void);
 +
  /*
   * Toggle write/execute on the pages marked MAP_JIT
   * for the current thread.
 diff --git a/util/oslib-posix.c b/util/oslib-posix.c
 index f2be7321c59f..54cef21217c4 100644
 --- a/util/oslib-posix.c
 +++ b/util/oslib-posix.c
 @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
  #endif
  return 0;
  }
 +
 +bool qemu_amd_iommu_is_present(void)
 +{
 +bool found = false;
 +#ifdef CONFIG_LINUX
 +struct dirent *entry;
 +char *path;
 +DIR *dir;
 +
 +path = g_strdup_printf("/sys/class/iommu");
 +dir = opendir(path);
 +if (!dir) {
 +g_free(path);
 +return found;
 +}
 +
 +do {
 +entry = readdir(dir);
 +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
 +found = true;
 +break;
 +}
 +} while (entry);
 +
 +g_free(path);
 

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Michael S. Tsirkin
On Thu, Feb 24, 2022 at 08:04:48PM +, Joao Martins wrote:
> 
> 
> On 2/24/22 19:54, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:
> >> On 2/24/22 18:30, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
>  On 2/24/22 17:23, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
> >> On 2/23/22 23:35, Joao Martins wrote:
> >>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> > +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> > +  uint64_t pci_hole64_size)
> > +{
> > +X86MachineState *x86ms = X86_MACHINE(pcms);
> > +uint32_t eax, vendor[3];
> > +
> > +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > +if (!IS_AMD_VENDOR(vendor)) {
> > +return;
> > +}
> 
>  Wait a sec, should this actually be tying things to the host CPU ID?
>  It's really about what we present to the guest though,
>  isn't it?
> >>>
> >>> It was the easier catch all to use cpuid without going into
> >>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> >>> for systems with an IOMMU present.
> >>>
>  Also, can't we tie this to whether the AMD IOMMU is present?
> 
> >>> I think so, I can add that. Something like a amd_iommu_exists() helper
> >>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> >>> that start with ivhd in /sys/class/iommu/. Given that this HT region 
> >>> is
> >>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
> >>> think it's
> >>> even worth checking the range exists in:
> >>>
> >>>   /sys/kernel/iommu_groups/0/reserved_regions
> >>>
> >>> (Also that sysfs ABI is >= 4.11 only)
> >>
> >> Here's what I have staged in local tree, to address your comment.
> >>
> >> Naturally the first chunk is what's affected by this patch the rest is 
> >> a
> >> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to 
> >> pass
> >> all the tests and what not.
> >>
> >> I am not entirely sure this is the right place to put such a helper, 
> >> open
> >> to suggestions. wrt to the naming of the helper, I tried to follow the 
> >> rest
> >> of the file's style.
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index a9be5d33a291..2ea4430d5dcc 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -868,10 +868,8 @@ static void 
> >> x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>uint64_t pci_hole64_size)
> >>  {
> >>  X86MachineState *x86ms = X86_MACHINE(pcms);
> >> -uint32_t eax, vendor[3];
> >>
> >> -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >> -if (!IS_AMD_VENDOR(vendor)) {
> >> +if (!qemu_amd_iommu_is_present()) {
> >>  return;
> >>  }
> >>
> >> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >> index 7bcce3bceb0f..eb4ea071ecec 100644
> >> --- a/include/qemu/osdep.h
> >> +++ b/include/qemu/osdep.h
> >> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>   */
> >>  size_t qemu_get_host_physmem(void);
> >>
> >> +/**
> >> + * qemu_amd_iommu_is_present:
> >> + *
> >> + * Operating system agnostic way of querying if an AMD IOMMU
> >> + * is present.
> >> + *
> >> + */
> >> +bool qemu_amd_iommu_is_present(void);
> >> +
> >>  /*
> >>   * Toggle write/execute on the pages marked MAP_JIT
> >>   * for the current thread.
> >> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >> index f2be7321c59f..54cef21217c4 100644
> >> --- a/util/oslib-posix.c
> >> +++ b/util/oslib-posix.c
> >> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>  #endif
> >>  return 0;
> >>  }
> >> +
> >> +bool qemu_amd_iommu_is_present(void)
> >> +{
> >> +bool found = false;
> >> +#ifdef CONFIG_LINUX
> >> +struct dirent *entry;
> >> +char *path;
> >> +DIR *dir;
> >> +
> >> +path = g_strdup_printf("/sys/class/iommu");
> >> +dir = opendir(path);
> >> +if (!dir) {
> >> +g_free(path);
> >> +return found;
> >> +}
> >> +
> >> +do {
> >> +entry = readdir(dir);
> >> +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >> +found = true;
> >> +break;
> >> +}
> >> +} while (entry);
> >> +
> >> +g_free(path);
> >> +closedir(dir);
> >> +#endif
> >>

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Joao Martins



On 2/24/22 19:54, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:
>> On 2/24/22 18:30, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
 On 2/24/22 17:23, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
>> On 2/23/22 23:35, Joao Martins wrote:
>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> +  uint64_t pci_hole64_size)
> +{
> +X86MachineState *x86ms = X86_MACHINE(pcms);
> +uint32_t eax, vendor[3];
> +
> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +if (!IS_AMD_VENDOR(vendor)) {
> +return;
> +}

 Wait a sec, should this actually be tying things to the host CPU ID?
 It's really about what we present to the guest though,
 isn't it?
>>>
>>> It was the easier catch all to use cpuid without going into
>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>> for systems with an IOMMU present.
>>>
 Also, can't we tie this to whether the AMD IOMMU is present?

>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
>>> think it's
>>> even worth checking the range exists in:
>>>
>>> /sys/kernel/iommu_groups/0/reserved_regions
>>>
>>> (Also that sysfs ABI is >= 4.11 only)
>>
>> Here's what I have staged in local tree, to address your comment.
>>
>> Naturally the first chunk is what's affected by this patch the rest is a
>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>> all the tests and what not.
>>
>> I am not entirely sure this is the right place to put such a helper, open
>> to suggestions. wrt to the naming of the helper, I tried to follow the 
>> rest
>> of the file's style.
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index a9be5d33a291..2ea4430d5dcc 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -868,10 +868,8 @@ static void 
>> x86_update_above_4g_mem_start(PCMachineState *pcms,
>>uint64_t pci_hole64_size)
>>  {
>>  X86MachineState *x86ms = X86_MACHINE(pcms);
>> -uint32_t eax, vendor[3];
>>
>> -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>> -if (!IS_AMD_VENDOR(vendor)) {
>> +if (!qemu_amd_iommu_is_present()) {
>>  return;
>>  }
>>
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index 7bcce3bceb0f..eb4ea071ecec 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>   */
>>  size_t qemu_get_host_physmem(void);
>>
>> +/**
>> + * qemu_amd_iommu_is_present:
>> + *
>> + * Operating system agnostic way of querying if an AMD IOMMU
>> + * is present.
>> + *
>> + */
>> +bool qemu_amd_iommu_is_present(void);
>> +
>>  /*
>>   * Toggle write/execute on the pages marked MAP_JIT
>>   * for the current thread.
>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>> index f2be7321c59f..54cef21217c4 100644
>> --- a/util/oslib-posix.c
>> +++ b/util/oslib-posix.c
>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>  #endif
>>  return 0;
>>  }
>> +
>> +bool qemu_amd_iommu_is_present(void)
>> +{
>> +bool found = false;
>> +#ifdef CONFIG_LINUX
>> +struct dirent *entry;
>> +char *path;
>> +DIR *dir;
>> +
>> +path = g_strdup_printf("/sys/class/iommu");
>> +dir = opendir(path);
>> +if (!dir) {
>> +g_free(path);
>> +return found;
>> +}
>> +
>> +do {
>> +entry = readdir(dir);
>> +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>> +found = true;
>> +break;
>> +}
>> +} while (entry);
>> +
>> +g_free(path);
>> +closedir(dir);
>> +#endif
>> +return found;
>> +}
>
> why are we checking whether an AMD IOMMU is present
> on the host? 
> Isn't what we care about whether it's
> present in the VM? What we are reserving after all
> is a range of GPAs, not actual host PA's ...
>
 Right.

 But any DMA map done by VFIO/IOMMU using those GPA ran

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Michael S. Tsirkin
On Thu, Feb 24, 2022 at 07:44:26PM +, Joao Martins wrote:
> On 2/24/22 18:30, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
> >> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> >>> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
>  On 2/23/22 23:35, Joao Martins wrote:
> > On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>> +  uint64_t pci_hole64_size)
> >>> +{
> >>> +X86MachineState *x86ms = X86_MACHINE(pcms);
> >>> +uint32_t eax, vendor[3];
> >>> +
> >>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>> +if (!IS_AMD_VENDOR(vendor)) {
> >>> +return;
> >>> +}
> >>
> >> Wait a sec, should this actually be tying things to the host CPU ID?
> >> It's really about what we present to the guest though,
> >> isn't it?
> >
> > It was the easier catch all to use cpuid without going into
> > Linux UAPI specifics. But it doesn't have to tie in there, it is only
> > for systems with an IOMMU present.
> >
> >> Also, can't we tie this to whether the AMD IOMMU is present?
> >>
> > I think so, I can add that. Something like a amd_iommu_exists() helper
> > in util/vfio-helpers.c which checks if there's any sysfs child entries
> > that start with ivhd in /sys/class/iommu/. Given that this HT region is
> > hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
> > think it's
> > even worth checking the range exists in:
> >
> > /sys/kernel/iommu_groups/0/reserved_regions
> >
> > (Also that sysfs ABI is >= 4.11 only)
> 
>  Here's what I have staged in local tree, to address your comment.
> 
>  Naturally the first chunk is what's affected by this patch the rest is a
>  precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>  all the tests and what not.
> 
>  I am not entirely sure this is the right place to put such a helper, open
>  to suggestions. wrt to the naming of the helper, I tried to follow the 
>  rest
>  of the file's style.
> 
>  diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>  index a9be5d33a291..2ea4430d5dcc 100644
>  --- a/hw/i386/pc.c
>  +++ b/hw/i386/pc.c
>  @@ -868,10 +868,8 @@ static void 
>  x86_update_above_4g_mem_start(PCMachineState *pcms,
> uint64_t pci_hole64_size)
>   {
>   X86MachineState *x86ms = X86_MACHINE(pcms);
>  -uint32_t eax, vendor[3];
> 
>  -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>  -if (!IS_AMD_VENDOR(vendor)) {
>  +if (!qemu_amd_iommu_is_present()) {
>   return;
>   }
> 
>  diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>  index 7bcce3bceb0f..eb4ea071ecec 100644
>  --- a/include/qemu/osdep.h
>  +++ b/include/qemu/osdep.h
>  @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>    */
>   size_t qemu_get_host_physmem(void);
> 
>  +/**
>  + * qemu_amd_iommu_is_present:
>  + *
>  + * Operating system agnostic way of querying if an AMD IOMMU
>  + * is present.
>  + *
>  + */
>  +bool qemu_amd_iommu_is_present(void);
>  +
>   /*
>    * Toggle write/execute on the pages marked MAP_JIT
>    * for the current thread.
>  diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>  index f2be7321c59f..54cef21217c4 100644
>  --- a/util/oslib-posix.c
>  +++ b/util/oslib-posix.c
>  @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>   #endif
>   return 0;
>   }
>  +
>  +bool qemu_amd_iommu_is_present(void)
>  +{
>  +bool found = false;
>  +#ifdef CONFIG_LINUX
>  +struct dirent *entry;
>  +char *path;
>  +DIR *dir;
>  +
>  +path = g_strdup_printf("/sys/class/iommu");
>  +dir = opendir(path);
>  +if (!dir) {
>  +g_free(path);
>  +return found;
>  +}
>  +
>  +do {
>  +entry = readdir(dir);
>  +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>  +found = true;
>  +break;
>  +}
>  +} while (entry);
>  +
>  +g_free(path);
>  +closedir(dir);
>  +#endif
>  +return found;
>  +}
> >>>
> >>> why are we checking whether an AMD IOMMU is present
> >>> on the host? 
> >>> Isn't what we care about whether it's
> >>> present in the VM? What we are reserving after all
> >>> is a range of GPAs, not actual host PA's ...
> >>>
> >> Right.
> >>
> >> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> >> and so we need to not map that

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Joao Martins
On 2/24/22 18:30, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
>> On 2/24/22 17:23, Michael S. Tsirkin wrote:
>>> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
 On 2/23/22 23:35, Joao Martins wrote:
> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>> +  uint64_t pci_hole64_size)
>>> +{
>>> +X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +uint32_t eax, vendor[3];
>>> +
>>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>> +if (!IS_AMD_VENDOR(vendor)) {
>>> +return;
>>> +}
>>
>> Wait a sec, should this actually be tying things to the host CPU ID?
>> It's really about what we present to the guest though,
>> isn't it?
>
> It was the easier catch all to use cpuid without going into
> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> for systems with an IOMMU present.
>
>> Also, can't we tie this to whether the AMD IOMMU is present?
>>
> I think so, I can add that. Something like a amd_iommu_exists() helper
> in util/vfio-helpers.c which checks if there's any sysfs child entries
> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
> think it's
> even worth checking the range exists in:
>
>   /sys/kernel/iommu_groups/0/reserved_regions
>
> (Also that sysfs ABI is >= 4.11 only)

 Here's what I have staged in local tree, to address your comment.

 Naturally the first chunk is what's affected by this patch the rest is a
 precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
 all the tests and what not.

 I am not entirely sure this is the right place to put such a helper, open
 to suggestions. wrt to the naming of the helper, I tried to follow the rest
 of the file's style.

 diff --git a/hw/i386/pc.c b/hw/i386/pc.c
 index a9be5d33a291..2ea4430d5dcc 100644
 --- a/hw/i386/pc.c
 +++ b/hw/i386/pc.c
 @@ -868,10 +868,8 @@ static void 
 x86_update_above_4g_mem_start(PCMachineState *pcms,
uint64_t pci_hole64_size)
  {
  X86MachineState *x86ms = X86_MACHINE(pcms);
 -uint32_t eax, vendor[3];

 -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
 -if (!IS_AMD_VENDOR(vendor)) {
 +if (!qemu_amd_iommu_is_present()) {
  return;
  }

 diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
 index 7bcce3bceb0f..eb4ea071ecec 100644
 --- a/include/qemu/osdep.h
 +++ b/include/qemu/osdep.h
 @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
   */
  size_t qemu_get_host_physmem(void);

 +/**
 + * qemu_amd_iommu_is_present:
 + *
 + * Operating system agnostic way of querying if an AMD IOMMU
 + * is present.
 + *
 + */
 +bool qemu_amd_iommu_is_present(void);
 +
  /*
   * Toggle write/execute on the pages marked MAP_JIT
   * for the current thread.
 diff --git a/util/oslib-posix.c b/util/oslib-posix.c
 index f2be7321c59f..54cef21217c4 100644
 --- a/util/oslib-posix.c
 +++ b/util/oslib-posix.c
 @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
  #endif
  return 0;
  }
 +
 +bool qemu_amd_iommu_is_present(void)
 +{
 +bool found = false;
 +#ifdef CONFIG_LINUX
 +struct dirent *entry;
 +char *path;
 +DIR *dir;
 +
 +path = g_strdup_printf("/sys/class/iommu");
 +dir = opendir(path);
 +if (!dir) {
 +g_free(path);
 +return found;
 +}
 +
 +do {
 +entry = readdir(dir);
 +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
 +found = true;
 +break;
 +}
 +} while (entry);
 +
 +g_free(path);
 +closedir(dir);
 +#endif
 +return found;
 +}
>>>
>>> why are we checking whether an AMD IOMMU is present
>>> on the host? 
>>> Isn't what we care about whether it's
>>> present in the VM? What we are reserving after all
>>> is a range of GPAs, not actual host PA's ...
>>>
>> Right.
>>
>> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
>> and so we need to not map that portion of address space entirely
>> and use some other portion of the GPA-space. This has to
>> do with host IOMMU which is where the DMA mapping of guest PA takes
>> place. So, if you do not have an host IOMMU, you can't
>> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore 
>> you
>> don't need this problem.
>>
>

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Michael S. Tsirkin
On Thu, Feb 24, 2022 at 05:54:58PM +, Joao Martins wrote:
> On 2/24/22 17:23, Michael S. Tsirkin wrote:
> > On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
> >> On 2/23/22 23:35, Joao Martins wrote:
> >>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> > +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> > +  uint64_t pci_hole64_size)
> > +{
> > +X86MachineState *x86ms = X86_MACHINE(pcms);
> > +uint32_t eax, vendor[3];
> > +
> > +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> > +if (!IS_AMD_VENDOR(vendor)) {
> > +return;
> > +}
> 
>  Wait a sec, should this actually be tying things to the host CPU ID?
>  It's really about what we present to the guest though,
>  isn't it?
> >>>
> >>> It was the easier catch all to use cpuid without going into
> >>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> >>> for systems with an IOMMU present.
> >>>
>  Also, can't we tie this to whether the AMD IOMMU is present?
> 
> >>> I think so, I can add that. Something like a amd_iommu_exists() helper
> >>> in util/vfio-helpers.c which checks if there's any sysfs child entries
> >>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> >>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't 
> >>> think it's
> >>> even worth checking the range exists in:
> >>>
> >>>   /sys/kernel/iommu_groups/0/reserved_regions
> >>>
> >>> (Also that sysfs ABI is >= 4.11 only)
> >>
> >> Here's what I have staged in local tree, to address your comment.
> >>
> >> Naturally the first chunk is what's affected by this patch the rest is a
> >> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> >> all the tests and what not.
> >>
> >> I am not entirely sure this is the right place to put such a helper, open
> >> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> >> of the file's style.
> >>
> >> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> >> index a9be5d33a291..2ea4430d5dcc 100644
> >> --- a/hw/i386/pc.c
> >> +++ b/hw/i386/pc.c
> >> @@ -868,10 +868,8 @@ static void 
> >> x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>uint64_t pci_hole64_size)
> >>  {
> >>  X86MachineState *x86ms = X86_MACHINE(pcms);
> >> -uint32_t eax, vendor[3];
> >>
> >> -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >> -if (!IS_AMD_VENDOR(vendor)) {
> >> +if (!qemu_amd_iommu_is_present()) {
> >>  return;
> >>  }
> >>
> >> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> >> index 7bcce3bceb0f..eb4ea071ecec 100644
> >> --- a/include/qemu/osdep.h
> >> +++ b/include/qemu/osdep.h
> >> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
> >>   */
> >>  size_t qemu_get_host_physmem(void);
> >>
> >> +/**
> >> + * qemu_amd_iommu_is_present:
> >> + *
> >> + * Operating system agnostic way of querying if an AMD IOMMU
> >> + * is present.
> >> + *
> >> + */
> >> +bool qemu_amd_iommu_is_present(void);
> >> +
> >>  /*
> >>   * Toggle write/execute on the pages marked MAP_JIT
> >>   * for the current thread.
> >> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> >> index f2be7321c59f..54cef21217c4 100644
> >> --- a/util/oslib-posix.c
> >> +++ b/util/oslib-posix.c
> >> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
> >>  #endif
> >>  return 0;
> >>  }
> >> +
> >> +bool qemu_amd_iommu_is_present(void)
> >> +{
> >> +bool found = false;
> >> +#ifdef CONFIG_LINUX
> >> +struct dirent *entry;
> >> +char *path;
> >> +DIR *dir;
> >> +
> >> +path = g_strdup_printf("/sys/class/iommu");
> >> +dir = opendir(path);
> >> +if (!dir) {
> >> +g_free(path);
> >> +return found;
> >> +}
> >> +
> >> +do {
> >> +entry = readdir(dir);
> >> +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> >> +found = true;
> >> +break;
> >> +}
> >> +} while (entry);
> >> +
> >> +g_free(path);
> >> +closedir(dir);
> >> +#endif
> >> +return found;
> >> +}
> > 
> > why are we checking whether an AMD IOMMU is present
> > on the host? 
> > Isn't what we care about whether it's
> > present in the VM? What we are reserving after all
> > is a range of GPAs, not actual host PA's ...
> > 
> Right.
> 
> But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
> and so we need to not map that portion of address space entirely
> and use some other portion of the GPA-space. This has to
> do with host IOMMU which is where the DMA mapping of guest PA takes
> place. So, if you do not have an host IOMMU, you can't
> service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
> don't need this problem.
> 
> Whether one uses a guest IOMMU or not does not affect th

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Joao Martins
On 2/24/22 17:23, Michael S. Tsirkin wrote:
> On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
>> On 2/23/22 23:35, Joao Martins wrote:
>>> On 2/23/22 21:22, Michael S. Tsirkin wrote:
> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> +  uint64_t pci_hole64_size)
> +{
> +X86MachineState *x86ms = X86_MACHINE(pcms);
> +uint32_t eax, vendor[3];
> +
> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> +if (!IS_AMD_VENDOR(vendor)) {
> +return;
> +}

 Wait a sec, should this actually be tying things to the host CPU ID?
 It's really about what we present to the guest though,
 isn't it?
>>>
>>> It was the easier catch all to use cpuid without going into
>>> Linux UAPI specifics. But it doesn't have to tie in there, it is only
>>> for systems with an IOMMU present.
>>>
 Also, can't we tie this to whether the AMD IOMMU is present?

>>> I think so, I can add that. Something like a amd_iommu_exists() helper
>>> in util/vfio-helpers.c which checks if there's any sysfs child entries
>>> that start with ivhd in /sys/class/iommu/. Given that this HT region is
>>> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think 
>>> it's
>>> even worth checking the range exists in:
>>>
>>> /sys/kernel/iommu_groups/0/reserved_regions
>>>
>>> (Also that sysfs ABI is >= 4.11 only)
>>
>> Here's what I have staged in local tree, to address your comment.
>>
>> Naturally the first chunk is what's affected by this patch the rest is a
>> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
>> all the tests and what not.
>>
>> I am not entirely sure this is the right place to put such a helper, open
>> to suggestions. wrt to the naming of the helper, I tried to follow the rest
>> of the file's style.
>>
>> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
>> index a9be5d33a291..2ea4430d5dcc 100644
>> --- a/hw/i386/pc.c
>> +++ b/hw/i386/pc.c
>> @@ -868,10 +868,8 @@ static void 
>> x86_update_above_4g_mem_start(PCMachineState *pcms,
>>uint64_t pci_hole64_size)
>>  {
>>  X86MachineState *x86ms = X86_MACHINE(pcms);
>> -uint32_t eax, vendor[3];
>>
>> -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>> -if (!IS_AMD_VENDOR(vendor)) {
>> +if (!qemu_amd_iommu_is_present()) {
>>  return;
>>  }
>>
>> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
>> index 7bcce3bceb0f..eb4ea071ecec 100644
>> --- a/include/qemu/osdep.h
>> +++ b/include/qemu/osdep.h
>> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>>   */
>>  size_t qemu_get_host_physmem(void);
>>
>> +/**
>> + * qemu_amd_iommu_is_present:
>> + *
>> + * Operating system agnostic way of querying if an AMD IOMMU
>> + * is present.
>> + *
>> + */
>> +bool qemu_amd_iommu_is_present(void);
>> +
>>  /*
>>   * Toggle write/execute on the pages marked MAP_JIT
>>   * for the current thread.
>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>> index f2be7321c59f..54cef21217c4 100644
>> --- a/util/oslib-posix.c
>> +++ b/util/oslib-posix.c
>> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>>  #endif
>>  return 0;
>>  }
>> +
>> +bool qemu_amd_iommu_is_present(void)
>> +{
>> +bool found = false;
>> +#ifdef CONFIG_LINUX
>> +struct dirent *entry;
>> +char *path;
>> +DIR *dir;
>> +
>> +path = g_strdup_printf("/sys/class/iommu");
>> +dir = opendir(path);
>> +if (!dir) {
>> +g_free(path);
>> +return found;
>> +}
>> +
>> +do {
>> +entry = readdir(dir);
>> +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
>> +found = true;
>> +break;
>> +}
>> +} while (entry);
>> +
>> +g_free(path);
>> +closedir(dir);
>> +#endif
>> +return found;
>> +}
> 
> why are we checking whether an AMD IOMMU is present
> on the host? 
> Isn't what we care about whether it's
> present in the VM? What we are reserving after all
> is a range of GPAs, not actual host PA's ...
> 
Right.

But any DMA map done by VFIO/IOMMU using those GPA ranges will fail,
and so we need to not map that portion of address space entirely
and use some other portion of the GPA-space. This has to
do with host IOMMU which is where the DMA mapping of guest PA takes
place. So, if you do not have an host IOMMU, you can't
service guest DMA/PCIe services via VFIO through the host IOMMU, therefore you
don't need this problem.

Whether one uses a guest IOMMU or not does not affect the result,
it would be more of a case of mimicking real hardware not fixing
the issue of this series.



Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Michael S. Tsirkin
On Thu, Feb 24, 2022 at 04:07:22PM +, Joao Martins wrote:
> On 2/23/22 23:35, Joao Martins wrote:
> > On 2/23/22 21:22, Michael S. Tsirkin wrote:
> >>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> >>> +  uint64_t pci_hole64_size)
> >>> +{
> >>> +X86MachineState *x86ms = X86_MACHINE(pcms);
> >>> +uint32_t eax, vendor[3];
> >>> +
> >>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> >>> +if (!IS_AMD_VENDOR(vendor)) {
> >>> +return;
> >>> +}
> >>
> >> Wait a sec, should this actually be tying things to the host CPU ID?
> >> It's really about what we present to the guest though,
> >> isn't it?
> > 
> > It was the easier catch all to use cpuid without going into
> > Linux UAPI specifics. But it doesn't have to tie in there, it is only
> > for systems with an IOMMU present.
> > 
> >> Also, can't we tie this to whether the AMD IOMMU is present?
> >>
> > I think so, I can add that. Something like a amd_iommu_exists() helper
> > in util/vfio-helpers.c which checks if there's any sysfs child entries
> > that start with ivhd in /sys/class/iommu/. Given that this HT region is
> > hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think 
> > it's
> > even worth checking the range exists in:
> > 
> > /sys/kernel/iommu_groups/0/reserved_regions
> > 
> > (Also that sysfs ABI is >= 4.11 only)
> 
> Here's what I have staged in local tree, to address your comment.
> 
> Naturally the first chunk is what's affected by this patch the rest is a
> precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
> all the tests and what not.
> 
> I am not entirely sure this is the right place to put such a helper, open
> to suggestions. wrt to the naming of the helper, I tried to follow the rest
> of the file's style.
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index a9be5d33a291..2ea4430d5dcc 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState 
> *pcms,
>uint64_t pci_hole64_size)
>  {
>  X86MachineState *x86ms = X86_MACHINE(pcms);
> -uint32_t eax, vendor[3];
> 
> -host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
> -if (!IS_AMD_VENDOR(vendor)) {
> +if (!qemu_amd_iommu_is_present()) {
>  return;
>  }
> 
> diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
> index 7bcce3bceb0f..eb4ea071ecec 100644
> --- a/include/qemu/osdep.h
> +++ b/include/qemu/osdep.h
> @@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
>   */
>  size_t qemu_get_host_physmem(void);
> 
> +/**
> + * qemu_amd_iommu_is_present:
> + *
> + * Operating system agnostic way of querying if an AMD IOMMU
> + * is present.
> + *
> + */
> +bool qemu_amd_iommu_is_present(void);
> +
>  /*
>   * Toggle write/execute on the pages marked MAP_JIT
>   * for the current thread.
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index f2be7321c59f..54cef21217c4 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
>  #endif
>  return 0;
>  }
> +
> +bool qemu_amd_iommu_is_present(void)
> +{
> +bool found = false;
> +#ifdef CONFIG_LINUX
> +struct dirent *entry;
> +char *path;
> +DIR *dir;
> +
> +path = g_strdup_printf("/sys/class/iommu");
> +dir = opendir(path);
> +if (!dir) {
> +g_free(path);
> +return found;
> +}
> +
> +do {
> +entry = readdir(dir);
> +if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
> +found = true;
> +break;
> +}
> +} while (entry);
> +
> +g_free(path);
> +closedir(dir);
> +#endif
> +return found;
> +}

why are we checking whether an AMD IOMMU is present
on the host? Isn't what we care about whether it's
present in the VM? What we are reserving after all
is a range of GPAs, not actual host PA's ...



> diff --git a/util/oslib-win32.c b/util/oslib-win32.c
> index af559ef3398d..c08826e7e19b 100644
> --- a/util/oslib-win32.c
> +++ b/util/oslib-win32.c
> @@ -652,3 +652,8 @@ size_t qemu_get_host_physmem(void)
>  }
>  return 0;
>  }
> +
> +bool qemu_amd_iommu_is_present(void)
> +{
> +return false;
> +}




Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Joao Martins
On 2/23/22 23:35, Joao Martins wrote:
> On 2/23/22 21:22, Michael S. Tsirkin wrote:
>>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>>> +  uint64_t pci_hole64_size)
>>> +{
>>> +X86MachineState *x86ms = X86_MACHINE(pcms);
>>> +uint32_t eax, vendor[3];
>>> +
>>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>>> +if (!IS_AMD_VENDOR(vendor)) {
>>> +return;
>>> +}
>>
>> Wait a sec, should this actually be tying things to the host CPU ID?
>> It's really about what we present to the guest though,
>> isn't it?
> 
> It was the easier catch all to use cpuid without going into
> Linux UAPI specifics. But it doesn't have to tie in there, it is only
> for systems with an IOMMU present.
> 
>> Also, can't we tie this to whether the AMD IOMMU is present?
>>
> I think so, I can add that. Something like a amd_iommu_exists() helper
> in util/vfio-helpers.c which checks if there's any sysfs child entries
> that start with ivhd in /sys/class/iommu/. Given that this HT region is
> hardcoded in iommu reserved regions since >=4.11 (to latest) I don't think 
> it's
> even worth checking the range exists in:
> 
>   /sys/kernel/iommu_groups/0/reserved_regions
> 
> (Also that sysfs ABI is >= 4.11 only)

Here's what I have staged in local tree, to address your comment.

Naturally the first chunk is what's affected by this patch the rest is a
precedessor patch to introduce qemu_amd_iommu_is_present(). Seems to pass
all the tests and what not.

I am not entirely sure this is the right place to put such a helper, open
to suggestions. wrt to the naming of the helper, I tried to follow the rest
of the file's style.

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index a9be5d33a291..2ea4430d5dcc 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -868,10 +868,8 @@ static void x86_update_above_4g_mem_start(PCMachineState 
*pcms,
   uint64_t pci_hole64_size)
 {
 X86MachineState *x86ms = X86_MACHINE(pcms);
-uint32_t eax, vendor[3];

-host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
-if (!IS_AMD_VENDOR(vendor)) {
+if (!qemu_amd_iommu_is_present()) {
 return;
 }

diff --git a/include/qemu/osdep.h b/include/qemu/osdep.h
index 7bcce3bceb0f..eb4ea071ecec 100644
--- a/include/qemu/osdep.h
+++ b/include/qemu/osdep.h
@@ -637,6 +637,15 @@ char *qemu_get_host_name(Error **errp);
  */
 size_t qemu_get_host_physmem(void);

+/**
+ * qemu_amd_iommu_is_present:
+ *
+ * Operating system agnostic way of querying if an AMD IOMMU
+ * is present.
+ *
+ */
+bool qemu_amd_iommu_is_present(void);
+
 /*
  * Toggle write/execute on the pages marked MAP_JIT
  * for the current thread.
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index f2be7321c59f..54cef21217c4 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -982,3 +982,32 @@ size_t qemu_get_host_physmem(void)
 #endif
 return 0;
 }
+
+bool qemu_amd_iommu_is_present(void)
+{
+bool found = false;
+#ifdef CONFIG_LINUX
+struct dirent *entry;
+char *path;
+DIR *dir;
+
+path = g_strdup_printf("/sys/class/iommu");
+dir = opendir(path);
+if (!dir) {
+g_free(path);
+return found;
+}
+
+do {
+entry = readdir(dir);
+if (entry && !strncmp(entry->d_name, "ivhd", 4)) {
+found = true;
+break;
+}
+} while (entry);
+
+g_free(path);
+closedir(dir);
+#endif
+return found;
+}
diff --git a/util/oslib-win32.c b/util/oslib-win32.c
index af559ef3398d..c08826e7e19b 100644
--- a/util/oslib-win32.c
+++ b/util/oslib-win32.c
@@ -652,3 +652,8 @@ size_t qemu_get_host_physmem(void)
 }
 return 0;
 }
+
+bool qemu_amd_iommu_is_present(void)
+{
+return false;
+}



Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-24 Thread Joao Martins
On 2/23/22 18:44, Joao Martins wrote:
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 360f4e10001b..6e4f5c87a2e5 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -802,6 +802,78 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN   0x800
>  #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD 
> machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD__h - FF__h
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD__h FD_F7FF_h Reserved interrupt address space
> + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl
> + * FD_F900_h FD_F90F_h Legacy PIC IACK
> + * FD_F910_h FD_F91F_h System Management
> + * FD_F920_h FD_FAFF_h Reserved Page Tables
> + * FD_FB00_h FD_FBFF_h Address Translation
> + * FD_FC00_h FD_FDFF_h I/O Space
> + * FD_FE00_h FD__h Configuration
> + * FE__h FE_1FFF_h Extended Configuration/Device Messages
> + * FE_2000_h FF__h Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START 0xfdUL
> +#define AMD_HT_END   0xffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE  (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
> +uint64_t pci_hole64_size)
> +{
> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +X86MachineState *x86ms = X86_MACHINE(pcms);
> +MachineState *machine = MACHINE(pcms);
> +ram_addr_t device_mem_size = 0;
> +hwaddr base;
> +

I am adding this extra check for 32-bit coverage (phys-bits=32)

+if (!x86ms->above_4g_mem_size) {
+   /*
+* 32-bit pci hole goes from
+   * end-of-low-ram (@below_4g_mem_size) to IOAPIC.
+*/
+return IO_APIC_DEFAULT_ADDRESS - 1;
+}



Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-23 Thread Joao Martins
On 2/23/22 21:22, Michael S. Tsirkin wrote:
> On Wed, Feb 23, 2022 at 06:44:53PM +, Joao Martins wrote:
>> It is assumed that the whole GPA space is available to be DMA
>> addressable, within a given address space limit, expect for a
>> tiny region before the 4G. Since Linux v5.4, VFIO validates
>> whether the selected GPA is indeed valid i.e. not reserved by
>> IOMMU on behalf of some specific devices or platform-defined
>> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>>  -EINVAL.
>>
>> AMD systems with an IOMMU are examples of such platforms and
>> particularly may only have these ranges as allowed:
>>
>>   - fedf (0  .. 3.982G)
>>  fef0 - 00fc (3.983G .. 1011.9G)
>>  0100 -  (1Tb.. 16Pb[*])
>>
>> We already account for the 4G hole, albeit if the guest is big
>> enough we will fail to allocate a guest with  >1010G due to the
>> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).
> 
> Could you point me to which driver then reserves the
> other regions on Linux for AMD platforms?
> 
It's two regions only. The 4G hole which its use is the same use as 
AMD[0]/Intel[1],
and part of that hole is the IOMMU MSI reserved range. And the 1T hole, is 
reserved
for HyperTransport[2]. This is hardware behaviour, so drivers just mark them 
reserved
and avoid using those at all.

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/amd/iommu.c#n2203
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/intel/iommu.c#n5328
[2]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/amd/iommu.c#n2210

Now for the 256T on AMD, it isn't reserved anywhere and the only code reference 
that I can
give you is KVM selftests that had issues before[4] fixed by Paolo. The errata 
also gives
a glimpse[3].

[3] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
[4]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c8cc43c1eae2910ac96daa4216e0fb3391ad0504

>> +/*
>> + * AMD systems with an IOMMU have an additional hole close to the
>> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
>> + * on kernel version, VFIO may or may not let you DMA map those ranges.
>> + * Starting Linux v5.4 we validate it, and can't create guests on AMD 
>> machines
>> + * with certain memory sizes. It's also wrong to use those IOVA ranges
>> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
>> + * The ranges reserved for Hyper-Transport are:
>> + *
>> + * FD__h - FF__h
>> + *
>> + * The ranges represent the following:
>> + *
>> + * Base Address   Top Address  Use
>> + *
>> + * FD__h FD_F7FF_h Reserved interrupt address space
>> + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl
>> + * FD_F900_h FD_F90F_h Legacy PIC IACK
>> + * FD_F910_h FD_F91F_h System Management
>> + * FD_F920_h FD_FAFF_h Reserved Page Tables
>> + * FD_FB00_h FD_FBFF_h Address Translation
>> + * FD_FC00_h FD_FDFF_h I/O Space
>> + * FD_FE00_h FD__h Configuration
>> + * FE__h FE_1FFF_h Extended Configuration/Device Messages
>> + * FE_2000_h FF__h Reserved
>> + *
>> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
>> + * Table 3: Special Address Controls (GPA) for more information.
>> + */
>> +#define AMD_HT_START 0xfdUL
>> +#define AMD_HT_END   0xffUL
>> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
>> +#define AMD_HT_SIZE  (AMD_ABOVE_1TB_START - AMD_HT_START)
>> +
>> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
>> +uint64_t pci_hole64_size)
>> +{
>> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
>> +X86MachineState *x86ms = X86_MACHINE(pcms);
>> +MachineState *machine = MACHINE(pcms);
>> +ram_addr_t device_mem_size = 0;
>> +hwaddr base;
>> +
>> +if (pcmc->has_reserved_memory &&
>> +   (machine->ram_size < machine->maxram_size)) {
>> +device_mem_size = machine->maxram_size - machine->ram_size;
>> +}
>> +
>> +base = ROUND_UP(x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
>> +pcms->sgx_epc.size, 1 * GiB);
>> +
>> +return base + device_mem_size + pci_hole64_size;
>> +}
>> +
>> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
>> +  uint64_t pci_hole64_size)
>> +{
>> +X86MachineState *x86ms = X86_MACHINE(pcms);
>> +uint32_t eax, vendor[3];
>> +
>> +host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
>> +if (!IS_AMD_VENDOR(vendor)) {
>> +return;
>> +}
> 
> Wait a sec, should this actually be tying things to the host CPU ID?
> It's really about what we present to the guest though,
> isn't it?
> 

It was the easie

Re: [PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-23 Thread Michael S. Tsirkin
On Wed, Feb 23, 2022 at 06:44:53PM +, Joao Martins wrote:
> It is assumed that the whole GPA space is available to be DMA
> addressable, within a given address space limit, expect for a
> tiny region before the 4G. Since Linux v5.4, VFIO validates
> whether the selected GPA is indeed valid i.e. not reserved by
> IOMMU on behalf of some specific devices or platform-defined
> restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
>  -EINVAL.
> 
> AMD systems with an IOMMU are examples of such platforms and
> particularly may only have these ranges as allowed:
> 
>    - fedf (0  .. 3.982G)
>   fef0 - 00fc (3.983G .. 1011.9G)
>   0100 -  (1Tb.. 16Pb[*])
> 
> We already account for the 4G hole, albeit if the guest is big
> enough we will fail to allocate a guest with  >1010G due to the
> ~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

Could you point me to which driver then reserves the
other regions on Linux for AMD platforms?

> 
> [*] there is another reserved region unrelated to HT that exists
> in the 256T boundaru in Fam 17h according to Errata #1286,
> documeted also in "Open-Source Register Reference for AMD Family
> 17h Processors (PUB)"
> 
> When creating the region above 4G, take into account that on AMD
> platforms the HyperTransport range is reserved and hence it
> cannot be used either as GPAs. On those cases rather than
> establishing the start of ram-above-4g to be 4G, relocate instead
> to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
> Topology", for more information on the underlying restriction of
> IOVAs.
> 
> After accounting for the 1Tb hole on AMD hosts, mtree should
> look like:
> 
> -7fff (prio 0, i/o):
>alias ram-below-4g @pc.ram -7fff
> 0100-01ff7fff (prio 0, i/o):
>   alias ram-above-4g @pc.ram 8000-00ff
> 
> If the relocation is done, we also add the the reserved HT
> e820 range as reserved.
> 
> Suggested-by: Igor Mammedov 
> Signed-off-by: Joao Martins 
> ---
>  hw/i386/pc.c  | 79 +++
>  target/i386/cpu.h |  4 +++
>  2 files changed, 83 insertions(+)
> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index 360f4e10001b..6e4f5c87a2e5 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -802,6 +802,78 @@ void xen_load_linux(PCMachineState *pcms)
>  #define PC_ROM_ALIGN   0x800
>  #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA)
>  
> +/*
> + * AMD systems with an IOMMU have an additional hole close to the
> + * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
> + * on kernel version, VFIO may or may not let you DMA map those ranges.
> + * Starting Linux v5.4 we validate it, and can't create guests on AMD 
> machines
> + * with certain memory sizes. It's also wrong to use those IOVA ranges
> + * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
> + * The ranges reserved for Hyper-Transport are:
> + *
> + * FD__h - FF__h
> + *
> + * The ranges represent the following:
> + *
> + * Base Address   Top Address  Use
> + *
> + * FD__h FD_F7FF_h Reserved interrupt address space
> + * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl
> + * FD_F900_h FD_F90F_h Legacy PIC IACK
> + * FD_F910_h FD_F91F_h System Management
> + * FD_F920_h FD_FAFF_h Reserved Page Tables
> + * FD_FB00_h FD_FBFF_h Address Translation
> + * FD_FC00_h FD_FDFF_h I/O Space
> + * FD_FE00_h FD__h Configuration
> + * FE__h FE_1FFF_h Extended Configuration/Device Messages
> + * FE_2000_h FF__h Reserved
> + *
> + * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
> + * Table 3: Special Address Controls (GPA) for more information.
> + */
> +#define AMD_HT_START 0xfdUL
> +#define AMD_HT_END   0xffUL
> +#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
> +#define AMD_HT_SIZE  (AMD_ABOVE_1TB_START - AMD_HT_START)
> +
> +static hwaddr x86_max_phys_addr(PCMachineState *pcms,
> +uint64_t pci_hole64_size)
> +{
> +PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
> +X86MachineState *x86ms = X86_MACHINE(pcms);
> +MachineState *machine = MACHINE(pcms);
> +ram_addr_t device_mem_size = 0;
> +hwaddr base;
> +
> +if (pcmc->has_reserved_memory &&
> +   (machine->ram_size < machine->maxram_size)) {
> +device_mem_size = machine->maxram_size - machine->ram_size;
> +}
> +
> +base = ROUND_UP(x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
> +pcms->sgx_epc.size, 1 * GiB);
> +
> +return base + device_mem_size + pci_hole64_size;
> +}
> +
> +static void x86_update_above_4g_mem_start(PCMachineState *pcms,
> +  uint64_t pci_hole

[PATCH v3 4/6] i386/pc: relocate 4g start to 1T where applicable

2022-02-23 Thread Joao Martins
It is assumed that the whole GPA space is available to be DMA
addressable, within a given address space limit, expect for a
tiny region before the 4G. Since Linux v5.4, VFIO validates
whether the selected GPA is indeed valid i.e. not reserved by
IOMMU on behalf of some specific devices or platform-defined
restrictions, and thus failing the ioctl(VFIO_DMA_MAP) with
 -EINVAL.

AMD systems with an IOMMU are examples of such platforms and
particularly may only have these ranges as allowed:

 - fedf (0  .. 3.982G)
fef0 - 00fc (3.983G .. 1011.9G)
0100 -  (1Tb.. 16Pb[*])

We already account for the 4G hole, albeit if the guest is big
enough we will fail to allocate a guest with  >1010G due to the
~12G hole at the 1Tb boundary, reserved for HyperTransport (HT).

[*] there is another reserved region unrelated to HT that exists
in the 256T boundaru in Fam 17h according to Errata #1286,
documeted also in "Open-Source Register Reference for AMD Family
17h Processors (PUB)"

When creating the region above 4G, take into account that on AMD
platforms the HyperTransport range is reserved and hence it
cannot be used either as GPAs. On those cases rather than
establishing the start of ram-above-4g to be 4G, relocate instead
to 1Tb. See AMD IOMMU spec, section 2.1.2 "IOMMU Logical
Topology", for more information on the underlying restriction of
IOVAs.

After accounting for the 1Tb hole on AMD hosts, mtree should
look like:

-7fff (prio 0, i/o):
 alias ram-below-4g @pc.ram -7fff
0100-01ff7fff (prio 0, i/o):
alias ram-above-4g @pc.ram 8000-00ff

If the relocation is done, we also add the the reserved HT
e820 range as reserved.

Suggested-by: Igor Mammedov 
Signed-off-by: Joao Martins 
---
 hw/i386/pc.c  | 79 +++
 target/i386/cpu.h |  4 +++
 2 files changed, 83 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 360f4e10001b..6e4f5c87a2e5 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -802,6 +802,78 @@ void xen_load_linux(PCMachineState *pcms)
 #define PC_ROM_ALIGN   0x800
 #define PC_ROM_SIZE(PC_ROM_MAX - PC_ROM_MIN_VGA)
 
+/*
+ * AMD systems with an IOMMU have an additional hole close to the
+ * 1Tb, which are special GPAs that cannot be DMA mapped. Depending
+ * on kernel version, VFIO may or may not let you DMA map those ranges.
+ * Starting Linux v5.4 we validate it, and can't create guests on AMD machines
+ * with certain memory sizes. It's also wrong to use those IOVA ranges
+ * in detriment of leading to IOMMU INVALID_DEVICE_REQUEST or worse.
+ * The ranges reserved for Hyper-Transport are:
+ *
+ * FD__h - FF__h
+ *
+ * The ranges represent the following:
+ *
+ * Base Address   Top Address  Use
+ *
+ * FD__h FD_F7FF_h Reserved interrupt address space
+ * FD_F800_h FD_F8FF_h Interrupt/EOI IntCtl
+ * FD_F900_h FD_F90F_h Legacy PIC IACK
+ * FD_F910_h FD_F91F_h System Management
+ * FD_F920_h FD_FAFF_h Reserved Page Tables
+ * FD_FB00_h FD_FBFF_h Address Translation
+ * FD_FC00_h FD_FDFF_h I/O Space
+ * FD_FE00_h FD__h Configuration
+ * FE__h FE_1FFF_h Extended Configuration/Device Messages
+ * FE_2000_h FF__h Reserved
+ *
+ * See AMD IOMMU spec, section 2.1.2 "IOMMU Logical Topology",
+ * Table 3: Special Address Controls (GPA) for more information.
+ */
+#define AMD_HT_START 0xfdUL
+#define AMD_HT_END   0xffUL
+#define AMD_ABOVE_1TB_START  (AMD_HT_END + 1)
+#define AMD_HT_SIZE  (AMD_ABOVE_1TB_START - AMD_HT_START)
+
+static hwaddr x86_max_phys_addr(PCMachineState *pcms,
+uint64_t pci_hole64_size)
+{
+PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(pcms);
+X86MachineState *x86ms = X86_MACHINE(pcms);
+MachineState *machine = MACHINE(pcms);
+ram_addr_t device_mem_size = 0;
+hwaddr base;
+
+if (pcmc->has_reserved_memory &&
+   (machine->ram_size < machine->maxram_size)) {
+device_mem_size = machine->maxram_size - machine->ram_size;
+}
+
+base = ROUND_UP(x86ms->above_4g_mem_start + x86ms->above_4g_mem_size +
+pcms->sgx_epc.size, 1 * GiB);
+
+return base + device_mem_size + pci_hole64_size;
+}
+
+static void x86_update_above_4g_mem_start(PCMachineState *pcms,
+  uint64_t pci_hole64_size)
+{
+X86MachineState *x86ms = X86_MACHINE(pcms);
+uint32_t eax, vendor[3];
+
+host_cpuid(0x0, 0, &eax, &vendor[0], &vendor[2], &vendor[1]);
+if (!IS_AMD_VENDOR(vendor)) {
+return;
+}
+
+if (x86_max_phys_addr(pcms, pci_hole64_size) < AMD_HT_START) {
+return;
+}
+
+x86ms->above_4g_mem_start = AMD_ABOVE_1TB_START;
+}
+
 void