On Tue, 2014-01-14 at 18:18 +0200, Michael S. Tsirkin wrote:
> On Tue, Jan 14, 2014 at 09:15:14AM -0700, Alex Williamson wrote:
> > On Tue, 2014-01-14 at 18:03 +0200, Michael S. Tsirkin wrote:
> > > On Tue, Jan 14, 2014 at 08:57:58AM -0700, Alex Williamson wrote:
> > > > On Tue, 2014-01-14 at 14:07 +0200, Michael S. Tsirkin wrote:
> > > > > On Mon, Jan 13, 2014 at 03:48:11PM -0700, Alex Williamson wrote:
> > > > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> > > > > > > 
> > > > > > > > Am 13.01.2014 um 22:39 schrieb Alex Williamson 
> > > > > > > > <alex.william...@redhat.com>:
> > > > > > > > 
> > > > > > > >> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> > > > > > > >>> On 12.01.2014, at 08:54, Michael S. Tsirkin <m...@redhat.com> 
> > > > > > > >>> wrote:
> > > > > > > >>> 
> > > > > > > >>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson 
> > > > > > > >>>> wrote:
> > > > > > > >>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> > > > > > > >>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson 
> > > > > > > >>>>>> wrote:
> > > > > > > >>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin 
> > > > > > > >>>>>>> wrote:
> > > > > > > >>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex 
> > > > > > > >>>>>>>> Williamson wrote:
> > > > > > > >>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson 
> > > > > > > >>>>>>>>> wrote:
> > > > > > > >>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin 
> > > > > > > >>>>>>>>>> wrote:
> > > > > > > >>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex 
> > > > > > > >>>>>>>>>>> Williamson wrote:
> > > > > > > >>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. 
> > > > > > > >>>>>>>>>>>> Tsirkin wrote:
> > > > > > > >>>>>>>>>>>> From: Paolo Bonzini <pbonz...@redhat.com>
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit 
> > > > > > > >>>>>>>>>>>> system memory
> > > > > > > >>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 
> > > > > > > >>>>>>>>>>>> 64-bit wide.
> > > > > > > >>>>>>>>>>>> This eliminates problems with phys_page_find 
> > > > > > > >>>>>>>>>>>> ignoring bits above
> > > > > > > >>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and 
> > > > > > > >>>>>>>>>>>> address_space_translate_internal
> > > > > > > >>>>>>>>>>>> consequently messing up the computations.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to 
> > > > > > > >>>>>>>>>>>> read from address
> > > > > > > >>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  
> > > > > > > >>>>>>>>>>>> The region it gets
> > > > > > > >>>>>>>>>>>> is the newly introduced master abort region, which 
> > > > > > > >>>>>>>>>>>> is as big as the PCI
> > > > > > > >>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo 
> > > > > > > >>>>>>>>>>>> that's only 2^63-1,
> > > > > > > >>>>>>>>>>>> not 2^64.  But we get it anyway because 
> > > > > > > >>>>>>>>>>>> phys_page_find ignores the upper
> > > > > > > >>>>>>>>>>>> bits of the physical address.  In 
> > > > > > > >>>>>>>>>>>> address_space_translate_internal then
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>>   diff = int128_sub(section->mr->size, 
> > > > > > > >>>>>>>>>>>> int128_make64(addr));
> > > > > > > >>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, 
> > > > > > > >>>>>>>>>>>> int128_make64(*plen)));
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> The size of the PCI address space region should be 
> > > > > > > >>>>>>>>>>>> fixed anyway.
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitul...@redhat.com>
> > > > > > > >>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> > > > > > > >>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> > > > > > > >>>>>>>>>>>> ---
> > > > > > > >>>>>>>>>>>> exec.c | 8 ++------
> > > > > > > >>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> diff --git a/exec.c b/exec.c
> > > > > > > >>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> > > > > > > >>>>>>>>>>>> --- a/exec.c
> > > > > > > >>>>>>>>>>>> +++ b/exec.c
> > > > > > > >>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> > > > > > > >>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> > > > > > > >>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> > > > > > > >>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> #define P_L2_BITS 10
> > > > > > > >>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> > > > > > > >>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void 
> > > > > > > >>>>>>>>>>>> memory_map_init(void)
> > > > > > > >>>>>>>>>>>> {
> > > > > > > >>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> > > > > > > >>>>>>>>>>>> -
> > > > > > > >>>>>>>>>>>> -    memory_region_init(system_memory, NULL, 
> > > > > > > >>>>>>>>>>>> "system",
> > > > > > > >>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> > > > > > > >>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << 
> > > > > > > >>>>>>>>>>>> ADDR_SPACE_BITS));
> > > > > > > >>>>>>>>>>>> +    memory_region_init(system_memory, NULL, 
> > > > > > > >>>>>>>>>>>> "system", UINT64_MAX);
> > > > > > > >>>>>>>>>>>>    address_space_init(&address_space_memory, 
> > > > > > > >>>>>>>>>>>> system_memory, "memory");
> > > > > > > >>>>>>>>>>>> 
> > > > > > > >>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> This seems to have some unexpected consequences 
> > > > > > > >>>>>>>>>>> around sizing 64bit PCI
> > > > > > > >>>>>>>>>>> BARs that I'm not sure how to handle.
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> BARs are often disabled during sizing. Maybe you
> > > > > > > >>>>>>>>>> don't detect BAR being disabled?
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> See the trace below, the BARs are not disabled.  QEMU 
> > > > > > > >>>>>>>>> pci-core is doing
> > > > > > > >>>>>>>>> the sizing an memory region updates for the BARs, vfio 
> > > > > > > >>>>>>>>> is just a
> > > > > > > >>>>>>>>> pass-through here.
> > > > > > > >>>>>>>> 
> > > > > > > >>>>>>>> Sorry, not in the trace below, but yes the sizing seems 
> > > > > > > >>>>>>>> to be happening
> > > > > > > >>>>>>>> while I/O & memory are enabled int he command register.  
> > > > > > > >>>>>>>> Thanks,
> > > > > > > >>>>>>>> 
> > > > > > > >>>>>>>> Alex
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> OK then from QEMU POV this BAR value is not special at 
> > > > > > > >>>>>>> all.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Unfortunately
> > > > > > > >>>>>> 
> > > > > > > >>>>>>>>>>> After this patch I get vfio
> > > > > > > >>>>>>>>>>> traces like this:
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, 
> > > > > > > >>>>>>>>>>> len=0x4) febe0004
> > > > > > > >>>>>>>>>>> (save lower 32bits of BAR)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 
> > > > > > > >>>>>>>>>>> 0xffffffff, len=0x4)
> > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, 
> > > > > > > >>>>>>>>>>> len=0x4) ffffc004
> > > > > > > >>>>>>>>>>> (read size mask)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 
> > > > > > > >>>>>>>>>>> 0xfebe0004, len=0x4)
> > > > > > > >>>>>>>>>>> (restore BAR)
> > > > > > > >>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> > > > > > > >>>>>>>>>>> (memory region re-mapped)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, 
> > > > > > > >>>>>>>>>>> len=0x4) 0
> > > > > > > >>>>>>>>>>> (save upper 32bits of BAR)
> > > > > > > >>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 
> > > > > > > >>>>>>>>>>> 0xffffffff, len=0x4)
> > > > > > > >>>>>>>>>>> (write mask to BAR)
> > > > > > > >>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> > > > > > > >>>>>>>>>>> (memory region gets unmapped)
> > > > > > > >>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff 
> > > > > > > >>>>>>>>>>> [0x7fcf3654d000]
> > > > > > > >>>>>>>>>>> (memory region gets re-mapped with new address)
> > > > > > > >>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 
> > > > > > > >>>>>>>>>>> 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 
> > > > > > > >>>>>>>>>>> (Bad address)
> > > > > > > >>>>>>>>>>> (iommu barfs because it can only handle 48bit 
> > > > > > > >>>>>>>>>>> physical addresses)
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> Why are you trying to program BAR addresses for dma in 
> > > > > > > >>>>>>>>>> the iommu?
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> Two reasons, first I can't tell the difference between 
> > > > > > > >>>>>>>>> RAM and MMIO.
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> Why can't you? Generally memory core let you find out 
> > > > > > > >>>>>>> easily.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> My MemoryListener is setup for &address_space_memory and I 
> > > > > > > >>>>>> then filter
> > > > > > > >>>>>> out anything that's not memory_region_is_ram().  This 
> > > > > > > >>>>>> still gets
> > > > > > > >>>>>> through, so how do I easily find out?
> > > > > > > >>>>>> 
> > > > > > > >>>>>>> But in this case it's vfio device itself that is sized so 
> > > > > > > >>>>>>> for sure you
> > > > > > > >>>>>>> know it's MMIO.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> How so?  I have a MemoryListener as described above and 
> > > > > > > >>>>>> pass everything
> > > > > > > >>>>>> through to the IOMMU.  I suppose I could look through all 
> > > > > > > >>>>>> the
> > > > > > > >>>>>> VFIODevices and check if the MemoryRegion matches, but 
> > > > > > > >>>>>> that seems really
> > > > > > > >>>>>> ugly.
> > > > > > > >>>>>> 
> > > > > > > >>>>>>> Maybe you will have same issue if there's another device 
> > > > > > > >>>>>>> with a 64 bit
> > > > > > > >>>>>>> bar though, like ivshmem?
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Perhaps, I suspect I'll see anything that registers their 
> > > > > > > >>>>>> BAR
> > > > > > > >>>>>> MemoryRegion from memory_region_init_ram or 
> > > > > > > >>>>>> memory_region_init_ram_ptr.
> > > > > > > >>>>> 
> > > > > > > >>>>> Must be a 64 bit BAR to trigger the issue though.
> > > > > > > >>>>> 
> > > > > > > >>>>>>>>> Second, it enables peer-to-peer DMA between devices, 
> > > > > > > >>>>>>>>> which is something
> > > > > > > >>>>>>>>> that we might be able to take advantage of with GPU 
> > > > > > > >>>>>>>>> passthrough.
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>>>> Prior to this change, there was no re-map with the 
> > > > > > > >>>>>>>>>>> fffffffffebe0000
> > > > > > > >>>>>>>>>>> address, presumably because it was beyond the address 
> > > > > > > >>>>>>>>>>> space of the PCI
> > > > > > > >>>>>>>>>>> window.  This address is clearly not in a PCI MMIO 
> > > > > > > >>>>>>>>>>> space, so why are we
> > > > > > > >>>>>>>>>>> allowing it to be realized in the system address 
> > > > > > > >>>>>>>>>>> space at this location?
> > > > > > > >>>>>>>>>>> Thanks,
> > > > > > > >>>>>>>>>>> 
> > > > > > > >>>>>>>>>>> Alex
> > > > > > > >>>>>>>>>> 
> > > > > > > >>>>>>>>>> Why do you think it is not in PCI MMIO space?
> > > > > > > >>>>>>>>>> True, CPU can't access this address but other pci 
> > > > > > > >>>>>>>>>> devices can.
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> What happens on real hardware when an address like this 
> > > > > > > >>>>>>>>> is programmed to
> > > > > > > >>>>>>>>> a device?  The CPU doesn't have the physical bits to 
> > > > > > > >>>>>>>>> access it.  I have
> > > > > > > >>>>>>>>> serious doubts that another PCI device would be able to 
> > > > > > > >>>>>>>>> access it
> > > > > > > >>>>>>>>> either.  Maybe in some limited scenario where the 
> > > > > > > >>>>>>>>> devices are on the
> > > > > > > >>>>>>>>> same conventional PCI bus.  In the typical case, PCI 
> > > > > > > >>>>>>>>> addresses are
> > > > > > > >>>>>>>>> always limited by some kind of aperture, whether that's 
> > > > > > > >>>>>>>>> explicit in
> > > > > > > >>>>>>>>> bridge windows or implicit in hardware design (and 
> > > > > > > >>>>>>>>> perhaps made explicit
> > > > > > > >>>>>>>>> in ACPI).  Even if I wanted to filter these out as 
> > > > > > > >>>>>>>>> noise in vfio, how
> > > > > > > >>>>>>>>> would I do it in a way that still allows real 64bit 
> > > > > > > >>>>>>>>> MMIO to be
> > > > > > > >>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO 
> > > > > > > >>>>>>>>> doesn't.  Thanks,
> > > > > > > >>>>>>>>> 
> > > > > > > >>>>>>>>> Alex
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec 
> > > > > > > >>>>>>> is explicit that
> > > > > > > >>>>>>> full 64 bit addresses must be allowed and hardware 
> > > > > > > >>>>>>> validation
> > > > > > > >>>>>>> test suites normally check that it actually does work
> > > > > > > >>>>>>> if it happens.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Sure, PCI devices themselves, but the chipset typically 
> > > > > > > >>>>>> has defined
> > > > > > > >>>>>> routing, that's more what I'm referring to.  There are 
> > > > > > > >>>>>> generally only
> > > > > > > >>>>>> fixed address windows for RAM vs MMIO.
> > > > > > > >>>>> 
> > > > > > > >>>>> The physical chipset? Likely - in the presence of IOMMU.
> > > > > > > >>>>> Without that, devices can talk to each other without going
> > > > > > > >>>>> through chipset, and bridge spec is very explicit that
> > > > > > > >>>>> full 64 bit addressing must be supported.
> > > > > > > >>>>> 
> > > > > > > >>>>> So as long as we don't emulate an IOMMU,
> > > > > > > >>>>> guest will normally think it's okay to use any address.
> > > > > > > >>>>> 
> > > > > > > >>>>>>> Yes, if there's a bridge somewhere on the path that 
> > > > > > > >>>>>>> bridge's
> > > > > > > >>>>>>> windows would protect you, but pci already does this 
> > > > > > > >>>>>>> filtering:
> > > > > > > >>>>>>> if you see this address in the memory map this means
> > > > > > > >>>>>>> your virtual device is on root bus.
> > > > > > > >>>>>>> 
> > > > > > > >>>>>>> So I think it's the other way around: if VFIO requires 
> > > > > > > >>>>>>> specific
> > > > > > > >>>>>>> address ranges to be assigned to devices, it should give 
> > > > > > > >>>>>>> this
> > > > > > > >>>>>>> info to qemu and qemu can give this to guest.
> > > > > > > >>>>>>> Then anything outside that range can be ignored by VFIO.
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Then we get into deficiencies in the IOMMU API and maybe 
> > > > > > > >>>>>> VFIO.  There's
> > > > > > > >>>>>> currently no way to find out the address width of the 
> > > > > > > >>>>>> IOMMU.  We've been
> > > > > > > >>>>>> getting by because it's safely close enough to the CPU 
> > > > > > > >>>>>> address width to
> > > > > > > >>>>>> not be a concern until we start exposing things at the top 
> > > > > > > >>>>>> of the 64bit
> > > > > > > >>>>>> address space.  Maybe I can safely ignore anything above
> > > > > > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> > > > > > > >>>>>> 
> > > > > > > >>>>>> Alex
> > > > > > > >>>>> 
> > > > > > > >>>>> I think it's not related to target CPU at all - it's a host 
> > > > > > > >>>>> limitation.
> > > > > > > >>>>> So just make up your own constant, maybe depending on host 
> > > > > > > >>>>> architecture.
> > > > > > > >>>>> Long term add an ioctl to query it.
> > > > > > > >>>> 
> > > > > > > >>>> It's a hardware limitation which I'd imagine has some loose 
> > > > > > > >>>> ties to the
> > > > > > > >>>> physical address bits of the CPU.
> > > > > > > >>>> 
> > > > > > > >>>>> Also, we can add a fwcfg interface to tell bios that it 
> > > > > > > >>>>> should avoid
> > > > > > > >>>>> placing BARs above some address.
> > > > > > > >>>> 
> > > > > > > >>>> That doesn't help this case, it's a spurious mapping caused 
> > > > > > > >>>> by sizing
> > > > > > > >>>> the BARs with them enabled.  We may still want such a thing 
> > > > > > > >>>> to feed into
> > > > > > > >>>> building ACPI tables though.
> > > > > > > >>> 
> > > > > > > >>> Well the point is that if you want BIOS to avoid
> > > > > > > >>> specific addresses, you need to tell it what to avoid.
> > > > > > > >>> But neither BIOS nor ACPI actually cover the range above
> > > > > > > >>> 2^48 ATM so it's not a high priority.
> > > > > > > >>> 
> > > > > > > >>>>> Since it's a vfio limitation I think it should be a vfio 
> > > > > > > >>>>> API, along the
> > > > > > > >>>>> lines of vfio_get_addr_space_bits(void).
> > > > > > > >>>>> (Is this true btw? legacy assignment doesn't have this 
> > > > > > > >>>>> problem?)
> > > > > > > >>>> 
> > > > > > > >>>> It's an IOMMU hardware limitation, legacy assignment has the 
> > > > > > > >>>> same
> > > > > > > >>>> problem.  It looks like legacy will abort() in QEMU for the 
> > > > > > > >>>> failed
> > > > > > > >>>> mapping and I'm planning to tighten vfio to also kill the VM 
> > > > > > > >>>> for failed
> > > > > > > >>>> mappings.  In the short term, I think I'll ignore any 
> > > > > > > >>>> mappings above
> > > > > > > >>>> TARGET_PHYS_ADDR_SPACE_BITS,
> > > > > > > >>> 
> > > > > > > >>> That seems very wrong. It will still fail on an x86 host if 
> > > > > > > >>> we are
> > > > > > > >>> emulating a CPU with full 64 bit addressing. The limitation 
> > > > > > > >>> is on the
> > > > > > > >>> host side there's no real reason to tie it to the target.
> > > > > > > > 
> > > > > > > > I doubt vfio would be the only thing broken in that case.
> > > > > > > > 
> > > > > > > >>>> long term vfio already has an IOMMU info
> > > > > > > >>>> ioctl that we could use to return this information, but 
> > > > > > > >>>> we'll need to
> > > > > > > >>>> figure out how to get it out of the IOMMU driver first.
> > > > > > > >>>> Thanks,
> > > > > > > >>>> 
> > > > > > > >>>> Alex
> > > > > > > >>> 
> > > > > > > >>> Short term, just assume 48 bits on x86.
> > > > > > > > 
> > > > > > > > I hate to pick an arbitrary value since we have a very specific 
> > > > > > > > mapping
> > > > > > > > we're trying to avoid.  Perhaps a better option is to skip 
> > > > > > > > anything
> > > > > > > > where:
> > > > > > > > 
> > > > > > > >        MemoryRegionSection.offset_within_address_space >
> > > > > > > >        ~MemoryRegionSection.offset_within_address_space
> > > > > > > > 
> > > > > > > >>> We need to figure out what's the limitation on ppc and arm -
> > > > > > > >>> maybe there's none and it can address full 64 bit range.
> > > > > > > >> 
> > > > > > > >> IIUC on PPC and ARM you always have BAR windows where things 
> > > > > > > >> can get mapped into. Unlike x86 where the full phyiscal 
> > > > > > > >> address range can be overlayed by BARs.
> > > > > > > >> 
> > > > > > > >> Or did I misunderstand the question?
> > > > > > > > 
> > > > > > > > Sounds right, if either BAR mappings outside the window will 
> > > > > > > > not be
> > > > > > > > realized in the memory space or the IOMMU has a full 64bit 
> > > > > > > > address
> > > > > > > > space, there's no problem.  Here we have an intermediate step 
> > > > > > > > in the BAR
> > > > > > > > sizing producing a stray mapping that the IOMMU hardware can't 
> > > > > > > > handle.
> > > > > > > > Even if we could handle it, it's not clear that we want to.  On 
> > > > > > > > AMD-Vi
> > > > > > > > the IOMMU pages tables can grow to 6-levels deep.  A stray 
> > > > > > > > mapping like
> > > > > > > > this then causes space and time overhead until the tables are 
> > > > > > > > pruned
> > > > > > > > back down.  Thanks,
> > > > > > > 
> > > > > > > I thought sizing is hard defined as a set to
> > > > > > > -1? Can't we check for that one special case and treat it as "not 
> > > > > > > mapped, but tell the guest the size in config space"?
> > > > > > 
> > > > > > PCI doesn't want to handle this as anything special to 
> > > > > > differentiate a
> > > > > > sizing mask from a valid BAR address.  I agree though, I'd prefer to
> > > > > > never see a spurious address like this in my MemoryListener.
> > > > > 
> > > > > It's more a can't than doesn't want to: it's a 64 bit BAR, it's not
> > > > > set to all ones atomically.
> > > > > 
> > > > > Also, while it doesn't address this fully (same issue can happen
> > > > > e.g. with ivshmem), do you think we should distinguish these BARs 
> > > > > mapped
> > > > > from vfio / device assignment in qemu somehow?
> > > > > 
> > > > > In particular, even when it has sane addresses:
> > > > > device really can not DMA into its own BAR, that's a spec violation
> > > > > so in theory can do anything including crashing the system.
> > > > > I don't know what happens in practice but
> > > > > if you are programming IOMMU to forward transactions back to
> > > > > device that originated it, you are not doing it any favors.
> > > > 
> > > > I might concede that peer-to-peer is more trouble than it's worth if I
> > > > had a convenient way to ignore MMIO mappings in my MemoryListener, but I
> > > > don't.
> > > 
> > > Well for VFIO devices you are creating these mappings so we surely
> > > can find a way for you to check that.
> > > Doesn't each segment point back at the memory region that created it?
> > > Then you can just check that.
> > 
> > It's a fairly heavy-weight search and it only avoid vfio devices, so it
> > feels like it's just delaying a real solution.
> 
> Well there are several problems.
> 
> That device get its own BAR programmed
> as a valid target in IOMMU is in my opinion a separate bug,
> and for *that* it's a real solution.

Except the side-effect of that solution is that it also disables
peer-to-peer since we do not use separate IOMMU domains per device.  In
fact, we can't guarantee that it's possible to use separate IOMMU
domains per device.  So, the cure is worse than the disease.

> > > >  Self-DMA is really not the intent of doing the mapping, but
> > > > peer-to-peer does have merit.
> > > > 
> > > > > I also note that if someone tries zero copy transmit out of such an
> > > > > address, get user pages will fail.
> > > > > I think this means tun zero copy transmit needs to fall-back
> > > > > on copy from user on get user pages failure.
> > > > > 
> > > > > Jason, what's tour thinking on this?
> > > > > 
> > > > 
> > > > 
> > 
> > 




Reply via email to