On Tue, Jan 14, 2014 at 12:24:24PM +0200, Avi Kivity wrote: > On 01/14/2014 12:48 AM, Alex Williamson wrote: > >On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote: > >>>Am 13.01.2014 um 22:39 schrieb Alex Williamson > >>><alex.william...@redhat.com>: > >>> > >>>>On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote: > >>>>>On 12.01.2014, at 08:54, Michael S. Tsirkin <m...@redhat.com> wrote: > >>>>> > >>>>>>On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote: > >>>>>>>On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote: > >>>>>>>>On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote: > >>>>>>>>>On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote: > >>>>>>>>>>On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote: > >>>>>>>>>>>On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote: > >>>>>>>>>>>>On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote: > >>>>>>>>>>>>>On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote: > >>>>>>>>>>>>>>On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote: > >>>>>>>>>>>>>>From: Paolo Bonzini <pbonz...@redhat.com> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>As an alternative to commit 818f86b (exec: limit system memory > >>>>>>>>>>>>>>size, 2013-11-04) let's just make all address spaces 64-bit > >>>>>>>>>>>>>>wide. > >>>>>>>>>>>>>>This eliminates problems with phys_page_find ignoring bits above > >>>>>>>>>>>>>>TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal > >>>>>>>>>>>>>>consequently messing up the computations. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>In Luiz's reported crash, at startup gdb attempts to read from > >>>>>>>>>>>>>>address > >>>>>>>>>>>>>>0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The region > >>>>>>>>>>>>>>it gets > >>>>>>>>>>>>>>is the newly introduced master abort region, which is as big as > >>>>>>>>>>>>>>the PCI > >>>>>>>>>>>>>>address space (see pci_bus_init). Due to a typo that's only > >>>>>>>>>>>>>>2^63-1, > >>>>>>>>>>>>>>not 2^64. But we get it anyway because phys_page_find ignores > >>>>>>>>>>>>>>the upper > >>>>>>>>>>>>>>bits of the physical address. In > >>>>>>>>>>>>>>address_space_translate_internal then > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> diff = int128_sub(section->mr->size, int128_make64(addr)); > >>>>>>>>>>>>>> *plen = int128_get64(int128_min(diff, int128_make64(*plen))); > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>diff becomes negative, and int128_get64 booms. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>The size of the PCI address space region should be fixed anyway. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>Reported-by: Luiz Capitulino <lcapitul...@redhat.com> > >>>>>>>>>>>>>>Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > >>>>>>>>>>>>>>Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > >>>>>>>>>>>>>>--- > >>>>>>>>>>>>>>exec.c | 8 ++------ > >>>>>>>>>>>>>>1 file changed, 2 insertions(+), 6 deletions(-) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>diff --git a/exec.c b/exec.c > >>>>>>>>>>>>>>index 7e5ce93..f907f5f 100644 > >>>>>>>>>>>>>>--- a/exec.c > >>>>>>>>>>>>>>+++ b/exec.c > >>>>>>>>>>>>>>@@ -94,7 +94,7 @@ struct PhysPageEntry { > >>>>>>>>>>>>>>#define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>/* Size of the L2 (and L3, etc) page tables. */ > >>>>>>>>>>>>>>-#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS > >>>>>>>>>>>>>>+#define ADDR_SPACE_BITS 64 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>#define P_L2_BITS 10 > >>>>>>>>>>>>>>#define P_L2_SIZE (1 << P_L2_BITS) > >>>>>>>>>>>>>>@@ -1861,11 +1861,7 @@ static void memory_map_init(void) > >>>>>>>>>>>>>>{ > >>>>>>>>>>>>>> system_memory = g_malloc(sizeof(*system_memory)); > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>- assert(ADDR_SPACE_BITS <= 64); > >>>>>>>>>>>>>>- > >>>>>>>>>>>>>>- memory_region_init(system_memory, NULL, "system", > >>>>>>>>>>>>>>- ADDR_SPACE_BITS == 64 ? > >>>>>>>>>>>>>>- UINT64_MAX : (0x1ULL << > >>>>>>>>>>>>>>ADDR_SPACE_BITS)); > >>>>>>>>>>>>>>+ memory_region_init(system_memory, NULL, "system", > >>>>>>>>>>>>>>UINT64_MAX); > >>>>>>>>>>>>>> address_space_init(&address_space_memory, system_memory, > >>>>>>>>>>>>>> "memory"); > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> system_io = g_malloc(sizeof(*system_io)); > >>>>>>>>>>>>>This seems to have some unexpected consequences around sizing > >>>>>>>>>>>>>64bit PCI > >>>>>>>>>>>>>BARs that I'm not sure how to handle. > >>>>>>>>>>>>BARs are often disabled during sizing. Maybe you > >>>>>>>>>>>>don't detect BAR being disabled? > >>>>>>>>>>>See the trace below, the BARs are not disabled. QEMU pci-core is > >>>>>>>>>>>doing > >>>>>>>>>>>the sizing an memory region updates for the BARs, vfio is just a > >>>>>>>>>>>pass-through here. > >>>>>>>>>>Sorry, not in the trace below, but yes the sizing seems to be > >>>>>>>>>>happening > >>>>>>>>>>while I/O & memory are enabled int he command register. Thanks, > >>>>>>>>>> > >>>>>>>>>>Alex > >>>>>>>>>OK then from QEMU POV this BAR value is not special at all. > >>>>>>>>Unfortunately > >>>>>>>> > >>>>>>>>>>>>>After this patch I get vfio > >>>>>>>>>>>>>traces like this: > >>>>>>>>>>>>> > >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004 > >>>>>>>>>>>>>(save lower 32bits of BAR) > >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, > >>>>>>>>>>>>>len=0x4) > >>>>>>>>>>>>>(write mask to BAR) > >>>>>>>>>>>>>vfio: region_del febe0000 - febe3fff > >>>>>>>>>>>>>(memory region gets unmapped) > >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004 > >>>>>>>>>>>>>(read size mask) > >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, > >>>>>>>>>>>>>len=0x4) > >>>>>>>>>>>>>(restore BAR) > >>>>>>>>>>>>>vfio: region_add febe0000 - febe3fff [0x7fcf3654d000] > >>>>>>>>>>>>>(memory region re-mapped) > >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0 > >>>>>>>>>>>>>(save upper 32bits of BAR) > >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, > >>>>>>>>>>>>>len=0x4) > >>>>>>>>>>>>>(write mask to BAR) > >>>>>>>>>>>>>vfio: region_del febe0000 - febe3fff > >>>>>>>>>>>>>(memory region gets unmapped) > >>>>>>>>>>>>>vfio: region_add fffffffffebe0000 - fffffffffebe3fff > >>>>>>>>>>>>>[0x7fcf3654d000] > >>>>>>>>>>>>>(memory region gets re-mapped with new address) > >>>>>>>>>>>>>qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, > >>>>>>>>>>>>>0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address) > >>>>>>>>>>>>>(iommu barfs because it can only handle 48bit physical addresses) > >>>>>>>>>>>>Why are you trying to program BAR addresses for dma in the iommu? > >>>>>>>>>>>Two reasons, first I can't tell the difference between RAM and > >>>>>>>>>>>MMIO. > >>>>>>>>>Why can't you? Generally memory core let you find out easily. > >>>>>>>>My MemoryListener is setup for &address_space_memory and I then filter > >>>>>>>>out anything that's not memory_region_is_ram(). This still gets > >>>>>>>>through, so how do I easily find out? > >>>>>>>> > >>>>>>>>>But in this case it's vfio device itself that is sized so for sure > >>>>>>>>>you > >>>>>>>>>know it's MMIO. > >>>>>>>>How so? I have a MemoryListener as described above and pass > >>>>>>>>everything > >>>>>>>>through to the IOMMU. I suppose I could look through all the > >>>>>>>>VFIODevices and check if the MemoryRegion matches, but that seems > >>>>>>>>really > >>>>>>>>ugly. > >>>>>>>> > >>>>>>>>>Maybe you will have same issue if there's another device with a 64 > >>>>>>>>>bit > >>>>>>>>>bar though, like ivshmem? > >>>>>>>>Perhaps, I suspect I'll see anything that registers their BAR > >>>>>>>>MemoryRegion from memory_region_init_ram or > >>>>>>>>memory_region_init_ram_ptr. > >>>>>>>Must be a 64 bit BAR to trigger the issue though. > >>>>>>> > >>>>>>>>>>>Second, it enables peer-to-peer DMA between devices, which is > >>>>>>>>>>>something > >>>>>>>>>>>that we might be able to take advantage of with GPU passthrough. > >>>>>>>>>>> > >>>>>>>>>>>>>Prior to this change, there was no re-map with the > >>>>>>>>>>>>>fffffffffebe0000 > >>>>>>>>>>>>>address, presumably because it was beyond the address space of > >>>>>>>>>>>>>the PCI > >>>>>>>>>>>>>window. This address is clearly not in a PCI MMIO space, so why > >>>>>>>>>>>>>are we > >>>>>>>>>>>>>allowing it to be realized in the system address space at this > >>>>>>>>>>>>>location? > >>>>>>>>>>>>>Thanks, > >>>>>>>>>>>>> > >>>>>>>>>>>>>Alex > >>>>>>>>>>>>Why do you think it is not in PCI MMIO space? > >>>>>>>>>>>>True, CPU can't access this address but other pci devices can. > >>>>>>>>>>>What happens on real hardware when an address like this is > >>>>>>>>>>>programmed to > >>>>>>>>>>>a device? The CPU doesn't have the physical bits to access it. I > >>>>>>>>>>>have > >>>>>>>>>>>serious doubts that another PCI device would be able to access it > >>>>>>>>>>>either. Maybe in some limited scenario where the devices are on > >>>>>>>>>>>the > >>>>>>>>>>>same conventional PCI bus. In the typical case, PCI addresses are > >>>>>>>>>>>always limited by some kind of aperture, whether that's explicit in > >>>>>>>>>>>bridge windows or implicit in hardware design (and perhaps made > >>>>>>>>>>>explicit > >>>>>>>>>>>in ACPI). Even if I wanted to filter these out as noise in vfio, > >>>>>>>>>>>how > >>>>>>>>>>>would I do it in a way that still allows real 64bit MMIO to be > >>>>>>>>>>>programmed. PCI has this knowledge, I hope. VFIO doesn't. > >>>>>>>>>>>Thanks, > >>>>>>>>>>> > >>>>>>>>>>>Alex > >>>>>>>>>AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit > >>>>>>>>>that > >>>>>>>>>full 64 bit addresses must be allowed and hardware validation > >>>>>>>>>test suites normally check that it actually does work > >>>>>>>>>if it happens. > >>>>>>>>Sure, PCI devices themselves, but the chipset typically has defined > >>>>>>>>routing, that's more what I'm referring to. There are generally only > >>>>>>>>fixed address windows for RAM vs MMIO. > >>>>>>>The physical chipset? Likely - in the presence of IOMMU. > >>>>>>>Without that, devices can talk to each other without going > >>>>>>>through chipset, and bridge spec is very explicit that > >>>>>>>full 64 bit addressing must be supported. > >>>>>>> > >>>>>>>So as long as we don't emulate an IOMMU, > >>>>>>>guest will normally think it's okay to use any address. > >>>>>>> > >>>>>>>>>Yes, if there's a bridge somewhere on the path that bridge's > >>>>>>>>>windows would protect you, but pci already does this filtering: > >>>>>>>>>if you see this address in the memory map this means > >>>>>>>>>your virtual device is on root bus. > >>>>>>>>> > >>>>>>>>>So I think it's the other way around: if VFIO requires specific > >>>>>>>>>address ranges to be assigned to devices, it should give this > >>>>>>>>>info to qemu and qemu can give this to guest. > >>>>>>>>>Then anything outside that range can be ignored by VFIO. > >>>>>>>>Then we get into deficiencies in the IOMMU API and maybe VFIO. > >>>>>>>>There's > >>>>>>>>currently no way to find out the address width of the IOMMU. We've > >>>>>>>>been > >>>>>>>>getting by because it's safely close enough to the CPU address width > >>>>>>>>to > >>>>>>>>not be a concern until we start exposing things at the top of the > >>>>>>>>64bit > >>>>>>>>address space. Maybe I can safely ignore anything above > >>>>>>>>TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks, > >>>>>>>> > >>>>>>>>Alex > >>>>>>>I think it's not related to target CPU at all - it's a host limitation. > >>>>>>>So just make up your own constant, maybe depending on host > >>>>>>>architecture. > >>>>>>>Long term add an ioctl to query it. > >>>>>>It's a hardware limitation which I'd imagine has some loose ties to the > >>>>>>physical address bits of the CPU. > >>>>>> > >>>>>>>Also, we can add a fwcfg interface to tell bios that it should avoid > >>>>>>>placing BARs above some address. > >>>>>>That doesn't help this case, it's a spurious mapping caused by sizing > >>>>>>the BARs with them enabled. We may still want such a thing to feed into > >>>>>>building ACPI tables though. > >>>>>Well the point is that if you want BIOS to avoid > >>>>>specific addresses, you need to tell it what to avoid. > >>>>>But neither BIOS nor ACPI actually cover the range above > >>>>>2^48 ATM so it's not a high priority. > >>>>> > >>>>>>>Since it's a vfio limitation I think it should be a vfio API, along the > >>>>>>>lines of vfio_get_addr_space_bits(void). > >>>>>>>(Is this true btw? legacy assignment doesn't have this problem?) > >>>>>>It's an IOMMU hardware limitation, legacy assignment has the same > >>>>>>problem. It looks like legacy will abort() in QEMU for the failed > >>>>>>mapping and I'm planning to tighten vfio to also kill the VM for failed > >>>>>>mappings. In the short term, I think I'll ignore any mappings above > >>>>>>TARGET_PHYS_ADDR_SPACE_BITS, > >>>>>That seems very wrong. It will still fail on an x86 host if we are > >>>>>emulating a CPU with full 64 bit addressing. The limitation is on the > >>>>>host side there's no real reason to tie it to the target. > >>>I doubt vfio would be the only thing broken in that case. > >>> > >>>>>>long term vfio already has an IOMMU info > >>>>>>ioctl that we could use to return this information, but we'll need to > >>>>>>figure out how to get it out of the IOMMU driver first. > >>>>>>Thanks, > >>>>>> > >>>>>>Alex > >>>>>Short term, just assume 48 bits on x86. > >>>I hate to pick an arbitrary value since we have a very specific mapping > >>>we're trying to avoid. Perhaps a better option is to skip anything > >>>where: > >>> > >>> MemoryRegionSection.offset_within_address_space > > >>> ~MemoryRegionSection.offset_within_address_space > >>> > >>>>>We need to figure out what's the limitation on ppc and arm - > >>>>>maybe there's none and it can address full 64 bit range. > >>>>IIUC on PPC and ARM you always have BAR windows where things can get > >>>>mapped into. Unlike x86 where the full phyiscal address range can be > >>>>overlayed by BARs. > >>>> > >>>>Or did I misunderstand the question? > >>>Sounds right, if either BAR mappings outside the window will not be > >>>realized in the memory space or the IOMMU has a full 64bit address > >>>space, there's no problem. Here we have an intermediate step in the BAR > >>>sizing producing a stray mapping that the IOMMU hardware can't handle. > >>>Even if we could handle it, it's not clear that we want to. On AMD-Vi > >>>the IOMMU pages tables can grow to 6-levels deep. A stray mapping like > >>>this then causes space and time overhead until the tables are pruned > >>>back down. Thanks, > >>I thought sizing is hard defined as a set to > >>-1? Can't we check for that one special case and treat it as "not mapped, > >>but tell the guest the size in config space"? > >PCI doesn't want to handle this as anything special to differentiate a > >sizing mask from a valid BAR address. I agree though, I'd prefer to > >never see a spurious address like this in my MemoryListener. > > > > > > Can't you just ignore regions that cannot be mapped? Oh, and teach > the bios and/or linux to disable memory access while sizing.
I know Linux won't disable memory access while sizing because there are some broken devices where you can't re-enable it afterwards. It should be harmless to set BAR to any silly value as long as you are careful not to access it. -- MST