On Tue, Jan 14, 2014 at 08:36:27AM -0700, Alex Williamson wrote: > On Tue, 2014-01-14 at 12:24 +0200, Avi Kivity wrote: > > On 01/14/2014 12:48 AM, Alex Williamson wrote: > > > On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote: > > >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson > > >>> <alex.william...@redhat.com>: > > >>> > > >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote: > > >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <m...@redhat.com> wrote: > > >>>>> > > >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote: > > >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote: > > >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote: > > >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote: > > >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote: > > >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote: > > >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote: > > >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote: > > >>>>>>>>>>>>>> From: Paolo Bonzini <pbonz...@redhat.com> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system > > >>>>>>>>>>>>>> memory > > >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit > > >>>>>>>>>>>>>> wide. > > >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits > > >>>>>>>>>>>>>> above > > >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and > > >>>>>>>>>>>>>> address_space_translate_internal > > >>>>>>>>>>>>>> consequently messing up the computations. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read > > >>>>>>>>>>>>>> from address > > >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive. The > > >>>>>>>>>>>>>> region it gets > > >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big > > >>>>>>>>>>>>>> as the PCI > > >>>>>>>>>>>>>> address space (see pci_bus_init). Due to a typo that's only > > >>>>>>>>>>>>>> 2^63-1, > > >>>>>>>>>>>>>> not 2^64. But we get it anyway because phys_page_find > > >>>>>>>>>>>>>> ignores the upper > > >>>>>>>>>>>>>> bits of the physical address. In > > >>>>>>>>>>>>>> address_space_translate_internal then > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> diff = int128_sub(section->mr->size, int128_make64(addr)); > > >>>>>>>>>>>>>> *plen = int128_get64(int128_min(diff, > > >>>>>>>>>>>>>> int128_make64(*plen))); > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The size of the PCI address space region should be fixed > > >>>>>>>>>>>>>> anyway. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitul...@redhat.com> > > >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> > > >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > > >>>>>>>>>>>>>> --- > > >>>>>>>>>>>>>> exec.c | 8 ++------ > > >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c > > >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644 > > >>>>>>>>>>>>>> --- a/exec.c > > >>>>>>>>>>>>>> +++ b/exec.c > > >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry { > > >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables. */ > > >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS > > >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64 > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> #define P_L2_BITS 10 > > >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS) > > >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void) > > >>>>>>>>>>>>>> { > > >>>>>>>>>>>>>> system_memory = g_malloc(sizeof(*system_memory)); > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> - assert(ADDR_SPACE_BITS <= 64); > > >>>>>>>>>>>>>> - > > >>>>>>>>>>>>>> - memory_region_init(system_memory, NULL, "system", > > >>>>>>>>>>>>>> - ADDR_SPACE_BITS == 64 ? > > >>>>>>>>>>>>>> - UINT64_MAX : (0x1ULL << > > >>>>>>>>>>>>>> ADDR_SPACE_BITS)); > > >>>>>>>>>>>>>> + memory_region_init(system_memory, NULL, "system", > > >>>>>>>>>>>>>> UINT64_MAX); > > >>>>>>>>>>>>>> address_space_init(&address_space_memory, system_memory, > > >>>>>>>>>>>>>> "memory"); > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> system_io = g_malloc(sizeof(*system_io)); > > >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing > > >>>>>>>>>>>>> 64bit PCI > > >>>>>>>>>>>>> BARs that I'm not sure how to handle. > > >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you > > >>>>>>>>>>>> don't detect BAR being disabled? > > >>>>>>>>>>> See the trace below, the BARs are not disabled. QEMU pci-core > > >>>>>>>>>>> is doing > > >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a > > >>>>>>>>>>> pass-through here. > > >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be > > >>>>>>>>>> happening > > >>>>>>>>>> while I/O & memory are enabled int he command register. Thanks, > > >>>>>>>>>> > > >>>>>>>>>> Alex > > >>>>>>>>> OK then from QEMU POV this BAR value is not special at all. > > >>>>>>>> Unfortunately > > >>>>>>>> > > >>>>>>>>>>>>> After this patch I get vfio > > >>>>>>>>>>>>> traces like this: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) > > >>>>>>>>>>>>> febe0004 > > >>>>>>>>>>>>> (save lower 32bits of BAR) > > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, > > >>>>>>>>>>>>> len=0x4) > > >>>>>>>>>>>>> (write mask to BAR) > > >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff > > >>>>>>>>>>>>> (memory region gets unmapped) > > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) > > >>>>>>>>>>>>> ffffc004 > > >>>>>>>>>>>>> (read size mask) > > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, > > >>>>>>>>>>>>> len=0x4) > > >>>>>>>>>>>>> (restore BAR) > > >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000] > > >>>>>>>>>>>>> (memory region re-mapped) > > >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0 > > >>>>>>>>>>>>> (save upper 32bits of BAR) > > >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, > > >>>>>>>>>>>>> len=0x4) > > >>>>>>>>>>>>> (write mask to BAR) > > >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff > > >>>>>>>>>>>>> (memory region gets unmapped) > > >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff > > >>>>>>>>>>>>> [0x7fcf3654d000] > > >>>>>>>>>>>>> (memory region gets re-mapped with new address) > > >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, > > >>>>>>>>>>>>> 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad > > >>>>>>>>>>>>> address) > > >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical > > >>>>>>>>>>>>> addresses) > > >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the > > >>>>>>>>>>>> iommu? > > >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and > > >>>>>>>>>>> MMIO. > > >>>>>>>>> Why can't you? Generally memory core let you find out easily. > > >>>>>>>> My MemoryListener is setup for &address_space_memory and I then > > >>>>>>>> filter > > >>>>>>>> out anything that's not memory_region_is_ram(). This still gets > > >>>>>>>> through, so how do I easily find out? > > >>>>>>>> > > >>>>>>>>> But in this case it's vfio device itself that is sized so for > > >>>>>>>>> sure you > > >>>>>>>>> know it's MMIO. > > >>>>>>>> How so? I have a MemoryListener as described above and pass > > >>>>>>>> everything > > >>>>>>>> through to the IOMMU. I suppose I could look through all the > > >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems > > >>>>>>>> really > > >>>>>>>> ugly. > > >>>>>>>> > > >>>>>>>>> Maybe you will have same issue if there's another device with a > > >>>>>>>>> 64 bit > > >>>>>>>>> bar though, like ivshmem? > > >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR > > >>>>>>>> MemoryRegion from memory_region_init_ram or > > >>>>>>>> memory_region_init_ram_ptr. > > >>>>>>> Must be a 64 bit BAR to trigger the issue though. > > >>>>>>> > > >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is > > >>>>>>>>>>> something > > >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough. > > >>>>>>>>>>> > > >>>>>>>>>>>>> Prior to this change, there was no re-map with the > > >>>>>>>>>>>>> fffffffffebe0000 > > >>>>>>>>>>>>> address, presumably because it was beyond the address space > > >>>>>>>>>>>>> of the PCI > > >>>>>>>>>>>>> window. This address is clearly not in a PCI MMIO space, so > > >>>>>>>>>>>>> why are we > > >>>>>>>>>>>>> allowing it to be realized in the system address space at > > >>>>>>>>>>>>> this location? > > >>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Alex > > >>>>>>>>>>>> Why do you think it is not in PCI MMIO space? > > >>>>>>>>>>>> True, CPU can't access this address but other pci devices can. > > >>>>>>>>>>> What happens on real hardware when an address like this is > > >>>>>>>>>>> programmed to > > >>>>>>>>>>> a device? The CPU doesn't have the physical bits to access it. > > >>>>>>>>>>> I have > > >>>>>>>>>>> serious doubts that another PCI device would be able to access > > >>>>>>>>>>> it > > >>>>>>>>>>> either. Maybe in some limited scenario where the devices are > > >>>>>>>>>>> on the > > >>>>>>>>>>> same conventional PCI bus. In the typical case, PCI addresses > > >>>>>>>>>>> are > > >>>>>>>>>>> always limited by some kind of aperture, whether that's > > >>>>>>>>>>> explicit in > > >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made > > >>>>>>>>>>> explicit > > >>>>>>>>>>> in ACPI). Even if I wanted to filter these out as noise in > > >>>>>>>>>>> vfio, how > > >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be > > >>>>>>>>>>> programmed. PCI has this knowledge, I hope. VFIO doesn't. > > >>>>>>>>>>> Thanks, > > >>>>>>>>>>> > > >>>>>>>>>>> Alex > > >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is > > >>>>>>>>> explicit that > > >>>>>>>>> full 64 bit addresses must be allowed and hardware validation > > >>>>>>>>> test suites normally check that it actually does work > > >>>>>>>>> if it happens. > > >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined > > >>>>>>>> routing, that's more what I'm referring to. There are generally > > >>>>>>>> only > > >>>>>>>> fixed address windows for RAM vs MMIO. > > >>>>>>> The physical chipset? Likely - in the presence of IOMMU. > > >>>>>>> Without that, devices can talk to each other without going > > >>>>>>> through chipset, and bridge spec is very explicit that > > >>>>>>> full 64 bit addressing must be supported. > > >>>>>>> > > >>>>>>> So as long as we don't emulate an IOMMU, > > >>>>>>> guest will normally think it's okay to use any address. > > >>>>>>> > > >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's > > >>>>>>>>> windows would protect you, but pci already does this filtering: > > >>>>>>>>> if you see this address in the memory map this means > > >>>>>>>>> your virtual device is on root bus. > > >>>>>>>>> > > >>>>>>>>> So I think it's the other way around: if VFIO requires specific > > >>>>>>>>> address ranges to be assigned to devices, it should give this > > >>>>>>>>> info to qemu and qemu can give this to guest. > > >>>>>>>>> Then anything outside that range can be ignored by VFIO. > > >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO. > > >>>>>>>> There's > > >>>>>>>> currently no way to find out the address width of the IOMMU. > > >>>>>>>> We've been > > >>>>>>>> getting by because it's safely close enough to the CPU address > > >>>>>>>> width to > > >>>>>>>> not be a concern until we start exposing things at the top of the > > >>>>>>>> 64bit > > >>>>>>>> address space. Maybe I can safely ignore anything above > > >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now. Thanks, > > >>>>>>>> > > >>>>>>>> Alex > > >>>>>>> I think it's not related to target CPU at all - it's a host > > >>>>>>> limitation. > > >>>>>>> So just make up your own constant, maybe depending on host > > >>>>>>> architecture. > > >>>>>>> Long term add an ioctl to query it. > > >>>>>> It's a hardware limitation which I'd imagine has some loose ties to > > >>>>>> the > > >>>>>> physical address bits of the CPU. > > >>>>>> > > >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid > > >>>>>>> placing BARs above some address. > > >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing > > >>>>>> the BARs with them enabled. We may still want such a thing to feed > > >>>>>> into > > >>>>>> building ACPI tables though. > > >>>>> Well the point is that if you want BIOS to avoid > > >>>>> specific addresses, you need to tell it what to avoid. > > >>>>> But neither BIOS nor ACPI actually cover the range above > > >>>>> 2^48 ATM so it's not a high priority. > > >>>>> > > >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along > > >>>>>>> the > > >>>>>>> lines of vfio_get_addr_space_bits(void). > > >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?) > > >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same > > >>>>>> problem. It looks like legacy will abort() in QEMU for the failed > > >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for > > >>>>>> failed > > >>>>>> mappings. In the short term, I think I'll ignore any mappings above > > >>>>>> TARGET_PHYS_ADDR_SPACE_BITS, > > >>>>> That seems very wrong. It will still fail on an x86 host if we are > > >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the > > >>>>> host side there's no real reason to tie it to the target. > > >>> I doubt vfio would be the only thing broken in that case. > > >>> > > >>>>>> long term vfio already has an IOMMU info > > >>>>>> ioctl that we could use to return this information, but we'll need to > > >>>>>> figure out how to get it out of the IOMMU driver first. > > >>>>>> Thanks, > > >>>>>> > > >>>>>> Alex > > >>>>> Short term, just assume 48 bits on x86. > > >>> I hate to pick an arbitrary value since we have a very specific mapping > > >>> we're trying to avoid. Perhaps a better option is to skip anything > > >>> where: > > >>> > > >>> MemoryRegionSection.offset_within_address_space > > > >>> ~MemoryRegionSection.offset_within_address_space > > >>> > > >>>>> We need to figure out what's the limitation on ppc and arm - > > >>>>> maybe there's none and it can address full 64 bit range. > > >>>> IIUC on PPC and ARM you always have BAR windows where things can get > > >>>> mapped into. Unlike x86 where the full phyiscal address range can be > > >>>> overlayed by BARs. > > >>>> > > >>>> Or did I misunderstand the question? > > >>> Sounds right, if either BAR mappings outside the window will not be > > >>> realized in the memory space or the IOMMU has a full 64bit address > > >>> space, there's no problem. Here we have an intermediate step in the BAR > > >>> sizing producing a stray mapping that the IOMMU hardware can't handle. > > >>> Even if we could handle it, it's not clear that we want to. On AMD-Vi > > >>> the IOMMU pages tables can grow to 6-levels deep. A stray mapping like > > >>> this then causes space and time overhead until the tables are pruned > > >>> back down. Thanks, > > >> I thought sizing is hard defined as a set to > > >> -1? Can't we check for that one special case and treat it as "not > > >> mapped, but tell the guest the size in config space"? > > > PCI doesn't want to handle this as anything special to differentiate a > > > sizing mask from a valid BAR address. I agree though, I'd prefer to > > > never see a spurious address like this in my MemoryListener. > > > > > > > > > > Can't you just ignore regions that cannot be mapped? Oh, and teach the > > bios and/or linux to disable memory access while sizing. > > Actually I think we need to be more stringent about DMA mapping > failures. If a chunk of guest RAM fails to map then we can lose data if > the device attempts to DMA a packet into it. How do we know which > regions we can ignore and which we can't? Whether or not the CPU can > access it is a pretty good hint that we can ignore it. Thanks, > > Alex
Go ahead and use that as a hint if you prefer, but for targets which have CPU target bits in excess of what host IOMMU supports, this might not be enough to actually make things not break. -- MST