Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide

Michael S. Tsirkin Tue, 14 Jan 2014 02:30:23 -0800

On Tue, Jan 14, 2014 at 10:20:57AM +0100, Alexander Graf wrote:
> 
> On 14.01.2014, at 09:18, Michael S. Tsirkin <m...@redhat.com> wrote:
> 
> > On Mon, Jan 13, 2014 at 10:48:21PM +0100, Alexander Graf wrote:
> >> 
> >> 
> >>> Am 13.01.2014 um 22:39 schrieb Alex Williamson 
> >>> <alex.william...@redhat.com>:
> >>> 
> >>>> On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>> On 12.01.2014, at 08:54, Michael S. Tsirkin <m...@redhat.com> wrote:
> >>>>> 
> >>>>>> On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>> On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>> On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>> On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>> On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>> On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>> On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>> On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>> On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>> From: Paolo Bonzini <pbonz...@redhat.com>
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>> size, 2013-11-04) let's just make all address spaces 64-bit 
> >>>>>>>>>>>>>> wide.
> >>>>>>>>>>>>>> This eliminates problems with phys_page_find ignoring bits 
> >>>>>>>>>>>>>> above
> >>>>>>>>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS and 
> >>>>>>>>>>>>>> address_space_translate_internal
> >>>>>>>>>>>>>> consequently messing up the computations.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> In Luiz's reported crash, at startup gdb attempts to read from 
> >>>>>>>>>>>>>> address
> >>>>>>>>>>>>>> 0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The 
> >>>>>>>>>>>>>> region it gets
> >>>>>>>>>>>>>> is the newly introduced master abort region, which is as big 
> >>>>>>>>>>>>>> as the PCI
> >>>>>>>>>>>>>> address space (see pci_bus_init).  Due to a typo that's only 
> >>>>>>>>>>>>>> 2^63-1,
> >>>>>>>>>>>>>> not 2^64.  But we get it anyway because phys_page_find ignores 
> >>>>>>>>>>>>>> the upper
> >>>>>>>>>>>>>> bits of the physical address.  In 
> >>>>>>>>>>>>>> address_space_translate_internal then
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>  diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>  *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> The size of the PCI address space region should be fixed 
> >>>>>>>>>>>>>> anyway.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Reported-by: Luiz Capitulino <lcapitul...@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Paolo Bonzini <pbonz...@redhat.com>
> >>>>>>>>>>>>>> Signed-off-by: Michael S. Tsirkin <m...@redhat.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> exec.c | 8 ++------
> >>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>> index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>> --- a/exec.c
> >>>>>>>>>>>>>> +++ b/exec.c
> >>>>>>>>>>>>>> @@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>> #define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> /* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>> -#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>> +#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> #define P_L2_BITS 10
> >>>>>>>>>>>>>> #define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>> @@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>>   system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> -    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>> -
> >>>>>>>>>>>>>> -    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>> -                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>> -                       UINT64_MAX : (0x1ULL << 
> >>>>>>>>>>>>>> ADDR_SPACE_BITS));
> >>>>>>>>>>>>>> +    memory_region_init(system_memory, NULL, "system", 
> >>>>>>>>>>>>>> UINT64_MAX);
> >>>>>>>>>>>>>>   address_space_init(&address_space_memory, system_memory, 
> >>>>>>>>>>>>>> "memory");
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>>   system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> This seems to have some unexpected consequences around sizing 
> >>>>>>>>>>>>> 64bit PCI
> >>>>>>>>>>>>> BARs that I'm not sure how to handle.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>> don't detect BAR being disabled?
> >>>>>>>>>>> 
> >>>>>>>>>>> See the trace below, the BARs are not disabled.  QEMU pci-core is 
> >>>>>>>>>>> doing
> >>>>>>>>>>> the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>> pass-through here.
> >>>>>>>>>> 
> >>>>>>>>>> Sorry, not in the trace below, but yes the sizing seems to be 
> >>>>>>>>>> happening
> >>>>>>>>>> while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>> 
> >>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>> 
> >>>>>>>> Unfortunately
> >>>>>>>> 
> >>>>>>>>>>>>> After this patch I get vfio
> >>>>>>>>>>>>> traces like this:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) 
> >>>>>>>>>>>>> febe0004
> >>>>>>>>>>>>> (save lower 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, 
> >>>>>>>>>>>>> len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) 
> >>>>>>>>>>>>> ffffc004
> >>>>>>>>>>>>> (read size mask)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, 
> >>>>>>>>>>>>> len=0x4)
> >>>>>>>>>>>>> (restore BAR)
> >>>>>>>>>>>>> vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region re-mapped)
> >>>>>>>>>>>>> vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>> (save upper 32bits of BAR)
> >>>>>>>>>>>>> vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, 
> >>>>>>>>>>>>> len=0x4)
> >>>>>>>>>>>>> (write mask to BAR)
> >>>>>>>>>>>>> vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>> (memory region gets unmapped)
> >>>>>>>>>>>>> vfio: region_add fffffffffebe0000 - fffffffffebe3fff 
> >>>>>>>>>>>>> [0x7fcf3654d000]
> >>>>>>>>>>>>> (memory region gets re-mapped with new address)
> >>>>>>>>>>>>> qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 
> >>>>>>>>>>>>> 0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>> (iommu barfs because it can only handle 48bit physical 
> >>>>>>>>>>>>> addresses)
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>> 
> >>>>>>>>>>> Two reasons, first I can't tell the difference between RAM and 
> >>>>>>>>>>> MMIO.
> >>>>>>>>> 
> >>>>>>>>> Why can't you? Generally memory core let you find out easily.
> >>>>>>>> 
> >>>>>>>> My MemoryListener is setup for &address_space_memory and I then 
> >>>>>>>> filter
> >>>>>>>> out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>> through, so how do I easily find out?
> >>>>>>>> 
> >>>>>>>>> But in this case it's vfio device itself that is sized so for sure 
> >>>>>>>>> you
> >>>>>>>>> know it's MMIO.
> >>>>>>>> 
> >>>>>>>> How so?  I have a MemoryListener as described above and pass 
> >>>>>>>> everything
> >>>>>>>> through to the IOMMU.  I suppose I could look through all the
> >>>>>>>> VFIODevices and check if the MemoryRegion matches, but that seems 
> >>>>>>>> really
> >>>>>>>> ugly.
> >>>>>>>> 
> >>>>>>>>> Maybe you will have same issue if there's another device with a 64 
> >>>>>>>>> bit
> >>>>>>>>> bar though, like ivshmem?
> >>>>>>>> 
> >>>>>>>> Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>> MemoryRegion from memory_region_init_ram or 
> >>>>>>>> memory_region_init_ram_ptr.
> >>>>>>> 
> >>>>>>> Must be a 64 bit BAR to trigger the issue though.
> >>>>>>> 
> >>>>>>>>>>> Second, it enables peer-to-peer DMA between devices, which is 
> >>>>>>>>>>> something
> >>>>>>>>>>> that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>> 
> >>>>>>>>>>>>> Prior to this change, there was no re-map with the 
> >>>>>>>>>>>>> fffffffffebe0000
> >>>>>>>>>>>>> address, presumably because it was beyond the address space of 
> >>>>>>>>>>>>> the PCI
> >>>>>>>>>>>>> window.  This address is clearly not in a PCI MMIO space, so 
> >>>>>>>>>>>>> why are we
> >>>>>>>>>>>>> allowing it to be realized in the system address space at this 
> >>>>>>>>>>>>> location?
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>> True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>> 
> >>>>>>>>>>> What happens on real hardware when an address like this is 
> >>>>>>>>>>> programmed to
> >>>>>>>>>>> a device?  The CPU doesn't have the physical bits to access it.  
> >>>>>>>>>>> I have
> >>>>>>>>>>> serious doubts that another PCI device would be able to access it
> >>>>>>>>>>> either.  Maybe in some limited scenario where the devices are on 
> >>>>>>>>>>> the
> >>>>>>>>>>> same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>> always limited by some kind of aperture, whether that's explicit 
> >>>>>>>>>>> in
> >>>>>>>>>>> bridge windows or implicit in hardware design (and perhaps made 
> >>>>>>>>>>> explicit
> >>>>>>>>>>> in ACPI).  Even if I wanted to filter these out as noise in vfio, 
> >>>>>>>>>>> how
> >>>>>>>>>>> would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>> programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> 
> >>>>>>>>>>> Alex
> >>>>>>>>> 
> >>>>>>>>> AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit 
> >>>>>>>>> that
> >>>>>>>>> full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>> test suites normally check that it actually does work
> >>>>>>>>> if it happens.
> >>>>>>>> 
> >>>>>>>> Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>> routing, that's more what I'm referring to.  There are generally only
> >>>>>>>> fixed address windows for RAM vs MMIO.
> >>>>>>> 
> >>>>>>> The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>> Without that, devices can talk to each other without going
> >>>>>>> through chipset, and bridge spec is very explicit that
> >>>>>>> full 64 bit addressing must be supported.
> >>>>>>> 
> >>>>>>> So as long as we don't emulate an IOMMU,
> >>>>>>> guest will normally think it's okay to use any address.
> >>>>>>> 
> >>>>>>>>> Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>> windows would protect you, but pci already does this filtering:
> >>>>>>>>> if you see this address in the memory map this means
> >>>>>>>>> your virtual device is on root bus.
> >>>>>>>>> 
> >>>>>>>>> So I think it's the other way around: if VFIO requires specific
> >>>>>>>>> address ranges to be assigned to devices, it should give this
> >>>>>>>>> info to qemu and qemu can give this to guest.
> >>>>>>>>> Then anything outside that range can be ignored by VFIO.
> >>>>>>>> 
> >>>>>>>> Then we get into deficiencies in the IOMMU API and maybe VFIO.  
> >>>>>>>> There's
> >>>>>>>> currently no way to find out the address width of the IOMMU.  We've 
> >>>>>>>> been
> >>>>>>>> getting by because it's safely close enough to the CPU address width 
> >>>>>>>> to
> >>>>>>>> not be a concern until we start exposing things at the top of the 
> >>>>>>>> 64bit
> >>>>>>>> address space.  Maybe I can safely ignore anything above
> >>>>>>>> TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>> 
> >>>>>>>> Alex
> >>>>>>> 
> >>>>>>> I think it's not related to target CPU at all - it's a host 
> >>>>>>> limitation.
> >>>>>>> So just make up your own constant, maybe depending on host 
> >>>>>>> architecture.
> >>>>>>> Long term add an ioctl to query it.
> >>>>>> 
> >>>>>> It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>> physical address bits of the CPU.
> >>>>>> 
> >>>>>>> Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>> placing BARs above some address.
> >>>>>> 
> >>>>>> That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>> the BARs with them enabled.  We may still want such a thing to feed 
> >>>>>> into
> >>>>>> building ACPI tables though.
> >>>>> 
> >>>>> Well the point is that if you want BIOS to avoid
> >>>>> specific addresses, you need to tell it what to avoid.
> >>>>> But neither BIOS nor ACPI actually cover the range above
> >>>>> 2^48 ATM so it's not a high priority.
> >>>>> 
> >>>>>>> Since it's a vfio limitation I think it should be a vfio API, along 
> >>>>>>> the
> >>>>>>> lines of vfio_get_addr_space_bits(void).
> >>>>>>> (Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>> 
> >>>>>> It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>> problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>> mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>> mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>> TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>> 
> >>>>> That seems very wrong. It will still fail on an x86 host if we are
> >>>>> emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>> host side there's no real reason to tie it to the target.
> >>> 
> >>> I doubt vfio would be the only thing broken in that case.
> >>> 
> >>>>>> long term vfio already has an IOMMU info
> >>>>>> ioctl that we could use to return this information, but we'll need to
> >>>>>> figure out how to get it out of the IOMMU driver first.
> >>>>>> Thanks,
> >>>>>> 
> >>>>>> Alex
> >>>>> 
> >>>>> Short term, just assume 48 bits on x86.
> >>> 
> >>> I hate to pick an arbitrary value since we have a very specific mapping
> >>> we're trying to avoid.  Perhaps a better option is to skip anything
> >>> where:
> >>> 
> >>>       MemoryRegionSection.offset_within_address_space >
> >>>       ~MemoryRegionSection.offset_within_address_space
> >>> 
> >>>>> We need to figure out what's the limitation on ppc and arm -
> >>>>> maybe there's none and it can address full 64 bit range.
> >>>> 
> >>>> IIUC on PPC and ARM you always have BAR windows where things can get 
> >>>> mapped into. Unlike x86 where the full phyiscal address range can be 
> >>>> overlayed by BARs.
> >>>> 
> >>>> Or did I misunderstand the question?
> >>> 
> >>> Sounds right, if either BAR mappings outside the window will not be
> >>> realized in the memory space or the IOMMU has a full 64bit address
> >>> space, there's no problem.  Here we have an intermediate step in the BAR
> >>> sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>> Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>> the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>> this then causes space and time overhead until the tables are pruned
> >>> back down.  Thanks,
> >> 
> >> I thought sizing is hard defined as a set to
> >> -1? Can't we check for that one special case and treat it as "not mapped, 
> >> but tell the guest the size in config space"?
> >> 
> >> Alex
> > 
> > We already have a work-around like this and it works for 32 bit BARs
> > or after software writes the full 64 register:
> >    if (last_addr <= new_addr || new_addr == 0 ||
> >        last_addr == PCI_BAR_UNMAPPED) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> >    if  (!(type & PCI_BASE_ADDRESS_MEM_TYPE_64) && last_addr >= UINT32_MAX) {
> >        return PCI_BAR_UNMAPPED;
> >    }
> > 
> > 
> > But for 64 bit BARs this software writes all 1's
> > in the high 32 bit register before writing in the low register
> > (see trace above).
> > This makes it impossible to distinguish between
> > setting bar at fffffffffebe0000 and this intermediate sizing step.
> 
> Well, at least according to the AMD manual there's only support for 52 bytes 
> of physical address space:
> 
>       • Long Mode—This mode is unique to the AMD64 architecture. This mode 
> supports up to 4 petabytes of physical-address space using 52-bit physical 
> addresses.
> 
> Intel seems to agree:
> 
>       • CPUID.80000008H:EAX[7:0] reports the physical-address width supported 
> by the processor. (For processors that do not support CPUID function 
> 80000008H, the width is generally 36 if CPUID.01H:EDX.PAE [bit 6] = 1 and 32 
> otherwise.) This width is referred to as MAXPHYADDR. MAXPHYADDR is at most 52.
> 
> Of course there's potential for future extensions to allow for more bits in 
> the future, but at least the current generation x86_64 (and x86) 
> specification clearly only supports 52 bits of physical address space. And 
> non-x86(_64) don't care about bigger address spaces either because they use 
> BAR windows which are very unlikely to grow bigger than 52 bits ;).
> 
> 
> Alex


Yes but that's from CPU's point of view.
I think that devices can still access each other's BARs
using full 64 bit addresses.

-- 
MST

Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide

Reply via email to