> On Wed, Feb 13, 2013 at 11:34:52AM +0100, Jan Kiszka wrote:
>> On 2013-02-13 11:24, Michael S. Tsirkin wrote:
>>> On Wed, Feb 13, 2013 at 06:06:37PM +1300, Alexey Korolev wrote:
>>>> Sometime ago I reported an issue about guest OS hang when 64bit BAR 
>>>> present.
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-01/msg03189.html
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-12/msg00413.html
>>>>
>>>> Some more investigation has been done, so in this post I'll try to explain 
>>>> why it happens and offer possible solutions:
>>>>
>>>> *When the issue happens*
>>>> The issue occurs on Linux guest OS if kernel version <2.6.36
>>>> A Guest OS hangs on boot when a 64bit PCI BAR is present in a system (if 
>>>> we use ivshmem driver for example) and occupies range within first
>>>> 4 GB.
>>>>
>>>> *How to reproduce*
>>>> I used the following qemu command to reproduce the case:
>>>> /usr/local/bin/qemu-system-x86_64 -M pc-1.3 -enable-kvm -m 2000 -smp 
>>>> 1,sockets=1,cores=1,threads=1 -name Rh5332 -chardev
>>>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/Rh5332.monitor,server,nowait
>>>>  -mon chardev=charmonitor,id=monitor,mode=readline -rtc
>>>> base=utc -boot cd -drive 
>>>> file=/home/akorolev/rh5332.img,if=none,id=drive-ide0-0-0,format=raw -device
>>>> ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -chardev 
>>>> file,id=charserial0,path=/home/akorolev/serial.log -device
>>>> isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -k en-us 
>>>> -vga cirrus -device ivshmem,shm,size=32M-device
>>>> virtio-balloon-pci,id=balloon0
>>>>
>>>> Tried different guests: Centos 5.8 64bit, RHEL 5.3 32bit, FC 12 64bit on 
>>>> all machines hang occurs in 100% cases
>>>>
>>>> *Why it happens*
>>>> The issue basically comes from Linux PCI enumeration code.
>>>>
>>>> The OS enumerates 64BIT bars when device is enabled using the following 
>>>> procedure.
>>>> 1. Write all FF's to lower half of 64bit BAR
>>>> 2. Write address back to lower half of 64bit BAR
>>>> 3. Write all FF's to higher half of 64bit BAR
>>>> 4. Write address back to higher half of 64bit BAR
>>>>
>>>> For qemu it means that  qemu pci_default_write_config() recevies all FFs 
>>>> for lower part of the 64bit BAR.
>>>> Then it applies the mask and converts the value to "All FF's - size + 1" 
>>>> (FE000000 if size is 32MB).
>>>>
>>>> So for short period of time the range [0xFE000000 - 0xFFFFFFFF] will be 
>>>> occupied by ivshmem resource.
>>>> For some reason it is lethal for further boot process.
>>>>
>>>> We have found that boot process screws up completely if kvm-apic-msi range 
>>>> is overlapped even for short period of time.  (We still don't
>>>> know why it happens, hope that the qemu maintainers can answer?)
>>>>
>>>> If we look at kvm-apic-msi memory region it is a non-overlapable memory 
>>>> region with hardcoded address range [0xFEE00000 - 0xFEF00000].
>>> Thanks for looking into this!
>>>
>>>> Here is a log we collected from render_memory_regions:
>>>>
>>>>  system overlap 0 pri 0 [0x0 - 0x7fffffffffffffff]
>>>>      kvmvapic-rom overlap 1 pri 1000 [0xca000 - 0xcd000]
>>>>          pc.ram overlap 0 pri 0 [0xca000 - 0xcd000]
>>>>          ++ pc.ram [0xca000 - 0xcd000] is added to view
>>>>      ....................
>>>>      smram-region overlap 1 pri 1 [0xa0000 - 0xc0000]
>>>>          pci overlap 0 pri 0 [0xa0000 - 0xc0000]
>>>>              cirrus-lowmem-container overlap 1 pri 1 [0xa0000 - 0xc0000]
>>>>                  cirrus-low-memory overlap 0 pri 0 [0xa0000 - 0xc0000]
>>>>                 ++cirrus-low-memory [0xa0000 - 0xc0000] is added to view
>>>>      kvm-ioapic overlap 0 pri 0 [0xfec00000 - 0xfec01000]
>>>>     ++kvm-ioapic [0xfec00000 - 0xfec01000] is added to view
>>>>      pci-hole64 overlap 0 pri 0 [0x100000000 - 0x4000000100000000]
>>>>          pci overlap 0 pri 0 [0x100000000 - 0x4000000100000000]
>>>>      pci-hole overlap 0 pri 0 [0x7d000000 - 0x100000000]
>>> So we have ioapic and pci-hole which should be non-overlap,
>>> actually overlap each other.
>>> Isn't this a problem?
>>>
>>>>          pci overlap 0 pri 0 [0x7d000000 - 0x100000000]
>>>>              ivshmem-bar2-container overlap 1 pri 1 [0xfe000000 - 
>>>> 0x100000000]
>>>>                  ivshmem.bar2 overlap 0 pri 0 [0xfe000000 - 0x100000000]
>>>>                 ++ivshmem.bar2 [0xfe000000 - 0xfec00000] is added to view
>>>>                 ++ivshmem.bar2  [0xfec01000 - 0x100000000] is added to view
>>>>              ivshmem-mmio overlap 1 pri 1 [0xfebf1000 - 0xfebf1100]
>>>>              e1000-mmio overlap 1 pri 1 [0xfeba0000 - 0xfebc0000]
>>>>              cirrus-mmio overlap 1 pri 1 [0xfebf0000 - 0xfebf1000]
>>>>              cirrus-pci-bar0 overlap 1 pri 1 [0xfa000000 - 0xfc000000]
>>>>                  vga.vram overlap 1 pri 1 [0xfa000000 - 0xfa800000]
>>>>                 ++vga.vram [0xfa000000 - 0xfa800000] is added to view
>>>>                  cirrus-bitblt-mmio overlap 0 pri 0 [0xfb000000 - 
>>>> 0xfb400000]
>>>>                 ++cirrus-bitblt-mmio [0xfb000000 - 0xfb400000] is added to 
>>>> view
>>>>                  cirrus-linear-io overlap 0 pri 0 [0xfa000000 - 0xfa800000]
>>>>              pc.bios overlap 0 pri 0 [0xfffe0000 - 0x100000000]
>>>>      ram-below-4g overlap 0 pri 0 [0x0 - 0x7d000000]
>>>>          pc.ram overlap 0 pri 0 [0x0 - 0x7d000000]
>>>>         ++pc.ram [0x0 - 0xa0000] is added to view
>>>>         ++pc.ram [0x100000 - 0x7d000000] is added to view
>>>>      kvm-apic-msi overlap 0 pri 0 [0xfee00000 - 0xfef00000]
>>>>
>>>> As you can see from log the kvm-apic-msi is enumarated last when range 
>>>> [0xfee00000 - 0xfef00000] is already occupied by ivshmem.bar2
>>>> [0xfec01000 - 0x100000000].
>>>>
>>>>
>>>> *Possible solutions*
>>>> Solution 1. Probably the best would be adding the rule that regions which 
>>>> may not be overlapped are added to view first (In in other words
>>>> regions which must not be overlapped have the highest priority).  Please 
>>>> find patch in the following message.
>>>>
>>>> Solution 2. Raise priority of kvm-apic-msi resource. This is a bit 
>>>> misleading solution, as priority is only applicable for overlap-able
>>>> regions, but this region must not be overlapped.
>>>>
>>>> Solution 3. Fix the issue at PCI level. Track if the resource is 64bit and 
>>>> apply changes if both parts of 64bit BAR are programmed. (It
>>>> appears that real PCI bus controllers are smart enough to track 64bit BAR 
>>>> writes on PC, so qemu could do the same? Drawbacks are that
>>>> tracking PCI writes is bit cumbersome, and such tracking may appear to 
>>>> somebody as a hack)
>>>>
>>>>
>>>> Alexey
>>> I have to say I don't understand what does the overlap attribute
>>> supposed to do, exactly.
Here is what I've found in docs/memory.txt of qemu tree
....
Overlapping regions and priority
--------------------------------
Usually, regions may not overlap each other; a memory address decodes into
exactly one target.  In some cases it is useful to allow regions to overlap,
and sometimes to control which of an overlapping regions is visible to the
guest.  This is done with memory_region_add_subregion_overlap(), which
allows the region to overlap any other region in the same container, and
specifies a priority that allows the core to decide which of two regions at
the same address are visible (highest wins).
....

>>>
>>> In practice it currently seems to be ignored.
>>>
>>> How about we drop it and rely exclusively on priorities?
I thought about this. The only things which stopped me doing this: the 
interface functions and documentation points that somebody has had an
idea to make use of both priority and may_overlap flag in qemu memory manager 
and didn't complete it for some reason. At least an outside
code reader has a distinct feeling of this.

If this is not a problem this is a good idea to remove may_overlap. It would 
make code more concise as it removes a mess of extra service
functions as well.
Does anyone object to this?

BTW does anybody know what kvm-apic-msi does special, so short overlapping of 
the region hangs the guest?
>> Isn't it a guest bug if it maps a PCI resource over the APIC window? How
>> should the guest access that region then? With in-kernel APIC, it will
>> remain unaccessible, even if you raise the prio of that PCI region over
>> the APIC as the kernel will prioritize the latter.
> Yes that would be fine (it's a temporary condition while sizing the
> BARs). The problem, I think, is that currently apic has default priority
> (0) so PCI regions can mask it during sizing.
>
>>> It's probably easier to just give the apic high priority.
>>> there's precedent - kvmvapic does:
>>>
>>> memory_region_add_subregion_overlap(as, rom_paddr, &s->rom, 1000);
>>>
>>> Jan, could you please clarify where did the value 1000 come from?
>> To ensure that the VAPIC region is always accessible.
>>
>>> Maybe we need some predefined priority values in memory.h
>>>
>> That makes some sense, sure.
>>
>> Jan
>>
>> -- 
>> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
>> Corporate Competence Center Embedded Linux


Reply via email to