Sometime ago I reported an issue about guest OS hang when 64bit BAR present.
http://lists.gnu.org/archive/html/qemu-devel/2012-01/msg03189.html
http://lists.gnu.org/archive/html/qemu-devel/2012-12/msg00413.html

Some more investigation has been done, so in this post I'll try to explain why 
it happens and offer possible solutions:

*When the issue happens*
The issue occurs on Linux guest OS if kernel version <2.6.36
A Guest OS hangs on boot when a 64bit PCI BAR is present in a system (if we use 
ivshmem driver for example) and occupies range within first
4 GB.

*How to reproduce*
I used the following qemu command to reproduce the case:
/usr/local/bin/qemu-system-x86_64 -M pc-1.3 -enable-kvm -m 2000 -smp 
1,sockets=1,cores=1,threads=1 -name Rh5332 -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/Rh5332.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=readline -rtc
base=utc -boot cd -drive 
file=/home/akorolev/rh5332.img,if=none,id=drive-ide0-0-0,format=raw -device
ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -chardev 
file,id=charserial0,path=/home/akorolev/serial.log -device
isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -k en-us -vga 
cirrus -device ivshmem,shm,size=32M-device
virtio-balloon-pci,id=balloon0

Tried different guests: Centos 5.8 64bit, RHEL 5.3 32bit, FC 12 64bit on all 
machines hang occurs in 100% cases

*Why it happens*
The issue basically comes from Linux PCI enumeration code.

The OS enumerates 64BIT bars when device is enabled using the following 
procedure.
1. Write all FF's to lower half of 64bit BAR
2. Write address back to lower half of 64bit BAR
3. Write all FF's to higher half of 64bit BAR
4. Write address back to higher half of 64bit BAR

For qemu it means that  qemu pci_default_write_config() recevies all FFs for 
lower part of the 64bit BAR.
Then it applies the mask and converts the value to "All FF's - size + 1" 
(FE000000 if size is 32MB).

So for short period of time the range [0xFE000000 - 0xFFFFFFFF] will be 
occupied by ivshmem resource.
For some reason it is lethal for further boot process.

We have found that boot process screws up completely if kvm-apic-msi range is 
overlapped even for short period of time.  (We still don't
know why it happens, hope that the qemu maintainers can answer?)

If we look at kvm-apic-msi memory region it is a non-overlapable memory region 
with hardcoded address range [0xFEE00000 - 0xFEF00000].

Here is a log we collected from render_memory_regions:

 system overlap 0 pri 0 [0x0 - 0x7fffffffffffffff]
     kvmvapic-rom overlap 1 pri 1000 [0xca000 - 0xcd000]
         pc.ram overlap 0 pri 0 [0xca000 - 0xcd000]
         ++ pc.ram [0xca000 - 0xcd000] is added to view
     ....................
     smram-region overlap 1 pri 1 [0xa0000 - 0xc0000]
         pci overlap 0 pri 0 [0xa0000 - 0xc0000]
             cirrus-lowmem-container overlap 1 pri 1 [0xa0000 - 0xc0000]
                 cirrus-low-memory overlap 0 pri 0 [0xa0000 - 0xc0000]
                ++cirrus-low-memory [0xa0000 - 0xc0000] is added to view
     kvm-ioapic overlap 0 pri 0 [0xfec00000 - 0xfec01000]
    ++kvm-ioapic [0xfec00000 - 0xfec01000] is added to view
     pci-hole64 overlap 0 pri 0 [0x100000000 - 0x4000000100000000]
         pci overlap 0 pri 0 [0x100000000 - 0x4000000100000000]
     pci-hole overlap 0 pri 0 [0x7d000000 - 0x100000000]
         pci overlap 0 pri 0 [0x7d000000 - 0x100000000]
             ivshmem-bar2-container overlap 1 pri 1 [0xfe000000 - 0x100000000]
                 ivshmem.bar2 overlap 0 pri 0 [0xfe000000 - 0x100000000]
                ++ivshmem.bar2 [0xfe000000 - 0xfec00000] is added to view
                ++ivshmem.bar2  [0xfec01000 - 0x100000000] is added to view
             ivshmem-mmio overlap 1 pri 1 [0xfebf1000 - 0xfebf1100]
             e1000-mmio overlap 1 pri 1 [0xfeba0000 - 0xfebc0000]
             cirrus-mmio overlap 1 pri 1 [0xfebf0000 - 0xfebf1000]
             cirrus-pci-bar0 overlap 1 pri 1 [0xfa000000 - 0xfc000000]
                 vga.vram overlap 1 pri 1 [0xfa000000 - 0xfa800000]
                ++vga.vram [0xfa000000 - 0xfa800000] is added to view
                 cirrus-bitblt-mmio overlap 0 pri 0 [0xfb000000 - 0xfb400000]
                ++cirrus-bitblt-mmio [0xfb000000 - 0xfb400000] is added to view
                 cirrus-linear-io overlap 0 pri 0 [0xfa000000 - 0xfa800000]
             pc.bios overlap 0 pri 0 [0xfffe0000 - 0x100000000]
     ram-below-4g overlap 0 pri 0 [0x0 - 0x7d000000]
         pc.ram overlap 0 pri 0 [0x0 - 0x7d000000]
        ++pc.ram [0x0 - 0xa0000] is added to view
        ++pc.ram [0x100000 - 0x7d000000] is added to view
     kvm-apic-msi overlap 0 pri 0 [0xfee00000 - 0xfef00000]

As you can see from log the kvm-apic-msi is enumarated last when range 
[0xfee00000 - 0xfef00000] is already occupied by ivshmem.bar2
[0xfec01000 - 0x100000000].


*Possible solutions*
Solution 1. Probably the best would be adding the rule that regions which may 
not be overlapped are added to view first (In in other words
regions which must not be overlapped have the highest priority).  Please find 
patch in the following message.

Solution 2. Raise priority of kvm-apic-msi resource. This is a bit misleading 
solution, as priority is only applicable for overlap-able
regions, but this region must not be overlapped.

Solution 3. Fix the issue at PCI level. Track if the resource is 64bit and 
apply changes if both parts of 64bit BAR are programmed. (It
appears that real PCI bus controllers are smart enough to track 64bit BAR 
writes on PC, so qemu could do the same? Drawbacks are that
tracking PCI writes is bit cumbersome, and such tracking may appear to somebody 
as a hack)


Alexey

Reply via email to