RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm
>> >> My question is, which kvm_get_msrs/kvm_put_msrs routine be used by >> live migration, the routine in target-i386/kvm.c, or in >> kvm/libkvm/libkvm-x86.c? They both have ioctl >> KVM_GET_MSR_INDEX_LIST/ KVM_GET_MSRS/ KVM_SET_MSRS, but I'm not >> clear their purpose/usage difference. > > kvm_get_msrs/kvm_put_msrs in target-i386/kvm.c. kvm/ directory is > dead. Thanks to make me clear. Add it to qemu like: diff --git a/target-i386/cpu.h b/target-i386/cpu.h index 935d08a..62ff73c 100644 --- a/target-i386/cpu.h +++ b/target-i386/cpu.h @@ -283,6 +283,7 @@ #define MSR_IA32_APICBASE_BSP (1<<8) #define MSR_IA32_APICBASE_ENABLE(1<<11) #define MSR_IA32_APICBASE_BASE (0xf<<12) +#define MSR_IA32_TSCDEADLINE0x6e0 #define MSR_MTRRcap0xfe #define MSR_MTRRcap_VCNT 8 @@ -687,6 +688,7 @@ typedef struct CPUX86State { uint64_t async_pf_en_msr; uint64_t tsc; +uint64_t tsc_deadline; uint64_t mcg_status; diff --git a/target-i386/kvm.c b/target-i386/kvm.c index aa843f0..206fcad 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -59,6 +59,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = { static bool has_msr_star; static bool has_msr_hsave_pa; +static bool has_msr_tsc_deadline; static bool has_msr_async_pf_en; static int lm_capable_kernel; @@ -571,6 +572,10 @@ static int kvm_get_supported_msrs(KVMState *s) has_msr_hsave_pa = true; continue; } +if (kvm_msr_list->indices[i] == MSR_IA32_TSCDEADLINE) { +has_msr_tsc_deadline = true; +continue; +} } } @@ -899,6 +904,9 @@ static int kvm_put_msrs(CPUState *env, int level) if (has_msr_hsave_pa) { kvm_msr_entry_set(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave); } +if (has_msr_tsc_deadline) { +kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSCDEADLINE, env->tsc_deadline); +} #ifdef TARGET_X86_64 if (lm_capable_kernel) { kvm_msr_entry_set(&msrs[n++], MSR_CSTAR, env->cstar); @@ -1145,6 +1153,9 @@ static int kvm_get_msrs(CPUState *env) if (has_msr_hsave_pa) { msrs[n++].index = MSR_VM_HSAVE_PA; } +if (has_msr_tsc_deadline) { +msrs[n++].index = MSR_IA32_TSCDEADLINE; +} if (!env->tsc_valid) { msrs[n++].index = MSR_IA32_TSC; @@ -1213,6 +1224,9 @@ static int kvm_get_msrs(CPUState *env) case MSR_IA32_TSC: env->tsc = msrs[i].data; break; +case MSR_IA32_TSCDEADLINE: +env->tsc_deadline = msrs[i].data; +break; case MSR_VM_HSAVE_PA: env->vm_hsave = msrs[i].data; break; = Thanks, Jinsong -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: emulate lapic tsc deadline timer for hvm
On Sat, Sep 10, 2011 at 02:11:36AM +0800, Liu, Jinsong wrote: > Marcelo Tosatti wrote: > > On Fri, Sep 09, 2011 at 01:12:51AM +0800, Liu, Jinsong wrote: > > --- a/arch/x86/include/asm/msr-index.h > > +++ b/arch/x86/include/asm/msr-index.h > > @@ -229,6 +229,8 @@ > > #define MSR_IA32_APICBASE_ENABLE (1<<11) > > #define MSR_IA32_APICBASE_BASE (0xf<<12) > > > > +#define MSR_IA32_TSCDEADLINE 0x06e0 > > + > > #define MSR_IA32_UCODE_WRITE 0x0079 > > #define MSR_IA32_UCODE_REV 0x008b > > Need to add to msrs_to_save so live migration works. > >>> > >>> MSR must be explicitly listed in qemu, also. > >>> > >> > >> Marcelo, seems MSR don't need explicitly list in qemu? > >> KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu > >> will get it through KVM_GET_MSR_INDEX_LIST. Do I miss something? > > > > Notice in target-i386/kvm.c the KVM_GET_MSR_INDEX_LIST list is only > > used for MSR_STAR/MSR_HSAVE_PA presence detection. > > Yes > > > > > Do you do need to explicitly add MSR_IA32_TSCDEADLINE to > > kvm_get_msrs/kvm_put_msrs routines. > > My question is, which kvm_get_msrs/kvm_put_msrs routine be used by live > migration, the routine in target-i386/kvm.c, or in kvm/libkvm/libkvm-x86.c? > They both have ioctl KVM_GET_MSR_INDEX_LIST/ KVM_GET_MSRS/ KVM_SET_MSRS, but > I'm not clear their purpose/usage difference. kvm_get_msrs/kvm_put_msrs in target-i386/kvm.c. kvm/ directory is dead. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About hotplug multifunction
On Fri, Sep 09, 2011 at 10:05:01AM -0700, Alex Williamson wrote: > On Fri, 2011-09-09 at 10:32 +0300, Michael S. Tsirkin wrote: > > On Fri, Sep 09, 2011 at 03:08:21AM -0400, Amos Kong wrote: > > > Hello all, > > > > > > I'm working on hotplug pci multifunction. > > > > > > 1. qemu cmdline: > > > ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 > > > /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor > > > unix:/tmp/a,server,nowait --enable-kvm -net none > > > > > > 2. script to add virtio-blk devices: > > > for i in `seq 1 7` 0;do > > > qemu-img create /tmp/resize$i.qcow2 1G -f qcow2 > > > echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U > > > /tmp/a > > > echo device_add > > > virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U > > > /tmp/a > > > done > > I don't think it should work this way, there shouldn't be special > intrinsic meaning that hotplugging func 0 makes the whole device appear. Function 0 is mandatory. Thats what the guest (at least the Linux driver) searches when a notification for the slot is received. > Perhaps we need a notify= device option so we can add devices without > notifying the guest, then maybe a different command to cause the pci > slot notification. > > > > 3. script to add virio-nic devices: > > > for i in `seq 1 7` 0;do > > > echo netdev_add tap,id=drv$i | nc -U /tmp/a > > > echo device_add > > > virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U > > > /tmp/a > > > done > > > > > > 4. current qemu behaviors > > > 4.1. add func 1~7 one by one, then add func 0 > > > virtio-nic : success, all funcs are added > > > virtio-blk : success > > > > > > 4.2. add func 0~7 one by one > > > virtio-nic : failed, only func 0 is added > > > virtio-blk : success > > > > > > 4.3. removing any single func in monitor > > > virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. > > > eth1~eth7 also exist. > > > virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the > > > device. /dev/vda disappears, > > > vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 > > > funcs to guest, they all works. > > > # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit) > > > 00:06.1 SCSI storage controller: Red Hat, Inc Virtio block > > > device (rev ff) > > We shouldn't be able to remove single funcs of a multifunction device, > imho. > > > something I noted when readin our acpi code: > > we currently pass eject request for function 0 only: > >Name (_ADR, nr##) > > We either need a device per function there (acpi 1.0), > > send eject request for them all, or use > > as function number (newer acpi, not sure which version). > > Need to see which guests (windows,linux) can handle which form. > > I'd guess we need to change that to . No need, only make sure function 0 is there and all other functions should be removed automatically by the guest on eject notification. > > > > > > qemu sends an acpi event to guest, then guest will remove all funcs in > > > the slot. > > > linux-2.6/drivers/pci/hotplug/acpiphp_glue.c: > > > static int disable_device(struct acpiphp_slot *slot) { > > > list_for_each_entry(func, &slot->funcs, sibling) { > > > ... > > > > > > Questions: > > > 1. why func1~7 still can be found after hot-remove? is it same as real > > > hardware? > > I think we want to behave the same as adding and removing a complete > physical devices, which means all the functions get added and removed > together. Probably the only time individual functions disappear on real > hardware is poking chipset registers to hide and expose sub devices. It > may not be necessary to make them atomically [in]accessible, but we > should treat them as a set. ACPI PCI hotplug is based on slots, not on functions. It does not support addition/removal of individual functions. > > > 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by > > > one)? > > > 3. how about this interface to hotplug/hot-unplug multifunction: > > >1) Add func 1-7 by monitor, add func 0, then send an acpi event to > > > notice guest > > >2) Remove func0, send an acpi event to guest. (all funcs can be > > > removed) > > I think I'd prefer an explicit interface. Thanks, > > Alex Function 0 must be present for the guest to detect the device. I do not see the problem of specifying (and documenting) that the insert notification is sent for function 0 only. An explicit interface is going to break the current scheme where a single "device_add" command also does the notification. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM memory saving amount
On Fri, Sep 09, 2011 at 09:38:23AM +0300, Mihamina Rakotomandimby wrote: > Hi all, > > Running an Ubuntu Natty where I launched 3 Xubuntu and 1 win 7, I cat see: > /sys/kernel/mm/ksm/full_scans: 34 > /sys/kernel/mm/ksm/pages_shared: 8020 > /sys/kernel/mm/ksm/pages_sharing: 36098 > /sys/kernel/mm/ksm/pages_to_scan: 100 > /sys/kernel/mm/ksm/pages_unshared: 247109 > /sys/kernel/mm/ksm/pages_volatile: 32716 > /sys/kernel/mm/ksm/run: 1 > /sys/kernel/mm/ksm/sleep_millisecs: 200 > > I would like to evaluate the memory amount: > How many byte is a page? 4096. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] qemu-kvm: pc: Factor out apic_next_timer
On Thu, Sep 08, 2011 at 12:51:31PM +0200, Jan Kiszka wrote: > Factor out apic_next_timer from apic_timer_update. The former can then > be used to update next_timer without actually starting the qemu timer. > KVM's in-kernel APIC model will make use of it. > > Signed-off-by: Jan Kiszka Applied both, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] qemu-kvm: Resolve PCI upstream diffs
On Thu, Sep 08, 2011 at 12:48:02PM +0200, Jan Kiszka wrote: > Resolve all unneeded deviations from upstream code. No functional > changes. > > Signed-off-by: Jan Kiszka > --- > hw/pci.c | 11 +++ > hw/pci.h |5 - > 2 files changed, 11 insertions(+), 5 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm
Marcelo Tosatti wrote: > On Fri, Sep 09, 2011 at 01:12:51AM +0800, Liu, Jinsong wrote: > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -229,6 +229,8 @@ > #define MSR_IA32_APICBASE_ENABLE (1<<11) > #define MSR_IA32_APICBASE_BASE (0xf<<12) > > +#define MSR_IA32_TSCDEADLINE 0x06e0 > + > #define MSR_IA32_UCODE_WRITE 0x0079 > #define MSR_IA32_UCODE_REV 0x008b Need to add to msrs_to_save so live migration works. >>> >>> MSR must be explicitly listed in qemu, also. >>> >> >> Marcelo, seems MSR don't need explicitly list in qemu? >> KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu >> will get it through KVM_GET_MSR_INDEX_LIST. Do I miss something? > > Notice in target-i386/kvm.c the KVM_GET_MSR_INDEX_LIST list is only > used for MSR_STAR/MSR_HSAVE_PA presence detection. Yes, but in kvm/libkvm/libkvm-x86.c the KVM_GET_MSR_INDEX_LIST get all list. That's what I want to make clear --> which one live migration use? > > Do you do need to explicitly add MSR_IA32_TSCDEADLINE to > kvm_get_msrs/kvm_put_msrs routines. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM memory saving amount
On Fri, Sep 9, 2011 at 12:52 PM, Thomas Treutner wrote: > Am 09.09.2011 08:38, schrieb Mihamina Rakotomandimby: >> >> How many byte is a page? > > Typically, unless you use hugepages, a page is 4 KiB = 4096 Byte. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > I wrote a bash script which calculates the KSM savings. I have posted it here: https://gist.github.com/1206923 I can not guarantee its accuracy. It was the best I could do with the knowledge I had. The results seem reasonable to me though. Hope this is useful to someone. Dan VerWeire -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm
Marcelo Tosatti wrote: > On Fri, Sep 09, 2011 at 01:12:51AM +0800, Liu, Jinsong wrote: > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -229,6 +229,8 @@ > #define MSR_IA32_APICBASE_ENABLE (1<<11) > #define MSR_IA32_APICBASE_BASE (0xf<<12) > > +#define MSR_IA32_TSCDEADLINE 0x06e0 > + > #define MSR_IA32_UCODE_WRITE 0x0079 > #define MSR_IA32_UCODE_REV 0x008b Need to add to msrs_to_save so live migration works. >>> >>> MSR must be explicitly listed in qemu, also. >>> >> >> Marcelo, seems MSR don't need explicitly list in qemu? >> KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu >> will get it through KVM_GET_MSR_INDEX_LIST. Do I miss something? > > Notice in target-i386/kvm.c the KVM_GET_MSR_INDEX_LIST list is only > used for MSR_STAR/MSR_HSAVE_PA presence detection. Yes > > Do you do need to explicitly add MSR_IA32_TSCDEADLINE to > kvm_get_msrs/kvm_put_msrs routines. My question is, which kvm_get_msrs/kvm_put_msrs routine be used by live migration, the routine in target-i386/kvm.c, or in kvm/libkvm/libkvm-x86.c? They both have ioctl KVM_GET_MSR_INDEX_LIST/ KVM_GET_MSRS/ KVM_SET_MSRS, but I'm not clear their purpose/usage difference. Thanks, Jinsong-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About hotplug multifunction
pci/pcie hot plug needs clean up for multifunction hotplug in long term. Only single function device case works. Multifunction case is broken somwehat. Especially the current acpi based hotplug should be replaced by the standardized hot plug controller in long term. On Fri, Sep 09, 2011 at 03:08:21AM -0400, Amos Kong wrote: > Hello all, > > I'm working on hotplug pci multifunction. > > 1. qemu cmdline: > ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 > /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor > unix:/tmp/a,server,nowait --enable-kvm -net none > > 2. script to add virtio-blk devices: > for i in `seq 1 7` 0;do > qemu-img create /tmp/resize$i.qcow2 1G -f qcow2 > echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U /tmp/a > echo device_add > virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U > /tmp/a > done > > 3. script to add virio-nic devices: > for i in `seq 1 7` 0;do > echo netdev_add tap,id=drv$i | nc -U /tmp/a > echo device_add > virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U > /tmp/a > done > > 4. current qemu behaviors > 4.1. add func 1~7 one by one, then add func 0 > virtio-nic : success, all funcs are added > virtio-blk : success > > 4.2. add func 0~7 one by one > virtio-nic : failed, only func 0 is added > virtio-blk : success > > 4.3. removing any single func in monitor > virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 > also exist. > virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the device. > /dev/vda disappears, > vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 funcs > to guest, they all works. > # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit) > 00:06.1 SCSI storage controller: Red Hat, Inc Virtio block > device (rev ff) > > qemu sends an acpi event to guest, then guest will remove all funcs in the > slot. > linux-2.6/drivers/pci/hotplug/acpiphp_glue.c: > static int disable_device(struct acpiphp_slot *slot) { > list_for_each_entry(func, &slot->funcs, sibling) { > ... > > Questions: > 1. why func1~7 still can be found after hot-remove? is it same as real > hardware? > 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)? > 3. how about this interface to hotplug/hot-unplug multifunction: >1) Add func 1-7 by monitor, add func 0, then send an acpi event to notice > guest >2) Remove func0, send an acpi event to guest. (all funcs can be removed) > 4. what does "reversion 0xff" stand for? > > Thanks in advance, > Amos > -- yamahata -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KSM memory saving amount
Am 09.09.2011 08:38, schrieb Mihamina Rakotomandimby: How many byte is a page? Typically, unless you use hugepages, a page is 4 KiB = 4096 Byte. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About hotplug multifunction
On Fri, 2011-09-09 at 10:32 +0300, Michael S. Tsirkin wrote: > On Fri, Sep 09, 2011 at 03:08:21AM -0400, Amos Kong wrote: > > Hello all, > > > > I'm working on hotplug pci multifunction. > > > > 1. qemu cmdline: > > ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 > > /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor > > unix:/tmp/a,server,nowait --enable-kvm -net none > > > > 2. script to add virtio-blk devices: > > for i in `seq 1 7` 0;do > > qemu-img create /tmp/resize$i.qcow2 1G -f qcow2 > > echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U > > /tmp/a > > echo device_add > > virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U > > /tmp/a > > done I don't think it should work this way, there shouldn't be special intrinsic meaning that hotplugging func 0 makes the whole device appear. Perhaps we need a notify= device option so we can add devices without notifying the guest, then maybe a different command to cause the pci slot notification. > > 3. script to add virio-nic devices: > > for i in `seq 1 7` 0;do > > echo netdev_add tap,id=drv$i | nc -U /tmp/a > > echo device_add > > virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U > > /tmp/a > > done > > > > 4. current qemu behaviors > > 4.1. add func 1~7 one by one, then add func 0 > > virtio-nic : success, all funcs are added > > virtio-blk : success > > > > 4.2. add func 0~7 one by one > > virtio-nic : failed, only func 0 is added > > virtio-blk : success > > > > 4.3. removing any single func in monitor > > virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 > > also exist. > > virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the > > device. /dev/vda disappears, > > vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 > > funcs to guest, they all works. > > # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit) > > 00:06.1 SCSI storage controller: Red Hat, Inc Virtio block > > device (rev ff) We shouldn't be able to remove single funcs of a multifunction device, imho. > something I noted when readin our acpi code: > we currently pass eject request for function 0 only: >Name (_ADR, nr##) > We either need a device per function there (acpi 1.0), > send eject request for them all, or use > as function number (newer acpi, not sure which version). > Need to see which guests (windows,linux) can handle which form. I'd guess we need to change that to . > > > > qemu sends an acpi event to guest, then guest will remove all funcs in the > > slot. > > linux-2.6/drivers/pci/hotplug/acpiphp_glue.c: > > static int disable_device(struct acpiphp_slot *slot) { > > list_for_each_entry(func, &slot->funcs, sibling) { > > ... > > > > Questions: > > 1. why func1~7 still can be found after hot-remove? is it same as real > > hardware? I think we want to behave the same as adding and removing a complete physical devices, which means all the functions get added and removed together. Probably the only time individual functions disappear on real hardware is poking chipset registers to hide and expose sub devices. It may not be necessary to make them atomically [in]accessible, but we should treat them as a set. > > 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)? > > 3. how about this interface to hotplug/hot-unplug multifunction: > >1) Add func 1-7 by monitor, add func 0, then send an acpi event to > > notice guest > >2) Remove func0, send an acpi event to guest. (all funcs can be removed) I think I'd prefer an explicit interface. Thanks, Alex > We must make sure guest acked removal of all functions. Surprise > removal would be bad. > > > 4. what does "reversion 0xff" stand for? > > You get if no device responds to a configuration read. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/8/11 10:55 PM, "Michael S. Tsirkin" wrote: > On Thu, Sep 08, 2011 at 07:53:11PM -0700, Roopa Prabhu wrote: Phase 1: Goal: Enable hardware filtering for all macvlan modes - In macvlan passthru mode the single guest virtio-nic connected will receive traffic that he requested for - In macvlan non-passthru mode all guest virtio-nics sharing the physical nic will see all other guest traffic but the filtering at guest virtio-nic >>> >>> I don't think guests currently filter anything. >>> >> I was referring to Qemu-kvm virtio-net in >> virtion_net_receive->receive_filter. I think It only passes pkts that the >> guest OS is interested. It uses the filter table that I am passing to >> macvtap in this patch. > > This happens after userspace thread gets woken up and data > is copied there. So relying on filtering at that level is > going to be very inefficient on a system with > multiple active guests. Further, and for that reason, vhost-net > doesn't do filtering at all, relying on the backends > to pass it correct packets. Ok thanks for the info. So in which case, phase 1 is best for PASSTHRU mode and for non-PASSTHRU when there is a single guest connected to a VF. For non-PASSTHRU multi guest sharing the same VF, Phase 1 is definitely better than putting the VF in promiscuous mode. But to address the concern you mention above, in phase 2 when we have more than one guest sharing the VF, we will have to add filter lookup in macvlan to filter pkts for each guest. This will need some performance tests too. Will start investigating the netlink interface comments for phase 1 first. Thanks! -Roopa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
1st level guest crashes on start of 2nd level guest in nested virtualisation
Hello Nadav Har'El, We are using qemu-kvm with nested virtualisation (two guest levels). The 1st-level guest sporadicly crashes on start-up of 2nd level guest (in BIOS). We have now found out that it always crashes with parameter "no-kvm-irqchip" set. Do you have any idea what the problem is? Any help? Regards, Steffen Host System: - Ubuntu 10.10 - kernel 2.6.35-30-generic #56-Ubuntu SMP Mon Jul 11 20:01:08 UTC 2011 x86_64 GNU/Linux - QEMU emulator version 0.15.0 (qemu-kvm-0.15.0) - kvm_kmod from git repository: commit c040fec91c95609c8a6b54ddd5ce952605a11850 (Sun Aug 21 21:23:11 2011 -0700) - kvm_kmod submodule linux: commit 902c502f0b0efec3a784a8ef65057298025e5e11 (Fri Aug 26 08:04:19 2011 -0300) Guest system (1st level): - fresh installed Ubuntu 11.04 - QEMU emulator version 0.14.0 (qemu-kvm-0.14.0) - kernel 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011 x86_64 GNU/Linux - kvm modules from kernel above Procedure to reproduce crash: 1. Activate nested virtualisation 2. Boot Qemu guest (1st level) system with given command qemu-system-x86_64 \ -no-kvm-irqchip \ -enable-kvm \ -m 1G \ -smp 1 \ -drive file=vda.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 \ -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 \ -cpu host 3. Start 2nd lvel Qemu guest: qemu-system-x86_64 Result: Qemu 1st level guest crashes with: KVM: entry failed, hardware error 0x8021 If you're runnning a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. RAX=008f RBX= RCX=6ea6 RDX=0100 RSI= RDI=0002000f RBP=8b00 RSP=88003c189d00 R8 = R9 = R10= R11= R12= R13= R14= R15= RIP=a0112621 RFL=00023002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0018 00c0f300 DPL=3 DS [-WA] CS =0010 00a09b00 DPL=0 CS64 [-RA] SS =0018 00c09300 DPL=0 DS [-WA] DS =0018 00c0f300 DPL=3 DS [-WA] FS = 7f9c9d390700 GS = 88003fc0 000f LDT= 000f TR =0040 88003fc11880 2087 8b00 DPL=0 TSS64-busy GDT= 88003fc04000 007f IDT= 81bae000 0fff CR0=8005003b CR2= CR3=3bdeb000 CR4=26f0 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER=0d01 Code=01 00 00 48 8b 89 50 01 00 00 75 05 0f 01 c2 eb 03 0f 01 c3 <48> 87 0c 24 48 89 81 48 01 00 00 48 89 99 60 01 00 00 ff 34 24 8f 81 50 01 00 00 48 89 91 Tracing (trace-cmd record -b 2 -e kvm) gives the following patterns: qemu-system-x86-15220 [001] 31572.340816: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340817: kvm_exit: reason VMREAD rip 0xa010d06e qemu-system-x86-15220 [001] 31572.340818: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340821: kvm_exit: reason VMWRITE rip 0xa010da4f qemu-system-x86-15220 [001] 31572.340821: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340822: kvm_exit: reason VMWRITE rip 0xa010da4f qemu-system-x86-15220 [001] 31572.340822: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340822: kvm_exit: reason VMWRITE rip 0xa010da4f qemu-system-x86-15220 [001] 31572.340823: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340823: kvm_exit: reason VMWRITE rip 0xa010da4f qemu-system-x86-15220 [001] 31572.340823: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340824: kvm_exit: reason VMWRITE rip 0xa010da4f qemu-system-x86-15220 [001] 31572.340824: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340825: kvm_exit: reason VMRESUME rip 0xa011261e qemu-system-x86-15220 [001] 31572.340828: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340829: kvm_exit: reason EXTERNAL_INTERRUPT rip 0xfebb qemu-system-x86-15220 [001] 31572.340831: kvm_userspace_exit: reason KVM_EXIT_INTR (10) qemu-system-x86-15220 [001] 31572.340837: kvm_entry:vcpu 0 qemu-system-x86-15220 [001] 31572.340837: kvm_exit: reason UNKOWN rip 0xa0112621 qemu-system-x86-15220 [001] 31572.340838: kvm_userspace_exit: reason KVM_EXIT_FAIL_ENTRY (9) -- To unsubscribe from this list: send the line
Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode
On 9/8/11 9:25 PM, "Sridhar Samudrala" wrote: > On 9/8/2011 8:00 PM, Roopa Prabhu wrote: >> >> >> On 9/8/11 12:33 PM, "Michael S. Tsirkin" wrote: >> >>> On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote: > I think the main usecase for passthru mode is to assign a SR-IOV VF to > a single guest. > Yes and for the passthru usecase this patch should be enough to enable filtering in hw (eventually like I indicated before I need to fix vlan filtering too). >>> So with filtering in hw, and in sriov VF case, VFs >>> actually share a filtering table. How will that >>> be partitioned? >> AFAIK, though it might maintain a single filter table space in hw, hw does >> know which filter belongs to which VF. And the OS driver does not need to do >> anything special. The VF driver exposes a VF netdev. And any uc/mc addresses >> registered with a VF netdev are registered with the hw by the driver. And hw >> will filter and send only pkts that the VF has expressed interest in. > Does your NIC & driver support adding multiple mac addresses to a VF? > I have tried a few other SR-IOV NICs sometime back and they didn't > support this feature. Yes our nic does. I thought Intel's also does (see ixgbevf_set_rx_mode). Though I have not really tried using it on an Intel card. I think most cards should at the least support multicast filters. If the lower dev does not support unicast filtering, dev_uc_add(lowerdev,..) puts the lower dev in promiscous mode. Though..i think I can chcek this before hand in macvlan_open and put the lowerdev in promiscuous mode if it does not support filtering. > > Currently, we don't have an interface to add multiple mac addresses to a > netdev other than an > indirect way of creating a macvlan /if on top of it. Yes I think so. I have been using only macvlan to test. Thanks, Roopa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] Some emulator cleanups
On Wed, Sep 07, 2011 at 04:41:34PM +0300, Avi Kivity wrote: > Some mindless emulator cleanups while waiting for autotest. > > Avi Kivity (6): > KVM: x86 emulator: simplify emulate_2op_SrcV() > KVM: x86 emulator: simplify emulate_2op_cl() > KVM: x86 emulator: simplify emulate_2op_cl() > KVM: x86 emulator: simplify emulate_1op() > KVM: x86 emulator: merge the two emulate_1op_rax_rdx implementations > KVM: x86 emulator: simplify emulate_1op_rax_rdx() > > arch/x86/kvm/emulate.c | 225 > +++- > 1 files changed, 89 insertions(+), 136 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 3/4] block: add block timer and throttling algorithm
On Thu, Sep 08, 2011 at 06:11:07PM +0800, Zhi Yong Wu wrote: > Note: > 1.) When bps/iops limits are specified to a small value such as 511 > bytes/s, this VM will hang up. We are considering how to handle this senario. You can increase the length of the slice, if the request is larger than slice_time * bps_limit. > 2.) When "dd" command is issued in guest, if its option bs is set to a > large value such as "bs=1024K", the result speed will slightly bigger than > the limits. Why? There is lots of debugging leftovers in the patch. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
On Fri, Sep 09, 2011 at 06:00:28PM +0900, Takuya Yoshikawa wrote: [..] > > > > > - Even if there are close cooperators, these queues are merged and they > > are treated as single queue from slice point of view. So cooperating > > queues should be merged and get a single slice instead of starving > > other queues in the system. > > I understand that close cooperators' queues should be merged, but in our test > case, when the 64KB request was issued from one aio thread, the other thread's > queue was empty; because these queues are for the same stream, next request > could not come until current request got finished. > > But this is complicated because it depends on the qemu block layer aio. > > I am not sure if cfq would try to merge the queues in such cases. [CCing Jeff Moyer ] I think even if these queues are alternating, it should have been merged (If we considered them close cooperator). So in select queue we have. new_cfqq = cfq_close_cooperator(cfqd, cfqq); if (new_cfqq) { if (!cfqq->new_cfqq) cfq_setup_merge(cfqq, new_cfqq); goto expire; } So if we selected a new queue because it is a close cooperator, we should have called setup_merge() and next time when the IO happens, one of the queue should merge into another queue. cfq_set_request() { if (cfqq->new_cfqq) cfqq = cfq_merge_cfqqs(cfqd, cic, cfqq); } If merging is not happening and still we somehow continue to pick close_cooperator() as the new queue and starve other queues in the system, then there is a bug. I think try to reproduce this with fio with upstream kenrels and put some more tracepoints and see what's happening. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
On Fri, Sep 9, 2011 at 10:00 AM, Takuya Yoshikawa wrote: > Vivek Goyal wrote: > >> So you are using both RHEL 6.0 in both host and guest kernel? Can you >> reproduce the same issue with upstream kernels? How easily/frequently >> you can reproduce this with RHEL6.0 host. > > Guests were CentOS6.0. > > I have only RHEL6.0 and RHEL6.1 test results now. > I want to try similar tests with upstream kernels if I can get some time. > > With RHEL6.0 kernel, I heard that this issue was reproduced every time, 100%. > >> > On the host, we were running 3 linux guests to see if I/O from these guests >> > would be handled fairly by host; each guest did dd write with oflag=direct. >> > >> > Guest virtual disk: >> > We used a host local disk which had 3 partitions, and each guest was >> > allocated one of these as dd write target. >> > >> > So our test was for checking if cfq could keep fairness for the 3 guests >> > who shared the same disk. >> > >> > The result (strage starvation): >> > Sometimes, one guest dominated cfq for more than 10sec and requests from >> > other guests were not handled at all during that time. >> > >> > Below is the blktrace log which shows that a request to (8,27) in cfq2068S >> > (*1) >> > is not handled at all during cfq2095S and cfq2067S which hold requests to >> > (8,26) are being handled alternately. >> > >> > *1) WS 104920578 + 64 >> > >> > Question: >> > I guess that cfq_close_cooperator() was being called in an unusual >> > manner. >> > If so, do you think that cfq is responsible for keeping fairness for this >> > kind of unusual write requests? >> >> - If two guests are doing IO to separate partitions, they should really >> not be very close (until and unless partitions are really small). > > Sorry for my lack of explanation. > > The IO was issued from QEMU and the cooperative threads were both for the same > guest. In other words, QEMU was using two threads for one IO stream from the > guest. > > As my blktrace log snippet showed, cfq2095S and cfq2067S treated one > sequential > IO; cfq2095S did 64KB, then cfq2067S did next 64KB, and so on. > > These should be from the same guest because the target partition was same, > which was allocated to that guest. > > During the 10sec, this repetition continued without allowing others to > interrupt. > > I know it is unnatural but sometimes QEMU uses two aio threads for issuing one > IO stream. > >> >> - Even if there are close cooperators, these queues are merged and they >> are treated as single queue from slice point of view. So cooperating >> queues should be merged and get a single slice instead of starving >> other queues in the system. > > I understand that close cooperators' queues should be merged, but in our test > case, when the 64KB request was issued from one aio thread, the other thread's > queue was empty; because these queues are for the same stream, next request > could not come until current request got finished. > > But this is complicated because it depends on the qemu block layer aio. > > I am not sure if cfq would try to merge the queues in such cases. Looking at posix-aio-compat.c, QEMU's threadpool for asynchronous I/O, this seems like a fairly generic issue. Other applications may suffer from this same I/O scheduler behavior. It would be nice to create a test case program which doesn't use QEMU at all. QEMU has a queue of requests that need to be processed. There is a pool of threads that sleep until requests become available with pthread_cond_timedwait(3). When a request is added to the queue, pthread_cond_signal(3) is called in order to wake one sleeping thread. This bouncing pattern between two threads that you describe is probably a result of pthread_cond_timedwait(3) waking up each thread in alternating fashion. So we get this pattern: A B <-- threads 1 <-- I/O requests 2 3 4 5 6 ... Stefan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] kvm tools: fix repeated io emulation
On Fri, 2011-09-09 at 10:26 +0800, Xiao Guangrong wrote: > On 08/18/2011 11:08 PM, Avi Kivity wrote: > > On 08/18/2011 12:35 AM, Sasha Levin wrote: > >> On Thu, 2011-08-18 at 09:13 +0300, Pekka Enberg wrote: > >> > Hi, > >> > > >> > On Thu, Aug 18, 2011 at 6:06 AM, Xiao Guangrong > >> >wrote: > >> > > When kvm emulates repeation io read instruction, it can exit to > >> > user-space with > >> > > 'count'> 1, we need to emulate io access for many times > >> > > > >> > > Signed-off-by: Xiao Guangrong > >> > > >> > The KVM tool is not actually maintained by Avi and Marcelo but by me > >> > and few others. Our git repository is here: > >> > > >> > https://github.com/penberg/linux-kvm > >> > > >> > Ingo pulls that to -tip few times a week or so. Sasha, can you please > >> > take a look at these patches and if you're OK with them, I'll apply > >> > them. > >> > >> Pekka, > >> > >> I can only assume they're right, 'count' isn't documented anywhere :) > >> > >> If any of KVM maintainers could confirm it I'll add it into the docs. > >> > > > > Count is indeed the number of repetitions. > > > > Hi Pekka, > > Could you pick up this patchset please? > Xiao, It was merged couple of weeks ago. -- Sasha. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Avoid soft lockup message when KVM is stopped by host
On Thu, Sep 01, 2011 at 02:27:49PM -0600, emun...@mgebm.net wrote: > On Thu, 01 Sep 2011 14:24:12 -0500, Anthony Liguori wrote: > >On 08/30/2011 07:26 AM, Marcelo Tosatti wrote: > >>On Mon, Aug 29, 2011 at 05:27:11PM -0600, Eric B Munson wrote: > >>>Currently, when qemu stops a guest kernel that guest will > >>>issue a soft lockup > >>>message when it resumes. This set provides the ability for > >>>qemu to comminucate > >>>to the guest that it has been stopped. When the guest hits > >>>the watchdog on > >>>resume it will check if it was suspended before issuing the > >>>warning. > >>> > >>>Eric B Munson (4): > >>> Add flag to indicate that a vm was stopped by the host > >>> Add functions to check if the host has stopped the vm > >>> Add generic stubs for kvm stop check functions > >>> Add check for suspended vm in softlockup detector > >>> > >>> arch/x86/include/asm/pvclock-abi.h |1 + > >>> arch/x86/include/asm/pvclock.h |2 ++ > >>> arch/x86/kernel/kvmclock.c | 14 ++ > >>> include/asm-generic/pvclock.h | 14 ++ > >>> kernel/watchdog.c | 12 > >>> 5 files changed, 43 insertions(+), 0 deletions(-) > >>> create mode 100644 include/asm-generic/pvclock.h > >>> > >>>-- > >>>1.7.4.1 > >> > >>How is the host supposed to set this flag? > >> > >>As mentioned previously, if you save save/restore the offset > >>added to > >>kvmclock on stop/cont (and the TSC MSR, forgot to mention that), no > >>paravirt infrastructure is required. Which means the issue is > >>also fixed > >>for older guests. > > > >IIRC, the steal time patches have some logic that basically say: > > > >if there was steal time: > > kick soft lockup detector > > > >I wonder if that serves this purpose provided that time spent in stop > >is accounted as steal time. If it isn't, perhaps it should be? Accounting "in userspace (qemu)" as stolen time is problematic. Think irqchip in userspace (halt emulation), excessive time spent in device emulation that would trigger genuine watchdog warnings. The latest patches from Glauber dropped this logic. > > > >Regards, > > > >Anthony Liguori > > > > I could be missing it, but I don't see anywhere in the steal time > patches that kicks the watchdog. > > Accounting stopped time as stolen time opens a possible problem when > the accounting (CPU power modification) part is turned on. As a > process accumulates steal time its CPU power is increased. If we > account stopped time as stolen, it will do strange things with CPU > power for that process. I believe that it is for this reason that > patch 4 of Glauber's series explicitly states that halted time is > not stolen time. Stolen time is only accumulated when a vCPU > actually has work to do. > > Eric > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC [v2]: vfio / device assignment -- layout of device fd files
Meant to identify the changes in v2 of this proposal: v2: -removed PCI_INFO record type -removed PCI_BAR_INFO record type -PCI_CONFIG_SPACE is now a sub-record/property of a REGION -removed physical address from region and made it a subrecord/property of a REGION -added PCI_BAR_INDEX sub-record type -updated magic numbers Stuart On Fri, Sep 9, 2011 at 8:11 AM, Stuart Yoder wrote: > Based on the discussions over the last couple of weeks > I have updated the device fd file layout proposal and > tried to specify it a bit more formally. > > === > > 1. Overview > > This specification describes the layout of device files > used in the context of vfio, which gives user space > direct access to I/O devices that have been bound to > vfio. > > When a device fd is opened and read, offset 0x0 contains > a fixed sized header followed by a number of variable length > records that describe different characteristics > of the device-- addressable regions, interrupts, etc. > > 0x0 +-+-+ > | magic | u32 // identifies this as a vfio > device file > +---+ and identifies the type of bus > | version | u32 // specifies the version of this > +---+ > | flags | u32 // encodes any flags > +---+ > | dev info record 0 | > | type | u32 // type of record > | rec_len | u32 // length in bytes of record > | | (including record header) > | flags | u32 // type specific flags > | ...content... | // record content, which could > +---+ // include sub-records > | dev info record 1 | > +---+ > | dev info record N | > +---+ > > The device info records following the file header may have > the following record types each with content encoded in > a record specific way: > > +---+-- > | type | > Region | num | Description > --- > REGION 1 describes an addressable address range for the device > DTPATH 2 describes the device tree path for the device > DTINDEX 3 describes the index into the related device tree > property (reg,ranges,interrupts,interrupt-map) > INTERRUPT 4 describes an interrupt for the device > PCI_CONFIG_SPACE 5 property identifying a region as PCI config space > PCI_BAR_INDEX 6 describes the BAR index for a PCI region > PHYS_ADDR 7 describes the physical address of the region > --- > > 2. Header > > The header is located at offset 0x0 in the device fd > and has the following format: > > struct devfd_header { > __u32 magic; > __u32 version; > __u32 flags; > }; > > The 'magic' field contains a magic value that will > identify the type bus the device is on. Valid values > are: > > 0x70636900 // "pci" - PCI device > 0x6474 // "dt" - device tree (system bus) > > 3. Region > > A REGION record an addressable address region for the device. > > struct devfd_region { > __u32 type; // must be 0x1 > __u32 record_len; > __u32 flags; > __u64 offset; // seek offset to region from beginning > // of file > __u64 len ; // length of the region > }; > > The 'flags' field supports one flag: > > IS_MMAPABLE > > 4. Device Tree Path (DTPATH) > > A DTPATH record is a sub-record of a REGION and describes > the path to a device tree node for the region > > struct devfd_dtpath { > __u32 type; // must be 0x2 > __u32 record_len; > __u64 char[] ; // length of the region > }; > > 5. Device Tree Index (DTINDEX) > > A DTINDEX record is a sub-record of a REGION and specifies > the index into the resource list encoded in the associated > device tree property-- "reg", "ranges", "interrupts", or > "interrupt-map". > > struct devfd_dtindex { > __u32 type; // must be 0x3 > __u32 record_len; > __u32 prop_type; > __u32 prop_index; // index into the resource list > }; > > prop_type must have one of the follow values: > 1 // "reg" property > 2 // "ranges" property > 3 // "interrupts" property > 4 // "interrupts" property > > Note: prop_index is not the byte offset into the property, > but the logical index. > > 6. Interrupts (INTERR
RFC [v2]: vfio / device assignment -- layout of device fd files
Based on the discussions over the last couple of weeks I have updated the device fd file layout proposal and tried to specify it a bit more formally. === 1. Overview This specification describes the layout of device files used in the context of vfio, which gives user space direct access to I/O devices that have been bound to vfio. When a device fd is opened and read, offset 0x0 contains a fixed sized header followed by a number of variable length records that describe different characteristics of the device-- addressable regions, interrupts, etc. 0x0 +-+-+ | magic | u32 // identifies this as a vfio device file +---+ and identifies the type of bus | version | u32 // specifies the version of this +---+ | flags | u32 // encodes any flags +---+ | dev info record 0| |type | u32 // type of record |rec_len| u32 // length in bytes of record | | (including record header) |flags | u32 // type specific flags |...content... | // record content, which could +---+ // include sub-records | dev info record 1| +---+ | dev info record N| +---+ The device info records following the file header may have the following record types each with content encoded in a record specific way: +---+-- | type | Region | num | Description --- REGION 1describes an addressable address range for the device DTPATH 2describes the device tree path for the device DTINDEX 3describes the index into the related device tree property (reg,ranges,interrupts,interrupt-map) INTERRUPT4describes an interrupt for the device PCI_CONFIG_SPACE 5property identifying a region as PCI config space PCI_BAR_INDEX6describes the BAR index for a PCI region PHYS_ADDR7describes the physical address of the region --- 2. Header The header is located at offset 0x0 in the device fd and has the following format: struct devfd_header { __u32 magic; __u32 version; __u32 flags; }; The 'magic' field contains a magic value that will identify the type bus the device is on. Valid values are: 0x70636900 // "pci" - PCI device 0x6474 // "dt" - device tree (system bus) 3. Region A REGION record an addressable address region for the device. struct devfd_region { __u32 type; // must be 0x1 __u32 record_len; __u32 flags; __u64 offset; // seek offset to region from beginning // of file __u64 len ; // length of the region }; The 'flags' field supports one flag: IS_MMAPABLE 4. Device Tree Path (DTPATH) A DTPATH record is a sub-record of a REGION and describes the path to a device tree node for the region struct devfd_dtpath { __u32 type; // must be 0x2 __u32 record_len; __u64 char[] ; // length of the region }; 5. Device Tree Index (DTINDEX) A DTINDEX record is a sub-record of a REGION and specifies the index into the resource list encoded in the associated device tree property-- "reg", "ranges", "interrupts", or "interrupt-map". struct devfd_dtindex { __u32 type; // must be 0x3 __u32 record_len; __u32 prop_type; __u32 prop_index; // index into the resource list }; prop_type must have one of the follow values: 1 // "reg" property 2 // "ranges" property 3 // "interrupts" property 4 // "interrupts" property Note: prop_index is not the byte offset into the property, but the logical index. 6. Interrupts (INTERRUPT) An INTERRUPT record describes one of a device's interrupts. The handle field is an argument to VFIO_DEVICE_GET_IRQ_FD which user space can use to receive device interrupts. struct devfd_interrupts { __u32 type; // must be 0x4 __u32 record_len; __u32 flags; __u32 handle; // parameter to VFIO_DEVICE_GET_IRQ_FD }; 7. PCI Config Space (PCI_CONFIG_SPACE) A PCI_CONFIG_SPACE record is a sub-record of a REGION record and identifies the region as PCI configuration space. struct de
Re: [PATCH] KVM: emulate lapic tsc deadline timer for hvm
On Fri, Sep 09, 2011 at 01:12:51AM +0800, Liu, Jinsong wrote: > >>> --- a/arch/x86/include/asm/msr-index.h > >>> +++ b/arch/x86/include/asm/msr-index.h > >>> @@ -229,6 +229,8 @@ > >>> #define MSR_IA32_APICBASE_ENABLE (1<<11) > >>> #define MSR_IA32_APICBASE_BASE (0xf<<12) > >>> > >>> +#define MSR_IA32_TSCDEADLINE 0x06e0 > >>> + > >>> #define MSR_IA32_UCODE_WRITE 0x0079 > >>> #define MSR_IA32_UCODE_REV 0x008b > >> > >> Need to add to msrs_to_save so live migration works. > > > > MSR must be explicitly listed in qemu, also. > > > > Marcelo, seems MSR don't need explicitly list in qemu? > KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu will get > it through KVM_GET_MSR_INDEX_LIST. > Do I miss something? Notice in target-i386/kvm.c the KVM_GET_MSR_INDEX_LIST list is only used for MSR_STAR/MSR_HSAVE_PA presence detection. Do you do need to explicitly add MSR_IA32_TSCDEADLINE to kvm_get_msrs/kvm_put_msrs routines. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
Vivek Goyal wrote: > So you are using both RHEL 6.0 in both host and guest kernel? Can you > reproduce the same issue with upstream kernels? How easily/frequently > you can reproduce this with RHEL6.0 host. Guests were CentOS6.0. I have only RHEL6.0 and RHEL6.1 test results now. I want to try similar tests with upstream kernels if I can get some time. With RHEL6.0 kernel, I heard that this issue was reproduced every time, 100%. > > On the host, we were running 3 linux guests to see if I/O from these guests > > would be handled fairly by host; each guest did dd write with oflag=direct. > > > > Guest virtual disk: > > We used a host local disk which had 3 partitions, and each guest was > > allocated one of these as dd write target. > > > > So our test was for checking if cfq could keep fairness for the 3 guests > > who shared the same disk. > > > > The result (strage starvation): > > Sometimes, one guest dominated cfq for more than 10sec and requests from > > other guests were not handled at all during that time. > > > > Below is the blktrace log which shows that a request to (8,27) in cfq2068S > > (*1) > > is not handled at all during cfq2095S and cfq2067S which hold requests to > > (8,26) are being handled alternately. > > > > *1) WS 104920578 + 64 > > > > Question: > > I guess that cfq_close_cooperator() was being called in an unusual manner. > > If so, do you think that cfq is responsible for keeping fairness for this > > kind of unusual write requests? > > - If two guests are doing IO to separate partitions, they should really > not be very close (until and unless partitions are really small). Sorry for my lack of explanation. The IO was issued from QEMU and the cooperative threads were both for the same guest. In other words, QEMU was using two threads for one IO stream from the guest. As my blktrace log snippet showed, cfq2095S and cfq2067S treated one sequential IO; cfq2095S did 64KB, then cfq2067S did next 64KB, and so on. These should be from the same guest because the target partition was same, which was allocated to that guest. During the 10sec, this repetition continued without allowing others to interrupt. I know it is unnatural but sometimes QEMU uses two aio threads for issuing one IO stream. > > - Even if there are close cooperators, these queues are merged and they > are treated as single queue from slice point of view. So cooperating > queues should be merged and get a single slice instead of starving > other queues in the system. I understand that close cooperators' queues should be merged, but in our test case, when the 64KB request was issued from one aio thread, the other thread's queue was empty; because these queues are for the same stream, next request could not come until current request got finished. But this is complicated because it depends on the qemu block layer aio. I am not sure if cfq would try to merge the queues in such cases. > Can you upload the blktrace logs somewhere which shows what happened > during that 10 seconds. I have some restrictions here, so maybe, but I need to check later. > > Note: > > With RHEL6.1, this problem could not triggered. But I guess that was due > > to > > QEMU's block layer updates. > > You can try reproducing this with fio. Thank you, I want to do some tests by myself; the original report was not from my team. My feeling is that, it may be possible to dominate IO if we create two threads and issue cooperative IO as our QEMU did; QEMU is just a process from the host view, and one QEMU process dominated IO preventing other QEMU's IO. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About hotplug multifunction
On Fri, Sep 09, 2011 at 03:08:21AM -0400, Amos Kong wrote: > Hello all, > > I'm working on hotplug pci multifunction. > > 1. qemu cmdline: > ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 > /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor > unix:/tmp/a,server,nowait --enable-kvm -net none > > 2. script to add virtio-blk devices: > for i in `seq 1 7` 0;do > qemu-img create /tmp/resize$i.qcow2 1G -f qcow2 > echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U /tmp/a > echo device_add > virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U > /tmp/a > done > > 3. script to add virio-nic devices: > for i in `seq 1 7` 0;do > echo netdev_add tap,id=drv$i | nc -U /tmp/a > echo device_add > virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U > /tmp/a > done > > 4. current qemu behaviors > 4.1. add func 1~7 one by one, then add func 0 > virtio-nic : success, all funcs are added > virtio-blk : success > > 4.2. add func 0~7 one by one > virtio-nic : failed, only func 0 is added > virtio-blk : success > > 4.3. removing any single func in monitor > virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 > also exist. > virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the device. > /dev/vda disappears, > vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 funcs > to guest, they all works. > # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit) > 00:06.1 SCSI storage controller: Red Hat, Inc Virtio block > device (rev ff) something I noted when readin our acpi code: we currently pass eject request for function 0 only: Name (_ADR, nr##) We either need a device per function there (acpi 1.0), send eject request for them all, or use as function number (newer acpi, not sure which version). Need to see which guests (windows,linux) can handle which form. > > qemu sends an acpi event to guest, then guest will remove all funcs in the > slot. > linux-2.6/drivers/pci/hotplug/acpiphp_glue.c: > static int disable_device(struct acpiphp_slot *slot) { > list_for_each_entry(func, &slot->funcs, sibling) { > ... > > Questions: > 1. why func1~7 still can be found after hot-remove? is it same as real > hardware? > 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)? > 3. how about this interface to hotplug/hot-unplug multifunction: >1) Add func 1-7 by monitor, add func 0, then send an acpi event to notice > guest >2) Remove func0, send an acpi event to guest. (all funcs can be removed) We must make sure guest acked removal of all functions. Surprise removal would be bad. > 4. what does "reversion 0xff" stand for? You get if no device responds to a configuration read. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
About hotplug multifunction
Hello all, I'm working on hotplug pci multifunction. 1. qemu cmdline: ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor unix:/tmp/a,server,nowait --enable-kvm -net none 2. script to add virtio-blk devices: for i in `seq 1 7` 0;do qemu-img create /tmp/resize$i.qcow2 1G -f qcow2 echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U /tmp/a echo device_add virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U /tmp/a done 3. script to add virio-nic devices: for i in `seq 1 7` 0;do echo netdev_add tap,id=drv$i | nc -U /tmp/a echo device_add virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U /tmp/a done 4. current qemu behaviors 4.1. add func 1~7 one by one, then add func 0 virtio-nic : success, all funcs are added virtio-blk : success 4.2. add func 0~7 one by one virtio-nic : failed, only func 0 is added virtio-blk : success 4.3. removing any single func in monitor virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 also exist. virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the device. /dev/vda disappears, vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 funcs to guest, they all works. # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit) 00:06.1 SCSI storage controller: Red Hat, Inc Virtio block device (rev ff) qemu sends an acpi event to guest, then guest will remove all funcs in the slot. linux-2.6/drivers/pci/hotplug/acpiphp_glue.c: static int disable_device(struct acpiphp_slot *slot) { list_for_each_entry(func, &slot->funcs, sibling) { ... Questions: 1. why func1~7 still can be found after hot-remove? is it same as real hardware? 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)? 3. how about this interface to hotplug/hot-unplug multifunction: 1) Add func 1-7 by monitor, add func 0, then send an acpi event to notice guest 2) Remove func0, send an acpi event to guest. (all funcs can be removed) 4. what does "reversion 0xff" stand for? Thanks in advance, Amos -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html