RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-09-09 Thread Liu, Jinsong
>> 
>> My question is, which kvm_get_msrs/kvm_put_msrs routine be used by
>> live migration, the routine in target-i386/kvm.c, or in
>> kvm/libkvm/libkvm-x86.c? They both have ioctl
>> KVM_GET_MSR_INDEX_LIST/ KVM_GET_MSRS/ KVM_SET_MSRS, but I'm not
>> clear their purpose/usage difference.
> 
> kvm_get_msrs/kvm_put_msrs in target-i386/kvm.c. kvm/ directory is
> dead. 


Thanks to make me clear. Add it to qemu like:


diff --git a/target-i386/cpu.h b/target-i386/cpu.h
index 935d08a..62ff73c 100644
--- a/target-i386/cpu.h
+++ b/target-i386/cpu.h
@@ -283,6 +283,7 @@
 #define MSR_IA32_APICBASE_BSP   (1<<8)
 #define MSR_IA32_APICBASE_ENABLE(1<<11)
 #define MSR_IA32_APICBASE_BASE  (0xf<<12)
+#define MSR_IA32_TSCDEADLINE0x6e0

 #define MSR_MTRRcap0xfe
 #define MSR_MTRRcap_VCNT   8
@@ -687,6 +688,7 @@ typedef struct CPUX86State {
 uint64_t async_pf_en_msr;

 uint64_t tsc;
+uint64_t tsc_deadline;

 uint64_t mcg_status;

diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index aa843f0..206fcad 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -59,6 +59,7 @@ const KVMCapabilityInfo kvm_arch_required_capabilities[] = {

 static bool has_msr_star;
 static bool has_msr_hsave_pa;
+static bool has_msr_tsc_deadline;
 static bool has_msr_async_pf_en;
 static int lm_capable_kernel;

@@ -571,6 +572,10 @@ static int kvm_get_supported_msrs(KVMState *s)
 has_msr_hsave_pa = true;
 continue;
 }
+if (kvm_msr_list->indices[i] == MSR_IA32_TSCDEADLINE) {
+has_msr_tsc_deadline = true;
+continue;
+}
 }
 }

@@ -899,6 +904,9 @@ static int kvm_put_msrs(CPUState *env, int level)
 if (has_msr_hsave_pa) {
 kvm_msr_entry_set(&msrs[n++], MSR_VM_HSAVE_PA, env->vm_hsave);
 }
+if (has_msr_tsc_deadline) {
+kvm_msr_entry_set(&msrs[n++], MSR_IA32_TSCDEADLINE, env->tsc_deadline);
+}
 #ifdef TARGET_X86_64
 if (lm_capable_kernel) {
 kvm_msr_entry_set(&msrs[n++], MSR_CSTAR, env->cstar);
@@ -1145,6 +1153,9 @@ static int kvm_get_msrs(CPUState *env)
 if (has_msr_hsave_pa) {
 msrs[n++].index = MSR_VM_HSAVE_PA;
 }
+if (has_msr_tsc_deadline) {
+msrs[n++].index = MSR_IA32_TSCDEADLINE;
+}

 if (!env->tsc_valid) {
 msrs[n++].index = MSR_IA32_TSC;
@@ -1213,6 +1224,9 @@ static int kvm_get_msrs(CPUState *env)
 case MSR_IA32_TSC:
 env->tsc = msrs[i].data;
 break;
+case MSR_IA32_TSCDEADLINE:
+env->tsc_deadline = msrs[i].data;
+break;
 case MSR_VM_HSAVE_PA:
 env->vm_hsave = msrs[i].data;
 break;

=


Thanks,
Jinsong
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-09-09 Thread Marcelo Tosatti
On Sat, Sep 10, 2011 at 02:11:36AM +0800, Liu, Jinsong wrote:
> Marcelo Tosatti wrote:
> > On Fri, Sep 09, 2011 at 01:12:51AM +0800, Liu, Jinsong wrote:
> > --- a/arch/x86/include/asm/msr-index.h
> > +++ b/arch/x86/include/asm/msr-index.h
> > @@ -229,6 +229,8 @@
> >  #define MSR_IA32_APICBASE_ENABLE   (1<<11)
> >  #define MSR_IA32_APICBASE_BASE (0xf<<12)
> > 
> > +#define MSR_IA32_TSCDEADLINE   0x06e0
> > +
> >  #define MSR_IA32_UCODE_WRITE   0x0079
> >  #define MSR_IA32_UCODE_REV 0x008b
>  
>  Need to add to msrs_to_save so live migration works.
> >>> 
> >>> MSR must be explicitly listed in qemu, also.
> >>> 
> >> 
> >> Marcelo, seems MSR don't need explicitly list in qemu?
> >> KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu
> >> will get it through KVM_GET_MSR_INDEX_LIST. Do I miss something?
> > 
> > Notice in target-i386/kvm.c the KVM_GET_MSR_INDEX_LIST list is only
> > used for MSR_STAR/MSR_HSAVE_PA presence detection.
> 
> Yes
> 
> > 
> > Do you do need to explicitly add MSR_IA32_TSCDEADLINE to
> > kvm_get_msrs/kvm_put_msrs routines.
> 
> My question is, which kvm_get_msrs/kvm_put_msrs routine be used by live 
> migration, the routine in target-i386/kvm.c, or in kvm/libkvm/libkvm-x86.c? 
> They both have ioctl KVM_GET_MSR_INDEX_LIST/ KVM_GET_MSRS/ KVM_SET_MSRS, but 
> I'm not clear their purpose/usage difference.

kvm_get_msrs/kvm_put_msrs in target-i386/kvm.c. kvm/ directory is dead.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About hotplug multifunction

2011-09-09 Thread Marcelo Tosatti
On Fri, Sep 09, 2011 at 10:05:01AM -0700, Alex Williamson wrote:
> On Fri, 2011-09-09 at 10:32 +0300, Michael S. Tsirkin wrote:
> > On Fri, Sep 09, 2011 at 03:08:21AM -0400, Amos Kong wrote:
> > > Hello all,
> > > 
> > > I'm working on hotplug pci multifunction. 
> > > 
> > > 1. qemu cmdline: 
> > > ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 
> > > /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor 
> > > unix:/tmp/a,server,nowait --enable-kvm -net none
> > > 
> > > 2. script to add virtio-blk devices:
> > > for i in `seq 1 7` 0;do
> > > qemu-img create /tmp/resize$i.qcow2 1G -f qcow2
> > > echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U 
> > > /tmp/a
> > > echo device_add 
> > > virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U 
> > > /tmp/a
> > > done
> 
> I don't think it should work this way, there shouldn't be special
> intrinsic meaning that hotplugging func 0 makes the whole device appear.

Function 0 is mandatory. Thats what the guest (at least the Linux
driver) searches when a notification for the slot is received.

> Perhaps we need a notify= device option so we can add devices without
> notifying the guest, then maybe a different command to cause the pci
> slot notification.
>
> > > 3. script to add virio-nic devices:
> > > for i in `seq 1 7` 0;do
> > > echo netdev_add tap,id=drv$i | nc -U /tmp/a
> > > echo device_add 
> > > virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U 
> > > /tmp/a
> > > done
> > > 
> > > 4. current qemu behaviors
> > > 4.1. add func 1~7 one by one, then add func 0
> > > virtio-nic : success, all funcs are added
> > > virtio-blk : success
> > > 
> > > 4.2. add func 0~7 one by one
> > > virtio-nic : failed, only func 0 is added
> > > virtio-blk : success
> > > 
> > > 4.3. removing any single func in monitor
> > > virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. 
> > > eth1~eth7 also exist.
> > > virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the 
> > > device. /dev/vda disappears,
> > >   vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 
> > > funcs to guest, they all works.
> > >   # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit)
> > >   00:06.1 SCSI storage controller: Red Hat, Inc Virtio block 
> > > device (rev ff)
> 
> We shouldn't be able to remove single funcs of a multifunction device,
> imho.
> 
> > something I noted when readin our acpi code:
> > we currently pass eject request for function 0 only:
> >Name (_ADR, nr##)
> > We either need a device per function there (acpi 1.0),
> > send eject request for them all, or use 
> > as function number (newer acpi, not sure which version).
> > Need to see which guests (windows,linux) can handle which form.
> 
> I'd guess we need to change that to .

No need, only make sure function 0 is there and all other functions
should be removed automatically by the guest on eject notification.

> > > 
> > > qemu sends an acpi event to guest, then guest will remove all funcs in 
> > > the slot.
> > > linux-2.6/drivers/pci/hotplug/acpiphp_glue.c:
> > > static int disable_device(struct acpiphp_slot *slot) {
> > > list_for_each_entry(func, &slot->funcs, sibling) {
> > > ...
> > > 
> > > Questions:
> > > 1. why func1~7 still can be found after hot-remove? is it same as real 
> > > hardware?
> 
> I think we want to behave the same as adding and removing a complete
> physical devices, which means all the functions get added and removed
> together.  Probably the only time individual functions disappear on real
> hardware is poking chipset registers to hide and expose sub devices.  It
> may not be necessary to make them atomically [in]accessible, but we
> should treat them as a set.

ACPI PCI hotplug is based on slots, not on functions. It does not
support addition/removal of individual functions.

> > > 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by 
> > > one)?
> > > 3. how about this interface to hotplug/hot-unplug multifunction:
> > >1) Add func 1-7 by monitor, add func 0, then send an acpi event to 
> > > notice guest
> > >2) Remove func0, send an acpi event to guest. (all funcs can be 
> > > removed)
> 
> I think I'd prefer an explicit interface.  Thanks,
> 
> Alex

Function 0 must be present for the guest to detect the device. I do
not see the problem of specifying (and documenting) that the insert
notification is sent for function 0 only.

An explicit interface is going to break the current scheme where a
single "device_add" command also does the notification.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KSM memory saving amount

2011-09-09 Thread Marcelo Tosatti
On Fri, Sep 09, 2011 at 09:38:23AM +0300, Mihamina Rakotomandimby wrote:
> Hi all,
> 
> Running an Ubuntu Natty where I launched 3 Xubuntu and 1 win 7, I cat see:
> /sys/kernel/mm/ksm/full_scans: 34
> /sys/kernel/mm/ksm/pages_shared: 8020
> /sys/kernel/mm/ksm/pages_sharing: 36098
> /sys/kernel/mm/ksm/pages_to_scan: 100
> /sys/kernel/mm/ksm/pages_unshared: 247109
> /sys/kernel/mm/ksm/pages_volatile: 32716
> /sys/kernel/mm/ksm/run: 1
> /sys/kernel/mm/ksm/sleep_millisecs: 200
> 
> I would like to evaluate the memory amount:
> How many byte is a page?

4096.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] qemu-kvm: pc: Factor out apic_next_timer

2011-09-09 Thread Marcelo Tosatti
On Thu, Sep 08, 2011 at 12:51:31PM +0200, Jan Kiszka wrote:
> Factor out apic_next_timer from apic_timer_update. The former can then
> be used to update next_timer without actually starting the qemu timer.
> KVM's in-kernel APIC model will make use of it.
> 
> Signed-off-by: Jan Kiszka 

Applied both, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: Resolve PCI upstream diffs

2011-09-09 Thread Marcelo Tosatti
On Thu, Sep 08, 2011 at 12:48:02PM +0200, Jan Kiszka wrote:
> Resolve all unneeded deviations from upstream code. No functional
> changes.
> 
> Signed-off-by: Jan Kiszka 
> ---
>  hw/pci.c |   11 +++
>  hw/pci.h |5 -
>  2 files changed, 11 insertions(+), 5 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-09-09 Thread Liu, Jinsong
Marcelo Tosatti wrote:
> On Fri, Sep 09, 2011 at 01:12:51AM +0800, Liu, Jinsong wrote:
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -229,6 +229,8 @@
>  #define MSR_IA32_APICBASE_ENABLE (1<<11)
>  #define MSR_IA32_APICBASE_BASE   (0xf<<12)
> 
> +#define MSR_IA32_TSCDEADLINE 0x06e0
> +
>  #define MSR_IA32_UCODE_WRITE 0x0079
>  #define MSR_IA32_UCODE_REV   0x008b
 
 Need to add to msrs_to_save so live migration works.
>>> 
>>> MSR must be explicitly listed in qemu, also.
>>> 
>> 
>> Marcelo, seems MSR don't need explicitly list in qemu?
>> KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu
>> will get it through KVM_GET_MSR_INDEX_LIST. Do I miss something?
> 
> Notice in target-i386/kvm.c the KVM_GET_MSR_INDEX_LIST list is only
> used for MSR_STAR/MSR_HSAVE_PA presence detection.

Yes, but in kvm/libkvm/libkvm-x86.c the KVM_GET_MSR_INDEX_LIST get all list.
That's what I want to make clear --> which one live migration use?

> 
> Do you do need to explicitly add MSR_IA32_TSCDEADLINE to
> kvm_get_msrs/kvm_put_msrs routines.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KSM memory saving amount

2011-09-09 Thread Dan VerWeire
On Fri, Sep 9, 2011 at 12:52 PM, Thomas Treutner  wrote:
> Am 09.09.2011 08:38, schrieb Mihamina Rakotomandimby:
>>
>> How many byte is a page?
>
> Typically, unless you use hugepages, a page is 4 KiB = 4096 Byte.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I wrote a bash script which calculates the KSM savings. I have posted it here:

 https://gist.github.com/1206923

I can not guarantee its accuracy. It was the best I could do with the
knowledge I had. The results seem reasonable to me though.

Hope this is useful to someone.

Dan VerWeire
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-09-09 Thread Liu, Jinsong
Marcelo Tosatti wrote:
> On Fri, Sep 09, 2011 at 01:12:51AM +0800, Liu, Jinsong wrote:
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -229,6 +229,8 @@
>  #define MSR_IA32_APICBASE_ENABLE (1<<11)
>  #define MSR_IA32_APICBASE_BASE   (0xf<<12)
> 
> +#define MSR_IA32_TSCDEADLINE 0x06e0
> +
>  #define MSR_IA32_UCODE_WRITE 0x0079
>  #define MSR_IA32_UCODE_REV   0x008b
 
 Need to add to msrs_to_save so live migration works.
>>> 
>>> MSR must be explicitly listed in qemu, also.
>>> 
>> 
>> Marcelo, seems MSR don't need explicitly list in qemu?
>> KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu
>> will get it through KVM_GET_MSR_INDEX_LIST. Do I miss something?
> 
> Notice in target-i386/kvm.c the KVM_GET_MSR_INDEX_LIST list is only
> used for MSR_STAR/MSR_HSAVE_PA presence detection.

Yes

> 
> Do you do need to explicitly add MSR_IA32_TSCDEADLINE to
> kvm_get_msrs/kvm_put_msrs routines.

My question is, which kvm_get_msrs/kvm_put_msrs routine be used by live 
migration, the routine in target-i386/kvm.c, or in kvm/libkvm/libkvm-x86.c? 
They both have ioctl KVM_GET_MSR_INDEX_LIST/ KVM_GET_MSRS/ KVM_SET_MSRS, but 
I'm not clear their purpose/usage difference.

Thanks,
Jinsong--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About hotplug multifunction

2011-09-09 Thread Isaku Yamahata
pci/pcie hot plug needs clean up for multifunction hotplug in long term.
Only single function device case works. Multifunction case is broken somwehat.
Especially the current acpi based hotplug should be replaced by
the standardized hot plug controller in long term.

On Fri, Sep 09, 2011 at 03:08:21AM -0400, Amos Kong wrote:
> Hello all,
> 
> I'm working on hotplug pci multifunction. 
> 
> 1. qemu cmdline: 
> ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 
> /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor 
> unix:/tmp/a,server,nowait --enable-kvm -net none
> 
> 2. script to add virtio-blk devices:
> for i in `seq 1 7` 0;do
> qemu-img create /tmp/resize$i.qcow2 1G -f qcow2
> echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U /tmp/a
> echo device_add 
> virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U 
> /tmp/a
> done
> 
> 3. script to add virio-nic devices:
> for i in `seq 1 7` 0;do
> echo netdev_add tap,id=drv$i | nc -U /tmp/a
> echo device_add 
> virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U 
> /tmp/a
> done
> 
> 4. current qemu behaviors
> 4.1. add func 1~7 one by one, then add func 0
> virtio-nic : success, all funcs are added
> virtio-blk : success
> 
> 4.2. add func 0~7 one by one
> virtio-nic : failed, only func 0 is added
> virtio-blk : success
> 
> 4.3. removing any single func in monitor
> virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 
> also exist.
> virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the device. 
> /dev/vda disappears,
>   vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 funcs 
> to guest, they all works.
>   # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit)
>   00:06.1 SCSI storage controller: Red Hat, Inc Virtio block 
> device (rev ff)
> 
> qemu sends an acpi event to guest, then guest will remove all funcs in the 
> slot.
> linux-2.6/drivers/pci/hotplug/acpiphp_glue.c:
> static int disable_device(struct acpiphp_slot *slot) {
> list_for_each_entry(func, &slot->funcs, sibling) {
> ...
> 
> Questions:
> 1. why func1~7 still can be found after hot-remove? is it same as real 
> hardware?
> 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)?
> 3. how about this interface to hotplug/hot-unplug multifunction:
>1) Add func 1-7 by monitor, add func 0, then send an acpi event to notice 
> guest
>2) Remove func0, send an acpi event to guest. (all funcs can be removed)
> 4. what does "reversion 0xff" stand for?
> 
> Thanks in advance,
> Amos
> 

-- 
yamahata
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KSM memory saving amount

2011-09-09 Thread Thomas Treutner

Am 09.09.2011 08:38, schrieb Mihamina Rakotomandimby:

How many byte is a page?


Typically, unless you use hugepages, a page is 4 KiB = 4096 Byte.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About hotplug multifunction

2011-09-09 Thread Alex Williamson
On Fri, 2011-09-09 at 10:32 +0300, Michael S. Tsirkin wrote:
> On Fri, Sep 09, 2011 at 03:08:21AM -0400, Amos Kong wrote:
> > Hello all,
> > 
> > I'm working on hotplug pci multifunction. 
> > 
> > 1. qemu cmdline: 
> > ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 
> > /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor 
> > unix:/tmp/a,server,nowait --enable-kvm -net none
> > 
> > 2. script to add virtio-blk devices:
> > for i in `seq 1 7` 0;do
> > qemu-img create /tmp/resize$i.qcow2 1G -f qcow2
> > echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U 
> > /tmp/a
> > echo device_add 
> > virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U 
> > /tmp/a
> > done

I don't think it should work this way, there shouldn't be special
intrinsic meaning that hotplugging func 0 makes the whole device appear.
Perhaps we need a notify= device option so we can add devices without
notifying the guest, then maybe a different command to cause the pci
slot notification.
 
> > 3. script to add virio-nic devices:
> > for i in `seq 1 7` 0;do
> > echo netdev_add tap,id=drv$i | nc -U /tmp/a
> > echo device_add 
> > virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U 
> > /tmp/a
> > done
> > 
> > 4. current qemu behaviors
> > 4.1. add func 1~7 one by one, then add func 0
> > virtio-nic : success, all funcs are added
> > virtio-blk : success
> > 
> > 4.2. add func 0~7 one by one
> > virtio-nic : failed, only func 0 is added
> > virtio-blk : success
> > 
> > 4.3. removing any single func in monitor
> > virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 
> > also exist.
> > virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the 
> > device. /dev/vda disappears,
> >   vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 
> > funcs to guest, they all works.
> >   # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit)
> >   00:06.1 SCSI storage controller: Red Hat, Inc Virtio block 
> > device (rev ff)

We shouldn't be able to remove single funcs of a multifunction device,
imho.

> something I noted when readin our acpi code:
> we currently pass eject request for function 0 only:
>Name (_ADR, nr##)
> We either need a device per function there (acpi 1.0),
> send eject request for them all, or use 
> as function number (newer acpi, not sure which version).
> Need to see which guests (windows,linux) can handle which form.

I'd guess we need to change that to .

> > 
> > qemu sends an acpi event to guest, then guest will remove all funcs in the 
> > slot.
> > linux-2.6/drivers/pci/hotplug/acpiphp_glue.c:
> > static int disable_device(struct acpiphp_slot *slot) {
> > list_for_each_entry(func, &slot->funcs, sibling) {
> > ...
> > 
> > Questions:
> > 1. why func1~7 still can be found after hot-remove? is it same as real 
> > hardware?

I think we want to behave the same as adding and removing a complete
physical devices, which means all the functions get added and removed
together.  Probably the only time individual functions disappear on real
hardware is poking chipset registers to hide and expose sub devices.  It
may not be necessary to make them atomically [in]accessible, but we
should treat them as a set.

> > 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)?
> > 3. how about this interface to hotplug/hot-unplug multifunction:
> >1) Add func 1-7 by monitor, add func 0, then send an acpi event to 
> > notice guest
> >2) Remove func0, send an acpi event to guest. (all funcs can be removed)

I think I'd prefer an explicit interface.  Thanks,

Alex

> We must make sure guest acked removal of all functions. Surprise
> removal would be bad.
> 
> > 4. what does "reversion 0xff" stand for?
> 
> You get  if no device responds to a configuration read.
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-09 Thread Roopa Prabhu



On 9/8/11 10:55 PM, "Michael S. Tsirkin"  wrote:

> On Thu, Sep 08, 2011 at 07:53:11PM -0700, Roopa Prabhu wrote:
 Phase 1: Goal: Enable hardware filtering for all macvlan modes
 - In macvlan passthru mode the single guest virtio-nic connected will
   receive traffic that he requested for
 - In macvlan non-passthru mode all guest virtio-nics sharing the
   physical nic will see all other guest traffic
   but the filtering at guest virtio-nic
>>> 
>>> I don't think guests currently filter anything.
>>> 
>> I was referring to Qemu-kvm virtio-net in
>> virtion_net_receive->receive_filter. I think It only passes pkts that the
>> guest OS is interested. It uses the filter table that I am passing to
>> macvtap in this patch.
> 
> This happens after userspace thread gets woken up and data
> is copied there. So relying on filtering at that level is
> going to be very inefficient on a system with
> multiple active guests. Further, and for that reason, vhost-net
> doesn't do filtering at all, relying on the backends
> to pass it correct packets.

Ok thanks for the info. So in which case, phase 1 is best for PASSTHRU mode
and for non-PASSTHRU when there is a single guest connected to a VF.
For non-PASSTHRU multi guest sharing the same VF, Phase 1 is definitely
better than putting the VF in promiscuous mode.
But to address the concern you mention above, in phase 2 when we have more
than one guest sharing the VF, we will have to add filter lookup in macvlan
to filter pkts for each guest. This will need some performance tests too.

Will start investigating the netlink interface comments for phase 1 first.

Thanks!
-Roopa

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


1st level guest crashes on start of 2nd level guest in nested virtualisation

2011-09-09 Thread Steffen Evers (EXT)

Hello Nadav Har'El,

We are using qemu-kvm with nested virtualisation (two guest levels).
The 1st-level guest sporadicly crashes on start-up of 2nd level
guest (in BIOS). We have now found out that it always crashes with
parameter "no-kvm-irqchip" set.

Do you have any idea what the problem is? Any help?

Regards, Steffen


Host System:
- Ubuntu 10.10
- kernel 2.6.35-30-generic #56-Ubuntu SMP Mon Jul 11 20:01:08 UTC 2011 x86_64 
GNU/Linux
- QEMU emulator version 0.15.0 (qemu-kvm-0.15.0)
- kvm_kmod from git repository: commit c040fec91c95609c8a6b54ddd5ce952605a11850
  (Sun Aug 21 21:23:11 2011 -0700)
- kvm_kmod submodule linux: commit 902c502f0b0efec3a784a8ef65057298025e5e11
  (Fri Aug 26 08:04:19 2011 -0300)

Guest system (1st level):
- fresh installed Ubuntu 11.04
- QEMU emulator version 0.14.0 (qemu-kvm-0.14.0)
- kernel 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011 x86_64 
GNU/Linux
- kvm modules from kernel above


Procedure to reproduce crash:

1. Activate nested virtualisation

2. Boot Qemu guest (1st level) system with given command

   qemu-system-x86_64 \
   -no-kvm-irqchip \
   -enable-kvm \
   -m 1G \
   -smp 1 \
   -drive file=vda.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 \
   -device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 \
   -cpu host

3. Start 2nd lvel Qemu guest:
   qemu-system-x86_64

Result:

Qemu 1st level guest crashes with:

KVM: entry failed, hardware error 0x8021

If you're runnning a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

RAX=008f RBX= RCX=6ea6 
RDX=0100
RSI= RDI=0002000f RBP=8b00 
RSP=88003c189d00
R8 = R9 = R10= 
R11=
R12= R13= R14= 
R15=
RIP=a0112621 RFL=00023002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0018   00c0f300 DPL=3 DS   [-WA]
CS =0010   00a09b00 DPL=0 CS64 [-RA]
SS =0018   00c09300 DPL=0 DS   [-WA]
DS =0018   00c0f300 DPL=3 DS   [-WA]
FS = 7f9c9d390700  
GS = 88003fc0 000f 
LDT=  000f 
TR =0040 88003fc11880 2087 8b00 DPL=0 TSS64-busy
GDT= 88003fc04000 007f
IDT= 81bae000 0fff
CR0=8005003b CR2= CR3=3bdeb000 CR4=26f0
DR0= DR1= DR2= 
DR3=
DR6=0ff0 DR7=0400
EFER=0d01
Code=01 00 00 48 8b 89 50 01 00 00 75 05 0f 01 c2 eb 03 0f 01 c3 <48> 87 0c 24 48 89 81 48 01 00 00 48 89 99 60 01 00 00 
ff 34 24 8f 81 50 01 00 00 48 89 91


Tracing (trace-cmd record -b 2 -e kvm) gives the following patterns:
   qemu-system-x86-15220 [001] 31572.340816: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340817: kvm_exit: reason 
VMREAD rip 0xa010d06e
   qemu-system-x86-15220 [001] 31572.340818: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340821: kvm_exit: reason 
VMWRITE rip 0xa010da4f
   qemu-system-x86-15220 [001] 31572.340821: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340822: kvm_exit: reason 
VMWRITE rip 0xa010da4f
   qemu-system-x86-15220 [001] 31572.340822: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340822: kvm_exit: reason 
VMWRITE rip 0xa010da4f
   qemu-system-x86-15220 [001] 31572.340823: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340823: kvm_exit: reason 
VMWRITE rip 0xa010da4f
   qemu-system-x86-15220 [001] 31572.340823: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340824: kvm_exit: reason 
VMWRITE rip 0xa010da4f
   qemu-system-x86-15220 [001] 31572.340824: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340825: kvm_exit: reason 
VMRESUME rip 0xa011261e
   qemu-system-x86-15220 [001] 31572.340828: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340829: kvm_exit: reason 
EXTERNAL_INTERRUPT rip 0xfebb
   qemu-system-x86-15220 [001] 31572.340831: kvm_userspace_exit:   reason 
KVM_EXIT_INTR (10)
   qemu-system-x86-15220 [001] 31572.340837: kvm_entry:vcpu 0
   qemu-system-x86-15220 [001] 31572.340837: kvm_exit: reason 
UNKOWN rip 0xa0112621
   qemu-system-x86-15220 [001] 31572.340838: kvm_userspace_exit:   reason 
KVM_EXIT_FAIL_ENTRY (9)
--
To unsubscribe from this list: send the line

Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-09 Thread Roopa Prabhu



On 9/8/11 9:25 PM, "Sridhar Samudrala"  wrote:

> On 9/8/2011 8:00 PM, Roopa Prabhu wrote:
>> 
>> 
>> On 9/8/11 12:33 PM, "Michael S. Tsirkin"  wrote:
>> 
>>> On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote:
> I think the main usecase for passthru mode is to assign a SR-IOV VF to
> a single guest.
> 
 Yes and for the passthru usecase this patch should be enough to enable
 filtering in hw (eventually like I indicated before I need to fix vlan
 filtering too).
>>> So with filtering in hw, and in sriov VF case, VFs
>>> actually share a filtering table. How will that
>>> be partitioned?
>> AFAIK, though it might maintain a single filter table space in hw, hw does
>> know which filter belongs to which VF. And the OS driver does not need to do
>> anything special. The VF driver exposes a VF netdev. And any uc/mc addresses
>> registered with a VF netdev are registered with the hw by the driver. And hw
>> will filter and send only pkts that the VF has expressed interest in.
> Does your NIC & driver support adding multiple mac addresses to a VF?
> I have tried a few other SR-IOV NICs sometime back and they didn't
> support this feature.

Yes our nic does. I thought Intel's also does (see ixgbevf_set_rx_mode).
Though I have not really tried using it on an Intel card. I think most cards
should at the least support multicast filters.

If the lower dev does not support unicast filtering, dev_uc_add(lowerdev,..)
puts the lower dev in promiscous mode. Though..i think I can chcek this
before hand in macvlan_open and put the lowerdev in promiscuous mode if it
does not support filtering.

> 
> Currently, we don't have an interface to add multiple mac addresses to a
> netdev other than an
> indirect way of creating a macvlan /if on top of it.

Yes I think so. I have been using only macvlan to test.

Thanks,
Roopa

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] Some emulator cleanups

2011-09-09 Thread Marcelo Tosatti
On Wed, Sep 07, 2011 at 04:41:34PM +0300, Avi Kivity wrote:
> Some mindless emulator cleanups while waiting for autotest.
> 
> Avi Kivity (6):
>   KVM: x86 emulator: simplify emulate_2op_SrcV()
>   KVM: x86 emulator: simplify emulate_2op_cl()
>   KVM: x86 emulator: simplify emulate_2op_cl()
>   KVM: x86 emulator: simplify emulate_1op()
>   KVM: x86 emulator: merge the two emulate_1op_rax_rdx implementations
>   KVM: x86 emulator: simplify emulate_1op_rax_rdx()
> 
>  arch/x86/kvm/emulate.c |  225 
> +++-
>  1 files changed, 89 insertions(+), 136 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 3/4] block: add block timer and throttling algorithm

2011-09-09 Thread Marcelo Tosatti
On Thu, Sep 08, 2011 at 06:11:07PM +0800, Zhi Yong Wu wrote:
> Note:
>  1.) When bps/iops limits are specified to a small value such as 511 
> bytes/s, this VM will hang up. We are considering how to handle this senario.

You can increase the length of the slice, if the request is larger than
slice_time * bps_limit.

>  2.) When "dd" command is issued in guest, if its option bs is set to a 
> large value such as "bs=1024K", the result speed will slightly bigger than 
> the limits.

Why?

There is lots of debugging leftovers in the patch.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CFQ I/O starvation problem triggered by RHEL6.0 KVM guests

2011-09-09 Thread Vivek Goyal
On Fri, Sep 09, 2011 at 06:00:28PM +0900, Takuya Yoshikawa wrote:

[..]
> 
> > 
> > - Even if there are close cooperators, these queues are merged and they
> >   are treated as single queue from slice point of view. So cooperating
> >   queues should be merged and get a single slice instead of starving
> >   other queues in the system.
> 
> I understand that close cooperators' queues should be merged, but in our test
> case, when the 64KB request was issued from one aio thread, the other thread's
> queue was empty; because these queues are for the same stream, next request
> could not come until current request got finished.
> 
>   But this is complicated because it depends on the qemu block layer aio.
> 
> I am not sure if cfq would try to merge the queues in such cases.

[CCing Jeff Moyer ]

I think even if these queues are alternating, it should have been merged
(If we considered them close cooperator).

So in select queue we have.

new_cfqq = cfq_close_cooperator(cfqd, cfqq);
if (new_cfqq) {
if (!cfqq->new_cfqq)
cfq_setup_merge(cfqq, new_cfqq);
goto expire;
}

So if we selected a new queue because it is a close cooperator, we should
have called setup_merge() and next time when the IO happens, one of the
queue should merge into another queue.

cfq_set_request() {
if (cfqq->new_cfqq)
cfqq = cfq_merge_cfqqs(cfqd, cic, cfqq);
}

If merging is not happening and still we somehow continue to pick 
close_cooperator() as the new queue and starve other queues in the system,
then there is a bug.

I think try to reproduce this with fio with upstream kenrels and put
some more tracepoints and see what's happening.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests

2011-09-09 Thread Stefan Hajnoczi
On Fri, Sep 9, 2011 at 10:00 AM, Takuya Yoshikawa
 wrote:
> Vivek Goyal  wrote:
>
>> So you are using both RHEL 6.0 in both host and guest kernel? Can you
>> reproduce the same issue with upstream kernels? How easily/frequently
>> you can reproduce this with RHEL6.0 host.
>
> Guests were CentOS6.0.
>
> I have only RHEL6.0 and RHEL6.1 test results now.
> I want to try similar tests with upstream kernels if I can get some time.
>
> With RHEL6.0 kernel, I heard that this issue was reproduced every time, 100%.
>
>> > On the host, we were running 3 linux guests to see if I/O from these guests
>> > would be handled fairly by host; each guest did dd write with oflag=direct.
>> >
>> > Guest virtual disk:
>> >   We used a host local disk which had 3 partitions, and each guest was
>> >   allocated one of these as dd write target.
>> >
>> > So our test was for checking if cfq could keep fairness for the 3 guests
>> > who shared the same disk.
>> >
>> > The result (strage starvation):
>> >   Sometimes, one guest dominated cfq for more than 10sec and requests from
>> >   other guests were not handled at all during that time.
>> >
>> > Below is the blktrace log which shows that a request to (8,27) in cfq2068S 
>> > (*1)
>> > is not handled at all during cfq2095S and cfq2067S which hold requests to
>> > (8,26) are being handled alternately.
>> >
>> > *1) WS 104920578 + 64
>> >
>> > Question:
>> >   I guess that cfq_close_cooperator() was being called in an unusual 
>> > manner.
>> >   If so, do you think that cfq is responsible for keeping fairness for this
>> >   kind of unusual write requests?
>>
>> - If two guests are doing IO to separate partitions, they should really
>>   not be very close (until and unless partitions are really small).
>
> Sorry for my lack of explanation.
>
> The IO was issued from QEMU and the cooperative threads were both for the same
> guest. In other words, QEMU was using two threads for one IO stream from the 
> guest.
>
> As my blktrace log snippet showed, cfq2095S and cfq2067S treated one 
> sequential
> IO; cfq2095S did 64KB, then cfq2067S did next 64KB, and so on.
>
>  These should be from the same guest because the target partition was same,
>  which was allocated to that guest.
>
> During the 10sec, this repetition continued without allowing others to 
> interrupt.
>
> I know it is unnatural but sometimes QEMU uses two aio threads for issuing one
> IO stream.
>
>>
>> - Even if there are close cooperators, these queues are merged and they
>>   are treated as single queue from slice point of view. So cooperating
>>   queues should be merged and get a single slice instead of starving
>>   other queues in the system.
>
> I understand that close cooperators' queues should be merged, but in our test
> case, when the 64KB request was issued from one aio thread, the other thread's
> queue was empty; because these queues are for the same stream, next request
> could not come until current request got finished.
>
>  But this is complicated because it depends on the qemu block layer aio.
>
> I am not sure if cfq would try to merge the queues in such cases.

Looking at posix-aio-compat.c, QEMU's threadpool for asynchronous I/O,
this seems like a fairly generic issue.  Other applications may suffer
from this same I/O scheduler behavior.  It would be nice to create a
test case program which doesn't use QEMU at all.

QEMU has a queue of requests that need to be processed.  There is a
pool of threads that sleep until requests become available with
pthread_cond_timedwait(3).  When a request is added to the queue,
pthread_cond_signal(3) is called in order to wake one sleeping thread.

This bouncing pattern between two threads that you describe is
probably a result of pthread_cond_timedwait(3) waking up each thread
in alternating fashion.  So we get this pattern:

A  B  <-- threads
1 <-- I/O requests
   2
3
   4
5
   6
...

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] kvm tools: fix repeated io emulation

2011-09-09 Thread Sasha Levin
On Fri, 2011-09-09 at 10:26 +0800, Xiao Guangrong wrote:
> On 08/18/2011 11:08 PM, Avi Kivity wrote:
> > On 08/18/2011 12:35 AM, Sasha Levin wrote:
> >> On Thu, 2011-08-18 at 09:13 +0300, Pekka Enberg wrote:
> >> >  Hi,
> >> >
> >> >  On Thu, Aug 18, 2011 at 6:06 AM, Xiao Guangrong
> >> >wrote:
> >> >  >  When kvm emulates repeation io read instruction, it can exit to 
> >> > user-space with
> >> >  >  'count'>  1, we need to emulate io access for many times
> >> >  >
> >> >  >  Signed-off-by: Xiao Guangrong
> >> >
> >> >  The KVM tool is not actually maintained by Avi and Marcelo but by me
> >> >  and few others. Our git repository is here:
> >> >
> >> >  https://github.com/penberg/linux-kvm
> >> >
> >> >  Ingo pulls that to -tip few times a week or so. Sasha, can you please
> >> >  take a look at these patches and if you're OK with them, I'll apply
> >> >  them.
> >>
> >> Pekka,
> >>
> >> I can only assume they're right, 'count' isn't documented anywhere :)
> >>
> >> If any of KVM maintainers could confirm it I'll add it into the docs.
> >>
> > 
> > Count is indeed the number of repetitions.
> > 
> 
> Hi Pekka,
> 
> Could you pick up this patchset please?
> 

Xiao, It was merged couple of weeks ago.

-- 

Sasha.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Avoid soft lockup message when KVM is stopped by host

2011-09-09 Thread Marcelo Tosatti
On Thu, Sep 01, 2011 at 02:27:49PM -0600, emun...@mgebm.net wrote:
> On Thu, 01 Sep 2011 14:24:12 -0500, Anthony Liguori wrote:
> >On 08/30/2011 07:26 AM, Marcelo Tosatti wrote:
> >>On Mon, Aug 29, 2011 at 05:27:11PM -0600, Eric B Munson wrote:
> >>>Currently, when qemu stops a guest kernel that guest will
> >>>issue a soft lockup
> >>>message when it resumes.  This set provides the ability for
> >>>qemu to comminucate
> >>>to the guest that it has been stopped.  When the guest hits
> >>>the watchdog on
> >>>resume it will check if it was suspended before issuing the
> >>>warning.
> >>>
> >>>Eric B Munson (4):
> >>>   Add flag to indicate that a vm was stopped by the host
> >>>   Add functions to check if the host has stopped the vm
> >>>   Add generic stubs for kvm stop check functions
> >>>   Add check for suspended vm in softlockup detector
> >>>
> >>>  arch/x86/include/asm/pvclock-abi.h |1 +
> >>>  arch/x86/include/asm/pvclock.h |2 ++
> >>>  arch/x86/kernel/kvmclock.c |   14 ++
> >>>  include/asm-generic/pvclock.h  |   14 ++
> >>>  kernel/watchdog.c  |   12 
> >>>  5 files changed, 43 insertions(+), 0 deletions(-)
> >>>  create mode 100644 include/asm-generic/pvclock.h
> >>>
> >>>--
> >>>1.7.4.1
> >>
> >>How is the host supposed to set this flag?
> >>
> >>As mentioned previously, if you save save/restore the offset
> >>added to
> >>kvmclock on stop/cont (and the TSC MSR, forgot to mention that), no
> >>paravirt infrastructure is required. Which means the issue is
> >>also fixed
> >>for older guests.
> >
> >IIRC, the steal time patches have some logic that basically say:
> >
> >if there was steal time:
> >   kick soft lockup detector
> >
> >I wonder if that serves this purpose provided that time spent in stop
> >is accounted as steal time.  If it isn't, perhaps it should be?

Accounting "in userspace (qemu)" as stolen time is problematic. Think
irqchip in userspace (halt emulation), excessive time spent in device
emulation that would trigger genuine watchdog warnings.

The latest patches from Glauber dropped this logic.


> >
> >Regards,
> >
> >Anthony Liguori
> >
> 
> I could be missing it, but I don't see anywhere in the steal time
> patches that kicks the watchdog.
> 
> Accounting stopped time as stolen time opens a possible problem when
> the accounting (CPU power modification) part is turned on.  As a
> process accumulates steal time its CPU power is increased.  If we
> account stopped time as stolen, it will do strange things with CPU
> power for that process.  I believe that it is for this reason that
> patch 4 of Glauber's series explicitly states that halted time is
> not stolen time.  Stolen time is only accumulated when a vCPU
> actually has work to do.
> 
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC [v2]: vfio / device assignment -- layout of device fd files

2011-09-09 Thread Stuart Yoder
Meant to identify the changes in v2 of this proposal:

v2:
   -removed PCI_INFO record type
   -removed PCI_BAR_INFO record type
   -PCI_CONFIG_SPACE is now a sub-record/property of a REGION
   -removed physical address from region and made it
a subrecord/property of a REGION
   -added PCI_BAR_INDEX sub-record type
   -updated magic numbers

Stuart

On Fri, Sep 9, 2011 at 8:11 AM, Stuart Yoder  wrote:
> Based on the discussions over the last couple of weeks
> I have updated the device fd file layout proposal and
> tried to specify it a bit more formally.
>
> ===
>
> 1.  Overview
>
>  This specification describes the layout of device files
>  used in the context of vfio, which gives user space
>  direct access to I/O devices that have been bound to
>  vfio.
>
>  When a device fd is opened and read, offset 0x0 contains
>  a fixed sized header followed by a number of variable length
>  records that describe different characteristics
>  of the device-- addressable regions, interrupts, etc.
>
>  0x0  +-+-+
>       |         magic             | u32  // identifies this as a vfio
> device file
>       +---+         and identifies the type of bus
>       |         version           | u32  // specifies the version of this
>       +---+
>       |         flags             | u32  // encodes any flags
>       +---+
>       |  dev info record 0        |
>       |    type                   | u32   // type of record
>       |    rec_len                | u32   // length in bytes of record
>       |                           |          (including record header)
>       |    flags                  | u32   // type specific flags
>       |    ...content...          |       // record content, which could
>       +---+       // include sub-records
>       |  dev info record 1        |
>       +---+
>       |  dev info record N        |
>       +---+
>
>  The device info records following the file header may have
>  the following record types each with content encoded in
>  a record specific way:
>
>  +---+--
>              |  type |
>   Region     |  num  | Description
>  ---
>  REGION           1    describes an addressable address range for the device
>  DTPATH           2    describes the device tree path for the device
>  DTINDEX          3    describes the index into the related device tree
>                          property (reg,ranges,interrupts,interrupt-map)
>  INTERRUPT        4    describes an interrupt for the device
>  PCI_CONFIG_SPACE 5    property identifying a region as PCI config space
>  PCI_BAR_INDEX    6    describes the BAR index for a PCI region
>  PHYS_ADDR        7    describes the physical address of the region
>  ---
>
> 2. Header
>
> The header is located at offset 0x0 in the device fd
> and has the following format:
>
>    struct devfd_header {
>        __u32 magic;
>        __u32 version;
>        __u32 flags;
>    };
>
>    The 'magic' field contains a magic value that will
>    identify the type bus the device is on.  Valid values
>    are:
>
>        0x70636900   // "pci" - PCI device
>        0x6474   // "dt" - device tree (system bus)
>
> 3. Region
>
>  A REGION record an addressable address region for the device.
>
>    struct devfd_region {
>        __u32 type;   // must be 0x1
>        __u32 record_len;
>        __u32 flags;
>        __u64 offset; // seek offset to region from beginning
>                      // of file
>        __u64 len   ; // length of the region
>    };
>
>  The 'flags' field supports one flag:
>
>      IS_MMAPABLE
>
> 4. Device Tree Path (DTPATH)
>
>  A DTPATH record is a sub-record of a REGION and describes
>  the path to a device tree node for the region
>
>    struct devfd_dtpath {
>        __u32 type;   // must be 0x2
>        __u32 record_len;
>        __u64 char[]   ; // length of the region
>    };
>
> 5. Device Tree Index (DTINDEX)
>
>  A DTINDEX record is a sub-record of a REGION and specifies
>  the index into the resource list encoded in the associated
>  device tree property-- "reg", "ranges", "interrupts", or
>  "interrupt-map".
>
>    struct devfd_dtindex {
>        __u32 type;   // must be 0x3
>        __u32 record_len;
>        __u32 prop_type;
>        __u32 prop_index;  // index into the resource list
>    };
>
>    prop_type must have one of the follow values:
>       1   // "reg" property
>       2   // "ranges" property
>       3   // "interrupts" property
>       4   // "interrupts" property
>
>    Note: prop_index is not the byte offset into the property,
>    but the logical index.
>
> 6. Interrupts (INTERR

RFC [v2]: vfio / device assignment -- layout of device fd files

2011-09-09 Thread Stuart Yoder
Based on the discussions over the last couple of weeks
I have updated the device fd file layout proposal and
tried to specify it a bit more formally.

===

1.  Overview

  This specification describes the layout of device files
  used in the context of vfio, which gives user space
  direct access to I/O devices that have been bound to
  vfio.

  When a device fd is opened and read, offset 0x0 contains
  a fixed sized header followed by a number of variable length
  records that describe different characteristics
  of the device-- addressable regions, interrupts, etc.

  0x0  +-+-+
   | magic | u32  // identifies this as a vfio
device file
   +---+ and identifies the type of bus
   | version   | u32  // specifies the version of this
   +---+
   | flags | u32  // encodes any flags
   +---+
   |  dev info record 0|
   |type   | u32   // type of record
   |rec_len| u32   // length in bytes of record
   |   |  (including record header)
   |flags  | u32   // type specific flags
   |...content...  |   // record content, which could
   +---+   // include sub-records
   |  dev info record 1|
   +---+
   |  dev info record N|
   +---+

  The device info records following the file header may have
  the following record types each with content encoded in
  a record specific way:

  +---+--
  |  type |
   Region |  num  | Description
  ---
  REGION   1describes an addressable address range for the device
  DTPATH   2describes the device tree path for the device
  DTINDEX  3describes the index into the related device tree
  property (reg,ranges,interrupts,interrupt-map)
  INTERRUPT4describes an interrupt for the device
  PCI_CONFIG_SPACE 5property identifying a region as PCI config space
  PCI_BAR_INDEX6describes the BAR index for a PCI region
  PHYS_ADDR7describes the physical address of the region
  ---

2. Header

The header is located at offset 0x0 in the device fd
and has the following format:

struct devfd_header {
__u32 magic;
__u32 version;
__u32 flags;
};

The 'magic' field contains a magic value that will
identify the type bus the device is on.  Valid values
are:

0x70636900   // "pci" - PCI device
0x6474   // "dt" - device tree (system bus)

3. Region

  A REGION record an addressable address region for the device.

struct devfd_region {
__u32 type;   // must be 0x1
__u32 record_len;
__u32 flags;
__u64 offset; // seek offset to region from beginning
  // of file
__u64 len   ; // length of the region
};

  The 'flags' field supports one flag:

  IS_MMAPABLE

4. Device Tree Path (DTPATH)

  A DTPATH record is a sub-record of a REGION and describes
  the path to a device tree node for the region

struct devfd_dtpath {
__u32 type;   // must be 0x2
__u32 record_len;
__u64 char[]   ; // length of the region
};

5. Device Tree Index (DTINDEX)

  A DTINDEX record is a sub-record of a REGION and specifies
  the index into the resource list encoded in the associated
  device tree property-- "reg", "ranges", "interrupts", or
  "interrupt-map".

struct devfd_dtindex {
__u32 type;   // must be 0x3
__u32 record_len;
__u32 prop_type;
__u32 prop_index;  // index into the resource list
};

prop_type must have one of the follow values:
   1   // "reg" property
   2   // "ranges" property
   3   // "interrupts" property
   4   // "interrupts" property

Note: prop_index is not the byte offset into the property,
but the logical index.

6. Interrupts (INTERRUPT)

  An INTERRUPT record describes one of a device's interrupts.
  The handle field is an argument to VFIO_DEVICE_GET_IRQ_FD
  which user space can use to receive device interrupts.

struct devfd_interrupts {
__u32 type;   // must be 0x4
__u32 record_len;
__u32 flags;
__u32 handle;  // parameter to VFIO_DEVICE_GET_IRQ_FD
};

7.  PCI Config Space (PCI_CONFIG_SPACE)

A PCI_CONFIG_SPACE record is a sub-record of a REGION record
and identifies the region as PCI configuration space.

struct de

Re: [PATCH] KVM: emulate lapic tsc deadline timer for hvm

2011-09-09 Thread Marcelo Tosatti
On Fri, Sep 09, 2011 at 01:12:51AM +0800, Liu, Jinsong wrote:
> >>> --- a/arch/x86/include/asm/msr-index.h
> >>> +++ b/arch/x86/include/asm/msr-index.h
> >>> @@ -229,6 +229,8 @@
> >>>  #define MSR_IA32_APICBASE_ENABLE (1<<11)
> >>>  #define MSR_IA32_APICBASE_BASE   (0xf<<12)
> >>> 
> >>> +#define MSR_IA32_TSCDEADLINE 0x06e0
> >>> +
> >>>  #define MSR_IA32_UCODE_WRITE 0x0079
> >>>  #define MSR_IA32_UCODE_REV   0x008b
> >> 
> >> Need to add to msrs_to_save so live migration works.
> > 
> > MSR must be explicitly listed in qemu, also.
> > 
> 
> Marcelo, seems MSR don't need explicitly list in qemu?
> KVM side adding MSR_IA32_TSCDEADLINE to msrs_to_save is enough. Qemu will get 
> it through KVM_GET_MSR_INDEX_LIST.
> Do I miss something?

Notice in target-i386/kvm.c the KVM_GET_MSR_INDEX_LIST list is only used
for MSR_STAR/MSR_HSAVE_PA presence detection.

Do you do need to explicitly add MSR_IA32_TSCDEADLINE to
kvm_get_msrs/kvm_put_msrs routines.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CFQ I/O starvation problem triggered by RHEL6.0 KVM guests

2011-09-09 Thread Takuya Yoshikawa
Vivek Goyal  wrote:

> So you are using both RHEL 6.0 in both host and guest kernel? Can you
> reproduce the same issue with upstream kernels? How easily/frequently
> you can reproduce this with RHEL6.0 host.

Guests were CentOS6.0.

I have only RHEL6.0 and RHEL6.1 test results now.
I want to try similar tests with upstream kernels if I can get some time.

With RHEL6.0 kernel, I heard that this issue was reproduced every time, 100%.

> > On the host, we were running 3 linux guests to see if I/O from these guests
> > would be handled fairly by host; each guest did dd write with oflag=direct.
> > 
> > Guest virtual disk:
> >   We used a host local disk which had 3 partitions, and each guest was
> >   allocated one of these as dd write target.
> > 
> > So our test was for checking if cfq could keep fairness for the 3 guests
> > who shared the same disk.
> > 
> > The result (strage starvation):
> >   Sometimes, one guest dominated cfq for more than 10sec and requests from
> >   other guests were not handled at all during that time.
> > 
> > Below is the blktrace log which shows that a request to (8,27) in cfq2068S 
> > (*1)
> > is not handled at all during cfq2095S and cfq2067S which hold requests to
> > (8,26) are being handled alternately.
> > 
> > *1) WS 104920578 + 64
> > 
> > Question:
> >   I guess that cfq_close_cooperator() was being called in an unusual manner.
> >   If so, do you think that cfq is responsible for keeping fairness for this
> >   kind of unusual write requests?
> 
> - If two guests are doing IO to separate partitions, they should really
>   not be very close (until and unless partitions are really small).

Sorry for my lack of explanation.

The IO was issued from QEMU and the cooperative threads were both for the same
guest. In other words, QEMU was using two threads for one IO stream from the 
guest.

As my blktrace log snippet showed, cfq2095S and cfq2067S treated one sequential
IO; cfq2095S did 64KB, then cfq2067S did next 64KB, and so on.

  These should be from the same guest because the target partition was same,
  which was allocated to that guest.

During the 10sec, this repetition continued without allowing others to 
interrupt.

I know it is unnatural but sometimes QEMU uses two aio threads for issuing one
IO stream.

> 
> - Even if there are close cooperators, these queues are merged and they
>   are treated as single queue from slice point of view. So cooperating
>   queues should be merged and get a single slice instead of starving
>   other queues in the system.

I understand that close cooperators' queues should be merged, but in our test
case, when the 64KB request was issued from one aio thread, the other thread's
queue was empty; because these queues are for the same stream, next request
could not come until current request got finished.

  But this is complicated because it depends on the qemu block layer aio.

I am not sure if cfq would try to merge the queues in such cases.

> Can you upload the blktrace logs somewhere which shows what happened 
> during that 10 seconds.

I have some restrictions here, so maybe, but I need to check later.

> > Note:
> >   With RHEL6.1, this problem could not triggered. But I guess that was due 
> > to
> >   QEMU's block layer updates.
> 
> You can try reproducing this with fio.

Thank you, I want to do some tests by myself; the original report was not from
my team.


My feeling is that, it may be possible to dominate IO if we create two threads
and issue cooperative IO as our QEMU did; QEMU is just a process from the host
view, and one QEMU process dominated IO preventing other QEMU's IO.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About hotplug multifunction

2011-09-09 Thread Michael S. Tsirkin
On Fri, Sep 09, 2011 at 03:08:21AM -0400, Amos Kong wrote:
> Hello all,
> 
> I'm working on hotplug pci multifunction. 
> 
> 1. qemu cmdline: 
> ./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 
> /home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor 
> unix:/tmp/a,server,nowait --enable-kvm -net none
> 
> 2. script to add virtio-blk devices:
> for i in `seq 1 7` 0;do
> qemu-img create /tmp/resize$i.qcow2 1G -f qcow2
> echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U /tmp/a
> echo device_add 
> virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U 
> /tmp/a
> done
> 
> 3. script to add virio-nic devices:
> for i in `seq 1 7` 0;do
> echo netdev_add tap,id=drv$i | nc -U /tmp/a
> echo device_add 
> virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U 
> /tmp/a
> done
> 
> 4. current qemu behaviors
> 4.1. add func 1~7 one by one, then add func 0
> virtio-nic : success, all funcs are added
> virtio-blk : success
> 
> 4.2. add func 0~7 one by one
> virtio-nic : failed, only func 0 is added
> virtio-blk : success
> 
> 4.3. removing any single func in monitor
> virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 
> also exist.
> virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the device. 
> /dev/vda disappears,
>   vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 funcs 
> to guest, they all works.
>   # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit)
>   00:06.1 SCSI storage controller: Red Hat, Inc Virtio block 
> device (rev ff)

something I noted when readin our acpi code:
we currently pass eject request for function 0 only:
   Name (_ADR, nr##)
We either need a device per function there (acpi 1.0),
send eject request for them all, or use 
as function number (newer acpi, not sure which version).
Need to see which guests (windows,linux) can handle which form.

> 
> qemu sends an acpi event to guest, then guest will remove all funcs in the 
> slot.
> linux-2.6/drivers/pci/hotplug/acpiphp_glue.c:
> static int disable_device(struct acpiphp_slot *slot) {
> list_for_each_entry(func, &slot->funcs, sibling) {
> ...
> 
> Questions:
> 1. why func1~7 still can be found after hot-remove? is it same as real 
> hardware?
> 2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)?
> 3. how about this interface to hotplug/hot-unplug multifunction:
>1) Add func 1-7 by monitor, add func 0, then send an acpi event to notice 
> guest
>2) Remove func0, send an acpi event to guest. (all funcs can be removed)

We must make sure guest acked removal of all functions. Surprise
removal would be bad.

> 4. what does "reversion 0xff" stand for?

You get  if no device responds to a configuration read.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


About hotplug multifunction

2011-09-09 Thread Amos Kong
Hello all,

I'm working on hotplug pci multifunction. 

1. qemu cmdline: 
./x86_64-softmmu/qemu-system-x86_64 -snapshot -m 2000 
/home/kvm_autotest_root/images/rhel61-64-virtio.qcow2 -vnc :0 -monitor 
unix:/tmp/a,server,nowait --enable-kvm -net none

2. script to add virtio-blk devices:
for i in `seq 1 7` 0;do
qemu-img create /tmp/resize$i.qcow2 1G -f qcow2
echo drive_add 0x6.$i id=drv$i,if=none,file=/tmp/resize$i.qcow2 | nc -U /tmp/a
echo device_add 
virtio-blk-pci,id=dev$i,drive=drv$i,addr=0x6.$i,multifunction=on | nc -U /tmp/a
done

3. script to add virio-nic devices:
for i in `seq 1 7` 0;do
echo netdev_add tap,id=drv$i | nc -U /tmp/a
echo device_add 
virtio-net-pci,id=dev$i,netdev=drv$i,addr=0x6.$i,multifunction=on | nc -U /tmp/a
done

4. current qemu behaviors
4.1. add func 1~7 one by one, then add func 0
virtio-nic : success, all funcs are added
virtio-blk : success

4.2. add func 0~7 one by one
virtio-nic : failed, only func 0 is added
virtio-blk : success

4.3. removing any single func in monitor
virtio-nic: func 0 are not found in 'lspci', func 1~7 also exist. eth1~eth7 
also exist.
virtio-blk: func 0 are not found in 'lspci', func 1~7 also exist. the device. 
/dev/vda disappears,
  vdb,vdc,vde,vdf,vdg,vdh,vdi,vdj also exist. If I re-add 8 funcs 
to guest, they all works.
  # lspci (00:06.1 ~ 00:06.7 exist, 00:06.0 doesn't exit)
  00:06.1 SCSI storage controller: Red Hat, Inc Virtio block device 
(rev ff)

qemu sends an acpi event to guest, then guest will remove all funcs in the slot.
linux-2.6/drivers/pci/hotplug/acpiphp_glue.c:
static int disable_device(struct acpiphp_slot *slot) {
list_for_each_entry(func, &slot->funcs, sibling) {
...

Questions:
1. why func1~7 still can be found after hot-remove? is it same as real hardware?
2. why the func 1~7 could not be added to guest (addingfunc 0~7 one by one)?
3. how about this interface to hotplug/hot-unplug multifunction:
   1) Add func 1-7 by monitor, add func 0, then send an acpi event to notice 
guest
   2) Remove func0, send an acpi event to guest. (all funcs can be removed)
4. what does "reversion 0xff" stand for?

Thanks in advance,
Amos
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html