date:20120417

On 2012-04-17 14:10, Michael S. Tsirkin wrote:
> Real command register is under kernel control:
> it includes bits for triggering SERR, marking
> BARs as invalid and such which are all under host
> kernel control.
> 
> While there's no known bug this triggers - since qemu does its
> best to make guest state match device state -
> it seems safer to avoid touching this register as much as
> possible.
> 
> With this patch, we don't touch any bits
> except bus master which is ok to put under guest control
> and intx mask which kvm interrupt sharing machinery
> explicitly allows.
> 
> Note: PCI_STATUS bears looking into as well.
> 
> Tested-by: Alex Williamson 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> Changes from v3:
> - use pci_word_test_and_clear_mask to clear bits in a clean way
> Changes from v2:
> - whitespace fix
> Changes from v1:
> - fix intx mask
> 
>  hw/device-assignment.c |   20 ++--
>  1 files changed, 6 insertions(+), 14 deletions(-)
> 
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..17a3a93 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
> uint16_t r_seg,
>  FILE *f;
>  unsigned long long start, end, size, flags;
>  uint16_t id;
> -struct stat statbuf;
>  PCIRegion *rp;
>  PCIDevRegions *dev = &pci_dev->real_device;
>  
> @@ -610,12 +609,8 @@ again:
>  pci_dev->dev.config[2] = id & 0xff;
>  pci_dev->dev.config[3] = (id & 0xff00) >> 8;
>  
> -/* dealing with virtual function device */
> -snprintf(name, sizeof(name), "%sphysfn/", dir);
> -if (!stat(name, &statbuf)) {
> -/* always provide the written value on readout */
> -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
> -}
> +pci_word_test_and_clear_mask(pci_dev->emulate_config_write + PCI_COMMAND,
> + PCI_COMMAND_MASTER | 
> PCI_COMMAND_INTX_DISABLE);
>  
>  dev->region_number = r;
>  return 0;
> @@ -782,13 +778,9 @@ static int assign_device(AssignedDevice *dev)
>  "cause host memory corruption if the device issues DMA write 
> "
>  "requests!\n");
>  }
>  if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
>  kvm_has_intx_set_mask()) {
>  assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
> -
> -/* hide host-side INTx masking from the guest */
> -dev->emulate_config_read[PCI_COMMAND + 1] |=
> -PCI_COMMAND_INTX_DISABLE >> 8;
>  }
>  
>  r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
> @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev)
>  }
>  
>  /*
> - * When a 0 is written to the command register, the device is logically
> + * When a 0 is written to the bus master register, the device is 
> logically
>   * disconnected from the PCI bus. This avoids further DMA transfers.
>   */
> -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
> +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
>  }
>  
>  static int assigned_initfn(struct PCIDevice *pci_dev)
> @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
>   * device initialization.
>   */
>  assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
> -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
>  assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
>  assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
>  assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);

Acked-by: Jan Kiszka 

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4] device-assignment: don't touch pci command register

2012-04-17 Thread Alex Williamson

On Tue, 2012-04-17 at 15:10 +0300, Michael S. Tsirkin wrote:
> Real command register is under kernel control:
> it includes bits for triggering SERR, marking
> BARs as invalid and such which are all under host
> kernel control.
> 
> While there's no known bug this triggers - since qemu does its
> best to make guest state match device state -
> it seems safer to avoid touching this register as much as
> possible.
> 
> With this patch, we don't touch any bits
> except bus master which is ok to put under guest control
> and intx mask which kvm interrupt sharing machinery
> explicitly allows.
> 
> Note: PCI_STATUS bears looking into as well.
> 
> Tested-by: Alex Williamson 
> Signed-off-by: Michael S. Tsirkin 

Acked-by: Alex Williamson 

> ---
> 
> Changes from v3:
> - use pci_word_test_and_clear_mask to clear bits in a clean way
> Changes from v2:
> - whitespace fix
> Changes from v1:
> - fix intx mask
> 
>  hw/device-assignment.c |   20 ++--
>  1 files changed, 6 insertions(+), 14 deletions(-)
> 
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..17a3a93 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
> uint16_t r_seg,
>  FILE *f;
>  unsigned long long start, end, size, flags;
>  uint16_t id;
> -struct stat statbuf;
>  PCIRegion *rp;
>  PCIDevRegions *dev = &pci_dev->real_device;
>  
> @@ -610,12 +609,8 @@ again:
>  pci_dev->dev.config[2] = id & 0xff;
>  pci_dev->dev.config[3] = (id & 0xff00) >> 8;
>  
> -/* dealing with virtual function device */
> -snprintf(name, sizeof(name), "%sphysfn/", dir);
> -if (!stat(name, &statbuf)) {
> -/* always provide the written value on readout */
> -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
> -}
> +pci_word_test_and_clear_mask(pci_dev->emulate_config_write + PCI_COMMAND,
> + PCI_COMMAND_MASTER | 
> PCI_COMMAND_INTX_DISABLE);
>  
>  dev->region_number = r;
>  return 0;
> @@ -782,13 +778,9 @@ static int assign_device(AssignedDevice *dev)
>  "cause host memory corruption if the device issues DMA write 
> "
>  "requests!\n");
>  }
>  if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
>  kvm_has_intx_set_mask()) {
>  assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
> -
> -/* hide host-side INTx masking from the guest */
> -dev->emulate_config_read[PCI_COMMAND + 1] |=
> -PCI_COMMAND_INTX_DISABLE >> 8;
>  }
>  
>  r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
> @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev)
>  }
>  
>  /*
> - * When a 0 is written to the command register, the device is logically
> + * When a 0 is written to the bus master register, the device is 
> logically
>   * disconnected from the PCI bus. This avoids further DMA transfers.
>   */
> -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
> +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
>  }
>  
>  static int assigned_initfn(struct PCIDevice *pci_dev)
> @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
>   * device initialization.
>   */
>  assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
> -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
>  assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
>  assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
>  assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-17 Thread Anthony Liguori


On 04/17/2012 02:44 AM, Avi Kivity wrote:

On 04/11/2012 04:39 AM, zhangyanfei wrote:

This patch set exports offsets of VMCS fields as note information for
kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
runtime state of guest machine image, such as registers, in host
machine's crash dump as VMCS format. The problem is that VMCS
internal is hidden by Intel in its specification. So, we reverse
engineering it in the way implemented in this patch set. Please note
that this processing never affects any existing kvm logic. The
VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO.

Here is an example:
Processor: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz

$cat /sys/kernel/vmcsinfo
1cba8c0 2000

crash>  rd -p 1cba8c0 1000
  1cba8c0:  127b0009 53434d56   {...VMCS
  1cba8d0:  4f464e49 4e4f495349564552   INFOREVISION
  1cba8e0:  49460a643d44495f 5f4e495028444c45   _ID=d.FIELD(PIN_
  1cba8f0:  4d565f4445534142 4f435f434558455f   BASED_VM_EXEC_CO
  1cba900:  303d294c4f52544e 0a30383130343831   NTROL)=01840180.
  1cba910:  504328444c454946 5f44455341425f55   FIELD(CPU_BASED_
  1cba920:  5f434558455f4d56 294c4f52544e4f43   VM_EXEC_CONTROL)
  1cba930:  393130343931303d 28444c4549460a30   =01940190.FIELD(
  1cba940:  5241444e4f434553 4558455f4d565f59   SECONDARY_VM_EXE
  1cba950:  4f52544e4f435f43 30346566303d294c   C_CONTROL)=0fe40
  1cba960:  4c4549460a306566 4958455f4d562844   fe0.FIELD(VM_EXI
  1cba970:  4f52544e4f435f54 346531303d29534c   T_CONTROLS)=01e4
  1cba980:  4549460a30653130 4e455f4d5628444c   01e0.FIELD(VM_EN
  1cba990:  544e4f435f595254 33303d29534c4f52   TRY_CONTROLS)=03
  1cba9a0:  460a303133303431 45554728444c4549   140310.FIELD(GUE
  1cba9b0:  45535f53455f5453 3d29524f5443454c   ST_ES_SELECTOR)=
  1cba9c0:  4549460a30303530 545345554728444c   0500.FIELD(GUEST
  1cba9d0:  454c45535f53435f 35303d29524f5443   _CS_SELECTOR)=05
  ..

TODO:
   1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
  into vmcore.
   2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
  core file. To do this, we will modify kernel core dumper, gdb gcore
  and crash gcore.
   3. Dump guest image from the qemu-process core file into a vmcore.



Taking a step back, can you describe the problem scenario you're fixing
here?


Note that assuming that VMCS fields map to offsets within memory is not 
guaranteed to work for future processors.


There is no guarantee that fields won't be compressed or stored with inverted 
bit ordering.


Regards,

Anthony Liguori





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On 04/17/2012 07:15 PM, Jan Kiszka wrote:
> On 2012-04-17 14:06, Avi Kivity wrote:
> > On 04/17/2012 03:03 PM, Gleb Natapov wrote:
> >>>
> >>> KVM_MAX_VCPUS.
> >>>
> >> Ah, so you are worried about malicious guest configuring pit to
> >> broadcast to all its vcpus.
> > 
> > Yes - it can introduce huge amounts of latency this way which is exactly
> > what Jan is trying to prevent.
> > 
> > Though I'm not sure spin_lock_irq() in the realtime tree actually
> > disables irqs (but it's certainly not a good idea in mainline; it's
> > nasty even with just the spinlock).
>
> This depends on how you declare the spin lock type - raw or normal. The
> former will disable irqs, the latter not even preemption (but become a
> mutex).

Yes (and I see no reason to use raw spinlocks here).  Still, for
mainline, are we okay with 254*IPIs?  Maybe it's not so bad and I'm
overinflating the problem.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On 2012-04-17 14:06, Avi Kivity wrote:
> On 04/17/2012 03:03 PM, Gleb Natapov wrote:
>>>
>>> KVM_MAX_VCPUS.
>>>
>> Ah, so you are worried about malicious guest configuring pit to
>> broadcast to all its vcpus.
> 
> Yes - it can introduce huge amounts of latency this way which is exactly
> what Jan is trying to prevent.
> 
> Though I'm not sure spin_lock_irq() in the realtime tree actually
> disables irqs (but it's certainly not a good idea in mainline; it's
> nasty even with just the spinlock).

This depends on how you declare the spin lock type - raw or normal. The
former will disable irqs, the latter not even preemption (but become a
mutex).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

On 04/17/2012 05:54 PM, Takuya Yoshikawa wrote:
> On Tue, 17 Apr 2012 15:41:39 +0300
> Avi Kivity  wrote:
>
> > > Since there are many known algorithms to predict hot memory pages,
> > > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> > > parts not to get too many faults repeatedly, if we can restrict the range
> > > of pages to protect.
> > >
> > > This is the fine-grained control.
> > 
> > Do you want per-page control, or just range-based?
>
> Difficult question.
>
> For live migration, range-based control may be enough duo to the locality
> of WWS.

What's WWS?

I don't see a reason to assume locality in gpa space (or in gva space
either, for some workloads).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-17 Thread Takuya Yoshikawa

On Tue, 17 Apr 2012 15:41:39 +0300
Avi Kivity  wrote:

> > Since there are many known algorithms to predict hot memory pages,
> > the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> > parts not to get too many faults repeatedly, if we can restrict the range
> > of pages to protect.
> >
> > This is the fine-grained control.
> 
> Do you want per-page control, or just range-based?

Difficult question.

For live migration, range-based control may be enough duo to the locality
of WWS.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 07/16] KVM: MMU: introduce for_each_pte_list_spte

2012-04-17 Thread Takuya Yoshikawa

On Mon, 16 Apr 2012 11:36:25 +0800
Xiao Guangrong  wrote:

> I tested it with kernbench, no regression is found.

Because kernbench is not at all good test for this.

> It is not a problem since the iter and spte should be in the cache.

I have checked dirty-log-perf myself with this patch [01-07].

GET_DIRTY_LOG for 1GB dirty pages has become more than 15% slower.

This is not a pure cleanup and introduces significant regression to
rmap handling.

This is the reason why I asked to remove "cleanup patches" from this
patch series.  Had it been really trivial cleanup, I would not have
asked that.

Note: if you had checked the worst case latency with this patch series,
you should have noticed this regression by yourself.

Auto-test and kernbench are not enough to see the effect of this work.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM call agenda for April, Tuesday 17

2012-04-17 Thread Juan Quintela

Juan Quintela  wrote:
> Hi
>
> Please send in any agenda items you are interested in covering.

As there are no topics, we cancel today call.

Happy hacking, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

2012-04-17 Thread Chegu Vinod

On 4/17/2012 2:49 AM, Gleb Natapov wrote:

On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:

On 4/16/2012 5:18 AM, Gleb Natapov wrote:

On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:

On 04/11/2012 01:21 PM, Chegu Vinod wrote:

Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
60-way KVM guest) I noticed pretty bad performance when the guest was booted
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
(RHEL6.2) kernel.
For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
kernel was nearly 12x better !

How many CPUs your host has?

80 Cores on the DL980. (i.e. 8 Westmere sockets).

So you are not oversubscribing CPUs at all. Are those real cores or including
HT?

HT is off.

Do you have other cpus hogs running on the host while testing the guest?

Nope. Sometimes I do run the utilities like "perf" or "sar" or "mpstat"
on the numa node 0 (where

the guest is not running).

I was using numactl to bind the qemu of the 40-way guests to numa
nodes : 4-7 ( or for a 60-way guest
binding them to nodes 2-7)

/etc/qemu-ifup tap0

numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
/usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu
Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-enable-kvm \
-m 65536 -smp 40 \
-name vm1 -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
\
-drive
file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
-device virtio-blk-pci,scsi=off,bus=pci
.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-monitor stdio \
-net nic,macaddr=<..mac_addr..> \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

/etc/qemu-ifdown tap0

I knew that there will be a few additional temporary qemu worker
threads created... i.e. some over
subscription will be there.

4 nodes above have 40 real cores, yes?

Yes .
Other than the qemu's related threads and some of the generic per-cpu
Linux kernel threads (e.g. migration etc)

there isn't anything else running on these Numa nodes.

Can you try to run upstream
kernel without binding at all and check the performance?

I shall re-run and get back to you with this info.

Typically for the native runs... binding the workload results in better
numbers. Hence I choose to do the
binding for the guest too...i.e. on the same numa nodes as the native
case for virt. vs. native comparison
purposes. Having said that ...In the past I had seen a couple of cases
where the non-binded guest
performed better than the native case. Need to re-run and dig into this
further...

Will have to retry by doing some explicit pinning of the vcpus to
native cores (without using virsh).

Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.

Looks like you may be running into the ticket spinlock
code. During the early RHEL 6 days, Gleb came up with a
patch to automatically disable ticket spinlocks when
running inside a KVM guest.

IIRC that patch got rejected upstream at the time,
with upstream developers preferring to wait for a
"better solution".

If such a better solution is not on its way upstream
now (two years later), maybe we should just merge
Gleb's patch upstream for the time being?

I think the pv spinlock that is actively discussed currently should
address the issue, but I am not sure someone tests it against non-ticket
lock in a guest to see which one performs better.

I did see that discussion...seems to have originated from the Xen context.

Yes, The problem is the same for both hypervisors.

--
Gleb.

Thanks
Vinod

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

On 04/17/2012 02:25 PM, Wen Congyang wrote:
> > 
> >>
> >> For scenario 2, we also want the guest's registers values to be dumped into
> >> qemu process's core file when qemu process crashes. This is the task of 
> >> TODO-list 2.
> > 
> > Why?  If qemu crashed it is because of an internal qemu fault.  If any
> > guest registers were involved, they would have been decoded by qemu
> > previously and would be present in the stack trace (for example mmio
> > address/data).
>
> Hmm, IIRC, if qemu meets some critical error, it will call abort() or 
> assert().
> The guest registers are stored in the kernel, and qemu does not call
> cpu_synchronize_state() to get guest register. So I donot understand
> why the registers woubld be present int the stack trace...

There are two cases.  One case is where the problem was not caused
directly by guest action, for example a segmentation fault in the block
layer or the VNC server.  In this case the guest registers are immaterial.

The other case is where the problem was directly caused by guest action,
for example an mmio write to a device register triggered an error. In
this case kvm emulates the mmio instruction and returns KVM_EXIT_MMIO;
it can be seen in the kvm_run page.  The address/data pair is propagated
by the qemu memory core all the way to the device callback.  So the
instruction and register contents are unneeded for debugging the crash.

Is there a scenario where the guest registers help towards debugging a
qemu crash?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] device-assignment: don't touch real command register

2012-04-17 Thread Michael S. Tsirkin

On Tue, Apr 17, 2012 at 01:24:24PM +0200, Jan Kiszka wrote:
> > +/* Pass bus master writes to device. */
> > +pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER;
> > +pci_dev->emulate_config_write[PCI_COMMAND + 1] &= 
> > ~(PCI_COMMAND_INTX_DISABLE >> 8);
> 
> Comment doesn't fully match the code.

Will fix.

> And I bet you aren't using
> checkpatch... ;)

I didn't run it but it doesn't complain.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

On 04/17/2012 03:37 PM, Takuya Yoshikawa wrote:
> On Tue, 17 Apr 2012 10:51:40 +0300
> Avi Kivity  wrote:
>
> > That's true with the write protect everything approach we use now.  But
> > it's not true with range-based write protection, where you issue
> > GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.
> > 
> > (the motivation for that is to decrease the time between GET_DIRTY_LOG
> > and sending the page; as the time increases, the chances that the page
> > got re-dirtied go up).
>
> Thank you for explaining this.
>
> I was planning to give the userspace more freedom.
>
> Since there are many known algorithms to predict hot memory pages,
> the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
> parts not to get too many faults repeatedly, if we can restrict the range
> of pages to protect.
>
> This is the fine-grained control.

Do you want per-page control, or just range-based?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

2012-04-17 Thread Takuya Yoshikawa

On Tue, 17 Apr 2012 10:51:40 +0300
Avi Kivity  wrote:

> That's true with the write protect everything approach we use now.  But
> it's not true with range-based write protection, where you issue
> GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.
> 
> (the motivation for that is to decrease the time between GET_DIRTY_LOG
> and sending the page; as the time increases, the chances that the page
> got re-dirtied go up).

Thank you for explaining this.

I was planning to give the userspace more freedom.

Since there are many known algorithms to predict hot memory pages,
the userspace will be able to tune the frequency of GET_DIRTY_LOG for such
parts not to get too many faults repeatedly, if we can restrict the range
of pages to protect.

This is the fine-grained control.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-17 Thread Wen Congyang

At 04/17/2012 06:59 PM, Avi Kivity Wrote:
> On 04/17/2012 01:51 PM, zhangyanfei wrote:
>> 于 2012年04月17日 15:44, Avi Kivity 写道:
>>> On 04/11/2012 04:39 AM, zhangyanfei wrote:
 This patch set exports offsets of VMCS fields as note information for
 kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
 runtime state of guest machine image, such as registers, in host
 machine's crash dump as VMCS format. The problem is that VMCS
 internal is hidden by Intel in its specification. So, we reverse
 engineering it in the way implemented in this patch set. Please note
 that this processing never affects any existing kvm logic. The
 VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO.

 Here is an example:
 Processor: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz

 $cat /sys/kernel/vmcsinfo
 1cba8c0 2000

 crash> rd -p 1cba8c0 1000
  1cba8c0:  127b0009 53434d56   {...VMCS
  1cba8d0:  4f464e49 4e4f495349564552   INFOREVISION
  1cba8e0:  49460a643d44495f 5f4e495028444c45   _ID=d.FIELD(PIN_
  1cba8f0:  4d565f4445534142 4f435f434558455f   BASED_VM_EXEC_CO
  1cba900:  303d294c4f52544e 0a30383130343831   NTROL)=01840180.
  1cba910:  504328444c454946 5f44455341425f55   FIELD(CPU_BASED_
  1cba920:  5f434558455f4d56 294c4f52544e4f43   VM_EXEC_CONTROL)
  1cba930:  393130343931303d 28444c4549460a30   =01940190.FIELD(
  1cba940:  5241444e4f434553 4558455f4d565f59   SECONDARY_VM_EXE
  1cba950:  4f52544e4f435f43 30346566303d294c   C_CONTROL)=0fe40
  1cba960:  4c4549460a306566 4958455f4d562844   fe0.FIELD(VM_EXI
  1cba970:  4f52544e4f435f54 346531303d29534c   T_CONTROLS)=01e4
  1cba980:  4549460a30653130 4e455f4d5628444c   01e0.FIELD(VM_EN
  1cba990:  544e4f435f595254 33303d29534c4f52   TRY_CONTROLS)=03
  1cba9a0:  460a303133303431 45554728444c4549   140310.FIELD(GUE
  1cba9b0:  45535f53455f5453 3d29524f5443454c   ST_ES_SELECTOR)=
  1cba9c0:  4549460a30303530 545345554728444c   0500.FIELD(GUEST
  1cba9d0:  454c45535f53435f 35303d29524f5443   _CS_SELECTOR)=05
  ..

 TODO:
   1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
  into vmcore.
   2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
  core file. To do this, we will modify kernel core dumper, gdb gcore
  and crash gcore.
   3. Dump guest image from the qemu-process core file into a vmcore.

>>>
>>> Taking a step back, can you describe the problem scenario you're fixing
>>> here?
>>>
>> Considering two scenarios below:
>> 1. Host panics, guests running on that host will also be dumped into
>>host's vmcore.
>> 2. Qemu process is core dumped (by gdb gcore or kernel core dumper), and
>>its coresponding guest will be included in the core file.
>>
>> We want to create the guest machine's crash dump from host machine's vmcore
>> or qemu process's core file. Unfortunately, we cannot get the guest's 
>> registers
>> values in both scenarios.
>>
>> For scenario 1, some key registers (CR0, CR3...) of the guest machine are 
>> stored
>> in VMCS region. But VMCS internal is hidden by Intel specification. So this
>> patch set aims to get offsets of fields in VMCS region and export it as note
>> information for kdump. 
> 
> Okay.  Do you expect it to help in debugging the crash?  Did you have
> cases where it would help?
> 
>>
>> For scenario 2, we also want the guest's registers values to be dumped into
>> qemu process's core file when qemu process crashes. This is the task of 
>> TODO-list 2.
> 
> Why?  If qemu crashed it is because of an internal qemu fault.  If any
> guest registers were involved, they would have been decoded by qemu
> previously and would be present in the stack trace (for example mmio
> address/data).

Hmm, IIRC, if qemu meets some critical error, it will call abort() or assert().
The guest registers are stored in the kernel, and qemu does not call
cpu_synchronize_state() to get guest register. So I donot understand
why the registers woubld be present int the stack trace...

Thanks
Wen Congyang

> 
>> Is this what you want?
>>
> 
> Yes.  I'm trying to understand if the feature would be useful in real life.
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable

2012-04-17 Thread Xiao Guangrong

On 04/17/2012 03:41 PM, Avi Kivity wrote:

> On 04/17/2012 06:55 AM, Xiao Guangrong wrote:
>> On 04/16/2012 07:47 PM, Avi Kivity wrote:
>>
>>> On 04/16/2012 01:20 PM, Xiao Guangrong wrote:
>>
>> It is used to avoid the unnecessary overload 
>
> It's overloading me :(
>


 Sorry.

>>>
>>> The trick is to send those in separate patchset so the maintainer
>>> doesn't notice.
>>>
>>
>>
>> Thanks for your suggestion, i will pay more attention on it in the
>> further.
>>
>> For this patch, what did you mean of "those"? You mean the whole
>> rmap.PTE_LIST_WP_BIT (fast check for shadow page table write protection
>> and host write protection) or just about host_page_write_protect
>> (for KSM only)?
> 
> All of it.  Let's start with just modifying sptes concurrently and only
> later add reading bits from rmap concurrently, if it proves necessary.
> 


Okay, i agree.

>>
>> If we do not have rmap.PTE_LIST_WP_BIT, there may have regression on
>> shadow mmu.
>>
>> Hmm, do i need implement rmap.PTE_LIST_WP_BIT, then fast page fault?
> 
> Let's try to measure the effect without rmap.PTE_LIST_WP_BIT.  Usually
> PTE chains for page tables are short so the effect would be small.  Of
> course we can't tell about all guest.
> 


It is not about rmap's spte, it is about sp.sync write-protect, if the sp.sync
is written, the fast page fault path will be triggered even if no migration and
no framebuffer.

I have done a quick test for kernbench for 10 times and get the average value
without xwindow:

keep rmap.PTE_LIST_WP_BIT: 53.494
comment rmap.PTE_LIST_WP_BIT checking in page_fault_can_be_fast: 53.948

Anyway, for good review, let move fast page fault in first and discuss this in
the separate patchset later.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv4] device-assignment: don't touch pci command register

2012-04-17 Thread Michael S. Tsirkin

Real command register is under kernel control:
it includes bits for triggering SERR, marking
BARs as invalid and such which are all under host
kernel control.

While there's no known bug this triggers - since qemu does its
best to make guest state match device state -
it seems safer to avoid touching this register as much as
possible.

With this patch, we don't touch any bits
except bus master which is ok to put under guest control
and intx mask which kvm interrupt sharing machinery
explicitly allows.

Note: PCI_STATUS bears looking into as well.

Tested-by: Alex Williamson 
Signed-off-by: Michael S. Tsirkin 
---

Changes from v3:
- use pci_word_test_and_clear_mask to clear bits in a clean way
Changes from v2:
- whitespace fix
Changes from v1:
- fix intx mask

 hw/device-assignment.c |   20 ++--
 1 files changed, 6 insertions(+), 14 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 89823f1..17a3a93 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
uint16_t r_seg,
 FILE *f;
 unsigned long long start, end, size, flags;
 uint16_t id;
-struct stat statbuf;
 PCIRegion *rp;
 PCIDevRegions *dev = &pci_dev->real_device;
 
@@ -610,12 +609,8 @@ again:
 pci_dev->dev.config[2] = id & 0xff;
 pci_dev->dev.config[3] = (id & 0xff00) >> 8;
 
-/* dealing with virtual function device */
-snprintf(name, sizeof(name), "%sphysfn/", dir);
-if (!stat(name, &statbuf)) {
-/* always provide the written value on readout */
-assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
-}
+pci_word_test_and_clear_mask(pci_dev->emulate_config_write + PCI_COMMAND,
+ PCI_COMMAND_MASTER | 
PCI_COMMAND_INTX_DISABLE);
 
 dev->region_number = r;
 return 0;
@@ -782,13 +778,9 @@ static int assign_device(AssignedDevice *dev)
 "cause host memory corruption if the device issues DMA write "
 "requests!\n");
 }
 if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
 kvm_has_intx_set_mask()) {
 assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
-
-/* hide host-side INTx masking from the guest */
-dev->emulate_config_read[PCI_COMMAND + 1] |=
-PCI_COMMAND_INTX_DISABLE >> 8;
 }
 
 r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
@@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev)
 }
 
 /*
- * When a 0 is written to the command register, the device is logically
+ * When a 0 is written to the bus master register, the device is logically
  * disconnected from the PCI bus. This avoids further DMA transfers.
  */
-assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
+assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
 }
 
 static int assigned_initfn(struct PCIDevice *pci_dev)
@@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
  * device initialization.
  */
 assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
-assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
 assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
 assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
 assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);
-- 
1.7.9.111.gf3fb0
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On 04/17/2012 03:03 PM, Gleb Natapov wrote:
> > 
> > KVM_MAX_VCPUS.
> > 
> Ah, so you are worried about malicious guest configuring pit to
> broadcast to all its vcpus.

Yes - it can introduce huge amounts of latency this way which is exactly
what Jan is trying to prevent.

Though I'm not sure spin_lock_irq() in the realtime tree actually
disables irqs (but it's certainly not a good idea in mainline; it's
nasty even with just the spinlock).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On Tue, Apr 17, 2012 at 03:00:10PM +0300, Avi Kivity wrote:
> On 04/17/2012 02:05 PM, Gleb Natapov wrote:
> > On Tue, Apr 17, 2012 at 01:42:31PM +0300, Avi Kivity wrote:
> > > On 04/17/2012 01:31 PM, Gleb Natapov wrote:
> > > > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote:
> > > > > On 04/17/2012 01:26 PM, Gleb Natapov wrote:
> > > > > > > It isn't, since you need to send an IPI.
> > > > > > > 
> > > > > > That is exactly what I forget whether you can send IPI from there :)
> > > > > > Anyway this is another reason.
> > > > > >
> > > > > 
> > > > > Actually I was wrong.  You can't smp_call_function_single() from irq
> > > > > context (deadlocks if two vcpus do that), but we send a reschedule
> > > > > interrupt.  So it should work.
> > > > > 
> > > > Ah, good point. So if we'll use irqsave versions of spinlocks we can
> > > > drop kthread?
> > > 
> > > Do we want 254 IPIs to be issued from irq context?  Could be slow.
> > > 
> > Where this number is coming from?
> 
> KVM_MAX_VCPUS.
> 
Ah, so you are worried about malicious guest configuring pit to
broadcast to all its vcpus.

> > > We can make the unicast case run from irq context and defer the
> > > multicast to a thread.
> > > 
> > We do not know if gsi is multicasted in the pit code though.
> >
> 
> We don't have to do it in the pit code.  The .set method can decide
> whether it wants a direct path or go through a thread (which thread?
> passed as a parameter?)
> 
Agree.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On 04/17/2012 02:05 PM, Gleb Natapov wrote:
> On Tue, Apr 17, 2012 at 01:42:31PM +0300, Avi Kivity wrote:
> > On 04/17/2012 01:31 PM, Gleb Natapov wrote:
> > > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote:
> > > > On 04/17/2012 01:26 PM, Gleb Natapov wrote:
> > > > > > It isn't, since you need to send an IPI.
> > > > > > 
> > > > > That is exactly what I forget whether you can send IPI from there :)
> > > > > Anyway this is another reason.
> > > > >
> > > > 
> > > > Actually I was wrong.  You can't smp_call_function_single() from irq
> > > > context (deadlocks if two vcpus do that), but we send a reschedule
> > > > interrupt.  So it should work.
> > > > 
> > > Ah, good point. So if we'll use irqsave versions of spinlocks we can
> > > drop kthread?
> > 
> > Do we want 254 IPIs to be issued from irq context?  Could be slow.
> > 
> Where this number is coming from?

KVM_MAX_VCPUS.

> > We can make the unicast case run from irq context and defer the
> > multicast to a thread.
> > 
> We do not know if gsi is multicasted in the pit code though.
>

We don't have to do it in the pit code.  The .set method can decide
whether it wants a direct path or go through a thread (which thread?
passed as a parameter?)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Run PIT work in own kthread

On 04/17/2012 02:18 PM, Jan Kiszka wrote:
> On 2012-04-17 12:27, Avi Kivity wrote:
> > On 04/16/2012 09:26 PM, Jan Kiszka wrote:
> >> We can't run PIT IRQ injection work in the interrupt context of the host
> >> timer. This would allow the user to influence the handler complexity by
> >> asking for a broadcast to a large number of VCPUs. Therefore, this work
> >> was pushed into workqueue context in 9d244caf2e. However, this prevents
> >> prioritizing the PIT injection over other task as workqueues share
> >> kernel threads.
> >>
> >> This replaces the workqueue with a kthread worker and gives that thread
> >> a name in the format "kvm-pit/". That allows to
> >> identify and adjust the kthread priority according to the VM process
> >> parameters.
> > 
> > Is this a new ABI?
>
> Yep. Scripts will use it, maybe even QEMU. Do you want this to appear in
> api.txt?

You already know the answer.

Please make it explicit that the thread is optional.  This way, if we
gain real-time workqueue support, or eliminate it through some other
trick, userspace won't break.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] device-assignment: don't touch real command register

On 2012-04-17 13:24, Jan Kiszka wrote:
> On 2012-04-16 21:53, Michael S. Tsirkin wrote:
>> Real command register is under kernel control:
>> it includes bits for triggering SERR, marking
>> BARs as invalid and such which are under host
>> kernel control. Don't touch any except bus master
>> which is ok to put under guest control and intx
>> mask which kvm interrupt sharing machinery
>> explicitly allows.
>>
>> Signed-off-by: Michael S. Tsirkin 
>>
>> ---
>> Compiled only, don't have a setup for assignment
>> testing ATM. Is anyone interested enough to test
>> and report?
>>
>> Changes from v1:
>> - fix intx mask handling
>>
>>  hw/device-assignment.c |   24 
>>  1 files changed, 8 insertions(+), 16 deletions(-)
>>
>> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
>> index 89823f1..6cdcc17 100644
>> --- a/hw/device-assignment.c
>> +++ b/hw/device-assignment.c
>> @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
>> uint16_t r_seg,
>>  FILE *f;
>>  unsigned long long start, end, size, flags;
>>  uint16_t id;
>> -struct stat statbuf;
>>  PCIRegion *rp;
>>  PCIDevRegions *dev = &pci_dev->real_device;
>>  
>> @@ -610,12 +609,9 @@ again:
>>  pci_dev->dev.config[2] = id & 0xff;
>>  pci_dev->dev.config[3] = (id & 0xff00) >> 8;
>>  
>> -/* dealing with virtual function device */
>> -snprintf(name, sizeof(name), "%sphysfn/", dir);
>> -if (!stat(name, &statbuf)) {
>> -/* always provide the written value on readout */
>> -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
>> -}
>> +/* Pass bus master writes to device. */
>> +pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER;
>> +pci_dev->emulate_config_write[PCI_COMMAND + 1] &= 
>> ~(PCI_COMMAND_INTX_DISABLE >> 8);
> 
> Comment doesn't fully match the code. And I bet you aren't using
> checkpatch... ;)
> 
>>  
>>  dev->region_number = r;
>>  return 0;
>> @@ -782,13 +778,10 @@ static int assign_device(AssignedDevice *dev)
>>  "cause host memory corruption if the device issues DMA 
>> write "
>>  "requests!\n");
>>  }
>> -if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
>> -kvm_has_intx_set_mask()) {
>> -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
>>  
>> -/* hide host-side INTx masking from the guest */
>> -dev->emulate_config_read[PCI_COMMAND + 1] |=
>> -PCI_COMMAND_INTX_DISABLE >> 8;
>> +if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
>> +kvm_has_intx_set_mask()) {
>> +assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
>>  }
>>  
>>  r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
>> @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev)
>>  }
>>  
>>  /*
>> - * When a 0 is written to the command register, the device is logically
>> + * When a 0 is written to the bus master register, the device is 
>> logically
>>   * disconnected from the PCI bus. This avoids further DMA transfers.
>>   */
>> -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
>> +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
>>  }
>>  
>>  static int assigned_initfn(struct PCIDevice *pci_dev)
>> @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
>>   * device initialization.
>>   */
>>  assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
>> -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
>>  assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
>>  assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
>>  assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);
> 
> Patch looks otherwise fine to me.

Err, I mean v3 on which I actually wanted to comment.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv2] device-assignment: don't touch real command register

On 2012-04-16 21:53, Michael S. Tsirkin wrote:
> Real command register is under kernel control:
> it includes bits for triggering SERR, marking
> BARs as invalid and such which are under host
> kernel control. Don't touch any except bus master
> which is ok to put under guest control and intx
> mask which kvm interrupt sharing machinery
> explicitly allows.
> 
> Signed-off-by: Michael S. Tsirkin 
> 
> ---
> Compiled only, don't have a setup for assignment
> testing ATM. Is anyone interested enough to test
> and report?
> 
> Changes from v1:
> - fix intx mask handling
> 
>  hw/device-assignment.c |   24 
>  1 files changed, 8 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/device-assignment.c b/hw/device-assignment.c
> index 89823f1..6cdcc17 100644
> --- a/hw/device-assignment.c
> +++ b/hw/device-assignment.c
> @@ -501,7 +501,6 @@ static int get_real_device(AssignedDevice *pci_dev, 
> uint16_t r_seg,
>  FILE *f;
>  unsigned long long start, end, size, flags;
>  uint16_t id;
> -struct stat statbuf;
>  PCIRegion *rp;
>  PCIDevRegions *dev = &pci_dev->real_device;
>  
> @@ -610,12 +609,9 @@ again:
>  pci_dev->dev.config[2] = id & 0xff;
>  pci_dev->dev.config[3] = (id & 0xff00) >> 8;
>  
> -/* dealing with virtual function device */
> -snprintf(name, sizeof(name), "%sphysfn/", dir);
> -if (!stat(name, &statbuf)) {
> -/* always provide the written value on readout */
> -assigned_dev_emulate_config_read(pci_dev, PCI_COMMAND, 2);
> -}
> +/* Pass bus master writes to device. */
> +pci_dev->emulate_config_write[PCI_COMMAND] &= ~PCI_COMMAND_MASTER;
> +pci_dev->emulate_config_write[PCI_COMMAND + 1] &= 
> ~(PCI_COMMAND_INTX_DISABLE >> 8);

Comment doesn't fully match the code. And I bet you aren't using
checkpatch... ;)

>  
>  dev->region_number = r;
>  return 0;
> @@ -782,13 +778,10 @@ static int assign_device(AssignedDevice *dev)
>  "cause host memory corruption if the device issues DMA write 
> "
>  "requests!\n");
>  }
> -if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
> -kvm_has_intx_set_mask()) {
> -assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
>  
> -/* hide host-side INTx masking from the guest */
> -dev->emulate_config_read[PCI_COMMAND + 1] |=
> -PCI_COMMAND_INTX_DISABLE >> 8;
> +if (dev->features & ASSIGNED_DEVICE_SHARE_INTX_MASK &&
> + kvm_has_intx_set_mask()) {
> + assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
>  }
>  
>  r = kvm_assign_pci_device(kvm_state, &assigned_dev_data);
> @@ -1631,10 +1624,10 @@ static void reset_assigned_device(DeviceState *dev)
>  }
>  
>  /*
> - * When a 0 is written to the command register, the device is logically
> + * When a 0 is written to the bus master register, the device is 
> logically
>   * disconnected from the PCI bus. This avoids further DMA transfers.
>   */
> -assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 2);
> +assigned_dev_pci_write_config(pci_dev, PCI_COMMAND, 0, 1);
>  }
>  
>  static int assigned_initfn(struct PCIDevice *pci_dev)
> @@ -1658,7 +1651,6 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
>   * device initialization.
>   */
>  assigned_dev_emulate_config_read(dev, 0, PCI_CONFIG_SPACE_SIZE);
> -assigned_dev_direct_config_read(dev, PCI_COMMAND, 2);
>  assigned_dev_direct_config_read(dev, PCI_STATUS, 2);
>  assigned_dev_direct_config_read(dev, PCI_REVISION_ID, 1);
>  assigned_dev_direct_config_read(dev, PCI_CLASS_PROG, 3);

Patch looks otherwise fine to me.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Run PIT work in own kthread

On 2012-04-17 12:27, Avi Kivity wrote:
> On 04/16/2012 09:26 PM, Jan Kiszka wrote:
>> We can't run PIT IRQ injection work in the interrupt context of the host
>> timer. This would allow the user to influence the handler complexity by
>> asking for a broadcast to a large number of VCPUs. Therefore, this work
>> was pushed into workqueue context in 9d244caf2e. However, this prevents
>> prioritizing the PIT injection over other task as workqueues share
>> kernel threads.
>>
>> This replaces the workqueue with a kthread worker and gives that thread
>> a name in the format "kvm-pit/". That allows to
>> identify and adjust the kthread priority according to the VM process
>> parameters.
> 
> Is this a new ABI?

Yep. Scripts will use it, maybe even QEMU. Do you want this to appear in
api.txt?

> 
> Patch looks reasonable.
> 

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On Tue, Apr 17, 2012 at 01:42:31PM +0300, Avi Kivity wrote:
> On 04/17/2012 01:31 PM, Gleb Natapov wrote:
> > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote:
> > > On 04/17/2012 01:26 PM, Gleb Natapov wrote:
> > > > > It isn't, since you need to send an IPI.
> > > > > 
> > > > That is exactly what I forget whether you can send IPI from there :)
> > > > Anyway this is another reason.
> > > >
> > > 
> > > Actually I was wrong.  You can't smp_call_function_single() from irq
> > > context (deadlocks if two vcpus do that), but we send a reschedule
> > > interrupt.  So it should work.
> > > 
> > Ah, good point. So if we'll use irqsave versions of spinlocks we can
> > drop kthread?
> 
> Do we want 254 IPIs to be issued from irq context?  Could be slow.
> 
Where this number is coming from?

> We can make the unicast case run from irq context and defer the
> multicast to a thread.
> 
We do not know if gsi is multicasted in the pit code though.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

On 04/17/2012 01:51 PM, zhangyanfei wrote:
> 于 2012年04月17日 15:44, Avi Kivity 写道:
> > On 04/11/2012 04:39 AM, zhangyanfei wrote:
> >> This patch set exports offsets of VMCS fields as note information for
> >> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
> >> runtime state of guest machine image, such as registers, in host
> >> machine's crash dump as VMCS format. The problem is that VMCS
> >> internal is hidden by Intel in its specification. So, we reverse
> >> engineering it in the way implemented in this patch set. Please note
> >> that this processing never affects any existing kvm logic. The
> >> VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO.
> >>
> >> Here is an example:
> >> Processor: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
> >>
> >> $cat /sys/kernel/vmcsinfo
> >> 1cba8c0 2000
> >>
> >> crash> rd -p 1cba8c0 1000
> >>  1cba8c0:  127b0009 53434d56   {...VMCS
> >>  1cba8d0:  4f464e49 4e4f495349564552   INFOREVISION
> >>  1cba8e0:  49460a643d44495f 5f4e495028444c45   _ID=d.FIELD(PIN_
> >>  1cba8f0:  4d565f4445534142 4f435f434558455f   BASED_VM_EXEC_CO
> >>  1cba900:  303d294c4f52544e 0a30383130343831   NTROL)=01840180.
> >>  1cba910:  504328444c454946 5f44455341425f55   FIELD(CPU_BASED_
> >>  1cba920:  5f434558455f4d56 294c4f52544e4f43   VM_EXEC_CONTROL)
> >>  1cba930:  393130343931303d 28444c4549460a30   =01940190.FIELD(
> >>  1cba940:  5241444e4f434553 4558455f4d565f59   SECONDARY_VM_EXE
> >>  1cba950:  4f52544e4f435f43 30346566303d294c   C_CONTROL)=0fe40
> >>  1cba960:  4c4549460a306566 4958455f4d562844   fe0.FIELD(VM_EXI
> >>  1cba970:  4f52544e4f435f54 346531303d29534c   T_CONTROLS)=01e4
> >>  1cba980:  4549460a30653130 4e455f4d5628444c   01e0.FIELD(VM_EN
> >>  1cba990:  544e4f435f595254 33303d29534c4f52   TRY_CONTROLS)=03
> >>  1cba9a0:  460a303133303431 45554728444c4549   140310.FIELD(GUE
> >>  1cba9b0:  45535f53455f5453 3d29524f5443454c   ST_ES_SELECTOR)=
> >>  1cba9c0:  4549460a30303530 545345554728444c   0500.FIELD(GUEST
> >>  1cba9d0:  454c45535f53435f 35303d29524f5443   _CS_SELECTOR)=05
> >>  ..
> >>
> >> TODO:
> >>   1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
> >>  into vmcore.
> >>   2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
> >>  core file. To do this, we will modify kernel core dumper, gdb gcore
> >>  and crash gcore.
> >>   3. Dump guest image from the qemu-process core file into a vmcore.
> >>
> > 
> > Taking a step back, can you describe the problem scenario you're fixing
> > here?
> > 
> Considering two scenarios below:
> 1. Host panics, guests running on that host will also be dumped into
>host's vmcore.
> 2. Qemu process is core dumped (by gdb gcore or kernel core dumper), and
>its coresponding guest will be included in the core file.
>
> We want to create the guest machine's crash dump from host machine's vmcore
> or qemu process's core file. Unfortunately, we cannot get the guest's 
> registers
> values in both scenarios.
>
> For scenario 1, some key registers (CR0, CR3...) of the guest machine are 
> stored
> in VMCS region. But VMCS internal is hidden by Intel specification. So this
> patch set aims to get offsets of fields in VMCS region and export it as note
> information for kdump. 

Okay.  Do you expect it to help in debugging the crash?  Did you have
cases where it would help?

>
> For scenario 2, we also want the guest's registers values to be dumped into
> qemu process's core file when qemu process crashes. This is the task of 
> TODO-list 2.

Why?  If qemu crashed it is because of an internal qemu fault.  If any
guest registers were involved, they would have been decoded by qemu
previously and would be present in the stack trace (for example mmio
address/data).

> Is this what you want?
>

Yes.  I'm trying to understand if the feature would be useful in real life.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

2012-04-17 Thread zhangyanfei

于 2012年04月17日 15:44, Avi Kivity 写道:
> On 04/11/2012 04:39 AM, zhangyanfei wrote:
>> This patch set exports offsets of VMCS fields as note information for
>> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
>> runtime state of guest machine image, such as registers, in host
>> machine's crash dump as VMCS format. The problem is that VMCS
>> internal is hidden by Intel in its specification. So, we reverse
>> engineering it in the way implemented in this patch set. Please note
>> that this processing never affects any existing kvm logic. The
>> VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO.
>>
>> Here is an example:
>> Processor: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
>>
>> $cat /sys/kernel/vmcsinfo
>> 1cba8c0 2000
>>
>> crash> rd -p 1cba8c0 1000
>>  1cba8c0:  127b0009 53434d56   {...VMCS
>>  1cba8d0:  4f464e49 4e4f495349564552   INFOREVISION
>>  1cba8e0:  49460a643d44495f 5f4e495028444c45   _ID=d.FIELD(PIN_
>>  1cba8f0:  4d565f4445534142 4f435f434558455f   BASED_VM_EXEC_CO
>>  1cba900:  303d294c4f52544e 0a30383130343831   NTROL)=01840180.
>>  1cba910:  504328444c454946 5f44455341425f55   FIELD(CPU_BASED_
>>  1cba920:  5f434558455f4d56 294c4f52544e4f43   VM_EXEC_CONTROL)
>>  1cba930:  393130343931303d 28444c4549460a30   =01940190.FIELD(
>>  1cba940:  5241444e4f434553 4558455f4d565f59   SECONDARY_VM_EXE
>>  1cba950:  4f52544e4f435f43 30346566303d294c   C_CONTROL)=0fe40
>>  1cba960:  4c4549460a306566 4958455f4d562844   fe0.FIELD(VM_EXI
>>  1cba970:  4f52544e4f435f54 346531303d29534c   T_CONTROLS)=01e4
>>  1cba980:  4549460a30653130 4e455f4d5628444c   01e0.FIELD(VM_EN
>>  1cba990:  544e4f435f595254 33303d29534c4f52   TRY_CONTROLS)=03
>>  1cba9a0:  460a303133303431 45554728444c4549   140310.FIELD(GUE
>>  1cba9b0:  45535f53455f5453 3d29524f5443454c   ST_ES_SELECTOR)=
>>  1cba9c0:  4549460a30303530 545345554728444c   0500.FIELD(GUEST
>>  1cba9d0:  454c45535f53435f 35303d29524f5443   _CS_SELECTOR)=05
>>  ..
>>
>> TODO:
>>   1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
>>  into vmcore.
>>   2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
>>  core file. To do this, we will modify kernel core dumper, gdb gcore
>>  and crash gcore.
>>   3. Dump guest image from the qemu-process core file into a vmcore.
>>
> 
> Taking a step back, can you describe the problem scenario you're fixing
> here?
> 
Considering two scenarios below:
1. Host panics, guests running on that host will also be dumped into
   host's vmcore.
2. Qemu process is core dumped (by gdb gcore or kernel core dumper), and
   its coresponding guest will be included in the core file.

We want to create the guest machine's crash dump from host machine's vmcore
or qemu process's core file. Unfortunately, we cannot get the guest's registers
values in both scenarios.

For scenario 1, some key registers (CR0, CR3...) of the guest machine are stored
in VMCS region. But VMCS internal is hidden by Intel specification. So this
patch set aims to get offsets of fields in VMCS region and export it as note
information for kdump. 

For scenario 2, we also want the guest's registers values to be dumped into
qemu process's core file when qemu process crashes. This is the task of 
TODO-list 2.

Is this what you want?

Thanks
Zhang Yanfei

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On 04/17/2012 01:42 PM, Avi Kivity wrote:
> On 04/17/2012 01:31 PM, Gleb Natapov wrote:
> > On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote:
> > > On 04/17/2012 01:26 PM, Gleb Natapov wrote:
> > > > > It isn't, since you need to send an IPI.
> > > > > 
> > > > That is exactly what I forget whether you can send IPI from there :)
> > > > Anyway this is another reason.
> > > >
> > > 
> > > Actually I was wrong.  You can't smp_call_function_single() from irq
> > > context (deadlocks if two vcpus do that), but we send a reschedule
> > > interrupt.  So it should work.
> > > 
> > Ah, good point. So if we'll use irqsave versions of spinlocks we can
> > drop kthread?
>
> Do we want 254 IPIs to be issued from irq context?  Could be slow.
>
> We can make the unicast case run from irq context and defer the
> multicast to a thread.

(that should help eventfd and device assignment)

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On 04/17/2012 01:31 PM, Gleb Natapov wrote:
> On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote:
> > On 04/17/2012 01:26 PM, Gleb Natapov wrote:
> > > > It isn't, since you need to send an IPI.
> > > > 
> > > That is exactly what I forget whether you can send IPI from there :)
> > > Anyway this is another reason.
> > >
> > 
> > Actually I was wrong.  You can't smp_call_function_single() from irq
> > context (deadlocks if two vcpus do that), but we send a reschedule
> > interrupt.  So it should work.
> > 
> Ah, good point. So if we'll use irqsave versions of spinlocks we can
> drop kthread?

Do we want 254 IPIs to be issued from irq context?  Could be slow.

We can make the unicast case run from irq context and defer the
multicast to a thread.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On Tue, Apr 17, 2012 at 01:29:04PM +0300, Avi Kivity wrote:
> On 04/17/2012 01:26 PM, Gleb Natapov wrote:
> > > It isn't, since you need to send an IPI.
> > > 
> > That is exactly what I forget whether you can send IPI from there :)
> > Anyway this is another reason.
> >
> 
> Actually I was wrong.  You can't smp_call_function_single() from irq
> context (deadlocks if two vcpus do that), but we send a reschedule
> interrupt.  So it should work.
> 
Ah, good point. So if we'll use irqsave versions of spinlocks we can
drop kthread?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On 04/17/2012 01:26 PM, Gleb Natapov wrote:
> > It isn't, since you need to send an IPI.
> > 
> That is exactly what I forget whether you can send IPI from there :)
> Anyway this is another reason.
>

Actually I was wrong.  You can't smp_call_function_single() from irq
context (deadlocks if two vcpus do that), but we send a reschedule
interrupt.  So it should work.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: Run PIT work in own kthread

On 04/16/2012 09:26 PM, Jan Kiszka wrote:
> We can't run PIT IRQ injection work in the interrupt context of the host
> timer. This would allow the user to influence the handler complexity by
> asking for a broadcast to a large number of VCPUs. Therefore, this work
> was pushed into workqueue context in 9d244caf2e. However, this prevents
> prioritizing the PIT injection over other task as workqueues share
> kernel threads.
>
> This replaces the workqueue with a kthread worker and gives that thread
> a name in the format "kvm-pit/". That allows to
> identify and adjust the kthread priority according to the VM process
> parameters.

Is this a new ABI?

Patch looks reasonable.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On Tue, Apr 17, 2012 at 01:23:52PM +0300, Avi Kivity wrote:
> On 04/17/2012 12:31 PM, Gleb Natapov wrote:
> > On Mon, Apr 16, 2012 at 06:33:26PM +0200, Jan Kiszka wrote:
> > > On 2010-06-16 23:11, Chris Lalancette wrote:
> > > > We really want to "kvm_set_irq" during the hrtimer callback,
> > > > but that is risky because that is during interrupt context.
> > > > Instead, offload the work to a workqueue, which is a bit safer
> > > > and should provide most of the same functionality.
> > > 
> > > Unfortunately, workqueues do not have fixed kthread associations (and
> > > "kvm-pit-wq" would be too unspecific when running multiple VMs). So I
> > > just realized that this subtly breaks the ability to run KVM guests with
> > > RT priority (boot managers with timeouts hang as the workqueue starves).
> > > 
> > > Before throwing some kthread_worker at this, could someone help me
> > > recalling what was "risky" here? That the PIT IRQ may have to be
> > > broadcast to a large number of VCPUs? I would offer to include this
> > > information in my changelog. ;)
> > > 
> > ioapic and pic irq_set functions uses spinlocks, but not irqsave
> > variants. Also I am always forget whether is is safe to kick vcpu from
> > irq context.
> 
> It isn't, since you need to send an IPI.
> 
That is exactly what I forget whether you can send IPI from there :)
Anyway this is another reason.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM updates for the 3.4 merge window

On 04/17/2012 12:34 PM, Alexander Graf wrote:
>>> tree but his tree is in linux-next as well. There's no reason not to do
>>> that.
>>>
>>> That way, his next branch gets linux-next coverage whether it's in my
>>> tree or not, and I pull when I put the final powerpc-next together,
>>> which gives me a chance to do a quick vet "just in case" and sort out
>>> any major conflict before it all goes to Linus.
>>>
>> Sure, that works too.
> Or do like I do with Kumar for FSL stuff... his stuff gets pulled via my
>
> Sounds even easier to me. So how do I get my tree into linux-next?

Send a note to Stephen Rothwell (copied).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On 04/17/2012 12:31 PM, Gleb Natapov wrote:
> On Mon, Apr 16, 2012 at 06:33:26PM +0200, Jan Kiszka wrote:
> > On 2010-06-16 23:11, Chris Lalancette wrote:
> > > We really want to "kvm_set_irq" during the hrtimer callback,
> > > but that is risky because that is during interrupt context.
> > > Instead, offload the work to a workqueue, which is a bit safer
> > > and should provide most of the same functionality.
> > 
> > Unfortunately, workqueues do not have fixed kthread associations (and
> > "kvm-pit-wq" would be too unspecific when running multiple VMs). So I
> > just realized that this subtly breaks the ability to run KVM guests with
> > RT priority (boot managers with timeouts hang as the workqueue starves).
> > 
> > Before throwing some kthread_worker at this, could someone help me
> > recalling what was "risky" here? That the PIT IRQ may have to be
> > broadcast to a large number of VCPUs? I would offer to include this
> > information in my changelog. ;)
> > 
> ioapic and pic irq_set functions uses spinlocks, but not irqsave
> variants. Also I am always forget whether is is safe to kick vcpu from
> irq context.

It isn't, since you need to send an IPI.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
> On 4/16/2012 5:18 AM, Gleb Natapov wrote:
> >On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
> >>On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >>>Hello,
> >>>
> >>>While running an AIM7 (workfile.high_systime) in a single 40-way (or a 
> >>>single
> >>>60-way KVM guest) I noticed pretty bad performance when the guest was 
> >>>booted
> >>>with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> >>>(RHEL6.2) kernel.
> >>>For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better 
> >>>than
> >>>the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older 
> >>>guest
> >>>kernel was nearly 12x better !
> >How many CPUs your host has?
> 
> 80 Cores on the DL980.  (i.e. 8 Westmere sockets).
> 
So you are not oversubscribing CPUs at all. Are those real cores or including 
HT?
Do you have other cpus hogs running on the host while testing the guest?

> I was using numactl to bind the qemu of the 40-way guests to numa
> nodes : 4-7  ( or for a 60-way guest
> binding them to nodes 2-7)
> 
> /etc/qemu-ifup tap0
> 
> numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
> /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu 
> Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
> -enable-kvm \
> -m 65536 -smp 40 \
> -name vm1 -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
> \
> -drive 
> file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
> -device virtio-blk-pci,scsi=off,bus=pci
> .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> -monitor stdio \
> -net nic,macaddr=<..mac_addr..> \
> -net tap,ifname=tap0,script=no,downscript=no \
> -vnc :4
> 
> /etc/qemu-ifdown tap0
> 
> 
> I knew that there will be a few additional temporary qemu worker
> threads created...  i.e. some over
> subscription  will be there.
> 
4 nodes above have 40 real cores, yes? Can you try to run upstream
kernel without binding at all and check the performance?

> 
> Will have to retry by doing some explicit pinning of the vcpus to
> native cores (without using virsh).
> 
> >>>Turned on function tracing and found that there appears to be more time 
> >>>being
> >>>spent around the lock code in the 3.3.1 guest when compared to the 
> >>>2.6.32-220
> >>>guest.
> >>Looks like you may be running into the ticket spinlock
> >>code. During the early RHEL 6 days, Gleb came up with a
> >>patch to automatically disable ticket spinlocks when
> >>running inside a KVM guest.
> >>
> >>IIRC that patch got rejected upstream at the time,
> >>with upstream developers preferring to wait for a
> >>"better solution".
> >>
> >>If such a better solution is not on its way upstream
> >>now (two years later), maybe we should just merge
> >>Gleb's patch upstream for the time being?
> >I think the pv spinlock that is actively discussed currently should
> >address the issue, but I am not sure someone tests it against non-ticket
> >lock in a guest to see which one performs better.
> 
> I did see that discussion...seems to have originated from the Xen context.
> 
Yes, The problem is the same for both hypervisors.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM updates for the 3.4 merge window

2012-04-17 Thread Alexander Graf


On 04/17/2012 09:20 AM, Avi Kivity wrote:

On 04/17/2012 02:05 AM, Benjamin Herrenschmidt wrote:

On Mon, 2012-04-16 at 15:53 +0300, Avi Kivity wrote:

kvm.git next is exposed to linux-next, where they get tested quite a
lot.  Granted it's mostly build testing, and people are unlikely to
test
kvm there, but they will test the non-kvm bits that creep in there.


The alternative would be that I don't have a -next tree, just

collect patches and immediately send them to Avi. That way the main
kvm tree would be broken more often, but at least we don't get these
horrible synchronization latencies.

That works too.  Don't post immediately; 2-3 week batches would reduce
noise.

Or do like I do with Kumar for FSL stuff... his stuff gets pulled via my
tree but his tree is in linux-next as well. There's no reason not to do
that.

That way, his next branch gets linux-next coverage whether it's in my
tree or not, and I pull when I put the final powerpc-next together,
which gives me a chance to do a quick vet "just in case" and sort out
any major conflict before it all goes to Linus.


Sure, that works too.


Sounds even easier to me. So how do I get my tree into linux-next?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/3] Introduce a workqueue to deliver PIT timer interrupts.

On Mon, Apr 16, 2012 at 06:33:26PM +0200, Jan Kiszka wrote:
> On 2010-06-16 23:11, Chris Lalancette wrote:
> > We really want to "kvm_set_irq" during the hrtimer callback,
> > but that is risky because that is during interrupt context.
> > Instead, offload the work to a workqueue, which is a bit safer
> > and should provide most of the same functionality.
> 
> Unfortunately, workqueues do not have fixed kthread associations (and
> "kvm-pit-wq" would be too unspecific when running multiple VMs). So I
> just realized that this subtly breaks the ability to run KVM guests with
> RT priority (boot managers with timeouts hang as the workqueue starves).
> 
> Before throwing some kthread_worker at this, could someone help me
> recalling what was "risky" here? That the PIT IRQ may have to be
> broadcast to a large number of VCPUs? I would offer to include this
> information in my changelog. ;)
> 
ioapic and pic irq_set functions uses spinlocks, but not irqsave
variants. Also I am always forget whether is is safe to kick vcpu from
irq context.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv1 dont apply] RFC: kvm eoi PV using shared memory

On 04/16/2012 03:18 PM, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 02:24:46PM +0300, Gleb Natapov wrote:
> > On Mon, Apr 16, 2012 at 02:09:20PM +0300, Michael S. Tsirkin wrote:
> > > Thanks very much for the review. I'll address the comments.
> > > Some questions on your comments below.
> > > 
> > > On Mon, Apr 16, 2012 at 01:08:07PM +0300, Gleb Natapov wrote:
> > > > > @@ -37,6 +38,8 @@
> > > > >  #define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
> > > > >  #define MSR_KVM_ASYNC_PF_EN 0x4b564d02
> > > > >  #define MSR_KVM_STEAL_TIME  0x4b564d03
> > > > > +#define MSR_KVM_EOI_EN  0x4b564d04
> > > > > +#define MSR_KVM_EOI_DISABLED 0x0L
> > > > This is valid gpa. Follow others MSR example i.e align the address to,
> > > > lets say dword, and use lsb as enable bit.
> > > 
> > > We only need a single byte, since this is per-CPU -
> > > it's better to save the memory, so no alignment is required.
> > > An explicit disable msr would also address this, right?
> > > 
> > We do not have shortage of memory.
> > Better make all MSRs works the same
> > way.
>
> I agree it's nice to have EOI and ASYNC_PF look similar
> but wasting memory is also bad.  I'll ponder this some more.

Wasting three bytes?

> > BTW have you added new MSR to msrs_to_save array? I forgot to
> > checked.
>
> I didn't yet. Trying to understand how will that affect
> cross-version migration - any input?

It will just work.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv1 dont apply] RFC: kvm eoi PV using shared memory

On 04/16/2012 01:08 PM, Gleb Natapov wrote:
> > }
> >  
> > +   
> > +   if (kvm_para_has_feature(KVM_FEATURE_EOI)) {
> > +   kvm_guest_native_apic_write = apic->write;
> > +   apic->write = kvm_guest_apic_write;
> > +   wrmsrl(MSR_KVM_EOI_EN, __pa(&__get_cpu_var(apic_eoi)));
> > +   }
> Can happen on more then one cpu. After it happens on two 
> kvm_guest_apic_write()
> will call to itself.

It's also hacky.  Try static_cpu_has() in the apic code (and add an eoi
method).


-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3]KVM: PPC: Use clockevent multiplier and shifter for decrementer

2012-04-17 Thread Alexander Graf


On 04/17/2012 05:55 AM, Bharat Bhushan wrote:

Time for which the hrtimer is started for decrementer emulation is calculated using 
tb_ticks_per_usec. While hrtimer uses the clockevent for DEC reprogramming (if 
needed) and which calculate timebase ticks using the multiplier and shifter mechanism 
implemented within clockevent layer. It was observed that this conversion 
(timebase->time->timebase) are not correct because the mechanism are not 
consistent. In our setup it adds 2% jitter.

With this patch clockevent multiplier and shifter mechanism are used when starting 
hrtimer for decrementer emulation. Now the jitter is<  0.5%.

Signed-off-by: Bharat Bhushan
---
v3:
  - decrementer_clockevent symbol exported.

  arch/powerpc/include/asm/time.h |1 +
  arch/powerpc/kernel/time.c  |3 ++-
  arch/powerpc/kvm/emulate.c  |5 +++--
  3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 7eb10fb..b3c7959 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -28,6 +28,7 @@
  extern unsigned long tb_ticks_per_jiffy;
  extern unsigned long tb_ticks_per_usec;
  extern unsigned long tb_ticks_per_sec;
+extern struct clock_event_device decrementer_clockevent;

  struct rtc_time;
  extern void to_tm(int tim, struct rtc_time * tm);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 567dd7c..4f7a285 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -105,7 +105,7 @@ static int decrementer_set_next_event(unsigned long evt,
  static void decrementer_set_mode(enum clock_event_mode mode,
 struct clock_event_device *dev);

-static struct clock_event_device decrementer_clockevent = {
+struct clock_event_device decrementer_clockevent = {
.name   = "decrementer",
.rating = 200,
.irq= 0,
@@ -113,6 +113,7 @@ static struct clock_event_device decrementer_clockevent = {
.set_mode   = decrementer_set_mode,
.features   = CLOCK_EVT_FEAT_ONESHOT,
  };
+EXPORT_SYMBOL(decrementer_clockevent);

  DEFINE_PER_CPU(u64, decrementers_next_tb);
  static DEFINE_PER_CPU(struct clock_event_device, decrementers);
diff --git a/arch/powerpc/kvm/emulate.c b/arch/powerpc/kvm/emulate.c
index afc9154..c8b5206 100644
--- a/arch/powerpc/kvm/emulate.c
+++ b/arch/powerpc/kvm/emulate.c
@@ -23,6 +23,7 @@
  #include
  #include
  #include
+#include

  #include
  #include
@@ -104,8 +105,8 @@ void kvmppc_emulate_dec(struct kvm_vcpu *vcpu)
 */

dec_time = vcpu->arch.dec;
-   dec_time *= 1000;
-   do_div(dec_time, tb_ticks_per_usec);


Sorry for the late nitpicking, but could you please add a comment here 
that states that we can use the host's decrementer calculations because 
the guest's timebase ticks at the same speed?



Alex


+   dec_time = dec_time<<  decrementer_clockevent.shift;
+   do_div(dec_time, decrementer_clockevent.mult);
dec_nsec = do_div(dec_time, NSEC_PER_SEC);
hrtimer_start(&vcpu->arch.dec_timer,
ktime_set(dec_time, dec_nsec), HRTIMER_MODE_REL);


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv1 dont apply] RFC: kvm eoi PV using shared memory

On Mon, Apr 16, 2012 at 10:01:29PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 08:51:16PM +0300, Gleb Natapov wrote:
> > > > Only when eoi is pending. This is rare.
> > > 
> > > This is exactly while guest handles an interrupt.
> > > It's not all that rare at all: e.g. device
> > > drivers cause an exit from interrupt
> > > handler by doing io.
> > So eoi will be coalesced with io that device driver does. Exactly what
> > we want.
> 
> It won't. While we handle interrupts eoi is still set.
> So there will be a couple of tests + read from userspace
> Wasted not a huge overhead but it's the principle of the thing.
> 
Linux acks irq before calling device handler. Windows for some devices
does IO to a device before acking.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: EuroSec'12 Presentation (ASLR reduces effect of KSM)

2012-04-17 Thread Kuniyasu Suzaki


From: Marcelo Tosatti 
Subject: Re: EuroSec'12 Presentation (ASLR reduces effect of KSM)
Date: Mon, 16 Apr 2012 21:29:31 -0300

> On Mon, Apr 16, 2012 at 03:52:10PM +0900, Kuniyasu Suzaki wrote:
> > 
> > Marcelo,
> > 
> > From: Marcelo Tosatti 
> > Subject: Re: EuroSec'12 Presentation (ASLR reduces effect of KSM)
> > Date: Fri, 13 Apr 2012 21:47:47 -0300
> > 
> > > On Thu, Apr 12, 2012 at 08:24:57PM +0900, Kuniyasu Suzaki wrote:
> > > > 
> > > > Dear,
> > > > 
> > > > I made a presentation which measures OS security functions(ASLR,
> > > > Memory Santization, and Cache Page Flushing) on memory deduplication
> > > > "KSM with VKM" at EuroSec 2012.
> > > > 
> > > > The titile is "Effects of Memory Randomization, Sanitization and Page
> > > > Cache on Memory Deduplication".
> > > > # This is one of papers related to my memory deduplication research.
> > > > 
> > > > The slide is downloadable.
> > > >   
> > > > http://www.slideshare.net/suzaki/eurosec2012-effects-of-memory-randomization-sanitization-and-page-cache-on-memory-deduplication-by-ksuzaki
> > > > The paper will be downloadable form ACM Digital Library.
> > > > 
> > > > The results show ALSR reduces the effect of memory deduplciation.
> > > > Please tell me, if you have comments. Thank you.
> > > > 
> > > > --
> > > >   Kuniyasu Suzaki, National Institute of Advanced Industrial Science 
> > > > and Technology,
> > > >   http://staff.aist.go.jp/k.suzaki
> > > 
> > > Very nice. ALSR is supposed to increase the number of unshared pages
> > > because translation tables that contain addresses of symbols will
> > > differ for every instance of an executable.
> > 
> > Thank you for good suggestion.
> > Anyway, how much the size of translation tables?
> 
> One entry per symbol that is accessed outside of the object or
> main executable, one table per shared object (GOT and PLT tables). See
> the ELF documentation.
> 
> > In our experience, ALSR on 4 GuestOS (Linux) increased the memory 
> > consumption more than 50MB.
> > Does the translation table in a linux take more than 10MB?
> 
> Increased memory consumption is due to larger number of pagetables
> (which is necessary to cover larger virtual address space). 
> Increased number of unshared pages can be explained by translation
> tables.

GOT and PLT tables includes Virtual Addresses which are arranged by
ASLR. The address are shown at /proc/PID/task/PID/maps. 
The pagetables are the table to translate form Virtual Address to Physical
Address. The memory pages which includes GOT and PLT tables and
pagetables, are not identical (unshared), and the size is not small.

--
suzaki

> > > Can you share additional information about "HICAMP (hardware memory
> > > deduplication)" ?
> > 
> > The detail of HICAMP was presented at ASPLOS 2011.
> >  ASPLOS 2011 paper http://dl.acm.org/citation.cfm?id=2151007&preflayout=tabs
> > 
> > --
> > suzaki
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv1 dont apply] RFC: kvm eoi PV using shared memory

On Mon, Apr 16, 2012 at 09:56:02PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 16, 2012 at 08:37:37PM +0300, Gleb Natapov wrote:
> > On Mon, Apr 16, 2012 at 08:24:16PM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Apr 16, 2012 at 06:10:11PM +0300, Gleb Natapov wrote:
> > > > > > lapic changes should be minimal.
> > > > > 
> > > > > Exactly my motivation.
> > > > > 
> > > > My patch removes 13 lines more :)
> > > 
> > > Haven't checked whether your patch is correct yet
> > > but I see it's checking the eoi register on each exit.
> > > 
> > Only if eoi.pending == true on exit.
> 
> it's checking eoi.pending always and eoi register when
> that is true.
> 
We already have if there, so we can reuse it. Instead of checking
vapic_addr in kvm_lapic_sync_from_vapic() let it check apic_attention
bitmask. vapic_addr and eoi.pending will set bit there and if() will
become if(!apic_attention) return. Zero overhead.

> > > I think it's clear this would make code a bit shorter
> > > (not necessarily clearer as we are relying on ) but
> > > as I said even though the extra overhead is likely
> > > negligeable I have a feeling it's a wrong approach since
> > > this won't scale as we add more features.
> > > 
> > > Let's see what do others think.
> > > 
> > What I do not like about not calling eoi here is that we
> > have to reason about all the paths into apic and check that
> > we clear isr on all of them.
> 
> Not at all. Instead, check each time before we read isr
> or ppr.
Or inject interrupt, or ... Luckily lapic code calls apic_update_ppr() a
lot (just in case), so adding check there likely covers any relevant
entry into the apic, but this is exactly reasoning I would like to avoid
if possible :) I do not see bug per se with your current scheme otherwise I
would point it out.

> 
> > And that is not all. eoi handler
> > set event request, is it OK to skip it? May be, or may be not.
> 
> I think it's for reinjection of lower prio interrupt.
Me too. But skipping setting that bit should not be taken lightly. We
need to know for sure. Current code tries to err on safe side and
prefers to set event request when not needed instead of missing it when
it is needed. Those rare cases are hard to debug.

> Since eoi optimization is disabled in that case we don't need to set
> event request.
> 
> > > > > I also find the logic easier to follow as is -
> > > > > it is contained in lapic.c without relying
> > > > > on being called from x86.c as just the right moment.
> > > > > 
> > > > See the patch. It change nothing outside of lapic.c.
> > > 
> > > True but you rely on x86.c to call kvm_sync_lapic_from_vapic
> > > at the right time before injecting interrupts.
> > Not before injecting interrupts, but on vmexit.
> 
> sync is called on entry, not on exit, no?
No. Look at the code. Entry is to late.

> 
> 
> > > I haven't checked whether that is always the case but
> > > to me, this makes the code less clear and more fragile.
> > > 
> > We call a lot of apic functions from x86.c. That's were interrupt
> > injection happens.
> > 
> > > Again, it appears to be a matter of taste.
> > > 
> > > -- 
> > > MST
> > 
> > --
> > Gleb.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: writeback cache + h700 controller w/1gb nvcache, corruption on power loss

2012-04-17 Thread Stefan Hajnoczi

On Mon, Apr 16, 2012 at 9:51 AM, Ron Edison  wrote:
> I would be very interested in how to ensure the guests are sending flushes. 
> I'm unfamiliar with the example you gave, where is that configured?

"mount -o barrier=1 /dev/sda /mnt" is a mount option for ext3 and ext4
file systems.

You probably don't want this actually since you have a battery-backed
RAID controller.  See below for more.

> Primarily the guests are CentOS 4, 5 or 6. I am also curious if it would be 
> advisable to switch to writethrough cache on each guest virtual disk and 
> leave writeback enabled on the controller and if that would adversely affect 
> performance of the guests.

The most conservative modes are cache=writethrough (uses host page
cache) and cache=directsync (does not use host page cache).  They both
ensure that every single write is flushed to disk.  Therefore they
have a performance penalty.  cache=directsync minimizes stress on host
memory because it bypasses the page cache.

Since you have a non-volatile cache in your RAID controller you can
also use cache=none.  This also bypasses the host page cache but it
does not flush every single write.  The guest may still send flushes
but even if it does not, the writes are going to the RAID controller's
non-volatile cache.

> The disk corruption experienced was indeed lost data -- an fsck was necessary 
> for 4 of the guests to boot at all in RW mode, they first came up read only. 
> In the case of one of the guests there was actually files data / data lost 
> after fsck was manually run upon reboot/single user mode. In some cases these 
> were config files, in other database indexes, etc. This one of the 4 guests 
> with the most severe corruption was not usable and we had to revert to a 
> backup and pull current data out of it as much as possible.

Since you used QEMU -drive cache=writeback data loss is expected on
host power failure.  cache=writeback uses the (volatile) host page
cache and therefore data may not have made it to the RAID controller
before power was lost.

Guest file system recovery - either a quick journal replay or a
painful fsck - is also expected on host power failure.  The file
systems are dirty since the guest stopped executing without cleanly
unmounting its file systems.  If you use cache=none or
cache=directsync then you should get a quick journal replay and the
risk of a painful fsck should be reduced (most/all of the data will
have been preserved).

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Virtio network performance on Debian

2012-04-17 Thread Michael Tokarev

On 12.04.2012 11:42, Hans-Kristian Bakke wrote:
> Hi
> 
> For some reason I am not able to get good network performance using
> virtio/vhost-net on Debian KVM host (perhaps also valid for Ubuntu
> hosts then).

The issue has been identified, after Hans-Kristian gave me access
to his machine and I did alot of testing.  And as usual, the root
cause was very stupid... ;)

In last release of debian qemu-kvm package I changed the way how
debian package version string propagates to build procedure -- before
it was a patch grabbing values from debian/version, but in last
release I used --with-pkgversion configure flag.  This resulted in
the following change:

- QEMU emulator version 1.0 (qemu-kvm-1.0 Debian 1.0+dfsg-8), Copyright (c) 
2003-2008 Fabrice Bellard
+ QEMU emulator version 1.0 (Debian qemu-kvm 1.0+dfsg-9), Copyright (c) 
2003-2008 Fabrice Bellard

As it turns out, libvirt parses `qemu -version' output and looks for
" (qemu-kvm-" string in there, and if it is found, libvirt enables
some "extra" features.  Important for us was setting IFF_VNET_HDR
flag for a tap device, -- apparently this flag makes a HUGE difference
in networking speed, especially when using vhost_net.

Obviously this is a change unique to debian, and I never thought
about such an.. "interesting" effect it may give us.

This is a libvirt bug actually, since support of vnet_hdr can be
determined by other means, and since upstream qemu now has almost
everything from qemu-kvm, and it wants to be fast too.  But since
qemu[-kvm] has a long history of changing features, it is difficult
to blame libvirt that much...

Oh well.

Thanks!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/13] KVM: MMU: fast page fault

On 04/17/2012 09:26 AM, Xiao Guangrong wrote:
> On 04/16/2012 11:49 PM, Takuya Yoshikawa wrote:
>
>
> > Although O(1) is actually O(1) for GET_DIRTY_LOG thread, it adds some
> > overheads to page fault handling.  We may need to hold mmu_lock for properly
> > handling O(1)'s write protection and ~500 write protections will not be so
> > cheap.  And there is no answer to the question how to achive slot-wise write
> > protection.
> > 
>
>
> Actually no.
>
> We do not increase the overload on page fault for migration. The number of
> page fault of O(1) is the same as write-protect all spte.

That's true with the write protect everything approach we use now.  But
it's not true with range-based write protection, where you issue
GET_DIRTY_LOG on a range of pages and only need to re-write-protect them.

(the motivation for that is to decrease the time between GET_DIRTY_LOG
and sending the page; as the time increases, the chances that the page
got re-dirtied go up).

That doesn't mean O(1) is unusable for this, just that it requires more
thought.  Especially with direct maps, we can write-enable pages very
quickly.

> And, we can also avoid to hold mmu_lock to write-protect PML4s, we can use
> a generation number, and notify mmu to update its page table when dirty-log
> is enabled.

Generation numbers are also useful for o(1) invalidation.

>
> Anyway, no performance data, no truth. Let me implement it first.
>

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM updates for the 3.4 merge window

On 04/17/2012 02:05 AM, Benjamin Herrenschmidt wrote:
> On Mon, 2012-04-16 at 15:53 +0300, Avi Kivity wrote:
> > 
> > kvm.git next is exposed to linux-next, where they get tested quite a
> > lot.  Granted it's mostly build testing, and people are unlikely to
> > test
> > kvm there, but they will test the non-kvm bits that creep in there.
> > 
> > > The alternative would be that I don't have a -next tree, just
> > collect patches and immediately send them to Avi. That way the main
> > kvm tree would be broken more often, but at least we don't get these
> > horrible synchronization latencies.
> > 
> > That works too.  Don't post immediately; 2-3 week batches would reduce
> > noise.
>
> Or do like I do with Kumar for FSL stuff... his stuff gets pulled via my
> tree but his tree is in linux-next as well. There's no reason not to do
> that.
>
> That way, his next branch gets linux-next coverage whether it's in my
> tree or not, and I pull when I put the final powerpc-next together,
> which gives me a chance to do a quick vet "just in case" and sort out
> any major conflict before it all goes to Linus.
>

Sure, that works too.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] Export offsets of VMCS fields as note information for kdump

On 04/11/2012 04:39 AM, zhangyanfei wrote:
> This patch set exports offsets of VMCS fields as note information for
> kdump. We call it VMCSINFO. The purpose of VMCSINFO is to retrieve
> runtime state of guest machine image, such as registers, in host
> machine's crash dump as VMCS format. The problem is that VMCS
> internal is hidden by Intel in its specification. So, we reverse
> engineering it in the way implemented in this patch set. Please note
> that this processing never affects any existing kvm logic. The
> VMCSINFO is exported via sysfs to kexec-tools just like VMCOREINFO.
>
> Here is an example:
> Processor: Intel(R) Core(TM)2 Duo CPU E7500  @ 2.93GHz
>
> $cat /sys/kernel/vmcsinfo
> 1cba8c0 2000
>
> crash> rd -p 1cba8c0 1000
>  1cba8c0:  127b0009 53434d56   {...VMCS
>  1cba8d0:  4f464e49 4e4f495349564552   INFOREVISION
>  1cba8e0:  49460a643d44495f 5f4e495028444c45   _ID=d.FIELD(PIN_
>  1cba8f0:  4d565f4445534142 4f435f434558455f   BASED_VM_EXEC_CO
>  1cba900:  303d294c4f52544e 0a30383130343831   NTROL)=01840180.
>  1cba910:  504328444c454946 5f44455341425f55   FIELD(CPU_BASED_
>  1cba920:  5f434558455f4d56 294c4f52544e4f43   VM_EXEC_CONTROL)
>  1cba930:  393130343931303d 28444c4549460a30   =01940190.FIELD(
>  1cba940:  5241444e4f434553 4558455f4d565f59   SECONDARY_VM_EXE
>  1cba950:  4f52544e4f435f43 30346566303d294c   C_CONTROL)=0fe40
>  1cba960:  4c4549460a306566 4958455f4d562844   fe0.FIELD(VM_EXI
>  1cba970:  4f52544e4f435f54 346531303d29534c   T_CONTROLS)=01e4
>  1cba980:  4549460a30653130 4e455f4d5628444c   01e0.FIELD(VM_EN
>  1cba990:  544e4f435f595254 33303d29534c4f52   TRY_CONTROLS)=03
>  1cba9a0:  460a303133303431 45554728444c4549   140310.FIELD(GUE
>  1cba9b0:  45535f53455f5453 3d29524f5443454c   ST_ES_SELECTOR)=
>  1cba9c0:  4549460a30303530 545345554728444c   0500.FIELD(GUEST
>  1cba9d0:  454c45535f53435f 35303d29524f5443   _CS_SELECTOR)=05
>  ..
>
> TODO:
>   1. In kexec-tools, get VMCSINFO via sysfs and dump it as note information
>  into vmcore.
>   2. Dump VMCS region of each guest vcpu and VMCSINFO into qemu-process
>  core file. To do this, we will modify kernel core dumper, gdb gcore
>  and crash gcore.
>   3. Dump guest image from the qemu-process core file into a vmcore.
>

Taking a step back, can you describe the problem scenario you're fixing
here?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 10/16] KVM: MMU: fask check whether page is writable

On 04/17/2012 06:55 AM, Xiao Guangrong wrote:
> On 04/16/2012 07:47 PM, Avi Kivity wrote:
>
> > On 04/16/2012 01:20 PM, Xiao Guangrong wrote:
> 
>  It is used to avoid the unnecessary overload 
> >>>
> >>> It's overloading me :(
> >>>
> >>
> >>
> >> Sorry.
> >>
> > 
> > The trick is to send those in separate patchset so the maintainer
> > doesn't notice.
> > 
>
>
> Thanks for your suggestion, i will pay more attention on it in the
> further.
>
> For this patch, what did you mean of "those"? You mean the whole
> rmap.PTE_LIST_WP_BIT (fast check for shadow page table write protection
> and host write protection) or just about host_page_write_protect
> (for KSM only)?

All of it.  Let's start with just modifying sptes concurrently and only
later add reading bits from rmap concurrently, if it proves necessary.

>
> If we do not have rmap.PTE_LIST_WP_BIT, there may have regression on
> shadow mmu.
>
> Hmm, do i need implement rmap.PTE_LIST_WP_BIT, then fast page fault?

Let's try to measure the effect without rmap.PTE_LIST_WP_BIT.  Usually
PTE chains for page tables are short so the effect would be small.  Of
course we can't tell about all guest.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] DOS VM problem with QEMU-KVM and newer kernels