Re: using cache for virtio allocations?

2012-05-02 Thread Sasha Levin
On Thu, May 3, 2012 at 7:29 AM, Michael S. Tsirkin  wrote:
> Sasha, didn't you have a patch to allocate
> things using cache in virtio core?
> What happened to it?
>
> Thanks,
> MST

It got stuck due to several things, and I got sidetracked, sorry. Here
are the outstanding issues:

1. Since now we can allocate a descriptor either using kmalloc or from
the cache, we need a new flag in vring_desc to know how to free it, it
seems a bit too intrusive, and I couldn't thing of a better
alternative.

2. Rusty has pointed out that no one is going to modify the default
value we set, and we don't really have a good default value to put
there (at least, we haven't agreed on a specific value). Also, you
have noted that it should be a per-device value, which complicates
this question further since we probably want a different value for
each device type.

While the first one can be solved easily with a blessing from the
maintainers, the second one will require testing on various platforms,
configurations and devices to select either the best "magic" value, or
the best algorithm to play with threshold.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


using cache for virtio allocations?

2012-05-02 Thread Michael S. Tsirkin
Sasha, didn't you have a patch to allocate
things using cache in virtio core?
What happened to it?

Thanks,
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method

2012-05-02 Thread Michael S. Tsirkin
On Thu, May 03, 2012 at 10:19:52AM +0800, Asias He wrote:
> If we reset the virtio-blk device before the requests already dispatched
> to the virtio-blk driver from the block layer are finised, we will stuck
> in blk_cleanup_queue() and the remove will fail.
> 
> blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
> before DEAD marking. However it will never success if the device is
> already stopped. We'll have q->in_flight[] > 0, so the drain will not
> finish.
> 
> How to reproduce the race:
> 1. hot-plug a virtio-blk device
> 2. keep reading/writing the device in guest
> 3. hot-unplug while the device is busy serving I/O
> 
> Signed-off-by: Asias He 

We used to do similar tracking in -net but dropped it all by using the
tracking that virtio core does.  Can't blk do the same?  Isn't there
some way to use virtqueue_detach_unused_buf for this instead?

> ---
>  drivers/block/virtio_blk.c |   26 ++
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 72fe55d..72b818b 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -46,6 +46,9 @@ struct virtio_blk
>   /* Ida index - used to track minor number allocations. */
>   int index;
>  
> + /* Number of pending requests dispatched to driver. */
> + int req_in_flight;
> +
>   /* Scatterlist: can be too big for stack. */
>   struct scatterlist sg[/*sg_elems*/];
>  };
> @@ -95,6 +98,7 @@ static void blk_done(struct virtqueue *vq)
>   }
>  
>   __blk_end_request_all(vbr->req, error);
> + vblk->req_in_flight--;
>   mempool_free(vbr, vblk->pool);
>   }
>   /* In case queue is stopped waiting for more buffers. */
> @@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)
>  
>   while ((req = blk_peek_request(q)) != NULL) {
>   BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> + vblk->req_in_flight++;
>  
>   /* If this request fails, stop queue and wait for something to
>  finish to restart it. */
> @@ -443,7 +448,7 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>   if (err)
>   goto out_free_vblk;
>  
> - vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
> + vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
>   if (!vblk->pool) {
>   err = -ENOMEM;
>   goto out_free_vq;
> @@ -466,6 +471,7 @@ static int __devinit virtblk_probe(struct virtio_device 
> *vdev)
>  
>   virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
>  
> + vblk->req_in_flight = 0;
>   vblk->disk->major = major;
>   vblk->disk->first_minor = index_to_minor(index);
>   vblk->disk->private_data = vblk;
> @@ -576,22 +582,34 @@ static void __devexit virtblk_remove(struct 
> virtio_device *vdev)
>  {
>   struct virtio_blk *vblk = vdev->priv;
>   int index = vblk->index;
> + unsigned long flags;
> + int req_in_flight;
>  
>   /* Prevent config work handler from accessing the device. */
>   mutex_lock(&vblk->config_lock);
>   vblk->config_enable = false;
>   mutex_unlock(&vblk->config_lock);
>  
> + /* Abort all request on the queue. */
> + blk_abort_queue(vblk->disk->queue);
> + del_gendisk(vblk->disk);
> +
>   /* Stop all the virtqueues. */
>   vdev->config->reset(vdev);
> -
> + vdev->config->del_vqs(vdev);
>   flush_work(&vblk->config_work);
>  
> - del_gendisk(vblk->disk);
> + /* Wait requests dispatched to device driver to finish. */
> + do {
> + spin_lock_irqsave(&vblk->lock, flags);
> + req_in_flight = vblk->req_in_flight;
> + spin_unlock_irqrestore(&vblk->lock, flags);
> + } while (req_in_flight != 0);
> +
>   blk_cleanup_queue(vblk->disk->queue);
>   put_disk(vblk->disk);
> +
>   mempool_destroy(vblk->pool);
> - vdev->config->del_vqs(vdev);
>   kfree(vblk);
>   ida_simple_remove(&vd_index_ida, index);
>  }
> -- 
> 1.7.10
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method

2012-05-02 Thread Sasha Levin
On Thu, May 3, 2012 at 4:19 AM, Asias He  wrote:
> @@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)
>
>        while ((req = blk_peek_request(q)) != NULL) {
>                BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
> +               vblk->req_in_flight++;
>
>                /* If this request fails, stop queue and wait for something to
>                   finish to restart it. */

This is being increased before we know if the request will actually be
sent, so if do_req() fails afterwards, req_in_flight would be
increased but the request will never be sent.

Which means we won't be able to unplug the device ever.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] virtio: Use ida to allocate virtio index

2012-05-02 Thread Asias He
Current index allocation in virtio is based on a monotonically
increasing variable "index". This means we'll run out of numbers
after a while. E.g. someone crazy doing this in host side.

while(1) {
hot-plug a virtio device
hot-unplug the virito devcie
}

Signed-off-by: Asias He 
---
 drivers/virtio/virtio.c |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 984c501..f355807 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -2,9 +2,10 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Unique numbering for virtio devices. */
-static unsigned int dev_index;
+static DEFINE_IDA(virtio_index_ida);
 
 static ssize_t device_show(struct device *_d,
   struct device_attribute *attr, char *buf)
@@ -193,7 +194,11 @@ int register_virtio_device(struct virtio_device *dev)
dev->dev.bus = &virtio_bus;
 
/* Assign a unique device index and hence name. */
-   dev->index = dev_index++;
+   err = ida_simple_get(&virtio_index_ida, 0, 0, GFP_KERNEL);
+   if (err < 0)
+   goto out;
+
+   dev->index = err;
dev_set_name(&dev->dev, "virtio%u", dev->index);
 
/* We always start by resetting the device, in case a previous
@@ -208,6 +213,7 @@ int register_virtio_device(struct virtio_device *dev)
/* device_register() causes the bus infrastructure to look for a
 * matching driver. */
err = device_register(&dev->dev);
+out:
if (err)
add_status(dev, VIRTIO_CONFIG_S_FAILED);
return err;
@@ -217,6 +223,7 @@ EXPORT_SYMBOL_GPL(register_virtio_device);
 void unregister_virtio_device(struct virtio_device *dev)
 {
device_unregister(&dev->dev);
+   ida_simple_remove(&virtio_index_ida, dev->index);
 }
 EXPORT_SYMBOL_GPL(unregister_virtio_device);
 
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] virtio-blk: Fix hot-unplug race in remove method

2012-05-02 Thread Asias He
If we reset the virtio-blk device before the requests already dispatched
to the virtio-blk driver from the block layer are finised, we will stuck
in blk_cleanup_queue() and the remove will fail.

blk_cleanup_queue() calls blk_drain_queue() to drain all requests queued
before DEAD marking. However it will never success if the device is
already stopped. We'll have q->in_flight[] > 0, so the drain will not
finish.

How to reproduce the race:
1. hot-plug a virtio-blk device
2. keep reading/writing the device in guest
3. hot-unplug while the device is busy serving I/O

Signed-off-by: Asias He 
---
 drivers/block/virtio_blk.c |   26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 72fe55d..72b818b 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -46,6 +46,9 @@ struct virtio_blk
/* Ida index - used to track minor number allocations. */
int index;
 
+   /* Number of pending requests dispatched to driver. */
+   int req_in_flight;
+
/* Scatterlist: can be too big for stack. */
struct scatterlist sg[/*sg_elems*/];
 };
@@ -95,6 +98,7 @@ static void blk_done(struct virtqueue *vq)
}
 
__blk_end_request_all(vbr->req, error);
+   vblk->req_in_flight--;
mempool_free(vbr, vblk->pool);
}
/* In case queue is stopped waiting for more buffers. */
@@ -190,6 +194,7 @@ static void do_virtblk_request(struct request_queue *q)
 
while ((req = blk_peek_request(q)) != NULL) {
BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
+   vblk->req_in_flight++;
 
/* If this request fails, stop queue and wait for something to
   finish to restart it. */
@@ -443,7 +448,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
if (err)
goto out_free_vblk;
 
-   vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
+   vblk->pool = mempool_create_kmalloc_pool(1, sizeof(struct virtblk_req));
if (!vblk->pool) {
err = -ENOMEM;
goto out_free_vq;
@@ -466,6 +471,7 @@ static int __devinit virtblk_probe(struct virtio_device 
*vdev)
 
virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
 
+   vblk->req_in_flight = 0;
vblk->disk->major = major;
vblk->disk->first_minor = index_to_minor(index);
vblk->disk->private_data = vblk;
@@ -576,22 +582,34 @@ static void __devexit virtblk_remove(struct virtio_device 
*vdev)
 {
struct virtio_blk *vblk = vdev->priv;
int index = vblk->index;
+   unsigned long flags;
+   int req_in_flight;
 
/* Prevent config work handler from accessing the device. */
mutex_lock(&vblk->config_lock);
vblk->config_enable = false;
mutex_unlock(&vblk->config_lock);
 
+   /* Abort all request on the queue. */
+   blk_abort_queue(vblk->disk->queue);
+   del_gendisk(vblk->disk);
+
/* Stop all the virtqueues. */
vdev->config->reset(vdev);
-
+   vdev->config->del_vqs(vdev);
flush_work(&vblk->config_work);
 
-   del_gendisk(vblk->disk);
+   /* Wait requests dispatched to device driver to finish. */
+   do {
+   spin_lock_irqsave(&vblk->lock, flags);
+   req_in_flight = vblk->req_in_flight;
+   spin_unlock_irqrestore(&vblk->lock, flags);
+   } while (req_in_flight != 0);
+
blk_cleanup_queue(vblk->disk->queue);
put_disk(vblk->disk);
+
mempool_destroy(vblk->pool);
-   vdev->config->del_vqs(vdev);
kfree(vblk);
ida_simple_remove(&vd_index_ida, index);
 }
-- 
1.7.10

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows XP + Virtio

2012-05-02 Thread skennedy
> On Wednesday, May 02, 2012 05:54:50 PM Sean Kennedy wrote:
>> On May 2, 2012, at 6:56 AM, Vadim Rozenfeld wrote:
>> > On Wednesday, May 02, 2012 02:33:49 AM Sean Kennedy wrote:
>> >> I am getting crashes (BSoD) when using Virtio for the disk driver in
>> >> Windows XP.
>> >>
>> >> It boots fine, it seems to run okay most of the time, but whenever
>> the
>> >> disk begins to get taxed, 9 times out of 10 it will start locking up
>> >> then eventually crash with a BSoD about virtio.sys.
>> >
>> > Hi Sean,
>> > Can you tell me the bugcheck code and viostor version?
>> > Thank you,
>> > Vadim.
>>
>> I'm using virtio-win-0.1-22, it looks like viostor.sys is version
>> 02/13/2012,51.63.103.2200.
>>
>
> Could you please try the more recent one, available at
> http://people.redhat.com/vrozenfe/build-26/virtio-win-prewhql-0.1.zip ?

This version seems to be working just fine.  ICMP doesn't go through the
roof during high I/O and so far no BSoDs.  Hopefully these newer drivers
are put on the KVM wiki soon ;)

Thanks!

>
>> The BSoD is telling me 'IRQL_NOT_LESS_OR_EQUAL'.
>>
>> >> Here is the environment:
>> >>
>> >> VM Host is a CentOS 6 server running qemu-kvm-0.12.1.2-2.209 with
>> Kernel
>> >> version 2.6.32-220.13.1.el6.x86_64.  It's a dual quad-core Xeon with
>> 24
>> >> gigs of ram.
>> >>
>> >> It's connected to backend storage via 2 gigabit ethernet connections.
>>  I
>> >> have created a raw 20gig LVM block device for this XP machine that is
>> >> exported over iSCSI.
>> >>
>> >> The VM Host is running device-mapper-multipath to utilize both
>> ethernet
>> >> connections to the SAN.
>> >>
>> >> When I run a disk benchmark tool on the XP machine, the ICMP
>> responses
>> >> from the box start going through the roof, and even drop off.  It
>> >> usually bluescreens during the test.
>> >>
>> >> I have eliminated multipathd and setup the XP virt machine to just
>> use
>> >> the iSCSI /dev/disk/by-id/ block directly, and it still behaves this
>> >> way.
>> >>
>> >> If I set the machine to use IDE instead of Virtio, it's certainly
>> >> slower, but the machine never crashes and when running I/O
>> benchmarks,
>> >> pings stay solid as they should, this is while still using multipathd
>> >> and iSCSI to the storage server.
>> >>
>> >> Have I setup virtio incorrectly?  How would you go about finding the
>> >> real issue?
>> >>
>> >> Here is the virt machine's XML (using IDE for disk currently):
>> >>
>> >> 
>> >>
>> >>  Apollo
>> >>  d32041b8-853e-e679-edce-2b1f3db55e8a
>> >>  4194304
>> >>  4194304
>> >>  2
>> >>  
>> >>
>> >>hvm
>> >>
>> >>
>> >>  
>> >>  
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>  
>> >>  
>> >>
>> >>
>> >>
>> >>  
>> >>  destroy
>> >>  restart
>> >>  restart
>> >>  
>> >>
>> >>/usr/libexec/qemu-kvm
>> >>
>> >>
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>
>> >>
>> >>
>> >>
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>
>> >>
>> >>
>> >>
>> >>  
>> >>  > >>
>> >> function='0x1'/> 
>> >>
>> >>
>> >>
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  
>> >>  > >>
>> >> function='0x0'/> 
>> >>
>> >>
>> >>
>> >>  
>> >>  
>> >>  
>> >>
>> >>
>> >>
>> >>
>> >>  
>> >>  
>> >>  
>> >>
>> >>
>> >>
>> >>
>> >>  
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>  
>> >>  
>> >>  > >>
>> >> function='0x0'/> 
>> >>
>> >>
>> >>
>> >>  
>> >>  > >>
>> >> function='0x0'/> 
>> >>
>> >>  
>> >>
>> >> 
>> >>
>> >>
>> >> Thanks,
>> >> Sean--
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majord...@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] kvm-s390x: implement KVM_CAP_NR/MAX_VCPUS

2012-05-02 Thread Marcelo Tosatti
On Wed, May 02, 2012 at 10:50:38AM +0200, Christian Borntraeger wrote:
> Let userspace know the number of max and supported cpus for kvm on s390.
> Return KVM_MAX_VCPUS (currently 64) for both values.
> 
> Signed-off-by: Christian Borntraeger 
> ---
>  arch/s390/kvm/kvm-s390.c |4 
>  1 file changed, 4 insertions(+)

Applied, thanks.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-02 Thread Takuya Yoshikawa
On Wed, 02 May 2012 13:39:51 +0800
Xiao Guangrong  wrote:

> > Was the problem really mmu_lock contention?

> Takuya, i am so tired to argue the advantage of lockless write-protect
> and lockless O(1) dirty-log again and again.

You are missing my point.  Please do not take my comments as an objection
to your whole work: whey do you feel so?

I thought that your new fast-page-fault path was fast and optimized
the guest during dirty logging.

So in this v4, you might get a similar result even before dropping
mmu_lock, without 07/10?, if the problem Marcelo explained was not there.


Of course there is a problem of mmu_lock contention.  What I am suggesting
is to split that problem and do measurement separately so that part of
your work can be merged soon.

Your guest size and workload was small to make get_dirty hold mmu_lock
long time.  If you want to appeal the real value of lock-less, you need to
do another measurment.


But this is your work and it's up to you.  Although I was thinking to help
your measurement, I cannot do that knowing the fact that you would not
welcome my help.


> > Although I am not certain about what will be really needed in the
> > final form, if this kind of maybe-needed-overhead is going to be
> > added little by little, I worry about possible regression.

> Well, will you suggest Linus to reject all patches and stop
> all discussion for the "possible regression" reason?

My concern was for Marcelo's examples, not your current implementation.
If you can show explicitely what will be needed in the final form,
I do not have any concern.


Sorry for disturbing.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3S HV: Make the guest MMU hash table size configurable

2012-05-02 Thread Paul Mackerras
On Wed, May 02, 2012 at 02:52:14PM +0200, Alexander Graf wrote:

> >+/*
> >+ * If the user wants a different size from default,
> >+ * try first to allocate it from the kernel page allocator.
> >+ */
> >+hpt = 0;
> >+if (order != kvm_hpt_order) {
> > hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|
> >-   __GFP_NOWARN, HPT_ORDER - PAGE_SHIFT);
> >+   __GFP_NOWARN, order - PAGE_SHIFT);
> 
> I like the idea quite a lot, but shouldn't we set a max memory
> threshold through a kernel module option that a user space program
> could pin this way?

The page allocator will hand out at most 16MB with the default setting
of CONFIG_FORCE_MAX_ZONEORDER.  Any larger request will fail, and
userspace can only do one allocation per VM.

Of course, userspace can start hundreds of VMs and use up memory that
way - but that is exactly the same situation as we have already, where
we allocate 16MB per VM at VM creation time, so this patch makes
things no worse in that regard.

> >+long kvmppc_alloc_reset_hpt(struct kvm *kvm, u32 *htab_orderp)
> >+{
> >+long err = -EBUSY;
> >+long order;
> >+
> >+mutex_lock(&kvm->lock);
> >+if (kvm->arch.rma_setup_done) {
> >+kvm->arch.rma_setup_done = 0;
> >+/* order rma_setup_done vs. vcpus_running */
> >+smp_mb();
> >+if (atomic_read(&kvm->arch.vcpus_running)) {
> 
> Is this safe? What if we're running one vcpu thread and one hpt
> reset thread. The vcpu thread starts, is just before the vcpu_run
> function, the reset thread checks if anything is running, nothing is
> running, the vcpu thread goes on to do its thing and boom things
> break.

No - in the situation you describe, the vcpu thread will see that
kvm->arch.rma_setup_done is clear and call kvmppc_hv_setup_htab_rma(),
where the first thing it does is mutex_lock(&kvm->lock), and it sits
there until the reset thread has finished and unlocked the
mutex.  The vcpu thread has to see rma_setup_done clear since the
reset thread clears it and then has a barrier, before testing
vcpus_running.

If on the other hand the vcpu thread gets there first and sees that
kvm->arch.rma_setup_done is still set, then the reset thread must see
that kvm->arch.vcpus_running is non-zero, since the vcpu thread first
increments vcpus_running and then tests rma_setup_done, with a barrier
in between.

> But then again, what's the problem with having vcpus running while
> we're clearing the HPT? Stale TLB entries?

If a vcpu is running, it could possibly call kvmppc_h_enter() to set a
HPTE at the exact same time that we are clearing that HPTE, and we
could end up with inconsistent results.  I didn't want to have to add
extra locking in kvmppc_h_enter() to make sure that couldn't happen,
since kvmppc_h_enter() is quite a hot path.  Instead I added a little
bit of overhead on the vcpu entry/exit, since that happens much more
rarely.  In any case, if we're resetting the VM, there shouldn't be
any vcpus running.

Your comment has however reminded me that I need to arrange for a full
TLB flush to occur the first time we run each vcpu after the reset.
I'll do a new version that includes that.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM

2012-05-02 Thread Alexander Graf
From: Benjamin Herrenschmidt 

There is nothing in the code for emulating TCE tables in the kernel
that prevents it from working on "PR" KVM... other than ifdef's and
location of the code.

This and moves the bulk of the code there to a new file called
book3s_64_vio.c.

This speeds things up a bit on my G5.

Signed-off-by: Benjamin Herrenschmidt 
[agraf: fix for hv kvm, whitespace]
Signed-off-by: Alexander Graf 
---
 arch/powerpc/include/asm/kvm_host.h |4 +-
 arch/powerpc/include/asm/kvm_ppc.h  |2 +
 arch/powerpc/kvm/Makefile   |2 +
 arch/powerpc/kvm/book3s_64_vio.c|  150 +++
 arch/powerpc/kvm/book3s_64_vio_hv.c |3 +
 arch/powerpc/kvm/book3s_hv.c|  109 -
 arch/powerpc/kvm/book3s_pr.c|3 +
 arch/powerpc/kvm/book3s_pr_papr.c   |   18 
 arch/powerpc/kvm/powerpc.c  |8 ++-
 9 files changed, 187 insertions(+), 112 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_64_vio.c

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 42a527e..d848cdc 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -237,7 +237,6 @@ struct kvm_arch {
unsigned long vrma_slb_v;
int rma_setup_done;
int using_mmu_notifiers;
-   struct list_head spapr_tce_tables;
spinlock_t slot_phys_lock;
unsigned long *slot_phys[KVM_MEM_SLOTS_NUM];
int slot_npages[KVM_MEM_SLOTS_NUM];
@@ -245,6 +244,9 @@ struct kvm_arch {
struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
struct kvmppc_linear_info *hpt_li;
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
+#ifdef CONFIG_PPC_BOOK3S_64
+   struct list_head spapr_tce_tables;
+#endif
 };
 
 /*
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 7f0a3da..c1069f6 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -126,6 +126,8 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+unsigned long ioba, unsigned long tce);
 extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
 extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 25225ae..c2a0863 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -54,6 +54,7 @@ kvm-book3s_64-objs-$(CONFIG_KVM_BOOK3S_64_PR) := \
book3s_paired_singles.o \
book3s_pr.o \
book3s_pr_papr.o \
+   book3s_64_vio_hv.o \
book3s_emulate.o \
book3s_interrupts.o \
book3s_mmu_hpte.o \
@@ -78,6 +79,7 @@ kvm-book3s_64-module-objs := \
powerpc.o \
emulate.o \
book3s.o \
+   book3s_64_vio.o \
$(kvm-book3s_64-objs-y)
 
 kvm-objs-$(CONFIG_KVM_BOOK3S_64) := $(kvm-book3s_64-module-objs)
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
new file mode 100644
index 000..72ffc89
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -0,0 +1,150 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright 2010 Paul Mackerras, IBM Corp. 
+ * Copyright 2011 David Gibson, IBM Corporation 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
+
+static long kvmppc_stt_npages(unsigned long window_size)
+{
+   return ALIGN((window_size >> SPAPR_TCE_SHIFT)
+* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
+}
+
+static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
+{
+   struct kvm *kvm = stt->kvm;
+   int i;
+
+   mutex_lock(&kvm->lock);
+   list_del(&stt->list);
+   for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
+   __free_page(stt->pages[i]);
+   kfree(stt);
+   mutex_unlock(&kvm->lock);
+
+   kvm_pu

Re: [PATCH v2] kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM

2012-05-02 Thread Benjamin Herrenschmidt
On Wed, 2012-05-02 at 23:21 +0200, Alexander Graf wrote:
> If you're too busy to fix it up, I can give it another wuick try
> though...

That would be great... I'll have no time til at least next week.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM

2012-05-02 Thread Alexander Graf


On 02.05.2012, at 23:17, Benjamin Herrenschmidt  
wrote:

> On Wed, 2012-05-02 at 15:32 +0200, Alexander Graf wrote:
>> This function needs to be available in real mode on HV, but the ones 
>> below must be on module code when kvm is compiled as a module,
>> because 
>> they call kvm infrastructure which lives in module code.
> 
> Yes, I realized that, can you fix it in your tree ?

It's slightly more involved as it means that we need to split the 2 files you 
merged again. I removed the patch from my next queue for now :).

If you're too busy to fix it up, I can give it another wuick try though...


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM

2012-05-02 Thread Benjamin Herrenschmidt
On Wed, 2012-05-02 at 15:32 +0200, Alexander Graf wrote:
> This function needs to be available in real mode on HV, but the ones 
> below must be on module code when kvm is compiled as a module,
> because 
> they call kvm infrastructure which lives in module code.

Yes, I realized that, can you fix it in your tree ?

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-02 Thread Marcelo Tosatti
On Wed, May 02, 2012 at 01:39:51PM +0800, Xiao Guangrong wrote:
> On 04/29/2012 04:50 PM, Takuya Yoshikawa wrote:
> 
> > On Fri, 27 Apr 2012 11:52:13 -0300
> > Marcelo Tosatti  wrote:
> > 
> >> Yes but the objective you are aiming for is to read and write sptes
> >> without mmu_lock. That is, i am not talking about this patch. 
> >> Please read carefully the two examples i gave (separated by "example)").
> > 
> > The real objective is not still clear.
> > 
> > The ~10% improvement reported before was on macro benchmarks during live
> > migration.  At least, that optimization was the initial objective.
> > 
> > But at some point, the objective suddenly changed to "lock-less" without
> > understanding what introduced the original improvement.
> > 
> > Was the problem really mmu_lock contention?
> > 
> 
> 
> Takuya, i am so tired to argue the advantage of lockless write-protect
> and lockless O(1) dirty-log again and again.

His point is valid: there is a lack of understanding on the details of
the improvement.

Did you see the pahole output on struct kvm? Apparently mmu_lock is
sharing a cacheline with read-intensive memslots pointer. It would be 
interesting to see what are the effects of cacheline aligning mmu_lock.

> > If the path being introduced by this patch is really fast, isn't it
> > possible to achieve the same improvement still using mmu_lock?
> > 
> > 
> > Note: During live migration, the fact that the guest gets faulted is
> > itself a limitation.  We could easily see noticeable slowdown of a
> > program even if it runs only between two GET_DIRTY_LOGs.
> > 
> 
> 
> Obviously no.
> 
> It depends on what the guest is doing, from my autotest test, it very
> easily to see that, the huge improvement is on bench-migration not
> pure-migration.
> 
> > 
> >> The rules for code under mmu_lock should be:
> >>
> >> 1) Spte updates under mmu lock must always be atomic and 
> >> with locked instructions.
> >> 2) Spte values must be read once, and appropriate action
> >> must be taken when writing them back in case their value 
> >> has changed (remote TLB flush might be required).
> > 
> > Although I am not certain about what will be really needed in the
> > final form, if this kind of maybe-needed-overhead is going to be
> > added little by little, I worry about possible regression.
> 
> 
> Well, will you suggest Linus to reject all patches and stop
> all discussion for the "possible regression" reason?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 06/10] KVM: MMU: fast path of handling guest page fault

2012-05-02 Thread Marcelo Tosatti
On Wed, May 02, 2012 at 01:28:39PM +0800, Xiao Guangrong wrote:
> On 05/01/2012 09:34 AM, Marcelo Tosatti wrote:
> 
> 
> > 
> > It is getting better, but not yet, there are still reads of sptep
> > scattered all over (as mentioned before, i think a pattern of read spte
> > once, work on top of that, atomically write and then deal with results
> > _everywhere_ (where mmu lock is held) is more consistent.
> > 
> 
> 
> But we only need care the path which depends on is_writable_pte(), no?

Yes.

> So, where call is_writable_pte() are spte_has_volatile_bits(),
> spte_write_protect() and set_spte().
> 
> I have changed these functions:
> In spte_has_volatile_bits():
>  static bool spte_has_volatile_bits(u64 spte)
>  {
> + /*
> +  * Always atomicly update spte if it can be updated
> +  * out of mmu-lock.
> +  */
> + if (spte_can_lockless_update(spte))
> + return true;
> +
> 
> In spte_write_protect():
> 
> + spte = mmu_spte_update(sptep, spte);
> +
> + if (is_writable_pte(spte))
> + *flush |= true;
> +
> The 'spte' is from atomically read-write (xchg).
> 
> in set_spte():
>  set_pte:
> - mmu_spte_update(sptep, spte);
> + entry = mmu_spte_update(sptep, spte);
>   /*
>* If we overwrite a writable spte with a read-only one we
>* should flush remote TLBs. Otherwise rmap_write_protect
> The 'entry' is also the latest value.
> 
> > /*
> >  * If we overwrite a writable spte with a read-only one we
> >  * should flush remote TLBs. Otherwise rmap_write_protect
> >  * will find a read-only spte, even though the writable spte
> >  * might be cached on a CPU's TLB.
> >  */
> > if (is_writable_pte(entry) && !is_writable_pte(*sptep))
> > kvm_flush_remote_tlbs(vcpu->kvm);
> > 
> > This is inconsistent with the above obviously.
> > 
> 
> 
> 'entry' is not a problem since it is from atomically read-write as
> mentioned above, i need change this code to:
> 
>   /*
>* Optimization: for pte sync, if spte was writable the hash
>* lookup is unnecessary (and expensive). Write protection
>* is responsibility of mmu_get_page / kvm_sync_page.
>* Same reasoning can be applied to dirty page accounting.
>*/
>   if (!can_unsync && is_writable_pte(entry) /* Use 'entry' 
> instead of '*sptep'. */
>   goto set_pte
>..
> 
> 
>  if (is_writable_pte(entry) && !is_writable_pte(spte)) /* Use 'spte' 
> instead of '*sptep'. */
>  kvm_flush_remote_tlbs(vcpu->kvm);

What is of more importance than the ability to verify that this or that
particular case are ok at the moment is to write code in such a way that
its easy to verify that it is correct.

Thus the suggestion above:

"scattered all over (as mentioned before, i think a pattern of read spte
once, work on top of that, atomically write and then deal with results
_everywhere_ (where mmu lock is held) is more consistent."


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows XP + Virtio

2012-05-02 Thread Vadim Rozenfeld
On Wednesday, May 02, 2012 05:54:50 PM Sean Kennedy wrote:
> On May 2, 2012, at 6:56 AM, Vadim Rozenfeld wrote:
> > On Wednesday, May 02, 2012 02:33:49 AM Sean Kennedy wrote:
> >> I am getting crashes (BSoD) when using Virtio for the disk driver in
> >> Windows XP.
> >> 
> >> It boots fine, it seems to run okay most of the time, but whenever the
> >> disk begins to get taxed, 9 times out of 10 it will start locking up
> >> then eventually crash with a BSoD about virtio.sys.
> > 
> > Hi Sean,
> > Can you tell me the bugcheck code and viostor version?
> > Thank you,
> > Vadim.
> 
> I'm using virtio-win-0.1-22, it looks like viostor.sys is version
> 02/13/2012,51.63.103.2200.
> 

Could you please try the more recent one, available at 
http://people.redhat.com/vrozenfe/build-26/virtio-win-prewhql-0.1.zip ?

> The BSoD is telling me 'IRQL_NOT_LESS_OR_EQUAL'.
> 
> >> Here is the environment:
> >> 
> >> VM Host is a CentOS 6 server running qemu-kvm-0.12.1.2-2.209 with Kernel
> >> version 2.6.32-220.13.1.el6.x86_64.  It's a dual quad-core Xeon with 24
> >> gigs of ram.
> >> 
> >> It's connected to backend storage via 2 gigabit ethernet connections.  I
> >> have created a raw 20gig LVM block device for this XP machine that is
> >> exported over iSCSI.
> >> 
> >> The VM Host is running device-mapper-multipath to utilize both ethernet
> >> connections to the SAN.
> >> 
> >> When I run a disk benchmark tool on the XP machine, the ICMP responses
> >> from the box start going through the roof, and even drop off.  It
> >> usually bluescreens during the test.
> >> 
> >> I have eliminated multipathd and setup the XP virt machine to just use
> >> the iSCSI /dev/disk/by-id/ block directly, and it still behaves this
> >> way.
> >> 
> >> If I set the machine to use IDE instead of Virtio, it's certainly
> >> slower, but the machine never crashes and when running I/O benchmarks,
> >> pings stay solid as they should, this is while still using multipathd
> >> and iSCSI to the storage server.
> >> 
> >> Have I setup virtio incorrectly?  How would you go about finding the
> >> real issue?
> >> 
> >> Here is the virt machine's XML (using IDE for disk currently):
> >> 
> >> 
> >> 
> >>  Apollo
> >>  d32041b8-853e-e679-edce-2b1f3db55e8a
> >>  4194304
> >>  4194304
> >>  2
> >>  
> >>  
> >>hvm
> >>
> >>  
> >>  
> >>  
> >>  
> >>
> >>
> >>
> >>  
> >>  
> >>  
> >>  
> >>
> >>  
> >>  
> >>  destroy
> >>  restart
> >>  restart
> >>  
> >>  
> >>/usr/libexec/qemu-kvm
> >>
> >>
> >>  
> >>  
> >>  
> >>  
> >>  
> >>
> >>
> >>
> >>
> >>  
> >>  
> >>  
> >>  
> >>  
> >>
> >>
> >>
> >>
> >>  
> >>   >> 
> >> function='0x1'/> 
> >> 
> >>
> >>
> >>  
> >>  
> >>  
> >>  
> >>  
> >>   >> 
> >> function='0x0'/> 
> >> 
> >>
> >>
> >>  
> >>  
> >>  
> >>
> >>
> >>
> >>
> >>  
> >>  
> >>  
> >>
> >>
> >>
> >>
> >>  
> >>
> >>
> >>
> >>
> >>
> >>
> >>  
> >>  
> >>   >> 
> >> function='0x0'/> 
> >> 
> >>
> >>
> >>  
> >>   >> 
> >> function='0x0'/> 
> >> 
> >>  
> >> 
> >> 
> >> 
> >> 
> >> Thanks,
> >> Sean--
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] kvm: fix cpuid eax

2012-05-02 Thread Michael S. Tsirkin
cpuid eax should return the max leaf so that
guests can find out the valid range.
This matches Xen et al.
Update documentation to match.

Tested with -cpu host.

Signed-off-by: Michael S. Tsirkin 
---

Changes from v1:
Updated documentation.

 Documentation/virtual/kvm/cpuid.txt |6 +-
 arch/x86/kvm/cpuid.c|2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/cpuid.txt 
b/Documentation/virtual/kvm/cpuid.txt
index 8820685..83afe65 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -10,11 +10,15 @@ a guest.
 KVM cpuid functions are:
 
 function: KVM_CPUID_SIGNATURE (0x4000)
-returns : eax = 0,
+returns : eax = 0x4001,
   ebx = 0x4b4d564b,
   ecx = 0x564b4d56,
   edx = 0x4d.
 Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM".
+The value in eax corresponds to the maximum cpuid function present in this 
leaf,
+and will be updated if more functions are added in the future.
+Note also that old hosts set eax value to 0x0. This should
+be interpreted as if the value was 0x4001.
 This function queries the presence of KVM cpuid leafs.
 
 
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 9fed5be..eba8345 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -397,7 +397,7 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
case KVM_CPUID_SIGNATURE: {
char signature[12] = "KVMKVMKVM\0\0";
u32 *sigptr = (u32 *)signature;
-   entry->eax = 0;
+   entry->eax = KVM_CPUID_FEATURES;
entry->ebx = sigptr[0];
entry->ecx = sigptr[1];
entry->edx = sigptr[2];
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Windows XP + Virtio

2012-05-02 Thread Sean Kennedy

On May 2, 2012, at 6:56 AM, Vadim Rozenfeld wrote:

> On Wednesday, May 02, 2012 02:33:49 AM Sean Kennedy wrote:
>> I am getting crashes (BSoD) when using Virtio for the disk driver in
>> Windows XP.
>> 
>> It boots fine, it seems to run okay most of the time, but whenever the disk
>> begins to get taxed, 9 times out of 10 it will start locking up then
>> eventually crash with a BSoD about virtio.sys.
> Hi Sean,
> Can you tell me the bugcheck code and viostor version?
> Thank you,
> Vadim.

I'm using virtio-win-0.1-22, it looks like viostor.sys is version 
02/13/2012,51.63.103.2200.

The BSoD is telling me 'IRQL_NOT_LESS_OR_EQUAL'.

>> 
>> Here is the environment:
>> 
>> VM Host is a CentOS 6 server running qemu-kvm-0.12.1.2-2.209 with Kernel
>> version 2.6.32-220.13.1.el6.x86_64.  It's a dual quad-core Xeon with 24
>> gigs of ram.
>> 
>> It's connected to backend storage via 2 gigabit ethernet connections.  I
>> have created a raw 20gig LVM block device for this XP machine that is
>> exported over iSCSI.
>> 
>> The VM Host is running device-mapper-multipath to utilize both ethernet
>> connections to the SAN.
>> 
>> When I run a disk benchmark tool on the XP machine, the ICMP responses from
>> the box start going through the roof, and even drop off.  It usually
>> bluescreens during the test.
>> 
>> I have eliminated multipathd and setup the XP virt machine to just use the
>> iSCSI /dev/disk/by-id/ block directly, and it still behaves this way.
>> 
>> If I set the machine to use IDE instead of Virtio, it's certainly slower,
>> but the machine never crashes and when running I/O benchmarks, pings stay
>> solid as they should, this is while still using multipathd and iSCSI to
>> the storage server.
>> 
>> Have I setup virtio incorrectly?  How would you go about finding the real
>> issue?
>> 
>> Here is the virt machine's XML (using IDE for disk currently):
>> 
>> 
>>  Apollo
>>  d32041b8-853e-e679-edce-2b1f3db55e8a
>>  4194304
>>  4194304
>>  2
>>  
>>hvm
>>
>>  
>>  
>>
>>
>>
>>  
>>  
>>
>>  
>>  destroy
>>  restart
>>  restart
>>  
>>/usr/libexec/qemu-kvm
>>
>>  
>>  
>>  
>>  
>>  
>>
>>
>>  
>>  
>>  
>>  
>>  
>>
>>
>>  
>>  > function='0x1'/> 
>>
>>  
>>  
>>  
>>  
>>  
>>  > function='0x0'/> 
>>
>>  
>>  
>>  
>>
>>
>>  
>>  
>>  
>>
>>
>>  
>>
>>
>>
>>
>>  
>>  
>>  > function='0x0'/> 
>>
>>  
>>  > function='0x0'/> 
>>  
>> 
>> 
>> 
>> Thanks,
>> Sean--
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Heavy memory_region_get_dirty() -- Re: [PATCH 0/1 v2] KVM: Alleviate mmu_lock contention during dirty logging

2012-05-02 Thread Takuya Yoshikawa
On Wed, 02 May 2012 14:33:55 +0300
Avi Kivity  wrote:

> > =
> > perf top -t ${QEMU_TID}
> > =
> >  51.52%  qemu-system-x86_64   [.] memory_region_get_dirty
> >  16.73%  qemu-system-x86_64   [.] ram_save_remaining
> >
> 
> memory_region_get_dirty() is called from ram_save_remaining().  Looks
> like quadratic behaviour here: we send a few pages in
> ram_save_remaining(), then walk the entire dirty bitmap to calculate
> expected_time().
> 
> We should probably calculate expected_time once per iteration.

There seems to be many unnecessary/repeated calculations.
Not restricted to this expected_time one.

Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-02 Thread Nikunj A Dadhania
On Wed, 02 May 2012 12:20:40 +0200, Peter Zijlstra  wrote:
> On Wed, 2012-05-02 at 14:21 +0530, Nikunj A Dadhania wrote:
> > [root@krm1 linux]# grep HAVE_RCU_TABLE .config
> > CONFIG_HAVE_RCU_TABLE_FREE=y
> > [root@krm1 linux]# make -j32  -s
> > mm/memory.c: In function ‘tlb_remove_table_one’:
> > mm/memory.c:315: error: implicit declaration of function 
> > ‘__tlb_remove_table’
> > 
> > I suppose we need to have __tlb_remove_table. Trying to understand what
> > needs to be done there. 
> 
> Argh, I really should get back to unifying all mmu-gather
> implementations :/
> 
> I think something like the below ought to sort it.
> 
Thanks a lot.

> Completely untested though..
> 

Tested-by: Nikunj A Dadhania 

Here is the comparison with the other version. 

Gangpv_spin_flushpv_spin_flush_rcu  
1VM 1.01   0.490.49 
2VMs7.07   4.044.06 
4VMs9.07   5.275.19 
8VMs9.99   7.657.80 

Will test other use cases as well and report back.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM

2012-05-02 Thread Alexander Graf

On 03/16/2012 08:58 AM, Benjamin Herrenschmidt wrote:

There is nothing in the code for emulating TCE tables in the kernel
that prevents it from working on "PR" KVM... other than ifdef's and
location of the code.

This renames book3s_64_vio_hv.c to book3s_64_vio.c and moves the
bulk of the code there.

This speeds things up a bit on my G5.
---

v2. Changed the ifdef as per discussion with Alex. I still didn't
manage to get git to figure out the rename but that's no big deal,
the old file had only one small function in it. There's no code
change, you can trust me on that one, It's really just moving
things around :-)

  arch/powerpc/include/asm/kvm_host.h |4 +-
  arch/powerpc/include/asm/kvm_ppc.h  |2 +
  arch/powerpc/kvm/Makefile   |3 +-
  arch/powerpc/kvm/book3s_64_vio.c|  187 +++
  arch/powerpc/kvm/book3s_64_vio_hv.c |   73 --
  arch/powerpc/kvm/book3s_hv.c|  109 
  arch/powerpc/kvm/book3s_pr.c|3 +
  arch/powerpc/kvm/book3s_pr_papr.c   |   18 
  arch/powerpc/kvm/powerpc.c  |8 +-
  9 files changed, 221 insertions(+), 186 deletions(-)
  create mode 100644 arch/powerpc/kvm/book3s_64_vio.c
  delete mode 100644 arch/powerpc/kvm/book3s_64_vio_hv.c

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 42a527e..d848cdc 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -237,7 +237,6 @@ struct kvm_arch {
unsigned long vrma_slb_v;
int rma_setup_done;
int using_mmu_notifiers;
-   struct list_head spapr_tce_tables;
spinlock_t slot_phys_lock;
unsigned long *slot_phys[KVM_MEM_SLOTS_NUM];
int slot_npages[KVM_MEM_SLOTS_NUM];
@@ -245,6 +244,9 @@ struct kvm_arch {
struct kvmppc_vcore *vcores[KVM_MAX_VCORES];
struct kvmppc_linear_info *hpt_li;
  #endif /* CONFIG_KVM_BOOK3S_64_HV */
+#ifdef CONFIG_PPC_BOOK3S_64
+   struct list_head spapr_tce_tables;
+#endif
  };

  /*
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 7f0a3da..c1069f6 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -126,6 +126,8 @@ extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
  extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
  extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
struct kvm_create_spapr_tce *args);
+extern long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+unsigned long ioba, unsigned long tce);
  extern long kvm_vm_ioctl_allocate_rma(struct kvm *kvm,
struct kvm_allocate_rma *rma);
  extern struct kvmppc_linear_info *kvm_alloc_rma(void);
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 25225ae..8c95def 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -54,6 +54,7 @@ kvm-book3s_64-objs-$(CONFIG_KVM_BOOK3S_64_PR) := \
book3s_paired_singles.o \
book3s_pr.o \
book3s_pr_papr.o \
+   book3s_64_vio.o \
book3s_emulate.o \
book3s_interrupts.o \
book3s_mmu_hpte.o \
@@ -70,7 +71,7 @@ kvm-book3s_64-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
  kvm-book3s_64-builtin-objs-$(CONFIG_KVM_BOOK3S_64_HV) := \
book3s_hv_rmhandlers.o \
book3s_hv_rm_mmu.o \
-   book3s_64_vio_hv.o \
+   book3s_64_vio.o \
book3s_hv_builtin.o

  kvm-book3s_64-module-objs := \
diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
new file mode 100644
index 000..193ba68
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -0,0 +1,187 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+ *
+ * Copyright 2010 Paul Mackerras, IBM Corp.
+ * Copyright 2011 David Gibson, IBM Corporation
+ */
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#define TCES_PER_PAGE  (PAGE_SIZE / sizeof(u64))
+
+/* WARNING: This will be called in real-mode on HV KVM and virtual
+ *  mode on PR KVM
+ */
+long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
+ unsigned long ioba, un

Re: Windows XP + Virtio

2012-05-02 Thread Vadim Rozenfeld
On Wednesday, May 02, 2012 02:33:49 AM Sean Kennedy wrote:
> I am getting crashes (BSoD) when using Virtio for the disk driver in
> Windows XP.
> 
> It boots fine, it seems to run okay most of the time, but whenever the disk
> begins to get taxed, 9 times out of 10 it will start locking up then
> eventually crash with a BSoD about virtio.sys.
Hi Sean,
Can you tell me the bugcheck code and viostor version?
Thank you,
Vadim.
> 
> Here is the environment:
> 
> VM Host is a CentOS 6 server running qemu-kvm-0.12.1.2-2.209 with Kernel
> version 2.6.32-220.13.1.el6.x86_64.  It's a dual quad-core Xeon with 24
> gigs of ram.
> 
> It's connected to backend storage via 2 gigabit ethernet connections.  I
> have created a raw 20gig LVM block device for this XP machine that is
> exported over iSCSI.
> 
> The VM Host is running device-mapper-multipath to utilize both ethernet
> connections to the SAN.
> 
> When I run a disk benchmark tool on the XP machine, the ICMP responses from
> the box start going through the roof, and even drop off.  It usually
> bluescreens during the test.
> 
> I have eliminated multipathd and setup the XP virt machine to just use the
> iSCSI /dev/disk/by-id/ block directly, and it still behaves this way.
> 
> If I set the machine to use IDE instead of Virtio, it's certainly slower,
> but the machine never crashes and when running I/O benchmarks, pings stay
> solid as they should, this is while still using multipathd and iSCSI to
> the storage server.
> 
> Have I setup virtio incorrectly?  How would you go about finding the real
> issue?
> 
> Here is the virt machine's XML (using IDE for disk currently):
> 
> 
>   Apollo
>   d32041b8-853e-e679-edce-2b1f3db55e8a
>   4194304
>   4194304
>   2
>   
> hvm
> 
>   
>   
> 
> 
> 
>   
>   
> 
>   
>   destroy
>   restart
>   restart
>   
> /usr/libexec/qemu-kvm
> 
>   
>   
>   
>   
>   
> 
> 
>   
>   
>   
>   
>   
> 
> 
>   
>function='0x1'/> 
> 
>   
>   
>   
>   
>   
>function='0x0'/> 
> 
>   
>   
>   
> 
> 
>   
>   
>   
> 
> 
>   
> 
> 
> 
> 
>   
>   
>function='0x0'/> 
> 
>   
>function='0x0'/> 
>   
> 
> 
> 
> Thanks,
> Sean--
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3S HV: Make the guest MMU hash table size configurable

2012-05-02 Thread Alexander Graf

On 04/27/2012 05:55 AM, Paul Mackerras wrote:

At present, on powerpc with Book 3S HV KVM, the kernel allocates a
fixed-size MMU hashed page table (HPT) to store the hardware PTEs for
the guest.  The hash table is currently always 16MB in size, but this
is larger than necessary for small guests (i.e. those with less than
about 1GB of RAM) and too small for large guests.  Furthermore, there
is no way for userspace to clear it out when resetting the guest.

This adds a new ioctl to enable qemu to control the size of the guest
hash table, and to clear it out when resetting the guest.  The
KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter a
pointer to a u32 containing the desired order of the HPT (log base 2
of the size in bytes), which is updated on successful return to the
actual order of the HPT which was allocated.

There must be no vcpus running at the time of this ioctl.  To enforce
this, we now keep a count of the number of vcpus running in
kvm->arch.vcpus_running.

If the ioctl is called when a HPT has already been allocated, we don't
reallocate the HPT but just clear it out.  We first clear the
kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
the kvm->lock mutex, it will prevent any vcpus from starting to run until
we're done, and (b) it means that the first vcpu to run after we're done
will re-establish the VRMA if necessary.

If userspace doesn't call this ioctl before running the first vcpu, the
kernel will allocate a default-sized HPT at that point.  We do it then
rather than when creating the VM, as the code did previously, so that
userspace has a chance to do the ioctl if it wants.

When allocating the HPT, we can allocate either from the kernel page
allocator, or from the preallocated pool.  If userspace is asking for
a different size from the preallocated HPTs, we first try to allocate
using the kernel page allocator.  Then we try to allocate from the
preallocated pool, and then if that fails, we try allocating decreasing
sizes from the kernel page allocator, down to the minimum size allowed
(256kB).

Signed-off-by: Paul Mackerras
---
  Documentation/virtual/kvm/api.txt|   34 +
  arch/powerpc/include/asm/kvm_book3s_64.h |7 +-
  arch/powerpc/include/asm/kvm_host.h  |4 +
  arch/powerpc/include/asm/kvm_ppc.h   |3 +-
  arch/powerpc/kvm/book3s_64_mmu_hv.c  |  117 +++---
  arch/powerpc/kvm/book3s_hv.c |   40 +++---
  arch/powerpc/kvm/book3s_hv_builtin.c |4 +-
  arch/powerpc/kvm/book3s_hv_rm_mmu.c  |   15 ++--
  arch/powerpc/kvm/powerpc.c   |   18 +
  include/linux/kvm.h  |3 +
  10 files changed, 191 insertions(+), 54 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 81ff39f..3629d70 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1689,6 +1689,40 @@ where the guest will clear the flag: when the soft 
lockup watchdog timer resets
  itself or when a soft lockup is detected.  This ioctl can be called any time
  after pausing the vcpu, but before it is resumed.

+4.71 KVM_PPC_ALLOCATE_HTAB
+
+Capability: KVM_CAP_PPC_ALLOC_HTAB
+Architectures: powerpc
+Type: vm ioctl
+Parameters: Pointer to u32 containing hash table order (in/out)
+Returns: 0 on success, -1 on error
+
+This requests the host kernel to allocate an MMU hash table for a
+guest using the PAPR paravirtualization interface.  This only does
+anything if the kernel is configured to use the Book 3S HV style of
+virtualization.  Otherwise the capability doesn't exist and the ioctl
+returns an ENOTTY error.  The rest of this description assumes Book 3S
+HV.
+
+There must be no vcpus running when this ioctl is called; if there
+are, it will do nothing and return an EBUSY error.
+
+The parameter is a pointer to a 32-bit unsigned integer variable
+containing the order (log base 2) of the desired size of the hash
+table, which must be between 18 and 46.  On successful return from the
+ioctl, it will have been updated with the order of the hash table that
+was allocated.
+
+If no hash table has been allocated when any vcpu is asked to run
+(with the KVM_RUN ioctl), the host kernel will allocate a
+default-sized hash table (16 MB).
+
+If this ioctl is called when a hash table has already been allocated,
+the kernel will clear out the existing hash table (zero all HPTEs) and
+return the hash table order in the parameter.  (If the guest is using
+the virtualized real-mode area (VRMA) facility, the kernel will
+re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
+
  5. The kvm_run structure

  Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index b0c08b1..0dd1d86 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -36,11 +

Re: [PATCH] kvm_para_available() should check hypervisor bit before accessing hypervisor cpuid leaf.

2012-05-02 Thread Gleb Natapov
On Tue, May 01, 2012 at 03:59:33PM +0300, Avi Kivity wrote:
> On 05/01/2012 03:55 PM, Gleb Natapov wrote:
> > On Tue, May 01, 2012 at 03:50:44PM +0300, Avi Kivity wrote:
> > > On 04/30/2012 02:45 PM, Gleb Natapov wrote:
> > > > This couid range does not exist on real HW and Intel spec says that
> > > > "Information returned for highest basic information leaf" will be
> > > > returned. Not very well defined. 
> > > >
> > > 
> > > Correct in principle.  But IIRC some old qemus supported the kvm cpuid
> > > range without setting the hypervisor bit.
> > > 
> > How old?
> 
> commit e19967a0e00c7af4efd2f2db12a7d5d7f0b387f7
> Author: Avi Kivity 
> Date:   Tue Apr 22 11:44:19 2008 +0300
> 
> Set "hypervisor present" cpuid bit
>
> This tells Microsoft guests that they are running under a hypervisor.
>
> Signed-off-by: Avi Kivity  
> $ git describe --contains e19967a0e00c
> kvm-80rc2~489
> 
Since latest Windowses will not run on such old version correctly (may
use tsc for QPC) may be dropping support for Linuxes past 3.4 is a
possibility too?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: async pf: INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected

2012-05-02 Thread Gleb Natapov
On Thu, Apr 26, 2012 at 04:50:35PM +0200, Sasha Levin wrote:
> Hi all.
> 
> I got a "INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected" warning
> while running LTP inside a KVM guest using the recent -next kernel.
> 
> It seems that it was initially originated from rcu_torture_rea(), but I
> don't think that it's a problem with RCU itself (I'll cc. Paul just in
> case). RCU torture was indeed configured to run during the testing.
> 
> The output is attached since it's a bit long.
> 
The reason is mmdrop() called from irq context and in reality we do
not need to hold reference to mm at all, so the patch below drops it.
---

KVM: Do not take reference to mm during async #PF

It turned to be totally unneeded. The reason the code was introduced is
so that KVM can prefault swapped in page, but prefault can fail even
if mm is pinned since page table can change anyway. KVM handles this
situation correctly though and does not inject spurious page faults.

Signed-off-by: Gleb Natapov 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ba6e4..e554e5a 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -79,7 +79,6 @@ struct kvm_task_sleep_node {
u32 token;
int cpu;
bool halted;
-   struct mm_struct *mm;
 };
 
 static struct kvm_task_sleep_head {
@@ -126,9 +125,7 @@ void kvm_async_pf_task_wait(u32 token)
 
n.token = token;
n.cpu = smp_processor_id();
-   n.mm = current->active_mm;
n.halted = idle || preempt_count() > 1;
-   atomic_inc(&n.mm->mm_count);
init_waitqueue_head(&n.wq);
hlist_add_head(&n.link, &b->list);
spin_unlock(&b->lock);
@@ -161,9 +158,6 @@ EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
 static void apf_task_wake_one(struct kvm_task_sleep_node *n)
 {
hlist_del_init(&n->link);
-   if (!n->mm)
-   return;
-   mmdrop(n->mm);
if (n->halted)
smp_send_reschedule(n->cpu);
else if (waitqueue_active(&n->wq))
@@ -207,7 +201,7 @@ again:
 * async PF was not yet handled.
 * Add dummy entry for the token.
 */
-   n = kmalloc(sizeof(*n), GFP_ATOMIC);
+   n = kzalloc(sizeof(*n), GFP_ATOMIC);
if (!n) {
/*
 * Allocation failed! Busy wait while other cpu
@@ -219,7 +213,6 @@ again:
}
n->token = token;
n->cpu = smp_processor_id();
-   n->mm = NULL;
init_waitqueue_head(&n->wq);
hlist_add_head(&n->link, &b->list);
} else

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Heavy memory_region_get_dirty() -- Re: [PATCH 0/1 v2] KVM: Alleviate mmu_lock contention during dirty logging

2012-05-02 Thread Avi Kivity
On 05/02/2012 02:24 PM, Takuya Yoshikawa wrote:
> During checking mmu_lock contention, I noticed that QEMU's
> memory_region_get_dirty() was using unexpectedly much CPU time.
>
> Thanks,
>   Takuya
>
> =
> perf top -t ${QEMU_TID}
> =
>  51.52%  qemu-system-x86_64   [.] memory_region_get_dirty
>  16.73%  qemu-system-x86_64   [.] ram_save_remaining
>

memory_region_get_dirty() is called from ram_save_remaining().  Looks
like quadratic behaviour here: we send a few pages in
ram_save_remaining(), then walk the entire dirty bitmap to calculate
expected_time().

We should probably calculate expected_time once per iteration.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Heavy memory_region_get_dirty() -- Re: [PATCH 0/1 v2] KVM: Alleviate mmu_lock contention during dirty logging

2012-05-02 Thread Takuya Yoshikawa
During checking mmu_lock contention, I noticed that QEMU's
memory_region_get_dirty() was using unexpectedly much CPU time.

Thanks,
Takuya

=
perf top -t ${QEMU_TID}
=
 51.52%  qemu-system-x86_64   [.] memory_region_get_dirty
 16.73%  qemu-system-x86_64   [.] ram_save_remaining
  7.25%  qemu-system-x86_64   [.] cpu_physical_memory_reset_dirty
  3.49%  [kvm][k] __rmap_write_protect
  2.85%  [kvm][k] mmu_spte_update
  2.20%  [kernel] [k] copy_user_generic_string
  2.16%  libc-2.13.so [.] 0x874e9
  1.71%  qemu-system-x86_64   [.] memory_region_set_dirty
  1.20%  qemu-system-x86_64   [.] kvm_physical_sync_dirty_bitmap
  1.00%  [kernel] [k] __lock_acquire.isra.31
  0.66%  [kvm][k] rmap_get_next
  0.58%  [kvm][k] rmap_get_first
  0.54%  [kvm][k] kvm_mmu_write_protect_pt_masked
  0.54%  [kvm][k] spte_has_volatile_bits
  0.42%  [kernel] [k] lock_release
  0.37%  [kernel] [k] tcp_sendmsg
  0.33%  [kernel] [k] alloc_pages_current
  0.29%  [kernel] [k] native_read_tsc
  0.29%  qemu-system-x86_64   [.] ram_save_block
  0.25%  [kernel] [k] lock_is_held
  0.25%  [kernel] [k] __ticket_spin_trylock
  0.21%  [kernel] [k] lock_acquire


On Sat, 28 Apr 2012 19:05:44 +0900
Takuya Yoshikawa  wrote:

> 1. Problem
>   During live migration, if the guest tries to take mmu_lock at the same
>   time as GET_DIRTY_LOG, which is called periodically by QEMU, it may be
>   forced to wait long time; this is not restricted to page faults caused
>   by GET_DIRTY_LOG's write protection.
> 
> 2. Measurement
> - Server:
>   Xeon: 8 cores(2 CPUs), 24GB memory
> 
> - One VM was being migrated locally to the opposite numa node:
>   Source(active)   VM: binded to node 0
>   Target(incoming) VM: binded to node 1
> 
>   This binding was for reducing extra noise.
> 
> - The guest inside it:
>   3 VCPUs, 11GB memory
> 
> - Workload:
>   On VCPU 2 and 3, there were 3 threads and each of them was endlessly
>   writing to 3GB, in total 9GB, anonymous memory at its maximum speed.
> 
>   I had checked that GET_DIRTY_LOG was forced to write protect more than
>   2 million pages.  So the 9GB memory was almost always kept dirty to be
>   sent.
> 
>   In parallel, on VCPU 1, I checked memory write latency: how long it
>   takes to write to one byte of each page in 1GB anonymous memory.
> 
> - Result:
>   With the current KVM, I could see 1.5ms worst case latency: this
>   corresponds well with the expected mmu_lock hold time.
> 
>   Here, you may think that this is too small compared to the numbers I
>   reported before, using dirty-log-perf, but that was done on 32-bit
>   host on a core-i3 box which was much slower than server machines.
> 
> 
>   Although having 10GB dirty memory pages is a bit extreme for guests
>   with less than 16GB memory, much larger guests, e.g. 128GB guests, may
>   see latency longer than 1.5ms.
> 
> 3. Solution
>   GET_DIRTY_LOG time is very limited compared to other works in QEMU,
>   so we should focus on alleviating the worst case latency first.
> 
>   The solution is very simple and originally suggested by Marcelo:
> "Conditionally reschedule when there is a contention."
> 
>   By this rescheduling, see the following patch, the worst case latency
>   changed from 1.5ms to 800us for the same test.
> 
> 4. TODO
>   The patch treats kvm_vm_ioctl_get_dirty_log() only, so the write
>   protection by kvm_mmu_slot_remove_write_access(), which is called when
>   we enable dirty page logging, can cause the same problem.
> 
>   My plan is to replace it with rmap-based protection after this.
> 
> 
> Thanks,
>   Takuya
> 
> ---
> Takuya Yoshikawa (1):
>   KVM: Reduce mmu_lock contention during dirty logging by cond_resched()
> 
>  arch/x86/include/asm/kvm_host.h |6 +++---
>  arch/x86/kvm/mmu.c  |   12 +---
>  arch/x86/kvm/x86.c  |   22 +-
>  3 files changed, 29 insertions(+), 11 deletions(-)
> 
> -- 
> 1.7.5.4
> 


-- 
Takuya Yoshikawa 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler

2012-05-02 Thread Alexander Graf

On 04/26/2012 07:43 PM, Scott Wood wrote:

On 04/26/2012 06:36 AM, Alexander Graf wrote:

On 16.04.2012, at 16:08, Mihai Caraman wrote:


Guest r8 register is held in the scratch register and stored correctly,
so remove the instruction that clobbers it. Guest r13 was missing from vcpu,
store it there.

Signed-off-by: Mihai Caraman
---
arch/powerpc/kvm/bookehv_interrupts.S |2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 0d04bc9..0ca987d 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -264,10 +264,10 @@ _GLOBAL(kvmppc_handler_\intno\()_\srr1)
mfspr   r6, \srr1
PPC_LL  r4, GPR11(r8)
PPC_STL r7, VCPU_GPR(r7)(r11)
-   PPC_STL r8, VCPU_GPR(r8)(r11)

I'm not sure I can follow you here. The code that leads up to this is:

#define NORMAL_EXCEPTION_PROLOG(intno)  
 \
 mtspr   SPRN_SPRG_WSCRATCH0, r10;   /* save one register */  \
 mfspr   r10, SPRN_SPRG_THREAD;   \
 stw r11, THREAD_NORMSAVE(0)(r10);\
 stw r13, THREAD_NORMSAVE(2)(r10);\
 mfcrr13;/* save CR in r13 for now  */\
 mfspr   r11, SPRN_SRR1;  \
 DO_KVM  BOOKE_INTERRUPT_##intno SPRN_SRR1;   \

See the subject line -- this is for level exceptions
(EXC_LEVEL_EXCEPTION_PROLOG), not normal exceptions.


Ah, my bad. Sorry.

Thanks, applied to kvm-ppc-next.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-02 Thread Peter Zijlstra
On Wed, 2012-05-02 at 14:21 +0530, Nikunj A Dadhania wrote:
> [root@krm1 linux]# grep HAVE_RCU_TABLE .config
> CONFIG_HAVE_RCU_TABLE_FREE=y
> [root@krm1 linux]# make -j32  -s
> mm/memory.c: In function ‘tlb_remove_table_one’:
> mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’
> 
> I suppose we need to have __tlb_remove_table. Trying to understand what
> needs to be done there. 

Argh, I really should get back to unifying all mmu-gather
implementations :/

I think something like the below ought to sort it.

Completely untested though..

---
 arch/powerpc/include/asm/pgalloc.h  |1 +
 arch/s390/mm/pgtable.c  |1 +
 arch/sparc/include/asm/pgalloc_64.h |1 +
 arch/x86/mm/pgtable.c   |6 +++---
 include/asm-generic/tlb.h   |9 +
 mm/memory.c |7 +++
 6 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..c33ae79 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -49,6 +49,7 @@ static inline void __tlb_remove_table(void *_table)
 
pgtable_free(table, shift);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, 
unsigned shift)
 {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6e765bf..7029ed7 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -730,6 +730,7 @@ void __tlb_remove_table(void *_table)
else
free_pages((unsigned long) table, ALLOC_ORDER);
 }
+#define __tlb_remove_table __tlb_remove_table
 
 static void tlb_remove_table_smp_sync(void *arg)
 {
diff --git a/arch/sparc/include/asm/pgalloc_64.h 
b/arch/sparc/include/asm/pgalloc_64.h
index 40b2d7a..d10913a 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -106,6 +106,7 @@ static inline void __tlb_remove_table(void *_table)
is_page = true;
pgtable_free(table, is_page);
 }
+#define __tlb_remove_table __tlb_remove_table
 #else /* CONFIG_SMP */
 static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, bool 
is_page)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..34fa168 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -51,21 +51,21 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page 
*pte)
 {
pgtable_page_dtor(pte);
paravirt_release_pte(page_to_pfn(pte));
-   tlb_remove_page(tlb, pte);
+   tlb_remove_table(tlb, pte);
 }
 
 #if PAGETABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
-   tlb_remove_page(tlb, virt_to_page(pmd));
+   tlb_remove_table(tlb, virt_to_page(pmd));
 }
 
 #if PAGETABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-   tlb_remove_page(tlb, virt_to_page(pud));
+   tlb_remove_table(tlb, virt_to_page(pud));
 }
 #endif /* PAGETABLE_LEVELS > 3 */
 #endif /* PAGETABLE_LEVELS > 2 */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index f96a5b5..8ca33e9 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -19,6 +19,8 @@
 #include 
 #include 
 
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page);
+
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 /*
  * Semi RCU freeing of the page directories.
@@ -60,6 +62,13 @@ struct mmu_table_batch {
 extern void tlb_table_flush(struct mmu_gather *tlb);
 extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
 
+#else
+
+static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
+{
+   tlb_remove_page(tlb, page);
+}
+
 #endif
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index bf8b403..6ca6561 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -297,6 +297,13 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page 
*page)
  * See the comment near struct mmu_table_batch.
  */
 
+#ifndef __tlb_remove_table
+static void __tlb_remove_table(void *table)
+{
+   free_page_and_swap_cache(table);
+}
+#endif
+
 static void tlb_remove_table_smp_sync(void *arg)
 {
/* Simply deliver the interrupt */

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 0/17] Paravirtualized ticket spinlocks

2012-05-02 Thread Raghavendra K T

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.(targeted for 3.5 window)

Note: This needs debugfs changes patch that should be in Xen / linux-next
   https://lkml.org/lkml/2012/3/30/687

Changes in V8:
 - Reabsed patches to 3.4-rc4
 - Combined the KVM changes with ticketlock + Xen changes (Ingo)
 - Removed CAP_PV_UNHALT since it is redundant (Avi). But note that we
need newer qemu which uses KVM_GET_SUPPORTED_CPUID ioctl.
 - Rewrite GET_MP_STATE condition (Avi)
 - Make pv_unhalt = bool (Avi)
 - Move out reset pv_unhalt code to vcpu_run from vcpu_block (Gleb)
 - Documentation changes (Rob Landley)
 - Have a printk to recognize that paravirt spinlock is enabled (Nikunj)
 - Move out kick hypercall out of CONFIG_PARAVIRT_SPINLOCK now
   so that it can be used for other optimizations such as 
   flush_tlb_ipi_others etc. (Nikunj)

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs).  This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct "next"
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning.  (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
  iterations, then call out to the __ticket_lock_spinning() pvop,
  which allows a backend to block the vCPU rather than spinning.  This
  pvop can set the lock into "slowpath state".

- When releasing a lock, if it is in "slowpath state", the call
  __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
  lock is no longer in contention, it also clears the slowpath flag.

The "slowpath state" is stored in the LSB of the within the lock tail
ticket.  This has the effect of reducing the max number of CPUs by
half (so, a "small ticket" can deal with 128 CPUs, and "large ticket"
32768).

For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
another vcpu out of halt state.
The blocking of vcpu is done using halt() in (lock_spinning) slowpath.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in "slowpath"
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
inc = xadd(&lock->tickets, inc);
inc.tail &= ~TICKET_SLOWPATH_FLAG;

if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
if (ACCESS_ONCE(lock->tickets.head) == inc.tail)
goto out;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
out:barrier();
which results in:
push   %rbp
mov%rsp,%rbp

mov$0x200,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f   # Slowpath if lock in contention

pop%rbp
retq   

### SLOWPATH START
1:  and$-2,%edx
movzbl %dl,%esi

2:  mov$0x800,%eax
jmp4f

3:  pause  
sub$0x1,%eax
je 5f

4:  movzbl (%rdi),%ecx
cmp%cl,%dl
jne3b

pop%rbp
retq   

5:  callq  *__ticket_lock_spinning
jmp2b
### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

push   %rbp
mov%rsp,%rbp

mov$0x100,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f

pop%rbp
retq   

### SLOWPATH START
1:  pause  
movzbl (%rdi),%eax
cmp%dl,%al
jne1b

pop%rbp
retq   
### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
"head" and fetch the slowpath flag from "tail".  Th

[PATCH RFC V8 9/17] Split out rate limiting from jump_label.h

2012-05-02 Thread Raghavendra K T
From: Andrew Jones 

Commit b202952075f62603bea9bfb6ebc6b0420db11949 introduced rate limiting
for jump label disabling. The changes were made in the jump label code
in order to be more widely available and to keep things tidier. This is
all fine, except now jump_label.h includes linux/workqueue.h, which
makes it impossible to include jump_label.h from anything that
workqueue.h needs. For example, it's now impossible to include
jump_label.h from asm/spinlock.h, which is done in proposed
pv-ticketlock patches. This patch splits out the rate limiting related
changes from jump_label.h into a new file, jump_label_ratelimit.h, to
resolve the issue.

Signed-off-by: Andrew Jones 
Signed-off-by: Raghavendra K T 
---
 include/linux/jump_label.h   |   26 +-
 include/linux/jump_label_ratelimit.h |   34 ++
 include/linux/perf_event.h   |1 +
 kernel/jump_label.c  |1 +
 4 files changed, 37 insertions(+), 25 deletions(-)
diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index c513a40..8195227 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -49,7 +49,6 @@
 
 #include 
 #include 
-#include 
 
 #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
 
@@ -62,12 +61,6 @@ struct static_key {
 #endif
 };
 
-struct static_key_deferred {
-   struct static_key key;
-   unsigned long timeout;
-   struct delayed_work work;
-};
-
 # include 
 # define HAVE_JUMP_LABEL
 #endif /* CC_HAVE_ASM_GOTO && CONFIG_JUMP_LABEL */
@@ -126,10 +119,7 @@ extern void arch_jump_label_transform_static(struct 
jump_entry *entry,
 extern int jump_label_text_reserved(void *start, void *end);
 extern void static_key_slow_inc(struct static_key *key);
 extern void static_key_slow_dec(struct static_key *key);
-extern void static_key_slow_dec_deferred(struct static_key_deferred *key);
 extern void jump_label_apply_nops(struct module *mod);
-extern void
-jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl);
 
 #define STATIC_KEY_INIT_TRUE ((struct static_key) \
{ .enabled = ATOMIC_INIT(1), .entries = (void *)1 })
@@ -148,10 +138,6 @@ static __always_inline void jump_label_init(void)
 {
 }
 
-struct static_key_deferred {
-   struct static_key  key;
-};
-
 static __always_inline bool static_key_false(struct static_key *key)
 {
if (unlikely(atomic_read(&key->enabled)) > 0)
@@ -184,11 +170,6 @@ static inline void static_key_slow_dec(struct static_key 
*key)
atomic_dec(&key->enabled);
 }
 
-static inline void static_key_slow_dec_deferred(struct static_key_deferred 
*key)
-{
-   static_key_slow_dec(&key->key);
-}
-
 static inline int jump_label_text_reserved(void *start, void *end)
 {
return 0;
@@ -202,12 +183,6 @@ static inline int jump_label_apply_nops(struct module *mod)
return 0;
 }
 
-static inline void
-jump_label_rate_limit(struct static_key_deferred *key,
-   unsigned long rl)
-{
-}
-
 #define STATIC_KEY_INIT_TRUE ((struct static_key) \
{ .enabled = ATOMIC_INIT(1) })
 #define STATIC_KEY_INIT_FALSE ((struct static_key) \
@@ -218,6 +193,7 @@ jump_label_rate_limit(struct static_key_deferred *key,
 #define STATIC_KEY_INIT STATIC_KEY_INIT_FALSE
 #define jump_label_enabled static_key_enabled
 
+static inline int atomic_read(const atomic_t *v);
 static inline bool static_key_enabled(struct static_key *key)
 {
return (atomic_read(&key->enabled) > 0);
diff --git a/include/linux/jump_label_ratelimit.h 
b/include/linux/jump_label_ratelimit.h
new file mode 100644
index 000..1137883
--- /dev/null
+++ b/include/linux/jump_label_ratelimit.h
@@ -0,0 +1,34 @@
+#ifndef _LINUX_JUMP_LABEL_RATELIMIT_H
+#define _LINUX_JUMP_LABEL_RATELIMIT_H
+
+#include 
+#include 
+
+#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
+struct static_key_deferred {
+   struct static_key key;
+   unsigned long timeout;
+   struct delayed_work work;
+};
+#endif
+
+#ifdef HAVE_JUMP_LABEL
+extern void static_key_slow_dec_deferred(struct static_key_deferred *key);
+extern void
+jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl);
+
+#else  /* !HAVE_JUMP_LABEL */
+struct static_key_deferred {
+   struct static_key  key;
+};
+static inline void static_key_slow_dec_deferred(struct static_key_deferred 
*key)
+{
+   static_key_slow_dec(&key->key);
+}
+static inline void
+jump_label_rate_limit(struct static_key_deferred *key,
+   unsigned long rl)
+{
+}
+#endif /* HAVE_JUMP_LABEL */
+#endif /* _LINUX_JUMP_LABEL_RATELIMIT_H */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ddbb6a9..a0e6118 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -605,6 +605,7 @@ struct perf_guest_info_callbacks {
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 4304919..e

[PATCH RFC V8 12/17] xen: Enable PV ticketlocks on HVM Xen

2012-05-02 Thread Raghavendra K T
From: Stefano Stabellini 

Signed-off-by: Jeremy Fitzhardinge 
Reviewed-by: Konrad Rzeszutek Wilk 
Signed-off-by: Raghavendra K T 
---
 arch/x86/xen/smp.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 7dc400a..6a7a3da 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -589,4 +589,5 @@ void __init xen_hvm_smp_init(void)
smp_ops.cpu_die = xen_hvm_cpu_die;
smp_ops.send_call_func_ipi = xen_smp_send_call_function_ipi;
smp_ops.send_call_func_single_ipi = 
xen_smp_send_call_function_single_ipi;
+   xen_init_spinlocks();
 }

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 10/17] x86/ticketlock: Add slowpath logic

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

Maintain a flag in the LSB of the ticket lock tail which indicates
whether anyone is in the lock slowpath and may need kicking when
the current holder unlocks.  The flags are set when the first locker
enters the slowpath, and cleared when unlocking to an empty queue (ie,
no contention).

In the specific implementation of lock_spinning(), make sure to set
the slowpath flags on the lock just before blocking.  We must do
this before the last-chance pickup test to prevent a deadlock
with the unlocker:

UnlockerLocker
test for lock pickup
-> fail
unlock
test slowpath
-> false
set slowpath flags
block

Whereas this works in any ordering:

UnlockerLocker
set slowpath flags
test for lock pickup
-> fail
block
unlock
test slowpath
-> true, kick

If the unlocker finds that the lock has the slowpath flag set but it is
actually uncontended (ie, head == tail, so nobody is waiting), then it
clears the slowpath flag.

The unlock code uses a locked add to update the head counter.  This also
acts as a full memory barrier so that its safe to subsequently
read back the slowflag state, knowing that the updated lock is visible
to the other CPUs.  If it were an unlocked add, then the flag read may
just be forwarded from the store buffer before it was visible to the other
CPUs, which could result in a deadlock.

Unfortunately this means we need to do a locked instruction when
unlocking with PV ticketlocks.  However, if PV ticketlocks are not
enabled, then the old non-locked "add" is the only unlocking code.

Note: this code relies on gcc making sure that unlikely() code is out of
line of the fastpath, which only happens when OPTIMIZE_SIZE=n.  If it
doesn't the generated code isn't too bad, but its definitely suboptimal.

Thanks to Srivatsa Vaddagiri for providing a bugfix to the original
version of this change, which has been folded in.
Thanks to Stephan Diestelhorst for commenting on some code which relied
on an inaccurate reading of the x86 memory ordering rules.

Signed-off-by: Jeremy Fitzhardinge 
Signed-off-by: Srivatsa Vaddagiri 
Reviewed-by: Konrad Rzeszutek Wilk 
Cc: Stephan Diestelhorst 
Signed-off-by: Raghavendra K T 
---
 arch/x86/include/asm/paravirt.h   |2 +-
 arch/x86/include/asm/spinlock.h   |   86 +++-
 arch/x86/include/asm/spinlock_types.h |2 +
 arch/x86/kernel/paravirt-spinlocks.c  |3 +
 arch/x86/xen/spinlock.c   |6 ++
 5 files changed, 74 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9769096..af49670 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -757,7 +757,7 @@ static __always_inline void __ticket_lock_spinning(struct 
arch_spinlock *lock,
PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
-static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock,
+static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock,
__ticket_t ticket)
 {
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 60b7e83..e6881fd 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -1,11 +1,14 @@
 #ifndef _ASM_X86_SPINLOCK_H
 #define _ASM_X86_SPINLOCK_H
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
+
 /*
  * Your basic SMP spinlocks, allowing only a single CPU anywhere
  *
@@ -40,32 +43,28 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1 << 11)
 
-#ifndef CONFIG_PARAVIRT_SPINLOCKS
+extern struct static_key paravirt_ticketlocks_enabled;
+static __always_inline bool static_key_false(struct static_key *key);
 
-static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
-   __ticket_t ticket)
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
 {
+   set_bit(0, (volatile unsigned long *)&lock->tickets.tail);
 }
 
-static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock,
-__ticket_t ticket)
+#else  /* !CONFIG_PARAVIRT_SPINLOCKS */
+static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock,
+   __ticket_t ticket)
 {
 }
-
-#endif /* CONFIG_PARAVIRT_SPINLOCKS */
-
-
-/*
- * If a spinlock has someone waiting on it, then kick the appropriate
- * wait

[PATCH RFC V8 7/17] x86/pvticketlock: Use callee-save for lock_spinning

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

Although the lock_spinning calls in the spinlock code are on the
uncommon path, their presence can cause the compiler to generate many
more register save/restores in the function pre/postamble, which is in
the fast path.  To avoid this, convert it to using the pvops callee-save
calling convention, which defers all the save/restores until the actual
function is called, keeping the fastpath clean.

Signed-off-by: Jeremy Fitzhardinge 
Reviewed-by: Konrad Rzeszutek Wilk 
Tested-by: Attilio Rao  
Signed-off-by: Raghavendra K T 
---
 arch/x86/include/asm/paravirt.h   |2 +-
 arch/x86/include/asm/paravirt_types.h |2 +-
 arch/x86/kernel/paravirt-spinlocks.c  |2 +-
 arch/x86/xen/spinlock.c   |3 ++-
 4 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4bcd146..9769096 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -754,7 +754,7 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
-   PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket);
+   PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
 static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock,
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 005e24d..5e0c138 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -330,7 +330,7 @@ struct arch_spinlock;
 #include 
 
 struct pv_lock_ops {
-   void (*lock_spinning)(struct arch_spinlock *lock, __ticket_t ticket);
+   struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
 };
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index c2e010e..4251c1d 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -9,7 +9,7 @@
 
 struct pv_lock_ops pv_lock_ops = {
 #ifdef CONFIG_SMP
-   .lock_spinning = paravirt_nop,
+   .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop),
.unlock_kick = paravirt_nop,
 #endif
 };
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index c4886dc..c47a8d1 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -171,6 +171,7 @@ out:
local_irq_restore(flags);
spin_time_accum_blocked(start);
 }
+PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning);
 
 static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
 {
@@ -230,7 +231,7 @@ void __init xen_init_spinlocks(void)
return;
}
 
-   pv_lock_ops.lock_spinning = xen_lock_spinning;
+   pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
pv_lock_ops.unlock_kick = xen_unlock_kick;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 2/17] x86/ticketlock: Don't inline _spin_unlock when using paravirt spinlocks

2012-05-02 Thread Raghavendra K T
From: Raghavendra K T  

The code size expands somewhat, and its better to just call
a function rather than inline it.

Thanks Jeremy for original version of ARCH_NOINLINE_SPIN_UNLOCK config patch, 
which is simplified.

Suggested-by: Linus Torvalds 
Signed-off-by: Raghavendra K T 
---
 arch/x86/Kconfig |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1d14cc6..35eb2e4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -597,6 +597,7 @@ config PARAVIRT
 config PARAVIRT_SPINLOCKS
bool "Paravirtualization layer for spinlocks"
depends on PARAVIRT && SMP && EXPERIMENTAL
+   select UNINLINE_SPIN_UNLOCK
---help---
  Paravirtualized spinlocks allow a pvops backend to replace the
  spinlock implementation with something virtualization-friendly

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 17/17] Documentation/kvm : Add documentation on Hypercalls and features used for PV spinlock

2012-05-02 Thread Raghavendra K T
From: Raghavendra K T  

KVM_HC_KICK_CPU  hypercall added to wakeup halted vcpu in paravirtual spinlock
enabled guest.

KVM_FEATURE_PV_UNHALT enables guest to check whether pv spinlock can be enabled
in guest.

Thanks Alex for KVM_HC_FEATURES inputs and Vatsa for rewriting KVM_HC_KICK_CPU

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Raghavendra K T 
---
 Documentation/virtual/kvm/cpuid.txt  |4 ++
 Documentation/virtual/kvm/hypercalls.txt |   60 ++
 2 files changed, 64 insertions(+), 0 deletions(-)
diff --git a/Documentation/virtual/kvm/cpuid.txt 
b/Documentation/virtual/kvm/cpuid.txt
index 8820685..062dff9 100644
--- a/Documentation/virtual/kvm/cpuid.txt
+++ b/Documentation/virtual/kvm/cpuid.txt
@@ -39,6 +39,10 @@ KVM_FEATURE_CLOCKSOURCE2   || 3 || kvmclock 
available at msrs
 KVM_FEATURE_ASYNC_PF   || 4 || async pf can be enabled by
||   || writing to msr 0x4b564d02
 --
+KVM_FEATURE_PV_UNHALT  || 6 || guest checks this feature bit
+   ||   || before enabling paravirtualized
+   ||   || spinlock support.
+--
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no guest-side
||   || per-cpu warps are expected in
||   || kvmclock.
diff --git a/Documentation/virtual/kvm/hypercalls.txt 
b/Documentation/virtual/kvm/hypercalls.txt
new file mode 100644
index 000..bc3f14a
--- /dev/null
+++ b/Documentation/virtual/kvm/hypercalls.txt
@@ -0,0 +1,60 @@
+KVM Hypercalls Documentation
+===
+The template for each hypercall is:
+1. Hypercall name, value.
+2. Architecture(s)
+3. Status (deprecated, obsolete, active)
+4. Purpose
+
+1. KVM_HC_VAPIC_POLL_IRQ
+
+Value: 1
+Architecture: x86
+Purpose: None
+
+2. KVM_HC_MMU_OP
+
+Value: 2
+Architecture: x86
+Status: deprecated.
+Purpose: Support MMU operations such as writing to PTE,
+flushing TLB, release PT.
+
+3. KVM_HC_FEATURES
+
+Value: 3
+Architecture: PPC
+Status: active
+Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid
+used to enumerate which hypercalls are available. On PPC, either device tree
+based lookup ( which is also what EPAPR dictates) OR KVM specific enumeration
+mechanism (which is this hypercall) can be used.
+
+4. KVM_HC_PPC_MAP_MAGIC_PAGE
+
+Value: 4
+Architecture: PPC
+Status: active
+Purpose: To enable communication between the hypervisor and guest there is a
+shared page that contains parts of supervisor visible register state.
+The guest can map this shared page to access its supervisor register through
+memory using this hypercall.
+
+5. KVM_HC_KICK_CPU
+
+Value: 5
+Architecture: x86
+Status: active
+Purpose: Hypercall used to wakeup a vcpu from HLT state
+
+Usage example : A vcpu of a paravirtualized guest that is busywaiting in guest
+kernel mode for an event to occur (ex: a spinlock to become available) can
+execute HLT instruction once it has busy-waited for more than a threshold
+time-interval. Execution of HLT instruction would cause the hypervisor to put
+the vcpu to sleep until occurence of an appropriate event. Another vcpu of the
+same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
+specifying APIC ID of the vcpu to be wokenup.
+
+TODO:
+1. more information on input and output needed?
+2. Add more detail to purpose of hypercalls.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 16/17] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor

2012-05-02 Thread Raghavendra K T
From: Srivatsa Vaddagiri 

During smp_boot_cpus  paravirtualied KVM guest detects if the hypervisor has
required feature (KVM_FEATURE_PV_UNHALT) to support pv-ticketlocks. If so,
 support for pv-ticketlocks is registered via pv_lock_ops.

Use KVM_HC_KICK_CPU hypercall to wakeup waiting/halted vcpu.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
---
 arch/x86/include/asm/kvm_para.h |   14 ++-
 arch/x86/kernel/kvm.c   |  256 +++
 2 files changed, 268 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 5b647ea..77266d3 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -195,10 +195,20 @@ void kvm_async_pf_task_wait(u32 token);
 void kvm_async_pf_task_wake(u32 token);
 u32 kvm_read_and_reset_pf_reason(void);
 extern void kvm_disable_steal_time(void);
-#else
-#define kvm_guest_init() do { } while (0)
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_spinlock_init(void);
+#else /* !CONFIG_PARAVIRT_SPINLOCKS */
+static inline void kvm_spinlock_init(void)
+{
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
+#else /* CONFIG_KVM_GUEST */
+#define kvm_guest_init() do {} while (0)
 #define kvm_async_pf_task_wait(T) do {} while(0)
 #define kvm_async_pf_task_wake(T) do {} while(0)
+
 static inline u32 kvm_read_and_reset_pf_reason(void)
 {
return 0;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b8ba6e4..7c46567 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -368,6 +369,7 @@ static void __init kvm_smp_prepare_boot_cpu(void)
 #endif
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
+   kvm_spinlock_init();
 }
 
 static void __cpuinit kvm_guest_cpu_online(void *dummy)
@@ -450,3 +452,257 @@ static __init int activate_jump_labels(void)
return 0;
 }
 arch_initcall(activate_jump_labels);
+
+/* Kick a cpu by its apicid. Used to wake up a halted vcpu */
+void kvm_kick_cpu(int cpu)
+{
+   int apicid;
+
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
+   kvm_hypercall1(KVM_HC_KICK_CPU, apicid);
+}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+enum kvm_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
+
+#ifdef CONFIG_KVM_DEBUG_FS
+#define HISTO_BUCKETS  30
+
+static struct kvm_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
+   u32 histo_spin_blocked[HISTO_BUCKETS+1];
+   u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static inline void check_zero(void)
+{
+   u8 ret;
+   u8 old;
+
+   old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(&zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+   }
+}
+
+static inline void add_stats(enum kvm_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
+
+
+static inline u64 spin_time_start(void)
+{
+   return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+   unsigned index;
+
+   index = ilog2(delta);
+   check_zero();
+
+   if (index < HISTO_BUCKETS)
+   array[index]++;
+   else
+   array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+   u32 delta;
+
+   delta = sched_clock() - start;
+   __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+   spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+static struct dentry *d_kvm_debug;
+
+struct dentry *kvm_init_debugfs(void)
+{
+   d_kvm_debug = debugfs_create_dir("kvm", NULL);
+   if (!d_kvm_debug)
+   printk(KERN_WARNING "Could not create 'kvm' debugfs 
directory\n");
+
+   return d_kvm_debug;
+}
+
+static int __init kvm_spinlock_debugfs(void)
+{
+   struct dentry *d_kvm;
+
+   d_kvm = kvm_init_debugfs();
+   if (d_kvm == NULL)
+   return -ENOMEM;
+
+   d_spin_debug = debugfs_create_dir("spinlocks", d_kvm);
+
+   debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
+
+   debugfs_create_u32("taken_slow", 0444, d_spin_debug,
+  &spinlock_stats.contention_stats[TAKEN_SLOW]);
+   debugfs_create_u32("taken_slow_pickup", 0444, d_spin_debug,
+  &spinlock_stats.contention_stats[TAKEN_SLOW_PICKUP]);
+
+   debugfs_create_u32("released_slow", 0444, d_spin_debug,
+  &spinlock_stats.contention_stats[RELEASED_SLOW]);
+   debugfs_create_u32("released_slow_kicked", 0444, d_spin_debug,
+  &spinlo

[PATCH RFC V8 15/17] kvm guest : Add configuration support to enable debug information for KVM Guests

2012-05-02 Thread Raghavendra K T
From: Srivatsa Vaddagiri 

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
---
 arch/x86/Kconfig |9 +
 1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 35eb2e4..a9ec0da 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -584,6 +584,15 @@ config KVM_GUEST
  This option enables various optimizations for running under the KVM
  hypervisor.
 
+config KVM_DEBUG_FS
+   bool "Enable debug information for KVM Guests in debugfs"
+   depends on KVM_GUEST && DEBUG_FS
+   default n
+   ---help---
+ This option enables collection of various statistics for KVM guest.
+ Statistics are displayed in debugfs filesystem. Enabling this option
+ may incur significant overhead.
+
 source "arch/x86/lguest/Kconfig"
 
 config PARAVIRT

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 14/17] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration

2012-05-02 Thread Raghavendra K T
From: Raghavendra K T 

During migration, any vcpu that got kicked but did not become runnable
 (still in halted state) should be runnable after migration.

Signed-off-by: Raghavendra K T 
---
 arch/x86/kvm/x86.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f188cdc..5b09b67 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5691,7 +5691,12 @@ int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
struct kvm_mp_state *mp_state)
 {
-   mp_state->mp_state = vcpu->arch.mp_state;
+   if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED &&
+   vcpu->arch.pv.pv_unhalted)
+   mp_state->mp_state = KVM_MP_STATE_RUNNABLE;
+   else
+   mp_state->mp_state = vcpu->arch.mp_state;
+
return 0;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 13/17] kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks

2012-05-02 Thread Raghavendra K T
From: srivatsa vaddagiri 

kvm_hc_kick_cpu allows the calling vcpu to kick another vcpu out of halt state.

the presence of these hypercalls is indicated to guest via
kvm_feature_pv_unhalt.

Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Suzuki Poulose 
Signed-off-by: Raghavendra K T 
---
 arch/x86/include/asm/kvm_host.h |4 
 arch/x86/include/asm/kvm_para.h |2 ++
 arch/x86/kvm/cpuid.c|3 ++-
 arch/x86/kvm/x86.c  |   37 +
 include/linux/kvm_para.h|1 +
 5 files changed, 46 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e216ba0..e187a9b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -481,6 +481,10 @@ struct kvm_vcpu_arch {
u64 length;
u64 status;
} osvw;
+   /* pv related host specific info */
+   struct {
+   bool pv_unhalted;
+   } pv;
 };
 
 struct kvm_lpage_info {
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 734c376..5b647ea 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -16,12 +16,14 @@
 #define KVM_FEATURE_CLOCKSOURCE0
 #define KVM_FEATURE_NOP_IO_DELAY   1
 #define KVM_FEATURE_MMU_OP 2
+
 /* This indicates that the new set of kvmclock msrs
  * are available. The use of 0x11 and 0x12 is deprecated
  */
 #define KVM_FEATURE_CLOCKSOURCE23
 #define KVM_FEATURE_ASYNC_PF   4
 #define KVM_FEATURE_STEAL_TIME 5
+#define KVM_FEATURE_PV_UNHALT  6
 
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 9fed5be..7c93806 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -408,7 +408,8 @@ static int do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 
function,
 (1 << KVM_FEATURE_NOP_IO_DELAY) |
 (1 << KVM_FEATURE_CLOCKSOURCE2) |
 (1 << KVM_FEATURE_ASYNC_PF) |
-(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+(1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+(1 << KVM_FEATURE_PV_UNHALT);
 
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 91a5e98..f188cdc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4993,6 +4993,36 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
return 1;
 }
 
+/*
+ * kvm_pv_kick_cpu_op:  Kick a vcpu.
+ *
+ * @apicid - apicid of vcpu to be kicked.
+ */
+static void kvm_pv_kick_cpu_op(struct kvm *kvm, int apicid)
+{
+   struct kvm_vcpu *vcpu = NULL;
+   int i;
+
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!kvm_apic_present(vcpu))
+   continue;
+
+   if (kvm_apic_match_dest(vcpu, 0, 0, apicid, 0))
+   break;
+   }
+   if (vcpu) {
+   /*
+* Setting unhalt flag here can result in spurious runnable
+* state when unhalt reset does not happen in vcpu_block.
+* But that is harmless since that should soon result in halt.
+*/
+   vcpu->arch.pv.pv_unhalted = true;
+   /* We need everybody see unhalt before vcpu unblocks */
+   smp_wmb();
+   kvm_vcpu_kick(vcpu);
+   }
+}
+
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 {
unsigned long nr, a0, a1, a2, a3, ret;
@@ -5026,6 +5056,10 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_VAPIC_POLL_IRQ:
ret = 0;
break;
+   case KVM_HC_KICK_CPU:
+   kvm_pv_kick_cpu_op(vcpu->kvm, a0);
+   ret = 0;
+   break;
default:
ret = -KVM_ENOSYS;
break;
@@ -5409,6 +5443,7 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
{
switch(vcpu->arch.mp_state) {
case KVM_MP_STATE_HALTED:
+   vcpu->arch.pv.pv_unhalted = false;
vcpu->arch.mp_state =
KVM_MP_STATE_RUNNABLE;
case KVM_MP_STATE_RUNNABLE:
@@ -6128,6 +6163,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
BUG_ON(vcpu->kvm == NULL);
kvm = vcpu->kvm;
 
+   vcpu->arch.pv.pv_unhalted = false;
vcpu->arch.emulate_ctxt.ops = &emulate_ops;
if (!irqchip_in_kernel(kvm) || kvm_vcpu_is_bsp(vcpu))
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
@@ -6394,6 +6430

[PATCH RFC V8 8/17] x86/pvticketlock: When paravirtualizing ticket locks, increment by 2

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

Increment ticket head/tails by 2 rather than 1 to leave the LSB free
to store a "is in slowpath state" bit.  This halves the number
of possible CPUs for a given ticket size, but this shouldn't matter
in practice - kernels built for 32k+ CPU systems are probably
specially built for the hardware rather than a generic distro
kernel.

Signed-off-by: Jeremy Fitzhardinge 
Tested-by: Attilio Rao  
Signed-off-by: Raghavendra K T 
---
 arch/x86/include/asm/spinlock.h   |   10 +-
 arch/x86/include/asm/spinlock_types.h |   10 +-
 2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index ee4bbd4..60b7e83 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -81,7 +81,7 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
  */
 static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
 {
-   register struct __raw_tickets inc = { .tail = 1 };
+   register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC };
 
inc = xadd(&lock->tickets, inc);
 
@@ -107,7 +107,7 @@ static __always_inline int 
arch_spin_trylock(arch_spinlock_t *lock)
if (old.tickets.head != old.tickets.tail)
return 0;
 
-   new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
+   new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
 
/* cmpxchg is a full barrier, so nothing can move before it */
return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
old.head_tail;
@@ -115,9 +115,9 @@ static __always_inline int 
arch_spin_trylock(arch_spinlock_t *lock)
 
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
-   __ticket_t next = lock->tickets.head + 1;
+   __ticket_t next = lock->tickets.head + TICKET_LOCK_INC;
 
-   __add(&lock->tickets.head, 1, UNLOCK_LOCK_PREFIX);
+   __add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
__ticket_unlock_kick(lock, next);
 }
 
@@ -132,7 +132,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t 
*lock)
 {
struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
-   return (__ticket_t)(tmp.tail - tmp.head) > 1;
+   return (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC;
 }
 #define arch_spin_is_contended arch_spin_is_contended
 
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 83fd3c7..e96fcbd 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -3,7 +3,13 @@
 
 #include 
 
-#if (CONFIG_NR_CPUS < 256)
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define __TICKET_LOCK_INC  2
+#else
+#define __TICKET_LOCK_INC  1
+#endif
+
+#if (CONFIG_NR_CPUS < (256 / __TICKET_LOCK_INC))
 typedef u8  __ticket_t;
 typedef u16 __ticketpair_t;
 #else
@@ -11,6 +17,8 @@ typedef u16 __ticket_t;
 typedef u32 __ticketpair_t;
 #endif
 
+#define TICKET_LOCK_INC((__ticket_t)__TICKET_LOCK_INC)
+
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
 typedef struct arch_spinlock {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 11/17] xen/pvticketlock: Allow interrupts to be enabled while blocking

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

If interrupts were enabled when taking the spinlock, we can leave them
enabled while blocking to get the lock.

If we can enable interrupts while waiting for the lock to become
available, and we take an interrupt before entering the poll,
and the handler takes a spinlock which ends up going into
the slow state (invalidating the per-cpu "lock" and "want" values),
then when the interrupt handler returns the event channel will
remain pending so the poll will return immediately, causing it to
return out to the main spinlock loop.

Signed-off-by: Jeremy Fitzhardinge 
Reviewed-by: Konrad Rzeszutek Wilk 
Signed-off-by: Raghavendra K T 
---
 arch/x86/xen/spinlock.c |   46 --
 1 files changed, 40 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d4abaf9..3bf93d5 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -140,7 +140,20 @@ static void xen_lock_spinning(struct arch_spinlock *lock, 
__ticket_t want)
 * partially setup state.
 */
local_irq_save(flags);
-
+   /*
+* We don't really care if we're overwriting some other
+* (lock,want) pair, as that would mean that we're currently
+* in an interrupt context, and the outer context had
+* interrupts enabled.  That has already kicked the VCPU out
+* of xen_poll_irq(), so it will just return spuriously and
+* retry with newly setup (lock,want).
+*
+* The ordering protocol on this is that the "lock" pointer
+* may only be set non-NULL if the "want" ticket is correct.
+* If we're updating "want", we must first clear "lock".
+*/
+   w->lock = NULL;
+   smp_wmb();
w->want = want;
smp_wmb();
w->lock = lock;
@@ -155,24 +168,43 @@ static void xen_lock_spinning(struct arch_spinlock *lock, 
__ticket_t want)
/* Only check lock once pending cleared */
barrier();
 
-   /* Mark entry to slowpath before doing the pickup test to make
-  sure we don't deadlock with an unlocker. */
+   /*
+* Mark entry to slowpath before doing the pickup test to make
+* sure we don't deadlock with an unlocker.
+*/
__ticket_enter_slowpath(lock);
 
-   /* check again make sure it didn't become free while
-  we weren't looking  */
+   /*
+* check again make sure it didn't become free while
+* we weren't looking
+*/
if (ACCESS_ONCE(lock->tickets.head) == want) {
add_stats(TAKEN_SLOW_PICKUP, 1);
goto out;
}
+
+   /* Allow interrupts while blocked */
+   local_irq_restore(flags);
+
+   /*
+* If an interrupt happens here, it will leave the wakeup irq
+* pending, which will cause xen_poll_irq() to return
+* immediately.
+*/
+
/* Block until irq becomes pending (or perhaps a spurious wakeup) */
xen_poll_irq(irq);
add_stats(TAKEN_SLOW_SPURIOUS, !xen_test_irq_pending(irq));
+
+   local_irq_save(flags);
+
kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq));
 out:
cpumask_clear_cpu(cpu, &waiting_cpus);
w->lock = NULL;
+
local_irq_restore(flags);
+
spin_time_accum_blocked(start);
 }
 PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning);
@@ -186,7 +218,9 @@ static void xen_unlock_kick(struct arch_spinlock *lock, 
__ticket_t next)
for_each_cpu(cpu, &waiting_cpus) {
const struct xen_lock_waiting *w = &per_cpu(lock_waiting, cpu);
 
-   if (w->lock == lock && w->want == next) {
+   /* Make sure we read lock before want */
+   if (ACCESS_ONCE(w->lock) == lock &&
+   ACCESS_ONCE(w->want) == next) {
add_stats(RELEASED_SLOW_KICKED, 1);
xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
break;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 6/17] xen/pvticketlocks: Add xen_nopvspin parameter to disable xen pv ticketlocks

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

Signed-off-by: Jeremy Fitzhardinge 
Reviewed-by: Konrad Rzeszutek Wilk 
Signed-off-by: Raghavendra K T 
---
 arch/x86/xen/spinlock.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 4e98a07..c4886dc 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -221,12 +221,26 @@ void xen_uninit_lock_cpu(int cpu)
unbind_from_irqhandler(per_cpu(lock_kicker_irq, cpu), NULL);
 }
 
+static bool xen_pvspin __initdata = true;
+
 void __init xen_init_spinlocks(void)
 {
+   if (!xen_pvspin) {
+   printk(KERN_DEBUG "xen: PV spinlocks disabled\n");
+   return;
+   }
+
pv_lock_ops.lock_spinning = xen_lock_spinning;
pv_lock_ops.unlock_kick = xen_unlock_kick;
 }
 
+static __init int xen_parse_nopvspin(char *arg)
+{
+   xen_pvspin = false;
+   return 0;
+}
+early_param("xen_nopvspin", xen_parse_nopvspin);
+
 #ifdef CONFIG_XEN_DEBUG_FS
 
 static struct dentry *d_spin_debug;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 5/17] xen/pvticketlock: Xen implementation for PV ticket locks

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

Replace the old Xen implementation of PV spinlocks with and implementation
of xen_lock_spinning and xen_unlock_kick.

xen_lock_spinning simply registers the cpu in its entry in lock_waiting,
adds itself to the waiting_cpus set, and blocks on an event channel
until the channel becomes pending.

xen_unlock_kick searches the cpus in waiting_cpus looking for the one
which next wants this lock with the next ticket, if any.  If found,
it kicks it by making its event channel pending, which wakes it up.

We need to make sure interrupts are disabled while we're relying on the
contents of the per-cpu lock_waiting values, otherwise an interrupt
handler could come in, try to take some other lock, block, and overwrite
our values.

Raghu: use function + enum instead of macro, cmpxchg for zero status reset

Signed-off-by: Jeremy Fitzhardinge 
Reviewed-by: Konrad Rzeszutek Wilk 
Signed-off-by: Raghavendra K T 
---
 arch/x86/xen/spinlock.c |  348 +++
 1 files changed, 78 insertions(+), 270 deletions(-)
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index f1f4540..4e98a07 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -16,45 +16,44 @@
 #include "xen-ops.h"
 #include "debugfs.h"
 
-#ifdef CONFIG_XEN_DEBUG_FS
-static struct xen_spinlock_stats
-{
-   u64 taken;
-   u32 taken_slow;
-   u32 taken_slow_nested;
-   u32 taken_slow_pickup;
-   u32 taken_slow_spurious;
-   u32 taken_slow_irqenable;
+enum xen_contention_stat {
+   TAKEN_SLOW,
+   TAKEN_SLOW_PICKUP,
+   TAKEN_SLOW_SPURIOUS,
+   RELEASED_SLOW,
+   RELEASED_SLOW_KICKED,
+   NR_CONTENTION_STATS
+};
 
-   u64 released;
-   u32 released_slow;
-   u32 released_slow_kicked;
 
+#ifdef CONFIG_XEN_DEBUG_FS
 #define HISTO_BUCKETS  30
-   u32 histo_spin_total[HISTO_BUCKETS+1];
-   u32 histo_spin_spinning[HISTO_BUCKETS+1];
+static struct xen_spinlock_stats
+{
+   u32 contention_stats[NR_CONTENTION_STATS];
u32 histo_spin_blocked[HISTO_BUCKETS+1];
-
-   u64 time_total;
-   u64 time_spinning;
u64 time_blocked;
 } spinlock_stats;
 
 static u8 zero_stats;
 
-static unsigned lock_timeout = 1 << 10;
-#define TIMEOUT lock_timeout
-
 static inline void check_zero(void)
 {
-   if (unlikely(zero_stats)) {
-   memset(&spinlock_stats, 0, sizeof(spinlock_stats));
-   zero_stats = 0;
+   u8 ret;
+   u8 old = ACCESS_ONCE(zero_stats);
+   if (unlikely(old)) {
+   ret = cmpxchg(&zero_stats, old, 0);
+   /* This ensures only one fellow resets the stat */
+   if (ret == old)
+   memset(&spinlock_stats, 0, sizeof(spinlock_stats));
}
 }
 
-#define ADD_STATS(elem, val)   \
-   do { check_zero(); spinlock_stats.elem += (val); } while(0)
+static inline void add_stats(enum xen_contention_stat var, u32 val)
+{
+   check_zero();
+   spinlock_stats.contention_stats[var] += val;
+}
 
 static inline u64 spin_time_start(void)
 {
@@ -73,22 +72,6 @@ static void __spin_time_accum(u64 delta, u32 *array)
array[HISTO_BUCKETS]++;
 }
 
-static inline void spin_time_accum_spinning(u64 start)
-{
-   u32 delta = xen_clocksource_read() - start;
-
-   __spin_time_accum(delta, spinlock_stats.histo_spin_spinning);
-   spinlock_stats.time_spinning += delta;
-}
-
-static inline void spin_time_accum_total(u64 start)
-{
-   u32 delta = xen_clocksource_read() - start;
-
-   __spin_time_accum(delta, spinlock_stats.histo_spin_total);
-   spinlock_stats.time_total += delta;
-}
-
 static inline void spin_time_accum_blocked(u64 start)
 {
u32 delta = xen_clocksource_read() - start;
@@ -98,19 +81,15 @@ static inline void spin_time_accum_blocked(u64 start)
 }
 #else  /* !CONFIG_XEN_DEBUG_FS */
 #define TIMEOUT(1 << 10)
-#define ADD_STATS(elem, val)   do { (void)(val); } while(0)
+static inline void add_stats(enum xen_contention_stat var, u32 val)
+{
+}
 
 static inline u64 spin_time_start(void)
 {
return 0;
 }
 
-static inline void spin_time_accum_total(u64 start)
-{
-}
-static inline void spin_time_accum_spinning(u64 start)
-{
-}
 static inline void spin_time_accum_blocked(u64 start)
 {
 }
@@ -133,230 +112,83 @@ typedef u16 xen_spinners_t;
asm(LOCK_PREFIX " decw %0" : "+m" ((xl)->spinners) : : "memory");
 #endif
 
-struct xen_spinlock {
-   unsigned char lock; /* 0 -> free; 1 -> locked */
-   xen_spinners_t spinners;/* count of waiting cpus */
+struct xen_lock_waiting {
+   struct arch_spinlock *lock;
+   __ticket_t want;
 };
 
 static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
+static cpumask_t waiting_cpus;
 
-#if 0
-static int xen_spin_is_locked(struct arch_spinlock *lock)
-{
-   struct xen_spinlock *xl = (struct xen

[PATCH RFC V8 4/17] xen: Defer spinlock setup until boot CPU setup

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

There's no need to do it at very early init, and doing it there
makes it impossible to use the jump_label machinery.

Signed-off-by: Jeremy Fitzhardinge 
Reviewed-by: Konrad Rzeszutek Wilk 
Signed-off-by: Raghavendra K T 
---
 arch/x86/xen/smp.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 0503c0c..7dc400a 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -222,6 +222,7 @@ static void __init xen_smp_prepare_boot_cpu(void)
 
xen_filter_cpu_maps();
xen_setup_vcpu_info_placement();
+   xen_init_spinlocks();
 }
 
 static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
@@ -551,7 +552,6 @@ void __init xen_smp_init(void)
 {
smp_ops = xen_smp_ops;
xen_fill_possible_map();
-   xen_init_spinlocks();
 }
 
 static void __init xen_hvm_smp_prepare_cpus(unsigned int max_cpus)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 3/17] x86/ticketlock: Collapse a layer of functions

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

Now that the paravirtualization layer doesn't exist at the spinlock
level any more, we can collapse the __ticket_ functions into the arch_
functions.

Signed-off-by: Jeremy Fitzhardinge 
Tested-by: Attilio Rao  
Signed-off-by: Raghavendra K T 
---
 arch/x86/include/asm/spinlock.h |   35 +--
 1 files changed, 5 insertions(+), 30 deletions(-)
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 3e47608..ee4bbd4 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -79,7 +79,7 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
  * in the high part, because a wide xadd increment of the low part would carry
  * up and contaminate the high part.
  */
-static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock)
+static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
 {
register struct __raw_tickets inc = { .tail = 1 };
 
@@ -99,7 +99,7 @@ static __always_inline void __ticket_spin_lock(struct 
arch_spinlock *lock)
 out:   barrier();  /* make sure nothing creeps before the lock is taken */
 }
 
-static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
+static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
 {
arch_spinlock_t old, new;
 
@@ -113,7 +113,7 @@ static __always_inline int 
__ticket_spin_trylock(arch_spinlock_t *lock)
return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == 
old.head_tail;
 }
 
-static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
+static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
__ticket_t next = lock->tickets.head + 1;
 
@@ -121,46 +121,21 @@ static __always_inline void 
__ticket_spin_unlock(arch_spinlock_t *lock)
__ticket_unlock_kick(lock, next);
 }
 
-static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
+static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
return tmp.tail != tmp.head;
 }
 
-static inline int __ticket_spin_is_contended(arch_spinlock_t *lock)
+static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
 
return (__ticket_t)(tmp.tail - tmp.head) > 1;
 }
-
-static inline int arch_spin_is_locked(arch_spinlock_t *lock)
-{
-   return __ticket_spin_is_locked(lock);
-}
-
-static inline int arch_spin_is_contended(arch_spinlock_t *lock)
-{
-   return __ticket_spin_is_contended(lock);
-}
 #define arch_spin_is_contended arch_spin_is_contended
 
-static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
-{
-   __ticket_spin_lock(lock);
-}
-
-static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
-{
-   return __ticket_spin_trylock(lock);
-}
-
-static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
-{
-   __ticket_spin_unlock(lock);
-}
-
 static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,
  unsigned long flags)
 {

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V8 1/17] x86/spinlock: Replace pv spinlocks with pv ticketlocks

2012-05-02 Thread Raghavendra K T
From: Jeremy Fitzhardinge 

Rather than outright replacing the entire spinlock implementation in
order to paravirtualize it, keep the ticket lock implementation but add
a couple of pvops hooks on the slow patch (long spin on lock, unlocking
a contended lock).

Ticket locks have a number of nice properties, but they also have some
surprising behaviours in virtual environments.  They enforce a strict
FIFO ordering on cpus trying to take a lock; however, if the hypervisor
scheduler does not schedule the cpus in the correct order, the system can
waste a huge amount of time spinning until the next cpu can take the lock.

(See Thomas Friebel's talk "Prevent Guests from Spinning Around"
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

To address this, we add two hooks:
 - __ticket_spin_lock which is called after the cpu has been
   spinning on the lock for a significant number of iterations but has
   failed to take the lock (presumably because the cpu holding the lock
   has been descheduled).  The lock_spinning pvop is expected to block
   the cpu until it has been kicked by the current lock holder.
 - __ticket_spin_unlock, which on releasing a contended lock
   (there are more cpus with tail tickets), it looks to see if the next
   cpu is blocked and wakes it if so.

When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub
functions causes all the extra code to go away.

Signed-off-by: Jeremy Fitzhardinge 
Reviewed-by: Konrad Rzeszutek Wilk 
Tested-by: Attilio Rao  
Signed-off-by: Raghavendra K T 
---
 arch/x86/include/asm/paravirt.h   |   32 
 arch/x86/include/asm/paravirt_types.h |   10 ++
 arch/x86/include/asm/spinlock.h   |   53 ++--
 arch/x86/include/asm/spinlock_types.h |4 --
 arch/x86/kernel/paravirt-spinlocks.c  |   15 +
 arch/x86/xen/spinlock.c   |8 -
 6 files changed, 61 insertions(+), 61 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index aa0f913..4bcd146 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -751,36 +751,16 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 
 #if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT_SPINLOCKS)
 
-static inline int arch_spin_is_locked(struct arch_spinlock *lock)
+static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
+   __ticket_t ticket)
 {
-   return PVOP_CALL1(int, pv_lock_ops.spin_is_locked, lock);
+   PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket);
 }
 
-static inline int arch_spin_is_contended(struct arch_spinlock *lock)
+static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock,
+   __ticket_t ticket)
 {
-   return PVOP_CALL1(int, pv_lock_ops.spin_is_contended, lock);
-}
-#define arch_spin_is_contended arch_spin_is_contended
-
-static __always_inline void arch_spin_lock(struct arch_spinlock *lock)
-{
-   PVOP_VCALL1(pv_lock_ops.spin_lock, lock);
-}
-
-static __always_inline void arch_spin_lock_flags(struct arch_spinlock *lock,
- unsigned long flags)
-{
-   PVOP_VCALL2(pv_lock_ops.spin_lock_flags, lock, flags);
-}
-
-static __always_inline int arch_spin_trylock(struct arch_spinlock *lock)
-{
-   return PVOP_CALL1(int, pv_lock_ops.spin_trylock, lock);
-}
-
-static __always_inline void arch_spin_unlock(struct arch_spinlock *lock)
-{
-   PVOP_VCALL1(pv_lock_ops.spin_unlock, lock);
+   PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
 #endif
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 8e8b9a4..005e24d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -327,13 +327,11 @@ struct pv_mmu_ops {
 };
 
 struct arch_spinlock;
+#include 
+
 struct pv_lock_ops {
-   int (*spin_is_locked)(struct arch_spinlock *lock);
-   int (*spin_is_contended)(struct arch_spinlock *lock);
-   void (*spin_lock)(struct arch_spinlock *lock);
-   void (*spin_lock_flags)(struct arch_spinlock *lock, unsigned long 
flags);
-   int (*spin_trylock)(struct arch_spinlock *lock);
-   void (*spin_unlock)(struct arch_spinlock *lock);
+   void (*lock_spinning)(struct arch_spinlock *lock, __ticket_t ticket);
+   void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 76bfa2c..3e47608 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -37,6 +37,35 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+/* How long a lock should spin before we consider blocking */
+#define SPIN_THRESHOLD (1 << 11)
+
+#ifndef CONFIG

Re: [Xen-devel] [PATCHv2] x86info: dump kvm cpuid's

2012-05-02 Thread Ian Campbell
On Wed, 2012-05-02 at 10:50 +0100, Michael S. Tsirkin wrote:
> On Wed, May 02, 2012 at 10:45:27AM +0100, Ian Campbell wrote:
> > On Tue, 2012-05-01 at 16:04 +0300, Gleb Natapov wrote:
> > > > BTW, according to arch/x86/include/asm/kvm_para.h unsurprisingly KVM has
> > > > a signature too 'KVMKVMKVM'.
> > > > 
> > > > >   cpu->stepping = eax & 0xf;
> > > > >   cpu->model = (eax >> 4) & 0xf;
> > > > >   cpu->family = (eax >> 8) & 0xf;
> > > > > @@ -29,6 +29,19 @@ void get_cpu_info_basics(struct cpudata *cpu)
> > > > >  
> > > > >   cpuid(cpu->number, 0xC000, &maxei, NULL, NULL, NULL);
> > > > >   cpu->maxei2 = maxei;
> > > > > + if (ecx & 0x8000) {
> > > > > + cpuid(cpu->number, 0x4000, &maxhv, NULL, NULL, 
> > > > > NULL);
> > > > > + /*
> > > > > +  * KVM up to linux 3.4 reports 0 as the max hypervisor 
> > > > > leaf,
> > > > > +  * where it really means 0x4001.
> > > > 
> > > > This is something where I definitely think you want to check the
> > > > signature first.
> > > In theory yes, but in practice what will this break?
> > 
> > I've got no idea -- but what's the harm in checking?
> > 
> > Ian.
> 
> Users can set kvm signature to anything, if they do
> debugging will be a bit harder for them.

Ah, right, someone already mentioned that and I forgot, sorry.

And, just to complete my train of thought, cpuid just returns reserved
values for requests for non-existent leaves (rather than #GP for
example) so it's safe enough even if you do end up trying to read an
eax=0x4001 when it doesn't exist.

Seems fine to me then.

Ian.

-- 
Ian Campbell
Current Noise: Hypocrisy - Buried

He's like a function -- he returns a value, in the form of his opinion.
It's up to you to cast it into a void or not.
-- Phil Lapsley

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: QEMU-KVM 'Random' Virtual MAC Address.

2012-05-02 Thread Veruca Salt

Thanks Markus.
We wrote a simple app this morning which generates a random address(obviously 
not true GUID) which should be safe enough if a customer runs two physical 
boxes local-side of their network.
We are currently running a single VM on each box, with two NIC's, one dedicated 
to host comms across IP and one for the 'world'. Reliable and safe, the 
MACAddress is the last part of the jigsaw.
Bit of a step down, actually; with host Kernel 2.6.32 we were able to pass 
though IOMMU devices, but with Sandy Bridge having shared IRQ's on South Bridge 
and the 2.6.38 kernel (one of the few we can use that 'fully' works with 
Sandy)and the move to 64 bit, IOMMU was disastrously unstable, hence the move 
back to virtual taps.
Simon


> From: arm...@redhat.com
> To: verucasal...@hotmail.co.uk
> CC: kvm@vger.kernel.org
> Subject: Re: QEMU-KVM 'Random' Virtual MAC Address.
> Date: Wed, 2 May 2012 08:58:06 +0200
>
> Veruca Salt  writes:
>
> > Hello.
> > When creating an NIC in qemu, I have found that the default address on XP 
> > virtual machines is the same for two VM's on different physical boxes.
> > Before we go off building an assignment programme to retrofit VM's upgraded 
> > to the fit we're planning, will differently imaged VM's, with different 
> > Computer Names and other data, produce different randomised VMAC's, or will 
> > they always set to the standard default?
> >
> > If they do, we are in a happier place already.
>
> The default MAC address is 52:54:00:12:34:N, where N is 0x56 for the
> first NIC, 0x57 for the second, and so forth. Best not to rely on the
> order of NIC initialization.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
  --
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] x86info: dump kvm cpuid's

2012-05-02 Thread Michael S. Tsirkin
On Wed, May 02, 2012 at 10:45:27AM +0100, Ian Campbell wrote:
> On Tue, 2012-05-01 at 16:04 +0300, Gleb Natapov wrote:
> > > BTW, according to arch/x86/include/asm/kvm_para.h unsurprisingly KVM has
> > > a signature too 'KVMKVMKVM'.
> > > 
> > > > cpu->stepping = eax & 0xf;
> > > > cpu->model = (eax >> 4) & 0xf;
> > > > cpu->family = (eax >> 8) & 0xf;
> > > > @@ -29,6 +29,19 @@ void get_cpu_info_basics(struct cpudata *cpu)
> > > >  
> > > > cpuid(cpu->number, 0xC000, &maxei, NULL, NULL, NULL);
> > > > cpu->maxei2 = maxei;
> > > > +   if (ecx & 0x8000) {
> > > > +   cpuid(cpu->number, 0x4000, &maxhv, NULL, NULL, 
> > > > NULL);
> > > > +   /*
> > > > +* KVM up to linux 3.4 reports 0 as the max hypervisor 
> > > > leaf,
> > > > +* where it really means 0x4001.
> > > 
> > > This is something where I definitely think you want to check the
> > > signature first.
> > In theory yes, but in practice what will this break?
> 
> I've got no idea -- but what's the harm in checking?
> 
> Ian.

Users can set kvm signature to anything, if they do
debugging will be a bit harder for them.

> -- 
> Ian Campbell
> Current Noise: Hypocrisy - Roswell 47
> 
> Angels we have heard on High
> Tell us to go out and Buy.
>   -- Tom Lehrer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] x86info: dump kvm cpuid's

2012-05-02 Thread Ian Campbell
On Tue, 2012-05-01 at 16:04 +0300, Gleb Natapov wrote:
> > BTW, according to arch/x86/include/asm/kvm_para.h unsurprisingly KVM has
> > a signature too 'KVMKVMKVM'.
> > 
> > >   cpu->stepping = eax & 0xf;
> > >   cpu->model = (eax >> 4) & 0xf;
> > >   cpu->family = (eax >> 8) & 0xf;
> > > @@ -29,6 +29,19 @@ void get_cpu_info_basics(struct cpudata *cpu)
> > >  
> > >   cpuid(cpu->number, 0xC000, &maxei, NULL, NULL, NULL);
> > >   cpu->maxei2 = maxei;
> > > + if (ecx & 0x8000) {
> > > + cpuid(cpu->number, 0x4000, &maxhv, NULL, NULL, NULL);
> > > + /*
> > > +  * KVM up to linux 3.4 reports 0 as the max hypervisor leaf,
> > > +  * where it really means 0x4001.
> > 
> > This is something where I definitely think you want to check the
> > signature first.
> In theory yes, but in practice what will this break?

I've got no idea -- but what's the harm in checking?

Ian.

-- 
Ian Campbell
Current Noise: Hypocrisy - Roswell 47

Angels we have heard on High
Tell us to go out and Buy.
-- Tom Lehrer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v1 3/5] KVM: Add paravirt kvm_flush_tlb_others

2012-05-02 Thread Nikunj A Dadhania
On Tue, 01 May 2012 11:39:36 +0200, Peter Zijlstra  wrote:
> On Sun, 2012-04-29 at 15:23 +0300, Avi Kivity wrote:
> > On 04/27/2012 07:24 PM, Nikunj A. Dadhania wrote:
> > > flush_tlb_others_ipi depends on lot of statics in tlb.c.  Replicated
> > > the flush_tlb_others_ipi as kvm_flush_tlb_others to further adapt to
> > > paravirtualization.
> > >
> > > Use the vcpu state information inside the kvm_flush_tlb_others to
> > > avoid sending ipi to pre-empted vcpus.
> > >
> > > * Do not send ipi's to offline vcpus and set flush_on_enter flag
> > 
> > get_user_pages_fast() depends on the IPI to hold off page table teardown
> > while they are locklessly walked with interrupts disabled.  If a vcpu
> > were to be preempted while in this critical section, another vcpu
> > tearing down page tables would go ahead and destroy them.  when the
> > preempted vcpu resumes it then touches the freed pages.
> > 
> > We could try to teach kvm and get_user_pages_fast() about this, but this
> > is intrusive.  Another option is to replace the cpu_relax() loop with
> > something that sleeps and is then woken up by the TLB IPI handler if needed.
> 
> I think something like
> 
>   select HAVE_RCU_TABLE_FREE if PARAVIRT
> 
> or somesuch is just about all it takes.
>
[root@krm1 linux]# grep HAVE_RCU_TABLE .config
CONFIG_HAVE_RCU_TABLE_FREE=y
[root@krm1 linux]# make -j32  -s
mm/memory.c: In function ‘tlb_remove_table_one’:
mm/memory.c:315: error: implicit declaration of function ‘__tlb_remove_table’

I suppose we need to have __tlb_remove_table. Trying to understand what
needs to be done there.

Regards
Nikunj

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] kvm-s390x: implement KVM_CAP_NR/MAX_VCPUS

2012-05-02 Thread Christian Borntraeger
Let userspace know the number of max and supported cpus for kvm on s390.
Return KVM_MAX_VCPUS (currently 64) for both values.

Signed-off-by: Christian Borntraeger 
---
 arch/s390/kvm/kvm-s390.c |4 
 1 file changed, 4 insertions(+)

diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index d30c835..54f75ff 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -135,6 +135,10 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_SYNC_REGS:
r = 1;
break;
+   case KVM_CAP_NR_VCPUS:
+   case KVM_CAP_MAX_VCPUS:
+   r = KVM_MAX_VCPUS;
+   break;
default:
r = 0;
}
-- 
1.7.9.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/1] Updated Patch for NR/MAX VCPU on s390

2012-05-02 Thread Christian Borntraeger
Marcelo,

here is an updated patch. We will use 64 for max and supported numbers
of vcpus.

Christian Borntraeger (1):
  kvm-s390x: implement KVM_CAP_NR/MAX_VCPUS

 arch/s390/kvm/kvm-s390.c |4 
 1 file changed, 4 insertions(+)

-- 
1.7.9.6

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html