[Bug 43328] VT-d/SR-IOV doesn't work in guest

2012-08-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=43328


Jay Ren  changed:

   What|Removed |Added

 Status|VERIFIED|CLOSED




--- Comment #7 from Jay Ren   2012-08-10 06:57:48 ---
Verified, so let's close it.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 43328] VT-d/SR-IOV doesn't work in guest

2012-08-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=43328


Jay Ren  changed:

   What|Removed |Added

 Status|RESOLVED|VERIFIED




--- Comment #6 from Jay Ren   2012-08-10 06:55:12 ---
I verified it based on Linux v3.5 tag. As comment #5 said, this bug has been
fixed.
Marked it as "verified".

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] KVM: PPC: booke: Added debug handler

2012-08-09 Thread Bhushan Bharat-R65777


> -Original Message-
> From: kvm-ppc-ow...@vger.kernel.org [mailto:kvm-ppc-ow...@vger.kernel.org] On
> Behalf Of Alexander Graf
> Sent: Wednesday, August 08, 2012 4:41 PM
> To: Bhushan Bharat-R65777
> Cc: Wood Scott-B07421; kvm-...@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [PATCH 3/3] KVM: PPC: booke: Added debug handler
> 
> 
> On 08.08.2012, at 03:02, Bhushan Bharat-R65777 wrote:
> 
> >
> >
> >> -Original Message-
> >> From: Wood Scott-B07421
> >> Sent: Wednesday, August 08, 2012 2:15 AM
> >> To: Alexander Graf
> >> Cc: Bhushan Bharat-R65777; kvm-...@vger.kernel.org;
> >> kvm@vger.kernel.org; Bhushan
> >> Bharat-R65777
> >> Subject: Re: [PATCH 3/3] KVM: PPC: booke: Added debug handler
> >>
> >> On 08/07/2012 05:47 AM, Alexander Graf wrote:
>  diff --git a/arch/powerpc/kvm/booke_interrupts.S
>  b/arch/powerpc/kvm/booke_interrupts.S
>  index 3539805..890673c 100644
>  --- a/arch/powerpc/kvm/booke_interrupts.S
>  +++ b/arch/powerpc/kvm/booke_interrupts.S
>  @@ -73,6 +73,51 @@ _GLOBAL(kvmppc_handler_\ivor_nr)
>   bctr
>  .endm
> 
>  +.macro KVM_DBG_HANDLER ivor_nr scratch srr0
> >>>
> >>> This is a lot of asm code. Any chance to share a good share of it
> >>> with the
> >> generic handler?
> >>
> >> That entire file could use an update to lok more like
> >> bookehv_interrupts.S and its use of asm macros.
> >
> > In booke there is assumption that size of KVM IVORs will not me more than 
> > host
> IVORs size so that only IVPR is changed.
> >
> > I tried to give it that shape of bookehv_interrupts.S  and found that size 
> > of
> some IVORs become more than host IVORs.
> 
> We can always jump off to another (more generic?) function and only have a 
> small
> stub in the IVOR referenced code.

What extra KVM_DBG_HANDLER have from KVM_HANDLER is the handing of debug single 
step (which is similar to host).
So do you want a jump in assembly for handling the debug single step? Or you 
really think of moving something from the KVM_HANDLER to more generic?

Thanks
-Bharat

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 43328] VT-d/SR-IOV doesn't work in guest

2012-08-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=43328


Jay Ren  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||CODE_FIX




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 4:24 PM, Avi Kivity  wrote:
> On 08/09/2012 10:28 AM, liu ping fan wrote:
>>>
>>> Seems to me that nothing in memory.c can susceptible to races.  It must
>>> already be called under the big qemu lock, and with the exception of
>>> mutators (memory_region_set_*), changes aren't directly visible.
>>>
>> Yes, what I want to do is "prepare unplug out of protection of global
>> lock".  When io-dispatch and mmio-dispatch are all out of big lock, we
>> will run into the following scene:
>> In vcpu context A, qdev_unplug_complete()-> delete subregion;
>> In context B, write pci bar --> pci mapping update-> add subregion
>
> Why do you want unlocked unplug?  Unplug is rare and complicated; there
> are no performance considerations on one hand, and difficulty of testing
> for lock correctness on the other.  I think it is better if it remains
> protected by the global lock.
>
Oh, yes! I deviate quite far from the origin aim, and introduce some
unnecessary complicate.

>>
>>> I think it's sufficient to take the mem_map_lock at the beginning of
>>> core_begin() and drop it at the end of core_commit().  That means all
>>> updates of volatile state, phys_map, are protected.
>>>
>> The mem_map_lock is to protect both address_space_io and
>> address_space_memory. When without the protection of big lock,
>> competing will raise among the updaters
>> (memory_region_{add,del}_subregion and the readers
>> generate_memory_topology()->render_memory_region().
>
> These should all run under the big qemu lock, for the same reasons.
> They are rare and not performance sensitive.  Only phys_map reads are
> performance sensitive.
>
OK, I see. Leave the big lock as it is, except for mmio, we will not
worry about it.
>>
>> If just in core_begin/commit, we will duplicate it for
>> xx_begin/commit, right?
>
> No.  Other listeners will be protected by the global lock.
>
Yes, if leave the big lock as it is.
>> And at the same time, mr->subregions is
>> exposed under SMP without big lock.
>>
>
> Who accesses it?
>
Again, I assume the updaters out of the protection of the big lock

> IMO locking should look like:
>
>   phys_map: mem_map_lock
>   dispatch callbacks: device specific lock (or big qemu lock for
> unconverted devices)
>   everything else: big qemu lock
>
I See. Thank you for the review. And I will eliminate the unnecessary
complicate and effort for the next version

Regards,
pingfan
>
>
> --
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/15] memory: use refcnt to manage MemoryRegion

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 4:38 PM, Avi Kivity  wrote:
> On 08/09/2012 10:27 AM, liu ping fan wrote:
>> On Wed, Aug 8, 2012 at 5:20 PM, Avi Kivity  wrote:
>>> On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
 From: Liu Ping Fan 

 Using refcnt for mr, so we can separate mr's life cycle management
 from refered object.
   When mr->ref 0->1, inc the refered object.
   When mr->ref 1->0, dec the refered object.

 The refered object can be DeviceStae, another mr, or other opaque.
>>>
>>> Please explain the motivation more fully.
>>>
>> Actually, the aim is to mange the reference of an object, used by mem view.
>> DeviceState can be referred by different system, when it comes to the
>> view of subsystem, we hold dev's ref. And any indirect reference will
>> just mr->ref++, not dev's.
>> This can help us avoid the down-walk through the referred chain, like
>> alias> mr ---> DeviceState.
>
> That is a lot of complexity, for no gain.  Manipulating memory regions
> is a slow path, and can be done under the bit qemu lock without any
> complications.
>
OK. I will discard this design.
>>
>> In the previous discussion, you have suggest add dev->ref++ in
>> core_region_add.  But I think, if we can move it to higher layer --
>> memory_region_{add,del}_subregion, so we can avoid to duplicate do
>> this in other xx_region_add.
>
> Why would other memory listeners be impacted?  They all operate under
> the big qemu lock.  If they start using devices outside the lock, then
> they need to take a reference.
>
Yes, if unplug path in the protection of big lock.
And just one extra question, for ram-unplug scene, how do we protect from:
  updater:  ram-unplug -->qemu free() --> brk() invalidate this vaddr interval
  reader:  vhost-thread copy data from the interval
I guess something like lock/ref used by them, but can not find such
mechanism in vhost_set_memory() to protect the scene against
vhost_worker()

Thanks and regards,
pingfan

>> As a payment for this, we need to handle alias which can be avoid at
>> core_region_add().  And mr's ref can help to avoid
>>  the down-walk.
>
> The payment is two systems of reference counts.
>
> --
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 4:18 PM, Avi Kivity  wrote:
> On 08/09/2012 10:49 AM, Paolo Bonzini wrote:
>> Il 09/08/2012 09:33, liu ping fan ha scritto:
>>> Yes, it is to defer destructors.
>>> See 0009-memory-prepare-flatview-and-radix-tree-for-rcu-style.patch
>>> When MemoryRegion is _del_subregion from mem in updater, it may be
>>> still in use by reader -- radix or flatview, so defer its destructors
>>> to the reclaimer --phys_map_release(PhysMap *map)
>>
>> How are you sure that the reader is already out of its critical section
>> by the time the reclaimer runs?
>>
>>> If we have rcu, it could be elegant to do this.
>>
>> Yeah, I think inventing primitives is dangerous and difficult to review;
>> and it may be difficult to replace it with proper call_rcu.
>>
>> You should probably make a proof-of-concept using liburcu.  Then we can
>> decide how to implement RCU in a way that is portable enough for QEMU's
>> needs.
>
> IMO we should start with a simple mutex (which will cover only the
> lookup and map rebuild).  This should reduce the contention to basically
> nothing (still leaving a cache line bounce).  If a profile shows the
> cache line bounce hurting us, or perhaps contention in ultralarge
> guests, then we should switch to rcu.
>
Agree, I think this will pin us on the major issue -- mmio perfermance

Regards,
pingfan
>
> --
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] e1000: using new interface--unmap to unplug

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 3:40 PM, Paolo Bonzini  wrote:
> Il 09/08/2012 09:28, liu ping fan ha scritto:
 >> +static void
 >> +pci_e1000_unmap(PCIDevice *p)
 >> +{
 >> +/* DO NOT FREE anything!until refcnt=0 */
 >> +/* isolate from memory view */
 >> +}
>>> >
>>> > At least you need to call the superclass method.
>>> >
>> Refer to  0013-hotplug-introduce-qdev_unplug_complete-to-remove-dev.patch,
>> we have the following sequence
>> qdev_unmap->pci_unmap_device->pci_e1000_unmap.  So pci_e1000_unmap
>> need not to do anything.
>
> But then this patch is unnecessary, isn't it?
>
Yes,  should remove from next version

Regards,
pingfan


> Paolo
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 4:00 PM, Paolo Bonzini  wrote:
> Il 09/08/2012 09:28, liu ping fan ha scritto:
>>> > VCPU threadI/O thread
>>> > =
>>> > get MMIO request
>>> > rcu_read_lock()
>>> > walk memory map
>>> >qdev_unmap()
>>> >lock_devtree()
>>> >...
>>> >unlock_devtree
>>> >unref dev -> refcnt=0, free enqueued
>>> > ref()
>> No ref() for dev here, while we have ref to flatview+radix in my patches.
>> I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
>> inc when it is added into mem view -- that is
>> memory_region_add_subregion -> memory_region_get() {
>> if(atomic_add_and_return()) dev->ref++  }.
>> So not until reclaimer of mem view, the dev's ref is hold by mem view.
>> In a short word, rcu protect mem view, while device is protected by refcnt.
>
> But the RCU critical section should not include the whole processing of
> MMIO, only the walk of the memory map.
>
Yes, you are right.  And I think cur_map_get() can be broken into the
style "lock,  ref++, phys_page_find(); unlock".  easily.

> And in general I think this is a bit too tricky... I understand not
> adding refcounting to all of bottom halves, timers, etc., but if you are
> using a device you should have explicit ref/unref pairs.
>
Actually, there are pairs -- when dev enter mem view, inc ref; and
when it leave, dec ref.
But as Avi has pointed out, the mr->refcnt introduce complicate and no
gain. So I will discard this design

Thanks and regards,
pingfan

> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] KVM: PPC: booke: Allow multiple exception types

2012-08-09 Thread Bhushan Bharat-R65777


> -Original Message-
> From: kvm-ppc-ow...@vger.kernel.org [mailto:kvm-ppc-ow...@vger.kernel.org] On
> Behalf Of Alexander Graf
> Sent: Tuesday, August 07, 2012 4:16 PM
> To: Bhushan Bharat-R65777
> Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Bhushan Bharat-R65777
> Subject: Re: [PATCH 2/3] KVM: PPC: booke: Allow multiple exception types
> 
> 
> On 03.08.2012, at 09:08, Bharat Bhushan wrote:
> 
> > Current kvmppc_booke_handlers uses the same macro (KVM_HANDLER) and
> > all handlers are considered to be the same size. This will not be
> > the case if we want to use different macros for different handlers.
> >
> > This patch improves the kvmppc_booke_handler so that it can
> > support different macros for different handlers.
> >
> > Signed-off-by: Liu Yu 
> > [bharat.bhus...@freescale.com: Substantial changes]
> > Signed-off-by: Bharat Bhushan 
> > ---
> > arch/powerpc/include/asm/kvm_ppc.h  |2 -
> > arch/powerpc/kvm/booke.c|9 ---
> > arch/powerpc/kvm/booke.h|1 +
> > arch/powerpc/kvm/booke_interrupts.S |   37 
> > --
> > arch/powerpc/kvm/e500.c |   13 +++
> > 5 files changed, 48 insertions(+), 14 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/kvm_ppc.h
> b/arch/powerpc/include/asm/kvm_ppc.h
> > index 1438a5e..deb55bd 100644
> > --- a/arch/powerpc/include/asm/kvm_ppc.h
> > +++ b/arch/powerpc/include/asm/kvm_ppc.h
> > @@ -47,8 +47,6 @@ enum emulation_result {
> >
> > extern int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
> > extern int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu 
> > *vcpu);
> > -extern char kvmppc_handlers_start[];
> > -extern unsigned long kvmppc_handler_len;
> > extern void kvmppc_handler_highmem(void);
> >
> > extern void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu);
> > diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> > index aebbb8b..1338233 100644
> > --- a/arch/powerpc/kvm/booke.c
> > +++ b/arch/powerpc/kvm/booke.c
> > @@ -1541,6 +1541,7 @@ int __init kvmppc_booke_init(void)
> > {
> > #ifndef CONFIG_KVM_BOOKE_HV
> > unsigned long ivor[16];
> > +   unsigned long *handler = kvmppc_booke_handler_addr;
> > unsigned long max_ivor = 0;
> > int i;
> >
> > @@ -1574,14 +1575,14 @@ int __init kvmppc_booke_init(void)
> >
> > for (i = 0; i < 16; i++) {
> > if (ivor[i] > max_ivor)
> > -   max_ivor = ivor[i];
> > +   max_ivor = i;
> >
> > memcpy((void *)kvmppc_booke_handlers + ivor[i],
> > -  kvmppc_handlers_start + i * kvmppc_handler_len,
> > -  kvmppc_handler_len);
> > +  (void *)handler[i], handler[i + 1] - handler[i]);
> > }
> > flush_icache_range(kvmppc_booke_handlers,
> > -  kvmppc_booke_handlers + max_ivor + 
> > kvmppc_handler_len);
> > +  kvmppc_booke_handlers + ivor[max_ivor] +
> > +   handler[max_ivor + 1] - handler[max_ivor]);
> > #endif /* !BOOKE_HV */
> > return 0;
> > }
> > diff --git a/arch/powerpc/kvm/booke.h b/arch/powerpc/kvm/booke.h
> > index ba61974..de9e526 100644
> > --- a/arch/powerpc/kvm/booke.h
> > +++ b/arch/powerpc/kvm/booke.h
> > @@ -65,6 +65,7 @@
> >   (1 << BOOKE_IRQPRIO_CRITICAL))
> >
> > extern unsigned long kvmppc_booke_handlers;
> > +extern unsigned long kvmppc_booke_handler_addr[];
> >
> > void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr);
> > void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 old_msr);
> > diff --git a/arch/powerpc/kvm/booke_interrupts.S
> b/arch/powerpc/kvm/booke_interrupts.S
> > index bb46b32..3539805 100644
> > --- a/arch/powerpc/kvm/booke_interrupts.S
> > +++ b/arch/powerpc/kvm/booke_interrupts.S
> > @@ -73,6 +73,14 @@ _GLOBAL(kvmppc_handler_\ivor_nr)
> > bctr
> > .endm
> >
> > +.macro KVM_HANDLER_ADDR ivor_nr
> > +   .long   kvmppc_handler_\ivor_nr
> > +.endm
> > +
> > +.macro KVM_HANDLER_END
> > +   .long   kvmppc_handlers_end
> > +.endm
> > +
> > _GLOBAL(kvmppc_handlers_start)
> > KVM_HANDLER BOOKE_INTERRUPT_CRITICAL SPRN_SPRG_RSCRATCH_CRIT SPRN_CSRR0
> > KVM_HANDLER BOOKE_INTERRUPT_MACHINE_CHECK  SPRN_SPRG_RSCRATCH_MC SPRN_MCSRR0
> > @@ -93,9 +101,7 @@ KVM_HANDLER BOOKE_INTERRUPT_DEBUG SPRN_SPRG_RSCRATCH_CRIT
> SPRN_CSRR0
> > KVM_HANDLER BOOKE_INTERRUPT_SPE_UNAVAIL SPRN_SPRG_RSCRATCH0 SPRN_SRR0
> > KVM_HANDLER BOOKE_INTERRUPT_SPE_FP_DATA SPRN_SPRG_RSCRATCH0 SPRN_SRR0
> > KVM_HANDLER BOOKE_INTERRUPT_SPE_FP_ROUND SPRN_SPRG_RSCRATCH0 SPRN_SRR0
> > -
> > -_GLOBAL(kvmppc_handler_len)
> > -   .long kvmppc_handler_1 - kvmppc_handler_0
> > +_GLOBAL(kvmppc_handlers_end)
> >
> > /* Registers:
> >  *  SPRG_SCRATCH0: guest r4
> > @@ -463,6 +469,31 @@ lightweight_exit:
> > lwz r4, VCPU_GPR(R4)(r4)
> > rfi
> >
> > +   .data
> > +   .align  4
> > +   .globl  kvmppc_booke_handler_addr
> > +kvmppc_booke_handler_addr:
> > +KVM_HANDLER_ADDR BOOKE_INTE

[PATCH 2/2] kvm tools: Check for unknown parameters in network option handling

2012-08-09 Thread Michael Ellerman
Currently the parsing of network command line parameters doesn't check
for unknown parameters.

For example "-n mod=none" will cause kvmtool to setup TAP networking.

Save users from their typos by checking for unknown parameters and bailing.

Signed-off-by: Michael Ellerman 
---
 tools/kvm/builtin-run.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c
index a36bd00..9e5c1d4 100644
--- a/tools/kvm/builtin-run.c
+++ b/tools/kvm/builtin-run.c
@@ -257,7 +257,8 @@ static int set_net_param(struct virtio_net_params *p, const 
char *param,
p->vhost = atoi(val);
} else if (strcmp(param, "fd") == 0) {
p->fd = atoi(val);
-   }
+   } else
+   die("Unknown network parameter %s", param);
 
return 0;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] kvm tools: Fix formatting of error message in TAP handling

2012-08-09 Thread Michael Ellerman
This error message is missing a space, and has a redundant ":" at the end,
currently it produces:

  You have requested a TAP device, but creation of one hasfailed because:: No 
such file or directory

Add a space to "hasfailed" and remove the extra ":".

Don't split the line to improve grepability.

Signed-off-by: Michael Ellerman 
---
 tools/kvm/virtio/net.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/kvm/virtio/net.c b/tools/kvm/virtio/net.c
index 10420ae..8f3735b 100644
--- a/tools/kvm/virtio/net.c
+++ b/tools/kvm/virtio/net.c
@@ -526,8 +526,7 @@ void virtio_net__init(const struct virtio_net_params 
*params)
ndev->mode = params->mode;
if (ndev->mode == NET_MODE_TAP) {
if (!virtio_net__tap_init(params, ndev))
-   die_perror("You have requested a TAP device, but 
creation of one has"
-   "failed because:");
+   die_perror("You have requested a TAP device, but 
creation of one has failed because");
ndev->ops = &tap_ops;
} else {
ndev->info.host_ip  = 
ntohl(inet_addr(params->host_ip));
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm tools: Fix crash when /etc/resolv.conf doesn't exist

2012-08-09 Thread Michael Ellerman
In uip_dhcp_get_dns() we try to open /etc/resolv.conf. If we fail to
open it we then SEGV trying to fclose() it.

Fix the code to just return directly if we can't open it.

Signed-off-by: Michael Ellerman 
---
 tools/kvm/net/uip/dhcp.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tools/kvm/net/uip/dhcp.c b/tools/kvm/net/uip/dhcp.c
index e91a7c7..b17d352 100644
--- a/tools/kvm/net/uip/dhcp.c
+++ b/tools/kvm/net/uip/dhcp.c
@@ -45,7 +45,7 @@ int uip_dhcp_get_dns(struct uip_info *info)
 
fp = fopen("/etc/resolv.conf", "r");
if (!fp)
-   goto out;
+   return ret;
 
while (!feof(fp)) {
if (fscanf(fp, "%s %s\n", key, val) != 2)
@@ -62,7 +62,6 @@ int uip_dhcp_get_dns(struct uip_info *info)
}
}
 
-out:
fclose(fp);
return ret;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/12] KVM: introduce gfn_to_pfn_memslot_atomic

2012-08-09 Thread Xiao Guangrong
On 08/10/2012 02:50 AM, Marcelo Tosatti wrote:
> On Tue, Aug 07, 2012 at 05:49:36PM +0800, Xiao Guangrong wrote:
>> It can instead of hva_to_pfn_atomic
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c   |5 +
>>  include/linux/kvm_host.h |3 ++-
>>  virt/kvm/kvm_main.c  |   14 --
>>  3 files changed, 11 insertions(+), 11 deletions(-)
> 
> What if someone needs gfn_to_hva_memslot in the future, which is not
> unlikely?

Marcelo,

We have already had the function gfn_to_hva_memslot in kvm_host.h which
is not directly used on x86. In this patchset, gfn_to_hva_memslot does
not check the slot->flags.

I think we can let it be based on gfn_to_hva_many:
static inline unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot,
   gfn_t gfn)
{
return gfn_to_hva_many(slot, gfn, true);
}

Then all gfn_to_*() functions are based on gfn_to_hva_many in which we
will check slot->flags properly.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6 3/3] KVM: perf kvm events analysis tool

2012-08-09 Thread Dong Hao
From: Xiao Guangrong 

Add 'perf kvm stat' support to analyze kvm vmexit/mmio/ioport smartly.

Usage:
- kvm stat
  run a command and gather performance counter statistics, it is the alias of
  perf stat

- trace kvm events:
  perf kvm stat record, or, if other tracepoints are interesting as well, we
  can append the events like this:
  perf kvm stat record -e timer:*
  If many guests are running, we can track the specified guest by using -p or
  --pid

- show the result:
  perf kvm stat report

The output example is following:
# pgrep qemu-kvm
27841
27888
27936

total 3 guests are running on the host

Then, track the guest whose pid is 27936:
# ./perf kvm stat record -p 27936
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.177 MB perf.data.guest (~7742 samples) ]

See the vmexit events:
# ./perf kvm stat report --event=vmexit


Analyze events for all VCPUs:

 VM-EXITSamples  Samples% Time% Avg time

 APIC_ACCESS23780.34% 0.01% 20.86us ( +-   8.03% )
 HLT 4515.25%99.98% 790874.75us ( +-  16.93% )
  EXTERNAL_INTERRUPT 11 3.73% 0.00% 47.70us ( +-  17.45% )
   EXCEPTION_NMI  1 0.34% 0.00%  5.11us ( +-   -nan% )
   CR_ACCESS  1 0.34% 0.00%  6.33us ( +-   -nan% )

Total Samples:295, Total events handled time:35594844.34us.

See the mmio events:
# ./perf kvm stat report --event=mmio


Analyze events for all VCPUs:

 MMIO AccessSamples  Samples% Time% Avg time

0xfee00380:W15182.07%76.19%  7.72us ( +-  11.96% )
0xfee00300:W 11 5.98%18.86% 26.23us ( +-  27.33% )
0xfee00300:R 11 5.98% 2.16%  3.00us ( +-   9.17% )
0xfee00310:W 11 5.98% 2.79%  3.88us ( +-   9.15% )

Total Samples:184, Total events handled time:1529.42us.

See the ioport event:
# ./perf kvm stat report --event=ioport


Analyze events for all VCPUs:

  IO Port AccessSamples  Samples% Time% Avg time

  0x5658:PIN   335755.22%92.13%162.31us ( +-   0.87% )
 0xc090:POUT   122120.09% 2.10% 10.18us ( +-   4.97% )
0x60:PIN74812.30% 3.18% 25.17us ( +-   5.01% )
0x64:PIN74812.30% 2.57% 20.35us ( +-  11.81% )
 0xc050:POUT  5 0.08% 0.01%  8.79us ( +-  12.88% )

Total Samples:6079, Total events handled time:591405.43us.

And, --vcpu is used to track the specified vcpu and --key is used to sort the
result:
# ./perf kvm stat report --event=ioport --vcpu=0 --key=time


Analyze events for VCPU 0:

  IO Port AccessSamples  Samples% Time% Avg time

  0x5658:PIN14594.77%99.67%133.00us ( +-   2.96% )
 0xc090:POUT  8 5.23% 0.33%  7.99us ( +-  16.85% )

Total Samples:153, Total events handled time:19348.87us.


[ Dong Hao :
 - rebase it on current tip tree
 - fix the compiling-error on i386
]
Signed-off-by: Xiao Guangrong 
Signed-off-by: Dong Hao 
---
 tools/perf/Documentation/perf-kvm.txt |   30 ++-
 tools/perf/MANIFEST   |3 +
 tools/perf/builtin-kvm.c  |  858 -
 tools/perf/util/header.c  |   55 ++-
 tools/perf/util/header.h  |1 +
 tools/perf/util/thread.h  |2 +
 6 files changed, 943 insertions(+), 6 deletions(-)

diff --git a/tools/perf/Documentation/perf-kvm.txt 
b/tools/perf/Documentation/perf-kvm.txt
index dd84cb2..d52feef 100644
--- a/tools/perf/Documentation/perf-kvm.txt
+++ b/tools/perf/Documentation/perf-kvm.txt
@@ -12,7 +12,7 @@ SYNOPSIS
[--guestkallsyms= --guestmodules= | --guestvmlinux=]]
{top|record|report|diff|buildid-list}
 'perf kvm' [--host] [--guest] [--guestkallsyms= --guestmodules=
-   | --guestvmlinux=] {top|record|report|diff|buildid-list}
+   | --guestvmlinux=] {top|record|report|diff|buildid-list|stat}
 
 DESCRIPTION
 ---
@@ -38,6 +38,18 @@ There are a couple of variants of perf kvm:
   so that other tools can be used to fetch packages with matching symbol tables
   for use by perf report.
 
+  'perf kvm stat ' to run a command and gather performance counter
+   statistics.
+  Especially, perf 'kvm stat record/report' generates a statistical analysis
+  of KVM events. Currently, vmexit, mmio and ioport events are supported.
+'perf kvm stat record ' records kvm events and the events between
+start and end .
+And this command produces a file which contains tracing results of kvm
+events.
+
+'perf kvm stat report' reports statistical data which includes events
+handled time, samples, and so on.
+
 OPTIONS
 ---
 -i::
@@ -68,7 +80,21 @@ OPTIONS
 --guestvmlinux=::
Guest os kernel vmlinux.
 
+STAT REPORT OPTIONS
+-

[PATCH v6 1/3] KVM: x86: export svm/vmx exit code and vector code to userspace

2012-08-09 Thread Dong Hao
From: Xiao Guangrong 

They will be needed by 'perf kvm stat'

[ Dong Hao : rebase it on current kvm/tip tree]
Signed-off-by: Xiao Guangrong
Signed-off-by: Dong Hao 
---
 arch/x86/include/asm/kvm_host.h |   36 ---
 arch/x86/include/asm/svm.h  |  205 +--
 arch/x86/include/asm/vmx.h  |  126 
 arch/x86/kvm/trace.h|   89 -
 4 files changed, 234 insertions(+), 222 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1309e69..df23d75 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -11,6 +11,24 @@
 #ifndef _ASM_X86_KVM_HOST_H
 #define _ASM_X86_KVM_HOST_H
 
+#define DE_VECTOR 0
+#define DB_VECTOR 1
+#define BP_VECTOR 3
+#define OF_VECTOR 4
+#define BR_VECTOR 5
+#define UD_VECTOR 6
+#define NM_VECTOR 7
+#define DF_VECTOR 8
+#define TS_VECTOR 10
+#define NP_VECTOR 11
+#define SS_VECTOR 12
+#define GP_VECTOR 13
+#define PF_VECTOR 14
+#define MF_VECTOR 16
+#define MC_VECTOR 18
+
+#ifdef __KERNEL__
+
 #include 
 #include 
 #include 
@@ -75,22 +93,6 @@
 #define KVM_HPAGE_MASK(x)  (~(KVM_HPAGE_SIZE(x) - 1))
 #define KVM_PAGES_PER_HPAGE(x) (KVM_HPAGE_SIZE(x) / PAGE_SIZE)
 
-#define DE_VECTOR 0
-#define DB_VECTOR 1
-#define BP_VECTOR 3
-#define OF_VECTOR 4
-#define BR_VECTOR 5
-#define UD_VECTOR 6
-#define NM_VECTOR 7
-#define DF_VECTOR 8
-#define TS_VECTOR 10
-#define NP_VECTOR 11
-#define SS_VECTOR 12
-#define GP_VECTOR 13
-#define PF_VECTOR 14
-#define MF_VECTOR 16
-#define MC_VECTOR 18
-
 #define SELECTOR_TI_MASK (1 << 2)
 #define SELECTOR_RPL_MASK 0x03
 
@@ -995,4 +997,6 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, 
u64 *data);
 void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
 void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
 
+#endif
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index f2b83bc..cdf5674 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -1,6 +1,135 @@
 #ifndef __SVM_H
 #define __SVM_H
 
+#define SVM_EXIT_READ_CR0  0x000
+#define SVM_EXIT_READ_CR3  0x003
+#define SVM_EXIT_READ_CR4  0x004
+#define SVM_EXIT_READ_CR8  0x008
+#define SVM_EXIT_WRITE_CR0 0x010
+#define SVM_EXIT_WRITE_CR3 0x013
+#define SVM_EXIT_WRITE_CR4 0x014
+#define SVM_EXIT_WRITE_CR8 0x018
+#define SVM_EXIT_READ_DR0  0x020
+#define SVM_EXIT_READ_DR1  0x021
+#define SVM_EXIT_READ_DR2  0x022
+#define SVM_EXIT_READ_DR3  0x023
+#define SVM_EXIT_READ_DR4  0x024
+#define SVM_EXIT_READ_DR5  0x025
+#define SVM_EXIT_READ_DR6  0x026
+#define SVM_EXIT_READ_DR7  0x027
+#define SVM_EXIT_WRITE_DR0 0x030
+#define SVM_EXIT_WRITE_DR1 0x031
+#define SVM_EXIT_WRITE_DR2 0x032
+#define SVM_EXIT_WRITE_DR3 0x033
+#define SVM_EXIT_WRITE_DR4 0x034
+#define SVM_EXIT_WRITE_DR5 0x035
+#define SVM_EXIT_WRITE_DR6 0x036
+#define SVM_EXIT_WRITE_DR7 0x037
+#define SVM_EXIT_EXCP_BASE 0x040
+#define SVM_EXIT_INTR  0x060
+#define SVM_EXIT_NMI   0x061
+#define SVM_EXIT_SMI   0x062
+#define SVM_EXIT_INIT  0x063
+#define SVM_EXIT_VINTR 0x064
+#define SVM_EXIT_CR0_SEL_WRITE 0x065
+#define SVM_EXIT_IDTR_READ 0x066
+#define SVM_EXIT_GDTR_READ 0x067
+#define SVM_EXIT_LDTR_READ 0x068
+#define SVM_EXIT_TR_READ   0x069
+#define SVM_EXIT_IDTR_WRITE0x06a
+#define SVM_EXIT_GDTR_WRITE0x06b
+#define SVM_EXIT_LDTR_WRITE0x06c
+#define SVM_EXIT_TR_WRITE  0x06d
+#define SVM_EXIT_RDTSC 0x06e
+#define SVM_EXIT_RDPMC 0x06f
+#define SVM_EXIT_PUSHF 0x070
+#define SVM_EXIT_POPF  0x071
+#define SVM_EXIT_CPUID 0x072
+#define SVM_EXIT_RSM   0x073
+#define SVM_EXIT_IRET  0x074
+#define SVM_EXIT_SWINT 0x075
+#define SVM_EXIT_INVD  0x076
+#define SVM_EXIT_PAUSE 0x077
+#define SVM_EXIT_HLT   0x078
+#define SVM_EXIT_INVLPG0x079
+#define SVM_EXIT_INVLPGA   0x07a
+#define SVM_EXIT_IOIO  0x07b
+#define SVM_EXIT_MSR   0x07c
+#define SVM_EXIT_TASK_SWITCH   0x07d
+#define SVM_EXIT_FERR_FREEZE   0x07e
+#define SVM_EXIT_SHUTDOWN  0x07f
+#define SVM_EXIT_VMRUN 0x080
+#define SVM_EXIT_VMMCALL   0x081
+#define SVM_EXIT_VMLOAD0x082
+#define SVM_EXIT_VMSAVE0x083
+#define SVM_EXIT_STGI  0x084
+#define SVM_EXIT_CLGI  0x085
+#define SVM_EXIT_SKINIT0x086
+#define SVM_EXIT_RDTSCP0x087
+#define SVM_EXIT_ICEBP 0x088
+#define SVM_EXIT_WBINVD0x089
+#define SVM_EXIT_MONITOR   0x08a
+#define SVM_EXIT_MWAIT 0x08b
+#define SVM_EXIT_MWAIT_COND0x08c
+#define SVM_EXIT_XSETBV0x08d
+#define SVM_EXIT_NPF   0x400
+
+#define SVM_EXIT_ERR   -1
+
+#define SVM_EXIT_REASONS \
+   { SVM_EXIT_READ_CR0,"read_cr0" }, \
+   { SVM_EXIT_READ_CR3,"read_cr

[PATCH v6 2/3] KVM: x86: tracemmio begin and complete

2012-08-09 Thread Dong Hao
From: Xiao Guangrong 

'perf kvm stat record/report' will use kvm_exit and kvm_mmio(read...) to
calculate mmio read emulated time for the old kernel, in order to trace
mmio read event more exactly, we add kvm_mmio_begin to trace the time when
mmio read begins, also, add kvm_io_done to trace the time when mmio/pio is
completed

[ Dong Hao : rebase it on current kvm tree ]
Signed-off-by: Xiao Guangrong 
Signed-off-by: Dong Hao 
---
 arch/x86/kvm/x86.c |   32 
 include/trace/events/kvm.h |   37 +
 2 files changed, 57 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ebf65c..164ed9d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3807,9 +3807,12 @@ mmio:
/*
 * Is this MMIO handled locally?
 */
+   trace_kvm_mmio_begin(vcpu->vcpu_id, write, gpa);
handled = ops->read_write_mmio(vcpu, gpa, bytes, val);
-   if (handled == bytes)
+   if (handled == bytes) {
+   trace_kvm_io_done(vcpu->vcpu_id);
return X86EMUL_CONTINUE;
+   }
 
gpa += handled;
bytes -= handled;
@@ -4002,6 +4005,7 @@ static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int 
size,
vcpu->arch.pio.size = size;
 
if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
+   trace_kvm_io_done(vcpu->vcpu_id);
vcpu->arch.pio.count = 0;
return 1;
}
@@ -4602,9 +4606,7 @@ restart:
inject_emulated_exception(vcpu);
r = EMULATE_DONE;
} else if (vcpu->arch.pio.count) {
-   if (!vcpu->arch.pio.in)
-   vcpu->arch.pio.count = 0;
-   else
+   if (vcpu->arch.pio.in)
writeback = false;
r = EMULATE_DO_MMIO;
} else if (vcpu->mmio_needed) {
@@ -4635,8 +4637,6 @@ int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size, 
unsigned short port)
unsigned long val = kvm_register_read(vcpu, VCPU_REGS_RAX);
int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt,
size, port, &val, 1);
-   /* do not return to emulator after return from userspace */
-   vcpu->arch.pio.count = 0;
return ret;
 }
 EXPORT_SYMBOL_GPL(kvm_fast_pio_out);
@@ -5487,11 +5487,16 @@ static int complete_mmio(struct kvm_vcpu *vcpu)
 {
struct kvm_run *run = vcpu->run;
struct kvm_mmio_fragment *frag;
-   int r;
+   int r = 1;
 
if (!(vcpu->arch.pio.count || vcpu->mmio_needed))
return 1;
 
+   if (vcpu->arch.pio.count && !vcpu->arch.pio.in) {
+   vcpu->arch.pio.count = 0;
+   goto exit;
+   }
+
if (vcpu->mmio_needed) {
/* Complete previous fragment */
frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment++];
@@ -5499,8 +5504,10 @@ static int complete_mmio(struct kvm_vcpu *vcpu)
memcpy(frag->data, run->mmio.data, frag->len);
if (vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments) {
vcpu->mmio_needed = 0;
+
if (vcpu->mmio_is_write)
-   return 1;
+   goto exit;
+
vcpu->mmio_read_completed = 1;
goto done;
}
@@ -5517,11 +5524,12 @@ static int complete_mmio(struct kvm_vcpu *vcpu)
}
 done:
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
-   r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
+   r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE) == EMULATE_DONE;
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
-   if (r != EMULATE_DONE)
-   return 0;
-   return 1;
+
+exit:
+   trace_kvm_io_done(vcpu->vcpu_id);
+   return r;
 }
 
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 7ef9e75..5be5ad3 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -177,6 +177,43 @@ TRACE_EVENT(kvm_mmio,
  __entry->len, __entry->gpa, __entry->val)
 );
 
+TRACE_EVENT(kvm_mmio_begin,
+   TP_PROTO(unsigned int vcpu_id, bool rw, u64 gpa),
+   TP_ARGS(vcpu_id, rw, gpa),
+
+   TP_STRUCT__entry(
+   __field(unsigned int, vcpu_id)
+   __field(int, type)
+   __field(u64, gpa)
+   ),
+
+   TP_fast_assign(
+   __entry->vcpu_id = vcpu_id;
+   __entry->type = rw ? KVM_TRACE_MMIO_WRITE :
+   KVM_TRACE_MMIO_READ;
+   __entry->gpa = gpa;
+   ),
+
+   TP_printk("vcpu %u mmio %s gpa 0x%llx", __entry->vcpu_id,
+   __print_symbolic(__entry->type, kvm_trace_symbol_mmio),
+   _

[PATCH v6 0/3] KVM: perf: kvm events analysis tool

2012-08-09 Thread Dong Hao
From: Xiao Guangrong

This patchset introduces a perf-based tool (perf kvm stat record/report)
which can analysis kvm events more smartly. This is a presentation on
2012 Japan LinuxCon:
http://events.linuxfoundation.org/images/stories/pdf/lcjp2012_guangrong.pdf
You can get more detail from it. Any question/comment please feel free let
us know.

Patch 1 and patch 3 can be applied to either tip tree or kvm tree, but patch 2
can only be applied to kvm tree. Fortunately, patch 2 is just doing the
"improvement" work, and it can be picked up independently.

Usage:
- kvm stat
  run a command and gather performance counter statistics, it is the alias of
  perf stat

- trace kvm events:
  perf kvm stat record, or, if other tracepoints are interesting as well, we
  can append the events like this:
  perf kvm stat record -e timer:*
  If many guests are running, we can track the specified guest by using -p or
  --pid

- show the result:
  perf kvm stat report

The output example is following:
# pgrep qemu-kvm
27841
27888
27936

total 3 guests are running on the host

Then, track the guest whose pid is 27936:
# ./perf kvm stat record -p 27936
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.177 MB perf.data.guest (~7742 samples) ]

See the vmexit events:
# ./perf kvm stat report --event=vmexit


Analyze events for all VCPUs:

 VM-EXITSamples  Samples% Time% Avg time

 APIC_ACCESS23780.34% 0.01% 20.86us ( +-   8.03% )
 HLT 4515.25%99.98% 790874.75us ( +-  16.93% )
  EXTERNAL_INTERRUPT 11 3.73% 0.00% 47.70us ( +-  17.45% )
   EXCEPTION_NMI  1 0.34% 0.00%  5.11us ( +-   -nan% )
   CR_ACCESS  1 0.34% 0.00%  6.33us ( +-   -nan% )

Total Samples:295, Total events handled time:35594844.34us.

See the mmio events:
# ./perf kvm stat report --event=mmio


Analyze events for all VCPUs:

 MMIO AccessSamples  Samples% Time% Avg time

0xfee00380:W15182.07%76.19%  7.72us ( +-  11.96% )
0xfee00300:W 11 5.98%18.86% 26.23us ( +-  27.33% )
0xfee00300:R 11 5.98% 2.16%  3.00us ( +-   9.17% )
0xfee00310:W 11 5.98% 2.79%  3.88us ( +-   9.15% )

Total Samples:184, Total events handled time:1529.42us.

See the ioport event:
# ./perf kvm stat report --event=ioport


Analyze events for all VCPUs:

  IO Port AccessSamples  Samples% Time% Avg time

  0x5658:PIN   335755.22%92.13%162.31us ( +-   0.87% )
 0xc090:POUT   122120.09% 2.10% 10.18us ( +-   4.97% )
0x60:PIN74812.30% 3.18% 25.17us ( +-   5.01% )
0x64:PIN74812.30% 2.57% 20.35us ( +-  11.81% )
 0xc050:POUT  5 0.08% 0.01%  8.79us ( +-  12.88% )

Total Samples:6079, Total events handled time:591405.43us.

And, --vcpu is used to track the specified vcpu and --key is used to sort the
result:
# ./perf kvm stat report --event=ioport --vcpu=0 --key=time


Analyze events for VCPU 0:

  IO Port AccessSamples  Samples% Time% Avg time

  0x5658:PIN14594.77%99.67%133.00us ( +-   2.96% )
 0xc090:POUT  8 5.23% 0.33%  7.99us ( +-  16.85% )

Total Samples:153, Total events handled time:19348.87us.

Changelog:
- merge "perf kvm-events" into perf "perf kvm stat" as Ingo's suggestion
- track kvm events for the specified guest
- rename kvm_mmio_done to kvm_io_done
- fix compiling-error on i386


Dong Hao (3):
  KVM: x86: export svm/vmx exit code and vector code to userspace
  KVM: x86: tracemmio begin and complete
  KVM: perf kvm events analysis tool

 arch/x86/include/asm/kvm_host.h   |   36 +-
 arch/x86/include/asm/svm.h|  205 +---
 arch/x86/include/asm/vmx.h|  126 --
 arch/x86/kvm/trace.h  |   89 
 arch/x86/kvm/x86.c|   32 +-
 include/trace/events/kvm.h|   37 ++
 tools/perf/Documentation/perf-kvm.txt |   30 ++-
 tools/perf/MANIFEST   |3 +
 tools/perf/builtin-kvm.c  |  858 -
 tools/perf/util/header.c  |   55 ++-
 tools/perf/util/header.h  |1 +
 tools/perf/util/thread.h  |2 +
 12 files changed, 1234 insertions(+), 240 deletions(-)

-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 02/12] KVM: hide KVM_MEMSLOT_INVALID from userspace

2012-08-09 Thread Xiao Guangrong
On 08/10/2012 02:48 AM, Marcelo Tosatti wrote:

>> +#define KVM_MEMSLOT_INVALID (1UL << 31)
>> +
>>  /*
>>   * If we support unaligned MMIO, at most one fragment will be split into 
>> two:
>>   */
> 
> Please document which range is for userspace visible flags, which range
> is reserved. Mention that in both headers, point to each other
> ("userspace definitions are there" / "internal definitions are there").
> 
> 16/16 bits for each should be fine.

Okay, good to me, will do it in the next version, thank you, Marcelo!


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

2012-08-09 Thread Takuya Yoshikawa
On Thu, 9 Aug 2012 22:25:32 -0300
Marcelo Tosatti  wrote:

> I'll send a patch to flush per memslot in the next days, you can work
> out the PPC details in the meantime.

Are you going to implement that using slot_bitmap?

Since I'm now converting kvm_mmu_slot_remove_write_access() to
rmap based protection, I'd like to hear your plan.

If you can use the stale memslot to restrict the flush, that
should live with rmap based protection.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

2012-08-09 Thread Marcelo Tosatti
On Thu, Aug 09, 2012 at 10:25:32PM -0300, Marcelo Tosatti wrote:
> On Fri, Aug 10, 2012 at 10:34:39AM +1000, Paul Mackerras wrote:
> > On Thu, Aug 09, 2012 at 03:16:12PM -0300, Marcelo Tosatti wrote:
> > 
> > > The !memslot->npages case is handled in __kvm_set_memory_region
> > > (please read that part, before kvm_arch_prepare_memory_region() call).
> > > 
> > > kvm_arch_flush_shadow should be implemented.
> > 
> > Book3S HV doesn't have shadow page tables per se, rather the hardware
> > page table is under the control of the hypervisor (i.e. KVM), and
> > entries are added and removed by the guest using hypercalls.  On
> > recent machines (POWER7) the hypervisor can choose whether or not to
> > have the hardware PTE point to a real page of memory; if it doesn't,
> > access by the guest will trap to the hypervisor.  On older machines
> > (PPC970) we don't have that flexibility, and we have to provide a real
> > page of memory (i.e. RAM or I/O) behind every hardware PTE.  (This is
> > because PPC970 provides no way for page faults in the guest to go to
> > the hypervisor.)
> > 
> > I could implement kvm_arch_flush_shadow to remove the backing pages
> > behind every hardware PTE, but that would be very slow and inefficient
> > on POWER7, and would break the guest on PPC970, particularly in the
> > case where userspace is removing a small memory slot containing some
> > I/O device and leaving the memory slot for system RAM untouched.
> > 
> > So the reason for unmapping the hardware PTEs in
> > kvm_arch_prepare_memory_region rather than kvm_arch_flush_shadow is
> > that that way we know which memslot is going away.
> > 
> > What exactly are the semantics of kvm_arch_flush_shadow?  
> 
> It removes all translations mapped via memslots. Its used in cases where
> the translations become stale, or during shutdown.
> 
> > I presume that on x86 with NPT/EPT it basically does nothing - is that 
> > right?
> 
> It does, it removes all NPT/EPT ptes (named "sptes" in arch/x86/kvm/).
> The translations are rebuilt on demand (when accesses by the guest fault
> into the HV).
> 
> > > > +   if (old->npages) {
> > > > +   /* modifying guest_phys or flags */
> > > > +   if (old->base_gfn != memslot->base_gfn)
> > > > +   kvmppc_unmap_memslot(kvm, old);
> > > 
> > > This case is also handled generically by the last kvm_arch_flush_shadow
> > > call in __kvm_set_memory_region.
> > 
> > Again, to use this we would need to know which memslot we're
> > flushing.  If we could change __kvm_set_memory_region to pass the
> > memslot for these kvm_arch_flush_shadow calls, then I could do as you
> > suggest.  (Though I would need to think carefully about what would
> > happen with guest invalidations of hardware PTEs in the interval
> > between the rcu_assign_pointer(kvm->memslots, slots) and the
> > kvm_arch_flush_shadow, and whether the invalidation would find the
> > correct location in the rmap array, given that we have updated the
> > base_gfn in the memslot without first getting rid of any references to
> > those pages in the hardware page table.)
> 
> That can be done.
> 
> I'll send a patch to flush per memslot in the next days, you can work
> out the PPC details in the meantime.
> 
> To be clear: this is necessary to have consistent behaviour across
> arches in the kvm_set_memory codepath which is tricky (not nitpicking).
> 
> Alternatively, kvm_arch_flush_shadow can be split into two methods (but
> thats not necessary if memslot information is sufficient for PPC).

What kvm_set_memory requires is that there are no stale translations. On
x86, it is cheaper to nuke all translation entries and let them be rebuilt on
pagefaults.

But, if necessary we can introduce a new callback
"kvm_arch_sync_shadow", which can be used by PPC to verify translations 
are valid, if it makes sense.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

2012-08-09 Thread Marcelo Tosatti
On Fri, Aug 10, 2012 at 10:34:39AM +1000, Paul Mackerras wrote:
> On Thu, Aug 09, 2012 at 03:16:12PM -0300, Marcelo Tosatti wrote:
> 
> > The !memslot->npages case is handled in __kvm_set_memory_region
> > (please read that part, before kvm_arch_prepare_memory_region() call).
> > 
> > kvm_arch_flush_shadow should be implemented.
> 
> Book3S HV doesn't have shadow page tables per se, rather the hardware
> page table is under the control of the hypervisor (i.e. KVM), and
> entries are added and removed by the guest using hypercalls.  On
> recent machines (POWER7) the hypervisor can choose whether or not to
> have the hardware PTE point to a real page of memory; if it doesn't,
> access by the guest will trap to the hypervisor.  On older machines
> (PPC970) we don't have that flexibility, and we have to provide a real
> page of memory (i.e. RAM or I/O) behind every hardware PTE.  (This is
> because PPC970 provides no way for page faults in the guest to go to
> the hypervisor.)
> 
> I could implement kvm_arch_flush_shadow to remove the backing pages
> behind every hardware PTE, but that would be very slow and inefficient
> on POWER7, and would break the guest on PPC970, particularly in the
> case where userspace is removing a small memory slot containing some
> I/O device and leaving the memory slot for system RAM untouched.
> 
> So the reason for unmapping the hardware PTEs in
> kvm_arch_prepare_memory_region rather than kvm_arch_flush_shadow is
> that that way we know which memslot is going away.
> 
> What exactly are the semantics of kvm_arch_flush_shadow?  

It removes all translations mapped via memslots. Its used in cases where
the translations become stale, or during shutdown.

> I presume that on x86 with NPT/EPT it basically does nothing - is that right?

It does, it removes all NPT/EPT ptes (named "sptes" in arch/x86/kvm/).
The translations are rebuilt on demand (when accesses by the guest fault
into the HV).

> > > + if (old->npages) {
> > > + /* modifying guest_phys or flags */
> > > + if (old->base_gfn != memslot->base_gfn)
> > > + kvmppc_unmap_memslot(kvm, old);
> > 
> > This case is also handled generically by the last kvm_arch_flush_shadow
> > call in __kvm_set_memory_region.
> 
> Again, to use this we would need to know which memslot we're
> flushing.  If we could change __kvm_set_memory_region to pass the
> memslot for these kvm_arch_flush_shadow calls, then I could do as you
> suggest.  (Though I would need to think carefully about what would
> happen with guest invalidations of hardware PTEs in the interval
> between the rcu_assign_pointer(kvm->memslots, slots) and the
> kvm_arch_flush_shadow, and whether the invalidation would find the
> correct location in the rmap array, given that we have updated the
> base_gfn in the memslot without first getting rid of any references to
> those pages in the hardware page table.)

That can be done.

I'll send a patch to flush per memslot in the next days, you can work
out the PPC details in the meantime.

To be clear: this is necessary to have consistent behaviour across
arches in the kvm_set_memory codepath which is tricky (not nitpicking).

Alternatively, kvm_arch_flush_shadow can be split into two methods (but
thats not necessary if memslot information is sufficient for PPC).

> > > + if (memslot->dirty_bitmap &&
> > > + old->dirty_bitmap != memslot->dirty_bitmap)
> > > + kvmppc_hv_get_dirty_log(kvm, old);
> > > + return 0;
> > > + }
> > 
> > Better clear dirty log unconditionally on kvm_arch_commit_memory_region,
> > similarly to x86 (just so its consistent).
> 
> OK.
> 
> > > --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> > > +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> > > @@ -81,7 +81,7 @@ static void remove_revmap_chain(struct kvm *kvm, long 
> > > pte_index,
> > >   ptel = rev->guest_rpte |= rcbits;
> > >   gfn = hpte_rpn(ptel, hpte_page_size(hpte_v, ptel));
> > >   memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > > - if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID))
> > > + if (!memslot)
> > >   return;
> > 
> > Why remove this check? (i don't know why it was there in the first
> > place, just checking).
> 
> This is where we are removing the page backing a hardware PTE and thus
> removing the hardware PTE from the reverse-mapping list for the page.
> We want to be able to do that properly even if the memslot is in the
> process of going away.  I had the flags check in there originally
> because other places that used a memslot had that check, but when I
> read __kvm_set_memory_region more carefully I realized that the
> KVM_MEMSLOT_INVALID flag indicates that we should not create any more
> references to pages in the memslot, but we do still need to be able to
> handle references going away, i.e. pages in the memslot getting
> unmapped.
> 
> Paul.

Yes, thats it. kvm_arch_flush_shadow requires functional memslot lookup,
for example.

--
To unsubscribe

Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

2012-08-09 Thread Paul Mackerras
On Thu, Aug 09, 2012 at 03:16:12PM -0300, Marcelo Tosatti wrote:

> The !memslot->npages case is handled in __kvm_set_memory_region
> (please read that part, before kvm_arch_prepare_memory_region() call).
> 
> kvm_arch_flush_shadow should be implemented.

Book3S HV doesn't have shadow page tables per se, rather the hardware
page table is under the control of the hypervisor (i.e. KVM), and
entries are added and removed by the guest using hypercalls.  On
recent machines (POWER7) the hypervisor can choose whether or not to
have the hardware PTE point to a real page of memory; if it doesn't,
access by the guest will trap to the hypervisor.  On older machines
(PPC970) we don't have that flexibility, and we have to provide a real
page of memory (i.e. RAM or I/O) behind every hardware PTE.  (This is
because PPC970 provides no way for page faults in the guest to go to
the hypervisor.)

I could implement kvm_arch_flush_shadow to remove the backing pages
behind every hardware PTE, but that would be very slow and inefficient
on POWER7, and would break the guest on PPC970, particularly in the
case where userspace is removing a small memory slot containing some
I/O device and leaving the memory slot for system RAM untouched.

So the reason for unmapping the hardware PTEs in
kvm_arch_prepare_memory_region rather than kvm_arch_flush_shadow is
that that way we know which memslot is going away.

What exactly are the semantics of kvm_arch_flush_shadow?  I presume
that on x86 with NPT/EPT it basically does nothing - is that right?

> > +   if (old->npages) {
> > +   /* modifying guest_phys or flags */
> > +   if (old->base_gfn != memslot->base_gfn)
> > +   kvmppc_unmap_memslot(kvm, old);
> 
> This case is also handled generically by the last kvm_arch_flush_shadow
> call in __kvm_set_memory_region.

Again, to use this we would need to know which memslot we're
flushing.  If we could change __kvm_set_memory_region to pass the
memslot for these kvm_arch_flush_shadow calls, then I could do as you
suggest.  (Though I would need to think carefully about what would
happen with guest invalidations of hardware PTEs in the interval
between the rcu_assign_pointer(kvm->memslots, slots) and the
kvm_arch_flush_shadow, and whether the invalidation would find the
correct location in the rmap array, given that we have updated the
base_gfn in the memslot without first getting rid of any references to
those pages in the hardware page table.)

> > +   if (memslot->dirty_bitmap &&
> > +   old->dirty_bitmap != memslot->dirty_bitmap)
> > +   kvmppc_hv_get_dirty_log(kvm, old);
> > +   return 0;
> > +   }
> 
> Better clear dirty log unconditionally on kvm_arch_commit_memory_region,
> similarly to x86 (just so its consistent).

OK.

> > --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> > +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
> > @@ -81,7 +81,7 @@ static void remove_revmap_chain(struct kvm *kvm, long 
> > pte_index,
> > ptel = rev->guest_rpte |= rcbits;
> > gfn = hpte_rpn(ptel, hpte_page_size(hpte_v, ptel));
> > memslot = __gfn_to_memslot(kvm_memslots(kvm), gfn);
> > -   if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID))
> > +   if (!memslot)
> > return;
> 
> Why remove this check? (i don't know why it was there in the first
> place, just checking).

This is where we are removing the page backing a hardware PTE and thus
removing the hardware PTE from the reverse-mapping list for the page.
We want to be able to do that properly even if the memslot is in the
process of going away.  I had the flags check in there originally
because other places that used a memslot had that check, but when I
read __kvm_set_memory_region more carefully I realized that the
KVM_MEMSLOT_INVALID flag indicates that we should not create any more
references to pages in the memslot, but we do still need to be able to
handle references going away, i.e. pages in the memslot getting
unmapped.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] KVM: PPC: Book3S HV: Take the SRCU read lock before looking up memslots

2012-08-09 Thread Paul Mackerras
On Thu, Aug 09, 2012 at 03:22:38PM -0300, Marcelo Tosatti wrote:
> On Mon, Aug 06, 2012 at 08:06:55PM +1000, Paul Mackerras wrote:
> > The generic KVM code uses SRCU (sleeping RCU) to protect accesses
> > to the memslots data structures against updates due to userspace
> > adding, modifying or removing memory slots.  We need to do that too,
> > both to avoid accessing stale copies of the memslots and to avoid
> > lockdep warnings.  This therefore adds srcu_read_lock/unlock pairs
> > around code that accesses and uses memslots.
> > 
> > Since the real-mode handlers for H_ENTER, H_REMOVE and H_BULK_REMOVE
> > need to access the memslots, and we don't want to call the SRCU code
> > in real mode (since we have no assurance that it would only access
> > the linear mapping), we hold the SRCU read lock for the VM while
> > in the guest.  This does mean that adding or removing memory slots
> > while some vcpus are executing in the guest will block for up to
> > two jiffies.  This tradeoff is acceptable since adding/removing
> > memory slots only happens rarely, while H_ENTER/H_REMOVE/H_BULK_REMOVE
> > are performance-critical hot paths.
> 
> I would avoid doing this. What prevents the guest entry loop in the kernel to
> simply reenter without ever unlocking SRCU?

I take and release the SRCU lock inside the guest entry loop, so in
fact we will release the SRCU lock every time we pop out of the guest.

As for holding the SRCU read lock while we're in the guest - that is
what enables us to do H_ENTER etc. (the hypercalls that the guest uses
to update the hardware page table) in real mode, i.e. without
switching the MMU over to the kernel context.  In real mode we can
access the linear mapping but not vmalloc, ioremap or user space, so I
am nervous about doing srcu_read_lock/unlock in real mode.  I really
don't want to switch to kernel mode for those hypercalls because that
switch is relatively slow and requires pulling all of the SMT threads
in the core out of the guest (because of hardware restrictions).

So, holding the SRCU read lock while in the guest is the least ugly
approach AFAICS.  Yes it makes memslot addition/removal a bit slower,
but only for this VM, and it doesn't delay grace periods for other
forms of RCU, so its impact is pretty limited.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/5] KVM: PPC: Take the SRCU lock around memslot use

2012-08-09 Thread Paul Mackerras
On Thu, Aug 09, 2012 at 03:27:17PM -0300, Marcelo Tosatti wrote:
> On Mon, Aug 06, 2012 at 08:08:16PM +1000, Paul Mackerras wrote:
> > The generic KVM code uses SRCU (sleeping RCU) to protect accesses
> > to the memslots data structures against updates due to userspace
> > adding, modifying or removing memory slots.  We need to do that too,
> > both to avoid accessing stale copies of the memslots and to avoid
> > lockdep warnings.  This therefore adds srcu_read_lock/unlock pairs
> > around code that accesses and uses memslots in the Book 3S PR code
> > and the Book E (44x and e500) code.
> > 
> > Signed-off-by: Paul Mackerras 
> > ---
> > Compile-tested only.
> > 
> >  arch/powerpc/kvm/44x_tlb.c   |6 ++
> >  arch/powerpc/kvm/book3s_pr.c |6 ++
> >  arch/powerpc/kvm/e500_tlb.c  |6 ++
> >  3 files changed, 18 insertions(+)
> 
> On top of the previous comment:
> 
> x86 calls srcu_read_lock at the beginning of the KVM_RUN ioctl handler
> (__vcpu_run in arch/x86/kvm/x86.c), unlocks srcu on guest entry, locks
> on guest exit before any potential use of memslots, and unlocks on
> exit to userspace.
> 
> This has the advantage of not sprinkling srcu lock/unlock calls all over
> (except from other ioctls, of course). Its low maintenance.
> 
> Perhaps doing the same on PPC is not a bad idea.

Perhaps... these changes are to areas of the PPC KVM code that I don't
use or maintain, so they're really more a suggestion of one way to fix
the problem than anything else.  That's why I put RFC in the subject
line.  It would be up to Alex whether he wants to fix it like this or
by taking the SRCU lock at a higher level.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] s390/kvm: Handle hosts not supporting s390-virtio.

2012-08-09 Thread Alexander Graf

On 07.08.2012, at 16:52, Cornelia Huck wrote:

> Running under a kvm host does not necessarily imply the presence of
> a page mapped above the main memory with the virtio information;
> however, the code includes a hard coded access to that page.
> 
> Instead, check for the presence of the page and exit gracefully
> before we hit an addressing exception if it does not exist.
> 
> Signed-off-by: Cornelia Huck 
> ---
> drivers/s390/kvm/kvm_virtio.c | 21 +
> 1 file changed, 21 insertions(+)
> 
> diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c
> index 47cccd5..408447c 100644
> --- a/drivers/s390/kvm/kvm_virtio.c
> +++ b/drivers/s390/kvm/kvm_virtio.c
> @@ -418,6 +418,21 @@ static void kvm_extint_handler(struct ext_code ext_code,
>   }
> }
> 
> +static int __init test_devices_support(void)

Today we know what this function does, but the next person running into it has 
no clue. Maybe add a nice description here that you're trying to access memory 
above ram? (is this even what you're doing here? my s390 asm is slightly rusty) 
:)

Alex

> +{
> + int ccode = -EIO;
> +
> + asm volatile(
> + "0: cli 0(%1),0\n"
> + "   ipm %0\n"
> + "   srl %0,28\n"
> + "1:\n"
> + EX_TABLE(0b,1b)
> + : "+d" (ccode)
> + : "a" (kvm_devices)
> + : "cc");
> + return ccode;
> +}
> /*
>  * Init function for virtio
>  * devices are in a single page above top of "normal" mem
> @@ -443,6 +458,12 @@ static int __init kvm_devices_init(void)
>   }
> 
>   kvm_devices = (void *) real_memory_size;
> + if (test_devices_support() < 0) {
> + vmem_remove_mapping(real_memory_size, PAGE_SIZE);
> + root_device_unregister(kvm_root);
> + /* No error. */
> + return 0;
> + }
> 
>   INIT_WORK(&hotplug_work, hotplug_devices);
> 
> -- 
> 1.7.11.4
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] KVM: PPC: BookE: Add some more trace points

2012-08-09 Thread Alexander Graf
Without trace points, debugging what exactly is going on inside guest
code can be very tricky. Add a few more trace points at places that
hopefully tell us more when things go wrong.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c|3 ++
 arch/powerpc/kvm/e500_tlb.c |3 ++
 arch/powerpc/kvm/trace.h|   71 +++
 3 files changed, 77 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 52f6cbb..00bcc57 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -143,6 +143,7 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr)
 static void kvmppc_booke_queue_irqprio(struct kvm_vcpu *vcpu,
unsigned int priority)
 {
+   trace_kvm_booke_queue_irqprio(vcpu, priority);
set_bit(priority, &vcpu->arch.pending_exceptions);
 }
 
@@ -457,6 +458,8 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 static void kvmppc_check_requests(struct kvm_vcpu *vcpu)
 {
if (vcpu->requests) {
+   trace_kvm_check_requests(vcpu);
+
if (kvm_check_request(KVM_REQ_PENDING_TIMER, vcpu))
update_timer_ints(vcpu);
 #if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index 56bc92c..f5e531f 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -312,6 +312,7 @@ static inline void kvmppc_e500_ref_setup(struct tlbe_ref 
*ref,
 static inline void kvmppc_e500_ref_release(struct tlbe_ref *ref)
 {
if (ref->flags & E500_TLB_VALID) {
+   trace_kvm_booke206_ref_release(ref->pfn, ref->flags);
ref->flags = 0;
}
 }
@@ -1076,6 +1077,8 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, 
gpa_t gpaddr,
 
 int kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
 {
+   trace_kvm_unmap_hva(hva);
+
/*
 * Flush all shadow tlb entries everywhere. This is slow, but
 * we are 100% sure that we catch the to be unmapped page
diff --git a/arch/powerpc/kvm/trace.h b/arch/powerpc/kvm/trace.h
index 9fab6ed..cb2780a 100644
--- a/arch/powerpc/kvm/trace.h
+++ b/arch/powerpc/kvm/trace.h
@@ -82,6 +82,21 @@ TRACE_EVENT(kvm_exit,
)
 );
 
+TRACE_EVENT(kvm_unmap_hva,
+   TP_PROTO(unsigned long hva),
+   TP_ARGS(hva),
+
+   TP_STRUCT__entry(
+   __field(unsigned long,  hva )
+   ),
+
+   TP_fast_assign(
+   __entry->hva= hva;
+   ),
+
+   TP_printk("unmap hva 0x%lx\n", __entry->hva)
+);
+
 TRACE_EVENT(kvm_stlb_inval,
TP_PROTO(unsigned int stlb_index),
TP_ARGS(stlb_index),
@@ -149,6 +164,24 @@ TRACE_EVENT(kvm_gtlb_write,
__entry->word1, __entry->word2)
 );
 
+TRACE_EVENT(kvm_check_requests,
+   TP_PROTO(struct kvm_vcpu *vcpu),
+   TP_ARGS(vcpu),
+
+   TP_STRUCT__entry(
+   __field(__u32,  cpu_nr  )
+   __field(__u32,  requests)
+   ),
+
+   TP_fast_assign(
+   __entry->cpu_nr = vcpu->vcpu_id;
+   __entry->requests   = vcpu->requests;
+   ),
+
+   TP_printk("vcpu=%x requests=%x",
+   __entry->cpu_nr, __entry->requests)
+);
+
 
 /*
  * Book3S trace points   *
@@ -418,6 +451,44 @@ TRACE_EVENT(kvm_booke206_gtlb_write,
__entry->mas2, __entry->mas7_3)
 );
 
+TRACE_EVENT(kvm_booke206_ref_release,
+   TP_PROTO(__u64 pfn, __u32 flags),
+   TP_ARGS(pfn, flags),
+
+   TP_STRUCT__entry(
+   __field(__u64,  pfn )
+   __field(__u32,  flags   )
+   ),
+
+   TP_fast_assign(
+   __entry->pfn= pfn;
+   __entry->flags  = flags;
+   ),
+
+   TP_printk("pfn=%llx flags=%x",
+   __entry->pfn, __entry->flags)
+);
+
+TRACE_EVENT(kvm_booke_queue_irqprio,
+   TP_PROTO(struct kvm_vcpu *vcpu, unsigned int priority),
+   TP_ARGS(vcpu, priority),
+
+   TP_STRUCT__entry(
+   __field(__u32,  cpu_nr  )
+   __field(__u32,  priority)
+   __field(unsigned long,  pending )
+   ),
+
+   TP_fast_assign(
+   __entry->cpu_nr = vcpu->vcpu_id;
+   __entry->priority   = priority;
+   __entry->pending= vcpu->arch.pending_exceptions;
+   ),
+
+   TP_printk("vcpu=%x prio=%x pending=%lx",
+   __entry->cpu_nr, __entry->priority, __entry->pending)
+);
+
 #endif
 
 #endif /* _TRACE_KVM_H */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to 

[PATCH 4/5] KVM: PPC: Add cache flush on page map

2012-08-09 Thread Alexander Graf
When we map a page that wasn't icache cleared before, do so when first
mapping it in KVM using the same information bits as the Linux mapping
logic. That way we are 100% sure that any page we map does not have stale
entries in the icache.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - move code to arch/powerpc
---
 arch/powerpc/include/asm/kvm_host.h   |1 +
 arch/powerpc/include/asm/kvm_ppc.h|   12 
 arch/powerpc/kvm/book3s_32_mmu_host.c |3 +++
 arch/powerpc/kvm/book3s_64_mmu_host.c |2 ++
 arch/powerpc/kvm/e500_tlb.c   |3 +++
 arch/powerpc/mm/mem.c |1 +
 6 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index ed75bc9..f5c289b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KVM_MAX_VCPUS  NR_CPUS
 #define KVM_MAX_VCORES NR_CPUS
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index c38e824..88de314 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -220,4 +220,16 @@ void kvmppc_claim_lpid(long lpid);
 void kvmppc_free_lpid(long lpid);
 void kvmppc_init_lpid(unsigned long nr_lpids);
 
+static inline void kvmppc_mmu_flush_icache(pfn_t pfn)
+{
+   /* Clear i-cache for new pages */
+   struct page *page;
+   page = pfn_to_page(pfn);
+   if (!test_bit(PG_arch_1, &page->flags)) {
+   flush_dcache_icache_page(page);
+   set_bit(PG_arch_1, &page->flags);
+   }
+}
+
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c 
b/arch/powerpc/kvm/book3s_32_mmu_host.c
index f922c29..837f13e 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -211,6 +211,9 @@ next_pteg:
pteg1 |= PP_RWRX;
}
 
+   if (orig_pte->may_execute)
+   kvmppc_mmu_flush_icache(hpaddr >> PAGE_SHIFT);
+
local_irq_disable();
 
if (pteg[rr]) {
diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c 
b/arch/powerpc/kvm/book3s_64_mmu_host.c
index 10fc8ec..0688b6b 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_host.c
@@ -126,6 +126,8 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct 
kvmppc_pte *orig_pte)
 
if (!orig_pte->may_execute)
rflags |= HPTE_R_N;
+   else
+   kvmppc_mmu_flush_icache(hpaddr >> PAGE_SHIFT);
 
hash = hpt_hash(va, PTE_SIZE, MMU_SEGSIZE_256M);
 
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index c439754..56bc92c 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -544,6 +544,9 @@ static inline void kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize,
ref, gvaddr, stlbe);
 
+   /* Clear i-cache for new pages */
+   kvmppc_mmu_flush_icache(pfn);
+
/* Drop refcount on page, so that mmu notifiers can clear it */
kvm_release_pfn_clean(pfn);
 }
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index baaafde..fbdad0e 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -469,6 +469,7 @@ void flush_dcache_icache_page(struct page *page)
__flush_dcache_icache_phys(page_to_pfn(page) << PAGE_SHIFT);
 #endif
 }
+EXPORT_SYMBOL(flush_dcache_icache_page);
 
 void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
 {
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] KVM: PPC: E500: Implement MMU notifiers

2012-08-09 Thread Alexander Graf
The e500 target has lived without mmu notifiers ever since it got
introduced, but fails for the user space check on them with hugetlbfs.

So in order to get that one working, implement mmu notifiers in a
reasonably dumb fashion and be happy. On embedded hardware, we almost
never end up with mmu notifier calls, since most people don't overcommit.

Signed-off-by: Alexander Graf 

---

v1 -> v2:

  - drop refcount for in-use pages
  - mark pages dirty at mapping time, not unmapping time
  - drop hva_to_memslot usage
---
 arch/powerpc/include/asm/kvm_host.h |3 +-
 arch/powerpc/include/asm/kvm_ppc.h  |1 +
 arch/powerpc/kvm/Kconfig|2 +
 arch/powerpc/kvm/booke.c|6 +++
 arch/powerpc/kvm/e500_tlb.c |   60 +++---
 5 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 572ad01..ed75bc9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -45,7 +45,8 @@
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 #endif
 
-#ifdef CONFIG_KVM_BOOK3S_64_HV
+#if defined(CONFIG_KVM_BOOK3S_64_HV) || defined(CONFIG_KVM_E500V2) || \
+defined(CONFIG_KVM_E500MC)
 #include 
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 0124937..c38e824 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -104,6 +104,7 @@ extern void kvmppc_core_queue_external(struct kvm_vcpu 
*vcpu,
struct kvm_interrupt *irq);
 extern void kvmppc_core_dequeue_external(struct kvm_vcpu *vcpu,
  struct kvm_interrupt *irq);
+extern void kvmppc_core_flush_tlb(struct kvm_vcpu *vcpu);
 
 extern int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu *vcpu,
   unsigned int op, int *advance);
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index f4dacb9..40cad8c 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -123,6 +123,7 @@ config KVM_E500V2
depends on EXPERIMENTAL && E500 && !PPC_E500MC
select KVM
select KVM_MMIO
+   select MMU_NOTIFIER
---help---
  Support running unmodified E500 guest kernels in virtual machines on
  E500v2 host processors.
@@ -138,6 +139,7 @@ config KVM_E500MC
select KVM
select KVM_MMIO
select KVM_BOOKE_HV
+   select MMU_NOTIFIER
---help---
  Support running unmodified E500MC/E5500 (32-bit) guest kernels in
  virtual machines on E500MC/E5500 host processors.
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 70a86c0..52f6cbb 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -459,6 +459,10 @@ static void kvmppc_check_requests(struct kvm_vcpu *vcpu)
if (vcpu->requests) {
if (kvm_check_request(KVM_REQ_PENDING_TIMER, vcpu))
update_timer_ints(vcpu);
+#if defined(CONFIG_KVM_E500V2) || defined(CONFIG_KVM_E500MC)
+   if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
+   kvmppc_core_flush_tlb(vcpu);
+#endif
}
 }
 
@@ -579,6 +583,8 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
 #endif
 
kvm_guest_exit();
+   vcpu->mode = OUTSIDE_GUEST_MODE;
+   smp_wmb();
 
 out:
vcpu->mode = OUTSIDE_GUEST_MODE;
diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index d26e705..c439754 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -303,18 +303,15 @@ static inline void kvmppc_e500_ref_setup(struct tlbe_ref 
*ref,
ref->pfn = pfn;
ref->flags = E500_TLB_VALID;
 
-   if (tlbe_is_writable(gtlbe))
+   if (tlbe_is_writable(gtlbe)) {
ref->flags |= E500_TLB_DIRTY;
+   kvm_set_pfn_dirty(pfn);
+   }
 }
 
 static inline void kvmppc_e500_ref_release(struct tlbe_ref *ref)
 {
if (ref->flags & E500_TLB_VALID) {
-   if (ref->flags & E500_TLB_DIRTY)
-   kvm_release_pfn_dirty(ref->pfn);
-   else
-   kvm_release_pfn_clean(ref->pfn);
-
ref->flags = 0;
}
 }
@@ -357,6 +354,13 @@ static void clear_tlb_refs(struct kvmppc_vcpu_e500 
*vcpu_e500)
clear_tlb_privs(vcpu_e500);
 }
 
+void kvmppc_core_flush_tlb(struct kvm_vcpu *vcpu)
+{
+   struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+   clear_tlb_refs(vcpu_e500);
+   clear_tlb1_bitmap(vcpu_e500);
+}
+
 static inline void kvmppc_e500_deliver_tlb_miss(struct kvm_vcpu *vcpu,
unsigned int eaddr, int as)
 {
@@ -539,6 +543,9 @@ static inline void kvmppc_e500_shadow_map(struct 
kvmppc_vcpu_e500 *vcpu_e500,
 
kvmppc_e500_setup_stlbe(&vcpu_e500->vcpu, gtlbe, tsize,
  

[PATCH 1/5] KVM: PPC: BookE: Add check_requests helper function

2012-08-09 Thread Alexander Graf
We need a central place to check for pending requests in. Add one that
only does the timer check we already do in a different place.

Later, this central function can be extended by more checks.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |   24 +---
 1 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 1d4ce9a..bcf87fe 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -419,13 +419,6 @@ static void kvmppc_core_check_exceptions(struct kvm_vcpu 
*vcpu)
unsigned long *pending = &vcpu->arch.pending_exceptions;
unsigned int priority;
 
-   if (vcpu->requests) {
-   if (kvm_check_request(KVM_REQ_PENDING_TIMER, vcpu)) {
-   smp_mb();
-   update_timer_ints(vcpu);
-   }
-   }
-
priority = __ffs(*pending);
while (priority < BOOKE_IRQPRIO_MAX) {
if (kvmppc_booke_irqprio_deliver(vcpu, priority))
@@ -461,6 +454,14 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
return r;
 }
 
+static void kvmppc_check_requests(struct kvm_vcpu *vcpu)
+{
+   if (vcpu->requests) {
+   if (kvm_check_request(KVM_REQ_PENDING_TIMER, vcpu))
+   update_timer_ints(vcpu);
+   }
+}
+
 /*
  * Common checks before entering the guest world.  Call with interrupts
  * disabled.
@@ -485,6 +486,15 @@ static int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
break;
}
 
+   smp_mb();
+   if (vcpu->requests) {
+   /* Make sure we process requests preemptable */
+   local_irq_enable();
+   kvmppc_check_requests(vcpu);
+   local_irq_disable();
+   continue;
+   }
+
if (kvmppc_core_prepare_to_enter(vcpu)) {
/* interrupts got enabled in between, so we
   are back at square 1 */
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] KVM: PPC: BookE: Add support for vcpu->mode

2012-08-09 Thread Alexander Graf
Generic KVM code might want to know whether we are inside guest context
or outside. It also wants to be able to push us out of guest context.

Add support to the BookE code for the generic vcpu->mode field that describes
the above states.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/booke.c |   11 +++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index bcf87fe..70a86c0 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -501,6 +501,15 @@ static int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu)
continue;
}
 
+   if (vcpu->mode == EXITING_GUEST_MODE) {
+   r = 1;
+   break;
+   }
+
+   /* Going into guest context! Yay! */
+   vcpu->mode = IN_GUEST_MODE;
+   smp_wmb();
+
break;
}
 
@@ -572,6 +581,8 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
kvm_vcpu *vcpu)
kvm_guest_exit();
 
 out:
+   vcpu->mode = OUTSIDE_GUEST_MODE;
+   smp_wmb();
local_irq_enable();
return ret;
 }
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] KVM: PPC: E500: Implement MMU Notifiers

2012-08-09 Thread Alexander Graf
This patch set adds a very simple implementation of MMU notifiers for the
e500 target. Along the way, I stumbled over a few other things that I've
put into the same patch set, namely:

  * support remote vcpu requests
  * hva_to_memslot helper function
  * icache flushing on page map
  * additional trace points

v1 -> v2:

  - remove patches that are already merged into kvm-ppc-next
  - split up mmu notifier patch
  - remove hva_to_memslot
  - add trace points
  - change icache flush method to be in arch/powerpc
  - make request checks preemptible

Alexander Graf (5):
  KVM: PPC: BookE: Add check_requests helper function
  KVM: PPC: BookE: Add support for vcpu->mode
  KVM: PPC: E500: Implement MMU notifiers
  KVM: PPC: Add cache flush on page map
  KVM: PPC: BookE: Add some more trace points

 arch/powerpc/include/asm/kvm_host.h   |4 +-
 arch/powerpc/include/asm/kvm_ppc.h|   13 ++
 arch/powerpc/kvm/Kconfig  |2 +
 arch/powerpc/kvm/book3s_32_mmu_host.c |3 +
 arch/powerpc/kvm/book3s_64_mmu_host.c |2 +
 arch/powerpc/kvm/booke.c  |   44 +---
 arch/powerpc/kvm/e500_tlb.c   |   66 +++---
 arch/powerpc/kvm/trace.h  |   71 +
 arch/powerpc/mm/mem.c |1 +
 9 files changed, 192 insertions(+), 14 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in qem[-kvm] 1.1+ and in-kernel irqchip

2012-08-09 Thread Michael Tokarev
On 10.08.2012 00:47, Marcelo Tosatti wrote:
[]
>>> calibrate_tsc (void)
>>> {
>>>   /* First calibrate the TSC rate (relative, not absolute time). */
>>>   grub_uint64_t start_tsc;
>>>   grub_uint64_t end_tsc;
>>>
>>>   start_tsc = grub_get_tsc ();
>>>   grub_pit_wait (0x);
>>>   end_tsc = grub_get_tsc ();
>>>
>>>   tsc_ticks_per_ms = grub_divmod64 (end_tsc - start_tsc, 55, 0);
>>> }
>>
>> Emulation of grub_pit_wait sequence by in-kernel PIT is probably broken.

This is grub_pit_wait():

#define TIMER2_REG_CONTROL  0x42
#define TIMER_REG_COMMAND   0x43
#define TIMER2_REG_LATCH0x61

#define TIMER2_SELECT   0x80
#define TIMER_ENABLE_LSB0x20
#define TIMER_ENABLE_MSB0x10
#define TIMER2_LATCH0x20
#define TIMER2_SPEAKER  0x02
#define TIMER2_GATE 0x01

void
grub_pit_wait (grub_uint16_t tics)
{
  /* Disable timer2 gate and speaker.  */
  grub_outb (grub_inb (TIMER2_REG_LATCH) & ~ (TIMER2_SPEAKER | TIMER2_GATE),
 TIMER2_REG_LATCH);

  /* Set tics.  */
  grub_outb (TIMER2_SELECT | TIMER_ENABLE_LSB | TIMER_ENABLE_MSB, 
TIMER_REG_COMMAND);
  grub_outb (tics & 0xff, TIMER2_REG_CONTROL);
  grub_outb (tics >> 8, TIMER2_REG_CONTROL);

  /* Enable timer2 gate, keep speaker disabled.  */
  grub_outb ((grub_inb (TIMER2_REG_LATCH) & ~ TIMER2_SPEAKER) | TIMER2_GATE,
 TIMER2_REG_LATCH);

  /* Wait.  */
  while ((grub_inb (TIMER2_REG_LATCH) & TIMER2_LATCH) == 0x00);

  /* Disable timer2 gate and speaker.  */
  grub_outb (grub_inb (TIMER2_REG_LATCH) & ~ (TIMER2_SPEAKER | TIMER2_GATE),
 TIMER2_REG_LATCH);
}


>> QEMU PIT emulation is also affected by miscalibration.
>>
>> Please provide steps to reproduce.
> 
> I mean verbose on the steps (does it happen always when setting timeout=1,
> how to set timeout=1, etc).

untested:

mkdir /tmp/grub
cd /tmp/grub
mkdir boot boot/grub
cat > boot/grub/grub.cfg 

FW: cpu cache size

2012-08-09 Thread Benjamin Rizkowsky

Currently we as a company are evaluating KVM over XEN.

So far KVM has been winning as far as performance particularly in disk I/O 
since our infrastructure will include iscsi/iser over infiniband.  One problem 
we have  run into is the cpu cache size is reported as being 
4096 KB

When in fact it should be reporting 
15360 KB

We have concluded that this is adversely affecting the performance on one of 
our J2EE applications that use lots of cpu.  Identical Testing on XEN which 
does correct cpu passthrough show increased performance of the cpu compared to 
KVM.  Is there any workaround or patch for this feature or possible something 
I'm missing when starting the vm to enable the additional cache?


#this command is supposed to do direct cpu passthrough.
qemu-kvm -S -M rhel6.3.0 -cpu host -enable-kvm -m 20480 -smp 
8,sockets=8,cores=1,threads=1


#results from cat /proc/cpuinfo on the host
processor   : 23
vendor_id   : GenuineIntel
cpu family  : 6
model   : 45
model name  : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
stepping    : 7
cpu MHz : 2294.231
cache size  : 15360 KB
physical id : 1
siblings    : 12
core id : 5
cpu cores   : 6
apicid  : 43
initial apicid  : 43
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat 
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp 
lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf 
pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca 
sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm arat xsaveopt pln pts dts 
tpr_shadow vnmi flexpriority ept vpid
bogomips    : 4588.30
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:



# results from cat /proc/cpuinfo on the  guest
processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model   : 45
model name  : Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
stepping    : 7
cpu MHz : 2294.470
cache size  : 4096 KB
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat 
pse36 clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc arch_perfmon 
unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt aes xsave 
avx hypervisor lahf_lm xsaveopt
bogomips    : 4588.94
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:




Regards,
Benjamin Rizkowsky
b...@enterprisewizard.com

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on reboot with 3.6-rc1

2012-08-09 Thread Alex Williamson
On Thu, 2012-08-09 at 13:35 -0600, David Ahern wrote:
> On 8/9/12 1:03 PM, Alex Williamson wrote:
> 
> > I think you should be ok with just what is in Linus' tree now.  The
> > change in pci_acs_path_enabled is the same kind of bug, but I don't
> > think you'll hit it with the fixed iommu code.  Please let me know if
> > that's not the case.  Thanks,
> 
> Indeed, latest tree is clean. no oops, VF shows fine and passthrough works.

Great, thanks!

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on reboot with 3.6-rc1

2012-08-09 Thread David Ahern

On 8/9/12 1:03 PM, Alex Williamson wrote:


I think you should be ok with just what is in Linus' tree now.  The
change in pci_acs_path_enabled is the same kind of bug, but I don't
think you'll hit it with the fixed iommu code.  Please let me know if
that's not the case.  Thanks,


Indeed, latest tree is clean. no oops, VF shows fine and passthrough works.

David

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs

2012-08-09 Thread Alex Williamson
On Mon, 2012-08-06 at 13:40 +0300, Avi Kivity wrote:
> On 08/06/2012 01:38 PM, Avi Kivity wrote:
> 
> > Regarding the implementation, instead of a linked list, would an array
> > of counters parallel to the bitmap make it simpler?
> 
> Or even, replace the bitmap with an array of counters.

I'm not sure a counter array is what we're really after.  That gives us
reference counting for the irq source IDs, but not the key->gsi lookup.
It also highlights another issue, that we have a limited set of source
IDs.  Looks like we have BITS_PER_LONG IDs, with two already used, one
for the shared userspace ID and another for the PIT.  How happy are we
going to be with a limit of 62 level interrupts in use at one time?

It's arguably a reasonable number since the most virtualization friendly
devices (sr-iov VFs) don't even support this kind of interrupt.  It's
also very wasteful allocating an entire source ID for a single GSI
within that source ID.  PCI supports interrupts A, B, C, and D, which,
in the most optimal config, each go to different GSIs.  So we could
theoretically be more efficient in our use and allocation of irq source
IDs if we tracked use by the source ID, gsi pair.

That probably makes it less practical to replace anything at the top
level with a counter array.  The key that we pass back is currently the
actual source ID, but we don't specify what it is, so we could split it
and have it encode a 16bit source ID plus 16 bit GSI.  It could also be
an idr entry.

Michael, would the interface be more acceptable to you if we added
separate ioctls to allocate and free some representation of an irq
source ID, gsi pair?  For instance, an ioctl might return an idr entry
for an irq source ID/gsi object which would then be passed as a
parameter in struct kvm_irqfd and struct kvm_eoifd so that the object
representing the source id/gsi isn't magically freed on it's own.  This
would also allow us to deassign/close one end and reconfigure it later.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 03/12] KVM: introduce gfn_to_pfn_memslot_atomic

2012-08-09 Thread Marcelo Tosatti
On Tue, Aug 07, 2012 at 05:49:36PM +0800, Xiao Guangrong wrote:
> It can instead of hva_to_pfn_atomic
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c   |5 +
>  include/linux/kvm_host.h |3 ++-
>  virt/kvm/kvm_main.c  |   14 --
>  3 files changed, 11 insertions(+), 11 deletions(-)

What if someone needs gfn_to_hva_memslot in the future, which is not
unlikely?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 02/12] KVM: hide KVM_MEMSLOT_INVALID from userspace

2012-08-09 Thread Marcelo Tosatti
On Tue, Aug 07, 2012 at 05:48:49PM +0800, Xiao Guangrong wrote:
> Quote Avi's comment:
> | KVM_MEMSLOT_INVALID is actually an internal symbol, not used by
> | userspace.  Please move it to kvm_host.h.
> 
> Also, move KVM_MEMSLOT_INVALID to the highest bit
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  include/linux/kvm.h  |1 -
>  include/linux/kvm_host.h |2 ++
>  2 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index 2ce09aa..dc3aa2a 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -103,7 +103,6 @@ struct kvm_userspace_memory_region {
> 
>  /* for kvm_memory_region::flags */
>  #define KVM_MEM_LOG_DIRTY_PAGES  1UL
> -#define KVM_MEMSLOT_INVALID  (1UL << 1)
> 
>  /* for KVM_IRQ_LINE */
>  struct kvm_irq_level {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d2b897e..4c2da5a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -35,6 +35,8 @@
>  #define KVM_MMIO_SIZE 8
>  #endif
> 
> +#define KVM_MEMSLOT_INVALID  (1UL << 31)
> +
>  /*
>   * If we support unaligned MMIO, at most one fragment will be split into two:
>   */

Please document which range is for userspace visible flags, which range
is reserved. Mention that in both headers, point to each other
("userspace definitions are there" / "internal definitions are there").

16/16 bits for each should be fine.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Marcelo Tosatti
On Thu, Aug 09, 2012 at 03:53:24PM +0300, Avi Kivity wrote:
> On 08/09/2012 02:57 PM, Gerd Hoffmann wrote:
> > Use kvmclock for tsc calibration when running on kvm.  Without this the
> > tsc frequency calibrated by seabios can be *way* off in case the virtual
> > machine is booted on a loaded host.  I've seen seabios calibrating 27
> > instead of ca. 2800 MHz, resulting in timeouts being to short by factor
> > 100.  Which in turn leads to disk I/O errors due to timeouts, especially
> > as I/O requests tend to take a bit longer than usual on a loaded box ...
> 
> > +
> > +struct pvclock_vcpu_time_info {
> > +   u32   version;
> > +   u32   pad0;
> > +   u64   tsc_timestamp;
> > +   u64   system_time;
> > +   u32   tsc_to_system_mul;
> > +   s8tsc_shift;
> > +   u8flags;
> > +   u8pad[2];
> > +} PACKED;
> > +
> > +
> > +u64 kvm_tsc_khz(void)
> > +{
> > +u32 eax, ebx, ecx, edx, msr;
> > +struct pvclock_vcpu_time_info time;
> > +u32 addr = (u32)(&time);
> > +u64 khz;
> > +
> > +/* check presence and figure msr number */
> > +cpuid(KVM_CPUID_FEATURES, &eax, &ebx, &ecx, &edx);
> > +if (eax & KVM_FEATURE_CLOCKSOURCE2) {
> > +msr = MSR_KVM_SYSTEM_TIME_NEW;
> > +} else if (eax & KVM_FEATURE_CLOCKSOURCE) {
> > +msr = MSR_KVM_SYSTEM_TIME;
> > +} else {
> > +return 0;
> > +}
> > +
> > +/* ask kvm hypervisor to fill struct */
> > +memset(&time, 0, sizeof(time));
> > +wrmsr(msr, addr | 1);
> 
> How can this work?  There is a 64-byte alignment requirement.
> 
> > +wrmsr(msr, 0);
> > +if (time.version < 2 || time.tsc_to_system_mul == 0)
> > +return 0;
> > +
> > +/* go figure tsc frequency */
> > +khz = pvclock_tsc_khz(&time);
> > +dprintf(1, "Using kvmclock, msr 0x%x, tsc %d MHz\n",
> > +msr, (u32)khz / 1000);
> > +return khz;
> 
> That's a meaningless number.  You can be migrated to a cpu or a machine
> with very different tsc.

Thats why there exists hardware tsc frequency scaling and the software
equivalent for that on kvm.

> You want accurate time on kvm, don't use the tsc.
> 
> 
> -- 
> error compiling committee.c: too many arguments to function
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Marcelo Tosatti
On Thu, Aug 09, 2012 at 05:01:34PM +0300, Avi Kivity wrote:
> On 08/09/2012 04:57 PM, Gerd Hoffmann wrote:
> >   Hi,
> > 
> >>> +u64 kvm_tsc_khz(void)
> >>> +{
> >>> +u32 eax, ebx, ecx, edx, msr;
> >>> +struct pvclock_vcpu_time_info time;
> >>> +u32 addr = (u32)(&time);
> >>> +u64 khz;
> >>> +
> >>> +/* check presence and figure msr number */
> >>> +cpuid(KVM_CPUID_FEATURES, &eax, &ebx, &ecx, &edx);
> >>> +if (eax & KVM_FEATURE_CLOCKSOURCE2) {
> >>> +msr = MSR_KVM_SYSTEM_TIME_NEW;
> >>> +} else if (eax & KVM_FEATURE_CLOCKSOURCE) {
> >>> +msr = MSR_KVM_SYSTEM_TIME;
> >>> +} else {
> >>> +return 0;
> >>> +}
> >>> +
> >>> +/* ask kvm hypervisor to fill struct */
> >>> +memset(&time, 0, sizeof(time));
> >>> +wrmsr(msr, addr | 1);
> >> 
> >> How can this work?
> > 
> > It did in my testing, although maybe by pure luck ...
> > 
> >> There is a 64-byte alignment requirement.
> > 
> > 64 bytes?  Sure?  The whole struct is only 32 bytes in size ...
> 
> er, the documentation says 4 bytes (so stack alignment works).  I
> distinctly remember having a large alignment requirement so we don't
> cross a page or slot boundary... something's wrong here.
> 
> > 
> > Easily fixable though, just need to grab some memory with memalign
> > instead of using the stack.
> 
> > 
> >>> +wrmsr(msr, 0);
> >>> +if (time.version < 2 || time.tsc_to_system_mul == 0)
> >>> +return 0;
> >>> +
> >>> +/* go figure tsc frequency */
> >>> +khz = pvclock_tsc_khz(&time);
> >>> +dprintf(1, "Using kvmclock, msr 0x%x, tsc %d MHz\n",
> >>> +msr, (u32)khz / 1000);
> >>> +return khz;
> >> 
> >> That's a meaningless number.  You can be migrated to a cpu or a machine
> >> with very different tsc.
> > 
> >> You want accurate time on kvm, don't use the tsc.
> > 
> > seabios uses the tsc for timeout calculations only, so it doesn't need
> > to be 100% accurate.  The order of magnitude should be correct though.
> > The Linux kernel uses the value for delay loops too, so using it for the
> > given purpose can't be *that* horrible after all ...
> > 
> > It is certainly an improvement over the current code which tries to
> > calibrate the tsc and gets totally broken results in case the busy host
> > happens to schedule the guest in the middle of calibration.
> > 
> > So what do you suggest?  The options I see are:
> > 
> >   (1) Use this patch (with alignment issue fixed of course).
> >   (2) Do a full kvmclock implementation.  Feels a bit like overkill.
> >   (3) SeaBIOS can fallback to the PIT for timing on machines which
> >   have no TSC.  We could do that too in case we detect kvm ...
> 
> What sort of timeouts are these?  If seconds, maybe the rtc would be best.

I vote for 3 so nobody has to maintain kvmclock code in SeaBIOS and Gerd
can fix the in-kernel PIT issues with GRUB (see Michaels message) while testing.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Documentation : hypercalls

2012-08-09 Thread Marcelo Tosatti
On Tue, Aug 07, 2012 at 01:09:46PM +0530, Raghavendra K T wrote:
> This is the hypercall documentation patch series
> First patch covers KVM specific hypercall information.
> 
> second patch is has typo fix for vmcall instruction
> comment in kvm_para.h
> 
> Third patch includes a veryful documentation on PowerPC
> hypercalls originally written by Alex
> 
> Changes since initial post:
>  - Added hypercall ABI (Peter)
>  - made KVM_HC_VAPIC_POLL_IRQ active explicitly (Randy)
>  - Changed vmrun/vmmrun ==> vmcall/vmmcall (Marcelo)
>  - use Linux KVM hypercall instead of ABI  (Marcelo)
>  - correct PowerPC typo (Alex)
>  - Remove value field (Alex)
> 
> Raghavendra K T (2):
>  Documentation/kvm : Add documentation on hypercalls
>  Documentation/header : Correct vmrun to vmcall typo
> 
> Alexander Graf (1) :
>  Documentation/ppc: Add ppc hypercall documentation
> 
>  Documentation/virtual/kvm/hypercalls.txt |   66 
> ++
>  Documentation/virtual/kvm/ppc-pv.txt |   22 ++
>  arch/x86/include/asm/kvm_para.h  |2 +-
>  3 files changed, 89 insertions(+), 1 deletions(-)

ACK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TSC in qem[-kvm] 1.1+ and in-kernel irqchip

2012-08-09 Thread Marcelo Tosatti
On Thu, Aug 09, 2012 at 10:27:43PM +0400, Michael Tokarev wrote:
> As a follow-up to the patch "tsc: use kvmclock for
> calibration".
> 
> There's another problem reported by several users.
> The sympthom is that grub does not show boot menu,
> it boots default entry right away without any pause.
> 
> After quite some debugging it turned out to be
> TSC issue.  Grub uses tsc for its timeout handling.
> When setting timeout to some very large value
> (1), I can see the counter is ticking backwards
> at a very high speed, ticking from 1 to 0 in
> about 5 seconds.
> 
> Running kvm -cpu host,-tsc forces grub to use
> rtc clocksource, and the problem goes away.
> 
> The most interesting thing is that this is a
> problem new for qemu-kvm 1.1 (and is still
> present in current git), 1.0 version had no
> such issue.  And it only happens when in-kernel
> irqchip is enabled -- running with -no-kvm-irqchip
> also fixes the grub problem, so that tsc starts
> counting "correctly" for grub again.
> 
> Gerd mentioned mis-calibration of bios timer
> when host is heavily loaded.  I tested grub on
> my workstation today which was completely idle,
> no other processes running.
> 
> It smells like a bug in kvm somewhere.  And it
> happens when I explicitly pin kvm to a single
> core, so tsc should tick correctly even if its
> syncronization is broken between cores.
> 
> Current qemu also has this issue (since 1.1),
> since it also has in-kernel irqchip support now.
> 
> FWIW, here's the TSC calibration routine from
> grub:
> 
> /* Calibrate the TSC based on the RTC.  */
> static void
> calibrate_tsc (void)
> {
>   /* First calibrate the TSC rate (relative, not absolute time). */
>   grub_uint64_t start_tsc;
>   grub_uint64_t end_tsc;
> 
>   start_tsc = grub_get_tsc ();
>   grub_pit_wait (0x);
>   end_tsc = grub_get_tsc ();
> 
>   tsc_ticks_per_ms = grub_divmod64 (end_tsc - start_tsc, 55, 0);
> }

Emulation of grub_pit_wait sequence by in-kernel PIT is probably broken.

QEMU PIT emulation is also affected by miscalibration.

Please provide steps to reproduce.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Marcelo Tosatti
On Thu, Aug 09, 2012 at 05:20:11PM +0300, Avi Kivity wrote:
> On 08/09/2012 05:18 PM, Gerd Hoffmann wrote:
> >   Hi,
> > 
> >>> So what do you suggest?  The options I see are:
> >>>
> >>>   (1) Use this patch (with alignment issue fixed of course).
> >>>   (2) Do a full kvmclock implementation.  Feels a bit like overkill.
> >>>   (3) SeaBIOS can fallback to the PIT for timing on machines which
> >>>   have no TSC.  We could do that too in case we detect kvm ...
> >> 
> >> What sort of timeouts are these?  If seconds, maybe the rtc would be best.
> > 
> > All sorts of timeouts, from a few miliseconds to seconds.
> > 
> > The problematic ones are the longer timeouts, which wait for I/O stuff
> > like disk reads complete.  The stuff with smaller timeouts (like waiting
> > for AHCI link become ready) tend to finish instantly in kvm.
> 
> That's not guaranteed.  The AHCI adapter might be real hardware.  Or the
> emulation may change.
> 
> What's wrong with having a full kvmclock implementation?  Instead of
> issuing rdtsc call a function pointer.

Its not necessary (someone is going to maintain the kvmclock frequency
retrieve, which patch is already here, versus maintainance of 
full kvmclock).

Frequency scaling (or the software equivalent: TSC trapping) are
required for other reasons anyway.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on reboot with 3.6-rc1

2012-08-09 Thread Alex Williamson
On Thu, 2012-08-09 at 12:17 -0600, David Ahern wrote:
> On 8/9/12 11:26 AM, Alex Williamson wrote:
> > On Thu, 2012-08-09 at 11:10 -0600, David Ahern wrote:
> >> On 8/9/12 3:42 AM, Avi Kivity wrote:
> >>> On 08/08/2012 07:27 PM, David Ahern wrote:
>  Not sure if KVM is the culprit, but it is the last line shown on the
>  console. I have to power cycle the server to reboot.
> >>>
> >>> Have you tried rmmoding the kvm modules before reboot?
> >>>
> >>> Were any guests running during this?
> >>
> >> hmm seems to be a side effect of https://lkml.org/lkml/2012/8/3/370.
> >
> > Is it fixed with the patches for that?  We're operating on a NULL
> > pointer with that bug on your system, so you probably don't want to be
> > assuming the system is entirely healthy after you hit it.  Thanks,
> 
> 
> Agreed and yes.
> 
> Latest Linus tree has the iommu patch. Still need to get the other one 
> in (https://lkml.org/lkml/2012/8/4/141) or whatever alternative is 
> deemed better.

I think you should be ok with just what is in Linus' tree now.  The
change in pci_acs_path_enabled is the same kind of bug, but I don't
think you'll hit it with the fixed iommu code.  Please let me know if
that's not the case.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


TSC in qem[-kvm] 1.1+ and in-kernel irqchip

2012-08-09 Thread Michael Tokarev
As a follow-up to the patch "tsc: use kvmclock for
calibration".

There's another problem reported by several users.
The sympthom is that grub does not show boot menu,
it boots default entry right away without any pause.

After quite some debugging it turned out to be
TSC issue.  Grub uses tsc for its timeout handling.
When setting timeout to some very large value
(1), I can see the counter is ticking backwards
at a very high speed, ticking from 1 to 0 in
about 5 seconds.

Running kvm -cpu host,-tsc forces grub to use
rtc clocksource, and the problem goes away.

The most interesting thing is that this is a
problem new for qemu-kvm 1.1 (and is still
present in current git), 1.0 version had no
such issue.  And it only happens when in-kernel
irqchip is enabled -- running with -no-kvm-irqchip
also fixes the grub problem, so that tsc starts
counting "correctly" for grub again.

Gerd mentioned mis-calibration of bios timer
when host is heavily loaded.  I tested grub on
my workstation today which was completely idle,
no other processes running.

It smells like a bug in kvm somewhere.  And it
happens when I explicitly pin kvm to a single
core, so tsc should tick correctly even if its
syncronization is broken between cores.

Current qemu also has this issue (since 1.1),
since it also has in-kernel irqchip support now.

FWIW, here's the TSC calibration routine from
grub:

/* Calibrate the TSC based on the RTC.  */
static void
calibrate_tsc (void)
{
  /* First calibrate the TSC rate (relative, not absolute time). */
  grub_uint64_t start_tsc;
  grub_uint64_t end_tsc;

  start_tsc = grub_get_tsc ();
  grub_pit_wait (0x);
  end_tsc = grub_get_tsc ();

  tsc_ticks_per_ms = grub_divmod64 (end_tsc - start_tsc, 55, 0);
}

Thanks,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] KVM: PPC: Book3S HV: Take the SRCU read lock before looking up memslots

2012-08-09 Thread Marcelo Tosatti
On Mon, Aug 06, 2012 at 08:06:55PM +1000, Paul Mackerras wrote:
> The generic KVM code uses SRCU (sleeping RCU) to protect accesses
> to the memslots data structures against updates due to userspace
> adding, modifying or removing memory slots.  We need to do that too,
> both to avoid accessing stale copies of the memslots and to avoid
> lockdep warnings.  This therefore adds srcu_read_lock/unlock pairs
> around code that accesses and uses memslots.
> 
> Since the real-mode handlers for H_ENTER, H_REMOVE and H_BULK_REMOVE
> need to access the memslots, and we don't want to call the SRCU code
> in real mode (since we have no assurance that it would only access
> the linear mapping), we hold the SRCU read lock for the VM while
> in the guest.  This does mean that adding or removing memory slots
> while some vcpus are executing in the guest will block for up to
> two jiffies.  This tradeoff is acceptable since adding/removing
> memory slots only happens rarely, while H_ENTER/H_REMOVE/H_BULK_REMOVE
> are performance-critical hot paths.

I would avoid doing this. What prevents the guest entry loop in the kernel to
simply reenter without ever unlocking SRCU?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly

2012-08-09 Thread Marcelo Tosatti
On Mon, Aug 06, 2012 at 08:06:03PM +1000, Paul Mackerras wrote:
> >From 44067a8ee15021583636bea4cc1d47e5370b8397 Mon Sep 17 00:00:00 2001
> From: Paul Mackerras 
> Date: Mon, 30 Jul 2012 16:40:54 +1000
> Subject: 
> 
> At present the HV style of KVM doesn't handle deletion or modification
> of memory slots correctly.  Deletion occurs when userspace does the
> KVM_SET_USER_MEMORY_REGION with arguments specifying a new length of
> zero for a slot that contains memory.  Modification occurs when the
> arguments specify a new guest_phys_addr or flags.
> 
> To allow the HV code to know which operation (creation, deletion or
> modification) is being requested, it needs to see the old and new
> contents of the memslot.  kvm_arch_prepare_memory_region has this
> information, so we modify it to pass it down to
> kvmppc_core_prepare_memory_region.  We then modify the HV version
> of that to check which operation is being performed.  If it is a
> deletion, we call a new function kvmppc_unmap_memslot to remove any
> HPT (hashed page table) entries referencing the memory being removed.
> Similarly, if we are modifying the guest physical address we also
> remove any HPT entries.  If the KVM_MEM_LOG_DIRTY_PAGES flag is now
> set and wasn't before, we call kvmppc_hv_get_dirty_log to clear any
> dirty bits so we can detect all modifications from now on.
> 
> Signed-off-by: Paul Mackerras 
> ---
>  arch/powerpc/include/asm/kvm_ppc.h  |4 +++
>  arch/powerpc/kvm/book3s_64_mmu_hv.c |   27 ++--
>  arch/powerpc/kvm/book3s_hv.c|   61 
> +++
>  arch/powerpc/kvm/book3s_hv_rm_mmu.c |2 +-
>  arch/powerpc/kvm/book3s_pr.c|2 ++
>  arch/powerpc/kvm/booke.c|2 ++
>  arch/powerpc/kvm/powerpc.c  |2 +-
>  7 files changed, 76 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index 0124937..044c921 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -140,11 +140,15 @@ extern void kvm_release_hpt(struct kvmppc_linear_info 
> *li);
>  extern int kvmppc_core_init_vm(struct kvm *kvm);
>  extern void kvmppc_core_destroy_vm(struct kvm *kvm);
>  extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
> +struct kvm_memory_slot *memslot,
> +struct kvm_memory_slot *old,
>   struct kvm_userspace_memory_region *mem);
>  extern void kvmppc_core_commit_memory_region(struct kvm *kvm,
>   struct kvm_userspace_memory_region *mem);
>  extern int kvm_vm_ioctl_get_smmu_info(struct kvm *kvm,
> struct kvm_ppc_smmu_info *info);
> +extern void kvmppc_unmap_memslot(struct kvm *kvm,
> +  struct kvm_memory_slot *memslot);
>  
>  extern int kvmppc_bookehv_init(void);
>  extern void kvmppc_bookehv_exit(void);
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 3c635c0..87735a7 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -850,7 +850,8 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
> *rmapp,
>   psize = hpte_page_size(hptep[0], ptel);
>   if ((hptep[0] & HPTE_V_VALID) &&
>   hpte_rpn(ptel, psize) == gfn) {
> - hptep[0] |= HPTE_V_ABSENT;
> + if (kvm->arch.using_mmu_notifiers)
> + hptep[0] |= HPTE_V_ABSENT;
>   kvmppc_invalidate_hpte(kvm, hptep, i);
>   /* Harvest R and C */
>   rcbits = hptep[1] & (HPTE_R_R | HPTE_R_C);
> @@ -877,6 +878,28 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long 
> start, unsigned long end)
>   return 0;
>  }
>  
> +void kvmppc_unmap_memslot(struct kvm *kvm, struct kvm_memory_slot *memslot)
> +{
> + unsigned long *rmapp;
> + unsigned long gfn;
> + unsigned long n;
> +
> + rmapp = memslot->rmap;
> + gfn = memslot->base_gfn;
> + for (n = memslot->npages; n; --n) {
> + /*
> +  * Testing the present bit without locking is OK because
> +  * the memslot has been marked invalid already, and hence
> +  * no new HPTEs referencing this page can be created,
> +  * thus the present bit can't go from 0 to 1.
> +  */
> + if (*rmapp & KVMPPC_RMAP_PRESENT)
> + kvm_unmap_rmapp(kvm, rmapp, gfn);
> + ++rmapp;
> + ++gfn;
> + }
> +}
> +
>  static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
>unsigned long gfn)
>  {
> @@ -1039,7 +1062,7 @@ long kvmppc_hv_get_dirty_log(struct kvm *kvm, struct 
> kvm_memory_slot *memslot)
>   rmapp = memslot->rmap;
>   map = memslot->dirty_bitmap;

Re: [RFC PATCH 5/5] KVM: PPC: Take the SRCU lock around memslot use

2012-08-09 Thread Marcelo Tosatti
On Mon, Aug 06, 2012 at 08:08:16PM +1000, Paul Mackerras wrote:
> The generic KVM code uses SRCU (sleeping RCU) to protect accesses
> to the memslots data structures against updates due to userspace
> adding, modifying or removing memory slots.  We need to do that too,
> both to avoid accessing stale copies of the memslots and to avoid
> lockdep warnings.  This therefore adds srcu_read_lock/unlock pairs
> around code that accesses and uses memslots in the Book 3S PR code
> and the Book E (44x and e500) code.
> 
> Signed-off-by: Paul Mackerras 
> ---
> Compile-tested only.
> 
>  arch/powerpc/kvm/44x_tlb.c   |6 ++
>  arch/powerpc/kvm/book3s_pr.c |6 ++
>  arch/powerpc/kvm/e500_tlb.c  |6 ++
>  3 files changed, 18 insertions(+)

On top of the previous comment:

x86 calls srcu_read_lock at the beginning of the KVM_RUN ioctl handler
(__vcpu_run in arch/x86/kvm/x86.c), unlocks srcu on guest entry, locks
on guest exit before any potential use of memslots, and unlocks on
exit to userspace.

This has the advantage of not sprinkling srcu lock/unlock calls all over
(except from other ioctls, of course). Its low maintenance.

Perhaps doing the same on PPC is not a bad idea.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on reboot with 3.6-rc1

2012-08-09 Thread David Ahern

On 8/9/12 11:26 AM, Alex Williamson wrote:

On Thu, 2012-08-09 at 11:10 -0600, David Ahern wrote:

On 8/9/12 3:42 AM, Avi Kivity wrote:

On 08/08/2012 07:27 PM, David Ahern wrote:

Not sure if KVM is the culprit, but it is the last line shown on the
console. I have to power cycle the server to reboot.


Have you tried rmmoding the kvm modules before reboot?

Were any guests running during this?


hmm seems to be a side effect of https://lkml.org/lkml/2012/8/3/370.


Is it fixed with the patches for that?  We're operating on a NULL
pointer with that bug on your system, so you probably don't want to be
assuming the system is entirely healthy after you hit it.  Thanks,



Agreed and yes.

Latest Linus tree has the iommu patch. Still need to get the other one 
in (https://lkml.org/lkml/2012/8/4/141) or whatever alternative is 
deemed better.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on reboot with 3.6-rc1

2012-08-09 Thread Alex Williamson
On Thu, 2012-08-09 at 11:10 -0600, David Ahern wrote:
> On 8/9/12 3:42 AM, Avi Kivity wrote:
> > On 08/08/2012 07:27 PM, David Ahern wrote:
> >> Not sure if KVM is the culprit, but it is the last line shown on the
> >> console. I have to power cycle the server to reboot.
> >
> > Have you tried rmmoding the kvm modules before reboot?
> >
> > Were any guests running during this?
> 
> hmm seems to be a side effect of https://lkml.org/lkml/2012/8/3/370.

Is it fixed with the patches for that?  We're operating on a NULL
pointer with that bug on your system, so you probably don't want to be
assuming the system is entirely healthy after you hit it.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on reboot with 3.6-rc1

2012-08-09 Thread David Ahern

On 8/9/12 3:42 AM, Avi Kivity wrote:

On 08/08/2012 07:27 PM, David Ahern wrote:

Not sure if KVM is the culprit, but it is the last line shown on the
console. I have to power cycle the server to reboot.


Have you tried rmmoding the kvm modules before reboot?

Were any guests running during this?


hmm seems to be a side effect of https://lkml.org/lkml/2012/8/3/370.

David
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-09 Thread Blue Swirl
On Thu, Aug 9, 2012 at 7:28 AM, liu ping fan  wrote:
> On Thu, Aug 9, 2012 at 3:17 AM, Blue Swirl  wrote:
>> On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan  wrote:
>>> From: Liu Ping Fan 
>>>
>>> Using mem_map_lock to protect among updaters. So we can get the intact
>>> snapshot of mem topology -- FlatView & radix-tree.
>>>
>>> Signed-off-by: Liu Ping Fan 
>>> ---
>>>  exec.c   |3 +++
>>>  memory.c |   22 ++
>>>  memory.h |2 ++
>>>  3 files changed, 27 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/exec.c b/exec.c
>>> index 8244d54..0e29ef9 100644
>>> --- a/exec.c
>>> +++ b/exec.c
>>> @@ -210,6 +210,8 @@ static unsigned phys_map_nodes_nb, 
>>> phys_map_nodes_nb_alloc;
>>> The bottom level has pointers to MemoryRegionSections.  */
>>>  static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
>>>
>>> +QemuMutex mem_map_lock;
>>> +
>>>  static void io_mem_init(void);
>>>  static void memory_map_init(void);
>>>
>>> @@ -637,6 +639,7 @@ void cpu_exec_init_all(void)
>>>  #if !defined(CONFIG_USER_ONLY)
>>>  memory_map_init();
>>>  io_mem_init();
>>> +qemu_mutex_init(&mem_map_lock);
>>
>> I'd move this and the mutex to memory.c since there are no other uses.
>> The mutex could be static then.
>>
> But the init entry is in exec.c, not memory.c.

Memory subsystem does not have an init function of its own, this can
be the start of it.

>
> Regards,
> pingfan
>
>>>  #endif
>>>  }
>>>
>>> diff --git a/memory.c b/memory.c
>>> index aab4a31..5986532 100644
>>> --- a/memory.c
>>> +++ b/memory.c
>>> @@ -761,7 +761,9 @@ void memory_region_transaction_commit(void)
>>>  assert(memory_region_transaction_depth);
>>>  --memory_region_transaction_depth;
>>>  if (!memory_region_transaction_depth && memory_region_update_pending) {
>>> +qemu_mutex_lock(&mem_map_lock);
>>>  memory_region_update_topology(NULL);
>>> +qemu_mutex_unlock(&mem_map_lock);
>>>  }
>>>  }
>>>
>>> @@ -1069,8 +1071,10 @@ void memory_region_set_log(MemoryRegion *mr, bool 
>>> log, unsigned client)
>>>  {
>>>  uint8_t mask = 1 << client;
>>>
>>> +qemu_mutex_lock(&mem_map_lock);
>>>  mr->dirty_log_mask = (mr->dirty_log_mask & ~mask) | (log * mask);
>>>  memory_region_update_topology(mr);
>>> +qemu_mutex_unlock(&mem_map_lock);
>>>  }
>>>
>>>  bool memory_region_get_dirty(MemoryRegion *mr, target_phys_addr_t addr,
>>> @@ -1103,8 +1107,10 @@ void memory_region_sync_dirty_bitmap(MemoryRegion 
>>> *mr)
>>>  void memory_region_set_readonly(MemoryRegion *mr, bool readonly)
>>>  {
>>>  if (mr->readonly != readonly) {
>>> +qemu_mutex_lock(&mem_map_lock);
>>>  mr->readonly = readonly;
>>>  memory_region_update_topology(mr);
>>> +qemu_mutex_unlock(&mem_map_lock);
>>>  }
>>>  }
>>>
>>> @@ -1112,7 +1118,9 @@ void 
>>> memory_region_rom_device_set_readable(MemoryRegion *mr, bool readable)
>>>  {
>>>  if (mr->readable != readable) {
>>>  mr->readable = readable;
>>> +qemu_mutex_lock(&mem_map_lock);
>>>  memory_region_update_topology(mr);
>>> +qemu_mutex_unlock(&mem_map_lock);
>>>  }
>>>  }
>>>
>>> @@ -1206,6 +1214,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
>>>  };
>>>  unsigned i;
>>>
>>> +qemu_mutex_lock(&mem_map_lock);
>>>  for (i = 0; i < mr->ioeventfd_nb; ++i) {
>>>  if (memory_region_ioeventfd_before(mrfd, mr->ioeventfds[i])) {
>>>  break;
>>> @@ -1218,6 +1227,7 @@ void memory_region_add_eventfd(MemoryRegion *mr,
>>>  sizeof(*mr->ioeventfds) * (mr->ioeventfd_nb-1 - i));
>>>  mr->ioeventfds[i] = mrfd;
>>>  memory_region_update_topology(mr);
>>> +qemu_mutex_unlock(&mem_map_lock);
>>>  }
>>>
>>>  void memory_region_del_eventfd(MemoryRegion *mr,
>>> @@ -1236,6 +1246,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
>>>  };
>>>  unsigned i;
>>>
>>> +qemu_mutex_lock(&mem_map_lock);
>>>  for (i = 0; i < mr->ioeventfd_nb; ++i) {
>>>  if (memory_region_ioeventfd_equal(mrfd, mr->ioeventfds[i])) {
>>>  break;
>>> @@ -1248,6 +1259,7 @@ void memory_region_del_eventfd(MemoryRegion *mr,
>>>  mr->ioeventfds = g_realloc(mr->ioeventfds,
>>>sizeof(*mr->ioeventfds)*mr->ioeventfd_nb 
>>> + 1);
>>>  memory_region_update_topology(mr);
>>> +qemu_mutex_unlock(&mem_map_lock);
>>>  }
>>>
>>>  static void memory_region_add_subregion_common(MemoryRegion *mr,
>>> @@ -1259,6 +1271,8 @@ static void 
>>> memory_region_add_subregion_common(MemoryRegion *mr,
>>>  assert(!subregion->parent);
>>>  subregion->parent = mr;
>>>  subregion->addr = offset;
>>> +
>>> +qemu_mutex_lock(&mem_map_lock);
>>>  QTAILQ_FOREACH(other, &mr->subregions, subregions_link) {
>>>  if (subregion->may_overlap || other->may_overlap) {
>>>  continue;
>>> @@ -1289,6 +1303,7 @@ static void 
>>> memory_region_add_subregion_common(MemoryRegion 

Re: [PATCH 5/8] KVM: Add hva_to_memslot

2012-08-09 Thread Alexander Graf


On 09.08.2012, at 12:36, Avi Kivity  wrote:

> On 08/09/2012 01:34 PM, Takuya Yoshikawa wrote:
>> On Tue,  7 Aug 2012 12:57:13 +0200
>> Alexander Graf  wrote:
>> 
>>> +struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, hva_t hva)
>>> +{
>>> +struct kvm_memslots *slots = kvm_memslots(kvm);
>>> +struct kvm_memory_slot *memslot;
>>> +
>>> +kvm_for_each_memslot(memslot, slots)
>>> +if (hva >= memslot->userspace_addr &&
>>> +  hva < memslot->userspace_addr + memslot->npages)
>>> +return memslot;
>>> +
>>> +return NULL;
>>> +}
>> 
>> Can't we have two memory slots which contain that hva?
>> I thought that's why hva handler had to check all slots.
> 
> We can and do.  Good catch.
> 

Hrm. So I guess we can only do an hva_is_guest_memory() helper? That's all I 
really need anyways :)

Alex


> 
> -- 
> error compiling committee.c: too many arguments to function
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 4/5] hw/kvm/arm_gic: Implement support for KVM in-kernel ARM GIC

2012-08-09 Thread Peter Maydell
Implement support for using the KVM in-kernel GIC for ARM.

Signed-off-by: Peter Maydell 
---
 hw/a15mpcore.c   |   11 +++-
 hw/arm/Makefile.objs |1 +
 hw/kvm/arm_gic.c |  153 ++
 3 files changed, 164 insertions(+), 1 deletion(-)
 create mode 100644 hw/kvm/arm_gic.c

diff --git a/hw/a15mpcore.c b/hw/a15mpcore.c
index fc0a02a..e075849 100644
--- a/hw/a15mpcore.c
+++ b/hw/a15mpcore.c
@@ -19,6 +19,7 @@
  */
 
 #include "sysbus.h"
+#include "kvm.h"
 
 /* A15MP private memory region.  */
 
@@ -41,7 +42,15 @@ static int a15mp_priv_init(SysBusDevice *dev)
 A15MPPrivState *s = FROM_SYSBUS(A15MPPrivState, dev);
 SysBusDevice *busdev;
 
-s->gic = qdev_create(NULL, "arm_gic");
+/* TODO when the VGIC patches make it to Christoffer's kernel
+ * tree we can make !kvm_irqchip_in_kernel() a fatal error.
+ */
+if (kvm_irqchip_in_kernel()) {
+s->gic = qdev_create(NULL, "kvm-arm_gic");
+} else {
+s->gic = qdev_create(NULL, "arm_gic");
+}
+
 qdev_prop_set_uint32(s->gic, "num-cpu", s->num_cpu);
 qdev_prop_set_uint32(s->gic, "num-irq", s->num_irq);
 qdev_prop_set_uint32(s->gic, "revision", 2);
diff --git a/hw/arm/Makefile.objs b/hw/arm/Makefile.objs
index c413780..881fffd 100644
--- a/hw/arm/Makefile.objs
+++ b/hw/arm/Makefile.objs
@@ -39,5 +39,6 @@ obj-y += imx_serial.o imx_ccm.o imx_timer.o imx_avic.o
 obj-y += kzm.o
 obj-y += pl041.o lm4549.o
 obj-$(CONFIG_FDT) += ../device_tree.o
+obj-$(CONFIG_KVM) += kvm/arm_gic.o
 
 obj-y := $(addprefix ../,$(obj-y))
diff --git a/hw/kvm/arm_gic.c b/hw/kvm/arm_gic.c
new file mode 100644
index 000..4535f90
--- /dev/null
+++ b/hw/kvm/arm_gic.c
@@ -0,0 +1,153 @@
+/*
+ * ARM Generic Interrupt Controller using KVM in-kernel support
+ *
+ * Copyright (c) 2012 Linaro Limited
+ * Written by Peter Maydell
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation, either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see .
+ */
+
+#include "hw/sysbus.h"
+#include "kvm.h"
+#include "hw/arm_gic_internal.h"
+
+#define TYPE_KVM_ARM_GIC "kvm-arm_gic"
+#define KVM_ARM_GIC(obj) \
+ OBJECT_CHECK(gic_state, (obj), TYPE_KVM_ARM_GIC)
+#define KVM_ARM_GIC_CLASS(klass) \
+ OBJECT_CLASS_CHECK(KVMARMGICClass, (klass), TYPE_KVM_ARM_GIC)
+#define KVM_ARM_GIC_GET_CLASS(obj) \
+ OBJECT_GET_CLASS(KVMARMGICClass, (obj), TYPE_KVM_ARM_GIC)
+
+typedef struct KVMARMGICClass {
+ARMGICCommonClass parent_class;
+int (*parent_init)(SysBusDevice *dev);
+void (*parent_reset)(DeviceState *dev);
+} KVMARMGICClass;
+
+static void kvm_arm_gic_set_irq(void *opaque, int irq, int level)
+{
+/* Meaning of the 'irq' parameter:
+ *  [0..N-1] : external interrupts
+ *  [N..N+31] : PPI (internal) interrupts for CPU 0
+ *  [N+32..N+63] : PPI (internal interrupts for CPU 1
+ *  ...
+ */
+gic_state *s = (gic_state *)opaque;
+
+
+if (irq < (s->num_irq - GIC_INTERNAL)) {
+/* External interrupt number 'irq' */
+kvm_set_irq(kvm_state, irq + GIC_INTERNAL, !!level);
+} else {
+struct kvm_irq_level irq_level;
+int cpu;
+irq -= (s->num_irq - GIC_INTERNAL);
+cpu = irq / GIC_INTERNAL;
+irq %= GIC_INTERNAL;
+/* Internal interrupt 'irq' for CPU 'cpu' */
+irq_level.irq = irq;
+irq_level.level = !!level;
+kvm_vcpu_ioctl(qemu_get_cpu(cpu), KVM_IRQ_LINE, &irq_level);
+}
+}
+
+static void kvm_arm_gic_put(gic_state *s)
+{
+/* TODO: there isn't currently a kernel interface to set the GIC state */
+}
+
+static void kvm_arm_gic_get(gic_state *s)
+{
+/* TODO: there isn't currently a kernel interface to get the GIC state */
+}
+
+static void kvm_arm_gic_reset(DeviceState *dev)
+{
+gic_state *s = ARM_GIC_COMMON(dev);
+KVMARMGICClass *kgc = KVM_ARM_GIC_GET_CLASS(s);
+kgc->parent_reset(dev);
+kvm_arm_gic_put(s);
+}
+
+static int kvm_arm_gic_init(SysBusDevice *dev)
+{
+/* Device instance init function for the GIC sysbus device */
+int i;
+gic_state *s = FROM_SYSBUS(gic_state, dev);
+KVMARMGICClass *kgc = KVM_ARM_GIC_GET_CLASS(s);
+
+kgc->parent_init(dev);
+
+i = s->num_irq - GIC_INTERNAL;
+/* For the GIC, also expose incoming GPIO lines for PPIs for each CPU.
+ * GPIO array layout is thus:
+ *  [0..N-1] SPIs
+ *  [N..N+31] PPIs for CPU 0
+ *  [N+32..N+63] PPIs for CPU 1
+ *   ...
+ */
+

[RFC 0/5] Support KVM on ARM

2012-08-09 Thread Peter Maydell
This patch series is an early RFC for the QEMU patches adding
support for KVM on ARM Cortex-A15 hardware. It's intended for
use with the kernel tree at
 git://github.com/virtualopensystems/linux-kvm-arm.git kvm-a15-v10-stage

There are two aims here:
 * early review for qemu-devel folk
 * resend a complete set of patches to kvmarm, since I've done a
   number of incremental changes and tweaks since Christoffer's
   original QEMU code

These patches depend on various cleanups to KVM and configure
which I've posted in the last couple of weeks:
 configure: Don't implicitly hardcode list of KVM architectures
 update-linux-headers.sh: Pull in asm-generic/kvm_para.h
 update-linux-headers.sh: Don't hard code list of architectures
 kvm-all.c: Move init of irqchip_inject_ioctl out of kvm_irqchip_create()
 kvm: Add documentation comment for kvm_irqchip_in_kernel()
 kvm: Decouple 'GSI routing' from 'kernel irqchip'
 kvm: Decouple 'MSI routing via irqfds' from 'kernel irqchip'
 kvm: Decouple 'irqfds usable' from 'kernel irqchip'
 kvm: Move kvm_allows_irq0_override() to target-i386, fix return type
 kvm: Rename kvm_irqchip_set_irq() to kvm_set_irq()
 kvm: Decouple 'async interrupt delivery' from 'kernel irqchip'

A git branch consisting of qemu master + these preliminary fixes
 + the ARM patches is available here:
 git://git.linaro.org/people/pmaydell/qemu-arm.git kvm-arm

with pointy-clicky version here:
 
http://git.linaro.org/gitweb?p=people/pmaydell/qemu-arm.git;a=shortlog;h=refs/heads/kvm-arm

There are still a number of TODOs scattered through the code;
a quick summary:
 * a15mpcore should enforce vgic use (currently not done pending the
   VGIC patches landing in Christoffer's kernel tree)
 * the makefile change for the hw/kvm/arm_gic object is not right
   (see discussions on qemu-devel in the past about how to handle
   "only if architecture foo and KVM" object files)
 * kvm_arch_put/get_registers should drive register set/get from
   the cp15 hashtable
 * we should use an accessor function for c2_mask/base/control
 * breakpoint support is unimplemented
 * vgic register save/load from kernel is unimplemented (no kernel ABI)
 * fpu save/load unimplemented (no kernel ABI yet)
 * tell kernel the A15 peripheral base address (no kernel ABI)
 * the kernel ABI for sending per-CPU interrupts for VGIC vs non-VGIC
   is inconsistent (the former uses a vcpu ioctl, the latter encodes
   cpu number in the irq number), and we should standardise on one
   approach or the other

Christoffer Dall (1):
  ARM: KVM: Add support for KVM on ARM architecture

Peter Maydell (4):
  linux-headers: Add ARM KVM headers (not for upstream)
  hw/arm_gic: Add presave/postload hooks
  hw/kvm/arm_gic: Implement support for KVM in-kernel ARM GIC
  configure: Enable KVM on ARM

 configure|2 +-
 hw/a15mpcore.c   |   11 +-
 hw/arm/Makefile.objs |1 +
 hw/arm_gic_common.c  |   10 ++
 hw/arm_gic_internal.h|2 +
 hw/arm_pic.c |   28 
 hw/kvm/arm_gic.c |  153 +++
 linux-headers/asm-arm/kvm.h  |  119 +++
 linux-headers/asm-arm/kvm_para.h |1 +
 linux-headers/asm-generic/kvm_para.h |5 +
 linux-headers/linux/kvm.h|3 +
 target-arm/Makefile.objs |1 +
 target-arm/cpu.h |1 +
 target-arm/helper.c  |2 +-
 target-arm/kvm.c |  274 ++
 15 files changed, 610 insertions(+), 3 deletions(-)
 create mode 100644 hw/kvm/arm_gic.c
 create mode 100644 linux-headers/asm-arm/kvm.h
 create mode 100644 linux-headers/asm-arm/kvm_para.h
 create mode 100644 linux-headers/asm-generic/kvm_para.h
 create mode 100644 target-arm/kvm.c

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 5/5] configure: Enable KVM on ARM

2012-08-09 Thread Peter Maydell
Enable KVM on ARM hosts, now that all the necessary components
for it exist.

Signed-off-by: Peter Maydell 
---
 configure |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure b/configure
index b9a0b27..f7a825a 100755
--- a/configure
+++ b/configure
@@ -3797,7 +3797,7 @@ case "$target_arch2" in
 echo "CONFIG_NO_XEN=y" >> $config_target_mak
 esac
 case "$target_arch2" in
-  i386|x86_64|ppcemb|ppc|ppc64|s390x)
+  arm|i386|x86_64|ppcemb|ppc|ppc64|s390x)
 # Make sure the target and host cpus are compatible
 if test "$kvm" = "yes" -a "$target_softmmu" = "yes" -a \
   \( "$target_arch2" = "$cpu" -o \
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 2/5] ARM: KVM: Add support for KVM on ARM architecture

2012-08-09 Thread Peter Maydell
From: Christoffer Dall 

Add basic support for KVM on ARM architecture.

Signed-off-by: Christoffer Dall 
[Rusty: updates to use KVM_ARM_VCPU_INIT, KVM_GET/SET_MSRS]
Signed-off-by: Rusty Russell 
[PMM: Minor tweaks and code cleanup]
Signed-off-by: Peter Maydell 
---
 hw/arm_pic.c |   28 +
 target-arm/Makefile.objs |1 +
 target-arm/cpu.h |1 +
 target-arm/helper.c  |2 +-
 target-arm/kvm.c |  274 ++
 5 files changed, 305 insertions(+), 1 deletion(-)
 create mode 100644 target-arm/kvm.c

diff --git a/hw/arm_pic.c b/hw/arm_pic.c
index ffb4d41..950e6c4 100644
--- a/hw/arm_pic.c
+++ b/hw/arm_pic.c
@@ -9,6 +9,7 @@
 
 #include "hw.h"
 #include "arm-misc.h"
+#include "kvm.h"
 
 /* Input 0 is IRQ and input 1 is FIQ.  */
 static void arm_pic_cpu_handler(void *opaque, int irq, int level)
@@ -34,7 +35,34 @@ static void arm_pic_cpu_handler(void *opaque, int irq, int 
level)
 }
 }
 
+#ifdef CONFIG_KVM
+static void kvm_arm_pic_cpu_handler(void *opaque, int irq, int level)
+{
+ARMCPU *cpu = opaque;
+CPUARMState *env = &cpu->env;
+int kvm_irq;
+
+switch (irq) {
+case ARM_PIC_CPU_IRQ:
+kvm_irq = KVM_ARM_IRQ_LINE;
+break;
+case ARM_PIC_CPU_FIQ:
+kvm_irq = KVM_ARM_FIQ_LINE;
+break;
+default:
+hw_error("kvm_arm_pic_cpu_handler: Bad interrupt line %d\n", irq);
+}
+kvm_irq |= (env->cpu_index << 1);
+kvm_set_irq(kvm_state, kvm_irq, level ? 1 : 0);
+}
+#endif
+
 qemu_irq *arm_pic_init_cpu(ARMCPU *cpu)
 {
+#ifdef CONFIG_KVM
+if (kvm_enabled()) {
+return qemu_allocate_irqs(kvm_arm_pic_cpu_handler, cpu, 2);
+}
+#endif
 return qemu_allocate_irqs(arm_pic_cpu_handler, cpu, 2);
 }
diff --git a/target-arm/Makefile.objs b/target-arm/Makefile.objs
index f447c4f..f906d20 100644
--- a/target-arm/Makefile.objs
+++ b/target-arm/Makefile.objs
@@ -1,5 +1,6 @@
 obj-y += arm-semi.o
 obj-$(CONFIG_SOFTMMU) += machine.o
+obj-$(CONFIG_KVM) += kvm.o
 obj-y += translate.o op_helper.o helper.o cpu.o
 obj-y += neon_helper.o iwmmxt_helper.o
 
diff --git a/target-arm/cpu.h b/target-arm/cpu.h
index 191895c..5194bad 100644
--- a/target-arm/cpu.h
+++ b/target-arm/cpu.h
@@ -236,6 +236,7 @@ ARMCPU *cpu_arm_init(const char *cpu_model);
 void arm_translate_init(void);
 int cpu_arm_exec(CPUARMState *s);
 void do_interrupt(CPUARMState *);
+int bank_number(CPUARMState *env, int mode);
 void switch_mode(CPUARMState *, int);
 uint32_t do_arm_semihosting(CPUARMState *env);
 
diff --git a/target-arm/helper.c b/target-arm/helper.c
index 5727da2..632a366 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -1619,7 +1619,7 @@ uint32_t HELPER(get_r13_banked)(CPUARMState *env, 
uint32_t mode)
 #else
 
 /* Map CPU modes onto saved register banks.  */
-static inline int bank_number(CPUARMState *env, int mode)
+int bank_number(CPUARMState *env, int mode)
 {
 switch (mode) {
 case ARM_CPU_MODE_USR:
diff --git a/target-arm/kvm.c b/target-arm/kvm.c
new file mode 100644
index 000..b9abee0
--- /dev/null
+++ b/target-arm/kvm.c
@@ -0,0 +1,274 @@
+/*
+ * ARM implementation of KVM hooks
+ *
+ * Copyright Christoffer Dall 2009-2010
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "qemu-common.h"
+#include "qemu-timer.h"
+#include "sysemu.h"
+#include "kvm.h"
+#include "cpu.h"
+#include "device_tree.h"
+#include "hw/arm-misc.h"
+
+const KVMCapabilityInfo kvm_arch_required_capabilities[] = {
+KVM_CAP_LAST_INFO
+};
+
+int kvm_arch_init(KVMState *s)
+{
+/* For ARM interrupt delivery is always asynchronous,
+ * whether we are using an in-kernel VGIC or not.
+ */
+kvm_async_interrupts_allowed = true;
+return 0;
+}
+
+int kvm_arch_init_vcpu(CPUARMState *env)
+{
+struct kvm_vcpu_init init;
+
+init.target = KVM_ARM_TARGET_CORTEX_A15;
+memset(init.features, 0, sizeof(init.features));
+return kvm_vcpu_ioctl(env, KVM_ARM_VCPU_INIT, &init);
+}
+
+#define MSR32_INDEX_OF(coproc, crn, opc1, crm, opc2) \
+(((coproc)<<16) | ((opc1)<<11) | ((crn)<<7) | ((opc2)<<4) | (crm))
+
+int kvm_arch_put_registers(CPUARMState *env, int level)
+{
+struct kvm_regs regs;
+int mode, bn;
+struct cp15 {
+struct kvm_msrs hdr;
+struct kvm_msr_entry e[2];
+} cp15;
+int ret;
+
+ret = kvm_vcpu_ioctl(env, KVM_GET_REGS, ®s);
+if (ret < 0) {
+return ret;
+}
+
+/* We make sure the banked regs are properly set */
+mode = env->uncached_cpsr & CPSR_M;
+bn = bank_number(env, mode);
+if (mode == ARM_CPU_MODE_FIQ) {
+memcpy(env->fiq_regs, env->regs + 8, 5 * sizeof(uint32_t));
+} else {
+memcpy(env->usr_regs, env->regs + 8, 5 * sizeof(uint32_t));
+}
+env->banked_r13[bn] = env->regs[13];
+env->banked_r14[bn] = env->regs[14]

[RFC 3/5] hw/arm_gic: Add presave/postload hooks

2012-08-09 Thread Peter Maydell
Add presave/postload hooks to the ARM GIC common base class.
These will be used by the KVM in-kernel GIC subclass to sync
state between kernel and userspace when migrating.

Signed-off-by: Peter Maydell 
---
 hw/arm_gic_common.c   |   10 ++
 hw/arm_gic_internal.h |2 ++
 2 files changed, 12 insertions(+)

diff --git a/hw/arm_gic_common.c b/hw/arm_gic_common.c
index 360e782..d972755 100644
--- a/hw/arm_gic_common.c
+++ b/hw/arm_gic_common.c
@@ -23,9 +23,14 @@
 static void gic_save(QEMUFile *f, void *opaque)
 {
 gic_state *s = (gic_state *)opaque;
+ARMGICCommonClass *c = ARM_GIC_COMMON_GET_CLASS(s);
 int i;
 int j;
 
+if (c->pre_save) {
+c->pre_save(s);
+}
+
 qemu_put_be32(f, s->enabled);
 for (i = 0; i < s->num_cpu; i++) {
 qemu_put_be32(f, s->cpu_enabled[i]);
@@ -57,6 +62,7 @@ static void gic_save(QEMUFile *f, void *opaque)
 static int gic_load(QEMUFile *f, void *opaque, int version_id)
 {
 gic_state *s = (gic_state *)opaque;
+ARMGICCommonClass *c = ARM_GIC_COMMON_GET_CLASS(s);
 int i;
 int j;
 
@@ -91,6 +97,10 @@ static int gic_load(QEMUFile *f, void *opaque, int 
version_id)
 s->irq_state[i].trigger = qemu_get_byte(f);
 }
 
+if (c->post_load) {
+c->post_load(s);
+}
+
 return 0;
 }
 
diff --git a/hw/arm_gic_internal.h b/hw/arm_gic_internal.h
index db4fad5..183dca6 100644
--- a/hw/arm_gic_internal.h
+++ b/hw/arm_gic_internal.h
@@ -118,6 +118,8 @@ void gic_init_irqs_and_distributor(gic_state *s, int 
num_irq);
 
 typedef struct ARMGICCommonClass {
 SysBusDeviceClass parent_class;
+void (*pre_save)(gic_state *s);
+void (*post_load)(gic_state *s);
 } ARMGICCommonClass;
 
 #define TYPE_ARM_GIC "arm_gic"
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue

2012-08-09 Thread Avi Kivity
On 08/09/2012 06:13 PM, Paolo Bonzini wrote:
> Il 05/07/2012 12:29, Jason Wang ha scritto:
>> Sometimes, virtio device need to configure irq affiniry hint to maximize the
>> performance. Instead of just exposing the irq of a virtqueue, this patch
>> introduce an API to set the affinity for a virtqueue.
>> 
>> The api is best-effort, the affinity hint may not be set as expected due to
>> platform support, irq sharing or irq type. Currently, only pci method were
>> implemented and we set the affinity according to:
>> 
>> - if device uses INTX, we just ignore the request
>> - if device has per vq vector, we force the affinity hint
>> - if the virtqueues share MSI, make the affinity OR over all affinities
>>  requested
>> 
>> Signed-off-by: Jason Wang 
> 
> It looks like both I and Jason will need these patches during the 3.7
> merge window, and from different trees (net-next vs. scsi).  How do we
> synchronize?

Get one of them to promise not to rebase, merge it, and base your
patches on top of the merge.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 16455] Host freezes when resuming after having been suspended (S3) with a VM up

2012-08-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=16455


Alan  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|WILL_NOT_FIX|




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Avi Kivity
On 08/09/2012 05:18 PM, Gerd Hoffmann wrote:
>   Hi,
> 
>>> So what do you suggest?  The options I see are:
>>>
>>>   (1) Use this patch (with alignment issue fixed of course).
>>>   (2) Do a full kvmclock implementation.  Feels a bit like overkill.
>>>   (3) SeaBIOS can fallback to the PIT for timing on machines which
>>>   have no TSC.  We could do that too in case we detect kvm ...
>> 
>> What sort of timeouts are these?  If seconds, maybe the rtc would be best.
> 
> All sorts of timeouts, from a few miliseconds to seconds.
> 
> The problematic ones are the longer timeouts, which wait for I/O stuff
> like disk reads complete.  The stuff with smaller timeouts (like waiting
> for AHCI link become ready) tend to finish instantly in kvm.

That's not guaranteed.  The AHCI adapter might be real hardware.  Or the
emulation may change.

What's wrong with having a full kvmclock implementation?  Instead of
issuing rdtsc call a function pointer.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue

2012-08-09 Thread Paolo Bonzini
Il 30/07/2012 08:27, Paolo Bonzini ha scritto:
 >> > Did you set the affinity manually in your experiments, or perhaps 
 >> > there
 >> > is a difference between scsi and networking... (interrupt mitigation?)
>> > 
>> > You need to run irqbalancer in guest to make it actually work. Do you?
> Yes, of course, now on to debugging that one.  I just wanted to ask
> before the weekend if I was missing something as obvious as that.

It was indeed an irqbalance bug, it is fixed now upstream.

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC V5 3/5] virtio: intorduce an API to set affinity for a virtqueue

2012-08-09 Thread Paolo Bonzini
Il 05/07/2012 12:29, Jason Wang ha scritto:
> Sometimes, virtio device need to configure irq affiniry hint to maximize the
> performance. Instead of just exposing the irq of a virtqueue, this patch
> introduce an API to set the affinity for a virtqueue.
> 
> The api is best-effort, the affinity hint may not be set as expected due to
> platform support, irq sharing or irq type. Currently, only pci method were
> implemented and we set the affinity according to:
> 
> - if device uses INTX, we just ignore the request
> - if device has per vq vector, we force the affinity hint
> - if the virtqueues share MSI, make the affinity OR over all affinities
>  requested
> 
> Signed-off-by: Jason Wang 

It looks like both I and Jason will need these patches during the 3.7
merge window, and from different trees (net-next vs. scsi).  How do we
synchronize?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 16455] Host freezes when resuming after having been suspended (S3) with a VM up

2012-08-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=16455


Avi Kivity  changed:

   What|Removed |Added

 CC||a...@redhat.com




--- Comment #2 from Avi Kivity   2012-08-09 15:09:00 ---
Actually we can re-enable vmx if the bios disabled it.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 16455] Host freezes when resuming after having been suspended (S3) with a VM up

2012-08-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=16455


Alan  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 CC||a...@lxorguk.ukuu.org.uk
 Resolution||WILL_NOT_FIX




--- Comment #1 from Alan   2012-08-09 14:59:30 ---
Nothing we can do if a BIOS does this

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Avi Kivity
On 08/09/2012 04:57 PM, Gerd Hoffmann wrote:
>   Hi,
> 
>>> +u64 kvm_tsc_khz(void)
>>> +{
>>> +u32 eax, ebx, ecx, edx, msr;
>>> +struct pvclock_vcpu_time_info time;
>>> +u32 addr = (u32)(&time);
>>> +u64 khz;
>>> +
>>> +/* check presence and figure msr number */
>>> +cpuid(KVM_CPUID_FEATURES, &eax, &ebx, &ecx, &edx);
>>> +if (eax & KVM_FEATURE_CLOCKSOURCE2) {
>>> +msr = MSR_KVM_SYSTEM_TIME_NEW;
>>> +} else if (eax & KVM_FEATURE_CLOCKSOURCE) {
>>> +msr = MSR_KVM_SYSTEM_TIME;
>>> +} else {
>>> +return 0;
>>> +}
>>> +
>>> +/* ask kvm hypervisor to fill struct */
>>> +memset(&time, 0, sizeof(time));
>>> +wrmsr(msr, addr | 1);
>> 
>> How can this work?
> 
> It did in my testing, although maybe by pure luck ...
> 
>> There is a 64-byte alignment requirement.
> 
> 64 bytes?  Sure?  The whole struct is only 32 bytes in size ...

er, the documentation says 4 bytes (so stack alignment works).  I
distinctly remember having a large alignment requirement so we don't
cross a page or slot boundary... something's wrong here.

> 
> Easily fixable though, just need to grab some memory with memalign
> instead of using the stack.

> 
>>> +wrmsr(msr, 0);
>>> +if (time.version < 2 || time.tsc_to_system_mul == 0)
>>> +return 0;
>>> +
>>> +/* go figure tsc frequency */
>>> +khz = pvclock_tsc_khz(&time);
>>> +dprintf(1, "Using kvmclock, msr 0x%x, tsc %d MHz\n",
>>> +msr, (u32)khz / 1000);
>>> +return khz;
>> 
>> That's a meaningless number.  You can be migrated to a cpu or a machine
>> with very different tsc.
> 
>> You want accurate time on kvm, don't use the tsc.
> 
> seabios uses the tsc for timeout calculations only, so it doesn't need
> to be 100% accurate.  The order of magnitude should be correct though.
> The Linux kernel uses the value for delay loops too, so using it for the
> given purpose can't be *that* horrible after all ...
> 
> It is certainly an improvement over the current code which tries to
> calibrate the tsc and gets totally broken results in case the busy host
> happens to schedule the guest in the middle of calibration.
> 
> So what do you suggest?  The options I see are:
> 
>   (1) Use this patch (with alignment issue fixed of course).
>   (2) Do a full kvmclock implementation.  Feels a bit like overkill.
>   (3) SeaBIOS can fallback to the PIT for timing on machines which
>   have no TSC.  We could do that too in case we detect kvm ...

What sort of timeouts are these?  If seconds, maybe the rtc would be best.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Gerd Hoffmann
  Hi,

>> So what do you suggest?  The options I see are:
>>
>>   (1) Use this patch (with alignment issue fixed of course).
>>   (2) Do a full kvmclock implementation.  Feels a bit like overkill.
>>   (3) SeaBIOS can fallback to the PIT for timing on machines which
>>   have no TSC.  We could do that too in case we detect kvm ...
> 
> What sort of timeouts are these?  If seconds, maybe the rtc would be best.

All sorts of timeouts, from a few miliseconds to seconds.

The problematic ones are the longer timeouts, which wait for I/O stuff
like disk reads complete.  The stuff with smaller timeouts (like waiting
for AHCI link become ready) tend to finish instantly in kvm.

cheers,
  Gerd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Avi Kivity
On 08/09/2012 05:12 PM, Gerd Hoffmann wrote:
>   Hi,
> 
>>> er, the documentation says 4 bytes (so stack alignment works).  I
>>> distinctly remember having a large alignment requirement so we don't
>>> cross a page or slot boundary... something's wrong here.
>> 
>>  case MSR_KVM_SYSTEM_TIME: {
> [ ... ]
> 
>> So your tests worked by pure luck, but the bug is in kvm.  We need to
>> grab two pages here.
> 
> Ok, so better use memalign(32,32) to make sure the struct doesn't cross
> a page border ...

No, we need to fix kvm, no need to complicate the guest for that.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Gerd Hoffmann
  Hi,

>> er, the documentation says 4 bytes (so stack alignment works).  I
>> distinctly remember having a large alignment requirement so we don't
>> cross a page or slot boundary... something's wrong here.
> 
>   case MSR_KVM_SYSTEM_TIME: {
[ ... ]

> So your tests worked by pure luck, but the bug is in kvm.  We need to
> grab two pages here.

Ok, so better use memalign(32,32) to make sure the struct doesn't cross
a page border ...

cheers,
  Gerd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Avi Kivity
On 08/09/2012 05:01 PM, Avi Kivity wrote:
> On 08/09/2012 04:57 PM, Gerd Hoffmann wrote:
>>   Hi,
>> 
 +u64 kvm_tsc_khz(void)
 +{
 +u32 eax, ebx, ecx, edx, msr;
 +struct pvclock_vcpu_time_info time;
 +u32 addr = (u32)(&time);
 +u64 khz;
 +
 +/* check presence and figure msr number */
 +cpuid(KVM_CPUID_FEATURES, &eax, &ebx, &ecx, &edx);
 +if (eax & KVM_FEATURE_CLOCKSOURCE2) {
 +msr = MSR_KVM_SYSTEM_TIME_NEW;
 +} else if (eax & KVM_FEATURE_CLOCKSOURCE) {
 +msr = MSR_KVM_SYSTEM_TIME;
 +} else {
 +return 0;
 +}
 +
 +/* ask kvm hypervisor to fill struct */
 +memset(&time, 0, sizeof(time));
 +wrmsr(msr, addr | 1);
>>> 
>>> How can this work?
>> 
>> It did in my testing, although maybe by pure luck ...
>> 
>>> There is a 64-byte alignment requirement.
>> 
>> 64 bytes?  Sure?  The whole struct is only 32 bytes in size ...
> 
> er, the documentation says 4 bytes (so stack alignment works).  I
> distinctly remember having a large alignment requirement so we don't
> cross a page or slot boundary... something's wrong here.

case MSR_KVM_SYSTEM_TIME: {
kvmclock_reset(vcpu);

vcpu->arch.time = data;
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

/* we verify if the enable bit is set... */
if (!(data & 1))
break;

/* ...but clean it before doing the actual write */
vcpu->arch.time_offset = data & ~(PAGE_MASK | 1);

vcpu->arch.time_page =
gfn_to_page(vcpu->kvm, data >> PAGE_SHIFT);

if (is_error_page(vcpu->arch.time_page))
vcpu->arch.time_page = NULL;

break;

So your tests worked by pure luck, but the bug is in kvm.  We need to
grab two pages here.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Gerd Hoffmann
  Hi,

>> +u64 kvm_tsc_khz(void)
>> +{
>> +u32 eax, ebx, ecx, edx, msr;
>> +struct pvclock_vcpu_time_info time;
>> +u32 addr = (u32)(&time);
>> +u64 khz;
>> +
>> +/* check presence and figure msr number */
>> +cpuid(KVM_CPUID_FEATURES, &eax, &ebx, &ecx, &edx);
>> +if (eax & KVM_FEATURE_CLOCKSOURCE2) {
>> +msr = MSR_KVM_SYSTEM_TIME_NEW;
>> +} else if (eax & KVM_FEATURE_CLOCKSOURCE) {
>> +msr = MSR_KVM_SYSTEM_TIME;
>> +} else {
>> +return 0;
>> +}
>> +
>> +/* ask kvm hypervisor to fill struct */
>> +memset(&time, 0, sizeof(time));
>> +wrmsr(msr, addr | 1);
> 
> How can this work?

It did in my testing, although maybe by pure luck ...

> There is a 64-byte alignment requirement.

64 bytes?  Sure?  The whole struct is only 32 bytes in size ...

Easily fixable though, just need to grab some memory with memalign
instead of using the stack.

>> +wrmsr(msr, 0);
>> +if (time.version < 2 || time.tsc_to_system_mul == 0)
>> +return 0;
>> +
>> +/* go figure tsc frequency */
>> +khz = pvclock_tsc_khz(&time);
>> +dprintf(1, "Using kvmclock, msr 0x%x, tsc %d MHz\n",
>> +msr, (u32)khz / 1000);
>> +return khz;
> 
> That's a meaningless number.  You can be migrated to a cpu or a machine
> with very different tsc.

> You want accurate time on kvm, don't use the tsc.

seabios uses the tsc for timeout calculations only, so it doesn't need
to be 100% accurate.  The order of magnitude should be correct though.
The Linux kernel uses the value for delay loops too, so using it for the
given purpose can't be *that* horrible after all ...

It is certainly an improvement over the current code which tries to
calibrate the tsc and gets totally broken results in case the busy host
happens to schedule the guest in the middle of calibration.

So what do you suggest?  The options I see are:

  (1) Use this patch (with alignment issue fixed of course).
  (2) Do a full kvmclock implementation.  Feels a bit like overkill.
  (3) SeaBIOS can fallback to the PIT for timing on machines which
  have no TSC.  We could do that too in case we detect kvm ...

cheers,
  Gerd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SeaBIOS] [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Fred .
It should be kHz not khz.

-msr, (u32)khz / 1000);
+msr, (u32)kHz / 1000);

On Thu, Aug 9, 2012 at 2:53 PM, Avi Kivity  wrote:
> On 08/09/2012 02:57 PM, Gerd Hoffmann wrote:
>> Use kvmclock for tsc calibration when running on kvm.  Without this the
>> tsc frequency calibrated by seabios can be *way* off in case the virtual
>> machine is booted on a loaded host.  I've seen seabios calibrating 27
>> instead of ca. 2800 MHz, resulting in timeouts being to short by factor
>> 100.  Which in turn leads to disk I/O errors due to timeouts, especially
>> as I/O requests tend to take a bit longer than usual on a loaded box ...
>
>> +
>> +struct pvclock_vcpu_time_info {
>> + u32   version;
>> + u32   pad0;
>> + u64   tsc_timestamp;
>> + u64   system_time;
>> + u32   tsc_to_system_mul;
>> + s8tsc_shift;
>> + u8flags;
>> + u8pad[2];
>> +} PACKED;
>> +
>> +
>> +u64 kvm_tsc_khz(void)
>> +{
>> +u32 eax, ebx, ecx, edx, msr;
>> +struct pvclock_vcpu_time_info time;
>> +u32 addr = (u32)(&time);
>> +u64 khz;
>> +
>> +/* check presence and figure msr number */
>> +cpuid(KVM_CPUID_FEATURES, &eax, &ebx, &ecx, &edx);
>> +if (eax & KVM_FEATURE_CLOCKSOURCE2) {
>> +msr = MSR_KVM_SYSTEM_TIME_NEW;
>> +} else if (eax & KVM_FEATURE_CLOCKSOURCE) {
>> +msr = MSR_KVM_SYSTEM_TIME;
>> +} else {
>> +return 0;
>> +}
>> +
>> +/* ask kvm hypervisor to fill struct */
>> +memset(&time, 0, sizeof(time));
>> +wrmsr(msr, addr | 1);
>
> How can this work?  There is a 64-byte alignment requirement.
>
>> +wrmsr(msr, 0);
>> +if (time.version < 2 || time.tsc_to_system_mul == 0)
>> +return 0;
>> +
>> +/* go figure tsc frequency */
>> +khz = pvclock_tsc_khz(&time);
>> +dprintf(1, "Using kvmclock, msr 0x%x, tsc %d MHz\n",
>> +msr, (u32)khz / 1000);
>> +return khz;
>
> That's a meaningless number.  You can be migrated to a cpu or a machine
> with very different tsc.
>
> You want accurate time on kvm, don't use the tsc.
>
>
> --
> error compiling committee.c: too many arguments to function
>
> ___
> SeaBIOS mailing list
> seab...@seabios.org
> http://www.seabios.org/mailman/listinfo/seabios
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/7] split out uses of kvm_irqchip_in_kernel()

2012-08-09 Thread Avi Kivity
On 08/07/2012 06:11 PM, Peter Maydell wrote:
> On 2 August 2012 10:14, Jan Kiszka  wrote:
>> On 2012-07-26 16:35, Peter Maydell wrote:
>>> This patch series removes all uses of kvm_irqchip_in_kernel()
>>> from architecture-independent code, by creating a set of more
>>> specific functions instead to test for the particular aspects
>>> of behaviour that the calling code is actually interested in.
>>>
>>> The uses in x86-specific code could in theory be further broken
>>> down into kvm_ioapic(), kvm_pit(), etc, but I leave that for
>>> one of the x86 maintainers if they think it's worthwhile.
>>
>> For the whole series:
>>
>> Acked-by: Jan Kiszka 
> 
> Just a ping to check this will get into qemu before the
> hardfreeze next week...

Thanks for the ping; all applied.  Will push after autotesting.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] handle device help before accelerator set up

2012-08-09 Thread Avi Kivity
On 08/08/2012 09:40 PM, Bruce Rogers wrote:
> A command line device probe using just -device "?" gets processed
> after qemu-kvm initializes the accelerator. If /dev/kvm is not
> present, the accelerator check will fail (kvm is defaulted to on),
> which causes libvirt to not be set up to handle qemu guests.
> 
> Moving the device help handling before the accelerator set up allows
> the device probe to work in this configuration and libvirt succeeds
> in setting up for a qemu hypervisor mode.

Please post this to the qemu mailing list.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Avi Kivity
On 08/09/2012 02:57 PM, Gerd Hoffmann wrote:
> Use kvmclock for tsc calibration when running on kvm.  Without this the
> tsc frequency calibrated by seabios can be *way* off in case the virtual
> machine is booted on a loaded host.  I've seen seabios calibrating 27
> instead of ca. 2800 MHz, resulting in timeouts being to short by factor
> 100.  Which in turn leads to disk I/O errors due to timeouts, especially
> as I/O requests tend to take a bit longer than usual on a loaded box ...

> +
> +struct pvclock_vcpu_time_info {
> + u32   version;
> + u32   pad0;
> + u64   tsc_timestamp;
> + u64   system_time;
> + u32   tsc_to_system_mul;
> + s8tsc_shift;
> + u8flags;
> + u8pad[2];
> +} PACKED;
> +
> +
> +u64 kvm_tsc_khz(void)
> +{
> +u32 eax, ebx, ecx, edx, msr;
> +struct pvclock_vcpu_time_info time;
> +u32 addr = (u32)(&time);
> +u64 khz;
> +
> +/* check presence and figure msr number */
> +cpuid(KVM_CPUID_FEATURES, &eax, &ebx, &ecx, &edx);
> +if (eax & KVM_FEATURE_CLOCKSOURCE2) {
> +msr = MSR_KVM_SYSTEM_TIME_NEW;
> +} else if (eax & KVM_FEATURE_CLOCKSOURCE) {
> +msr = MSR_KVM_SYSTEM_TIME;
> +} else {
> +return 0;
> +}
> +
> +/* ask kvm hypervisor to fill struct */
> +memset(&time, 0, sizeof(time));
> +wrmsr(msr, addr | 1);

How can this work?  There is a 64-byte alignment requirement.

> +wrmsr(msr, 0);
> +if (time.version < 2 || time.tsc_to_system_mul == 0)
> +return 0;
> +
> +/* go figure tsc frequency */
> +khz = pvclock_tsc_khz(&time);
> +dprintf(1, "Using kvmclock, msr 0x%x, tsc %d MHz\n",
> +msr, (u32)khz / 1000);
> +return khz;

That's a meaningless number.  You can be migrated to a cpu or a machine
with very different tsc.

You want accurate time on kvm, don't use the tsc.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] s390: Add new channel I/O based virtio transport.

2012-08-09 Thread Stefan Hajnoczi
On Thu, Aug 9, 2012 at 1:12 PM, Cornelia Huck  wrote:
> On Thu, 9 Aug 2012 12:34:04 +0100
> Stefan Hajnoczi  wrote:
>
>> On Tue, Aug 7, 2012 at 3:52 PM, Cornelia Huck  
>> wrote:
>> > Add a new virtio transport that uses channel commands to perform
>> > virtio operations.
>> >
>> > Add a new machine type s390-ccw that uses this virtio-ccw transport
>> > and make it the default machine for s390.
>> >
>> > Signed-off-by: Cornelia Huck 
>> > ---
>> >  hw/qdev-monitor.c  |   5 +
>> >  hw/s390-virtio.c   | 268 ++
>> >  hw/s390x/Makefile.objs |   1 +
>> >  hw/s390x/virtio-ccw.c  | 962 
>> > +
>> >  hw/s390x/virtio-ccw.h  |  77 
>> >  vl.c   |   1 +
>> >  6 files changed, 1243 insertions(+), 71 deletions(-)
>> >  create mode 100644 hw/s390x/virtio-ccw.c
>> >  create mode 100644 hw/s390x/virtio-ccw.h
>>
>> Is the virtqueue still using vring and assuming the hypervisor reaches
>> into guest memory?
>
> The virtqueues are guest-allocated and their location is transmitted
> via a control-type ccw to the host, which can then use it until
> notified otherwise.
>
>>
>> Can existing ccw device types access memory directly (for some reason
>> I assumed ccw always copies or send messages)?
>
> Not sure if I understand your question correctly, but read or write
> type ccws specify a memory area where the hardware/hypervisor may write
> to or read from. These accesses happen while the channel program is
> running (any time between the ssch/rsch and ending status present at
> the subchannel). The "specify an area that can be used by hardware and
> os" approach exists as well; the closed thing to the virtio-ccw
> approach is probably qdio.

Okay, thanks!

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] s390: Add new channel I/O based virtio transport.

2012-08-09 Thread Cornelia Huck
On Thu, 9 Aug 2012 12:34:04 +0100
Stefan Hajnoczi  wrote:

> On Tue, Aug 7, 2012 at 3:52 PM, Cornelia Huck  
> wrote:
> > Add a new virtio transport that uses channel commands to perform
> > virtio operations.
> >
> > Add a new machine type s390-ccw that uses this virtio-ccw transport
> > and make it the default machine for s390.
> >
> > Signed-off-by: Cornelia Huck 
> > ---
> >  hw/qdev-monitor.c  |   5 +
> >  hw/s390-virtio.c   | 268 ++
> >  hw/s390x/Makefile.objs |   1 +
> >  hw/s390x/virtio-ccw.c  | 962 
> > +
> >  hw/s390x/virtio-ccw.h  |  77 
> >  vl.c   |   1 +
> >  6 files changed, 1243 insertions(+), 71 deletions(-)
> >  create mode 100644 hw/s390x/virtio-ccw.c
> >  create mode 100644 hw/s390x/virtio-ccw.h
> 
> Is the virtqueue still using vring and assuming the hypervisor reaches
> into guest memory?

The virtqueues are guest-allocated and their location is transmitted
via a control-type ccw to the host, which can then use it until
notified otherwise.

> 
> Can existing ccw device types access memory directly (for some reason
> I assumed ccw always copies or send messages)?

Not sure if I understand your question correctly, but read or write
type ccws specify a memory area where the hardware/hypervisor may write
to or read from. These accesses happen while the channel program is
running (any time between the ssch/rsch and ending status present at
the subchannel). The "specify an area that can be used by hardware and
os" approach exists as well; the closed thing to the virtio-ccw
approach is probably qdio.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] tsc: use kvmclock for calibration

2012-08-09 Thread Gerd Hoffmann
Use kvmclock for tsc calibration when running on kvm.  Without this the
tsc frequency calibrated by seabios can be *way* off in case the virtual
machine is booted on a loaded host.  I've seen seabios calibrating 27
instead of ca. 2800 MHz, resulting in timeouts being to short by factor
100.  Which in turn leads to disk I/O errors due to timeouts, especially
as I/O requests tend to take a bit longer than usual on a loaded box ...

Signed-off-by: Gerd Hoffmann 
---
 src/clock.c|9 +
 src/paravirt.c |   90 
 src/paravirt.h |1 +
 3 files changed, 100 insertions(+), 0 deletions(-)

diff --git a/src/clock.c b/src/clock.c
index 69e9f17..5883b1a 100644
--- a/src/clock.c
+++ b/src/clock.c
@@ -13,6 +13,7 @@
 #include "bregs.h" // struct bregs
 #include "biosvar.h" // GET_GLOBAL
 #include "usb-hid.h" // usb_check_event
+#include "paravirt.h" // kvm clock
 
 // RTC register flags
 #define RTC_A_UIP 0x80
@@ -80,6 +81,14 @@ calibrate_tsc(void)
 return;
 }
 
+if (kvm_para_available()) {
+u32 khz = kvm_tsc_khz();
+if (khz != 0) {
+SET_GLOBAL(cpu_khz, khz);
+return;
+}
+}
+
 // Setup "timer2"
 u8 orig = inb(PORT_PS2_CTRLB);
 outb((orig & ~PPCB_SPKR) | PPCB_T2GATE, PORT_PS2_CTRLB);
diff --git a/src/paravirt.c b/src/paravirt.c
index 2a98d53..942ce11 100644
--- a/src/paravirt.c
+++ b/src/paravirt.c
@@ -12,6 +12,7 @@
 #include "ioport.h" // outw
 #include "paravirt.h" // qemu_cfg_port_probe
 #include "smbios.h" // struct smbios_structure_header
+#include "biosvar.h" // GET_GLOBAL
 
 int qemu_cfg_present;
 
@@ -346,3 +347,92 @@ void qemu_cfg_romfile_setup(void)
 dprintf(3, "Found fw_cfg file: %s (size=%d)\n", file->name, 
file->size);
 }
 }
+
+#define KVM_CPUID_SIGNATURE   0x4000
+#define KVM_CPUID_FEATURES0x4001
+#define KVM_FEATURE_CLOCKSOURCE0
+#define KVM_FEATURE_CLOCKSOURCE2   3
+#define MSR_KVM_SYSTEM_TIME 0x12
+#define MSR_KVM_SYSTEM_TIME_NEW   0x4b564d01
+
+struct pvclock_vcpu_time_info {
+   u32   version;
+   u32   pad0;
+   u64   tsc_timestamp;
+   u64   system_time;
+   u32   tsc_to_system_mul;
+   s8tsc_shift;
+   u8flags;
+   u8pad[2];
+} PACKED;
+
+/*
+ * do_div() is NOT a C function. It wants to return
+ * two values (the quotient and the remainder), but
+ * since that doesn't work very well in C, what it
+ * does is:
+ *
+ * - modifies the 64-bit dividend _in_place_
+ * - returns the 32-bit remainder
+ *
+ * This ends up being the most efficient "calling
+ * convention" on x86.
+ */
+#define do_div(n, base) \
+({  \
+unsigned long __upper, __low, __high, __mod, __base;\
+__base = (base);\
+asm("" : "=a" (__low), "=d" (__high) : "A" (n));\
+__upper = __high;   \
+if (__high) {   \
+__upper = __high % (__base);\
+__high = __high / (__base); \
+}   \
+asm("divl %2" : "=a" (__low), "=d" (__mod)  \
+: "rm" (__base), "0" (__low), "1" (__upper));   \
+asm("" : "=A" (n) : "a" (__low), "d" (__high)); \
+__mod;  \
+})
+
+static u64 pvclock_tsc_khz(struct pvclock_vcpu_time_info *src)
+{
+u64 pv_tsc_khz = 100ULL << 32;
+
+do_div(pv_tsc_khz, src->tsc_to_system_mul);
+if (src->tsc_shift < 0)
+pv_tsc_khz <<= -src->tsc_shift;
+else
+pv_tsc_khz >>= src->tsc_shift;
+return pv_tsc_khz;
+}
+
+u64 kvm_tsc_khz(void)
+{
+u32 eax, ebx, ecx, edx, msr;
+struct pvclock_vcpu_time_info time;
+u32 addr = (u32)(&time);
+u64 khz;
+
+/* check presence and figure msr number */
+cpuid(KVM_CPUID_FEATURES, &eax, &ebx, &ecx, &edx);
+if (eax & KVM_FEATURE_CLOCKSOURCE2) {
+msr = MSR_KVM_SYSTEM_TIME_NEW;
+} else if (eax & KVM_FEATURE_CLOCKSOURCE) {
+msr = MSR_KVM_SYSTEM_TIME;
+} else {
+return 0;
+}
+
+/* ask kvm hypervisor to fill struct */
+memset(&time, 0, sizeof(time));
+wrmsr(msr, addr | 1);
+wrmsr(msr, 0);
+if (time.version < 2 || time.tsc_to_system_mul == 0)
+return 0;
+
+/* go figure tsc frequency */
+khz = pvclock_tsc_khz(&time);
+dprintf(1, "Using kvmclock, msr 0x%x, tsc %d MHz\n",
+msr, (u32)khz / 1000);
+return khz;
+}
diff --git a/src/paravirt.h b/src/paravirt.h
index a284c41..eedfcc3 100644
--- a/src/paravirt.h
+++

Re: [PATCH 3/5] s390: Add new channel I/O based virtio transport.

2012-08-09 Thread Stefan Hajnoczi
On Tue, Aug 7, 2012 at 3:52 PM, Cornelia Huck  wrote:
> Add a new virtio transport that uses channel commands to perform
> virtio operations.
>
> Add a new machine type s390-ccw that uses this virtio-ccw transport
> and make it the default machine for s390.
>
> Signed-off-by: Cornelia Huck 
> ---
>  hw/qdev-monitor.c  |   5 +
>  hw/s390-virtio.c   | 268 ++
>  hw/s390x/Makefile.objs |   1 +
>  hw/s390x/virtio-ccw.c  | 962 
> +
>  hw/s390x/virtio-ccw.h  |  77 
>  vl.c   |   1 +
>  6 files changed, 1243 insertions(+), 71 deletions(-)
>  create mode 100644 hw/s390x/virtio-ccw.c
>  create mode 100644 hw/s390x/virtio-ccw.h

Is the virtqueue still using vring and assuming the hypervisor reaches
into guest memory?

Can existing ccw device types access memory directly (for some reason
I assumed ccw always copies or send messages)?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] s390/kvm: Handle hosts not supporting s390-virtio.

2012-08-09 Thread Cornelia Huck
On Thu, 09 Aug 2012 13:03:57 +0300
Avi Kivity  wrote:

> On 08/07/2012 05:52 PM, Cornelia Huck wrote:
> > Running under a kvm host does not necessarily imply the presence of
> > a page mapped above the main memory with the virtio information;
> > however, the code includes a hard coded access to that page.
> > 
> > Instead, check for the presence of the page and exit gracefully
> > before we hit an addressing exception if it does not exist.
> > 
> >  /*
> >   * Init function for virtio
> >   * devices are in a single page above top of "normal" mem
> > @@ -443,6 +458,12 @@ static int __init kvm_devices_init(void)
> > }
> >  
> > kvm_devices = (void *) real_memory_size;
> > +   if (test_devices_support() < 0) {
> > +   vmem_remove_mapping(real_memory_size, PAGE_SIZE);
> > +   root_device_unregister(kvm_root);
> > +   /* No error. */
> > +   return 0;
> > +   }
> >  
> 
> Cleaner to defer root_device_register() until after the mapping has been
> verified.

OK, will reorder.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] KVM: Add hva_to_memslot

2012-08-09 Thread Avi Kivity
On 08/09/2012 01:34 PM, Takuya Yoshikawa wrote:
> On Tue,  7 Aug 2012 12:57:13 +0200
> Alexander Graf  wrote:
> 
>> +struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, hva_t hva)
>> +{
>> +struct kvm_memslots *slots = kvm_memslots(kvm);
>> +struct kvm_memory_slot *memslot;
>> +
>> +kvm_for_each_memslot(memslot, slots)
>> +if (hva >= memslot->userspace_addr &&
>> +  hva < memslot->userspace_addr + memslot->npages)
>> +return memslot;
>> +
>> +return NULL;
>> +}
> 
> Can't we have two memory slots which contain that hva?
> I thought that's why hva handler had to check all slots.

We can and do.  Good catch.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/8] KVM: Add hva_to_memslot

2012-08-09 Thread Takuya Yoshikawa
On Tue,  7 Aug 2012 12:57:13 +0200
Alexander Graf  wrote:

> +struct kvm_memory_slot *hva_to_memslot(struct kvm *kvm, hva_t hva)
> +{
> + struct kvm_memslots *slots = kvm_memslots(kvm);
> + struct kvm_memory_slot *memslot;
> +
> + kvm_for_each_memslot(memslot, slots)
> + if (hva >= memslot->userspace_addr &&
> +   hva < memslot->userspace_addr + memslot->npages)
> + return memslot;
> +
> + return NULL;
> +}

Can't we have two memory slots which contain that hva?
I thought that's why hva handler had to check all slots.

Thanks,
Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] s390/kvm: Handle hosts not supporting s390-virtio.

2012-08-09 Thread Avi Kivity
On 08/07/2012 05:52 PM, Cornelia Huck wrote:
> Running under a kvm host does not necessarily imply the presence of
> a page mapped above the main memory with the virtio information;
> however, the code includes a hard coded access to that page.
> 
> Instead, check for the presence of the page and exit gracefully
> before we hit an addressing exception if it does not exist.
> 
>  /*
>   * Init function for virtio
>   * devices are in a single page above top of "normal" mem
> @@ -443,6 +458,12 @@ static int __init kvm_devices_init(void)
>   }
>  
>   kvm_devices = (void *) real_memory_size;
> + if (test_devices_support() < 0) {
> + vmem_remove_mapping(real_memory_size, PAGE_SIZE);
> + root_device_unregister(kvm_root);
> + /* No error. */
> + return 0;
> + }
>  

Cleaner to defer root_device_register() until after the mapping has been
verified.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: correctly detect APIC SW state in kvm_apic_post_state_restore()

2012-08-09 Thread Avi Kivity
On 08/08/2012 03:24 PM, Gleb Natapov wrote:
> For apic_set_spiv() to track APIC SW state correctly it needs to see
> previous and next values of the spurious vector register, but currently
> memset() overwrite the old value before apic_set_spiv() get a chance to
> do tracking. Fix it by calling apic_set_spiv() before overwriting old
> value.
> 

Applied, thanks.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang on reboot with 3.6-rc1

2012-08-09 Thread Avi Kivity
On 08/08/2012 07:27 PM, David Ahern wrote:
> Not sure if KVM is the culprit, but it is the last line shown on the
> console. I have to power cycle the server to reboot.

Have you tried rmmoding the kvm modules before reboot?

Were any guests running during this?


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Nested kvm_intel broken on pre 3.3 hosts

2012-08-09 Thread Avi Kivity
On 08/09/2012 10:13 AM, Stefan Bader wrote:

> Avi, was the last version of the patch (only adding the flag to the nested 
> MSRs)
> good for submitting to stable from your point of view?
> 

Yes, it is correct.  I forwarded it to stable, thanks.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] perf/x86: disable PEBS on a guest entry.

2012-08-09 Thread Gleb Natapov
If PMU counter has PEBS enabled it is not enough to disable counter
on a guest entry since PEBS memory write can overshoot guest entry
and corrupt guest memory. Disabling PEBS during guest entry solves
the problem.

Tested-by: David Ahern 
Signed-off-by: Gleb Natapov 
diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
b/arch/x86/kernel/cpu/perf_event_intel.c
index 7a8b9d0..7268fd8 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -1444,8 +1444,16 @@ static struct perf_guest_switch_msr 
*intel_guest_get_msrs(int *nr)
arr[0].msr = MSR_CORE_PERF_GLOBAL_CTRL;
arr[0].host = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask;
arr[0].guest = x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_host_mask;
+   /*
+* If PMU counter has PEBS enabled it is not enough to disable counter
+* on a guest entry since PEBS memory write can overshoot guest entry
+* and corrupt guest memory. Disabling PEBS solves the problem.
+*/
+   arr[1].msr = MSR_IA32_PEBS_ENABLE;
+   arr[1].host = cpuc->pebs_enabled;
+   arr[1].guest = 0;
 
-   *nr = 1;
+   *nr = 2;
return arr;
 }
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/15] memory: use refcnt to manage MemoryRegion

2012-08-09 Thread Avi Kivity
On 08/09/2012 10:27 AM, liu ping fan wrote:
> On Wed, Aug 8, 2012 at 5:20 PM, Avi Kivity  wrote:
>> On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan 
>>>
>>> Using refcnt for mr, so we can separate mr's life cycle management
>>> from refered object.
>>>   When mr->ref 0->1, inc the refered object.
>>>   When mr->ref 1->0, dec the refered object.
>>>
>>> The refered object can be DeviceStae, another mr, or other opaque.
>>
>> Please explain the motivation more fully.
>>
> Actually, the aim is to mange the reference of an object, used by mem view.
> DeviceState can be referred by different system, when it comes to the
> view of subsystem, we hold dev's ref. And any indirect reference will
> just mr->ref++, not dev's.
> This can help us avoid the down-walk through the referred chain, like
> alias> mr ---> DeviceState.

That is a lot of complexity, for no gain.  Manipulating memory regions
is a slow path, and can be done under the bit qemu lock without any
complications.

> 
> In the previous discussion, you have suggest add dev->ref++ in
> core_region_add.  But I think, if we can move it to higher layer --
> memory_region_{add,del}_subregion, so we can avoid to duplicate do
> this in other xx_region_add.

Why would other memory listeners be impacted?  They all operate under
the big qemu lock.  If they start using devices outside the lock, then
they need to take a reference.

> As a payment for this, we need to handle alias which can be avoid at
> core_region_add().  And mr's ref can help to avoid
>  the down-walk.

The payment is two systems of reference counts.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/15] lock: introduce global lock for device tree

2012-08-09 Thread Avi Kivity
On 08/09/2012 10:27 AM, liu ping fan wrote:
> On Wed, Aug 8, 2012 at 5:42 PM, Avi Kivity  wrote:
>> On 08/08/2012 09:25 AM, Liu Ping Fan wrote:
>>> From: Liu Ping Fan 
>>>
>>
>> Please explain the motivation.  AFAICT, the big qemu lock is sufficient.
>>
> Oh, this is one of the series locks for the removal of big qemu lock.

Why do you want to remove the big qemu lock?

Even now it is not heavily contended.  We should focus on fixing the
cases where is it contended, instead of removing it completely, which is
sure to make further development harder and is likely to introduce
locking bugs.

> The degradation of big lock will take several steps, including to
> introduce device's private lock. Till then, when the device add path
> from iothread and the remove path in io-dispatch is out of the big
> qemu lock.  We need this extra lock.
> 
> These series is too big, so I send out the 1st phase for review.

Even the first phase is too big.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/15] memory: MemoryRegion topology must be stable when updating

2012-08-09 Thread Avi Kivity
On 08/09/2012 10:28 AM, liu ping fan wrote:
>>
>> Seems to me that nothing in memory.c can susceptible to races.  It must
>> already be called under the big qemu lock, and with the exception of
>> mutators (memory_region_set_*), changes aren't directly visible.
>>
> Yes, what I want to do is "prepare unplug out of protection of global
> lock".  When io-dispatch and mmio-dispatch are all out of big lock, we
> will run into the following scene:
> In vcpu context A, qdev_unplug_complete()-> delete subregion;
> In context B, write pci bar --> pci mapping update-> add subregion

Why do you want unlocked unplug?  Unplug is rare and complicated; there
are no performance considerations on one hand, and difficulty of testing
for lock correctness on the other.  I think it is better if it remains
protected by the global lock.

> 
>> I think it's sufficient to take the mem_map_lock at the beginning of
>> core_begin() and drop it at the end of core_commit().  That means all
>> updates of volatile state, phys_map, are protected.
>>
> The mem_map_lock is to protect both address_space_io and
> address_space_memory. When without the protection of big lock,
> competing will raise among the updaters
> (memory_region_{add,del}_subregion and the readers
> generate_memory_topology()->render_memory_region().

These should all run under the big qemu lock, for the same reasons.
They are rare and not performance sensitive.  Only phys_map reads are
performance sensitive.

> 
> If just in core_begin/commit, we will duplicate it for
> xx_begin/commit, right?  

No.  Other listeners will be protected by the global lock.

> And at the same time, mr->subregions is
> exposed under SMP without big lock.
> 

Who accesses it?

IMO locking should look like:

  phys_map: mem_map_lock
  dispatch callbacks: device specific lock (or big qemu lock for
unconverted devices)
  everything else: big qemu lock



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-09 Thread Avi Kivity
On 08/09/2012 10:49 AM, Paolo Bonzini wrote:
> Il 09/08/2012 09:33, liu ping fan ha scritto:
>> Yes, it is to defer destructors.
>> See 0009-memory-prepare-flatview-and-radix-tree-for-rcu-style.patch
>> When MemoryRegion is _del_subregion from mem in updater, it may be
>> still in use by reader -- radix or flatview, so defer its destructors
>> to the reclaimer --phys_map_release(PhysMap *map)
> 
> How are you sure that the reader is already out of its critical section
> by the time the reclaimer runs?
> 
>> If we have rcu, it could be elegant to do this.
> 
> Yeah, I think inventing primitives is dangerous and difficult to review;
> and it may be difficult to replace it with proper call_rcu.
> 
> You should probably make a proof-of-concept using liburcu.  Then we can
> decide how to implement RCU in a way that is portable enough for QEMU's
> needs.

IMO we should start with a simple mutex (which will cover only the
lookup and map rebuild).  This should reduce the contention to basically
nothing (still leaving a cache line bounce).  If a profile shows the
cache line bounce hurting us, or perhaps contention in ultralarge
guests, then we should switch to rcu.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/15] hotplug: introduce qdev_unplug_complete() to remove device from views

2012-08-09 Thread Paolo Bonzini
Il 09/08/2012 09:28, liu ping fan ha scritto:
>> > VCPU threadI/O thread
>> > =
>> > get MMIO request
>> > rcu_read_lock()
>> > walk memory map
>> >qdev_unmap()
>> >lock_devtree()
>> >...
>> >unlock_devtree
>> >unref dev -> refcnt=0, free enqueued
>> > ref()
> No ref() for dev here, while we have ref to flatview+radix in my patches.
> I use rcu to protect radix+flatview+mr refered. As to dev, its ref has
> inc when it is added into mem view -- that is
> memory_region_add_subregion -> memory_region_get() {
> if(atomic_add_and_return()) dev->ref++  }.
> So not until reclaimer of mem view, the dev's ref is hold by mem view.
> In a short word, rcu protect mem view, while device is protected by refcnt.

But the RCU critical section should not include the whole processing of
MMIO, only the walk of the memory map.

And in general I think this is a bit too tricky... I understand not
adding refcounting to all of bottom halves, timers, etc., but if you are
using a device you should have explicit ref/unref pairs.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] KVM call agenda for tuesday 31

2012-08-09 Thread Igor Mitsyanko

On 08/08/2012 08:25 PM, Andreas Färber wrote:

Am 31.01.2012 15:01, schrieb Mitsyanko Igor:

On 01/31/2012 05:15 PM, Andreas Färber wrote:

Am 31.01.2012 00:53, schrieb Anthony Liguori:

On 01/30/2012 05:41 PM, Andreas Färber wrote:

Am 30.01.2012 19:55, schrieb Juan Quintela:

Please send in any agenda items you are interested in covering.

VMState:
Anthony specifically said that VMState were not affected by QOM and
that
patches should not be deferred until the merge. Yet there's no review
and/or decision-making for a month now. Ping^2 for AHCI+SDHC.

Do you have pointers (to pending VMState patches)?

http://patchwork.ozlabs.org/patch/137732/ (PATCH v4)

It's basically about how to deal with variable-sized arrays. (Alex
mentioned it on one call around November.) I found ways to deal with
subsets of arrays embedded within the struct and variable-sized list of
pointers to structs but no solution for a malloc()'ed array of structs.
Maybe I'm just too stupid to see. Anyway, no one commented since Xmas.

Igor posted (and refined for v2) a patch with a callback-based approach
that I find promising. From my view, unofficially Juan is the VMState
guy, he's been cc'ed. Are we lacking an official maintainer that cares?
Or is Juan the official, undocumented maintainer but simply busy?

SUSE's interest is making AHCI migratable, and my VMState workaround for
that is simply ugly:

http://patchwork.ozlabs.org/patch/133066/ (RFC)


If I'm not mistaken, if you change AHCIState's ".ports" type to uint32_t
you can use existing VMSTATE_BUFFER_MULTIPLY macro like this:

VMSTATE_BUFFER_MULTIPLY(dev, AHCIState, 0, NULL, 0, ports,
sizeof(AHCIDevice))

Igor, I finally got around to rebasing and trying this: Am I seeing
correctly that this tries to serialize the whole of AHCIDevice as an
opaque buffer? The difficulty here was that we were looking for a way to
serialize a variable number of structured elements with their own
&vmstate_ahci_device specifying what fields to serialize.

Juan, how should we proceed there? Do as Igor suggested? Some VMSTATE_
macro I'm overlooking? Or do we need some new macro for this use case?

Regards,
Andreas


Hi Andreas, that was a bad suggestion, sorry for that, we can't treat 
structures as opaque buffers because that wouldn't work when migrating 
between hosts with different alignment/endianess.
Perhaps you can use VMSTATE_STRUCT_VARRAY_UINT32 for this, if I'm not 
again mistaken.



VMSTATE_BUFFER_MULTIPLY currently lacks VMS_POINTER flag and therefore
doesn't make any use of _start field (you don't need it anyway)

Nevertheless, VMSTATE_BUFFER_MULTIPLY is just a partial solution to a
bigger set of possible problems. SD card's vmstate implementation
requires shift operation, SDHC gets size from switch {} statement,
something else later may require division or addition e.t.c.,
get_bufsize callback will cover all possible cases.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-09 Thread Paolo Bonzini
Il 09/08/2012 09:33, liu ping fan ha scritto:
> Yes, it is to defer destructors.
> See 0009-memory-prepare-flatview-and-radix-tree-for-rcu-style.patch
> When MemoryRegion is _del_subregion from mem in updater, it may be
> still in use by reader -- radix or flatview, so defer its destructors
> to the reclaimer --phys_map_release(PhysMap *map)

How are you sure that the reader is already out of its critical section
by the time the reclaimer runs?

> If we have rcu, it could be elegant to do this.

Yeah, I think inventing primitives is dangerous and difficult to review;
and it may be difficult to replace it with proper call_rcu.

You should probably make a proof-of-concept using liburcu.  Then we can
decide how to implement RCU in a way that is portable enough for QEMU's
needs.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/15] lock: introduce global lock for device tree

2012-08-09 Thread Paolo Bonzini
Il 09/08/2012 09:28, liu ping fan ha scritto:
> On Wed, Aug 8, 2012 at 5:41 PM, Paolo Bonzini  wrote:
>> Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
>>> From: Liu Ping Fan 
>>>
>>> Signed-off-by: Liu Ping Fan 
>>> ---
>>>  cpus.c  |   12 
>>>  main-loop.h |3 +++
>>>  2 files changed, 15 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/cpus.c b/cpus.c
>>> index b182b3d..a734b36 100644
>>> --- a/cpus.c
>>> +++ b/cpus.c
>>> @@ -611,6 +611,7 @@ static void qemu_tcg_init_cpu_signals(void)
>>>  }
>>>  #endif /* _WIN32 */
>>>
>>> +QemuMutex qemu_device_tree_mutex;
>>>  QemuMutex qemu_global_mutex;
>>>  static QemuCond qemu_io_proceeded_cond;
>>>  static bool iothread_requesting_mutex;
>>> @@ -634,6 +635,7 @@ void qemu_init_cpu_loop(void)
>>>  qemu_cond_init(&qemu_work_cond);
>>>  qemu_cond_init(&qemu_io_proceeded_cond);
>>>  qemu_mutex_init(&qemu_global_mutex);
>>> +qemu_mutex_init(&qemu_device_tree_mutex);
>>>
>>>  qemu_thread_get_self(&io_thread);
>>>  }
>>> @@ -911,6 +913,16 @@ void qemu_mutex_unlock_iothread(void)
>>>  qemu_mutex_unlock(&qemu_global_mutex);
>>>  }
>>>
>>> +void qemu_lock_devtree(void)
>>> +{
>>> +qemu_mutex_lock(&qemu_device_tree_mutex);
>>> +}
>>> +
>>> +void qemu_unlock_devtree(void)
>>> +{
>>> +qemu_mutex_unlock(&qemu_device_tree_mutex);
>>> +}
>>
>> We don't need the wrappers.  They are there for the big lock just
>> because TCG needs extra work for iothread_requesting_mutex.
>>
> Sorry, could you give more detail about TCG, what is extra work.

void qemu_mutex_lock_iothread(void)
{
if (!tcg_enabled()) {
qemu_mutex_lock(&qemu_global_mutex);
} else {
iothread_requesting_mutex = true;
if (qemu_mutex_trylock(&qemu_global_mutex)) {
qemu_cpu_kick_thread(first_cpu);
qemu_mutex_lock(&qemu_global_mutex);
}
iothread_requesting_mutex = false;
qemu_cond_broadcast(&qemu_io_proceeded_cond);
}
}

You do not need any of the code in the "else" branch for the device tree
mutex, so you do not need wrappers.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] e1000: using new interface--unmap to unplug

2012-08-09 Thread Paolo Bonzini
Il 09/08/2012 09:28, liu ping fan ha scritto:
>>> >> +static void
>>> >> +pci_e1000_unmap(PCIDevice *p)
>>> >> +{
>>> >> +/* DO NOT FREE anything!until refcnt=0 */
>>> >> +/* isolate from memory view */
>>> >> +}
>> >
>> > At least you need to call the superclass method.
>> >
> Refer to  0013-hotplug-introduce-qdev_unplug_complete-to-remove-dev.patch,
> we have the following sequence
> qdev_unmap->pci_unmap_device->pci_e1000_unmap.  So pci_e1000_unmap
> need not to do anything.

But then this patch is unnecessary, isn't it?

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:35 PM, Paolo Bonzini  wrote:
> Il 08/08/2012 08:25, Liu Ping Fan ha scritto:
>> From: Liu Ping Fan 
>>
>> Collect unused object and release them at caller demand.
>>
>> Signed-off-by: Liu Ping Fan 
>> ---
>>  include/qemu/reclaimer.h |   28 ++
>>  main-loop.c  |5 
>>  qemu-tool.c  |5 
>>  qom/Makefile.objs|2 +-
>>  qom/reclaimer.c  |   58 
>> ++
>>  5 files changed, 97 insertions(+), 1 deletions(-)
>>  create mode 100644 include/qemu/reclaimer.h
>>  create mode 100644 qom/reclaimer.c
>>
>> diff --git a/include/qemu/reclaimer.h b/include/qemu/reclaimer.h
>> new file mode 100644
>> index 000..9307e93
>> --- /dev/null
>> +++ b/include/qemu/reclaimer.h
>> @@ -0,0 +1,28 @@
>> +/*
>> + * QEMU reclaimer
>> + *
>> + * Copyright IBM, Corp. 2012
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef QEMU_RECLAIMER
>> +#define QEMU_RECLAIMER
>> +
>> +typedef void ReleaseHandler(void *opaque);
>> +typedef struct Chunk {
>> +QLIST_ENTRY(Chunk) list;
>> +void *opaque;
>> +ReleaseHandler *release;
>> +} Chunk;
>> +
>> +typedef struct ChunkHead {
>> +struct Chunk *lh_first;
>> +} ChunkHead;
>> +
>> +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler 
>> *release);
>> +void reclaimer_worker(ChunkHead *head);
>> +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release);
>> +void qemu_reclaimer(void);
>
> So "enqueue" is call_rcu and qemu_reclaimer marks a quiescent state +
> empties the pending call_rcu.
>
> But what's the difference between the two pairs of APIs?
>
I add the new one for V2 to reclaim the resource for mem view. And
yes, as you point out, I will delete the 2nd API, for it can be
substituted by the 1st one easily.

>> +#endif
>> diff --git a/main-loop.c b/main-loop.c
>> index eb3b6e6..be9d095 100644
>> --- a/main-loop.c
>> +++ b/main-loop.c
>> @@ -26,6 +26,7 @@
>>  #include "qemu-timer.h"
>>  #include "slirp/slirp.h"
>>  #include "main-loop.h"
>> +#include "qemu/reclaimer.h"
>>
>>  #ifndef _WIN32
>>
>> @@ -505,5 +506,9 @@ int main_loop_wait(int nonblocking)
>> them.  */
>>  qemu_bh_poll();
>>
>> +/* ref to device from iohandler/bh/timer do not obey the rules, so delay
>> + * reclaiming until now.
>> + */
>> +qemu_reclaimer();
>>  return ret;
>>  }
>> diff --git a/qemu-tool.c b/qemu-tool.c
>> index 318c5fc..f5fe319 100644
>> --- a/qemu-tool.c
>> +++ b/qemu-tool.c
>> @@ -21,6 +21,7 @@
>>  #include "main-loop.h"
>>  #include "qemu_socket.h"
>>  #include "slirp/libslirp.h"
>> +#include "qemu/reclaimer.h"
>>
>>  #include 
>>
>> @@ -75,6 +76,10 @@ void qemu_mutex_unlock_iothread(void)
>>  {
>>  }
>>
>> +void qemu_reclaimer(void)
>> +{
>> +}
>> +
>>  int use_icount;
>>
>>  void qemu_clock_warp(QEMUClock *clock)
>> diff --git a/qom/Makefile.objs b/qom/Makefile.objs
>> index 5ef060a..a579261 100644
>> --- a/qom/Makefile.objs
>> +++ b/qom/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -qom-obj-y = object.o container.o qom-qobject.o
>> +qom-obj-y = object.o container.o qom-qobject.o reclaimer.o
>>  qom-obj-twice-y = cpu.o
>>  common-obj-y = $(qom-obj-twice-y)
>>  user-obj-y = $(qom-obj-twice-y)
>> diff --git a/qom/reclaimer.c b/qom/reclaimer.c
>> new file mode 100644
>> index 000..6cb53e3
>> --- /dev/null
>> +++ b/qom/reclaimer.c
>> @@ -0,0 +1,58 @@
>> +/*
>> + * QEMU reclaimer
>> + *
>> + * Copyright IBM, Corp. 2012
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
>> + * See the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu-common.h"
>> +#include "qemu-thread.h"
>> +#include "main-loop.h"
>> +#include "qemu-queue.h"
>> +#include "qemu/reclaimer.h"
>> +
>> +static struct QemuMutex reclaimer_lock;
>> +static QLIST_HEAD(rcl, Chunk) reclaimer_list;
>> +
>> +void reclaimer_enqueue(ChunkHead *head, void *opaque, ReleaseHandler 
>> *release)
>> +{
>> +Chunk *r = g_malloc0(sizeof(Chunk));
>> +r->opaque = opaque;
>> +r->release = release;
>> +QLIST_INSERT_HEAD_RCU(head, r, list);
>> +}
>
> No lock?
Yes, need!  I will think it more closely.

Thanks and regards,
pingfan
>
>> +void reclaimer_worker(ChunkHead *head)
>> +{
>> +Chunk *cur, *next;
>> +
>> +QLIST_FOREACH_SAFE(cur, head, list, next) {
>> +QLIST_REMOVE(cur, list);
>> +cur->release(cur->opaque);
>> +g_free(cur);
>> +}
>
> QLIST_REMOVE needs a lock too, so using the lockless
> QLIST_INSERT_HEAD_RCU is not necessary.
>
>> +}
>> +
>> +void qemu_reclaimer_enqueue(void *opaque, ReleaseHandler *release)
>> +{
>> +Chunk *r = g_malloc0(sizeof(Chunk));
>> +r->opaque = opaque;
>> +r->release = release;
>> +qemu_mutex_lock(&reclaimer_lock);
>> +QLIST_INSERT_HEAD_RCU(&reclaimer_list, r, list);
>> +qemu_mutex_un

Re: [PATCH 03/15] qom: introduce reclaimer to release obj

2012-08-09 Thread liu ping fan
On Wed, Aug 8, 2012 at 5:15 PM, Avi Kivity  wrote:
> On 08/08/2012 12:07 PM, Paolo Bonzini wrote:
>> Il 08/08/2012 11:05, Avi Kivity ha scritto:
 > From: Liu Ping Fan 
 >
 > Collect unused object and release them at caller demand.
 >
>>> Please explain the motivation for this patch.
>>
>> It's poor man RCU, I think?
>
> I thought that it was to defer destructors (finalizers) to a more
> suitable context.  But why is the unref context unsuitable?
>
Yes, it is to defer destructors.
See 0009-memory-prepare-flatview-and-radix-tree-for-rcu-style.patch
When MemoryRegion is _del_subregion from mem in updater, it may be
still in use by reader -- radix or flatview, so defer its destructors
to the reclaimer --phys_map_release(PhysMap *map)
If we have rcu, it could be elegant to do this.

I think, I should write the commit comment here too,  not until the
followed patch.

Regards, pingfan
> I don't see how it relates to RCU, where is the C and the U?
>
> Anyway the list eagerly awaits the explanation.
>
> --
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/15] memory: prepare flatview and radix-tree for rcu style access

2012-08-09 Thread liu ping fan
On Thu, Aug 9, 2012 at 3:23 AM, Blue Swirl  wrote:
> On Wed, Aug 8, 2012 at 6:25 AM, Liu Ping Fan  wrote:
>> From: Liu Ping Fan 
>>
>> Flatview and radix view are all under the protection of pointer.
>> And this make sure the change of them seem to be atomic!
>>
>> The mr accessed by radix-tree leaf or flatview will be reclaimed
>> after the prev PhysMap not in use any longer
>>
>> Signed-off-by: Liu Ping Fan 
>> ---
>>  exec.c  |  303 
>> +++---
>>  hw/vhost.c  |2 +-
>>  hw/xen_pt.c |2 +-
>>  kvm-all.c   |2 +-
>>  memory.c|   92 ++-
>>  memory.h|9 ++-
>>  vl.c|1 +
>>  xen-all.c   |2 +-
>>  8 files changed, 286 insertions(+), 127 deletions(-)
>>
>> diff --git a/exec.c b/exec.c
>> index 01b91b0..97addb9 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -24,6 +24,7 @@
>>  #include 
>>  #endif
>>
>> +#include "qemu/atomic.h"
>>  #include "qemu-common.h"
>>  #include "cpu.h"
>>  #include "tcg.h"
>> @@ -35,6 +36,8 @@
>>  #include "qemu-timer.h"
>>  #include "memory.h"
>>  #include "exec-memory.h"
>> +#include "qemu-thread.h"
>> +#include "qemu/reclaimer.h"
>>  #if defined(CONFIG_USER_ONLY)
>>  #include 
>>  #if defined(__FreeBSD__) || defined(__FreeBSD_kernel__)
>> @@ -184,25 +187,17 @@ static void *l1_map[V_L1_SIZE];
>>
>>  #if !defined(CONFIG_USER_ONLY)
>>
>> -static MemoryRegionSection *phys_sections;
>> -static unsigned phys_sections_nb, phys_sections_nb_alloc;
>>  static uint16_t phys_section_unassigned;
>>  static uint16_t phys_section_notdirty;
>>  static uint16_t phys_section_rom;
>>  static uint16_t phys_section_watch;
>>
>> -
>> -/* Simple allocator for PhysPageEntry nodes */
>> -static PhysPageEntry (*phys_map_nodes)[L2_SIZE];
>> -static unsigned phys_map_nodes_nb, phys_map_nodes_nb_alloc;
>> -
>>  #define PHYS_MAP_NODE_NIL (((uint16_t)~0) >> 1)
>>
>> -/* This is a multi-level map on the physical address space.
>> -   The bottom level has pointers to MemoryRegionSections.  */
>> -static PhysPageEntry phys_map = { .ptr = PHYS_MAP_NODE_NIL, .is_leaf = 0 };
>> -
>> +static QemuMutex cur_map_lock;
>> +static PhysMap *cur_map;
>>  QemuMutex mem_map_lock;
>> +static PhysMap *next_map;
>>
>>  static void io_mem_init(void);
>>  static void memory_map_init(void);
>> @@ -383,41 +378,38 @@ static inline PageDesc *page_find(tb_page_addr_t index)
>>
>>  #if !defined(CONFIG_USER_ONLY)
>>
>> -static void phys_map_node_reserve(unsigned nodes)
>> +static void phys_map_node_reserve(PhysMap *map, unsigned nodes)
>>  {
>> -if (phys_map_nodes_nb + nodes > phys_map_nodes_nb_alloc) {
>> +if (map->phys_map_nodes_nb + nodes > map->phys_map_nodes_nb_alloc) {
>>  typedef PhysPageEntry Node[L2_SIZE];
>> -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc * 2, 16);
>> -phys_map_nodes_nb_alloc = MAX(phys_map_nodes_nb_alloc,
>> -  phys_map_nodes_nb + nodes);
>> -phys_map_nodes = g_renew(Node, phys_map_nodes,
>> - phys_map_nodes_nb_alloc);
>> +map->phys_map_nodes_nb_alloc = MAX(map->phys_map_nodes_nb_alloc * 2,
>> +16);
>> +map->phys_map_nodes_nb_alloc = MAX(map->phys_map_nodes_nb_alloc,
>> +  map->phys_map_nodes_nb + nodes);
>> +map->phys_map_nodes = g_renew(Node, map->phys_map_nodes,
>> + map->phys_map_nodes_nb_alloc);
>>  }
>>  }
>>
>> -static uint16_t phys_map_node_alloc(void)
>> +static uint16_t phys_map_node_alloc(PhysMap *map)
>>  {
>>  unsigned i;
>>  uint16_t ret;
>>
>> -ret = phys_map_nodes_nb++;
>> +ret = map->phys_map_nodes_nb++;
>>  assert(ret != PHYS_MAP_NODE_NIL);
>> -assert(ret != phys_map_nodes_nb_alloc);
>> +assert(ret != map->phys_map_nodes_nb_alloc);
>>  for (i = 0; i < L2_SIZE; ++i) {
>> -phys_map_nodes[ret][i].is_leaf = 0;
>> -phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
>> +map->phys_map_nodes[ret][i].is_leaf = 0;
>> +map->phys_map_nodes[ret][i].ptr = PHYS_MAP_NODE_NIL;
>>  }
>>  return ret;
>>  }
>>
>> -static void phys_map_nodes_reset(void)
>> -{
>> -phys_map_nodes_nb = 0;
>> -}
>> -
>> -
>> -static void phys_page_set_level(PhysPageEntry *lp, target_phys_addr_t 
>> *index,
>> -target_phys_addr_t *nb, uint16_t leaf,
>> +static void phys_page_set_level(PhysMap *map, PhysPageEntry *lp,
>> +target_phys_addr_t *index,
>> +target_phys_addr_t *nb,
>> +uint16_t leaf,
>>  int level)
>>  {
>>  PhysPageEntry *p;
>> @@ -425,8 +417,8 @@ static void phys_page_set_level(PhysPageEntry *lp, 
>> target_phys_addr_t *index,
>>  target_phys_addr_t step = (target_phys_addr_t)1 << (level * L2_BITS);
>>
>>

  1   2   >