Re: mmu_notifers, pte_dirty questions

2010-06-06 Thread Avi Kivity

On 06/06/2010 09:36 PM, Andrea Arcangeli wrote:

On Sun, Jun 06, 2010 at 03:07:27PM +0300, Avi Kivity wrote:
   

Why no notifer when testing and clearing the dirty bit?

(*clear_flush_dirty)(...).

 

static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 unsigned long address)
{
 struct mm_struct *mm = vma->vm_mm;
 pte_t *pte;
 spinlock_t *ptl;
 int ret = 0;

 pte = page_check_address(page, mm, address,&ptl, 1);
 if (!pte)
 goto out;

 if (pte_dirty(*pte) || pte_write(*pte)) {
 pte_t entry;

 flush_cache_page(vma, address, pte_pfn(*pte));
 entry = ptep_clear_flush_notify(vma, address, pte);
 entry = pte_wrprotect(entry);
 entry = pte_mkclean(entry);
 set_pte_at(mm, address, pte, entry);
   

set_pte_at_notify()?  without this (or clear_flush_dirty) Linux will
assume all ptes are now clean; if the guest writes to a page nothing
will catch it.

->  with set_pte_at_notify(), we can drop the spte and mark the page as
dirty, so the next write will re-instantiate the spte
->  with ->clear_flush_dirty(), we can track the dirty state without
dropping the spte.

 

 ret = 1;
 }

 pte_unmap_unlock(pte, ptl);
out:
 return ret;
   

I'm probably missing something big as I can't see how this works.
 

Under the PT lock it's safe to keep the PTE zero, just the pte must be
non zero again before pte_unmap_unlock.

The sptes are all gone by the time ptep_clear_flush_notify returns
(also gup-fast is stopped with the IPI of the flush) and no spte can
be established again before pte_unmap_unlock runs, so it's all safe as
far as I can tell.

   


Somehow  I missed the ptep_clear_flush_notify()...  so all should be fine.


set_pte_at_notify might prevent a minor fault though.
   


I'm thinking of how to implement speculative write access for kvm: 
consider a read fault for a writeable page.  We could install a 
writeable spte with the dirty bit clear, and examine the dirty bit at 
pte_clear_flush_notify() time and transfer it to the page flags.  
However I can't see where the mm code checks the pte dirty bit for 
anonymous pages?  Does it assume anonymous pages are always dirty? (they 
could have a clean copy in swap, no?)


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] kvm/powerpc: fix a build error in e500_tlb.c

2010-06-06 Thread Liu Yu-B13201
 

> -Original Message-
> From: kvm-ow...@vger.kernel.org 
> [mailto:kvm-ow...@vger.kernel.org] On Behalf Of Alexander Graf
> Sent: Friday, June 04, 2010 10:08 PM
> To: Kevin Hao
> Cc: Marcelo Tosatti; Kumar Gala; Avi Kivity; 
> kvm@vger.kernel.org list; linuxppc-...@ozlabs.org; Liu Yu-B13201
> Subject: Re: [PATCH] kvm/powerpc: fix a build error in e500_tlb.c
> 
> 
> On 03.06.2010, at 07:52, Kevin Hao wrote:
> 
> > We use the wrong number arguments when invoking 
> trace_kvm_stlb_inval,
> > and cause the following build error.
> > arch/powerpc/kvm/e500_tlb.c: In function 
> 'kvmppc_e500_stlbe_invalidate':
> > arch/powerpc/kvm/e500_tlb.c:230: error: too many arguments 
> to function 'trace_kvm_stlb_inval'
> 
> Liu, I'd like to get an ack from you here.
> 

Seems commit e43f2f741a49483034bf968841275cfa553a4cb3 has solved this.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Reboot Problem

2010-06-06 Thread Chaitra Gorantla
Hi,

Yes, Insipte of adding
kernel.panic = 10
in sysctl.conf,
the kernel does not reboot when kvm-i9ntel.ko is been inserted.

Do you have any idea , because, when the kvm-intel.ko is not inserted, kernel 
reboots.


-Original Message-
From: Sterling Windmill [mailto:sterl...@ampx.net]
Sent: Monday, June 07, 2010 8:49 AM
To: Chaitra Gorantla
Cc: kvm@vger.kernel.org
Subject: Re: Reboot Problem

The Linux kernel does not reboot on panic by default.

Read here: 
http://www.cyberciti.biz/tips/reboot-linux-box-after-a-kernel-panic.html

- Original Message -
From: "Chaitra Gorantla" 
To: kvm@vger.kernel.org
Sent: Sunday, June 6, 2010 10:59:11 PM
Subject: Reboot Problem

Hi all,

We know that kernel reboots after any panic or crash.

I am using 2.6.21 kernel (HOST machine) and KVM-88 package.

After inserting kvm-intel.ko , if any kernel crash occurs, the Host
halts= .
The kernel never reboots.

Please suggest regarding this.

Thanks in advance,
Chaitra


This Email may contain confidential or privileged information for the
intended recipient (s) If you are not the intended recipient, please do
not use or disseminate the information, notify the sender and delete it
from your system.

__
-- To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

__

This Email may contain confidential or privileged information for the intended 
recipient (s) If you are not the intended recipient, please do not use or 
disseminate the information, notify the sender and delete it from your system.

__
N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

Re: Reboot Problem

2010-06-06 Thread Sterling Windmill
The Linux kernel does not reboot on panic by default.

Read here: 
http://www.cyberciti.biz/tips/reboot-linux-box-after-a-kernel-panic.html

- Original Message -
From: "Chaitra Gorantla" 
To: kvm@vger.kernel.org
Sent: Sunday, June 6, 2010 10:59:11 PM
Subject: Reboot Problem

Hi all,

We know that kernel reboots after any panic or crash.

I am using 2.6.21 kernel (HOST machine) and KVM-88 package.

After inserting kvm-intel.ko , if any kernel crash occurs, the Host
halts= .
The kernel never reboots.

Please suggest regarding this.

Thanks in advance,
Chaitra


This Email may contain confidential or privileged information for the
intended recipient (s) If you are not the intended recipient, please do
not use or disseminate the information, notify the sender and delete it
from your system.

__
-- To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Reboot Problem

2010-06-06 Thread Chaitra Gorantla
Hi all,

We know that kernel reboots after any panic or crash.

I am using 2.6.21 kernel (HOST machine) and KVM-88 package.

After inserting kvm-intel.ko , if any kernel crash occurs,  the Host halts= .
The kernel never reboots.

Please suggest regarding this.

Thanks in advance,
Chaitra


This Email may contain confidential or privileged information for the intended 
recipient (s) If you are not the intended recipient, please do not use or 
disseminate the information, notify the sender and delete it from your system.

__
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM VMX: rename vpid_sync_vcpu_all() to vpid_sync_vcpu_single()

2010-06-06 Thread Gui Jianfeng
The name "pid_sync_vcpu_all" isn't appropriate since it just affect
a single vpid, so rename it to vpid_sync_vcpu_single().

Signed-off-by: Gui Jianfeng 
---
 arch/x86/kvm/vmx.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aaf8a28..4d893ac 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -504,7 +504,7 @@ static void vcpu_clear(struct vcpu_vmx *vmx)
smp_call_function_single(vmx->vcpu.cpu, __vcpu_clear, vmx, 1);
 }
 
-static inline void vpid_sync_vcpu_all(struct vcpu_vmx *vmx)
+static inline void vpid_sync_vcpu_single(struct vcpu_vmx *vmx)
 {
if (vmx->vpid == 0)
return;
@@ -522,7 +522,7 @@ static inline void vpid_sync_vcpu_global(void)
 static inline void vpid_sync_context(struct vcpu_vmx *vmx)
 {
if (cpu_has_vmx_invvpid_single())
-   vpid_sync_vcpu_all(vmx);
+   vpid_sync_vcpu_single(vmx);
else
vpid_sync_vcpu_global();
 }
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM VMX: Add all-context INVVPID type support

2010-06-06 Thread Gui Jianfeng
Add all-context INVVPID type support.

Signed-off-by: Gui Jianfeng 
---
 arch/x86/include/asm/vmx.h |1 +
 arch/x86/kvm/vmx.c |   23 +--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index b4e2840..96a5886 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -377,6 +377,7 @@ enum vmcs_field {
 #define VMX_EPT_EXTENT_GLOBAL_BIT  (1ull << 26)
 
 #define VMX_VPID_EXTENT_SINGLE_CONTEXT_BIT  (1ull << 9) /* (41 - 32) */
+#define VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT  (1ull << 10) /* (42 - 32) */
 
 #define VMX_EPT_DEFAULT_GAW3
 #define VMX_EPT_MAX_GAW0x4
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 240a407..aaf8a28 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -360,6 +360,11 @@ static inline bool cpu_has_vmx_invvpid_single(void)
return vmx_capability.vpid & VMX_VPID_EXTENT_SINGLE_CONTEXT_BIT;
 }
 
+static inline bool cpu_has_vmx_invvpid_global(void)
+{
+   return vmx_capability.vpid & VMX_VPID_EXTENT_GLOBAL_CONTEXT_BIT;
+}
+
 static inline bool cpu_has_vmx_ept(void)
 {
return vmcs_config.cpu_based_2nd_exec_ctrl &
@@ -508,6 +513,20 @@ static inline void vpid_sync_vcpu_all(struct vcpu_vmx *vmx)
__invvpid(VMX_VPID_EXTENT_SINGLE_CONTEXT, vmx->vpid, 0);
 }
 
+static inline void vpid_sync_vcpu_global(void)
+{
+   if (cpu_has_vmx_invvpid_global())
+   __invvpid(VMX_VPID_EXTENT_ALL_CONTEXT, 0, 0);
+}
+
+static inline void vpid_sync_context(struct vcpu_vmx *vmx)
+{
+   if (cpu_has_vmx_invvpid_single())
+   vpid_sync_vcpu_all(vmx);
+   else
+   vpid_sync_vcpu_global();
+}
+
 static inline void ept_sync_global(void)
 {
if (cpu_has_vmx_invept_global())
@@ -1797,7 +1816,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
 
 static void vmx_flush_tlb(struct kvm_vcpu *vcpu)
 {
-   vpid_sync_vcpu_all(to_vmx(vcpu));
+   vpid_sync_context(to_vmx(vcpu));
if (enable_ept)
ept_sync_context(construct_eptp(vcpu->arch.mmu.root_hpa));
 }
@@ -2753,7 +2772,7 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmx_fpu_activate(&vmx->vcpu);
update_exception_bitmap(&vmx->vcpu);
 
-   vpid_sync_vcpu_all(vmx);
+   vpid_sync_context(vmx);
 
ret = 0;
 
-- 
1.6.5.2


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] kvm: rename kvm/test/test/powerpc to kvm/test/powerpc

2010-06-06 Thread Ken CC
On Sun, Jun 06, 2010 at 04:22:45PM +0300, Avi Kivity wrote:
> On 06/06/2010 04:13 PM, Ken CC wrote:
> >To make the directory architecture match the file config file
> >kvm/test/config-powerpc.mk.
> >
> 
> Can you please resent with 'diff.renames' set in your git
> configuration?  That will make the patch reviewable, as it will
> convert it to a pattern of moves instead of creates/deletes.
> 

Hi, Avi,

I have resent the patch, please check.

> Also, isn't a Makefile change needed?
> 

the content of config-powerpc.mak shares the same directory structure
with that of config-x86-common.mak, so i think it is not needed to
change Makefile while moving "powerpc" out of the "test" dir.


Regards,

Ken CC

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH QEMU-KVM v2 diff.renames] kvm: rename kvm/test/test/powerpc to kvm/test/powerpc

2010-06-06 Thread Ken CC
To make the directory architecture match the make config file
kvm/test/config-powerpc.mk.

Signed-off-by: Ken CC 
---
 kvm/test/{ => }/test/powerpc/44x/tlbsx.S  |0
 kvm/test/{ => }/test/powerpc/44x/tlbwe.S  |0
 kvm/test/{ => }/test/powerpc/44x/tlbwe_16KB.S |0
 kvm/test/{ => }/test/powerpc/44x/tlbwe_hole.S |0
 kvm/test/{ => }/test/powerpc/cstart.S |0
 kvm/test/{ => }/test/powerpc/exit.c   |0
 kvm/test/{ => }/test/powerpc/helloworld.c |0
 kvm/test/{ => }/test/powerpc/io.S |0
 kvm/test/{ => }/test/powerpc/spin.S   |0
 kvm/test/{ => }/test/powerpc/sprg.S   |0
 10 files changed, 0 insertions(+), 0 deletions(-)
 rename kvm/test/{ => }/test/powerpc/44x/tlbsx.S (100%)
 rename kvm/test/{ => }/test/powerpc/44x/tlbwe.S (100%)
 rename kvm/test/{ => }/test/powerpc/44x/tlbwe_16KB.S (100%)
 rename kvm/test/{ => }/test/powerpc/44x/tlbwe_hole.S (100%)
 rename kvm/test/{ => }/test/powerpc/cstart.S (100%)
 rename kvm/test/{ => }/test/powerpc/exit.c (100%)
 rename kvm/test/{ => }/test/powerpc/helloworld.c (100%)
 rename kvm/test/{ => }/test/powerpc/io.S (100%)
 rename kvm/test/{ => }/test/powerpc/spin.S (100%)
 rename kvm/test/{ => }/test/powerpc/sprg.S (100%)

diff --git a/kvm/test/test/powerpc/44x/tlbsx.S b/kvm/test/powerpc/44x/tlbsx.S
similarity index 100%
rename from kvm/test/test/powerpc/44x/tlbsx.S
rename to kvm/test/powerpc/44x/tlbsx.S
diff --git a/kvm/test/test/powerpc/44x/tlbwe.S b/kvm/test/powerpc/44x/tlbwe.S
similarity index 100%
rename from kvm/test/test/powerpc/44x/tlbwe.S
rename to kvm/test/powerpc/44x/tlbwe.S
diff --git a/kvm/test/test/powerpc/44x/tlbwe_16KB.S 
b/kvm/test/powerpc/44x/tlbwe_16KB.S
similarity index 100%
rename from kvm/test/test/powerpc/44x/tlbwe_16KB.S
rename to kvm/test/powerpc/44x/tlbwe_16KB.S
diff --git a/kvm/test/test/powerpc/44x/tlbwe_hole.S 
b/kvm/test/powerpc/44x/tlbwe_hole.S
similarity index 100%
rename from kvm/test/test/powerpc/44x/tlbwe_hole.S
rename to kvm/test/powerpc/44x/tlbwe_hole.S
diff --git a/kvm/test/test/powerpc/cstart.S b/kvm/test/powerpc/cstart.S
similarity index 100%
rename from kvm/test/test/powerpc/cstart.S
rename to kvm/test/powerpc/cstart.S
diff --git a/kvm/test/test/powerpc/exit.c b/kvm/test/powerpc/exit.c
similarity index 100%
rename from kvm/test/test/powerpc/exit.c
rename to kvm/test/powerpc/exit.c
diff --git a/kvm/test/test/powerpc/helloworld.c b/kvm/test/powerpc/helloworld.c
similarity index 100%
rename from kvm/test/test/powerpc/helloworld.c
rename to kvm/test/powerpc/helloworld.c
diff --git a/kvm/test/test/powerpc/io.S b/kvm/test/powerpc/io.S
similarity index 100%
rename from kvm/test/test/powerpc/io.S
rename to kvm/test/powerpc/io.S
diff --git a/kvm/test/test/powerpc/spin.S b/kvm/test/powerpc/spin.S
similarity index 100%
rename from kvm/test/test/powerpc/spin.S
rename to kvm/test/powerpc/spin.S
diff --git a/kvm/test/test/powerpc/sprg.S b/kvm/test/powerpc/sprg.S
similarity index 100%
rename from kvm/test/test/powerpc/sprg.S
rename to kvm/test/powerpc/sprg.S
-- 
1.6.5.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-06 Thread Stephen Hemminger
Still not sure this is a good idea for a couple of reasons:

1. We already have lots of special cases with skb's (frags and fraglist),
   and skb's travel through a lot of different parts of the kernel.  So any
   new change like this creates lots of exposed points for new bugs. Look
   at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
   and ppp and ...

2. SKB's can have infinite lifetime in the kernel. If these buffers come from
   a fixed size pool in an external device, they can easily all get tied up
   if you have a slow listener. What happens then?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [COMMIT master] Merge branch 'master' of ssh://master.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into next

2010-06-06 Thread Alexander Graf

On 06.06.2010, at 18:13, Avi Kivity wrote:

> On 06/06/2010 07:09 PM, Avi Kivity wrote:
>> From: Avi Kivity
>> 
>> * 'master' of 
>> ssh://master.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6: (1443 
>> commits)
>>   Linux 2.6.35-rc2
>>   drm/i915: Move non-phys cursors into the GTT
>>   ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags
>>   module: fix bne2 "gave up waiting for init of module libcrc32c"
>>   module: verify_export_symbols under the lock
>>   module: move find_module check to end
>>   module: make locking more fine-grained.
>>   module: Make module sysfs functions private.
>>   module: move sysfs exposure to end of load_module
>>   module: fix kdb's illicit use of struct module_use.
>>   module: Make the 'usage' lists be two-way
>>   X25: remove duplicated #include
>>   tcp: use correct net ns in cookie_v4_check()
>>   rps: tcp: fix rps_sock_flow_table table updates
>>   ppp_generic: fix multilink fragment sizes
>>   syncookies: remove Kconfig text line about disabled-by-default
>>   ixgbe: only check pfc bits in hang logic if pfc is enabled
>>   net: check for refcount if pop a stacked dst_entry
>>   omap: remove BUG_ON for disabled interrupts
>>   vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim 
>> failure
>>   ...
>> 
>> Conflicts:
>>  arch/powerpc/kernel/ppc_ksyms.c
>>   
> 
> Alex, can you verify that I merged this correctly (dropping the export)?

Yes, dropping was the right choice :). Thank you!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mmu_notifers, pte_dirty questions

2010-06-06 Thread Andrea Arcangeli
On Sun, Jun 06, 2010 at 03:07:27PM +0300, Avi Kivity wrote:
> Why no notifer when testing and clearing the dirty bit?
> 
> (*clear_flush_dirty)(...).
> 
> > static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > unsigned long address)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > pte_t *pte;
> > spinlock_t *ptl;
> > int ret = 0;
> >
> > pte = page_check_address(page, mm, address, &ptl, 1);
> > if (!pte)
> > goto out;
> >
> > if (pte_dirty(*pte) || pte_write(*pte)) {
> > pte_t entry;
> >
> > flush_cache_page(vma, address, pte_pfn(*pte));
> > entry = ptep_clear_flush_notify(vma, address, pte);
> > entry = pte_wrprotect(entry);
> > entry = pte_mkclean(entry);
> > set_pte_at(mm, address, pte, entry);
> 
> set_pte_at_notify()?  without this (or clear_flush_dirty) Linux will 
> assume all ptes are now clean; if the guest writes to a page nothing 
> will catch it.
> 
> -> with set_pte_at_notify(), we can drop the spte and mark the page as 
> dirty, so the next write will re-instantiate the spte
> -> with ->clear_flush_dirty(), we can track the dirty state without 
> dropping the spte.
> 
> > ret = 1;
> > }
> >
> > pte_unmap_unlock(pte, ptl);
> > out:
> > return ret;
> 
> I'm probably missing something big as I can't see how this works.

Under the PT lock it's safe to keep the PTE zero, just the pte must be
non zero again before pte_unmap_unlock.

The sptes are all gone by the time ptep_clear_flush_notify returns
(also gup-fast is stopped with the IPI of the flush) and no spte can
be established again before pte_unmap_unlock runs, so it's all safe as
far as I can tell.

set_pte_at_notify might prevent a minor fault though.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v4 3/3] block: add sheepdog driver for distributed storage support

2010-06-06 Thread MORITA Kazutaka
At Fri, 04 Jun 2010 13:04:00 +0200,
Kevin Wolf wrote:
> 
> Am 03.06.2010 18:23, schrieb MORITA Kazutaka:
> >>> +static void sd_aio_cancel(BlockDriverAIOCB *blockacb)
> >>> +{
> >>> + SheepdogAIOCB *acb = (SheepdogAIOCB *)blockacb;
> >>> +
> >>> + acb->canceled = 1;
> >>> +}
> >>
> >> Does this provide the right semantics? You haven't really cancelled the
> >> request, but you pretend to. So you actually complete the request in the
> >> background and then throw the return code away.
> >>
> >> I seem to remember that posix-aio-compat.c waits at this point for
> >> completion of the requests, calls the callbacks and only afterwards
> >> returns from aio_cancel when no more requests are in flight.
> >>
> >> Or if you can really cancel requests, it would be the best option, of
> >> course.
> >>
> > 
> > Sheepdog cannot cancel the requests which are already sent to the
> > servers.  So, as you say, we pretend to cancel the requests without
> > waiting for completion of them.  However, are there any situation
> > where pretending to cancel causes problems in practice?
> 
> I'm not sure how often it would happen in practice, but if the guest OS
> thinks the old value is on disk when in fact the new one is, this could
> lead to corruption. I think if it can happen, even without evidence that
> it actually does, it's already relevant enough.
> 

I agree.

> > To wait for completion of the requests here, we may need to create
> > another thread for processing I/O like posix-aio-compat.c.
> 
> I don't think you need a thread to get the same behaviour, you just need
> to call the fd handlers like in the main loop. It would probably be the
> first driver doing this, though, and it's not an often used code path,
> so it might be a bad idea.
> 
> Maybe it's reasonable to just complete the request with -EIO? This way
> the guest couldn't make any assumption about the data written. On the
> other hand, it could be unhappy about failed requests, but that's
> probably better than corruption.
> 

Completing with -EIO looks good to me.  Thanks for the advice.
I'll send an updated patch tomorrow.

Regards,

Kazutaka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: porting fixes regarding kvm-clock and lost irqs to stable qemu-kvm 0.12.4

2010-06-06 Thread Peter Lieven

Am 06.06.2010 um 16:53 schrieb Faidon Liambotis:

> Peter Lieven wrote:
 i was looking for a generic way to disable it in the hypervisor
 without the need to touch every guests kernel commandline.
 this would be easy revertible once its all working as expected.
>>> This is a huge hack but for the same reason I've made an LD_PRELOAD
>>> wrapper that wraps ioctl masking the hsot kernel's kvmclock feature from
>>> userspace kvm. I'll be happy to share it if you're interested.
>> 
>> i would be grateful if you can sent it to me offlist.
> http://people.debian.org/~paravoid/kvm-noclock-3.tar.gz
> 
>> however, i came up with this hack,
>> that seems to work also:
> 
> 
> Yes, that's equivalent. We didn't use that since we wanted to use our
> distribution's binaries and be able to upgrade without modifying the source.

thank you. 

i want to share this patch for stable until -cpu xxx,-kvmclock works in the 
stable
tree.

diff -ur qemu-kvm-0.12.4.orig/qemu-kvm.c qemu-kvm-0.12.4/qemu-kvm.c
--- qemu-kvm-0.12.4.orig/qemu-kvm.c 2010-05-09 13:05:19.0 +0200
+++ qemu-kvm-0.12.4/qemu-kvm.c  2010-06-06 16:58:12.0 +0200
@@ -54,7 +54,7 @@
 int kvm_pit = 1;
 int kvm_pit_reinject = 1;
 int kvm_nested = 0;
-
+int kvm_clock = 1;
 
 KVMState *kvm_state;
 kvm_context_t kvm_context;
diff -ur qemu-kvm-0.12.4.orig/qemu-kvm.h qemu-kvm-0.12.4/qemu-kvm.h
--- qemu-kvm-0.12.4.orig/qemu-kvm.h 2010-05-09 13:05:19.0 +0200
+++ qemu-kvm-0.12.4/qemu-kvm.h  2010-06-06 17:04:02.0 +0200
@@ -1035,6 +1035,7 @@
 extern int kvm_pit;
 extern int kvm_pit_reinject;
 extern int kvm_nested;
+extern int kvm_clock;
 extern kvm_context_t kvm_context;
 
 struct ioperm_data {
diff -ur qemu-kvm-0.12.4.orig/qemu-kvm-x86.c qemu-kvm-0.12.4/qemu-kvm-x86.c
--- qemu-kvm-0.12.4.orig/qemu-kvm-x86.c 2010-05-09 13:05:19.0 +0200
+++ qemu-kvm-0.12.4/qemu-kvm-x86.c  2010-06-06 16:46:08.0 +0200
@@ -1259,6 +1259,7 @@
int i, features = 0;
 
for (i = 0; i < ARRAY_SIZE(para_features)-1; i++) {
+if (para_features[i].cap == KVM_CAP_CLOCKSOURCE && !kvm_clock) 
continue;
if (kvm_check_extension(kvm_state, para_features[i].cap))
features |= (1 << para_features[i].feature);
}
diff -ur qemu-kvm-0.12.4.orig/qemu-options.hx qemu-kvm-0.12.4/qemu-options.hx
--- qemu-kvm-0.12.4.orig/qemu-options.hx2010-05-09 13:05:19.0 
+0200
+++ qemu-kvm-0.12.4/qemu-options.hx 2010-06-06 16:38:46.0 +0200
@@ -1959,6 +1959,9 @@
 "-no-kvm-pit disable KVM kernel mode PIT\n")
 DEF("no-kvm-pit-reinjection", 0, QEMU_OPTION_no_kvm_pit_reinjection,
 "-no-kvm-pit-reinjection disable KVM kernel mode PIT interrupt 
reinjection\n")
+DEF("no-kvm-clock", 0, QEMU_OPTION_no_kvm_clock,
+"-no-kvm-clock   disable KVM pvclock\n")
+
 #if defined(TARGET_I386) || defined(TARGET_X86_64) || defined(TARGET_IA64) || 
defined(__linux__)
 DEF("pcidevice", HAS_ARG, QEMU_OPTION_pcidevice,
 "-pcidevice host=bus:dev.func[,dma=none][,name=string]\n"
diff -ur qemu-kvm-0.12.4.orig/vl.c qemu-kvm-0.12.4/vl.c
--- qemu-kvm-0.12.4.orig/vl.c   2010-05-09 13:05:19.0 +0200
+++ qemu-kvm-0.12.4/vl.c2010-06-06 17:05:53.0 +0200
@@ -5530,6 +5530,10 @@
 kvm_pit_reinject = 0;
 break;
 }
+case QEMU_OPTION_no_kvm_clock: {
+kvm_clock = 0;
+break;
+}
case QEMU_OPTION_enable_nesting: {
kvm_nested = 1;
break;


> 
> Regards,
> Faidon
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf trace event parse errors for KVM events

2010-06-06 Thread Avi Kivity

On 06/06/2010 06:02 PM, Steven Rostedt wrote:

On Sun, 2010-06-06 at 11:08 +0300, Avi Kivity wrote:

   

Also, I kicked this off in kernelshark, and it made no difference that I
can see. This is because kernelshark only evaluates the viewable area of
the screen.

   

Neat.  Can it also search?  Where can I find it?

 

It's in the same repo as trace-cmd:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

I'm still working on it. The latest is in the branch kernelshark-devel.

And yes, it does searches.
   


I'll be tracking it.

In case you're interested in wishlists, my #1 would be to be able to 
initiate traces from the GUI (so it's easy to explore the available 
trace events and enable them interactively).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [COMMIT master] Merge branch 'master' of ssh://master.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6 into next

2010-06-06 Thread Avi Kivity

On 06/06/2010 07:09 PM, Avi Kivity wrote:

From: Avi Kivity

* 'master' of 
ssh://master.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6: (1443 
commits)
   Linux 2.6.35-rc2
   drm/i915: Move non-phys cursors into the GTT
   ext4: Fix remaining racy updates of EXT4_I(inode)->i_flags
   module: fix bne2 "gave up waiting for init of module libcrc32c"
   module: verify_export_symbols under the lock
   module: move find_module check to end
   module: make locking more fine-grained.
   module: Make module sysfs functions private.
   module: move sysfs exposure to end of load_module
   module: fix kdb's illicit use of struct module_use.
   module: Make the 'usage' lists be two-way
   X25: remove duplicated #include
   tcp: use correct net ns in cookie_v4_check()
   rps: tcp: fix rps_sock_flow_table table updates
   ppp_generic: fix multilink fragment sizes
   syncookies: remove Kconfig text line about disabled-by-default
   ixgbe: only check pfc bits in hang logic if pfc is enabled
   net: check for refcount if pop a stacked dst_entry
   omap: remove BUG_ON for disabled interrupts
   vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim 
failure
   ...

Conflicts:
arch/powerpc/kernel/ppc_ksyms.c
   


Alex, can you verify that I merged this correctly (dropping the export)?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make-release: make mtime, owner, group consistent

2010-06-06 Thread Michael S. Tsirkin
On Fri, Jun 04, 2010 at 01:46:25PM -0300, Marcelo Tosatti wrote:
> On Wed, Jun 02, 2010 at 06:27:20PM +0300, Michael S. Tsirkin wrote:
> > Files from git have modification time set to one
> > of commit, and owner/group to root.
> > Making it so for generated files as well makes
> > it easier to generate an identical tarball from git.
> > 
> > Setting owner/group to root is especially important because
> > otherwise you must have a user/group with same name
> > to generate an identical tarball.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  kvm/scripts/make-release |9 +++--
> >  1 files changed, 7 insertions(+), 2 deletions(-)
> 
> Applied, thanks.

Could you re-generate and upload the qemu-kvm-0.12.4 release
tarball with this applied?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: rework remove-write-access for a slot

2010-06-06 Thread Avi Kivity

On 06/04/2010 11:14 AM, Lai Jiangshan wrote:




- I thought of a different approach to write protection: write protect
the L4 sptes, on write fault add write permission to the L4 spte and
write protect the L3 sptes that it points to, etc.  This method can use
the slot bitmap to reduce the number of write faults.  However we can
reintroduce the slot bitmap if/when we use the method, this shouldn't
block the patch.
 

It is very a good approach and it is blazing fast.

I have no time to implement it currently,
could you update it into the TODO list?
   


Done.


+static void rmapp_remove_write_access(struct kvm *kvm, unsigned long
*rmapp)
+{
+u64 *spte = rmap_next(kvm, rmapp, NULL);
+
+while (spte) {
+/* avoid RMW */
+if (is_writable_pte(*spte))
+*spte&= ~PT_WRITABLE_MASK;
   

Must use an atomic operation here to avoid losing dirty or accessed bit.

 

Atomic operation is too expensive, I retained the comment "/* avoid RMW */"
and wait someone take a good approach for it.
   


You are right, it is an existing problem.  I just posted a patchset 
which fixes the problem, when it's merged please rebase on top.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: porting fixes regarding kvm-clock and lost irqs to stable qemu-kvm 0.12.4

2010-06-06 Thread Faidon Liambotis
Peter Lieven wrote:
>>> i was looking for a generic way to disable it in the hypervisor
>>> without the need to touch every guests kernel commandline.
>>> this would be easy revertible once its all working as expected.
>> This is a huge hack but for the same reason I've made an LD_PRELOAD
>> wrapper that wraps ioctl masking the hsot kernel's kvmclock feature from
>> userspace kvm. I'll be happy to share it if you're interested.
> 
> i would be grateful if you can sent it to me offlist.
http://people.debian.org/~paravoid/kvm-noclock-3.tar.gz

> however, i came up with this hack,
> that seems to work also:


Yes, that's equivalent. We didn't use that since we wanted to use our
distribution's binaries and be able to upgrade without modifying the source.

Regards,
Faidon
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v6 5/6] Inter-VM shared memory PCI device

2010-06-06 Thread Avi Kivity

On 06/05/2010 12:44 PM, Blue Swirl wrote:

On Fri, Jun 4, 2010 at 9:45 PM, Cam Macdonell  wrote:
   

Support an inter-vm shared memory device that maps a shared-memory object as a
PCI device in the guest.  This patch also supports interrupts between guest by
communicating over a unix domain socket.  This patch applies to the qemu-kvm
repository.

-device ivshmem,size=[,shm=]

Interrupts are supported between multiple VMs by using a shared memory server
by using a chardev socket.

-device ivshmem,size=[,shm=]
   [,chardev=][,msi=on][,irqfd=on][,vectors=n][,role=peer|master]
-chardev socket,path=,id=

(shared memory server is qemu.git/contrib/ivshmem-server)

Sample programs and init scripts are in a git repo here:

 

Why is this KVM specific BTW, Posix SHM is available on many
platforms? What would happen if kvm_set_foobar functions were not
called when KVM is not being used? Is host eventfd support essential?
   


It's not kvm specific, it's 
atomic-ops-on-shared-memory-are-visible-as-atomic-ops specific, which is 
currently only available with kvm.  When tcg gains true smp support (and 
not just against other tcg threads) this can work with tcg as well.


I guess that needs a host with at least 32/64 bit CAS for 32/64 bit 
targets respectively, and double that if the target has DCAS.  Not sure 
how targets with ll/sc can be implemented, especially if there are 
limits as to what can go in between.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf trace event parse errors for KVM events

2010-06-06 Thread Steven Rostedt
On Sun, 2010-06-06 at 11:08 +0300, Avi Kivity wrote:

> > Also, I kicked this off in kernelshark, and it made no difference that I
> > can see. This is because kernelshark only evaluates the viewable area of
> > the screen.
> >
> 
> Neat.  Can it also search?  Where can I find it? 
> 

It's in the same repo as trace-cmd:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

I'm still working on it. The latest is in the branch kernelshark-devel.

And yes, it does searches.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2933400 ] virtio-blk io errors / data corruption on raw drives > 1 TB

2010-06-06 Thread SourceForge.net
Bugs item #2933400, was opened at 2010-01-16 17:35
Message generated for change (Comment added) made by nevenchannyy
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2933400&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 9
Private: No
Submitted By: MaSc82 (masc82)
Assigned to: Nobody/Anonymous (nobody)
Summary: virtio-blk io errors / data corruption on raw drives > 1 TB

Initial Comment:
When attaching raw drives > 1 TB, buffer io errors will most likely occur, 
filesystems get corrupted. Processes (like mkfs.ext4) might freeze completely 
when filesystems are created on the guest.

Here's a typical log excerpt when using mkfs.ext4 on a 1.5 TB drive on a Ubuntu 
9.10 guest:
(kern.log)
Jan 15 20:40:44 q kernel: [  677.076602] Buffer I/O error on device vde, 
logical block 366283764
Jan 15 20:40:44 q kernel: [  677.076607] Buffer I/O error on device vde, 
logical block 366283765
Jan 15 20:40:44 q kernel: [  677.076611] Buffer I/O error on device vde, 
logical block 366283766
Jan 15 20:40:44 q kernel: [  677.076616] Buffer I/O error on device vde, 
logical block 366283767
Jan 15 20:40:44 q kernel: [  677.076621] Buffer I/O error on device vde, 
logical block 366283768
Jan 15 20:40:44 q kernel: [  677.076626] Buffer I/O error on device vde, 
logical block 366283769
(messages)
Jan 15 20:40:44 q kernel: [  677.076534] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076541] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076546] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076599] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076604] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076609] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076613] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076618] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076623] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076628] lost page write due to I/O error on vde
Jan 15 20:45:55 q Backgrounding to notify hosts...
(The following entries will repeat infinitely, mkfs.ext4 will not exit and 
cannot be killed)
Jan 15 20:49:27 q kernel: [ 1200.520096] mkfs.ext4 D  0 
 1839   1709 0x
Jan 15 20:49:27 q kernel: [ 1200.520101]  88004e157cb8 0082 
88004e157c58 00015880
Jan 15 20:49:27 q kernel: [ 1200.520115]  88004ef6c7c0 00015880 
00015880 00015880
Jan 15 20:49:27 q kernel: [ 1200.520118]  00015880 88004ef6c7c0 
00015880 00015880
Jan 15 20:49:27 q kernel: [ 1200.520123] Call Trace:
Jan 15 20:49:27 q kernel: [ 1200.520157]  [] ? 
sync_page+0x0/0x50
Jan 15 20:49:27 q kernel: [ 1200.520178]  [] 
io_schedule+0x28/0x40
Jan 15 20:49:27 q kernel: [ 1200.520182]  [] 
sync_page+0x3d/0x50
Jan 15 20:49:27 q kernel: [ 1200.520185]  [] 
__wait_on_bit+0x57/0x80
Jan 15 20:49:27 q kernel: [ 1200.520192]  [] 
wait_on_page_bit+0x6e/0x80
Jan 15 20:49:27 q kernel: [ 1200.520205]  [] ? 
wake_bit_function+0x0/0x40
Jan 15 20:49:27 q kernel: [ 1200.520210]  [] ? 
pagevec_lookup_tag+0x20/0x30
Jan 15 20:49:27 q kernel: [ 1200.520213]  [] 
wait_on_page_writeback_range+0xf5/0x190
Jan 15 20:49:27 q kernel: [ 1200.520217]  [] 
filemap_fdatawait+0x27/0x30
Jan 15 20:49:27 q kernel: [ 1200.520220]  [] 
filemap_write_and_wait+0x44/0x50
Jan 15 20:49:27 q kernel: [ 1200.520235]  [] 
__sync_blockdev+0x1f/0x40
Jan 15 20:49:27 q kernel: [ 1200.520239]  [] 
sync_blockdev+0xe/0x10
Jan 15 20:49:27 q kernel: [ 1200.520241]  [] 
block_fsync+0x1a/0x20
Jan 15 20:49:27 q kernel: [ 1200.520249]  [] 
vfs_fsync+0x86/0xf0
Jan 15 20:49:27 q kernel: [ 1200.520252]  [] 
do_fsync+0x39/0x60
Jan 15 20:49:27 q kernel: [ 1200.520255]  [] 
sys_fsync+0xb/0x10
Jan 15 20:49:27 q kernel: [ 1200.520271]  [] 
system_call_fastpath+0x16/0x1b

In my case I was switching to virtio at one point, but the behaviour didn't 
show until there was > 1 TB data on the filesystem. very dangerous.

Tested using 2 different SATA controllers, 1.5 TB lvm/mdraid, single 1.5 TB 
drive and 2 TB lvm/mdraid.
The behaviour does not occur with if=scsi or if=ide.

#2914397 might be related: 
https://sourceforge.net/tracker/?func=detail&aid=2914397&group_id=180599&atid=893831
This blog post might also relate: 
http://www.neuhalfen.name/2009/08/05/OpenSolaris_KVM_and_large_IDE_drives/

CPU: Intel Xeon E5430
KVM: qemu-kvm-0.12.1.2
Kernel:  2.6.32.2, x86_64
Guest OS: Verified to occur on guests Ubuntu Linux 9.10 (64-bit) and Gentoo 
Linux (64-bit)
Commandline (atm using ide instead of virtio for large drives as a workaround): 
 qemu-system-x86_64 -S -M p

Re: [PATCH 1/4] KVM: MMU: Introduce drop_spte()

2010-06-06 Thread Avi Kivity

On 06/06/2010 04:06 PM, Avi Kivity wrote:

When we call rmap_remove(), we (almost) always immediately follow it by
an __set_spte() to a nonpresent pte.  Since we need to perform the two
operations atomically, to avoid losing the dirty and accessed bits, introduce
a helper drop_spte() and convert all call sites.

The operation is still nonatomic at this point.


@@ -1498,13 +1502,14 @@ static void kvm_mmu_page_unlink_children(struct kvm 
*kvm,
ent&= PT64_BASE_ADDR_MASK;
mmu_page_remove_parent_pte(page_header(ent),
&pt[i]);
+   pt[i] = shadow_trap_nonpresent_pte;
} else {
if (is_large_pte(ent))
--kvm->stat.lpages;
-   rmap_remove(kvm,&pt[i]);
+   drop_spte(kvm,&pt[i],
+ shadow_trap_nonpresent_pte);
}
}
-   pt[i] = shadow_trap_nonpresent_pte;
}
  }
   


Autotest points out that this transformation (and an identical one in 
zap_pte) does not preserve the semantics; if the outer if () fails, the 
new code does not update pt[i].


With the original line after the if () retained, autotest is happier.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-06 Thread Avi Kivity

On 06/02/2010 07:53 PM, Chris Wright wrote:

* Avi Kivity (a...@redhat.com) wrote:
   

The interface would only work for clients which support it: kvm,
vhost, and iommu/devices with restartable dma.
 

BTW, there is no such thing as restartable dma.  There is a provision in
new specs (read: no real hardware) that allows a device to request pages
before using them.  So it's akin to demand paging, but the demand is an
explicit request rather than a page fault.
   


Thanks for the correction.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] kvm: rename kvm/test/test/powerpc to kvm/test/powerpc

2010-06-06 Thread Avi Kivity

On 06/06/2010 04:13 PM, Ken CC wrote:

To make the directory architecture match the file config file
kvm/test/config-powerpc.mk.

   


Can you please resent with 'diff.renames' set in your git 
configuration?  That will make the patch reviewable, as it will convert 
it to a pattern of moves instead of creates/deletes.


Also, isn't a Makefile change needed?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: mmu: Remove unused local variable

2010-06-06 Thread Avi Kivity

On 06/03/2010 10:08 AM, Jan Kiszka wrote:

From: Jan Kiszka

Signed-off-by: Jan Kiszka
---

No one else checks for new build warnings?

   


My fault, sorry, I updated the patch in place to remove a dangling "r = 
something;" and introduced two new warnings.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3][STABLE] KVM: Various issues in virtio_net

2010-06-06 Thread Avi Kivity

On 06/03/2010 10:31 PM, Bruce Rogers wrote:

These are patches which we have found useful for our 2.6.32 based SLES 11 SP1 
release.

The first patch is already upstream, but should be included in stable.

The second patch is a subset of another upstream patch. Again, stable material.

The third patch solves the last remaining issue we saw when testing kvm 
configurations with the SUSE certification test suite. Under heavy load, we 
observed rx stalls (first two patches applied), and this third patch was 
crafted to address the issue. Please apply to stable.
I assume this last problem also exists in more recent kernels than 2.6.32, but 
I haven't validated that.

With these 3 patches applied we no longer see any issues with virito networking 
using our certification test suite.

   



Please send virtio patches, including patches for the stable branches, 
to the virtio maintainer (Rusty).  Copying Michael (m...@redhat.com) is 
also a good idea.


Also:

- use git format-patch/git sent-email or equivalents; these ensure the 
correct formatting and group the patches together for threaded mail clients
- indicate the upstream commit (often it's enough to ask the maintainers 
to backport a commit)


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH qemu-kvm] kvm: rename kvm/test/test/powerpc to kvm/test/powerpc

2010-06-06 Thread Ken CC
To make the directory architecture match the file config file
kvm/test/config-powerpc.mk.

Signed-off-by: Ken CC 
---
 kvm/test/powerpc/44x/tlbsx.S   |   33 
 kvm/test/powerpc/44x/tlbwe.S   |   27 +++
 kvm/test/powerpc/44x/tlbwe_16KB.S  |   35 +
 kvm/test/powerpc/44x/tlbwe_hole.S  |   27 +++
 kvm/test/powerpc/cstart.S  |   38 
 kvm/test/powerpc/exit.c|   23 +++
 kvm/test/powerpc/helloworld.c  |   27 +++
 kvm/test/powerpc/io.S  |   32 +++
 kvm/test/powerpc/spin.S|4 +++
 kvm/test/powerpc/sprg.S|7 ++
 kvm/test/test/powerpc/44x/tlbsx.S  |   33 
 kvm/test/test/powerpc/44x/tlbwe.S  |   27 ---
 kvm/test/test/powerpc/44x/tlbwe_16KB.S |   35 -
 kvm/test/test/powerpc/44x/tlbwe_hole.S |   27 ---
 kvm/test/test/powerpc/cstart.S |   38 
 kvm/test/test/powerpc/exit.c   |   23 ---
 kvm/test/test/powerpc/helloworld.c |   27 ---
 kvm/test/test/powerpc/io.S |   32 ---
 kvm/test/test/powerpc/spin.S   |4 ---
 kvm/test/test/powerpc/sprg.S   |7 --
 20 files changed, 253 insertions(+), 253 deletions(-)
 create mode 100644 kvm/test/powerpc/44x/tlbsx.S
 create mode 100644 kvm/test/powerpc/44x/tlbwe.S
 create mode 100644 kvm/test/powerpc/44x/tlbwe_16KB.S
 create mode 100644 kvm/test/powerpc/44x/tlbwe_hole.S
 create mode 100644 kvm/test/powerpc/cstart.S
 create mode 100644 kvm/test/powerpc/exit.c
 create mode 100644 kvm/test/powerpc/helloworld.c
 create mode 100644 kvm/test/powerpc/io.S
 create mode 100644 kvm/test/powerpc/spin.S
 create mode 100644 kvm/test/powerpc/sprg.S
 delete mode 100644 kvm/test/test/powerpc/44x/tlbsx.S
 delete mode 100644 kvm/test/test/powerpc/44x/tlbwe.S
 delete mode 100644 kvm/test/test/powerpc/44x/tlbwe_16KB.S
 delete mode 100644 kvm/test/test/powerpc/44x/tlbwe_hole.S
 delete mode 100644 kvm/test/test/powerpc/cstart.S
 delete mode 100644 kvm/test/test/powerpc/exit.c
 delete mode 100644 kvm/test/test/powerpc/helloworld.c
 delete mode 100644 kvm/test/test/powerpc/io.S
 delete mode 100644 kvm/test/test/powerpc/spin.S
 delete mode 100644 kvm/test/test/powerpc/sprg.S

diff --git a/kvm/test/powerpc/44x/tlbsx.S b/kvm/test/powerpc/44x/tlbsx.S
new file mode 100644
index 000..b15874b
--- /dev/null
+++ b/kvm/test/powerpc/44x/tlbsx.S
@@ -0,0 +1,33 @@
+#define SPRN_MMUCR 0x3b2
+
+#define TLBWORD0 0x1210
+#define TLBWORD1 0x1000
+#define TLBWORD2 0x0003
+
+.global _start
+_start:
+   li  r4, 0
+   mtspr   SPRN_MMUCR, r4
+
+   li  r3, 23
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 0
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 1
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 2
+
+   lis r4, 0x1000
+   tlbsx   r5, r4, r0
+   cmpwi   r5, 23
+   beq good
+   trap
+
+good:
+   b   .
diff --git a/kvm/test/powerpc/44x/tlbwe.S b/kvm/test/powerpc/44x/tlbwe.S
new file mode 100644
index 000..ec6ef5c
--- /dev/null
+++ b/kvm/test/powerpc/44x/tlbwe.S
@@ -0,0 +1,27 @@
+#define SPRN_MMUCR 0x3b2
+
+/* Create a mapping at 4MB */
+#define TLBWORD0 0x00400210
+#define TLBWORD1 0x0040
+#define TLBWORD2 0x0003
+
+.global _start
+_start:
+   li  r4, 0
+   mtspr   SPRN_MMUCR, r4
+
+   li  r3, 23
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 0
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 1
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 2
+
+   b   .
diff --git a/kvm/test/powerpc/44x/tlbwe_16KB.S 
b/kvm/test/powerpc/44x/tlbwe_16KB.S
new file mode 100644
index 000..1bd10bf
--- /dev/null
+++ b/kvm/test/powerpc/44x/tlbwe_16KB.S
@@ -0,0 +1,35 @@
+#define SPRN_MMUCR 0x3b2
+
+/* 16KB mapping at 4MB */
+#define TLBWORD0 0x00400220
+#define TLBWORD1 0x0040
+#define TLBWORD2 0x0003
+
+.global _start
+_start:
+   li  r4, 0
+   mtspr   SPRN_MMUCR, r4
+
+   li  r3, 5
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 0
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 1
+
+   lis r4, tlbwo...@h
+   ori r4, r4, tlbwo...@l
+   tlbwe   r4, r3, 2
+
+   /* load from 4MB */
+   lis r3, 0x0040
+   lwz r4, 0(r3)
+
+   /* load from 4MB+8KB */
+   ori r3, r3, 0x2000
+

[PATCH 1/4] KVM: MMU: Introduce drop_spte()

2010-06-06 Thread Avi Kivity
When we call rmap_remove(), we (almost) always immediately follow it by
an __set_spte() to a nonpresent pte.  Since we need to perform the two
operations atomically, to avoid losing the dirty and accessed bits, introduce
a helper drop_spte() and convert all call sites.

The operation is still nonatomic at this point.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/mmu.c |   33 +++--
 arch/x86/kvm/paging_tmpl.h |   13 ++---
 2 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6b2c644..17331c2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -666,6 +666,12 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
}
 }
 
+static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte)
+{
+   rmap_remove(kvm, sptep);
+   __set_spte(sptep, new_spte);
+}
+
 static u64 *rmap_next(struct kvm *kvm, unsigned long *rmapp, u64 *spte)
 {
struct kvm_rmap_desc *desc;
@@ -731,9 +737,9 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != 
(PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
pgprintk("rmap_write_protect(large): spte %p %llx 
%lld\n", spte, *spte, gfn);
if (is_writable_pte(*spte)) {
-   rmap_remove(kvm, spte);
+   drop_spte(kvm, spte,
+ shadow_trap_nonpresent_pte);
--kvm->stat.lpages;
-   __set_spte(spte, shadow_trap_nonpresent_pte);
spte = NULL;
write_protected = 1;
}
@@ -753,8 +759,7 @@ static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
while ((spte = rmap_next(kvm, rmapp, NULL))) {
BUG_ON(!(*spte & PT_PRESENT_MASK));
rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
-   rmap_remove(kvm, spte);
-   __set_spte(spte, shadow_trap_nonpresent_pte);
+   drop_spte(kvm, spte, shadow_trap_nonpresent_pte);
need_tlb_flush = 1;
}
return need_tlb_flush;
@@ -776,8 +781,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
rmap_printk("kvm_set_pte_rmapp: spte %p %llx\n", spte, *spte);
need_flush = 1;
if (pte_write(*ptep)) {
-   rmap_remove(kvm, spte);
-   __set_spte(spte, shadow_trap_nonpresent_pte);
+   drop_spte(kvm, spte, shadow_trap_nonpresent_pte);
spte = rmap_next(kvm, rmapp, NULL);
} else {
new_spte = *spte &~ (PT64_BASE_ADDR_MASK);
@@ -1498,13 +1502,14 @@ static void kvm_mmu_page_unlink_children(struct kvm 
*kvm,
ent &= PT64_BASE_ADDR_MASK;
mmu_page_remove_parent_pte(page_header(ent),
   &pt[i]);
+   pt[i] = shadow_trap_nonpresent_pte;
} else {
if (is_large_pte(ent))
--kvm->stat.lpages;
-   rmap_remove(kvm, &pt[i]);
+   drop_spte(kvm, &pt[i],
+ shadow_trap_nonpresent_pte);
}
}
-   pt[i] = shadow_trap_nonpresent_pte;
}
 }
 
@@ -1902,9 +1907,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if (level > PT_PAGE_TABLE_LEVEL &&
has_wrprotected_page(vcpu->kvm, gfn, level)) {
ret = 1;
-   rmap_remove(vcpu->kvm, sptep);
-   spte = shadow_trap_nonpresent_pte;
-   goto set_pte;
+   drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte);
+   goto done;
}
 
spte |= PT_WRITABLE_MASK;
@@ -1936,6 +1940,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 
 set_pte:
__set_spte(sptep, spte);
+done:
return ret;
 }
 
@@ -1972,7 +1977,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
} else if (pfn != spte_to_pfn(*sptep)) {
pgprintk("hfn old %lx new %lx\n",
 spte_to_pfn(*sptep), pfn);
-   rmap_remove(vcpu->kvm, sptep);
+   drop_spte(vcpu->kvm, sptep, shadow_trap_nonpresent_pte);
} else
was_rmapped = 1;
}
@@ -2623,13 +2628,13 @@ static void mmu_pte_write_zap_pte(struct kvm_vcpu *vcpu,
pte = *spte;
if (is_shadow_present_

[PATCH 3/4] KVM: MMU: Atomically check for accessed bit when dropping an spte

2010-06-06 Thread Avi Kivity
Currently, in the window between the check for the accessed bit, and actually
dropping the spte, a vcpu can access the page through the spte and set the bit,
which will be ignored by the mmu.

Fix by using an exchange operation to atmoically fetch the spte and drop it.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/mmu.c |   26 +-
 1 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f93948d..b565a14 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -290,6 +290,21 @@ static void __set_spte(u64 *sptep, u64 spte)
 #endif
 }
 
+static u64 __xchg_spte(u64 *sptep, u64 new_spte)
+{
+#ifdef CONFIG_X86_64
+   return xchg(sptep, new_spte);
+#else
+   u64 old_spte;
+
+   do {
+   old_spte = *sptep;
+   } while (cmpxchg64(sptep, old_spte, new_spte) != old_spte);
+
+   return old;
+#endif
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
  struct kmem_cache *base_cache, int min)
 {
@@ -661,16 +676,17 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte)
 {
pfn_t pfn;
+   u64 old_spte;
 
-   if (!is_rmap_spte(*sptep))
+   old_spte = __xchg_spte(sptep, new_spte);
+   if (!is_rmap_spte(old_spte))
return;
-   pfn = spte_to_pfn(*sptep);
-   if (*sptep & shadow_accessed_mask)
+   pfn = spte_to_pfn(old_spte);
+   if (old_spte & shadow_accessed_mask)
kvm_set_pfn_accessed(pfn);
-   if (is_writable_pte(*sptep))
+   if (is_writable_pte(old_spte))
kvm_set_pfn_dirty(pfn);
rmap_remove(kvm, sptep);
-   __set_spte(sptep, new_spte);
 }
 
 static u64 *rmap_next(struct kvm *kvm, unsigned long *rmapp, u64 *spte)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] KVM: MMU: Don't drop accessed bit while updating an spte

2010-06-06 Thread Avi Kivity
__set_spte() will happily replace an spte with the accessed bit set with
one that has the accessed bit clear.  Add a helper update_spte() which checks
for this condition and updates the page flag if needed.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/mmu.c |   25 +
 1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index b565a14..4c98726 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -305,6 +305,19 @@ static u64 __xchg_spte(u64 *sptep, u64 new_spte)
 #endif
 }
 
+static void update_spte(u64 *sptep, u64 new_spte)
+{
+   u64 old_spte;
+
+   if (!shadow_accessed_mask || (new_spte & shadow_accessed_mask)) {
+   __set_spte(sptep, new_spte);
+   } else {
+   old_spte = __xchg_spte(sptep, new_spte);
+   if (old_spte & shadow_accessed_mask)
+   mark_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
+   }
+}
+
 static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
  struct kmem_cache *base_cache, int min)
 {
@@ -730,7 +743,7 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
BUG_ON(!(*spte & PT_PRESENT_MASK));
rmap_printk("rmap_write_protect: spte %p %llx\n", spte, *spte);
if (is_writable_pte(*spte)) {
-   __set_spte(spte, *spte & ~PT_WRITABLE_MASK);
+   update_spte(spte, *spte & ~PT_WRITABLE_MASK);
write_protected = 1;
}
spte = rmap_next(kvm, rmapp, spte);
@@ -786,7 +799,7 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long 
*rmapp,
 unsigned long data)
 {
int need_flush = 0;
-   u64 *spte, new_spte;
+   u64 *spte, new_spte, old_spte;
pte_t *ptep = (pte_t *)data;
pfn_t new_pfn;
 
@@ -806,9 +819,13 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned 
long *rmapp,
 
new_spte &= ~PT_WRITABLE_MASK;
new_spte &= ~SPTE_HOST_WRITEABLE;
+   new_spte &= ~shadow_accessed_mask;
if (is_writable_pte(*spte))
kvm_set_pfn_dirty(spte_to_pfn(*spte));
-   __set_spte(spte, new_spte);
+   old_spte = __xchg_spte(spte, new_spte);
+   if (is_shadow_present_pte(old_spte)
+   && (old_spte & shadow_accessed_mask))
+   
mark_page_accessed(pfn_to_page(spte_to_pfn(old_spte)));
spte = rmap_next(kvm, rmapp, spte);
}
}
@@ -1956,7 +1973,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
mark_page_dirty(vcpu->kvm, gfn);
 
 set_pte:
-   __set_spte(sptep, spte);
+   update_spte(sptep, spte);
 done:
return ret;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Fix accessed bit tracking

2010-06-06 Thread Avi Kivity
The kvm mmu synchronizes shadow ptes using the mmu lock, however the cpu
will happily ignore the lock when setting the accessed bit.  This can cause
the accessed bit to be lost.  Luckily this only results in incorrect page
selection for swap.

This patchset fixes the problem by atomically updating the spte when
needed while taking care of the accessed bit.

Avi Kivity (4):
  KVM: MMU: Introduce drop_spte()
  KVM: MMU: Move accessed/dirty bit checks from rmap_remove() to
drop_spte()
  KVM: MMU: Atomically check for accessed bit when dropping an spte
  KVM: MMU: Don't drop accessed bit while updating an spte

 arch/x86/kvm/mmu.c |   91 +++
 arch/x86/kvm/paging_tmpl.h |   13 +++---
 2 files changed, 71 insertions(+), 33 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] KVM: MMU: Move accessed/dirty bit checks from rmap_remove() to drop_spte()

2010-06-06 Thread Avi Kivity
Since we need to make the check atomic, move it to the place that will
set the new spte.

Signed-off-by: Avi Kivity 
---
 arch/x86/kvm/mmu.c |   17 +
 1 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 17331c2..f93948d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -620,19 +620,11 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
struct kvm_rmap_desc *desc;
struct kvm_rmap_desc *prev_desc;
struct kvm_mmu_page *sp;
-   pfn_t pfn;
gfn_t gfn;
unsigned long *rmapp;
int i;
 
-   if (!is_rmap_spte(*spte))
-   return;
sp = page_header(__pa(spte));
-   pfn = spte_to_pfn(*spte);
-   if (*spte & shadow_accessed_mask)
-   kvm_set_pfn_accessed(pfn);
-   if (is_writable_pte(*spte))
-   kvm_set_pfn_dirty(pfn);
gfn = kvm_mmu_page_get_gfn(sp, spte - sp->spt);
rmapp = gfn_to_rmap(kvm, gfn, sp->role.level);
if (!*rmapp) {
@@ -668,6 +660,15 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
 
 static void drop_spte(struct kvm *kvm, u64 *sptep, u64 new_spte)
 {
+   pfn_t pfn;
+
+   if (!is_rmap_spte(*sptep))
+   return;
+   pfn = spte_to_pfn(*sptep);
+   if (*sptep & shadow_accessed_mask)
+   kvm_set_pfn_accessed(pfn);
+   if (is_writable_pte(*sptep))
+   kvm_set_pfn_dirty(pfn);
rmap_remove(kvm, sptep);
__set_spte(sptep, new_spte);
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mmu_notifers, pte_dirty questions

2010-06-06 Thread Avi Kivity

Why no notifer when testing and clearing the dirty bit?

(*clear_flush_dirty)(...).


static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
unsigned long address)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte;
spinlock_t *ptl;
int ret = 0;

pte = page_check_address(page, mm, address, &ptl, 1);
if (!pte)
goto out;

if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush_notify(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);


set_pte_at_notify()?  without this (or clear_flush_dirty) Linux will 
assume all ptes are now clean; if the guest writes to a page nothing 
will catch it.


-> with set_pte_at_notify(), we can drop the spte and mark the page as 
dirty, so the next write will re-instantiate the spte
-> with ->clear_flush_dirty(), we can track the dirty state without 
dropping the spte.



ret = 1;
}

pte_unmap_unlock(pte, ptl);
out:
return ret;


I'm probably missing something big as I can't see how this works.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Print a user-friendly message on failed vmentry

2010-06-06 Thread Mohammed Gamal
This patch address bug report in https://bugs.launchpad.net/qemu/+bug/530077.

Failed vmentries were handled with handle_unhandled() which prints a rather
unfriendly message to the user. This patch separates handling vmentry failures
from unknown exit reasons and prints a friendly message to the user.

Signed-off-by: Mohammed Gamal 
---
 qemu-kvm.c |   20 +++-
 1 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/qemu-kvm.c b/qemu-kvm.c
index 9534b31..1a66f55 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -106,6 +106,24 @@ static int handle_unhandled(uint64_t reason)
 return -EINVAL;
 }
 
+#define VMX_INVALID_GUEST_STATE 0x8021
+
+static int handle_failed_vmentry(uint64_t reason)
+{
+fprintf(stderr, "kvm: vm entry failed with error 0x%" PRIx64 "\n\n", 
reason);
+
+/* Perhaps we will need to check if this machine is intel since exit 
reason 0x21 
+   has a different interpretation on SVM */
+if (reason == VMX_INVALID_GUEST_STATE) {
+fprintf(stderr, "If you're runnning a guest on an Intel machine 
without\n");
+fprintf(stderr, "unrestricted mode support, the failure can be most 
likely\n");
+fprintf(stderr, "due to the guest entering an invalid state for Intel 
VT.\n");
+fprintf(stderr, "For example, the guest maybe running in big real 
mode\n");
+fprintf(stderr, "which is not supported on less recent Intel 
processors.\n\n");
+}
+
+return -EINVAL;
+}
 
 static inline void set_gsi(kvm_context_t kvm, unsigned int gsi)
 {
@@ -586,7 +604,7 @@ int kvm_run(CPUState *env)
 r = handle_unhandled(run->hw.hardware_exit_reason);
 break;
 case KVM_EXIT_FAIL_ENTRY:
-r = 
handle_unhandled(run->fail_entry.hardware_entry_failure_reason);
+r = 
handle_failed_vmentry(run->fail_entry.hardware_entry_failure_reason);
 break;
 case KVM_EXIT_EXCEPTION:
 fprintf(stderr, "exception %d (%x)\n", run->ex.exception,
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/7] KVM: MMU: skip invalid sp when unprotect page

2010-06-06 Thread Avi Kivity

On 06/04/2010 04:52 PM, Xiao Guangrong wrote:

In kvm_mmu_unprotect_page(), the invalid sp can be skipped
   


Applied all, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-06 Thread Michael S. Tsirkin
On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
> OK, in the interest of making progress, I am about to embark on the following:
> 
> 1. Create a user-iommu-domain driver - opening it will give a new empty 
> domain.
> Ultimately this can also populate sysfs with the state of its world, 
> which would
> also be a good addition to the base iommu stuff.
> If someone closes the fd while in use, the domain stays valid anyway 
> until users
> drop off.
> 
> 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
>a domain be set before using the VFIO_DMA_MAP_IOVA ioctl

Require domain to be set before you allow any access to the device:
mmap, write, read.  IMO this is the only safe way to make sure userspace
does not corrupt memory, and this removes the need to special-case
MSI memory, play with bus master enable and hope it can be cleared without
reset, etc.

> (this is the one
>that KVM wants).

Not sure I understand. I think that MAP should be done on the domain,
not the device, this handles pinning pages correctly and
this way you don't need any special checks.

>However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
>which uses the dma_sg interface which has no expicit control of domains. I
>intend to keep it the way it is, but expect only non-hypervisor programs 
> would
>want to use it.

If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
non-hypervisors just pick an address?

> 3. Clean up the docs and other nits that folks have found.
> 
> Comments? 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3 1/2] virtio: support layout with avail ring before idx

2010-06-06 Thread Michael S. Tsirkin
On Sat, Jun 05, 2010 at 01:40:26PM +0930, Rusty Russell wrote:
> On Fri, 4 Jun 2010 09:12:05 pm Michael S. Tsirkin wrote:
> > On Fri, Jun 04, 2010 at 08:46:49PM +0930, Rusty Russell wrote:
> > > I'm uncomfortable with moving a field.
> > > 
> > > We haven't done that before and I wonder what will break with old code.
> > 
> > With e.g. my patch, We only do this conditionally when bit is negotitated.
> 
> Of course, but see this change:
> 
> commit ef688e151c00e5d529703be9a04fd506df8bc54e
> Author: Rusty Russell 
> Date:   Fri Jun 12 22:16:35 2009 -0600
> 
> virtio: meet virtio spec by finalizing features before using device
> 
> Virtio devices are supposed to negotiate features before they start using
> the device, but the current code doesn't do this.  This is because the
> driver's probe() function invariably has to add buffers to a virtqueue,
> or probe the disk (virtio_blk).
> 
> This currently doesn't matter since no existing backend is strict about
> the feature negotiation.  But it's possible to imagine a future feature
> which completely changes how a device operates: in this case, we'd need
> to acknowledge it before using the device.
> 
> Signed-off-by: Rusty Russell 
> 
> Now, this isn't impossible to overcome: we know that if they use the ring
> before completing feature negotiation then they don't understand the new
> format.
> 
> But we have to be aware of that on the qemu side.  Are we?

I think we are ok. virtqueue_init which sets the avail/ysed pointers is
called when we write the base address.  So we only need to be careful
and not change this feature bit after creating the rings.


> > > Should we instead just abandon the flags field and use last_used only?
> > > Or, more radically, put flags == last_used when the feature is on?
> > > 
> > > Thoughts?
> > > Rusty.
> > 
> > Hmm, e.g. with TX and virtio net, we almost never want interrupts,
> > whatever the index value.
> 
> Good point.  OK, I give in, I'll take your patch which moves the fields
> to the end.  Is that your preference?

Yes, I think so.
You mean PATCHv3 unchanged with 254 byte padding?

> Please be careful with the qemu side though...
> 
> It's not inconceivable that I'll write that virtio cacheline simulator this
> (coming) week, too...
> 
> Thanks.
> Rusty.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Perf trace event parse errors for KVM events

2010-06-06 Thread Avi Kivity

On 06/04/2010 12:57 AM, Steven Rostedt wrote:

On Tue, 2010-06-01 at 15:39 +0300, Avi Kivity wrote:
   

On 06/01/2010 02:59 PM, Steven Rostedt wrote:
 
   

I meant that viewing would be slowed down.  It's an important part of
using ftrace!

How long does the Python formatter take to process 100k or 1M events?

 

I finally got around to testing this.

I ran a trace on lock_acquire, and traced 1,253,296 events.

I then created a python plugin to analyze the trace:


def lock_acquire(trace_seq, event):
 t = ''
 r = ''
 if int(event['flags'])&  1:
 t = 'try'
 if int(event['flags'])&  2:
 r = 'read'
 trace_seq.puts('t %x %s%s%s' % (
 event['lockdep_addr'], t, r,
 event['name']))

def register(pevent):
 pevent.register_event_handler("lock", "lock_acquire", lock_acquire)


Disclaimer, I'm not a python expert, and I'm sure the above python code
sucks.
   


I think Python more or less guarantees your code will suck no matter 
what you do.



[r...@ixf9 trace-cmd.git]# time ./trace-cmd report -N>/dev/null 2>&1

real0m4.653s
user0m4.234s
sys 0m0.419s

* -N keeps trace-cmd from loading any plugins.


[r...@ixf9 trace-cmd.git]# time PYTHONPATH=`pwd` ./trace-cmd report>/dev/null 
2>&1

real0m53.916s
user0m53.047s
sys 0m0.859s


Yes, running a python interpreter is a bit more expensive. It took 4
seconds to read the million events with plain C, but 53 seconds to read
it in python.

That said...  This would only affect you if you were writing this to a
file. I doubt that you would notice this if you were scanning the trace
with less.
   


I'm more worried about searching with less.  But a minute for a million 
events isn't that bad.



Also, I kicked this off in kernelshark, and it made no difference that I
can see. This is because kernelshark only evaluates the viewable area of
the screen.
   


Neat.  Can it also search?  Where can I find it? 

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH]: kdump on KVM

2010-06-06 Thread Avi Kivity

On 06/04/2010 11:11 PM, Chris Lalancette wrote:

Hello,
  I've gone back to look at the problem with kdump while running as a KVM
guest.  I'll preface this discussion by noting that all of the problems I
describe below only happen when the HPET emulation is shutoff.  However, since
there still seem to be some lingering issues with the HPET emulation, running
without the HPET is a reasonable mode of operation, and one in which it would
be nice to have kexec/kdump working.
  The basic problem with getting kdump working is that currently KVM
delivers the timer interrupts only to vcpu 0, so kexecing onto another
processor will never see timer interrupts.  In the KVM code, there are a
number of places where we make incorrect assumptions regarding the timer and
which vcpu to deliver it to:

1)  In the i8254.c code when the hrtimer representing the PIT expires.  In
this case, when we get the callback, we kick only the BSP.

2)  In the i8254.c code when a vcpu is migrated from one processor to another.
In this case we only migrate the PIT timer if the vcpu to be migrated is the
BSP.

3)  In the lapic code when deciding whether to accept a PIC interrupt, we only
accept interrupts on the BSP.

4)  In the irq_comm.c code when calling kvm_irq_delivery_to_apic().  The
problem here is that we don't take into account the fact that an LAPIC might
be disabled when trying to deliver an interrupt in DM_LOWEST mode.  Further,
on a kexec, the processor that we are kexec'ing *to* gets it's APIC ID
re-written to the BSP APIC ID.  What it means in the end is that we are
currently still matching against the BSP even though vcpu 1 (where the kexec
is happening) would match if we let it.

The attached patch takes care of 1), 3), and 4), and works in my testing (it
needs to be cleaned up a bit to not be so inefficient, but it should work).
However, problem 2) is pretty sticky.  The reason we are currently migrating
the PIT timer around with the BSP is pretty well explained in commit
2f5997140f22f68f6390c49941150d3fa8a95cb7.  With my new patch, though, we are
no longer guaranteeing that we are going to inject onto CPU 0.  I think we
can do something where when the hrtimer expires, we figure out which
vcpu will get the timer interrupt and IPI to that physical processor to cause
the VMEXIT.  Unfortunately it's racy because the expiration of the hrtimer is
de-coupled from the setting of the IRQ for the interrupt.


It's only racy because of (perhaps not immature) optimization. 
Logically, the flow of events is as follows:


1. the httimer fires
2. irq0 is toggled (perhaps via a workqueue since the hrtimer is in 
interrupt context)
3. the PIC and IOAPIC state machines run and decide which local APIC(s), 
if any, need to be notified

4. the local APICs are IPIed.

Perhaps we should start by deoptimizing the code to work like the above.

In light of the below, I'm not sure the race actually exists.


  That means that
the hrtimer could  expire, we could choose vcpu 2 (say), IPI to cause a
VMEXIT, but by the time it goes to VMENTER the guest has changed something in
the (IO)APIC(s) so that the set_irq logic chooses vcpu 3 to do the injection.
This would result in a delayed injection that 2f5997 is trying to avoid.
   


Not sure that's the race here.  Between the vmexit and vmenter we have 
the following trace:


  kvm_inject_pending_timer_irqs()
  kvm_inject_pit_timer_irqs()
  __inject_pit_timer_intr()
  kvm_set_irq()

So, even if we end up on the wrong vcpu, we will IPI the right one later.


Taking a step back, it seems to me that something along the lines of my
previous patchset (where we do set_irq directly from the hrtimer callback) is
the right way to go.  We would still need to IPI to the appropriate physical
cpu to cause a VMEXIT on the cpu we care about, but we would avoid the race I
describe in the previous paragraph.


I agree.  But we also want the vcpu that accepts the interrupt to pull 
the hrtimer after it (like it does now) to minimize cross-cpu talk.



Unfortunately that patchset is also much
more risky.
   


I'd like a minimal, backportable patchset first, then bigger changes 
later.  We have at least two choices, irq-safe pit and ioapic, and using 
a workqueue for to match impedances.  The latter is a lot safer since we 
already do it for assigned devices.



diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 3c4f8f6..bc40143 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -231,7 +231,13 @@ int pit_has_pending_timer(struct kvm_vcpu *vcpu)
  {
struct kvm_pit *pit = vcpu->kvm->arch.vpit;

-   if (pit&&  kvm_vcpu_is_bsp(vcpu)&&  pit->pit_state.irq_ack)
+   /* We only want to inject a single IRQ for each PIT timer that goes
+* off.  If 2 vcpus happen to come in here at the same time, we may
+* erroneously tell one of them that it has a timer pending.  That's
+* OK, though; in kvm_inject_pit_timer_irqs we will make sure to only
+* let one of