Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.

2020-05-28 Thread Paul Mackerras
On Thu, May 28, 2020 at 11:31:04AM +0530, Aneesh Kumar K.V wrote:
> On 5/28/20 7:13 AM, Paul Mackerras wrote:
> > On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> > > The locking rules for walking partition scoped table is different from 
> > > process
> > > scoped table. Hence add a helper for secondary linux page table walk and 
> > > also
> > > add check whether we are holding the right locks.
> > 
> > This patch is causing new warnings to appear when testing migration,
> > like this:
> > 
> > [  142.090159] [ cut here ]
> > [  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held
> > [  142.090176] WARNING: CPU: 23 PID: 5341 at 
> > arch/powerpc/include/asm/kvm_book3s_64.h:644 
> > kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> > [  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 
> > nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables 
> > ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds 
> > raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
> > [  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: GW  
> >5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
> > [  142.090189] NIP:  c00800fe848c LR: c00800fe8488 CTR: 
> > 
> > [  142.090190] REGS: c01e19f077e0 TRAP: 0700   Tainted: GW  
> > (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
> > [  142.090191] MSR:  90029033   CR: 
> > 4422  XER: 2004
> > [  142.090196] CFAR: c012f5ac IRQMASK: 0
> > GPR00: c00800fe8488 c01e19f07a70 c00800ffe200 
> > 0039
> > GPR04: 0001 c01ffc8b4900 00018840 
> > 0007
> > GPR08: 0003 0001 0007 
> > 0001
> > GPR12: 2000 c01fff6d9400 00011f884678 
> > 7fff70b7
> > GPR16: 7fff7137cb90 7fff7dcb4410 0001 
> > 
> > GPR20: 0ffe  0001 
> > 
> > GPR24: 8000 0001 c01e1f67e600 
> > c01e1fd82410
> > GPR28: 1000 c01e2e41 0fff 
> > 0ffe
> > [  142.090217] NIP [c00800fe848c] 
> > kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> > [  142.090223] LR [c00800fe8488] 
> > kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
> > [  142.090224] Call Trace:
> > [  142.090230] [c01e19f07a70] [c00800fe8488] 
> > kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
> > [  142.090236] [c01e19f07b50] [c00800fd42e4] 
> > kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
> > [  142.090292] [c01e19f07be0] [c00800eea878] 
> > kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
> > [  142.090300] [c01e19f07c00] [c00800edc818] 
> > kvm_vm_ioctl+0x2b0/0xc00 [kvm]
> > [  142.090302] [c01e19f07d50] [c046e148] ksys_ioctl+0xf8/0x150
> > [  142.090305] [c01e19f07da0] [c046e1c8] sys_ioctl+0x28/0x80
> > [  142.090307] [c01e19f07dc0] [c003652c] 
> > system_call_exception+0x16c/0x240
> > [  142.090309] [c01e19f07e20] [c000d070] 
> > system_call_common+0xf0/0x278
> > [  142.090310] Instruction dump:
> > [  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 6000 3c82 
> > e8848468 3c62
> > [  142.090317] e86384a8 38840010 4800673d e8410018 <0fe0> 4bfffdd4 
> > 6000 6000
> > [  142.090322] ---[ end trace 619d45057b6919e0 ]---
> > 
> > Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
> > locklessly, and only takes the kvm->mmu_lock once it finds a dirty
> > PTE.  I think that is important for performance, since on any given
> > scan of the guest real address space we may only find a small
> > proportion of the guest pages to be dirty.
> > 
> > Are you now relying on the kvm->mmu_lock to protect the existence of
> > the PTEs, or just their content?
> > 
> 
> The patch series should not change any rules w.r.t kvm partition scoped page
> table walk. We only added helpers to make it explicit that this is different
> from regular linux page table walk. And kvm->mmu_lock is what was used to
> protect the partition scoped table walk earlier.
> 
> In this specific case, what we need probably is an open coded kvm partition
> scoped walk with a comment around explaining why is it ok to walk that
> partition scoped table without taking kvm->mmu_lock.
> 
> What happens when a parallel invalidate happens to Qemu address space? Since
> we are not holding kvm->mmu_lock mmu notifier will complete and we will go
> ahead with unmapping partition scoped table.
> 
> Do we need a change like below?
> 
> @@ -1040,7 +1040,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
>  {
>   unsigned long gfn = memslot->base_gfn + pagenum;
>   

Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.

2020-05-28 Thread Aneesh Kumar K.V

On 5/28/20 7:13 AM, Paul Mackerras wrote:

On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:

The locking rules for walking partition scoped table is different from process
scoped table. Hence add a helper for secondary linux page table walk and also
add check whether we are holding the right locks.


This patch is causing new warnings to appear when testing migration,
like this:

[  142.090159] [ cut here ]
[  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held
[  142.090176] WARNING: CPU: 23 PID: 5341 at 
arch/powerpc/include/asm/kvm_book3s_64.h:644 
kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
[  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables bpfilter overlay binfmt_misc input_leds raid_class 
scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
[  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: GW  
   5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
[  142.090189] NIP:  c00800fe848c LR: c00800fe8488 CTR: 
[  142.090190] REGS: c01e19f077e0 TRAP: 0700   Tainted: GW  
(5.7.0-rc5-kvm-00211-g9ccf10d6d088)
[  142.090191] MSR:  90029033   CR: 4422  
XER: 2004
[  142.090196] CFAR: c012f5ac IRQMASK: 0
GPR00: c00800fe8488 c01e19f07a70 c00800ffe200 
0039
GPR04: 0001 c01ffc8b4900 00018840 
0007
GPR08: 0003 0001 0007 
0001
GPR12: 2000 c01fff6d9400 00011f884678 
7fff70b7
GPR16: 7fff7137cb90 7fff7dcb4410 0001 

GPR20: 0ffe  0001 

GPR24: 8000 0001 c01e1f67e600 
c01e1fd82410
GPR28: 1000 c01e2e41 0fff 
0ffe
[  142.090217] NIP [c00800fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 
[kvm_hv]
[  142.090223] LR [c00800fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 
[kvm_hv]
[  142.090224] Call Trace:
[  142.090230] [c01e19f07a70] [c00800fe8488] 
kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
[  142.090236] [c01e19f07b50] [c00800fd42e4] 
kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
[  142.090292] [c01e19f07be0] [c00800eea878] 
kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
[  142.090300] [c01e19f07c00] [c00800edc818] kvm_vm_ioctl+0x2b0/0xc00 
[kvm]
[  142.090302] [c01e19f07d50] [c046e148] ksys_ioctl+0xf8/0x150
[  142.090305] [c01e19f07da0] [c046e1c8] sys_ioctl+0x28/0x80
[  142.090307] [c01e19f07dc0] [c003652c] 
system_call_exception+0x16c/0x240
[  142.090309] [c01e19f07e20] [c000d070] 
system_call_common+0xf0/0x278
[  142.090310] Instruction dump:
[  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 6000 3c82 e8848468 
3c62
[  142.090317] e86384a8 38840010 4800673d e8410018 <0fe0> 4bfffdd4 6000 
6000
[  142.090322] ---[ end trace 619d45057b6919e0 ]---

Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
locklessly, and only takes the kvm->mmu_lock once it finds a dirty
PTE.  I think that is important for performance, since on any given
scan of the guest real address space we may only find a small
proportion of the guest pages to be dirty.

Are you now relying on the kvm->mmu_lock to protect the existence of
the PTEs, or just their content?



The patch series should not change any rules w.r.t kvm partition scoped 
page table walk. We only added helpers to make it explicit that this is 
different from regular linux page table walk. And kvm->mmu_lock is what 
was used to protect the partition scoped table walk earlier.


In this specific case, what we need probably is an open coded kvm 
partition scoped walk with a comment around explaining why is it ok to 
walk that partition scoped table without taking kvm->mmu_lock.


What happens when a parallel invalidate happens to Qemu address space? 
Since we are not holding kvm->mmu_lock mmu notifier will complete and we 
will go ahead with unmapping partition scoped table.


Do we need a change like below?

@@ -1040,7 +1040,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
 {
unsigned long gfn = memslot->base_gfn + pagenum;
unsigned long gpa = gfn << PAGE_SHIFT;
-   pte_t *ptep;
+   pte_t *ptep, pte;
unsigned int shift;
int ret = 0;
unsigned long old, *rmapp;
@@ -1049,11 +1049,23 @@ static int kvm_radix_test_clear_dirty(struct kvm 
*kvm,

return ret;

ptep = find_kvm_secondary_pte(kvm, gpa, );
-   if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
+   if (!ptep)
+  

Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.

2020-05-27 Thread Paul Mackerras
On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> The locking rules for walking partition scoped table is different from process
> scoped table. Hence add a helper for secondary linux page table walk and also
> add check whether we are holding the right locks.

This patch is causing new warnings to appear when testing migration,
like this:

[  142.090159] [ cut here ]
[  142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held 
[  142.090176] WARNING: CPU: 23 PID: 5341 at 
arch/powerpc/include/asm/kvm_book3s_64.h:644 
kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
[  142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter 
ip6_tables bpfilter overlay binfmt_misc input_leds raid_class 
scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
[  142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: GW  
   5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
[  142.090189] NIP:  c00800fe848c LR: c00800fe8488 CTR: 
[  142.090190] REGS: c01e19f077e0 TRAP: 0700   Tainted: GW  
(5.7.0-rc5-kvm-00211-g9ccf10d6d088)
[  142.090191] MSR:  90029033   CR: 4422  
XER: 2004
[  142.090196] CFAR: c012f5ac IRQMASK: 0 
   GPR00: c00800fe8488 c01e19f07a70 c00800ffe200 
0039 
   GPR04: 0001 c01ffc8b4900 00018840 
0007 
   GPR08: 0003 0001 0007 
0001 
   GPR12: 2000 c01fff6d9400 00011f884678 
7fff70b7 
   GPR16: 7fff7137cb90 7fff7dcb4410 0001 
 
   GPR20: 0ffe  0001 
 
   GPR24: 8000 0001 c01e1f67e600 
c01e1fd82410 
   GPR28: 1000 c01e2e41 0fff 
0ffe 
[  142.090217] NIP [c00800fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 
[kvm_hv]
[  142.090223] LR [c00800fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 
[kvm_hv]
[  142.090224] Call Trace:
[  142.090230] [c01e19f07a70] [c00800fe8488] 
kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
[  142.090236] [c01e19f07b50] [c00800fd42e4] 
kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
[  142.090292] [c01e19f07be0] [c00800eea878] 
kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
[  142.090300] [c01e19f07c00] [c00800edc818] kvm_vm_ioctl+0x2b0/0xc00 
[kvm]
[  142.090302] [c01e19f07d50] [c046e148] ksys_ioctl+0xf8/0x150
[  142.090305] [c01e19f07da0] [c046e1c8] sys_ioctl+0x28/0x80
[  142.090307] [c01e19f07dc0] [c003652c] 
system_call_exception+0x16c/0x240
[  142.090309] [c01e19f07e20] [c000d070] 
system_call_common+0xf0/0x278
[  142.090310] Instruction dump:
[  142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 6000 3c82 e8848468 
3c62 
[  142.090317] e86384a8 38840010 4800673d e8410018 <0fe0> 4bfffdd4 6000 
6000 
[  142.090322] ---[ end trace 619d45057b6919e0 ]---

Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
locklessly, and only takes the kvm->mmu_lock once it finds a dirty
PTE.  I think that is important for performance, since on any given
scan of the guest real address space we may only find a small
proportion of the guest pages to be dirty.

Are you now relying on the kvm->mmu_lock to protect the existence of
the PTEs, or just their content?

Paul.


[PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.

2020-05-05 Thread Aneesh Kumar K.V
The locking rules for walking partition scoped table is different from process
scoped table. Hence add a helper for secondary linux page table walk and also
add check whether we are holding the right locks.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/kvm_book3s_64.h | 13 +
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 12 ++--
 arch/powerpc/kvm/book3s_hv_nested.c  |  2 +-
 3 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 04b2b927bb5a..2c2635967d6e 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC_PSERIES
 static inline bool kvmhv_on_pseries(void)
@@ -634,6 +635,18 @@ extern void kvmhv_remove_nest_rmap_range(struct kvm *kvm,
unsigned long gpa, unsigned long hpa,
unsigned long nbytes);
 
+static inline pte_t *find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
+   unsigned *hshift)
+{
+   pte_t *pte;
+
+   VM_WARN(!spin_is_locked(>mmu_lock),
+   "%s called with kvm mmu_lock not held \n", __func__);
+   pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);
+
+   return pte;
+}
+
 #endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */
 
 #endif /* __ASM_KVM_BOOK3S_64_H__ */
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index aa12cd4078b3..c92d413eeaaf 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -981,11 +981,11 @@ int kvm_unmap_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
return 0;
}
 
-   ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
+   ptep = find_kvm_secondary_pte(kvm, gpa, );
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 kvm->arch.lpid);
-   return 0;   
+   return 0;
 }
 
 /* Called with kvm->mmu_lock held */
@@ -1001,7 +1001,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot 
*memslot,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ref;
 
-   ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
+   ptep = find_kvm_secondary_pte(kvm, gpa, );
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
  gpa, shift);
@@ -1028,7 +1028,7 @@ int kvm_test_age_radix(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ref;
 
-   ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
+   ptep = find_kvm_secondary_pte(kvm, gpa, );
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
return ref;
@@ -1048,7 +1048,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ret;
 
-   ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
+   ptep = find_kvm_secondary_pte(kvm, gpa, );
if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
ret = 1;
if (shift)
@@ -1109,7 +1109,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
gpa = memslot->base_gfn << PAGE_SHIFT;
spin_lock(>mmu_lock);
for (n = memslot->npages; n; --n) {
-   ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
+   ptep = find_kvm_secondary_pte(kvm, gpa, );
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
 kvm->arch.lpid);
diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
b/arch/powerpc/kvm/book3s_hv_nested.c
index dc97e5be76f6..7f1fc5db13ea 100644
--- a/arch/powerpc/kvm/book3s_hv_nested.c
+++ b/arch/powerpc/kvm/book3s_hv_nested.c
@@ -1362,7 +1362,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_run 
*run,
/* See if can find translation in our partition scoped tables for L1 */
pte = __pte(0);
spin_lock(>mmu_lock);
-   pte_p = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, );
+   pte_p = find_kvm_secondary_pte(kvm, gpa, );
if (!shift)
shift = PAGE_SHIFT;
if (pte_p)
-- 
2.26.2