Re: [PATCH 1/4] powerpc/64s: Clear on-stack exception marker upon exception return

2019-02-01 Thread Balbir Singh
On Sat, Feb 2, 2019 at 12:14 PM Balbir Singh  wrote:
>
> On Tue, Jan 22, 2019 at 10:57:21AM -0500, Joe Lawrence wrote:
> > From: Nicolai Stange 
> >
> > The ppc64 specific implementation of the reliable stacktracer,
> > save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
> > trace" whenever it finds an exception frame on the stack. Stack frames
> > are classified as exception frames if the STACK_FRAME_REGS_MARKER magic,
> > as written by exception prologues, is found at a particular location.
> >
> > However, as observed by Joe Lawrence, it is possible in practice that
> > non-exception stack frames can alias with prior exception frames and thus,
> > that the reliable stacktracer can find a stale STACK_FRAME_REGS_MARKER on
> > the stack. It in turn falsely reports an unreliable stacktrace and blocks
> > any live patching transition to finish. Said condition lasts until the
> > stack frame is overwritten/initialized by function call or other means.
> >
> > In principle, we could mitigate this by making the exception frame
> > classification condition in save_stack_trace_tsk_reliable() stronger:
> > in addition to testing for STACK_FRAME_REGS_MARKER, we could also take into
> > account that for all exceptions executing on the kernel stack
> > - their stack frames's backlink pointers always match what is saved
> >   in their pt_regs instance's ->gpr[1] slot and that
> > - their exception frame size equals STACK_INT_FRAME_SIZE, a value
> >   uncommonly large for non-exception frames.
> >
> > However, while these are currently true, relying on them would make the
> > reliable stacktrace implementation more sensitive towards future changes in
> > the exception entry code. Note that false negatives, i.e. not detecting
> > exception frames, would silently break the live patching consistency model.
> >
> > Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
> > rely on STACK_FRAME_REGS_MARKER as well.
> >
> > Make the exception exit code clear the on-stack STACK_FRAME_REGS_MARKER
> > for those exceptions running on the "normal" kernel stack and returning
> > to kernelspace: because the topmost frame is ignored by the reliable stack
> > tracer anyway, returns to userspace don't need to take care of clearing
> > the marker.
> >
> > Furthermore, as I don't have the ability to test this on Book 3E or
> > 32 bits, limit the change to Book 3S and 64 bits.
> >
> > Finally, make the HAVE_RELIABLE_STACKTRACE Kconfig option depend on
> > PPC_BOOK3S_64 for documentation purposes. Before this patch, it depended
> > on PPC64 && CPU_LITTLE_ENDIAN and because CPU_LITTLE_ENDIAN implies
> > PPC_BOOK3S_64, there's no functional change here.
> >
> > Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing 
> > for the consistency model")
> > Reported-by: Joe Lawrence 
> > Signed-off-by: Nicolai Stange 
> > Signed-off-by: Joe Lawrence 
> > ---
> >  arch/powerpc/Kconfig   | 2 +-
> >  arch/powerpc/kernel/entry_64.S | 7 +++
> >  2 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 2890d36eb531..73bf87b1d274 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -220,7 +220,7 @@ config PPC
> >   select HAVE_PERF_USER_STACK_DUMP
> >   select HAVE_RCU_TABLE_FREE  if SMP
> >   select HAVE_REGS_AND_STACK_ACCESS_API
> > - select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
> > + select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && 
> > CPU_LITTLE_ENDIAN
> >   select HAVE_SYSCALL_TRACEPOINTS
> >   select HAVE_VIRT_CPU_ACCOUNTING
> >   select HAVE_IRQ_TIME_ACCOUNTING
> > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> > index 435927f549c4..a2c168b395d2 100644
> > --- a/arch/powerpc/kernel/entry_64.S
> > +++ b/arch/powerpc/kernel/entry_64.S
> > @@ -1002,6 +1002,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
> >   ld  r2,_NIP(r1)
> >   mtspr   SPRN_SRR0,r2
> >
> > + /*
> > +  * Leaving a stale exception_marker on the stack can confuse
> > +  * the reliable stack unwinder later on. Clear it.
> > +  */
> > + li  r2,0
> > + std r2,STACK_FRAME_OVERHEAD-16(r1)
> > +
>
> Could you please double check, r4 is already 0 at this point
> IIUC. So the change might be a simple
>
> std r4,STACK_FRAME_OVERHEAD-16(r1)
>

r4 is not 0, sorry for the noise

Balbir


[PATCH V2 9/10] KVM: Add flush parameter for kvm_age_hva()

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patch is to add flush parameter for kvm_aga_hva() and move tlb
flush from kvm_mmu_notifier_clear_flush_young() to kvm_age_hva().
kvm_age_hva() can check whether tlb flush is necessary when
return value young is more than 0. Flush tlb if both conditions
are met.

Signed-off-by: Lan Tianyu 
---
 arch/arm/include/asm/kvm_host.h |  3 ++-
 arch/arm64/include/asm/kvm_host.h   |  3 ++-
 arch/mips/include/asm/kvm_host.h|  3 ++-
 arch/mips/kvm/mmu.c | 11 +--
 arch/powerpc/include/asm/kvm_host.h |  3 ++-
 arch/powerpc/kvm/book3s.c   | 10 --
 arch/powerpc/kvm/e500_mmu_host.c|  3 ++-
 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c  | 10 --
 virt/kvm/arm/mmu.c  | 13 +++--
 virt/kvm/kvm_main.c |  6 ++
 11 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index ca56537b61bc..b3c6a6db8173 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -229,7 +229,8 @@ int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, 
pte_t pte);
 
 unsigned long kvm_arm_num_regs(struct kvm_vcpu *vcpu);
 int kvm_arm_copy_reg_indices(struct kvm_vcpu *vcpu, u64 __user *indices);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end,
+   bool flush);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
 struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 7732d0ba4e60..182bbb2de60a 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -361,7 +361,8 @@ int __kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
 int kvm_unmap_hva_range(struct kvm *kvm,
unsigned long start, unsigned long end);
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end,
+   bool flush);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
 struct kvm_vcpu *kvm_arm_get_running_vcpu(void);
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index d2abd98471e8..e055f49532c8 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -937,7 +937,8 @@ enum kvm_mips_fault_result kvm_trap_emul_gva_fault(struct 
kvm_vcpu *vcpu,
 int kvm_unmap_hva_range(struct kvm *kvm,
unsigned long start, unsigned long end);
 int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end);
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end,
+   bool flush);
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 
 /* Emulation */
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 97e538a8c1be..288a22d70cf8 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -579,9 +579,16 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gfn_t 
gfn, gfn_t gfn_end,
return pte_young(*gpa_pte);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end)
+int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end,
+   bool flush)
 {
-   return handle_hva_to_gpa(kvm, start, end, kvm_age_hva_handler, NULL);
+   int young = handle_hva_to_gpa(kvm, start, end,
+   kvm_age_hva_handler, NULL);
+
+   if (flush && young > 0)
+   kvm_flush_remote_tlbs(kvm);
+
+   return young;
 }
 
 int kvm_test_age_hva(struct kvm *kvm, unsigned long hva)
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 0f98f00da2ea..d160e6b8ccfb 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -70,7 +70,8 @@
 
 extern int kvm_unmap_hva_range(struct kvm *kvm,
   unsigned long start, unsigned long end);
-extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long 
end);
+extern int kvm_age_hva(struct kvm *kvm, unsigned long start, unsigned long end,
+  bool flush);
 extern int kvm_test_age_hva(struct kvm *kvm, unsigned long hva);
 extern int kvm_set_spte_hva(struct kvm *kvm, unsigned long hva, pte_t pte);
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bd1a677dd9e4..09a67ebbde8a 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -841,9 +841,15 @@ int kvm_unmap_hva_range(struct kvm *kvm, unsigned long 
start, unsigned long end)
return kvm->arch.kvm_ops->unmap_hva_range(kvm, start, end);
 }
 
-int kvm_age_hva(struct kvm *kvm, unsigned 

[PATCH V2 8/10] KVM: Use tlb range flush in the kvm_vm_ioctl_get/clear_dirty_log()

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patch is to use tlb range flush to flush memslot's in the
kvm_vm_ioctl_get/clear_dirty_log() instead of flushing tlbs
of entire ept page table when range flush is available.

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/mmu.c |  8 +---
 arch/x86/kvm/mmu.h |  7 +++
 arch/x86/kvm/x86.c | 16 
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6b5e9bed6665..63b3e36530e3 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -264,12 +264,6 @@ static void mmu_spte_set(u64 *sptep, u64 spte);
 static union kvm_mmu_page_role
 kvm_mmu_calc_root_page_role(struct kvm_vcpu *vcpu);
 
-
-static inline bool kvm_available_flush_tlb_with_range(void)
-{
-   return kvm_x86_ops->tlb_remote_flush_with_range;
-}
-
 static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
struct kvm_tlb_range *range)
 {
@@ -282,7 +276,7 @@ static void kvm_flush_remote_tlbs_with_range(struct kvm 
*kvm,
kvm_flush_remote_tlbs(kvm);
 }
 
-static void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
+void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
u64 start_gfn, u64 pages)
 {
struct kvm_tlb_range range;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index c7b333147c4a..dddab78d8ed8 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -63,6 +63,13 @@ void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool 
execonly,
 bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu);
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
u64 fault_address, char *insn, int insn_len);
+void kvm_flush_remote_tlbs_with_address(struct kvm *kvm,
+   u64 start_gfn, u64 pages);
+
+static inline bool kvm_available_flush_tlb_with_range(void)
+{
+   return kvm_x86_ops->tlb_remote_flush_with_range;
+}
 
 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3d32b8f5728d..0f70e07abfa1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4445,9 +4445,13 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
 * kvm_mmu_slot_remove_write_access().
 */
lockdep_assert_held(>slots_lock);
-   if (flush)
-   kvm_flush_remote_tlbs(kvm);
+   if (flush) {
+   struct kvm_memory_slot *memslot = kvm_get_memslot(kvm,
+   log->slot);
 
+   kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn,
+   memslot->npages);
+   }
mutex_unlock(>slots_lock);
return r;
 }
@@ -4472,9 +4476,13 @@ int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, struct 
kvm_clear_dirty_log *lo
 * kvm_mmu_slot_remove_write_access().
 */
lockdep_assert_held(>slots_lock);
-   if (flush)
-   kvm_flush_remote_tlbs(kvm);
+   if (flush) {
+   struct kvm_memory_slot *memslot = kvm_get_memslot(kvm,
+   log->slot);
 
+   kvm_flush_remote_tlbs_with_address(kvm, memslot->base_gfn,
+   memslot->npages);
+   }
mutex_unlock(>slots_lock);
return r;
 }
-- 
2.14.4



[PATCH V2 7/10] KVM: Add kvm_get_memslot() to get memslot via slot id

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patch is to add kvm_get_memslot() to get struct kvm_memory_slot
via slot it and remove redundant codes. The function will also be used
in the following changes.

Signed-off-by: Lan Tianyu 
---
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c  | 45 +++--
 2 files changed, 20 insertions(+), 26 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e73..aaa2b57eeb19 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -758,6 +758,7 @@ int kvm_get_dirty_log_protect(struct kvm *kvm,
  struct kvm_dirty_log *log, bool *flush);
 int kvm_clear_dirty_log_protect(struct kvm *kvm,
struct kvm_clear_dirty_log *log, bool *flush);
+struct kvm_memory_slot *kvm_get_memslot(struct kvm *kvm, u32 slot);
 
 void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
struct kvm_memory_slot *slot,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7ebe36a13045..b2097fa4b618 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1095,22 +1095,30 @@ static int kvm_vm_ioctl_set_memory_region(struct kvm 
*kvm,
return kvm_set_memory_region(kvm, mem);
 }
 
+struct kvm_memory_slot *kvm_get_memslot(struct kvm *kvm, u32 slot)
+{
+   struct kvm_memslots *slots;
+   int as_id, id;
+
+   as_id = slot >> 16;
+   id = (u16)slot;
+   if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
+   return NULL;
+
+   slots = __kvm_memslots(kvm, as_id);
+   return id_to_memslot(slots, id);
+}
+
 int kvm_get_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log, int *is_dirty)
 {
-   struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
-   int i, as_id, id;
unsigned long n;
unsigned long any = 0;
+   int i;
 
-   as_id = log->slot >> 16;
-   id = (u16)log->slot;
-   if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
-   return -EINVAL;
+   memslot = kvm_get_memslot(kvm, log->slot);
 
-   slots = __kvm_memslots(kvm, as_id);
-   memslot = id_to_memslot(slots, id);
if (!memslot->dirty_bitmap)
return -ENOENT;
 
@@ -1154,20 +1162,13 @@ EXPORT_SYMBOL_GPL(kvm_get_dirty_log);
 int kvm_get_dirty_log_protect(struct kvm *kvm,
struct kvm_dirty_log *log, bool *flush)
 {
-   struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
-   int i, as_id, id;
unsigned long n;
unsigned long *dirty_bitmap;
unsigned long *dirty_bitmap_buffer;
+   int i;
 
-   as_id = log->slot >> 16;
-   id = (u16)log->slot;
-   if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
-   return -EINVAL;
-
-   slots = __kvm_memslots(kvm, as_id);
-   memslot = id_to_memslot(slots, id);
+   memslot = kvm_get_memslot(kvm, log->slot);
 
dirty_bitmap = memslot->dirty_bitmap;
if (!dirty_bitmap)
@@ -1225,24 +1226,16 @@ EXPORT_SYMBOL_GPL(kvm_get_dirty_log_protect);
 int kvm_clear_dirty_log_protect(struct kvm *kvm,
struct kvm_clear_dirty_log *log, bool *flush)
 {
-   struct kvm_memslots *slots;
struct kvm_memory_slot *memslot;
-   int as_id, id;
gfn_t offset;
unsigned long i, n;
unsigned long *dirty_bitmap;
unsigned long *dirty_bitmap_buffer;
 
-   as_id = log->slot >> 16;
-   id = (u16)log->slot;
-   if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_USER_MEM_SLOTS)
-   return -EINVAL;
-
if ((log->first_page & 63) || (log->num_pages & 63))
return -EINVAL;
 
-   slots = __kvm_memslots(kvm, as_id);
-   memslot = id_to_memslot(slots, id);
+   memslot = kvm_get_memslot(kvm, log->slot);
 
dirty_bitmap = memslot->dirty_bitmap;
if (!dirty_bitmap)
-- 
2.14.4



[PATCH V2 6/10] KVM/MMU: Flush tlb directly in the kvm_mmu_slot_gfn_write_protect()

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patch is to flush tlb directly in the kvm_mmu_slot_gfn_write_protect()
when range flush is available.

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/mmu.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d57574b49823..6b5e9bed6665 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1718,6 +1718,11 @@ bool kvm_mmu_slot_gfn_write_protect(struct kvm *kvm,
write_protected |= __rmap_write_protect(kvm, rmap_head, true);
}
 
+   if (write_protected && kvm_available_flush_tlb_with_range()) {
+   kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+   write_protected = false;
+   }
+
return write_protected;
 }
 
-- 
2.14.4



[PATCH V2 5/10] KVM/MMU: Flush tlb with range list in sync_page()

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patch is to flush tlb via flush list function. Put
page into flush list when return value of set_spte()
includes flag SET_SPTE_NEED_REMOTE_TLB_FLUSH. kvm_flush_remote_
tlbs_with_list() checks whether the flush list is empty
or not. It also checks whether range tlb flush is available
and go back to tradiion flush if not.

Signed-off-by: Lan Tianyu 
---
Change since v1:
   Use check of list_empty in the kvm_flush_remote_tlbs_with_list()
   to determine flush or not instead of checking set_spte_ret.
 
arch/x86/kvm/paging_tmpl.h | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6bdca39829bc..d84486e75345 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -970,7 +970,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
int i, nr_present = 0;
bool host_writable;
gpa_t first_pte_gpa;
-   int set_spte_ret = 0;
+   HLIST_HEAD(flush_list);
 
/* direct kvm_mmu_page can not be unsync. */
BUG_ON(sp->role.direct);
@@ -978,6 +978,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
first_pte_gpa = FNAME(get_level1_sp_gpa)(sp);
 
for (i = 0; i < PT64_ENT_PER_PAGE; i++) {
+   int set_spte_ret = 0;
unsigned pte_access;
pt_element_t gpte;
gpa_t pte_gpa;
@@ -1027,14 +1028,20 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, 
struct kvm_mmu_page *sp)
 
host_writable = sp->spt[i] & SPTE_HOST_WRITEABLE;
 
-   set_spte_ret |= set_spte(vcpu, >spt[i],
+   set_spte_ret = set_spte(vcpu, >spt[i],
 pte_access, PT_PAGE_TABLE_LEVEL,
 gfn, spte_to_pfn(sp->spt[i]),
 true, false, host_writable);
+
+   if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH) {
+   struct kvm_mmu_page *leaf_sp = page_header(sp->spt[i]
+   & PT64_BASE_ADDR_MASK);
+   hlist_add_head(_sp->flush_link, _list);
+   }
+
}
 
-   if (set_spte_ret & SET_SPTE_NEED_REMOTE_TLB_FLUSH)
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_flush_remote_tlbs_with_list(vcpu->kvm, _list);
 
return nr_present;
 }
-- 
2.14.4



[PATCH V2 4/10] KVM/MMU: Introduce tlb flush with range list

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patch is to introduce tlb flush with range list interface and use
struct kvm_mmu_page as list entry. Use flush list function in the
kvm_mmu_commit_zap_page().

Signed-off-by: Lan Tianyu 
---
 arch/x86/kvm/mmu.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 70cafd3f95ab..d57574b49823 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -289,6 +289,20 @@ static void kvm_flush_remote_tlbs_with_address(struct kvm 
*kvm,
 
range.start_gfn = start_gfn;
range.pages = pages;
+   range.flush_list = NULL;
+
+   kvm_flush_remote_tlbs_with_range(kvm, );
+}
+
+static void kvm_flush_remote_tlbs_with_list(struct kvm *kvm,
+   struct hlist_head *flush_list)
+{
+   struct kvm_tlb_range range;
+
+   if (hlist_empty(flush_list))
+   return;
+
+   range.flush_list = flush_list;
 
kvm_flush_remote_tlbs_with_range(kvm, );
 }
@@ -2708,6 +2722,7 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
struct list_head *invalid_list)
 {
struct kvm_mmu_page *sp, *nsp;
+   HLIST_HEAD(flush_list);
 
if (list_empty(invalid_list))
return;
@@ -2721,7 +2736,15 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
 * In addition, kvm_flush_remote_tlbs waits for all vcpus to exit
 * guest mode and/or lockless shadow page table walks.
 */
-   kvm_flush_remote_tlbs(kvm);
+   if (kvm_available_flush_tlb_with_range()) {
+   list_for_each_entry(sp, invalid_list, link)
+   if (sp->last_level)
+   hlist_add_head(>flush_link, _list);
+
+   kvm_flush_remote_tlbs_with_list(kvm, _list);
+   } else {
+   kvm_flush_remote_tlbs(kvm);
+   }
 
list_for_each_entry_safe(sp, nsp, invalid_list, link) {
WARN_ON(!sp->role.invalid || sp->root_count);
-- 
2.14.4



[PATCH V2 3/10] KVM/MMU: Add last_level in the struct mmu_spte_page

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patch is to add last_level in the struct kvm_mmu_page. When build
flush tlb range list, last_level will be used to identify whehter the
page should be added into list.

Signed-off-by: Lan Tianyu 
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/mmu.c  | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a3d3e58fe0a..9d858d68c17a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -325,6 +325,7 @@ struct kvm_mmu_page {
struct hlist_node flush_link;
struct hlist_node hash_link;
bool unsync;
+   bool last_level;
 
/*
 * The following two entries are used to key the shadow page in the
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ce770b446238..70cafd3f95ab 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2918,6 +2918,9 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 
if (level > PT_PAGE_TABLE_LEVEL)
spte |= PT_PAGE_SIZE_MASK;
+
+   sp->last_level = is_last_spte(spte, level);
+
if (tdp_enabled)
spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
kvm_is_mmio_pfn(pfn));
-- 
2.14.4



[PATCH V2 2/10] KVM/VMX: Fill range list in kvm_fill_hv_flush_list_func()

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

Populate ranges on the flush list into struct hv_guest_mapping_flush_list
when flush list is available in the struct kvm_tlb_range.

Signed-off-by: Lan Tianyu 
---
Change since v1:
   Make flush list as a "hlist" instead of a "list" in order to 
   keep struct kvm_mmu_page size.

arch/x86/include/asm/kvm_host.h |  7 +++
 arch/x86/kvm/vmx/vmx.c  | 18 --
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 49f449f56434..4a3d3e58fe0a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -317,6 +317,12 @@ struct kvm_rmap_head {
 
 struct kvm_mmu_page {
struct list_head link;
+
+   /*
+* Tlb flush with range list uses struct kvm_mmu_page as list entry
+* and all list operations should be under protection of mmu_lock.
+*/
+   struct hlist_node flush_link;
struct hlist_node hash_link;
bool unsync;
 
@@ -443,6 +449,7 @@ struct kvm_mmu {
 struct kvm_tlb_range {
u64 start_gfn;
u64 pages;
+   struct hlist_head *flush_list;
 };
 
 enum pmc_type {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9d954b4adce3..6452d0efd2cc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -427,9 +427,23 @@ int kvm_fill_hv_flush_list_func(struct 
hv_guest_mapping_flush_list *flush,
void *data)
 {
struct kvm_tlb_range *range = data;
+   struct kvm_mmu_page *sp;
 
-   return hyperv_fill_flush_guest_mapping_list(flush, 0, range->start_gfn,
-   range->pages);
+   if (!range->flush_list) {
+   return hyperv_fill_flush_guest_mapping_list(flush,
+   0, range->start_gfn, range->pages);
+   } else {
+   int offset = 0;
+
+   hlist_for_each_entry(sp, range->flush_list, flush_link) {
+   int pages = KVM_PAGES_PER_HPAGE(sp->role.level);
+
+   offset = hyperv_fill_flush_guest_mapping_list(flush,
+   offset, sp->gfn, pages);
+   }
+
+   return offset;
+   }
 }
 
 static inline int __hv_remote_flush_tlb_with_range(struct kvm *kvm,
-- 
2.14.4



[PATCH V2 1/10] X86/Hyper-V: Add parameter offset for hyperv_fill_flush_guest_mapping_list()

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

Add parameter offset to specify start position to add flush ranges in
guest address list of struct hv_guest_mapping_flush_list.

Signed-off-by: Lan Tianyu 
---
arch/x86/hyperv/nested.c| 4 ++--
 arch/x86/include/asm/mshyperv.h | 2 +-
 arch/x86/kvm/vmx/vmx.c  | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/nested.c b/arch/x86/hyperv/nested.c
index dd0a843f766d..96f8bac7476d 100644
--- a/arch/x86/hyperv/nested.c
+++ b/arch/x86/hyperv/nested.c
@@ -58,11 +58,11 @@ EXPORT_SYMBOL_GPL(hyperv_flush_guest_mapping);
 
 int hyperv_fill_flush_guest_mapping_list(
struct hv_guest_mapping_flush_list *flush,
-   u64 start_gfn, u64 pages)
+   int offset, u64 start_gfn, u64 pages)
 {
u64 cur = start_gfn;
u64 additional_pages;
-   int gpa_n = 0;
+   int gpa_n = offset;
 
do {
/*
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index cc60e617931c..d6be685ab6b0 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -357,7 +357,7 @@ int hyperv_flush_guest_mapping_range(u64 as,
hyperv_fill_flush_list_func fill_func, void *data);
 int hyperv_fill_flush_guest_mapping_list(
struct hv_guest_mapping_flush_list *flush,
-   u64 start_gfn, u64 end_gfn);
+   int offset, u64 start_gfn, u64 end_gfn);
 
 #ifdef CONFIG_X86_64
 void hv_apic_init(void);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6915f10e584..9d954b4adce3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -428,7 +428,7 @@ int kvm_fill_hv_flush_list_func(struct 
hv_guest_mapping_flush_list *flush,
 {
struct kvm_tlb_range *range = data;
 
-   return hyperv_fill_flush_guest_mapping_list(flush, range->start_gfn,
+   return hyperv_fill_flush_guest_mapping_list(flush, 0, range->start_gfn,
range->pages);
 }
 
-- 
2.14.4



[PATCH V2 00/10] X86/KVM/Hyper-V: Add HV ept tlb range list flush support in KVM

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patchset is to introduce hv ept tlb range list flush function
support in the KVM MMU component. Flushing ept tlbs of several address
range can be done via single hypercall and new list flush function is
used in the kvm_mmu_commit_zap_page() and FNAME(sync_page). This patchset
also adds more hv ept tlb range flush support in more KVM MMU function.

Change since v1:
   1) Make flush list as a hlist instead of list in order to 
   keep struct kvm_mmu_page size.
   2) Add last_level flag in the struct kvm_mmu_page instead
   of spte pointer
   3) Move tlb flush from kvm_mmu_notifier_clear_flush_young() to 
kvm_age_hva()
   4) Use range flush in the kvm_vm_ioctl_get/clear_dirty_log()

Lan Tianyu (10):
  X86/Hyper-V: Add parameter offset for
hyperv_fill_flush_guest_mapping_list()
  KVM/VMX: Fill range list in kvm_fill_hv_flush_list_func()
  KVM/MMU: Add last_level in the struct mmu_spte_page
  KVM/MMU: Introduce tlb flush with range list
  KVM/MMU: Flush tlb with range list in sync_page()
  KVM/MMU: Flush tlb directly in the kvm_mmu_slot_gfn_write_protect()
  KVM: Add kvm_get_memslot() to get memslot via slot id
  KVM: Use tlb range flush in the kvm_vm_ioctl_get/clear_dirty_log()
  KVM: Add flush parameter for kvm_age_hva()
  KVM/MMU: Use tlb range flush  in the kvm_age_hva()

 arch/arm/include/asm/kvm_host.h |  3 ++-
 arch/arm64/include/asm/kvm_host.h   |  3 ++-
 arch/mips/include/asm/kvm_host.h|  3 ++-
 arch/mips/kvm/mmu.c | 11 ++--
 arch/powerpc/include/asm/kvm_host.h |  3 ++-
 arch/powerpc/kvm/book3s.c   | 10 ++--
 arch/powerpc/kvm/e500_mmu_host.c|  3 ++-
 arch/x86/hyperv/nested.c|  4 +--
 arch/x86/include/asm/kvm_host.h | 11 +++-
 arch/x86/include/asm/mshyperv.h |  2 +-
 arch/x86/kvm/mmu.c  | 51 +
 arch/x86/kvm/mmu.h  |  7 +
 arch/x86/kvm/paging_tmpl.h  | 15 ---
 arch/x86/kvm/vmx/vmx.c  | 18 +++--
 arch/x86/kvm/x86.c  | 16 +---
 include/linux/kvm_host.h|  1 +
 virt/kvm/arm/mmu.c  | 13 --
 virt/kvm/kvm_main.c | 51 +++--
 18 files changed, 160 insertions(+), 65 deletions(-)

-- 
2.14.4



Re: [PATCH 1/4] powerpc/64s: Clear on-stack exception marker upon exception return

2019-02-01 Thread Balbir Singh
On Tue, Jan 22, 2019 at 10:57:21AM -0500, Joe Lawrence wrote:
> From: Nicolai Stange 
> 
> The ppc64 specific implementation of the reliable stacktracer,
> save_stack_trace_tsk_reliable(), bails out and reports an "unreliable
> trace" whenever it finds an exception frame on the stack. Stack frames
> are classified as exception frames if the STACK_FRAME_REGS_MARKER magic,
> as written by exception prologues, is found at a particular location.
> 
> However, as observed by Joe Lawrence, it is possible in practice that
> non-exception stack frames can alias with prior exception frames and thus,
> that the reliable stacktracer can find a stale STACK_FRAME_REGS_MARKER on
> the stack. It in turn falsely reports an unreliable stacktrace and blocks
> any live patching transition to finish. Said condition lasts until the
> stack frame is overwritten/initialized by function call or other means.
> 
> In principle, we could mitigate this by making the exception frame
> classification condition in save_stack_trace_tsk_reliable() stronger:
> in addition to testing for STACK_FRAME_REGS_MARKER, we could also take into
> account that for all exceptions executing on the kernel stack
> - their stack frames's backlink pointers always match what is saved
>   in their pt_regs instance's ->gpr[1] slot and that
> - their exception frame size equals STACK_INT_FRAME_SIZE, a value
>   uncommonly large for non-exception frames.
> 
> However, while these are currently true, relying on them would make the
> reliable stacktrace implementation more sensitive towards future changes in
> the exception entry code. Note that false negatives, i.e. not detecting
> exception frames, would silently break the live patching consistency model.
> 
> Furthermore, certain other places (diagnostic stacktraces, perf, xmon)
> rely on STACK_FRAME_REGS_MARKER as well.
> 
> Make the exception exit code clear the on-stack STACK_FRAME_REGS_MARKER
> for those exceptions running on the "normal" kernel stack and returning
> to kernelspace: because the topmost frame is ignored by the reliable stack
> tracer anyway, returns to userspace don't need to take care of clearing
> the marker.
> 
> Furthermore, as I don't have the ability to test this on Book 3E or
> 32 bits, limit the change to Book 3S and 64 bits.
> 
> Finally, make the HAVE_RELIABLE_STACKTRACE Kconfig option depend on
> PPC_BOOK3S_64 for documentation purposes. Before this patch, it depended
> on PPC64 && CPU_LITTLE_ENDIAN and because CPU_LITTLE_ENDIAN implies
> PPC_BOOK3S_64, there's no functional change here.
> 
> Fixes: df78d3f61480 ("powerpc/livepatch: Implement reliable stack tracing for 
> the consistency model")
> Reported-by: Joe Lawrence 
> Signed-off-by: Nicolai Stange 
> Signed-off-by: Joe Lawrence 
> ---
>  arch/powerpc/Kconfig   | 2 +-
>  arch/powerpc/kernel/entry_64.S | 7 +++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 2890d36eb531..73bf87b1d274 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -220,7 +220,7 @@ config PPC
>   select HAVE_PERF_USER_STACK_DUMP
>   select HAVE_RCU_TABLE_FREE  if SMP
>   select HAVE_REGS_AND_STACK_ACCESS_API
> - select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN
> + select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && 
> CPU_LITTLE_ENDIAN
>   select HAVE_SYSCALL_TRACEPOINTS
>   select HAVE_VIRT_CPU_ACCOUNTING
>   select HAVE_IRQ_TIME_ACCOUNTING
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 435927f549c4..a2c168b395d2 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -1002,6 +1002,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
>   ld  r2,_NIP(r1)
>   mtspr   SPRN_SRR0,r2
>  
> + /*
> +  * Leaving a stale exception_marker on the stack can confuse
> +  * the reliable stack unwinder later on. Clear it.
> +  */
> + li  r2,0
> + std r2,STACK_FRAME_OVERHEAD-16(r1)
> +

Could you please double check, r4 is already 0 at this point
IIUC. So the change might be a simple

std r4,STACK_FRAME_OVERHEAD-16(r1)

Balbir


Re: [PATCH 4/4] powerpc/livepatch: return -ERRNO values in save_stack_trace_tsk_reliable()

2019-02-01 Thread Balbir Singh
On Tue, Jan 22, 2019 at 10:57:24AM -0500, Joe Lawrence wrote:
> To match its x86 counterpart, save_stack_trace_tsk_reliable() should
> return -EINVAL in cases that it is currently returning 1.  No caller is
> currently differentiating non-zero error codes, but let's keep the
> arch-specific implementations consistent.
> 
> Signed-off-by: Joe Lawrence 

Seems straight forward

Acked-by: Balbir Singh 


Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode

2019-02-01 Thread Cédric Le Goater
On 1/31/19 4:01 AM, Paul Mackerras wrote:
> On Wed, Jan 30, 2019 at 08:01:22AM +0100, Cédric Le Goater wrote:
>> On 1/30/19 5:29 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
 On 1/22/19 6:05 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>> This is the basic framework for the new KVM device supporting the XIVE
>> native exploitation mode. The user interface exposes a new capability
>> and a new KVM device to be used by QEMU.
>
> [snip]
>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>  #ifdef CONFIG_KVM_XIVE
>>  if (xive_enabled()) {
>>  kvmppc_xive_init_module();
>> +kvmppc_xive_native_init_module();
>>  kvm_register_device_ops(_xive_ops, 
>> KVM_DEV_TYPE_XICS);
>> +kvm_register_device_ops(_xive_native_ops,
>> +KVM_DEV_TYPE_XIVE);
>
> I think we want tighter conditions on initializing the xive_native
> stuff and creating the xive device class.  We could have
> xive_enabled() returning true in a guest, and this code will get
> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> are running bare metal).

 So yes, I gave nested a try with kernel_irqchip=on and the nested 
 hypervisor 
 (L1) obviously crashes trying to call OPAL. I have tighten the test with : 

if (xive_enabled() && !kvmhv_on_pseries()) {

 for now.

 As this is a problem today in 5.0.x, I will send a patch for it if you 
 think
>>>
>>> How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
>>> with kernel_irqchip=on in a nested guest and it works just fine.  What
>>> exactly did you test?
>>
>> L0: Linux 5.0.0-rc3 (+ KVM HV)
>> L1: QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
>> L2:  QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3
>>
>> L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
>> it does an OPAL call and its running under SLOF. See below.
> 
> OK, you must have a QEMU that advertises XIVE to the guest (L1). 

XIVE is not advertised if QEMU is started with 'ic-mode=xics' 

> In
> that case I can see that L1 would try to do XICS-on-XIVE, which won't
> work.  We need to fix that.  Unfortunately the XICS-on-XICS emulation
> won't work as is in L1 either, but I think we can fix that by
> disabling the real-mode XICS hcall handling.

I have added some tests on kvm-hv, using kvmhv_on_pseries(), to disable 
the KVM XICS-on-XIVE device in a L1 guest running as hypervisor and 
to instead register the old KVM XICS device. 

If the L1 is started in KVM XICS mode, L2 can now run with KVM XICS.
All seem fine. I booted two guests with disk and network. 

But I am still "a bit" confused with what is being done at each 
hypervisor level. It's not obvious to follow at all even with traces.
 
>> I don't understand how L2 can work with kernel_irqchip=on. Could you
>> please explain ? 
> 
> If QEMU decides to advertise XIVE to the L2 guest and the L2 guest can
> do XIVE, then the only possibility is to use the XIVE software
> emulation in QEMU, and if kernel_irqchip=on has been specified
> explicitly, maybe QEMU decides to terminate the guest rather than
> implicitly turning off kernel_irqchip.

we can do that by disabling the KVM XIVE device when under kvmhv_on_pseries().

> If QEMU decides not to advertise XIVE to the L2 guest, or the L2 guest
> can't do XIVE, then we could use the XICS-on-XICS emulation in L1 as
> long as either (a) L1 is not using XIVE, or (b) we modify the
> XICS-on-XICS code to avoid using any XICS or XIVE access (i.e. just
> using calls to generic kernel facilities).

(a) is what I did above I think

May be we should consider having nested version of the KVM devices 
when under kvmhv_on_pseries(). With some sort of backend ops to
modify the relation with the parent hypervisor : PowerNV/Linux or 
pseries/Linux. 

> Ultimately, if the spapr xive backend code in the kernel could be
> extended to provide all the low-level functions that the XICS-on-XIVE
> code needs, then we could do XICS-on-XIVE in a guest.

What about a XIVE on XIVE ? 

Propagating the ESB pages to a nested guest seems feasible if not 
already done. The hcalls could be forwarded to the L1 QEMU ? The 
problematic part is handling the XIVE VP block.

C.



Re: use generic DMA mapping code in powerpc V4

2019-02-01 Thread Christian Zigotzky
Hi Christoph,

I will try it at the weekend.

Thanks,
Christian

Sent from my iPhone

> On 1. Feb 2019, at 09:04, Christoph Hellwig  wrote:
> 
>> On Thu, Jan 31, 2019 at 01:48:26PM +0100, Christian Zigotzky wrote:
>> Hi Christoph,
>> 
>> I compiled kernels for the X5000 and X1000 from your branch 'powerpc-dma.6' 
>> today.
>> 
>> Gitweb: 
>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.6
>> 
>> git clone git://git.infradead.org/users/hch/misc.git -b powerpc-dma.6 a
>> 
>> The X1000 and X5000 boot but unfortunately the P.A. Semi Ethernet doesn't 
>> work.
> 
> Oh.  Can you try with just the next one and then two patches applied
> over the working setup?  That is first:
> 
> http://git.infradead.org/users/hch/misc.git/commitdiff/b50f42f0fe12965ead395c76bcb6a14f00cdf65b
> 
> then also with:
> 
> http://git.infradead.org/users/hch/misc.git/commitdiff/21fe52470a483afbb1726741118abef8602dde4d


[PATCH net-next v4 12/12] sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW

2019-02-01 Thread Deepa Dinamani
Add new socket timeout options that are y2038 safe.

Signed-off-by: Deepa Dinamani 
Cc: ccaul...@redhat.com
Cc: da...@davemloft.net
Cc: del...@gmx.de
Cc: pau...@samba.org
Cc: r...@linux-mips.org
Cc: r...@twiddle.net
Cc: cluster-de...@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-al...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 arch/alpha/include/uapi/asm/socket.h  | 12 +--
 arch/mips/include/uapi/asm/socket.h   | 11 --
 arch/parisc/include/uapi/asm/socket.h | 10 --
 arch/sparc/include/uapi/asm/socket.h  | 11 --
 include/uapi/asm-generic/socket.h | 11 --
 net/core/sock.c   | 49 ---
 6 files changed, 81 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 9826d1db71d0..0d0fddb7e738 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -119,19 +119,25 @@
 #define SO_TIMESTAMPNS_NEW  64
 #define SO_TIMESTAMPING_NEW 65
 
-#if !defined(__KERNEL__)
+#define SO_RCVTIMEO_NEW 66
+#define SO_SNDTIMEO_NEW 67
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
+#if !defined(__KERNEL__)
 
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPING SO_TIMESTAMPING_OLD
+
+#define SO_RCVTIMEOSO_RCVTIMEO_OLD
+#define SO_SNDTIMEOSO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
 #endif
 
 #define SCM_TIMESTAMP   SO_TIMESTAMP
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 96cc0e907f12..eb9f33f8a8b3 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -130,18 +130,25 @@
 #define SO_TIMESTAMPNS_NEW  64
 #define SO_TIMESTAMPING_NEW 65
 
+#define SO_RCVTIMEO_NEW 66
+#define SO_SNDTIMEO_NEW 67
+
 #if !defined(__KERNEL__)
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD
+
+#define SO_RCVTIMEO SO_RCVTIMEO_OLD
+#define SO_SNDTIMEO SO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
 #endif
 
 #define SCM_TIMESTAMP   SO_TIMESTAMP
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index 046f0cd9cce4..16e428f03526 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -111,18 +111,24 @@
 #define SO_TIMESTAMPNS_NEW  0x4039
 #define SO_TIMESTAMPING_NEW 0x403A
 
+#define SO_RCVTIMEO_NEW 0x4040
+#define SO_SNDTIMEO_NEW 0x4041
+
 #if !defined(__KERNEL__)
 
-#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
-#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
 #define SO_TIMESTAMPING SO_TIMESTAMPING_OLD
+#define SO_RCVTIMEOSO_RCVTIMEO_OLD
+#define SO_SNDTIMEOSO_SNDTIMEO_OLD
 #else
 #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
 #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
 #define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
+
+#define SO_RCVTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_RCVTIMEO_OLD : SO_RCVTIMEO_NEW)
+#define SO_SNDTIMEO (sizeof(time_t) == sizeof(__kernel_long_t) ? 
SO_SNDTIMEO_OLD : SO_SNDTIMEO_NEW)
 #endif
 
 #define SCM_TIMESTAMP   

[PATCH net-next v4 11/12] socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes

2019-02-01 Thread Deepa Dinamani
SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval
as the time format. struct timeval is not y2038 safe.
The subsequent patches in the series add support for new socket
timeout options with _NEW suffix that will use y2038 safe
data structures. Although the existing struct timeval layout
is sufficiently wide to represent timeouts, because of the way
libc will interpret time_t based on user defined flag, these
new flags provide a way of having a structure that is the same
for all architectures consistently.
Rename the existing options with _OLD suffix forms so that the
right option is enabled for userspace applications according
to the architecture and time_t definition of libc.

Signed-off-by: Deepa Dinamani 
Cc: ccaul...@redhat.com
Cc: del...@gmx.de
Cc: pau...@samba.org
Cc: r...@linux-mips.org
Cc: r...@twiddle.net
Cc: cluster-de...@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-al...@vger.kernel.org
Cc: linux-a...@vger.kernel.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 arch/alpha/include/uapi/asm/socket.h   | 7 +--
 arch/mips/include/uapi/asm/socket.h| 6 --
 arch/parisc/include/uapi/asm/socket.h  | 6 --
 arch/powerpc/include/uapi/asm/socket.h | 4 ++--
 arch/sparc/include/uapi/asm/socket.h   | 7 +--
 fs/dlm/lowcomms.c  | 4 ++--
 include/uapi/asm-generic/socket.h  | 6 --
 net/core/sock.c| 8 
 8 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/socket.h 
b/arch/alpha/include/uapi/asm/socket.h
index 934ea6268f1a..9826d1db71d0 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -31,8 +31,8 @@
 #define SO_RCVBUFFORCE 0x100b
 #defineSO_RCVLOWAT 0x1010
 #defineSO_SNDLOWAT 0x1011
-#defineSO_RCVTIMEO 0x1012
-#defineSO_SNDTIMEO 0x1013
+#defineSO_RCVTIMEO_OLD 0x1012
+#defineSO_SNDTIMEO_OLD 0x1013
 #define SO_ACCEPTCONN  0x1014
 #define SO_PROTOCOL0x1028
 #define SO_DOMAIN  0x1029
@@ -121,6 +121,9 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
+
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/mips/include/uapi/asm/socket.h 
b/arch/mips/include/uapi/asm/socket.h
index 110f9506d64f..96cc0e907f12 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -39,8 +39,8 @@
 #define SO_RCVBUF  0x1002  /* Receive buffer. */
 #define SO_SNDLOWAT0x1003  /* send low-water mark */
 #define SO_RCVLOWAT0x1004  /* receive low-water mark */
-#define SO_SNDTIMEO0x1005  /* send timeout */
-#define SO_RCVTIMEO0x1006  /* receive timeout */
+#define SO_SNDTIMEO_OLD0x1005  /* send timeout */
+#define SO_RCVTIMEO_OLD0x1006  /* receive timeout */
 #define SO_ACCEPTCONN  0x1009
 #define SO_PROTOCOL0x1028  /* protocol type */
 #define SO_DOMAIN  0x1029  /* domain/socket family */
@@ -132,6 +132,8 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/parisc/include/uapi/asm/socket.h 
b/arch/parisc/include/uapi/asm/socket.h
index bee2a9dde656..046f0cd9cce4 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -22,8 +22,8 @@
 #define SO_RCVBUFFORCE 0x100b
 #define SO_SNDLOWAT0x1003
 #define SO_RCVLOWAT0x1004
-#define SO_SNDTIMEO0x1005
-#define SO_RCVTIMEO0x1006
+#define SO_SNDTIMEO_OLD0x1005
+#define SO_RCVTIMEO_OLD0x1006
 #define SO_ERROR   0x1007
 #define SO_TYPE0x1008
 #define SO_PROTOCOL0x1028
@@ -113,6 +113,8 @@
 
 #if !defined(__KERNEL__)
 
+#defineSO_RCVTIMEO SO_RCVTIMEO_OLD
+#defineSO_SNDTIMEO SO_SNDTIMEO_OLD
 #if __BITS_PER_LONG == 64
 #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
 #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
diff --git a/arch/powerpc/include/uapi/asm/socket.h 
b/arch/powerpc/include/uapi/asm/socket.h
index 94de465e0920..12aa0c43e775 100644
--- a/arch/powerpc/include/uapi/asm/socket.h
+++ b/arch/powerpc/include/uapi/asm/socket.h
@@ -11,8 +11,8 @@
 
 #define SO_RCVLOWAT16
 #define SO_SNDLOWAT17
-#define SO_RCVTIMEO18
-#define SO_SNDTIMEO19
+#define SO_RCVTIMEO_OLD18
+#define SO_SNDTIMEO_OLD19
 #define SO_PASSCRED20
 #define SO_PEERCRED21
 
diff --git a/arch/sparc/include/uapi/asm/socket.h 
b/arch/sparc/include/uapi/asm/socket.h
index 2b38dda51426..342ffdc3b424 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -21,8 +21,8 @@
 #define SO_BSDCOMPAT0x0400

[PATCH V2 00/10] X86/KVM/Hyper-V: Add HV ept tlb range list flush support in KVM

2019-02-01 Thread lantianyu1986
From: Lan Tianyu 

This patchset is to introduce hv ept tlb range list flush function
support in the KVM MMU component. Flushing ept tlbs of several address
range can be done via single hypercall and new list flush function is
used in the kvm_mmu_commit_zap_page() and FNAME(sync_page). This patchset
also adds more hv ept tlb range flush support in more KVM MMU function.

Change since v1:
   1) Make flush list as a hlist instead of list in order to 
   keep struct kvm_mmu_page size.
   2) Add last_level flag in the struct kvm_mmu_page instead
   of spte pointer
   3) Move tlb flush from kvm_mmu_notifier_clear_flush_young() to 
kvm_age_hva()
   4) Use range flush in the kvm_vm_ioctl_get/clear_dirty_log()

Lan Tianyu (10):
  X86/Hyper-V: Add parameter offset for
hyperv_fill_flush_guest_mapping_list()
  KVM/VMX: Fill range list in kvm_fill_hv_flush_list_func()
  KVM/MMU: Add last_level in the struct mmu_spte_page
  KVM/MMU: Introduce tlb flush with range list
  KVM/MMU: Flush tlb with range list in sync_page()
  KVM/MMU: Flush tlb directly in the kvm_mmu_slot_gfn_write_protect()
  KVM: Add kvm_get_memslot() to get memslot via slot id
  KVM: Use tlb range flush in the kvm_vm_ioctl_get/clear_dirty_log()
  KVM: Add flush parameter for kvm_age_hva()
  KVM/MMU: Use tlb range flush  in the kvm_age_hva()

 arch/arm/include/asm/kvm_host.h |  3 ++-
 arch/arm64/include/asm/kvm_host.h   |  3 ++-
 arch/mips/include/asm/kvm_host.h|  3 ++-
 arch/mips/kvm/mmu.c | 11 ++--
 arch/powerpc/include/asm/kvm_host.h |  3 ++-
 arch/powerpc/kvm/book3s.c   | 10 ++--
 arch/powerpc/kvm/e500_mmu_host.c|  3 ++-
 arch/x86/hyperv/nested.c|  4 +--
 arch/x86/include/asm/kvm_host.h | 11 +++-
 arch/x86/include/asm/mshyperv.h |  2 +-
 arch/x86/kvm/mmu.c  | 51 +
 arch/x86/kvm/mmu.h  |  7 +
 arch/x86/kvm/paging_tmpl.h  | 15 ---
 arch/x86/kvm/vmx/vmx.c  | 18 +++--
 arch/x86/kvm/x86.c  | 16 +---
 include/linux/kvm_host.h|  1 +
 virt/kvm/arm/mmu.c  | 13 --
 virt/kvm/kvm_main.c | 51 +++--
 18 files changed, 160 insertions(+), 65 deletions(-)

-- 
2.14.4



Re: [PATCH v02] powerpc/pseries: Check for ceded CPU's during LPAR migration

2019-02-01 Thread Michael Bringmann
See below.

On 1/31/19 3:53 PM, Michael Bringmann wrote:
> On 1/30/19 11:38 PM, Michael Ellerman wrote:
>> Michael Bringmann  writes:
>>> This patch is to check for cede'ed CPUs during LPM.  Some extreme
>>> tests encountered a problem ehere Linux has put some threads to
>>> sleep (possibly to save energy or something), LPM was attempted,
>>> and the Linux kernel didn't awaken the sleeping threads, but issued
>>> the H_JOIN for the active threads.  Since the sleeping threads
>>> are not awake, they can not issue the expected H_JOIN, and the
>>> partition would never suspend.  This patch wakes the sleeping
>>> threads back up.
>>
>> I'm don't think this is the right solution.
>>
>> Just after your for loop we do an on_each_cpu() call, which sends an IPI
>> to every CPU, and that should wake all CPUs up from CEDE.
>>
>> If that's not happening then there is a bug somewhere, and we need to
>> work out where.

>From Pete Heyrman:
Both sending IPI or H_PROD will awaken a logical processors that has ceded.
When you have logical proc doing cede and one logical proc doing prod or IPI
you have a race condition that the prod/IPI can proceed the cede request.
If you use prod, the hypervisor takes care of the synchronization by 
ignoring
a cede request if it was preceeded by a prod.  With IPI the interrupt is
delivered which could then be followed by a cede so OS would need to provide
synchronization.

Shouldn't this answer your concerns about race conditions and the suitability
of using H_PROD?

Michael

> 
> Let me explain the scenario of the LPM case that Pete Heyrman found, and
> that Nathan F. was working upon, previously.
> 
> In the scenario, the partition has 5 dedicated processors each with 8 threads
> running.
> 
>>From the PHYP data we can see that on VP 0, threads 3, 4, 5, 6 and 7 issued
> a H_CEDE requesting to save energy by putting the requesting thread into
> sleep mode.  In this state, the thread will only be awakened by H_PROD from
> another running thread or from an external user action (power off, reboot
> and such).  Timers and external interrupts are disabled in this mode.
> 
> About 3 seconds later, as part of the LPM operation, the other 35 threads
> have all issued a H_JOIN request.  Join is part of the LPM process where
> the threads suspend themselves as part of the LPM operation so the partition
> can be migrated to the target server.
> 
> So, the current state is the the OS has suspended the execution of all the
> threads in the partition without successfully suspending all threads as part
> of LPM.
> 
> Net, OS has an issue where they suspended every processor thread so nothing
> can run.
> 
> This appears to be slightly different than the previous LPM stalls we have
> seen where the migration stalls because of cpus being taken offline and not
> making the H_JOIN call.
> 
> In this scenario we appear to have CPUs that have done an H_CEDE prior to
> the LPM. For these CPUs we would need to do a H_PROD to wake them back up
> so they can do a H_JOIN and allow the LPM to continue.
> 
> The problem is that Linux has some threads that they put to sleep (probably
> to save energy or something), LPM was attempted, Linux didn't awaken the
> sleeping threads but issued the H_JOIN for the active threads.  Since the
> sleeping threads don't issue the H_JOIN the partition will never suspend.
> 
> I am checking again with Pete regarding your concerns.
> 
> Thanks.
> 
>>
>>
>>> diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
>>> b/arch/powerpc/include/asm/plpar_wrappers.h
>>> index cff5a41..8292eff 100644
>>> --- a/arch/powerpc/include/asm/plpar_wrappers.h
>>> +++ b/arch/powerpc/include/asm/plpar_wrappers.h
>>> @@ -26,10 +26,8 @@ static inline void set_cede_latency_hint(u8 latency_hint)
>>> get_lppaca()->cede_latency_hint = latency_hint;
>>>  }
>>>  
>>> -static inline long cede_processor(void)
>>> -{
>>> -   return plpar_hcall_norets(H_CEDE);
>>> -}
>>> +int cpu_is_ceded(int cpu);
>>> +long cede_processor(void);
>>>  
>>>  static inline long extended_cede_processor(unsigned long latency_hint)
>>>  {
>>> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
>>> index de35bd8f..fea3d21 100644
>>> --- a/arch/powerpc/kernel/rtas.c
>>> +++ b/arch/powerpc/kernel/rtas.c
>>> @@ -44,6 +44,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  /* This is here deliberately so it's only used in this file */
>>>  void enter_rtas(unsigned long);
>>> @@ -942,7 +943,7 @@ int rtas_ibm_suspend_me(u64 handle)
>>> struct rtas_suspend_me_data data;
>>> DECLARE_COMPLETION_ONSTACK(done);
>>> cpumask_var_t offline_mask;
>>> -   int cpuret;
>>> +   int cpuret, cpu;
>>>  
>>> if (!rtas_service_present("ibm,suspend-me"))
>>> return -ENOSYS;
>>> @@ -991,6 +992,11 @@ int rtas_ibm_suspend_me(u64 handle)
>>> goto out_hotplug_enable;
>>> }
>>>  
>>> +   for_each_present_cpu(cpu) {
>>> +   if 

Re: [RFC PATCH] powerpc/6xx: Don't set back MSR_RI before reenabling MMU

2019-02-01 Thread Christophe Leroy




Le 01/02/2019 à 12:10, Michael Ellerman a écrit :

Christophe Leroy  writes:


By delaying the setting of MSR_RI, a 1% improvment is optained on
null_syscall selftest on an mpc8321.

Without this patch:

root@vgoippro:~# ./null_syscall
1134.33 ns 378.11 cycles

With this patch:

root@vgoippro:~# ./null_syscall
1121.85 ns 373.95 cycles

The drawback is that a machine check during that period
would be unrecoverable, but as only main memory is accessed
during that period, it shouldn't be a concern.


On 64-bit server CPUs accessing main memory can cause a UE
(Uncorrectable Error) which can trigger a machine check.

So it may still be a concern, it depends how paranoid you are.


diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 146385b1c2da..ea28a6ab56ec 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -282,8 +282,6 @@ __secondary_hold_acknowledge:
stw r1,GPR1(r11);   \
stw r1,0(r11);  \
tovirt(r1,r11); /* set new kernel sp */ \
-   li  r10,MSR_KERNEL & ~(MSR_IR|MSR_DR); /* can take exceptions */ \
-   MTMSRD(r10);/* (except for mach check in rtas) */ \
stw r0,GPR0(r11);   \
lis r10,STACK_FRAME_REGS_MARKER@ha; /* exception frame marker */ \
addir10,r10,STACK_FRAME_REGS_MARKER@l; \


Where does RI get enabled? I don't see it anywhere obvious.


MSR_RI is part of MSR_KERNEL, it gets then enabled when reenabling MMU 
when calling the exception handler.


#define EXC_XFER_TEMPLATE(n, hdlr, trap, copyee, tfer, ret) \
li  r10,trap;   \
stw r10,_TRAP(r11); \
li  r10,MSR_KERNEL; \
copyee(r10, r9);\
bl  tfer;   \
i##n:   \
.long   hdlr;   \
.long   ret

where tfer = transfer_to_handler.

In transfer_to_handler (kernel/entry_32.S) you have:

transfer_to_handler_cont:
3:
mflrr9
lwz r11,0(r9)   /* virtual address of handler */
lwz r9,4(r9)/* where to go when done */
[...]
mtspr   SPRN_SRR0,r11
mtspr   SPRN_SRR1,r10
mtlrr9
SYNC
RFI /* jump to handler, enable MMU */

So MSR_RI is restored above as r10 contains MSR_KERNEL [ | MSR_EE ]

Christophe





cheers



Re: BUG: memcmp(): Accessing invalid memory location

2019-02-01 Thread Michael Ellerman
Michael Ellerman  writes:

> Adding Simon who wrote the code.
>
> Chandan Rajendra  writes:
>> When executing fstests' generic/026 test, I hit the following call trace,
>>
>> [  417.061038] BUG: Unable to handle kernel data access at 0xc0062ac4
>> [  417.062172] Faulting instruction address: 0xc0092240
>> [  417.062242] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  417.062299] LE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries
>> [  417.062366] Modules linked in:
>> [  417.062401] CPU: 0 PID: 27828 Comm: chacl Not tainted 
>> 5.0.0-rc2-next-20190115-1-g6de6dba64dda #1
>> [  417.062495] NIP:  c0092240 LR: c066a55c CTR: 
>> 
>> [  417.062567] REGS: c0062c0c3430 TRAP: 0300   Not tainted  
>> (5.0.0-rc2-next-20190115-1-g6de6dba64dda)
>> [  417.062660] MSR:  82009033   CR: 
>> 44000842  XER: 2000
>> [  417.062750] CFAR: 7fff7f3108ac DAR: c0062ac4 DSISR: 4000 
>> IRQMASK: 0
>>GPR00:  c0062c0c36c0 c17f4c00 
>> c121a660
>>GPR04: c0062ac3fff9 0004 0020 
>> 275b19c4
>>GPR08: 000c 46494c45 5347495f41434c5f 
>> c26073a0
>>GPR12:  c27a  
>> 
>>GPR16:    
>> 
>>GPR20: c0062ea70020 c0062c0c38d0 0002 
>> 0002
>>GPR24: c0062ac3ffe8 275b19c4 0001 
>> c0062ac3
>>GPR28: c0062c0c38d0 c0062ac30050 c0062ac30058 
>> 
>> [  417.063563] NIP [c0092240] memcmp+0x120/0x690
>> [  417.063635] LR [c066a55c] xfs_attr3_leaf_lookup_int+0x53c/0x5b0
>> [  417.063709] Call Trace:
>> [  417.063744] [c0062c0c36c0] [c066a098] 
>> xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable)
>> [  417.063851] [c0062c0c3760] [c0693f8c] 
>> xfs_da3_node_lookup_int+0x32c/0x5a0
>> [  417.063944] [c0062c0c3820] [c06634a0] 
>> xfs_attr_node_addname+0x170/0x6b0
>> [  417.064034] [c0062c0c38b0] [c0664ffc] xfs_attr_set+0x2ac/0x340
>> [  417.064118] [c0062c0c39a0] [c0758d40] __xfs_set_acl+0xf0/0x230
>> [  417.064190] [c0062c0c3a00] [c0758f50] xfs_set_acl+0xd0/0x160
>> [  417.064268] [c0062c0c3aa0] [c04b69b0] set_posix_acl+0xc0/0x130
>> [  417.064339] [c0062c0c3ae0] [c04b6a88] 
>> posix_acl_xattr_set+0x68/0x110
>> [  417.064412] [c0062c0c3b20] [c04532d4] 
>> __vfs_setxattr+0xa4/0x110
>> [  417.064485] [c0062c0c3b80] [c0454c2c] 
>> __vfs_setxattr_noperm+0xac/0x240
>> [  417.064566] [c0062c0c3bd0] [c0454ee8] vfs_setxattr+0x128/0x130
>> [  417.064638] [c0062c0c3c30] [c0455138] setxattr+0x248/0x600
>> [  417.064710] [c0062c0c3d90] [c0455738] 
>> path_setxattr+0x108/0x120
>> [  417.064785] [c0062c0c3e00] [c0455778] sys_setxattr+0x28/0x40
>> [  417.064858] [c0062c0c3e20] [c000bae4] system_call+0x5c/0x70
>> [  417.064930] Instruction dump:
>> [  417.064964] 7d201c28 7d402428 7c295040 38630008 38840008 408201f0 
>> 4200ffe8 2c05
>> [  417.065051] 4182ff6c 20c50008 54c61838 7d201c28 <7d402428> 7d293436 
>> 7d4a3436 7c295040
>> [  417.065150] ---[ end trace 0d060411b5e3741b ]---
>>
>>
>> Both the memory locations passed to memcmp() had "SGI_ACL_FILE" and len
>> argument of memcmp() was set to 12. s1 argument of memcmp() had the value
>> 0xf4af0485, while s2 argument had the value 0xce9e316f.
>>
>> The following is the code path within memcmp() that gets executed for the
>> above mentioned values,
>>
>> - Since len (i.e. 12) is greater than 7, we branch to .Lno_short.
>> - We then prefetch the contents of r3 & r4 and branch to
>>   .Ldiffoffset_8bytes_make_align_start.
>> - Under .Ldiffoffset_novmx_cmp, Since r3 is unaligned we end up comparing
>>   "SGI" part of the string. r3's value is then aligned. r4's value is
>>   incremented by 3. For comparing the remaining 9 bytes, we jump to
>>   .Lcmp_lt32bytes.
>> - Here, 8 bytes of the remaining 9 bytes are compared and execution moves to
>>   .Lcmp_rest_lt8bytes.
>> - Here we execute "LD rB,0,r4". In the case of this bug, r4 has an unaligned
>>   value and hence ends up accessing the "next" double word. The "next" double
>>   word happens to occur after the last page mapped into the kernel's address
>>   space and hence this leads to the previously listed oops.
>
> Thanks for the analysis.
>
> This is just a bug, we can't read past the end of the source or dest.

How about this, works for me.

cheers

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 844d8e774492..2a302158cb53 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -215,20 

Re: [RFC PATCH] powerpc/6xx: Don't set back MSR_RI before reenabling MMU

2019-02-01 Thread Michael Ellerman
Christophe Leroy  writes:

> By delaying the setting of MSR_RI, a 1% improvment is optained on
> null_syscall selftest on an mpc8321.
>
> Without this patch:
>
> root@vgoippro:~# ./null_syscall
>1134.33 ns 378.11 cycles
>
> With this patch:
>
> root@vgoippro:~# ./null_syscall
>1121.85 ns 373.95 cycles
>
> The drawback is that a machine check during that period
> would be unrecoverable, but as only main memory is accessed
> during that period, it shouldn't be a concern.

On 64-bit server CPUs accessing main memory can cause a UE
(Uncorrectable Error) which can trigger a machine check.

So it may still be a concern, it depends how paranoid you are.

> diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
> index 146385b1c2da..ea28a6ab56ec 100644
> --- a/arch/powerpc/kernel/head_32.S
> +++ b/arch/powerpc/kernel/head_32.S
> @@ -282,8 +282,6 @@ __secondary_hold_acknowledge:
>   stw r1,GPR1(r11);   \
>   stw r1,0(r11);  \
>   tovirt(r1,r11); /* set new kernel sp */ \
> - li  r10,MSR_KERNEL & ~(MSR_IR|MSR_DR); /* can take exceptions */ \
> - MTMSRD(r10);/* (except for mach check in rtas) */ \
>   stw r0,GPR0(r11);   \
>   lis r10,STACK_FRAME_REGS_MARKER@ha; /* exception frame marker */ \
>   addir10,r10,STACK_FRAME_REGS_MARKER@l; \

Where does RI get enabled? I don't see it anywhere obvious.

cheers


[PATCH v2] powerpc: drop page_is_ram() and walk_system_ram_range()

2019-02-01 Thread Christophe Leroy
Since commit c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem")
it is possible to use the generic walk_system_ram_range() and
the generic page_is_ram().

To enable the use of walk_system_ram_range() by the IBM EHEA
ethernet driver, the generic function has to be exported.

As powerpc was the only (last?) user of CONFIG_ARCH_HAS_WALK_MEMORY,
the #ifdef around the generic walk_system_ram_range() has become
useless and can be dropped.

Fixes: c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig|  3 ---
 arch/powerpc/include/asm/page.h |  1 -
 arch/powerpc/mm/mem.c   | 33 -
 kernel/resource.c   |  5 +
 4 files changed, 1 insertion(+), 41 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2890d36eb531..f92e6754edf1 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -478,9 +478,6 @@ config ARCH_CPU_PROBE_RELEASE
 config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
 
-config ARCH_HAS_WALK_MEMORY
-   def_bool y
-
 config ARCH_ENABLE_MEMORY_HOTREMOVE
def_bool y
 
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 5c5ea2413413..aa4497175bd3 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -326,7 +326,6 @@ struct page;
 extern void clear_user_page(void *page, unsigned long vaddr, struct page *pg);
 extern void copy_user_page(void *to, void *from, unsigned long vaddr,
struct page *p);
-extern int page_is_ram(unsigned long pfn);
 extern int devmem_is_allowed(unsigned long pfn);
 
 #ifdef CONFIG_PPC_SMLPAR
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 33cc6f676fa6..fa9916c2c662 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -80,11 +80,6 @@ static inline pte_t *virt_to_kpte(unsigned long vaddr)
 #define TOP_ZONE ZONE_NORMAL
 #endif
 
-int page_is_ram(unsigned long pfn)
-{
-   return memblock_is_memory(__pfn_to_phys(pfn));
-}
-
 pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
  unsigned long size, pgprot_t vma_prot)
 {
@@ -176,34 +171,6 @@ int __meminit arch_remove_memory(int nid, u64 start, u64 
size,
 #endif
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
-/*
- * walk_memory_resource() needs to make sure there is no holes in a given
- * memory range.  PPC64 does not maintain the memory layout in /proc/iomem.
- * Instead it maintains it in memblock.memory structures.  Walk through the
- * memory regions, find holes and callback for contiguous regions.
- */
-int
-walk_system_ram_range(unsigned long start_pfn, unsigned long nr_pages,
-   void *arg, int (*func)(unsigned long, unsigned long, void *))
-{
-   struct memblock_region *reg;
-   unsigned long end_pfn = start_pfn + nr_pages;
-   unsigned long tstart, tend;
-   int ret = -1;
-
-   for_each_memblock(memory, reg) {
-   tstart = max(start_pfn, memblock_region_memory_base_pfn(reg));
-   tend = min(end_pfn, memblock_region_memory_end_pfn(reg));
-   if (tstart >= tend)
-   continue;
-   ret = (*func)(tstart, tend - tstart, arg);
-   if (ret)
-   break;
-   }
-   return ret;
-}
-EXPORT_SYMBOL_GPL(walk_system_ram_range);
-
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 void __init mem_topology_setup(void)
 {
diff --git a/kernel/resource.c b/kernel/resource.c
index 915c02e8e5dd..2e1636041508 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -448,8 +448,6 @@ int walk_mem_res(u64 start, u64 end, void *arg,
 arg, func);
 }
 
-#if !defined(CONFIG_ARCH_HAS_WALK_MEMORY)
-
 /*
  * This function calls the @func callback against all memory ranges of type
  * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.
@@ -480,8 +478,7 @@ int walk_system_ram_range(unsigned long start_pfn, unsigned 
long nr_pages,
}
return ret;
 }
-
-#endif
+EXPORT_SYMBOL_GPL(walk_system_ram_range);
 
 static int __is_ram(unsigned long pfn, unsigned long nr_pages, void *arg)
 {
-- 
2.13.3



Re: fix a layering violation in videobuf2 and improve dma_map_resource v2

2019-02-01 Thread Christoph Hellwig
On Fri, Feb 01, 2019 at 08:05:21AM +0100, Marek Szyprowski wrote:
> Works fine on older Exynos based boards with IOMMU and CMA disabled.
> 
> Tested-by: Marek Szyprowski 

Thanks.  I've merged the ѕeries into the dma-mapping tree, and I've
also made a stable branch available at:

git://git.infradead.org/users/hch/dma-mapping.git videobuf-map-resource

in case it needs to be pulled into the media tree.


Re: [PATCH] powerpc: drop page_is_ram() and walk_system_ram_range()

2019-02-01 Thread Christoph Hellwig
On Thu, Jan 31, 2019 at 07:28:49PM +, Christophe Leroy wrote:
> Since commit c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem")
> it is possible to use the generic walk_system_ram_range() and
> the generic page_is_ram().
> 
> Fixes: c40dd2f76644 ("powerpc: Add System RAM to /proc/iomem")
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/Kconfig|  3 ---
>  arch/powerpc/include/asm/page.h |  1 -
>  arch/powerpc/mm/mem.c   | 33 -
>  3 files changed, 37 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 0a26e0075ce5..0006ca6a7664 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -479,9 +479,6 @@ config ARCH_CPU_PROBE_RELEASE
>  config ARCH_ENABLE_MEMORY_HOTPLUG
>   def_bool y
>  
> -config ARCH_HAS_WALK_MEMORY
> - def_bool y

powerpc was the last architecture to define ARCH_HAS_WALK_MEMORY,
so the symbol can be removed now.


Re: use generic DMA mapping code in powerpc V4

2019-02-01 Thread Christoph Hellwig
On Thu, Jan 31, 2019 at 01:48:26PM +0100, Christian Zigotzky wrote:
> Hi Christoph,
>
> I compiled kernels for the X5000 and X1000 from your branch 'powerpc-dma.6' 
> today.
>
> Gitweb: 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.6
>
> git clone git://git.infradead.org/users/hch/misc.git -b powerpc-dma.6 a
>
> The X1000 and X5000 boot but unfortunately the P.A. Semi Ethernet doesn't 
> work.

Oh.  Can you try with just the next one and then two patches applied
over the working setup?  That is first:

http://git.infradead.org/users/hch/misc.git/commitdiff/b50f42f0fe12965ead395c76bcb6a14f00cdf65b

then also with:

http://git.infradead.org/users/hch/misc.git/commitdiff/21fe52470a483afbb1726741118abef8602dde4d