Re: KVM guests freeze under upstream kernel

2017-07-26 Thread Suraj Jitindar Singh
On Thu, 2017-07-27 at 13:14 +1000, Michael Ellerman wrote:
> jos...@linux.vnet.ibm.com writes:
> > On Thu, Jul 20, 2017 at 10:18:18PM -0300, jos...@linux.vnet.ibm.com
> >  wrote:
> > > On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
> > > > 
> > > > Did you check the host kernel logs for any oops messages?
> > > 
> > > dmesg was clean but after sometime waiting (I forgot QEMU running
> > > in
> > > another terminal) I got the oops below (after rebooting the host
> > > I 
> > > couldn't reproduce it again).
> > > 
> > > Another test that I did was:
> > > Compile with transparent huge pages disabled: KVM works fine
> > > Compile with transparent huge pages enabled: doesn't work
> > >   + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't
> > > work
> > > 
> > > Just out of my own curiosity I made this small change:
> > > 
> > > diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > b/arch/powerpc/include
> > > index c0737c8..f94a3b6 100644
> > > --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> > > @@ -80,7 +80,7 @@
> > >  
> > >   #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software
> > > dirty
> > >   tracking 
> > >    #define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special
> > > page */
> > >    -#define _PAGE_DEVMAP   _RPAGE_SW1 /* software:
> > > ZONE_DEVICE page */
> > >    +#define _PAGE_DEVMAP   _RPAGE_RSV3
> > > #define __HAVE_ARCH_PTE_DEVMAP
> > > 
> > > and it works. I chose _RPAGE_RSV3 because it uses the same value
> > > that
> > > x86 uses (0x0400UL) but I don't if it could have any
> > > side
> > > effect
> > > 
> > 
> > Does this change make any sense to you people?
> 
> No :)
> 
> I think it's just hiding the bug somehow. Presumably we have some
> code
> somewhere that is getting confused by _RPAGE_SW1 being set, or
> setting
> that bit incorrectly.

kernel BUG at 
/scratch/surajjs/linux/arch/powerpc/include/asm/book3s/64/radix.h:260!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2048 
NUMA 
PowerNV
Modules linked in:
CPU: 3 PID: 2050 Comm: qemu-system-ppc Not tainted 
4.13.0-rc2-1-g2f3013c-dirty #1
task: c00f1ebc task.stack: c00f1ec0
NIP: c0070fd4 LR: c00e2120 CTR: c00e20d0
REGS: c00f1ec036b0 TRAP: 0700   Not tainted  
(4.13.0-rc2-1-g2f3013c-dirty)
MSR: 9282b033 
  CR: 22244824  XER: 
CFAR: c0070e74 SOFTE: 1 
GPR00: 0009 c00f1ec03930 c1067400 19cf0a05 
GPR04: c000 050acf190f80 0005 0800 
GPR08: 0015 800f19cf0a05 c00f1eb64368 0009 
GPR12: 0009 cfd80f00 c00f1eca7a30 4000 
GPR16: 5f9f1780 40002000 7fff5fff 7fff879700a6 
GPR20: 8108 c110bce0 0f61 c00e20d0 
GPR24:  c00f1c7a6008 7fff6f60 7fff5fff 
GPR28: c00f19fd 0da0  c00f1ec03990 
NIP [c0070fd4] __find_linux_pte_or_hugepte+0x1d4/0x350
LR [c00e2120] kvm_unmap_radix+0x50/0x1d0
Call Trace:
[c00f1ec03930] [c00b2554] mark_page_dirty+0x34/0xa0 (unreliable)
[c00f1ec03970] [c00e2120] kvm_unmap_radix+0x50/0x1d0
[c00f1ec039c0] [c00dbea0] kvm_handle_hva_range+0x100/0x170
[c00f1ec03a30] [c00df43c] kvm_unmap_hva_range_hv+0x6c/0x80
[c00f1ec03a70] [c00c7588] kvm_unmap_hva_range+0x48/0x60
[c00f1ec03ab0] [c00bb77c] 
kvm_mmu_notifier_invalidate_range_start+0x8c/0x130
[c00f1ec03b10] [c0316f10] 
__mmu_notifier_invalidate_range_start+0xa0/0xf0
[c00f1ec03b60] [c02e95f0] change_protection+0x840/0xe20
[c00f1ec03cb0] [c0313050] change_prot_numa+0x50/0xd0
[c00f1ec03d00] [c0143f24] task_numa_work+0x2b4/0x3b0
[c00f1ec03dc0] [c0128738] task_work_run+0xf8/0x160
[c00f1ec03e00] [c001db94] do_notify_resume+0xe4/0xf0
[c00f1ec03e30] [c000b744] ret_from_except_lite+0x70/0x74
Instruction dump:
419e00ec 6000 78a70022 54a9403e 50a9c00e 54e3403e 50a9c42e 50e3c00e 
50e3c42e 792907c6 7d291b78 55270528 <0b07> 3ce04000 3c804000 78e707c6 
---[ end trace aecf406c356566bb ]---


The bug on added was:

arch/powerpc/include/asm/book3s/64/radix.h:260:
258 static inline int radix__pmd_trans_huge(pmd_t pmd)
259 {
260 BUG_ON(pmd_val(pmd) & _PAGE_DEVMAP);
261 return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
262 }

> 
> cheers


[PATCH v3 3/3] powerpc/mm/cxl: Add the fault handling cpu to mm cpumask

2017-07-26 Thread Aneesh Kumar K.V
We use mm cpumask for serializing against lockless page table walk. Anybody
who is doing a lockless page table walk is expected to disable irq and only
cpus in mm cpumask is expected do the lockless walk. This ensure that
a THP split can send IPI to only cpus in the mm cpumask, to make sure there
are no parallel lockless page table walk.

Add the CAPI fault handling cpu to the mm cpumask so that we can do the lockless
page table walk while inserting hash page table entries.

Reviewed-by: Frederic Barrat 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/pgtable-book3s64.c | 10 +-
 drivers/misc/cxl/fault.c   |  6 ++
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 57b947cde2bf..3b65917785a5 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -83,15 +83,7 @@ static void do_nothing(void *unused)
 void serialize_against_pte_lookup(struct mm_struct *mm)
 {
smp_mb();
-   /*
-* Cxl fault handling requires us to do a lockless page table
-* walk while inserting hash page table entry with mm tracked
-* in cxl context. Hence we need to do a global flush.
-*/
-   if (cxl_ctx_in_use())
-   smp_call_function(do_nothing, NULL, 1);
-   else
-   smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+   smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
 }
 
 /*
diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
index 6eed7d03e2b5..ab507e4ed69b 100644
--- a/drivers/misc/cxl/fault.c
+++ b/drivers/misc/cxl/fault.c
@@ -138,6 +138,12 @@ int cxl_handle_mm_fault(struct mm_struct *mm, u64 dsisr, 
u64 dar)
int result;
unsigned long access, flags, inv_flags = 0;
 
+   /*
+* Add the fault handling cpu to task mm cpumask so that we
+* can do a safe lockless page table walk when inserting the
+* hash page table entry.
+*/
+   cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
if ((result = copro_handle_mm_fault(mm, dar, dsisr, &flt))) {
pr_devel("copro_handle_mm_fault failed: %#x\n", result);
return result;
-- 
2.13.3



[PATCH v3 2/3] powerpc/mm: Don't send IPI to all cpus on THP updates

2017-07-26 Thread Aneesh Kumar K.V
Now that we made sure that lockless walk of linux page table is mostly limitted
to current task(current->mm->pgdir) we can update the THP update sequence to
only send IPI to cpus on which this task has run. This helps in reducing the IPI
overload on systems with large number of CPUs.

W.r.t kvm even though kvm is walking page table with vpc->arch.pgdir, it is
done only on secondary cpus and in that case we have primary cpu added to
task's mm cpumask. Sending an IPI to primary will force the secondary to do
a vm exit and hence this mm cpumask usage is safe here.

W.r.t CAPI, we still end up walking linux page table with capi context MM. For
now the pte lookup serialization sends an IPI to all cpus in CPI is in use. We
can further improve this by adding the CAPI interrupt handling cpu to task
mm cpumask. That will be done in a later patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  1 +
 arch/powerpc/mm/pgtable-book3s64.c   | 32 +++-
 arch/powerpc/mm/pgtable-hash64.c |  8 +++
 arch/powerpc/mm/pgtable-radix.c  |  8 +++
 4 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index d1da415e283c..f349f5388af6 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1159,6 +1159,7 @@ static inline bool arch_needs_pgtable_deposit(void)
return false;
return true;
 }
+extern void serialize_against_pte_lookup(struct mm_struct *mm);
 
 
 static inline pmd_t pmd_mkdevmap(pmd_t pmd)
diff --git a/arch/powerpc/mm/pgtable-book3s64.c 
b/arch/powerpc/mm/pgtable-book3s64.c
index 31eed8fa8e99..57b947cde2bf 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -64,6 +65,35 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
trace_hugepage_set_pmd(addr, pmd_val(pmd));
return set_pte_at(mm, addr, pmdp_ptep(pmdp), pmd_pte(pmd));
 }
+
+static void do_nothing(void *unused)
+{
+
+}
+/*
+ * Serialize against find_current_mm_pte which does lock-less
+ * lookup in page tables with local interrupts disabled. For huge pages
+ * it casts pmd_t to pte_t. Since format of pte_t is different from
+ * pmd_t we want to prevent transit from pmd pointing to page table
+ * to pmd pointing to huge page (and back) while interrupts are disabled.
+ * We clear pmd to possibly replace it with page table pointer in
+ * different code paths. So make sure we wait for the parallel
+ * find_current_mm_pte to finish.
+ */
+void serialize_against_pte_lookup(struct mm_struct *mm)
+{
+   smp_mb();
+   /*
+* Cxl fault handling requires us to do a lockless page table
+* walk while inserting hash page table entry with mm tracked
+* in cxl context. Hence we need to do a global flush.
+*/
+   if (cxl_ctx_in_use())
+   smp_call_function(do_nothing, NULL, 1);
+   else
+   smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
+}
+
 /*
  * We use this to invalidate a pmdp entry before switching from a
  * hugepte to regular pmd entry.
@@ -77,7 +107,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned 
long address,
 * This ensures that generic code that rely on IRQ disabling
 * to prevent a parallel THP split work as expected.
 */
-   kick_all_cpus_sync();
+   serialize_against_pte_lookup(vma->vm_mm);
 }
 
 static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index 443a2c66a304..c0a7372bdaa6 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -239,7 +239,7 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, 
unsigned long addres
 * by sending an IPI to all the cpus and executing a dummy
 * function there.
 */
-   kick_all_cpus_sync();
+   serialize_against_pte_lookup(vma->vm_mm);
/*
 * Now invalidate the hpte entries in the range
 * covered by pmd. This make sure we take a
@@ -380,16 +380,16 @@ pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
 */
memset(pgtable, 0, PTE_FRAG_SIZE);
/*
-* Serialize against find_linux_pte_or_hugepte which does lock-less
+* Serialize against find_current_mm_pte variants which does lock-less
 * lookup in page tables with local interrupts disabled. For huge pages
 * it casts pmd_t to pte_t. Since format of pte_t is different from
 * pmd_t we want to prevent transit from pmd pointing to page table
 * to pmd pointing to huge page (and back) while interrupts are 
disabled.
 * We clear pmd to possibly replace it with page table pointer in

[PATCH v3 1/3] powerpc/mm: Rename find_linux_pte_or_hugepte

2017-07-26 Thread Aneesh Kumar K.V
Add newer helpers to make the function usage simpler. It is always recommended
to use find_current_mm_pte() for walking the page table. If we cannot use
find_current_mm_pte(), it should be documented why the said usage of
__find_linux_pte() is safe against a parallel THP split.

For now we have KVM code using __find_linux_pte(). This is because kvm code ends
up calling __find_linux_pte() in real mode with MSR_EE=0 but with PACA 
soft_enabled =
1. We may want to fix that later and make sure we keep the MSR_EE and  PACA
soft_enabled in sync. When we do that we can switch kvm to use find_linux_pte().

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/pgtable.h | 10 +
 arch/powerpc/include/asm/pte-walk.h| 38 ++
 arch/powerpc/kernel/eeh.c  |  4 ++--
 arch/powerpc/kernel/io-workarounds.c   |  5 +++--
 arch/powerpc/kvm/book3s_64_mmu_hv.c|  5 +++--
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 28 -
 arch/powerpc/kvm/book3s_64_vio_hv.c| 12 ++-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c| 18 
 arch/powerpc/kvm/e500_mmu_host.c   |  3 ++-
 arch/powerpc/mm/hash_utils_64.c|  5 +++--
 arch/powerpc/mm/hugetlbpage.c  | 24 -
 arch/powerpc/mm/tlb_hash64.c   |  6 --
 arch/powerpc/perf/callchain.c  |  3 ++-
 13 files changed, 106 insertions(+), 55 deletions(-)
 create mode 100644 arch/powerpc/include/asm/pte-walk.h

diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index afae9a336136..eb9d57defb75 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -66,16 +66,8 @@ extern int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
 #ifndef CONFIG_TRANSPARENT_HUGEPAGE
 #define pmd_large(pmd) 0
 #endif
-pte_t *__find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
-  bool *is_thp, unsigned *shift);
-static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
-  bool *is_thp, unsigned *shift)
-{
-   VM_WARN(!arch_irqs_disabled(),
-   "%s called with irq enabled\n", __func__);
-   return __find_linux_pte_or_hugepte(pgdir, ea, is_thp, shift);
-}
 
+/* can we use this in kvm */
 unsigned long vmalloc_to_phys(void *vmalloc_addr);
 
 void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
diff --git a/arch/powerpc/include/asm/pte-walk.h 
b/arch/powerpc/include/asm/pte-walk.h
new file mode 100644
index ..3a5a391a4c6d
--- /dev/null
+++ b/arch/powerpc/include/asm/pte-walk.h
@@ -0,0 +1,38 @@
+#ifndef _ASM_POWERPC_PTE_WALK_H
+#define _ASM_POWERPC_PTE_WALK_H
+
+#ifndef __ASSEMBLY__
+#include 
+
+/* Don't use this directly */
+extern pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
+  bool *is_thp, unsigned *hshift);
+
+static inline pte_t *find_linux_pte(pgd_t *pgdir, unsigned long ea,
+   bool *is_thp, unsigned *hshift)
+{
+   VM_WARN(!arch_irqs_disabled(),
+   "%s called with irq enabled\n", __func__);
+   return __find_linux_pte(pgdir, ea, is_thp, hshift);
+}
+
+static inline pte_t *find_init_mm_pte(unsigned long ea, unsigned *hshift)
+{
+   pgd_t *pgdir = init_mm.pgd;
+   return __find_linux_pte(pgdir, ea, NULL, hshift);
+}
+/*
+ * This is what we should always use. Any other lockless page table lookup 
needs
+ * careful audit against THP split.
+ */
+static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
+bool *is_thp, unsigned *hshift)
+{
+   VM_WARN(!arch_irqs_disabled(),
+   "%s called with irq enabled\n", __func__);
+   VM_WARN(pgdir != current->mm->pgd,
+   "%s lock less page table lookup called on wrong mm\n", 
__func__);
+   return __find_linux_pte(pgdir, ea, is_thp, hshift);
+}
+#endif
+#endif
diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 63992b2d8e15..5e6887c40528 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 
 /** Overview:
@@ -352,8 +353,7 @@ static inline unsigned long eeh_token_to_phys(unsigned long 
token)
 * worried about _PAGE_SPLITTING/collapse. Also we will not hit
 * page table free, because of init_mm.
 */
-   ptep = __find_linux_pte_or_hugepte(init_mm.pgd, token,
-  NULL, &hugepage_shift);
+   ptep = find_init_mm_pte(token, &hugepage_shift);
if (!ptep)
return token;
WARN_ON(hugepage_shift);
diff --git a/arch/powerpc/kernel/io-workarounds.c 
b/arch/powerpc/kernel/io-workarounds.c
index a582e0d42525..bbe85f5aea71 100644
--- a/arch/powerpc/kernel/io-workarounds.c
+++ b/arch/powerpc/kernel/io-workarounds.c
@@ -19,

[PATCH v3 3/3] powerpc/mm/hugetlb: Allow runtime allocation of 16G.

2017-07-26 Thread Aneesh Kumar K.V
We now have GIGANTIC_PAGE on powerpc. Currently this is enabled only on
radix with 1G as gigantic hugepage size. Enable this with hash translation mode
too (ie, with 16G hugepage size). Depending on the total system memory we may
be able to allocate 16G hugepage size. This bring parity between radix and hash
translation mode. Also reduce the confusion of the user with respect to
hugetlbfs usage.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hugetlb.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hugetlb.h 
b/arch/powerpc/include/asm/book3s/64/hugetlb.h
index 5c28bd6f2ae1..2d1ca488ca44 100644
--- a/arch/powerpc/include/asm/book3s/64/hugetlb.h
+++ b/arch/powerpc/include/asm/book3s/64/hugetlb.h
@@ -54,9 +54,7 @@ static inline pte_t arch_make_huge_pte(pte_t entry, struct 
vm_area_struct *vma,
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 static inline bool gigantic_page_supported(void)
 {
-   if (radix_enabled())
-   return true;
-   return false;
+   return true;
 }
 #endif
 
-- 
2.13.3



[PATCH v3 2/3] powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line

2017-07-26 Thread Aneesh Kumar K.V
With commit aa888a74977a8 ("hugetlb: support larger than MAX_ORDER") we added
support for allocating gigantic hugepages via kernel command line. Switch
ppc64 arch specific code to use that.

W.r.t FSL support, we now limit our allocation range using 
BOOTMEM_ALLOC_ACCESSIBLE.

We use the kernel command line to do reservation of hugetlb pages on powernv
platforms. On pseries hash mmu mode the supported gigantic huge page size is
16GB and that can only be allocated with hypervisor assist. For pseries the
command line option doesn't do the allocation. Instead pseries does gigantic
hugepage allocation based on hypervisor hint that is specified via
"ibm,expected#pages" property of the memory node.

Cc: Scott Wood 
Cc: Christophe Leroy 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   2 +-
 arch/powerpc/include/asm/hugetlb.h|  14 --
 arch/powerpc/kernel/setup-common.c|   7 -
 arch/powerpc/mm/hash_utils_64.c   |   2 +-
 arch/powerpc/mm/hugetlbpage.c | 177 +++---
 arch/powerpc/mm/init_32.c |   2 -
 6 files changed, 22 insertions(+), 182 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 6981a52b3887..f28d21c69f79 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -468,7 +468,7 @@ extern int htab_bolt_mapping(unsigned long vstart, unsigned 
long vend,
 int psize, int ssize);
 int htab_remove_mapping(unsigned long vstart, unsigned long vend,
int psize, int ssize);
-extern void add_gpage(u64 addr, u64 page_size, unsigned long number_of_pages);
+extern void pseries_add_gpage(u64 addr, u64 page_size, unsigned long 
number_of_pages);
 extern void demote_segment_4k(struct mm_struct *mm, unsigned long addr);
 
 #ifdef CONFIG_PPC_PSERIES
diff --git a/arch/powerpc/include/asm/hugetlb.h 
b/arch/powerpc/include/asm/hugetlb.h
index 7f4025a6c69e..b8a0fb442c64 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -218,18 +218,4 @@ static inline pte_t *hugepte_offset(hugepd_t hpd, unsigned 
long addr,
 }
 #endif /* CONFIG_HUGETLB_PAGE */
 
-/*
- * FSL Book3E platforms require special gpage handling - the gpages
- * are reserved early in the boot process by memblock instead of via
- * the .dts as on IBM platforms.
- */
-#if defined(CONFIG_HUGETLB_PAGE) && (defined(CONFIG_PPC_FSL_BOOK3E) || \
-defined(CONFIG_PPC_8xx))
-extern void __init reserve_hugetlb_gpages(void);
-#else
-static inline void reserve_hugetlb_gpages(void)
-{
-}
-#endif
-
 #endif /* _ASM_POWERPC_HUGETLB_H */
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 94a948207cd2..0f896f17d5ab 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -916,13 +916,6 @@ void __init setup_arch(char **cmdline_p)
/* Reserve large chunks of memory for use by CMA for KVM. */
kvm_cma_reserve();
 
-   /*
-* Reserve any gigantic pages requested on the command line.
-* memblock needs to have been initialized by the time this is
-* called since this will reserve memory.
-*/
-   reserve_hugetlb_gpages();
-
klp_init_thread_info(&init_thread_info);
 
init_mm.start_code = (unsigned long)_stext;
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 7a20669c19e7..2f1f6bc04012 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -509,7 +509,7 @@ static int __init htab_dt_scan_hugepage_blocks(unsigned 
long node,
phys_addr, block_size, expected_pages);
if (phys_addr + (16 * GB) <= memblock_end_of_DRAM()) {
memblock_reserve(phys_addr, block_size * expected_pages);
-   add_gpage(phys_addr, block_size, expected_pages);
+   pseries_add_gpage(phys_addr, block_size, expected_pages);
}
return 0;
 }
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index e1bf5ca397fe..a0271d738a30 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -36,26 +36,6 @@
 unsigned int HPAGE_SHIFT;
 EXPORT_SYMBOL(HPAGE_SHIFT);
 
-/*
- * Tracks gpages after the device tree is scanned and before the
- * huge_boot_pages list is ready.  On non-Freescale implementations, this is
- * just used to track 16G pages and so is a single array.  FSL-based
- * implementations may have more than one gpage size, so we need multiple
- * arrays
- */
-#if defined(CONFIG_PPC_FSL_BOOK3E) || defined(CONFIG_PPC_8xx)
-#define MAX_NUMBER_GPAGES  128
-struct psize_gpages {
-   u64 gpage_list[MAX_NUMBER_GPAGES];
-   unsigned int nr_gpages;
-};
-static struct psize_gpages gpage_freearray[MMU_PAGE_COUNT];
-#else
-#define MAX_NUMBER_GPAGES

[PATCH v3 1/3] mm/hugetlb: Allow arch to override and call the weak function

2017-07-26 Thread Aneesh Kumar K.V
For ppc64, we want to call this function when we are not running as guest.
Also, if we failed to allocate hugepages, let the user know.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/hugetlb.h | 1 +
 mm/hugetlb.c| 5 -
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 0ed8e41aaf11..8bbbd37ab105 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -358,6 +358,7 @@ int huge_add_to_page_cache(struct page *page, struct 
address_space *mapping,
pgoff_t idx);
 
 /* arch callback */
+int __init __alloc_bootmem_huge_page(struct hstate *h);
 int __init alloc_bootmem_huge_page(struct hstate *h);
 
 void __init hugetlb_bad_size(void);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc48ee783dd9..a3a7a7e6339e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2083,7 +2083,9 @@ struct page *alloc_huge_page_noerr(struct vm_area_struct 
*vma,
return page;
 }
 
-int __weak alloc_bootmem_huge_page(struct hstate *h)
+int alloc_bootmem_huge_page(struct hstate *h)
+   __attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
+int __alloc_bootmem_huge_page(struct hstate *h)
 {
struct huge_bootmem_page *m;
int nr_nodes, node;
@@ -2104,6 +2106,7 @@ int __weak alloc_bootmem_huge_page(struct hstate *h)
goto found;
}
}
+   pr_info("Failed to allocate hugepage of size %ld\n", huge_page_size(h));
return 0;
 
 found:
-- 
2.13.3



blk_mq_sched_insert_request: inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage

2017-07-26 Thread Michael Ellerman
Hi Jens,

I'm seeing the lockdep warning below on shutdown on a Power8 machine
using IPR.

If I'm reading it right it looks like the spin_lock() (non-irq) in
blk_mq_sched_insert_request() is the immediate cause.

Looking at blk_mq_requeue_work() (the caller), it is doing
spin_lock_irqsave(). So is switching blk_mq_sched_insert_request() to
spin_lock_irqsave() the right fix?

cheers


ipr 0001:08:00.0: shutdown


WARNING: inconsistent lock state
4.13.0-rc2-gcc6x-gf74c89b #1 Not tainted

inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/28/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
 (&(&hctx->lock)->rlock){+.?...}, at: [] 
blk_mq_sched_dispatch_requests+0xa4/0x2a0
{SOFTIRQ-ON-W} state was registered at:
  lock_acquire+0xec/0x2e0
  _raw_spin_lock+0x44/0x70
  blk_mq_sched_insert_request+0x88/0x1f0
  blk_mq_requeue_work+0x108/0x180
  process_one_work+0x310/0x800
  worker_thread+0x88/0x520
  kthread+0x164/0x1b0
  ret_from_kernel_thread+0x5c/0x74
irq event stamp: 3572314
hardirqs last  enabled at (3572314): [] 
_raw_spin_unlock_irqrestore+0x58/0xb0
hardirqs last disabled at (3572313): [] 
_raw_spin_lock_irqsave+0x3c/0x90
softirqs last  enabled at (3572302): [] irq_enter+0x9c/0xe0
softirqs last disabled at (3572303): [] irq_exit+0x108/0x150

other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(&(&hctx->lock)->rlock);
  
lock(&(&hctx->lock)->rlock);

 *** DEADLOCK ***

2 locks held by swapper/28/0:
 #0:  ((&ipr_cmd->timer)){+.-...}, at: [] 
call_timer_fn+0x10/0x4b0
 #1:  (rcu_read_lock){..}, at: [] 
__blk_mq_run_hw_queue+0xa0/0x2c0

stack backtrace:
CPU: 28 PID: 0 Comm: swapper/28 Not tainted 4.13.0-rc2-gcc6x-gf74c89b #1
Call Trace:
[c01fffe97550] [c0b50818] dump_stack+0xe8/0x160 (unreliable)
[c01fffe97590] [c01586d0] print_usage_bug+0x2d0/0x390
[c01fffe97640] [c0158f34] mark_lock+0x7a4/0x8e0
[c01fffe976f0] [c015a000] __lock_acquire+0x6a0/0x1a70
[c01fffe97860] [c015befc] lock_acquire+0xec/0x2e0
[c01fffe97930] [c0b71514] _raw_spin_lock+0x44/0x70
[c01fffe97960] [c05b60f4] blk_mq_sched_dispatch_requests+0xa4/0x2a0
[c01fffe979c0] [c05acac0] __blk_mq_run_hw_queue+0x100/0x2c0
[c01fffe97a00] [c05ad478] __blk_mq_delay_run_hw_queue+0x118/0x130
[c01fffe97a40] [c05ad61c] blk_mq_start_hw_queues+0x6c/0xa0
[c01fffe97a80] [c0797aac] scsi_kick_queue+0x2c/0x60
[c01fffe97aa0] [c0797cf0] scsi_run_queue+0x210/0x360
[c01fffe97b10] [c079b888] scsi_run_host_queues+0x48/0x80
[c01fffe97b40] [c07b6090] ipr_ioa_bringdown_done+0x70/0x1e0
[c01fffe97bc0] [c07bc860] ipr_reset_ioa_job+0x80/0xf0
[c01fffe97bf0] [c07b4d50] ipr_reset_timer_done+0xd0/0x100
[c01fffe97c30] [c01937bc] call_timer_fn+0xdc/0x4b0
[c01fffe97cf0] [c0193d08] expire_timers+0x178/0x330
[c01fffe97d60] [c01940c8] run_timer_softirq+0xb8/0x120
[c01fffe97de0] [c0b726a8] __do_softirq+0x168/0x6d8
[c01fffe97ef0] [c00df2c8] irq_exit+0x108/0x150
[c01fffe97f10] [c0017bf4] __do_irq+0x2a4/0x4a0
[c01fffe97f90] [c002da50] call_do_irq+0x14/0x24
[c007fad93aa0] [c0017e8c] do_IRQ+0x9c/0x140
[c007fad93af0] [c0008b98] hardware_interrupt_common+0x138/0x140
--- interrupt: 501 at .L1.42+0x0/0x4
LR = arch_local_irq_restore.part.4+0x84/0xb0
[c007fad93de0] [c007ffc1f7d8] 0xc007ffc1f7d8 (unreliable)
[c007fad93e00] [c0988d3c] cpuidle_enter_state+0x1bc/0x530
[c007fad93e60] [c01457cc] call_cpuidle+0x4c/0x90
[c007fad93e80] [c0145b28] do_idle+0x208/0x2f0
[c007fad93ef0] [c0145f8c] cpu_startup_entry+0x3c/0x50
[c007fad93f20] [c0042bc0] start_secondary+0x3b0/0x4b0
[c007fad93f90] [c000ac6c] start_secondary_prolog+0x10/0x14


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Nicholas Piggin
On Wed, 26 Jul 2017 18:42:14 -0700
"Paul E. McKenney"  wrote:

> On Wed, Jul 26, 2017 at 04:22:00PM -0700, David Miller wrote:

> > Indeed, that really wouldn't explain how we end up with a RCU stall
> > dump listing almost all of the cpus as having missed a grace period.  
> 
> I have seen stranger things, but admittedly not often.

So the backtraces show the RCU gp thread in schedule_timeout.

Are you sure that it's timeout has expired and it's not being scheduled,
or could it be a bad (large) timeout (looks unlikely) or that it's being
scheduled but not correctly noting gps on other CPUs?

It's not in R state, so if it's not being scheduled at all, then it's
because the timer has not fired:

[ 1984.628602] rcu_preempt kthread starved for 5663 jiffies! g1566 c1565 f0x0 
RCU_GP_WAIT_FQS(3) ->state=0x1
[ 1984.638153] rcu_preempt S0 9  2 0x
[ 1984.643626] Call trace:
[ 1984.646059] [] __switch_to+0x90/0xa8
[ 1984.651189] [] __schedule+0x19c/0x5d8
[ 1984.656400] [] schedule+0x38/0xa0
[ 1984.661266] [] schedule_timeout+0x124/0x218
[ 1984.667002] [] rcu_gp_kthread+0x4fc/0x748
[ 1984.672564] [] kthread+0xfc/0x128
[ 1984.677429] [] ret_from_fork+0x10/0x50


Re: KVM guests freeze under upstream kernel

2017-07-26 Thread Michael Ellerman
jos...@linux.vnet.ibm.com writes:
> On Thu, Jul 20, 2017 at 10:18:18PM -0300, jos...@linux.vnet.ibm.com wrote:
>> On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
>> > 
>> > Did you check the host kernel logs for any oops messages?
>> 
>> dmesg was clean but after sometime waiting (I forgot QEMU running in
>> another terminal) I got the oops below (after rebooting the host I 
>> couldn't reproduce it again).
>> 
>> Another test that I did was:
>> Compile with transparent huge pages disabled: KVM works fine
>> Compile with transparent huge pages enabled: doesn't work
>>   + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work
>> 
>> Just out of my own curiosity I made this small change:
>> 
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> b/arch/powerpc/include
>> index c0737c8..f94a3b6 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -80,7 +80,7 @@
>>  
>>   #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty
>>   tracking 
>>#define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
>>-#define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page 
>> */
>>+#define _PAGE_DEVMAP   _RPAGE_RSV3
>> #define __HAVE_ARCH_PTE_DEVMAP
>> 
>> and it works. I chose _RPAGE_RSV3 because it uses the same value that
>> x86 uses (0x0400UL) but I don't if it could have any side
>> effect
>> 
>
> Does this change make any sense to you people?

No :)

I think it's just hiding the bug somehow. Presumably we have some code
somewhere that is getting confused by _RPAGE_SW1 being set, or setting
that bit incorrectly.

cheers


Re: [PATCH 05/11] powerpc/topology: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang

Hi Michael,

At 07/27/2017 10:21 AM, Michael Ellerman wrote:

Dou Liyang  writes:


Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in POWERPC platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/topology.h | 2 --
 1 file changed, 2 deletions(-)


Thanks for doing this series.



It's my pleasure. :)

Sincerely,

dou


Acked-by: Michael Ellerman 

cheers








Re: [PATCH 05/11] powerpc/topology: Remove the unused parent_node() macro

2017-07-26 Thread Michael Ellerman
Dou Liyang  writes:

> Commit a7be6e5a7f8d ("mm: drop useless local parameters of
> __register_one_node()") removes the last user of parent_node().
>
> The parent_node() macro in POWERPC platform is unnecessary.
>
> Remove it for cleanup.
>
> Reported-by: Michael Ellerman 
> Signed-off-by: Dou Liyang 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
>  arch/powerpc/include/asm/topology.h | 2 --
>  1 file changed, 2 deletions(-)

Thanks for doing this series.

Acked-by: Michael Ellerman 

cheers


Re: [PATCH 1/6] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-07-26 Thread Aneesh Kumar K.V



On 07/26/2017 09:36 PM, Ram Pai wrote:

On Wed, Jul 26, 2017 at 04:05:48PM +0530, Aneesh Kumar K.V wrote:

Ram Pai  writes:




diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 9732837..62e580c 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,18 +12,14 @@
   */
  #define H_PAGE_COMBO  _RPAGE_RPN0 /* this is a combo 4k page */
  #define H_PAGE_4K_PFN _RPAGE_RPN1 /* PFN is for a single 4k page */
+#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */



Why are we moving H_PAGE_BUSY. Right now 4k and 64k linux page table
format looks similar.


The goal is to clear off all the _RPAGE_RSV* bits so that they can be
used for protection keys.  the aim is to keep the protection-bits in the
_RPAGE_RSV* bits, so that they will work as-is whenever radix MMU enables
protection keys.

Yes this makes the PTE format differ from 4k PTE. Hopefully it is a
small inconvenience. The PTE format for 4K is anyway not exactly the
same compared to 64K PTE format. For example, higher RPN bits are
used on 4K but not on 64k. lower RPN bits are used on 64k but not
on 4k.

I was wondering why in this patch ? You do in the next patch

--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -12,7 +12,7 @@
  */
 #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
 #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
-#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are busy */
+#define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are busy */



-aneesh



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Paul E. McKenney
On Wed, Jul 26, 2017 at 04:22:00PM -0700, David Miller wrote:
> From: "Paul E. McKenney" 
> Date: Wed, 26 Jul 2017 16:15:05 -0700
> 
> > On Wed, Jul 26, 2017 at 03:45:40PM -0700, David Miller wrote:
> >> Just out of curiousity, what x86 idle method is your machine using?
> >> The mwait one or the one which simply uses 'halt'?  The mwait variant
> >> might mask this bug, and halt would be a lot closer to how sparc64 and
> >> Jonathan's system operates.
> > 
> > My kernel builds with CONFIG_INTEL_IDLE=n, which I believe means that
> > I am not using the mwait one.  Here is a grep for IDLE in my .config:
> > 
> > CONFIG_NO_HZ_IDLE=y
> > CONFIG_GENERIC_SMP_IDLE_THREAD=y
> > # CONFIG_IDLE_PAGE_TRACKING is not set
> > CONFIG_ACPI_PROCESSOR_IDLE=y
> > CONFIG_CPU_IDLE=y
> > # CONFIG_CPU_IDLE_GOV_LADDER is not set
> > CONFIG_CPU_IDLE_GOV_MENU=y
> > # CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
> > # CONFIG_INTEL_IDLE is not set
> 
> No, that doesn't influence it.  It is determined by cpu features at
> run time.
> 
> If you are using mwait, it'll say so in your kernel log like this:
> 
>   using mwait in idle threads

Thank you for the hint!

And vim says:

"E486: Pattern not found: using mwait in idle threads"

> >> On sparc64 the cpu yield we do in the idle loop sleeps the cpu.  It's
> >> local TICK register keeps advancing, and the local timer therefore
> >> will still trigger.  Also, any externally generated interrupts
> >> (including cross calls) will wake up the cpu as well.
> >> 
> >> The tick-sched code is really tricky wrt. NO_HZ even in the NO_HZ_IDLE
> >> case.  One of my running theories is that we miss scheduling a tick
> >> due to a race.  That would be consistent with the behavior we see
> >> in the RCU dumps, I think.
> > 
> > But wouldn't you have to miss a -lot- of ticks to get an RCU CPU stall
> > warning?  By default, your grace period needs to extend for more than
> > 21 seconds (more than one-third of a -minute-) to get one.  Or do
> > you mean that the ticks get shut off now and forever, as opposed to
> > just losing one of them?
> 
> Hmmm, good point.  And I was talking about simply missing one tick.
> 
> Indeed, that really wouldn't explain how we end up with a RCU stall
> dump listing almost all of the cpus as having missed a grace period.

I have seen stranger things, but admittedly not often.

Thanx, Paul



Re: [PATCH v3 4/5] powerpc/lib/sstep: Add prty instruction emulation

2017-07-26 Thread Michael Ellerman
Segher Boessenkool  writes:

> On Wed, Jul 26, 2017 at 08:03:30PM +1000, Michael Ellerman wrote:
>> Segher Boessenkool  writes:
>> > A general question about these patches: some things are inside #ifdef
>> > __powerpc64__, some are not.  It seems it is the wrong macro, and it
>> > should be used (or not used) consistently?
>> 
>> Why is it the wrong macro? Because we tend to use CONFIG_PPC64 you mean?
>
> Yeah.  But I see sstep.c already mixes those two at will (or if there
> is a distinction, I'm not seeing it :-) )

Yeah OK. In practice they're equivalent, if CONFIG_PPC64=y then the
kernel is built 64-bit and therefore __powerpc64__ is defined.

But I agree it's a mess, we should use CONFIG_PPC64 exclusively unless
there's some reason not to (which I don't think there ever is).

>> I thought the reason some are #ifdef'ed is that some are 64-bit only.
>> ie. bpermd is 64-bit only ?
>
> 64-bit only, in what way?  It's not clear what the rules are.

Instructions that have "d" in the name? :P

> It's not instructions that can only run in 64-bit mode.
> It's not instructions that only give a usable result with 64-bit regs
> implemented.
> It's not instructions only implemented on 64-bit CPUs.

I think it's trying to be that ^

If you build a 32-bit kernel then instructions that are only defined on
64-bit CPUs should be treated as illegal, so the easiest way to achieve
that is to #ifdef off the code for those instructions.

cheers


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread David Miller
From: "Paul E. McKenney" 
Date: Wed, 26 Jul 2017 16:15:05 -0700

> On Wed, Jul 26, 2017 at 03:45:40PM -0700, David Miller wrote:
>> Just out of curiousity, what x86 idle method is your machine using?
>> The mwait one or the one which simply uses 'halt'?  The mwait variant
>> might mask this bug, and halt would be a lot closer to how sparc64 and
>> Jonathan's system operates.
> 
> My kernel builds with CONFIG_INTEL_IDLE=n, which I believe means that
> I am not using the mwait one.  Here is a grep for IDLE in my .config:
> 
>   CONFIG_NO_HZ_IDLE=y
>   CONFIG_GENERIC_SMP_IDLE_THREAD=y
>   # CONFIG_IDLE_PAGE_TRACKING is not set
>   CONFIG_ACPI_PROCESSOR_IDLE=y
>   CONFIG_CPU_IDLE=y
>   # CONFIG_CPU_IDLE_GOV_LADDER is not set
>   CONFIG_CPU_IDLE_GOV_MENU=y
>   # CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
>   # CONFIG_INTEL_IDLE is not set

No, that doesn't influence it.  It is determined by cpu features at
run time.

If you are using mwait, it'll say so in your kernel log like this:

using mwait in idle threads

>> On sparc64 the cpu yield we do in the idle loop sleeps the cpu.  It's
>> local TICK register keeps advancing, and the local timer therefore
>> will still trigger.  Also, any externally generated interrupts
>> (including cross calls) will wake up the cpu as well.
>> 
>> The tick-sched code is really tricky wrt. NO_HZ even in the NO_HZ_IDLE
>> case.  One of my running theories is that we miss scheduling a tick
>> due to a race.  That would be consistent with the behavior we see
>> in the RCU dumps, I think.
> 
> But wouldn't you have to miss a -lot- of ticks to get an RCU CPU stall
> warning?  By default, your grace period needs to extend for more than
> 21 seconds (more than one-third of a -minute-) to get one.  Or do
> you mean that the ticks get shut off now and forever, as opposed to
> just losing one of them?

Hmmm, good point.  And I was talking about simply missing one tick.

Indeed, that really wouldn't explain how we end up with a RCU stall
dump listing almost all of the cpus as having missed a grace period.


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Paul E. McKenney
On Wed, Jul 26, 2017 at 03:45:40PM -0700, David Miller wrote:
> From: "Paul E. McKenney" 
> Date: Wed, 26 Jul 2017 15:36:58 -0700
> 
> > And without CONFIG_SOFTLOCKUP_DETECTOR, I see five runs of 24 with RCU
> > CPU stall warnings.  So it seems likely that CONFIG_SOFTLOCKUP_DETECTOR
> > really is having an effect.
> 
> Thanks for all of the info Paul, I'll digest this and scan over the
> code myself.
> 
> Just out of curiousity, what x86 idle method is your machine using?
> The mwait one or the one which simply uses 'halt'?  The mwait variant
> might mask this bug, and halt would be a lot closer to how sparc64 and
> Jonathan's system operates.

My kernel builds with CONFIG_INTEL_IDLE=n, which I believe means that
I am not using the mwait one.  Here is a grep for IDLE in my .config:

CONFIG_NO_HZ_IDLE=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_GOV_LADDER is not set
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
# CONFIG_INTEL_IDLE is not set

> On sparc64 the cpu yield we do in the idle loop sleeps the cpu.  It's
> local TICK register keeps advancing, and the local timer therefore
> will still trigger.  Also, any externally generated interrupts
> (including cross calls) will wake up the cpu as well.
> 
> The tick-sched code is really tricky wrt. NO_HZ even in the NO_HZ_IDLE
> case.  One of my running theories is that we miss scheduling a tick
> due to a race.  That would be consistent with the behavior we see
> in the RCU dumps, I think.

But wouldn't you have to miss a -lot- of ticks to get an RCU CPU stall
warning?  By default, your grace period needs to extend for more than
21 seconds (more than one-third of a -minute-) to get one.  Or do
you mean that the ticks get shut off now and forever, as opposed to
just losing one of them?

> Anyways, just a theory, and that's why I keep mentioning that commit
> about the revert of the revert (specifically
> 411fe24e6b7c283c3a1911450cdba6dd3aaea56e).
> 
> :-)

I am running an overnight test in preparation for attempting to push
some fixes for regressions into 4.12, but will try reverting this
and enabling CONFIG_HZ_PERIODIC tomorrow.

Jonathan, might the commit that Dave points out above be what reduces
the probability of occurrence as you test older releases?

Thanx, Paul



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread David Miller
From: "Paul E. McKenney" 
Date: Wed, 26 Jul 2017 15:36:58 -0700

> And without CONFIG_SOFTLOCKUP_DETECTOR, I see five runs of 24 with RCU
> CPU stall warnings.  So it seems likely that CONFIG_SOFTLOCKUP_DETECTOR
> really is having an effect.

Thanks for all of the info Paul, I'll digest this and scan over the
code myself.

Just out of curiousity, what x86 idle method is your machine using?
The mwait one or the one which simply uses 'halt'?  The mwait variant
might mask this bug, and halt would be a lot closer to how sparc64 and
Jonathan's system operates.

On sparc64 the cpu yield we do in the idle loop sleeps the cpu.  It's
local TICK register keeps advancing, and the local timer therefore
will still trigger.  Also, any externally generated interrupts
(including cross calls) will wake up the cpu as well.

The tick-sched code is really tricky wrt. NO_HZ even in the NO_HZ_IDLE
case.  One of my running theories is that we miss scheduling a tick
due to a race.  That would be consistent with the behavior we see
in the RCU dumps, I think.

Anyways, just a theory, and that's why I keep mentioning that commit
about the revert of the revert (specifically
411fe24e6b7c283c3a1911450cdba6dd3aaea56e).

:-)


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Paul E. McKenney
On Wed, Jul 26, 2017 at 10:50:13AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 26, 2017 at 09:54:32AM -0700, David Miller wrote:
> > From: "Paul E. McKenney" 
> > Date: Wed, 26 Jul 2017 08:49:00 -0700
> > 
> > > On Wed, Jul 26, 2017 at 04:33:40PM +0100, Jonathan Cameron wrote:
> > >> Didn't leave it long enough. Still bad on 4.10-rc7 just took over
> > >> an hour to occur.
> > > 
> > > And it is quite possible that SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y
> > > are just greatly reducing the probability of the problem rather than
> > > completely preventing it.
> > > 
> > > Still, hopefully useful information, thank you for the testing!
> > 
> > I guess that invalidates my idea to test reverting recent changes to
> > the tick-sched.c code... :-/
> > 
> > In NO_HZ_IDLE mode, what is really supposed to happen on a completely
> > idle system?
> > 
> > All the cpus enter the idle loop, have no timers programmed, and they
> > all just go to sleep until an external event happens.
> > 
> > What ensures that grace periods get processed in this regime?
> 
> There are several different situations with different mechanisms:
> 
> 1.No grace period is in progress and no RCU callbacks are pending
>   anywhere in the system.  In this case, some other event would
>   need to start a grace period, so RCU just stays idle until that
>   happens, possibly indefinitely.  According to the battery-powered
>   embedded guys, this is a feature, not a bug.  ;-)
> 
> 2.No grace period is in progress, but there is at least one RCU
>   callback somewhere in the system.  In this case, the mechanism
>   depends on CONFIG_RCU_FAST_NO_HZ:
> 
>   CONFIG_RCU_FAST_NO_HZ=n:  The CPU on which the callback is
>   queued will return "true" in response to the call to
>   rcu_needs_cpu() that is made shortly before that CPU
>   enters idle.  This will cause the scheduling-clock
>   interrupt to remain on, despite the CPU being idle,
>   which will in turn allow RCU's state machine to continue
>   running out of softirq, triggered by the scheduling-clock
>   interrupts.
> 
>   CONFIG_RCU_FAST_NO_HZ=y:  The CPU on which the callback is queued
>   will return "false" in response to the call to
>   rcu_needs_cpu() that is made shortly before that CPU
>   enters idle.  However, it will also request a next event
>   about six seconds in the future if all callbacks do
>   nothing but free memory (kfree_rcu()), or about four
>   jiffies in the future if at least one callback does
>   something more than just free memory.
> 
>   There is also a rcu_prepare_for_idle() function that
>   is invoked later in the idle-entry process in this case
>   which will wake up the grace-period kthread if need be.
> 
> 3.A grace period is in progress.  In this case the grace-period
>   kthread is either currently running (in which case there will be
>   at least one non-idle CPU) or is in a timed wait for its next
>   scan for idle/offline CPUs (such CPUs need the grace-period
>   kthread to report quiescent states on their behalf).  In this
>   latter case, the timer subsystem will post a next event that
>   will be the wakeup time for the grace-period kthread, or some
>   earlier event.
> 
>   This is where we have been seeing trouble, if for no other
>   reason because RCU CPU stall warnings only happen when there
>   is a grace period in progress.
> 
> That is the theory, anyway...
> 
> And when I enabled CONFIG_SOFTLOCKUP_DETECTOR, I still see failures.
> I did 24 half-hour rcutorture runs on the TREE01 scenario, and two of them
> saw RCU CPU stall warnings with starvation of the grace-period kthread.
> I just now started another test but without CONFIG_SOFTLOCKUP_DETECTOR
> to see if it makes a significance difference for my testing.  I do have
> CONFIG_RCU_FAST_NO_HZ=y in my runs.

And without CONFIG_SOFTLOCKUP_DETECTOR, I see five runs of 24 with RCU
CPU stall warnings.  So it seems likely that CONFIG_SOFTLOCKUP_DETECTOR
really is having an effect.

Thanx, Paul



[PATCH 05/11] powerpc/topology: Remove the unused parent_node() macro

2017-07-26 Thread Dou Liyang
Commit a7be6e5a7f8d ("mm: drop useless local parameters of
__register_one_node()") removes the last user of parent_node().

The parent_node() macro in POWERPC platform is unnecessary.

Remove it for cleanup.

Reported-by: Michael Ellerman 
Signed-off-by: Dou Liyang 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/topology.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index dc4e159..2d84bca 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -16,8 +16,6 @@ struct device_node;
 
 #include 
 
-#define parent_node(node)  (node)
-
 #define cpumask_of_node(node) ((node) == -1 ?  \
   cpu_all_mask :   \
   node_to_cpumask_map[node])
-- 
2.5.5





Re: [RFC Part1 PATCH v3 13/17] x86/io: Unroll string I/O when SEV is active

2017-07-26 Thread Brijesh Singh



On 07/26/2017 02:26 PM, H. Peter Anvin wrote:


   \

   static inline void outs##bwl(int port, const void *addr, unsigned

long count) \

   {


This will clash with a fix I did to add a "memory" clobber
for the traditional implementation, see
https://patchwork.kernel.org/patch/9854573/


Is it even worth leaving these as inline functions?
Given the speed of IO cycles it is unlikely that the cost of calling

a real

function will be significant.
The code bloat reduction will be significant.


I think the smallest code would be the original "rep insb" etc, which
should be smaller than a function call, unlike the loop. Then again,
there is a rather small number of affected device drivers, almost all
of them for ancient hardware that you won't even build in a 64-bit
x86 kernel, see the list below. The only user I found that is

actually

still relevant is drivers/tty/hvc/hvc_xen.c, which uses it for the

early

console.



There are some indirect user of string I/O functions. The following
functions
defined in lib/iomap.c calls rep version of ins and outs.

- ioread8_rep, ioread16_rep, ioread32_rep
- iowrite8_rep, iowrite16_rep, iowrite32_rep

I found that several drivers use above functions.

Here is one approach to convert it into non-inline functions. In this
approach,
I have added a new file arch/x86/kernel/io.c which provides non rep
version of
string I/O routines. The file gets built and used only when
AMD_MEM_ENCRYPT is
enabled. On positive side, if we don't build kernel with
AMD_MEM_ENCRYPT support
then we use inline routines, when AMD_MEM_ENCRYPT is built then we make
a function
call. Inside the function we unroll only when SEV is active.

Do you see any issue with this approach ? thanks

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index e080a39..104927d 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -323,8 +323,9 @@ static inline unsigned type in##bwl##_p(int port)
\
  unsigned type value = in##bwl(port);\
  slow_down_io(); \
  return value;   \
-}
\
-
\
+}
+
+#define BUILDIO_REP(bwl, bw, type)
\
static inline void outs##bwl(int port, const void *addr, unsigned long
count) \
{
\
  asm volatile("rep; outs" #bwl   \
@@ -335,12 +336,31 @@ static inline void ins##bwl(int port, void *addr,
unsigned long count)\
{
\
  asm volatile("rep; ins" #bwl\
   : "+D"(addr), "+c"(count) : "d"(port));\
-}
+}
\
  
  BUILDIO(b, b, char)

  BUILDIO(w, w, short)
  BUILDIO(l, , int)
  
+#ifdef CONFIG_AMD_MEM_ENCRYPT

+extern void outsb_try_rep(int port, const void *addr, unsigned long
count);
+extern void insb_try_rep(int port, void *addr, unsigned long count);
+extern void outsw_try_rep(int port, const void *addr, unsigned long
count);
+extern void insw_try_rep(int port, void *addr, unsigned long count);
+extern void outsl_try_rep(int port, const void *addr, unsigned long
count);
+extern void insl_try_rep(int port, void *addr, unsigned long count);
+#define outsb  outsb_try_rep
+#define insb   insb_try_rep
+#define outsw  outsw_try_rep
+#define insw   insw_try_rep
+#define outsl  outsl_try_rep
+#define insl   insl_try_rep
+#else
+BUILDIO_REP(b, b, char)
+BUILDIO_REP(w, w, short)
+BUILDIO_REP(l, , int)
+#endif
+
  extern void *xlate_dev_mem_ptr(phys_addr_t phys);
  extern void unxlate_dev_mem_ptr(phys_addr_t phys, void *addr);

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a01892b..3b6e2a3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -42,6 +42,7 @@ CFLAGS_irq.o := -I$(src)/../include/asm/trace
  
  obj-y  := process_$(BITS).o signal.o

  obj-$(CONFIG_COMPAT)   += signal_compat.o
+obj-$(CONFIG_AMD_MEM_ENCRYPT) += io.o
obj-y  += traps.o irq.o irq_$(BITS).o
dumpstack_$(BITS).o
  obj-y  += time.o ioport.o dumpstack.o nmi.o
  obj-$(CONFIG_MODIFY_LDT_SYSCALL)   += ldt.o
diff --git a/arch/x86/kernel/io.c b/arch/x86/kernel/io.c
new file mode 100644
index 000..f58afa9
--- /dev/null
+++ b/arch/x86/kernel/io.c
@@ -0,0 +1,87 @@
+#include 
+#include 
+#include 
+
+void outsb_try_rep(int port, const void *addr, unsigned long count)
+{
+   if (sev_active()) {
+   unsigned char *value = (unsigned char *)addr;
+   while (count) {
+   outb(*value, port);
+   value++;
+   count--;
+   }
+   } else {
+   asm volatile("rep; outsb" : "+S"(addr), "+c"(count) :
"d"(port));
+   }
+}
+
+void insb_try_rep(int port, void *addr, unsigned long count)
+{
+   if (sev_active()) {
+   unsigned char *value = (unsigned char *)addr;
+  

Re: [RFC Part1 PATCH v3 13/17] x86/io: Unroll string I/O when SEV is active

2017-07-26 Thread H. Peter Anvin
,Eric Biederman ,Tejun Heo 
,Paolo Bonzini ,Andrew Morton 
,"Kirill A . Shutemov" 
,Lu Baolu 
From: h...@zytor.com
Message-ID: 

On July 26, 2017 9:24:45 PM GMT+02:00, Brijesh Singh  
wrote:
>
>Hi Arnd and David,
>
>On 07/26/2017 05:45 AM, Arnd Bergmann wrote:
>> On Tue, Jul 25, 2017 at 11:51 AM, David Laight
> wrote:
>>> From: Brijesh Singh
 Sent: 24 July 2017 20:08
 From: Tom Lendacky 

 Secure Encrypted Virtualization (SEV) does not support string I/O,
>so
 unroll the string I/O operation into a loop operating on one
>element at
 a time.

 Signed-off-by: Tom Lendacky 
 Signed-off-by: Brijesh Singh 
 ---
   arch/x86/include/asm/io.h | 26 ++
   1 file changed, 22 insertions(+), 4 deletions(-)

 diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
 index e080a39..2f3c002 100644
 --- a/arch/x86/include/asm/io.h
 +++ b/arch/x86/include/asm/io.h
 @@ -327,14 +327,32 @@ static inline unsigned type in##bwl##_p(int
>port)   \

>   \
   static inline void outs##bwl(int port, const void *addr, unsigned
>long count) \
   {
>> 
>> This will clash with a fix I did to add a "memory" clobber
>> for the traditional implementation, see
>> https://patchwork.kernel.org/patch/9854573/
>> 
>>> Is it even worth leaving these as inline functions?
>>> Given the speed of IO cycles it is unlikely that the cost of calling
>a real
>>> function will be significant.
>>> The code bloat reduction will be significant.
>> 
>> I think the smallest code would be the original "rep insb" etc, which
>> should be smaller than a function call, unlike the loop. Then again,
>> there is a rather small number of affected device drivers, almost all
>> of them for ancient hardware that you won't even build in a 64-bit
>> x86 kernel, see the list below. The only user I found that is
>actually
>> still relevant is drivers/tty/hvc/hvc_xen.c, which uses it for the
>early
>> console.
>
>
>There are some indirect user of string I/O functions. The following
>functions
>defined in lib/iomap.c calls rep version of ins and outs.
>
>- ioread8_rep, ioread16_rep, ioread32_rep
>- iowrite8_rep, iowrite16_rep, iowrite32_rep
>
>I found that several drivers use above functions.
>
>Here is one approach to convert it into non-inline functions. In this
>approach,
>I have added a new file arch/x86/kernel/io.c which provides non rep
>version of
>string I/O routines. The file gets built and used only when
>AMD_MEM_ENCRYPT is
>enabled. On positive side, if we don't build kernel with
>AMD_MEM_ENCRYPT support
>then we use inline routines, when AMD_MEM_ENCRYPT is built then we make
>a function
>call. Inside the function we unroll only when SEV is active.
>
>Do you see any issue with this approach ? thanks
>
>diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>index e080a39..104927d 100644
>--- a/arch/x86/include/asm/io.h
>+++ b/arch/x86/include/asm/io.h
>@@ -323,8 +323,9 @@ static inline unsigned type in##bwl##_p(int port)  
>\
>  unsigned type value = in##bwl(port);\
>  slow_down_io(); \
>  return value;   \
>-} 
>\
>-  
>\
>+}
>+
>+#define BUILDIO_REP(bwl, bw, type)
>\
>static inline void outs##bwl(int port, const void *addr, unsigned long
>count) \
>{ 
>\
>  asm volatile("rep; outs" #bwl   \
>@@ -335,12 +336,31 @@ static inline void ins##bwl(int port, void *addr,
>unsigned long count)\
>{ 
>\
>  asm volatile("rep; ins" #bwl\
>   : "+D"(addr), "+c"(count) : "d"(port));\
>-}
>+} 
>\
>  
>  BUILDIO(b, b, char)
>  BUILDIO(w, w, short)
>  BUILDIO(l, , int)
>  
>+#ifdef CONFIG_AMD_MEM_ENCRYPT
>+extern void outsb_try_rep(int port, const void *addr, unsigned long
>count);
>+extern void insb_try_rep(int port, void *addr, unsigned long count);
>+extern void outsw_try_rep(int port, const void *addr, unsigned long
>count);
>+extern void insw_try_rep(int port, void *addr, unsigned long count);
>+extern void outsl_try_rep(int port, const void *addr, unsigned long
>count);
>+extern void insl_try_rep(int port, void *addr, unsigned long count);
>+#define outsb  outsb_try_rep
>+#define insb   insb_try_rep
>+#define outsw  outsw_try_rep
>+#define insw   insw_try_rep
>+#define outsl  outsl_try_rep
>+#define insl   insl_try_rep
>+#else
>+BUILDIO_REP(b

Re: [PATCH v2] include/linux/vfio.h: Guard powerpc-specific functions with CONFIG_VFIO_SPAPR_EEH

2017-07-26 Thread Alex Williamson
On Tue, 18 Jul 2017 14:22:20 -0300
Murilo Opsfelder Araujo  wrote:

> When CONFIG_EEH=y and CONFIG_VFIO_SPAPR_EEH=n, build fails with the
> following:
> 
> drivers/vfio/pci/vfio_pci.o: In function `.vfio_pci_release':
> vfio_pci.c:(.text+0xa98): undefined reference to 
> `.vfio_spapr_pci_eeh_release'
> drivers/vfio/pci/vfio_pci.o: In function `.vfio_pci_open':
> vfio_pci.c:(.text+0x1420): undefined reference to 
> `.vfio_spapr_pci_eeh_open'
> 
> In this case, vfio_pci.c should use the empty definitions of
> vfio_spapr_pci_eeh_open and vfio_spapr_pci_eeh_release functions.
> 
> This patch fixes it by guarding these function definitions with
> CONFIG_VFIO_SPAPR_EEH, the symbol that controls whether vfio_spapr_eeh.c is
> built, which is where the non-empty versions of these functions are. We need 
> to
> make use of IS_ENABLED() macro because CONFIG_VFIO_SPAPR_EEH is a tristate
> option.
> 
> This issue was found during a randconfig build. Logs are here:
> 
> http://kisskb.ellerman.id.au/kisskb/buildresult/12982362/
> 
> Signed-off-by: Murilo Opsfelder Araujo 
> ---

Applied to my for-linus branch with David and Alexey's R-b for v4.13.
Thanks,

Alex

> 
> Changes from v1:
> - Rebased on top of next-20170718.
> 
>  include/linux/vfio.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 586809a..a47b985 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -152,7 +152,7 @@ extern int vfio_set_irqs_validate_and_prepare(struct 
> vfio_irq_set *hdr,
> size_t *data_size);
> 
>  struct pci_dev;
> -#ifdef CONFIG_EEH
> +#if IS_ENABLED(CONFIG_VFIO_SPAPR_EEH)
>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
>  extern void vfio_spapr_pci_eeh_release(struct pci_dev *pdev);
>  extern long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
> @@ -173,7 +173,7 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct 
> iommu_group *group,
>  {
>   return -ENOTTY;
>  }
> -#endif /* CONFIG_EEH */
> +#endif /* CONFIG_VFIO_SPAPR_EEH */
> 
>  /*
>   * IRQfd - generic
> --
> 2.9.4



Re: [RFC Part1 PATCH v3 13/17] x86/io: Unroll string I/O when SEV is active

2017-07-26 Thread Brijesh Singh


Hi Arnd and David,

On 07/26/2017 05:45 AM, Arnd Bergmann wrote:

On Tue, Jul 25, 2017 at 11:51 AM, David Laight  wrote:

From: Brijesh Singh

Sent: 24 July 2017 20:08
From: Tom Lendacky 

Secure Encrypted Virtualization (SEV) does not support string I/O, so
unroll the string I/O operation into a loop operating on one element at
a time.

Signed-off-by: Tom Lendacky 
Signed-off-by: Brijesh Singh 
---
  arch/x86/include/asm/io.h | 26 ++
  1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index e080a39..2f3c002 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -327,14 +327,32 @@ static inline unsigned type in##bwl##_p(int port) 
  \
   \
  static inline void outs##bwl(int port, const void *addr, unsigned long count) 
\
  {


This will clash with a fix I did to add a "memory" clobber
for the traditional implementation, see
https://patchwork.kernel.org/patch/9854573/


Is it even worth leaving these as inline functions?
Given the speed of IO cycles it is unlikely that the cost of calling a real
function will be significant.
The code bloat reduction will be significant.


I think the smallest code would be the original "rep insb" etc, which
should be smaller than a function call, unlike the loop. Then again,
there is a rather small number of affected device drivers, almost all
of them for ancient hardware that you won't even build in a 64-bit
x86 kernel, see the list below. The only user I found that is actually
still relevant is drivers/tty/hvc/hvc_xen.c, which uses it for the early
console.



There are some indirect user of string I/O functions. The following functions
defined in lib/iomap.c calls rep version of ins and outs.

- ioread8_rep, ioread16_rep, ioread32_rep
- iowrite8_rep, iowrite16_rep, iowrite32_rep

I found that several drivers use above functions.

Here is one approach to convert it into non-inline functions. In this approach,
I have added a new file arch/x86/kernel/io.c which provides non rep version of
string I/O routines. The file gets built and used only when AMD_MEM_ENCRYPT is
enabled. On positive side, if we don't build kernel with AMD_MEM_ENCRYPT support
then we use inline routines, when AMD_MEM_ENCRYPT is built then we make a 
function
call. Inside the function we unroll only when SEV is active.

Do you see any issue with this approach ? thanks

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index e080a39..104927d 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -323,8 +323,9 @@ static inline unsigned type in##bwl##_p(int port)   
\
unsigned type value = in##bwl(port);\
slow_down_io(); \
return value;   \
-}  \
-   \
+}
+
+#define BUILDIO_REP(bwl, bw, type) \
 static inline void outs##bwl(int port, const void *addr, unsigned long count) \
 {  \
asm volatile("rep; outs" #bwl   \
@@ -335,12 +336,31 @@ static inline void ins##bwl(int port, void *addr, 
unsigned long count)\
 {  \
asm volatile("rep; ins" #bwl\
 : "+D"(addr), "+c"(count) : "d"(port));\
-}
+}  \
 
 BUILDIO(b, b, char)

 BUILDIO(w, w, short)
 BUILDIO(l, , int)
 
+#ifdef CONFIG_AMD_MEM_ENCRYPT

+extern void outsb_try_rep(int port, const void *addr, unsigned long count);
+extern void insb_try_rep(int port, void *addr, unsigned long count);
+extern void outsw_try_rep(int port, const void *addr, unsigned long count);
+extern void insw_try_rep(int port, void *addr, unsigned long count);
+extern void outsl_try_rep(int port, const void *addr, unsigned long count);
+extern void insl_try_rep(int port, void *addr, unsigned long count);
+#define outsb  outsb_try_rep
+#define insb   insb_try_rep
+#define outsw  outsw_try_rep
+#define insw   insw_try_rep
+#define outsl  outsl_try_rep
+#define insl   insl_try_rep
+#else
+BUILDIO_REP(b, b, char)
+BUILDIO_REP(w, w, short)
+BUILDIO_REP(l, , int)
+#endif
+
 extern void *xlate_dev_mem_ptr(phys_addr_t phys);
 extern void unxlate_dev_mem_ptr(phys_addr_t phys, void *addr);

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index a01892b..3b6e2a3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -42,6 +42,7 @@ CFLAGS_irq.o := -I$(src)/../include/asm/trace
 
 obj-y   

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Paul E. McKenney
On Wed, Jul 26, 2017 at 09:54:32AM -0700, David Miller wrote:
> From: "Paul E. McKenney" 
> Date: Wed, 26 Jul 2017 08:49:00 -0700
> 
> > On Wed, Jul 26, 2017 at 04:33:40PM +0100, Jonathan Cameron wrote:
> >> Didn't leave it long enough. Still bad on 4.10-rc7 just took over
> >> an hour to occur.
> > 
> > And it is quite possible that SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y
> > are just greatly reducing the probability of the problem rather than
> > completely preventing it.
> > 
> > Still, hopefully useful information, thank you for the testing!
> 
> I guess that invalidates my idea to test reverting recent changes to
> the tick-sched.c code... :-/
> 
> In NO_HZ_IDLE mode, what is really supposed to happen on a completely
> idle system?
> 
> All the cpus enter the idle loop, have no timers programmed, and they
> all just go to sleep until an external event happens.
> 
> What ensures that grace periods get processed in this regime?

There are several different situations with different mechanisms:

1.  No grace period is in progress and no RCU callbacks are pending
anywhere in the system.  In this case, some other event would
need to start a grace period, so RCU just stays idle until that
happens, possibly indefinitely.  According to the battery-powered
embedded guys, this is a feature, not a bug.  ;-)

2.  No grace period is in progress, but there is at least one RCU
callback somewhere in the system.  In this case, the mechanism
depends on CONFIG_RCU_FAST_NO_HZ:

CONFIG_RCU_FAST_NO_HZ=n:  The CPU on which the callback is
queued will return "true" in response to the call to
rcu_needs_cpu() that is made shortly before that CPU
enters idle.  This will cause the scheduling-clock
interrupt to remain on, despite the CPU being idle,
which will in turn allow RCU's state machine to continue
running out of softirq, triggered by the scheduling-clock
interrupts.

CONFIG_RCU_FAST_NO_HZ=y:  The CPU on which the callback is queued
will return "false" in response to the call to
rcu_needs_cpu() that is made shortly before that CPU
enters idle.  However, it will also request a next event
about six seconds in the future if all callbacks do
nothing but free memory (kfree_rcu()), or about four
jiffies in the future if at least one callback does
something more than just free memory.

There is also a rcu_prepare_for_idle() function that
is invoked later in the idle-entry process in this case
which will wake up the grace-period kthread if need be.

3.  A grace period is in progress.  In this case the grace-period
kthread is either currently running (in which case there will be
at least one non-idle CPU) or is in a timed wait for its next
scan for idle/offline CPUs (such CPUs need the grace-period
kthread to report quiescent states on their behalf).  In this
latter case, the timer subsystem will post a next event that
will be the wakeup time for the grace-period kthread, or some
earlier event.

This is where we have been seeing trouble, if for no other
reason because RCU CPU stall warnings only happen when there
is a grace period in progress.

That is the theory, anyway...

And when I enabled CONFIG_SOFTLOCKUP_DETECTOR, I still see failures.
I did 24 half-hour rcutorture runs on the TREE01 scenario, and two of them
saw RCU CPU stall warnings with starvation of the grace-period kthread.
I just now started another test but without CONFIG_SOFTLOCKUP_DETECTOR
to see if it makes a significance difference for my testing.  I do have
CONFIG_RCU_FAST_NO_HZ=y in my runs.

Thanx, Paul



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 09:54:32 -0700
David Miller  wrote:

> From: "Paul E. McKenney" 
> Date: Wed, 26 Jul 2017 08:49:00 -0700
> 
> > On Wed, Jul 26, 2017 at 04:33:40PM +0100, Jonathan Cameron wrote:  
> >> Didn't leave it long enough. Still bad on 4.10-rc7 just took over
> >> an hour to occur.  
> > 
> > And it is quite possible that SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y
> > are just greatly reducing the probability of the problem rather than
> > completely preventing it.
> > 
> > Still, hopefully useful information, thank you for the testing!  

Not sure it actually gives us much information, but no issues yet
with a simple program running every cpu that wakes up every 3 seconds.

Will leave it running overnight and report back in the morning.

> 
> I guess that invalidates my idea to test reverting recent changes to
> the tick-sched.c code... :-/
> 
> In NO_HZ_IDLE mode, what is really supposed to happen on a completely
> idle system?
> 
> All the cpus enter the idle loop, have no timers programmed, and they
> all just go to sleep until an external event happens.
> 
> What ensures that grace periods get processed in this regime?


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread David Miller
From: "Paul E. McKenney" 
Date: Wed, 26 Jul 2017 08:49:00 -0700

> On Wed, Jul 26, 2017 at 04:33:40PM +0100, Jonathan Cameron wrote:
>> Didn't leave it long enough. Still bad on 4.10-rc7 just took over
>> an hour to occur.
> 
> And it is quite possible that SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y
> are just greatly reducing the probability of the problem rather than
> completely preventing it.
> 
> Still, hopefully useful information, thank you for the testing!

I guess that invalidates my idea to test reverting recent changes to
the tick-sched.c code... :-/

In NO_HZ_IDLE mode, what is really supposed to happen on a completely
idle system?

All the cpus enter the idle loop, have no timers programmed, and they
all just go to sleep until an external event happens.

What ensures that grace periods get processed in this regime?


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread David Miller
From: "Paul E. McKenney" 
Date: Wed, 26 Jul 2017 07:14:17 -0700

> Dave, any other ideas on what might be causing this or what might be
> tested?

I was going to go through the changes that happened between v4.12
and now to kernel/timer/tick-sched.c and see if reverting any of
those help.

But I don't know when I would get to that.

Commit 411fe24e6b7c283c3a1911450cdba6dd3aaea56e, looks particularly juicy.
:-)


Re: [RFC Part1 PATCH v3 03/17] x86/mm: Secure Encrypted Virtualization (SEV) support

2017-07-26 Thread Tom Lendacky

On 7/25/2017 11:28 PM, Borislav Petkov wrote:

On Mon, Jul 24, 2017 at 02:07:43PM -0500, Brijesh Singh wrote:

From: Tom Lendacky 

Provide support for Secure Encyrpted Virtualization (SEV). This initial


Your subject misses a verb and patch subjects should have an active verb
denoting what the patch does. The sentence above is a good example.


Yup, will update.




support defines a flag that is used by the kernel to determine if it is
running with SEV active.

Signed-off-by: Tom Lendacky 
Signed-off-by: Brijesh Singh 
---
  arch/x86/include/asm/mem_encrypt.h | 2 ++
  arch/x86/mm/mem_encrypt.c  | 3 +++
  include/linux/mem_encrypt.h| 8 +++-
  3 files changed, 12 insertions(+), 1 deletion(-)


...


diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
index 0fbd092..1e4643e 100644
--- a/arch/x86/mm/mem_encrypt.c
+++ b/arch/x86/mm/mem_encrypt.c
@@ -40,6 +40,9 @@ static char sme_cmdline_off[] __initdata = "off";
  unsigned long sme_me_mask __section(.data) = 0;
  EXPORT_SYMBOL_GPL(sme_me_mask);
  
+unsigned int sev_enabled __section(.data) = 0;

+EXPORT_SYMBOL_GPL(sev_enabled);


So sev_enabled is a pure bool used only in bool context, not like
sme_me_mask whose value is read too. Which means, you can make the
former static and query it only through accessor functions.


If it's made static then the sme_active()/sev_active() inline functions
would need to be turned into functions within the mem_encrypt.c file. So
there's a trade-off to do that, which is the better one?




  /* Buffer used for early in-place encryption by BSP, no locking needed */
  static char sme_early_buffer[PAGE_SIZE] __aligned(PAGE_SIZE);
  
diff --git a/include/linux/mem_encrypt.h b/include/linux/mem_encrypt.h

index 1255f09..ea0831a 100644
--- a/include/linux/mem_encrypt.h
+++ b/include/linux/mem_encrypt.h
@@ -22,12 +22,18 @@
  #else /* !CONFIG_ARCH_HAS_MEM_ENCRYPT */
  
  #define sme_me_mask	0UL

+#define sev_enabled0
  
  #endif	/* CONFIG_ARCH_HAS_MEM_ENCRYPT */
  
  static inline bool sme_active(void)

  {
-   return !!sme_me_mask;
+   return (sme_me_mask && !sev_enabled);


You don't need the brackets. Below too.


Ok.




+}
+
+static inline bool sev_active(void)
+{
+   return (sme_me_mask && sev_enabled);
  }


So this is confusing, TBH. SME and SEV are not mutually exclusive and
yet the logic here says so. Why?

I mean, in the hypervisor context, sme_active() is still true.

/me is confused.


The kernel needs to distinguish between running under SME and running
under SEV. SME and SEV are similar but not the same. The trampoline code
is a good example.  Before paging is activated, SME will access all
memory as decrypted, but SEV will access all memory as encrypted.  So
when APs are being brought up under SME the trampoline area cannot be
encrypted, whereas under SEV the trampoline area must be encrypted.

Thanks,
Tom





Re: [PATCH 1/6] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-07-26 Thread Ram Pai
On Wed, Jul 26, 2017 at 04:05:48PM +0530, Aneesh Kumar K.V wrote:
> Ram Pai  writes:
> 
> > Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6,
> > in the 4K backed HPTE pages.These bits continue to be used
> > for 64K backed HPTE pages in this patch, but will be freed
> > up in the next patch. The  bit  numbers are big-endian  as
> > defined in the ISA3.0
> >
> > The patch does the following change to the 4k htpe backed
> > 64K PTE's format.
> >
> > H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
> > below)
> > V0 which occupied bit 4 is not used anymore.
> > V1 which occupied bit 5 is not used anymore.
> > V2 which occupied bit 6 is not used anymore.
> > V3 which occupied bit 7 is not used anymore.
> >
> > Before the patch, the 4k backed 64k PTE format was as follows
> >
> >  0 1 2 3 4  5  6  7  8 9 10...63
> >  : : : : :  :  :  :  : : ::
> >  v v v v v  v  v  v  v v vv
> >
> > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> > |x|x|x|B|V0|V1|V2|V3|x| | |x|x||x|x|x|x| <- primary pte
> > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> > |S|G|I|X|S |G |I |X |S|G|I|X|..|S|G|I|X| <- secondary pte
> > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
> >
> > After the patch, the 4k backed 64k PTE format is as follows
> >
> >  0 1 2 3 4  5  6  7  8 9 10...63
> >  : : : : :  :  :  :  : : ::
> >  v v v v v  v  v  v  v v vv
> >
> > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> > |x|x|x| |  |  |  |  |x|B| |x|x||.|.|.|.| <- primary pte
> > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> > |S|G|I|X|S |G |I |X |S|G|I|X|..|S|G|I|X| <- secondary pte
> > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
> >
> > the four  bits S,G,I,X (one quadruplet per 4k HPTE) that
> > cache  the  hash-bucket  slot  value, is initialized  to
> > 1,1,1,1 indicating -- an invalid slot.   If  a HPTE gets
> > cached in a   slot(i.e 7th  slot  of  secondary hash
> > bucket), it is  released  immediately. In  other  words,
> > even  though    is   a valid slot  value in the hash
> > bucket, we consider it invalid and  release the slot and
> > the HPTE.  This  gives  us  the opportunity to determine
> > the validity of S,G,I,X  bits  based on its contents and
> > not on any of the bits V0,V1,V2 or V3 in the primary PTE
> >
> > When   we  release  aHPTEcached in the  slot
> > we alsorelease  a  legitimate   slot  in the primary
> > hash bucket  and  unmap  its  corresponding  HPTE.  This
> > is  to  ensure   that  we do get a HPTE cached in a slot
> > of the primary hash bucket, the next time we retry.
> >
> > Though  treating    slot  as  invalid,  reduces  the
> > number of  available  slots  in the hash bucket and  may
> > have  an  effect   on the performance, the probabilty of
> > hitting a  slot is extermely low.
> >
> > Compared  to  the  current   scheme, the above described
> > scheme  reduces  the  number of false hash table updates
> > significantly   andhas  the   added   advantage   of
> > releasing  four  valuable  PTE bits for other purpose.
> >
> > NOTE:even though bits 3, 4, 5, 6, 7 are  not  used  when
> > the  64K  PTE is backed by 4k HPTE,  they continue to be
> > used  if  the  PTE  gets  backed  by 64k HPTE.  The next
> > patch will decouple that aswell, and truely  release the
> > bits.
> >
> > This idea was jointly developed by Paul Mackerras,
> > Aneesh, Michael Ellermen and myself.
> >
> > 4K PTE format remains unchanged currently.
> >
> > The patch does the following code changes
> > a) PTE flags are split between 64k and 4k  header files.
> > b) __hash_page_4K()  is  reimplemented   to reflect the
> >above logic.
> >
> > Reviewed-by: Aneesh Kumar K.V 
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/include/asm/book3s/64/hash-4k.h  |2 +
> >  arch/powerpc/include/asm/book3s/64/hash-64k.h |8 +--
> >  arch/powerpc/include/asm/book3s/64/hash.h |1 -
> >  arch/powerpc/mm/hash64_64k.c  |   74 
> > -
> >  arch/powerpc/mm/hash_utils_64.c   |4 +-
> >  5 files changed, 55 insertions(+), 34 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> > b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > index 0c4e470..f959c00 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > @@ -16,6 +16,8 @@
> >  #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> >  #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >
> > +#define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are 
> > busy */
> > +
> >  /* PTE flags to conserve for HPTE identification */
> >  #define _PAGE_HPTEFLAGS (H_PAGE_

Re: [RFC Part1 PATCH v3 05/17] x86, realmode: Don't decrypt trampoline area under SEV

2017-07-26 Thread Borislav Petkov
 Subject: x86/realmode: ...

On Mon, Jul 24, 2017 at 02:07:45PM -0500, Brijesh Singh wrote:
> From: Tom Lendacky 
> 
> When SEV is active the trampoline area will need to be in encrypted
> memory so only mark the area decrypted if SME is active.
> 
> Signed-off-by: Tom Lendacky 
> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/realmode/init.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
> index 1f71980..c7eeca7 100644
> --- a/arch/x86/realmode/init.c
> +++ b/arch/x86/realmode/init.c
> @@ -63,9 +63,11 @@ static void __init setup_real_mode(void)
>   /*
>* If SME is active, the trampoline area will need to be in
>* decrypted memory in order to bring up other processors
> -  * successfully.
> +  * successfully. For SEV the trampoline area needs to be in
> +  * encrypted memory, so only do this for SME.

Or simply say:

"It is not needed for SEV."

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: [PATCH v3 4/5] powerpc/lib/sstep: Add prty instruction emulation

2017-07-26 Thread Segher Boessenkool
On Wed, Jul 26, 2017 at 08:03:30PM +1000, Michael Ellerman wrote:
> Segher Boessenkool  writes:
> > A general question about these patches: some things are inside #ifdef
> > __powerpc64__, some are not.  It seems it is the wrong macro, and it
> > should be used (or not used) consistently?
> 
> Why is it the wrong macro? Because we tend to use CONFIG_PPC64 you mean?

Yeah.  But I see sstep.c already mixes those two at will (or if there
is a distinction, I'm not seeing it :-) )

> I thought the reason some are #ifdef'ed is that some are 64-bit only.
> ie. bpermd is 64-bit only ?

64-bit only, in what way?  It's not clear what the rules are.

It's not instructions that can only run in 64-bit mode.
It's not instructions that only give a usable result with 64-bit regs
implemented.
It's not instructions only implemented on 64-bit CPUs.
It's not even "all instructions that would not give a correct result
in the low 32 bits of GPRs if the high 32 bits are not implemented".


Segher


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Paul E. McKenney
On Wed, Jul 26, 2017 at 04:33:40PM +0100, Jonathan Cameron wrote:
> On Wed, 26 Jul 2017 15:23:15 +0100
> Jonathan Cameron  wrote:
> 
> > On Wed, 26 Jul 2017 07:14:17 -0700
> > "Paul E. McKenney"  wrote:
> > 
> > > On Wed, Jul 26, 2017 at 01:28:01PM +0100, Jonathan Cameron wrote:  
> > > > On Wed, 26 Jul 2017 10:32:32 +0100
> > > > Jonathan Cameron  wrote:
> > > > 
> > > > > On Wed, 26 Jul 2017 09:16:23 +0100
> > > > > Jonathan Cameron  wrote:
> > > > > 
> > > > > > On Tue, 25 Jul 2017 21:12:17 -0700
> > > > > > "Paul E. McKenney"  wrote:
> > > > > >   
> > > > > > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote: 
> > > > > > >
> > > > > > > > From: "Paul E. McKenney" 
> > > > > > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > > > > > >   
> > > > > > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote: 
> > > > > > > > >  
> > > > > > > > >> Just to report, turning softlockup back on fixes things for 
> > > > > > > > >> me on
> > > > > > > > >> sparc64 too.  
> > > > > > > > > 
> > > > > > > > > Very good!
> > > > > > > > >   
> > > > > > > > >> The thing about softlockup is it runs an hrtimer, which 
> > > > > > > > >> seems to run
> > > > > > > > >> about every 4 seconds.  
> > > > > > > > > 
> > > > > > > > > I could see where that could shake things loose, but I am 
> > > > > > > > > surprised that
> > > > > > > > > it would be needed.  I ran a short run with 
> > > > > > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > > > > > with no trouble, but I will be running a longer test later on.
> > > > > > > > >   
> > > > > > > > >> So I wonder if this is a NO_HZ problem.  
> > > > > > > > > 
> > > > > > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  
> > > > > > > > > What are
> > > > > > > > > you running?  (Again, my symptoms are slightly different, so 
> > > > > > > > > I might
> > > > > > > > > be seeing a different bug.)  
> > > > > > > > 
> > > > > > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > > > > > 
> > > > > > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR 
> > > > > > > > disabled.  
> > > > > > > 
> > > > > > > Same here -- but my failure case happens fairly rarely, so it 
> > > > > > > will take
> > > > > > > some time to gain reasonable confidence that enabling 
> > > > > > > SOFTLOCKUP_DETECTOR
> > > > > > > had effect.
> > > > > > > 
> > > > > > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > > > > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > > > > > 
> > > > > > >   Thanx, Paul
> > > > > > > 
> > > > > > I'll be the headless chicken running around and trying as many tests
> > > > > > as I can fit in.  Typical time to see the failure for us is sub 10
> > > > > > minutes so we'll see how far we get.
> > > > > > 
> > > > > > Make me a list to run if you like ;)
> > > > > > 
> > > > > > NO_HZ_PERIODIC=y running now.  
> > > > > By which I mean CONFIG_HZ_PERIODIC=y
> > > 
> > > I did get that messed up, didn't I?  Sorry for my confusion!
> > >   
> > > > > Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> > > > > on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> > > > > I won't have much confidence until we are a few hours in on this.
> > > > > 
> > > > > Anyhow, certainly looking like a promising direction for 
> > > > > investigation!
> > > > > 
> > > > Well it's done over 3 hours without a splat so I think it is fine with
> > > > CONFIG_HZ_PERIODIC=y
> > > 
> > > Thank you!
> > > 
> > > If you run with SOFTLOCKUP_DETECTOR=n and NO_HZ_IDLE=y, but have a normal
> > > user task waking up every few seconds on each CPU, does the problem occur?
> > > (The question is whether any disturbance gets things going, or whether 
> > > there
> > > is something special about SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y.
> > > 
> > > Dave, any other ideas on what might be causing this or what might be
> > > tested?
> > > 
> > >   Thanx, Paul
> > >   
> > 
> > Although it's still early days (40 mins in) it looks like the issue first
> > occurred between 4.10-rc7 and 4.11-rc1 (don't ask why those particular RCs)
> > 
> > Bad as with current kernel on 4.11-rc1 and good on 4.10-rc7.
> 
> Didn't leave it long enough. Still bad on 4.10-rc7 just took over
> an hour to occur.

And it is quite possible that SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y
are just greatly reducing the probability of the problem rather than
completely preventing it.

Still, hopefully useful information, thank you for the testing!

Thanx, Paul

> > Could be something different was hiding it in 4.10 though.  We have a fair
> > delta from mainline back then unfortunately so bisecting will be
> > 'interesti

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 15:23:15 +0100
Jonathan Cameron  wrote:

> On Wed, 26 Jul 2017 07:14:17 -0700
> "Paul E. McKenney"  wrote:
> 
> > On Wed, Jul 26, 2017 at 01:28:01PM +0100, Jonathan Cameron wrote:  
> > > On Wed, 26 Jul 2017 10:32:32 +0100
> > > Jonathan Cameron  wrote:
> > > 
> > > > On Wed, 26 Jul 2017 09:16:23 +0100
> > > > Jonathan Cameron  wrote:
> > > > 
> > > > > On Tue, 25 Jul 2017 21:12:17 -0700
> > > > > "Paul E. McKenney"  wrote:
> > > > >   
> > > > > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:   
> > > > > >  
> > > > > > > From: "Paul E. McKenney" 
> > > > > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > > > > >   
> > > > > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:   
> > > > > > > >
> > > > > > > >> Just to report, turning softlockup back on fixes things for me 
> > > > > > > >> on
> > > > > > > >> sparc64 too.  
> > > > > > > > 
> > > > > > > > Very good!
> > > > > > > >   
> > > > > > > >> The thing about softlockup is it runs an hrtimer, which seems 
> > > > > > > >> to run
> > > > > > > >> about every 4 seconds.  
> > > > > > > > 
> > > > > > > > I could see where that could shake things loose, but I am 
> > > > > > > > surprised that
> > > > > > > > it would be needed.  I ran a short run with 
> > > > > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > > > > with no trouble, but I will be running a longer test later on.
> > > > > > > >   
> > > > > > > >> So I wonder if this is a NO_HZ problem.  
> > > > > > > > 
> > > > > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  
> > > > > > > > What are
> > > > > > > > you running?  (Again, my symptoms are slightly different, so I 
> > > > > > > > might
> > > > > > > > be seeing a different bug.)  
> > > > > > > 
> > > > > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > > > > 
> > > > > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR 
> > > > > > > disabled.  
> > > > > > 
> > > > > > Same here -- but my failure case happens fairly rarely, so it will 
> > > > > > take
> > > > > > some time to gain reasonable confidence that enabling 
> > > > > > SOFTLOCKUP_DETECTOR
> > > > > > had effect.
> > > > > > 
> > > > > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > > > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > > > > 
> > > > > > Thanx, Paul
> > > > > > 
> > > > > I'll be the headless chicken running around and trying as many tests
> > > > > as I can fit in.  Typical time to see the failure for us is sub 10
> > > > > minutes so we'll see how far we get.
> > > > > 
> > > > > Make me a list to run if you like ;)
> > > > > 
> > > > > NO_HZ_PERIODIC=y running now.  
> > > > By which I mean CONFIG_HZ_PERIODIC=y
> > 
> > I did get that messed up, didn't I?  Sorry for my confusion!
> >   
> > > > Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> > > > on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> > > > I won't have much confidence until we are a few hours in on this.
> > > > 
> > > > Anyhow, certainly looking like a promising direction for investigation!
> > > > 
> > > Well it's done over 3 hours without a splat so I think it is fine with
> > > CONFIG_HZ_PERIODIC=y
> > 
> > Thank you!
> > 
> > If you run with SOFTLOCKUP_DETECTOR=n and NO_HZ_IDLE=y, but have a normal
> > user task waking up every few seconds on each CPU, does the problem occur?
> > (The question is whether any disturbance gets things going, or whether there
> > is something special about SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y.
> > 
> > Dave, any other ideas on what might be causing this or what might be
> > tested?
> > 
> > Thanx, Paul
> >   
> 
> Although it's still early days (40 mins in) it looks like the issue first
> occurred between 4.10-rc7 and 4.11-rc1 (don't ask why those particular RCs)
> 
> Bad as with current kernel on 4.11-rc1 and good on 4.10-rc7.
Didn't leave it long enough. Still bad on 4.10-rc7 just took over
an hour to occur.
> 
> Could be something different was hiding it in 4.10 though.  We have a fair
> delta from mainline back then unfortunately so bisecting will be
> 'interesting'.
> 
> I'll see if I can get the test you suggest running.
> 
> Jonathan
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm



Re: [RFC Part1 PATCH v3 04/17] x86/mm: Don't attempt to encrypt initrd under SEV

2017-07-26 Thread Borislav Petkov
On Mon, Jul 24, 2017 at 02:07:44PM -0500, Brijesh Singh wrote:
> From: Tom Lendacky 
> 
> When SEV is active the initrd/initramfs will already have already been
> placed in memory encyrpted so do not try to encrypt it.
> 
> Signed-off-by: Tom Lendacky 
> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/kernel/setup.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 0bfe0c1..01d56a1 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -379,9 +379,11 @@ static void __init reserve_initrd(void)
>* If SME is active, this memory will be marked encrypted by the
>* kernel when it is accessed (including relocation). However, the
>* ramdisk image was loaded decrypted by the bootloader, so make
> -  * sure that it is encrypted before accessing it.
> +  * sure that it is encrypted before accessing it. For SEV the
> +  * ramdisk will already be encyrpted, so only do this for SME.
   ^
Please introduce a spellchecker into your patch creation workflow.

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: [PATCH 3/4] powerpc: pseries: only store the device node basename in full_name

2017-07-26 Thread Rob Herring
On Wed, Jul 26, 2017 at 5:07 AM, David Laight  wrote:
> From: Rob Herring
>> Sent: 25 July 2017 22:44
>> With dependencies on full_name containing the entire device node path
>> removed, stop storing the full_name in nodes created by
>> dlpar_configure_connector() and pSeries_reconfig_add_node().
> ...
>>   dn = kzalloc(sizeof(*dn), GFP_KERNEL);
>>   if (!dn)
>>   return NULL;
>>
>>   name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
>> - dn->full_name = kasprintf(GFP_KERNEL, "%s/%s", path, name);
>> + dn->full_name = kasprintf(GFP_KERNEL, "%s", name);
>
> Isn't this just strdup()?

Yes, it can be simplified to that now.

> Perhaps you should rename full_name to name - since it is no longer 'full'?

Ideally, yes. However, we still have users in other places tree wide
(which should still work with the change) and I don't think it is
worth the additional churn. Also, we already have "name" as that is
the node name without the unit-address. I'd like to get rid of that,
but name is special in that it is exposed as a property too. Finally,
full_name is still the full path on Sparc.

> Maybe you could do a single malloc() for both 'dn' and the name?

Sure.

Rob


Re: [PATCH 0/4] Removing full paths from DT full_name

2017-07-26 Thread Rob Herring
On Wed, Jul 26, 2017 at 9:20 AM, Frank Rowand  wrote:
> Hi Rob,
>
> On 07/25/17 14:44, Rob Herring wrote:
>> This series is the last steps to remove storing the full path for every
>> DT node. Instead, we can create full path strings dynamically as needed
>> with printf %pOF specifiers (commit ce4fecf1fe15). There are a number of
>> remaining direct users of full_name after this series. I don't believe
>> there should be any functional impact for those users with the change to
>> only the node name (+unit-address). The majority are for struct
>> resource.name. This should only affect /proc/iomem display.
>
> I added a new dependency on full_name in:
>
>   [PATCH v4 3/3] of: overlay: add overlay symbols to live device tree
>
> You don't need to fix that -- I knew the removal of full_name was coming
> and expected to adapt to that change when it came, so I'll take care of
> it.  It will probably take me two or three weeks to get to it, but that
> shouldn't be a big deal since the affected code is a new feature that
> is not yet used.

Indeed, I had fixed up the print statements, but missed that. Thanks
for the heads up.

Rob


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 07:14:17 -0700
"Paul E. McKenney"  wrote:

> On Wed, Jul 26, 2017 at 01:28:01PM +0100, Jonathan Cameron wrote:
> > On Wed, 26 Jul 2017 10:32:32 +0100
> > Jonathan Cameron  wrote:
> >   
> > > On Wed, 26 Jul 2017 09:16:23 +0100
> > > Jonathan Cameron  wrote:
> > >   
> > > > On Tue, 25 Jul 2017 21:12:17 -0700
> > > > "Paul E. McKenney"  wrote:
> > > > 
> > > > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:  
> > > > > > From: "Paul E. McKenney" 
> > > > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > > > > 
> > > > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote: 
> > > > > > >
> > > > > > >> Just to report, turning softlockup back on fixes things for me on
> > > > > > >> sparc64 too.
> > > > > > > 
> > > > > > > Very good!
> > > > > > > 
> > > > > > >> The thing about softlockup is it runs an hrtimer, which seems to 
> > > > > > >> run
> > > > > > >> about every 4 seconds.
> > > > > > > 
> > > > > > > I could see where that could shake things loose, but I am 
> > > > > > > surprised that
> > > > > > > it would be needed.  I ran a short run with 
> > > > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > > > with no trouble, but I will be running a longer test later on.
> > > > > > > 
> > > > > > >> So I wonder if this is a NO_HZ problem.
> > > > > > > 
> > > > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What 
> > > > > > > are
> > > > > > > you running?  (Again, my symptoms are slightly different, so I 
> > > > > > > might
> > > > > > > be seeing a different bug.)
> > > > > > 
> > > > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > > > 
> > > > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled. 
> > > > > >
> > > > > 
> > > > > Same here -- but my failure case happens fairly rarely, so it will 
> > > > > take
> > > > > some time to gain reasonable confidence that enabling 
> > > > > SOFTLOCKUP_DETECTOR
> > > > > had effect.
> > > > > 
> > > > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > > > 
> > > > >   Thanx, Paul
> > > > >   
> > > > I'll be the headless chicken running around and trying as many tests
> > > > as I can fit in.  Typical time to see the failure for us is sub 10
> > > > minutes so we'll see how far we get.
> > > > 
> > > > Make me a list to run if you like ;)
> > > > 
> > > > NO_HZ_PERIODIC=y running now.
> > > By which I mean CONFIG_HZ_PERIODIC=y  
> 
> I did get that messed up, didn't I?  Sorry for my confusion!
> 
> > > Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> > > on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> > > I won't have much confidence until we are a few hours in on this.
> > > 
> > > Anyhow, certainly looking like a promising direction for investigation!
> > >   
> > Well it's done over 3 hours without a splat so I think it is fine with
> > CONFIG_HZ_PERIODIC=y  
> 
> Thank you!
> 
> If you run with SOFTLOCKUP_DETECTOR=n and NO_HZ_IDLE=y, but have a normal
> user task waking up every few seconds on each CPU, does the problem occur?
> (The question is whether any disturbance gets things going, or whether there
> is something special about SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y.
> 
> Dave, any other ideas on what might be causing this or what might be
> tested?
> 
>   Thanx, Paul
> 

Although it's still early days (40 mins in) it looks like the issue first
occurred between 4.10-rc7 and 4.11-rc1 (don't ask why those particular RCs)

Bad as with current kernel on 4.11-rc1 and good on 4.10-rc7.

Could be something different was hiding it in 4.10 though.  We have a fair
delta from mainline back then unfortunately so bisecting will be
'interesting'.

I'll see if I can get the test you suggest running.

Jonathan


Re: [PATCH 0/4] Removing full paths from DT full_name

2017-07-26 Thread Frank Rowand
Hi Rob,

On 07/25/17 14:44, Rob Herring wrote:
> This series is the last steps to remove storing the full path for every 
> DT node. Instead, we can create full path strings dynamically as needed 
> with printf %pOF specifiers (commit ce4fecf1fe15). There are a number of 
> remaining direct users of full_name after this series. I don't believe 
> there should be any functional impact for those users with the change to 
> only the node name (+unit-address). The majority are for struct 
> resource.name. This should only affect /proc/iomem display.

I added a new dependency on full_name in:

  [PATCH v4 3/3] of: overlay: add overlay symbols to live device tree

You don't need to fix that -- I knew the removal of full_name was coming
and expected to adapt to that change when it came, so I'll take care of
it.  It will probably take me two or three weeks to get to it, but that
shouldn't be a big deal since the affected code is a new feature that
is not yet used.

-Frank

> 
> Patches 1 and 2 can be applied now for 4.14. For patches 3 and 4, my 
> target is 4.15 after all the dependencies have been merged.
> 
> PPC folks, Please test! The PPC parts are untested. A git branch with 
> all the dependencies is here[1].
> 
> Rob
> 
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git dt-printf
> 
> Rob Herring (4):
>   powerpc: pseries: vio: match parent nodes with of_find_node_by_path
>   powerpc: pseries: remove dlpar_attach_node dependency on full path
>   powerpc: pseries: only store the device node basename in full_name
>   of/fdt: only store the device node basename in full_name
> 
>  arch/powerpc/platforms/pseries/dlpar.c   | 26 +++
>  arch/powerpc/platforms/pseries/hotplug-cpu.c |  2 +-
>  arch/powerpc/platforms/pseries/mobility.c|  2 +-
>  arch/powerpc/platforms/pseries/pseries.h |  2 +-
>  arch/powerpc/platforms/pseries/reconfig.c|  2 +-
>  arch/powerpc/platforms/pseries/vio.c |  4 +-
>  drivers/of/fdt.c | 69 
> +---
>  7 files changed, 23 insertions(+), 84 deletions(-)
> 



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Paul E. McKenney
On Wed, Jul 26, 2017 at 01:28:01PM +0100, Jonathan Cameron wrote:
> On Wed, 26 Jul 2017 10:32:32 +0100
> Jonathan Cameron  wrote:
> 
> > On Wed, 26 Jul 2017 09:16:23 +0100
> > Jonathan Cameron  wrote:
> > 
> > > On Tue, 25 Jul 2017 21:12:17 -0700
> > > "Paul E. McKenney"  wrote:
> > >   
> > > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:
> > > > > From: "Paul E. McKenney" 
> > > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > > >   
> > > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:  
> > > > > >> Just to report, turning softlockup back on fixes things for me on
> > > > > >> sparc64 too.  
> > > > > > 
> > > > > > Very good!
> > > > > >   
> > > > > >> The thing about softlockup is it runs an hrtimer, which seems to 
> > > > > >> run
> > > > > >> about every 4 seconds.  
> > > > > > 
> > > > > > I could see where that could shake things loose, but I am surprised 
> > > > > > that
> > > > > > it would be needed.  I ran a short run with 
> > > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > > with no trouble, but I will be running a longer test later on.
> > > > > >   
> > > > > >> So I wonder if this is a NO_HZ problem.  
> > > > > > 
> > > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What 
> > > > > > are
> > > > > > you running?  (Again, my symptoms are slightly different, so I might
> > > > > > be seeing a different bug.)  
> > > > > 
> > > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > > 
> > > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.   
> > > > >
> > > > 
> > > > Same here -- but my failure case happens fairly rarely, so it will take
> > > > some time to gain reasonable confidence that enabling 
> > > > SOFTLOCKUP_DETECTOR
> > > > had effect.
> > > > 
> > > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > > 
> > > > Thanx, Paul
> > > > 
> > > I'll be the headless chicken running around and trying as many tests
> > > as I can fit in.  Typical time to see the failure for us is sub 10
> > > minutes so we'll see how far we get.
> > > 
> > > Make me a list to run if you like ;)
> > > 
> > > NO_HZ_PERIODIC=y running now.  
> > By which I mean CONFIG_HZ_PERIODIC=y

I did get that messed up, didn't I?  Sorry for my confusion!

> > Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> > on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> > I won't have much confidence until we are a few hours in on this.
> > 
> > Anyhow, certainly looking like a promising direction for investigation!
> > 
> Well it's done over 3 hours without a splat so I think it is fine with
> CONFIG_HZ_PERIODIC=y

Thank you!

If you run with SOFTLOCKUP_DETECTOR=n and NO_HZ_IDLE=y, but have a normal
user task waking up every few seconds on each CPU, does the problem occur?
(The question is whether any disturbance gets things going, or whether there
is something special about SOFTLOCKUP_DETECTOR=y and HZ_PERIODIC=y.

Dave, any other ideas on what might be causing this or what might be
tested?

Thanx, Paul



Re: [PATCH 4/4] of/fdt: only store the device node basename in full_name

2017-07-26 Thread Rob Herring
On Wed, Jul 26, 2017 at 5:26 AM, Michael Ellerman  wrote:
> Rob Herring  writes:
>
>> With dependencies on a statically allocated full path name converted to
>> use %pOF format specifier, we can store just the basename of node, and
>> the unflattening of the FDT can be simplified.
>>
>> This commit will affect the remaining users of full_name. After
>> analyzing these users, the remaining cases should only change some print
>> messages. The main users of full_name are providing a name for struct
>> resource. The resource names shouldn't be important other than providing
>> /proc/iomem names.
>>
>> We no longer distinguish between pre and post 0x10 dtb formats as either
>> a full path or basename will work. However, less than 0x10 formats have
>> been broken since the conversion to use libfdt (and no one has cared).
>
> For the record - yes we did care. It broke booting with old versions of
> kexec, and it was a royal P.I.T.# to debug :D

Sorry, I forgot about that one. I'll drop the statement. I had gone
back and looked and only found the issue on mpc8323 booting[1] which
was an issue with libfdt having more checks on the fdt. I proposed
some fixes, but never heard back on that.

Rob

[1] https://lkml.org/lkml/2015/6/10/820


Re: Possible circular locking dependency detected between cpu_hotplug_lock.rw_sem and wfc.work

2017-07-26 Thread Thomas Gleixner
On Wed, 26 Jul 2017, Michael Ellerman wrote:

> Hi Thomas,
> 
> I'm seeing the lockdep barf below on some bare metal Power8 machines.
> 
> This seems to be caused by our smp_cpus_done(), which does:
> 
>   void __init smp_cpus_done(unsigned int max_cpus)
>   {
>   /*
>* We want the setup_cpu() here to be called on the boot CPU, but
>* init might run on any CPU, so make sure it's invoked on the boot
>* CPU.
>*/
>   if (smp_ops && smp_ops->setup_cpu)
>   work_on_cpu_safe(boot_cpuid, smp_setup_cpu_workfn, NULL);
> 
> 
> I don't think CPU hotplug can happen at this point, so I don't think
> there's really a bug.
> 
> But it looks like the work_on_cpu_safe() call could just go away, since
> you pinned init to the boot CPU in 8fb12156b8db ("init: Pin init task to
> the boot CPU, initially"). Though I can't see where init is unpinned, so
> maybe we do still need to do it?

It's undone in sched_init_smp(). So it looks safe. The call order is:

 smp_init()
...
smp_cpus_done()

 sched_init_smp()

Thanks,

tglx


[PATCH] powerpc/boot: Fix 64-bit boot wrapper build with non-biarch compiler

2017-07-26 Thread Michael Ellerman
Historically the boot wrapper was always built 32-bit big endian, even
for 64-bit kernels. That was because old firmwares didn't necessarily
support booting a 64-bit image. Because of that arch/powerpc/boot/Makefile
uses CROSS32CC for compilation.

However when we added 64-bit little endian support, we also added
support for building the boot wrapper 64-bit. However we kept using
CROSS32CC, because in most cases it is just CC and everything works.

However if the user doesn't specify CROSS32_COMPILE (which no one ever
does AFAIK), and CC is *not* biarch (32/64-bit capable), then CROSS32CC
becomes just "gcc". On native systems that is probably OK, but if we're
cross building it definitely isn't, leading to eg:

  gcc ... -m64 -mlittle-endian -mabi=elfv2 ... arch/powerpc/boot/cpm-serial.c
  gcc: error: unrecognized argument in option ‘-mabi=elfv2’
  gcc: error: unrecognized command line option ‘-mlittle-endian’
  make: *** [zImage] Error 2

To fix it, stop using CROSS32CC, because we may or may not be building
32-bit. Instead setup a BOOTCC, which defaults to CC, and only use
CROSS32_COMPILE if it's set and we're building for 32-bit.

Fixes: 147c05168fc8 ("powerpc/boot: Add support for 64bit little endian 
wrapper")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/boot/Makefile | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index a7814a7b1523..6f952fe1f084 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -25,12 +25,20 @@ compress-$(CONFIG_KERNEL_XZ)   := CONFIG_KERNEL_XZ
 BOOTCFLAGS:= -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
 -fno-strict-aliasing -Os -msoft-float -pipe \
 -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
--isystem $(shell $(CROSS32CC) -print-file-name=include) \
 -D$(compress-y)
 
+BOOTCC := $(CC)
 ifdef CONFIG_PPC64_BOOT_WRAPPER
 BOOTCFLAGS += -m64
+else
+BOOTCFLAGS += -m32
+ifdef CROSS32_COMPILE
+BOOTCC := $(CROSS32_COMPILE)gcc
+endif
 endif
+
+BOOTCFLAGS += -isystem $(shell $(BOOTCC) -print-file-name=include)
+
 ifdef CONFIG_CPU_BIG_ENDIAN
 BOOTCFLAGS += -mbig-endian
 else
@@ -183,10 +191,10 @@ clean-files := $(zlib-) $(zlibheader-) 
$(zliblinuxheader-) \
empty.c zImage.coff.lds zImage.ps3.lds zImage.lds
 
 quiet_cmd_bootcc = BOOTCC  $@
-  cmd_bootcc = $(CROSS32CC) -Wp,-MD,$(depfile) $(BOOTCFLAGS) -c -o $@ $<
+  cmd_bootcc = $(BOOTCC) -Wp,-MD,$(depfile) $(BOOTCFLAGS) -c -o $@ $<
 
 quiet_cmd_bootas = BOOTAS  $@
-  cmd_bootas = $(CROSS32CC) -Wp,-MD,$(depfile) $(BOOTAFLAGS) -c -o $@ $<
+  cmd_bootas = $(BOOTCC) -Wp,-MD,$(depfile) $(BOOTAFLAGS) -c -o $@ $<
 
 quiet_cmd_bootar = BOOTAR  $@
   cmd_bootar = $(CROSS32AR) -cr$(KBUILD_ARFLAGS) $@. $(filter-out 
FORCE,$^); mv $@. $@
-- 
2.7.4



Re: KVM guests freeze under upstream kernel

2017-07-26 Thread joserz
On Thu, Jul 20, 2017 at 10:18:18PM -0300, jos...@linux.vnet.ibm.com wrote:
> On Thu, Jul 20, 2017 at 03:21:59PM +1000, Paul Mackerras wrote:
> > On Thu, Jul 20, 2017 at 12:02:23AM -0300, jos...@linux.vnet.ibm.com wrote:
> > > On Thu, Jul 20, 2017 at 09:42:50AM +1000, Benjamin Herrenschmidt wrote:
> > > > On Wed, 2017-07-19 at 16:46 -0300, jos...@linux.vnet.ibm.com wrote:
> > > > > Hello!
> > > > > 
> > > > > We're not able to boot any KVM guest using upstream kernel 
> > > > > (cb8c65ccff7f77d0285f1b126c72d37b2572c865 - 4.13.0-rc1+).
> > > > > After reaching the SLOF initial counting, the guest simply freezes:
> > > > 
> > > > Can you send our .config ?
> > > 
> > > Sure,
> > > 
> > > Answering Michael as well:
> > > 
> > > It's a P9 with RHEL kernel 4.11.0-10.el7a.ppc64le installed. The problem
> > > was noticed with kernel > 4.13 (I'm currently running 4.13.0-rc1+).
> > > 
> > > QEMU is https://github.com/dgibson/qemu (ppc-for-2.10) but I gave the
> > > default packaged Qemu a try.
> > > 
> > > For the guest, I tried both a vanilla Ubuntu 17.04 and the host kernel.
> > > But they had never a chance to run since the freezing happened in SLOF.
> > > 
> > > Note that using the 4.11.0-10.el7a.ppc64le kernel it works fine
> > > (for any of these Qemu/Guest setup). With 4.13.0-rc1 I have it run after
> > > reverting that referred commit.
> > 
> > Is the host kernel running in radix mode?
> 
> yes
> 
> > 
> > Did you check the host kernel logs for any oops messages?
> 
> dmesg was clean but after sometime waiting (I forgot QEMU running in
> another terminal) I got the oops below (after rebooting the host I 
> couldn't reproduce it again).
> 
> Another test that I did was:
> Compile with transparent huge pages disabled: KVM works fine
> Compile with transparent huge pages enabled: doesn't work
>   + disabling it in /sys/kernel/mm/transparent_hugepage: doesn't work
> 
> Just out of my own curiosity I made this small change:
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h
> b/arch/powerpc/include
> index c0737c8..f94a3b6 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -80,7 +80,7 @@
>  
>   #define _PAGE_SOFT_DIRTY   _RPAGE_SW3 /* software: software dirty
>   tracking 
>#define _PAGE_SPECIAL  _RPAGE_SW2 /* software: special page */
>-#define _PAGE_DEVMAP   _RPAGE_SW1 /* software: ZONE_DEVICE page */
>+#define _PAGE_DEVMAP   _RPAGE_RSV3
> #define __HAVE_ARCH_PTE_DEVMAP
> 
> and it works. I chose _RPAGE_RSV3 because it uses the same value that
> x86 uses (0x0400UL) but I don't if it could have any side
> effect
> 

Does this change make any sense to you people?
I didn't see any side effect expect that devices backed memory will have
a bigger address space in transparent huge pages IF I understand that
correctly.

If so I can send a patch with this change.

Thank you!!



[PATCH] powerpc/Makefile: Fix ld version check with 64-bit LE-only toolchain

2017-07-26 Thread Michael Ellerman
In commit efe0160cfd40 ("powerpc/64: Linker on-demand sfpr functions
for modules"), we added an ld version check early in the powerpc
top-level Makefile.

Because the Makefile runs before the kernel config is setup, the
checks for CONFIG_CPU_LITTLE_ENDIAN etc. all take the default case. So
we end up configuring ld for 32-bit big endian.

That would be OK, except that for historical (or perhaps no) reason,
we use 'override LD' to add the endian flags to the LD variable
itself, rather than the normal approach of adding them to LDFLAGS.

The end result is that when we check the ld version we run it as:

  $(CROSS_COMPILE)ld -EB -m elf32ppc --version

This often works, unless you are using a 64-bit only and/or little
endian only, toolchain. In which case you see something like:

  $ make defconfig
  powerpc64le-linux-ld: unrecognised emulation mode: elf32ppc
  Supported emulations: elf64lppc elf32lppc elf32lppclinux elf32lppcsim
  /bin/sh: 1: [: -ge: unexpected operator

The proper fix is to stop using 'override LD', but that will require a
fair bit of testing. Instead we can fix it for now just by reordering
the Makefile to do the version check earlier.

Fixes: efe0160cfd40 ("powerpc/64: Linker on-demand sfpr functions for modules")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/Makefile | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 8d4ed73d5490..e2b3e7a00c9e 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -59,6 +59,19 @@ machine-$(CONFIG_PPC64) += 64
 machine-$(CONFIG_CPU_LITTLE_ENDIAN) += le
 UTS_MACHINE := $(subst $(space),,$(machine-y))
 
+# XXX This needs to be before we override LD below
+ifdef CONFIG_PPC32
+KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o
+else
+ifeq ($(call ld-ifversion, -ge, 22500, y),y)
+# Have the linker provide sfpr if possible.
+# There is a corresponding test in arch/powerpc/lib/Makefile
+KBUILD_LDFLAGS_MODULE += --save-restore-funcs
+else
+KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o
+endif
+endif
+
 ifeq ($(CONFIG_CPU_LITTLE_ENDIAN),y)
 override LD+= -EL
 LDEMULATION:= lppc
@@ -190,18 +203,6 @@ else
 CHECKFLAGS += -D__LITTLE_ENDIAN__
 endif
 
-ifdef CONFIG_PPC32
-KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o
-else
-ifeq ($(call ld-ifversion, -ge, 22500, y),y)
-# Have the linker provide sfpr if possible.
-# There is a corresponding test in arch/powerpc/lib/Makefile
-KBUILD_LDFLAGS_MODULE += --save-restore-funcs
-else
-KBUILD_LDFLAGS_MODULE += arch/powerpc/lib/crtsavres.o
-endif
-endif
-
 ifeq ($(CONFIG_476FPE_ERR46),y)
KBUILD_LDFLAGS_MODULE += --ppc476-workaround \
-T $(srctree)/arch/powerpc/platforms/44x/ppc476_modules.lds
-- 
2.7.4



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 13:28:01 +0100
Jonathan Cameron  wrote:

> On Wed, 26 Jul 2017 10:32:32 +0100
> Jonathan Cameron  wrote:
> 
> > On Wed, 26 Jul 2017 09:16:23 +0100
> > Jonathan Cameron  wrote:
> >   
> > > On Tue, 25 Jul 2017 21:12:17 -0700
> > > "Paul E. McKenney"  wrote:
> > > 
> > > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:  
> > > > > From: "Paul E. McKenney" 
> > > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > > > 
> > > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:   
> > > > > >  
> > > > > >> Just to report, turning softlockup back on fixes things for me on
> > > > > >> sparc64 too.
> > > > > > 
> > > > > > Very good!
> > > > > > 
> > > > > >> The thing about softlockup is it runs an hrtimer, which seems to 
> > > > > >> run
> > > > > >> about every 4 seconds.
> > > > > > 
> > > > > > I could see where that could shake things loose, but I am surprised 
> > > > > > that
> > > > > > it would be needed.  I ran a short run with 
> > > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > > with no trouble, but I will be running a longer test later on.
> > > > > > 
> > > > > >> So I wonder if this is a NO_HZ problem.
> > > > > > 
> > > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What 
> > > > > > are
> > > > > > you running?  (Again, my symptoms are slightly different, so I might
> > > > > > be seeing a different bug.)
> > > > > 
> > > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > > 
> > > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.   
> > > > >  
> > > > 
> > > > Same here -- but my failure case happens fairly rarely, so it will take
> > > > some time to gain reasonable confidence that enabling 
> > > > SOFTLOCKUP_DETECTOR
> > > > had effect.
> > > > 
> > > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > > 
> > > > Thanx, Paul
> > > >   
> > > I'll be the headless chicken running around and trying as many tests
> > > as I can fit in.  Typical time to see the failure for us is sub 10
> > > minutes so we'll see how far we get.
> > > 
> > > Make me a list to run if you like ;)
> > > 
> > > NO_HZ_PERIODIC=y running now.
> > By which I mean CONFIG_HZ_PERIODIC=y
> > 
> > Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> > on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> > I won't have much confidence until we are a few hours in on this.
> > 
> > Anyhow, certainly looking like a promising direction for investigation!
> >   
> Well it's done over 3 hours without a splat so I think it is fine with
> CONFIG_HZ_PERIODIC=y
> 
As I think we expected, the problem occurs with NO_HZ_FULL.
Happened pretty quickly but given the somewhat random nature,
might just be coincidence.

Jonathan
> 
> > Jonathan
> >   
> > > 
> > > Jonathan
> > > 
> > > ___
> > > linuxarm mailing list
> > > linux...@huawei.com
> > > http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm
> > 
> > 
> > ___
> > linuxarm mailing list
> > linux...@huawei.com
> > http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm  
> 
> 
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm




Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 10:32:32 +0100
Jonathan Cameron  wrote:

> On Wed, 26 Jul 2017 09:16:23 +0100
> Jonathan Cameron  wrote:
> 
> > On Tue, 25 Jul 2017 21:12:17 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:
> > > > From: "Paul E. McKenney" 
> > > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > >   
> > > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:  
> > > > >> Just to report, turning softlockup back on fixes things for me on
> > > > >> sparc64 too.  
> > > > > 
> > > > > Very good!
> > > > >   
> > > > >> The thing about softlockup is it runs an hrtimer, which seems to run
> > > > >> about every 4 seconds.  
> > > > > 
> > > > > I could see where that could shake things loose, but I am surprised 
> > > > > that
> > > > > it would be needed.  I ran a short run with 
> > > > > CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > > with no trouble, but I will be running a longer test later on.
> > > > >   
> > > > >> So I wonder if this is a NO_HZ problem.  
> > > > > 
> > > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What are
> > > > > you running?  (Again, my symptoms are slightly different, so I might
> > > > > be seeing a different bug.)  
> > > > 
> > > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > > 
> > > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled. 
> > > >  
> > > 
> > > Same here -- but my failure case happens fairly rarely, so it will take
> > > some time to gain reasonable confidence that enabling SOFTLOCKUP_DETECTOR
> > > had effect.
> > > 
> > > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > > 
> > >   Thanx, Paul
> > > 
> > I'll be the headless chicken running around and trying as many tests
> > as I can fit in.  Typical time to see the failure for us is sub 10
> > minutes so we'll see how far we get.
> > 
> > Make me a list to run if you like ;)
> > 
> > NO_HZ_PERIODIC=y running now.  
> By which I mean CONFIG_HZ_PERIODIC=y
> 
> Anyhow, run for 40 minutes with out seeing a splat but my sanity check
> on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
> I won't have much confidence until we are a few hours in on this.
> 
> Anyhow, certainly looking like a promising direction for investigation!
> 
Well it's done over 3 hours without a splat so I think it is fine with
CONFIG_HZ_PERIODIC=y


> Jonathan
> 
> > 
> > Jonathan
> > 
> > ___
> > linuxarm mailing list
> > linux...@huawei.com
> > http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm  
> 
> 
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm




Possible circular locking dependency detected between cpu_hotplug_lock.rw_sem and wfc.work

2017-07-26 Thread Michael Ellerman
Hi Thomas,

I'm seeing the lockdep barf below on some bare metal Power8 machines.

This seems to be caused by our smp_cpus_done(), which does:

  void __init smp_cpus_done(unsigned int max_cpus)
  {
/*
 * We want the setup_cpu() here to be called on the boot CPU, but
 * init might run on any CPU, so make sure it's invoked on the boot
 * CPU.
 */
if (smp_ops && smp_ops->setup_cpu)
work_on_cpu_safe(boot_cpuid, smp_setup_cpu_workfn, NULL);


I don't think CPU hotplug can happen at this point, so I don't think
there's really a bug.

But it looks like the work_on_cpu_safe() call could just go away, since
you pinned init to the boot CPU in 8fb12156b8db ("init: Pin init task to
the boot CPU, initially"). Though I can't see where init is unpinned, so
maybe we do still need to do it?

cheers


[1.468911] ipr: IBM Power RAID SCSI Device Driver version: 2.6.4 (March 14, 
2017)
[1.469032] ipr 0001:08:00.0: Found IOA with IRQ: 0
[1.469167] 
[1.469193] ==
[1.469250] WARNING: possible circular locking dependency detected
[1.469309] 4.12.0-gcc6-gb40b238 #1 Not tainted
[1.469355] --
[1.469413] kworker/0:1/971 is trying to acquire lock:
[1.469459]  (cpu_hotplug_lock.rw_sem){++}, at: [] 
apply_workqueue_attrs+0x34/0xa0
[1.469544] 
[1.469544] but task is already holding lock:
[1.469602]  ((&wfc.work)){+.+.+.}, at: [] 
process_one_work+0x25c/0x800
[1.469674] 
[1.469674] which lock already depends on the new lock.
[1.469674] 
[1.469744] 
[1.469744] the existing dependency chain (in reverse order) is:
[1.469813] 
[1.469813] -> #1 ((&wfc.work)){+.+.+.}:
[1.469862]flush_work+0x74/0x340
[1.469898]work_on_cpu+0xb8/0xe0
[1.469934]work_on_cpu_safe+0x80/0xd0
[1.469982]smp_cpus_done+0x54/0xb0
[1.470017]smp_init+0x1a0/0x1cc
[1.470053]kernel_init_freeable+0x1f4/0x3b0
[1.470100]kernel_init+0x24/0x160
[1.470137]ret_from_kernel_thread+0x5c/0x74
[1.470183] 
[1.470183] -> #0 (cpu_hotplug_lock.rw_sem){++}:
[1.470244]lock_acquire+0xec/0x2e0
[1.470279]cpus_read_lock+0x4c/0xd0
[1.470315]apply_workqueue_attrs+0x34/0xa0
[1.470362]__alloc_workqueue_key+0x1b4/0x5b4
[1.470410]scsi_host_alloc+0x328/0x4b0
[1.470457]ipr_probe+0xb4/0x1f00
[1.470493]local_pci_probe+0x6c/0x140
[1.470541]work_for_cpu_fn+0x38/0x60
[1.470588]process_one_work+0x310/0x800
[1.470635]worker_thread+0x2b0/0x520
[1.470682]kthread+0x164/0x1b0
[1.470718]ret_from_kernel_thread+0x5c/0x74
[1.470764] 
[1.470764] other info that might help us debug this:
[1.470764] 
[1.470834]  Possible unsafe locking scenario:
[1.470834] 
[1.470891]CPU0CPU1
[1.470937]
[1.470984]   lock((&wfc.work));
[1.471020]lock(cpu_hotplug_lock.rw_sem);
[1.471078]lock((&wfc.work));
[1.471136]   lock(cpu_hotplug_lock.rw_sem);
[1.471183] 
[1.471183]  *** DEADLOCK ***
[1.471183] 
[1.471242] 2 locks held by kworker/0:1/971:
[1.471289]  #0:  ("events"){.+.+.+}, at: [] 
process_one_work+0x25c/0x800
[1.471360]  #1:  ((&wfc.work)){+.+.+.}, at: [] 
process_one_work+0x25c/0x800
[1.471488] 
[1.471488] stack backtrace:
[1.471582] CPU: 0 PID: 971 Comm: kworker/0:1 Not tainted 
4.12.0-gcc6-gb40b238 #1
[1.471721] Workqueue: events work_for_cpu_fn
[1.471813] Call Trace:
[1.471859] [c007f877f560] [c0b491b8] dump_stack+0xe8/0x160 
(unreliable)
[1.471996] [c007f877f5a0] [c0153e68] 
print_circular_bug+0x288/0x3d0
[1.472133] [c007f877f640] [c0157ec8] 
__lock_acquire+0x1858/0x1a20
[1.472271] [c007f877f7b0] [c0158bfc] lock_acquire+0xec/0x2e0
[1.472386] [c007f877f880] [c00d4f8c] cpus_read_lock+0x4c/0xd0
[1.472501] [c007f877f8b0] [c0100974] 
apply_workqueue_attrs+0x34/0xa0
[1.472639] [c007f877f8f0] [c0102ef4] 
__alloc_workqueue_key+0x1b4/0x5b4
[1.472776] [c007f877f9b0] [c078cd58] scsi_host_alloc+0x328/0x4b0
[1.472913] [c007f877fa50] [c07bd414] ipr_probe+0xb4/0x1f00
[1.473028] [c007f877fbb0] [c06157fc] local_pci_probe+0x6c/0x140
[1.473166] [c007f877fc40] [c00f8298] work_for_cpu_fn+0x38/0x60
[1.473281] [c007f877fc70] [c00fdbe0] 
process_one_work+0x310/0x800
[1.473418] [c007f877fd30] [c00fe380] worker_thread+0x2b0/0x520
[1.473534] [c007f877fdc0] [c0107c84] kthread+0x164/0x1b0
[1.473649] [c007f877fe30] [c000b6e8

Re: [PATCH 0/4] Allow non-legacy cards to be vgaarb default

2017-07-26 Thread Laszlo Ersek
On 07/25/17 17:56, Gabriele Paoloni wrote:
> Hi Laszlo
> 
> [...]
> 
>>
>> Having practically zero background in gfx development (either kernel or
>> Xorg), I think the problem is that vga_default_device() /
>> vga_set_default_device(), which -- apparently -- "boot_vga" is based
>> upon, come from "drivers/gpu/vga/vgaarb.c". Namely, the concept of
>> "primary / boot display device" is tied to the VGA arbiter, plus only a
>> PCI device can currently be marked as primary/boot display device.
>>
>> Can these concepts be split from each other? (I can fully imagine that
>> this would result in a userspace visible interface change (or
>> addition),
>> so that e.g. "/sys/devices/**/boot_gpu" would have to be consulted by
>> display servers.)
>>
>> (Sorry if I'm totally wrong.)
>>
>> ... Hm, reading the thread starter at
>> > d...@lists.ozlabs.org/msg120851.html>,
>> and the references within... It looks like this work is motivated by
>> hardware that is supposed to be PCI, but actually breaks the specs. Is
>> that correct? If so, then I don't think I can suggest anything useful.
> 
> My understanding is that the current PCIe HW is specs compliant but the
> vgaarb, in order to make a VGA device the default one, requires all the
> bridges on top of such device to have the "VGA Enable" bit set (optional
> bit in the PCI Express™ to PCI/PCI-X Bridge Spec). I.e. all the bridges
> on top have to support legacy VGA devices; and this is not mandatory
> from the specs...right?

Sounds very plausible to me. And, I guess if the "boot GPU" concept were
split from the VGA arbiter, then the VGA arbiter's above requirement
(a) would not have to be disturbed, and
(b) would no longer interfere with the kind of hardware that's being
discussed.

Thanks
Laszlo

> BTW my VGA experience is limited too...this is just my understanding...
> 
> Gab
> 
>> Specs exist so that hardware vendors and software authors follow them.
>> If hardware does not conform, then software should either refuse to
>> work
>> with it, or handle it with quirks, on a case-by-case basis. I guess
>> this
>> means that I don't agree with the
>>
>>   broad[] suggest[ion] that a more generic solution would be better
>>
>> which seems to disqualify me from the discussion, as it must have been
>> suggested by people with incomparably more experience than what I have
>> :)
>>
>> Thanks
>> Laszlo



Re: [RFC Part1 PATCH v3 13/17] x86/io: Unroll string I/O when SEV is active

2017-07-26 Thread Arnd Bergmann
On Tue, Jul 25, 2017 at 11:51 AM, David Laight  wrote:
> From: Brijesh Singh
>> Sent: 24 July 2017 20:08
>> From: Tom Lendacky 
>>
>> Secure Encrypted Virtualization (SEV) does not support string I/O, so
>> unroll the string I/O operation into a loop operating on one element at
>> a time.
>>
>> Signed-off-by: Tom Lendacky 
>> Signed-off-by: Brijesh Singh 
>> ---
>>  arch/x86/include/asm/io.h | 26 ++
>>  1 file changed, 22 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>> index e080a39..2f3c002 100644
>> --- a/arch/x86/include/asm/io.h
>> +++ b/arch/x86/include/asm/io.h
>> @@ -327,14 +327,32 @@ static inline unsigned type in##bwl##_p(int port)  
>>  \
>>   \
>>  static inline void outs##bwl(int port, const void *addr, unsigned long 
>> count) \
>>  {

This will clash with a fix I did to add a "memory" clobber
for the traditional implementation, see
https://patchwork.kernel.org/patch/9854573/

> Is it even worth leaving these as inline functions?
> Given the speed of IO cycles it is unlikely that the cost of calling a real
> function will be significant.
> The code bloat reduction will be significant.

I think the smallest code would be the original "rep insb" etc, which
should be smaller than a function call, unlike the loop. Then again,
there is a rather small number of affected device drivers, almost all
of them for ancient hardware that you won't even build in a 64-bit
x86 kernel, see the list below. The only user I found that is actually
still relevant is drivers/tty/hvc/hvc_xen.c, which uses it for the early
console.

   Arnd

---
Full list for reference

$ git grep -wl '\(__ide_\|\)\(in\|out\)s\(b\|w\|l\)' drivers sound/
drivers/atm/horizon.c
drivers/cdrom/gdrom.c
drivers/gpu/drm/vmwgfx/vmwgfx_msg.h
drivers/ide/ide-io-std.c
drivers/isdn/hardware/avm/avmcard.h
drivers/isdn/hardware/eicon/divasmain.c
drivers/isdn/hardware/mISDN/avmfritz.c
drivers/isdn/hardware/mISDN/iohelper.h
drivers/isdn/hardware/mISDN/netjet.c
drivers/isdn/hardware/mISDN/w6692.c
drivers/isdn/hisax/asuscom.c
drivers/isdn/hisax/avm_a1.c
drivers/isdn/hisax/avm_a1p.c
drivers/isdn/hisax/avm_pci.c
drivers/isdn/hisax/diva.c
drivers/isdn/hisax/elsa.c
drivers/isdn/hisax/gazel.c
drivers/isdn/hisax/hisax_fcpcipnp.c
drivers/isdn/hisax/ix1_micro.c
drivers/isdn/hisax/mic.c
drivers/isdn/hisax/netjet.c
drivers/isdn/hisax/niccy.c
drivers/isdn/hisax/saphir.c
drivers/isdn/hisax/sedlbauer.c
drivers/isdn/hisax/sportster.c
drivers/isdn/hisax/teles3.c
drivers/isdn/hisax/w6692.c
drivers/isdn/hisax/w6692.h
drivers/media/rc/winbond-cir.c
drivers/mtd/spi-nor/aspeed-smc.c
drivers/net/appletalk/cops.c
drivers/net/arcnet/arcdevice.h
drivers/net/arcnet/com20020.c
drivers/net/ethernet/3com/3c509.c
drivers/net/ethernet/3com/3c515.c
drivers/net/ethernet/3com/3c574_cs.c
drivers/net/ethernet/3com/3c589_cs.c
drivers/net/ethernet/8390/axnet_cs.c
drivers/net/ethernet/8390/mcf8390.c
drivers/net/ethernet/8390/ne.c
drivers/net/ethernet/8390/ne2k-pci.c
drivers/net/ethernet/8390/pcnet_cs.c
drivers/net/ethernet/8390/smc-ultra.c
drivers/net/ethernet/amd/nmclan_cs.c
drivers/net/ethernet/fujitsu/fmvj18x_cs.c
drivers/net/ethernet/hp/hp100.c
drivers/net/ethernet/smsc/smc9194.c
drivers/net/ethernet/smsc/smc91c92_cs.c
drivers/net/ethernet/smsc/smc91x.h
drivers/net/ethernet/xircom/xirc2ps_cs.c
drivers/net/sb1000.c
drivers/net/wan/sbni.c
drivers/net/wireless/cisco/airo.c
drivers/net/wireless/intersil/hostap/hostap_cs.c
drivers/net/wireless/intersil/hostap/hostap_plx.c
drivers/net/wireless/marvell/libertas/if_cs.c
drivers/net/wireless/wl3501_cs.c
drivers/parport/parport_pc.c
drivers/pcmcia/m32r_cfc.c
drivers/scsi/NCR53c406a.c
drivers/scsi/aha152x.c
drivers/scsi/eata_pio.c
drivers/scsi/fdomain.c
drivers/scsi/g_NCR5380.c
drivers/scsi/imm.c
drivers/scsi/nsp32_io.h
drivers/scsi/pcmcia/nsp_io.h
drivers/scsi/pcmcia/sym53c500_cs.c
drivers/scsi/ppa.c
drivers/scsi/qlogicfas408.c
drivers/scsi/sym53c416.c
drivers/staging/comedi/drivers/adl_pci9111.c
drivers/staging/comedi/drivers/cb_pcidas.c
drivers/staging/comedi/drivers/das16m1.c
drivers/staging/comedi/drivers/das1800.c
drivers/staging/iio/adc/ad7606_par.c
drivers/tty/hvc/hvc_xen.c
drivers/tty/isicom.c
drivers/tty/rocket_int.h
drivers/usb/host/isp1362.h
drivers/usb/musb/blackfin.c
sound/drivers/opl4/opl4_lib.c
sound/isa/gus/gus_dram.c
sound/isa/gus/gus_pcm.c


Re: [RESEND] [v3 PATCH 2/2] powernv/powerpc: Clear PECE1 in LPCR via stop-api only on Hotplug

2017-07-26 Thread Nicholas Piggin
On Fri, 21 Jul 2017 16:31:34 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> Currently we use the stop-api provided by the firmware to program the
> SLW engine to restore the values of hypervisor resources that get lost
> on deeper idle states (such as winkle). Since the deep states were
> only used for CPU-Hotplug on POWER8 systems, we would program the LPCR
> to have the PECE1 bit since Hotplugged CPUs shouldn't be spuriously
> woken up by decrementer.
> 
> On POWER9, some of the deep platform idle states such as stop4 can be
> used in cpuidle as well. In this case, we want the CPU in stop4 to be
> woken up by the decrementer when some timer on the CPU expires.
> 
> In this patch, we program the stop-api for LPCR with PECE1
> bit cleared only when we are offlining the CPU and set it
> back once the CPU is online.

Looks pretty good to me, thanks!

Reviewed-by: Nicholas Piggin 


> 
> Signed-off-by: Gautham R. Shenoy 
> ---
> v2 --> v3:
> - Program the LPCR during platform idle entry/exit on both POWER8 and
>   POWER9
> 
> v1 --> v2:
> - Move the LPCR manipulations for CPU-Hotplug into idle.c
> 
>  arch/powerpc/platforms/powernv/idle.c | 34 +-
>  arch/powerpc/platforms/powernv/smp.c  |  8 
>  2 files changed, 33 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 2abee07..a1296e7 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -68,7 +68,7 @@ static int pnv_save_sprs_for_deep_states(void)
>* all cpus at boot. Get these reg values of current cpu and use the
>* same across all cpus.
>*/
> - uint64_t lpcr_val = mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1;
> + uint64_t lpcr_val = mfspr(SPRN_LPCR);
>   uint64_t hid0_val = mfspr(SPRN_HID0);
>   uint64_t hid1_val = mfspr(SPRN_HID1);
>   uint64_t hid4_val = mfspr(SPRN_HID4);
> @@ -355,6 +355,14 @@ void power9_idle(void)
>  }
>  
>  #ifdef CONFIG_HOTPLUG_CPU
> +static void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val)
> +{
> + u64 pir = get_hard_smp_processor_id(cpu);
> +
> + mtspr(SPRN_LPCR, lpcr_val);
> + opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
> +}
> +
>  /*
>   * pnv_cpu_offline: A function that puts the CPU into the deepest
>   * available platform idle state on a CPU-Offline.
> @@ -364,6 +372,20 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
>  {
>   unsigned long srr1;
>   u32 idle_states = pnv_get_supported_cpuidle_states();
> + u64 lpcr_val;
> +
> + /*
> +  * We don't want to take decrementer interrupts while we are
> +  * offline, so clear LPCR:PECE1. We keep PECE2 (and
> +  * LPCR_PECE_HVEE on P9) enabled as to let IPIs in.
> +  *
> +  * If the CPU gets woken up by a special wakeup, ensure that
> +  * the SLW engine sets LPCR with decrementer bit cleared, else
> +  * the CPU will come back to the kernel due to a spurious
> +  * wakeup.
> +  */
> + lpcr_val = mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1;
> + pnv_program_cpu_hotplug_lpcr(cpu, lpcr_val);
>  
>   __ppc64_runlatch_off();
>  
> @@ -394,6 +416,16 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
>  
>   __ppc64_runlatch_on();
>  
> + /*
> +  * Re-enable decrementer interrupts in LPCR.
> +  *
> +  * Further, we want stop states to be woken up by decrementer
> +  * for non-hotplug cases. So program the LPCR via stop api as
> +  * well.
> +  */
> + lpcr_val = mfspr(SPRN_LPCR) | (u64)LPCR_PECE1;
> + pnv_program_cpu_hotplug_lpcr(cpu, lpcr_val);
> +
>   return srr1;
>  }
>  #endif
> diff --git a/arch/powerpc/platforms/powernv/smp.c 
> b/arch/powerpc/platforms/powernv/smp.c
> index 40dae96..536b07b 100644
> --- a/arch/powerpc/platforms/powernv/smp.c
> +++ b/arch/powerpc/platforms/powernv/smp.c
> @@ -164,12 +164,6 @@ static void pnv_smp_cpu_kill_self(void)
>   if (cpu_has_feature(CPU_FTR_ARCH_207S))
>   wmask = SRR1_WAKEMASK_P8;
>  
> - /* We don't want to take decrementer interrupts while we are offline,
> -  * so clear LPCR:PECE1. We keep PECE2 (and LPCR_PECE_HVEE on P9)
> -  * enabled as to let IPIs in.
> -  */
> - mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) & ~(u64)LPCR_PECE1);
> -
>   while (!generic_check_cpu_restart(cpu)) {
>   /*
>* Clear IPI flag, since we don't handle IPIs while
> @@ -219,8 +213,6 @@ static void pnv_smp_cpu_kill_self(void)
>  
>   }
>  
> - /* Re-enable decrementer interrupts */
> - mtspr(SPRN_LPCR, mfspr(SPRN_LPCR) | LPCR_PECE1);
>   DBG("CPU%d coming online...\n", cpu);
>  }
>  



Re: [v3 PATCH 1/2] powernv/powerpc:Save/Restore additional SPRs for stop4 cpuidle

2017-07-26 Thread Nicholas Piggin
On Fri, 21 Jul 2017 16:11:37 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> The stop4 idle state on POWER9 is a deep idle state which loses
> hypervisor resources, but whose latency is low enough that it can be
> exposed via cpuidle.
> 
> Until now, the deep idle states which lose hypervisor resources (eg:
> winkle) were only exposed via CPU-Hotplug.  Hence currently on wakeup
> from such states, barring a few SPRs which need to be restored to
> their older value, rest of the SPRS are reinitialized to their values
> corresponding to that at boot time.
> 
> When stop4 is used in the context of cpuidle, we want these additional
> SPRs to be restored to their older value, to ensure that the context
> on the CPU coming back from idle is same as it was before going idle.
> 
> In this patch, we define a SPR save area in PACA (since we have used
> up the volatile register space in the stack) and on POWER9, we restore
> SPRN_PID, SPRN_LDBAR, SPRN_FSCR, SPRN_HFSCR, SPRN_MMCRA, SPRN_MMCR1,
> SPRN_MMCR2 to the values they had before entering stop.
> 
> Signed-off-by: Gautham R. Shenoy 

Looks good to me. Keeping in mind we need to tidy up and unify
all this SPR handling and save/restore etc. in the longer term.

Reviewed-by: Nicholas Piggin 

> ---
> v2-->v3:
> - Use a structure instead of an array for the stop sprs save area.
> - Name the offsets into the paca->stop_sprs as STOP_XXX instead of
>   PACA_XXX.
> - Add comments in the assembly code explaining why saving/restoring
>   is not needed on POWER8.
> v1-->v2:
> No change
> 
>  arch/powerpc/include/asm/cpuidle.h | 11 +++
>  arch/powerpc/include/asm/paca.h|  7 
>  arch/powerpc/kernel/asm-offsets.c  |  8 +
>  arch/powerpc/kernel/idle_book3s.S  | 65 
> --
>  4 files changed, 89 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cpuidle.h 
> b/arch/powerpc/include/asm/cpuidle.h
> index 52586f9..8a174cb 100644
> --- a/arch/powerpc/include/asm/cpuidle.h
> +++ b/arch/powerpc/include/asm/cpuidle.h
> @@ -67,6 +67,17 @@
>  #define ERR_DEEP_STATE_ESL_MISMATCH  -2
>  
>  #ifndef __ASSEMBLY__
> +/* Additional SPRs that need to be saved/restored during stop */
> +struct stop_sprs {
> + u64 pid;
> + u64 ldbar;
> + u64 fscr;
> + u64 hfscr;
> + u64 mmcr1;
> + u64 mmcr2;
> + u64 mmcra;
> +};
> +
>  extern u32 pnv_fastsleep_workaround_at_entry[];
>  extern u32 pnv_fastsleep_workaround_at_exit[];
>  
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index dc88a31..04b60af 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -31,6 +31,7 @@
>  #endif
>  #include 
>  #include 
> +#include 
>  
>  register struct paca_struct *local_paca asm("r13");
>  
> @@ -183,6 +184,12 @@ struct paca_struct {
>   struct paca_struct **thread_sibling_pacas;
>   /* The PSSCR value that the kernel requested before going to stop */
>   u64 requested_psscr;
> +
> + /*
> +  * Save area for additional SPRs that need to be
> +  * saved/restored during cpuidle stop.
> +  */
> + struct stop_sprs stop_sprs;
>  #endif
>  
>  #ifdef CONFIG_PPC_STD_MMU_64
> diff --git a/arch/powerpc/kernel/asm-offsets.c 
> b/arch/powerpc/kernel/asm-offsets.c
> index a7b5af3..e2a48df 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -743,6 +743,14 @@ int main(void)
>   OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
>   OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas);
>   OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
> +#define STOP_SPR(x, f)   OFFSET(x, paca_struct, stop_sprs.f)
> + STOP_SPR(STOP_PID, pid);
> + STOP_SPR(STOP_LDBAR, ldbar);
> + STOP_SPR(STOP_FSCR, fscr);
> + STOP_SPR(STOP_HFSCR, hfscr);
> + STOP_SPR(STOP_MMCR1, mmcr1);
> + STOP_SPR(STOP_MMCR2, mmcr2);
> + STOP_SPR(STOP_MMCRA, mmcra);
>  #endif
>  
>   DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 5adb390e..5e6af97 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -84,7 +84,61 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
>   std r3,_WORT(r1)
>   mfspr   r3,SPRN_WORC
>   std r3,_WORC(r1)
> +/*
> + * On POWER9, there are idle states such as stop4, invoked via cpuidle,
> + * that lose hypervisor resources. In such cases, we need to save
> + * additional SPRs before entering those idle states so that they can
> + * be restored to their older values on wakeup from the idle state.
> + *
> + * On POWER8, the only such deep idle state is winkle which is used
> + * only in the context of CPU-Hotplug, where these additional SPRs are
> + * reinitiazed to a sane value. Hence there is no need to save/restore
> + * these SPRs.
> + */
> +BEGIN_FTR_SECTION
> 

Re: [PATCH 1/6] powerpc: Free up four 64K PTE bits in 4K backed HPTE pages

2017-07-26 Thread Aneesh Kumar K.V
Ram Pai  writes:

> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6,
> in the 4K backed HPTE pages.These bits continue to be used
> for 64K backed HPTE pages in this patch, but will be freed
> up in the next patch. The  bit  numbers are big-endian  as
> defined in the ISA3.0
>
> The patch does the following change to the 4k htpe backed
> 64K PTE's format.
>
> H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
>   below)
> V0 which occupied bit 4 is not used anymore.
> V1 which occupied bit 5 is not used anymore.
> V2 which occupied bit 6 is not used anymore.
> V3 which occupied bit 7 is not used anymore.
>
> Before the patch, the 4k backed 64k PTE format was as follows
>
>  0 1 2 3 4  5  6  7  8 9 10...63
>  : : : : :  :  :  :  : : ::
>  v v v v v  v  v  v  v v vv
>
> ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> |x|x|x|B|V0|V1|V2|V3|x| | |x|x||x|x|x|x| <- primary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> |S|G|I|X|S |G |I |X |S|G|I|X|..|S|G|I|X| <- secondary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
>
> After the patch, the 4k backed 64k PTE format is as follows
>
>  0 1 2 3 4  5  6  7  8 9 10...63
>  : : : : :  :  :  :  : : ::
>  v v v v v  v  v  v  v v vv
>
> ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> |x|x|x| |  |  |  |  |x|B| |x|x||.|.|.|.| <- primary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> |S|G|I|X|S |G |I |X |S|G|I|X|..|S|G|I|X| <- secondary pte
> '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
>
> the four  bits S,G,I,X (one quadruplet per 4k HPTE) that
> cache  the  hash-bucket  slot  value, is initialized  to
> 1,1,1,1 indicating -- an invalid slot.   If  a HPTE gets
> cached in a   slot(i.e 7th  slot  of  secondary hash
> bucket), it is  released  immediately. In  other  words,
> even  though    is   a valid slot  value in the hash
> bucket, we consider it invalid and  release the slot and
> the HPTE.  This  gives  us  the opportunity to determine
> the validity of S,G,I,X  bits  based on its contents and
> not on any of the bits V0,V1,V2 or V3 in the primary PTE
>
> When   we  release  aHPTEcached in the  slot
> we alsorelease  a  legitimate   slot  in the primary
> hash bucket  and  unmap  its  corresponding  HPTE.  This
> is  to  ensure   that  we do get a HPTE cached in a slot
> of the primary hash bucket, the next time we retry.
>
> Though  treating    slot  as  invalid,  reduces  the
> number of  available  slots  in the hash bucket and  may
> have  an  effect   on the performance, the probabilty of
> hitting a  slot is extermely low.
>
> Compared  to  the  current   scheme, the above described
> scheme  reduces  the  number of false hash table updates
> significantly   andhas  the   added   advantage   of
> releasing  four  valuable  PTE bits for other purpose.
>
> NOTE:even though bits 3, 4, 5, 6, 7 are  not  used  when
> the  64K  PTE is backed by 4k HPTE,  they continue to be
> used  if  the  PTE  gets  backed  by 64k HPTE.  The next
> patch will decouple that aswell, and truely  release the
> bits.
>
> This idea was jointly developed by Paul Mackerras,
> Aneesh, Michael Ellermen and myself.
>
> 4K PTE format remains unchanged currently.
>
> The patch does the following code changes
> a) PTE flags are split between 64k and 4k  header files.
> b) __hash_page_4K()  is  reimplemented   to reflect the
>above logic.
>
> Reviewed-by: Aneesh Kumar K.V 
> Signed-off-by: Ram Pai 
> ---
>  arch/powerpc/include/asm/book3s/64/hash-4k.h  |2 +
>  arch/powerpc/include/asm/book3s/64/hash-64k.h |8 +--
>  arch/powerpc/include/asm/book3s/64/hash.h |1 -
>  arch/powerpc/mm/hash64_64k.c  |   74 
> -
>  arch/powerpc/mm/hash_utils_64.c   |4 +-
>  5 files changed, 55 insertions(+), 34 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index 0c4e470..f959c00 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -16,6 +16,8 @@
>  #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
>  #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
>
> +#define H_PAGE_BUSY  _RPAGE_RSV1 /* software: PTE & hash are busy */
> +
>  /* PTE flags to conserve for HPTE identification */
>  #define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | \
>H_PAGE_F_SECOND | H_PAGE_F_GIX)
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
> b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 9732837..62e580c 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/

Re: [PATCH 4/4] of/fdt: only store the device node basename in full_name

2017-07-26 Thread Michael Ellerman
Rob Herring  writes:

> With dependencies on a statically allocated full path name converted to
> use %pOF format specifier, we can store just the basename of node, and
> the unflattening of the FDT can be simplified.
>
> This commit will affect the remaining users of full_name. After
> analyzing these users, the remaining cases should only change some print
> messages. The main users of full_name are providing a name for struct
> resource. The resource names shouldn't be important other than providing
> /proc/iomem names.
>
> We no longer distinguish between pre and post 0x10 dtb formats as either
> a full path or basename will work. However, less than 0x10 formats have
> been broken since the conversion to use libfdt (and no one has cared).

For the record - yes we did care. It broke booting with old versions of
kexec, and it was a royal P.I.T.# to debug :D

cheers


RE: [PATCH 3/4] powerpc: pseries: only store the device node basename in full_name

2017-07-26 Thread David Laight
From: Rob Herring
> Sent: 25 July 2017 22:44
> With dependencies on full_name containing the entire device node path
> removed, stop storing the full_name in nodes created by
> dlpar_configure_connector() and pSeries_reconfig_add_node().
...
>   dn = kzalloc(sizeof(*dn), GFP_KERNEL);
>   if (!dn)
>   return NULL;
> 
>   name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
> - dn->full_name = kasprintf(GFP_KERNEL, "%s/%s", path, name);
> + dn->full_name = kasprintf(GFP_KERNEL, "%s", name);

Isn't this just strdup()?
Perhaps you should rename full_name to name - since it is no longer 'full'?

Maybe you could do a single malloc() for both 'dn' and the name?

David



Re: [PATCH v3 4/5] powerpc/lib/sstep: Add prty instruction emulation

2017-07-26 Thread Michael Ellerman
Segher Boessenkool  writes:

> On Tue, Jul 25, 2017 at 01:33:19PM +1000, Matt Brown wrote:
>> +static nokprobe_inline void do_prty(struct pt_regs *regs, unsigned long v,
>> +int size, int ra)
>> +{
>> +unsigned long long res = v;
>> +
>> +res = (0x0001000100010001 & res) + (0x0001000100010001 & (res >> 8));
>> +res = (0x00070007 & res) + (0x00070007 & (res >> 16));
>> +if (size == 32) {   /* prtyw */
>> +regs->gpr[ra] = (0x00010001 & res);
>> +return;
>> +}
>> +
>> +res = (0x000f & res) + (0x000f & (res >> 32));
>> +regs->gpr[ra] = res & 1;/*prtyd */
>> +}
>
> Does 7's and 0xf look strange (since the top bit in the values there
> is always 0).  You always "and" the values with 1 later, you could do
> that immediately (and change + to ^).
>
> A general question about these patches: some things are inside #ifdef
> __powerpc64__, some are not.  It seems it is the wrong macro, and it
> should be used (or not used) consistently?

Why is it the wrong macro? Because we tend to use CONFIG_PPC64 you mean?

I thought the reason some are #ifdef'ed is that some are 64-bit only.
ie. bpermd is 64-bit only ?

cheers


Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Wed, 26 Jul 2017 09:16:23 +0100
Jonathan Cameron  wrote:

> On Tue, 25 Jul 2017 21:12:17 -0700
> "Paul E. McKenney"  wrote:
> 
> > On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:  
> > > From: "Paul E. McKenney" 
> > > Date: Tue, 25 Jul 2017 20:55:45 -0700
> > > 
> > > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:
> > > >> Just to report, turning softlockup back on fixes things for me on
> > > >> sparc64 too.
> > > > 
> > > > Very good!
> > > > 
> > > >> The thing about softlockup is it runs an hrtimer, which seems to run
> > > >> about every 4 seconds.
> > > > 
> > > > I could see where that could shake things loose, but I am surprised that
> > > > it would be needed.  I ran a short run with CONFIG_SOFTLOCKUP_DETECTOR=y
> > > > with no trouble, but I will be running a longer test later on.
> > > > 
> > > >> So I wonder if this is a NO_HZ problem.
> > > > 
> > > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What are
> > > > you running?  (Again, my symptoms are slightly different, so I might
> > > > be seeing a different bug.)
> > > 
> > > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > > 
> > > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.
> > 
> > Same here -- but my failure case happens fairly rarely, so it will take
> > some time to gain reasonable confidence that enabling SOFTLOCKUP_DETECTOR
> > had effect.
> > 
> > But you are right, might be interesting to try NO_HZ_PERIODIC=y
> > or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> > 
> > Thanx, Paul
> >   
> I'll be the headless chicken running around and trying as many tests
> as I can fit in.  Typical time to see the failure for us is sub 10
> minutes so we'll see how far we get.
> 
> Make me a list to run if you like ;)
> 
> NO_HZ_PERIODIC=y running now.
By which I mean CONFIG_HZ_PERIODIC=y

Anyhow, run for 40 minutes with out seeing a splat but my sanity check
on the NO_FULL_HZ=n and NO_HZ_IDLE=y this morning took 20 minutes so
I won't have much confidence until we are a few hours in on this.

Anyhow, certainly looking like a promising direction for investigation!

Jonathan

> 
> Jonathan
> 
> ___
> linuxarm mailing list
> linux...@huawei.com
> http://rnd-openeuler.huawei.com/mailman/listinfo/linuxarm




Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Tue, 25 Jul 2017 21:12:17 -0700
"Paul E. McKenney"  wrote:

> On Tue, Jul 25, 2017 at 09:02:33PM -0700, David Miller wrote:
> > From: "Paul E. McKenney" 
> > Date: Tue, 25 Jul 2017 20:55:45 -0700
> >   
> > > On Tue, Jul 25, 2017 at 02:10:29PM -0700, David Miller wrote:  
> > >> Just to report, turning softlockup back on fixes things for me on
> > >> sparc64 too.  
> > > 
> > > Very good!
> > >   
> > >> The thing about softlockup is it runs an hrtimer, which seems to run
> > >> about every 4 seconds.  
> > > 
> > > I could see where that could shake things loose, but I am surprised that
> > > it would be needed.  I ran a short run with CONFIG_SOFTLOCKUP_DETECTOR=y
> > > with no trouble, but I will be running a longer test later on.
> > >   
> > >> So I wonder if this is a NO_HZ problem.  
> > > 
> > > Might be.  My tests run with NO_HZ_FULL=n and NO_HZ_IDLE=y.  What are
> > > you running?  (Again, my symptoms are slightly different, so I might
> > > be seeing a different bug.)  
> > 
> > I run with NO_HZ_FULL=n and NO_HZ_IDLE=y, just like you.
> > 
> > To clarify, the symptoms show up with SOFTLOCKUP_DETECTOR disabled.  
> 
> Same here -- but my failure case happens fairly rarely, so it will take
> some time to gain reasonable confidence that enabling SOFTLOCKUP_DETECTOR
> had effect.
> 
> But you are right, might be interesting to try NO_HZ_PERIODIC=y
> or NO_HZ_FULL=y.  So many possible tests, and so little time.  ;-)
> 
>   Thanx, Paul
> 
I'll be the headless chicken running around and trying as many tests
as I can fit in.  Typical time to see the failure for us is sub 10
minutes so we'll see how far we get.

Make me a list to run if you like ;)

NO_HZ_PERIODIC=y running now.

Jonathan



Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-07-26 Thread Jonathan Cameron
On Tue, 25 Jul 2017 20:53:06 -0700
"Paul E. McKenney"  wrote:

> On Wed, Jul 26, 2017 at 12:52:07AM +0800, Jonathan Cameron wrote:
> > On Tue, 25 Jul 2017 08:12:45 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > On Tue, Jul 25, 2017 at 10:42:45PM +0800, Jonathan Cameron wrote:  
> > > > On Tue, 25 Jul 2017 06:46:26 -0700
> > > > "Paul E. McKenney"  wrote:
> > > > 
> > > > > On Tue, Jul 25, 2017 at 10:26:54PM +1000, Nicholas Piggin wrote:
> > > > > > On Tue, 25 Jul 2017 19:32:10 +0800
> > > > > > Jonathan Cameron  wrote:
> > > > > >   
> > > > > > > Hi All,
> > > > > > > 
> > > > > > > We observed a regression on our d05 boards (but curiously not
> > > > > > > the fairly similar but single socket / smaller core count
> > > > > > > d03), initially seen with linux-next prior to the merge window
> > > > > > > and still present in v4.13-rc2.
> > > > > > > 
> > > > > > > The symptom is:  
> > > > > 
> > > > > Adding Dave Miller and the sparcli...@vger.kernel.org email on CC, as
> > > > > they have been seeing something similar, and you might well have saved
> > > > > them the trouble of bisecting.
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > [ 1984.628602] rcu_preempt kthread starved for 5663 jiffies! 
> > > > > > > g1566 c1565 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1  
> > > > > 
> > > > > This is the cause from an RCU perspective.  You had a lot of idle 
> > > > > CPUs,
> > > > > and RCU is not permitted to disturb them -- the battery-powered 
> > > > > embedded
> > > > > guys get very annoyed by that sort of thing.  What happens instead is
> > > > > that each CPU updates a per-CPU state variable when entering or 
> > > > > exiting
> > > > > idle, and the grace-period kthread ("rcu_preempt kthread" in the above
> > > > > message) checks these state variables, and if when sees an idle CPU,
> > > > > it reports a quiescent state on that CPU's behalf.
> > > > > 
> > > > > But the grace-period kthread can only do this work if it gets a chance
> > > > > to run.  And the message above says that this kthread hasn't had a 
> > > > > chance
> > > > > to run for a full 5,663 jiffies.  For completeness, the "g1566 c1565"
> > > > > says that grace period #1566 is in progress, the "f0x0" says that no 
> > > > > one
> > > > > is needing another grace period #1567.  The "RCU_GP_WAIT_FQS(3)" says
> > > > > that the grace-period kthread has fully initialized the current grace
> > > > > period and is sleeping for a few jiffies waiting to scan for idle 
> > > > > tasks.
> > > > > Finally, the "->state=0x1" says that the grace-period kthread is in
> > > > > TASK_INTERRUPTIBLE state, in other words, still sleeping.
> > > > 
> > > > Thanks for the explanation!
> > > > > 
> > > > > So my first question is "What did commit 05a4a9527 (kernel/watchdog:
> > > > > split up config options) do to prevent the grace-period kthread from
> > > > > getting a chance to run?" 
> > > > 
> > > > As far as we can tell it was a side effect of that patch.
> > > > 
> > > > The real cause is that patch changed the result of defconfigs to stop 
> > > > running
> > > > the softlockup detector - now CONFIG_SOFTLOCKUP_DETECTOR
> > > > 
> > > > Enabling that on 4.13-rc2 (and presumably everything in between)
> > > > means we don't see the problem any more.
> > > > 
> > > > > I must confess that I don't see anything
> > > > > obvious in that commit, so my second question is "Are we sure that
> > > > > reverting this commit makes the problem go away?"
> > > > 
> > > > Simply enabling CONFIG_SOFTLOCKUP_DETECTOR seems to make it go away.
> > > > That detector fires up a thread on every cpu, which may be relevant.
> > > 
> > > Interesting...  Why should it be necessary to fire up a thread on every
> > > CPU in order to make sure that RCU's grace-period kthreads get some
> > > CPU time?  Especially give how many idle CPUs you had on your system.
> > > 
> > > So I have to ask if there is some other bug that the softlockup detector
> > > is masking.  
> > I am thinking the same.  We can try going back further than 4.12 tomorrow
> > (we think we can realistically go back to 4.8 and possibly 4.6
> > with this board)  
> 
> Looking forward to seeing results!
> 
> > > > > and my third is "Is
> > > > > this an intermittent problem that led to a false bisection?"
> > > > 
> > > > Whilst it is a bit slow to occur, we verified with long runs on either
> > > > side of that patch and since with the option enabled on latest mainline.
> > > > 
> > > > Also can cause the issue before that patch by disabling the previous
> > > > relevant option on 4.12.
> > > 
> > > OK, thank you -- hard to argue with that!  ;-)  
> > 
> > We thought it was a pretty unlikely a bisection result
> > hence hammered it thoroughly ;)  
> 
> Glad that I am not the only paranoid one out here.  ;-)
> 
> > > Except that I am still puzzled as to why per-CPU softlockup threads
> > > are needed for RCU's kthreads to get their wakeups.  We really

Re: [PATCH v3 4/5] powerpc/lib/sstep: Add prty instruction emulation

2017-07-26 Thread Gabriel Paubert
On Tue, Jul 25, 2017 at 06:08:05PM +1000, Balbir Singh wrote:
> On Tue, 2017-07-25 at 13:33 +1000, Matt Brown wrote:
> > This adds emulation for the prtyw and prtyd instructions.
> > Tested for logical correctness against the prtyw and prtyd instructions
> > on ppc64le.
> > 
> > Signed-off-by: Matt Brown 
> > ---
> > v3:
> > - optimised using the Giles-Miller method of side-ways addition
> > v2:
> > - fixed opcodes
> > - fixed bitshifting and typecast errors
> > - merged do_prtyw and do_prtyd into single function
> > ---
> >  arch/powerpc/lib/sstep.c | 27 +++
> >  1 file changed, 27 insertions(+)
> > 
> > diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> > index 6a79618..0bcf631 100644
> > --- a/arch/powerpc/lib/sstep.c
> > +++ b/arch/powerpc/lib/sstep.c
> > @@ -655,6 +655,25 @@ static nokprobe_inline void do_bpermd(struct pt_regs 
> > *regs, unsigned long v1,
> > regs->gpr[ra] = perm;
> >  }
> >  #endif /* __powerpc64__ */
> > +/*
> > + * The size parameter adjusts the equivalent prty instruction.
> > + * prtyw = 32, prtyd = 64
> > + */
> > +static nokprobe_inline void do_prty(struct pt_regs *regs, unsigned long v,
> > +   int size, int ra)
> > +{
> > +   unsigned long long res = v;
> > +
> > +   res = (0x0001000100010001 & res) + (0x0001000100010001 & (res >> 8));
> > +   res = (0x00070007 & res) + (0x00070007 & (res >> 16));
> > +   if (size == 32) {   /* prtyw */
> > +   regs->gpr[ra] = (0x00010001 & res);
> > +   return;
> > +   }
> > +
> > +   res = (0x000f & res) + (0x000f & (res >> 32));
> > +   regs->gpr[ra] = res & 1;/*prtyd */
> 
> Looks good, you can also xor instead of adding, but the masks would need
> to change in that case.

Actually, I'd prefer to use xor for this case, since you hardly ever
need any mask, except at the end. Most masks are only needed because of
carry propagation, which the xor completely avoid.

In other words:

unsigned long long res = v ^ (v >> 8);
res ^= res >> 16; 
if (size == 32) {
regs->gpr[ra] = res & 0x00010001;
return;
}
res ^= res >> 32;
regs-> gpr[ra] = res & 1;

should work, but I've not even compiled it.

Gabriel


Re: [PATCH v3 2/5] powerpc/lib/sstep: Add popcnt instruction emulation

2017-07-26 Thread Gabriel Paubert
On Tue, Jul 25, 2017 at 01:33:17PM +1000, Matt Brown wrote:
> This adds emulations for the popcntb, popcntw, and popcntd instructions.
> Tested for correctness against the popcnt{b,w,d} instructions on ppc64le.
> 
> Signed-off-by: Matt Brown 
> ---
> v3:
>   - optimised using the Giles-Miller method of side-ways addition
> v2:
>   - fixed opcodes
>   - fixed typecasting
>   - fixed bitshifting error for both 32 and 64bit arch
> ---
>  arch/powerpc/lib/sstep.c | 40 +++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index 87d277f..c1f9cdb 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -612,6 +612,32 @@ static nokprobe_inline void do_cmpb(struct pt_regs 
> *regs, unsigned long v1,
>   regs->gpr[rd] = out_val;
>  }
>  
> +/*
> + * The size parameter is used to adjust the equivalent popcnt instruction.
> + * popcntb = 8, popcntw = 32, popcntd = 64
> + */
> +static nokprobe_inline void do_popcnt(struct pt_regs *regs, unsigned long v1,
> + int size, int ra)
> +{
> + unsigned long long out = v1;
> +
> + out = (0x & out) + (0x & (out >> 1));

This can be simplified in a less obvious way as:
out -= (out >> 1) & 0x;

which maps each pair of bits according to the following:
00 -> 00
01 -> 01
10 -> 01
11 -> 10

This should save one instruction.

> + out = (0x & out) + (0x & (out >> 2));

Ok, but now each nibble is between 0 and 4, so the addition of two
nibbles can't overflow or generate carry into the higher one.

> + out = (0x0f0f0f0f0f0f0f0f & out) + (0x0f0f0f0f0f0f0f0f & (out >> 4));

out += out >> 4;
out &= 0x0f0f0f0f0f0f0f0f;

which should also save one instruction

> + if (size == 8) {/* popcntb */
> + regs->gpr[ra] = out;
> + return;
> + }

At this point each count occupies at least one byte and can no more
overflow, so masking is only needed before returning.

> + out = (0x001f001f001f001f & out) + (0x001f001f001f001f & (out >> 8));
out += out >> 8;

> + out = (0x003f003f & out) + (0x003f003f & (out >> 16));

out += out >> 16;

> + if (size == 32) {   /* popcntw */
> + regs->gpr[ra] = out;
regs->gpr[ra] = out & 0x003f003f;

> + return;
> + }
> + out = (0x007f & out) + (0x007f & (out >> 32));
out = (out + (out >> 32)) & 0x7f;


Gabriel

> + regs->gpr[ra] = out;/* popcntd */
> +}
> +
>  static nokprobe_inline int trap_compare(long v1, long v2)
>  {
>   int ret = 0;
> @@ -1194,6 +1220,10 @@ int analyse_instr(struct instruction_op *op, struct 
> pt_regs *regs,
>   regs->gpr[ra] = regs->gpr[rd] & ~regs->gpr[rb];
>   goto logical_done;
>  
> + case 122:   /* popcntb */
> + do_popcnt(regs, regs->gpr[rd], 8, ra);
> + goto logical_done;
> +
>   case 124:   /* nor */
>   regs->gpr[ra] = ~(regs->gpr[rd] | regs->gpr[rb]);
>   goto logical_done;
> @@ -1206,6 +1236,10 @@ int analyse_instr(struct instruction_op *op, struct 
> pt_regs *regs,
>   regs->gpr[ra] = regs->gpr[rd] ^ regs->gpr[rb];
>   goto logical_done;
>  
> + case 378:   /* popcntw */
> + do_popcnt(regs, regs->gpr[rd], 32, ra);
> + goto logical_done;
> +
>   case 412:   /* orc */
>   regs->gpr[ra] = regs->gpr[rd] | ~regs->gpr[rb];
>   goto logical_done;
> @@ -1217,7 +1251,11 @@ int analyse_instr(struct instruction_op *op, struct 
> pt_regs *regs,
>   case 476:   /* nand */
>   regs->gpr[ra] = ~(regs->gpr[rd] & regs->gpr[rb]);
>   goto logical_done;
> -
> +#ifdef __powerpc64__
> + case 506:   /* popcntd */
> + do_popcnt(regs, regs->gpr[rd], 64, ra);
> + goto logical_done;
> +#endif
>   case 922:   /* extsh */
>   regs->gpr[ra] = (signed short) regs->gpr[rd];
>   goto logical_done;
> -- 
> 2.9.3