[PATCH 2/2] powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE

2019-12-23 Thread Russell Currey
I have tested this with the Radix MMU and everything seems to work, and
the previous patch for Hash seems to fix everything too.
STRICT_KERNEL_RWX should still be disabled by default for now.

Please test STRICT_KERNEL_RWX + RELOCATABLE!

Signed-off-by: Russell Currey 
---
 arch/powerpc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1ec34e16ed65..6093c48976bf 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -133,7 +133,7 @@ config PPC
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
-   select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
+   select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UACCESS_MCSAFE  if PPC64
-- 
2.24.1



[PATCH 1/2] powerpc/book3s64/hash: Disable 16M linear mapping size if not aligned

2019-12-23 Thread Russell Currey
With STRICT_KERNEL_RWX on in a relocatable kernel under the hash MMU, if
the position the kernel is loaded at is not 16M aligned, the kernel
miscalculates its ALIGN*()s and things go horribly wrong.

We can easily avoid this when selecting the linear mapping size, so do
so and print a warning.  I tested this for various alignments and as
long as the position is 64K aligned it's fine (the base requirement for
powerpc).

Signed-off-by: Russell Currey 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index b30435c7d804..523d4d39d11e 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -652,6 +652,7 @@ static void init_hpte_page_sizes(void)
 
 static void __init htab_init_page_sizes(void)
 {
+   bool aligned = true;
init_hpte_page_sizes();
 
if (!debug_pagealloc_enabled()) {
@@ -659,7 +660,15 @@ static void __init htab_init_page_sizes(void)
 * Pick a size for the linear mapping. Currently, we only
 * support 16M, 1M and 4K which is the default
 */
-   if (mmu_psize_defs[MMU_PAGE_16M].shift)
+   if (IS_ENABLED(STRICT_KERNEL_RWX) &&
+   (unsigned long)_stext % 0x100) {
+   if (mmu_psize_defs[MMU_PAGE_16M].shift)
+   pr_warn("Kernel not 16M aligned, "
+   "disabling 16M linear map alignment");
+   aligned = false;
+   }
+
+   if (mmu_psize_defs[MMU_PAGE_16M].shift && aligned)
mmu_linear_psize = MMU_PAGE_16M;
else if (mmu_psize_defs[MMU_PAGE_1M].shift)
mmu_linear_psize = MMU_PAGE_1M;
-- 
2.24.1



[PATCH v6 0/5] Implement STRICT_MODULE_RWX for powerpc

2019-12-23 Thread Russell Currey
v5 cover letter: 
https://lore.kernel.org/kernel-hardening/20191030073111.140493-1-rus...@russell.cc/
v4 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198268.html
v3 cover letter: 
https://lists.ozlabs.org/pipermail/linuxppc-dev/2019-October/198023.html

Changes since v5:
[1/5]: Addressed review comments from Christophe Leroy (thanks!)
[2/5]: Use patch_instruction() instead of memcpy() thanks to mpe

Thanks for the feedback, hopefully this is the final iteration.  I have a patch
to remove the STRICT_KERNEL_RWX incompatibility with RELOCATABLE for book3s64
coming soon, so with that we should have a great basis for powerpc RWX going
forward.

Russell Currey (5):
  powerpc/mm: Implement set_memory() routines
  powerpc/kprobes: Mark newly allocated probes as RO
  powerpc/mm/ptdump: debugfs handler for W+X checks at runtime
  powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
  powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig

 arch/powerpc/Kconfig   |  2 +
 arch/powerpc/Kconfig.debug |  6 +-
 arch/powerpc/configs/skiroot_defconfig |  1 +
 arch/powerpc/include/asm/set_memory.h  | 32 ++
 arch/powerpc/kernel/kprobes.c  |  6 +-
 arch/powerpc/mm/Makefile   |  1 +
 arch/powerpc/mm/pageattr.c | 83 ++
 arch/powerpc/mm/ptdump/ptdump.c| 21 ++-
 8 files changed, 147 insertions(+), 5 deletions(-)
 create mode 100644 arch/powerpc/include/asm/set_memory.h
 create mode 100644 arch/powerpc/mm/pageattr.c

-- 
2.24.1



[PATCH v6 5/5] powerpc/configs: Enable STRICT_MODULE_RWX in skiroot_defconfig

2019-12-23 Thread Russell Currey
skiroot_defconfig is the only powerpc defconfig with STRICT_KERNEL_RWX
enabled, and if you want memory protection for kernel text you'd want it
for modules too, so enable STRICT_MODULE_RWX there.

Acked-by: Joel Stanley 
Signed-off-by: Russell Currey 
---
 arch/powerpc/configs/skiroot_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/skiroot_defconfig 
b/arch/powerpc/configs/skiroot_defconfig
index 069f67f12731..b74358c3ede8 100644
--- a/arch/powerpc/configs/skiroot_defconfig
+++ b/arch/powerpc/configs/skiroot_defconfig
@@ -31,6 +31,7 @@ CONFIG_PERF_EVENTS=y
 CONFIG_SLAB_FREELIST_HARDENED=y
 CONFIG_JUMP_LABEL=y
 CONFIG_STRICT_KERNEL_RWX=y
+CONFIG_STRICT_MODULE_RWX=y
 CONFIG_MODULES=y
 CONFIG_MODULE_UNLOAD=y
 CONFIG_MODULE_SIG=y
-- 
2.24.1



[PATCH v6 4/5] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX

2019-12-23 Thread Russell Currey
To enable strict module RWX on powerpc, set:

CONFIG_STRICT_MODULE_RWX=y

You should also have CONFIG_STRICT_KERNEL_RWX=y set to have any real
security benefit.

ARCH_HAS_STRICT_MODULE_RWX is set to require ARCH_HAS_STRICT_KERNEL_RWX.
This is due to a quirk in arch/Kconfig and arch/powerpc/Kconfig that
makes STRICT_MODULE_RWX *on by default* in configurations where
STRICT_KERNEL_RWX is *unavailable*.

Since this doesn't make much sense, and module RWX without kernel RWX
doesn't make much sense, having the same dependencies as kernel RWX
works around this problem.

Signed-off-by: Russell Currey 
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f0b9b47b5353..97ea012fdff9 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,6 +135,7 @@ config PPC
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
+   select ARCH_HAS_STRICT_MODULE_RWX   if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UACCESS_MCSAFE  if PPC64
-- 
2.24.1



[PATCH v6 3/5] powerpc/mm/ptdump: debugfs handler for W+X checks at runtime

2019-12-23 Thread Russell Currey
Very rudimentary, just

echo 1 > [debugfs]/check_wx_pages

and check the kernel log.  Useful for testing strict module RWX.

Updated the Kconfig entry to reflect this.

Also fixed a typo.

Signed-off-by: Russell Currey 
---
 arch/powerpc/Kconfig.debug  |  6 --
 arch/powerpc/mm/ptdump/ptdump.c | 21 -
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 4e1d39847462..7c14c9728bc0 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -370,7 +370,7 @@ config PPC_PTDUMP
  If you are unsure, say N.
 
 config PPC_DEBUG_WX
-   bool "Warn on W+X mappings at boot"
+   bool "Warn on W+X mappings at boot & enable manual checks at runtime"
depends on PPC_PTDUMP
help
  Generate a warning if any W+X mappings are found at boot.
@@ -384,7 +384,9 @@ config PPC_DEBUG_WX
  of other unfixed kernel bugs easier.
 
  There is no runtime or memory usage effect of this option
- once the kernel has booted up - it's a one time check.
+ once the kernel has booted up, it only automatically checks once.
+
+ Enables the "check_wx_pages" debugfs entry for checking at runtime.
 
  If in doubt, say "Y".
 
diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index 2f9ddc29c535..b6cba29ae4a0 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@ -4,7 +4,7 @@
  *
  * This traverses the kernel pagetables and dumps the
  * information about the used sections of memory to
- * /sys/kernel/debug/kernel_pagetables.
+ * /sys/kernel/debug/kernel_page_tables.
  *
  * Derived from the arm64 implementation:
  * Copyright (c) 2014, The Linux Foundation, Laura Abbott.
@@ -409,6 +409,25 @@ void ptdump_check_wx(void)
else
pr_info("Checked W+X mappings: passed, no W+X pages found\n");
 }
+
+static int check_wx_debugfs_set(void *data, u64 val)
+{
+   if (val != 1ULL)
+   return -EINVAL;
+
+   ptdump_check_wx();
+
+   return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(check_wx_fops, NULL, check_wx_debugfs_set, "%llu\n");
+
+static int ptdump_check_wx_init(void)
+{
+   return debugfs_create_file("check_wx_pages", 0200, NULL,
+  NULL, _wx_fops) ? 0 : -ENOMEM;
+}
+device_initcall(ptdump_check_wx_init);
 #endif
 
 static int ptdump_init(void)
-- 
2.24.1



[PATCH v6 1/5] powerpc/mm: Implement set_memory() routines

2019-12-23 Thread Russell Currey
The set_memory_{ro/rw/nx/x}() functions are required for STRICT_MODULE_RWX,
and are generally useful primitives to have.  This implementation is
designed to be completely generic across powerpc's many MMUs.

It's possible that this could be optimised to be faster for specific
MMUs, but the focus is on having a generic and safe implementation for
now.

This implementation does not handle cases where the caller is attempting
to change the mapping of the page it is executing from, or if another
CPU is concurrently using the page being altered.  These cases likely
shouldn't happen, but a more complex implementation with MMU-specific code
could safely handle them, so that is left as a TODO for now.

Signed-off-by: Russell Currey 
---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/set_memory.h | 32 +++
 arch/powerpc/mm/Makefile  |  1 +
 arch/powerpc/mm/pageattr.c| 83 +++
 4 files changed, 117 insertions(+)
 create mode 100644 arch/powerpc/include/asm/set_memory.h
 create mode 100644 arch/powerpc/mm/pageattr.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1ec34e16ed65..f0b9b47b5353 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -133,6 +133,7 @@ config PPC
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
+   select ARCH_HAS_SET_MEMORY
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
select ARCH_HAS_UACCESS_FLUSHCACHE
diff --git a/arch/powerpc/include/asm/set_memory.h 
b/arch/powerpc/include/asm/set_memory.h
new file mode 100644
index ..5230ddb2fefd
--- /dev/null
+++ b/arch/powerpc/include/asm/set_memory.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SET_MEMORY_H
+#define _ASM_POWERPC_SET_MEMORY_H
+
+#define SET_MEMORY_RO  1
+#define SET_MEMORY_RW  2
+#define SET_MEMORY_NX  3
+#define SET_MEMORY_X   4
+
+int change_memory_attr(unsigned long addr, int numpages, int action);
+
+static inline int set_memory_ro(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RO);
+}
+
+static inline int set_memory_rw(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_RW);
+}
+
+static inline int set_memory_nx(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_NX);
+}
+
+static inline int set_memory_x(unsigned long addr, int numpages)
+{
+   return change_memory_attr(addr, numpages, SET_MEMORY_X);
+}
+
+#endif
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 5e147986400d..d0a0bcbc9289 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o
 obj-$(CONFIG_PPC_COPRO_BASE)   += copro_fault.o
 obj-$(CONFIG_PPC_PTDUMP)   += ptdump/
 obj-$(CONFIG_KASAN)+= kasan/
+obj-$(CONFIG_ARCH_HAS_SET_MEMORY) += pageattr.o
diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
new file mode 100644
index ..15d5fb04f531
--- /dev/null
+++ b/arch/powerpc/mm/pageattr.c
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * MMU-generic set_memory implementation for powerpc
+ *
+ * Copyright 2019, IBM Corporation.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+
+/*
+ * Updates the attributes of a page in three steps:
+ *
+ * 1. invalidate the page table entry
+ * 2. flush the TLB
+ * 3. install the new entry with the updated attributes
+ *
+ * This is unsafe if the caller is attempting to change the mapping of the
+ * page it is executing from, or if another CPU is concurrently using the
+ * page being altered.
+ *
+ * TODO make the implementation resistant to this.
+ */
+static int __change_page_attr(pte_t *ptep, unsigned long addr, void *data)
+{
+   int action = *((int *)data);
+   pte_t pte_val;
+
+   // invalidate the PTE so it's safe to modify
+   pte_val = ptep_get_and_clear(_mm, addr, ptep);
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+   // modify the PTE bits as desired, then apply
+   switch (action) {
+   case SET_MEMORY_RO:
+   pte_val = pte_wrprotect(pte_val);
+   break;
+   case SET_MEMORY_RW:
+   pte_val = pte_mkwrite(pte_val);
+   break;
+   case SET_MEMORY_NX:
+   pte_val = pte_exprotect(pte_val);
+   break;
+   case SET_MEMORY_X:
+   pte_val = pte_mkexec(pte_val);
+   break;
+   default:
+   WARN_ON(true);
+   return -EINVAL;
+   }
+
+   set_pte_at(_mm, addr, ptep, pte_val);
+
+   return 0;
+}
+
+static int 

[PATCH v6 2/5] powerpc/kprobes: Mark newly allocated probes as RO

2019-12-23 Thread Russell Currey
With CONFIG_STRICT_KERNEL_RWX=y and CONFIG_KPROBES=y, there will be one
W+X page at boot by default.  This can be tested with
CONFIG_PPC_PTDUMP=y and CONFIG_PPC_DEBUG_WX=y set, and checking the
kernel log during boot.

powerpc doesn't implement its own alloc() for kprobes like other
architectures do, but we couldn't immediately mark RO anyway since we do
a memcpy to the page we allocate later.  After that, nothing should be
allowed to modify the page, and write permissions are removed well
before the kprobe is armed.

The memcpy() would fail if >1 probes were allocated, so use
patch_instruction() instead which is safe for RO.

Reviewed-by: Daniel Axtens 
Signed-off-by: Russell Currey 
---
 arch/powerpc/kernel/kprobes.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 2d27ec4feee4..b72761f0c9e3 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DEFINE_PER_CPU(struct kprobe *, current_kprobe) = NULL;
 DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
@@ -124,13 +125,14 @@ int arch_prepare_kprobe(struct kprobe *p)
}
 
if (!ret) {
-   memcpy(p->ainsn.insn, p->addr,
-   MAX_INSN_SIZE * sizeof(kprobe_opcode_t));
+   patch_instruction(p->ainsn.insn, *p->addr);
p->opcode = *p->addr;
flush_icache_range((unsigned long)p->ainsn.insn,
(unsigned long)p->ainsn.insn + sizeof(kprobe_opcode_t));
}
 
+   set_memory_ro((unsigned long)p->ainsn.insn, 1);
+
p->ainsn.boostable = 0;
return ret;
 }
-- 
2.24.1



[PATCH V11 RESEND] mm/debug: Add tests validating architecture page table helpers

2019-12-23 Thread Anshuman Khandual
This adds tests which will validate architecture page table helpers and
other accessors in their compliance with expected generic MM semantics.
This will help various architectures in validating changes to existing
page table helpers or addition of new ones.

This test covers basic page table entry transformations including but not
limited to old, young, dirty, clean, write, write protect etc at various
level along with populating intermediate entries with next page table page
and validating them.

Test page table pages are allocated from system memory with required size
and alignments. The mapped pfns at page table levels are derived from a
real pfn representing a valid kernel text symbol. This test gets called
right after page_alloc_init_late().

This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
arm64. Going forward, other architectures too can enable this after fixing
build or runtime problems (if any) with their page table helpers.

Folks interested in making sure that a given platform's page table helpers
conform to expected generic MM semantics should enable the above config
which will just trigger this test during boot. Any non conformity here will
be reported as an warning which would need to be fixed. This test will help
catch any changes to the agreed upon semantics expected from generic MM and
enable platforms to accommodate it thereafter.

Cc: Andrew Morton 
Cc: Vlastimil Babka 
Cc: Greg Kroah-Hartman 
Cc: Thomas Gleixner 
Cc: Mike Rapoport 
Cc: Jason Gunthorpe 
Cc: Dan Williams 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Mark Rutland 
Cc: Mark Brown 
Cc: Steven Price 
Cc: Ard Biesheuvel 
Cc: Masahiro Yamada 
Cc: Kees Cook 
Cc: Tetsuo Handa 
Cc: Matthew Wilcox 
Cc: Sri Krishna chowdary 
Cc: Dave Hansen 
Cc: Russell King - ARM Linux 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Martin Schwidefsky 
Cc: Heiko Carstens 
Cc: "David S. Miller" 
Cc: Vineet Gupta 
Cc: James Hogan 
Cc: Paul Burton 
Cc: Ralf Baechle 
Cc: Kirill A. Shutemov 
Cc: Gerald Schaefer 
Cc: Christophe Leroy 
Cc: Ingo Molnar 
Cc: linux-snps-...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: x...@kernel.org
Cc: linux-ker...@vger.kernel.org

Tested-by: Christophe Leroy#PPC32
Reviewed-by: Ingo Molnar 
Suggested-by: Catalin Marinas 
Signed-off-by: Andrew Morton 
Signed-off-by: Christophe Leroy 
Signed-off-by: Anshuman Khandual 
---
This adds a test validation for architecture exported page table helpers.
Patch adds basic transformation tests at various levels of the page table.

This test was originally suggested by Catalin during arm64 THP migration
RFC discussion earlier. Going forward it can include more specific tests
with respect to various generic MM functions like THP, HugeTLB etc and
platform specific tests.

https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/

Needs to be applied on linux 5.5-rc2

Changes in V11:

- Rebased the patch on 5.5-rc2

Changes in V10: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=205529)

- Always enable DEBUG_VM_PGTABLE when DEBUG_VM is enabled per Ingo
- Added tags from Ingo

Changes in V9: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=201429)

- Changed feature support enumeration for powerpc platforms per Christophe
- Changed config wrapper for basic_[pmd|pud]_tests() to enable ARC platform
- Enabled the test on ARC platform

Changes in V8: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=194297)

- Enabled ARCH_HAS_DEBUG_VM_PGTABLE on PPC32 platform per Christophe
- Updated feature documentation as DEBUG_VM_PGTABLE is now enabled on PPC32 
platform
- Moved ARCH_HAS_DEBUG_VM_PGTABLE earlier to indent it with DEBUG_VM per 
Christophe
- Added an information message in debug_vm_pgtable() per Christophe
- Dropped random_vaddr boundary condition checks per Christophe and Qian
- Replaced virt_addr_valid() check with pfn_valid() check in debug_vm_pgtable()
- Slightly changed pr_fmt(fmt) information

Changes in V7: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=193051)

- Memory allocation and free routines for mapped pages have been droped
- Mapped pfns are derived from standard kernel text symbol per Matthew
- Moved debug_vm_pgtaable() after page_alloc_init_late() per Michal and Qian 
- Updated the commit message per Michal
- Updated W=1 GCC warning problem on x86 per Qian Cai
- Addition of new alloc_contig_pages() helper has been submitted separately

Changes in V6: 
(https://patchwork.kernel.org/project/linux-mm/list/?series=187589)

- Moved alloc_gigantic_page_order() into mm/page_alloc.c per Michal
- Moved 

Re: [RFC PATCH v2 05/10] lib: vdso: inline do_hres()

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> do_hres() is called from several places, so GCC doesn't inline
> it at first.
>
> do_hres() takes a struct __kernel_timespec * parameter for
> passing the result. In the 32 bits case, this parameter corresponds
> to a local var in the caller. In order to provide a pointer
> to this structure, the caller has to put it in its stack and
> do_hres() has to write the result in the stack. This is suboptimal,
> especially on RISC processor like powerpc.
>
> By making GCC inline the function, the struct __kernel_timespec
> remains a local var using registers, avoiding the need to write and
> read stack.
>
> The improvement is significant on powerpc.

I'm okay with it, mainly because I don't expect many workloads to have
more than one copy of the code hot at the same time.


Re: [RFC PATCH v2 04/10] lib: vdso: get pointer to vdso data from the arch

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> On powerpc, __arch_get_vdso_data() clobbers the link register,
> requiring the caller to set a stack frame in order to save it.
>
> As the parent function already has to set a stack frame and save
> the link register to call the C vdso function, retriving the
> vdso data pointer there is lighter.

I'm confused.  Can't you inline __arch_get_vdso_data()?  Or is the
issue that you can't retrieve the program counter on power without
clobbering the link register?

I would imagine that this patch generates worse code on any
architecture with PC-relative addressing modes (which includes at
least x86_64, and I would guess includes most modern architectures).

--Andy


Re: [RFC PATCH v2 02/10] lib: vdso: move call to fallback out of common code.

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> On powerpc, VDSO functions and syscalls cannot be implemented in C
> because the Linux kernel ABI requires that CR[SO] bit is set in case
> of error and cleared when no error.
>
> As this cannot be done in C, C VDSO functions and syscall'based
> fallback need a trampoline in ASM.
>
> By moving the fallback calls out of the common code, arches like
> powerpc can implement both the call to C VDSO and the fallback call
> in a single trampoline function.

Maybe the issue is that I'm not a powerpc person, but I don't
understand this.  The common vDSO code is in C.  Presumably this means
that you need an asm trampoline no matter what to call the C code.  Is
the improvement that, with this change, you can have the asm
trampoline do a single branch, so it's logically:

ret = [call the C code];
if (ret == 0) {
 set success bit;
} else {
 ret = fallback;
 if (ret == 0)
  set success bit;
else
  set failure bit;
}

return ret;

instead of:

ret = [call the C code, which includes the fallback];
if (ret == 0)
  set success bit;
else
  set failure bit;

It's not obvious to me that the former ought to be faster.

>
> The two advantages are:
> - No need play back and forth with CR[SO] and negative return value.
> - No stack frame is required in VDSO C functions for the fallbacks.

How is no stack frame required?  Do you mean that the presence of the
fallback causes worse code generation?  Can you improve the fallback
instead?


Re: [RFC PATCH v2 01/10] lib: vdso: ensure all arches have 32bit fallback

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> In order to simplify next step which moves fallback call at arch
> level, ensure all arches have a 32bit fallback instead of handling
> the lack of 32bit fallback in the common code based
> on VDSO_HAS_32BIT_FALLBACK

I don't like this.  You've implemented what appear to be nonsensical
fallbacks (the 32-bit fallback for a 64-bit vDSO build?  There's no
such thing).

How exactly does this simplify patch 2?

--Andy


Re: [RFC PATCH v2 08/10] lib: vdso: Avoid duplication in __cvdso_clock_getres()

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> VDSO_HRES and VDSO_RAW clocks are handled the same way.
>
> Don't duplicate code.
>
> Signed-off-by: Christophe Leroy 

Reviewed-by: Andy Lutomirski 


Re: [RFC PATCH v2 07/10] lib: vdso: don't use READ_ONCE() in __c_kernel_time()

2019-12-23 Thread Andy Lutomirski
On Mon, Dec 23, 2019 at 6:31 AM Christophe Leroy
 wrote:
>
> READ_ONCE() forces the read of the 64 bit value of
> vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec allthough
> only the lower part is needed.

Seems reasonable and very unlikely to be harmful.  That being said,
this function really ought to be considered deprecated -- 32-bit
time_t is insufficient.

Do you get even better code if you move the read into the if statement?

Reviewed-by: Andy Lutomirski 

--Andy


Re: [PATCH kernel v3] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

2019-12-23 Thread Alexey Kardashevskiy



On 23/12/2019 22:18, Michael Ellerman wrote:
> Alexey Kardashevskiy  writes:
> 
>> The last jump to free_exit in mm_iommu_do_alloc() happens after page
>> pointers in struct mm_iommu_table_group_mem_t were already converted to
>> physical addresses. Thus calling put_page() on these physical addresses
>> will likely crash.
>>
>> This moves the loop which calculates the pageshift and converts page
>> struct pointers to physical addresses later after the point when
>> we cannot fail; thus eliminating the need to convert pointers back.
>>
>> Fixes: eb9d7a62c386 ("powerpc/mm_iommu: Fix potential deadlock")
>> Reported-by: Jan Kara 
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>> Changes:
>> v3:
>> * move pointers conversion after the last possible failure point
>> ---
>>  arch/powerpc/mm/book3s64/iommu_api.c | 39 +++-
>>  1 file changed, 21 insertions(+), 18 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
>> b/arch/powerpc/mm/book3s64/iommu_api.c
>> index 56cc84520577..ef164851738b 100644
>> --- a/arch/powerpc/mm/book3s64/iommu_api.c
>> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
>> @@ -121,24 +121,6 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
>> unsigned long ua,
>>  goto free_exit;
>>  }
>>  
>> -pageshift = PAGE_SHIFT;
>> -for (i = 0; i < entries; ++i) {
>> -struct page *page = mem->hpages[i];
>> -
>> -/*
>> - * Allow to use larger than 64k IOMMU pages. Only do that
>> - * if we are backed by hugetlb.
>> - */
>> -if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page))
>> -pageshift = page_shift(compound_head(page));
>> -mem->pageshift = min(mem->pageshift, pageshift);
>> -/*
>> - * We don't need struct page reference any more, switch
>> - * to physical address.
>> - */
>> -mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
>> -}
>> -
>>  good_exit:
>>  atomic64_set(>mapped, 1);
>>  mem->used = 1;
>> @@ -158,6 +140,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
>> unsigned long ua,
>>  }
>>  }
>>  
>> +if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {
> 
> Couldn't you avoid testing this again ...
> 
>> +/*
>> + * Allow to use larger than 64k IOMMU pages. Only do that
>> + * if we are backed by hugetlb. Skip device memory as it is not
>> + * backed with page structs.
>> + */
>> +pageshift = PAGE_SHIFT;
>> +for (i = 0; i < entries; ++i) {
> 
> ... by making this loop up to `pinned`.
> 
> `pinned` is only incremented in the loop that does the GUP, and there's
> a check that pinned == entries after that loop.
> 
> So when we get here we know pinned == entries, and if pinned is zero
> it's because we took the (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) case at
> the start of the function to get here.
> 
> Or do you think that's too subtle to rely on?


I had 4 choices:

1. for (;i < pinned;)

2. if (dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) (dev_hpa is a function
parameter)

3. if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)

4. if (mem->hpages)


The function is already ugly. 3) seemed as the most obvious way of
telling what is going on here: "we have just initialized @mem and it is
not for a device memory, lets finish the initialization".

I could rearrange the code even more but since there is no NVLink3
coming ever, I'd avoid changing it more than necessary. Thanks,


> 
> cheers
> 
>> +struct page *page = mem->hpages[i];
>> +
>> +if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page))
>> +pageshift = page_shift(compound_head(page));
>> +mem->pageshift = min(mem->pageshift, pageshift);
>> +/*
>> + * We don't need struct page reference any more, switch
>> + * to physical address.
>> + */
>> +mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
>> +}
>> +}
>> +
>>  list_add_rcu(>next, >context.iommu_group_mem_list);
>>  
>>  mutex_unlock(_list_mutex);
>> -- 
>> 2.17.1

-- 
Alexey


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-23 Thread Jason Gunthorpe
On Fri, Dec 20, 2019 at 04:32:13PM -0800, Dan Williams wrote:

> > > There's already a limit, it's just a much larger one. :) What does "no 
> > > limit"
> > > really mean, numerically, to you in this case?
> >
> > I guess I mean 'hidden limit' - hitting the limit and failing would
> > be managable.
> >
> > I think 7 is probably too low though, but we are not using 1GB huge
> > pages, only 2M..
> 
> What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?

I don't think the failing testing is doing that.

It is also less likely that 1GB regions will need multi-mapping, IMHO.

Jason


[RFC PATCH 8/8] powerpc/irq: drop softirq stack

2019-12-23 Thread Christophe Leroy
There are two IRQ stacks: softirq_ctx and hardirq_ctx

do_softirq_own_stack() switches stack to softirq_ctx
do_IRQ() switches stack to hardirq_ctx

However, when soft and hard IRQs are nested, only one of the two
stacks is used:
- When on softirq stack, do_IRQ() doesn't switch to hardirq stack.
- irq_exit() runs softirqs on hardirq stack.

There is no added value in having two IRQ stacks as only one is
used when hard and soft irqs are nested. Remove softirq_ctx and
use hardirq_ctx for both hard and soft IRQs.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/irq.h | 1 -
 arch/powerpc/kernel/irq.c  | 8 +++-
 arch/powerpc/kernel/process.c  | 4 
 arch/powerpc/kernel/setup_32.c | 4 +---
 arch/powerpc/kernel/setup_64.c | 4 +---
 5 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index e4a92f0b4ad4..7cb2c76aa3ed 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -54,7 +54,6 @@ extern void *mcheckirq_ctx[NR_CPUS];
  * Per-cpu stacks for handling hard and soft interrupts.
  */
 extern void *hardirq_ctx[NR_CPUS];
-extern void *softirq_ctx[NR_CPUS];
 
 #ifdef CONFIG_PPC64
 void call_do_softirq(void *sp);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index a1122ef4a16c..3af0d1897354 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -680,15 +680,14 @@ void __do_irq(struct pt_regs *regs)
 
 void do_IRQ(struct pt_regs *regs)
 {
-   void *cursp, *irqsp, *sirqsp;
+   void *cursp, *irqsp;
 
/* Switch to the irq stack to handle this */
cursp = (void *)(stack_pointer() & ~(THREAD_SIZE - 1));
irqsp = hardirq_ctx[raw_smp_processor_id()];
-   sirqsp = softirq_ctx[raw_smp_processor_id()];
 
/* Already there ? Otherwise switch stack and call */
-   if (unlikely(cursp == irqsp || cursp == sirqsp))
+   if (unlikely(cursp == irqsp))
__do_irq(regs);
else
call_do_irq(regs, irqsp);
@@ -706,12 +705,11 @@ void*dbgirq_ctx[NR_CPUS] __read_mostly;
 void *mcheckirq_ctx[NR_CPUS] __read_mostly;
 #endif
 
-void *softirq_ctx[NR_CPUS] __read_mostly;
 void *hardirq_ctx[NR_CPUS] __read_mostly;
 
 void do_softirq_own_stack(void)
 {
-   call_do_softirq(softirq_ctx[smp_processor_id()]);
+   call_do_softirq(hardirq_ctx[smp_processor_id()]);
 }
 
 irq_hw_number_t virq_to_hw(unsigned int virq)
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 49d0ebf28ab9..be3e64cf28b4 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1963,10 +1963,6 @@ static inline int valid_irq_stack(unsigned long sp, 
struct task_struct *p,
if (sp >= stack_page && sp <= stack_page + THREAD_SIZE - nbytes)
return 1;
 
-   stack_page = (unsigned long)softirq_ctx[cpu];
-   if (sp >= stack_page && sp <= stack_page + THREAD_SIZE - nbytes)
-   return 1;
-
return 0;
 }
 
diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c
index dcffe927f5b9..8752aae06177 100644
--- a/arch/powerpc/kernel/setup_32.c
+++ b/arch/powerpc/kernel/setup_32.c
@@ -155,10 +155,8 @@ void __init irqstack_early_init(void)
 
/* interrupt stacks must be in lowmem, we get that for free on ppc32
 * as the memblock is limited to lowmem by default */
-   for_each_possible_cpu(i) {
-   softirq_ctx[i] = alloc_stack();
+   for_each_possible_cpu(i)
hardirq_ctx[i] = alloc_stack();
-   }
 }
 
 #if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 6104917a282d..96ee7627eda6 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -652,10 +652,8 @@ void __init irqstack_early_init(void)
 * cannot afford to take SLB misses on them. They are not
 * accessed in realmode.
 */
-   for_each_possible_cpu(i) {
-   softirq_ctx[i] = alloc_stack(limit, i);
+   for_each_possible_cpu(i)
hardirq_ctx[i] = alloc_stack(limit, i);
-   }
 }
 
 #ifdef CONFIG_PPC_BOOK3E
-- 
2.13.3



[RFC PATCH 7/8] powerpc/32: use IRQ stack immediately on IRQ exception

2019-12-23 Thread Christophe Leroy
Exception entries run of kernel thread stack, then do_IRQ()
switches to the IRQ stack.

Instead of doing a first step of the thread stack, increasing the
risk of stack overflow and spending time switch stacks two times when
coming from userspace, set the stack to IRQ stack immediately in the
EXCEPTION entry.

In the same way as ARM64, consider that when the stack pointer is not
within the kernel thread stack, it means it is already on IRQ stack.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_32.S  |  2 +-
 arch/powerpc/kernel/head_32.h  | 32 +---
 arch/powerpc/kernel/head_40x.S |  2 +-
 arch/powerpc/kernel/head_8xx.S |  2 +-
 4 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 4a24f8f026c7..0c36fba5b861 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -332,7 +332,7 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_HPTE_TABLE)
EXC_XFER_LITE(0x400, handle_page_fault)
 
 /* External interrupt */
-   EXCEPTION(0x500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE)
+   EXCEPTION_IRQ(0x500, HardwareInterrupt, __do_irq, EXC_XFER_LITE)
 
 /* Alignment exception */
. = 0x600
diff --git a/arch/powerpc/kernel/head_32.h b/arch/powerpc/kernel/head_32.h
index 8abc7783dbe5..f9e77e51723e 100644
--- a/arch/powerpc/kernel/head_32.h
+++ b/arch/powerpc/kernel/head_32.h
@@ -11,21 +11,41 @@
  * task's thread_struct.
  */
 
-.macro EXCEPTION_PROLOG
+.macro EXCEPTION_PROLOG is_irq=0
mtspr   SPRN_SPRG_SCRATCH0,r10
mtspr   SPRN_SPRG_SCRATCH1,r11
mfcrr10
-   EXCEPTION_PROLOG_1
+   EXCEPTION_PROLOG_1 is_irq=\is_irq
EXCEPTION_PROLOG_2
 .endm
 
-.macro EXCEPTION_PROLOG_1
+.macro EXCEPTION_PROLOG_1 is_irq=0
mfspr   r11,SPRN_SRR1   /* check whether user or kernel */
andi.   r11,r11,MSR_PR
+   .if \is_irq
+   bne 2f
+   mfspr   r11, SPRN_SPRG_THREAD
+   lwz r11, TASK_STACK - THREAD(r11)
+   xor r11, r11, r1
+   cmplwi  cr7, r11, THREAD_SIZE - 1
+   tophys(r11, r1) /* use tophys(r1) if not thread stack */
+   bgt cr7, 1f
+2:
+#ifdef CONFIG_SMP
+   mfspr   r11, SPRN_SPRG_THREAD
+   lwz r11, TASK_CPU - THREAD(r11)
+   slwir11, r11, 3
+   addis   r11, r11, (hardirq_ctx - PAGE_OFFSET)@ha
+#else
+   lis r11, (hardirq_ctx - PAGE_OFFSET)@ha
+#endif
+   lwz r11, (hardirq_ctx - PAGE_OFFSET)@l(r11)
+   .else
tophys(r11,r1)  /* use tophys(r1) if kernel */
beq 1f
mfspr   r11,SPRN_SPRG_THREAD
lwz r11,TASK_STACK-THREAD(r11)
+   .endif
addir11,r11,THREAD_SIZE
tophys(r11,r11)
 1: subir11,r11,INT_FRAME_SIZE  /* alloc exc. frame */
@@ -171,6 +191,12 @@
addir3,r1,STACK_FRAME_OVERHEAD; \
xfer(n, hdlr)
 
+#define EXCEPTION_IRQ(n, label, hdlr, xfer)\
+   START_EXCEPTION(n, label)   \
+   EXCEPTION_PROLOG is_irq=1;  \
+   addir3,r1,STACK_FRAME_OVERHEAD; \
+   xfer(n, hdlr)
+
 #define EXC_XFER_TEMPLATE(hdlr, trap, msr, tfer, ret)  \
li  r10,trap;   \
stw r10,_TRAP(r11); \
diff --git a/arch/powerpc/kernel/head_40x.S b/arch/powerpc/kernel/head_40x.S
index 4511fc1549f7..dd236f596c0b 100644
--- a/arch/powerpc/kernel/head_40x.S
+++ b/arch/powerpc/kernel/head_40x.S
@@ -315,7 +315,7 @@ _ENTRY(crit_srr1)
EXC_XFER_LITE(0x400, handle_page_fault)
 
 /* 0x0500 - External Interrupt Exception */
-   EXCEPTION(0x0500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE)
+   EXCEPTION_IRQ(0x0500, HardwareInterrupt, __do_irq, EXC_XFER_LITE)
 
 /* 0x0600 - Alignment Exception */
START_EXCEPTION(0x0600, Alignment)
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 19f583e18402..5a6cdbc89e26 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -150,7 +150,7 @@ DataAccess:
 InstructionAccess:
 
 /* External interrupt */
-   EXCEPTION(0x500, HardwareInterrupt, do_IRQ, EXC_XFER_LITE)
+   EXCEPTION_IRQ(0x500, HardwareInterrupt, __do_irq, EXC_XFER_LITE)
 
 /* Alignment exception */
. = 0x600
-- 
2.13.3



[RFC PATCH 5/8] powerpc/irq: move stack overflow verification

2019-12-23 Thread Christophe Leroy
As we are going to switch to IRQ stack immediately in the exception
handler, it won't be possible anymore to check stack overflow by
reading stack pointer.

Do the verification on regs->gpr[1] which contains the stack pointer
at the time the IRQ happended, and move it to __do_irq() so that the
verification is also done when calling __do_irq() directly once the
exception entry does the stack switch.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/irq.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 28414c6665cc..4df49f6e9987 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -596,15 +596,16 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
return sum;
 }
 
-static inline void check_stack_overflow(void)
+static inline void check_stack_overflow(struct pt_regs *regs)
 {
 #ifdef CONFIG_DEBUG_STACKOVERFLOW
+   bool is_user = user_mode(regs);
long sp;
 
-   sp = current_stack_pointer() & (THREAD_SIZE-1);
+   sp = regs->gpr[1] & (THREAD_SIZE - 1);
 
/* check for stack overflow: is there less than 2KB free? */
-   if (unlikely(sp < 2048)) {
+   if (unlikely(!is_user && sp < 2048)) {
pr_err("do_IRQ: stack overflow: %ld\n", sp);
dump_stack();
}
@@ -654,6 +655,8 @@ void __do_irq(struct pt_regs *regs)
 
trace_irq_entry(regs);
 
+   check_stack_overflow(regs);
+
/*
 * Query the platform PIC for the interrupt & ack it.
 *
@@ -685,8 +688,6 @@ void do_IRQ(struct pt_regs *regs)
irqsp = hardirq_ctx[raw_smp_processor_id()];
sirqsp = softirq_ctx[raw_smp_processor_id()];
 
-   check_stack_overflow();
-
/* Already there ? Otherwise switch stack and call */
if (unlikely(cursp == irqsp || cursp == sirqsp))
__do_irq(regs);
-- 
2.13.3



[RFC PATCH 6/8] powerpc/irq: cleanup check_stack_overflow() a bit

2019-12-23 Thread Christophe Leroy
Instead of #ifdef, use IS_ENABLED(CONFIG_DEBUG_STACKOVERFLOW).
This enable GCC to check for code validity even when the option
is not selected.

The function is not using current_stack_pointer() anymore so no
need to declare it inline, let GCC decide.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/irq.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 4df49f6e9987..a1122ef4a16c 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -596,20 +596,19 @@ u64 arch_irq_stat_cpu(unsigned int cpu)
return sum;
 }
 
-static inline void check_stack_overflow(struct pt_regs *regs)
+static void check_stack_overflow(struct pt_regs *regs)
 {
-#ifdef CONFIG_DEBUG_STACKOVERFLOW
bool is_user = user_mode(regs);
-   long sp;
+   long sp = regs->gpr[1] & (THREAD_SIZE - 1);
 
-   sp = regs->gpr[1] & (THREAD_SIZE - 1);
+   if (!IS_ENABLED(CONFIG_DEBUG_STACKOVERFLOW))
+   return;
 
/* check for stack overflow: is there less than 2KB free? */
if (unlikely(!is_user && sp < 2048)) {
pr_err("do_IRQ: stack overflow: %ld\n", sp);
dump_stack();
}
-#endif
 }
 
 #ifdef CONFIG_PPC32
-- 
2.13.3



[RFC PATCH 4/8] powerpc/irq: move set_irq_regs() closer to irq_enter/exit()

2019-12-23 Thread Christophe Leroy
set_irq_regs() is called by do_IRQ() while irq_enter() and irq_exit()
are called by __do_irq().

Move set_irq_regs() in __do_irq()

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/irq.c | 16 ++--
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 410accba865d..28414c6665cc 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -647,6 +647,7 @@ static inline void call_do_irq(struct pt_regs *regs, void 
*sp)
 
 void __do_irq(struct pt_regs *regs)
 {
+   struct pt_regs *old_regs = set_irq_regs(regs);
unsigned int irq;
 
irq_enter();
@@ -672,11 +673,11 @@ void __do_irq(struct pt_regs *regs)
trace_irq_exit(regs);
 
irq_exit();
+   set_irq_regs(old_regs);
 }
 
 void do_IRQ(struct pt_regs *regs)
 {
-   struct pt_regs *old_regs = set_irq_regs(regs);
void *cursp, *irqsp, *sirqsp;
 
/* Switch to the irq stack to handle this */
@@ -686,16 +687,11 @@ void do_IRQ(struct pt_regs *regs)
 
check_stack_overflow();
 
-   /* Already there ? */
-   if (unlikely(cursp == irqsp || cursp == sirqsp)) {
+   /* Already there ? Otherwise switch stack and call */
+   if (unlikely(cursp == irqsp || cursp == sirqsp))
__do_irq(regs);
-   set_irq_regs(old_regs);
-   return;
-   }
-   /* Switch stack and call */
-   call_do_irq(regs, irqsp);
-
-   set_irq_regs(old_regs);
+   else
+   call_do_irq(regs, irqsp);
 }
 
 void __init init_IRQ(void)
-- 
2.13.3



[RFC PATCH 3/8] powerpc/irq: don't use current_stack_pointer() in do_IRQ()

2019-12-23 Thread Christophe Leroy
Before commit 7306e83ccf5c ("powerpc: Don't use CURRENT_THREAD_INFO to
find the stack"), the current stack base address was obtained by
calling current_thread_info(). That inline function was simply masking
out the value of r1.

In that commit, it was changed to using current_stack_pointer(), which
is an heavier function as it is an outline assembly function which
cannot be inlined and which reads the content of the stack at 0(r1)

Create stack_pointer() function which returns the value of r1 and use
it instead.

Signed-off-by: Christophe Leroy 
Fixes: 7306e83ccf5c ("powerpc: Don't use CURRENT_THREAD_INFO to find the stack")
---
 arch/powerpc/include/asm/reg.h | 8 
 arch/powerpc/kernel/irq.c  | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1aa46dff0957..bc14fca9b13b 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1466,6 +1466,14 @@ static inline void update_power8_hid0(unsigned long hid0)
 */
asm volatile("sync; mtspr %0,%1; isync":: "i"(SPRN_HID0), "r"(hid0));
 }
+
+static __always_inline unsigned long stack_pointer(void)
+{
+   register unsigned long r1 asm("r1");
+
+   return r1;
+}
+
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_REG_H */
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 4690e5270806..410accba865d 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -680,7 +680,7 @@ void do_IRQ(struct pt_regs *regs)
void *cursp, *irqsp, *sirqsp;
 
/* Switch to the irq stack to handle this */
-   cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
+   cursp = (void *)(stack_pointer() & ~(THREAD_SIZE - 1));
irqsp = hardirq_ctx[raw_smp_processor_id()];
sirqsp = softirq_ctx[raw_smp_processor_id()];
 
-- 
2.13.3



[RFC PATCH 2/8] powerpc/irq: inline call_do_irq() and call_do_softirq() on PPC32

2019-12-23 Thread Christophe Leroy
call_do_irq() and call_do_softirq() are simple enough to be
worth inlining.

Inlining them avoids an mflr/mtlr pair plus a save/reload on stack.
It also allows GCC to keep the saved ksp_limit in an nonvolatile reg.

This is inspired from S390 arch. Several other arches do more or
less the same. The way sparc arch does seems odd thought.

Signed-off-by: Christophe Leroy 
Reviewed-by: Segher Boessenkool 

---
v2: no change.
v3: no change.
v4:
- comment reminding the purpose of the inline asm block.
- added r2 as clobbered reg
v5:
- Limiting the change to PPC32 for now.
- removed r2 from the clobbered regs list (on PPC32 r2 points to current all 
the time)
- Removed patch 1 and merged ksp_limit handling in here.
v6:
- rebased after removal of ksp_limit
---
 arch/powerpc/include/asm/irq.h |  2 ++
 arch/powerpc/kernel/irq.c  | 34 ++
 arch/powerpc/kernel/misc_32.S  | 25 -
 3 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 814dfab7e392..e4a92f0b4ad4 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -56,8 +56,10 @@ extern void *mcheckirq_ctx[NR_CPUS];
 extern void *hardirq_ctx[NR_CPUS];
 extern void *softirq_ctx[NR_CPUS];
 
+#ifdef CONFIG_PPC64
 void call_do_softirq(void *sp);
 void call_do_irq(struct pt_regs *regs, void *sp);
+#endif
 extern void do_IRQ(struct pt_regs *regs);
 extern void __init init_IRQ(void);
 extern void __do_irq(struct pt_regs *regs);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index add67498c126..4690e5270806 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -611,6 +611,40 @@ static inline void check_stack_overflow(void)
 #endif
 }
 
+#ifdef CONFIG_PPC32
+static inline void call_do_softirq(const void *sp)
+{
+   register unsigned long ret asm("r3");
+
+   /* Temporarily switch r1 to sp, call __do_softirq() then restore r1. */
+   asm volatile(
+   "   "PPC_STLU"  1, %2(%1);\n"
+   "   mr  1, %1;\n"
+   "   bl  %3;\n"
+   "   "PPC_LL"1, 0(1);\n" :
+   "=r"(ret) :
+   "b"(sp), "i"(THREAD_SIZE - STACK_FRAME_OVERHEAD), 
"i"(__do_softirq) :
+   "lr", "xer", "ctr", "memory", "cr0", "cr1", "cr5", "cr6", "cr7",
+   "r0", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12");
+}
+
+static inline void call_do_irq(struct pt_regs *regs, void *sp)
+{
+   register unsigned long r3 asm("r3") = (unsigned long)regs;
+
+   /* Temporarily switch r1 to sp, call __do_irq() then restore r1 */
+   asm volatile(
+   "   "PPC_STLU"  1, %2(%1);\n"
+   "   mr  1, %1;\n"
+   "   bl  %3;\n"
+   "   "PPC_LL"1, 0(1);\n" :
+   "+r"(r3) :
+   "b"(sp), "i"(THREAD_SIZE - STACK_FRAME_OVERHEAD), "i"(__do_irq) 
:
+   "lr", "xer", "ctr", "memory", "cr0", "cr1", "cr5", "cr6", "cr7",
+   "r0", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12");
+}
+#endif
+
 void __do_irq(struct pt_regs *regs)
 {
unsigned int irq;
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index bb5995fa6884..341a3cd199cb 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -27,31 +27,6 @@
 
.text
 
-_GLOBAL(call_do_softirq)
-   mflrr0
-   stw r0,4(r1)
-   stwur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r3)
-   mr  r1,r3
-   bl  __do_softirq
-   lwz r1,0(r1)
-   lwz r0,4(r1)
-   mtlrr0
-   blr
-
-/*
- * void call_do_irq(struct pt_regs *regs, void *sp);
- */
-_GLOBAL(call_do_irq)
-   mflrr0
-   stw r0,4(r1)
-   stwur1,THREAD_SIZE-STACK_FRAME_OVERHEAD(r4)
-   mr  r1,r4
-   bl  __do_irq
-   lwz r1,0(r1)
-   lwz r0,4(r1)
-   mtlrr0
-   blr
-
 /*
  * This returns the high 64 bits of the product of two 64-bit numbers.
  */
-- 
2.13.3



[RFC PATCH 1/8] powerpc/32: drop ksp_limit based stack overflow detection

2019-12-23 Thread Christophe Leroy
PPC32 implements a specific early stack overflow detection.

This detection is inherited from ppc arch (before the merge of
ppc and ppc64 into powerpc). At that time, there was no irqstacks
and the verification was simply to check that the stack pointer
was still over the stack base. But when irqstacks were implemented,
it was not possible to perform a simple check anymore so a
thread specific value called ksp_limit was introduced in the
task_struct and is updated at every stack switch in order to
keep track of the limit and perform the verification.

ppc64 didn't have this but had a verification during IRQs. This
verification was then extended to PPC32 and can be selected through
CONFIG_DEBUG_STACKOVERFLOW.

In the meantime, thread_info has moved away from the stack, reducing
the impact of a stack overflow.

In addition, there is CONFIG_SCHED_STACK_END_CHECK which can be used
to check that the magic stored at stack base has not be overwritten.

Remove this PPC32 specific stack overflow mechanism in order to
simplify ongoing work which also aim at reducing even more risks of
stack overflow:
- Switch to irqstack in IRQ exception entry in ASM
- VMAP stack

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/asm-prototypes.h |  1 -
 arch/powerpc/include/asm/processor.h  |  3 --
 arch/powerpc/kernel/asm-offsets.c |  2 --
 arch/powerpc/kernel/entry_32.S| 57 ---
 arch/powerpc/kernel/head_40x.S|  2 --
 arch/powerpc/kernel/head_booke.h  |  1 -
 arch/powerpc/kernel/misc_32.S | 14 
 arch/powerpc/kernel/process.c |  3 --
 arch/powerpc/kernel/traps.c   |  9 -
 arch/powerpc/lib/sstep.c  |  9 -
 10 files changed, 101 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 983c0084fb3f..90e9c6e415af 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -66,7 +66,6 @@ void RunModeException(struct pt_regs *regs);
 void single_step_exception(struct pt_regs *regs);
 void program_check_exception(struct pt_regs *regs);
 void alignment_exception(struct pt_regs *regs);
-void StackOverflow(struct pt_regs *regs);
 void kernel_fp_unavailable_exception(struct pt_regs *regs);
 void altivec_unavailable_exception(struct pt_regs *regs);
 void vsx_unavailable_exception(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index a9993e7a443b..a9552048c20b 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -155,7 +155,6 @@ struct thread_struct {
 #endif
 #ifdef CONFIG_PPC32
void*pgdir; /* root of page-table tree */
-   unsigned long   ksp_limit;  /* if ksp <= ksp_limit stack overflow */
 #ifdef CONFIG_PPC_RTAS
unsigned long   rtas_sp;/* stack pointer for when in RTAS */
 #endif
@@ -269,7 +268,6 @@ struct thread_struct {
 #define ARCH_MIN_TASKALIGN 16
 
 #define INIT_SP(sizeof(init_stack) + (unsigned long) 
_stack)
-#define INIT_SP_LIMIT  ((unsigned long)_stack)
 
 #ifdef CONFIG_SPE
 #define SPEFSCR_INIT \
@@ -282,7 +280,6 @@ struct thread_struct {
 #ifdef CONFIG_PPC32
 #define INIT_THREAD { \
.ksp = INIT_SP, \
-   .ksp_limit = INIT_SP_LIMIT, \
.addr_limit = KERNEL_DS, \
.pgdir = swapper_pg_dir, \
.fpexc_mode = MSR_FE0 | MSR_FE1, \
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 3d47aec7becf..d936db6b702f 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -88,7 +88,6 @@ int main(void)
DEFINE(SIGSEGV, SIGSEGV);
DEFINE(NMI_MASK, NMI_MASK);
 #else
-   OFFSET(KSP_LIMIT, thread_struct, ksp_limit);
 #ifdef CONFIG_PPC_RTAS
OFFSET(RTAS_SP, thread_struct, rtas_sp);
 #endif
@@ -353,7 +352,6 @@ int main(void)
DEFINE(_CSRR1, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, 
csrr1));
DEFINE(_DSRR0, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, 
dsrr0));
DEFINE(_DSRR1, STACK_INT_FRAME_SIZE+offsetof(struct exception_regs, 
dsrr1));
-   DEFINE(SAVED_KSP_LIMIT, STACK_INT_FRAME_SIZE+offsetof(struct 
exception_regs, saved_ksp_limit));
 #endif
 #endif
 
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index d60908ea37fb..bf11b464a17b 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -86,13 +86,6 @@ crit_transfer_to_handler:
stw r0,_SRR0(r11)
mfspr   r0,SPRN_SRR1
stw r0,_SRR1(r11)
-
-   /* set the stack limit to the current stack */
-   mfspr   r8,SPRN_SPRG_THREAD
-   lwz r0,KSP_LIMIT(r8)
-   stw r0,SAVED_KSP_LIMIT(r11)
-   rlwinm  r0,r1,0,0,(31 - THREAD_SHIFT)
-   stw r0,KSP_LIMIT(r8)
/* fall through */
 #endif
 

[RFC PATCH 0/8] Accelarate IRQ entry

2019-12-23 Thread Christophe Leroy
The purpose of this series is to accelerate IRQ entry by
avoiding unneccessary trampoline functions like call_do_irq()
and call_do_softirq() and by switching to IRQ stack
immediately in the exception handler.

For now, it is an RFC as it is still a bit messy.

Please provide feedback and I'll improve next year

Christophe Leroy (8):
  powerpc/32: drop ksp_limit based stack overflow detection
  powerpc/irq: inline call_do_irq() and call_do_softirq() on PPC32
  powerpc/irq: don't use current_stack_pointer() in do_IRQ()
  powerpc/irq: move set_irq_regs() closer to irq_enter/exit()
  powerpc/irq: move stack overflow verification
  powerpc/irq: cleanup check_stack_overflow() a bit
  powerpc/32: use IRQ stack immediately on IRQ exception
  powerpc/irq: drop softirq stack

 arch/powerpc/include/asm/asm-prototypes.h |  1 -
 arch/powerpc/include/asm/irq.h|  3 +-
 arch/powerpc/include/asm/processor.h  |  3 --
 arch/powerpc/include/asm/reg.h|  8 
 arch/powerpc/kernel/asm-offsets.c |  2 -
 arch/powerpc/kernel/entry_32.S| 57 
 arch/powerpc/kernel/head_32.S |  2 +-
 arch/powerpc/kernel/head_32.h | 32 +++--
 arch/powerpc/kernel/head_40x.S|  4 +-
 arch/powerpc/kernel/head_8xx.S|  2 +-
 arch/powerpc/kernel/head_booke.h  |  1 -
 arch/powerpc/kernel/irq.c | 74 +--
 arch/powerpc/kernel/misc_32.S | 39 
 arch/powerpc/kernel/process.c |  7 ---
 arch/powerpc/kernel/setup_32.c|  4 +-
 arch/powerpc/kernel/setup_64.c|  4 +-
 arch/powerpc/kernel/traps.c   |  9 
 arch/powerpc/lib/sstep.c  |  9 
 18 files changed, 95 insertions(+), 166 deletions(-)

-- 
2.13.3



[RFC PATCH v2 10/10] powerpc/32: Switch VDSO to C implementation.

2019-12-23 Thread Christophe Leroy
This is a tentative to switch powerpc/32 vdso to
generic C implementation.

It will likely not work on 64 bits or even build properly
at the moment, hence the RFC status.

powerpc is a bit special for VDSO as well as system calls in the
way that it requires setting CR SO bit which cannot be done in C.
Therefore, entry/exit and fallback needs to be performed in ASM.

On powerpc 8xx, performance is degraded by 30-40% for gettime and
by 15-20% for getres

On a powerpc885 at 132MHz:
With current powerpc/32 ASM VDSO:

gettimeofday:vdso: 737 nsec/call
clock-getres-realtime-coarse:vdso: 3081 nsec/call
clock-gettime-realtime-coarse:vdso: 2861 nsec/call
clock-getres-realtime:vdso: 475 nsec/call
clock-gettime-realtime:vdso: 892 nsec/call
clock-getres-boottime:vdso: 2621 nsec/call
clock-gettime-boottime:vdso: 3857 nsec/call
clock-getres-tai:vdso: 2620 nsec/call
clock-gettime-tai:vdso: 3854 nsec/call
clock-getres-monotonic-raw:vdso: 2621 nsec/call
clock-gettime-monotonic-raw:vdso: 3499 nsec/call
clock-getres-monotonic-coarse:vdso: 3083 nsec/call
clock-gettime-monotonic-coarse:vdso: 3082 nsec/call
clock-getres-monotonic:vdso: 475 nsec/call
clock-gettime-monotonic:vdso: 1014 nsec/call

Once switched to C implementation:

gettimeofday:vdso: 1016 nsec/call
clock-getres-realtime-coarse:vdso: 614 nsec/call
clock-gettime-realtime-coarse:vdso: 760 nsec/call
clock-getres-realtime:vdso: 560 nsec/call
clock-gettime-realtime:vdso: 1192 nsec/call
clock-getres-boottime:vdso: 560 nsec/call
clock-gettime-boottime:vdso: 1194 nsec/call
clock-getres-tai:vdso: 560 nsec/call
clock-gettime-tai:vdso: 1192 nsec/call
clock-getres-monotonic-raw:vdso: 560 nsec/call
clock-gettime-monotonic-raw:vdso: 1248 nsec/call
clock-getres-monotonic-coarse:vdso: 614 nsec/call
clock-gettime-monotonic-coarse:vdso: 760 nsec/call
clock-getres-monotonic:vdso: 560 nsec/call
clock-gettime-monotonic:vdso: 1192 nsec/call

On a powerpc 8321 running at 333MHz
With current powerpc/32 ASM VDSO:

gettimeofday:vdso: 190 nsec/call
clock-getres-realtime-coarse:vdso: 1449 nsec/call
clock-gettime-realtime-coarse:vdso: 1352 nsec/call
clock-getres-realtime:vdso: 135 nsec/call
clock-gettime-realtime:vdso: 244 nsec/call
clock-getres-boottime:vdso: 1313 nsec/call
clock-gettime-boottime:vdso: 1701 nsec/call
clock-getres-tai:vdso: 1268 nsec/call
clock-gettime-tai:vdso: 1742 nsec/call
clock-getres-monotonic-raw:vdso: 1310 nsec/call
clock-gettime-monotonic-raw:vdso: 1584 nsec/call
clock-getres-monotonic-coarse:vdso: 1488 nsec/call
clock-gettime-monotonic-coarse:vdso: 1503 nsec/call
clock-getres-monotonic:vdso: 135 nsec/call
clock-gettime-monotonic:vdso: 283 nsec/call

Once switched to C implementation:

gettimeofday:vdso: 347 nsec/call
clock-getres-realtime-coarse:vdso: 169 nsec/call
clock-gettime-realtime-coarse:vdso: 271 nsec/call
clock-getres-realtime:vdso: 150 nsec/call
clock-gettime-realtime:vdso: 383 nsec/call
clock-getres-boottime:vdso: 157 nsec/call
clock-gettime-boottime:vdso: 377 nsec/call
clock-getres-tai:vdso: 150 nsec/call
clock-gettime-tai:vdso: 380 nsec/call
clock-getres-monotonic-raw:vdso: 153 nsec/call
clock-gettime-monotonic-raw:vdso: 407 nsec/call
clock-getres-monotonic-coarse:vdso: 169 nsec/call
clock-gettime-monotonic-coarse:vdso: 271 nsec/call
clock-getres-monotonic:vdso: 153 nsec/call
clock-gettime-monotonic:vdso: 377 nsec/call
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig |   2 +
 arch/powerpc/include/asm/vdso/gettimeofday.h |  45 +
 arch/powerpc/include/asm/vdso/vsyscall.h |  27 +++
 arch/powerpc/include/asm/vdso_datapage.h |  18 +-
 arch/powerpc/kernel/asm-offsets.c|  23 +--
 arch/powerpc/kernel/time.c   |  92 +-
 arch/powerpc/kernel/vdso.c   |  19 +-
 arch/powerpc/kernel/vdso32/Makefile  |  19 +-
 arch/powerpc/kernel/vdso32/gettimeofday.S| 261 ---
 arch/powerpc/kernel/vdso32/vgettimeofday.c   |  32 
 10 files changed, 178 insertions(+), 360 deletions(-)
 create mode 100644 arch/powerpc/include/asm/vdso/gettimeofday.h
 create mode 100644 arch/powerpc/include/asm/vdso/vsyscall.h
 create mode 100644 arch/powerpc/kernel/vdso32/vgettimeofday.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1ec34e16ed65..bd04c68baf91 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -169,6 +169,7 @@ config PPC
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select GENERIC_TIME_VSYSCALL
+   select GENERIC_GETTIMEOFDAY
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
select HAVE_ARCH_JUMP_LABEL
@@ -198,6 +199,7 @@ config PPC
select 

[RFC PATCH v2 09/10] powerpc/vdso32: inline __get_datapage()

2019-12-23 Thread Christophe Leroy
__get_datapage() is only a few instructions to retrieve the
address of the page where the kernel stores data to the VDSO.

By inlining this function into its users, a bl/blr pair and
a mflr/mtlr pair is avoided, plus a few reg moves.

The improvement is noticeable (about 55 nsec/call on an 8xx)

vdsotest before the patch:
gettimeofday:vdso: 731 nsec/call
clock-gettime-realtime-coarse:vdso: 668 nsec/call
clock-gettime-monotonic-coarse:vdso: 745 nsec/call

vdsotest after the patch:
gettimeofday:vdso: 677 nsec/call
clock-gettime-realtime-coarse:vdso: 613 nsec/call
clock-gettime-monotonic-coarse:vdso: 690 nsec/call

Signed-off-by: Christophe Leroy 

---
v3: define get_datapage macro in asm/vdso_datapage.h
v4: fixed build failure with old binutils
---
 arch/powerpc/include/asm/vdso_datapage.h  | 10 ++
 arch/powerpc/kernel/vdso32/cacheflush.S   |  9 -
 arch/powerpc/kernel/vdso32/datapage.S | 28 +++-
 arch/powerpc/kernel/vdso32/gettimeofday.S | 12 +---
 4 files changed, 22 insertions(+), 37 deletions(-)

diff --git a/arch/powerpc/include/asm/vdso_datapage.h 
b/arch/powerpc/include/asm/vdso_datapage.h
index 40f13f3626d3..ee5319a6f4e3 100644
--- a/arch/powerpc/include/asm/vdso_datapage.h
+++ b/arch/powerpc/include/asm/vdso_datapage.h
@@ -118,6 +118,16 @@ struct vdso_data {
 
 extern struct vdso_data *vdso_data;
 
+#else /* __ASSEMBLY__ */
+
+.macro get_datapage ptr, tmp
+   bcl 20, 31, .+4
+   mflr\ptr
+   addi\ptr, \ptr, (__kernel_datapage_offset - (.-4))@l
+   lwz \tmp, 0(\ptr)
+   add \ptr, \tmp, \ptr
+.endm
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/kernel/vdso32/cacheflush.S 
b/arch/powerpc/kernel/vdso32/cacheflush.S
index 7f882e7b9f43..d178ec8c279d 100644
--- a/arch/powerpc/kernel/vdso32/cacheflush.S
+++ b/arch/powerpc/kernel/vdso32/cacheflush.S
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
.text
@@ -24,14 +25,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
   .cfi_startproc
mflrr12
   .cfi_register lr,r12
-   mr  r11,r3
-   bl  __get_datapage@local
+   get_datapager10, r0
mtlrr12
-   mr  r10,r3
 
lwz r7,CFG_DCACHE_BLOCKSZ(r10)
addir5,r7,-1
-   andcr6,r11,r5   /* round low to line bdy */
+   andcr6,r3,r5/* round low to line bdy */
subfr8,r6,r4/* compute length */
add r8,r8,r5/* ensure we get enough */
lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
@@ -48,7 +47,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
 
lwz r7,CFG_ICACHE_BLOCKSZ(r10)
addir5,r7,-1
-   andcr6,r11,r5   /* round low to line bdy */
+   andcr6,r3,r5/* round low to line bdy */
subfr8,r6,r4/* compute length */
add r8,r8,r5
lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
diff --git a/arch/powerpc/kernel/vdso32/datapage.S 
b/arch/powerpc/kernel/vdso32/datapage.S
index 6c7401bd284e..1095d818f94a 100644
--- a/arch/powerpc/kernel/vdso32/datapage.S
+++ b/arch/powerpc/kernel/vdso32/datapage.S
@@ -10,35 +10,13 @@
 #include 
 #include 
 #include 
+#include 
 
.text
.global __kernel_datapage_offset;
 __kernel_datapage_offset:
.long   0
 
-V_FUNCTION_BEGIN(__get_datapage)
-  .cfi_startproc
-   /* We don't want that exposed or overridable as we want other objects
-* to be able to bl directly to here
-*/
-   .protected __get_datapage
-   .hidden __get_datapage
-
-   mflrr0
-  .cfi_register lr,r0
-
-   bcl 20,31,data_page_branch
-data_page_branch:
-   mflrr3
-   mtlrr0
-   addir3, r3, __kernel_datapage_offset-data_page_branch
-   lwz r0,0(r3)
-  .cfi_restore lr
-   add r3,r0,r3
-   blr
-  .cfi_endproc
-V_FUNCTION_END(__get_datapage)
-
 /*
  * void *__kernel_get_syscall_map(unsigned int *syscall_count) ;
  *
@@ -53,7 +31,7 @@ V_FUNCTION_BEGIN(__kernel_get_syscall_map)
mflrr12
   .cfi_register lr,r12
mr  r4,r3
-   bl  __get_datapage@local
+   get_datapager3, r0
mtlrr12
addir3,r3,CFG_SYSCALL_MAP32
cmpli   cr0,r4,0
@@ -75,7 +53,7 @@ V_FUNCTION_BEGIN(__kernel_get_tbfreq)
   .cfi_startproc
mflrr12
   .cfi_register lr,r12
-   bl  __get_datapage@local
+   get_datapager3, r0
lwz r4,(CFG_TB_TICKS_PER_SEC + 4)(r3)
lwz r3,CFG_TB_TICKS_PER_SEC(r3)
mtlrr12
diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S 
b/arch/powerpc/kernel/vdso32/gettimeofday.S
index 3306672f57a9..d6c1d331e8cb 100644
--- a/arch/powerpc/kernel/vdso32/gettimeofday.S
+++ b/arch/powerpc/kernel/vdso32/gettimeofday.S
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 

[RFC PATCH v2 07/10] lib: vdso: don't use READ_ONCE() in __c_kernel_time()

2019-12-23 Thread Christophe Leroy
READ_ONCE() forces the read of the 64 bit value of
vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec allthough
only the lower part is needed.

This results in a suboptimal code:

0af4 <__c_kernel_time>:
 af4:   2c 03 00 00 cmpwi   r3,0
 af8:   81 44 00 20 lwz r10,32(r4)
 afc:   81 64 00 24 lwz r11,36(r4)
 b00:   41 82 00 08 beq b08 <__c_kernel_time+0x14>
 b04:   91 63 00 00 stw r11,0(r3)
 b08:   7d 63 5b 78 mr  r3,r11
 b0c:   4e 80 00 20 blr

By removing the READ_ONCE(), only the lower part is read from
memory, and the code is cleaner:

0af4 <__c_kernel_time>:
 af4:   7c 69 1b 79 mr. r9,r3
 af8:   80 64 00 24 lwz r3,36(r4)
 afc:   4d 82 00 20 beqlr
 b00:   90 69 00 00 stw r3,0(r9)
 b04:   4e 80 00 20 blr

Signed-off-by: Christophe Leroy 
---
 lib/vdso/gettimeofday.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 17b4cff6e5f0..5a17a9d2e6cd 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -144,7 +144,7 @@ __cvdso_gettimeofday(const struct vdso_data *vd, struct 
__kernel_old_timeval *tv
 static __maybe_unused __kernel_old_time_t
 __cvdso_time(const struct vdso_data *vd, __kernel_old_time_t *time)
 {
-   __kernel_old_time_t t = 
READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec);
+   __kernel_old_time_t t = vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec;
 
if (time)
*time = t;
-- 
2.13.3



[RFC PATCH v2 06/10] lib: vdso: make do_coarse() return 0

2019-12-23 Thread Christophe Leroy
do_coarse() is similare to do_hres() except that it never
fails.

Change its type to int instead of void and get it return 0
at all time. This cleans the code a bit.

Signed-off-by: Christophe Leroy 
---
 lib/vdso/gettimeofday.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 86d5b1c8796b..17b4cff6e5f0 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -64,7 +64,7 @@ static inline int do_hres(const struct vdso_data *vd, 
clockid_t clk,
return 0;
 }
 
-static void do_coarse(const struct vdso_data *vd, clockid_t clk,
+static int do_coarse(const struct vdso_data *vd, clockid_t clk,
  struct __kernel_timespec *ts)
 {
const struct vdso_timestamp *vdso_ts = >basetime[clk];
@@ -75,6 +75,8 @@ static void do_coarse(const struct vdso_data *vd, clockid_t 
clk,
ts->tv_sec = vdso_ts->sec;
ts->tv_nsec = vdso_ts->nsec;
} while (unlikely(vdso_read_retry(vd, seq)));
+
+   return 0;
 }
 
 static __maybe_unused int
@@ -92,14 +94,13 @@ __cvdso_clock_gettime(const struct vdso_data *vd, clockid_t 
clock,
 * clocks are handled in the VDSO directly.
 */
msk = 1U << clock;
-   if (likely(msk & VDSO_HRES)) {
+   if (likely(msk & VDSO_HRES))
return do_hres([CS_HRES_COARSE], clock, ts);
-   } else if (msk & VDSO_COARSE) {
-   do_coarse([CS_HRES_COARSE], clock, ts);
-   return 0;
-   } else if (msk & VDSO_RAW) {
+   else if (msk & VDSO_COARSE)
+   return do_coarse([CS_HRES_COARSE], clock, ts);
+   else if (msk & VDSO_RAW)
return do_hres([CS_RAW], clock, ts);
-   }
+
return -1;
 }
 
-- 
2.13.3



[RFC PATCH v2 05/10] lib: vdso: inline do_hres()

2019-12-23 Thread Christophe Leroy
do_hres() is called from several places, so GCC doesn't inline
it at first.

do_hres() takes a struct __kernel_timespec * parameter for
passing the result. In the 32 bits case, this parameter corresponds
to a local var in the caller. In order to provide a pointer
to this structure, the caller has to put it in its stack and
do_hres() has to write the result in the stack. This is suboptimal,
especially on RISC processor like powerpc.

By making GCC inline the function, the struct __kernel_timespec
remains a local var using registers, avoiding the need to write and
read stack.

The improvement is significant on powerpc.

Signed-off-by: Christophe Leroy 
---
 lib/vdso/gettimeofday.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 24e1ba838260..86d5b1c8796b 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -34,8 +34,8 @@ u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
 }
 #endif
 
-static int do_hres(const struct vdso_data *vd, clockid_t clk,
-  struct __kernel_timespec *ts)
+static inline int do_hres(const struct vdso_data *vd, clockid_t clk,
+ struct __kernel_timespec *ts)
 {
const struct vdso_timestamp *vdso_ts = >basetime[clk];
u64 cycles, last, sec, ns;
-- 
2.13.3



[RFC PATCH v2 08/10] lib: vdso: Avoid duplication in __cvdso_clock_getres()

2019-12-23 Thread Christophe Leroy
VDSO_HRES and VDSO_RAW clocks are handled the same way.

Don't duplicate code.

Signed-off-by: Christophe Leroy 
---
 lib/vdso/gettimeofday.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 5a17a9d2e6cd..aa4a167bf1e0 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -172,7 +172,7 @@ int __cvdso_clock_getres(const struct vdso_data *vd, 
clockid_t clock,
 * clocks are handled in the VDSO directly.
 */
msk = 1U << clock;
-   if (msk & VDSO_HRES) {
+   if (msk & (VDSO_HRES | VDSO_RAW)) {
/*
 * Preserves the behaviour of posix_get_hrtimer_res().
 */
@@ -182,11 +182,6 @@ int __cvdso_clock_getres(const struct vdso_data *vd, 
clockid_t clock,
 * Preserves the behaviour of posix_get_coarse_res().
 */
ns = LOW_RES_NSEC;
-   } else if (msk & VDSO_RAW) {
-   /*
-* Preserves the behaviour of posix_get_hrtimer_res().
-*/
-   ns = hrtimer_res;
} else {
return -1;
}
-- 
2.13.3



[RFC PATCH v2 04/10] lib: vdso: get pointer to vdso data from the arch

2019-12-23 Thread Christophe Leroy
On powerpc, __arch_get_vdso_data() clobbers the link register,
requiring the caller to set a stack frame in order to save it.

As the parent function already has to set a stack frame and save
the link register to call the C vdso function, retriving the
vdso data pointer there is lighter.

Give arches the opportunity to hand the vdso data pointer
to C vdso functions.

Signed-off-by: Christophe Leroy 
---
 arch/arm/vdso/vgettimeofday.c| 12 
 arch/arm64/kernel/vdso/vgettimeofday.c   |  9 ++---
 arch/arm64/kernel/vdso32/vgettimeofday.c | 12 
 arch/mips/vdso/vgettimeofday.c   | 21 ++---
 arch/x86/entry/vdso/vclock_gettime.c | 22 +++---
 lib/vdso/gettimeofday.c  | 28 ++--
 6 files changed, 65 insertions(+), 39 deletions(-)

diff --git a/arch/arm/vdso/vgettimeofday.c b/arch/arm/vdso/vgettimeofday.c
index 5451afb715e6..efad7d508d06 100644
--- a/arch/arm/vdso/vgettimeofday.c
+++ b/arch/arm/vdso/vgettimeofday.c
@@ -10,7 +10,8 @@
 int __vdso_clock_gettime(clockid_t clock,
 struct old_timespec32 *ts)
 {
-   int ret = __cvdso_clock_gettime32(clock, ts);
+   const struct vdso_data *vd = __arch_get_vdso_data();
+   int ret = __cvdso_clock_gettime32(vd, clock, ts);
 
if (likely(!ret))
return ret;
@@ -21,7 +22,8 @@ int __vdso_clock_gettime(clockid_t clock,
 int __vdso_clock_gettime64(clockid_t clock,
   struct __kernel_timespec *ts)
 {
-   int ret = __cvdso_clock_gettime(clock, ts);
+   const struct vdso_data *vd = __arch_get_vdso_data();
+   int ret = __cvdso_clock_gettime(vd, clock, ts);
 
if (likely(!ret))
return ret;
@@ -32,7 +34,8 @@ int __vdso_clock_gettime64(clockid_t clock,
 int __vdso_gettimeofday(struct __kernel_old_timeval *tv,
struct timezone *tz)
 {
-   int ret = __cvdso_gettimeofday(tv, tz);
+   const struct vdso_data *vd = __arch_get_vdso_data();
+   int ret = __cvdso_gettimeofday(vd, tv, tz);
 
if (likely(!ret))
return ret;
@@ -43,7 +46,8 @@ int __vdso_gettimeofday(struct __kernel_old_timeval *tv,
 int __vdso_clock_getres(clockid_t clock_id,
struct old_timespec32 *res)
 {
-   int ret = __cvdso_clock_getres_time32(clock_id, res);
+   const struct vdso_data *vd = __arch_get_vdso_data();
+   int ret = __cvdso_clock_getres_time32(vd, clock_id, res);
 
if (likely(!ret))
return ret;
diff --git a/arch/arm64/kernel/vdso/vgettimeofday.c 
b/arch/arm64/kernel/vdso/vgettimeofday.c
index 62694876b216..9a7122ec6d17 100644
--- a/arch/arm64/kernel/vdso/vgettimeofday.c
+++ b/arch/arm64/kernel/vdso/vgettimeofday.c
@@ -11,7 +11,8 @@
 int __kernel_clock_gettime(clockid_t clock,
   struct __kernel_timespec *ts)
 {
-   int ret = __cvdso_clock_gettime(clock, ts);
+   const struct vdso_data *vd = __arch_get_vdso_data();
+   int ret = __cvdso_clock_gettime(vd, clock, ts);
 
if (likely(!ret))
return ret;
@@ -22,7 +23,8 @@ int __kernel_clock_gettime(clockid_t clock,
 int __kernel_gettimeofday(struct __kernel_old_timeval *tv,
  struct timezone *tz)
 {
-   int ret = __cvdso_gettimeofday(tv, tz);
+   const struct vdso_data *vd = __arch_get_vdso_data();
+   int ret = __cvdso_gettimeofday(vd, tv, tz);
 
if (likely(!ret))
return ret;
@@ -33,7 +35,8 @@ int __kernel_gettimeofday(struct __kernel_old_timeval *tv,
 int __kernel_clock_getres(clockid_t clock_id,
  struct __kernel_timespec *res)
 {
-   int ret =  __cvdso_clock_getres(clock_id, res);
+   const struct vdso_data *vd = __arch_get_vdso_data();
+   int ret =  __cvdso_clock_getres(vd, clock_id, res);
 
if (likely(!ret))
return ret;
diff --git a/arch/arm64/kernel/vdso32/vgettimeofday.c 
b/arch/arm64/kernel/vdso32/vgettimeofday.c
index 6770d2bedd1f..3eb6a82c1c25 100644
--- a/arch/arm64/kernel/vdso32/vgettimeofday.c
+++ b/arch/arm64/kernel/vdso32/vgettimeofday.c
@@ -11,13 +11,14 @@
 int __vdso_clock_gettime(clockid_t clock,
 struct old_timespec32 *ts)
 {
+   const struct vdso_data *vd = __arch_get_vdso_data();
int ret;
 
/* The checks below are required for ABI consistency with arm */
if ((u32)ts >= TASK_SIZE_32)
return -EFAULT;
 
-   ret = __cvdso_clock_gettime32(clock, ts);
+   ret = __cvdso_clock_gettime32(vd, clock, ts);
 
if (likely(!ret))
return ret;
@@ -28,13 +29,14 @@ int __vdso_clock_gettime(clockid_t clock,
 int __vdso_clock_gettime64(clockid_t clock,
   struct __kernel_timespec *ts)
 {
+   const struct vdso_data *vd = __arch_get_vdso_data();
int ret;
 
/* The checks below are required for ABI consistency 

[RFC PATCH v2 02/10] lib: vdso: move call to fallback out of common code.

2019-12-23 Thread Christophe Leroy
On powerpc, VDSO functions and syscalls cannot be implemented in C
because the Linux kernel ABI requires that CR[SO] bit is set in case
of error and cleared when no error.

As this cannot be done in C, C VDSO functions and syscall'based
fallback need a trampoline in ASM.

By moving the fallback calls out of the common code, arches like
powerpc can implement both the call to C VDSO and the fallback call
in a single trampoline function.

The two advantages are:
- No need play back and forth with CR[SO] and negative return value.
- No stack frame is required in VDSO C functions for the fallbacks.

The performance improvement is significant on powerpc.

Signed-off-by: Christophe Leroy 
---
 arch/arm/vdso/vgettimeofday.c| 28 +++---
 arch/arm64/kernel/vdso/vgettimeofday.c   | 21 --
 arch/arm64/kernel/vdso32/vgettimeofday.c | 35 ---
 arch/mips/vdso/vgettimeofday.c   | 49 +++-
 arch/x86/entry/vdso/vclock_gettime.c | 42 +++
 lib/vdso/gettimeofday.c  | 31 
 6 files changed, 156 insertions(+), 50 deletions(-)

diff --git a/arch/arm/vdso/vgettimeofday.c b/arch/arm/vdso/vgettimeofday.c
index 1976c6f325a4..5451afb715e6 100644
--- a/arch/arm/vdso/vgettimeofday.c
+++ b/arch/arm/vdso/vgettimeofday.c
@@ -10,25 +10,45 @@
 int __vdso_clock_gettime(clockid_t clock,
 struct old_timespec32 *ts)
 {
-   return __cvdso_clock_gettime32(clock, ts);
+   int ret = __cvdso_clock_gettime32(clock, ts);
+
+   if (likely(!ret))
+   return ret;
+
+   return clock_gettime32_fallback(clock, );
 }
 
 int __vdso_clock_gettime64(clockid_t clock,
   struct __kernel_timespec *ts)
 {
-   return __cvdso_clock_gettime(clock, ts);
+   int ret = __cvdso_clock_gettime(clock, ts);
+
+   if (likely(!ret))
+   return ret;
+
+   return clock_gettime_fallback(clock, ts);
 }
 
 int __vdso_gettimeofday(struct __kernel_old_timeval *tv,
struct timezone *tz)
 {
-   return __cvdso_gettimeofday(tv, tz);
+   int ret = __cvdso_gettimeofday(tv, tz);
+
+   if (likely(!ret))
+   return ret;
+
+   return gettimeofday_fallback(tv, tz);
 }
 
 int __vdso_clock_getres(clockid_t clock_id,
struct old_timespec32 *res)
 {
-   return __cvdso_clock_getres_time32(clock_id, res);
+   int ret = __cvdso_clock_getres_time32(clock_id, res);
+
+   if (likely(!ret))
+   return ret;
+
+   return clock_getres32_fallback(clock, res);
 }
 
 /* Avoid unresolved references emitted by GCC */
diff --git a/arch/arm64/kernel/vdso/vgettimeofday.c 
b/arch/arm64/kernel/vdso/vgettimeofday.c
index 747635501a14..62694876b216 100644
--- a/arch/arm64/kernel/vdso/vgettimeofday.c
+++ b/arch/arm64/kernel/vdso/vgettimeofday.c
@@ -11,17 +11,32 @@
 int __kernel_clock_gettime(clockid_t clock,
   struct __kernel_timespec *ts)
 {
-   return __cvdso_clock_gettime(clock, ts);
+   int ret = __cvdso_clock_gettime(clock, ts);
+
+   if (likely(!ret))
+   return ret;
+
+   return clock_gettime_fallback(clock, ts);
 }
 
 int __kernel_gettimeofday(struct __kernel_old_timeval *tv,
  struct timezone *tz)
 {
-   return __cvdso_gettimeofday(tv, tz);
+   int ret = __cvdso_gettimeofday(tv, tz);
+
+   if (likely(!ret))
+   return ret;
+
+   return gettimeofday_fallback(tv, tz);
 }
 
 int __kernel_clock_getres(clockid_t clock_id,
  struct __kernel_timespec *res)
 {
-   return __cvdso_clock_getres(clock_id, res);
+   int ret =  __cvdso_clock_getres(clock_id, res);
+
+   if (likely(!ret))
+   return ret;
+
+   return clock_getres_fallback(clock, res);
 }
diff --git a/arch/arm64/kernel/vdso32/vgettimeofday.c 
b/arch/arm64/kernel/vdso32/vgettimeofday.c
index 54fc1c2ce93f..6770d2bedd1f 100644
--- a/arch/arm64/kernel/vdso32/vgettimeofday.c
+++ b/arch/arm64/kernel/vdso32/vgettimeofday.c
@@ -11,37 +11,64 @@
 int __vdso_clock_gettime(clockid_t clock,
 struct old_timespec32 *ts)
 {
+   int ret;
+
/* The checks below are required for ABI consistency with arm */
if ((u32)ts >= TASK_SIZE_32)
return -EFAULT;
 
-   return __cvdso_clock_gettime32(clock, ts);
+   ret = __cvdso_clock_gettime32(clock, ts);
+
+   if (likely(!ret))
+   return ret;
+
+   return clock_gettime32_fallback(clock, );
 }
 
 int __vdso_clock_gettime64(clockid_t clock,
   struct __kernel_timespec *ts)
 {
+   int ret;
+
/* The checks below are required for ABI consistency with arm */
if ((u32)ts >= TASK_SIZE_32)
return -EFAULT;
 
-   return __cvdso_clock_gettime(clock, ts);
+   ret = __cvdso_clock_gettime(clock, 

[RFC PATCH v2 03/10] lib: vdso: Change __cvdso_clock_gettime/getres_common() to __cvdso_clock_gettime/getres()

2019-12-23 Thread Christophe Leroy
__cvdso_clock_getres() just calls __cvdso_clock_getres_common().
__cvdso_clock_gettime() just calls __cvdso_clock_getres_common().

Drop __cvdso_clock_getres() and __cvdso_clock_gettime()
Rename __cvdso_clock_gettime_common() into __cvdso_clock_gettime()
Rename __cvdso_clock_getres_common() into __cvdso_clock_getres()

Signed-off-by: Christophe Leroy 
---
 lib/vdso/gettimeofday.c | 19 ---
 1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 4618e274f1d5..c6eeeb47f446 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -79,7 +79,7 @@ static void do_coarse(const struct vdso_data *vd, clockid_t 
clk,
 }
 
 static __maybe_unused int
-__cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts)
+__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
 {
const struct vdso_data *vd = __arch_get_vdso_data();
u32 msk;
@@ -105,16 +105,10 @@ __cvdso_clock_gettime_common(clockid_t clock, struct 
__kernel_timespec *ts)
 }
 
 static __maybe_unused int
-__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
-{
-   return __cvdso_clock_gettime_common(clock, ts);
-}
-
-static __maybe_unused int
 __cvdso_clock_gettime32(clockid_t clock, struct old_timespec32 *res)
 {
struct __kernel_timespec ts;
-   int ret = __cvdso_clock_gettime_common(clock, );
+   int ret = __cvdso_clock_gettime(clock, );
 
if (likely(!ret)) {
res->tv_sec = ts.tv_sec;
@@ -161,7 +155,7 @@ static __maybe_unused __kernel_old_time_t 
__cvdso_time(__kernel_old_time_t *time
 
 #ifdef VDSO_HAS_CLOCK_GETRES
 static __maybe_unused
-int __cvdso_clock_getres_common(clockid_t clock, struct __kernel_timespec *res)
+int __cvdso_clock_getres(clockid_t clock, struct __kernel_timespec *res)
 {
const struct vdso_data *vd = __arch_get_vdso_data();
u64 hrtimer_res;
@@ -204,16 +198,11 @@ int __cvdso_clock_getres_common(clockid_t clock, struct 
__kernel_timespec *res)
return 0;
 }
 
-int __cvdso_clock_getres(clockid_t clock, struct __kernel_timespec *res)
-{
-   return __cvdso_clock_getres_common(clock, res);
-}
-
 static __maybe_unused int
 __cvdso_clock_getres_time32(clockid_t clock, struct old_timespec32 *res)
 {
struct __kernel_timespec ts;
-   int ret = __cvdso_clock_getres_common(clock, );
+   int ret = __cvdso_clock_getres(clock, );
 
if (likely(!ret && res)) {
res->tv_sec = ts.tv_sec;
-- 
2.13.3



[RFC PATCH v2 01/10] lib: vdso: ensure all arches have 32bit fallback

2019-12-23 Thread Christophe Leroy
In order to simplify next step which moves fallback call at arch
level, ensure all arches have a 32bit fallback instead of handling
the lack of 32bit fallback in the common code based
on VDSO_HAS_32BIT_FALLBACK

Signed-off-by: Christophe Leroy 
---
 arch/arm/include/asm/vdso/gettimeofday.h  | 26 +
 arch/arm64/include/asm/vdso/compat_gettimeofday.h |  2 --
 arch/arm64/include/asm/vdso/gettimeofday.h| 26 +
 arch/mips/include/asm/vdso/gettimeofday.h | 28 +--
 arch/x86/include/asm/vdso/gettimeofday.h  | 28 +--
 lib/vdso/gettimeofday.c   | 10 
 6 files changed, 104 insertions(+), 16 deletions(-)

diff --git a/arch/arm/include/asm/vdso/gettimeofday.h 
b/arch/arm/include/asm/vdso/gettimeofday.h
index 0ad2429c324f..55f8ad6e 100644
--- a/arch/arm/include/asm/vdso/gettimeofday.h
+++ b/arch/arm/include/asm/vdso/gettimeofday.h
@@ -70,6 +70,32 @@ static __always_inline int clock_getres_fallback(
return ret;
 }
 
+static __always_inline
+long clock_gettime32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
+{
+   struct __kernel_timespec ts;
+   int ret = clock_gettime_fallback(clock, );
+
+   if (likely(!ret)) {
+   _ts->tv_sec = ts.tv_sec;
+   _ts->tv_nsec = ts.tv_nsec;
+   }
+   return ret;
+}
+
+static __always_inline
+long clock_getres32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
+{
+   struct __kernel_timespec ts;
+   int ret = clock_getres_fallback(clock, );
+
+   if (likely(!ret && _ts)) {
+   _ts->tv_sec = ts.tv_sec;
+   _ts->tv_nsec = ts.tv_nsec;
+   }
+   return ret;
+}
+
 static __always_inline u64 __arch_get_hw_counter(int clock_mode)
 {
 #ifdef CONFIG_ARM_ARCH_TIMER
diff --git a/arch/arm64/include/asm/vdso/compat_gettimeofday.h 
b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
index c50ee1b7d5cd..bab700e37a03 100644
--- a/arch/arm64/include/asm/vdso/compat_gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/compat_gettimeofday.h
@@ -16,8 +16,6 @@
 
 #define VDSO_HAS_CLOCK_GETRES  1
 
-#define VDSO_HAS_32BIT_FALLBACK1
-
 static __always_inline
 int gettimeofday_fallback(struct __kernel_old_timeval *_tv,
  struct timezone *_tz)
diff --git a/arch/arm64/include/asm/vdso/gettimeofday.h 
b/arch/arm64/include/asm/vdso/gettimeofday.h
index b08f476b72b4..c41c86a07423 100644
--- a/arch/arm64/include/asm/vdso/gettimeofday.h
+++ b/arch/arm64/include/asm/vdso/gettimeofday.h
@@ -66,6 +66,32 @@ int clock_getres_fallback(clockid_t _clkid, struct 
__kernel_timespec *_ts)
return ret;
 }
 
+static __always_inline
+long clock_gettime32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
+{
+   struct __kernel_timespec ts;
+   int ret = clock_gettime_fallback(clock, );
+
+   if (likely(!ret)) {
+   _ts->tv_sec = ts.tv_sec;
+   _ts->tv_nsec = ts.tv_nsec;
+   }
+   return ret;
+}
+
+static __always_inline
+long clock_getres32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
+{
+   struct __kernel_timespec ts;
+   int ret = clock_getres_fallback(clock, );
+
+   if (likely(!ret && _ts)) {
+   _ts->tv_sec = ts.tv_sec;
+   _ts->tv_nsec = ts.tv_nsec;
+   }
+   return ret;
+}
+
 static __always_inline u64 __arch_get_hw_counter(s32 clock_mode)
 {
u64 res;
diff --git a/arch/mips/include/asm/vdso/gettimeofday.h 
b/arch/mips/include/asm/vdso/gettimeofday.h
index b08825531e9f..60608e930a5c 100644
--- a/arch/mips/include/asm/vdso/gettimeofday.h
+++ b/arch/mips/include/asm/vdso/gettimeofday.h
@@ -109,8 +109,6 @@ static __always_inline int clock_getres_fallback(
 
 #if _MIPS_SIM != _MIPS_SIM_ABI64
 
-#define VDSO_HAS_32BIT_FALLBACK1
-
 static __always_inline long clock_gettime32_fallback(
clockid_t _clkid,
struct old_timespec32 *_ts)
@@ -150,6 +148,32 @@ static __always_inline int clock_getres32_fallback(
 
return error ? -ret : ret;
 }
+#else
+static __always_inline
+long clock_gettime32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
+{
+   struct __kernel_timespec ts;
+   int ret = clock_gettime_fallback(clock, );
+
+   if (likely(!ret)) {
+   _ts->tv_sec = ts.tv_sec;
+   _ts->tv_nsec = ts.tv_nsec;
+   }
+   return ret;
+}
+
+static __always_inline
+long clock_getres32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
+{
+   struct __kernel_timespec ts;
+   int ret = clock_getres_fallback(clock, );
+
+   if (likely(!ret && _ts)) {
+   _ts->tv_sec = ts.tv_sec;
+   _ts->tv_nsec = ts.tv_nsec;
+   }
+   return ret;
+}
 #endif
 
 #ifdef CONFIG_CSRC_R4K
diff --git a/arch/x86/include/asm/vdso/gettimeofday.h 

[RFC PATCH v2 00/10] powerpc/32: switch VDSO to C implementation.

2019-12-23 Thread Christophe Leroy
This is a second tentative to switch powerpc/32 vdso to generic C 
implementation.

It will likely not work on 64 bits or even build properly at the moment.

powerpc is a bit special for VDSO as well as system calls in the
way that it requires setting CR SO bit which cannot be done in C.
Therefore, entry/exit and fallback needs to be performed in ASM.

To allow that, the fallback calls are moved out of the common code
and left to the arches.

A few other changes in the common code have allowed performance improvement.

The performance has improved since first RFC, but it is still lower than
current assembly VDSO.

On a powerpc8xx, with current powerpc/32 ASM VDSO:

gettimeofday:vdso: 737 nsec/call
clock-getres-realtime:vdso: 475 nsec/call
clock-gettime-realtime:vdso: 892 nsec/call
clock-getres-monotonic:vdso: 475 nsec/call
clock-gettime-monotonic:vdso: 1014 nsec/call

First try of C implementation:

gettimeofday:vdso: 1533 nsec/call
clock-getres-realtime:vdso: 853 nsec/call
clock-gettime-realtime:vdso: 1570 nsec/call
clock-getres-monotonic:vdso: 835 nsec/call
clock-gettime-monotonic:vdso: 1605 nsec/call

With this series:

gettimeofday:vdso: 1016 nsec/call
clock-getres-realtime:vdso: 560 nsec/call
clock-gettime-realtime:vdso: 1192 nsec/call
clock-getres-monotonic:vdso: 560 nsec/call
clock-gettime-monotonic:vdso: 1192 nsec/call


Changes made to other arches are untested, not even compiled.


Christophe Leroy (10):
  lib: vdso: ensure all arches have 32bit fallback
  lib: vdso: move call to fallback out of common code.
  lib: vdso: Change __cvdso_clock_gettime/getres_common() to
__cvdso_clock_gettime/getres()
  lib: vdso: get pointer to vdso data from the arch
  lib: vdso: inline do_hres()
  lib: vdso: make do_coarse() return 0
  lib: vdso: don't use READ_ONCE() in __c_kernel_time()
  lib: vdso: Avoid duplication in __cvdso_clock_getres()
  powerpc/vdso32: inline __get_datapage()
  powerpc/32: Switch VDSO to C implementation.

 arch/arm/include/asm/vdso/gettimeofday.h  |  26 +++
 arch/arm/vdso/vgettimeofday.c |  32 ++-
 arch/arm64/include/asm/vdso/compat_gettimeofday.h |   2 -
 arch/arm64/include/asm/vdso/gettimeofday.h|  26 +++
 arch/arm64/kernel/vdso/vgettimeofday.c|  24 +-
 arch/arm64/kernel/vdso32/vgettimeofday.c  |  39 +++-
 arch/mips/include/asm/vdso/gettimeofday.h |  28 ++-
 arch/mips/vdso/vgettimeofday.c|  56 -
 arch/powerpc/Kconfig  |   2 +
 arch/powerpc/include/asm/vdso/gettimeofday.h  |  45 
 arch/powerpc/include/asm/vdso/vsyscall.h  |  27 +++
 arch/powerpc/include/asm/vdso_datapage.h  |  28 +--
 arch/powerpc/kernel/asm-offsets.c |  23 +-
 arch/powerpc/kernel/time.c|  92 +---
 arch/powerpc/kernel/vdso.c|  19 +-
 arch/powerpc/kernel/vdso32/Makefile   |  19 +-
 arch/powerpc/kernel/vdso32/cacheflush.S   |   9 +-
 arch/powerpc/kernel/vdso32/datapage.S |  28 +--
 arch/powerpc/kernel/vdso32/gettimeofday.S | 265 +++---
 arch/powerpc/kernel/vdso32/vgettimeofday.c|  32 +++
 arch/x86/entry/vdso/vclock_gettime.c  |  52 -
 arch/x86/include/asm/vdso/gettimeofday.h  |  28 ++-
 lib/vdso/gettimeofday.c   | 100 +++-
 23 files changed, 505 insertions(+), 497 deletions(-)
 create mode 100644 arch/powerpc/include/asm/vdso/gettimeofday.h
 create mode 100644 arch/powerpc/include/asm/vdso/vsyscall.h
 create mode 100644 arch/powerpc/kernel/vdso32/vgettimeofday.c

-- 
2.13.3



[PATCH] powerpc/shared: include correct header for static key

2019-12-23 Thread Jason A. Donenfeld
Recently, the spinlock implementation grew a static key optimization,
but the jump_label.h header include was left out, leading to build
errors:

linux/arch/powerpc/include/asm/spinlock.h:44:7: error: implicit declaration of 
function ‘static_branch_unlikely’ [-Werror=implicit-function-declaration]
   44 |  if (!static_branch_unlikely(_processor))

This commit adds the missing header.

Fixes: 656c21d6af5d ("powerpc/shared: Use static key to detect shared 
processor")
Cc: Srikar Dronamraju 
Signed-off-by: Jason A. Donenfeld 
---
 arch/powerpc/include/asm/spinlock.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 1b55fc08f853..860228e917dc 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -15,6 +15,7 @@
  *
  * (the type definitions are in asm/spinlock_types.h)
  */
+#include 
 #include 
 #ifdef CONFIG_PPC64
 #include 
-- 
2.24.1



Re: [PATCH kernel v3] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

2019-12-23 Thread Michael Ellerman
Alexey Kardashevskiy  writes:

> The last jump to free_exit in mm_iommu_do_alloc() happens after page
> pointers in struct mm_iommu_table_group_mem_t were already converted to
> physical addresses. Thus calling put_page() on these physical addresses
> will likely crash.
>
> This moves the loop which calculates the pageshift and converts page
> struct pointers to physical addresses later after the point when
> we cannot fail; thus eliminating the need to convert pointers back.
>
> Fixes: eb9d7a62c386 ("powerpc/mm_iommu: Fix potential deadlock")
> Reported-by: Jan Kara 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v3:
> * move pointers conversion after the last possible failure point
> ---
>  arch/powerpc/mm/book3s64/iommu_api.c | 39 +++-
>  1 file changed, 21 insertions(+), 18 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> b/arch/powerpc/mm/book3s64/iommu_api.c
> index 56cc84520577..ef164851738b 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -121,24 +121,6 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   goto free_exit;
>   }
>  
> - pageshift = PAGE_SHIFT;
> - for (i = 0; i < entries; ++i) {
> - struct page *page = mem->hpages[i];
> -
> - /*
> -  * Allow to use larger than 64k IOMMU pages. Only do that
> -  * if we are backed by hugetlb.
> -  */
> - if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page))
> - pageshift = page_shift(compound_head(page));
> - mem->pageshift = min(mem->pageshift, pageshift);
> - /*
> -  * We don't need struct page reference any more, switch
> -  * to physical address.
> -  */
> - mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> - }
> -
>  good_exit:
>   atomic64_set(>mapped, 1);
>   mem->used = 1;
> @@ -158,6 +140,27 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   }
>   }
>  
> + if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA) {

Couldn't you avoid testing this again ...

> + /*
> +  * Allow to use larger than 64k IOMMU pages. Only do that
> +  * if we are backed by hugetlb. Skip device memory as it is not
> +  * backed with page structs.
> +  */
> + pageshift = PAGE_SHIFT;
> + for (i = 0; i < entries; ++i) {

... by making this loop up to `pinned`.

`pinned` is only incremented in the loop that does the GUP, and there's
a check that pinned == entries after that loop.

So when we get here we know pinned == entries, and if pinned is zero
it's because we took the (dev_hpa != MM_IOMMU_TABLE_INVALID_HPA) case at
the start of the function to get here.

Or do you think that's too subtle to rely on?

cheers

> + struct page *page = mem->hpages[i];
> +
> + if ((mem->pageshift > PAGE_SHIFT) && PageHuge(page))
> + pageshift = page_shift(compound_head(page));
> + mem->pageshift = min(mem->pageshift, pageshift);
> + /*
> +  * We don't need struct page reference any more, switch
> +  * to physical address.
> +  */
> + mem->hpas[i] = page_to_pfn(page) << PAGE_SHIFT;
> + }
> + }
> +
>   list_add_rcu(>next, >context.iommu_group_mem_list);
>  
>   mutex_unlock(_list_mutex);
> -- 
> 2.17.1