date:20190719

[PATCH kernel RFC 1/2] powerpc/pseries: Call RTAS directly

2019-07-19 Thread Alexey Kardashevskiy

The pseries guests call RTAS via a RTAS entry point which is a firmware
image under powernv and simple HCALL wrapper under QEMU. For the latter,
we can skip the binary image and do HCALL directly, eliminating the need
in the RTAS blob entirely.

This checks the DT whether the new method is supported and use it if it is.
This removes few checks as QEMU might decide not to keen RTAS around at
all (might be the case for secure VMs).

Note that kexec still checks for the linux,rtas-xxx properties which has
to be fixed.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/rtas.h |  1 +
 arch/powerpc/kernel/rtas.c  | 47 +++--
 2 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 3c1887351c71..60cd528806c1 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -59,6 +59,7 @@ struct rtas_t {
arch_spinlock_t lock;
struct rtas_args args;
struct device_node *dev;/* virtual address pointer */
+   bool hcall;
 };
 
 struct rtas_suspend_me_data {
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 5faf0a64c92b..0651291ab5ff 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -49,6 +49,14 @@ struct rtas_t rtas = {
 };
 EXPORT_SYMBOL(rtas);
 
+static void __enter_rtas(unsigned long pa)
+{
+   if (rtas.hcall)
+   plpar_hcall_norets(H_RTAS, pa);
+   else
+   enter_rtas(pa);
+}
+
 DEFINE_SPINLOCK(rtas_data_buf_lock);
 EXPORT_SYMBOL(rtas_data_buf_lock);
 
@@ -95,9 +103,6 @@ static void call_rtas_display_status(unsigned char c)
 {
unsigned long s;
 
-   if (!rtas.base)
-   return;
-
s = lock_rtas();
rtas_call_unlocked(&rtas.args, 10, 1, 1, NULL, c);
unlock_rtas(s);
@@ -145,9 +150,6 @@ static void udbg_rtascon_putc(char c)
 {
int tries;
 
-   if (!rtas.base)
-   return;
-
/* Add CRs before LFs */
if (c == '\n')
udbg_rtascon_putc('\r');
@@ -164,9 +166,6 @@ static int udbg_rtascon_getc_poll(void)
 {
int c;
 
-   if (!rtas.base)
-   return -1;
-
if (rtas_call(rtas_getchar_token, 0, 2, &c))
return -1;
 
@@ -205,9 +204,6 @@ void rtas_progress(char *s, unsigned short hex)
static int current_line;
static int pending_newline = 0;  /* did last write end with unprinted 
newline? */
 
-   if (!rtas.base)
-   return;
-
if (display_width == 0) {
display_width = 0x10;
if ((root = of_find_node_by_path("/rtas"))) {
@@ -382,7 +378,7 @@ static char *__fetch_rtas_last_error(char *altbuf)
save_args = rtas.args;
rtas.args = err_args;
 
-   enter_rtas(__pa(&rtas.args));
+   __enter_rtas(__pa(&rtas.args));
 
err_args = rtas.args;
rtas.args = save_args;
@@ -428,7 +424,7 @@ va_rtas_call_unlocked(struct rtas_args *args, int token, 
int nargs, int nret,
for (i = 0; i < nret; ++i)
args->rets[i] = 0;
 
-   enter_rtas(__pa(args));
+   __enter_rtas(__pa(args));
 }
 
 void rtas_call_unlocked(struct rtas_args *args, int token, int nargs, int 
nret, ...)
@@ -449,7 +445,7 @@ int rtas_call(int token, int nargs, int nret, int *outputs, 
...)
char *buff_copy = NULL;
int ret;
 
-   if (!rtas.entry || token == RTAS_UNKNOWN_SERVICE)
+   if (token == RTAS_UNKNOWN_SERVICE)
return -1;
 
s = lock_rtas();
@@ -1064,9 +1060,6 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
-   if (!rtas.entry)
-   return -EINVAL;
-
if (copy_from_user(&args, uargs, 3 * sizeof(u32)) != 0)
return -EFAULT;
 
@@ -1115,7 +1108,7 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
flags = lock_rtas();
 
rtas.args = args;
-   enter_rtas(__pa(&rtas.args));
+   __enter_rtas(__pa(&rtas.args));
args = rtas.args;
 
/* A -1 return code indicates that the last command couldn't
@@ -1149,8 +1142,8 @@ SYSCALL_DEFINE1(rtas, struct rtas_args __user *, uargs)
 void __init rtas_initialize(void)
 {
unsigned long rtas_region = RTAS_INSTANTIATE_MAX;
-   u32 base, size, entry;
-   int no_base, no_size, no_entry;
+   u32 base = 0, size = 0, entry = 0, do_h_rtas = 0;
+   int no_base, no_size, no_entry, ret;
 
/* Get RTAS dev node and fill up our "rtas" structure with infos
 * about it.
@@ -1161,17 +1154,15 @@ void __init rtas_initialize(void)
 
no_base = of_property_read_u32(rtas.dev, "linux,rtas-base", &base);
no_size = of_property_read_u32(rtas.dev, "rtas-size", &size);
-   if (no_base || no_size) {
-   of_node_put(rtas.dev);
-   rtas.dev = NULL;
-

[PATCH kernel RFC 0/2] powerpc/pseries: Kexec style boot

2019-07-19 Thread Alexey Kardashevskiy

There is a funny excercise to run a guest under QEMU without
the SLOF firmware and boot into a kernel directly to use petitboot as
a boot loader (a more power boot loader than grub and yum),
the patchset is posted as "spapr: Kexec style boot".

Since there is no SLOF, i.e. no client interface and no RTAS blob,
we need to avoid the former and call the latter directly. Also,
this implements "client-architecture-support" substiitute for
the new environment.

This is based on sha1
a2b6f26c264e Christophe Leroy "powerpc/module64: Use symbolic instructions 
names.".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  powerpc/pseries: Call RTAS directly
  powerpc/pseries: Kexec style ibm,client-architecture-support support

 arch/powerpc/include/asm/rtas.h |  1 +
 arch/powerpc/kernel/prom_init.c | 12 ++---
 arch/powerpc/kernel/rtas.c  | 47 +++--
 arch/powerpc/kernel/setup_64.c  | 41 
 4 files changed, 71 insertions(+), 30 deletions(-)

-- 
2.17.1

[PATCH kernel RFC 2/2] powerpc/pseries: Kexec style ibm, client-architecture-support support

2019-07-19 Thread Alexey Kardashevskiy

This checks the FDT for "/chosen/qemu,h_cas" and calls H_CAS when present.
The H_CAS hcall is implemented in QEMU for ages and currently returns
an FDT with a diff to the initial FDT which SLOF updates and returns to
the OS. For this patch to work, QEMU needs to provide the full tree
instead, so when QEMU is run with the "-machine pseries,bios=no",
it reads the existing FDT from the OS, updats it and writes back on top.

This changes prom_check_platform_support() not to call the client
interface's prom_getproplen() as the kexec-style boot does not provide
the client interface services.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/prom_init.c | 12 ++
 arch/powerpc/kernel/setup_64.c  | 41 +
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 514707ef6779..6d8e35cb3c57 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -1261,7 +1261,7 @@ static void __init prom_parse_platform_support(u8 index, 
u8 val,
}
 }
 
-static void __init prom_check_platform_support(void)
+struct ibm_arch_vec __init *prom_check_platform_support(int prop_len)
 {
struct platform_support supported = {
.hash_mmu = false,
@@ -1269,8 +1269,6 @@ static void __init prom_check_platform_support(void)
.radix_gtse = false,
.xive = false
};
-   int prop_len = prom_getproplen(prom.chosen,
-  "ibm,arch-vec-5-platform-support");
 
/*
 * First copy the architecture vec template
@@ -1319,7 +1317,12 @@ static void __init prom_check_platform_support(void)
prom_debug("Asking for XIVE\n");
ibm_architecture_vec.vec5.intarch = OV5_FEAT(OV5_XIVE_EXPLOIT);
}
+
+   ibm_architecture_vec.vec5.max_cpus = cpu_to_be32(NR_CPUS);
+
+   return &ibm_architecture_vec;
 }
+EXPORT_SYMBOL_GPL(prom_check_platform_support);
 
 static void __init prom_send_capabilities(void)
 {
@@ -1328,7 +1331,8 @@ static void __init prom_send_capabilities(void)
u32 cores;
 
/* Check ibm,arch-vec-5-platform-support and fixup vec5 if required */
-   prom_check_platform_support();
+   prom_check_platform_support(prom_getproplen(prom.chosen,
+   "ibm,arch-vec-5-platform-support"));
 
root = call_prom("open", 1, 1, ADDR("/"));
if (root != 0) {
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 44b4c432a273..6fa384278180 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -284,12 +284,53 @@ void __init record_spr_defaults(void)
  * device-tree is not accessible via normal means at this point.
  */
 
+/*
+ * The architecture vector has an array of PVR mask/value pairs,
+ * followed by # option vectors - 1, followed by the option vectors.
+ *
+ * See prom.h for the definition of the bits specified in the
+ * architecture vector.
+ */
+
+extern struct ibm_arch_vec __init *prom_check_platform_support(
+   int vec5_prop_len);
+
+int __init early_init_dt_scan_chosen_h_cas(unsigned long node,
+   const char *uname, int depth, void *data)
+{
+   int l;
+   const char *p;
+
+   if (depth != 1 || !data ||
+   (strcmp(uname, "chosen") != 0 && strcmp(uname, "chosen@0") != 0))
+   return 0;
+   p = of_get_flat_dt_prop(node, "qemu,h_cas", &l);
+   if (p != NULL && l > 0)
+   *(bool *) data = be32_to_cpu(*(uint32_t *) p) != 0;
+
+   return 1;
+}
+
 void __init early_setup(unsigned long dt_ptr)
 {
static __initdata struct paca_struct boot_paca;
+   struct ibm_arch_vec *vec = prom_check_platform_support(0);
 
/*  printk is _NOT_ safe to use here ! --- */
 
+   /* ibm,client-architecture-support support */
+#define KVMPPC_HCALL_BASE   0xf000
+#define KVMPPC_H_CAS(KVMPPC_HCALL_BASE + 0x2)
+#define FDT_MAX_SIZE 0x10
+   bool do_h_cas = false;
+
+   if (early_init_dt_verify(__va(dt_ptr))) {
+   of_scan_flat_dt(early_init_dt_scan_chosen_h_cas, &do_h_cas);
+   if (do_h_cas)
+   plpar_hcall_norets(KVMPPC_H_CAS, vec, dt_ptr,
+   FDT_MAX_SIZE);
+   }
+
/* Try new device tree based feature discovery ... */
if (!dt_cpu_ftrs_init(__va(dt_ptr)))
/* Otherwise use the old style CPU table */
-- 
2.17.1

Re: [PATCH 2/3] DMA mapping: Move SME handling to x86-specific files

2019-07-19 Thread Thiago Jung Bauermann



kbuild test robot  writes:

> Hi Thiago,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linus/master]
> [cannot apply to v5.2 next-20190718]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Thiago-Jung-Bauermann/Remove-x86-specific-code-from-generic-headers/20190715-063006
> config: s390-allnoconfig (attached as .config)
> compiler: s390-linux-gcc (GCC) 7.4.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.4.0 make.cross ARCH=s390
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
>
> All errors (new ones prefixed by >>):
>
>kernel/dma/swiotlb.c: In function 'swiotlb_tbl_map_single':
>>> kernel/dma/swiotlb.c:461:6: error: implicit declaration of function 
>>> 'mem_encrypt_active'; did you mean 'set_cpu_active'? 
>>> [-Werror=implicit-function-declaration]
>  if (mem_encrypt_active())
>  ^~
>  set_cpu_active
>cc1: some warnings being treated as errors

This error was reported for v1 of the patch series. I wasn't able to
reproduce this problem on v1 but found a similar issue on v2.

I just did a build test of each patch of the latest version (v3) with an
s390 cross-toolchain and the config file from this report and didn't
find any build issues, so I believe this problem is solved.

--
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH v10 8/9] kselftest: save-and-restore errno to allow for %m formatting

2019-07-19 Thread Aleksa Sarai

On 2019-07-19, shuah  wrote:
> On 7/19/19 10:42 AM, Aleksa Sarai wrote:
> > Previously, using "%m" in a ksft_* format string can result in strange
> > output because the errno value wasn't saved before calling other libc
> > functions. The solution is to simply save and restore the errno before
> > we format the user-supplied format string.
> > 
> > Signed-off-by: Aleksa Sarai 
> [...]
> Hi Aleksa,
> 
> Can you send this patch separate from the patch series. I will apply
> this as bug fix to 5.3-rc2 or rc3.
> 
> This isn't part of this series anyway and I would like to get this in
> right away.

Done, and I'll drop it in v11 after the rest gets reviewed.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature

Re: [PATCH v10 8/9] kselftest: save-and-restore errno to allow for %m formatting

2019-07-19 Thread shuah


On 7/19/19 10:42 AM, Aleksa Sarai wrote:

Previously, using "%m" in a ksft_* format string can result in strange
output because the errno value wasn't saved before calling other libc
functions. The solution is to simply save and restore the errno before
we format the user-supplied format string.

Signed-off-by: Aleksa Sarai 
---
  tools/testing/selftests/kselftest.h | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kselftest.h 
b/tools/testing/selftests/kselftest.h
index ec15c4f6af55..0ac49d91a260 100644
--- a/tools/testing/selftests/kselftest.h
+++ b/tools/testing/selftests/kselftest.h
@@ -10,6 +10,7 @@
  #ifndef __KSELFTEST_H
  #define __KSELFTEST_H
  
+#include 

  #include 
  #include 
  #include 
@@ -81,58 +82,68 @@ static inline void ksft_print_cnts(void)
  
  static inline void ksft_print_msg(const char *msg, ...)

  {
+   int saved_errno = errno;
va_list args;
  
  	va_start(args, msg);

printf("# ");
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
  }
  
  static inline void ksft_test_result_pass(const char *msg, ...)

  {
+   int saved_errno = errno;
va_list args;
  
  	ksft_cnt.ksft_pass++;
  
  	va_start(args, msg);

printf("ok %d ", ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
  }
  
  static inline void ksft_test_result_fail(const char *msg, ...)

  {
+   int saved_errno = errno;
va_list args;
  
  	ksft_cnt.ksft_fail++;
  
  	va_start(args, msg);

printf("not ok %d ", ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
  }
  
  static inline void ksft_test_result_skip(const char *msg, ...)

  {
+   int saved_errno = errno;
va_list args;
  
  	ksft_cnt.ksft_xskip++;
  
  	va_start(args, msg);

printf("not ok %d # SKIP ", ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
  }
  
  static inline void ksft_test_result_error(const char *msg, ...)

  {
+   int saved_errno = errno;
va_list args;
  
  	ksft_cnt.ksft_error++;
  
  	va_start(args, msg);

printf("not ok %d # error ", ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
  }
@@ -152,10 +163,12 @@ static inline int ksft_exit_fail(void)
  
  static inline int ksft_exit_fail_msg(const char *msg, ...)

  {
+   int saved_errno = errno;
va_list args;
  
  	va_start(args, msg);

printf("Bail out! ");
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
  
@@ -178,10 +191,12 @@ static inline int ksft_exit_xpass(void)

  static inline int ksft_exit_skip(const char *msg, ...)
  {
if (msg) {
+   int saved_errno = errno;
va_list args;
  
  		va_start(args, msg);

printf("not ok %d # SKIP ", 1 + ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
} else {



Hi Aleksa,

Can you send this patch separate from the patch series. I will apply
this as bug fix to 5.3-rc2 or rc3.

This isn't part of this series anyway and I would like to get this in
right away.

thanks,
-- Shuah

[PATCH v10 9/9] selftests: add openat2(2) selftests

2019-07-19 Thread Aleksa Sarai

Test all of the various openat2(2) flags, as well as how file
descriptor re-opening works. A small stress-test of a symlink-rename
attack is included to show that the protections against ".."-based
attacks are sufficient.

In addition, the memfd selftest is fixed to no longer depend on the
now-disallowed functionality of upgrading an O_RDONLY descriptor to
O_RDWR.

Signed-off-by: Aleksa Sarai 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/memfd/memfd_test.c|   7 +-
 tools/testing/selftests/openat2/.gitignore|   1 +
 tools/testing/selftests/openat2/Makefile  |   8 +
 tools/testing/selftests/openat2/helpers.c | 162 +++
 tools/testing/selftests/openat2/helpers.h | 114 +
 .../testing/selftests/openat2/linkmode_test.c | 326 ++
 .../selftests/openat2/rename_attack_test.c| 124 ++
 .../testing/selftests/openat2/resolve_test.c  | 397 ++
 9 files changed, 1138 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/openat2/.gitignore
 create mode 100644 tools/testing/selftests/openat2/Makefile
 create mode 100644 tools/testing/selftests/openat2/helpers.c
 create mode 100644 tools/testing/selftests/openat2/helpers.h
 create mode 100644 tools/testing/selftests/openat2/linkmode_test.c
 create mode 100644 tools/testing/selftests/openat2/rename_attack_test.c
 create mode 100644 tools/testing/selftests/openat2/resolve_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..42a27d029c10 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -37,6 +37,7 @@ TARGETS += powerpc
 TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
+TARGETS += openat2
 TARGETS += rseq
 TARGETS += rtc
 TARGETS += seccomp
diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index c67d32eeb668..e71df3d3e55d 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -925,7 +925,7 @@ static void test_share_mmap(char *banner, char *b_suffix)
  */
 static void test_share_open(char *banner, char *b_suffix)
 {
-   int fd, fd2;
+   int procfd, fd, fd2;
 
printf("%s %s %s\n", memfd_str, banner, b_suffix);
 
@@ -950,13 +950,16 @@ static void test_share_open(char *banner, char *b_suffix)
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
 
+   /* We cannot do a MAY_WRITE re-open of an O_RDONLY fd. */
+   procfd = mfd_assert_open(fd2, O_PATH, 0);
close(fd2);
-   fd2 = mfd_assert_open(fd, O_RDWR, 0);
+   fd2 = mfd_assert_open(procfd, O_WRONLY, 0);
 
mfd_assert_add_seals(fd2, F_SEAL_SEAL);
mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
 
+   close(procfd);
close(fd2);
close(fd);
 }
diff --git a/tools/testing/selftests/openat2/.gitignore 
b/tools/testing/selftests/openat2/.gitignore
new file mode 100644
index ..bd68f6c3fd07
--- /dev/null
+++ b/tools/testing/selftests/openat2/.gitignore
@@ -0,0 +1 @@
+/*_test
diff --git a/tools/testing/selftests/openat2/Makefile 
b/tools/testing/selftests/openat2/Makefile
new file mode 100644
index ..a0c1b53fd268
--- /dev/null
+++ b/tools/testing/selftests/openat2/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS += -Wall -O2 -g
+TEST_GEN_PROGS := linkmode_test resolve_test rename_attack_test
+
+include ../lib.mk
+
+$(TEST_GEN_PROGS): helpers.c
diff --git a/tools/testing/selftests/openat2/helpers.c 
b/tools/testing/selftests/openat2/helpers.c
new file mode 100644
index ..c16213ff1946
--- /dev/null
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -0,0 +1,162 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai 
+ * Copyright (C) 2018-2019 SUSE LLC.
+ */
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "helpers.h"
+
+int sys_openat2(int dfd, const char *path, const struct open_how *how)
+{
+   int ret = syscall(__NR_openat2, dfd, path, how);
+   return ret >= 0 ? ret : -errno;
+}
+
+int sys_openat(int dfd, const char *path, const struct open_how *how)
+{
+   int ret = openat(dfd, path, how->flags, how->mode);
+   return ret >= 0 ? ret : -errno;
+}
+
+int sys_renameat2(int olddirfd, const char *oldpath,
+ int newdirfd, const char *newpath, unsigned int flags)
+{
+   int ret = syscall(__NR_renameat2, olddirfd, oldpath,
+ newdirfd, newpath, flags);
+   return ret >= 0 ? ret : -errno;
+}
+
+char *openat_flags(unsigned int flags)
+{
+   char *flagset, *accmode = "(none)";
+
+   switch (flags & 0x03) {
+   case O_RDWR:
+   accmode = "O

[PATCH v10 8/9] kselftest: save-and-restore errno to allow for %m formatting

2019-07-19 Thread Aleksa Sarai

Previously, using "%m" in a ksft_* format string can result in strange
output because the errno value wasn't saved before calling other libc
functions. The solution is to simply save and restore the errno before
we format the user-supplied format string.

Signed-off-by: Aleksa Sarai 
---
 tools/testing/selftests/kselftest.h | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/tools/testing/selftests/kselftest.h 
b/tools/testing/selftests/kselftest.h
index ec15c4f6af55..0ac49d91a260 100644
--- a/tools/testing/selftests/kselftest.h
+++ b/tools/testing/selftests/kselftest.h
@@ -10,6 +10,7 @@
 #ifndef __KSELFTEST_H
 #define __KSELFTEST_H
 
+#include 
 #include 
 #include 
 #include 
@@ -81,58 +82,68 @@ static inline void ksft_print_cnts(void)
 
 static inline void ksft_print_msg(const char *msg, ...)
 {
+   int saved_errno = errno;
va_list args;
 
va_start(args, msg);
printf("# ");
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
 }
 
 static inline void ksft_test_result_pass(const char *msg, ...)
 {
+   int saved_errno = errno;
va_list args;
 
ksft_cnt.ksft_pass++;
 
va_start(args, msg);
printf("ok %d ", ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
 }
 
 static inline void ksft_test_result_fail(const char *msg, ...)
 {
+   int saved_errno = errno;
va_list args;
 
ksft_cnt.ksft_fail++;
 
va_start(args, msg);
printf("not ok %d ", ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
 }
 
 static inline void ksft_test_result_skip(const char *msg, ...)
 {
+   int saved_errno = errno;
va_list args;
 
ksft_cnt.ksft_xskip++;
 
va_start(args, msg);
printf("not ok %d # SKIP ", ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
 }
 
 static inline void ksft_test_result_error(const char *msg, ...)
 {
+   int saved_errno = errno;
va_list args;
 
ksft_cnt.ksft_error++;
 
va_start(args, msg);
printf("not ok %d # error ", ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
 }
@@ -152,10 +163,12 @@ static inline int ksft_exit_fail(void)
 
 static inline int ksft_exit_fail_msg(const char *msg, ...)
 {
+   int saved_errno = errno;
va_list args;
 
va_start(args, msg);
printf("Bail out! ");
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
 
@@ -178,10 +191,12 @@ static inline int ksft_exit_xpass(void)
 static inline int ksft_exit_skip(const char *msg, ...)
 {
if (msg) {
+   int saved_errno = errno;
va_list args;
 
va_start(args, msg);
printf("not ok %d # SKIP ", 1 + ksft_test_num());
+   errno = saved_errno;
vprintf(msg, args);
va_end(args);
} else {
-- 
2.22.0

[PATCH v10 7/9] open: openat2(2) syscall

2019-07-19 Thread Aleksa Sarai

The most obvious syscall to add support for the new LOOKUP_* scoping
flags would be openat(2). However, there are a few reasons why this is
not the best course of action:

 * The new LOOKUP_* flags are intended to be security features, and
   openat(2) will silently ignore all unknown flags. This means that
   users would need to avoid foot-gunning themselves constantly when
   using this interface if it were part of openat(2). This can be fixed
   by having userspace libraries handle this for users[1], but should be
   avoided if possible.

 * Resolution scoping feels like a different operation to the existing
   O_* flags. And since openat(2) has limited flag space, it seems to be
   quite wasteful to clutter it with 5 flags that are all
   resolution-related. Arguably O_NOFOLLOW is also a resolution flag but
   its entire purpose is to error out if you encounter a trailing
   symlink -- not to scope resolution.

 * Other systems would be able to reimplement this syscall allowing for
   cross-OS standardisation rather than being hidden amongst O_* flags
   which may result in it not being used by all the parties that might
   want to use it (file servers, web servers, container runtimes, etc).

 * It gives us the opportunity to iterate on the O_PATH interface. In
   particular, the new @how->upgrade_mask field for fd re-opening is
   only possible because we have a clean slate without needing to re-use
   the ACC_MODE flag design nor the existing openat(2) @mode semantics.

To this end, we introduce the openat2(2) syscall. It provides all of the
features of openat(2) through the @how->flags argument, but also
also provides a new @how->resolve argument which exposes RESOLVE_* flags
that map to our new LOOKUP_* flags. It also eliminates the long-standing
ugliness of variadic-open(2) by embedding it in a struct.

In order to allow for userspace to lock down their usage of file
descriptor re-opening, openat2(2) has the ability for users to disallow
certain re-opening modes through @how->upgrade_mask. At the moment,
there is no UPGRADE_NOEXEC. The open_how struct is padded to 64 bytes
for future extensions (all of the reserved bits must be zeroed).

[1]: https://github.com/openSUSE/libpathrs

Co-developed-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 arch/alpha/kernel/syscalls/syscall.tbl  |   1 +
 arch/arm/tools/syscall.tbl  |   1 +
 arch/arm64/include/asm/unistd.h |   2 +-
 arch/arm64/include/asm/unistd32.h   |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl   |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl   |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl|   1 +
 arch/s390/kernel/syscalls/syscall.tbl   |   1 +
 arch/sh/kernel/syscalls/syscall.tbl |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl  |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl  |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl |   1 +
 fs/open.c   | 106 
 include/linux/fcntl.h   |  15 ++-
 include/linux/fs.h  |   4 +-
 include/linux/syscalls.h|  17 +++-
 include/uapi/asm-generic/unistd.h   |   4 +-
 include/uapi/linux/fcntl.h  |  42 
 24 files changed, 178 insertions(+), 30 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..cebe813a947f 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 541common  fsconfigsys_fsconfig
 542common  fsmount sys_fsmount
 543common  fspick  sys_fspick
+547common  openat2 sys_openat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..2a0b94e595e7 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
 431common  fsconfigsys_fsconfig
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
+437common  openat2 sys_openat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index c9f8dd421c5f..40b8fec7ba55 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -33,7 +33,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   434
+#define __NR_co

[PATCH v10 6/9] namei: aggressively check for nd->root escape on ".." resolution

2019-07-19 Thread Aleksa Sarai

This patch allows for LOOKUP_BENEATH and LOOKUP_IN_ROOT to safely permit
".." resolution (in the case of LOOKUP_BENEATH the resolution will still
fail if ".." resolution would resolve a path outside of the root --
while LOOKUP_IN_ROOT will chroot(2)-style scope it). Magic-link jumps
are still disallowed entirely because now they could result in
inconsistent behaviour if resolution encounters a subsequent ".."[*].

The need for this patch is explained by observing there is a fairly
easy-to-exploit race condition with chroot(2) (and thus by extension
LOOKUP_IN_ROOT and LOOKUP_BENEATH if ".." is allowed) where a rename(2)
of a path can be used to "skip over" nd->root and thus escape to the
filesystem above nd->root.

  thread1 [attacker]:
for (;;)
  renameat2(AT_FDCWD, "/a/b/c", AT_FDCWD, "/a/d", RENAME_EXCHANGE);
  thread2 [victim]:
for (;;)
  resolveat(dirb, "b/c/../../etc/shadow", RESOLVE_IN_ROOT);

With fairly significant regularity, thread2 will resolve to
"/etc/shadow" rather than "/a/b/etc/shadow". There is also a similar
(though somewhat more privileged) attack using MS_MOVE.

With this patch, such cases will be detected *during* ".." resolution
(which is the weak point of chroot(2) -- since walking *into* a
subdirectory tautologically cannot result in you walking *outside*
nd->root -- except through a bind-mount or magic-link). By detecting
this at ".." resolution (rather than checking only at the end of the
entire resolution) we can both correct escapes by jumping back to the
root (in the case of LOOKUP_IN_ROOT), as well as avoid revealing to
attackers the structure of the filesystem outside of the root (through
timing attacks for instance).

In order to avoid a quadratic lookup with each ".." entry, we only
activate the slow path if a write through &rename_lock or &mount_lock
has occurred during path resolution (&rename_lock and &mount_lock are
re-taken to further optimise the lookup). Since the primary attack being
protected against is MS_MOVE or rename(2), not doing additional checks
unless a mount or rename have occurred avoids making the common case
slow.

The use of path_is_under() here might seem suspect, but on further
inspection of the most important race (a path was *inside* the root but
is now *outside*), there appears to be no attack potential:

  * If path_is_under() occurs before the rename, then the path will be
resolved -- however the path was originally inside the root and thus
there is no escape (and to userspace it'd look like the rename
occurred after the path was resolved). If path_is_under() occurs
afterwards, the resolution is blocked.

  * Subsequent ".." jumps are guaranteed to check path_is_under() -- by
construction, &rename_lock or &mount_lock must have been taken by
the attacker after path_is_under() returned in the victim. Thus ".."
will not be able to escape from the previously-inside-root path.

  * Walking down in the moved path is still safe since the entire
subtree was moved (either by rename(2) or MS_MOVE) and because (as
discussed above) walking down is safe.

A variant of the above attack is included in the selftests for
openat2(2) later in this patch series. I've run this test on several
machines for several days and no instances of a breakout were detected.
While this is not concrete proof that this is safe, when combined with
the above argument it should lend some trustworthiness to this
construction.

[*] It may be acceptable in the future to do a path_is_under() check
after resolving a magic-link and permit resolution if the
nd_jump_link() result is still within the dirfd. However this seems
unlikely to be a feature that people *really* need* -- it can be
added later if it turns out a lot of people want it.

Cc: Al Viro 
Cc: Jann Horn 
Cc: Kees Cook 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c | 45 +++--
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 617fc7d55977..b3abf5754ddd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -491,7 +491,7 @@ struct nameidata {
struct path root;
struct inode*inode; /* path.dentry.d_inode */
unsigned intflags;
-   unsignedseq, m_seq;
+   unsignedseq, m_seq, r_seq;
int last_type;
unsigneddepth;
int total_link_count;
@@ -1758,22 +1758,36 @@ static inline int handle_dots(struct nameidata *nd, int 
type)
if (type == LAST_DOTDOT) {
int error = 0;
 
-   /*
-* LOOKUP_BENEATH resolving ".." is not currently safe -- races 
can
-* cause our parent to have moved outside of the root and us to 
skip
-* over it.
-*/
-   if (unlikely(nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT)))
-   return -EXDEV;
if (!nd->root.mnt) {

[PATCH v10 5/9] namei: LOOKUP_IN_ROOT: chroot-like path resolution

2019-07-19 Thread Aleksa Sarai

The primary motivation for the need for this flag is container runtimes
which have to interact with malicious root filesystems in the host
namespaces. One of the first requirements for a container runtime to be
secure against a malicious rootfs is that they correctly scope symlinks
(that is, they should be scoped as though they are chroot(2)ed into the
container's rootfs) and ".."-style paths[*]. The already-existing
LOOKUP_NO_XDEV and LOOKUP_NO_MAGICLINKS help defend against other
potential attacks in a malicious rootfs scenario.

Currently most container runtimes try to do this resolution in
userspace[1], causing many potential race conditions. In addition, the
"obvious" alternative (actually performing a {ch,pivot_}root(2))
requires a fork+exec (for some runtimes) which is *very* costly if
necessary for every filesystem operation involving a container.

[*] At the moment, ".." and magic-link jumping are disallowed for the
same reason it is disabled for LOOKUP_BENEATH -- currently it is not
safe to allow it. Future patches may enable it unconditionally once
we have resolved the possible races (for "..") and semantics (for
magic-link jumping).

The most significant *at(2) semantic change with LOOKUP_IN_ROOT is that
absolute pathnames no longer cause the dirfd to be ignored completely.

The rationale is that LOOKUP_IN_ROOT must necessarily chroot-scope
symlinks with absolute paths to dirfd, and so doing it for the base path
seems to be the most consistent behaviour (and also avoids foot-gunning
users who want to scope paths that are absolute).

[1]: https://github.com/cyphar/filepath-securejoin

Co-developed-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 41 +++--
 include/linux/namei.h |  1 +
 2 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 9df4aa35aedc..617fc7d55977 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -676,7 +676,7 @@ static int unlazy_walk(struct nameidata *nd)
goto out1;
if (!nd->root.mnt) {
/* Restart from path_init() if nd->root was cleared. */
-   if (nd->flags & LOOKUP_BENEATH)
+   if (nd->flags & (LOOKUP_BENEATH | LOOKUP_IN_ROOT))
goto out;
} else if (!(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(nd, &nd->root, nd->root_seq)))
@@ -809,10 +809,18 @@ static int complete_walk(struct nameidata *nd)
return status;
 }
 
-static void set_root(struct nameidata *nd)
+static int set_root(struct nameidata *nd)
 {
struct fs_struct *fs = current->fs;
 
+   /*
+* Jumping to the real root as part of LOOKUP_IN_ROOT is a BUG in namei,
+* but we still have to ensure it doesn't happen because it will cause a
+* breakout from the dirfd.
+*/
+   if (WARN_ON(nd->flags & LOOKUP_IN_ROOT))
+   return -ENOTRECOVERABLE;
+
if (nd->flags & LOOKUP_RCU) {
unsigned seq;
 
@@ -824,6 +832,7 @@ static void set_root(struct nameidata *nd)
} else {
get_fs_root(fs, &nd->root);
}
+   return 0;
 }
 
 static void path_put_conditional(struct path *path, struct nameidata *nd)
@@ -854,6 +863,11 @@ static int nd_jump_root(struct nameidata *nd)
if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
return -EXDEV;
}
+   if (!nd->root.mnt) {
+   int error = set_root(nd);
+   if (error)
+   return error;
+   }
if (nd->flags & LOOKUP_RCU) {
struct dentry *d;
nd->path = nd->root;
@@ -1100,15 +1114,13 @@ const char *get_link(struct nameidata *nd)
if (unlikely(nd->flags & LOOKUP_NO_MAGICLINKS))
return ERR_PTR(-ELOOP);
/* Not currently safe. */
-   if (unlikely(nd->flags & LOOKUP_BENEATH))
+   if (unlikely(nd->flags & (LOOKUP_BENEATH | 
LOOKUP_IN_ROOT)))
return ERR_PTR(-EXDEV);
}
if (IS_ERR_OR_NULL(res))
return res;
}
if (*res == '/') {
-   if (!nd->root.mnt)
-   set_root(nd);
error = nd_jump_root(nd);
if (unlikely(error))
return ERR_PTR(error);
@@ -1744,15 +1756,20 @@ static inline int may_lookup(struct nameidata *nd)
 static inline int handle_dots(struct nameidata *nd, int type)
 {
if (type == LAST_DOTDOT) {
+   int error = 0;
+
/*
 * LOOKUP_BENEATH resolving ".." is not currently safe -- races 
can
 * cause our parent to have moved outside of the root and us to 
skip
 * over it.
 */
-   if (unlikely(nd->fl

[PATCH v10 4/9] namei: O_BENEATH-style path resolution flags

2019-07-19 Thread Aleksa Sarai

Add the following flags to allow various restrictions on path resolution
(these affect the *entire* resolution, rather than just the final path
component -- as is the case with LOOKUP_FOLLOW).

The primary justification for these flags is to allow for programs to be
far more strict about how they want path resolution to handle symlinks,
mountpoint crossings, and paths that escape the dirfd (through an
absolute path or ".." shenanigans).

This is of particular concern to container runtimes that want to be very
careful about malicious root filesystems that a container's init might
have screwed around with (and there is no real way to protect against
this in userspace if you consider potential races against a malicious
container's init). More classical applications (which have their own
potentially buggy userspace path sanitisation code) include web servers,
archive extraction tools, network file servers, and so on.

These flags are exposed to userspace through openat2(2) in a later
patchset.

* LOOKUP_NO_XDEV: Disallow mount-point crossing (both *down* into one,
  or *up* from one). Both bind-mounts and cross-filesystem mounts are
  blocked by this flag. The naming is based on "find -xdev" as well as
  -EXDEV (though find(1) doesn't walk upwards, the semantics seem
  obvious).

* LOOKUP_NO_MAGICLINKS: Disallows ->get_link "symlink" (or rather,
  magic-link) jumping. This is a very specific restriction, and it
  exists because /proc/$pid/fd/... "symlinks" allow for access outside
  nd->root and pose risk to container runtimes that don't want to be
  tricked into accessing a host path (but do want to allow
  no-funny-business symlink resolution).

* LOOKUP_NO_SYMLINKS: Disallows resolution through symlinks of any kind
  (including magic-links).

* LOOKUP_BENEATH: Disallow "escapes" from the starting point of the
  filesystem tree during resolution (you must stay "beneath" the
  starting point at all times). Currently this is done by disallowing
  ".." and absolute paths (either in the given path or found during
  symlink resolution) entirely, as well as all magic-link jumping.

  The wholesale banning of ".." is because it is currently not safe to
  allow ".." resolution (races can cause the path to be moved outside of
  the root -- this is conceptually similar to historical chroot(2)
  escape attacks). Future patches in this series will address this, and
  will re-enable ".." resolution once it is safe. With those patches,
  ".." resolution will only be allowed if it remains in the root
  throughout resolution (such as "a/../b" not "a/../../outside/b").

  The banning of magic-link jumping is done because it is not clear
  whether semantically they should be allowed -- while some magic-links
  are safe there are many that can cause escapes (and once a
  resolution is outside of the root, O_BENEATH will no longer detect
  it). Future patches may re-enable magic-link jumping when such jumps
  would remain inside the root.

The LOOKUP_NO_*LINK flags return -ELOOP if path resolution would
violates their requirement, while the others all return -EXDEV.

This is a refresh of Al's AT_NO_JUMPS patchset[1] (which was a variation
on David Drysdale's O_BENEATH patchset[2], which in turn was based on
the Capsicum project[3]). Input from Linus and Andy in the AT_NO_JUMPS
thread[4] determined most of the API changes made in this refresh.

[1]: https://lwn.net/Articles/721443/
[2]: https://lwn.net/Articles/619151/
[3]: https://lwn.net/Articles/603929/
[4]: https://lwn.net/Articles/723057/

Cc: Christian Brauner 
Suggested-by: David Drysdale 
Suggested-by: Al Viro 
Suggested-by: Andy Lutomirski 
Suggested-by: Linus Torvalds 
Signed-off-by: Aleksa Sarai 
---
 fs/namei.c| 85 ---
 include/linux/namei.h |  7 
 2 files changed, 78 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 165861d621c2..9df4aa35aedc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -674,7 +674,11 @@ static int unlazy_walk(struct nameidata *nd)
goto out2;
if (unlikely(!legitimize_path(nd, &nd->path, nd->seq)))
goto out1;
-   if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT)) {
+   if (!nd->root.mnt) {
+   /* Restart from path_init() if nd->root was cleared. */
+   if (nd->flags & LOOKUP_BENEATH)
+   goto out;
+   } else if (!(nd->flags & LOOKUP_ROOT)) {
if (unlikely(!legitimize_path(nd, &nd->root, nd->root_seq)))
goto out;
}
@@ -843,6 +847,13 @@ static inline void path_to_nameidata(const struct path 
*path,
 
 static int nd_jump_root(struct nameidata *nd)
 {
+   if (unlikely(nd->flags & LOOKUP_BENEATH))
+   return -EXDEV;
+   if (unlikely(nd->flags & LOOKUP_NO_XDEV)) {
+   /* Absolute path arguments to path_init() are allowed. */
+   if (nd->path.mnt != NULL && nd->path.mnt != nd->root.mnt)
+

[PATCH v10 3/9] open: O_EMPTYPATH: procfs-less file descriptor re-opening

2019-07-19 Thread Aleksa Sarai

Userspace has made use of /proc/self/fd very liberally to allow for
descriptors to be re-opened. There are a wide variety of uses for this
feature, but it has always required constructing a pathname and could
not be done without procfs mounted. The obvious solution for this is to
extend openat(2) to have an AT_EMPTY_PATH-equivalent -- O_EMPTYPATH.

Now that descriptor re-opening has been made safe through the new
magic-link resolution restrictions, we can replicate these restrictions
for O_EMPTYPATH. In particular, we only allow "upgrading" the file
descriptor if the corresponding FMODE_PATH_* bit is set (or the
FMODE_{READ,WRITE} cases for non-O_PATH file descriptors).

When doing openat(O_EMPTYPATH|O_PATH), O_PATH takes precedence and
O_EMPTYPATH is ignored. Very few users ever have a need to O_PATH
re-open an existing file descriptor, and so accommodating them at the
expense of further complicating O_PATH makes little sense. Ultimately,
if users ask for this we can always add RESOLVE_EMPTY_PATH to
resolveat(2) in the future.

Signed-off-by: Aleksa Sarai 
---
 arch/alpha/include/uapi/asm/fcntl.h  |  1 +
 arch/parisc/include/uapi/asm/fcntl.h | 39 ++--
 arch/sparc/include/uapi/asm/fcntl.h  |  1 +
 fs/fcntl.c   |  2 +-
 fs/namei.c   | 20 ++
 fs/open.c|  7 -
 include/linux/fcntl.h|  2 +-
 include/uapi/asm-generic/fcntl.h |  4 +++
 8 files changed, 54 insertions(+), 22 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h 
b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..1f879bade68b 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -34,6 +34,7 @@
 
 #define O_PATH 04000
 #define __O_TMPFILE01
+#define O_EMPTYPATH02
 
 #define F_GETLK7
 #define F_SETLK8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h 
b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..5d709058a76f 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
 #ifndef _PARISC_FCNTL_H
 #define _PARISC_FCNTL_H
 
-#define O_APPEND   00010
-#define O_BLKSEEK  00100 /* HPUX only */
-#define O_CREAT00400 /* not fcntl */
-#define O_EXCL 02000 /* not fcntl */
-#define O_LARGEFILE04000
-#define __O_SYNC   00010
+#define O_APPEND   10
+#define O_BLKSEEK  000100 /* HPUX only */
+#define O_CREAT000400 /* not fcntl */
+#define O_EXCL 002000 /* not fcntl */
+#define O_LARGEFILE004000
+#define __O_SYNC   10
 #define O_SYNC (__O_SYNC|O_DSYNC)
-#define O_NONBLOCK 00024 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY   00040 /* not fcntl */
-#define O_DSYNC00100 /* HPUX only */
-#define O_RSYNC00200 /* HPUX only */
-#define O_NOATIME  00400
-#define O_CLOEXEC  01000 /* set close_on_exec */
-
-#define O_DIRECTORY1 /* must be a directory */
-#define O_NOFOLLOW 00200 /* don't follow links */
-#define O_INVISIBLE00400 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH 02000
-#define __O_TMPFILE04000
+#define O_NONBLOCK 24 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY   40 /* not fcntl */
+#define O_DSYNC000100 /* HPUX only */
+#define O_RSYNC000200 /* HPUX only */
+#define O_NOATIME  000400
+#define O_CLOEXEC  001000 /* set close_on_exec */
+
+#define O_DIRECTORY01 /* must be a directory */
+#define O_NOFOLLOW 000200 /* don't follow links */
+#define O_INVISIBLE000400 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH 002000
+#define __O_TMPFILE004000
+#define O_EMPTYPATH01
 
 #define F_GETLK64  8
 #define F_SETLK64  9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h 
b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..dc86c9eaf950 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
 
 #define O_PATH 0x100
 #define __O_TMPFILE0x200
+#define O_EMPTYPATH0x400
 
 #define F_GETOWN   5   /*  for sockets. */
 #define F_SETOWN   6   /*  for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..4cf05a2fd162 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1031,7 +1031,7 @@ static int __init fcntl_init(void)
 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
 * is defined as O_NONBLOCK on some platforms and not on others.
 */
-   BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+   BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(

[PATCH v10 2/9] procfs: switch magic-link modes to be more sane

2019-07-19 Thread Aleksa Sarai

Now that magic-link modes are obeyed for file re-opening purposes, some
of the pre-existing magic-link modes need to be adjusted to be more
semantically correct.

The most blatant example of this is /proc/self/exe, which had a mode of
a+rwx even though tautologically the file could never be opened for
writing (because it is the current->mm of a live process).

With the new O_PATH restrictions, changing the default mode of these
magic-links allows us to avoid delayed-access attacks such as we saw in
CVE-2019-5736.

Signed-off-by: Aleksa Sarai 
---
 fs/proc/base.c   | 20 ++--
 fs/proc/namespaces.c |  2 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 255f6754c70d..82c06c21e69d 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -133,9 +133,9 @@ struct pid_entry {
 
 #define DIR(NAME, MODE, iops, fops)\
NOD(NAME, (S_IFDIR|(MODE)), &iops, &fops, {} )
-#define LNK(NAME, get_link)\
-   NOD(NAME, (S_IFLNK|S_IRWXUGO),  \
-   &proc_pid_link_inode_operations, NULL,  \
+#define LNK(NAME, MODE, get_link)  \
+   NOD(NAME, (S_IFLNK|(MODE)), \
+   &proc_pid_link_inode_operations, NULL,  \
{ .proc_get_link = get_link } )
 #define REG(NAME, MODE, fops)  \
NOD(NAME, (S_IFREG|(MODE)), NULL, &fops, {})
@@ -2995,9 +2995,9 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("numa_maps",  S_IRUGO, proc_pid_numa_maps_operations),
 #endif
REG("mem",S_IRUSR|S_IWUSR, proc_mem_operations),
-   LNK("cwd",proc_cwd_link),
-   LNK("root",   proc_root_link),
-   LNK("exe",proc_exe_link),
+   LNK("cwd",S_IRWXUGO, proc_cwd_link),
+   LNK("root",   S_IRWXUGO, proc_root_link),
+   LNK("exe",S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts", S_IRUGO, proc_mounts_operations),
REG("mountinfo",  S_IRUGO, proc_mountinfo_operations),
REG("mountstats", S_IRUSR, proc_mountstats_operations),
@@ -3393,11 +3393,11 @@ static const struct pid_entry tid_base_stuff[] = {
REG("numa_maps", S_IRUGO, proc_pid_numa_maps_operations),
 #endif
REG("mem",   S_IRUSR|S_IWUSR, proc_mem_operations),
-   LNK("cwd",   proc_cwd_link),
-   LNK("root",  proc_root_link),
-   LNK("exe",   proc_exe_link),
+   LNK("cwd",   S_IRWXUGO, proc_cwd_link),
+   LNK("root",  S_IRWXUGO, proc_root_link),
+   LNK("exe",   S_IRUGO|S_IXUGO, proc_exe_link),
REG("mounts",S_IRUGO, proc_mounts_operations),
-   REG("mountinfo",  S_IRUGO, proc_mountinfo_operations),
+   REG("mountinfo", S_IRUGO, proc_mountinfo_operations),
 #ifdef CONFIG_PROC_PAGE_MONITOR
REG("clear_refs", S_IWUSR, proc_clear_refs_operations),
REG("smaps", S_IRUGO, proc_pid_smaps_operations),
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..cd1e130913f7 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -94,7 +94,7 @@ static struct dentry *proc_ns_instantiate(struct dentry 
*dentry,
struct inode *inode;
struct proc_inode *ei;
 
-   inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRWXUGO);
+   inode = proc_pid_make_inode(dentry->d_sb, task, S_IFLNK | S_IRUGO);
if (!inode)
return ERR_PTR(-ENOENT);
 
-- 
2.22.0

[PATCH v10 1/9] namei: obey trailing magic-link DAC permissions

2019-07-19 Thread Aleksa Sarai

The ability for userspace to "re-open" file descriptors through
/proc/self/fd has been a very useful tool for all sorts of usecases
(container runtimes are one common example). However, the current
interface for doing this has resulted in some pretty subtle security
holes. Userspace can re-open a file descriptor with more permissions
than the original, which can result in cases such as /proc/$pid/exe
being re-opened O_RDWR at a later date even though (by definition)
/proc/$pid/exe cannot be opened for writing. When combined with O_PATH
the results can get even more confusing.

We cannot block this outright. Aside from userspace already depending on
it, it's a useful feature which can actually increase the security of
userspace. For instance, LXC keeps an O_PATH of the container's
/dev/pts/ptmx that gets re-opened to create new ptys and then uses
TIOCGPTPEER to get the slave end. This allows for pty allocation without
resolving paths inside an (untrusted) container's rootfs. There isn't a
trivial way of doing this that is as straight-forward and safe as O_PATH
re-opening.

Instead we have to restrict it in such a way that it doesn't break
(good) users but does block potential attackers. The solution applied in
this patch is to restrict *re-opening* (not resolution through)
magic-links by requiring that mode of the link be obeyed. Normal
symlinks have modes of a+rwx but magic-links have other modes. These
magic-link modes were historically ignored during path resolution, but
they've now been re-purposed for more useful ends.

It is also necessary to define semantics for the mode of an O_PATH
descriptor, since re-opening a magic-link through an O_PATH needs to be
just as restricted as the corresponding magic-link -- otherwise the
above protection can be bypassed. There are two distinct cases:

 1. The target is a regular file (not a magic-link). Userspace depends
on being able to re-open the O_PATH of a regular file, so we must
define the mode to be a+rwx.

 2. The target is a magic-link. In this case, we simply copy the mode of
the magic-link. This results in an O_PATH of a magic-link
effectively acting as a no-op in terms of how much re-opening
privileges a process has.

CAP_DAC_OVERRIDE can be used to override all of these restrictions, but
we only permit &init_userns's capabilities to affect these semantics.
The reason for this is that there isn't a clear way to track what
user_ns is the original owner of a given O_PATH chain -- thus an
unprivileged user could create a new userns and O_PATH the file
descriptor, owning it. All signs would indicate that the user really
does have CAP_DAC_OVERRIDE over the new descriptor and the protection
would be bypassed. We thus opt for the more conservative approach.

I have run this patch on several machines for several days. So far, the
only processes which have hit this case ("loadkeys" and "kbd_mode" from
the kbd package[1]) gracefully handle the permission error and do not
cause any user-visible problems. In order to give users a heads-up, a
warning is output to dmesg whenever may_open_magiclink() refuses access.

[1]: http://git.altlinux.org/people/legion/packages/kbd.git

Co-developed-by: Andy Lutomirski 
Co-developed-by: Christian Brauner 
Signed-off-by: Aleksa Sarai 
---
 Documentation/filesystems/path-lookup.rst |  12 +--
 fs/internal.h |   1 +
 fs/namei.c| 105 +++---
 fs/open.c |   3 +-
 fs/proc/fd.c  |  23 -
 include/linux/fs.h|   4 +
 include/linux/namei.h |   1 +
 7 files changed, 130 insertions(+), 19 deletions(-)

diff --git a/Documentation/filesystems/path-lookup.rst 
b/Documentation/filesystems/path-lookup.rst
index 434a07b0002b..a57d78ec8bee 100644
--- a/Documentation/filesystems/path-lookup.rst
+++ b/Documentation/filesystems/path-lookup.rst
@@ -1310,12 +1310,14 @@ longer needed.
 ``LOOKUP_JUMPED`` means that the current dentry was chosen not because
 it had the right name but for some other reason.  This happens when
 following "``..``", following a symlink to ``/``, crossing a mount point
-or accessing a "``/proc/$PID/fd/$FD``" symlink.  In this case the
-filesystem has not been asked to revalidate the name (with
-``d_revalidate()``).  In such cases the inode may still need to be
-revalidated, so ``d_op->d_weak_revalidate()`` is called if
+or accessing a "``/proc/$PID/fd/$FD``" symlink (also known as a "magic
+link"). In this case the filesystem has not been asked to revalidate the
+name (with ``d_revalidate()``).  In such cases the inode may still need
+to be revalidated, so ``d_op->d_weak_revalidate()`` is called if
 ``LOOKUP_JUMPED`` is set when the look completes - which may be at the
-final component or, when creating, unlinking, or renaming, at the penultimate 
component.
+final component or, when creating, unlinking, or renaming, at the
+penu

[PATCH v10 0/9] namei: openat2(2) path resolution restrictions

2019-07-19 Thread Aleksa Sarai

This patch is being developed here (with snapshots of each series
version being stashed in separate branches with names of the form
"resolveat/vX-summary"):


Patch changelog:
 v10:
* Ensure that unlazy_walk() will fail if we are in a scoped walk and
  the caller has zeroed nd->root (this happens in a few places, I'm
  not sure why because unlazy_walk() does legitimize_path()
  already). In this case we need to go through path_init() again to
  reset it (otherwise we will have a breakout because set_root()
  will breakout).
  * Also add a WARN_ON (and return -ENOTRECOVERABLE) if
LOOKUP_IN_ROOT is set and we are in set_root() -- which should
never happen and will cause a breakout.
* Make changes suggested by Al Viro:
  * Remove nd->{opath_mask,acc_mode} by moving all of the magic-link
permission logic be done after trailing_symlink() (with
trailing_magiclink()) only within path_openat().
  * Introduce LOOKUP_MAGICLINK_JUMPED to be able to detect
magic-link jumps done with nd_jump_link() (so we don't end up
blocking other LOOKUP_JUMPED cases).
  * Simplify all of the path_init() changes to make the code far
less confusing. dirfd_path_init() turns out to be un-necessary.
* Make openat2(2) also -EINVAL on unknown how->flags.
  [Dmitry V. Levin]
* Clean up bad definitions of O_EMPTYPATH on architectures where O_*
  flags are subtly different to .
* Switch away from passing a struct to build_open_flags() and
  instead just copy the one field we need to temporarily modify
  (how->flags). Also fix a bug in OPENHOW_MODE. [Rasmus Villemoes]
* Fix syscall linkages and switch to 437. [Arnd Bergmann]
* Clean up text in commit messages and the cover-letter.
  [Rolf Eike Beer]
* Fix openat2 selftest makefile. [Michael Ellerman]

The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1,2] patchset (which was a variant
of David Drysdale's O_BENEATH patchset[3] which was a spin-off of the
Capsicum project[4]) with a few additions and changes made based on the
previous discussion within [5] as well as others I felt were useful.

In line with the conclusions of the original discussion of AT_NO_JUMPS,
the flag has been split up into separate flags. However, instead of
being an openat(2) flag it is provided through a new syscall openat2(2)
which provides an alternative way to get an O_PATH file descriptor (the
reasoning for doing this is included in the patch description). The
following new LOOKUP_* flags are added:

  * LOOKUP_NO_XDEV blocks all mountpoint crossings (upwards, downwards,
or through absolute links). Absolute pathnames alone in openat(2) do
not trigger this.

  * LOOKUP_NO_MAGICLINKS blocks resolution through /proc/$pid/fd-style
links. This is done by blocking the usage of nd_jump_link() during
resolution in a filesystem. The term "magic-links" is used to match
with the only reference to these links in Documentation/, but I'm
happy to change the name.

It should be noted that this is different to the scope of
~LOOKUP_FOLLOW in that it applies to all path components. However,
you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
will *not* fail (assuming that no parent component was a
magic-link), and you will have an fd for the magic-link.

  * LOOKUP_BENEATH disallows escapes to outside the starting dirfd's
tree, using techniques such as ".." or absolute links. Absolute
paths in openat(2) are also disallowed. Conceptually this flag is to
ensure you "stay below" a certain point in the filesystem tree --
but this requires some additional to protect against various races
that would allow escape using "..".

Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
can trivially beam you around the filesystem (breaking the
protection). In future, there might be similar safety checks done as
in LOOKUP_IN_ROOT, but that requires more discussion.

In addition, two new flags are added that expand on the above ideas:

  * LOOKUP_NO_SYMLINKS does what it says on the tin. No symlink
resolution is allowed at all, including magic-links. Just as with
LOOKUP_NO_MAGICLINKS this can still be used with NOFOLLOW to open an
fd for the symlink as long as no parent path had a symlink
component.

  * LOOKUP_IN_ROOT is an extension of LOOKUP_BENEATH that, rather than
blocking attempts to move past the root, forces all such movements
to be scoped to the starting point. This provides chroot(2)-like
protection but without the cost of a chroot(2) for each filesystem
operation, as well as being safe against race attacks t

Re: [PATCH v2] powerpc: slightly improve cache helpers

2019-07-19 Thread Nathan Chancellor

On Fri, Jul 19, 2019 at 10:23:03AM -0500, Segher Boessenkool wrote:
> On Thu, Jul 18, 2019 at 08:24:56PM -0700, Nathan Chancellor wrote:
> > On Mon, Jul 08, 2019 at 11:49:52PM -0700, Nathan Chancellor wrote:
> > > On Tue, Jul 09, 2019 at 07:04:43AM +0200, Christophe Leroy wrote:
> > > > Is that a Clang bug ?
> > > 
> > > No idea, it happens with clang-8 and clang-9 though (pretty sure there
> > > were fixes for PowerPC in clang-8 so something before it probably won't
> > > work but I haven't tried).
> > > 
> > > > 
> > > > Do you have a disassembly of the code both with and without this patch 
> > > > in
> > > > order to compare ?
> > > 
> > > I can give you whatever disassembly you want (or I can upload the raw
> > > files if that is easier).
> > 
> > What disassembly/files did you need to start taking a look at this? I
> > can upload/send whatever you need.
> 
> A before and after of *only this patch*.  And then look at what changed;
> it maybe be obvious what is the problem to you, as well, so look at it
> yourself first?
> 
> 
> Segher

Thanks, I will go ahead and disassemble the full kernel given that those
helpers are everywhere and see what I can find. I'll reach out again if
I can't come up with anything.

Thanks for the advice!
Nathan

Re: [PATCH v3 5/6] fs/core/vmcore: Move sev_active() reference to x86 arch code

2019-07-19 Thread Thiago Jung Bauermann



Hello Lianbo,

lijiang  writes:

> 在 2019年07月19日 01:47, Lendacky, Thomas 写道:
>> On 7/17/19 10:28 PM, Thiago Jung Bauermann wrote:
>>> Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't
>>> appear in generic kernel code because it forces non-x86 architectures to
>>> define the sev_active() function, which doesn't make a lot of sense.
>>>
>>> To solve this problem, add an x86 elfcorehdr_read() function to override
>>> the generic weak implementation. To do that, it's necessary to make
>>> read_from_oldmem() public so that it can be used outside of vmcore.c.
>>>
>>> Also, remove the export for sev_active() since it's only used in files that
>>> won't be built as modules.
>>>
>>> Signed-off-by: Thiago Jung Bauermann 
>> 
>> Adding Lianbo and Baoquan, who recently worked on this, for their review.
>> 
>
> This change looks good to me.
>
> Reviewed-by: Lianbo Jiang 

Thanks for your review!

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH v3 0/6] Remove x86-specific code from generic headers

2019-07-19 Thread Thiago Jung Bauermann

Lendacky, Thomas  writes:

> On 7/18/19 2:44 PM, Thiago Jung Bauermann wrote:
>> 
>> Lendacky, Thomas  writes:
>> 
>>> On 7/17/19 10:28 PM, Thiago Jung Bauermann wrote:
 Hello,

 This version is mostly about splitting up patch 2/3 into three separate
 patches, as suggested by Christoph Hellwig. Two other changes are a fix in
 patch 1 which wasn't selecting ARCH_HAS_MEM_ENCRYPT for s390 spotted by
 Janani and removal of sme_active and sev_active symbol exports as suggested
 by Christoph Hellwig.

 These patches are applied on top of today's dma-mapping/for-next.

 I don't have a way to test SME, SEV, nor s390's PEF so the patches have 
 only
 been build tested.
>>>
>>> I'll try and get this tested quickly to be sure everything works for SME
>>> and SEV.
>
> Built and tested both SME and SEV and everything appears to be working
> well (not extensive testing, but should be good enough).

Great news. Thanks for testing!

-- 
Thiago Jung Bauermann
IBM Linux Technology Center

Re: [PATCH v2] powerpc: slightly improve cache helpers

2019-07-19 Thread Segher Boessenkool

On Thu, Jul 18, 2019 at 08:24:56PM -0700, Nathan Chancellor wrote:
> On Mon, Jul 08, 2019 at 11:49:52PM -0700, Nathan Chancellor wrote:
> > On Tue, Jul 09, 2019 at 07:04:43AM +0200, Christophe Leroy wrote:
> > > Is that a Clang bug ?
> > 
> > No idea, it happens with clang-8 and clang-9 though (pretty sure there
> > were fixes for PowerPC in clang-8 so something before it probably won't
> > work but I haven't tried).
> > 
> > > 
> > > Do you have a disassembly of the code both with and without this patch in
> > > order to compare ?
> > 
> > I can give you whatever disassembly you want (or I can upload the raw
> > files if that is easier).
> 
> What disassembly/files did you need to start taking a look at this? I
> can upload/send whatever you need.

A before and after of *only this patch*.  And then look at what changed;
it maybe be obvious what is the problem to you, as well, so look at it
yourself first?


Segher

Re: [PATCH v3 5/7] kexec_elf: remove elf_addr_to_cpu macro

2019-07-19 Thread Michael Ellerman

Sven Schnelle  writes:
> Hi Michael,
>
> On Thu, Jul 11, 2019 at 09:08:51PM +1000, Michael Ellerman wrote:
>> Sven Schnelle  writes:
>> > On Wed, Jul 10, 2019 at 05:09:29PM +0200, Christophe Leroy wrote:
>> >> Le 10/07/2019 à 16:29, Sven Schnelle a écrit :
>> >> > It had only one definition, so just use the function directly.
>> >> 
>> >> It had only one definition because it was for ppc64 only.
>> >> But as far as I understand (at least from the name of the new file), you
>> >> want it to be generic, don't you ? Therefore I get on 32 bits it would be
>> >> elf32_to_cpu().
>> >
>> > That brings up the question whether we need those endianess conversions. I 
>> > would
>> > assume that the ELF file has always the same endianess as the running 
>> > kernel. So
>> > i think we could just drop them. What do you think?
>> 
>> We should be able to kexec from big to little endian or vice versa, so
>> they are necessary.
>
> I'll update the patch to check for a needed 32/64 bit conversion during 
> runtime,
> so we can also kexec from 32 to 64 bit kernels and vice versa. Don't know
> whether that's possible on powerpc, but at least on parisc it is.

On some of the Freescale (NXP) machines that should actually be
possible, the hardware can run a 64 or 32-bit kernel, but I'm not sure
if anyone has actually tested kexec'ing from one to the other.

cheers

Re: [PATCH v3 0/6] Remove x86-specific code from generic headers

2019-07-19 Thread Lendacky, Thomas

On 7/18/19 2:44 PM, Thiago Jung Bauermann wrote:
> 
> Lendacky, Thomas  writes:
> 
>> On 7/17/19 10:28 PM, Thiago Jung Bauermann wrote:
>>> Hello,
>>>
>>> This version is mostly about splitting up patch 2/3 into three separate
>>> patches, as suggested by Christoph Hellwig. Two other changes are a fix in
>>> patch 1 which wasn't selecting ARCH_HAS_MEM_ENCRYPT for s390 spotted by
>>> Janani and removal of sme_active and sev_active symbol exports as suggested
>>> by Christoph Hellwig.
>>>
>>> These patches are applied on top of today's dma-mapping/for-next.
>>>
>>> I don't have a way to test SME, SEV, nor s390's PEF so the patches have only
>>> been build tested.
>>
>> I'll try and get this tested quickly to be sure everything works for SME
>> and SEV.

Built and tested both SME and SEV and everything appears to be working
well (not extensive testing, but should be good enough).

Thanks,
Tom

> 
> Thanks! And thanks for reviewing the patches.
>

Re: question on "powerpc/pseries/dma: Allow SWIOTLB"

2019-07-19 Thread Christoph Hellwig

On Fri, Jul 19, 2019 at 06:23:59PM +1000, Alexey Kardashevskiy wrote:
> It is getting there and I still do not see why "swiotlb=force" should not
> work if chosed in the cmdline.

Ok, makes sense.  But that means we also have the issue in a few
other places..

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-19 Thread Christian Brauner

On Fri, Jul 19, 2019 at 09:13:16PM +1000, Michael Ellerman wrote:
> Christian Brauner  writes:
> > On Fri, Jul 19, 2019 at 08:18:02PM +1000, Michael Ellerman wrote:
> >> Christian Brauner  writes:
> >> > On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> >> >> I think Vasily already has a clone3 patch for s390x with 435. 
> >> >
> >> > A quick follow-up on this. Helge and Michael have asked whether there
> >> > are any tests for clone3. Yes, there will be and I try to have them
> >> > ready by the end of the this or next week for review. In the meantime I
> >> > hope the following minimalistic test program that just verifies very
> >> > very basic functionality (It's not pretty.) will help you test:
> >> 
> >> Hi Christian,
> >> 
> >> Thanks for the test.
> >> 
> >> This actually oopses on powerpc, it hits the BUG_ON in CHECK_FULL_REGS
> >> in process.c around line 1633:
> >> 
> >>} else {
> >>/* user thread */
> >>struct pt_regs *regs = current_pt_regs();
> >>CHECK_FULL_REGS(regs);
> >>*childregs = *regs;
> >>if (usp)
> >> 
> >> 
> >> So I'll have to dig into how we fix that before we wire up clone3.
> >> 
> >> Turns out testing is good! :)
> >
> > Indeed. I have a test-suite for clone3 in mind and I hope to have it
> > ready by the end of next week. It's just always the finding the time
> > part that is annoying. :)
> 
> I know the feeling!
> 
> > Thanks for digging into this, Michael!
> 
> No worries, happy to help where I can.
> 
> In the intervening five minutes I remembered how we handle this, we just
> need a little wrapper to save the non-volatile regs:
> 
> _GLOBAL(ppc_clone3)
>   bl  save_nvgprs
>   bl  sys_clone3
>   b   .Lsyscall_exit

Sounds good.

> 
> 
> A while back I meant to make it generate those automatically based on a
> flag in the syscall.tbl but of course haven't got around to it :)
> 
> So with the above it seems all good:
> 
> $ ./clone3 ; echo $?
> Parent process received child's pid 4204 as return value
> Parent process received child's pidfd 3
> Parent process received child's pid 4204 as return argument
> Child process with pid 4204
> 0
> 
> I'll send a patch to wire it up on Monday.

Excellent! Thank you!
Christian

Re: Crash in kvmppc_xive_release()

2019-07-19 Thread Cédric Le Goater

On 19/07/2019 13:20, Michael Ellerman wrote:
> Cédric Le Goater  writes:
>> On 18/07/2019 14:49, Michael Ellerman wrote:
>>> Anyone else seen this?
>>>
>>> This is running ~176 VMs on a Power9 (1 per thread), host crashes:
>>
>> This is beyond the underlying limits of XIVE. 
>>
>> As we allocate 2K vCPUs per VM, that is 16K EQs for interrupt events. The 
>> overall
>> EQ count is 1M. I let you calculate what is our max number of VMs ...
> 
> We need to fix it somehow, people will expect to be able to run a VM per
> thread.

we are limited by two spaces : VP space (1 << 19) system overall and 
EQ space (1 << 20 per chip, this one we could increase). But one of 
the big issue is the way we allocate the XIVE VPs in the XIVE devices.
As we have no idea of how much vCPUs we should provision for, we take 
the max: 2048 ...

If we had the maxcpu of the VM (from QEMU) or at least some hints on 
a rough figure, lets say a power of 2 [ 32 - 4096 ] CPUs, we would 
fragment less our VP space and increase a lot our #VMs per system.

It could be a kernel global, sysfs or what ever, a new KVM PPC control
on the VM to tune maxcpu, or a KVM device creation parameter. we 
could also register multiple KVM devices each having its maximum :
tiny (5), small (6), normal (8), big (11, default legacy), huge (12),
and create from QEMU the one we think fits the best.

I have to think this over. 

Nevertheless, I am trying to increase by 2 or 4 the XIVE spaces for
POWER10. 

C.

Re: Crash in kvmppc_xive_release()

2019-07-19 Thread Michael Ellerman

Cédric Le Goater  writes:
> On 18/07/2019 14:49, Michael Ellerman wrote:
>> Anyone else seen this?
>> 
>> This is running ~176 VMs on a Power9 (1 per thread), host crashes:
>
> This is beyond the underlying limits of XIVE. 
>
> As we allocate 2K vCPUs per VM, that is 16K EQs for interrupt events. The 
> overall
> EQ count is 1M. I let you calculate what is our max number of VMs ...

We need to fix it somehow, people will expect to be able to run a VM per
thread.

cheers

Re: [PATCH] powerpc/dma: Fix invalid DMA mmap behavior

2019-07-19 Thread Arnd Bergmann

On Thu, Jul 18, 2019 at 11:52 AM Christoph Hellwig  wrote:
>
> On Thu, Jul 18, 2019 at 10:49:34AM +0200, Christoph Hellwig wrote:
> > On Thu, Jul 18, 2019 at 01:45:16PM +1000, Oliver O'Halloran wrote:
> > > > Other than m68k, mips, and arm64, everybody else that doesn't have
> > > > ARCH_NO_COHERENT_DMA_MMAP set uses this default implementation, so
> > > > I assume this behavior is acceptable on those architectures.
> > >
> > > It might be acceptable, but there's no reason to use pgport_noncached
> > > if the platform supports cache-coherent DMA.
> > >
> > > Christoph (+cc) made the change so maybe he saw something we're missing.
> >
> > I always found the forcing of noncached access even for coherent
> > devices a little odd, but this was inherited from the previous
> > implementation, which surprised me a bit as the different attributes
> > are usually problematic even on x86.  Let me dig into the history a
> > bit more, but I suspect the righ fix is to default to cached mappings
> > for coherent devices.
>
> Ok, some history:
>
> The generic dma mmap implementation, which we are effectively still
> using today was added by:
>
> commit 64ccc9c033c6089b2d426dad3c56477ab066c999
> Author: Marek Szyprowski 
> Date:   Thu Jun 14 13:03:04 2012 +0200
>
> common: dma-mapping: add support for generic dma_mmap_* calls
>
> and unconditionally uses pgprot_noncached in dma_common_mmap, which is
> then used as the fallback by dma_mmap_attrs if no ->mmap method is
> present.  At that point we already had the powerpc implementation
> that only uses pgprot_noncached for non-coherent mappings, and
> the arm one, which uses pgprot_writecombine if DMA_ATTR_WRITE_COMBINE
> is set and otherwise pgprot_dmacoherent, which seems to be uncached.
> Arm did support coherent platforms at that time, but they might have
> been an afterthought and not handled properly.

Cache-coherent devices are still very rare on 32-bit ARM.

Among the callers of dma_mmap_coherent(), almost all are in platform
specific device drivers that only ever run on noncoherent ARM SoCs,
which explains why nobody would have noticed problems.

There is also a difference in behavior between ARM and PowerPC
when dealing with mismatched cacheability attributes: If the same
page is mapped as both cached and uncached to, this may
cause silent undefined behavior on ARM, while PowerPC should
enter a checkstop as soon as it notices.

  Arnd

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-19 Thread Michael Ellerman

Christian Brauner  writes:
> On Fri, Jul 19, 2019 at 08:18:02PM +1000, Michael Ellerman wrote:
>> Christian Brauner  writes:
>> > On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
>> >> I think Vasily already has a clone3 patch for s390x with 435. 
>> >
>> > A quick follow-up on this. Helge and Michael have asked whether there
>> > are any tests for clone3. Yes, there will be and I try to have them
>> > ready by the end of the this or next week for review. In the meantime I
>> > hope the following minimalistic test program that just verifies very
>> > very basic functionality (It's not pretty.) will help you test:
>> 
>> Hi Christian,
>> 
>> Thanks for the test.
>> 
>> This actually oopses on powerpc, it hits the BUG_ON in CHECK_FULL_REGS
>> in process.c around line 1633:
>> 
>>  } else {
>>  /* user thread */
>>  struct pt_regs *regs = current_pt_regs();
>>  CHECK_FULL_REGS(regs);
>>  *childregs = *regs;
>>  if (usp)
>> 
>> 
>> So I'll have to dig into how we fix that before we wire up clone3.
>> 
>> Turns out testing is good! :)
>
> Indeed. I have a test-suite for clone3 in mind and I hope to have it
> ready by the end of next week. It's just always the finding the time
> part that is annoying. :)

I know the feeling!

> Thanks for digging into this, Michael!

No worries, happy to help where I can.

In the intervening five minutes I remembered how we handle this, we just
need a little wrapper to save the non-volatile regs:

_GLOBAL(ppc_clone3)
bl  save_nvgprs
bl  sys_clone3
b   .Lsyscall_exit


A while back I meant to make it generate those automatically based on a
flag in the syscall.tbl but of course haven't got around to it :)

So with the above it seems all good:

$ ./clone3 ; echo $?
Parent process received child's pid 4204 as return value
Parent process received child's pidfd 3
Parent process received child's pid 4204 as return argument
Child process with pid 4204
0

I'll send a patch to wire it up on Monday.

cheers

Re: [PATCH v9 08/10] open: openat2(2) syscall

2019-07-19 Thread Christian Brauner

On Fri, Jul 19, 2019 at 05:12:18AM +0300, Dmitry V. Levin wrote:
> On Thu, Jul 18, 2019 at 11:29:50PM +0200, Arnd Bergmann wrote:
> [...]
> > 5. you get the same problem with seccomp and strace that
> >clone3() has -- these and others only track the register
> >arguments by default.
> 
> Just for the record, this is definitely not the case for strace:
> it decodes arrays, structures, netlink messages, and so on by default.

There sure is value in trying to design syscalls that can be handled
nicely by seccomp but that shouldn't become a burden on designing
extensible syscalls.
I suggested a session for Ksummit where we can discuss if and how we can
make seccomp more compatible with pointer-args in syscalls.

Christian

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-19 Thread Christian Brauner

On Fri, Jul 19, 2019 at 08:18:02PM +1000, Michael Ellerman wrote:
> Christian Brauner  writes:
> > On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> >> I think Vasily already has a clone3 patch for s390x with 435. 
> >
> > A quick follow-up on this. Helge and Michael have asked whether there
> > are any tests for clone3. Yes, there will be and I try to have them
> > ready by the end of the this or next week for review. In the meantime I
> > hope the following minimalistic test program that just verifies very
> > very basic functionality (It's not pretty.) will help you test:
> 
> Hi Christian,
> 
> Thanks for the test.
> 
> This actually oopses on powerpc, it hits the BUG_ON in CHECK_FULL_REGS
> in process.c around line 1633:
> 
>   } else {
>   /* user thread */
>   struct pt_regs *regs = current_pt_regs();
>   CHECK_FULL_REGS(regs);
>   *childregs = *regs;
>   if (usp)
> 
> 
> So I'll have to dig into how we fix that before we wire up clone3.
> 
> Turns out testing is good! :)

Indeed. I have a test-suite for clone3 in mind and I hope to have it
ready by the end of next week. It's just always the finding the time
part that is annoying. :)

Thanks for digging into this, Michael!
Christian

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-19 Thread Michael Ellerman

Christian Brauner  writes:
> On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
>> I think Vasily already has a clone3 patch for s390x with 435. 
>
> A quick follow-up on this. Helge and Michael have asked whether there
> are any tests for clone3. Yes, there will be and I try to have them
> ready by the end of the this or next week for review. In the meantime I
> hope the following minimalistic test program that just verifies very
> very basic functionality (It's not pretty.) will help you test:

Hi Christian,

Thanks for the test.

This actually oopses on powerpc, it hits the BUG_ON in CHECK_FULL_REGS
in process.c around line 1633:

} else {
/* user thread */
struct pt_regs *regs = current_pt_regs();
CHECK_FULL_REGS(regs);
*childregs = *regs;
if (usp)


So I'll have to dig into how we fix that before we wire up clone3.

Turns out testing is good! :)

cheers

Re: [PATCH 2/3] DMA mapping: Move SME handling to x86-specific files

2019-07-19 Thread kbuild test robot

Hi Thiago,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[cannot apply to v5.2 next-20190718]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Thiago-Jung-Bauermann/Remove-x86-specific-code-from-generic-headers/20190715-063006
config: s390-allnoconfig (attached as .config)
compiler: s390-linux-gcc (GCC) 7.4.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.4.0 make.cross ARCH=s390 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot 

All errors (new ones prefixed by >>):

   kernel/dma/swiotlb.c: In function 'swiotlb_tbl_map_single':
>> kernel/dma/swiotlb.c:461:6: error: implicit declaration of function 
>> 'mem_encrypt_active'; did you mean 'set_cpu_active'? 
>> [-Werror=implicit-function-declaration]
 if (mem_encrypt_active())
 ^~
 set_cpu_active
   cc1: some warnings being treated as errors

vim +461 kernel/dma/swiotlb.c

1b548f667c1487d lib/swiotlb.c   Jeremy Fitzhardinge   2008-12-16  442  
e05ed4d1fad9e73 lib/swiotlb.c   Alexander Duyck   2012-10-15  443  
phys_addr_t swiotlb_tbl_map_single(struct device *hwdev,
e05ed4d1fad9e73 lib/swiotlb.c   Alexander Duyck   2012-10-15  444   
   dma_addr_t tbl_dma_addr,
e05ed4d1fad9e73 lib/swiotlb.c   Alexander Duyck   2012-10-15  445   
   phys_addr_t orig_addr, size_t size,
0443fa003fa199f lib/swiotlb.c   Alexander Duyck   2016-11-02  446   
   enum dma_data_direction dir,
0443fa003fa199f lib/swiotlb.c   Alexander Duyck   2016-11-02  447   
   unsigned long attrs)
^1da177e4c3f415 arch/ia64/lib/swiotlb.c Linus Torvalds2005-04-16  448  {
^1da177e4c3f415 arch/ia64/lib/swiotlb.c Linus Torvalds2005-04-16  449   
unsigned long flags;
e05ed4d1fad9e73 lib/swiotlb.c   Alexander Duyck   2012-10-15  450   
phys_addr_t tlb_addr;
^1da177e4c3f415 arch/ia64/lib/swiotlb.c Linus Torvalds2005-04-16  451   
unsigned int nslots, stride, index, wrap;
^1da177e4c3f415 arch/ia64/lib/swiotlb.c Linus Torvalds2005-04-16  452   
int i;
681cc5cd3efbeaf lib/swiotlb.c   FUJITA Tomonori   2008-02-04  453   
unsigned long mask;
681cc5cd3efbeaf lib/swiotlb.c   FUJITA Tomonori   2008-02-04  454   
unsigned long offset_slots;
681cc5cd3efbeaf lib/swiotlb.c   FUJITA Tomonori   2008-02-04  455   
unsigned long max_slots;
53b29c336830db4 kernel/dma/swiotlb.cDongli Zhang  2019-04-12  456   
unsigned long tmp_io_tlb_used;
681cc5cd3efbeaf lib/swiotlb.c   FUJITA Tomonori   2008-02-04  457  
ac2cbab21f318e1 lib/swiotlb.c   Yinghai Lu2013-01-24  458   
if (no_iotlb_memory)
ac2cbab21f318e1 lib/swiotlb.c   Yinghai Lu2013-01-24  459   
panic("Can not allocate SWIOTLB buffer earlier and can't now provide 
you with the DMA bounce buffer");
ac2cbab21f318e1 lib/swiotlb.c   Yinghai Lu2013-01-24  460  
d7b417fa08d1187 lib/swiotlb.c   Tom Lendacky  2017-10-20 @461   
if (mem_encrypt_active())
aa4d0dc3e029b79 kernel/dma/swiotlb.cThiago Jung Bauermann 2019-07-12  462   
pr_warn_once("Memory encryption is active and system is using DMA 
bounce buffers\n");
648babb7078c631 lib/swiotlb.c   Tom Lendacky  2017-07-17  463  
681cc5cd3efbeaf lib/swiotlb.c   FUJITA Tomonori   2008-02-04  464   
mask = dma_get_seg_boundary(hwdev);
681cc5cd3efbeaf lib/swiotlb.c   FUJITA Tomonori   2008-02-04  465  
eb605a5754d050a lib/swiotlb.c   FUJITA Tomonori   2010-05-10  466   
tbl_dma_addr &= mask;
eb605a5754d050a lib/swiotlb.c   FUJITA Tomonori   2010-05-10  467  
eb605a5754d050a lib/swiotlb.c   FUJITA Tomonori   2010-05-10  468   
offset_slots = ALIGN(tbl_dma_addr, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
a5ddde4a558b3bd lib/swiotlb.c   Ian Campbell  2008-12-16  469  
a5ddde4a558b3bd lib/swiotlb.c   Ian Campbell  2008-12-16  470   
/*
a5ddde4a558b3bd lib/swiotlb.c   Ian Campbell  2008-12-16  471   
 * Carefully handle integer overflow which can occur when mask == ~0UL.
a5ddde4a558b3bd lib/swiotlb.c   Ian Campbell  2008-12-16  472   
 */
b15a3891c916f32 lib/swiotlb.c   Jan Beulich   2008-03-13  473   
max_slots = mask + 1
b15a3891c916f32 lib/swiotlb.c   Jan Beulich   2008-03-13  474   
? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
b15a3891c916f32 lib/swiotlb.c   Jan Beulich   200

Re: question on "powerpc/pseries/dma: Allow SWIOTLB"

2019-07-19 Thread Alexey Kardashevskiy





On 19/07/2019 18:05, Christoph Hellwig wrote:

On Fri, Jul 19, 2019 at 06:00:25PM +1000, Alexey Kardashevskiy wrote:

But shouldn't we force usage of the direct ops in that case as the
IOMMU is not neededed at all?


We do, for mappings, but not unmappings and syncing.


Well, I mean as in literally not setting a dma_ops so that the
dma_direct code is used without the indirection through the iommu ops.
This is not only more obvious, but also faster as you avoid the
indirect call (although that probably doesn't matter much if you
are bounce buffering anyway).


I was not precise. We cannot avoid IOMMU in the guest for passed through 
devices.




Also isn't that support non-upstream so far?


How is this relevant? I expect the existing "swiotlb=force" just work.


I though the whole secure VM support was still not upstream.


It is getting there and I still do not see why "swiotlb=force" should 
not work if chosed in the cmdline.



--
Alexey

Re: [PATCH v5 1/7] kvmppc: HMM backend driver to manage pages of secure guest

2019-07-19 Thread Bharata B Rao

On Thu, Jul 18, 2019 at 11:46:41PM -0700, Christoph Hellwig wrote:
> On Thu, Jul 11, 2019 at 10:38:48AM +0530, Bharata B Rao wrote:
> > Hmmm... I still find it in upstream, guess it will be removed soon?
> > 
> > I find the below commit in mmotm.
> 
> Please take a look at the latest hmm code in mainline, there have
> also been other significant changes as well.

Yes, my next version of this patchset will be based on those recent
HMM related changes.

Regards,
Bharata.

Re: question on "powerpc/pseries/dma: Allow SWIOTLB"

2019-07-19 Thread Christoph Hellwig

On Fri, Jul 19, 2019 at 06:00:25PM +1000, Alexey Kardashevskiy wrote:
> > But shouldn't we force usage of the direct ops in that case as the
> > IOMMU is not neededed at all?
> 
> We do, for mappings, but not unmappings and syncing.

Well, I mean as in literally not setting a dma_ops so that the
dma_direct code is used without the indirection through the iommu ops.
This is not only more obvious, but also faster as you avoid the
indirect call (although that probably doesn't matter much if you
are bounce buffering anyway).

> > Also isn't that support non-upstream so far?
> 
> How is this relevant? I expect the existing "swiotlb=force" just work.

I though the whole secure VM support was still not upstream.

Re: question on "powerpc/pseries/dma: Allow SWIOTLB"

2019-07-19 Thread Alexey Kardashevskiy





On 19/07/2019 17:53, Christoph Hellwig wrote:

On Fri, Jul 19, 2019 at 05:52:37PM +1000, Alexey Kardashevskiy wrote:



On 19/07/2019 17:10, Christoph Hellwig wrote:

Hey Alexey,

what is the use case for the above commit?  Shouldn't we handle all
addressing limits using the iommu?


Our secure VMs is the use case, when only a fraction of system memory is
available for DMA.


But shouldn't we force usage of the direct ops in that case as the
IOMMU is not neededed at all?


We do, for mappings, but not unmappings and syncing.


Also isn't that support non-upstream so far?


How is this relevant? I expect the existing "swiotlb=force" just work.


--
Alexey

Re: question on "powerpc/pseries/dma: Allow SWIOTLB"

2019-07-19 Thread Christoph Hellwig

On Fri, Jul 19, 2019 at 05:52:37PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 19/07/2019 17:10, Christoph Hellwig wrote:
> > Hey Alexey,
> > 
> > what is the use case for the above commit?  Shouldn't we handle all
> > addressing limits using the iommu?
> 
> Our secure VMs is the use case, when only a fraction of system memory is
> available for DMA.

But shouldn't we force usage of the direct ops in that case as the
IOMMU is not neededed at all?  Also isn't that support non-upstream so
far?

Re: question on "powerpc/pseries/dma: Allow SWIOTLB"

2019-07-19 Thread Alexey Kardashevskiy





On 19/07/2019 17:10, Christoph Hellwig wrote:

Hey Alexey,

what is the use case for the above commit?  Shouldn't we handle all
addressing limits using the iommu?


Our secure VMs is the use case, when only a fraction of system memory is 
available for DMA.



--
Alexey

Re: [PATCH] powerpc/dma: Fix invalid DMA mmap behavior

2019-07-19 Thread Shawn Anastasio

On 7/19/19 2:06 AM, Christoph Hellwig wrote:
> What is inherently architecture specific here over the fact that
> the pgprot_* expand to architecture specific bits?

What I meant is that different architectures seem to have different
criteria for setting the different pgprot_ bits. i.e. ppc checks
for cache coherency, arm64 checks for cache coherency and
writecombine, mips just checks for writecombine, etc.

That being said, I'm no expert here and there is probably some behavior
here that would make for a much more sane default.

> I'd rather not create boilerplate code where we don't have to it. Note
> that arch_dma_mmap_pgprot is a little misnamed now, as we also use it
> for the in-kernel remapping in kernel/dma/remap.c, which I'm slowly
> switching a lot of architectures to.  powerpc will follow soon once
> I get the ppc44x that was given to me to actually boot with a recent
> kernel (not that I've tried much so far).

Fair enough. I didn't realize that most of the other architectures
don't use the common code anyways as you mention below.

> Every arch except for arm32 now uses dma direct for the direct mapping,
> and thus the common code without the indeed odd default.  I also have
> a series to remove the default fallback, which is inherently a bad idea:
>
> 
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-no-defaults

Awesome. This is great to hear.

> I think your patch that started this thread is fine for 5.3 and -stable:
>
> Reviewed-by: Christoph Hellwig 

Thanks!

> But going forward I'd rather have a sane default.

Agreed.

List etiquette

2019-07-19 Thread Stephen Rothwell

Hi all,

Just a short note to mention a couple of things:

- please do *not* post html formatted emails.  I may start just
rejecting them soon.

- please cut down the quoting in replies to what is needed for
context.

Thanks
-- 
Cheers,
Stephen Rothwell


pgpOWZXb9MRd1.pgp
Description: OpenPGP digital signature

Re: [PATCH v4 00/25] Add FADump support on PowerNV platform

2019-07-19 Thread Hari Bathini

Sorry, I missed mentioning that this patchset is based on top of upstream kernel
plus the below patches:

   https://patchwork.ozlabs.org/patch/1123582/
    https://patchwork.ozlabs.org/patch/1123583/ 


On 16/07/19 5:01 PM, Hari Bathini wrote:
> Firmware-Assisted Dump (FADump) is currently supported only on pSeries
> platform. This patch series adds support for PowerNV platform too.
>
> The first few patches refactor the FADump code to make use of common
> code across multiple platforms. Then basic FADump support is added for
> PowerNV platform. Followed by patches to honour reserved-ranges DT node
> while reserving/releasing memory used by FADump. The subsequent patch
> processes CPU state data provided by firmware to create and append core
> notes to the ELF core file and the next patch adds support to preserve
> crash data for subsequent boots (useful in cases like petitboot). The
> subsequent patches add support to export opalcore. opalcore makes
> debugging of failures in OPAL code easier. Firmware-Assisted Dump
> documentation is also updated appropriately.
>
> The patch series is tested with the latest firmware plus the below skiboot
> changes for MPIPL support:
>
> https://patchwork.ozlabs.org/project/skiboot/list/?series=119169
> ("MPIPL support")
>
>
> Changes in v4:
>   * Split the patches.
>   * Rebased to latest upstream kernel version.
>   * Updated according to latest OPAL changes.
>
> ---
>
> Hari Bathini (25):
>   powerpc/fadump: move internal macros/definitions to a new header
>   powerpc/fadump: move internal code to a new file
>   powerpc/fadump: Improve fadump documentation
>   pseries/fadump: move rtas specific definitions to platform code
>   pseries/fadump: introduce callbacks for platform specific operations
>   pseries/fadump: define register/un-register callback functions
>   pseries/fadump: move out platform specific support from generic code
>   powerpc/fadump: use FADump instead of fadump for how it is pronounced
>   opal: add MPIPL interface definitions
>   powernv/fadump: add fadump support on powernv
>   powernv/fadump: register kernel metadata address with opal
>   powernv/fadump: define register/un-register callback functions
>   powernv/fadump: support copying multiple kernel memory regions
>   powernv/fadump: process the crashdump by exporting it as /proc/vmcore
>   powerpc/fadump: Update documentation about OPAL platform support
>   powerpc/fadump: consider reserved ranges while reserving memory
>   powerpc/fadump: consider reserved ranges while releasing memory
>   powernv/fadump: process architected register state data provided by 
> firmware
>   powernv/fadump: add support to preserve crash data on FADUMP disabled 
> kernel
>   powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP
>   powernv/opalcore: export /sys/firmware/opal/core for analysing opal 
> crashes
>   powernv/fadump: Warn before processing partial crashdump
>   powernv/opalcore: provide an option to invalidate 
> /sys/firmware/opal/core file
>   powernv/fadump: consider f/w load area
>   powernv/fadump: update documentation about option to release opalcore
>
>
>  Documentation/powerpc/firmware-assisted-dump.txt |  224 +++-
>  arch/powerpc/Kconfig |   23 
>  arch/powerpc/include/asm/fadump.h|  190 
>  arch/powerpc/include/asm/opal-api.h  |   50 +
>  arch/powerpc/include/asm/opal.h  |6 
>  arch/powerpc/kernel/Makefile |6 
>  arch/powerpc/kernel/fadump-common.c  |  153 +++
>  arch/powerpc/kernel/fadump-common.h  |  203 
>  arch/powerpc/kernel/fadump.c | 1181 
> --
>  arch/powerpc/kernel/prom.c   |4 
>  arch/powerpc/platforms/powernv/Makefile  |3 
>  arch/powerpc/platforms/powernv/opal-call.c   |3 
>  arch/powerpc/platforms/powernv/opal-core.c   |  637 
>  arch/powerpc/platforms/powernv/opal-fadump.c |  671 
>  arch/powerpc/platforms/powernv/opal-fadump.h |  154 +++
>  arch/powerpc/platforms/pseries/Makefile  |1 
>  arch/powerpc/platforms/pseries/rtas-fadump.c |  595 +++
>  arch/powerpc/platforms/pseries/rtas-fadump.h |  123 ++
>  18 files changed, 3231 insertions(+), 996 deletions(-)
>  create mode 100644 arch/powerpc/kernel/fadump-common.c
>  create mode 100644 arch/powerpc/kernel/fadump-common.h
>  create mode 100644 arch/powerpc/platforms/powernv/opal-core.c
>  create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.c
>  create mode 100644 arch/powerpc/platforms/powernv/opal-fadump.h
>  create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.c
>  create mode 100644 arch/powerpc/platforms/pseries/rtas-fadump.h
>

Re: [PATCH v3 5/6] fs/core/vmcore: Move sev_active() reference to x86 arch code

2019-07-19 Thread lijiang

在 2019年07月19日 01:47, Lendacky, Thomas 写道:
> On 7/17/19 10:28 PM, Thiago Jung Bauermann wrote:
>> Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't
>> appear in generic kernel code because it forces non-x86 architectures to
>> define the sev_active() function, which doesn't make a lot of sense.
>>
>> To solve this problem, add an x86 elfcorehdr_read() function to override
>> the generic weak implementation. To do that, it's necessary to make
>> read_from_oldmem() public so that it can be used outside of vmcore.c.
>>
>> Also, remove the export for sev_active() since it's only used in files that
>> won't be built as modules.
>>
>> Signed-off-by: Thiago Jung Bauermann 
> 
> Adding Lianbo and Baoquan, who recently worked on this, for their review.
> 

This change looks good to me.

Reviewed-by: Lianbo Jiang 

Thanks.
Lianbo

> Thanks,
> Tom
> 
>> ---
>>  arch/x86/kernel/crash_dump_64.c |  5 +
>>  arch/x86/mm/mem_encrypt.c   |  1 -
>>  fs/proc/vmcore.c|  8 
>>  include/linux/crash_dump.h  | 14 ++
>>  include/linux/mem_encrypt.h |  1 -
>>  5 files changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kernel/crash_dump_64.c 
>> b/arch/x86/kernel/crash_dump_64.c
>> index 22369dd5de3b..045e82e8945b 100644
>> --- a/arch/x86/kernel/crash_dump_64.c
>> +++ b/arch/x86/kernel/crash_dump_64.c
>> @@ -70,3 +70,8 @@ ssize_t copy_oldmem_page_encrypted(unsigned long pfn, char 
>> *buf, size_t csize,
>>  {
>>  return __copy_oldmem_page(pfn, buf, csize, offset, userbuf, true);
>>  }
>> +
>> +ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
>> +{
>> +return read_from_oldmem(buf, count, ppos, 0, sev_active());
>> +}
>> diff --git a/arch/x86/mm/mem_encrypt.c b/arch/x86/mm/mem_encrypt.c
>> index 7139f2f43955..b1e823441093 100644
>> --- a/arch/x86/mm/mem_encrypt.c
>> +++ b/arch/x86/mm/mem_encrypt.c
>> @@ -349,7 +349,6 @@ bool sev_active(void)
>>  {
>>  return sme_me_mask && sev_enabled;
>>  }
>> -EXPORT_SYMBOL(sev_active);
>>  
>>  /* Override for DMA direct allocation check - 
>> ARCH_HAS_FORCE_DMA_UNENCRYPTED */
>>  bool force_dma_unencrypted(struct device *dev)
>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>> index 57957c91c6df..ca1f20bedd8c 100644
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -100,9 +100,9 @@ static int pfn_is_ram(unsigned long pfn)
>>  }
>>  
>>  /* Reads a page from the oldmem device from given offset. */
>> -static ssize_t read_from_oldmem(char *buf, size_t count,
>> -u64 *ppos, int userbuf,
>> -bool encrypted)
>> +ssize_t read_from_oldmem(char *buf, size_t count,
>> + u64 *ppos, int userbuf,
>> + bool encrypted)
>>  {
>>  unsigned long pfn, offset;
>>  size_t nr_bytes;
>> @@ -166,7 +166,7 @@ void __weak elfcorehdr_free(unsigned long long addr)
>>   */
>>  ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
>>  {
>> -return read_from_oldmem(buf, count, ppos, 0, sev_active());
>> +return read_from_oldmem(buf, count, ppos, 0, false);
>>  }
>>  
>>  /*
>> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
>> index f774c5eb9e3c..4664fc1871de 100644
>> --- a/include/linux/crash_dump.h
>> +++ b/include/linux/crash_dump.h
>> @@ -115,4 +115,18 @@ static inline int vmcore_add_device_dump(struct 
>> vmcoredd_data *data)
>>  return -EOPNOTSUPP;
>>  }
>>  #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
>> +
>> +#ifdef CONFIG_PROC_VMCORE
>> +ssize_t read_from_oldmem(char *buf, size_t count,
>> + u64 *ppos, int userbuf,
>> + bool encrypted);
>> +#else
>> +static inline ssize_t read_from_oldmem(char *buf, size_t count,
>> +   u64 *ppos, int userbuf,
>> +   bool encrypted)
>> +{
>> +return -EOPNOTSUPP;
>> +}
>> +#endif /* CONFIG_PROC_VMCORE */
>> +
>>  #endif /* LINUX_CRASHDUMP_H */
>> diff --git a/include/linux/mem_encrypt.h b/include/linux/mem_encrypt.h
>> index 0c5b0ff9eb29..5c4a18a91f89 100644
>> --- a/include/linux/mem_encrypt.h
>> +++ b/include/linux/mem_encrypt.h
>> @@ -19,7 +19,6 @@
>>  #else   /* !CONFIG_ARCH_HAS_MEM_ENCRYPT */
>>  
>>  static inline bool mem_encrypt_active(void) { return false; }
>> -static inline bool sev_active(void) { return false; }
>>  
>>  #endif  /* CONFIG_ARCH_HAS_MEM_ENCRYPT */
>>  
>>

question on "powerpc/pseries/dma: Allow SWIOTLB"

2019-07-19 Thread Christoph Hellwig

Hey Alexey,

what is the use case for the above commit?  Shouldn't we handle all
addressing limits using the iommu?

Re: Crash in kvmppc_xive_release()

2019-07-19 Thread Michael Ellerman

Cédric Le Goater  writes:
> On 18/07/2019 15:14, Cédric Le Goater wrote:
>> On 18/07/2019 14:49, Michael Ellerman wrote:
>>> Anyone else seen this?
>>>
>>> This is running ~176 VMs on a Power9 (1 per thread), host crashes:
>> 
>> This is beyond the underlying limits of XIVE. 
>> 
>> As we allocate 2K vCPUs per VM, that is 16K EQs for interrupt events. The 
>> overall
>> EQ count is 1M. I let you calculate what is our max number of VMs ...
>> 
>>>   [   66.403750][ T6423] xive: OPAL failed to allocate VCPUs order 11, err 
>>> -10
>> 
>> Hence, the OPAL XIVE driver fails which is good but ...
>> 
>>>   [188523.080935670,4] Spent 1783 msecs in OPAL call 135!
>>>   [   66.484965][ T6250] BUG: Kernel NULL pointer dereference at 0x42e8
>>>   [   66.485558][ T6250] Faulting instruction address: 0xc00811a33fcc
>>>   [   66.485990][ T6250] Oops: Kernel access of bad area, sig: 7 [#1]
>>>   [   66.486405][ T6250] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP 
>>> NR_CPUS=2048 NUMA PowerNV
>>>   [   66.486967][ T6250] Modules linked in: kvm_hv kvm
>>>   [   66.487275][ T6250] CPU: 107 PID: 6250 Comm: qemu-system-ppc Not 
>>> tainted 5.2.0-rc2-gcc9x-gf5a9e488d623 #1
>>>   [   66.487902][ T6250] NIP:  c00811a33fcc LR: c00811a33fc4 CTR: 
>>> c05d5970
>>>   [   66.488383][ T6250] REGS: c01fabebb900 TRAP: 0300   Not tainted  
>>> (5.2.0-rc2-gcc9x-gf5a9e488d623)
>>>   [   66.488933][ T6250] MSR:  9280b033 
>>>   CR: 24028224  XER: 
>>>   [   66.489724][ T6250] CFAR: c05d6a4c DAR: 42e8 
>>> DSISR: 0008 IRQMASK: 0 
>>>   [   66.489724][ T6250] GPR00: c00811a33fc4 c01fabebbb90 
>>> c00811a5a200 c1399928 
>>>   [   66.489724][ T6250] GPR04: 0001 c047b8d0 
>>>  0001 
>>>   [   66.489724][ T6250] GPR08:   
>>> c01fa8c42f00 c00811a3af20 
>>>   [   66.489724][ T6250] GPR12: 8000 c0002023ff65a880 
>>> 00013a1b4000 0002 
>>>   [   66.489724][ T6250] GPR16: 1000 0002 
>>> 0001 00012b194cc0 
>>>   [   66.489724][ T6250] GPR20: 7fffb1645250 0001 
>>> 0031  
>>>   [   66.489724][ T6250] GPR24: 7fffb16408d8 c01ffafb62e0 
>>> c01f78699360 c01ff35d0620 
>>>   [   66.489724][ T6250] GPR28: c01ed0ed c01ecd90 
>>>  c01ed0ed 
>>>   [   66.495211][ T6250] NIP [c00811a33fcc] 
>>> kvmppc_xive_release+0x54/0x1b0 [kvm]
>>>   [   66.495642][ T6250] LR [c00811a33fc4] 
>>> kvmppc_xive_release+0x4c/0x1b0 [kvm]
>>>   [   66.496101][ T6250] Call Trace:
>>>   [   66.496314][ T6250] [c01fabebbb90] [c00811a33fc4] 
>>> kvmppc_xive_release+0x4c/0x1b0 [kvm] (unreliable)
>>>   [   66.496893][ T6250] [c01fabebbbf0] [c00811a18d54] 
>>> kvm_device_release+0xac/0xf0 [kvm]
>>>   [   66.497399][ T6250] [c01fabebbc30] [c0442f8c] 
>>> __fput+0xec/0x310
>>>   [   66.497815][ T6250] [c01fabebbc90] [c0145f94] 
>>> task_work_run+0x114/0x170
>>>   [   66.498296][ T6250] [c01fabebbce0] [c0115274] 
>>> do_exit+0x454/0xee0
>>>   [   66.498743][ T6250] [c01fabebbdc0] [c0115dd0] 
>>> do_group_exit+0x60/0xe0
>>>   [   66.499201][ T6250] [c01fabebbe00] [c0115e74] 
>>> sys_exit_group+0x24/0x40
>>>   [   66.499747][ T6250] [c01fabebbe20] [c000b83c] 
>>> system_call+0x5c/0x70
>>>   [   66.500261][ T6250] Instruction dump:
>>>   [   66.500484][ T6250] fbe1fff8 fba1ffe8 fbc1fff0 7c7c1b78 f8010010 
>>> f821ffa1 eba30010 e87d0010 
>>>   [   66.501006][ T6250] ebdd 48006f61 e8410018 3920  
>>> 913e42e8 48007f3d e8410018 
>>>   [   66.501529][ T6250] ---[ end trace c021a6ca03594ec3 ]---
>>>   [   66.513119][ T6150] xive: OPAL failed to allocate VCPUs order 11, err 
>>> -10
>> 
>> 
>> ... the rollback code in case of such error must be bogus. It was never 
>> tested 
>> clearly :/
>
> Here is a fix. Could you give it a try on your system  ?

Yeah that fixes it, thanks.

Will apply it to my fixes branch.

cheers


> From b6f728ca19a9540c8bf4f5a56991c4e3dab4cf56 Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= 
> Date: Thu, 18 Jul 2019 22:15:31 +0200
> Subject: [PATCH] KVM: PPC: Book3S HV: XIVE: fix rollback when
>  kvmppc_xive_create fails
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> The XIVE device structure is now allocated in kvmppc_xive_get_device()
> and kfree'd in kvmppc_core_destroy_vm(). In case of an OPAL error when
> allocating the XIVE VPs, the kfree() call in kvmppc_xive_*create()
> will result in a double free and corrupt the host memory.
>
> Fixes: 5422e95103cf ("KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method 
> by a 'release' method")
> Signed-off-by: Cédric Le Goater 
> ---
>  arch/powerpc/kvm/book3s_xive.c| 4 +---
>  arch/powerpc/kvm/book3s_xive_nat

Re: [PATCH] powerpc/dma: Fix invalid DMA mmap behavior

2019-07-19 Thread Christoph Hellwig

On Thu, Jul 18, 2019 at 02:46:00PM -0500, Shawn Anastasio wrote:
> Personally, I'm not a huge fan of an implicit default for something
> inherently architecture-dependent like this at all.

What is inherently architecture specific here over the fact that
the pgprot_* expand to architecture specific bits?

> What I'd like to
> see is a mechanism that forces architecture code to explicitly
> opt in to the default pgprot settings if they don't provide an
> implementation of arch_dma_mmap_pgprot. This could perhaps be done
> by reversing ARCH_HAS_DMA_MMAP_PGPROT to something like
> ARCH_USE_DEFAULT_DMA_MMAP_PGPROT.

I'd rather not create boilerplate code where we don't have to it.  Note
that arch_dma_mmap_pgprot is a little misnamed now, as we also use it
for the in-kernel remapping in kernel/dma/remap.c, which I'm slowly
switching a lot of architectures to.  powerpc will follow soon once
I get the ppc44x that was given to me to actually boot with a recent
kernel (not that I've tried much so far).

>
> This way as more systems are moved to use the common mmap code instead
> of their ops->mmap, the people doing the refactoring have to make an
> explicit decision about the pgprot settings to use. Such a configuration
> would have likely prevented this situation with powerpc from happening.

Every arch except for arm32 now uses dma direct for the direct mapping,
and thus the common code without the indeed odd default.  I also have
a series to remove the default fallback, which is inherently a bad idea:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-no-defaults

> That being said, if the default behavior doesn't make sense in the
> general case it should probably be fixed as well.
>
> Curious to hear some thoughts on this.

I think your patch that started this thread is fine for 5.3 and -stable:

Reviewed-by: Christoph Hellwig 

But going forward I'd rather have a sane default.

45 matches

Mail list logo