date:20220401

[PATCH] powerpc/sysdev: fix refcount leak in icp_opal_init()

2022-04-01 Thread cgel . zte

From: Lv Ruyi 

The of_find_compatible_node() function returns a node pointer with
refcount incremented, use of_node_put() on it when done.

Reported-by: Zeal Robot 
Signed-off-by: Lv Ruyi 
---
 arch/powerpc/sysdev/xics/icp-opal.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/sysdev/xics/icp-opal.c 
b/arch/powerpc/sysdev/xics/icp-opal.c
index bda4c32582d9..4dae624b9f2f 100644
--- a/arch/powerpc/sysdev/xics/icp-opal.c
+++ b/arch/powerpc/sysdev/xics/icp-opal.c
@@ -196,6 +196,7 @@ int __init icp_opal_init(void)
 
printk("XICS: Using OPAL ICP fallbacks\n");
 
+   of_node_put(np);
return 0;
 }
 
-- 
2.25.1

Re: [PATCH] powerpc/85xx: Remove fsl,85... bindings

2022-04-01 Thread Scott Wood

On Thu, 2022-03-31 at 12:13 +0200, Christophe Leroy wrote:
> Since commit 8a4ab218ef70 ("powerpc/85xx: Change deprecated binding
> for 85xx-based boards"), those bindings are not used anymore.
> 
> A comment in drivers/edac/mpc85xx_edac.c say they are to be removed
> with kernel 2.6.30.
> 
> Remove them now.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  .../bindings/memory-controllers/fsl/fsl,ddr.yaml   |  6 --
>  .../devicetree/bindings/powerpc/fsl/l2cache.txt    |  6 --
>  drivers/edac/mpc85xx_edac.c    | 14 --
>  3 files changed, 26 deletions(-)

Acked-by: Scott Wood 

-Scott

[PATCH 4/4] tools/perf: Fix perf bench numa testcase to check if CPU used to bind task is online

2022-04-01 Thread Athira Rajeev

Perf numa bench test fails with error:

Testcase:
./perf bench numa mem -p 2 -t 1 -P 1024 -C 0,8 -M 1,0 -s 20 -zZq
--thp  1 --no-data_rand_walk

Failure snippet:
<<>>
 Running 'numa/mem' benchmark:

 # Running main, "perf bench numa numa-mem -p 2 -t 1 -P 1024 -C 0,8
-M 1,0 -s 20 -zZq --thp 1 --no-data_rand_walk"

perf: bench/numa.c:333: bind_to_cpumask: Assertion `!(ret)' failed.
<<>>

The Testcases uses CPU’s 0 and 8. In function "parse_setup_cpu_list",
There is check to see if cpu number is greater than max cpu’s possible
in the system ie via "if (bind_cpu_0 >= g->p.nr_cpus ||
bind_cpu_1 >= g->p.nr_cpus) {". But it could happen that system has
say 48 CPU’s, but only number of online CPU’s is 0-7. Other CPU’s
are offlined. Since "g->p.nr_cpus" is 48, so function will go ahead
and set bit for CPU 8 also in cpumask ( td->bind_cpumask).

bind_to_cpumask function is called to set affinity using
sched_setaffinity and the cpumask. Since the CPU8 is not present,
set affinity will fail here with EINVAL. Fix this issue by adding a
check to make sure that, CPU’s provided in the input argument values
are online before proceeding further and skip the test. For this,
include new helper function "is_cpu_online" in
"tools/perf/util/header.c".

Since "BIT(x)" definition will get included from header.h, remove
that from bench/numa.c

Reported-by: Nageswara R Sastry 
Signed-off-by: Athira Rajeev 
---
 tools/perf/bench/numa.c  |  8 ++--
 tools/perf/util/header.c | 43 
 tools/perf/util/header.h |  1 +
 3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c
index 333896907e45..42e2b2ed30c3 100644
--- a/tools/perf/bench/numa.c
+++ b/tools/perf/bench/numa.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 
+#include "../util/header.h"
 #include 
 #include 
 
@@ -624,6 +625,11 @@ static int parse_setup_cpu_list(void)
return -1;
}
 
+   if (is_cpu_online(bind_cpu_0) != 1 || is_cpu_online(bind_cpu_1) 
!= 1) {
+   printf("\nTest not applicable, bind_cpu_0 or bind_cpu_1 
is offline\n");
+   return -1;
+   }
+
BUG_ON(bind_cpu_0 < 0 || bind_cpu_1 < 0);
BUG_ON(bind_cpu_0 > bind_cpu_1);
 
@@ -794,8 +800,6 @@ static int parse_nodes_opt(const struct option *opt 
__maybe_unused,
return parse_node_list(arg);
 }
 
-#define BIT(x) (1ul << x)
-
 static inline uint32_t lfsr_32(uint32_t lfsr)
 {
const uint32_t taps = BIT(1) | BIT(5) | BIT(6) | BIT(31);
diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c
index 6da12e522edc..3f5fcf5d4b3f 100644
--- a/tools/perf/util/header.c
+++ b/tools/perf/util/header.c
@@ -983,6 +983,49 @@ static int write_dir_format(struct feat_fd *ff,
return do_write(ff, >dir.version, sizeof(data->dir.version));
 }
 
+#define SYSFS "/sys/devices/system/cpu/"
+
+/*
+ * Check whether a CPU is online
+ *
+ * Returns:
+ * 1 -> if CPU is online
+ * 0 -> if CPU is offline
+ *-1 -> error case
+ */
+int is_cpu_online(unsigned int cpu)
+{
+   char sysfs_cpu[255];
+   char buf[255];
+   struct stat statbuf;
+   size_t len;
+   int fd;
+
+   snprintf(sysfs_cpu, sizeof(sysfs_cpu), SYSFS "cpu%u", cpu);
+
+   if (stat(sysfs_cpu, ) != 0)
+   return 0;
+
+   /*
+* Check if /sys/devices/system/cpu/cpux/online file
+* exists. In kernels without CONFIG_HOTPLUG_CPU, this
+* file won't exist.
+*/
+   snprintf(sysfs_cpu, sizeof(sysfs_cpu), SYSFS "cpu%u/online", cpu);
+   if (stat(sysfs_cpu, ) != 0)
+   return 1;
+
+   fd = open(sysfs_cpu, O_RDONLY);
+   if (fd == -1)
+   return -1;
+
+   len = read(fd, buf, sizeof(buf) - 1);
+   buf[len] = '\0';
+   close(fd);
+
+   return strtoul(buf, NULL, 16);
+}
+
 #ifdef HAVE_LIBBPF_SUPPORT
 static int write_bpf_prog_info(struct feat_fd *ff,
   struct evlist *evlist __maybe_unused)
diff --git a/tools/perf/util/header.h b/tools/perf/util/header.h
index c9e3265832d9..0eb4bc29a5a4 100644
--- a/tools/perf/util/header.h
+++ b/tools/perf/util/header.h
@@ -158,6 +158,7 @@ int do_write(struct feat_fd *fd, const void *buf, size_t 
size);
 int write_padded(struct feat_fd *fd, const void *bf,
 size_t count, size_t count_aligned);
 
+int is_cpu_online(unsigned int cpu);
 /*
  * arch specific callback
  */
-- 
2.35.1

[PATCH 3/4] tools/perf: Fix perf numa bench to fix usage of affinity for machines with #CPUs > 1K

2022-04-01 Thread Athira Rajeev

perf bench numa testcase fails on systems with CPU's
more than 1K.

Testcase: perf bench numa mem -p 1 -t 3 -P 512 -s 100 -zZ0qcm --thp  1
Snippet of code:
<<>>
perf: bench/numa.c:302: bind_to_node: Assertion `!(ret)' failed.
Aborted (core dumped)
<<>>

bind_to_node function uses "sched_getaffinity" to save the original
cpumask and this call is returning EINVAL ((invalid argument).
This happens because the default mask size in glibc is 1024.
To overcome this 1024 CPUs mask size limitation of cpu_set_t,
change the mask size using the CPU_*_S macros ie, use CPU_ALLOC to
allocate cpumask, CPU_ALLOC_SIZE for size. Apart from fixing this
for "orig_mask", apply same logic to "mask" as well which is used to
setaffinity so that mask size is large enough to represent number
of possible CPU's in the system.

sched_getaffinity is used in one more place in perf numa bench. It
is in "bind_to_cpu" function. Apply the same logic there also. Though
currently no failure is reported from there, it is ideal to change
getaffinity to work with such system configurations having CPU's more
than default mask size supported by glibc.

Also fix "sched_setaffinity" to use mask size which is large enough
to represent number of possible CPU's in the system.

Fixed all places where "bind_cpumask" which is part of "struct
thread_data" is used such that bind_cpumask works in all configuration.

Reported-by: Disha Goel 
Signed-off-by: Athira Rajeev 
---
 tools/perf/bench/numa.c | 109 +---
 1 file changed, 81 insertions(+), 28 deletions(-)

diff --git a/tools/perf/bench/numa.c b/tools/perf/bench/numa.c
index f2640179ada9..333896907e45 100644
--- a/tools/perf/bench/numa.c
+++ b/tools/perf/bench/numa.c
@@ -54,7 +54,7 @@
 
 struct thread_data {
int curr_cpu;
-   cpu_set_t   bind_cpumask;
+   cpu_set_t   *bind_cpumask;
int bind_node;
u8  *process_data;
int process_nr;
@@ -266,46 +266,75 @@ static bool node_has_cpus(int node)
return ret;
 }
 
-static cpu_set_t bind_to_cpu(int target_cpu)
+static cpu_set_t *bind_to_cpu(int target_cpu)
 {
-   cpu_set_t orig_mask, mask;
+   int nrcpus = numa_num_possible_cpus();
+   cpu_set_t *orig_mask, *mask;
+   size_t size;
int ret;
 
-   ret = sched_getaffinity(0, sizeof(orig_mask), _mask);
-   BUG_ON(ret);
+   orig_mask = CPU_ALLOC(nrcpus);
+   BUG_ON(!orig_mask);
+   size = CPU_ALLOC_SIZE(nrcpus);
+   CPU_ZERO_S(size, orig_mask);
+
+   ret = sched_getaffinity(0, size, orig_mask);
+   if (ret) {
+   CPU_FREE(orig_mask);
+   BUG_ON(ret);
+   }
 
-   CPU_ZERO();
+   mask = CPU_ALLOC(nrcpus);
+   BUG_ON(!mask);
+   CPU_ZERO_S(size, mask);
 
if (target_cpu == -1) {
int cpu;
 
for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
-   CPU_SET(cpu, );
+   CPU_SET_S(cpu, size, mask);
} else {
BUG_ON(target_cpu < 0 || target_cpu >= g->p.nr_cpus);
-   CPU_SET(target_cpu, );
+   CPU_SET_S(target_cpu, size, mask);
}
 
-   ret = sched_setaffinity(0, sizeof(mask), );
-   BUG_ON(ret);
+   ret = sched_setaffinity(0, size, mask);
+   if (ret) {
+   CPU_FREE(mask);
+   BUG_ON(ret);
+   }
+
+   CPU_FREE(mask);
 
return orig_mask;
 }
 
-static cpu_set_t bind_to_node(int target_node)
+static cpu_set_t *bind_to_node(int target_node)
 {
-   cpu_set_t orig_mask, mask;
+   int nrcpus = numa_num_possible_cpus();
+   cpu_set_t *orig_mask, *mask;
+   size_t size;
int cpu;
int ret;
 
-   ret = sched_getaffinity(0, sizeof(orig_mask), _mask);
-   BUG_ON(ret);
+   orig_mask = CPU_ALLOC(nrcpus);
+   BUG_ON(!orig_mask);
+   size = CPU_ALLOC_SIZE(nrcpus);
+   CPU_ZERO_S(size, orig_mask);
+
+   ret = sched_getaffinity(0, size, orig_mask);
+   if (ret) {
+   CPU_FREE(orig_mask);
+   BUG_ON(ret);
+   }
 
-   CPU_ZERO();
+   mask = CPU_ALLOC(nrcpus);
+   BUG_ON(!mask);
+   CPU_ZERO_S(size, mask);
 
if (target_node == NUMA_NO_NODE) {
for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
-   CPU_SET(cpu, );
+   CPU_SET_S(cpu, size, mask);
} else {
struct bitmask *cpumask = numa_allocate_cpumask();
 
@@ -313,24 +342,33 @@ static cpu_set_t bind_to_node(int target_node)
if (!numa_node_to_cpus(target_node, cpumask)) {
for (cpu = 0; cpu < (int)cpumask->size; cpu++) {
if (numa_bitmask_isbitset(cpumask, cpu))
-   CPU_SET(cpu, );
+   CPU_SET_S(cpu, size, mask);

[PATCH 2/4] tools/perf: Fix perf bench epoll to correct usage of affinity for machines with #CPUs > 1K

2022-04-01 Thread Athira Rajeev

perf bench epoll testcase fails on systems with CPU's
more than 1K.

Testcase: perf bench epoll all
Result snippet:
<<>>
Run summary [PID 106497]: 1399 threads monitoring on 64 file-descriptors for 8 
secs.

perf: pthread_create: No such file or directory
<<>>

In epoll benchmarks (ctl, wait) pthread_create is invoked in do_threads
from respective bench_epoll_*  function. Though the logs shows direct
failure from pthread_create, the actual failure is from  "sched_setaffinity"
returning EINVAL (invalid argument). This happens because the default
mask size in glibc is 1024. To overcome this 1024 CPUs mask size
limitation of cpu_set_t, change the mask size using the CPU_*_S macros.

Patch addresses this by fixing all the epoll benchmarks to use
CPU_ALLOC to allocate cpumask, CPU_ALLOC_SIZE for size, and
CPU_SET_S to set the mask.

Reported-by: Disha Goel 
Signed-off-by: Athira Rajeev 
---
 tools/perf/bench/epoll-ctl.c  | 25 +++--
 tools/perf/bench/epoll-wait.c | 25 +++--
 2 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/tools/perf/bench/epoll-ctl.c b/tools/perf/bench/epoll-ctl.c
index 1a17ec83d3c4..91c53f6c6d87 100644
--- a/tools/perf/bench/epoll-ctl.c
+++ b/tools/perf/bench/epoll-ctl.c
@@ -222,13 +222,20 @@ static void init_fdmaps(struct worker *w, int pct)
 static int do_threads(struct worker *worker, struct perf_cpu_map *cpu)
 {
pthread_attr_t thread_attr, *attrp = NULL;
-   cpu_set_t cpuset;
+   cpu_set_t *cpuset;
unsigned int i, j;
int ret = 0;
+   int nrcpus;
+   size_t size;
 
if (!noaffinity)
pthread_attr_init(_attr);
 
+   nrcpus = perf_cpu_map__nr(cpu);
+   cpuset = CPU_ALLOC(nrcpus);
+   BUG_ON(!cpuset);
+   size = CPU_ALLOC_SIZE(nrcpus);
+
for (i = 0; i < nthreads; i++) {
struct worker *w = [i];
 
@@ -252,22 +259,28 @@ static int do_threads(struct worker *worker, struct 
perf_cpu_map *cpu)
init_fdmaps(w, 50);
 
if (!noaffinity) {
-   CPU_ZERO();
-   CPU_SET(perf_cpu_map__cpu(cpu, i % 
perf_cpu_map__nr(cpu)).cpu, );
+   CPU_ZERO_S(size, cpuset);
+   CPU_SET_S(perf_cpu_map__cpu(cpu, i % 
perf_cpu_map__nr(cpu)).cpu,
+   size, cpuset);
 
-   ret = pthread_attr_setaffinity_np(_attr, 
sizeof(cpu_set_t), );
-   if (ret)
+   ret = pthread_attr_setaffinity_np(_attr, size, 
cpuset);
+   if (ret) {
+   CPU_FREE(cpuset);
err(EXIT_FAILURE, 
"pthread_attr_setaffinity_np");
+   }
 
attrp = _attr;
}
 
ret = pthread_create(>thread, attrp, workerfn,
 (void *)(struct worker *) w);
-   if (ret)
+   if (ret) {
+   CPU_FREE(cpuset);
err(EXIT_FAILURE, "pthread_create");
+   }
}
 
+   CPU_FREE(cpuset);
if (!noaffinity)
pthread_attr_destroy(_attr);
 
diff --git a/tools/perf/bench/epoll-wait.c b/tools/perf/bench/epoll-wait.c
index 0d1dd8879197..9469a53ffab9 100644
--- a/tools/perf/bench/epoll-wait.c
+++ b/tools/perf/bench/epoll-wait.c
@@ -291,9 +291,11 @@ static void print_summary(void)
 static int do_threads(struct worker *worker, struct perf_cpu_map *cpu)
 {
pthread_attr_t thread_attr, *attrp = NULL;
-   cpu_set_t cpuset;
+   cpu_set_t *cpuset;
unsigned int i, j;
int ret = 0, events = EPOLLIN;
+   int nrcpus;
+   size_t size;
 
if (oneshot)
events |= EPOLLONESHOT;
@@ -306,6 +308,11 @@ static int do_threads(struct worker *worker, struct 
perf_cpu_map *cpu)
if (!noaffinity)
pthread_attr_init(_attr);
 
+   nrcpus = perf_cpu_map__nr(cpu);
+   cpuset = CPU_ALLOC(nrcpus);
+   BUG_ON(!cpuset);
+   size = CPU_ALLOC_SIZE(nrcpus);
+
for (i = 0; i < nthreads; i++) {
struct worker *w = [i];
 
@@ -341,22 +348,28 @@ static int do_threads(struct worker *worker, struct 
perf_cpu_map *cpu)
}
 
if (!noaffinity) {
-   CPU_ZERO();
-   CPU_SET(perf_cpu_map__cpu(cpu, i % 
perf_cpu_map__nr(cpu)).cpu, );
+   CPU_ZERO_S(size, cpuset);
+   CPU_SET_S(perf_cpu_map__cpu(cpu, i % 
perf_cpu_map__nr(cpu)).cpu,
+   size, cpuset);
 
-   ret = pthread_attr_setaffinity_np(_attr, 
sizeof(cpu_set_t), );
-   if (ret)
+   ret = pthread_attr_setaffinity_np(_attr, size, 
cpuset);
+   if (ret) {
+

[PATCH 1/4] tools/perf: Fix perf bench futex to correct usage of affinity for machines with #CPUs > 1K

2022-04-01 Thread Athira Rajeev

perf bench futex testcase fails on systems with CPU's
more than 1K.

Testcase: perf bench futex all
Failure snippet:
<<>>Running futex/hash benchmark...

perf: pthread_create: No such file or directory
<<>>

All the futex benchmarks ( ie hash, lock-api, requeue, wake,
wake-parallel ), pthread_create is invoked in respective bench_futex_*
function. Though the logs shows direct failure from pthread_create,
strace logs showed that actual failure is from  "sched_setaffinity"
returning EINVAL (invalid argument). This happens because the default
mask size in glibc is 1024. To overcome this 1024 CPUs mask size
limitation of cpu_set_t, change the mask size using the CPU_*_S macros.

Patch addresses this by fixing all the futex benchmarks to use
CPU_ALLOC to allocate cpumask, CPU_ALLOC_SIZE for size, and
CPU_SET_S to set the mask.

Reported-by: Disha Goel 
Signed-off-by: Athira Rajeev 
---
 tools/perf/bench/futex-hash.c  | 26 +++---
 tools/perf/bench/futex-lock-pi.c   | 21 -
 tools/perf/bench/futex-requeue.c   | 21 -
 tools/perf/bench/futex-wake-parallel.c | 21 -
 tools/perf/bench/futex-wake.c  | 22 --
 5 files changed, 83 insertions(+), 28 deletions(-)

diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index 9627b6ab8670..dfce64e551e2 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -122,12 +122,14 @@ static void print_summary(void)
 int bench_futex_hash(int argc, const char **argv)
 {
int ret = 0;
-   cpu_set_t cpuset;
+   cpu_set_t *cpuset;
struct sigaction act;
unsigned int i;
pthread_attr_t thread_attr;
struct worker *worker = NULL;
struct perf_cpu_map *cpu;
+   int nrcpus;
+   size_t size;
 
argc = parse_options(argc, argv, options, bench_futex_hash_usage, 0);
if (argc) {
@@ -170,25 +172,35 @@ int bench_futex_hash(int argc, const char **argv)
threads_starting = params.nthreads;
pthread_attr_init(_attr);
gettimeofday(__start, NULL);
+
+   nrcpus = perf_cpu_map__nr(cpu);
+   cpuset = CPU_ALLOC(nrcpus);
+   BUG_ON(!cpuset);
+   size = CPU_ALLOC_SIZE(nrcpus);
+
for (i = 0; i < params.nthreads; i++) {
worker[i].tid = i;
worker[i].futex = calloc(params.nfutexes, 
sizeof(*worker[i].futex));
if (!worker[i].futex)
goto errmem;
 
-   CPU_ZERO();
-   CPU_SET(perf_cpu_map__cpu(cpu, i % perf_cpu_map__nr(cpu)).cpu, 
);
+   CPU_ZERO_S(size, cpuset);
 
-   ret = pthread_attr_setaffinity_np(_attr, 
sizeof(cpu_set_t), );
-   if (ret)
+   CPU_SET_S(perf_cpu_map__cpu(cpu, i % 
perf_cpu_map__nr(cpu)).cpu, size, cpuset);
+   ret = pthread_attr_setaffinity_np(_attr, size, cpuset);
+   if (ret) {
+   CPU_FREE(cpuset);
err(EXIT_FAILURE, "pthread_attr_setaffinity_np");
-
+   }
ret = pthread_create([i].thread, _attr, workerfn,
 (void *)(struct worker *) [i]);
-   if (ret)
+   if (ret) {
+   CPU_FREE(cpuset);
err(EXIT_FAILURE, "pthread_create");
+   }
 
}
+   CPU_FREE(cpuset);
pthread_attr_destroy(_attr);
 
pthread_mutex_lock(_lock);
diff --git a/tools/perf/bench/futex-lock-pi.c b/tools/perf/bench/futex-lock-pi.c
index a512a320df74..61c3bb80d4cf 100644
--- a/tools/perf/bench/futex-lock-pi.c
+++ b/tools/perf/bench/futex-lock-pi.c
@@ -120,11 +120,17 @@ static void *workerfn(void *arg)
 static void create_threads(struct worker *w, pthread_attr_t thread_attr,
   struct perf_cpu_map *cpu)
 {
-   cpu_set_t cpuset;
+   cpu_set_t *cpuset;
unsigned int i;
+   int nrcpus =  perf_cpu_map__nr(cpu);
+   size_t size;
 
threads_starting = params.nthreads;
 
+   cpuset = CPU_ALLOC(nrcpus);
+   BUG_ON(!cpuset);
+   size = CPU_ALLOC_SIZE(nrcpus);
+
for (i = 0; i < params.nthreads; i++) {
worker[i].tid = i;
 
@@ -135,15 +141,20 @@ static void create_threads(struct worker *w, 
pthread_attr_t thread_attr,
} else
worker[i].futex = _futex;
 
-   CPU_ZERO();
-   CPU_SET(perf_cpu_map__cpu(cpu, i % perf_cpu_map__nr(cpu)).cpu, 
);
+   CPU_ZERO_S(size, cpuset);
+   CPU_SET_S(perf_cpu_map__cpu(cpu, i % 
perf_cpu_map__nr(cpu)).cpu, size, cpuset);
 
-   if (pthread_attr_setaffinity_np(_attr, 
sizeof(cpu_set_t), ))
+   if (pthread_attr_setaffinity_np(_attr, size, cpuset)) {
+   CPU_FREE(cpuset);
err(EXIT_FAILURE,

[PATCH 0/4] tools/perf: Fix perf bench numa, futex and epoll to work with machines having #CPUs > 1K

2022-04-01 Thread Athira Rajeev

The perf benchmark for collections: numa, futex and epoll
hits failure in system configuration with CPU's more than 1024.
These benchmarks uses "sched_getaffinity" and "sched_setaffinity"
in the code to work with affinity.

Example snippet from numa benchmark:
<<>>
perf: bench/numa.c:302: bind_to_node: Assertion `!(ret)' failed.
Aborted (core dumped)
<<>>

bind_to_node function uses "sched_getaffinity" to save the cpumask.
This fails with EINVAL because the default mask size in glibc is 1024.

Similarly in futex and epoll benchmark, uses sched_setaffinity during
pthread_create with affinity. And since it returns EINVAL in such system
configuration, benchmark doesn't run.

To overcome this 1024 CPUs mask size limitation of cpu_set_t,
change the mask size using the CPU_*_S macros ie, use CPU_ALLOC to
allocate cpumask, CPU_ALLOC_SIZE for size, CPU_SET_S to set mask bit.

Fix all the relevant places in the code to use mask size which is large
enough to represent number of possible CPU's in the system.

Fix parse_setup_cpu_list function in numa bench to check if input CPU
is online before binding task to that CPU. This is to fix failures where,
though CPU number is within max CPU, it could happen that CPU is offline.
Here, sched_setaffinity will result in failure when using cpumask having
that cpu bit set in the mask.

Patch 1 and Patch 2 address fix for perf bench futex and perf bench
epoll benchmark. Patch 3 and Patch 4 address fix in perf bench numa
benchmark

Athira Rajeev (4):
  tools/perf: Fix perf bench futex to correct usage of affinity for
machines with #CPUs > 1K
  tools/perf: Fix perf bench epoll to correct usage of affinity for
machines with #CPUs > 1K
  tools/perf: Fix perf numa bench to fix usage of affinity for machines
with #CPUs > 1K
  tools/perf: Fix perf bench numa testcase to check if CPU used to bind
task is online

 tools/perf/bench/epoll-ctl.c   |  25 --
 tools/perf/bench/epoll-wait.c  |  25 --
 tools/perf/bench/futex-hash.c  |  26 --
 tools/perf/bench/futex-lock-pi.c   |  21 +++--
 tools/perf/bench/futex-requeue.c   |  21 +++--
 tools/perf/bench/futex-wake-parallel.c |  21 +++--
 tools/perf/bench/futex-wake.c  |  22 +++--
 tools/perf/bench/numa.c| 117 ++---
 tools/perf/util/header.c   |  43 +
 tools/perf/util/header.h   |   1 +
 10 files changed, 252 insertions(+), 70 deletions(-)

-- 
2.35.1

[PATCH net-next] net: ethernet: Prepare cleanup of powerpc's asm/prom.h

2022-04-01 Thread Christophe Leroy

powerpc's asm/prom.h brings some headers that it doesn't
need itself.

In order to clean it up, first add missing headers in
users of asm/prom.h

Signed-off-by: Christophe Leroy 
---
 drivers/net/ethernet/apple/bmac.c| 2 +-
 drivers/net/ethernet/apple/mace.c| 2 +-
 drivers/net/ethernet/freescale/fec_mpc52xx.c | 2 ++
 drivers/net/ethernet/freescale/fec_mpc52xx_phy.c | 1 +
 drivers/net/ethernet/ibm/ehea/ehea.h | 1 +
 drivers/net/ethernet/ibm/ehea/ehea_main.c| 2 ++
 drivers/net/ethernet/ibm/ibmvnic.c   | 1 +
 drivers/net/ethernet/sun/sungem.c| 1 -
 drivers/net/ethernet/toshiba/spider_net.c| 1 +
 9 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/apple/bmac.c 
b/drivers/net/ethernet/apple/bmac.c
index 4d2ba30c2fbd..3843e8fdcdde 100644
--- a/drivers/net/ethernet/apple/bmac.c
+++ b/drivers/net/ethernet/apple/bmac.c
@@ -25,7 +25,7 @@
 #include 
 #include 
 #include 
-#include 
+
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/apple/mace.c 
b/drivers/net/ethernet/apple/mace.c
index 6f8c91eb1263..97f96d30d9b3 100644
--- a/drivers/net/ethernet/apple/mace.c
+++ b/drivers/net/ethernet/apple/mace.c
@@ -20,7 +20,7 @@
 #include 
 #include 
 #include 
-#include 
+
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/freescale/fec_mpc52xx.c 
b/drivers/net/ethernet/freescale/fec_mpc52xx.c
index be0bd4b44926..5ddb769bdfb4 100644
--- a/drivers/net/ethernet/freescale/fec_mpc52xx.c
+++ b/drivers/net/ethernet/freescale/fec_mpc52xx.c
@@ -29,7 +29,9 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/freescale/fec_mpc52xx_phy.c 
b/drivers/net/ethernet/freescale/fec_mpc52xx_phy.c
index b5497e308302..f85b5e81dfc1 100644
--- a/drivers/net/ethernet/freescale/fec_mpc52xx_phy.c
+++ b/drivers/net/ethernet/freescale/fec_mpc52xx_phy.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/ibm/ehea/ehea.h 
b/drivers/net/ethernet/ibm/ehea/ehea.h
index b140835d4c23..208c440a602b 100644
--- a/drivers/net/ethernet/ibm/ehea/ehea.h
+++ b/drivers/net/ethernet/ibm/ehea/ehea.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
diff --git a/drivers/net/ethernet/ibm/ehea/ehea_main.c 
b/drivers/net/ethernet/ibm/ehea/ehea_main.c
index bad94e4d50f4..8ce3348edf08 100644
--- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
+++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
@@ -29,6 +29,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 77683909ca3d..309d97d28fb1 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -53,6 +53,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/sun/sungem.c 
b/drivers/net/ethernet/sun/sungem.c
index 036856102c50..45bd89153de2 100644
--- a/drivers/net/ethernet/sun/sungem.c
+++ b/drivers/net/ethernet/sun/sungem.c
@@ -52,7 +52,6 @@
 #endif
 
 #ifdef CONFIG_PPC_PMAC
-#include 
 #include 
 #include 
 #endif
diff --git a/drivers/net/ethernet/toshiba/spider_net.c 
b/drivers/net/ethernet/toshiba/spider_net.c
index f47b8358669d..4f7ae444 100644
--- a/drivers/net/ethernet/toshiba/spider_net.c
+++ b/drivers/net/ethernet/toshiba/spider_net.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "spider_net.h"
-- 
2.35.1

[PATCH] cpufreq: Prepare cleanup of powerpc's asm/prom.h

2022-04-01 Thread Christophe Leroy

powerpc's asm/prom.h brings some headers that it doesn't
need itself.

In order to clean it up, first add missing headers in
users of asm/prom.h

Signed-off-by: Christophe Leroy 
---
 drivers/cpufreq/pasemi-cpufreq.c  | 1 -
 drivers/cpufreq/pmac32-cpufreq.c  | 2 +-
 drivers/cpufreq/pmac64-cpufreq.c  | 2 +-
 drivers/cpufreq/ppc_cbe_cpufreq.c | 1 -
 drivers/cpufreq/ppc_cbe_cpufreq_pmi.c | 2 +-
 5 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 815645170c4d..039a66bbe1be 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -18,7 +18,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 
diff --git a/drivers/cpufreq/pmac32-cpufreq.c b/drivers/cpufreq/pmac32-cpufreq.c
index 4f20c6a9108d..20f64a8b0a35 100644
--- a/drivers/cpufreq/pmac32-cpufreq.c
+++ b/drivers/cpufreq/pmac32-cpufreq.c
@@ -24,7 +24,7 @@
 #include 
 #include 
 #include 
-#include 
+
 #include 
 #include 
 #include 
diff --git a/drivers/cpufreq/pmac64-cpufreq.c b/drivers/cpufreq/pmac64-cpufreq.c
index d7542a106e6b..ba9c31d98bd6 100644
--- a/drivers/cpufreq/pmac64-cpufreq.c
+++ b/drivers/cpufreq/pmac64-cpufreq.c
@@ -22,7 +22,7 @@
 #include 
 #include 
 #include 
-#include 
+
 #include 
 #include 
 #include 
diff --git a/drivers/cpufreq/ppc_cbe_cpufreq.c 
b/drivers/cpufreq/ppc_cbe_cpufreq.c
index c58abb4cca3a..e3313ce63b38 100644
--- a/drivers/cpufreq/ppc_cbe_cpufreq.c
+++ b/drivers/cpufreq/ppc_cbe_cpufreq.c
@@ -12,7 +12,6 @@
 #include 
 
 #include 
-#include 
 #include 
 
 #include "ppc_cbe_cpufreq.h"
diff --git a/drivers/cpufreq/ppc_cbe_cpufreq_pmi.c 
b/drivers/cpufreq/ppc_cbe_cpufreq_pmi.c
index 037fe23bc6ed..4fba3637b115 100644
--- a/drivers/cpufreq/ppc_cbe_cpufreq_pmi.c
+++ b/drivers/cpufreq/ppc_cbe_cpufreq_pmi.c
@@ -13,9 +13,9 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
-#include 
 #include 
 #include 
 
-- 
2.35.1

[PATCH] macintosh: Prepare cleanup of powerpc's asm/prom.h

2022-04-01 Thread Christophe Leroy

powerpc's asm/prom.h brings some headers that it doesn't
need itself.

In order to clean it up, first add missing headers in
users of asm/prom.h

Signed-off-by: Christophe Leroy 
---
 drivers/macintosh/adb.c | 2 +-
 drivers/macintosh/ans-lcd.c | 2 +-
 drivers/macintosh/macio-adb.c   | 5 -
 drivers/macintosh/macio_asic.c  | 3 ++-
 drivers/macintosh/macio_sysfs.c | 2 ++
 drivers/macintosh/mediabay.c| 2 +-
 drivers/macintosh/rack-meter.c  | 1 -
 drivers/macintosh/smu.c | 1 -
 drivers/macintosh/therm_adt746x.c   | 1 -
 drivers/macintosh/therm_windtunnel.c| 1 -
 drivers/macintosh/via-cuda.c| 4 +++-
 drivers/macintosh/via-pmu-backlight.c   | 1 -
 drivers/macintosh/via-pmu-led.c | 2 +-
 drivers/macintosh/via-pmu.c | 1 -
 drivers/macintosh/windfarm_ad7417_sensor.c  | 2 +-
 drivers/macintosh/windfarm_core.c   | 2 --
 drivers/macintosh/windfarm_cpufreq_clamp.c  | 2 --
 drivers/macintosh/windfarm_fcu_controls.c   | 2 +-
 drivers/macintosh/windfarm_lm75_sensor.c| 1 -
 drivers/macintosh/windfarm_lm87_sensor.c| 2 +-
 drivers/macintosh/windfarm_max6690_sensor.c | 2 +-
 drivers/macintosh/windfarm_mpu.h| 2 ++
 drivers/macintosh/windfarm_pm112.c  | 4 +++-
 drivers/macintosh/windfarm_pm121.c  | 3 ++-
 drivers/macintosh/windfarm_pm72.c   | 2 +-
 drivers/macintosh/windfarm_pm81.c   | 3 ++-
 drivers/macintosh/windfarm_pm91.c   | 3 ++-
 drivers/macintosh/windfarm_rm31.c   | 2 +-
 drivers/macintosh/windfarm_smu_controls.c   | 3 ++-
 drivers/macintosh/windfarm_smu_sat.c| 2 +-
 drivers/macintosh/windfarm_smu_sensors.c| 3 ++-
 31 files changed, 37 insertions(+), 31 deletions(-)

diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
index 73b396189039..439fab4eaa85 100644
--- a/drivers/macintosh/adb.c
+++ b/drivers/macintosh/adb.c
@@ -38,10 +38,10 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #ifdef CONFIG_PPC
-#include 
 #include 
 #endif
 
diff --git a/drivers/macintosh/ans-lcd.c b/drivers/macintosh/ans-lcd.c
index b4821c751d04..fa904b24a600 100644
--- a/drivers/macintosh/ans-lcd.c
+++ b/drivers/macintosh/ans-lcd.c
@@ -11,10 +11,10 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
-#include 
 #include 
 
 #include "ans-lcd.h"
diff --git a/drivers/macintosh/macio-adb.c b/drivers/macintosh/macio-adb.c
index dc634c2932fd..9b63bd2551c6 100644
--- a/drivers/macintosh/macio-adb.c
+++ b/drivers/macintosh/macio-adb.c
@@ -9,8 +9,11 @@
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
+#include 
 #include 
+
 #include 
 #include 
 #include 
diff --git a/drivers/macintosh/macio_asic.c b/drivers/macintosh/macio_asic.c
index 1943a007e2d5..3f519f573a63 100644
--- a/drivers/macintosh/macio_asic.c
+++ b/drivers/macintosh/macio_asic.c
@@ -20,13 +20,14 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
 #include 
 
 #include 
 #include 
 #include 
-#include 
 
 #undef DEBUG
 
diff --git a/drivers/macintosh/macio_sysfs.c b/drivers/macintosh/macio_sysfs.c
index 27f5eefc508f..2bbe359b26d9 100644
--- a/drivers/macintosh/macio_sysfs.c
+++ b/drivers/macintosh/macio_sysfs.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
+#include 
+#include 
 #include 
 #include 
 
diff --git a/drivers/macintosh/mediabay.c b/drivers/macintosh/mediabay.c
index b17660c022eb..36070c6586d1 100644
--- a/drivers/macintosh/mediabay.c
+++ b/drivers/macintosh/mediabay.c
@@ -17,7 +17,7 @@
 #include 
 #include 
 #include 
-#include 
+
 #include 
 #include 
 #include 
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 60311e8d6240..c28893e41a8b 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -27,7 +27,6 @@
 #include 
 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/macintosh/smu.c b/drivers/macintosh/smu.c
index a4fbc3fc713d..4350dabd9e6e 100644
--- a/drivers/macintosh/smu.c
+++ b/drivers/macintosh/smu.c
@@ -41,7 +41,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/macintosh/therm_adt746x.c 
b/drivers/macintosh/therm_adt746x.c
index 7e218437730c..e604cbc91763 100644
--- a/drivers/macintosh/therm_adt746x.c
+++ b/drivers/macintosh/therm_adt746x.c
@@ -27,7 +27,6 @@
 #include 
 #include 
 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/macintosh/therm_windtunnel.c 
b/drivers/macintosh/therm_windtunnel.c
index f55f6adf5e5f..9226b74fa08f 100644
--- a/drivers/macintosh/therm_windtunnel.c
+++ b/drivers/macintosh/therm_windtunnel.c
@@ -38,7 +38,6 @@
 #include 
 #include 
 
-#include 
 #include 
 #include 
 #include 
diff --git a/drivers/macintosh/via-cuda.c b/drivers/macintosh/via-cuda.c
index 3d0d0b9d471d..3838eb459ab1 100644
--- a/drivers/macintosh/via-cuda.c
+++

[PATCH AUTOSEL 4.9 03/16] powerpc: dts: t104xrdb: fix phy type for FMAN 4/5

2022-04-01 Thread Sasha Levin

From: Maxim Kiselev 

[ Upstream commit 17846485dff91acce1ad47b508b633dffc32e838 ]

T1040RDB has two RTL8211E-VB phys which requires setting
of internal delays for correct work.

Changing the phy-connection-type property to `rgmii-id`
will fix this issue.

Signed-off-by: Maxim Kiselev 
Reviewed-by: Maxim Kochetkov 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20211230151123.1258321-1-biguncle...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
index 5fdddbd2a62b..b0a9beab1c26 100644
--- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
@@ -139,12 +139,12 @@ pca9546@77 {
fman@40 {
ethernet@e6000 {
phy-handle = <_rgmii_0>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
ethernet@e8000 {
phy-handle = <_rgmii_1>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.34.1

[PATCH AUTOSEL 4.14 17/22] powerpc/code-patching: Pre-map patch area

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 591b4b268435f00d2f0b81f786c2c7bd5ef66416 ]

Paul reported a warning with DEBUG_ATOMIC_SLEEP=y:

  BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:256
  in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
  preempt_count: 0, expected: 0
  ...
  Call Trace:
dump_stack_lvl+0xa0/0xec (unreliable)
__might_resched+0x2f4/0x310
kmem_cache_alloc+0x220/0x4b0
__pud_alloc+0x74/0x1d0
hash__map_kernel_page+0x2cc/0x390
do_patch_instruction+0x134/0x4a0
arch_jump_label_transform+0x64/0x78
__jump_label_update+0x148/0x180
static_key_enable_cpuslocked+0xd0/0x120
static_key_enable+0x30/0x50
check_kvm_guest+0x60/0x88
pSeries_smp_probe+0x54/0xb0
smp_prepare_cpus+0x3e0/0x430
kernel_init_freeable+0x20c/0x43c
kernel_init+0x30/0x1a0
ret_from_kernel_thread+0x5c/0x64

Peter pointed out that this is because do_patch_instruction() has
disabled interrupts, but then map_patch_area() calls map_kernel_page()
then hash__map_kernel_page() which does a sleeping memory allocation.

We only see the warning in KVM guests with SMT enabled, which is not
particularly common, or on other platforms if CONFIG_KPROBES is
disabled, also not common. The reason we don't see it in most
configurations is that another path that happens to have interrupts
enabled has allocated the required page tables for us, eg. there's a
path in kprobes init that does that. That's just pure luck though.

As Christophe suggested, the simplest solution is to do a dummy
map/unmap when we initialise the patching, so that any required page
table levels are pre-allocated before the first call to
do_patch_instruction(). This works because the unmap doesn't free any
page tables that were allocated by the map, it just clears the PTE,
leaving the page table levels there for the next map.

Reported-by: Paul Menzel 
Debugged-by: Peter Zijlstra 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220223015821.473097-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/lib/code-patching.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 85f84b45d3a0..c58a619a68b3 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -47,9 +47,14 @@ int raw_patch_instruction(unsigned int *addr, unsigned int 
instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+static int map_patch_area(void *addr, unsigned long text_poke_addr);
+static void unmap_patch_area(unsigned long addr);
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
+   unsigned long addr;
+   int err;
 
area = get_vm_area(PAGE_SIZE, VM_ALLOC);
if (!area) {
@@ -57,6 +62,15 @@ static int text_area_cpu_up(unsigned int cpu)
cpu);
return -1;
}
+
+   // Map/unmap the area to ensure all page tables are pre-allocated
+   addr = (unsigned long)area->addr;
+   err = map_patch_area(empty_zero_page, addr);
+   if (err)
+   return err;
+
+   unmap_patch_area(addr);
+
this_cpu_write(text_poke_area, area);
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 4.14 07/22] powerpc: Set crashkernel offset to mid of RMA region

2022-04-01 Thread Sasha Levin

From: Sourabh Jain 

[ Upstream commit 7c5ed82b800d8615cdda00729e7b62e5899f0b13 ]

On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.

The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for the kernel to allocate memory for other system
resources.

Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.

This patch is dependent on changes to paca allocation for boot CPU. It
expect boot CPU to discover 1T segment support which is introduced by
the patch posted here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html

Reported-by: Abdul haleem 
Signed-off-by: Sourabh Jain 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20220204085601.107257-1-sourabhj...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/machine_kexec.c | 15 +++
 arch/powerpc/kernel/rtas.c  |  6 ++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/machine_kexec.c
index cb4d6cd949fc..101b0fb7a80e 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -145,11 +145,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
 #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* On the LPAR platform place the crash kernel to mid of
+* RMA size (512MB or more) to ensure the crash kernel
+* gets enough space to place itself and some stack to be
+* in the first segment. At the same time normal kernel
+* also get enough space to allocate memory for essential
+* system resource in the first segment. Keep the crash
+* kernel starts at 128MB offset on other platforms.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
 #else
crashk_res.start = KDUMP_KERNELBASE;
 #endif
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 55b266d7afe1..912e7f69266e 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1356,6 +1356,12 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
 
+#ifdef CONFIG_PPC64
+   /* need this feature to decide the crashkernel offset */
+   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+#endif
+
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
-- 
2.34.1

[PATCH AUTOSEL 4.14 03/22] powerpc: dts: t104xrdb: fix phy type for FMAN 4/5

2022-04-01 Thread Sasha Levin

From: Maxim Kiselev 

[ Upstream commit 17846485dff91acce1ad47b508b633dffc32e838 ]

T1040RDB has two RTL8211E-VB phys which requires setting
of internal delays for correct work.

Changing the phy-connection-type property to `rgmii-id`
will fix this issue.

Signed-off-by: Maxim Kiselev 
Reviewed-by: Maxim Kochetkov 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20211230151123.1258321-1-biguncle...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
index 5fdddbd2a62b..b0a9beab1c26 100644
--- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
@@ -139,12 +139,12 @@ pca9546@77 {
fman@40 {
ethernet@e6000 {
phy-handle = <_rgmii_0>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
ethernet@e8000 {
phy-handle = <_rgmii_1>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.34.1

[PATCH AUTOSEL 4.19 23/29] powerpc/code-patching: Pre-map patch area

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 591b4b268435f00d2f0b81f786c2c7bd5ef66416 ]

Paul reported a warning with DEBUG_ATOMIC_SLEEP=y:

  BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:256
  in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
  preempt_count: 0, expected: 0
  ...
  Call Trace:
dump_stack_lvl+0xa0/0xec (unreliable)
__might_resched+0x2f4/0x310
kmem_cache_alloc+0x220/0x4b0
__pud_alloc+0x74/0x1d0
hash__map_kernel_page+0x2cc/0x390
do_patch_instruction+0x134/0x4a0
arch_jump_label_transform+0x64/0x78
__jump_label_update+0x148/0x180
static_key_enable_cpuslocked+0xd0/0x120
static_key_enable+0x30/0x50
check_kvm_guest+0x60/0x88
pSeries_smp_probe+0x54/0xb0
smp_prepare_cpus+0x3e0/0x430
kernel_init_freeable+0x20c/0x43c
kernel_init+0x30/0x1a0
ret_from_kernel_thread+0x5c/0x64

Peter pointed out that this is because do_patch_instruction() has
disabled interrupts, but then map_patch_area() calls map_kernel_page()
then hash__map_kernel_page() which does a sleeping memory allocation.

We only see the warning in KVM guests with SMT enabled, which is not
particularly common, or on other platforms if CONFIG_KPROBES is
disabled, also not common. The reason we don't see it in most
configurations is that another path that happens to have interrupts
enabled has allocated the required page tables for us, eg. there's a
path in kprobes init that does that. That's just pure luck though.

As Christophe suggested, the simplest solution is to do a dummy
map/unmap when we initialise the patching, so that any required page
table levels are pre-allocated before the first call to
do_patch_instruction(). This works because the unmap doesn't free any
page tables that were allocated by the map, it just clears the PTE,
leaving the page table levels there for the next map.

Reported-by: Paul Menzel 
Debugged-by: Peter Zijlstra 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220223015821.473097-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/lib/code-patching.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index bb245dbf6c57..2b9a92ea2d89 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -46,9 +46,14 @@ int raw_patch_instruction(unsigned int *addr, unsigned int 
instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+static int map_patch_area(void *addr, unsigned long text_poke_addr);
+static void unmap_patch_area(unsigned long addr);
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
+   unsigned long addr;
+   int err;
 
area = get_vm_area(PAGE_SIZE, VM_ALLOC);
if (!area) {
@@ -56,6 +61,15 @@ static int text_area_cpu_up(unsigned int cpu)
cpu);
return -1;
}
+
+   // Map/unmap the area to ensure all page tables are pre-allocated
+   addr = (unsigned long)area->addr;
+   err = map_patch_area(empty_zero_page, addr);
+   if (err)
+   return err;
+
+   unmap_patch_area(addr);
+
this_cpu_write(text_poke_area, area);
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 4.19 09/29] powerpc: Set crashkernel offset to mid of RMA region

2022-04-01 Thread Sasha Levin

From: Sourabh Jain 

[ Upstream commit 7c5ed82b800d8615cdda00729e7b62e5899f0b13 ]

On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.

The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for the kernel to allocate memory for other system
resources.

Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.

This patch is dependent on changes to paca allocation for boot CPU. It
expect boot CPU to discover 1T segment support which is introduced by
the patch posted here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html

Reported-by: Abdul haleem 
Signed-off-by: Sourabh Jain 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20220204085601.107257-1-sourabhj...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/machine_kexec.c | 15 +++
 arch/powerpc/kernel/rtas.c  |  6 ++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/machine_kexec.c
index 094c37fb07a9..437c50bfe4e6 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -148,11 +148,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
 #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* On the LPAR platform place the crash kernel to mid of
+* RMA size (512MB or more) to ensure the crash kernel
+* gets enough space to place itself and some stack to be
+* in the first segment. At the same time normal kernel
+* also get enough space to allocate memory for essential
+* system resource in the first segment. Keep the crash
+* kernel starts at 128MB offset on other platforms.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
 #else
crashk_res.start = KDUMP_KERNELBASE;
 #endif
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index b3aa0cea6283..362c20c8c22f 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1357,6 +1357,12 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
 
+#ifdef CONFIG_PPC64
+   /* need this feature to decide the crashkernel offset */
+   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+#endif
+
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
-- 
2.34.1

[PATCH AUTOSEL 4.19 05/29] powerpc: dts: t104xrdb: fix phy type for FMAN 4/5

2022-04-01 Thread Sasha Levin

From: Maxim Kiselev 

[ Upstream commit 17846485dff91acce1ad47b508b633dffc32e838 ]

T1040RDB has two RTL8211E-VB phys which requires setting
of internal delays for correct work.

Changing the phy-connection-type property to `rgmii-id`
will fix this issue.

Signed-off-by: Maxim Kiselev 
Reviewed-by: Maxim Kochetkov 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20211230151123.1258321-1-biguncle...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
index 099a598c74c0..bfe1ed5be337 100644
--- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
@@ -139,12 +139,12 @@ pca9546@77 {
fman@40 {
ethernet@e6000 {
phy-handle = <_rgmii_0>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
ethernet@e8000 {
phy-handle = <_rgmii_1>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.34.1

[PATCH AUTOSEL 5.4 30/37] powerpc/code-patching: Pre-map patch area

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 591b4b268435f00d2f0b81f786c2c7bd5ef66416 ]

Paul reported a warning with DEBUG_ATOMIC_SLEEP=y:

  BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:256
  in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
  preempt_count: 0, expected: 0
  ...
  Call Trace:
dump_stack_lvl+0xa0/0xec (unreliable)
__might_resched+0x2f4/0x310
kmem_cache_alloc+0x220/0x4b0
__pud_alloc+0x74/0x1d0
hash__map_kernel_page+0x2cc/0x390
do_patch_instruction+0x134/0x4a0
arch_jump_label_transform+0x64/0x78
__jump_label_update+0x148/0x180
static_key_enable_cpuslocked+0xd0/0x120
static_key_enable+0x30/0x50
check_kvm_guest+0x60/0x88
pSeries_smp_probe+0x54/0xb0
smp_prepare_cpus+0x3e0/0x430
kernel_init_freeable+0x20c/0x43c
kernel_init+0x30/0x1a0
ret_from_kernel_thread+0x5c/0x64

Peter pointed out that this is because do_patch_instruction() has
disabled interrupts, but then map_patch_area() calls map_kernel_page()
then hash__map_kernel_page() which does a sleeping memory allocation.

We only see the warning in KVM guests with SMT enabled, which is not
particularly common, or on other platforms if CONFIG_KPROBES is
disabled, also not common. The reason we don't see it in most
configurations is that another path that happens to have interrupts
enabled has allocated the required page tables for us, eg. there's a
path in kprobes init that does that. That's just pure luck though.

As Christophe suggested, the simplest solution is to do a dummy
map/unmap when we initialise the patching, so that any required page
table levels are pre-allocated before the first call to
do_patch_instruction(). This works because the unmap doesn't free any
page tables that were allocated by the map, it just clears the PTE,
leaving the page table levels there for the next map.

Reported-by: Paul Menzel 
Debugged-by: Peter Zijlstra 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220223015821.473097-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/lib/code-patching.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index a05f289e613e..e417d4470397 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -41,9 +41,14 @@ int raw_patch_instruction(unsigned int *addr, unsigned int 
instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+static int map_patch_area(void *addr, unsigned long text_poke_addr);
+static void unmap_patch_area(unsigned long addr);
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
+   unsigned long addr;
+   int err;
 
area = get_vm_area(PAGE_SIZE, VM_ALLOC);
if (!area) {
@@ -51,6 +56,15 @@ static int text_area_cpu_up(unsigned int cpu)
cpu);
return -1;
}
+
+   // Map/unmap the area to ensure all page tables are pre-allocated
+   addr = (unsigned long)area->addr;
+   err = map_patch_area(empty_zero_page, addr);
+   if (err)
+   return err;
+
+   unmap_patch_area(addr);
+
this_cpu_write(text_poke_area, area);
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 5.4 11/37] powerpc: Set crashkernel offset to mid of RMA region

2022-04-01 Thread Sasha Levin

From: Sourabh Jain 

[ Upstream commit 7c5ed82b800d8615cdda00729e7b62e5899f0b13 ]

On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.

The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for the kernel to allocate memory for other system
resources.

Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.

This patch is dependent on changes to paca allocation for boot CPU. It
expect boot CPU to discover 1T segment support which is introduced by
the patch posted here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html

Reported-by: Abdul haleem 
Signed-off-by: Sourabh Jain 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20220204085601.107257-1-sourabhj...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/machine_kexec.c | 15 +++
 arch/powerpc/kernel/rtas.c  |  6 ++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/machine_kexec.c 
b/arch/powerpc/kernel/machine_kexec.c
index 7a1c11a7cba5..716f8bb17461 100644
--- a/arch/powerpc/kernel/machine_kexec.c
+++ b/arch/powerpc/kernel/machine_kexec.c
@@ -146,11 +146,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
 #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* On the LPAR platform place the crash kernel to mid of
+* RMA size (512MB or more) to ensure the crash kernel
+* gets enough space to place itself and some stack to be
+* in the first segment. At the same time normal kernel
+* also get enough space to allocate memory for essential
+* system resource in the first segment. Keep the crash
+* kernel starts at 128MB offset on other platforms.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
 #else
crashk_res.start = KDUMP_KERNELBASE;
 #endif
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index c1e2e351ebff..9392661ac8a8 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1244,6 +1244,12 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
 
+#ifdef CONFIG_PPC64
+   /* need this feature to decide the crashkernel offset */
+   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+#endif
+
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
-- 
2.34.1

[PATCH AUTOSEL 5.4 05/37] powerpc: dts: t104xrdb: fix phy type for FMAN 4/5

2022-04-01 Thread Sasha Levin

From: Maxim Kiselev 

[ Upstream commit 17846485dff91acce1ad47b508b633dffc32e838 ]

T1040RDB has two RTL8211E-VB phys which requires setting
of internal delays for correct work.

Changing the phy-connection-type property to `rgmii-id`
will fix this issue.

Signed-off-by: Maxim Kiselev 
Reviewed-by: Maxim Kochetkov 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20211230151123.1258321-1-biguncle...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
index 099a598c74c0..bfe1ed5be337 100644
--- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
@@ -139,12 +139,12 @@ pca9546@77 {
fman@40 {
ethernet@e6000 {
phy-handle = <_rgmii_0>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
ethernet@e8000 {
phy-handle = <_rgmii_1>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.34.1

[PATCH AUTOSEL 5.10 52/65] powerpc/secvar: fix refcount leak in format_show()

2022-04-01 Thread Sasha Levin

From: Hangyu Hua 

[ Upstream commit d601fd24e6964967f115f036a840f4f28488f63f ]

Refcount leak will happen when format_show returns failure in multiple
cases. Unified management of of_node_put can fix this problem.

Signed-off-by: Hangyu Hua 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220302021959.10959-1-hbh...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/secvar-sysfs.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index a0a78aba2083..1ee4640a2641 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -26,15 +26,18 @@ static ssize_t format_show(struct kobject *kobj, struct 
kobj_attribute *attr,
const char *format;
 
node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
-   if (!of_device_is_available(node))
-   return -ENODEV;
+   if (!of_device_is_available(node)) {
+   rc = -ENODEV;
+   goto out;
+   }
 
rc = of_property_read_string(node, "format", );
if (rc)
-   return rc;
+   goto out;
 
rc = sprintf(buf, "%s\n", format);
 
+out:
of_node_put(node);
 
return rc;
-- 
2.34.1

[PATCH AUTOSEL 5.10 51/65] powerpc/code-patching: Pre-map patch area

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 591b4b268435f00d2f0b81f786c2c7bd5ef66416 ]

Paul reported a warning with DEBUG_ATOMIC_SLEEP=y:

  BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:256
  in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
  preempt_count: 0, expected: 0
  ...
  Call Trace:
dump_stack_lvl+0xa0/0xec (unreliable)
__might_resched+0x2f4/0x310
kmem_cache_alloc+0x220/0x4b0
__pud_alloc+0x74/0x1d0
hash__map_kernel_page+0x2cc/0x390
do_patch_instruction+0x134/0x4a0
arch_jump_label_transform+0x64/0x78
__jump_label_update+0x148/0x180
static_key_enable_cpuslocked+0xd0/0x120
static_key_enable+0x30/0x50
check_kvm_guest+0x60/0x88
pSeries_smp_probe+0x54/0xb0
smp_prepare_cpus+0x3e0/0x430
kernel_init_freeable+0x20c/0x43c
kernel_init+0x30/0x1a0
ret_from_kernel_thread+0x5c/0x64

Peter pointed out that this is because do_patch_instruction() has
disabled interrupts, but then map_patch_area() calls map_kernel_page()
then hash__map_kernel_page() which does a sleeping memory allocation.

We only see the warning in KVM guests with SMT enabled, which is not
particularly common, or on other platforms if CONFIG_KPROBES is
disabled, also not common. The reason we don't see it in most
configurations is that another path that happens to have interrupts
enabled has allocated the required page tables for us, eg. there's a
path in kprobes init that does that. That's just pure luck though.

As Christophe suggested, the simplest solution is to do a dummy
map/unmap when we initialise the patching, so that any required page
table levels are pre-allocated before the first call to
do_patch_instruction(). This works because the unmap doesn't free any
page tables that were allocated by the map, it just clears the PTE,
leaving the page table levels there for the next map.

Reported-by: Paul Menzel 
Debugged-by: Peter Zijlstra 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220223015821.473097-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/lib/code-patching.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index a2e4f864b63d..4318aee65a39 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -43,9 +43,14 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
ppc_inst instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+static int map_patch_area(void *addr, unsigned long text_poke_addr);
+static void unmap_patch_area(unsigned long addr);
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
+   unsigned long addr;
+   int err;
 
area = get_vm_area(PAGE_SIZE, VM_ALLOC);
if (!area) {
@@ -53,6 +58,15 @@ static int text_area_cpu_up(unsigned int cpu)
cpu);
return -1;
}
+
+   // Map/unmap the area to ensure all page tables are pre-allocated
+   addr = (unsigned long)area->addr;
+   err = map_patch_area(empty_zero_page, addr);
+   if (err)
+   return err;
+
+   unmap_patch_area(addr);
+
this_cpu_write(text_poke_area, area);
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 5.10 19/65] powerpc: Set crashkernel offset to mid of RMA region

2022-04-01 Thread Sasha Levin

From: Sourabh Jain 

[ Upstream commit 7c5ed82b800d8615cdda00729e7b62e5899f0b13 ]

On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.

The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for the kernel to allocate memory for other system
resources.

Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.

This patch is dependent on changes to paca allocation for boot CPU. It
expect boot CPU to discover 1T segment support which is introduced by
the patch posted here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html

Reported-by: Abdul haleem 
Signed-off-by: Sourabh Jain 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20220204085601.107257-1-sourabhj...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c |  6 ++
 arch/powerpc/kexec/core.c  | 15 +++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index cccb32cf0e08..cf421eb7f90d 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1296,6 +1296,12 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
 
+#ifdef CONFIG_PPC64
+   /* need this feature to decide the crashkernel offset */
+   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+#endif
+
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 56da5eb2b923..80c79cb5010c 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -147,11 +147,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
 #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* On the LPAR platform place the crash kernel to mid of
+* RMA size (512MB or more) to ensure the crash kernel
+* gets enough space to place itself and some stack to be
+* in the first segment. At the same time normal kernel
+* also get enough space to allocate memory for essential
+* system resource in the first segment. Keep the crash
+* kernel starts at 128MB offset on other platforms.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
 #else
crashk_res.start = KDUMP_KERNELBASE;
 #endif
-- 
2.34.1

[PATCH AUTOSEL 5.10 08/65] powerpc: dts: t104xrdb: fix phy type for FMAN 4/5

2022-04-01 Thread Sasha Levin

From: Maxim Kiselev 

[ Upstream commit 17846485dff91acce1ad47b508b633dffc32e838 ]

T1040RDB has two RTL8211E-VB phys which requires setting
of internal delays for correct work.

Changing the phy-connection-type property to `rgmii-id`
will fix this issue.

Signed-off-by: Maxim Kiselev 
Reviewed-by: Maxim Kochetkov 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20211230151123.1258321-1-biguncle...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
index 099a598c74c0..bfe1ed5be337 100644
--- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
@@ -139,12 +139,12 @@ pca9546@77 {
fman@40 {
ethernet@e6000 {
phy-handle = <_rgmii_0>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
ethernet@e8000 {
phy-handle = <_rgmii_1>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.34.1

[PATCH AUTOSEL 5.15 77/98] powerpc/secvar: fix refcount leak in format_show()

2022-04-01 Thread Sasha Levin

From: Hangyu Hua 

[ Upstream commit d601fd24e6964967f115f036a840f4f28488f63f ]

Refcount leak will happen when format_show returns failure in multiple
cases. Unified management of of_node_put can fix this problem.

Signed-off-by: Hangyu Hua 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220302021959.10959-1-hbh...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/secvar-sysfs.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index a0a78aba2083..1ee4640a2641 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -26,15 +26,18 @@ static ssize_t format_show(struct kobject *kobj, struct 
kobj_attribute *attr,
const char *format;
 
node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
-   if (!of_device_is_available(node))
-   return -ENODEV;
+   if (!of_device_is_available(node)) {
+   rc = -ENODEV;
+   goto out;
+   }
 
rc = of_property_read_string(node, "format", );
if (rc)
-   return rc;
+   goto out;
 
rc = sprintf(buf, "%s\n", format);
 
+out:
of_node_put(node);
 
return rc;
-- 
2.34.1

[PATCH AUTOSEL 5.15 76/98] powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 1a76e520ee1831a81dabf8a9a58c6453f700026e ]

Since the IBM A2 CPU support was removed, see commit
fb5a515704d7 ("powerpc: Remove platforms/wsp and associated pieces"),
the only 64-bit Book3E CPUs we support are Freescale (NXP) ones.

However our Kconfig still allows configurating a kernel that has 64-bit
Book3E support, but no Freescale CPU support enabled. Such a kernel
would never boot, it doesn't know about any CPUs.

It also causes build errors, as reported by lkp, because
PPC_BARRIER_NOSPEC is not enabled in such a configuration:

  powerpc64-linux-ld: arch/powerpc/net/bpf_jit_comp64.o:(.toc+0x0):
  undefined reference to `powerpc_security_features'

To fix this, force PPC_FSL_BOOK3E to be selected whenever we are
building a 64-bit Book3E kernel.

Reported-by: kernel test robot 
Reported-by: Naveen N. Rao 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220304061222.2478720-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/Kconfig.cputype | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index a208997ade88..87a95cbff2f3 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -111,6 +111,7 @@ config PPC_BOOK3S_64
 
 config PPC_BOOK3E_64
bool "Embedded processors"
+   select PPC_FSL_BOOK3E
select PPC_FPU # Make it a choice ?
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL
@@ -287,7 +288,7 @@ config FSL_BOOKE
 config PPC_FSL_BOOK3E
bool
select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
-   select FSL_EMB_PERFMON
+   imply FSL_EMB_PERFMON
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL
default y if FSL_BOOKE
-- 
2.34.1

[PATCH AUTOSEL 5.15 75/98] powerpc/code-patching: Pre-map patch area

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 591b4b268435f00d2f0b81f786c2c7bd5ef66416 ]

Paul reported a warning with DEBUG_ATOMIC_SLEEP=y:

  BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:256
  in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
  preempt_count: 0, expected: 0
  ...
  Call Trace:
dump_stack_lvl+0xa0/0xec (unreliable)
__might_resched+0x2f4/0x310
kmem_cache_alloc+0x220/0x4b0
__pud_alloc+0x74/0x1d0
hash__map_kernel_page+0x2cc/0x390
do_patch_instruction+0x134/0x4a0
arch_jump_label_transform+0x64/0x78
__jump_label_update+0x148/0x180
static_key_enable_cpuslocked+0xd0/0x120
static_key_enable+0x30/0x50
check_kvm_guest+0x60/0x88
pSeries_smp_probe+0x54/0xb0
smp_prepare_cpus+0x3e0/0x430
kernel_init_freeable+0x20c/0x43c
kernel_init+0x30/0x1a0
ret_from_kernel_thread+0x5c/0x64

Peter pointed out that this is because do_patch_instruction() has
disabled interrupts, but then map_patch_area() calls map_kernel_page()
then hash__map_kernel_page() which does a sleeping memory allocation.

We only see the warning in KVM guests with SMT enabled, which is not
particularly common, or on other platforms if CONFIG_KPROBES is
disabled, also not common. The reason we don't see it in most
configurations is that another path that happens to have interrupts
enabled has allocated the required page tables for us, eg. there's a
path in kprobes init that does that. That's just pure luck though.

As Christophe suggested, the simplest solution is to do a dummy
map/unmap when we initialise the patching, so that any required page
table levels are pre-allocated before the first call to
do_patch_instruction(). This works because the unmap doesn't free any
page tables that were allocated by the map, it just clears the PTE,
leaving the page table levels there for the next map.

Reported-by: Paul Menzel 
Debugged-by: Peter Zijlstra 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220223015821.473097-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/lib/code-patching.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index c5ed98823835..b76b31196be1 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -47,9 +47,14 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+static int map_patch_area(void *addr, unsigned long text_poke_addr);
+static void unmap_patch_area(unsigned long addr);
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
+   unsigned long addr;
+   int err;
 
area = get_vm_area(PAGE_SIZE, VM_ALLOC);
if (!area) {
@@ -57,6 +62,15 @@ static int text_area_cpu_up(unsigned int cpu)
cpu);
return -1;
}
+
+   // Map/unmap the area to ensure all page tables are pre-allocated
+   addr = (unsigned long)area->addr;
+   err = map_patch_area(empty_zero_page, addr);
+   if (err)
+   return err;
+
+   unmap_patch_area(addr);
+
this_cpu_write(text_poke_area, area);
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 5.15 60/98] powerpc/64s/hash: Make hash faults work in NMI context

2022-04-01 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 8b91cee5eadd2021f55e6775f2d50bd56d00c217 ]

Hash faults are not resoved in NMI context, instead causing the access
to fail. This is done because perf interrupts can get backtraces
including walking the user stack, and taking a hash fault on those could
deadlock on the HPTE lock if the perf interrupt hits while the same HPTE
lock is being held by the hash fault code. The user-access for the stack
walking will notice the access failed and deal with that in the perf
code.

The reason to allow perf interrupts in is to better profile hash faults.

The problem with this is any hash fault on a kernel access that happens
in NMI context will crash, because kernel accesses must not fail.

Hard lockups, system reset, machine checks that access vmalloc space
including modules and including stack backtracing and symbol lookup in
modules, per-cpu data, etc could all run into this problem.

Fix this by disallowing perf interrupts in the hash fault code (the
direct hash fault is covered by MSR[EE]=0 so the PMI disable just needs
to extend to the preload case). This simplifies the tricky logic in hash
faults and perf, at the cost of reduced profiling of hash faults.

perf can still latch addresses when interrupts are disabled, it just
won't get the stack trace at that point, so it would still find hot
spots, just sometimes with confusing stack chains.

An alternative could be to allow perf interrupts here but always do the
slowpath stack walk if we are in nmi context, but that slows down all
perf interrupt stack walking on hash though and it does not remove as
much tricky code.

Reported-by: Laurent Dufour 
Signed-off-by: Nicholas Piggin 
Tested-by: Laurent Dufour 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220204035348.545435-1-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/interrupt.h  |  2 +-
 arch/powerpc/mm/book3s64/hash_utils.c | 54 ---
 arch/powerpc/perf/callchain.h |  9 +
 arch/powerpc/perf/callchain_64.c  | 27 --
 4 files changed, 10 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index a1d238255f07..a07960066b5f 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -567,7 +567,7 @@ DECLARE_INTERRUPT_HANDLER_RAW(do_slb_fault);
 DECLARE_INTERRUPT_HANDLER(do_bad_slb_fault);
 
 /* hash_utils.c */
-DECLARE_INTERRUPT_HANDLER_RAW(do_hash_fault);
+DECLARE_INTERRUPT_HANDLER(do_hash_fault);
 
 /* fault.c */
 DECLARE_INTERRUPT_HANDLER(do_page_fault);
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index c145776d3ae5..7bfd88c4b547 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1522,8 +1522,7 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap,
 }
 EXPORT_SYMBOL_GPL(hash_page);
 
-DECLARE_INTERRUPT_HANDLER(__do_hash_fault);
-DEFINE_INTERRUPT_HANDLER(__do_hash_fault)
+DEFINE_INTERRUPT_HANDLER(do_hash_fault)
 {
unsigned long ea = regs->dar;
unsigned long dsisr = regs->dsisr;
@@ -1582,35 +1581,6 @@ DEFINE_INTERRUPT_HANDLER(__do_hash_fault)
}
 }
 
-/*
- * The _RAW interrupt entry checks for the in_nmi() case before
- * running the full handler.
- */
-DEFINE_INTERRUPT_HANDLER_RAW(do_hash_fault)
-{
-   /*
-* If we are in an "NMI" (e.g., an interrupt when soft-disabled), then
-* don't call hash_page, just fail the fault. This is required to
-* prevent re-entrancy problems in the hash code, namely perf
-* interrupts hitting while something holds H_PAGE_BUSY, and taking a
-* hash fault. See the comment in hash_preload().
-*
-* We come here as a result of a DSI at a point where we don't want
-* to call hash_page, such as when we are accessing memory (possibly
-* user memory) inside a PMU interrupt that occurred while interrupts
-* were soft-disabled.  We want to invoke the exception handler for
-* the access, or panic if there isn't a handler.
-*/
-   if (unlikely(in_nmi())) {
-   do_bad_page_fault_segv(regs);
-   return 0;
-   }
-
-   __do_hash_fault(regs);
-
-   return 0;
-}
-
 #ifdef CONFIG_PPC_MM_SLICES
 static bool should_hash_preload(struct mm_struct *mm, unsigned long ea)
 {
@@ -1677,26 +1647,18 @@ static void hash_preload(struct mm_struct *mm, pte_t 
*ptep, unsigned long ea,
 #endif /* CONFIG_PPC_64K_PAGES */
 
/*
-* __hash_page_* must run with interrupts off, as it sets the
-* H_PAGE_BUSY bit. It's possible for perf interrupts to hit at any
-* time and may take a hash fault reading the user stack, see
-* read_user_stack_slow() in the powerpc/perf code.
-*
-* If that takes a hash fault on the same page

[PATCH AUTOSEL 5.15 40/98] powerpc/set_memory: Avoid spinlock recursion in change_page_attr()

2022-04-01 Thread Sasha Levin

From: Christophe Leroy 

[ Upstream commit a4c182ecf33584b9b2d1aa9dad073014a504c01f ]

Commit 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
included a spin_lock() to change_page_attr() in order to
safely perform the three step operations. But then
commit 9f7853d7609d ("powerpc/mm: Fix set_memory_*() against
concurrent accesses") modify it to use pte_update() and do
the operation safely against concurrent access.

In the meantime, Maxime reported some spinlock recursion.

[   15.351649] BUG: spinlock recursion on CPU#0, kworker/0:2/217
[   15.357540]  lock: init_mm+0x3c/0x420, .magic: dead4ead, .owner: 
kworker/0:2/217, .owner_cpu: 0
[   15.366563] CPU: 0 PID: 217 Comm: kworker/0:2 Not tainted 5.15.0+ #523
[   15.373350] Workqueue: events do_free_init
[   15.377615] Call Trace:
[   15.380232] [e4105ac0] [800946a4] do_raw_spin_lock+0xf8/0x120 (unreliable)
[   15.387340] [e4105ae0] [8001f4ec] change_page_attr+0x40/0x1d4
[   15.393413] [e4105b10] [801424e0] __apply_to_page_range+0x164/0x310
[   15.49] [e4105b60] [80169620] free_pcp_prepare+0x1e4/0x4a0
[   15.406045] [e4105ba0] [8016c5a0] free_unref_page+0x40/0x2b8
[   15.411979] [e4105be0] [8018724c] kasan_depopulate_vmalloc_pte+0x6c/0x94
[   15.418989] [e4105c00] [801424e0] __apply_to_page_range+0x164/0x310
[   15.425451] [e4105c50] [80187834] kasan_release_vmalloc+0xbc/0x134
[   15.431898] [e4105c70] [8015f7a8] __purge_vmap_area_lazy+0x4e4/0xdd8
[   15.438560] [e4105d30] [80160d10] _vm_unmap_aliases.part.0+0x17c/0x24c
[   15.445283] [e4105d60] [801642d0] __vunmap+0x2f0/0x5c8
[   15.450684] [e4105db0] [800e32d0] do_free_init+0x68/0x94
[   15.456181] [e4105dd0] [8005d094] process_one_work+0x4bc/0x7b8
[   15.462283] [e4105e90] [8005d614] worker_thread+0x284/0x6e8
[   15.468227] [e4105f00] [8006aaec] kthread+0x1f0/0x210
[   15.473489] [e4105f40] [80017148] ret_from_kernel_thread+0x14/0x1c

Remove the read / modify / write sequence to make the operation atomic
and remove the spin_lock() in change_page_attr().

To do the operation atomically, we can't use pte modification helpers
anymore. Because all platforms have different combination of bits, it
is not easy to use those bits directly. But all have the
_PAGE_KERNEL_{RO/ROX/RW/RWX} set of flags. All we need it to compare
two sets to know which bits are set or cleared.

For instance, by comparing _PAGE_KERNEL_ROX and _PAGE_KERNEL_RO you
know which bit gets cleared and which bit get set when changing exec
permission.

Reported-by: Maxime Bizon 
Signed-off-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/all/20211212112152.GA27070@sakura/
Link: 
https://lore.kernel.org/r/43c3c76a1175ae6dc1a3d3b5c3f7ecb48f683eea.1640344012.git.christophe.le...@csgroup.eu
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/pageattr.c | 32 +---
 1 file changed, 13 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
index edea388e9d3f..8812454e70ff 100644
--- a/arch/powerpc/mm/pageattr.c
+++ b/arch/powerpc/mm/pageattr.c
@@ -15,12 +15,14 @@
 #include 
 
 
+static pte_basic_t pte_update_delta(pte_t *ptep, unsigned long addr,
+   unsigned long old, unsigned long new)
+{
+   return pte_update(_mm, addr, ptep, old & ~new, new & ~old, 0);
+}
+
 /*
- * Updates the attributes of a page in three steps:
- *
- * 1. take the page_table_lock
- * 2. install the new entry with the updated attributes
- * 3. flush the TLB
+ * Updates the attributes of a page atomically.
  *
  * This sequence is safe against concurrent updates, and also allows updating 
the
  * attributes of a page currently being executed or accessed.
@@ -28,41 +30,33 @@
 static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
 {
long action = (long)data;
-   pte_t pte;
-
-   spin_lock(_mm.page_table_lock);
-
-   pte = ptep_get(ptep);
 
-   /* modify the PTE bits as desired, then apply */
+   /* modify the PTE bits as desired */
switch (action) {
case SET_MEMORY_RO:
-   pte = pte_wrprotect(pte);
+   /* Don't clear DIRTY bit */
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RW & ~_PAGE_DIRTY, 
_PAGE_KERNEL_RO);
break;
case SET_MEMORY_RW:
-   pte = pte_mkwrite(pte_mkdirty(pte));
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RO, _PAGE_KERNEL_RW);
break;
case SET_MEMORY_NX:
-   pte = pte_exprotect(pte);
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_ROX, _PAGE_KERNEL_RO);
break;
case SET_MEMORY_X:
-   pte = pte_mkexec(pte);
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RO, _PAGE_KERNEL_ROX);
break;
default:
WARN_ON_ONCE(1);
break;
}
 
-   pte_update(_mm, addr, ptep, ~0UL, pte_val(pte), 0);
-
/* See ptesync comment in

[PATCH AUTOSEL 5.15 29/98] powerpc: Set crashkernel offset to mid of RMA region

2022-04-01 Thread Sasha Levin

From: Sourabh Jain 

[ Upstream commit 7c5ed82b800d8615cdda00729e7b62e5899f0b13 ]

On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.

The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for the kernel to allocate memory for other system
resources.

Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.

This patch is dependent on changes to paca allocation for boot CPU. It
expect boot CPU to discover 1T segment support which is introduced by
the patch posted here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html

Reported-by: Abdul haleem 
Signed-off-by: Sourabh Jain 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20220204085601.107257-1-sourabhj...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c |  6 ++
 arch/powerpc/kexec/core.c  | 15 +++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index ff80bbad22a5..e18a725a8e5d 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1235,6 +1235,12 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
 
+#ifdef CONFIG_PPC64
+   /* need this feature to decide the crashkernel offset */
+   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+#endif
+
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 48525e8b5730..71b1bfdadd76 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -147,11 +147,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
 #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* On the LPAR platform place the crash kernel to mid of
+* RMA size (512MB or more) to ensure the crash kernel
+* gets enough space to place itself and some stack to be
+* in the first segment. At the same time normal kernel
+* also get enough space to allocate memory for essential
+* system resource in the first segment. Keep the crash
+* kernel starts at 128MB offset on other platforms.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
 #else
crashk_res.start = KDUMP_KERNELBASE;
 #endif
-- 
2.34.1

[PATCH AUTOSEL 5.15 12/98] powerpc: dts: t104xrdb: fix phy type for FMAN 4/5

2022-04-01 Thread Sasha Levin

From: Maxim Kiselev 

[ Upstream commit 17846485dff91acce1ad47b508b633dffc32e838 ]

T1040RDB has two RTL8211E-VB phys which requires setting
of internal delays for correct work.

Changing the phy-connection-type property to `rgmii-id`
will fix this issue.

Signed-off-by: Maxim Kiselev 
Reviewed-by: Maxim Kochetkov 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20211230151123.1258321-1-biguncle...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
index 099a598c74c0..bfe1ed5be337 100644
--- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
@@ -139,12 +139,12 @@ pca9546@77 {
fman@40 {
ethernet@e6000 {
phy-handle = <_rgmii_0>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
ethernet@e8000 {
phy-handle = <_rgmii_1>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.34.1

[PATCH AUTOSEL 5.16 087/109] powerpc/secvar: fix refcount leak in format_show()

2022-04-01 Thread Sasha Levin

From: Hangyu Hua 

[ Upstream commit d601fd24e6964967f115f036a840f4f28488f63f ]

Refcount leak will happen when format_show returns failure in multiple
cases. Unified management of of_node_put can fix this problem.

Signed-off-by: Hangyu Hua 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220302021959.10959-1-hbh...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/secvar-sysfs.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index a0a78aba2083..1ee4640a2641 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -26,15 +26,18 @@ static ssize_t format_show(struct kobject *kobj, struct 
kobj_attribute *attr,
const char *format;
 
node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
-   if (!of_device_is_available(node))
-   return -ENODEV;
+   if (!of_device_is_available(node)) {
+   rc = -ENODEV;
+   goto out;
+   }
 
rc = of_property_read_string(node, "format", );
if (rc)
-   return rc;
+   goto out;
 
rc = sprintf(buf, "%s\n", format);
 
+out:
of_node_put(node);
 
return rc;
-- 
2.34.1

[PATCH AUTOSEL 5.16 086/109] powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 1a76e520ee1831a81dabf8a9a58c6453f700026e ]

Since the IBM A2 CPU support was removed, see commit
fb5a515704d7 ("powerpc: Remove platforms/wsp and associated pieces"),
the only 64-bit Book3E CPUs we support are Freescale (NXP) ones.

However our Kconfig still allows configurating a kernel that has 64-bit
Book3E support, but no Freescale CPU support enabled. Such a kernel
would never boot, it doesn't know about any CPUs.

It also causes build errors, as reported by lkp, because
PPC_BARRIER_NOSPEC is not enabled in such a configuration:

  powerpc64-linux-ld: arch/powerpc/net/bpf_jit_comp64.o:(.toc+0x0):
  undefined reference to `powerpc_security_features'

To fix this, force PPC_FSL_BOOK3E to be selected whenever we are
building a 64-bit Book3E kernel.

Reported-by: kernel test robot 
Reported-by: Naveen N. Rao 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220304061222.2478720-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/Kconfig.cputype | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index a208997ade88..87a95cbff2f3 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -111,6 +111,7 @@ config PPC_BOOK3S_64
 
 config PPC_BOOK3E_64
bool "Embedded processors"
+   select PPC_FSL_BOOK3E
select PPC_FPU # Make it a choice ?
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL
@@ -287,7 +288,7 @@ config FSL_BOOKE
 config PPC_FSL_BOOK3E
bool
select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
-   select FSL_EMB_PERFMON
+   imply FSL_EMB_PERFMON
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL
default y if FSL_BOOKE
-- 
2.34.1

[PATCH AUTOSEL 5.16 071/109] powerpc/64s/hash: Make hash faults work in NMI context

2022-04-01 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 8b91cee5eadd2021f55e6775f2d50bd56d00c217 ]

Hash faults are not resoved in NMI context, instead causing the access
to fail. This is done because perf interrupts can get backtraces
including walking the user stack, and taking a hash fault on those could
deadlock on the HPTE lock if the perf interrupt hits while the same HPTE
lock is being held by the hash fault code. The user-access for the stack
walking will notice the access failed and deal with that in the perf
code.

The reason to allow perf interrupts in is to better profile hash faults.

The problem with this is any hash fault on a kernel access that happens
in NMI context will crash, because kernel accesses must not fail.

Hard lockups, system reset, machine checks that access vmalloc space
including modules and including stack backtracing and symbol lookup in
modules, per-cpu data, etc could all run into this problem.

Fix this by disallowing perf interrupts in the hash fault code (the
direct hash fault is covered by MSR[EE]=0 so the PMI disable just needs
to extend to the preload case). This simplifies the tricky logic in hash
faults and perf, at the cost of reduced profiling of hash faults.

perf can still latch addresses when interrupts are disabled, it just
won't get the stack trace at that point, so it would still find hot
spots, just sometimes with confusing stack chains.

An alternative could be to allow perf interrupts here but always do the
slowpath stack walk if we are in nmi context, but that slows down all
perf interrupt stack walking on hash though and it does not remove as
much tricky code.

Reported-by: Laurent Dufour 
Signed-off-by: Nicholas Piggin 
Tested-by: Laurent Dufour 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220204035348.545435-1-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/interrupt.h  |  2 +-
 arch/powerpc/mm/book3s64/hash_utils.c | 54 ---
 arch/powerpc/perf/callchain.h |  9 +
 arch/powerpc/perf/callchain_64.c  | 27 --
 4 files changed, 10 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index a1d238255f07..a07960066b5f 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -567,7 +567,7 @@ DECLARE_INTERRUPT_HANDLER_RAW(do_slb_fault);
 DECLARE_INTERRUPT_HANDLER(do_bad_slb_fault);
 
 /* hash_utils.c */
-DECLARE_INTERRUPT_HANDLER_RAW(do_hash_fault);
+DECLARE_INTERRUPT_HANDLER(do_hash_fault);
 
 /* fault.c */
 DECLARE_INTERRUPT_HANDLER(do_page_fault);
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index cfd45245d009..f77fd4428db3 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1522,8 +1522,7 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap,
 }
 EXPORT_SYMBOL_GPL(hash_page);
 
-DECLARE_INTERRUPT_HANDLER(__do_hash_fault);
-DEFINE_INTERRUPT_HANDLER(__do_hash_fault)
+DEFINE_INTERRUPT_HANDLER(do_hash_fault)
 {
unsigned long ea = regs->dar;
unsigned long dsisr = regs->dsisr;
@@ -1582,35 +1581,6 @@ DEFINE_INTERRUPT_HANDLER(__do_hash_fault)
}
 }
 
-/*
- * The _RAW interrupt entry checks for the in_nmi() case before
- * running the full handler.
- */
-DEFINE_INTERRUPT_HANDLER_RAW(do_hash_fault)
-{
-   /*
-* If we are in an "NMI" (e.g., an interrupt when soft-disabled), then
-* don't call hash_page, just fail the fault. This is required to
-* prevent re-entrancy problems in the hash code, namely perf
-* interrupts hitting while something holds H_PAGE_BUSY, and taking a
-* hash fault. See the comment in hash_preload().
-*
-* We come here as a result of a DSI at a point where we don't want
-* to call hash_page, such as when we are accessing memory (possibly
-* user memory) inside a PMU interrupt that occurred while interrupts
-* were soft-disabled.  We want to invoke the exception handler for
-* the access, or panic if there isn't a handler.
-*/
-   if (unlikely(in_nmi())) {
-   do_bad_page_fault_segv(regs);
-   return 0;
-   }
-
-   __do_hash_fault(regs);
-
-   return 0;
-}
-
 #ifdef CONFIG_PPC_MM_SLICES
 static bool should_hash_preload(struct mm_struct *mm, unsigned long ea)
 {
@@ -1677,26 +1647,18 @@ static void hash_preload(struct mm_struct *mm, pte_t 
*ptep, unsigned long ea,
 #endif /* CONFIG_PPC_64K_PAGES */
 
/*
-* __hash_page_* must run with interrupts off, as it sets the
-* H_PAGE_BUSY bit. It's possible for perf interrupts to hit at any
-* time and may take a hash fault reading the user stack, see
-* read_user_stack_slow() in the powerpc/perf code.
-*
-* If that takes a hash fault on the same page

[PATCH AUTOSEL 5.16 048/109] powerpc/set_memory: Avoid spinlock recursion in change_page_attr()

2022-04-01 Thread Sasha Levin

From: Christophe Leroy 

[ Upstream commit a4c182ecf33584b9b2d1aa9dad073014a504c01f ]

Commit 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
included a spin_lock() to change_page_attr() in order to
safely perform the three step operations. But then
commit 9f7853d7609d ("powerpc/mm: Fix set_memory_*() against
concurrent accesses") modify it to use pte_update() and do
the operation safely against concurrent access.

In the meantime, Maxime reported some spinlock recursion.

[   15.351649] BUG: spinlock recursion on CPU#0, kworker/0:2/217
[   15.357540]  lock: init_mm+0x3c/0x420, .magic: dead4ead, .owner: 
kworker/0:2/217, .owner_cpu: 0
[   15.366563] CPU: 0 PID: 217 Comm: kworker/0:2 Not tainted 5.15.0+ #523
[   15.373350] Workqueue: events do_free_init
[   15.377615] Call Trace:
[   15.380232] [e4105ac0] [800946a4] do_raw_spin_lock+0xf8/0x120 (unreliable)
[   15.387340] [e4105ae0] [8001f4ec] change_page_attr+0x40/0x1d4
[   15.393413] [e4105b10] [801424e0] __apply_to_page_range+0x164/0x310
[   15.49] [e4105b60] [80169620] free_pcp_prepare+0x1e4/0x4a0
[   15.406045] [e4105ba0] [8016c5a0] free_unref_page+0x40/0x2b8
[   15.411979] [e4105be0] [8018724c] kasan_depopulate_vmalloc_pte+0x6c/0x94
[   15.418989] [e4105c00] [801424e0] __apply_to_page_range+0x164/0x310
[   15.425451] [e4105c50] [80187834] kasan_release_vmalloc+0xbc/0x134
[   15.431898] [e4105c70] [8015f7a8] __purge_vmap_area_lazy+0x4e4/0xdd8
[   15.438560] [e4105d30] [80160d10] _vm_unmap_aliases.part.0+0x17c/0x24c
[   15.445283] [e4105d60] [801642d0] __vunmap+0x2f0/0x5c8
[   15.450684] [e4105db0] [800e32d0] do_free_init+0x68/0x94
[   15.456181] [e4105dd0] [8005d094] process_one_work+0x4bc/0x7b8
[   15.462283] [e4105e90] [8005d614] worker_thread+0x284/0x6e8
[   15.468227] [e4105f00] [8006aaec] kthread+0x1f0/0x210
[   15.473489] [e4105f40] [80017148] ret_from_kernel_thread+0x14/0x1c

Remove the read / modify / write sequence to make the operation atomic
and remove the spin_lock() in change_page_attr().

To do the operation atomically, we can't use pte modification helpers
anymore. Because all platforms have different combination of bits, it
is not easy to use those bits directly. But all have the
_PAGE_KERNEL_{RO/ROX/RW/RWX} set of flags. All we need it to compare
two sets to know which bits are set or cleared.

For instance, by comparing _PAGE_KERNEL_ROX and _PAGE_KERNEL_RO you
know which bit gets cleared and which bit get set when changing exec
permission.

Reported-by: Maxime Bizon 
Signed-off-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/all/20211212112152.GA27070@sakura/
Link: 
https://lore.kernel.org/r/43c3c76a1175ae6dc1a3d3b5c3f7ecb48f683eea.1640344012.git.christophe.le...@csgroup.eu
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/pageattr.c | 32 +---
 1 file changed, 13 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
index edea388e9d3f..8812454e70ff 100644
--- a/arch/powerpc/mm/pageattr.c
+++ b/arch/powerpc/mm/pageattr.c
@@ -15,12 +15,14 @@
 #include 
 
 
+static pte_basic_t pte_update_delta(pte_t *ptep, unsigned long addr,
+   unsigned long old, unsigned long new)
+{
+   return pte_update(_mm, addr, ptep, old & ~new, new & ~old, 0);
+}
+
 /*
- * Updates the attributes of a page in three steps:
- *
- * 1. take the page_table_lock
- * 2. install the new entry with the updated attributes
- * 3. flush the TLB
+ * Updates the attributes of a page atomically.
  *
  * This sequence is safe against concurrent updates, and also allows updating 
the
  * attributes of a page currently being executed or accessed.
@@ -28,41 +30,33 @@
 static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
 {
long action = (long)data;
-   pte_t pte;
-
-   spin_lock(_mm.page_table_lock);
-
-   pte = ptep_get(ptep);
 
-   /* modify the PTE bits as desired, then apply */
+   /* modify the PTE bits as desired */
switch (action) {
case SET_MEMORY_RO:
-   pte = pte_wrprotect(pte);
+   /* Don't clear DIRTY bit */
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RW & ~_PAGE_DIRTY, 
_PAGE_KERNEL_RO);
break;
case SET_MEMORY_RW:
-   pte = pte_mkwrite(pte_mkdirty(pte));
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RO, _PAGE_KERNEL_RW);
break;
case SET_MEMORY_NX:
-   pte = pte_exprotect(pte);
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_ROX, _PAGE_KERNEL_RO);
break;
case SET_MEMORY_X:
-   pte = pte_mkexec(pte);
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RO, _PAGE_KERNEL_ROX);
break;
default:
WARN_ON_ONCE(1);
break;
}
 
-   pte_update(_mm, addr, ptep, ~0UL, pte_val(pte), 0);
-
/* See ptesync comment in

[PATCH AUTOSEL 5.16 035/109] powerpc: Set crashkernel offset to mid of RMA region

2022-04-01 Thread Sasha Levin

From: Sourabh Jain 

[ Upstream commit 7c5ed82b800d8615cdda00729e7b62e5899f0b13 ]

On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.

The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for the kernel to allocate memory for other system
resources.

Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.

This patch is dependent on changes to paca allocation for boot CPU. It
expect boot CPU to discover 1T segment support which is introduced by
the patch posted here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html

Reported-by: Abdul haleem 
Signed-off-by: Sourabh Jain 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20220204085601.107257-1-sourabhj...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c |  6 ++
 arch/powerpc/kexec/core.c  | 15 +++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index ff80bbad22a5..e18a725a8e5d 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1235,6 +1235,12 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
 
+#ifdef CONFIG_PPC64
+   /* need this feature to decide the crashkernel offset */
+   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+#endif
+
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index a2242017e55f..11b327fa135c 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -134,11 +134,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
 #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* On the LPAR platform place the crash kernel to mid of
+* RMA size (512MB or more) to ensure the crash kernel
+* gets enough space to place itself and some stack to be
+* in the first segment. At the same time normal kernel
+* also get enough space to allocate memory for essential
+* system resource in the first segment. Keep the crash
+* kernel starts at 128MB offset on other platforms.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
 #else
crashk_res.start = KDUMP_KERNELBASE;
 #endif
-- 
2.34.1

[PATCH AUTOSEL 5.16 018/109] powerpc: dts: t104xrdb: fix phy type for FMAN 4/5

2022-04-01 Thread Sasha Levin

From: Maxim Kiselev 

[ Upstream commit 17846485dff91acce1ad47b508b633dffc32e838 ]

T1040RDB has two RTL8211E-VB phys which requires setting
of internal delays for correct work.

Changing the phy-connection-type property to `rgmii-id`
will fix this issue.

Signed-off-by: Maxim Kiselev 
Reviewed-by: Maxim Kochetkov 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20211230151123.1258321-1-biguncle...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
index 099a598c74c0..bfe1ed5be337 100644
--- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
@@ -139,12 +139,12 @@ pca9546@77 {
fman@40 {
ethernet@e6000 {
phy-handle = <_rgmii_0>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
ethernet@e8000 {
phy-handle = <_rgmii_1>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.34.1

[PATCH AUTOSEL 5.17 124/149] powerpc/secvar: fix refcount leak in format_show()

2022-04-01 Thread Sasha Levin

From: Hangyu Hua 

[ Upstream commit d601fd24e6964967f115f036a840f4f28488f63f ]

Refcount leak will happen when format_show returns failure in multiple
cases. Unified management of of_node_put can fix this problem.

Signed-off-by: Hangyu Hua 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220302021959.10959-1-hbh...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/secvar-sysfs.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/secvar-sysfs.c 
b/arch/powerpc/kernel/secvar-sysfs.c
index a0a78aba2083..1ee4640a2641 100644
--- a/arch/powerpc/kernel/secvar-sysfs.c
+++ b/arch/powerpc/kernel/secvar-sysfs.c
@@ -26,15 +26,18 @@ static ssize_t format_show(struct kobject *kobj, struct 
kobj_attribute *attr,
const char *format;
 
node = of_find_compatible_node(NULL, NULL, "ibm,secvar-backend");
-   if (!of_device_is_available(node))
-   return -ENODEV;
+   if (!of_device_is_available(node)) {
+   rc = -ENODEV;
+   goto out;
+   }
 
rc = of_property_read_string(node, "format", );
if (rc)
-   return rc;
+   goto out;
 
rc = sprintf(buf, "%s\n", format);
 
+out:
of_node_put(node);
 
return rc;
-- 
2.34.1

[PATCH AUTOSEL 5.17 123/149] powerpc/64e: Tie PPC_BOOK3E_64 to PPC_FSL_BOOK3E

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 1a76e520ee1831a81dabf8a9a58c6453f700026e ]

Since the IBM A2 CPU support was removed, see commit
fb5a515704d7 ("powerpc: Remove platforms/wsp and associated pieces"),
the only 64-bit Book3E CPUs we support are Freescale (NXP) ones.

However our Kconfig still allows configurating a kernel that has 64-bit
Book3E support, but no Freescale CPU support enabled. Such a kernel
would never boot, it doesn't know about any CPUs.

It also causes build errors, as reported by lkp, because
PPC_BARRIER_NOSPEC is not enabled in such a configuration:

  powerpc64-linux-ld: arch/powerpc/net/bpf_jit_comp64.o:(.toc+0x0):
  undefined reference to `powerpc_security_features'

To fix this, force PPC_FSL_BOOK3E to be selected whenever we are
building a 64-bit Book3E kernel.

Reported-by: kernel test robot 
Reported-by: Naveen N. Rao 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220304061222.2478720-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/platforms/Kconfig.cputype | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 87bc1929ee5a..e2e1fec91c6e 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -107,6 +107,7 @@ config PPC_BOOK3S_64
 
 config PPC_BOOK3E_64
bool "Embedded processors"
+   select PPC_FSL_BOOK3E
select PPC_FPU # Make it a choice ?
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL
@@ -295,7 +296,7 @@ config FSL_BOOKE
 config PPC_FSL_BOOK3E
bool
select ARCH_SUPPORTS_HUGETLBFS if PHYS_64BIT || PPC64
-   select FSL_EMB_PERFMON
+   imply FSL_EMB_PERFMON
select PPC_SMP_MUXED_IPI
select PPC_DOORBELL
select PPC_KUEP
-- 
2.34.1

[PATCH AUTOSEL 5.17 122/149] powerpc/code-patching: Pre-map patch area

2022-04-01 Thread Sasha Levin

From: Michael Ellerman 

[ Upstream commit 591b4b268435f00d2f0b81f786c2c7bd5ef66416 ]

Paul reported a warning with DEBUG_ATOMIC_SLEEP=y:

  BUG: sleeping function called from invalid context at 
include/linux/sched/mm.h:256
  in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
  preempt_count: 0, expected: 0
  ...
  Call Trace:
dump_stack_lvl+0xa0/0xec (unreliable)
__might_resched+0x2f4/0x310
kmem_cache_alloc+0x220/0x4b0
__pud_alloc+0x74/0x1d0
hash__map_kernel_page+0x2cc/0x390
do_patch_instruction+0x134/0x4a0
arch_jump_label_transform+0x64/0x78
__jump_label_update+0x148/0x180
static_key_enable_cpuslocked+0xd0/0x120
static_key_enable+0x30/0x50
check_kvm_guest+0x60/0x88
pSeries_smp_probe+0x54/0xb0
smp_prepare_cpus+0x3e0/0x430
kernel_init_freeable+0x20c/0x43c
kernel_init+0x30/0x1a0
ret_from_kernel_thread+0x5c/0x64

Peter pointed out that this is because do_patch_instruction() has
disabled interrupts, but then map_patch_area() calls map_kernel_page()
then hash__map_kernel_page() which does a sleeping memory allocation.

We only see the warning in KVM guests with SMT enabled, which is not
particularly common, or on other platforms if CONFIG_KPROBES is
disabled, also not common. The reason we don't see it in most
configurations is that another path that happens to have interrupts
enabled has allocated the required page tables for us, eg. there's a
path in kprobes init that does that. That's just pure luck though.

As Christophe suggested, the simplest solution is to do a dummy
map/unmap when we initialise the patching, so that any required page
table levels are pre-allocated before the first call to
do_patch_instruction(). This works because the unmap doesn't free any
page tables that were allocated by the map, it just clears the PTE,
leaving the page table levels there for the next map.

Reported-by: Paul Menzel 
Debugged-by: Peter Zijlstra 
Suggested-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220223015821.473097-1-...@ellerman.id.au
Signed-off-by: Sasha Levin 
---
 arch/powerpc/lib/code-patching.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 906d43463366..00c68e7fb11e 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -43,9 +43,14 @@ int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+static int map_patch_area(void *addr, unsigned long text_poke_addr);
+static void unmap_patch_area(unsigned long addr);
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
+   unsigned long addr;
+   int err;
 
area = get_vm_area(PAGE_SIZE, VM_ALLOC);
if (!area) {
@@ -53,6 +58,15 @@ static int text_area_cpu_up(unsigned int cpu)
cpu);
return -1;
}
+
+   // Map/unmap the area to ensure all page tables are pre-allocated
+   addr = (unsigned long)area->addr;
+   err = map_patch_area(empty_zero_page, addr);
+   if (err)
+   return err;
+
+   unmap_patch_area(addr);
+
this_cpu_write(text_poke_area, area);
 
return 0;
-- 
2.34.1

[PATCH AUTOSEL 5.17 102/149] powerpc/64s/hash: Make hash faults work in NMI context

2022-04-01 Thread Sasha Levin

From: Nicholas Piggin 

[ Upstream commit 8b91cee5eadd2021f55e6775f2d50bd56d00c217 ]

Hash faults are not resoved in NMI context, instead causing the access
to fail. This is done because perf interrupts can get backtraces
including walking the user stack, and taking a hash fault on those could
deadlock on the HPTE lock if the perf interrupt hits while the same HPTE
lock is being held by the hash fault code. The user-access for the stack
walking will notice the access failed and deal with that in the perf
code.

The reason to allow perf interrupts in is to better profile hash faults.

The problem with this is any hash fault on a kernel access that happens
in NMI context will crash, because kernel accesses must not fail.

Hard lockups, system reset, machine checks that access vmalloc space
including modules and including stack backtracing and symbol lookup in
modules, per-cpu data, etc could all run into this problem.

Fix this by disallowing perf interrupts in the hash fault code (the
direct hash fault is covered by MSR[EE]=0 so the PMI disable just needs
to extend to the preload case). This simplifies the tricky logic in hash
faults and perf, at the cost of reduced profiling of hash faults.

perf can still latch addresses when interrupts are disabled, it just
won't get the stack trace at that point, so it would still find hot
spots, just sometimes with confusing stack chains.

An alternative could be to allow perf interrupts here but always do the
slowpath stack walk if we are in nmi context, but that slows down all
perf interrupt stack walking on hash though and it does not remove as
much tricky code.

Reported-by: Laurent Dufour 
Signed-off-by: Nicholas Piggin 
Tested-by: Laurent Dufour 
Reviewed-by: Aneesh Kumar K.V 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220204035348.545435-1-npig...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/include/asm/interrupt.h  |  2 +-
 arch/powerpc/mm/book3s64/hash_utils.c | 54 ---
 arch/powerpc/perf/callchain.h |  9 +
 arch/powerpc/perf/callchain_64.c  | 27 --
 4 files changed, 10 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index fc28f46d2f9d..5404f7abbcf8 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -612,7 +612,7 @@ DECLARE_INTERRUPT_HANDLER_RAW(do_slb_fault);
 DECLARE_INTERRUPT_HANDLER(do_bad_segment_interrupt);
 
 /* hash_utils.c */
-DECLARE_INTERRUPT_HANDLER_RAW(do_hash_fault);
+DECLARE_INTERRUPT_HANDLER(do_hash_fault);
 
 /* fault.c */
 DECLARE_INTERRUPT_HANDLER(do_page_fault);
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 7abf82a698d3..985cabdd7f67 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1621,8 +1621,7 @@ int hash_page(unsigned long ea, unsigned long access, 
unsigned long trap,
 }
 EXPORT_SYMBOL_GPL(hash_page);
 
-DECLARE_INTERRUPT_HANDLER(__do_hash_fault);
-DEFINE_INTERRUPT_HANDLER(__do_hash_fault)
+DEFINE_INTERRUPT_HANDLER(do_hash_fault)
 {
unsigned long ea = regs->dar;
unsigned long dsisr = regs->dsisr;
@@ -1681,35 +1680,6 @@ DEFINE_INTERRUPT_HANDLER(__do_hash_fault)
}
 }
 
-/*
- * The _RAW interrupt entry checks for the in_nmi() case before
- * running the full handler.
- */
-DEFINE_INTERRUPT_HANDLER_RAW(do_hash_fault)
-{
-   /*
-* If we are in an "NMI" (e.g., an interrupt when soft-disabled), then
-* don't call hash_page, just fail the fault. This is required to
-* prevent re-entrancy problems in the hash code, namely perf
-* interrupts hitting while something holds H_PAGE_BUSY, and taking a
-* hash fault. See the comment in hash_preload().
-*
-* We come here as a result of a DSI at a point where we don't want
-* to call hash_page, such as when we are accessing memory (possibly
-* user memory) inside a PMU interrupt that occurred while interrupts
-* were soft-disabled.  We want to invoke the exception handler for
-* the access, or panic if there isn't a handler.
-*/
-   if (unlikely(in_nmi())) {
-   do_bad_page_fault_segv(regs);
-   return 0;
-   }
-
-   __do_hash_fault(regs);
-
-   return 0;
-}
-
 #ifdef CONFIG_PPC_MM_SLICES
 static bool should_hash_preload(struct mm_struct *mm, unsigned long ea)
 {
@@ -1776,26 +1746,18 @@ static void hash_preload(struct mm_struct *mm, pte_t 
*ptep, unsigned long ea,
 #endif /* CONFIG_PPC_64K_PAGES */
 
/*
-* __hash_page_* must run with interrupts off, as it sets the
-* H_PAGE_BUSY bit. It's possible for perf interrupts to hit at any
-* time and may take a hash fault reading the user stack, see
-* read_user_stack_slow() in the powerpc/perf code.
-*
-* If that takes a hash fault on the

[PATCH AUTOSEL 5.17 069/149] powerpc/set_memory: Avoid spinlock recursion in change_page_attr()

2022-04-01 Thread Sasha Levin

From: Christophe Leroy 

[ Upstream commit a4c182ecf33584b9b2d1aa9dad073014a504c01f ]

Commit 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines")
included a spin_lock() to change_page_attr() in order to
safely perform the three step operations. But then
commit 9f7853d7609d ("powerpc/mm: Fix set_memory_*() against
concurrent accesses") modify it to use pte_update() and do
the operation safely against concurrent access.

In the meantime, Maxime reported some spinlock recursion.

[   15.351649] BUG: spinlock recursion on CPU#0, kworker/0:2/217
[   15.357540]  lock: init_mm+0x3c/0x420, .magic: dead4ead, .owner: 
kworker/0:2/217, .owner_cpu: 0
[   15.366563] CPU: 0 PID: 217 Comm: kworker/0:2 Not tainted 5.15.0+ #523
[   15.373350] Workqueue: events do_free_init
[   15.377615] Call Trace:
[   15.380232] [e4105ac0] [800946a4] do_raw_spin_lock+0xf8/0x120 (unreliable)
[   15.387340] [e4105ae0] [8001f4ec] change_page_attr+0x40/0x1d4
[   15.393413] [e4105b10] [801424e0] __apply_to_page_range+0x164/0x310
[   15.49] [e4105b60] [80169620] free_pcp_prepare+0x1e4/0x4a0
[   15.406045] [e4105ba0] [8016c5a0] free_unref_page+0x40/0x2b8
[   15.411979] [e4105be0] [8018724c] kasan_depopulate_vmalloc_pte+0x6c/0x94
[   15.418989] [e4105c00] [801424e0] __apply_to_page_range+0x164/0x310
[   15.425451] [e4105c50] [80187834] kasan_release_vmalloc+0xbc/0x134
[   15.431898] [e4105c70] [8015f7a8] __purge_vmap_area_lazy+0x4e4/0xdd8
[   15.438560] [e4105d30] [80160d10] _vm_unmap_aliases.part.0+0x17c/0x24c
[   15.445283] [e4105d60] [801642d0] __vunmap+0x2f0/0x5c8
[   15.450684] [e4105db0] [800e32d0] do_free_init+0x68/0x94
[   15.456181] [e4105dd0] [8005d094] process_one_work+0x4bc/0x7b8
[   15.462283] [e4105e90] [8005d614] worker_thread+0x284/0x6e8
[   15.468227] [e4105f00] [8006aaec] kthread+0x1f0/0x210
[   15.473489] [e4105f40] [80017148] ret_from_kernel_thread+0x14/0x1c

Remove the read / modify / write sequence to make the operation atomic
and remove the spin_lock() in change_page_attr().

To do the operation atomically, we can't use pte modification helpers
anymore. Because all platforms have different combination of bits, it
is not easy to use those bits directly. But all have the
_PAGE_KERNEL_{RO/ROX/RW/RWX} set of flags. All we need it to compare
two sets to know which bits are set or cleared.

For instance, by comparing _PAGE_KERNEL_ROX and _PAGE_KERNEL_RO you
know which bit gets cleared and which bit get set when changing exec
permission.

Reported-by: Maxime Bizon 
Signed-off-by: Christophe Leroy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/all/20211212112152.GA27070@sakura/
Link: 
https://lore.kernel.org/r/43c3c76a1175ae6dc1a3d3b5c3f7ecb48f683eea.1640344012.git.christophe.le...@csgroup.eu
Signed-off-by: Sasha Levin 
---
 arch/powerpc/mm/pageattr.c | 32 +---
 1 file changed, 13 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
index edea388e9d3f..8812454e70ff 100644
--- a/arch/powerpc/mm/pageattr.c
+++ b/arch/powerpc/mm/pageattr.c
@@ -15,12 +15,14 @@
 #include 
 
 
+static pte_basic_t pte_update_delta(pte_t *ptep, unsigned long addr,
+   unsigned long old, unsigned long new)
+{
+   return pte_update(_mm, addr, ptep, old & ~new, new & ~old, 0);
+}
+
 /*
- * Updates the attributes of a page in three steps:
- *
- * 1. take the page_table_lock
- * 2. install the new entry with the updated attributes
- * 3. flush the TLB
+ * Updates the attributes of a page atomically.
  *
  * This sequence is safe against concurrent updates, and also allows updating 
the
  * attributes of a page currently being executed or accessed.
@@ -28,41 +30,33 @@
 static int change_page_attr(pte_t *ptep, unsigned long addr, void *data)
 {
long action = (long)data;
-   pte_t pte;
-
-   spin_lock(_mm.page_table_lock);
-
-   pte = ptep_get(ptep);
 
-   /* modify the PTE bits as desired, then apply */
+   /* modify the PTE bits as desired */
switch (action) {
case SET_MEMORY_RO:
-   pte = pte_wrprotect(pte);
+   /* Don't clear DIRTY bit */
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RW & ~_PAGE_DIRTY, 
_PAGE_KERNEL_RO);
break;
case SET_MEMORY_RW:
-   pte = pte_mkwrite(pte_mkdirty(pte));
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RO, _PAGE_KERNEL_RW);
break;
case SET_MEMORY_NX:
-   pte = pte_exprotect(pte);
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_ROX, _PAGE_KERNEL_RO);
break;
case SET_MEMORY_X:
-   pte = pte_mkexec(pte);
+   pte_update_delta(ptep, addr, _PAGE_KERNEL_RO, _PAGE_KERNEL_ROX);
break;
default:
WARN_ON_ONCE(1);
break;
}
 
-   pte_update(_mm, addr, ptep, ~0UL, pte_val(pte), 0);
-
/* See ptesync comment in

[PATCH AUTOSEL 5.17 047/149] powerpc: Set crashkernel offset to mid of RMA region

2022-04-01 Thread Sasha Levin

From: Sourabh Jain 

[ Upstream commit 7c5ed82b800d8615cdda00729e7b62e5899f0b13 ]

On large config LPARs (having 192 and more cores), Linux fails to boot
due to insufficient memory in the first memblock. It is due to the
memory reservation for the crash kernel which starts at 128MB offset of
the first memblock. This memory reservation for the crash kernel doesn't
leave enough space in the first memblock to accommodate other essential
system resources.

The crash kernel start address was set to 128MB offset by default to
ensure that the crash kernel get some memory below the RMA region which
is used to be of size 256MB. But given that the RMA region size can be
512MB or more, setting the crash kernel offset to mid of RMA size will
leave enough space for the kernel to allocate memory for other system
resources.

Since the above crash kernel offset change is only applicable to the LPAR
platform, the LPAR feature detection is pushed before the crash kernel
reservation. The rest of LPAR specific initialization will still
be done during pseries_probe_fw_features as usual.

This patch is dependent on changes to paca allocation for boot CPU. It
expect boot CPU to discover 1T segment support which is introduced by
the patch posted here:
https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-January/239175.html

Reported-by: Abdul haleem 
Signed-off-by: Sourabh Jain 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/20220204085601.107257-1-sourabhj...@linux.ibm.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/kernel/rtas.c |  6 ++
 arch/powerpc/kexec/core.c  | 15 +++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 733e6ef36758..1f42aabbbab3 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -1313,6 +1313,12 @@ int __init early_init_dt_scan_rtas(unsigned long node,
entryp = of_get_flat_dt_prop(node, "linux,rtas-entry", NULL);
sizep  = of_get_flat_dt_prop(node, "rtas-size", NULL);
 
+#ifdef CONFIG_PPC64
+   /* need this feature to decide the crashkernel offset */
+   if (of_get_flat_dt_prop(node, "ibm,hypertas-functions", NULL))
+   powerpc_firmware_features |= FW_FEATURE_LPAR;
+#endif
+
if (basep && entryp && sizep) {
rtas.base = *basep;
rtas.entry = *entryp;
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 8b68d9f91a03..abf5897ae88c 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -134,11 +134,18 @@ void __init reserve_crashkernel(void)
if (!crashk_res.start) {
 #ifdef CONFIG_PPC64
/*
-* On 64bit we split the RMO in half but cap it at half of
-* a small SLB (128MB) since the crash kernel needs to place
-* itself and some stacks to be in the first segment.
+* On the LPAR platform place the crash kernel to mid of
+* RMA size (512MB or more) to ensure the crash kernel
+* gets enough space to place itself and some stack to be
+* in the first segment. At the same time normal kernel
+* also get enough space to allocate memory for essential
+* system resource in the first segment. Keep the crash
+* kernel starts at 128MB offset on other platforms.
 */
-   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 2));
+   if (firmware_has_feature(FW_FEATURE_LPAR))
+   crashk_res.start = ppc64_rma_size / 2;
+   else
+   crashk_res.start = min(0x800ULL, (ppc64_rma_size / 
2));
 #else
crashk_res.start = KDUMP_KERNELBASE;
 #endif
-- 
2.34.1

[PATCH AUTOSEL 5.17 028/149] powerpc: dts: t104xrdb: fix phy type for FMAN 4/5

2022-04-01 Thread Sasha Levin

From: Maxim Kiselev 

[ Upstream commit 17846485dff91acce1ad47b508b633dffc32e838 ]

T1040RDB has two RTL8211E-VB phys which requires setting
of internal delays for correct work.

Changing the phy-connection-type property to `rgmii-id`
will fix this issue.

Signed-off-by: Maxim Kiselev 
Reviewed-by: Maxim Kochetkov 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20211230151123.1258321-1-biguncle...@gmail.com
Signed-off-by: Sasha Levin 
---
 arch/powerpc/boot/dts/fsl/t104xrdb.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi 
b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
index 099a598c74c0..bfe1ed5be337 100644
--- a/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
+++ b/arch/powerpc/boot/dts/fsl/t104xrdb.dtsi
@@ -139,12 +139,12 @@ pca9546@77 {
fman@40 {
ethernet@e6000 {
phy-handle = <_rgmii_0>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
ethernet@e8000 {
phy-handle = <_rgmii_1>;
-   phy-connection-type = "rgmii";
+   phy-connection-type = "rgmii-id";
};
 
mdio0: mdio@fc000 {
-- 
2.34.1

[PATCH AUTOSEL 5.17 027/149] powerpc/xive: Export XIVE IPI information for online-only processors.

2022-04-01 Thread Sasha Levin

From: Sachin Sant 

[ Upstream commit 279d1a72c0f8021520f68ddb0a1346ff9ba1ea8c ]

Cédric pointed out that XIVE IPI information exported via sysfs
(debug/powerpc/xive) display empty lines for processors which are
not online.

Switch to using for_each_online_cpu() so that information is
displayed for online-only processors.

Reported-by: Cédric Le Goater 
Signed-off-by: Sachin Sant 
Reviewed-by: Cédric Le Goater 
Signed-off-by: Michael Ellerman 
Link: 
https://lore.kernel.org/r/16414670.19039.10920919226094771665.sendpatchset@MacBook-Pro.local
Signed-off-by: Sasha Levin 
---
 arch/powerpc/sysdev/xive/common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 1ca5564bda9d..32863b4daf72 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1791,7 +1791,7 @@ static int xive_ipi_debug_show(struct seq_file *m, void 
*private)
if (xive_ops->debug_show)
xive_ops->debug_show(m, private);
 
-   for_each_possible_cpu(cpu)
+   for_each_online_cpu(cpu)
xive_debug_show_ipi(m, cpu);
return 0;
 }
-- 
2.34.1

[PATCH v2] powerpc/rtas: Keep MSR[RI] set when calling RTAS

2022-04-01 Thread Laurent Dufour

RTAS runs in real mode (MSR[DR] and MSR[IR] unset) and in 32bits
mode (MSR[SF] unset).

The change in MSR is done in enter_rtas() in a relatively complex way,
since the MSR value could be hardcoded.

Furthermore, a panic has been reported when hitting the watchdog interrupt
while running in RTAS, this leads to the following stack trace:

[69244.027433][   C24] watchdog: CPU 24 Hard LOCKUP
[69244.027442][   C24] watchdog: CPU 24 TB:997512652051031, last heartbeat 
TB:997504470175378 (15980ms ago)
[69244.027451][   C24] Modules linked in: chacha_generic(E) libchacha(E) 
xxhash_generic(E) wp512(E) sha3_generic(E) rmd160(E) poly1305_generic(E) 
libpoly1305(E) michael_mic(E) md4(E) crc32_generic(E) cmac(E) ccm(E) 
algif_rng(E) twofish_generic(E) twofish_common(E) serpent_generic(E) fcrypt(E) 
des_generic(E) libdes(E) cast6_generic(E) cast5_generic(E) cast_common(E) 
camellia_generic(E) blowfish_generic(E) blowfish_common(E) algif_skcipher(E) 
algif_hash(E) gcm(E) algif_aead(E) af_alg(E) tun(E) rpcsec_gss_krb5(E) 
auth_rpcgss(E)
nfsv4(E) dns_resolver(E) rpadlpar_io(EX) rpaphp(EX) xsk_diag(E) tcp_diag(E) 
udp_diag(E) raw_diag(E) inet_diag(E) unix_diag(E) af_packet_diag(E) 
netlink_diag(E) nfsv3(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) 
fscache(E) netfs(E) af_packet(E) rfkill(E) bonding(E) tls(E) ibmveth(EX) 
crct10dif_vpmsum(E) rtc_generic(E) drm(E) drm_panel_orientation_quirks(E) 
fuse(E) configfs(E) backlight(E) ip_tables(E) x_tables(E) dm_service_time(E) 
sd_mod(E) t10_pi(E)
[69244.027555][   C24]  ibmvfc(EX) scsi_transport_fc(E) vmx_crypto(E) 
gf128mul(E) btrfs(E) blake2b_generic(E) libcrc32c(E) crc32c_vpmsum(E) xor(E) 
raid6_pq(E) dm_mirror(E) dm_region_hash(E) dm_log(E) sg(E) dm_multipath(E) 
dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E)
[69244.027587][   C24] Supported: No, Unreleased kernel
[69244.027600][   C24] CPU: 24 PID: 87504 Comm: drmgr Kdump: loaded Tainted: G  
  E  X5.14.21-150400.71.1.bz196362_2-default #1 SLE15-SP4 
(unreleased) 0d821077ef4faa8dfaf370efb5fdca1fa35f4e2c
[69244.027609][   C24] NIP:  1fb41050 LR: 1fb4104c CTR: 

[69244.027612][   C24] REGS: cfc33d60 TRAP: 0100   Tainted: G   
 E  X (5.14.21-150400.71.1.bz196362_2-default)
[69244.027615][   C24] MSR:  82981000   CR: 4882  
XER: 20040020
[69244.027625][   C24] CFAR: 011c IRQMASK: 1
[69244.027625][   C24] GPR00: 0003  
0001 50dc
[69244.027625][   C24] GPR04: 1ffb6100 0020 
0001 1fb09010
[69244.027625][   C24] GPR08: 2000  
 
[69244.027625][   C24] GPR12: 8004072a40a8 cff8b680 
0007 0034
[69244.027625][   C24] GPR16: 1fbf6e94 1fbf6d84 
1fbd1db0 1fb3f008
[69244.027625][   C24] GPR20: 1fb41018  
017f f68f
[69244.027625][   C24] GPR24: 1fb18fe8 1fb3e000 
1fb1adc0 1fb1cf40
[69244.027625][   C24] GPR28: 1fb26000 1fb460f0 
1fb17f18 1fb17000
[69244.027663][   C24] NIP [1fb41050] 0x1fb41050
[69244.027696][   C24] LR [1fb4104c] 0x1fb4104c
[69244.027699][   C24] Call Trace:
[69244.027701][   C24] Instruction dump:
[69244.027723][   C24]       
 
[69244.027728][   C24]       
 
[69244.027762][T87504] Oops: Unrecoverable System Reset, sig: 6 [#1]
[69244.028044][T87504] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
[69244.028089][T87504] Modules linked in: chacha_generic(E) libchacha(E) 
xxhash_generic(E) wp512(E) sha3_generic(E) rmd160(E) poly1305_generic(E) 
libpoly1305(E) michael_mic(E) md4(E) crc32_generic(E) cmac(E) ccm(E) 
algif_rng(E) twofish_generic(E) twofish_common(E) serpent_generic(E) fcrypt(E) 
des_generic(E) libdes(E) cast6_generic(E) cast5_generic(E) cast_common(E) 
camellia_generic(E) blowfish_generic(E) blowfish_common(E) algif_skcipher(E) 
algif_hash(E) gcm(E) algif_aead(E) af_alg(E) tun(E) rpcsec_gss_krb5(E) 
auth_rpcgss(E)
nfsv4(E) dns_resolver(E) rpadlpar_io(EX) rpaphp(EX) xsk_diag(E) tcp_diag(E) 
udp_diag(E) raw_diag(E) inet_diag(E) unix_diag(E) af_packet_diag(E) 
netlink_diag(E) nfsv3(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) 
fscache(E) netfs(E) af_packet(E) rfkill(E) bonding(E) tls(E) ibmveth(EX) 
crct10dif_vpmsum(E) rtc_generic(E) drm(E) drm_panel_orientation_quirks(E) 
fuse(E) configfs(E) backlight(E) ip_tables(E) x_tables(E) dm_service_time(E) 
sd_mod(E) t10_pi(E)
[69244.028171][T87504]  ibmvfc(EX) scsi_transport_fc(E) vmx_crypto(E) 
gf128mul(E) btrfs(E) blake2b_generic(E) libcrc32c(E) crc32c_vpmsum(E) xor(E) 
raid6_pq(E) dm_mirror(E) dm_region_hash(E) dm_log(E) sg(E) dm_multipath(E) 
dm_mod(E) scsi_dh_rdac(E)

Re: [PATCH v4 1/2] Revert "powerpc: Set max_mapnr correctly"

2022-04-01 Thread Christophe Leroy



Le 01/04/2022 à 13:23, Michael Ellerman a écrit :
> Christophe Leroy  writes:
>> Le 28/03/2022 à 12:37, Michael Ellerman a écrit :
>>> Kefeng Wang  writes:
 Hi maintainers，

 I saw the patches has been reviewed[1], could they be merged?
>>>
>>> Maybe I'm just misreading the change log, but it seems wrong that we
>>> need to add extra checks. pfn_valid() shouldn't return true for vmalloc
>>> addresses in the first place, shouldn't we fix that instead? Who knows
>>> what else that might be broken because of that.
>>
>> pfn_valid() doesn't take an address but a PFN
> 
> Yeah sorry that was unclear wording on my part.
> 
> What I mean is that pfn_valid(virt_to_pfn(some_vmalloc_addr)) should be
> false, because virt_to_pfn(vmalloc_addr) should fail.

Yes that's probably how it should be but none of the main architectures 
do that.

The best we find is some architecture that WARN_ON(some valloc addr) in 
virt_to_pfn(). That's wouldn't help in our case, as it would then WARN_ON()


> 
> The right way to convert a vmalloc address to a pfn is with
> vmalloc_to_pfn(), which walks the page tables to find the actual pfn
> backing that vmalloc addr.
> 
>> If you have 1Gbyte of memory you have 256k PFNs.
>>
>> In a generic config the kernel will map 768 Mbytes of mémory (From
>> 0xc000 to 0xe000) and will use 0xf000-0x for
>> everything else including vmalloc.
>>
>> If you take a page above that 768 Mbytes limit, and tries to linarly
>> convert it's PFN to a va, you'll hip vmalloc space. Anyway that PFN is
>> valid.
> 
> That's true, but it's just some random page in vmalloc space, there's no
> guarantee that it's the same page as the PFN you started with.

Yes sure, what I meant however is that pfn_valid(some_valid_pfn) should 
return true, even if virt_to_pfn(some_vmalloc_address) profides a valid PFN.

> 
> Note it's not true on 64-bit Book3S. There if you take a valid PFN (ie.
> backed by RAM) and convert it to a virtual address (with __va()), you
> will never get a vmalloc address.
> 
>> So the check really needs to be done in virt_addr_valid().
> 
> I don't think it has to, but with the way our virt_to_pfn()/__pa() works
> I guess for now it's the easiest solution.
> 

At least other architectures do it that way. See for instance how ARM 
does it:

#define virt_addr_valid(kaddr)  (((unsigned long)(kaddr) >= PAGE_OFFSET 
&& (unsigned long)(kaddr) < (unsigned long)high_memory) \
&& pfn_valid(virt_to_pfn(kaddr)))



high_memory being the top of linear RAM mapping

Christophe

Re: [PATCH v4 1/2] Revert "powerpc: Set max_mapnr correctly"

2022-04-01 Thread Michael Ellerman

Christophe Leroy  writes:
> Le 28/03/2022 à 12:37, Michael Ellerman a écrit :
>> Kefeng Wang  writes:
>>> Hi maintainers，
>>>
>>> I saw the patches has been reviewed[1], could they be merged?
>>
>> Maybe I'm just misreading the change log, but it seems wrong that we
>> need to add extra checks. pfn_valid() shouldn't return true for vmalloc
>> addresses in the first place, shouldn't we fix that instead? Who knows
>> what else that might be broken because of that.
>
> pfn_valid() doesn't take an address but a PFN

Yeah sorry that was unclear wording on my part.

What I mean is that pfn_valid(virt_to_pfn(some_vmalloc_addr)) should be
false, because virt_to_pfn(vmalloc_addr) should fail.

The right way to convert a vmalloc address to a pfn is with
vmalloc_to_pfn(), which walks the page tables to find the actual pfn
backing that vmalloc addr.

> If you have 1Gbyte of memory you have 256k PFNs.
>
> In a generic config the kernel will map 768 Mbytes of mémory (From
> 0xc000 to 0xe000) and will use 0xf000-0x for
> everything else including vmalloc.
>
> If you take a page above that 768 Mbytes limit, and tries to linarly
> convert it's PFN to a va, you'll hip vmalloc space. Anyway that PFN is
> valid.

That's true, but it's just some random page in vmalloc space, there's no
guarantee that it's the same page as the PFN you started with.

Note it's not true on 64-bit Book3S. There if you take a valid PFN (ie.
backed by RAM) and convert it to a virtual address (with __va()), you
will never get a vmalloc address.

> So the check really needs to be done in virt_addr_valid().

I don't think it has to, but with the way our virt_to_pfn()/__pa() works
I guess for now it's the easiest solution.

> There is another thing however that would be worth fixing (in another
> patch): that's the virt_to_pfn() in PPC64:
>
> #define virt_to_pfn(kaddr)(__pa(kaddr) >> PAGE_SHIFT)
>
> #define __pa(x)   
> \
> ({\
>   VIRTUAL_BUG_ON((unsigned long)(x) < PAGE_OFFSET);   \
>   (unsigned long)(x) & 0x0fffUL;  \
> })
>
>
> So 0xc000 and 0xd000 have the same PFN. That's
> wrong.

Yes it was wrong. But we don't use 0xd000 anymore, so it
shouldn't be an issue in practice.

See 0034d395f89d ("powerpc/mm/hash64: Map all the kernel regions in the same 
0xc range").

I guess it is still a problem for 64-bit Book3E, because they use 0xc
and 0x8.

I looked at fixing __pa()/__va() to use +/- a few years back, but from
memory it still didn't work and/or generated bad code. There's a comment
about it working around a GCC miscompile.

The other thing that makes me reluctant to change it is that I worry we
have places that inadvertantly use __pa() on addresses that are already
physical addresses. If we changed __pa() to use subtraction that would
break, ie. we'd end up with a negative address.

See eg. a6e2c226c3d5 ("powerpc: Fix kernel crash in show_instructions() 
w/DEBUG_VIRTUAL")

So to fix it we'd have to 1) verify that the compiler bug is no longer
an issue and 2) audit uses of __pa()/__va().

cheers

[RFC PATCH 4.19 2/2] powerpc/64s: Unmerge EX_LR and EX_DAR

2022-04-01 Thread Michael Ellerman

The SLB miss handler is not fully re-entrant, it is able to work because
we ensure that the SLB entries for the kernel text and data segment, as
well as the kernel stack are pinned in the SLB. Accesses to kernel data
outside of those areas has to be carefully managed and can only occur in
certain parts of the code. One way we deal with that is by storing some
values in temporary slots in the paca.

In v4.13 in commit dbeea1d6b4bd ("powerpc/64s/paca: EX_LR can be merged
with EX_DAR") we merged the storage for two temporary slots for register
storage during SLB miss handling. That was safe at the time because the
two slots were never used at the same time.

Unfortunately in v4.17 in commit c2b4d8b7417a ("powerpc/mm/hash64:
Increase the VA range") we broke that condition, and introduced a case
where the two slots could be in use at the same time, leading to one
being corrupted.

Specifically in slb_miss_common() when we detect that we're handling a
fault for a large virtual address (> 512TB) we go to the "8" label,
there we store the original fault address into paca->exslb[EX_DAR],
before jumping to large_addr_slb() (using rfid).

We then use the EXCEPTION_PROLOG_COMMON and RECONCILE_IRQ_STATE macros
to do exception setup, before reloading the fault address from
paca->exslb[EX_DAR] and storing it into pt_regs->dar (Data Address
Register).

However the code generated by those macros can cause a recursive SLB
miss on a kernel address in three places.

Firstly is the saving of the PPR (Program Priority Register), which
happens on all CPUs since Power7, the PPR is saved to the thread struct
which can be anywhere in memory. There is also the call to
accumulate_stolen_time() if CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y and
CONFIG_PPC_SPLPAR=y, and also the call to trace_hardirqs_off() if
CONFIG_TRACE_IRQFLAGS=y. The latter two call into generic C code and can
lead to accesses anywhere in memory.

On modern 64-bit CPUs we have 1TB segments, so for any of those accesses
to cause an SLB fault they must access memory more than 1TB away from
the kernel text, data and kernel stack. That typically only happens on
machines with more than 1TB of RAM. However it is possible on multi-node
Power9 systems, because memory on the 2nd node begins at 32TB in the
linear mapping.

If we take a recursive SLB fault then we will corrupt the original fault
address with the LR (Link Register) value, because the EX_DAR and EX_LR
slots share storage. Subsequently we will think we're trying to fault
that LR address, which is the wrong address, and will also mostly likely
lead to a segfault because the LR address will be < 512TB and so will be
rejected by slb_miss_large_addr().

This appears as a spurious segfault to userspace, and if
show_unhandled_signals is enabled you will see a fault reported in dmesg
with the LR address, not the expected fault address, eg:

  prog[123]: segfault (11) at 128a61808 nip 128a618cc lr 128a61808 code 3 in 
prog[128a6+1]
  prog[123]: code: 4ba4 39200040 3ce4 7d2903a6 3c000200 78e707c6 
780083e4 7d3b4b78
  prog[123]: code: 7d455378 7d7d5b78 7d9f6378 7da46b78  7d3a4b78 
7d465378 7d7c5b78

Notice that the fault address == the LR, and the faulting instruction is
a simple store that should never use LR.

In upstream this was fixed in v4.20 in commit
48e7b7695745 ("powerpc/64s/hash: Convert SLB miss handlers to C"),
however that is a huge rewrite and not backportable.

The minimal fix for stable is to just unmerge the EX_LR and EX_DAR slots
again, avoiding the corruption of the DAR value. This uses an extra 8
bytes per CPU, which is negligble.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/exception-64s.h | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index f0424c6fdeca..4fdae1c182df 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -48,11 +48,12 @@
 #define EX_CCR 52
 #define EX_CFAR56
 #define EX_PPR 64
+#define EX_LR  72
 #if defined(CONFIG_RELOCATABLE)
-#define EX_CTR 72
-#define EX_SIZE10  /* size in u64 units */
+#define EX_CTR 80
+#define EX_SIZE11  /* size in u64 units */
 #else
-#define EX_SIZE9   /* size in u64 units */
+#define EX_SIZE10  /* size in u64 units */
 #endif
 
 /*
@@ -60,14 +61,6 @@
  */
 #define MAX_MCE_DEPTH  4
 
-/*
- * EX_LR is only used in EXSLB and where it does not overlap with EX_DAR
- * EX_CCR similarly with DSISR, but being 4 byte registers there is a hole
- * in the save area so it's not necessary to overlap them. Could be used
- * for future savings though if another 4 byte register was to be saved.
- */
-#define EX_LR  EX_DAR
-
 /*
  * EX_R3 is only used by the bad_stack handler. bad_stack reloads and
  * saves DAR from SPRN_DAR, and

[RFC PATCH 4.19 1/2] powerpc/64/interrupt: Temporarily save PPR on stack to fix register corruption due to SLB miss

2022-04-01 Thread Michael Ellerman

From: Nicholas Piggin 

This is a minimal stable kernel fix for the problem solved by
4c2de74cc869 ("powerpc/64: Interrupts save PPR on stack rather than
thread_struct").

Upstream kernels between 4.17-4.20 have this bug, so I propose this
patch for 4.19 stable.

Longer description from mpe:

In commit f384796c4 ("powerpc/mm: Add support for handling > 512TB
address in SLB miss") we added support for using multiple context ids
per process. Previously accessing past the first context id was a fatal
error for the process. With the new support it became non-fatal, and so
the previous "bad_addr_slb" handler was changed to be the
"large_addr_slb" handler.

That handler uses the EXCEPTION_PROLOG_COMMON() macro, which in-turn
calls the SAVE_PPR() macro. At the point where SAVE_PPR() is used, the
r9-13 register values from the original user fault are saved in
paca->exslb. It's not until later in EXCEPTION_PROLOG_COMMON_2() that
they are saved from paca->exslb onto the kernel stack.

The PPR is saved into current->thread.ppr, which is notably not on the
kernel stack the way pt_regs are. This means we can take an SLB miss on
current->thread.ppr. If that happens in the "large_addr_slb" case we
will clobber the saved user r9-r13 in paca->exslb with kernel values.
Later we will save those clobbered values into the pt_regs on the stack,
and when we return to userspace those kernel values will be restored.

Typically this appears as some sort of segfault in userspace, with an
address that looks like a kernel address. In dmesg it can appear as:

  [19117.440331] some_program[1869625]: unhandled signal 11 at cf6bda10 
nip 7fff780d559c lr 7fff781ae56c code 30001

The upstream fix for this issue was to move PPR into pt_regs, on the
kernel stack, avoiding the possibility of an SLB fault when saving it.

However changing the size of pt_regs is an intrusive change, and has
side effects in other parts of the kernel. A minimal fix is to
temporarily save the PPR in an unused part of pt_regs, then save the
user register values from paca->exslb into pt_regs, and then move the
saved PPR into thread.ppr.

Fixes: f384796c40dc ("powerpc/mm: Add support for handling > 512TB address in 
SLB miss")
Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220316033235.903657-1-npig...@gmail.com
---
 arch/powerpc/include/asm/exception-64s.h | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 35fb5b11955a..f0424c6fdeca 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -243,10 +243,22 @@
  * PPR save/restore macros used in exceptions_64s.S  
  * Used for P7 or later processors
  */
-#define SAVE_PPR(area, ra, rb) \
+#define SAVE_PPR(area, ra) \
+BEGIN_FTR_SECTION_NESTED(940)  \
+   ld  ra,area+EX_PPR(r13);/* Read PPR from paca */\
+   std ra,RESULT(r1);  /* Store PPR in RESULT for now */ \
+END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,940)
+
+/*
+ * This is called after we are finished accessing 'area', so we can now take
+ * SLB faults accessing the thread struct, which will use PACA_EXSLB area.
+ * This is required because the large_addr_slb handler uses EXSLB and it also
+ * uses the common exception macros including this PPR saving.
+ */
+#define MOVE_PPR_TO_THREAD(ra, rb) \
 BEGIN_FTR_SECTION_NESTED(940)  \
ld  ra,PACACURRENT(r13);\
-   ld  rb,area+EX_PPR(r13);/* Read PPR from paca */\
+   ld  rb,RESULT(r1);  /* Read PPR from stack */   \
std rb,TASKTHREADPPR(ra);   \
 END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,940)
 
@@ -515,9 +527,11 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 3: EXCEPTION_PROLOG_COMMON_1();   \
beq 4f; /* if from kernel mode  */ \
ACCOUNT_CPU_USER_ENTRY(r13, r9, r10);  \
-   SAVE_PPR(area, r9, r10);   \
+   SAVE_PPR(area, r9);\
 4: EXCEPTION_PROLOG_COMMON_2(area)\
-   EXCEPTION_PROLOG_COMMON_3(n)   \
+   beq 5f; /* if from kernel mode  */ \
+   MOVE_PPR_TO_THREAD(r9, r10);   \
+5: EXCEPTION_PROLOG_COMMON_3(n)   \
ACCOUNT_STOLEN_TIME
 
 /* Save original regs values from save area to stack

[PATCH] KVM: PPC: Book3S HV: fix the return value of kvm_age_rmapp()

2022-04-01 Thread Bo Liu

The return value type defined in the function kvm_age_rmapp() is
"bool", but the return value type defined in the implementation of the
function kvm_age_rmapp() is "int".

Change the return value type to "bool".

Signed-off-by: Bo Liu 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 0aeb51738ca9..9a83894ca743 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -879,7 +879,7 @@ static bool kvm_age_rmapp(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
struct revmap_entry *rev = kvm->arch.hpt.rev;
unsigned long head, i, j;
__be64 *hptep;
-   int ret = 0;
+   bool ret = false;
unsigned long *rmapp;
 
rmapp = >arch.rmap[gfn - memslot->base_gfn];
@@ -887,7 +887,7 @@ static bool kvm_age_rmapp(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
lock_rmap(rmapp);
if (*rmapp & KVMPPC_RMAP_REFERENCED) {
*rmapp &= ~KVMPPC_RMAP_REFERENCED;
-   ret = 1;
+   ret = true;
}
if (!(*rmapp & KVMPPC_RMAP_PRESENT)) {
unlock_rmap(rmapp);
@@ -919,7 +919,7 @@ static bool kvm_age_rmapp(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
rev[i].guest_rpte |= HPTE_R_R;
note_hpte_modification(kvm, [i]);
}
-   ret = 1;
+   ret = true;
}
__unlock_hpte(hptep, be64_to_cpu(hptep[0]));
} while ((i = j) != head);
-- 
2.27.0

Re: [PATCH] powerpc/85xx: Remove fsl,85... bindings

2022-04-01 Thread Krzysztof Kozlowski

On 31/03/2022 12:13, Christophe Leroy wrote:
> Since commit 8a4ab218ef70 ("powerpc/85xx: Change deprecated binding
> for 85xx-based boards"), those bindings are not used anymore.
> 
> A comment in drivers/edac/mpc85xx_edac.c say they are to be removed
> with kernel 2.6.30.
> 
> Remove them now.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  .../bindings/memory-controllers/fsl/fsl,ddr.yaml   |  6 --
>  .../devicetree/bindings/powerpc/fsl/l2cache.txt|  6 --
>  drivers/edac/mpc85xx_edac.c| 14 --
>  3 files changed, 26 deletions(-)
> 

Acked-by: Krzysztof Kozlowski 


Best regards,
Krzysztof

Re: [PATCH 05/22] acpica: Replace comments with C99 initializers

2022-04-01 Thread Christoph Hellwig

On Sun, Mar 27, 2022 at 10:59:54PM +0300, Andy Shevchenko wrote:
> On Sat, Mar 26, 2022 at 7:39 PM Benjamin Stürz  wrote:
> >
> > This replaces comments with C99's designated
> > initializers because the kernel supports them now.
> 
> Does it follow the conventions which are accepted in the ACPI CA project?

Why would ACPI CA be allowed to make up it's own conventions?  And as
you might imply not allow using a very useful and more than 20 year old
C feature?  This kind of BS need to stop.

Re: [PATCH 05/22] acpica: Replace comments with C99 initializers

2022-04-01 Thread Christoph Hellwig



Please fix your mailer.  This mail is completely unreadable.

[PATCH] KVM: PPC: Book3S HV: fix the return value of kvm_age_rmapp()

2022-04-01 Thread Bo Liu

The return value type defined in the function kvm_age_rmapp() is
"bool", but the return value type defined in the implementation of the
function kvm_age_rmapp() is "int".

Change the return value type to "bool".

Signed-off-by: Bo Liu 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 0aeb51738ca9..9a83894ca743 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -879,7 +879,7 @@ static bool kvm_age_rmapp(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
struct revmap_entry *rev = kvm->arch.hpt.rev;
unsigned long head, i, j;
__be64 *hptep;
-   int ret = 0;
+   bool ret = false;
unsigned long *rmapp;
 
rmapp = >arch.rmap[gfn - memslot->base_gfn];
@@ -887,7 +887,7 @@ static bool kvm_age_rmapp(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
lock_rmap(rmapp);
if (*rmapp & KVMPPC_RMAP_REFERENCED) {
*rmapp &= ~KVMPPC_RMAP_REFERENCED;
-   ret = 1;
+   ret = true;
}
if (!(*rmapp & KVMPPC_RMAP_PRESENT)) {
unlock_rmap(rmapp);
@@ -919,7 +919,7 @@ static bool kvm_age_rmapp(struct kvm *kvm, struct 
kvm_memory_slot *memslot,
rev[i].guest_rpte |= HPTE_R_R;
note_hpte_modification(kvm, [i]);
}
-   ret = 1;
+   ret = true;
}
__unlock_hpte(hptep, be64_to_cpu(hptep[0]));
} while ((i = j) != head);
-- 
2.27.0

55 matches

Mail list logo