Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-29 Thread Gautham R Shenoy
On Mon, Jul 27, 2020 at 11:02:26AM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
> 
> While setting up the "CACHE" domain, check if shared_cache is already
> set.
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 

Reviewed-by: Gautham R. Shenoy 

> ---
> Changelog v1 -> v2:
>   Moved shared_cache topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 48 +++
>  1 file changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index d997c7411664..3c5ccf6d2b1c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>  EXPORT_SYMBOL_GPL(has_big_cores);
> 
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> + smt_idx,
> +#endif
> + bigcore_idx,
> + die_idx,
> +};
> +
>  #define MAX_THREAD_LIST_SIZE 8
>  #define THREAD_GROUP_SHARE_L1   1
>  struct thread_groups {
> @@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> - if (shared_caches)
> - return cpu_l2_cache_mask(cpu);
> -
> - if (has_big_cores)
> - return cpu_smallcore_mask(cpu);
> -
> - return per_cpu(cpu_sibling_map, cpu);
> + return per_cpu(cpu_l2_cache_map, cpu);
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> @@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> +static const struct cpumask *cpu_bigcore_mask(int cpu)
> +{
> + return per_cpu(cpu_sibling_map, cpu);
> +}
> +
>  static struct sched_domain_topology_level powerpc_topology[] = {
>  #ifdef CONFIG_SCHED_SMT
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
> - { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
> + { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };
> @@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu)
>  void start_secondary(void *unused)
>  {
>   unsigned int cpu = smp_processor_id();
> - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> 
>   mmgrab(_mm);
>   current->active_mm = _mm;
> @@ -1337,14 +1343,20 @@ void start_secondary(void *unused)
>   /* Update topology CPU masks */
>   add_cpu_to_masks(cpu);
> 
> - if (has_big_cores)
> - sibling_mask = cpu_smallcore_mask;
>   /*
>* Check for any shared caches. Note that this must be done on a
>* per-core basis because one core in the pair might be disabled.
>*/
> - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> - shared_caches = true;
> + if (!shared_caches) {
> + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + sibling_mask = cpu_smallcore_mask;
> +
> + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> + shared_caches = true;
> + }
> 
>   set_numa_node(numa_cpu_lookup_table[cpu]);
>   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1375,9 +1387,17 @@ static void fixup_topology(void)
>  #ifdef CONFIG_SCHED_SMT
>   if (has_big_cores) {
>   pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> + powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>   }
>  #endif
> + if (shared_caches) {
> + pr_info("Using shared cache scheduler topology\n");
> + powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> + powerpc_topology[bigcore_idx].sd_flags = 
> powerpc_shared_cache_flags;
> +#ifdef CONFIG_SCHED_DEBUG
> + powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> + }
>  }
> 
>  void __init smp_cpus_done(unsigned int max_cpus)
> -- 
> 2.17.1
> 


[PATCH v3 3/3] cpuidle-pseries : Fixup exit latency for CEDE(0)

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform.  However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
that.

In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.

Benchmark results:

On POWER8, this patch does not have any impact since the advertized
latency of Extended CEDE (1) is 30us which is higher than the default
latency of CEDE (0) which is 10us.

On POWER9 we see improvement the single-threaded performance of ebizzy,
and no regression in the wakeup latency or the number of
context-switches.

ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x  10   2491089   5834307   5398375   4244335 1596244.9
*  10   2893813   5834474   5832448 5327281.3 1055941.4

context_switch2 :
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.

context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
small cores). We observe a minor 0.14% regression in the number of
context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500348872362236354712 354745.69  2711.827
* 500349422361452353942  354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)

context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
improvement in the number of context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500287956294940288896 288977.23 646.59295
* 500288300294646289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)

schbench:
No major difference could be seen until the 99.9th percentile.

Without-patch
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993

With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812

Signed-off-by: Gautham R. Shenoy 
---
v2-->v3 : Made notation consistent with first two patches.
 drivers/cpuidle/cpuidle-pseries.c | 41 +--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index f528da7..8d19820 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -350,13 +350,50 @@ static int pseries_cpuidle_driver_init(void)
return 0;
 }
 
-static void __init parse_xcede_idle_states(void)
+static void __init fixup_cede0_latency(void)
 {
+   int i;
+   u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency 
*/
+   struct xcede_latency_payload *payload;
+
if (parse_cede_parameters())
return;
 
pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
nr_xcede_records);
+
+   payload = _latency_parameter.payload;
+   for (i = 0; i < nr_xcede_records; i++) {
+   struct xcede_latency_record *record = >records[i];
+   u64 latency_tb = be64_to_cpu(record->latency_ticks);
+   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+
+   if (latency_us < min_latency_us)
+   min_latency_us = latency_us;
+   }
+
+   /*
+* By default, we assume that CEDE(0) has exit latency 10us,
+* since there is no way for us to query from the platform.
+*
+* However, if the wakeup latency of an Extended CEDE state is
+* smaller than 10us, then we can be sure that CEDE(0)
+* requires no more than that.
+*
+* Perform the fix-up.
+*/
+   if (min_latency_us < dedicated_states[1].exit_latency) {
+   u64 cede0_latency = min_latency_us - 1;
+
+   if (cede0_latency <= 0)
+   cede0_latency = min_latency_us;
+
+   dedicated_states[1].exit_latency = cede0_latency;
+   dedicated_states[1].target_residency = 10 * (cede0_latency);
+   pr_info("cpuidle : Fixed up CEDE exit latency 

Re: [PATCH v2] powerpc/vio: drop bus_type from parent device

2020-07-29 Thread Greg KH
On Thu, Jul 30, 2020 at 11:28:38AM +1000, Michael Ellerman wrote:
> [ Added Peter & Greg to Cc ]
> 
> Thadeu Lima de Souza Cascardo  writes:
> > Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error
> > code if writing /sys/.../uevent fails") started returning failure when
> > writing to /sys/devices/vio/uevent.
> >
> > This causes an early udevadm trigger to fail. On some installer versions of
> > Ubuntu, this will cause init to exit, thus panicing the system very early
> > during boot.
> >
> > Removing the bus_type from the parent device will remove some of the extra
> > empty files from /sys/devices/vio/, but will keep the rest of the layout
> > for vio devices, keeping them under /sys/devices/vio/.
> 
> What exactly does it change?
> 
> I'm finding it hard to evaluate if this change is going to cause a
> regression somehow.
> 
> I'm also not clear on why removing the bus type is correct, apart from
> whether it fixes the bug you're seeing.
> 
> > It has been tested that uevents for vio devices don't change after this
> > fix, they still contain MODALIAS.
> >
> > Signed-off-by: Thadeu Lima de Souza Cascardo 
> > Fixes: df44b479654f ("kobject: return error code if writing /sys/.../uevent 
> > fails")
> 
> AFAICS there haven't been any other fixes for that commit. Do we know
> why it is only vio that was affected? (possibly because it's a fake bus
> to begin with?)

So there was an error previously, the core was ignoring it, and now it
isn't and to fix that you want to remove describing what bus a device is
on?

Huh???

> 
> cheers
> 
> > diff --git a/arch/powerpc/platforms/pseries/vio.c 
> > b/arch/powerpc/platforms/pseries/vio.c
> > index 37f1f25ba804..a94dab3972a0 100644
> > --- a/arch/powerpc/platforms/pseries/vio.c
> > +++ b/arch/powerpc/platforms/pseries/vio.c
> > @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device  = { /* fake 
> > "parent" device */
> > .name = "vio",
> > .type = "",
> > .dev.init_name = "vio",
> > -   .dev.bus = _bus_type,
> >  };

Wait, a static 'struct device'?  You all are playing with fire there.
That's a reference counted object, and should never be declared like
that at all.

I see you register it, but never unregister it, why?  Why is it even
needed?

And if you remove the bus type of it, it will show up in a different
part of sysfs, so I think this patch will show a user-visable change,
right?

thanks,

greg k-h


[PATCH v3 2/3] cpuidle-pseries: Add function to parse extended CEDE records

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.

The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.

This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.

dmesg on a POWER8 and POWER9 LPAR, demonstrating the output of parsing
the extended CEDE latency parameters are as follows

POWER8
[   10.093279] xcede : xcede_record_size = 10
[   10.093285] xcede : Record 0 : hint = 1, latency = 0x3c00 tb ticks, 
Wake-on-irq = 1
[   10.093291] xcede : Record 1 : hint = 2, latency = 0x4e2000 tb ticks, 
Wake-on-irq = 0
[   10.093297] cpuidle : Skipping the 2 Extended CEDE idle states

POWER9
[5.913180] xcede : xcede_record_size = 10
[5.913183] xcede : Record 0 : hint = 1, latency = 0x400 tb ticks, 
Wake-on-irq = 1
[5.913188] xcede : Record 1 : hint = 2, latency = 0x3e8000 tb ticks, 
Wake-on-irq = 0
[5.913193] cpuidle : Skipping the 2 Extended CEDE idle states

Signed-off-by: Gautham R. Shenoy 
---
v2-->v3 : Cleaned up parse_cede_parameters(). Silenced some sparse warnings.
drivers/cpuidle/cpuidle-pseries.c | 142 ++
 1 file changed, 142 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index f5865a2..f528da7 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct cpuidle_driver pseries_idle_driver = {
.name = "pseries_idle",
@@ -87,6 +88,137 @@ static void check_and_cede_processor(void)
 }
 
 #define NR_DEDICATED_STATES2 /* snooze, CEDE */
+/*
+ * XCEDE : Extended CEDE states discovered through the
+ * "ibm,get-systems-parameter" rtas-call with the token
+ * CEDE_LATENCY_TOKEN
+ */
+#define MAX_XCEDE_STATES   4
+#defineXCEDE_LATENCY_RECORD_SIZE   10
+#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \
+   (MAX_XCEDE_STATES * 
XCEDE_LATENCY_RECORD_SIZE))
+
+/*
+ * Section 7.3.16 System Parameters Option of PAPR version 2.8.1 has a
+ * table with all the parameters to ibm,get-system-parameters.
+ * CEDE_LATENCY_TOKEN corresponds to the token value for Cede Latency
+ * Settings Information.
+ */
+#define CEDE_LATENCY_TOKEN 45
+
+/*
+ * If the platform supports the cede latency settings
+ * information system parameter it must provide the following
+ * information in the NULL terminated parameter string:
+ *
+ * a. The first byte is the length “N” of each cede
+ *latency setting record minus one (zero indicates a length
+ *of 1 byte).
+ *
+ * b. For each supported cede latency setting a cede latency
+ *setting record consisting of the first “N” bytes as per
+ *the following table.
+ *
+ * -
+ * | Field   | Field  |
+ * | Name| Length |
+ * -
+ * | Cede Latency| 1 Byte |
+ * | Specifier Value ||
+ * -
+ * | Maximum wakeup  ||
+ * | latency in  | 8 Bytes|
+ * | tb-ticks||
+ * -
+ * | Responsive to   ||
+ * | external| 1 Byte |
+ * | interrupts  ||
+ * -
+ *
+ * This version has cede latency record size = 10.
+ *
+ * The structure xcede_latency_payload represents a) and b) with
+ * xcede_latency_record representing the table in b).
+ *
+ * xcede_latency_parameter is what gets returned by
+ * ibm,get-systems-parameter rtas-call when made with
+ * CEDE_LATENCY_TOKEN.
+ *
+ * These structures are only used to represent the data sent obtained
+ * by the rtas-call. The data is in Big-Endian.
+ */
+struct xcede_latency_record {
+   u8  hint;
+   __be64  latency_ticks;
+   u8  wake_on_irqs;
+} __packed;
+
+struct xcede_latency_payload {
+   u8 record_size;
+   struct xcede_latency_record records[MAX_XCEDE_STATES];
+} __packed;
+
+struct xcede_latency_parameter {
+   __be16  payload_size;
+   struct xcede_latency_payload payload;
+   u8 null_char;
+} __packed;
+
+static unsigned int nr_xcede_records;
+static struct xcede_latency_parameter xcede_latency_parameter __initdata;
+
+static int __init parse_cede_parameters(void)
+{
+   int ret, i;
+   u16 payload_size;
+   u8 xcede_record_size;
+   u32 total_xcede_records_size;
+   struct xcede_latency_payload *payload;
+
+   memset(_latency_parameter, 0, 

[PATCH v3 0/3] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

This is a v3 of the patch series to parse the extended CEDE
information in the pseries-cpuidle driver.

The previous two versions of the patches can be found here:

v2: 
https://lore.kernel.org/lkml/1596005254-25753-1-git-send-email-...@linux.vnet.ibm.com/

v1: 
https://lore.kernel.org/linuxppc-dev/1594120299-31389-1-git-send-email-...@linux.vnet.ibm.com/

The change from v2 --> v1 :

 * Patch 1: Got rid of some #define-s which were needed mainly for Patches4 and
   5 of v1, but were retained in v2.

 * Patch 2:

* Based on feedback from Michael Ellerman, rewrote the
  function to parse the extended idle states by explicitly
  defining the structure of the object that is returned by
  ibm,get-system-parameters(CEDE_LATENCY_TOKEN) rtas-call. In
  the previous versions we were passing a character array and
  subsequently parsing the individual elements which can be
  bug-prone. This also gets rid of the excessive (cast *)ing
  that was in the previous versions.

  * Marked some of the functions static and annotated some of
  the functions with __init and data with __initdata. This
  makes Sparse happy.

  * Added comments for CEDE_LATENCY_TOKEN.

  * Renamed add_pseries_idle_states() to
parse_xcede_idle_states(). Again, this is because Patch 4
and 5 from v1 are no longer there.

 * Patch 3: No functional changes, but minor changes to be consistent
   with Patch 1 and 2 of this series.
 
I have additionally tested the code on POWER8 dedicated LPAR and found
that it has no impact, since the wakeup latency of CEDE(1) is 30us
which is greater that default latency that we are assuming for
CEDE(0). So we do not need to fixup CEDE(0) latency on POWER8.

Vaidy, I have removed your Reviewed-by for v1, since the code has
changed a little bit.

Gautham R. Shenoy (3):
  cpuidle-pseries: Set the latency-hint before entering CEDE
  cpuidle-pseries: Add function to parse extended CEDE records
  cpuidle-pseries : Fixup exit latency for CEDE(0)

 drivers/cpuidle/cpuidle-pseries.c | 190 +-
 1 file changed, 188 insertions(+), 2 deletions(-)

-- 
1.9.4



[PATCH v3 1/3] cpuidle-pseries: Set the latency-hint before entering CEDE

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

As per the PAPR, each H_CEDE call is associated with a latency-hint to
be passed in the VPA field "cede_latency_hint". The CEDE states that
we were implicitly entering so far is CEDE with latency-hint = 0.

This patch explicitly sets the latency hint corresponding to the CEDE
state that we are currently entering. While at it, we save the
previous hint, to be restored once we wakeup from CEDE. This will be
required in the future when we expose extended-cede states through the
cpuidle framework, where each of them will have a different
cede-latency hint.

Signed-off-by: Gautham R. Shenoy 
---
v2-->v3 : Got rid of the usused NR_CEDE_STATES definition

 drivers/cpuidle/cpuidle-pseries.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 3e058ad2..f5865a2 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -86,19 +86,26 @@ static void check_and_cede_processor(void)
}
 }
 
+#define NR_DEDICATED_STATES2 /* snooze, CEDE */
+
+u8 cede_latency_hint[NR_DEDICATED_STATES];
 static int dedicated_cede_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
 {
+   u8 old_latency_hint;
 
pseries_idle_prolog();
get_lppaca()->donate_dedicated_cpu = 1;
+   old_latency_hint = get_lppaca()->cede_latency_hint;
+   get_lppaca()->cede_latency_hint = cede_latency_hint[index];
 
HMT_medium();
check_and_cede_processor();
 
local_irq_disable();
get_lppaca()->donate_dedicated_cpu = 0;
+   get_lppaca()->cede_latency_hint = old_latency_hint;
 
pseries_idle_epilog();
 
@@ -130,7 +137,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state dedicated_states[] = {
+static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
@@ -233,7 +240,7 @@ static int pseries_idle_probe(void)
max_idle_state = ARRAY_SIZE(shared_states);
} else {
cpuidle_state_table = dedicated_states;
-   max_idle_state = ARRAY_SIZE(dedicated_states);
+   max_idle_state = NR_DEDICATED_STATES;
}
} else
return -ENODEV;
-- 
1.9.4



[PATCH v3] selftests: powerpc: Fix online CPU selection

2020-07-29 Thread Sandipan Das
The size of the CPU affinity mask must be large enough for
systems with a very large number of CPUs. Otherwise, tests
which try to determine the first online CPU by calling
sched_getaffinity() will fail. This makes sure that the size
of the allocated affinity mask is dependent on the number of
CPUs as reported by get_nprocs_conf().

Fixes: 3752e453f6ba ("selftests/powerpc: Add tests of PMU EBBs")
Reported-by: Shirisha Ganta 
Signed-off-by: Sandipan Das 
Reviewed-by: Kamalesh Babulal 
---
Previous versions can be found at:
v2: 
https://lore.kernel.org/linuxppc-dev/20200609073733.997643-1-sandi...@linux.ibm.com/
v1: 
https://lore.kernel.org/linuxppc-dev/20200608144212.985144-1-sandi...@linux.ibm.com/

Changes in v3:
- Use get_nprocs_conf() instead of get_nprocs() as suggested by
  Srikar to cope with test failures that Michael reported for
  sparse SMT setups.

Changes in v2:
- Added NULL check for the affinity mask as suggested by Kamalesh.
- Changed "cpu set" to "CPU affinity mask" in the commit message.

---
 tools/testing/selftests/powerpc/utils.c | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/powerpc/utils.c 
b/tools/testing/selftests/powerpc/utils.c
index 933678f1ed0a0..18b6a773d5c73 100644
--- a/tools/testing/selftests/powerpc/utils.c
+++ b/tools/testing/selftests/powerpc/utils.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -88,28 +89,40 @@ void *get_auxv_entry(int type)
 
 int pick_online_cpu(void)
 {
-   cpu_set_t mask;
-   int cpu;
+   int ncpus, cpu = -1;
+   cpu_set_t *mask;
+   size_t size;
+
+   ncpus = get_nprocs_conf();
+   size = CPU_ALLOC_SIZE(ncpus);
+   mask = CPU_ALLOC(ncpus);
+   if (!mask) {
+   perror("malloc");
+   return -1;
+   }
 
-   CPU_ZERO();
+   CPU_ZERO_S(size, mask);
 
-   if (sched_getaffinity(0, sizeof(mask), )) {
+   if (sched_getaffinity(0, size, mask)) {
perror("sched_getaffinity");
-   return -1;
+   goto done;
}
 
/* We prefer a primary thread, but skip 0 */
-   for (cpu = 8; cpu < CPU_SETSIZE; cpu += 8)
-   if (CPU_ISSET(cpu, ))
-   return cpu;
+   for (cpu = 8; cpu < ncpus; cpu += 8)
+   if (CPU_ISSET_S(cpu, size, mask))
+   goto done;
 
/* Search for anything, but in reverse */
-   for (cpu = CPU_SETSIZE - 1; cpu >= 0; cpu--)
-   if (CPU_ISSET(cpu, ))
-   return cpu;
+   for (cpu = ncpus - 1; cpu >= 0; cpu--)
+   if (CPU_ISSET_S(cpu, size, mask))
+   goto done;
 
printf("No cpus in affinity mask?!\n");
-   return -1;
+
+done:
+   CPU_FREE(mask);
+   return cpu;
 }
 
 bool is_ppc64le(void)
-- 
2.25.1



Re: [PATCH v2] selftests: powerpc: Fix online CPU selection

2020-07-29 Thread Sandipan Das
Hi Srikar, Michael,

On 29/07/20 9:33 pm, Srikar Dronamraju wrote:
> * Sandipan Das  [2020-06-09 13:07:33]:
> 
>> The size of the CPU affinity mask must be large enough for
>> systems with a very large number of CPUs. Otherwise, tests
>> which try to determine the first online CPU by calling
>> sched_getaffinity() will fail. This makes sure that the size
>> of the allocated affinity mask is dependent on the number of
>> CPUs as reported by get_nprocs().
>>
>> Fixes: 3752e453f6ba ("selftests/powerpc: Add tests of PMU EBBs")
>> Reported-by: Shirisha Ganta 
>> Signed-off-by: Sandipan Das 
>> Reviewed-by: Kamalesh Babulal 
>> ---
>> Previous versions can be found at:
>> v1: 
>> https://lore.kernel.org/linuxppc-dev/20200608144212.985144-1-sandi...@linux.ibm.com/
>>
>> @@ -88,28 +89,40 @@ void *get_auxv_entry(int type)
>>
>>  int pick_online_cpu(void)
>>  {
>> -cpu_set_t mask;
>> -int cpu;
>> +int ncpus, cpu = -1;
>> +cpu_set_t *mask;
>> +size_t size;
>> +
>> +ncpus = get_nprocs();
> 
> Please use get_nprocs_conf or sysconf(_SC_NPROCESSORS_CONF). The manpage
> seems to suggest the latter. Not sure how accurate the manpage is.
> 
> get_nprocs is returning online cpus and when smt is off, the cpu numbers
> would be sparse and hence the result from get_nprocs wouldn't be ideal for
> allocating cpumask. However get_nprocs_conf would return the max configured
> cpus and would be able to handle it. 
> 
> I think this was the same situation hit by Michael Ellerman.
> 

Yes, that seems to be the case. Thanks for testing this.
Will fix this in v3.

- Sandipan


Re: [PATCH 11/15] memblock: reduce number of parameters in for_each_mem_range()

2020-07-29 Thread Baoquan He
On 07/28/20 at 08:11am, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Currently for_each_mem_range() iterator is the most generic way to traverse
> memblock regions. As such, it has 8 parameters and it is hardly convenient
> to users. Most users choose to utilize one of its wrappers and the only
> user that actually needs most of the parameters outside memblock is s390
> crash dump implementation.
> 
> To avoid yet another naming for memblock iterators, rename the existing
> for_each_mem_range() to __for_each_mem_range() and add a new
> for_each_mem_range() wrapper with only index, start and end parameters.
> 
> The new wrapper nicely fits into init_unavailable_mem() and will be used in
> upcoming changes to simplify memblock traversals.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  .clang-format  |  1 +
>  arch/arm64/kernel/machine_kexec_file.c |  6 ++
>  arch/s390/kernel/crash_dump.c  |  8 
>  include/linux/memblock.h   | 18 ++
>  mm/page_alloc.c|  3 +--
>  5 files changed, 22 insertions(+), 14 deletions(-)

Reviewed-by: Baoquan He 

> 
> diff --git a/.clang-format b/.clang-format
> index a0a96088c74f..52ededab25ce 100644
> --- a/.clang-format
> +++ b/.clang-format
> @@ -205,6 +205,7 @@ ForEachMacros:
>- 'for_each_memblock_type'
>- 'for_each_memcg_cache_index'
>- 'for_each_mem_pfn_range'
> +  - '__for_each_mem_range'
>- 'for_each_mem_range'
>- 'for_each_mem_range_rev'
>- 'for_each_migratetype_order'
> diff --git a/arch/arm64/kernel/machine_kexec_file.c 
> b/arch/arm64/kernel/machine_kexec_file.c
> index 361a1143e09e..5b0e67b93cdc 100644
> --- a/arch/arm64/kernel/machine_kexec_file.c
> +++ b/arch/arm64/kernel/machine_kexec_file.c
> @@ -215,8 +215,7 @@ static int prepare_elf_headers(void **addr, unsigned long 
> *sz)
>   phys_addr_t start, end;
>  
>   nr_ranges = 1; /* for exclusion of crashkernel region */
> - for_each_mem_range(i, , NULL, NUMA_NO_NODE,
> - MEMBLOCK_NONE, , , NULL)
> + for_each_mem_range(i, , )
>   nr_ranges++;
>  
>   cmem = kmalloc(struct_size(cmem, ranges, nr_ranges), GFP_KERNEL);
> @@ -225,8 +224,7 @@ static int prepare_elf_headers(void **addr, unsigned long 
> *sz)
>  
>   cmem->max_nr_ranges = nr_ranges;
>   cmem->nr_ranges = 0;
> - for_each_mem_range(i, , NULL, NUMA_NO_NODE,
> - MEMBLOCK_NONE, , , NULL) {
> + for_each_mem_range(i, , ) {
>   cmem->ranges[cmem->nr_ranges].start = start;
>   cmem->ranges[cmem->nr_ranges].end = end - 1;
>   cmem->nr_ranges++;
> diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
> index f96a5857bbfd..e28085c725ff 100644
> --- a/arch/s390/kernel/crash_dump.c
> +++ b/arch/s390/kernel/crash_dump.c
> @@ -549,8 +549,8 @@ static int get_mem_chunk_cnt(void)
>   int cnt = 0;
>   u64 idx;
>  
> - for_each_mem_range(idx, , _type, NUMA_NO_NODE,
> -MEMBLOCK_NONE, NULL, NULL, NULL)
> + __for_each_mem_range(idx, , _type, NUMA_NO_NODE,
> +  MEMBLOCK_NONE, NULL, NULL, NULL)
>   cnt++;
>   return cnt;
>  }
> @@ -563,8 +563,8 @@ static void loads_init(Elf64_Phdr *phdr, u64 loads_offset)
>   phys_addr_t start, end;
>   u64 idx;
>  
> - for_each_mem_range(idx, , _type, NUMA_NO_NODE,
> -MEMBLOCK_NONE, , , NULL) {
> + __for_each_mem_range(idx, , _type, NUMA_NO_NODE,
> +  MEMBLOCK_NONE, , , NULL) {
>   phdr->p_filesz = end - start;
>   phdr->p_type = PT_LOAD;
>   phdr->p_offset = start;
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index e6a23b3db696..d70c2835e913 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -142,7 +142,7 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t 
> *out_start,
>  void __memblock_free_late(phys_addr_t base, phys_addr_t size);
>  
>  /**
> - * for_each_mem_range - iterate through memblock areas from type_a and not
> + * __for_each_mem_range - iterate through memblock areas from type_a and not
>   * included in type_b. Or just type_a if type_b is NULL.
>   * @i: u64 used as loop variable
>   * @type_a: ptr to memblock_type to iterate
> @@ -153,7 +153,7 @@ void __memblock_free_late(phys_addr_t base, phys_addr_t 
> size);
>   * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
>   * @p_nid: ptr to int for nid of the range, can be %NULL
>   */
> -#define for_each_mem_range(i, type_a, type_b, nid, flags,\
> +#define __for_each_mem_range(i, type_a, type_b, nid, flags,  \
>  p_start, p_end, p_nid)   \
>   for (i = 0, __next_mem_range(, nid, flags, type_a, type_b,\
>p_start, p_end, p_nid);   

Re: [PATCH 10/15] memblock: make memblock_debug and related functionality private

2020-07-29 Thread Baoquan He
On 07/28/20 at 08:11am, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> The only user of memblock_dbg() outside memblock was s390 setup code and it
> is converted to use pr_debug() instead.
> This allows to stop exposing memblock_debug and memblock_dbg() to the rest
> of the kernel.
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/s390/kernel/setup.c |  4 ++--
>  include/linux/memblock.h | 12 +---
>  mm/memblock.c| 13 +++--
>  3 files changed, 14 insertions(+), 15 deletions(-)

Nice clean up.

Reviewed-by: Baoquan He 

> 
> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> index 07aa15ba43b3..8b284cf6e199 100644
> --- a/arch/s390/kernel/setup.c
> +++ b/arch/s390/kernel/setup.c
> @@ -776,8 +776,8 @@ static void __init memblock_add_mem_detect_info(void)
>   unsigned long start, end;
>   int i;
>  
> - memblock_dbg("physmem info source: %s (%hhd)\n",
> -  get_mem_info_source(), mem_detect.info_source);
> + pr_debug("physmem info source: %s (%hhd)\n",
> +  get_mem_info_source(), mem_detect.info_source);
>   /* keep memblock lists close to the kernel */
>   memblock_set_bottom_up(true);
>   for_each_mem_detect_block(i, , ) {
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 220b5f0dad42..e6a23b3db696 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -90,7 +90,6 @@ struct memblock {
>  };
>  
>  extern struct memblock memblock;
> -extern int memblock_debug;
>  
>  #ifndef CONFIG_ARCH_KEEP_MEMBLOCK
>  #define __init_memblock __meminit
> @@ -102,9 +101,6 @@ void memblock_discard(void);
>  static inline void memblock_discard(void) {}
>  #endif
>  
> -#define memblock_dbg(fmt, ...) \
> - if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> -
>  phys_addr_t memblock_find_in_range(phys_addr_t start, phys_addr_t end,
>  phys_addr_t size, phys_addr_t align);
>  void memblock_allow_resize(void);
> @@ -456,13 +452,7 @@ bool memblock_is_region_memory(phys_addr_t base, 
> phys_addr_t size);
>  bool memblock_is_reserved(phys_addr_t addr);
>  bool memblock_is_region_reserved(phys_addr_t base, phys_addr_t size);
>  
> -extern void __memblock_dump_all(void);
> -
> -static inline void memblock_dump_all(void)
> -{
> - if (memblock_debug)
> - __memblock_dump_all();
> -}
> +void memblock_dump_all(void);
>  
>  /**
>   * memblock_set_current_limit - Set the current allocation limit to allow
> diff --git a/mm/memblock.c b/mm/memblock.c
> index a5b9b3df81fc..824938849f6d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -134,7 +134,10 @@ struct memblock memblock __initdata_memblock = {
>i < memblock_type->cnt;\
>i++, rgn = _type->regions[i])
>  
> -int memblock_debug __initdata_memblock;
> +#define memblock_dbg(fmt, ...) \
> + if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
> +
> +static int memblock_debug __initdata_memblock;
>  static bool system_has_some_mirror __initdata_memblock = false;
>  static int memblock_can_resize __initdata_memblock;
>  static int memblock_memory_in_slab __initdata_memblock = 0;
> @@ -1919,7 +1922,7 @@ static void __init_memblock memblock_dump(struct 
> memblock_type *type)
>   }
>  }
>  
> -void __init_memblock __memblock_dump_all(void)
> +static void __init_memblock __memblock_dump_all(void)
>  {
>   pr_info("MEMBLOCK configuration:\n");
>   pr_info(" memory size = %pa reserved size = %pa\n",
> @@ -1933,6 +1936,12 @@ void __init_memblock __memblock_dump_all(void)
>  #endif
>  }
>  
> +void __init_memblock memblock_dump_all(void)
> +{
> + if (memblock_debug)
> + __memblock_dump_all();
> +}
> +
>  void __init memblock_allow_resize(void)
>  {
>   memblock_can_resize = 1;
> -- 
> 2.26.2
> 
> 



Re: [PATCH 09/15] memblock: make for_each_memblock_type() iterator private

2020-07-29 Thread Baoquan He
On 07/28/20 at 08:11am, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> for_each_memblock_type() is not used outside mm/memblock.c, move it there
> from include/linux/memblock.h
> 
> ---
>  include/linux/memblock.h | 5 -
>  mm/memblock.c| 5 +
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 017fae833d4a..220b5f0dad42 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -532,11 +532,6 @@ static inline unsigned long 
> memblock_region_reserved_end_pfn(const struct memblo
>region < (memblock.memblock_type.regions + 
> memblock.memblock_type.cnt);\
>region++)
>  
> -#define for_each_memblock_type(i, memblock_type, rgn)
> \
> - for (i = 0, rgn = _type->regions[0];   \
> -  i < memblock_type->cnt;\
> -  i++, rgn = _type->regions[i])
> -
>  extern void *alloc_large_system_hash(const char *tablename,
>unsigned long bucketsize,
>unsigned long numentries,
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 39aceafc57f6..a5b9b3df81fc 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -129,6 +129,11 @@ struct memblock memblock __initdata_memblock = {
>   .current_limit  = MEMBLOCK_ALLOC_ANYWHERE,
>  };
>  
> +#define for_each_memblock_type(i, memblock_type, rgn)
> \
> + for (i = 0, rgn = _type->regions[0];   \
> +  i < memblock_type->cnt;\
> +  i++, rgn = _type->regions[i])
> +

Reviewed-by: Baoquan He 



Re: [PATCH v2] powerpc/vio: drop bus_type from parent device

2020-07-29 Thread Michael Ellerman
[ Added Peter & Greg to Cc ]

Thadeu Lima de Souza Cascardo  writes:
> Commit df44b479654f62b478c18ee4d8bc4e9f897a9844 ("kobject: return error
> code if writing /sys/.../uevent fails") started returning failure when
> writing to /sys/devices/vio/uevent.
>
> This causes an early udevadm trigger to fail. On some installer versions of
> Ubuntu, this will cause init to exit, thus panicing the system very early
> during boot.
>
> Removing the bus_type from the parent device will remove some of the extra
> empty files from /sys/devices/vio/, but will keep the rest of the layout
> for vio devices, keeping them under /sys/devices/vio/.

What exactly does it change?

I'm finding it hard to evaluate if this change is going to cause a
regression somehow.

I'm also not clear on why removing the bus type is correct, apart from
whether it fixes the bug you're seeing.

> It has been tested that uevents for vio devices don't change after this
> fix, they still contain MODALIAS.
>
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> Fixes: df44b479654f ("kobject: return error code if writing /sys/.../uevent 
> fails")

AFAICS there haven't been any other fixes for that commit. Do we know
why it is only vio that was affected? (possibly because it's a fake bus
to begin with?)

cheers

> diff --git a/arch/powerpc/platforms/pseries/vio.c 
> b/arch/powerpc/platforms/pseries/vio.c
> index 37f1f25ba804..a94dab3972a0 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -36,7 +36,6 @@ static struct vio_dev vio_bus_device  = { /* fake "parent" 
> device */
>   .name = "vio",
>   .type = "",
>   .dev.init_name = "vio",
> - .dev.bus = _bus_type,
>  };
>  
>  #ifdef CONFIG_PPC_SMLPAR
> -- 
> 2.20.1


Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-29 Thread Michael Ellerman
Nathan Lynch  writes:
> Laurent Dufour  writes:
>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>> The drmem lmb list can have hundreds of thousands of entries, and
>>> unfortunately lookups take the form of linear searches. As long as
>>> this is the case, traversals have the potential to monopolize the CPU
>>> and provoke lockup reports, workqueue stalls, and the like unless
>>> they explicitly yield.
>>> 
>>> Rather than placing cond_resched() calls within various
>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>> expression of the loop macro itself so users can't omit it.
>>
>> Is that not too much to call cond_resched() on every LMB?
>>
>> Could that be less frequent, every 10, or 100, I don't really know ?
>
> Everything done within for_each_drmem_lmb is relatively heavyweight
> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
> of milliseconds. I don't think cond_resched() is an expensive check in
> this context.

Hmm, mostly.

But there are quite a few cases like drmem_update_dt_v1():

for_each_drmem_lmb(lmb) {
dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));

dr_cell++;
}

Which will compile to a pretty tight loop at the moment.

Or drmem_update_dt_v2() which has two loops over all lmbs.

And although the actual TIF check is cheap the function call to do it is
not free.

So I worry this is going to make some of those long loops take even longer.

At the same time I don't see an easy way to batch the calls to
cond_resched() without more intrusive changes.

cheers


Re: [PATCH] powerpc: fix function annotations to avoid section mismatch warnings with gcc-10

2020-07-29 Thread Segher Boessenkool
On Wed, Jul 29, 2020 at 03:44:56PM -0400, Vladis Dronov wrote:
> > > Certain warnings are emitted for powerpc code when building with a gcc-10
> > > toolset:
> > > 
> > > WARNING: modpost: vmlinux.o(.text.unlikely+0x377c): Section mismatch 
> > > in
> > > reference from the function remove_pmd_table() to the function
> > > .meminit.text:split_kernel_mapping()
> > > The function remove_pmd_table() references
> > > the function __meminit split_kernel_mapping().
> > > This is often because remove_pmd_table lacks a __meminit
> > > annotation or the annotation of split_kernel_mapping is wrong.
> > > 
> > > Add the appropriate __init and __meminit annotations to make modpost not
> > > complain. In all the cases there are just a single callsite from another
> > > __init or __meminit function:
> > > 
> > > __meminit remove_pagetable() -> remove_pud_table() -> remove_pmd_table()
> > > __init prom_init() -> setup_secure_guest()
> > > __init xive_spapr_init() -> xive_spapr_disabled()
> > 
> > So what changed?  These functions were inlined with older compilers, but
> > not anymore?
> 
> Yes, exactly. Gcc-10 does not inline them anymore. If this is because of my
> build system, this can happen to others also.
> 
> The same thing was fixed by Linus in e99332e7b4cd ("gcc-10: mark more 
> functions
> __init to avoid section mismatch warnings").

It sounds like this is part of "-finline-functions was retuned" on
?  So everyone should see it
(no matter what config or build system), and it is a good thing too :-)

Thanks for the confirmation,


Segher


Re: [PATCH net] ibmvnic: Fix IRQ mapping disposal in error path

2020-07-29 Thread Thomas Falcon



On 7/29/20 5:28 PM, Jakub Kicinski wrote:

On Wed, 29 Jul 2020 16:36:32 -0500 Thomas Falcon wrote:

RX queue IRQ mappings are disposed in both the TX IRQ and RX IRQ
error paths. Fix this and dispose of TX IRQ mappings correctly in
case of an error.

Signed-off-by: Thomas Falcon 

Thomas, please remember about Fixes tags (for networking patches,
at least):

Fixes: ea22d51a7831 ("ibmvnic: simplify and improve driver probe function")


Sorry, Jakub, I will try not to forget next time. Thanks.



Re: [PATCH net] ibmvnic: Fix IRQ mapping disposal in error path

2020-07-29 Thread David Miller
From: Thomas Falcon 
Date: Wed, 29 Jul 2020 16:36:32 -0500

> RX queue IRQ mappings are disposed in both the TX IRQ and RX IRQ
> error paths. Fix this and dispose of TX IRQ mappings correctly in
> case of an error.
> 
> Signed-off-by: Thomas Falcon 

Applied with Fixes: tag added and queued up for -stable.


Re: [PATCH net] ibmvnic: Fix IRQ mapping disposal in error path

2020-07-29 Thread Jakub Kicinski
On Wed, 29 Jul 2020 16:36:32 -0500 Thomas Falcon wrote:
> RX queue IRQ mappings are disposed in both the TX IRQ and RX IRQ
> error paths. Fix this and dispose of TX IRQ mappings correctly in
> case of an error.
> 
> Signed-off-by: Thomas Falcon 

Thomas, please remember about Fixes tags (for networking patches, 
at least):

Fixes: ea22d51a7831 ("ibmvnic: simplify and improve driver probe function")


Re: [PATCH v2] pseries/drmem: don't cache node id in drmem_lmb struct

2020-07-29 Thread Laurent Dufour

Hi Scott,

Le 28/07/2020 à 18:53, Scott Cheloha a écrit :

At memory hot-remove time we can retrieve an LMB's nid from its
corresponding memory_block.  There is no need to store the nid
in multiple locations.

Note that lmb_to_memblock() uses find_memory_block() to get the
corresponding memory_block.  As find_memory_block() runs in sub-linear
time this approach is negligibly slower than what we do at present.

In exchange for this lookup at hot-remove time we no longer need to
call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
spares us an O(n^2) initialization during boot.

On systems with many LMBs that initialization overhead is palpable and
disruptive.  For example, on a box with 249854 LMBs we're seeing
drmem_init() take upwards of 30 seconds to complete:

[   53.721639] drmem: initializing drmem v2
[   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   80.604377] Modules linked in:
[   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
[   80.604397] NIP:  c00a4980 LR: c00a4940 CTR: 
[   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
[   80.604412] MSR:  82009033   CR: 44000248  
XER: 000d
[   80.604431] CFAR: c00a4a38 IRQMASK: 0
[   80.604431] GPR00: c00a4940 c0002dbff8493ac0 c1904400 
c0003cfede30
[   80.604431] GPR04:  c0f4095a 002f 
1000
[   80.604431] GPR08: cbf7ecdb7fb8 cbf7ecc2d3c8 0008 
c00c0002fdfb2001
[   80.604431] GPR12:  c0001e8ec200
[   80.604477] NIP [c00a4980] hot_add_scn_to_nid+0xa0/0x3e0
[   80.604486] LR [c00a4940] hot_add_scn_to_nid+0x60/0x3e0
[   80.604492] Call Trace:
[   80.604498] [c0002dbff8493ac0] [c00a4940] 
hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
[   80.604509] [c0002dbff8493b20] [c0087c10] 
memory_add_physaddr_to_nid+0x20/0x60
[   80.604521] [c0002dbff8493b40] [c10d4880] drmem_init+0x25c/0x2f0
[   80.604530] [c0002dbff8493c10] [c0010154] do_one_initcall+0x64/0x2c0
[   80.604540] [c0002dbff8493ce0] [c10c4aa0] 
kernel_init_freeable+0x2d8/0x3a0
[   80.604550] [c0002dbff8493db0] [c0010824] kernel_init+0x2c/0x148
[   80.604560] [c0002dbff8493e20] [c000b648] 
ret_from_kernel_thread+0x5c/0x74
[   80.604567] Instruction dump:
[   80.604574] 392918e8 e949 e90a000a e92a 80ea000c 1d080018 3908ffe8 
7d094214
[   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e949 
7fbe5040
[   89.047390] drmem: 249854 LMB(s)

With a patched kernel on the same machine we're no longer seeing the
soft lockup.  drmem_init() now completes in negligible time, even when
the LMB count is large.

Signed-off-by: Scott Cheloha 
---
  arch/powerpc/include/asm/drmem.h  | 21 ---
  arch/powerpc/mm/drmem.c   |  6 +-
  .../platforms/pseries/hotplug-memory.c| 19 ++---
  3 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..34e4e9b257f5 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -13,9 +13,6 @@ struct drmem_lmb {
u32 drc_index;
u32 aa_index;
u32 flags;
-#ifdef CONFIG_MEMORY_HOTPLUG
-   int nid;
-#endif
  };

  struct drmem_lmb_info {
@@ -104,22 +101,4 @@ static inline void 
invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
lmb->aa_index = 0x;
  }

-#ifdef CONFIG_MEMORY_HOTPLUG
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-   lmb->nid = memory_add_physaddr_to_nid(lmb->base_addr);
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-   lmb->nid = -1;
-}
-#else
-static inline void lmb_set_nid(struct drmem_lmb *lmb)
-{
-}
-static inline void lmb_clear_nid(struct drmem_lmb *lmb)
-{
-}
-#endif
-
  #endif /* _ASM_POWERPC_LMB_H */
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 59327cefbc6a..873fcfc7b875 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -362,10 +362,8 @@ static void __init init_drmem_v1_lmbs(const __be32 *prop)
if (!drmem_info->lmbs)
return;

-   for_each_drmem_lmb(lmb) {
+   for_each_drmem_lmb(lmb)
read_drconf_v1_cell(lmb, );
-   lmb_set_nid(lmb);
-   }
  }

  static void __init init_drmem_v2_lmbs(const __be32 *prop)
@@ -410,8 +408,6 @@ static void __init init_drmem_v2_lmbs(const __be32 *prop)

lmb->aa_index = dr_cell.aa_index;
lmb->flags = dr_cell.flags;
-
-   lmb_set_nid(lmb);
}
}
  }
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 

linux-next: Fixes tag needs some work in the powerpc tree

2020-07-29 Thread Stephen Rothwell
Hi all,

In commit

  443359aebce0 ("powerpc/perf: Fix MMCRA_BHRB_DISABLE define for binutils < 
2.28")

Fixes tag

  Fixes: 9908c826d5ed ("Add Power10 PMU feature to DT CPU features")

has these problem(s):

  - Subject does not match target commit subject
Just use
git log -1 --format='Fixes: %h ("%s")'

Just a hint for the future.

-- 
Cheers,
Stephen Rothwell


pgp5y4yT48yPE.pgp
Description: OpenPGP digital signature


[PATCH 2/2] spi: mpc512x-psc: Convert to use GPIO descriptors

2020-07-29 Thread Linus Walleij
This driver is already relying on the core to provide
valid GPIO numbers in spi->cs_gpio through
of_spi_get_gpio_numbers(), so we can switch to letting
the core use GPIO descriptors instead.

Make sure that the .set_cs() is always called, as some controller
set-up is also needed.

Cc: Uwe Kleine-König 
Cc: Anatolij Gustschin 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Linus Walleij 
---
 drivers/spi/spi-mpc512x-psc.c | 33 +++--
 1 file changed, 7 insertions(+), 26 deletions(-)

diff --git a/drivers/spi/spi-mpc512x-psc.c b/drivers/spi/spi-mpc512x-psc.c
index 35313a77f977..dd8bba408301 100644
--- a/drivers/spi/spi-mpc512x-psc.c
+++ b/drivers/spi/spi-mpc512x-psc.c
@@ -23,7 +23,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 enum {
@@ -99,7 +98,7 @@ static void mpc512x_psc_spi_set_cs(struct spi_device *spi, 
bool enable)
u16 bclkdiv;
 
if (!enable) {
-   if (mps->cs_control && gpio_is_valid(spi->cs_gpio))
+   if (mps->cs_control && spi->cs_gpiod)
mps->cs_control(spi, (spi->mode & SPI_CS_HIGH) ? 0 : 1);
return;
}
@@ -134,7 +133,7 @@ static void mpc512x_psc_spi_set_cs(struct spi_device *spi, 
bool enable)
out_be32(psc_addr(mps, ccr), ccr);
mps->bits_per_word = cs->bits_per_word;
 
-   if (mps->cs_control && gpio_is_valid(spi->cs_gpio))
+   if (mps->cs_control && spi->cs_gpiod)
mps->cs_control(spi, (spi->mode & SPI_CS_HIGH) ? 1 : 0);
 }
 
@@ -358,18 +357,6 @@ static int mpc512x_psc_spi_setup(struct spi_device *spi)
if (!cs)
return -ENOMEM;
 
-   if (gpio_is_valid(spi->cs_gpio)) {
-   ret = gpio_request(spi->cs_gpio, dev_name(>dev));
-   if (ret) {
-   dev_err(>dev, "can't get CS gpio: %d\n",
-   ret);
-   kfree(cs);
-   return ret;
-   }
-   gpio_direction_output(spi->cs_gpio,
-   spi->mode & SPI_CS_HIGH ? 0 : 1);
-   }
-
spi->controller_state = cs;
}
 
@@ -381,8 +368,6 @@ static int mpc512x_psc_spi_setup(struct spi_device *spi)
 
 static void mpc512x_psc_spi_cleanup(struct spi_device *spi)
 {
-   if (gpio_is_valid(spi->cs_gpio))
-   gpio_free(spi->cs_gpio);
kfree(spi->controller_state);
 }
 
@@ -461,11 +446,6 @@ static irqreturn_t mpc512x_psc_spi_isr(int irq, void 
*dev_id)
return IRQ_NONE;
 }
 
-static void mpc512x_spi_cs_control(struct spi_device *spi, bool onoff)
-{
-   gpio_set_value(spi->cs_gpio, onoff);
-}
-
 static int mpc512x_psc_spi_do_probe(struct device *dev, u32 regaddr,
  u32 size, unsigned int irq)
 {
@@ -485,10 +465,8 @@ static int mpc512x_psc_spi_do_probe(struct device *dev, 
u32 regaddr,
mps->type = (int)of_device_get_match_data(dev);
mps->irq = irq;
 
-   if (pdata == NULL) {
-   mps->cs_control = mpc512x_spi_cs_control;
-   } else {
-   mps->cs_control = pdata->cs_control;
+   if (pdata) {
+   mps->cs_control = pdata->cs_control;x
master->bus_num = pdata->bus_num;
master->num_chipselect = pdata->max_chipselect;
}
@@ -499,6 +477,9 @@ static int mpc512x_psc_spi_do_probe(struct device *dev, u32 
regaddr,
master->transfer_one_message = mpc512x_psc_spi_msg_xfer;
master->unprepare_transfer_hardware = mpc512x_psc_spi_unprep_xfer_hw;
master->set_cs = mpc512x_psc_spi_set_cs;
+   /* This makes sure our custom .set_cs() is always called */
+   master->flags = SPI_MASTER_GPIO_SS;
+   master->use_gpio_descriptors = true;
master->cleanup = mpc512x_psc_spi_cleanup;
master->dev.of_node = dev->of_node;
 
-- 
2.26.2



[PATCH 1/2] spi: mpc512x-psc: Use the framework .set_cs()

2020-07-29 Thread Linus Walleij
The mpc512x-psc is rolling its own chip select control code,
but the SPI master framework can handle this. It was also
evaluating the CS status for each transfer but the CS change
should be per-message not per-transfer.

Switch to use the core .set_cs() to control the chip select.

Cc: Uwe Kleine-König 
Cc: Anatolij Gustschin 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Linus Walleij 
---
 drivers/spi/spi-mpc512x-psc.c | 30 --
 1 file changed, 8 insertions(+), 22 deletions(-)

diff --git a/drivers/spi/spi-mpc512x-psc.c b/drivers/spi/spi-mpc512x-psc.c
index ea1b07953d38..35313a77f977 100644
--- a/drivers/spi/spi-mpc512x-psc.c
+++ b/drivers/spi/spi-mpc512x-psc.c
@@ -89,7 +89,7 @@ static int mpc512x_psc_spi_transfer_setup(struct spi_device 
*spi,
return 0;
 }
 
-static void mpc512x_psc_spi_activate_cs(struct spi_device *spi)
+static void mpc512x_psc_spi_set_cs(struct spi_device *spi, bool enable)
 {
struct mpc512x_psc_spi_cs *cs = spi->controller_state;
struct mpc512x_psc_spi *mps = spi_master_get_devdata(spi->master);
@@ -98,6 +98,12 @@ static void mpc512x_psc_spi_activate_cs(struct spi_device 
*spi)
int speed;
u16 bclkdiv;
 
+   if (!enable) {
+   if (mps->cs_control && gpio_is_valid(spi->cs_gpio))
+   mps->cs_control(spi, (spi->mode & SPI_CS_HIGH) ? 0 : 1);
+   return;
+   }
+
sicr = in_be32(psc_addr(mps, sicr));
 
/* Set clock phase and polarity */
@@ -132,15 +138,6 @@ static void mpc512x_psc_spi_activate_cs(struct spi_device 
*spi)
mps->cs_control(spi, (spi->mode & SPI_CS_HIGH) ? 1 : 0);
 }
 
-static void mpc512x_psc_spi_deactivate_cs(struct spi_device *spi)
-{
-   struct mpc512x_psc_spi *mps = spi_master_get_devdata(spi->master);
-
-   if (mps->cs_control && gpio_is_valid(spi->cs_gpio))
-   mps->cs_control(spi, (spi->mode & SPI_CS_HIGH) ? 0 : 1);
-
-}
-
 /* extract and scale size field in txsz or rxsz */
 #define MPC512x_PSC_FIFO_SZ(sz) ((sz & 0x7ff) << 2);
 
@@ -290,40 +287,28 @@ static int mpc512x_psc_spi_msg_xfer(struct spi_master 
*master,
struct spi_message *m)
 {
struct spi_device *spi;
-   unsigned cs_change;
int status;
struct spi_transfer *t;
 
spi = m->spi;
-   cs_change = 1;
status = 0;
list_for_each_entry(t, >transfers, transfer_list) {
status = mpc512x_psc_spi_transfer_setup(spi, t);
if (status < 0)
break;
 
-   if (cs_change)
-   mpc512x_psc_spi_activate_cs(spi);
-   cs_change = t->cs_change;
-
status = mpc512x_psc_spi_transfer_rxtx(spi, t);
if (status)
break;
m->actual_length += t->len;
 
spi_transfer_delay_exec(t);
-
-   if (cs_change)
-   mpc512x_psc_spi_deactivate_cs(spi);
}
 
m->status = status;
if (m->complete)
m->complete(m->context);
 
-   if (status || !cs_change)
-   mpc512x_psc_spi_deactivate_cs(spi);
-
mpc512x_psc_spi_transfer_setup(spi, NULL);
 
spi_finalize_current_message(master);
@@ -513,6 +498,7 @@ static int mpc512x_psc_spi_do_probe(struct device *dev, u32 
regaddr,
master->prepare_transfer_hardware = mpc512x_psc_spi_prep_xfer_hw;
master->transfer_one_message = mpc512x_psc_spi_msg_xfer;
master->unprepare_transfer_hardware = mpc512x_psc_spi_unprep_xfer_hw;
+   master->set_cs = mpc512x_psc_spi_set_cs;
master->cleanup = mpc512x_psc_spi_cleanup;
master->dev.of_node = dev->of_node;
 
-- 
2.26.2



[PATCH net] ibmvnic: Fix IRQ mapping disposal in error path

2020-07-29 Thread Thomas Falcon
RX queue IRQ mappings are disposed in both the TX IRQ and RX IRQ
error paths. Fix this and dispose of TX IRQ mappings correctly in
case of an error.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 0fd7eae..5afb3c9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3206,7 +3206,7 @@ static int init_sub_crq_irqs(struct ibmvnic_adapter 
*adapter)
 req_tx_irq_failed:
for (j = 0; j < i; j++) {
free_irq(adapter->tx_scrq[j]->irq, adapter->tx_scrq[j]);
-   irq_dispose_mapping(adapter->rx_scrq[j]->irq);
+   irq_dispose_mapping(adapter->tx_scrq[j]->irq);
}
release_sub_crqs(adapter, 1);
return rc;
-- 
1.8.3.1



Re: [PATCH] powerpc: fix function annotations to avoid section mismatch warnings with gcc-10

2020-07-29 Thread Vladis Dronov
Hello,

- Original Message -
> From: "Segher Boessenkool" 
> To: "Vladis Dronov" 
> Cc: linuxppc-dev@lists.ozlabs.org, "Aneesh Kumar K . V" 
> , linux-ker...@vger.kernel.org,
> "Paul Mackerras" 
> Sent: Wednesday, July 29, 2020 4:49:49 PM
> Subject: Re: [PATCH] powerpc: fix function annotations to avoid section 
> mismatch warnings with gcc-10
> 
> On Wed, Jul 29, 2020 at 03:37:41PM +0200, Vladis Dronov wrote:
> > Certain warnings are emitted for powerpc code when building with a gcc-10
> > toolset:
> > 
> > WARNING: modpost: vmlinux.o(.text.unlikely+0x377c): Section mismatch in
> > reference from the function remove_pmd_table() to the function
> > .meminit.text:split_kernel_mapping()
> > The function remove_pmd_table() references
> > the function __meminit split_kernel_mapping().
> > This is often because remove_pmd_table lacks a __meminit
> > annotation or the annotation of split_kernel_mapping is wrong.
> > 
> > Add the appropriate __init and __meminit annotations to make modpost not
> > complain. In all the cases there are just a single callsite from another
> > __init or __meminit function:
> > 
> > __meminit remove_pagetable() -> remove_pud_table() -> remove_pmd_table()
> > __init prom_init() -> setup_secure_guest()
> > __init xive_spapr_init() -> xive_spapr_disabled()
> 
> So what changed?  These functions were inlined with older compilers, but
> not anymore?

Yes, exactly. Gcc-10 does not inline them anymore. If this is because of my
build system, this can happen to others also.

The same thing was fixed by Linus in e99332e7b4cd ("gcc-10: mark more functions
__init to avoid section mismatch warnings").

> 
> Segher

Best regards,
Vladis Dronov | Red Hat, Inc. | The Core Kernel | Senior Software Engineer



Re: [PATCH v2] selftests: powerpc: Fix online CPU selection

2020-07-29 Thread Srikar Dronamraju
* Sandipan Das  [2020-06-09 13:07:33]:

> The size of the CPU affinity mask must be large enough for
> systems with a very large number of CPUs. Otherwise, tests
> which try to determine the first online CPU by calling
> sched_getaffinity() will fail. This makes sure that the size
> of the allocated affinity mask is dependent on the number of
> CPUs as reported by get_nprocs().
> 
> Fixes: 3752e453f6ba ("selftests/powerpc: Add tests of PMU EBBs")
> Reported-by: Shirisha Ganta 
> Signed-off-by: Sandipan Das 
> Reviewed-by: Kamalesh Babulal 
> ---
> Previous versions can be found at:
> v1: 
> https://lore.kernel.org/linuxppc-dev/20200608144212.985144-1-sandi...@linux.ibm.com/
> 
> @@ -88,28 +89,40 @@ void *get_auxv_entry(int type)
> 
>  int pick_online_cpu(void)
>  {
> - cpu_set_t mask;
> - int cpu;
> + int ncpus, cpu = -1;
> + cpu_set_t *mask;
> + size_t size;
> +
> + ncpus = get_nprocs();

Please use get_nprocs_conf or sysconf(_SC_NPROCESSORS_CONF). The manpage
seems to suggest the latter. Not sure how accurate the manpage is.

get_nprocs is returning online cpus and when smt is off, the cpu numbers
would be sparse and hence the result from get_nprocs wouldn't be ideal for
allocating cpumask. However get_nprocs_conf would return the max configured
cpus and would be able to handle it. 

I think this was the same situation hit by Michael Ellerman.

> + size = CPU_ALLOC_SIZE(ncpus);
> + mask = CPU_ALLOC(ncpus);
> + if (!mask) {
> + perror("malloc");
> + return -1;
> + }
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] powerpc/fadump: Fix build error with CONFIG_PRESERVE_FA_DUMP=y

2020-07-29 Thread Gustavo Romero

On 7/27/20 4:03 AM, Michael Ellerman wrote:

skiroot_defconfig fails:

arch/powerpc/kernel/fadump.c:48:17: error: ‘cpus_in_fadump’ defined but not used
48 | static atomic_t cpus_in_fadump;

Fix it by moving the definition into the #ifdef where it's used.

Fixes: ba608c4fa12c ("powerpc/fadump: fix race between pstore write and fadump crash 
trigger")
Signed-off-by: Michael Ellerman 
---
  arch/powerpc/kernel/fadump.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 1858896d6809..10ebb4bf71ad 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -45,10 +45,12 @@ static struct fw_dump fw_dump;
  static void __init fadump_reserve_crash_area(u64 base);

  struct kobject *fadump_kobj;
-static atomic_t cpus_in_fadump;

  #ifndef CONFIG_PRESERVE_FA_DUMP
+
+static atomic_t cpus_in_fadump;
  static DEFINE_MUTEX(fadump_mutex);
+
  struct fadump_mrange_info crash_mrange_info = { "crash", NULL, 0, 0, 0, false 
};

  #define RESERVED_RNGS_SZ  16384 /* 16K - 128 entries */



Tested-by: Gustavo Romero 


Thanks,
Gustavo


Re: [PATCH] powerpc: fix function annotations to avoid section mismatch warnings with gcc-10

2020-07-29 Thread Segher Boessenkool
On Wed, Jul 29, 2020 at 03:37:41PM +0200, Vladis Dronov wrote:
> Certain warnings are emitted for powerpc code when building with a gcc-10
> toolset:
> 
> WARNING: modpost: vmlinux.o(.text.unlikely+0x377c): Section mismatch in
> reference from the function remove_pmd_table() to the function
> .meminit.text:split_kernel_mapping()
> The function remove_pmd_table() references
> the function __meminit split_kernel_mapping().
> This is often because remove_pmd_table lacks a __meminit
> annotation or the annotation of split_kernel_mapping is wrong.
> 
> Add the appropriate __init and __meminit annotations to make modpost not
> complain. In all the cases there are just a single callsite from another
> __init or __meminit function:
> 
> __meminit remove_pagetable() -> remove_pud_table() -> remove_pmd_table()
> __init prom_init() -> setup_secure_guest()
> __init xive_spapr_init() -> xive_spapr_disabled()

So what changed?  These functions were inlined with older compilers, but
not anymore?


Segher


[PATCH] powerpc: fix function annotations to avoid section mismatch warnings with gcc-10

2020-07-29 Thread Vladis Dronov
Certain warnings are emitted for powerpc code when building with a gcc-10
toolset:

WARNING: modpost: vmlinux.o(.text.unlikely+0x377c): Section mismatch in
reference from the function remove_pmd_table() to the function
.meminit.text:split_kernel_mapping()
The function remove_pmd_table() references
the function __meminit split_kernel_mapping().
This is often because remove_pmd_table lacks a __meminit
annotation or the annotation of split_kernel_mapping is wrong.

Add the appropriate __init and __meminit annotations to make modpost not
complain. In all the cases there are just a single callsite from another
__init or __meminit function:

__meminit remove_pagetable() -> remove_pud_table() -> remove_pmd_table()
__init prom_init() -> setup_secure_guest()
__init xive_spapr_init() -> xive_spapr_disabled()

Signed-off-by: Vladis Dronov 
---
 arch/powerpc/kernel/prom_init.c  | 4 ++--
 arch/powerpc/mm/book3s64/radix_pgtable.c | 4 ++--
 arch/powerpc/sysdev/xive/spapr.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 90c604d00b7d..f6ca7f450361 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -3262,7 +3262,7 @@ static int enter_secure_mode(unsigned long kbase, 
unsigned long fdt)
 /*
  * Call the Ultravisor to transfer us to secure memory if we have an ESM blob.
  */
-static void setup_secure_guest(unsigned long kbase, unsigned long fdt)
+static void __init setup_secure_guest(unsigned long kbase, unsigned long fdt)
 {
int ret;
 
@@ -3292,7 +3292,7 @@ static void setup_secure_guest(unsigned long kbase, 
unsigned long fdt)
}
 }
 #else
-static void setup_secure_guest(unsigned long kbase, unsigned long fdt)
+static void __init setup_secure_guest(unsigned long kbase, unsigned long fdt)
 {
 }
 #endif /* CONFIG_PPC_SVM */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index bb00e0cba119..b868c07110e3 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -800,7 +800,7 @@ static void __meminit split_kernel_mapping(unsigned long 
addr, unsigned long end
pte_clear(_mm, addr, pte);
 }
 
-static void remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
+static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
 unsigned long end)
 {
unsigned long next;
@@ -825,7 +825,7 @@ static void remove_pmd_table(pmd_t *pmd_start, unsigned 
long addr,
}
 }
 
-static void remove_pud_table(pud_t *pud_start, unsigned long addr,
+static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
 unsigned long end)
 {
unsigned long next;
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index f0551a2be9df..1e3674d7ea7b 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -768,7 +768,7 @@ static const u8 *get_vec5_feature(unsigned int index)
return vec5 + index;
 }
 
-static bool xive_spapr_disabled(void)
+static bool __init xive_spapr_disabled(void)
 {
const u8 *vec5_xive;
 
-- 
2.26.2



Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault

2020-07-29 Thread Michael Ellerman
Nicholas Piggin  writes:
> Excerpts from Linus Torvalds's message of July 29, 2020 5:02 am:
>> On Tue, Jul 28, 2020 at 3:53 AM Nicholas Piggin  wrote:
>>>
>>> The quirk is a problem with coprocessor where it's supposed to
>>> invalidate the translation after a fault but it doesn't, so we can get a
>>> read-only TLB stuck after something else does a RO->RW upgrade on the
>>> TLB. Something like that IIRC.  Coprocessors have their own MMU which
>>> lives in the nest not the core, so you need a global TLB flush to
>>> invalidate that thing.
>> 
>> So I assumed, but it does seem confused.
>> 
>> Why? Because if there are stale translations on the co-processor,
>> there's no guarantee that one of the CPU's will have them and take a
>> fault.
>> 
>> So I'm not seeing why a core CPU doing spurious TLB invalidation would
>> follow from "stale TLB in the Nest".
>
> If the nest MMU access faults, it sends an interrupt to the CPU and
> the driver tries to handle the page fault for it (I think that's how
> it works).

Yeah AFAIK. I think they all end up calling copro_handle_mm_fault().

Except for NX/vas, where the model is a fault just causes the request to
be dropped and sent back to userspace to fix things up.

cheers


Re: [PATCH v2] selftests: powerpc: Fix online CPU selection

2020-07-29 Thread Michael Ellerman
Sandipan Das  writes:
> The size of the CPU affinity mask must be large enough for
> systems with a very large number of CPUs. Otherwise, tests
> which try to determine the first online CPU by calling
> sched_getaffinity() will fail. This makes sure that the size
> of the allocated affinity mask is dependent on the number of
> CPUs as reported by get_nprocs().
>
> Fixes: 3752e453f6ba ("selftests/powerpc: Add tests of PMU EBBs")
> Reported-by: Shirisha Ganta 
> Signed-off-by: Sandipan Das 
> Reviewed-by: Kamalesh Babulal 
> ---
> Previous versions can be found at:
> v1: 
> https://lore.kernel.org/linuxppc-dev/20200608144212.985144-1-sandi...@linux.ibm.com/
>
> Changes in v2:
> - Added NULL check for the affinity mask as suggested by Kamalesh.
> - Changed "cpu set" to "CPU affinity mask" in the commit message.

This sometimes breaks, eg:

  # ./count_instructions 
  test: count_instructions
  tags: git_version:v5.8-rc2-327-g9a1d992a7eb7
  sched_getaffinity: Invalid argument
  [FAIL] Test FAILED on line 123
  failure: count_instructions


This system has a messed up SMT setup, but the old code was able to cope
with it:

  # ppc64_cpu --info
  Core   0:0*1*2 3 4 5 6 7  
  Core   1:8 910*   11*   12131415  
  Core   2:   1617181920212223  
  Core   3:   2425262728293031  
  Core   4:   3233343536373839  
  Core   5:   4041424344454647  
  Core   6:   4849505152535455  
  Core   7:   5657585960616263  
  Core   8:   6465666768697071  
  Core   9:   7273747576777879  
  Core  10:   8081828384858687  
  Core  11:   8889909192939495  
  Core  12:   96979899   100*  101*  102*  103* 
  Core  13:  104*  105*  106*  107*  108*  109*  110*  111* 
  Core  14:  112*  113*  114*  115*  116*  117*  118*  119* 
  Core  15:  120   121   122   123   124   125   126   127  
  Core  16:  128   129   130   131   132   133   134   135  
  Core  17:  136   137   138   139   140   141   142   143  
  Core  18:  144   145   146   147   148   149   150   151  
  Core  19:  152   153   154   155   156   157   158   159 

cheers


Re: ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices

2020-07-29 Thread Oliver O'Halloran
On Tue, Jul 21, 2020 at 3:51 PM Forest Crossman  wrote:
>
> Hello, again!
>
> After fixing the issue in my previous thread using this patch[1], I
> decided to do some stress-testing of the controller to make sure it
> could handle my intended workloads and that there were no further DMA
> address issues that would need to be fixed. Unfortunately, it looks
> like there's still more work to be done: when I try to do long bulk
> reads from multiple devices simultaneously, eventually the host
> controller sends a DMA write to address zero, which then triggers EEH
> in my POWER9 system, causing the controller card to get hotplug-reset,
> which of course kills the disk-reading processes. For more details on
> the EEH errors, you can see my kernel's EEH message log[2].

Take the logged address with a grain of salt. If an error occurs while
translating the DMA address the PHB logs all zeros as the "DMA
Address" because it only keeps around the bits that it needs to fetch
the next level of the TCE table. The EEH dump says the error is due to
a TCE permission mis-match so odds the ASmedia controller is writing
to an address that's already been DMA unmapped, hence the logged
address being zeros.

Sorry, I probably should have mentioned that quirk in the last mail.

> The results of the various tests I performed are listed below.
>
> Test results (all failures are due to DMA writes to address zero, all
> hubs are USB 3.0/3.1 Gen1 only, and all disks are accessed via the
> usb-storage driver):
> - Reading simultaneously from two or more disks behind a hub connected
> to one port on the host controller:
>   - FAIL after 20-50 GB of data transferred for each device.
> - Reading simultaneously from two disks, each connected directly to
> one port on the host controller:
>   - FAIL after about 800 GB of data transferred for each device.
> - Reading from one disk behind a hub connected to one port on the host
> controller:
>   - OK for at least 2.7 TB of data transferred (I didn't test the
> whole 8 TB disk).
> - Writing simultaneously to two FL2000 dongles (using osmo-fl2k's
> "fl2k_test"), each connected directly to one port on the host
> controller:
>   - OK, was able to write several dozen terabytes to each device over
> the course of a little over 21 hours.
>
> Seeing how simultaneous writes to multiple devices and reads from
> single devices both seem to work fine, I assume that means this is
> being caused by some race condition in the host controller firmware
> when it responds to multiple read requests.

Most likely. It's possible it's a platform specific race with DMA
map/unmap too, but I think we would be seeing similar issues with
other devices if it was.

> I also assume we're not
> going to be able to convince ASMedia to both fix the bug in their
> firmware and release the details on how to flash it from Linux, so I
> guess we'll just have to figure out how to make the driver talk to the
> controller in a way that avoids triggering the bad DMA write. As
> before, I decided to try a little kernel hacking of my own before
> sending this email, and tried separately enabling the
> XHCI_BROKEN_STREAMS and XHCI_ASMEDIA_MODIFY_FLOWCONTROL quirks in an
> attempt to fix this. As you might expect since you're reading this
> message, neither of those quirks fixed the issue, nor did they even
> make the transfers last any longer before failing.
>
> So now I've reached the limits of my understanding, and I need some
> help devising a fix. If anyone has any comments to that effect, or any
> questions about my hardware configuration, testing methodology, etc.,
> please don't hesitate to air them. Also, if anyone needs me to perform
> additional tests, or collect more log information, I'd be happy to do
> that as well.

I started writing a tool a while ago to use the internal trace bus to
log incoming TLPs. Something like that might allow you to get a better
idea what the faulting access pattern is, but you would still need to
find a way to mitigate the issue. I'm not all that familiar with USB3
so I'm not much help on that front.


Re: [PATCH 0/7] Optimization to improve cpu online/offline on Powerpc

2020-07-29 Thread Satheesh Rajendran
On Mon, Jul 27, 2020 at 01:25:25PM +0530, Srikar Dronamraju wrote:
> Anton reported that his 4096 cpu (1024 cores in a socket) was taking too
> long to boot. He also analyzed that most of the time was being spent on
> updating cpu_core_mask.
> 
> Here are some optimizations and fixes to make ppc64_cpu --smt=8/ppc64_cpu
> --smt=1 run faster and hence boot the kernel also faster.
> 
> Its based on top of my v4 coregroup support patchset.
> http://lore.kernel.org/lkml/20200727053230.19753-1-sri...@linux.vnet.ibm.com/t/#u
> 
> The first two patches should solve Anton's immediate problem.
> On the unofficial patches, Anton reported that the boot time came from 30
> mins to 6 seconds. (Basically a high core count in a single socket
> configuration). Satheesh also reported similar numbers.
> 
> The rest are simple cleanups/optimizations.
> 
> Since cpu_core_mask is an exported symbol for a long duration, lets retain
> as a snapshot of cpumask_of_node.

boot tested on P9 KVM guest.

without this series:
# dmesg|grep smp
[0.066624] smp: Bringing up secondary CPUs ...
[  347.521264] smp: Brought up 1 node, 2048 CPUs

with this series:
# dmesg|grep smp
[0.067744] smp: Bringing up secondary CPUs ...
[5.416910] smp: Brought up 1 node, 2048 CPUs

Tested-by: Satheesh Rajendran 

Regards,
-Satheesh
> 
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  160
> On-line CPU(s) list: 0-159
> Thread(s) per core:  4
> Core(s) per socket:  20
> Socket(s):   2
> NUMA node(s):2
> Model:   2.2 (pvr 004e 1202)
> Model name:  POWER9, altivec supported
> CPU max MHz: 3800.
> CPU min MHz: 2166.
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:512K
> L3 cache:10240K
> NUMA node0 CPU(s):   0-79
> NUMA node8 CPU(s):   80-159
> 
> without patch (powerpc/next)
> [0.099347] smp: Bringing up secondary CPUs ...
> [0.832513] smp: Brought up 2 nodes, 160 CPUs
> 
> with powerpc/next + coregroup support patchset
> [0.099241] smp: Bringing up secondary CPUs ...
> [0.835627] smp: Brought up 2 nodes, 160 CPUs
> 
> with powerpc/next + coregroup + this patchset
> [0.097232] smp: Bringing up secondary CPUs ...
> [0.528457] smp: Brought up 2 nodes, 160 CPUs
> 
> x ppc64_cpu --smt=1
> + ppc64_cpu --smt=4
> 
> without patch
> N   Min   MaxMedian   AvgStddev
> x 100 11.82 17.06 14.01 14.05 1.2665247
> + 100 12.25 16.59 13.86   14.1143  1.164293
> 
> with patch
> N   Min   MaxMedian   AvgStddev
> x 100 12.68 16.15 14.2414.2380.75489246
> + 100 12.93 15.85 14.35   14.28970.60041813
> 
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Satheesh Rajendran 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> 
> Srikar Dronamraju (7):
>   powerpc/topology: Update topology_core_cpumask
>   powerpc/smp: Stop updating cpu_core_mask
>   powerpc/smp: Remove get_physical_package_id
>   powerpc/smp: Optimize remove_cpu_from_masks
>   powerpc/smp: Limit cpus traversed to within a node.
>   powerpc/smp: Stop passing mask to update_mask_by_l2
>   powerpc/smp: Depend on cpu_l1_cache_map when adding cpus
> 
>  arch/powerpc/include/asm/smp.h  |  5 --
>  arch/powerpc/include/asm/topology.h |  7 +--
>  arch/powerpc/kernel/smp.c   | 79 +
>  3 files changed, 24 insertions(+), 67 deletions(-)
> 
> -- 
> 2.17.1
> 


Re: [PATCH] powerpc/powernv/pci: Fix build of pci-ioda.o

2020-07-29 Thread Gustavo Romero

Hi Oliver,

On 7/28/20 7:50 PM, Oliver O'Halloran wrote:

On Wed, Jul 29, 2020 at 8:35 AM Gustavo Romero  wrote:


Currently pnv_ioda_setup_bus_dma() is outside of a CONFIG_IOMMU_API guard
and if CONFIG_IOMMU_API=n the build can fail if the compiler sets
-Werror=unused-function, because pnv_ioda_setup_bus_dma() is only used in
functions guarded by a CONFIG_IOMMU_API guard.

That issue can be easily reproduced using the skiroot_defconfig. For other
configs, like powernv_defconfig, that issue is hidden by the fact that
if CONFIG_IOMMU_SUPPORT is enabled plus other common IOMMU options, like
CONFIG_OF_IOMMU, by default CONFIG_IOMMU_API is enabled as well. Hence, for
powernv_defconfig, it's necessary to set CONFIG_IOMMU_SUPPORT=n to make the
build fail, because CONFIG_PCI=y and pci-ioda.c is included in the build,
but since CONFIG_IOMMU_SUPPORT=n the CONFIG_IOMMU_API is disabled, breaking
the build.

This commit fixes that build issue by moving the pnv_ioda_setup_bus_dma()
inside a CONFIG_IOMMU_API guard, so when CONFIG_IOMMU_API is disabled that
function is not defined.


I think a fix for this is already in -next.


Indeed.

For the records, it's fixed in -next by:

commit e3417faec526cbf97773dca691dcd743f5bfeb64
Author: Oliver O'Halloran 
Date:   Sun Jul 5 23:35:57 2020 +1000

powerpc/powernv: Move pnv_ioda_setup_bus_dma under CONFIG_IOMMU_API

pnv_ioda_setup_bus_dma() is only used when a passed through PE is

returned to the host. If the kernel is built without IOMMU support
this is dead code. Move it under the #ifdef with the rest of the
IOMMU API support.

Reported-by: kernel test robot 

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Alexey Kardashevskiy 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20200705133557.443607-2-ooh...@gmail.com


Thanks.


Cheers,
Gustavo


[PATCH v6 11/11] ppc64/kexec_file: enable early kernel's OPAL calls

2020-07-29 Thread Hari Bathini
Kernel built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled expects r8 & r9
to be filled with OPAL base & entry addresses respectively. Setting
these registers allows the kernel to perform OPAL calls before the
device tree is parsed.

Signed-off-by: Hari Bathini 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Added Reviewed-by tag from Thiago.
* Moved the patch to end of the series for mpe to take a call on whether
  to have it or not.

v4 -> v5:
* New patch. Updated opal_base & opal_entry values in r8 & r9 respectively.
  This change was part of the below dropped patch in v4:
- https://lore.kernel.org/patchwork/patch/1275667/


 arch/powerpc/kexec/file_load_64.c  |   20 
 arch/powerpc/purgatory/trampoline_64.S |   16 
 2 files changed, 36 insertions(+)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index c6a37ad5a0a4..53bb71e3a2e1 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -876,6 +876,7 @@ int setup_purgatory_ppc64(struct kimage *image, const void 
*slave_code,
  const void *fdt, unsigned long kernel_load_addr,
  unsigned long fdt_load_addr)
 {
+   struct device_node *dn = NULL;
int ret;
 
ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
@@ -903,9 +904,28 @@ int setup_purgatory_ppc64(struct kimage *image, const void 
*slave_code,
 >arch.backup_start,
 sizeof(image->arch.backup_start),
 false);
+   if (ret)
+   goto out;
+
+   /* Setup OPAL base & entry values */
+   dn = of_find_node_by_path("/ibm,opal");
+   if (dn) {
+   u64 val;
+
+   of_property_read_u64(dn, "opal-base-address", );
+   ret = kexec_purgatory_get_set_symbol(image, "opal_base", ,
+sizeof(val), false);
+   if (ret)
+   goto out;
+
+   of_property_read_u64(dn, "opal-entry-address", );
+   ret = kexec_purgatory_get_set_symbol(image, "opal_entry", ,
+sizeof(val), false);
+   }
 out:
if (ret)
pr_err("Failed to setup purgatory symbols");
+   of_node_put(dn);
return ret;
 }
 
diff --git a/arch/powerpc/purgatory/trampoline_64.S 
b/arch/powerpc/purgatory/trampoline_64.S
index e79077ff1355..e6a6e7e6dfe4 100644
--- a/arch/powerpc/purgatory/trampoline_64.S
+++ b/arch/powerpc/purgatory/trampoline_64.S
@@ -87,6 +87,10 @@ master:
li  %r4,28
STWX_BE %r17,%r3,%r4/* Store my cpu as __be32 at byte 28 */
 1:
+   /* Load opal base and entry values in r8 & r9 respectively */
+   ld  %r8,(opal_base - 0b)(%r18)
+   ld  %r9,(opal_entry - 0b)(%r18)
+
/* load the kernel address */
ld  %r4,(kernel - 0b)(%r18)
 
@@ -133,6 +137,18 @@ backup_start:
.8byte  0x0
.size backup_start, . - backup_start
 
+   .balign 8
+   .globl opal_base
+opal_base:
+   .8byte  0x0
+   .size opal_base, . - opal_base
+
+   .balign 8
+   .globl opal_entry
+opal_entry:
+   .8byte  0x0
+   .size opal_entry, . - opal_entry
+
.data
.balign 8
 .globl purgatory_sha256_digest




[PATCH v6 10/11] ppc64/kexec_file: fix kexec load failure with lack of memory hole

2020-07-29 Thread Hari Bathini
The kexec purgatory has to run in real mode. Only the first memory
block maybe accessible in real mode. And, unlike the case with panic
kernel, no memory is set aside for regular kexec load. Another thing
to note is, the memory for crashkernel is reserved at an offset of
128MB. So, when crashkernel memory is reserved, the memory ranges to
load kexec segments shrink further as the generic code only looks for
memblock free memory ranges and in all likelihood only a tiny bit of
memory from 0 to 128MB would be available to load kexec segments.

With kdump being used by default in general, kexec file load is likely
to fail almost always. This can be fixed by changing the memory hole
lookup logic for regular kexec to use the same method as kdump. This
would mean that most kexec segments will overlap with crashkernel
memory region. That should still be ok as the pages, whose destination
address isn't available while loading, are placed in an intermediate
location till a flush to the actual destination address happens during
kexec boot sequence.

Signed-off-by: Hari Bathini 
Tested-by: Pingfan Liu 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Unchanged.

v4 -> v5:
* Unchanged.

v3 -> v4:
* Unchanged. Added Reviewed-by tag from Thiago.

v2 -> v3:
* Unchanged. Added Tested-by tag from Pingfan.

v1 -> v2:
* New patch to fix locating memory hole for kexec_file_load (kexec -s -l)
  when memory is reserved for crashkernel.


 arch/powerpc/kexec/file_load_64.c |   33 ++---
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index f13c5b8399e1..c6a37ad5a0a4 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -1012,13 +1012,6 @@ int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
u64 buf_min, buf_max;
int ret;
 
-   /*
-* Use the generic kexec_locate_mem_hole for regular
-* kexec_file_load syscall
-*/
-   if (kbuf->image->type != KEXEC_TYPE_CRASH)
-   return kexec_locate_mem_hole(kbuf);
-
/* Look up the exclude ranges list while locating the memory hole */
emem = &(kbuf->image->arch.exclude_ranges);
if (!(*emem) || ((*emem)->nr_ranges == 0)) {
@@ -1026,11 +1019,15 @@ int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
return kexec_locate_mem_hole(kbuf);
}
 
+   buf_min = kbuf->buf_min;
+   buf_max = kbuf->buf_max;
/* Segments for kdump kernel should be within crashkernel region */
-   buf_min = (kbuf->buf_min < crashk_res.start ?
-  crashk_res.start : kbuf->buf_min);
-   buf_max = (kbuf->buf_max > crashk_res.end ?
-  crashk_res.end : kbuf->buf_max);
+   if (kbuf->image->type == KEXEC_TYPE_CRASH) {
+   buf_min = (buf_min < crashk_res.start ?
+  crashk_res.start : buf_min);
+   buf_max = (buf_max > crashk_res.end ?
+  crashk_res.end : buf_max);
+   }
 
if (buf_min > buf_max) {
pr_err("Invalid buffer min and/or max values\n");
@@ -1067,15 +1064,13 @@ int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
 int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
  unsigned long buf_len)
 {
-   if (image->type == KEXEC_TYPE_CRASH) {
-   int ret;
+   int ret;
 
-   /* Get exclude memory ranges needed for setting up kdump 
segments */
-   ret = get_exclude_memory_ranges(&(image->arch.exclude_ranges));
-   if (ret) {
-   pr_err("Failed to setup exclude memory ranges for 
buffer lookup\n");
-   return ret;
-   }
+   /* Get exclude memory ranges needed for setting up kexec segments */
+   ret = get_exclude_memory_ranges(&(image->arch.exclude_ranges));
+   if (ret) {
+   pr_err("Failed to setup exclude memory ranges for buffer 
lookup\n");
+   return ret;
}
 
return kexec_image_probe_default(image, buf, buf_len);




[PATCH v6 09/11] ppc64/kexec_file: add appropriate regions for memory reserve map

2020-07-29 Thread Hari Bathini
While initrd, elfcorehdr and backup regions are already added to the
reserve map, there are a few missing regions that need to be added to
the memory reserve map. Add them here. And now that all the changes
to load panic kernel are in place, claim likewise.

Signed-off-by: Hari Bathini 
Tested-by: Pingfan Liu 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Unchanged.

v4 -> v5:
* Unchanged.

v3 -> v4:
* Fixed a spellcheck and added Reviewed-by tag from Thiago.

v2 -> v3:
* Unchanged. Added Tested-by tag from Pingfan.

v1 -> v2:
* Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
  the new prototype for these functions.


 arch/powerpc/kexec/file_load_64.c |   58 ++---
 1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 0d280d097cd6..f13c5b8399e1 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -205,6 +205,34 @@ static int get_crash_memory_ranges(struct crash_mem 
**mem_ranges)
return ret;
 }
 
+/**
+ * get_reserved_memory_ranges - Get reserve memory ranges. This list includes
+ *  memory regions that should be added to the
+ *  memory reserve map to ensure the region is
+ *  protected from any mischief.
+ * @mem_ranges: Range list to add the memory ranges to.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int get_reserved_memory_ranges(struct crash_mem **mem_ranges)
+{
+   int ret;
+
+   ret = add_rtas_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_tce_mem_ranges(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_reserved_mem_ranges(mem_ranges);
+out:
+   if (ret)
+   pr_err("Failed to setup reserved memory ranges\n");
+   return ret;
+}
+
 /**
  * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
  *  in the memory regions between buf_min & buf_max
@@ -897,8 +925,8 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
unsigned long initrd_load_addr,
unsigned long initrd_len, const char *cmdline)
 {
-   struct crash_mem *umem = NULL;
-   int ret;
+   struct crash_mem *umem = NULL, *rmem = NULL;
+   int i, nr_ranges, ret;
 
ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len, cmdline);
if (ret)
@@ -941,7 +969,27 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
}
}
 
+   /* Update memory reserve map */
+   ret = get_reserved_memory_ranges();
+   if (ret)
+   goto out;
+
+   nr_ranges = rmem ? rmem->nr_ranges : 0;
+   for (i = 0; i < nr_ranges; i++) {
+   u64 base, size;
+
+   base = rmem->ranges[i].start;
+   size = rmem->ranges[i].end - base + 1;
+   ret = fdt_add_mem_rsv(fdt, base, size);
+   if (ret) {
+   pr_err("Error updating memory reserve map: %s\n",
+  fdt_strerror(ret));
+   goto out;
+   }
+   }
+
 out:
+   kfree(rmem);
kfree(umem);
return ret;
 }
@@ -1024,10 +1072,10 @@ int arch_kexec_kernel_image_probe(struct kimage *image, 
void *buf,
 
/* Get exclude memory ranges needed for setting up kdump 
segments */
ret = get_exclude_memory_ranges(&(image->arch.exclude_ranges));
-   if (ret)
+   if (ret) {
pr_err("Failed to setup exclude memory ranges for 
buffer lookup\n");
-   /* Return this until all changes for panic kernel are in */
-   return -EOPNOTSUPP;
+   return ret;
+   }
}
 
return kexec_image_probe_default(image, buf, buf_len);




[PATCH v6 08/11] ppc64/kexec_file: prepare elfcore header for crashing kernel

2020-07-29 Thread Hari Bathini
Prepare elf headers for the crashing kernel's core file using
crash_prepare_elf64_headers() and pass on this info to kdump
kernel by updating its command line with elfcorehdr parameter.
Also, add elfcorehdr location to reserve map to avoid it from
being stomped on while booting.

Signed-off-by: Hari Bathini 
Tested-by: Pingfan Liu 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Unchanged.

v4 -> v5:
* Unchanged. Added Reviewed-by tag from Thiago.

v3 -> v4:
* Added a FIXME tag to indicate issue in adding opal/rtas regions to
  core image.
* Folded prepare_elf_headers() function into load_elfcorehdr_segment().

v2 -> v3:
* Unchanged. Added Tested-by tag from Pingfan.

v1 -> v2:
* Tried merging adjacent memory ranges on hitting maximum ranges limit
  to reduce reallocations for memory ranges and also, minimize PT_LOAD
  segments for elfcore.
* Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
  the new prototype for these functions.


 arch/powerpc/include/asm/kexec.h  |6 +
 arch/powerpc/kexec/elf_64.c   |   12 +++
 arch/powerpc/kexec/file_load.c|   49 +++
 arch/powerpc/kexec/file_load_64.c |  165 +
 4 files changed, 232 insertions(+)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index f9514ebeffaa..fe885bc3127e 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -108,12 +108,18 @@ struct kimage_arch {
unsigned long backup_start;
void *backup_buf;
 
+   unsigned long elfcorehdr_addr;
+   unsigned long elf_headers_sz;
+   void *elf_headers;
+
 #ifdef CONFIG_IMA_KEXEC
phys_addr_t ima_buffer_addr;
size_t ima_buffer_size;
 #endif
 };
 
+char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
+ unsigned long cmdline_len);
 int setup_purgatory(struct kimage *image, const void *slave_code,
const void *fdt, unsigned long kernel_load_addr,
unsigned long fdt_load_addr);
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 76e2fc7e6dc3..d0e459bb2f05 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -35,6 +35,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
void *fdt;
const void *slave_code;
struct elfhdr ehdr;
+   char *modified_cmdline = NULL;
struct kexec_elf_info elf_info;
struct kexec_buf kbuf = { .image = image, .buf_min = 0,
  .buf_max = ppc64_rma_size };
@@ -75,6 +76,16 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
pr_err("Failed to load kdump kernel segments\n");
goto out;
}
+
+   /* Setup cmdline for kdump kernel case */
+   modified_cmdline = setup_kdump_cmdline(image, cmdline,
+  cmdline_len);
+   if (!modified_cmdline) {
+   pr_err("Setting up cmdline for kdump kernel failed\n");
+   ret = -EINVAL;
+   goto out;
+   }
+   cmdline = modified_cmdline;
}
 
if (initrd != NULL) {
@@ -131,6 +142,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
pr_err("Error setting up the purgatory.\n");
 
 out:
+   kfree(modified_cmdline);
kexec_free_elf_info(_info);
 
/* Make kimage_file_post_load_cleanup free the fdt buffer for us. */
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index 38439aba27d7..d52c09729edd 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -18,10 +18,45 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define SLAVE_CODE_SIZE256 /* First 0x100 bytes */
 
+/**
+ * setup_kdump_cmdline - Prepend "elfcorehdr= " to command line
+ *   of kdump kernel for exporting the core.
+ * @image:   Kexec image
+ * @cmdline: Command line parameters to update.
+ * @cmdline_len: Length of the cmdline parameters.
+ *
+ * kdump segment must be setup before calling this function.
+ *
+ * Returns new cmdline buffer for kdump kernel on success, NULL otherwise.
+ */
+char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
+ unsigned long cmdline_len)
+{
+   int elfcorehdr_strlen;
+   char *cmdline_ptr;
+
+   cmdline_ptr = kzalloc(COMMAND_LINE_SIZE, GFP_KERNEL);
+   if (!cmdline_ptr)
+   return NULL;
+
+   elfcorehdr_strlen = sprintf(cmdline_ptr, "elfcorehdr=0x%lx ",
+   image->arch.elfcorehdr_addr);
+
+   if (elfcorehdr_strlen + cmdline_len > COMMAND_LINE_SIZE) {
+   pr_err("Appending elfcorehdr= exceeds cmdline size\n");
+   

[PATCH v6 07/11] ppc64/kexec_file: setup backup region for kdump kernel

2020-07-29 Thread Hari Bathini
Though kdump kernel boots from loaded address, the first 64KB of it is
copied down to real 0. So, setup a backup region and let purgatory
copy the first 64KB of crashed kernel into this backup region before
booting into kdump kernel. Update reserve map with backup region and
crashed kernel's memory to avoid kdump kernel from accidentially using
that memory.

Signed-off-by: Hari Bathini 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Added Reviewed-by tag from Thiago.
* The comment explaining why a source buffer is needed for backup segment
  is moved to appropriate place.
* Used the special branching instruction mpe suggested instead of "bl 0f"
* Added local labels & space between arguments in assembler code.

v4 -> v5:
* Did not add Reviewed-by tag from Thiago yet as he might want to reconsider
  it with the changes in this patch.
* Wrote backup region copy code in assembler. Also, dropped the patch that
  applies RELA relocations & the patch that sets up stack as they are no
  longer needed.
* For correctness, updated fdt_add_mem_rsv() to take "BACKUP_SRC_END + 1"
  as start address instead of BACKUP_SRC_SIZE.

v3 -> v4:
* Moved fdt_add_mem_rsv() for backup region under kdump flag, on Thiago's
  suggestion, as it is only relevant for kdump.

v2 -> v3:
* Dropped check for backup_start in trampoline_64.S as purgatory() takes
  care of it anyway.

v1 -> v2:
* Check if backup region is available before branching out. This is
  to keep `kexec -l -s` flow as before as much as possible. This would
  eventually change with more testing and addition of sha256 digest
  verification support.
* Fixed missing prototype for purgatory() as reported by lkp.
  lkp report for reference:
- https://lore.kernel.org/patchwork/patch/1264423/


 arch/powerpc/include/asm/crashdump-ppc64.h |   19 ++
 arch/powerpc/include/asm/kexec.h   |7 ++
 arch/powerpc/kexec/elf_64.c|9 +++
 arch/powerpc/kexec/file_load_64.c  |   93 +++-
 arch/powerpc/purgatory/trampoline_64.S |   38 ++-
 5 files changed, 159 insertions(+), 7 deletions(-)
 create mode 100644 arch/powerpc/include/asm/crashdump-ppc64.h

diff --git a/arch/powerpc/include/asm/crashdump-ppc64.h 
b/arch/powerpc/include/asm/crashdump-ppc64.h
new file mode 100644
index ..68d9717cc5ee
--- /dev/null
+++ b/arch/powerpc/include/asm/crashdump-ppc64.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_POWERPC_CRASHDUMP_PPC64_H
+#define _ASM_POWERPC_CRASHDUMP_PPC64_H
+
+/*
+ * Backup region - first 64KB of System RAM
+ *
+ * If ever the below macros are to be changed, please be judicious.
+ * The implicit assumptions are:
+ * - start, end & size are less than UINT32_MAX.
+ * - start & size are at least 8 byte aligned.
+ *
+ * For implementation details: arch/powerpc/purgatory/trampoline_64.S
+ */
+#define BACKUP_SRC_START   0
+#define BACKUP_SRC_END 0x
+#define BACKUP_SRC_SIZE(BACKUP_SRC_END - BACKUP_SRC_START + 1)
+
+#endif /* __ASM_POWERPC_CRASHDUMP_PPC64_H */
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 835dc92e091c..f9514ebeffaa 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -105,6 +105,9 @@ extern const struct kexec_file_ops kexec_elf64_ops;
 struct kimage_arch {
struct crash_mem *exclude_ranges;
 
+   unsigned long backup_start;
+   void *backup_buf;
+
 #ifdef CONFIG_IMA_KEXEC
phys_addr_t ima_buffer_addr;
size_t ima_buffer_size;
@@ -120,6 +123,10 @@ int setup_new_fdt(const struct kimage *image, void *fdt,
 int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
 
 #ifdef CONFIG_PPC64
+struct kexec_buf;
+
+int load_crashdump_segments_ppc64(struct kimage *image,
+ struct kexec_buf *kbuf);
 int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
  const void *fdt, unsigned long kernel_load_addr,
  unsigned long fdt_load_addr);
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 64c15a5a280b..76e2fc7e6dc3 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -68,6 +68,15 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 
pr_debug("Loaded purgatory at 0x%lx\n", pbuf.mem);
 
+   /* Load additional segments needed for panic kernel */
+   if (image->type == KEXEC_TYPE_CRASH) {
+   ret = load_crashdump_segments_ppc64(image, );
+   if (ret) {
+   pr_err("Failed to load kdump kernel segments\n");
+   goto out;
+   }
+   }
+
if (initrd != NULL) {
kbuf.buffer = initrd;
kbuf.bufsz = kbuf.memsz = initrd_len;
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 

[PATCH v6 06/11] ppc64/kexec_file: restrict memory usage of kdump kernel

2020-07-29 Thread Hari Bathini
Kdump kernel, used for capturing the kernel core image, is supposed
to use only specific memory regions to avoid corrupting the image to
be captured. The regions are crashkernel range - the memory reserved
explicitly for kdump kernel, memory used for the tce-table, the OPAL
region and RTAS region as applicable. Restrict kdump kernel memory
to use only these regions by setting up usable-memory DT property.
Also, tell the kdump kernel to run at the loaded address by setting
the magic word at 0x5c.

Signed-off-by: Hari Bathini 
Tested-by: Pingfan Liu 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Added Reviewed-by tag from Thiago.
* Avoided pass by reference count parameter in add_usable_mem() function
  by calculating the range count added from index value before & after it.
* Instead of trying to reinvent the wheel with get_node_path() &
  get_node_path_size() functions, used %pOF format as suggested by mpe.
* Used kernel types instead of uint32_t/uint64_t.
* and Dropped 'struct crash_mem *' member & added 'struct crash_mem_range *',
  nr_ranges & max_entries fields to 'struct umem_info' to avoid bit of
  a clutter in check_realloc_usable_mem() & add_usable_mem() functions.
* Updated the comment as to why 0 till crashk_res.start was needed to be
  added to usable memory ranges. Note that kexec-tools also has been
  doing the same thing.

v4 -> v5:
* Renamed get_node_pathlen() function to get_node_path_size() and
  handled root node separately to avoid off-by-one error in
  calculating string size.
* Updated get_node_path() in line with change in get_node_path_size().

v3 -> v4:
* Updated get_node_path() to be an iterative function instead of a
  recursive one.
* Added comment explaining why low memory is added to kdump kernel's
  usable memory ranges though it doesn't fall in crashkernel region.
* For correctness, added fdt_add_mem_rsv() for the low memory being
  added to kdump kernel's usable memory ranges.
* Fixed prop pointer update in add_usable_mem_property() and changed
  duple to tuple as suggested by Thiago.

v2 -> v3:
* Unchanged. Added Tested-by tag from Pingfan.

v1 -> v2:
* Fixed off-by-one error while setting up usable-memory properties.
* Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
  the new prototype for these functions.


 arch/powerpc/kexec/file_load_64.c |  386 +
 1 file changed, 385 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index d09c7724efa8..f94660874765 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -17,9 +17,23 @@
 #include 
 #include 
 #include 
+#include 
 #include 
+#include 
+#include 
 #include 
 
+struct umem_info {
+   u64 *buf;   /* data buffer for usable-memory property */
+   u32 size;   /* size allocated for the data buffer */
+   u32 max_entries;/* maximum no. of entries */
+   u32 idx;/* index of current entry */
+
+   /* usable memory ranges to look up */
+   unsigned int nr_ranges;
+   const struct crash_mem_range *ranges;
+};
+
 const struct kexec_file_ops * const kexec_file_loaders[] = {
_elf64_ops,
NULL
@@ -74,6 +88,44 @@ static int get_exclude_memory_ranges(struct crash_mem 
**mem_ranges)
return ret;
 }
 
+/**
+ * get_usable_memory_ranges - Get usable memory ranges. This list includes
+ *regions like crashkernel, opal/rtas & tce-table,
+ *that kdump kernel could use.
+ * @mem_ranges:   Range list to add the memory ranges to.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int get_usable_memory_ranges(struct crash_mem **mem_ranges)
+{
+   int ret;
+
+   /*
+* Early boot failure observed on guests when low memory (first memory
+* block?) is not added to usable memory. So, add [0, crashk_res.end]
+* instead of [crashk_res.start, crashk_res.end] to workaround it.
+* Also, crashed kernel's memory must be added to reserve map to
+* avoid kdump kernel from using it.
+*/
+   ret = add_mem_range(mem_ranges, 0, crashk_res.end + 1);
+   if (ret)
+   goto out;
+
+   ret = add_rtas_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_opal_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_tce_mem_ranges(mem_ranges);
+out:
+   if (ret)
+   pr_err("Failed to setup usable memory ranges\n");
+   return ret;
+}
+
 /**
  * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
  *  in the memory regions between buf_min & buf_max
@@ -273,6 +325,286 @@ static int locate_mem_hole_bottom_up_ppc64(struct 
kexec_buf *kbuf,
return ret;
 }
 
+/**
+ * check_realloc_usable_mem - Reallocate buffer if it 

Re: [PATCH 05/15] h8300, nds32, openrisc: simplify detection of memory extents

2020-07-29 Thread Stafford Horne
On Tue, Jul 28, 2020 at 08:11:43AM +0300, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Instead of traversing memblock.memory regions to find memory_start and
> memory_end, simply query memblock_{start,end}_of_DRAM().
> 
> Signed-off-by: Mike Rapoport 
> ---
>  arch/h8300/kernel/setup.c| 8 +++-
>  arch/nds32/kernel/setup.c| 8 ++--
>  arch/openrisc/kernel/setup.c | 9 ++---
>  3 files changed, 7 insertions(+), 18 deletions(-)

Hi Mike,

For the openrisc part:

Acked-by: Stafford Horne 

> --- a/arch/openrisc/kernel/setup.c
> +++ b/arch/openrisc/kernel/setup.c
> @@ -48,17 +48,12 @@ static void __init setup_memory(void)
>   unsigned long ram_start_pfn;
>   unsigned long ram_end_pfn;
>   phys_addr_t memory_start, memory_end;
> - struct memblock_region *region;
>  
>   memory_end = memory_start = 0;
>  
>   /* Find main memory where is the kernel, we assume its the only one */
> - for_each_memblock(memory, region) {
> - memory_start = region->base;
> - memory_end = region->base + region->size;
> - printk(KERN_INFO "%s: Memory: 0x%x-0x%x\n", __func__,
> -memory_start, memory_end);
> - }
> + memory_start = memblock_start_of_DRAM();
> + memory_end = memblock_end_of_DRAM();
>  
>   if (!memory_end) {
>   panic("No memory!");
> -- 
> 2.26.2
> 


[PATCH v6 05/11] powerpc/drmem: make lmb walk a bit more flexible

2020-07-29 Thread Hari Bathini
Currently, numa & prom are the users of drmem lmb walk code. Loading
kdump with kexec_file also needs to walk the drmem LMBs to setup the
usable memory ranges for kdump kernel. But there are couple of issues
in using the code as is. One, walk_drmem_lmb() code is built into the
.init section currently, while kexec_file needs it later. Two, there
is no scope to pass data to the callback function for processing and/
or erroring out on certain conditions.

Fix that by, moving drmem LMB walk code out of .init section, adding
scope to pass data to the callback function and bailing out when
an error is encountered in the callback function.

Signed-off-by: Hari Bathini 
Tested-by: Pingfan Liu 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Unchanged.

v4 -> v5:
* Unchanged.

v3 -> v4:
* Unchanged. Added Reviewed-by tag from Thiago.

v2 -> v3:
* Unchanged. Added Tested-by tag from Pingfan.

v1 -> v2:
* No changes.


 arch/powerpc/include/asm/drmem.h |9 ++--
 arch/powerpc/kernel/prom.c   |   13 +++---
 arch/powerpc/mm/drmem.c  |   87 +-
 arch/powerpc/mm/numa.c   |   13 +++---
 4 files changed, 78 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/include/asm/drmem.h b/arch/powerpc/include/asm/drmem.h
index 414d209f45bb..17ccc6474ab6 100644
--- a/arch/powerpc/include/asm/drmem.h
+++ b/arch/powerpc/include/asm/drmem.h
@@ -90,13 +90,14 @@ static inline bool drmem_lmb_reserved(struct drmem_lmb *lmb)
 }
 
 u64 drmem_lmb_memory_max(void);
-void __init walk_drmem_lmbs(struct device_node *dn,
-   void (*func)(struct drmem_lmb *, const __be32 **));
+int walk_drmem_lmbs(struct device_node *dn, void *data,
+   int (*func)(struct drmem_lmb *, const __be32 **, void *));
 int drmem_update_dt(void);
 
 #ifdef CONFIG_PPC_PSERIES
-void __init walk_drmem_lmbs_early(unsigned long node,
-   void (*func)(struct drmem_lmb *, const __be32 **));
+int __init
+walk_drmem_lmbs_early(unsigned long node, void *data,
+ int (*func)(struct drmem_lmb *, const __be32 **, void *));
 #endif
 
 static inline void invalidate_lmb_associativity_index(struct drmem_lmb *lmb)
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9cc49f265c86..7df78de378b0 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -468,8 +468,9 @@ static bool validate_mem_limit(u64 base, u64 *size)
  * This contains a list of memory blocks along with NUMA affinity
  * information.
  */
-static void __init early_init_drmem_lmb(struct drmem_lmb *lmb,
-   const __be32 **usm)
+static int  __init early_init_drmem_lmb(struct drmem_lmb *lmb,
+   const __be32 **usm,
+   void *data)
 {
u64 base, size;
int is_kexec_kdump = 0, rngs;
@@ -484,7 +485,7 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
 */
if ((lmb->flags & DRCONF_MEM_RESERVED) ||
!(lmb->flags & DRCONF_MEM_ASSIGNED))
-   return;
+   return 0;
 
if (*usm)
is_kexec_kdump = 1;
@@ -499,7 +500,7 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
 */
rngs = dt_mem_next_cell(dt_root_size_cells, usm);
if (!rngs) /* there are no (base, size) duple */
-   return;
+   return 0;
}
 
do {
@@ -524,6 +525,8 @@ static void __init early_init_drmem_lmb(struct drmem_lmb 
*lmb,
if (lmb->flags & DRCONF_MEM_HOTREMOVABLE)
memblock_mark_hotplug(base, size);
} while (--rngs);
+
+   return 0;
 }
 #endif /* CONFIG_PPC_PSERIES */
 
@@ -534,7 +537,7 @@ static int __init early_init_dt_scan_memory_ppc(unsigned 
long node,
 #ifdef CONFIG_PPC_PSERIES
if (depth == 1 &&
strcmp(uname, "ibm,dynamic-reconfiguration-memory") == 0) {
-   walk_drmem_lmbs_early(node, early_init_drmem_lmb);
+   walk_drmem_lmbs_early(node, NULL, early_init_drmem_lmb);
return 0;
}
 #endif
diff --git a/arch/powerpc/mm/drmem.c b/arch/powerpc/mm/drmem.c
index 59327cefbc6a..b2eeea39684c 100644
--- a/arch/powerpc/mm/drmem.c
+++ b/arch/powerpc/mm/drmem.c
@@ -14,6 +14,8 @@
 #include 
 #include 
 
+static int n_root_addr_cells, n_root_size_cells;
+
 static struct drmem_lmb_info __drmem_info;
 struct drmem_lmb_info *drmem_info = &__drmem_info;
 
@@ -189,12 +191,13 @@ int drmem_update_dt(void)
return rc;
 }
 
-static void __init read_drconf_v1_cell(struct drmem_lmb *lmb,
+static void read_drconf_v1_cell(struct drmem_lmb *lmb,
   const __be32 **prop)
 {
const __be32 *p = *prop;
 
-   lmb->base_addr = dt_mem_next_cell(dt_root_addr_cells, );
+   lmb->base_addr = of_read_number(p, n_root_addr_cells);
+  

[PATCH v6 04/11] ppc64/kexec_file: avoid stomping memory used by special regions

2020-07-29 Thread Hari Bathini
crashkernel region could have an overlap with special memory regions
like  opal, rtas, tce-table & such. These regions are referred to as
exclude memory ranges. Setup this ranges during image probe in order
to avoid them while finding the buffer for different kdump segments.
Override arch_kexec_locate_mem_hole() to locate a memory hole taking
these ranges into account.

Signed-off-by: Hari Bathini 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Implemented all the add_foo_mem_ranges() functions that get used while
  setting up exclude memory ranges.

v4 -> v5:
* Unchanged. Added Reviewed-by tag from Thiago.

v3 -> v4:
* Dropped KDUMP_BUF_MIN & KDUMP_BUF_MAX macros and fixed off-by-one error
  in arch_locate_mem_hole() helper routines.

v2 -> v3:
* If there are no exclude ranges, the right thing to do is fallbacking
  back to default kexec_locate_mem_hole() implementation instead of
  returning 0. Fixed that.

v1 -> v2:
* Did arch_kexec_locate_mem_hole() override to handle special regions.
* Ensured holes in the memory are accounted for while locating mem hole.
* Updated add_rtas_mem_range() & add_opal_mem_range() callsites based on
  the new prototype for these functions.


 arch/powerpc/include/asm/kexec.h|7 -
 arch/powerpc/include/asm/kexec_ranges.h |   14 +
 arch/powerpc/kexec/elf_64.c |8 +
 arch/powerpc/kexec/file_load_64.c   |  337 +++
 arch/powerpc/kexec/ranges.c |  177 
 5 files changed, 539 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index ac8fd4839171..835dc92e091c 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -100,14 +100,16 @@ void relocate_new_kernel(unsigned long indirection_page, 
unsigned long reboot_co
 #ifdef CONFIG_KEXEC_FILE
 extern const struct kexec_file_ops kexec_elf64_ops;
 
-#ifdef CONFIG_IMA_KEXEC
 #define ARCH_HAS_KIMAGE_ARCH
 
 struct kimage_arch {
+   struct crash_mem *exclude_ranges;
+
+#ifdef CONFIG_IMA_KEXEC
phys_addr_t ima_buffer_addr;
size_t ima_buffer_size;
-};
 #endif
+};
 
 int setup_purgatory(struct kimage *image, const void *slave_code,
const void *fdt, unsigned long kernel_load_addr,
@@ -125,6 +127,7 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt,
unsigned long initrd_load_addr,
unsigned long initrd_len, const char *cmdline);
 #endif /* CONFIG_PPC64 */
+
 #endif /* CONFIG_KEXEC_FILE */
 
 #else /* !CONFIG_KEXEC_CORE */
diff --git a/arch/powerpc/include/asm/kexec_ranges.h 
b/arch/powerpc/include/asm/kexec_ranges.h
index 35ae31a7a4de..7a9f8d15 100644
--- a/arch/powerpc/include/asm/kexec_ranges.h
+++ b/arch/powerpc/include/asm/kexec_ranges.h
@@ -7,5 +7,19 @@
 void sort_memory_ranges(struct crash_mem *mrngs, bool merge);
 struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges);
 int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size);
+int add_tce_mem_ranges(struct crash_mem **mem_ranges);
+int add_initrd_mem_range(struct crash_mem **mem_ranges);
+#ifdef CONFIG_PPC_BOOK3S_64
+int add_htab_mem_range(struct crash_mem **mem_ranges);
+#else
+static inline int add_htab_mem_range(struct crash_mem **mem_ranges)
+{
+   return 0;
+}
+#endif
+int add_kernel_mem_range(struct crash_mem **mem_ranges);
+int add_rtas_mem_range(struct crash_mem **mem_ranges);
+int add_opal_mem_range(struct crash_mem **mem_ranges);
+int add_reserved_mem_ranges(struct crash_mem **mem_ranges);
 
 #endif /* _ASM_POWERPC_KEXEC_RANGES_H */
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 23ad04ccaf8e..64c15a5a280b 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -46,6 +46,14 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
if (ret)
goto out;
 
+   if (image->type == KEXEC_TYPE_CRASH) {
+   /* min & max buffer values for kdump case */
+   kbuf.buf_min = pbuf.buf_min = crashk_res.start;
+   kbuf.buf_max = pbuf.buf_max =
+   ((crashk_res.end < ppc64_rma_size) ?
+crashk_res.end : (ppc64_rma_size - 1));
+   }
+
ret = kexec_elf_load(image, , _info, , _load_addr);
if (ret)
goto out;
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 3e9ac5f216b0..d09c7724efa8 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -17,12 +17,262 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 const struct kexec_file_ops * const kexec_file_loaders[] = {
_elf64_ops,
NULL
 };
 
+/**
+ * get_exclude_memory_ranges - Get exclude memory ranges. This list includes
+ * regions like opal/rtas, tce-table, initrd,
+ * 

[PATCH v6 03/11] powerpc/kexec_file: add helper functions for getting memory ranges

2020-07-29 Thread Hari Bathini
In kexec case, the kernel to be loaded uses the same memory layout as
the running kernel. So, passing on the DT of the running kernel would
be good enough.

But in case of kdump, different memory ranges are needed to manage
loading the kdump kernel, booting into it and exporting the elfcore
of the crashing kernel. The ranges are exclude memory ranges, usable
memory ranges, reserved memory ranges and crash memory ranges.

Exclude memory ranges specify the list of memory ranges to avoid while
loading kdump segments. Usable memory ranges list the memory ranges
that could be used for booting kdump kernel. Reserved memory ranges
list the memory regions for the loading kernel's reserve map. Crash
memory ranges list the memory ranges to be exported as the crashing
kernel's elfcore.

Add helper functions for setting up the above mentioned memory ranges.
This helpers facilitate in understanding the subsequent changes better
and make it easy to setup the different memory ranges listed above, as
and when appropriate.

Signed-off-by: Hari Bathini 
Tested-by: Pingfan Liu 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Dropped email address from copyright header of the new file being
  added: arch/powerpc/kexec/ranges.c
* Changed mrngs to mem_rngs. Using the convention mem_ranges for
  'struct crash_mem **' types & mem_rngs for 'struct crash_mem *'
  for easy readibility.
* Updated add_opal_mem_range() & add_rtas_mem_range() functions without
  goto statements.
* Moved implementation of all add_foo_mem_range(s)() functions to
  patch 04/11, where they are used.
* Fixed reference count leak in add_tce_mem_ranges() function and also
  updated error handling in reading tce table base & sizes.

v4 -> v5:
* Added Reviewed-by tag from Thiago.
* Added the missing "#ifdef CONFIG_PPC_BOOK3S_64" around add_htab_mem_range()
  function in arch/powerpc/kexec/ranges.c file.
* add_tce_mem_ranges() function returned error when tce table is not found
  in a pci node. This is wrong as pci nodes may not always have tce tables
  (KVM guests, for example). Fixed it by ignoring error in reading tce
  table base/size while returning from the function.

v3 -> v4:
* Updated sort_memory_ranges() function to reuse sort() from lib/sort.c
  and addressed other review comments from Thiago.

v2 -> v3:
* Unchanged. Added Tested-by tag from Pingfan.

v1 -> v2:
* Added an option to merge ranges while sorting to minimize reallocations
  for memory ranges list.
* Dropped within_crashkernel option for add_opal_mem_range() &
  add_rtas_mem_range() as it is not really needed.


 arch/powerpc/include/asm/kexec_ranges.h |   11 +
 arch/powerpc/kexec/Makefile |2 
 arch/powerpc/kexec/ranges.c |  235 +++
 3 files changed, 247 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/include/asm/kexec_ranges.h
 create mode 100644 arch/powerpc/kexec/ranges.c

diff --git a/arch/powerpc/include/asm/kexec_ranges.h 
b/arch/powerpc/include/asm/kexec_ranges.h
new file mode 100644
index ..35ae31a7a4de
--- /dev/null
+++ b/arch/powerpc/include/asm/kexec_ranges.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _ASM_POWERPC_KEXEC_RANGES_H
+#define _ASM_POWERPC_KEXEC_RANGES_H
+
+#define MEM_RANGE_CHUNK_SZ 2048/* Memory ranges size chunk */
+
+void sort_memory_ranges(struct crash_mem *mrngs, bool merge);
+struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges);
+int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size);
+
+#endif /* _ASM_POWERPC_KEXEC_RANGES_H */
diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile
index 67c355329457..4aff6846c772 100644
--- a/arch/powerpc/kexec/Makefile
+++ b/arch/powerpc/kexec/Makefile
@@ -7,7 +7,7 @@ obj-y   += core.o crash.o core_$(BITS).o
 
 obj-$(CONFIG_PPC32)+= relocate_32.o
 
-obj-$(CONFIG_KEXEC_FILE)   += file_load.o file_load_$(BITS).o elf_$(BITS).o
+obj-$(CONFIG_KEXEC_FILE)   += file_load.o ranges.o file_load_$(BITS).o 
elf_$(BITS).o
 
 ifdef CONFIG_HAVE_IMA_KEXEC
 ifdef CONFIG_IMA
diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c
new file mode 100644
index ..dc3ce036f416
--- /dev/null
+++ b/arch/powerpc/kexec/ranges.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * powerpc code to implement the kexec_file_load syscall
+ *
+ * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
+ * Copyright (C) 2004  IBM Corp.
+ * Copyright (C) 2004,2005  Milton D Miller II, IBM Corporation
+ * Copyright (C) 2005  R Sharada (shar...@in.ibm.com)
+ * Copyright (C) 2006  Mohan Kumar M (mo...@in.ibm.com)
+ * Copyright (C) 2020  IBM Corporation
+ *
+ * Based on kexec-tools' kexec-ppc64.c, fs2dt.c.
+ * Heavily modified for the kernel by
+ * Hari Bathini, IBM Corporation.
+ */
+
+#define pr_fmt(fmt) "kexec ranges: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/**
+ * 

[PATCH v6 02/11] powerpc/kexec_file: mark PPC64 specific code

2020-07-29 Thread Hari Bathini
Some of the kexec_file_load code isn't PPC64 specific. Move PPC64
specific code from kexec/file_load.c to kexec/file_load_64.c. Also,
rename purgatory/trampoline.S to purgatory/trampoline_64.S in the
same spirit. No functional changes.

Signed-off-by: Hari Bathini 
Tested-by: Pingfan Liu 
Reviewed-by: Laurent Dufour 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Dropped email address from copyright header of the new file being
  added: arch/powerpc/kexec/file_load_64.c

v4 -> v5:
* Unchanged.

v3 -> v4:
* Moved common code back to set_new_fdt() from setup_new_fdt_ppc64()
  function. Added Reviewed-by tags from Laurent & Thiago.

v2 -> v3:
* Unchanged. Added Tested-by tag from Pingfan.

v1 -> v2:
* No changes.


 arch/powerpc/include/asm/kexec.h   |9 ++
 arch/powerpc/kexec/Makefile|2 -
 arch/powerpc/kexec/elf_64.c|7 +-
 arch/powerpc/kexec/file_load.c |   19 +
 arch/powerpc/kexec/file_load_64.c  |   87 
 arch/powerpc/purgatory/Makefile|4 +
 arch/powerpc/purgatory/trampoline.S|  117 
 arch/powerpc/purgatory/trampoline_64.S |  117 
 8 files changed, 222 insertions(+), 140 deletions(-)
 create mode 100644 arch/powerpc/kexec/file_load_64.c
 delete mode 100644 arch/powerpc/purgatory/trampoline.S
 create mode 100644 arch/powerpc/purgatory/trampoline_64.S

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index c68476818753..ac8fd4839171 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -116,6 +116,15 @@ int setup_new_fdt(const struct kimage *image, void *fdt,
  unsigned long initrd_load_addr, unsigned long initrd_len,
  const char *cmdline);
 int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
+
+#ifdef CONFIG_PPC64
+int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
+ const void *fdt, unsigned long kernel_load_addr,
+ unsigned long fdt_load_addr);
+int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
+   unsigned long initrd_load_addr,
+   unsigned long initrd_len, const char *cmdline);
+#endif /* CONFIG_PPC64 */
 #endif /* CONFIG_KEXEC_FILE */
 
 #else /* !CONFIG_KEXEC_CORE */
diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile
index 86380c69f5ce..67c355329457 100644
--- a/arch/powerpc/kexec/Makefile
+++ b/arch/powerpc/kexec/Makefile
@@ -7,7 +7,7 @@ obj-y   += core.o crash.o core_$(BITS).o
 
 obj-$(CONFIG_PPC32)+= relocate_32.o
 
-obj-$(CONFIG_KEXEC_FILE)   += file_load.o elf_$(BITS).o
+obj-$(CONFIG_KEXEC_FILE)   += file_load.o file_load_$(BITS).o elf_$(BITS).o
 
 ifdef CONFIG_HAVE_IMA_KEXEC
 ifdef CONFIG_IMA
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 3072fd6dbe94..23ad04ccaf8e 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -88,7 +88,8 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
goto out;
}
 
-   ret = setup_new_fdt(image, fdt, initrd_load_addr, initrd_len, cmdline);
+   ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
+ initrd_len, cmdline);
if (ret)
goto out;
 
@@ -107,8 +108,8 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
 
slave_code = elf_info.buffer + elf_info.proghdrs[0].p_offset;
-   ret = setup_purgatory(image, slave_code, fdt, kernel_load_addr,
- fdt_load_addr);
+   ret = setup_purgatory_ppc64(image, slave_code, fdt, kernel_load_addr,
+   fdt_load_addr);
if (ret)
pr_err("Error setting up the purgatory.\n");
 
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index 143c91724617..38439aba27d7 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 /*
- * ppc64 code to implement the kexec_file_load syscall
+ * powerpc code to implement the kexec_file_load syscall
  *
  * Copyright (C) 2004  Adam Litke (a...@us.ibm.com)
  * Copyright (C) 2004  IBM Corp.
@@ -20,22 +20,7 @@
 #include 
 #include 
 
-#define SLAVE_CODE_SIZE256
-
-const struct kexec_file_ops * const kexec_file_loaders[] = {
-   _elf64_ops,
-   NULL
-};
-
-int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
- unsigned long buf_len)
-{
-   /* We don't support crash kernels yet. */
-   if (image->type == KEXEC_TYPE_CRASH)
-   return -EOPNOTSUPP;
-
-   return kexec_image_probe_default(image, buf, buf_len);
-}

[PATCH v6 01/11] kexec_file: allow archs to handle special regions while locating memory hole

2020-07-29 Thread Hari Bathini
Some architectures may have special memory regions, within the given
memory range, which can't be used for the buffer in a kexec segment.
Implement weak arch_kexec_locate_mem_hole() definition which arch code
may override, to take care of special regions, while trying to locate
a memory hole.

Also, add the missing declarations for arch overridable functions and
and drop the __weak descriptors in the declarations to avoid non-weak
definitions from becoming weak.

Reported-by: kernel test robot 
[lkp: In v1, arch_kimage_file_post_load_cleanup() declaration was missing]
Signed-off-by: Hari Bathini 
Tested-by: Pingfan Liu 
Acked-by: Dave Young 
Reviewed-by: Thiago Jung Bauermann 
---

v5 -> v6:
* Unchanged.

v4 -> v5:
* Unchanged.

v3 -> v4:
* Unchanged. Added Reviewed-by tag from Thiago.

v2 -> v3:
* Unchanged. Added Acked-by & Tested-by tags from Dave & Pingfan.

v1 -> v2:
* Introduced arch_kexec_locate_mem_hole() for override and dropped
  weak arch_kexec_add_buffer().
* Dropped __weak identifier for arch overridable functions.
* Fixed the missing declaration for arch_kimage_file_post_load_cleanup()
  reported by lkp. lkp report for reference:
- https://lore.kernel.org/patchwork/patch/1264418/


 include/linux/kexec.h |   29 ++---
 kernel/kexec_file.c   |   16 ++--
 2 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index ea67910ae6b7..9e93bef52968 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -183,17 +183,24 @@ int kexec_purgatory_get_set_symbol(struct kimage *image, 
const char *name,
   bool get_value);
 void *kexec_purgatory_get_symbol_addr(struct kimage *image, const char *name);
 
-int __weak arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
-unsigned long buf_len);
-void * __weak arch_kexec_kernel_image_load(struct kimage *image);
-int __weak arch_kexec_apply_relocations_add(struct purgatory_info *pi,
-   Elf_Shdr *section,
-   const Elf_Shdr *relsec,
-   const Elf_Shdr *symtab);
-int __weak arch_kexec_apply_relocations(struct purgatory_info *pi,
-   Elf_Shdr *section,
-   const Elf_Shdr *relsec,
-   const Elf_Shdr *symtab);
+/* Architectures may override the below functions */
+int arch_kexec_kernel_image_probe(struct kimage *image, void *buf,
+ unsigned long buf_len);
+void *arch_kexec_kernel_image_load(struct kimage *image);
+int arch_kexec_apply_relocations_add(struct purgatory_info *pi,
+Elf_Shdr *section,
+const Elf_Shdr *relsec,
+const Elf_Shdr *symtab);
+int arch_kexec_apply_relocations(struct purgatory_info *pi,
+Elf_Shdr *section,
+const Elf_Shdr *relsec,
+const Elf_Shdr *symtab);
+int arch_kimage_file_post_load_cleanup(struct kimage *image);
+#ifdef CONFIG_KEXEC_SIG
+int arch_kexec_kernel_verify_sig(struct kimage *image, void *buf,
+unsigned long buf_len);
+#endif
+int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf);
 
 extern int kexec_add_buffer(struct kexec_buf *kbuf);
 int kexec_locate_mem_hole(struct kexec_buf *kbuf);
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 09cc78df53c6..e89912d33a27 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -635,6 +635,19 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
return ret == 1 ? 0 : -EADDRNOTAVAIL;
 }
 
+/**
+ * arch_kexec_locate_mem_hole - Find free memory to place the segments.
+ * @kbuf:   Parameters for the memory search.
+ *
+ * On success, kbuf->mem will have the start address of the memory region 
found.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int __weak arch_kexec_locate_mem_hole(struct kexec_buf *kbuf)
+{
+   return kexec_locate_mem_hole(kbuf);
+}
+
 /**
  * kexec_add_buffer - place a buffer in a kexec segment
  * @kbuf:  Buffer contents and memory parameters.
@@ -647,7 +660,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf)
  */
 int kexec_add_buffer(struct kexec_buf *kbuf)
 {
-
struct kexec_segment *ksegment;
int ret;
 
@@ -675,7 +687,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE);
 
/* Walk the RAM ranges and allocate a suitable range for the buffer */
-   ret = kexec_locate_mem_hole(kbuf);
+   ret = arch_kexec_locate_mem_hole(kbuf);
if (ret)
return ret;
 




[PATCH v6 00/11] ppc64: enable kdump support for kexec_file_load syscall

2020-07-29 Thread Hari Bathini
Sorry! There was a gateway issue on my system while posting v5, due to
which some patches did not make it through. Resending...

This patch series enables kdump support for kexec_file_load system
call (kexec -s -p) on PPC64. The changes are inspired from kexec-tools
code but heavily modified for kernel consumption.

The first patch adds a weak arch_kexec_locate_mem_hole() function to
override locate memory hole logic suiting arch needs. There are some
special regions in ppc64 which should be avoided while loading buffer
& there are multiple callers to kexec_add_buffer making it complicated
to maintain range sanity and using generic lookup at the same time.

The second patch marks ppc64 specific code within arch/powerpc/kexec
and arch/powerpc/purgatory to make the subsequent code changes easy
to understand.

The next patch adds helper function to setup different memory ranges
needed for loading kdump kernel, booting into it and exporting the
crashing kernel's elfcore.

The fourth patch overrides arch_kexec_locate_mem_hole() function to
locate memory hole for kdump segments by accounting for the special
memory regions, referred to as excluded memory ranges, and sets
kbuf->mem when a suitable memory region is found.

The fifth patch moves walk_drmem_lmbs() out of .init section with
a few changes to reuse it for setting up kdump kernel's usable memory
ranges. The next patch uses walk_drmem_lmbs() to look up the LMBs
and set linux,drconf-usable-memory & linux,usable-memory properties
in order to restrict kdump kernel's memory usage.

The next patch setups up backup region as a kexec segment while
loading kdump kernel and teaches purgatory to copy data from source
to destination.

Patch 09 builds the elfcore header for the running kernel & passes
the info to kdump kernel via "elfcorehdr=" parameter to export as
/proc/vmcore file. The next patch sets up the memory reserve map
for the kexec kernel and also claims kdump support for kdump as
all the necessary changes are added.

The next patch fixes a lookup issue for `kexec -l -s` case when
memory is reserved for crashkernel.

The last patch updates purgatory to setup r8 & r9 with opal base
and opal entry addresses respectively to aid kernels built with
CONFIG_PPC_EARLY_DEBUG_OPAL enabled.

Tested the changes successfully on P8, P9 lpars, couple of OpenPOWER
boxes, one with secureboot enabled, KVM guest and a simulator.

v5 -> v6:
* Fixed reference count leak in add_tce_mem_ranges() function and also
  updated error handling in reading tce table base & sizes.
* Instead of trying to reinvent the wheel with get_node_path() &
  get_node_path_size() functions, used %pOF format as suggested by mpe.
* Moved patch 07/11 to end of the series for mpe to take a call on
  whether to have it or not.

v4 -> v5:
* Dropped patches 07/12 & 08/12 and updated purgatory to do everything
  in assembly.
* Added a new patch (which was part of patch 08/12 in v4) to update
  r8 & r9 registers with opal base & opal entry addresses as it is
  expected on kernels built with CONFIG_PPC_EARLY_DEBUG_OPAL enabled.
* Fixed kexec load issue on KVM guest.

v3 -> v4:
* Updated get_node_path() function to be iterative instead of a recursive one.
* Added comment explaining why low memory is added to kdump kernel's usable
  memory ranges though it doesn't fall in crashkernel region.
* Fixed stack_buf to be quadword aligned in accordance with ABI.
* Added missing of_node_put() in setup_purgatory_ppc64().
* Added a FIXME tag to indicate issue in adding opal/rtas regions to
  core image.

v2 -> v3:
* Fixed TOC pointer calculation for purgatory by using section info
  that has relocations applied.
* Fixed arch_kexec_locate_mem_hole() function to fallback to generic
  kexec_locate_mem_hole() lookup if exclude ranges list is empty.
* Dropped check for backup_start in trampoline_64.S as purgatory()
  function takes care of it anyway.

v1 -> v2:
* Introduced arch_kexec_locate_mem_hole() for override and dropped
  weak arch_kexec_add_buffer().
* Addressed warnings reported by lkp.
* Added patch to address kexec load issue when memory is reserved
  for crashkernel.
* Used the appropriate license header for the new files added.
* Added an option to merge ranges to minimize reallocations while
  adding memory ranges.
* Dropped within_crashkernel parameter for add_opal_mem_range() &
  add_rtas_mem_range() functions as it is not really needed.

---

Hari Bathini (11):
  kexec_file: allow archs to handle special regions while locating memory 
hole
  powerpc/kexec_file: mark PPC64 specific code
  powerpc/kexec_file: add helper functions for getting memory ranges
  ppc64/kexec_file: avoid stomping memory used by special regions
  powerpc/drmem: make lmb walk a bit more flexible
  ppc64/kexec_file: restrict memory usage of kdump kernel
  ppc64/kexec_file: setup backup region for kdump kernel
  ppc64/kexec_file: prepare elfcore header for crashing kernel
  ppc64/kexec_file: add 

Re: [PATCH] powerpc: Fix MMCRA_BHRB_DISABLE define to work with binutils version < 2.28

2020-07-29 Thread Madhavan Srinivasan




On 7/29/20 9:46 AM, Athira Rajeev wrote:

commit 9908c826d5ed ("powerpc/perf: Add Power10 PMU feature to
DT CPU features") defines MMCRA_BHRB_DISABLE as `0x20UL`.
Binutils version less than 2.28 doesn't support UL suffix.

linux-ppc/arch/powerpc/kernel/cpu_setup_power.S: Assembler messages:
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of 
line, first unrecognized character is `L'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: junk at end of 
line, first unrecognized character is `L'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: found 'L', 
expected: ')'
linux-ppc/arch/powerpc/kernel/cpu_setup_power.S:250: Error: operand out of 
range (0x0020 is not between 0x8000 and 
0x)

Fix this by wrapping it around `_UL` macro.

Looks fine to me.

Reviewed-by: Madhavan Srinivasan 




Fixes: 9908c826d5ed ("Add Power10 PMU feature to DT CPU features")
Signed-off-by: Athira Rajeev 
Suggested-by: Michael Ellerman 
---
  arch/powerpc/include/asm/reg.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index ae71027..41419f1 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -12,6 +12,7 @@
  #ifdef __KERNEL__

  #include 
+#include 
  #include 
  #include 
  #include 
@@ -888,7 +889,7 @@
  #define   MMCRA_SLOT  0x0700UL /* SLOT bits (37-39) */
  #define   MMCRA_SLOT_SHIFT24
  #define   MMCRA_SAMPLE_ENABLE 0x0001UL /* enable sampling */
-#define   MMCRA_BHRB_DISABLE  0x20UL // BHRB disable bit for ISA v3.1
+#define   MMCRA_BHRB_DISABLE  _UL(0x20) // BHRB disable bit for ISA 
v3.1
  #define   POWER6_MMCRA_SDSYNC 0x0800ULL   /* SDAR/SIAR synced */
  #define   POWER6_MMCRA_SIHV   0x0400ULL
  #define   POWER6_MMCRA_SIPR   0x0200ULL




Re: [PATCH 04/15] arm64: numa: simplify dummy_numa_init()

2020-07-29 Thread Jonathan Cameron
On Tue, 28 Jul 2020 08:11:42 +0300
Mike Rapoport  wrote:

> From: Mike Rapoport 
> 
> dummy_numa_init() loops over memblock.memory and passes nid=0 to
> numa_add_memblk() which essentially wraps memblock_set_node(). However,
> memblock_set_node() can cope with entire memory span itself, so the loop
> over memblock.memory regions is redundant.
> 
> Replace the loop with a single call to memblock_set_node() to the entire
> memory.

Hi Mike,

I had a similar patch I was going to post shortly so can add a bit more
on the advantages of this one.

Beyond cleaning up, it also fixes an issue with a buggy ACPI firmware in which 
the SRAT
table covers some but not all of the memory in the EFI memory map.  Stealing 
bits
from the draft cover letter I had for that...

> This issue can be easily triggered by having an SRAT table which fails
> to cover all elements of the EFI memory map.
> 
> This firmware error is detected and a warning printed. e.g.
> "NUMA: Warning: invalid memblk node 64 [mem 0x24000-0x27fff]"
> At that point we fall back to dummy_numa_init().
> 
> However, the failed ACPI init has left us with our memblocks all broken
> up as we split them when trying to assign them to NUMA nodes.
> 
> We then iterate over the memblocks and add them to node 0.
> 
> for_each_memblock(memory, mblk) {
>   ret = numa_add_memblk(0, mblk->base, mblk->base + mblk->size);
>   if (!ret)
>   continue;
>   pr_err("NUMA init failed\n");
>   return ret;
> }
> 
> numa_add_memblk() calls memblock_set_node() which merges regions that
> were previously split up during the earlier attempt to add them to different
> nodes during parsing of SRAT.
> 
> This means elements are moved in the memblock array and we can end up
> in a different memblock after the call to numa_add_memblk().
> Result is:
> 
> Unable to handle kernel paging request at virtual address 3a40
> Mem abort info:
>   ESR = 0x9604
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
> Data abort info:
>   ISV = 0, ISS = 0x0004
>   CM = 0, WnR = 0
> [3a40] user address but active_mm is swapper
> Internal error: Oops: 9604 [#1] PREEMPT SMP
> 
> ...
> 
> Call trace:
>   sparse_init_nid+0x5c/0x2b0
>   sparse_init+0x138/0x170
>   bootmem_init+0x80/0xe0
>   setup_arch+0x2a0/0x5fc
>   start_kernel+0x8c/0x648
> 
> As an illustrative example:
> EFI table has one block of memory.
> memblks[0] = [0...0x2f]  so we start with a single memblock.
> 
> SRAT has
> [0x00...0x0f] in node 0
> [0x10...0x1f] in node 1
> but no entry covering 
> [0x20...0x2f].
> 
> Whilst parsing SRAT the single memblock is broken into 3.
> memblks[0] = [0x00...0x0f] in node 0
> memblks[1] = [0x10...0x1f] in node 1
> memblks[2] = [0x20...0x2f] in node MAX_NUM_NODES (invalid value)
> 
> A sanity check parse then detects the invalid section and acpi_numa_init
> fails.  We then fall back to the dummy path.
> 
> That iterates over the memblocks.  We'll use i an index in the array of 
> memblocks
> 
> i = 0;
> memblks[0] = [0x00...0x0f] set to node0.
>merge doesn't do anything because the neighbouring memblock is still in 
> node1.
> 
> i = 1
> memblks[1] = [0x10...0x1f] set to node 0.
>merge combines memblock 0 and 1 to give a new set of memblocks.
> 
> memblks[0] = [0x00..0x1f] in node 0
> memblks[1] = [0x20..0x2f] in node MAX_NUM_NODES.
> 
> i = 2 off the end of the now reduced array of memblocks, so exit the loop.
> (if we restart the loop here everything will be fine).
> 
> Later sparse_init_nid tries to use the node of the second memblock to index
> somethings and boom.


> 
> Signed-off-by: Mike Rapoport 

Acked-by: Jonathan Cameron 

> ---
>  arch/arm64/mm/numa.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index aafcee3e3f7e..0cbdbcc885fb 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -423,19 +423,16 @@ static int __init numa_init(int (*init_func)(void))
>   */
>  static int __init dummy_numa_init(void)
>  {
> + phys_addr_t start = memblock_start_of_DRAM();
> + phys_addr_t end = memblock_end_of_DRAM();
>   int ret;
> - struct memblock_region *mblk;
>  
>   if (numa_off)
>   pr_info("NUMA disabled\n"); /* Forced off on command line. */
> - pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n",
> - memblock_start_of_DRAM(), memblock_end_of_DRAM() - 1);
> -
> - for_each_memblock(memory, mblk) {
> - ret = numa_add_memblk(0, mblk->base, mblk->base + mblk->size);
> - if (!ret)
> - continue;
> + pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n", start, end - 1);
>  
> + ret = numa_add_memblk(0, start, end);
> + if (ret) {
>   pr_err("NUMA init failed\n");
>   return ret;
>   }




[PATCH v2 3/3] cpuidle-pseries : Fixup exit latency for CEDE(0)

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

We are currently assuming that CEDE(0) has exit latency 10us, since
there is no way for us to query from the platform.  However, if the
wakeup latency of an Extended CEDE state is smaller than 10us, then we
can be sure that the exit latency of CEDE(0) cannot be more than that.
that.

In this patch, we fix the exit latency of CEDE(0) if we discover an
Extended CEDE state with wakeup latency smaller than 10us.

Benchmark results:

ebizzy:
2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s with patch.
x without_patch
+ with_patch
N   Min   MaxMedian   AvgStddev
x  10   2491089   5834307   5398375   4244335 1596244.9
+  10   2893813   5834474   5832448 5327281.3 1055941.4

context_switch2 :
There is no major regression observed with this patch as seen from the
context_switch2 benchmark.

context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
small cores). We observe a minor 0.14% regression in the number of
context-switches (higher is better).
x without_patch
+ with_patch
N   Min   MaxMedian   AvgStddev
x 500348872362236354712 354745.69  2711.827
+ 500349422361452353942  354215.4 2576.9258
Difference at 99.0% confidence
-530.288 +/- 430.963
-0.149484% +/- 0.121485%
(Student's t, pooled s = 2645.24)

context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
improvement in the number of context-switches (higher is better).
x without_patch
+ with_patch
N   Min   MaxMedian   AvgStddev
x 500287956294940288896 288977.23 646.59295
+ 500288300294646289582 290064.76 1161.9992
Difference at 99.0% confidence
1087.53 +/- 153.194
0.376337% +/- 0.0530125%
(Student's t, pooled s = 940.299)

schbench:
No major difference could be seen until the 99.9th percentile.

Without-patch
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993

With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812

Reviewed-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index b1dc24d..0b2f115 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -334,12 +334,42 @@ static int pseries_cpuidle_driver_init(void)
 static int add_pseries_idle_states(void)
 {
int nr_states = 2; /* By default we have snooze, CEDE */
+   int i;
+   u64 min_latency_us = dedicated_states[1].exit_latency; /* CEDE latency 
*/
 
if (parse_cede_parameters())
return nr_states;
 
-   pr_info("cpuidle : Skipping the %d Extended CEDE idle states\n",
-   nr_xcede_records);
+   for (i = 0; i < nr_xcede_records; i++) {
+   u64 latency_tb = xcede_records[i].wakeup_latency_tb_ticks;
+   u64 latency_us = tb_to_ns(latency_tb) / NSEC_PER_USEC;
+
+   if (latency_us < min_latency_us)
+   min_latency_us = latency_us;
+   }
+
+   /*
+* We are currently assuming that CEDE(0) has exit latency
+* 10us, since there is no way for us to query from the
+* platform.
+*
+* However, if the wakeup latency of an Extended CEDE state is
+* smaller than 10us, then we can be sure that CEDE(0)
+* requires no more than that.
+*
+* Perform the fix-up.
+*/
+   if (min_latency_us < dedicated_states[1].exit_latency) {
+   u64 cede0_latency = min_latency_us - 1;
+
+   if (cede0_latency <= 0)
+   cede0_latency = min_latency_us;
+
+   dedicated_states[1].exit_latency = cede0_latency;
+   dedicated_states[1].target_residency = 10 * (cede0_latency);
+   pr_info("cpuidle : Fixed up CEDE exit latency to %llu us\n",
+   cede0_latency);
+   }
 
return nr_states;
 }
-- 
1.9.4



[PATCH v2 0/3] cpuidle-pseries: Parse extended CEDE information for idle.

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

This is a v2 of the patch series to parse the extended CEDE
information in the pseries-cpuidle driver.

The v1 of this patchset can be found here :
https://lore.kernel.org/linuxppc-dev/1594120299-31389-1-git-send-email-...@linux.vnet.ibm.com/

The change from v1 --> v2 :

 * Dropped Patches 4 and 5 which would expose extended idle-states,
   that wakeup on external interrupts, to cpuidle framework.  These
   were RFC patches in v1. Dropped them because currently the only
   extended CEDE state that wakesup on external interrupts is CEDE(1)
   which adds no signifcant value over CEDE(0).
   
 * Rebased the patches onto powerpc/merge.
 
 * No changes in code for Patches 1-3.

Motivation:
===
On pseries Dedicated Linux LPARs, apart from the polling snooze idle
state, we currently have the CEDE idle state which cedes the CPU to
the hypervisor with latency-hint = 0.

However, the PowerVM hypervisor supports additional extended CEDE
states, which can be queried through the "ibm,get-systems-parameter"
rtas-call with the CEDE_LATENCY_TOKEN. The hypervisor maps these
extended CEDE states to appropriate platform idle-states in order to
provide energy-savings as well as shifting power to the active
units. On existing pseries LPARs today we have extended CEDE with
latency-hints {1,2} supported.

The patches in this patchset, adds code to parse the CEDE latency
records provided by the hypervisor. We use this information to
determine the wakeup latency of the regular CEDE (which we have been
so far hardcoding to 10us while experimentally it is much lesser ~
1us), by looking at the wakeup latency provided by the hypervisor for
Extended CEDE states. Since the platform currently advertises Extended
CEDE 1 to have wakeup latency of 2us, we can be sure that the wakeup
latency of the regular CEDE is no more than this.

With Patches 1-3, we see an improvement in the single-threaded
performance on ebizzy.

2 ebizzy threads bound to the same big-core. 25% improvement in the
avg records/s (higher the better) with patches 1-3.
x without_patches
* with_patches
N   Min   MaxMedian   AvgStddev
x  10   2491089   5834307   5398375   4244335 1596244.9
*  10   2893813   5834474   5832448 5327281.3 1055941.4

We do not observe any major regression in either the context_switch2
benchmark or the schbench benchmark

context_switch2 across CPU0 CPU1 (Both belong to same big-core, but different
small cores). We observe a minor 0.14% regression in the number of
context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500348872362236354712 354745.69  2711.827
* 500349422361452353942  354215.4 2576.9258

context_switch2 across CPU0 CPU8 (Different big-cores). We observe a 0.37%
improvement in the number of context-switches (higher is better).
x without_patch
* with_patch
N   Min   MaxMedian   AvgStddev
x 500287956294940288896 288977.23 646.59295
* 500288300294646289582 290064.76 1161.9992

schbench:
No major difference could be seen until the 99.9th percentile.

Without-patch
Latency percentiles (usec)
50.0th: 29
75.0th: 39
90.0th: 49
95.0th: 59
*99.0th: 13104
99.5th: 14672
99.9th: 15824
min=0, max=17993

With-patch:
Latency percentiles (usec)
50.0th: 29
75.0th: 40
90.0th: 50
95.0th: 61
*99.0th: 13648
99.5th: 14768
99.9th: 15664
min=0, max=29812

Gautham R. Shenoy (3):
  cpuidle-pseries: Set the latency-hint before entering CEDE
  cpuidle-pseries: Add function to parse extended CEDE records
  cpuidle-pseries : Fixup exit latency for CEDE(0)

 drivers/cpuidle/cpuidle-pseries.c | 167 +-
 1 file changed, 165 insertions(+), 2 deletions(-)

-- 
1.9.4



[PATCH v2 1/3] cpuidle-pseries: Set the latency-hint before entering CEDE

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

As per the PAPR, each H_CEDE call is associated with a latency-hint to
be passed in the VPA field "cede_latency_hint". The CEDE states that
we were implicitly entering so far is CEDE with latency-hint = 0.

This patch explicitly sets the latency hint corresponding to the CEDE
state that we are currently entering. While at it, we save the
previous hint, to be restored once we wakeup from CEDE. This will be
required in the future when we expose extended-cede states through the
cpuidle framework, where each of them will have a different
cede-latency hint.

Reviewed-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 3e058ad2..88e71c3 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -86,19 +86,27 @@ static void check_and_cede_processor(void)
}
 }
 
+#define NR_CEDE_STATES 1  /* CEDE with latency-hint 0 */
+#define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */
+
+u8 cede_latency_hint[NR_DEDICATED_STATES];
 static int dedicated_cede_loop(struct cpuidle_device *dev,
struct cpuidle_driver *drv,
int index)
 {
+   u8 old_latency_hint;
 
pseries_idle_prolog();
get_lppaca()->donate_dedicated_cpu = 1;
+   old_latency_hint = get_lppaca()->cede_latency_hint;
+   get_lppaca()->cede_latency_hint = cede_latency_hint[index];
 
HMT_medium();
check_and_cede_processor();
 
local_irq_disable();
get_lppaca()->donate_dedicated_cpu = 0;
+   get_lppaca()->cede_latency_hint = old_latency_hint;
 
pseries_idle_epilog();
 
@@ -130,7 +138,7 @@ static int shared_cede_loop(struct cpuidle_device *dev,
 /*
  * States for dedicated partition case.
  */
-static struct cpuidle_state dedicated_states[] = {
+static struct cpuidle_state dedicated_states[NR_DEDICATED_STATES] = {
{ /* Snooze */
.name = "snooze",
.desc = "snooze",
-- 
1.9.4



[PATCH v2 2/3] cpuidle-pseries: Add function to parse extended CEDE records

2020-07-29 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently we use CEDE with latency-hint 0 as the only other idle state
on a dedicated LPAR apart from the polling "snooze" state.

The platform might support additional extended CEDE idle states, which
can be discovered through the "ibm,get-system-parameter" rtas-call
made with CEDE_LATENCY_TOKEN.

This patch adds a function to obtain information about the extended
CEDE idle states from the platform and parse the contents to populate
an array of extended CEDE states. These idle states thus discovered
will be added to the cpuidle framework in the next patch.

dmesg on a POWER9 LPAR, demonstrating the output of parsing the
extended CEDE latency parameters.

[5.913180] xcede : xcede_record_size = 10
[5.913183] xcede : Record 0 : hint = 1, latency =0x400 tb-ticks, 
Wake-on-irq = 1
[5.913188] xcede : Record 1 : hint = 2, latency =0x3e8000 tb-ticks, 
Wake-on-irq = 0
[5.913193] cpuidle : Skipping the 2 Extended CEDE idle states

Reviewed-by: Vaidyanathan Srinivasan 
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 129 +-
 1 file changed, 127 insertions(+), 2 deletions(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 88e71c3..b1dc24d 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct cpuidle_driver pseries_idle_driver = {
.name = "pseries_idle",
@@ -86,9 +87,120 @@ static void check_and_cede_processor(void)
}
 }
 
-#define NR_CEDE_STATES 1  /* CEDE with latency-hint 0 */
+struct xcede_latency_records {
+   u8  latency_hint;
+   u64 wakeup_latency_tb_ticks;
+   u8  responsive_to_irqs;
+};
+
+/*
+ * XCEDE : Extended CEDE states discovered through the
+ * "ibm,get-systems-parameter" rtas-call with the token
+ * CEDE_LATENCY_TOKEN
+ */
+#define MAX_XCEDE_STATES   4
+#defineXCEDE_LATENCY_RECORD_SIZE   10
+#define XCEDE_LATENCY_PARAM_MAX_LENGTH (2 + 2 + \
+   (MAX_XCEDE_STATES * 
XCEDE_LATENCY_RECORD_SIZE))
+
+#define CEDE_LATENCY_TOKEN 45
+
+#define NR_CEDE_STATES (MAX_XCEDE_STATES + 1) /* CEDE with 
latency-hint 0 */
 #define NR_DEDICATED_STATES(NR_CEDE_STATES + 1) /* Includes snooze */
 
+struct xcede_latency_records xcede_records[MAX_XCEDE_STATES];
+unsigned int nr_xcede_records;
+char xcede_parameters[XCEDE_LATENCY_PARAM_MAX_LENGTH];
+
+static int parse_cede_parameters(void)
+{
+   int ret = -1, i;
+   u16 payload_length;
+   u8 xcede_record_size;
+   u32 total_xcede_records_size;
+   char *payload;
+
+   memset(xcede_parameters, 0, XCEDE_LATENCY_PARAM_MAX_LENGTH);
+
+   ret = rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+   NULL, CEDE_LATENCY_TOKEN, __pa(xcede_parameters),
+   XCEDE_LATENCY_PARAM_MAX_LENGTH);
+
+   if (ret) {
+   pr_err("xcede: Error parsing CEDE_LATENCY_TOKEN\n");
+   return ret;
+   }
+
+   payload_length = be16_to_cpu(*(__be16 *)(_parameters[0]));
+   payload = _parameters[2];
+
+   /*
+* If the platform supports the cede latency settings
+* information system parameter it must provide the following
+* information in the NULL terminated parameter string:
+*
+* a. The first byte is the length “N” of each cede
+*latency setting record minus one (zero indicates a length
+*of 1 byte).
+*
+* b. For each supported cede latency setting a cede latency
+*setting record consisting of the first “N” bytes as per
+*the following table.
+*
+*  -
+*  | Field   | Field  |
+*  | Name| Length |
+*  -
+*  | Cede Latency| 1 Byte |
+*  | Specifier Value ||
+*  -
+*  | Maximum wakeup  ||
+*  | latency in  | 8 Bytes|
+*  | tb-ticks||
+*  -
+*  | Responsive to   ||
+*  | external| 1 Byte |
+*  | interrupts  ||
+*  -
+*
+* This version has cede latency record size = 10.
+*/
+   xcede_record_size = (u8)payload[0] + 1;
+
+   if (xcede_record_size != XCEDE_LATENCY_RECORD_SIZE) {
+   pr_err("xcede : Expected record-size %d. Observed size %d.\n",
+  XCEDE_LATENCY_RECORD_SIZE, xcede_record_size);
+   return -EINVAL;
+   }
+
+   pr_info("xcede : xcede_record_size = %d\n", xcede_record_size);
+
+   /*
+ 

Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

2020-07-29 Thread Srikar Dronamraju
* Valentin Schneider  [2020-07-28 16:03:11]:

Hi Valentin,

Thanks for looking into the patches.

> On 27/07/20 06:32, Srikar Dronamraju wrote:
> > Add percpu coregroup maps and masks to create coregroup domain.
> > If a coregroup doesn't exist, the coregroup domain will be degenerated
> > in favour of SMT/CACHE domain.
> >
> 
> So there's at least one arm64 platform out there with the same "pairs of
> cores share L2" thing (Ampere eMAG), and that lives quite happily with the
> default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
> domain, and the whole system is covered by DIE.
> 
> Now arguably it's not a perfect representation; DIE doesn't have
> SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
> will impact all callsites using cpus_share_cache(): in the eMAG case, only
> pairs of cores will be seen as sharing cache, even though *all* cores share
> the same L3.
> 

Okay, Its good to know that we have a chip which is similar to P9 in
topology.

> I'm trying to paint a picture of what the P9 topology looks like (the one
> you showcase in your cover letter) to see if there are any similarities;
> from what I gather in [1], wikichips and your cover letter, with P9 you can
> have something like this in a single DIE (somewhat unsure about L3 setup;
> it looks to be distributed?)
> 
>  +-+
>  |  L3 |
>  +---+-+---+-+---+-+---+
>  |   L2  | |   L2  | |   L2  | |   L2  |
>  +--+-+--+ +--+-+--+ +--+-+--+ +--+-+--+
>  |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  |
>  +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
>  |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
>  +--+ +--+ +--+ +--+ +--+ +--+ +--+ +--+
> 
> Which would lead to (ignoring the whole SMT CPU numbering shenanigans)
> 
> NUMA [   ...
> DIE  [ ]
> MC   [ ] [ ] [ ] [ ]
> BIGCORE  [ ] [ ] [ ] [ ]
> SMT  [   ] [   ] [   ] [   ] [   ] [   ] [   ] [   ]
>  00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31  
> 

What you have summed up is perfectly what a P9 topology looks like. I dont
think I could have explained it better than this.

> This however has MC == BIGCORE; what makes it you can have different spans
> for these two domains? If it's not too much to ask, I'd love to have a P9
> topology diagram.
> 
> [1]: 20200722081822.gg9...@linux.vnet.ibm.com

At this time the current topology would be good enough i.e BIGCORE would
always be equal to a MC. However in future we could have chips that can have
lesser/larger number of CPUs in llc than in a BIGCORE or we could have
granular or split L3 caches within a DIE. In such a case BIGCORE != MC.

Also in the current P9 itself, two neighbouring core-pairs form a quad.
Cache latency within a quad is better than a latency to a distant core-pair.
Cache latency within a core pair is way better than latency within a quad.
So if we have only 4 threads running on a DIE all of them accessing the same
cache-lines, then we could probably benefit if all the tasks were to run
within the quad aka MC/Coregroup.

I have found some benchmarks which are latency sensitive to benefit by
having a grouping a quad level (using kernel hacks and not backed by
firmware changes). Gautham also found similar results in his experiments
but he only used binding within the stock kernel.

I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
domain need not be LLC domain for Power.

-- 
Thanks and Regards
Srikar Dronamraju