Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id

2020-07-31 Thread Srikar Dronamraju
* Aneesh Kumar K.V  [2020-07-31 16:49:14]:

> We use ibm,associativity and ibm,associativity-lookup-arrays to derive the 
> numa
> node numbers. These device tree properties are firmware indicated grouping of
> resources based on their hierarchy in the platform. These numbers (group id) 
> are
> not sequential and hypervisor/firmware can follow different numbering schemes.
> For ex: on powernv platforms, we group them in the below order.
> 
>  * - CCM node ID
>  * - HW card ID
>  * - HW module ID
>  * - Chip ID
>  * - Core ID
> 
> Based on ibm,associativity-reference-points we use one of the above group ids 
> as
> Linux NUMA node id. (On PowerNV platform Chip ID is used). This results
> in Linux reporting non-linear NUMA node id and which also results in Linux
> reporting empty node 0 NUMA nodes.
> 

If its just to eliminate node 0, then we have 2 other probably better
solutions.
1. Dont mark node 0 as spl (currently still in mm-tree and a result in
linux-next)
2. powerpc specific: explicitly clear node 0 during numa bringup.

> This can  be resolved by mapping the firmware provided group id to a logical 
> Linux
> NUMA id. In this patch, we do this only for pseries platforms considering the

On PowerVM, as you would know the nid is already a logical or a flattened
chip-id and not the actual hardware chip-id.

> firmware group id is a virtualized entity and users would not have drawn any
> conclusion based on the Linux Numa Node id.
> 
> On PowerNV platform since we have historically mapped Chip ID as Linux NUMA 
> node
> id, we keep the existing Linux NUMA node id numbering.
> 
> Before Fix:
>  # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
> 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 1 size: 50912 MB
> node 1 free: 45248 MB
> node distances:
> node   0   1
>   0:  10  40
>   1:  40  10
> 
> after fix
>  # numactl  -H
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
> 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
> 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 0 size: 50912 MB
> node 0 free: 49724 MB
> node distances:
> node   0
>   0:  10
> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/topology.h |  1 +
>  arch/powerpc/mm/numa.c  | 49 ++---
>  2 files changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h 
> b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..15b0424a27a8 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -118,5 +118,6 @@ int get_physical_package_id(int cpu);
>  #endif
>  #endif
> 
> +int firmware_group_id_to_nid(int firmware_gid);
>  #endif /* __KERNEL__ */
>  #endif   /* _ASM_POWERPC_TOPOLOGY_H */
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index e437a9ac4956..6c659aada55b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>   }
>  }
> 
> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  
> NUMA_NO_NODE};
> +
> +int firmware_group_id_to_nid(int firmware_gid)
> +{
> + static int last_nid = 0;
> +
> + /*
> +  * For PowerNV we don't change the node id. This helps to avoid
> +  * confusion w.r.t the expected node ids. On pseries, node numbers
> +  * are virtualized. Hence do logical node id for pseries.
> +  */
> + if (!firmware_has_feature(FW_FEATURE_LPAR))
> + return firmware_gid;
> +
> + if (firmware_gid ==  -1)
> + return NUMA_NO_NODE;
> +
> + if (nid_map[firmware_gid] == NUMA_NO_NODE)
> + nid_map[firmware_gid] = last_nid++;

How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
at this place in parallel?

> +
> + return nid_map[firmware_gid];
> +}
> +
>  /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>   * info is found.
>   */
>  static int associativity_to_nid(const __be32 *associativity)
>  {
>   int nid = NUMA_NO_NODE;
> + int firmware_gid = -1;
> 
>   if (!numa_enabled)
>   goto out;
> 
>   if (of_read_number(associativity, 1) >= min_common_depth)
> - nid = of_read_number([min_common_depth], 1);
> + firmware_gid = of_read_number([min_common_depth], 
> 1);
> 
>   /* POWER4 LPAR uses 0x as invalid node */
> - if (nid == 0x || nid >= MAX_NUMNODES)
> - nid = NUMA_NO_NODE;
> + if (firmware_gid == 0x || firmware_gid >= MAX_NUMNODES)
> + firmware_gid = -1;

Lets assume two or more invocations of 

Re: [PATCH V5 0/4] powerpc/perf: Add support for perf extended regs in powerpc

2020-07-31 Thread Athira Rajeev



> On 31-Jul-2020, at 1:20 AM, Jiri Olsa  wrote:
> 
> On Thu, Jul 30, 2020 at 01:24:40PM +0530, Athira Rajeev wrote:
>> 
>> 
>>> On 27-Jul-2020, at 10:46 PM, Athira Rajeev  
>>> wrote:
>>> 
>>> Patch set to add support for perf extended register capability in
>>> powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
>>> indicate the PMU which support extended registers. The generic code
>>> define the mask of extended registers as 0 for non supported architectures.
>>> 
>>> Patches 1 and 2 are the kernel side changes needed to include
>>> base support for extended regs in powerpc and in power10.
>>> Patches 3 and 4 are the perf tools side changes needed to support the
>>> extended registers.
>>> 
>> 
>> Hi Arnaldo, Jiri
>> 
>> please let me know if you have any comments/suggestions on this patch series 
>> to add support for perf extended regs.
> 
> hi,
> can't really tell for powerpc, but in general
> perf tool changes look ok
> 

Hi Jiri,
Thanks for checking the patchset.

Athira.

> jirka



Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.8-8 tag

2020-07-31 Thread pr-tracker-bot
The pull request you sent on Fri, 31 Jul 2020 23:05:17 +1000:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-5.8-8

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/deacdb3e3979979016fcd0ffd518c320a62ad166

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker


Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-31 Thread Nathan Lynch
Michael Ellerman  writes:

> Nathan Lynch  writes:
>> Michael Ellerman  writes:
>>> Nathan Lynch  writes:
 Laurent Dufour  writes:
> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>> The drmem lmb list can have hundreds of thousands of entries, and
>> unfortunately lookups take the form of linear searches. As long as
>> this is the case, traversals have the potential to monopolize the CPU
>> and provoke lockup reports, workqueue stalls, and the like unless
>> they explicitly yield.
>> 
>> Rather than placing cond_resched() calls within various
>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>> expression of the loop macro itself so users can't omit it.
>
> Is that not too much to call cond_resched() on every LMB?
>
> Could that be less frequent, every 10, or 100, I don't really know ?

 Everything done within for_each_drmem_lmb is relatively heavyweight
 already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
 of milliseconds. I don't think cond_resched() is an expensive check in
 this context.
>>>
>>> Hmm, mostly.
>>>
>>> But there are quite a few cases like drmem_update_dt_v1():
>>>
>>> for_each_drmem_lmb(lmb) {
>>> dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>>> dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>>> dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>>> dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>>
>>> dr_cell++;
>>> }
>>>
>>> Which will compile to a pretty tight loop at the moment.
>>>
>>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>>
>>> And although the actual TIF check is cheap the function call to do it is
>>> not free.
>>>
>>> So I worry this is going to make some of those long loops take even
>>> longer.
>>
>> That's fair, and I was wrong - some of the loop bodies are relatively
>> simple, not doing allocations or taking locks, etc.
>>
>> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
>> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().
>
> If we did that, how many call-sites would need converting?
> Is it ~2 or ~20 or ~200?

At a glance I would convert 15-20 out of the 24 users in the tree I'm
looking at. Let me know if I should do a v2 with that approach.


Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal

2020-07-31 Thread Michael Ellerman
Nathan Lynch  writes:
> Michael Ellerman  writes:
>> Nathan Lynch  writes:
>>> Laurent Dufour  writes:
 Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
> The drmem lmb list can have hundreds of thousands of entries, and
> unfortunately lookups take the form of linear searches. As long as
> this is the case, traversals have the potential to monopolize the CPU
> and provoke lockup reports, workqueue stalls, and the like unless
> they explicitly yield.
> 
> Rather than placing cond_resched() calls within various
> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
> expression of the loop macro itself so users can't omit it.

 Is that not too much to call cond_resched() on every LMB?

 Could that be less frequent, every 10, or 100, I don't really know ?
>>>
>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>> this context.
>>
>> Hmm, mostly.
>>
>> But there are quite a few cases like drmem_update_dt_v1():
>>
>>  for_each_drmem_lmb(lmb) {
>>  dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>>  dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>>  dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>>  dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>
>>  dr_cell++;
>>  }
>>
>> Which will compile to a pretty tight loop at the moment.
>>
>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>
>> And although the actual TIF check is cheap the function call to do it is
>> not free.
>>
>> So I worry this is going to make some of those long loops take even
>> longer.
>
> That's fair, and I was wrong - some of the loop bodies are relatively
> simple, not doing allocations or taking locks, etc.
>
> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().

If we did that, how many call-sites would need converting?
Is it ~2 or ~20 or ~200?

> On the other hand... it's probably not too strong to say that the
> drmem/hotplug code is in crisis with respect to correctness and
> algorithmic complexity, so those are my overriding concerns right
> now. Yes, this change will pessimize loops that are reinitializing the
> entire drmem_lmb array on every DLPAR operation, but:
>
> 1. it doesn't make any user of for_each_drmem_lmb() less correct;
> 2. why is this code doing that in the first place, other than to
>accommodate a poor data structure choice?
>
> The duration of the system calls where this code runs are measured in
> minutes or hours on large configurations because of all the behaviors
> that are at best O(n) with the amount of memory assigned to the
> partition. For simplicity's sake I'd rather defer lower-level
> performance considerations like this until the drmem data structures'
> awful lookup properties are fixed -- hopefully in the 5.10 timeframe.

Yeah fair enough.

cheers


[GIT PULL] Please pull powerpc/linux.git powerpc-5.8-8 tag

2020-07-31 Thread Michael Ellerman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull one more powerpc fix for 5.8:

The following changes since commit f0479c4bcbd92d1a457d4a43bcab79f29d11334a:

  selftests/powerpc: Use proper error code to check fault address (2020-07-15 
23:10:17 +1000)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-5.8-8

for you to fetch changes up to 909adfc66b9a1db21b5e8733e9ebfa6cd5135d74:

  powerpc/64s/hash: Fix hash_preload running with interrupts enabled 
(2020-07-27 17:02:09 +1000)

- --
powerpc fixes for 5.8 #8

Fix a bug introduced by the changes we made to lockless page table walking this
cycle. When using the hash MMU, and perf with callchain recording, we can
deadlock if the PMI interrupts a hash fault, and the callchain recording then
takes a hash fault on the same page.

Thanks to:
  Nicholas Piggin, Aneesh Kumar K.V, Anton Blanchard, Athira Rajeev.

- --
Nicholas Piggin (1):
  powerpc/64s/hash: Fix hash_preload running with interrupts enabled


 arch/powerpc/kernel/exceptions-64s.S  | 14 ---
 arch/powerpc/mm/book3s64/hash_utils.c | 25 
 arch/powerpc/perf/core-book3s.c   |  6 +
 3 files changed, 42 insertions(+), 3 deletions(-)
-BEGIN PGP SIGNATURE-

iQIzBAEBCAAdFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl8kFsQACgkQUevqPMjh
pYAkbg/9HzJB8KbjAVJpoXHOf9+aHylhZKU2pcVtdoYE07VV6JabWnlI5flvTSEY
3Dvr3h/X0k2Z3n52+s1kYlW27ErU+o9Hbz1j4O41Ardiy14GmQ02FSi5iXjhKxuk
AjX2MN76n1Te+1ee2Ib/Ubu8KOrbxLIya223tJ0TNu7RWuYVaCO/lNXPPTwm2JWI
s9lryRAHNc4qkPMytsWi/gzNml3IP6cqnGVt1H5fmnOwQXUYbSjdmhBa7AwjCsA1
xJBMkxBxWOnUSknRCxaCCUTQ3sD+sJMzp6verlwEIb+HRh08iJyVzuWCpT9m0ZOG
5F3CvhIxwLatDsJ9N5EeGI2V4qBstBlHEUJJAjkwKgZxzmEm7j9H+ItAc421eo2A
t/SYJTK4JltllCDyB22jCxdEgJh+opv8rki7U+INI01I8gHzqsr/0CHleGtMN5Cq
fUBf25LOMUCpjeAO207j4DzeY6rTXz9H07XWrR4su//4S9bNkYUYpFfeE/3Kwwdq
vZCxjm+fxxLPyAW+E2D2EGsyzIJootrAzly5T7gun8qwgCSgoCxI+YV9/DLmYaGf
OjcG6+X58sfn1FTc8cD6udKCIMKh0JmSccZRqw1Ijp4EiGMnDZUbP9jg/aSciD8a
+oco9QGvvxYQTMut4XUkXtqqZqaTISgWFD/QB9f4KGJ1Rs8tW5Y=
=oaTB
-END PGP SIGNATURE-


[PATCH v4 12/12] PCI: Remove '*val = 0' from pcie_capability_read_*()

2020-07-31 Thread Saheed O. Bolarinwa
There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.

This behaviour if further ensured by this code inside
pcie_capability_read_*()

 ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
 /*
  * Reset *val to 0 if pci_read_config_dword() fails, it may
  * have been written as 0x if hardware error happens
  * during pci_read_config_dword().
  */
 if (ret)
 *val = 0;
 return ret;

a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*()
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of  pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.

b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if
dev->error_state = pci_channel_io_perm_failure (i.e.
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.

Most drivers only consider the case (b) and in some cases, there is the
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).

In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.

Remove the reset of *val to 0 when pci_read_config_*() fails.

Suggested-by: Bjorn Helgaas 
Signed-off-by: Saheed O. Bolarinwa 
---
 drivers/pci/access.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 79c4a2ef269a..ec95edbb1ac8 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -413,13 +413,6 @@ int pcie_capability_read_word(struct pci_dev *dev, int 
pos, u16 *val)
 
if (pcie_capability_reg_implemented(dev, pos)) {
ret = pci_read_config_word(dev, pci_pcie_cap(dev) + pos, val);
-   /*
-* Reset *val to 0 if pci_read_config_word() fails, it may
-* have been written as 0x if hardware error happens
-* during pci_read_config_word().
-*/
-   if (ret)
-   *val = 0;
return ret;
}
 
@@ -448,13 +441,6 @@ int pcie_capability_read_dword(struct pci_dev *dev, int 
pos, u32 *val)
 
if (pcie_capability_reg_implemented(dev, pos)) {
ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
-   /*
-* Reset *val to 0 if pci_read_config_dword() fails, it may
-* have been written as 0x if hardware error happens
-* during pci_read_config_dword().
-*/
-   if (ret)
-   *val = 0;
return ret;
}
 
-- 
2.18.4



[PATCH v4 10/12] PCI/AER: Check if pcie_capability_read_*() reads ~0

2020-07-31 Thread Saheed O. Bolarinwa
On failure pcie_capability_read_*() sets it's last parameter, val
to 0. However, with Patch 12/12, it is possible that val is set
to ~0 on failure. This would introduce a bug because
(x & x) == (~0 & x).

Since ~0 is an invalid value in here,

Add extra check for ~0 to the if condition to confirm failure.

Suggested-by: Bjorn Helgaas 
Signed-off-by: Saheed O. Bolarinwa 
---
 drivers/pci/pcie/aer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3acf56683915..dbeabc370efc 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -829,7 +829,7 @@ static bool is_error_source(struct pci_dev *dev, struct 
aer_err_info *e_info)
 
/* Check if AER is enabled */
pcie_capability_read_word(dev, PCI_EXP_DEVCTL, );
-   if (!(reg16 & PCI_EXP_AER_FLAGS))
+   if ((reg16 == (u16)~0) || !(reg16 & PCI_EXP_AER_FLAGS))
return false;
 
if (!aer)
-- 
2.18.4



Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-31 Thread Michael Ellerman
Srikar Dronamraju  writes:
> * Michael Ellerman  [2020-07-31 17:45:37]:
>
>> Srikar Dronamraju  writes:
>> > Currently "CACHE" domain happens to be the 2nd sched domain as per
>> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
>> > same as SMT domain. However we could generalize this domain such that it
>> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
>> >
>> > While setting up the "CACHE" domain, check if shared_cache is already
>> > set.
>> 
>> PeterZ asked for some overview of what you're doing and why, you
>> responded to his mail, but I was expecting to see that text incorporated
>> here somewhere.
>> 
>
> Okay, do you want that as part of the code or documentation dir or the
> changelog?

I guess a comment is best, as that's most likely to be seen by people
looking at the code in future.

A little bit of overview in the change log is also good.

>> He also asked for some comments, which I would also like to see.
>> 
>> 
>> I'm also not clear why we want to rename it to "bigcore", that's not a
>> commonly understood term, I don't think it's clear to new readers what
>> it means.
>> 
>> Leaving it as the shared cache domain, and having a comment mentioning
>> that "bigcores" share a cache, would be clearer I think.
>> 
>
> Today, Shared cache is equal to Big Core. However in not too distant future,
> Shared cache domain and Big Core may not be the same. For example lets
> assume that L2 cache were to Shrink per small core with the firmware
> exposing the core as a bigcore. Then with the current design, we have a SMT
> == SHARED CACHE, and a DIE. We would not have any domain at the publicised 8
> thread level. Keeping the Bigcore as a domain and mapping the shared
> cache, (I am resetting the domain name as CACHE if BIGCORE==SHARED_CACHE),
> helps us in this scenario.

Yeah OK.

In that scenario it's not really clear what the 8 thread level domain
expresses, if the two "halves" of the bigcore have separate L2s. But
presumably there is still some benefit to exposing it.

cheers


Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group

2020-07-31 Thread Michael Ellerman
Srikar Dronamraju  writes:
> * Michael Ellerman  [2020-07-31 17:52:15]:
>
>> Srikar Dronamraju  writes:
>> > If allocated earlier and the search fails, then cpumask need to be
>> > freed. However cpu_l1_cache_map can be allocated after we search thread
>> > group.
>> 
>> It's not freed anywhere AFAICS?
>
> Yes, its never freed. Infact we are never checking if
> zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
> all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
> freed/checked. I did dig into this a bit and it appears that ..
> (Please do correct me if I am wrong!! )

That's correct.

> Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
> to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

I remember Rusty adding that code, but I don't know if we ever
considered enabling CPUMASK_OFFSTACK.

Probably we meant to but never got around to doing it.

> So from include/linux/cpumask.h
>
> typedef struct cpumask cpumask_var_t[1];
> and
> zalloc_cpumask_var_node ends up being cpumask_clear
>
> So I think we are historically we seem to assume we are always
> !CPUMASK_OFFSTACK and hence we dont need to check for return as well as
> free..

Right.

> I would look forward to your comments on how we should handle this going
> forward. But I would keep this the same for this patchset.

Agreed, just clarify in the change log that it's not freed at the moment
because of CPU_MASK_OFFSTACK=n

> One of the questions that I have is if we most likely are to be in
> !CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
> variables. 

I don't think so, cpumask_t is semi-deprecated AIUI.
  
> The reason being we end up using NR_CPU cpumask for each percpu cpumask
> variable instead of using NR_CPU cpumask_t pointer.

Our current defconfigs have NR_CPUS=2048, which is probably just small
enough to continue using OFFSTACK=n.

But we allow configuring NR_CPUS up to 8192, which surely would need
OFFSTACK=y in order to work.

So I think we need to stick with cpumask_var_t, but we should test with
OFFSTACK=y, and should probably be a bit more careful with checking the
allocations succeed.

And then we should select OFFSTACK=y for NR_CPUS above some threshold.

cheers


[PATCH v4 00/12] PCI: Remove '*val = 0' from pcie_capability_read_*()

2020-07-31 Thread Saheed O. Bolarinwa
v4 CHANGES:
- Drop uses of pcie_capability_read_*() return value. This related to
  [1] which is pointing towards making the accessors return void.
- Remove patches found to be unnecessary
- Reword some commit messages

v3 CHANGES:
- Split previous PATCH 6/13 into two : PATCH 6/14 and PATCH 7/14
- Fix commit message of PATCH 5/14
- Update Patch numbering and Commit messages
- Add 'Acked by Greg KH' to PATCH 2/14
- Add PATCH version

v2 CHANGES:
- Fix missing comma, causing the email cc error
- Fix typos and numbering errors in commit messages
- Add commit message to 13/13
- Add two more patches: PATCH 3/13 and PATCH 4/13

MERGING:
- Patch 6/12 depends on Patch 5/12. However Patch 5/12 has no dependency.
  Please, merge PATCH 6/12 only after Patch 5/12.
- Patch 12/12 depends on all preceding patches. Please merge Patch 12/12
  only after other patches in this series have been merged.
- All other patches have no dependencies besides those mentioned above and
  can be merge individually.

PATCH 5/12:
Set the default case in the switch-statement to set status
to "Power On".

PATCH 1/12 to 11/12:
Use the value read by pcie_capability_read_*() to determine success or
failure. This is done by checking if it is ~0, while maintaining the
functions' behaviour. This ensures that the changes in PATCH 12/12 does
not introduce any bug.

PATCH 12/12:
There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.

This behaviour if further ensured by this code inside
pcie_capability_read_*()

 ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
 /*
  * Reset *val to 0 if pci_read_config_dword() fails, it may
  * have been written as 0x if hardware error happens
  * during pci_read_config_dword().
  */
 if (ret)
 *val = 0;
 return ret;

a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*() 
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of  pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.

b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if 
dev->error_state = pci_channel_io_perm_failure (i.e. 
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.

Most drivers only consider the case (b) and in some cases, there is the 
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).

In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.

Check the return value of pcie_capability_read_dword() to ensure success
and avoid bug as a result of Patch 14/14.
Remove the reset of *val to 0 when pci_read_config_*() fails.

[1] 
https://lore.kernel.org/linux-pci/20200714234625.GA428442@bjorn-Precision-5520/


Saheed O. Bolarinwa (12):
  IB/hfi1: Check if pcie_capability_read_*() reads ~0
  misc: rtsx: Check if pcie_capability_read_*() reads ~0
  ath9k: Check if pcie_capability_read_*() reads ~0
  iwlegacy: Check if pcie_capability_read_*() reads ~0
  PCI: pciehp: Set "Power On" as the default get_power_status
  PCI: pciehp: Check if pcie_capability_read_*() reads ~0
  PCI/ACPI: Check if pcie_capability_read_*() reads ~0
  PCI: Check if pcie_capability_read_*() reads ~0
  PCI/PM: Check if pcie_capability_read_*() reads ~0
  PCI/AER: Check if pcie_capability_read_*() reads ~0
  PCI/ASPM: Check if pcie_capability_read_*() reads ~0
  PCI: Remove '*val = 0' from pcie_capability_read_*()

 drivers/infiniband/hw/hfi1/aspm.c|  6 ++--
 drivers/misc/cardreader/rts5227.c|  2 +-
 drivers/misc/cardreader/rts5249.c|  2 +-
 drivers/misc/cardreader/rts5260.c|  2 +-
 drivers/misc/cardreader/rts5261.c|  2 +-
 drivers/net/wireless/ath/ath9k/pci.c |  3 +-
 drivers/net/wireless/intel/iwlegacy/common.c |  2 +-
 drivers/pci/access.c | 14 
 drivers/pci/hotplug/pciehp_hpc.c | 13 +---
 drivers/pci/pci-acpi.c   |  4 +--
 drivers/pci/pci.c| 34 ++--
 drivers/pci/pcie/aer.c   |  2 +-
 drivers/pci/pcie/aspm.c  | 10 +++---
 drivers/pci/probe.c  | 12 +++
 14 files changed, 56 insertions(+), 52 deletions(-)

-- 
2.18.4



[PATCH 2/2] powerpc/vmemmap: Don't warn if we don't find a mapping vmemmap list entry

2020-07-31 Thread Aneesh Kumar K.V
Now that we are handling vmemmap list allocation failure correctly, don't
WARN in section deactivate when we don't find a mapping vmemmap list entry.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 8a198504c0e1..3ffe8b84aab2 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -285,10 +285,8 @@ static unsigned long vmemmap_list_free(unsigned long start)
vmem_back_prev = vmem_back;
}
 
-   if (unlikely(!vmem_back)) {
-   WARN_ON(1);
+   if (unlikely(!vmem_back))
return 0;
-   }
 
/* remove it from vmemmap_list */
if (vmem_back == vmemmap_list) /* remove head */
-- 
2.26.2



[PATCH 1/2] powerpc/vmemmap: Fix memory leak with vmemmap list allocation failures.

2020-07-31 Thread Aneesh Kumar K.V
If we fail to allocate vmemmap list, we don't keep track of allocated
vmemmap block buf. Hence on section deactivate we skip vmemmap block
buf free. This results in memory leak.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c | 35 ---
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 152aa0200cef..8a198504c0e1 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -162,16 +162,16 @@ static __meminit struct vmemmap_backing * 
vmemmap_list_alloc(int node)
return next++;
 }
 
-static __meminit void vmemmap_list_populate(unsigned long phys,
-   unsigned long start,
-   int node)
+static __meminit int vmemmap_list_populate(unsigned long phys,
+  unsigned long start,
+  int node)
 {
struct vmemmap_backing *vmem_back;
 
vmem_back = vmemmap_list_alloc(node);
if (unlikely(!vmem_back)) {
-   WARN_ON(1);
-   return;
+   pr_debug("vmemap list allocation failed\n");
+   return -ENOMEM;
}
 
vmem_back->phys = phys;
@@ -179,6 +179,7 @@ static __meminit void vmemmap_list_populate(unsigned long 
phys,
vmem_back->list = vmemmap_list;
 
vmemmap_list = vmem_back;
+   return 0;
 }
 
 static bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long 
start,
@@ -199,6 +200,7 @@ static bool altmap_cross_boundary(struct vmem_altmap 
*altmap, unsigned long star
 int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
struct vmem_altmap *altmap)
 {
+   bool altmap_alloc;
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
/* Align to the page size of the linear mapping. */
@@ -228,13 +230,32 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
p = altmap_alloc_block_buf(page_size, altmap);
if (!p)
pr_debug("altmap block allocation failed, 
falling back to system memory");
+   else
+   altmap_alloc = true;
}
-   if (!p)
+   if (!p) {
p = vmemmap_alloc_block_buf(page_size, node);
+   altmap_alloc = false;
+   }
if (!p)
return -ENOMEM;
 
-   vmemmap_list_populate(__pa(p), start, node);
+   if (vmemmap_list_populate(__pa(p), start, node)) {
+   /*
+* If we don't populate vmemap list, we don't have
+* the ability to free the allocated vmemmap
+* pages in section_deactivate. Hence free them
+* here.
+*/
+   int nr_pfns = page_size >> PAGE_SHIFT;
+   unsigned long page_order = get_order(page_size);
+
+   if (altmap_alloc)
+   vmem_altmap_free(altmap, nr_pfns);
+   else
+   free_pages((unsigned long)p, page_order);
+   return -ENOMEM;
+   }
 
pr_debug("  * %016lx..%016lx allocated at %p\n",
 start, start + page_size, p);
-- 
2.26.2



Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup

2020-07-31 Thread Michael Ellerman
Srikar Dronamraju  writes:
> * Michael Ellerman  [2020-07-31 17:49:55]:
>
>> Srikar Dronamraju  writes:
>> > Add support for grouping cores based on the device-tree classification.
>> > - The last domain in the associativity domains always refers to the
>> > core.
>> > - If primary reference domain happens to be the penultimate domain in
>> > the associativity domains device-tree property, then there are no
>> > coregroups. However if its not a penultimate domain, then there are
>> > coregroups. There can be more than one coregroup. For now we would be
>> > interested in the last or the smallest coregroups.
>> 
>> This still doesn't tell me what a coregroup actually represents.
>> 
>> I get that it's a grouping of cores, and that the device tree specifies
>> it for us, but grouping based on what?
>
> We have just abstracted the fact that we are creating a sub-group of cores
> within a DIE. We are limiting to one sub-group per core. However this would
> allow the firmware the flexibility to vary the grouping. Once the firmware
> starts using this group, we could add more code to detect the type of
> grouping and adjust the sd domain flags accordingly.

OK. That's good info to have in the change log.

>> I think the answer is we aren't being told by firmware, it's just a
>> grouping based on some opaque performance characteristic and we just
>> have to take that as given.
>> 
>
> This is partially true. At this time, we dont have firmwares that can
> exploit this code. Once the firmwares start using this grouping, we could
> add more code to align the grouping to the scheduler topology.
>
>> But please explain that clearly in the change log and the code comments.
>> 
>
> Okay, I will do the needful.

Thanks.

cheers


Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id

2020-07-31 Thread Michael Ellerman
Srikar Dronamraju  writes:
> * Michael Ellerman  [2020-07-31 18:02:21]:
>
>> Srikar Dronamraju  writes:
>> > Lookup the coregroup id from the associativity array.
>
> Thanks Michael for all your comments and inputs.
>
>> It's slightly strange that this is called in patch 9, but only properly
>> implemented here in patch 10.
>> 
>> I'm not saying you have to squash them together, but it would be good if
>> the change log for patch 9 mentioned that a subsequent commit will
>> complete the implementation and how that affects the behaviour.
>
> I probably got influenced by few LKML community members who always add a
> stub and implement the gory details in a subsequent patch. I will surely add
> the change log in patch 9 about the subsequent patches.

That's OK, it's a valid way to do things, and can be good for keeping
the size of individual patches down to make them easier to review.

But yeah a mention in the change log of the preceeding patch is helpful
for anyone looking at that commit on its own in the future.

cheers


[RFC PATCH 2/2] powerpc/powernv/cpufreq: Don't assume chip id is same as Linux node id

2020-07-31 Thread Aneesh Kumar K.V
On PowerNV platforms we always have 1:1 mapping between chip ID and
firmware group id. Use the helper to convert firmware group id to
node id instead of directly using chip ID as Linux node id.

NOTE: This doesn't have any functional change. On PowerNV platforms
we continue to have 1:1 mapping between firmware group id and
Linux node id.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/cpufreq/powernv-cpufreq.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 8646eb197cd9..ca82b6ac0989 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -27,6 +27,7 @@
 #include 
 #include  /* Required for cpu_sibling_mask() in UP configs */
 #include 
+#include 
 #include 
 
 #define POWERNV_MAX_PSTATES_ORDER  8
@@ -1069,8 +1070,14 @@ static int init_chip_info(void)
}
 
for (i = 0; i < nr_chips; i++) {
+   unsigned int nid;
+
chips[i].id = chip[i];
-   cpumask_copy([i].mask, cpumask_of_node(chip[i]));
+   /*
+* On powervn platforms firmware group id is same as chipd id.
+*/
+   nid = firmware_group_id_to_nid(chip[i]);
+   cpumask_copy([i].mask, cpumask_of_node(nid));
INIT_WORK([i].throttle, powernv_cpufreq_work_fn);
for_each_cpu(cpu, [i].mask)
per_cpu(chip_info, cpu) =  [i];
-- 
2.26.2



[RFC PATCH 1/2] powerpc/numa: Introduce logical numa id

2020-07-31 Thread Aneesh Kumar K.V
We use ibm,associativity and ibm,associativity-lookup-arrays to derive the numa
node numbers. These device tree properties are firmware indicated grouping of
resources based on their hierarchy in the platform. These numbers (group id) are
not sequential and hypervisor/firmware can follow different numbering schemes.
For ex: on powernv platforms, we group them in the below order.

 * - CCM node ID
 * - HW card ID
 * - HW module ID
 * - Chip ID
 * - Core ID

Based on ibm,associativity-reference-points we use one of the above group ids as
Linux NUMA node id. (On PowerNV platform Chip ID is used). This results
in Linux reporting non-linear NUMA node id and which also results in Linux
reporting empty node 0 NUMA nodes.

This can  be resolved by mapping the firmware provided group id to a logical 
Linux
NUMA id. In this patch, we do this only for pseries platforms considering the
firmware group id is a virtualized entity and users would not have drawn any
conclusion based on the Linux Numa Node id.

On PowerNV platform since we have historically mapped Chip ID as Linux NUMA node
id, we keep the existing Linux NUMA node id numbering.

Before Fix:
 # numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 1 size: 50912 MB
node 1 free: 45248 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

after fix
 # numactl  -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 0 size: 50912 MB
node 0 free: 49724 MB
node distances:
node   0
  0:  10

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/topology.h |  1 +
 arch/powerpc/mm/numa.c  | 49 ++---
 2 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..15b0424a27a8 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -118,5 +118,6 @@ int get_physical_package_id(int cpu);
 #endif
 #endif
 
+int firmware_group_id_to_nid(int firmware_gid);
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TOPOLOGY_H */
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index e437a9ac4956..6c659aada55b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
}
 }
 
+static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
+
+int firmware_group_id_to_nid(int firmware_gid)
+{
+   static int last_nid = 0;
+
+   /*
+* For PowerNV we don't change the node id. This helps to avoid
+* confusion w.r.t the expected node ids. On pseries, node numbers
+* are virtualized. Hence do logical node id for pseries.
+*/
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return firmware_gid;
+
+   if (firmware_gid ==  -1)
+   return NUMA_NO_NODE;
+
+   if (nid_map[firmware_gid] == NUMA_NO_NODE)
+   nid_map[firmware_gid] = last_nid++;
+
+   return nid_map[firmware_gid];
+}
+
 /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
  * info is found.
  */
 static int associativity_to_nid(const __be32 *associativity)
 {
int nid = NUMA_NO_NODE;
+   int firmware_gid = -1;
 
if (!numa_enabled)
goto out;
 
if (of_read_number(associativity, 1) >= min_common_depth)
-   nid = of_read_number([min_common_depth], 1);
+   firmware_gid = of_read_number([min_common_depth], 
1);
 
/* POWER4 LPAR uses 0x as invalid node */
-   if (nid == 0x || nid >= MAX_NUMNODES)
-   nid = NUMA_NO_NODE;
+   if (firmware_gid == 0x || firmware_gid >= MAX_NUMNODES)
+   firmware_gid = -1;
+
+   nid = firmware_group_id_to_nid(firmware_gid);
 
if (nid > 0 &&
-   of_read_number(associativity, 1) >= distance_ref_points_depth) {
+   of_read_number(associativity, 1) >= distance_ref_points_depth) {
/*
 * Skip the length field and send start of associativity array
 */
@@ -432,24 +458,25 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
 static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
 {
struct assoc_arrays aa = { .arrays = NULL };
-   int default_nid = NUMA_NO_NODE;
-   int nid = default_nid;
+   int nid = NUMA_NO_NODE, firmware_gid;
int rc, index;
 
if ((min_common_depth < 0) || !numa_enabled)
-   

Re: [PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()

2020-07-31 Thread Bharata B Rao
On Fri, Jul 31, 2020 at 01:37:00AM -0700, Ram Pai wrote:
> On Fri, Jul 31, 2020 at 09:59:40AM +0530, Bharata B Rao wrote:
> > On Thu, Jul 30, 2020 at 04:25:26PM -0700, Ram Pai wrote:
> > In our case, device pages that are in use are always associated with a valid
> > pvt member. See kvmppc_uvmem_get_page() which returns failure if it
> > runs out of device pfns and that will result in proper failure of
> > page-in calls.
> 
> looked at the code, and yes that code path looks correct. So my
> reasoning behind the root cause of this bug is incorrect. However the
> bug is surfacing and there must be a reason.
> 
> > 
> > For the case where we run out of device pfns, migrate_vma_finalize() will
> > restore the original PTE and will not replace the PTE with device private 
> > PTE.
> > 
> > Also kvmppc_uvmem_page_free() (=dev_pagemap_ops.page_free()) is never
> > called for non-device-private pages.
> 
> Yes. it should not be called. But as seen above in the stack trace, it is 
> called. 
> 
> What would cause the HMM to call ->page_free() on a page that is not
> associated with that device's pfn?

I believe it is being called for a device private page, you can verify
it when you hit it next time?

> 
> > 
> > This could be a use-after-free case possibly arising out of the new state
> > changes in HV. If so, this fix will only mask the bug and not address the
> > original problem.
> 
> I can verify by rerunning the tests, without the new state changes. But
> I do not see how those changes can cause this fault?
> 
> This could also be caused by a duplicate ->page_free() call due to some
> bug in the migrate_page path? Could there be a race between
> migrate_page() and a page_fault ?
> 
> 
> Regardless, kvmppc_uvmem_page_free() needs to be fixed. It should not
> access contents of pvt, without verifing pvt is valid.

We don't expect pvt to be NULL here. Checking for NULL and returning
isn't the right fix, I think.

Regards,
Bharata.


Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id

2020-07-31 Thread Srikar Dronamraju
* Michael Ellerman  [2020-07-31 18:02:21]:

> Srikar Dronamraju  writes:
> > Lookup the coregroup id from the associativity array.
> 

Thanks Michael for all your comments and inputs.

> It's slightly strange that this is called in patch 9, but only properly
> implemented here in patch 10.
> 
> I'm not saying you have to squash them together, but it would be good if
> the change log for patch 9 mentioned that a subsequent commit will
> complete the implementation and how that affects the behaviour.
> 

I probably got influenced by few LKML community members who always add a
stub and implement the gory details in a subsequent patch. I will surely add
the change log in patch 9 about the subsequent patches.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group

2020-07-31 Thread Srikar Dronamraju
* Michael Ellerman  [2020-07-31 17:52:15]:

> Srikar Dronamraju  writes:
> > If allocated earlier and the search fails, then cpumask need to be
> > freed. However cpu_l1_cache_map can be allocated after we search thread
> > group.
> 
> It's not freed anywhere AFAICS?
> 

Yes, its never freed. Infact we are never checking if
zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
freed/checked. I did dig into this a bit and it appears that ..
(Please do correct me if I am wrong!! )

Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

So from include/linux/cpumask.h

typedef struct cpumask cpumask_var_t[1];
and
zalloc_cpumask_var_node ends up being cpumask_clear

So I think we are historically we seem to assume we are always
!CPUMASK_OFFSTACK and hence we dont need to check for return as well as
free..

I would look forward to your comments on how we should handle this going
forward. But I would keep this the same for this patchset.

One of the questions that I have is if we most likely are to be in
!CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
variables. 
 
The reason being we end up using NR_CPU cpumask for each percpu cpumask
variable instead of using NR_CPU cpumask_t pointer.

> And even after this change there's still an error path that doesn't free
> it, isn't there?
> 
> cheers
> 
> > Cc: linuxppc-dev 
> > Cc: LKML 
> > Cc: Michael Ellerman 
> > Cc: Nicholas Piggin 
> > Cc: Anton Blanchard 
> > Cc: Oliver O'Halloran 
> > Cc: Nathan Lynch 
> > Cc: Michael Neuling 
> > Cc: Gautham R Shenoy 
> > Cc: Ingo Molnar 
> > Cc: Peter Zijlstra 
> > Cc: Valentin Schneider 
> > Cc: Jordan Niethe 
> > Reviewed-by: Gautham R. Shenoy 
> > Signed-off-by: Srikar Dronamraju 
> > ---
> >  arch/powerpc/kernel/smp.c | 7 +++
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 698000c7f76f..dab96a1203ec 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
> > if (err)
> > goto out;
> >  
> > -   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> > -   GFP_KERNEL,
> > -   cpu_to_node(cpu));
> > -
> > cpu_group_start = get_cpu_thread_group_start(cpu, );
> >  
> > if (unlikely(cpu_group_start == -1)) {
> > @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
> > goto out;
> > }
> >  
> > +   zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> > +   GFP_KERNEL, cpu_to_node(cpu));
> > +
> > for (i = first_thread; i < first_thread + threads_per_core; i++) {
> > int i_group_start = get_cpu_thread_group_start(i, );
> >  
> > -- 
> > 2.17.1

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-31 Thread Srikar Dronamraju
* Michael Ellerman  [2020-07-31 17:45:37]:

> Srikar Dronamraju  writes:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> >
> > While setting up the "CACHE" domain, check if shared_cache is already
> > set.
> 
> PeterZ asked for some overview of what you're doing and why, you
> responded to his mail, but I was expecting to see that text incorporated
> here somewhere.
> 

Okay, do you want that as part of the code or documentation dir or the
changelog?

> He also asked for some comments, which I would also like to see.
> 
> 
> I'm also not clear why we want to rename it to "bigcore", that's not a
> commonly understood term, I don't think it's clear to new readers what
> it means.
> 
> Leaving it as the shared cache domain, and having a comment mentioning
> that "bigcores" share a cache, would be clearer I think.
> 

Today, Shared cache is equal to Big Core. However in not too distant future,
Shared cache domain and Big Core may not be the same. For example lets
assume that L2 cache were to Shrink per small core with the firmware
exposing the core as a bigcore. Then with the current design, we have a SMT
== SHARED CACHE, and a DIE. We would not have any domain at the publicised 8
thread level. Keeping the Bigcore as a domain and mapping the shared
cache, (I am resetting the domain name as CACHE if BIGCORE==SHARED_CACHE),
helps us in this scenario.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup

2020-07-31 Thread Srikar Dronamraju
* Michael Ellerman  [2020-07-31 17:49:55]:

> Srikar Dronamraju  writes:
> > Add support for grouping cores based on the device-tree classification.
> > - The last domain in the associativity domains always refers to the
> > core.
> > - If primary reference domain happens to be the penultimate domain in
> > the associativity domains device-tree property, then there are no
> > coregroups. However if its not a penultimate domain, then there are
> > coregroups. There can be more than one coregroup. For now we would be
> > interested in the last or the smallest coregroups.
> 
> This still doesn't tell me what a coregroup actually represents.
> 
> I get that it's a grouping of cores, and that the device tree specifies
> it for us, but grouping based on what?
> 

We have just abstracted the fact that we are creating a sub-group of cores
within a DIE. We are limiting to one sub-group per core. However this would
allow the firmware the flexibility to vary the grouping. Once the firmware
starts using this group, we could add more code to detect the type of
grouping and adjust the sd domain flags accordingly.

> I think the answer is we aren't being told by firmware, it's just a
> grouping based on some opaque performance characteristic and we just
> have to take that as given.
> 

This is partially true. At this time, we dont have firmwares that can
exploit this code. Once the firmwares start using this grouping, we could
add more code to align the grouping to the scheduler topology.

> But please explain that clearly in the change log and the code comments.
> 

Okay, I will do the needful.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju


Re: [PATCH] KVM: PPC: Book3S HV: fix a oops in kvmppc_uvmem_page_free()

2020-07-31 Thread Ram Pai
On Fri, Jul 31, 2020 at 09:59:40AM +0530, Bharata B Rao wrote:
> On Thu, Jul 30, 2020 at 04:25:26PM -0700, Ram Pai wrote:
> > Observed the following oops while stress-testing, using multiple
> > secureVM on a distro kernel. However this issue theoritically exists in
> > 5.5 kernel and later.
> > 
> > This issue occurs when the total number of requested device-PFNs exceed
> > the total-number of available device-PFNs.  PFN migration fails to
> > allocate a device-pfn, which causes migrate_vma_finalize() to trigger
> > kvmppc_uvmem_page_free() on a page, that is not associated with any
> > device-pfn.  kvmppc_uvmem_page_free() blindly tries to access the
> > contents of the private data which can be null, leading to the following
> > kernel fault.
> > 
> >  --
> >  Unable to handle kernel paging request for data at address 0x0011
> >  Faulting instruction address: 0xc0080e36e110
> >  Oops: Kernel access of bad area, sig: 11 [#1]
> >  LE SMP NR_CPUS=2048 NUMA PowerNV
> > 
> >  MSR:  9280b033 
> >  CR: 24424822  XER: 
> >  CFAR: c0e3d764 DAR: 0011 DSISR: 4000 IRQMASK: 0
> >  GPR00: c0080e36e0a4 c01f1d59f610 c0080e38a400 
> >  GPR04: c01fa500 fffe  c000201fffeaf300
> >  GPR08: 01f0  0f80 c0080e373608
> >  GPR12: c0e3d710 c000201fffeaf300 0001 7fef8736
> >  GPR16: 7fff97db4410 c000201c3b66a578  
> >  GPR20: 000119db9ad0 000a fffc 0001
> >  GPR24: c000201c3b66 c01f1d59f7a0 c04cffb0 0001
> >  GPR28:  c00a001ff003e000 c0080e386150 0f80
> >  NIP [c0080e36e110] kvmppc_uvmem_page_free+0xc8/0x210 [kvm_hv]
> >  LR [c0080e36e0a4] kvmppc_uvmem_page_free+0x5c/0x210 [kvm_hv]
> >  Call Trace:
> >  [c0512010] free_devmap_managed_page+0xd0/0x100
> >  [c03f71d0] put_devmap_managed_page+0xa0/0xc0
> >  [c04d24bc] migrate_vma_finalize+0x32c/0x410
> >  [c0080e36e828] kvmppc_svm_page_in.constprop.5+0xa0/0x460 [kvm_hv]
> >  [c0080e36eddc] kvmppc_uv_migrate_mem_slot.isra.2+0x1f4/0x230 [kvm_hv]
> >  [c0080e36fa98] kvmppc_h_svm_init_done+0x90/0x170 [kvm_hv]
> >  [c0080e35bb14] kvmppc_pseries_do_hcall+0x1ac/0x10a0 [kvm_hv]
> >  [c0080e35edf4] kvmppc_vcpu_run_hv+0x83c/0x1060 [kvm_hv]
> >  [c0080e95eb2c] kvmppc_vcpu_run+0x34/0x48 [kvm]
> >  [c0080e95a2dc] kvm_arch_vcpu_ioctl_run+0x374/0x830 [kvm]
> >  [c0080e9433b4] kvm_vcpu_ioctl+0x45c/0x7c0 [kvm]
> >  [c05451d0] do_vfs_ioctl+0xe0/0xaa0
> >  [c0545d64] sys_ioctl+0xc4/0x160
> >  [c000b408] system_call+0x5c/0x70
> >  Instruction dump:
> >  a12d1174 2f89 409e0158 a1271172 3929 b1271172 7c2004ac 3920
> >  913e0140 3920 e87d0010 f93d0010 <89230011> e8c3 e9030008 2f89
> >  --
> > 
> >  Fix the oops..
> > 
> > fixes: ca9f49 ("KVM: PPC: Book3S HV: Support for running secure guests")
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/kvm/book3s_hv_uvmem.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
> > b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > index 2806983..f4002bf 100644
> > --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> > +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> > @@ -1018,13 +1018,15 @@ static void kvmppc_uvmem_page_free(struct page 
> > *page)
> >  {
> > unsigned long pfn = page_to_pfn(page) -
> > (kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
> > -   struct kvmppc_uvmem_page_pvt *pvt;
> > +   struct kvmppc_uvmem_page_pvt *pvt = page->zone_device_data;
> > +
> > +   if (!pvt)
> > +   return;
> >  
> > spin_lock(_uvmem_bitmap_lock);
> > bitmap_clear(kvmppc_uvmem_bitmap, pfn, 1);
> > spin_unlock(_uvmem_bitmap_lock);
> >  
> > -   pvt = page->zone_device_data;
> > page->zone_device_data = NULL;
> > if (pvt->remove_gfn)
> > kvmppc_gfn_remove(pvt->gpa >> PAGE_SHIFT, pvt->kvm);
> 
> In our case, device pages that are in use are always associated with a valid
> pvt member. See kvmppc_uvmem_get_page() which returns failure if it
> runs out of device pfns and that will result in proper failure of
> page-in calls.

looked at the code, and yes that code path looks correct. So my
reasoning behind the root cause of this bug is incorrect. However the
bug is surfacing and there must be a reason.

> 
> For the case where we run out of device pfns, migrate_vma_finalize() will
> restore the original PTE and will not replace the PTE with device private PTE.
> 
> Also kvmppc_uvmem_page_free() (=dev_pagemap_ops.page_free()) is never
> called for non-device-private pages.

Yes. it should not be 

[PATCH v3 4/4] powerpc sstep: add testcases for vsx load/store instructions

2020-07-31 Thread Balamuruhan S
add testcases for vsx load/store vector paired instructions,
* Load VSX Vector Paired (lxvp)
* Load VSX Vector Paired Indexed (lxvpx)
* Prefixed Load VSX Vector Paired (plxvp)
* Store VSX Vector Paired (stxvp)
* Store VSX Vector Paired Indexed (stxvpx)
* Prefixed Store VSX Vector Paired (pstxvp)

Suggested-by: Ravi Bangoria 
Signed-off-by: Balamuruhan S 
---
 arch/powerpc/lib/test_emulate_step.c | 252 +++
 1 file changed, 252 insertions(+)

diff --git a/arch/powerpc/lib/test_emulate_step.c 
b/arch/powerpc/lib/test_emulate_step.c
index d242e9f72e0c..f16934b80511 100644
--- a/arch/powerpc/lib/test_emulate_step.c
+++ b/arch/powerpc/lib/test_emulate_step.c
@@ -612,6 +612,255 @@ static void __init test_lxvd2x_stxvd2x(void)
 }
 #endif /* CONFIG_VSX */
 
+#ifdef CONFIG_VSX
+static void __init test_lxvp_stxvp(void)
+{
+   struct pt_regs regs;
+   union {
+   vector128 a;
+   u32 b[4];
+   } c[2];
+   u32 cached_b[8];
+   int stepped = -1;
+
+   init_pt_regs();
+
+   /*** lxvp ***/
+
+   cached_b[0] = c[0].b[0] = 18233;
+   cached_b[1] = c[0].b[1] = 34863571;
+   cached_b[2] = c[0].b[2] = 834;
+   cached_b[3] = c[0].b[3] = 6138911;
+   cached_b[4] = c[1].b[0] = 1234;
+   cached_b[5] = c[1].b[1] = 5678;
+   cached_b[6] = c[1].b[2] = 91011;
+   cached_b[7] = c[1].b[3] = 121314;
+
+   regs.gpr[4] = (unsigned long)[0].a;
+
+   /*
+* lxvp XTp,DQ(RA)
+* XTp = 32×TX + 2×Tp
+* let TX=1 Tp=1 RA=4 DQ=0
+*/
+   stepped = emulate_step(, ppc_inst(PPC_LXVP(1, 1, 4, 0)));
+
+   if (stepped == 1 && cpu_has_feature(CPU_FTR_VSX)) {
+   show_result("lxvp", "PASS");
+   } else {
+   if (!cpu_has_feature(CPU_FTR_VSX))
+   show_result("lxvp", "PASS (!CPU_FTR_VSX)");
+   else
+   show_result("lxvp", "FAIL");
+   }
+
+   /*** stxvp ***/
+
+   c[0].b[0] = 21379463;
+   c[0].b[1] = 87;
+   c[0].b[2] = 374234;
+   c[0].b[3] = 4;
+   c[1].b[0] = 90;
+   c[1].b[1] = 122;
+   c[1].b[2] = 555;
+   c[1].b[3] = 32144;
+
+   /*
+* stxvp XSp,DQ(RA)
+* XSp = 32×SX + 2×Sp
+* let SX=1 Sp=1 RA=4 DQ=0
+*/
+   stepped = emulate_step(, ppc_inst(PPC_STXVP(1, 1, 4, 0)));
+
+   if (stepped == 1 && cached_b[0] == c[0].b[0] && cached_b[1] == 
c[0].b[1] &&
+   cached_b[2] == c[0].b[2] && cached_b[3] == c[0].b[3] &&
+   cached_b[4] == c[1].b[0] && cached_b[5] == c[1].b[1] &&
+   cached_b[6] == c[1].b[2] && cached_b[7] == c[1].b[3] &&
+   cpu_has_feature(CPU_FTR_VSX)) {
+   show_result("stxvp", "PASS");
+   } else {
+   if (!cpu_has_feature(CPU_FTR_VSX))
+   show_result("stxvp", "PASS (!CPU_FTR_VSX)");
+   else
+   show_result("stxvp", "FAIL");
+   }
+}
+#else
+static void __init test_lxvp_stxvp(void)
+{
+   show_result("lxvp", "SKIP (CONFIG_VSX is not set)");
+   show_result("stxvp", "SKIP (CONFIG_VSX is not set)");
+}
+#endif /* CONFIG_VSX */
+
+#ifdef CONFIG_VSX
+static void __init test_lxvpx_stxvpx(void)
+{
+   struct pt_regs regs;
+   union {
+   vector128 a;
+   u32 b[4];
+   } c[2];
+   u32 cached_b[8];
+   int stepped = -1;
+
+   init_pt_regs();
+
+   /*** lxvpx ***/
+
+   cached_b[0] = c[0].b[0] = 18233;
+   cached_b[1] = c[0].b[1] = 34863571;
+   cached_b[2] = c[0].b[2] = 834;
+   cached_b[3] = c[0].b[3] = 6138911;
+   cached_b[4] = c[1].b[0] = 1234;
+   cached_b[5] = c[1].b[1] = 5678;
+   cached_b[6] = c[1].b[2] = 91011;
+   cached_b[7] = c[1].b[3] = 121314;
+
+   regs.gpr[3] = (unsigned long)[0].a;
+   regs.gpr[4] = 0;
+
+   /*
+* lxvpx XTp,RA,RB
+* XTp = 32×TX + 2×Tp
+* let TX=1 Tp=1 RA=3 RB=4
+*/
+   stepped = emulate_step(, ppc_inst(PPC_LXVPX(1, 1, 3, 4)));
+
+   if (stepped == 1 && cpu_has_feature(CPU_FTR_VSX)) {
+   show_result("lxvpx", "PASS");
+   } else {
+   if (!cpu_has_feature(CPU_FTR_VSX))
+   show_result("lxvpx", "PASS (!CPU_FTR_VSX)");
+   else
+   show_result("lxvpx", "FAIL");
+   }
+
+   /*** stxvpx ***/
+
+   c[0].b[0] = 21379463;
+   c[0].b[1] = 87;
+   c[0].b[2] = 374234;
+   c[0].b[3] = 4;
+   c[1].b[0] = 90;
+   c[1].b[1] = 122;
+   c[1].b[2] = 555;
+   c[1].b[3] = 32144;
+
+   /*
+* stxvpx XSp,RA,RB
+* XSp = 32×SX + 2×Sp
+* let SX=1 Sp=1 RA=3 RB=4
+*/
+   stepped = emulate_step(, ppc_inst(PPC_STXVPX(1, 1, 3, 4)));
+
+   if (stepped == 1 && cached_b[0] == c[0].b[0] && cached_b[1] == 
c[0].b[1] &&
+   cached_b[2] == c[0].b[2] && cached_b[3] == 

[PATCH v3 3/4] powerpc ppc-opcode: add encoding macros for vsx vector paired instructions

2020-07-31 Thread Balamuruhan S
add instruction encoding, extended opcodes, regs and DQ immediate macro
for new vsx vector paired instructions,
* Load VSX Vector Paired (lxvp)
* Load VSX Vector Paired Indexed (lxvpx)
* Prefixed Load VSX Vector Paired (plxvp)
* Store VSX Vector Paired (stxvp)
* Store VSX Vector Paired Indexed (stxvpx)
* Prefixed Store VSX Vector Paired (pstxvp)

Signed-off-by: Balamuruhan S 
---
 arch/powerpc/include/asm/ppc-opcode.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 4c0bdafb6a7b..6ad23f47d06a 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -78,6 +78,7 @@
 
 #define IMM_L(i)   ((uintptr_t)(i) & 0x)
 #define IMM_DS(i)  ((uintptr_t)(i) & 0xfffc)
+#define IMM_DQ(i)  (((uintptr_t)(i) & 0xfff) << 4)
 
 /*
  * 16-bit immediate helper macros: HA() is for use with sign-extending instrs
@@ -272,6 +273,8 @@
 #define PPC_INST_STFD  0xd800
 #define PPC_PREFIX_MLS 0x0600
 #define PPC_PREFIX_8LS 0x0400
+#define PPC_PLXVP_EX_OP0xe800
+#define PPC_PSTXVP_EX_OP   0xf800
 
 /* Prefixed instructions */
 #define PPC_INST_PLD   0xe400
@@ -296,6 +299,8 @@
 #define __PPC_XS(s)s) & 0x1f) << 21) | (((s) & 0x20) >> 5))
 #define __PPC_XT(s)__PPC_XS(s)
 #define __PPC_T_TLB(t) (((t) & 0x3) << 21)
+#define __PPC_TP(tp)   (((tp) & 0xf) << 22)
+#define __PPC_TX(tx)   (((tx) & 0x1) << 21)
 #define __PPC_WC(w)(((w) & 0x3) << 21)
 #define __PPC_WS(w)(((w) & 0x1f) << 11)
 #define __PPC_SH(s)__PPC_WS(s)
@@ -387,6 +392,18 @@
 #define PPC_RAW_STXVD2X(s, a, b)   (0x7c000798 | VSX_XX1((s), a, b))
 #define PPC_RAW_LXVD2X(s, a, b)(0x7c000698 | VSX_XX1((s), a, 
b))
 #define PPC_RAW_MFVRD(a, t)(0x7c66 | VSX_XX1((t) + 32, a, R0))
+#define PPC_LXVP(tp, tx, a, i) \
+   (0x1800 | __PPC_TP(tp) | __PPC_TX(tx) | ___PPC_RA(a) | IMM_DQ(i))
+#define PPC_STXVP(sp, sx, a, i) \
+   (0x1801 | __PPC_TP(sp) | __PPC_TX(sx) | ___PPC_RA(a) | IMM_DQ(i) | 
0x1)
+#define PPC_LXVPX(tp, tx, a, b) \
+   (0x7c00029a | __PPC_TP(tp) | __PPC_TX(tx) | ___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_STXVPX(sp, sx, a, b) \
+   (0x7c00039a | __PPC_TP(sp) | __PPC_TX(sx) | ___PPC_RA(a) | ___PPC_RB(b))
+#define PPC_PLXVP(a, i, pr, tp, tx) \
+   ((PPC_PREFIX_8LS | __PPC_PRFX_R(pr) | IMM_H(i)) << 32 | 
(PPC_PLXVP_EX_OP | __PPC_TP(tp) | __PPC_TX(tx) | ___PPC_RA(a) | IMM_L(i)))
+#define PPC_PSTXVP(a, i, pr, sp, sx) \
+   ((PPC_PREFIX_8LS | __PPC_PRFX_R(pr) | IMM_H(i)) << 32 | 
(PPC_PSTXVP_EX_OP | __PPC_TP(sp) | __PPC_TX(sx) | ___PPC_RA(a) | IMM_L(i)))
 #define PPC_RAW_MTVRD(t, a)(0x7c000166 | VSX_XX1((t) + 32, a, R0))
 #define PPC_RAW_VPMSUMW(t, a, b)   (0x1488 | VSX_XX3((t), a, b))
 #define PPC_RAW_VPMSUMD(t, a, b)   (0x14c8 | VSX_XX3((t), a, b))
-- 
2.24.1



[PATCH v3 2/4] powerpc/sstep: support emulation for vsx vector paired storage access instructions

2020-07-31 Thread Balamuruhan S
add emulate_step() changes to support vsx vector paired storage
access instructions that provides octword operands loads/stores
between storage and set of 64 Vector Scalar Registers (VSRs).

Suggested-by: Ravi Bangoria 
Suggested-by: Naveen N. Rao 
Signed-off-by: Balamuruhan S 
---
 arch/powerpc/lib/sstep.c | 77 +++-
 1 file changed, 60 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 22147257d74d..01e1a3adc406 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -280,6 +280,19 @@ static nokprobe_inline void do_byte_reverse(void *ptr, int 
nb)
up[1] = tmp;
break;
}
+   case 32: {
+   unsigned long *up = (unsigned long *)ptr;
+   unsigned long tmp;
+
+   tmp = byterev_8(up[0]);
+   up[0] = byterev_8(up[3]);
+   up[3] = tmp;
+   tmp = byterev_8(up[2]);
+   up[2] = byterev_8(up[1]);
+   up[1] = tmp;
+   break;
+   }
+
 #endif
default:
WARN_ON_ONCE(1);
@@ -710,6 +723,8 @@ void emulate_vsx_load(struct instruction_op *op, union 
vsx_reg *reg,
reg->d[0] = reg->d[1] = 0;
 
switch (op->element_size) {
+   case 32:
+   /* [p]lxvp[x] */
case 16:
/* whole vector; lxv[x] or lxvl[l] */
if (size == 0)
@@ -718,7 +733,7 @@ void emulate_vsx_load(struct instruction_op *op, union 
vsx_reg *reg,
if (IS_LE && (op->vsx_flags & VSX_LDLEFT))
rev = !rev;
if (rev)
-   do_byte_reverse(reg, 16);
+   do_byte_reverse(reg, size);
break;
case 8:
/* scalar loads, lxvd2x, lxvdsx */
@@ -794,6 +809,20 @@ void emulate_vsx_store(struct instruction_op *op, const 
union vsx_reg *reg,
size = GETSIZE(op->type);
 
switch (op->element_size) {
+   case 32:
+   /* [p]stxvp[x] */
+   if (size == 0)
+   break;
+   if (rev) {
+   /* reverse 32 bytes */
+   buf.d[0] = byterev_8(reg->d[3]);
+   buf.d[1] = byterev_8(reg->d[2]);
+   buf.d[2] = byterev_8(reg->d[1]);
+   buf.d[3] = byterev_8(reg->d[0]);
+   reg = 
+   }
+   memcpy(mem, reg, size);
+   break;
case 16:
/* stxv, stxvx, stxvl, stxvll */
if (size == 0)
@@ -862,28 +891,35 @@ static nokprobe_inline int do_vsx_load(struct 
instruction_op *op,
   bool cross_endian)
 {
int reg = op->reg;
-   u8 mem[16];
-   union vsx_reg buf;
+   int i, nr_vsx_regs;
+   u8 mem[32];
+   union vsx_reg buf[2];
int size = GETSIZE(op->type);
 
if (!address_ok(regs, ea, size) || copy_mem_in(mem, ea, size, regs))
return -EFAULT;
 
-   emulate_vsx_load(op, , mem, cross_endian);
+   nr_vsx_regs = size / sizeof(__vector128);
+   emulate_vsx_load(op, buf, mem, cross_endian);
preempt_disable();
if (reg < 32) {
/* FP regs + extensions */
if (regs->msr & MSR_FP) {
-   load_vsrn(reg, );
+   for (i = 0; i < nr_vsx_regs; i++)
+   load_vsrn(reg + i, [i].v);
} else {
-   current->thread.fp_state.fpr[reg][0] = buf.d[0];
-   current->thread.fp_state.fpr[reg][1] = buf.d[1];
+   for (i = 0; i < nr_vsx_regs; i++) {
+   current->thread.fp_state.fpr[reg + i][0] = 
buf[i].d[0];
+   current->thread.fp_state.fpr[reg + i][1] = 
buf[i].d[1];
+   }
}
} else {
if (regs->msr & MSR_VEC)
-   load_vsrn(reg, );
+   for (i = 0; i < nr_vsx_regs; i++)
+   load_vsrn(reg + i, [i].v);
else
-   current->thread.vr_state.vr[reg - 32] = buf.v;
+   for (i = 0; i < nr_vsx_regs; i++)
+   current->thread.vr_state.vr[reg - 32 + i] = 
buf[i].v;
}
preempt_enable();
return 0;
@@ -894,30 +930,37 @@ static nokprobe_inline int do_vsx_store(struct 
instruction_op *op,
bool cross_endian)
 {
int reg = op->reg;
-   u8 mem[16];
-   union vsx_reg buf;
+   int i, nr_vsx_regs;
+   u8 mem[32];
+   union vsx_reg buf[2];
int size = GETSIZE(op->type);
 
if (!address_ok(regs, ea, size))
return -EFAULT;
 
+   nr_vsx_regs = size / sizeof(__vector128);
 

[PATCH v3 1/4] powerpc/sstep: support new VSX vector paired storage access instructions

2020-07-31 Thread Balamuruhan S
VSX Vector Paired instructions loads/stores an octword (32 bytes)
from/to storage into two sequential VSRs. Add `analyse_instr()` support
to these new instructions,
* Load VSX Vector Paired (lxvp)
* Load VSX Vector Paired Indexed (lxvpx)
* Prefixed Load VSX Vector Paired (plxvp)
* Store VSX Vector Paired (stxvp)
* Store VSX Vector Paired Indexed (stxvpx)
* Prefixed Store VSX Vector Paired (pstxvp)

Suggested-by: Naveen N. Rao 
Signed-off-by: Balamuruhan S 
---
 arch/powerpc/lib/sstep.c | 45 
 1 file changed, 45 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index c58ea9e787cb..22147257d74d 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -32,6 +32,10 @@ extern char system_call_vectored_emulate[];
 #define XER_OV32   0x0008U
 #define XER_CA32   0x0004U
 
+#ifdef CONFIG_VSX
+#define VSX_REGISTER_XTP(rd)   rd) & 1) << 5) | ((rd) & 0xfe))
+#endif
+
 #ifdef CONFIG_PPC_FPU
 /*
  * Functions in ldstfp.S
@@ -2386,6 +2390,14 @@ int analyse_instr(struct instruction_op *op, const 
struct pt_regs *regs,
op->vsx_flags = VSX_SPLAT;
break;
 
+   case 333:   /* lxvpx */
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return -1;
+   op->reg = VSX_REGISTER_XTP(rd);
+   op->type = MKOP(LOAD_VSX, 0, 32);
+   op->element_size = 32;
+   break;
+
case 364:   /* lxvwsx */
op->reg = rd | ((word & 1) << 5);
op->type = MKOP(LOAD_VSX, 0, 4);
@@ -2414,6 +2426,13 @@ int analyse_instr(struct instruction_op *op, const 
struct pt_regs *regs,
VSX_CHECK_VEC;
break;
}
+   case 461:   /* stxvpx */
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return -1;
+   op->reg = VSX_REGISTER_XTP(rd);
+   op->type = MKOP(STORE_VSX, 0, 32);
+   op->element_size = 32;
+   break;
case 524:   /* lxsspx */
op->reg = rd | ((word & 1) << 5);
op->type = MKOP(LOAD_VSX, 0, 4);
@@ -2655,6 +2674,22 @@ int analyse_instr(struct instruction_op *op, const 
struct pt_regs *regs,
 #endif
 
 #ifdef CONFIG_VSX
+   case 6:
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return -1;
+   op->ea = dqform_ea(word, regs);
+   op->reg = VSX_REGISTER_XTP(rd);
+   op->element_size = 32;
+   switch (word & 0xf) {
+   case 0: /* lxvp */
+   op->type = MKOP(LOAD_VSX, 0, 32);
+   break;
+   case 1: /* stxvp */
+   op->type = MKOP(STORE_VSX, 0, 32);
+   break;
+   }
+   break;
+
case 61:/* stfdp, lxv, stxsd, stxssp, stxv */
switch (word & 7) {
case 0: /* stfdp with LSB of DS field = 0 */
@@ -2783,12 +2818,22 @@ int analyse_instr(struct instruction_op *op, const 
struct pt_regs *regs,
case 57:/* pld */
op->type = MKOP(LOAD, PREFIXED, 8);
break;
+   case 58:/* plxvp */
+   op->reg = VSX_REGISTER_XTP(rd);
+   op->type = MKOP(LOAD_VSX, PREFIXED, 32);
+   op->element_size = 32;
+   break;
case 60:/* stq */
op->type = MKOP(STORE, PREFIXED, 16);
break;
case 61:/* pstd */
op->type = MKOP(STORE, PREFIXED, 8);
break;
+   case 62:/* pstxvp */
+   op->reg = VSX_REGISTER_XTP(rd);
+   op->type = MKOP(STORE_VSX, PREFIXED, 32);
+   op->element_size = 32;
+   break;
}
break;
case 1: /* Type 01 Eight-Byte Register-to-Register */
-- 
2.24.1



[PATCH v3 0/4] VSX 32-byte vector paired load/store instructions

2020-07-31 Thread Balamuruhan S
VSX vector paired instructions operates with octword (32-byte) operand
for loads and stores between storage and a pair of two sequential Vector-Scalar
Registers (VSRs). There are 4 word instructions and 2 prefixed instructions
that provides this 32-byte storage access operations - lxvp, lxvpx, stxvp,
stxvpx, plxvpx, pstxvpx.

Emulation infrastructure doesn't have support for these instructions, to
operate with 32-byte storage access and to operate with 2 VSX registers.
This patch series enables the instruction emulation support and adds test
cases for them respectively.

Changes in v3:
-
Worked on review comments and suggestions from Ravi and Naveen,

* Fix the do_vsx_load() to handle vsx instructions if MSR_FP/MSR_VEC
  cleared in exception conditions and it reaches to read/write to
  thread_struct member fp_state/vr_state respectively.
* Fix wrongly used `__vector128 v[2]` in struct vsx_reg as it should
  hold a single vsx register size.
* Remove unnecessary `VSX_CHECK_VEC` flag set and condition to check
  `VSX_LDLEFT` that is not applicable for these vsx instructions.
* Fix comments in emulate_vsx_load() that were misleading.
* Rebased on latest powerpc next branch.

Changes in v2:
-
* Fix suggestion from Sandipan, wrap ISA 3.1 instructions with
  cpu_has_feature(CPU_FTR_ARCH_31) check.
* Rebase on latest powerpc next branch.

Balamuruhan S (4):
  powerpc/sstep: support new VSX vector paired storage access
instructions
  powerpc/sstep: support emulation for vsx vector paired storage access
instructions
  powerpc ppc-opcode: add encoding macros for vsx vector paired
instructions
  powerpc sstep: add testcases for vsx load/store instructions

 arch/powerpc/include/asm/ppc-opcode.h |  17 ++
 arch/powerpc/lib/sstep.c  | 122 +++--
 arch/powerpc/lib/test_emulate_step.c  | 252 ++
 3 files changed, 374 insertions(+), 17 deletions(-)


base-commit: 71d7bca373d5fa0ec977ca4814f49140621bd7ae
-- 
2.24.1



Re: [PATCH] KVM: PPC: Book3S HV: Define H_PAGE_IN_NONSHARED for H_SVM_PAGE_IN hcall

2020-07-31 Thread Ram Pai
On Fri, Jul 31, 2020 at 10:03:34AM +0530, Bharata B Rao wrote:
> On Thu, Jul 30, 2020 at 04:21:01PM -0700, Ram Pai wrote:
> > H_SVM_PAGE_IN hcall takes a flag parameter. This parameter specifies the
> > way in which a page will be treated.  H_PAGE_IN_NONSHARED indicates
> > that the page will be shared with the Secure VM, and H_PAGE_IN_SHARED
> > indicates that the page will not be shared but its contents will
> > be copied.
> 
> Looks like you got the definitions of shared and non-shared interchanged.

Yes. Realized it after sending the patch. Will fix it.

> 
> > 
> > However H_PAGE_IN_NONSHARED is not defined in the header file, though
> > it is defined and documented in the API captured in
> > Documentation/powerpc/ultravisor.rst
> > 
> > Define H_PAGE_IN_NONSHARED in the header file.
> 
> What is the use of defining this? Is this used directly in any place?
> Or, are youp planning to introduce such a usage?

I know the Hypervisor code currently has no use for that define.
It assumes  H_PAGE_IN_NONSHARED to be !H_PAGE_IN_SHARED.

But H_PAGE_IN_NONSHARED is defined in the Ultravisor API, and hence has
to be captured in the header, to stay consistent.

RP


Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id

2020-07-31 Thread Michael Ellerman
Srikar Dronamraju  writes:
> Lookup the coregroup id from the associativity array.

It's slightly strange that this is called in patch 9, but only properly
implemented here in patch 10.

I'm not saying you have to squash them together, but it would be good if
the change log for patch 9 mentioned that a subsequent commit will
complete the implementation and how that affects the behaviour.

cheers

> If unable to detect the coregroup id, fallback on the core id.
> This way, ensure sched_domain degenerates and an extra sched domain is
> not created.
>
> Ideally this function should have been implemented in
> arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
> don't need to find the primary domain again.
>
> If the device-tree mentions more than one coregroup, then kernel
> implements only the last or the smallest coregroup, which currently
> corresponds to the penultimate domain in the device-tree.
>
> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Reviewed-by : Gautham R. Shenoy 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
>   Move coregroup_enabled before getting associativity (Gautham)
>
>  arch/powerpc/mm/numa.c | 20 
>  1 file changed, 20 insertions(+)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0d57779e7942..8b3b3ec7fcc4 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu)
>  
>  int cpu_to_coregroup_id(int cpu)
>  {
> + __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> + int index;
> +
> + if (cpu < 0 || cpu > nr_cpu_ids)
> + return -1;
> +
> + if (!coregroup_enabled)
> + goto out;
> +
> + if (!firmware_has_feature(FW_FEATURE_VPHN))
> + goto out;
> +
> + if (vphn_get_associativity(cpu, associativity))
> + goto out;
> +
> + index = of_read_number(associativity, 1);
> + if (index > min_common_depth + 1)
> + return of_read_number([index - 1], 1);
> +
> +out:
>   return cpu_to_core_id(cpu);
>  }
>  
> -- 
> 2.17.1


Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group

2020-07-31 Thread Michael Ellerman
Srikar Dronamraju  writes:
> If allocated earlier and the search fails, then cpumask need to be
> freed. However cpu_l1_cache_map can be allocated after we search thread
> group.

It's not freed anywhere AFAICS?

And even after this change there's still an error path that doesn't free
it, isn't there?

cheers

> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Srikar Dronamraju 
> ---
>  arch/powerpc/kernel/smp.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 698000c7f76f..dab96a1203ec 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
>   if (err)
>   goto out;
>  
> - zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> - GFP_KERNEL,
> - cpu_to_node(cpu));
> -
>   cpu_group_start = get_cpu_thread_group_start(cpu, );
>  
>   if (unlikely(cpu_group_start == -1)) {
> @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
>   goto out;
>   }
>  
> + zalloc_cpumask_var_node(_cpu(cpu_l1_cache_map, cpu),
> + GFP_KERNEL, cpu_to_node(cpu));
> +
>   for (i = first_thread; i < first_thread + threads_per_core; i++) {
>   int i_group_start = get_cpu_thread_group_start(i, );
>  
> -- 
> 2.17.1


Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup

2020-07-31 Thread Michael Ellerman
Srikar Dronamraju  writes:
> Add support for grouping cores based on the device-tree classification.
> - The last domain in the associativity domains always refers to the
> core.
> - If primary reference domain happens to be the penultimate domain in
> the associativity domains device-tree property, then there are no
> coregroups. However if its not a penultimate domain, then there are
> coregroups. There can be more than one coregroup. For now we would be
> interested in the last or the smallest coregroups.

This still doesn't tell me what a coregroup actually represents.

I get that it's a grouping of cores, and that the device tree specifies
it for us, but grouping based on what?

I think the answer is we aren't being told by firmware, it's just a
grouping based on some opaque performance characteristic and we just
have to take that as given.

But please explain that clearly in the change log and the code comments.

cheers


> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Reviewed-by: Gautham R. Shenoy 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
>   Explained Coregroup in commit msg (Michael Ellerman)
>
>  arch/powerpc/include/asm/smp.h |  1 +
>  arch/powerpc/kernel/smp.c  |  1 +
>  arch/powerpc/mm/numa.c | 34 +-
>  3 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> index 49a25e2400f2..5bdc17a7049f 100644
> --- a/arch/powerpc/include/asm/smp.h
> +++ b/arch/powerpc/include/asm/smp.h
> @@ -28,6 +28,7 @@
>  extern int boot_cpuid;
>  extern int spinning_secondaries;
>  extern u32 *cpu_to_phys_id;
> +extern bool coregroup_enabled;
>  
>  extern void cpu_die(void);
>  extern int cpu_to_chip_id(int cpu);
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 3c5ccf6d2b1c..698000c7f76f 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
>  
>  struct task_struct *secondary_current;
>  bool has_big_cores;
> +bool coregroup_enabled;
>  
>  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 2298899a0f0a..51cb672f113b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -886,7 +886,9 @@ static void __init setup_node_data(int nid, u64 
> start_pfn, u64 end_pfn)
>  static void __init find_possible_nodes(void)
>  {
>   struct device_node *rtas;
> - u32 numnodes, i;
> + const __be32 *domains;
> + int prop_length, max_nodes;
> + u32 i;
>  
>   if (!numa_enabled)
>   return;
> @@ -895,25 +897,31 @@ static void __init find_possible_nodes(void)
>   if (!rtas)
>   return;
>  
> - if (of_property_read_u32_index(rtas, 
> "ibm,current-associativity-domains",
> - min_common_depth, )) {
> - /*
> -  * ibm,current-associativity-domains is a fairly recent
> -  * property. If it doesn't exist, then fallback on
> -  * ibm,max-associativity-domains. Current denotes what the
> -  * platform can support compared to max which denotes what the
> -  * Hypervisor can support.
> -  */
> - if (of_property_read_u32_index(rtas, 
> "ibm,max-associativity-domains",
> - min_common_depth, ))
> + /*
> +  * ibm,current-associativity-domains is a fairly recent property. If
> +  * it doesn't exist, then fallback on ibm,max-associativity-domains.
> +  * Current denotes what the platform can support compared to max
> +  * which denotes what the Hypervisor can support.
> +  */
> + domains = of_get_property(rtas, "ibm,current-associativity-domains",
> + _length);
> + if (!domains) {
> + domains = of_get_property(rtas, "ibm,max-associativity-domains",
> + _length);
> + if (!domains)
>   goto out;
>   }
>  
> - for (i = 0; i < numnodes; i++) {
> + max_nodes = of_read_number([min_common_depth], 1);
> + for (i = 0; i < max_nodes; i++) {
>   if (!node_possible(i))
>   node_set(i, node_possible_map);
>   }
>  
> + prop_length /= sizeof(int);
> + if (prop_length > min_common_depth + 2)
> + coregroup_enabled = 1;
> +
>  out:
>   of_node_put(rtas);
>  }
> -- 
> 2.17.1


Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain

2020-07-31 Thread Michael Ellerman
Srikar Dronamraju  writes:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
>
> While setting up the "CACHE" domain, check if shared_cache is already
> set.

PeterZ asked for some overview of what you're doing and why, you
responded to his mail, but I was expecting to see that text incorporated
here somewhere.

He also asked for some comments, which I would also like to see.


I'm also not clear why we want to rename it to "bigcore", that's not a
commonly understood term, I don't think it's clear to new readers what
it means.

Leaving it as the shared cache domain, and having a comment mentioning
that "bigcores" share a cache, would be clearer I think.

cheers


> Cc: linuxppc-dev 
> Cc: LKML 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Anton Blanchard 
> Cc: Oliver O'Halloran 
> Cc: Nathan Lynch 
> Cc: Michael Neuling 
> Cc: Gautham R Shenoy 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Valentin Schneider 
> Cc: Jordan Niethe 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 -> v2:
>   Moved shared_cache topology fixup to fixup_topology (Gautham)
>
>  arch/powerpc/kernel/smp.c | 48 +++
>  1 file changed, 34 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index d997c7411664..3c5ccf6d2b1c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>  EXPORT_SYMBOL_GPL(has_big_cores);
>  
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> + smt_idx,
> +#endif
> + bigcore_idx,
> + die_idx,
> +};
> +
>  #define MAX_THREAD_LIST_SIZE 8
>  #define THREAD_GROUP_SHARE_L1   1
>  struct thread_groups {
> @@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> - if (shared_caches)
> - return cpu_l2_cache_mask(cpu);
> -
> - if (has_big_cores)
> - return cpu_smallcore_mask(cpu);
> -
> - return per_cpu(cpu_sibling_map, cpu);
> + return per_cpu(cpu_l2_cache_map, cpu);
>  }
>  
>  #ifdef CONFIG_SCHED_SMT
> @@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
>  
> +static const struct cpumask *cpu_bigcore_mask(int cpu)
> +{
> + return per_cpu(cpu_sibling_map, cpu);
> +}
> +
>  static struct sched_domain_topology_level powerpc_topology[] = {
>  #ifdef CONFIG_SCHED_SMT
>   { cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
> - { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
> + { cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
>   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
>   { NULL, },
>  };
> @@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu)
>  void start_secondary(void *unused)
>  {
>   unsigned int cpu = smp_processor_id();
> - struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
>  
>   mmgrab(_mm);
>   current->active_mm = _mm;
> @@ -1337,14 +1343,20 @@ void start_secondary(void *unused)
>   /* Update topology CPU masks */
>   add_cpu_to_masks(cpu);
>  
> - if (has_big_cores)
> - sibling_mask = cpu_smallcore_mask;
>   /*
>* Check for any shared caches. Note that this must be done on a
>* per-core basis because one core in the pair might be disabled.
>*/
> - if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> - shared_caches = true;
> + if (!shared_caches) {
> + struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> + struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> + if (has_big_cores)
> + sibling_mask = cpu_smallcore_mask;
> +
> + if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> + shared_caches = true;
> + }
>  
>   set_numa_node(numa_cpu_lookup_table[cpu]);
>   set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1375,9 +1387,17 @@ static void fixup_topology(void)
>  #ifdef CONFIG_SCHED_SMT
>   if (has_big_cores) {
>   pr_info("Big cores detected but using small core scheduling\n");
> - powerpc_topology[0].mask = smallcore_smt_mask;
> + powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>   }
>  #endif
> + if (shared_caches) {
> + pr_info("Using shared cache scheduler topology\n");
> + powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> + powerpc_topology[bigcore_idx].sd_flags = 
> powerpc_shared_cache_flags;
> +#ifdef CONFIG_SCHED_DEBUG
> + powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> +   

Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain

2020-07-31 Thread Gautham R Shenoy
Hi Srikar, Valentin,

On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider  [2020-07-28 16:03:11]:
>

[..snip..]

> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
> 
> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
> 
> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
> 
> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.

I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best
results for POWER9 in terms of cache-benefits during wakeup.  On a
POWER9 Boston machine, running a producer-consumer test case
(https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c)

The test case creates two threads, one Producer and another
Consumer. Both work on a fairly large shared array of size 64M.  In an
interation the Producer performs stores to 1024 random locations and
wakes up the Consumer. In the Consumer's iteration, loads from those
exact 1024 locations.

We measure the number of Consumer iterations per second and the
average time for each Consumer iteration. The smaller the time, the
better it is.

The following results are when I pinned the Producer and Consumer to
different combinations of CPUs to cover Small core , Big-core,
Neighbouring Big-core, Far off core within the same chip, and across
chips. There is a also a case where they are not affined anywhere, and
we let the scheduler wake them up correctly.

We find the best results when the Producer and Consumer are within the
same L2 domain. These numbers are also close to the numbers that we
get when we let the Scheduler wake them up (where LLC is L2).

## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain)

Consumer affined to  CPU 3
Producer affined to  CPU 1
4698 iterations, avg time: 20034 ns
4951 iterations, avg time: 20012 ns
4957 iterations, avg time: 19971 ns
4968 iterations, avg time: 19985 ns
4970 iterations, avg time: 19977 ns


## Same Big Core (8 threads: Shares L2, L3, Frequency Domain)

Consumer affined to  CPU 7
Producer affined to  CPU 1
4580 iterations, avg time: 19403 ns
4851 iterations, avg time: 19373 ns
4849 iterations, avg time: 19394 ns
4856 iterations, avg time: 19394 ns
4867 iterations, avg time: 19353 ns

## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency 
Domain)

Producer affined to  CPU 1
Consumer affined to  CPU 11
4270 iterations, avg time: 24158 ns
4491 iterations, avg time: 24157 ns
4500 iterations, avg time: 24148 ns
4516 iterations, avg time: 24164 ns
4518 iterations, avg time: 24165 ns


## Any other Big-core from Same Chip (Shares L3)

Producer affined to  CPU 1
Consumer affined to  CPU 87
4176 iterations, avg time: 27953 ns
4417 iterations, avg time: 27925 ns
4415 iterations, avg time: 27934 ns
4417 iterations, avg time: 27983 ns
4430 iterations, avg time: 27958 ns


## Different Chips (No cache-sharing)

Consumer affined to  CPU 175
Producer affined to  CPU 1
3277 iterations, avg time: 50786 ns
3063 iterations, avg time: 50732 ns
2831 iterations, avg time: 50737 ns
2859 iterations, avg time: 50688 ns
2849 iterations, avg time: 50722 ns

## Without affining them (Let Scheduler wake-them up appropriately)
Consumer affined to  CPU 0-175
Producer affined to  CPU 0-175
4821 iterations, avg time: 19412 ns
4863 iterations, avg time: 19435 ns
4855 iterations, avg time: 19381 ns
4811 iterations, avg time: 19458 ns
4892 iterations, avg time: 19429 ns


--
Thanks and Regards
gautham.


Re: [PATCH v4 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

2020-07-31 Thread Aneesh Kumar K.V
Vaibhav Jain  writes:

> We add support for reporting 'fuel-gauge' NVDIMM metric via
> PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
> life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
> metric via the H_SCM_PERFORMANCE_STATS.
>
> The metric value is returned from the pdsm by extending the return
> payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
> field 'dimm_fuel_gauge' to hold the metric value is introduced at the
> end of the payload struct and its presence is indicated by by
> extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.
>
> The patch introduces a new function papr_pdsm_fuel_gauge() that is
> called from papr_pdsm_health(). If fetching NVDIMM performance stats
> is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
> large enough to hold the performance stat and passes it to
> drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
> of the stat is then populated in the 'struct
> nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
> 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
> nd_papr_pdsm_health.extension_flags'
>

Reviewed-by: Aneesh Kumar K.V 

> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4:
> * Moved a hunk from this patch to previous patch in the series.
>   [ Aneesh ]
>
> v3:
> * Updated papr_pdsm_fuel_guage() to use the updated
>   drc_pmem_query_stats() function.
>
> Resend:
> None
>
> v2:
> * Restructure code in papr_pdsm_fuel_gauge() to handle error case
> first [ Ira ]
> * Ignore the return value of papr_pdsm_fuel_gauge() in
> papr_psdm_health() [ Ira ]
> ---
>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 +
>  arch/powerpc/platforms/pseries/papr_scm.c | 49 +++
>  2 files changed, 58 insertions(+)
>
> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
> b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> index 9ccecc1d6840..50ef95e2f5b1 100644
> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> @@ -72,6 +72,11 @@
>  #define PAPR_PDSM_DIMM_CRITICAL  2
>  #define PAPR_PDSM_DIMM_FATAL 3
>  
> +/* struct nd_papr_pdsm_health.extension_flags field flags */
> +
> +/* Indicate that the 'dimm_fuel_gauge' field is valid */
> +#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
> +
>  /*
>   * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
>   * Various flags indicate the health status of the dimm.
> @@ -84,6 +89,7 @@
>   * dimm_locked   : Contents of the dimm cant be modified until 
> CEC reboot
>   * dimm_encrypted: Contents of dimm are encrypted.
>   * dimm_health   : Dimm health indicator. One of 
> PAPR_PDSM_DIMM_
> + * dimm_fuel_gauge   : Life remaining of DIMM as a percentage from 0-100
>   */
>  struct nd_papr_pdsm_health {
>   union {
> @@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
>   __u8 dimm_locked;
>   __u8 dimm_encrypted;
>   __u16 dimm_health;
> +
> + /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
> + __u16 dimm_fuel_gauge;
>   };
>   __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
>   };
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index f37f3f70007d..f439f0dfea7d 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned 
> int cmd, void *buf,
>   return 0;
>  }
>  
> +static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
> + union nd_pdsm_payload *payload)
> +{
> + int rc, size;
> + u64 statval;
> + struct papr_scm_perf_stat *stat;
> + struct papr_scm_perf_stats *stats;
> +
> + /* Silently fail if fetching performance metrics isn't  supported */
> + if (!p->stat_buffer_len)
> + return 0;
> +
> + /* Allocate request buffer enough to hold single performance stat */
> + size = sizeof(struct papr_scm_perf_stats) +
> + sizeof(struct papr_scm_perf_stat);
> +
> + stats = kzalloc(size, GFP_KERNEL);
> + if (!stats)
> + return -ENOMEM;
> +
> + stat = >scm_statistic[0];
> + memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id));
> + stat->stat_val = 0;
> +
> + /* Fetch the fuel gauge and populate it in payload */
> + rc = drc_pmem_query_stats(p, stats, 1);
> + if (rc < 0) {
> + dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
> + goto free_stats;
> + }
> +
> + statval = be64_to_cpu(stat->stat_val);
> + dev_dbg(>pdev->dev,
> + "Fetched fuel-gauge %llu", statval);
> + payload->health.extension_flags |=
> + PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
> + payload->health.dimm_fuel_gauge = statval;
> +
> + rc = sizeof(struct 

Re: [PATCH v4 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP

2020-07-31 Thread Aneesh Kumar K.V
Vaibhav Jain  writes:

> Update papr_scm.c to query dimm performance statistics from PHYP via
> H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
> specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
> provide a sysfs ABI documentation for the stats being reported and
> their meanings.
>
> During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
> of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
> performance statistics is supported or not. If successful then a PHYP
> returns a maximum possible buffer length needed to read all
> performance stats. This returned value is stored in a per-nvdimm
> attribute 'stat_buffer_len'.
>
> The layout of request buffer for reading NVDIMM performance stats from
> PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
> papr_scm_perf_stat'. These structs are used in newly introduced
> drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.
>
> The sysfs access function perf_stats_show() uses value
> 'stat_buffer_len' to allocate a buffer large enough to hold all
> possible NVDIMM performance stats and passes it to
> drc_pmem_query_stats() to populate. Finally statistics reported in the
> buffer are formatted into the sysfs access function output buffer.
>
> Signed-off-by: Vaibhav Jain 
> ---
> Changelog:
>
> v4:
> * Fixed a build issue with this patch by moving a hunk from second
>   patch in series to this patch. [ Aneesh ]
>
> v3:
> * Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
>   args to the function. Instead 'buff_size' is calculated from
>   'num_stats' and instead of populating 'R4' in arg 'out' the value is
>   returned from the function in case 'R4' represents
>   'max-buffer-size'.
>
> Resend:
> None
>
> v2:
> * Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
> to use big-endian types. [ Aneesh ]
> * s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
> * s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
> * Conversion from Big endian to cpu endian happens later rather than
> just after its fetched from PHYP.
> * Changed a log statement to unambiguously report dimm performance
> stats are not available for the given nvdimm [ Ira ]
> * Restructed some code to handle error case first [ Ira ]
> ---
>  Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 
>  arch/powerpc/platforms/pseries/papr_scm.c | 150 ++
>  2 files changed, 177 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
> b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> index 5b10d036a8d4..c1a67275c43f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> @@ -25,3 +25,30 @@ Description:
> NVDIMM have been scrubbed.
>   * "locked"  : Indicating that NVDIMM contents cant
> be modified until next power cycle.
> +
> +What:/sys/bus/nd/devices/nmemX/papr/perf_stats
> +Date:May, 2020
> +KernelVersion:   v5.9
> +Contact: linuxppc-dev , 
> linux-nvd...@lists.01.org,
> +Description:
> + (RO) Report various performance stats related to papr-scm NVDIMM
> + device.  Each stat is reported on a new line with each line
> + composed of a stat-identifier followed by it value. Below are
> + currently known dimm performance stats which are reported:
> +
> + * "CtlResCt" : Controller Reset Count
> + * "CtlResTm" : Controller Reset Elapsed Time
> + * "PonSecs " : Power-on Seconds
> + * "MemLife " : Life Remaining
> + * "CritRscU" : Critical Resource Utilization
> + * "HostLCnt" : Host Load Count
> + * "HostSCnt" : Host Store Count
> + * "HostSDur" : Host Store Duration
> + * "HostLDur" : Host Load Duration
> + * "MedRCnt " : Media Read Count
> + * "MedWCnt " : Media Write Count
> + * "MedRDur " : Media Read Duration
> + * "MedWDur " : Media Write Duration
> + * "CchRHCnt" : Cache Read Hit Count
> + * "CchWHCnt" : Cache Write Hit Count
> + * "FastWCnt" : Fast Write Count
> \ No newline at end of file
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 3d1235a76ba9..f37f3f70007d 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -64,6 +64,26 @@
>   PAPR_PMEM_HEALTH_FATAL |\
>   PAPR_PMEM_HEALTH_UNHEALTHY)
>  
> +#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
> +#define PAPR_SCM_PERF_STATS_VERSION 0x1
> +
> +/* Struct holding a single performance metric */
> +struct papr_scm_perf_stat {
> + u8 stat_id[8];
> + __be64 

[PATCH v4 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

2020-07-31 Thread Vaibhav Jain
We add support for reporting 'fuel-gauge' NVDIMM metric via
PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
metric via the H_SCM_PERFORMANCE_STATS.

The metric value is returned from the pdsm by extending the return
payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
field 'dimm_fuel_gauge' to hold the metric value is introduced at the
end of the payload struct and its presence is indicated by by
extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.

The patch introduces a new function papr_pdsm_fuel_gauge() that is
called from papr_pdsm_health(). If fetching NVDIMM performance stats
is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
large enough to hold the performance stat and passes it to
drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
of the stat is then populated in the 'struct
nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
nd_papr_pdsm_health.extension_flags'

Signed-off-by: Vaibhav Jain 
---
Changelog:

v4:
* Moved a hunk from this patch to previous patch in the series.
  [ Aneesh ]

v3:
* Updated papr_pdsm_fuel_guage() to use the updated
  drc_pmem_query_stats() function.

Resend:
None

v2:
* Restructure code in papr_pdsm_fuel_gauge() to handle error case
first [ Ira ]
* Ignore the return value of papr_pdsm_fuel_gauge() in
papr_psdm_health() [ Ira ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 49 +++
 2 files changed, 58 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 9ccecc1d6840..50ef95e2f5b1 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -72,6 +72,11 @@
 #define PAPR_PDSM_DIMM_CRITICAL  2
 #define PAPR_PDSM_DIMM_FATAL 3
 
+/* struct nd_papr_pdsm_health.extension_flags field flags */
+
+/* Indicate that the 'dimm_fuel_gauge' field is valid */
+#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -84,6 +89,7 @@
  * dimm_locked : Contents of the dimm cant be modified until CEC reboot
  * dimm_encrypted  : Contents of dimm are encrypted.
  * dimm_health : Dimm health indicator. One of PAPR_PDSM_DIMM_
+ * dimm_fuel_gauge : Life remaining of DIMM as a percentage from 0-100
  */
 struct nd_papr_pdsm_health {
union {
@@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
__u8 dimm_locked;
__u8 dimm_encrypted;
__u16 dimm_health;
+
+   /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
+   __u16 dimm_fuel_gauge;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index f37f3f70007d..f439f0dfea7d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -518,6 +518,51 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned 
int cmd, void *buf,
return 0;
 }
 
+static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
+   union nd_pdsm_payload *payload)
+{
+   int rc, size;
+   u64 statval;
+   struct papr_scm_perf_stat *stat;
+   struct papr_scm_perf_stats *stats;
+
+   /* Silently fail if fetching performance metrics isn't  supported */
+   if (!p->stat_buffer_len)
+   return 0;
+
+   /* Allocate request buffer enough to hold single performance stat */
+   size = sizeof(struct papr_scm_perf_stats) +
+   sizeof(struct papr_scm_perf_stat);
+
+   stats = kzalloc(size, GFP_KERNEL);
+   if (!stats)
+   return -ENOMEM;
+
+   stat = >scm_statistic[0];
+   memcpy(>stat_id, "MemLife ", sizeof(stat->stat_id));
+   stat->stat_val = 0;
+
+   /* Fetch the fuel gauge and populate it in payload */
+   rc = drc_pmem_query_stats(p, stats, 1);
+   if (rc < 0) {
+   dev_dbg(>pdev->dev, "Err(%d) fetching fuel gauge\n", rc);
+   goto free_stats;
+   }
+
+   statval = be64_to_cpu(stat->stat_val);
+   dev_dbg(>pdev->dev,
+   "Fetched fuel-gauge %llu", statval);
+   payload->health.extension_flags |=
+   PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
+   payload->health.dimm_fuel_gauge = statval;
+
+   rc = sizeof(struct nd_papr_pdsm_health);
+
+free_stats:
+   kfree(stats);
+   return rc;
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload 

[PATCH v4 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP

2020-07-31 Thread Vaibhav Jain
Update papr_scm.c to query dimm performance statistics from PHYP via
H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
provide a sysfs ABI documentation for the stats being reported and
their meanings.

During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
performance statistics is supported or not. If successful then a PHYP
returns a maximum possible buffer length needed to read all
performance stats. This returned value is stored in a per-nvdimm
attribute 'stat_buffer_len'.

The layout of request buffer for reading NVDIMM performance stats from
PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
papr_scm_perf_stat'. These structs are used in newly introduced
drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.

The sysfs access function perf_stats_show() uses value
'stat_buffer_len' to allocate a buffer large enough to hold all
possible NVDIMM performance stats and passes it to
drc_pmem_query_stats() to populate. Finally statistics reported in the
buffer are formatted into the sysfs access function output buffer.

Signed-off-by: Vaibhav Jain 
---
Changelog:

v4:
* Fixed a build issue with this patch by moving a hunk from second
  patch in series to this patch. [ Aneesh ]

v3:
* Updated drc_pmem_query_stats() to not require 'buff_size' and 'out'
  args to the function. Instead 'buff_size' is calculated from
  'num_stats' and instead of populating 'R4' in arg 'out' the value is
  returned from the function in case 'R4' represents
  'max-buffer-size'.

Resend:
None

v2:
* Updated 'struct papr_scm_perf_stats' and 'struct papr_scm_perf_stat'
to use big-endian types. [ Aneesh ]
* s/len_stat_buffer/stat_buffer_len/ [ Aneesh ]
* s/statistics_id/stat_id/ , s/statistics_val/stat_val/ [ Aneesh ]
* Conversion from Big endian to cpu endian happens later rather than
just after its fetched from PHYP.
* Changed a log statement to unambiguously report dimm performance
stats are not available for the given nvdimm [ Ira ]
* Restructed some code to handle error case first [ Ira ]
---
 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 
 arch/powerpc/platforms/pseries/papr_scm.c | 150 ++
 2 files changed, 177 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
b/Documentation/ABI/testing/sysfs-bus-papr-pmem
index 5b10d036a8d4..c1a67275c43f 100644
--- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
+++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
@@ -25,3 +25,30 @@ Description:
  NVDIMM have been scrubbed.
* "locked"  : Indicating that NVDIMM contents cant
  be modified until next power cycle.
+
+What:  /sys/bus/nd/devices/nmemX/papr/perf_stats
+Date:  May, 2020
+KernelVersion: v5.9
+Contact:   linuxppc-dev , 
linux-nvd...@lists.01.org,
+Description:
+   (RO) Report various performance stats related to papr-scm NVDIMM
+   device.  Each stat is reported on a new line with each line
+   composed of a stat-identifier followed by it value. Below are
+   currently known dimm performance stats which are reported:
+
+   * "CtlResCt" : Controller Reset Count
+   * "CtlResTm" : Controller Reset Elapsed Time
+   * "PonSecs " : Power-on Seconds
+   * "MemLife " : Life Remaining
+   * "CritRscU" : Critical Resource Utilization
+   * "HostLCnt" : Host Load Count
+   * "HostSCnt" : Host Store Count
+   * "HostSDur" : Host Store Duration
+   * "HostLDur" : Host Load Duration
+   * "MedRCnt " : Media Read Count
+   * "MedWCnt " : Media Write Count
+   * "MedRDur " : Media Read Duration
+   * "MedWDur " : Media Write Duration
+   * "CchRHCnt" : Cache Read Hit Count
+   * "CchWHCnt" : Cache Write Hit Count
+   * "FastWCnt" : Fast Write Count
\ No newline at end of file
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 3d1235a76ba9..f37f3f70007d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -64,6 +64,26 @@
PAPR_PMEM_HEALTH_FATAL |\
PAPR_PMEM_HEALTH_UNHEALTHY)
 
+#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
+#define PAPR_SCM_PERF_STATS_VERSION 0x1
+
+/* Struct holding a single performance metric */
+struct papr_scm_perf_stat {
+   u8 stat_id[8];
+   __be64 stat_val;
+} __packed;
+
+/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
+struct papr_scm_perf_stats {
+   u8 eye_catcher[8];
+   /* Should be 

[PATCH v4 0/2] powerpc/papr_scm: add support for reporting NVDIMM 'life_used_percentage' metric

2020-07-31 Thread Vaibhav Jain
Changes since v3[1]:

* Fixed a rebase issue pointed out by Aneesh in first patch in the series.

[1] 
https://lore.kernel.org/linux-nvdimm/20200730121303.134230-1-vaib...@linux.ibm.com
---

This small patchset implements kernel side support for reporting
'life_used_percentage' metric in NDCTL with dimm health output for
papr-scm NVDIMMs. With corresponding NDCTL side changes output for
should be like:

$ sudo ndctl list -DH
[
  {
"dev":"nmem0",
"health":{
  "health_state":"ok",
  "life_used_percentage":0,
  "shutdown_state":"clean"
}
  }
]

PHYP supports H_SCM_PERFORMANCE_STATS hcall through which an LPAR can
fetch various performance stats including 'fuel_gauge' percentage for
an NVDIMM. 'fuel_gauge' metric indicates the usable life remaining of
an NVDIMM expressed as percentage and  'life_used_percentage' can be
calculated as 'life_used_percentage = 100 - fuel_gauge'.

Structure of the patchset
=
First patch implements necessary scaffolding needed to issue the
H_SCM_PERFORMANCE_STATS hcall and fetch performance stats
catalogue. The patch also implements support for 'perf_stats' sysfs
attribute to report the full catalogue of supported performance stats
by PHYP.

Second and final patch implements support for sending this value to
libndctl by extending the PAPR_PDSM_HEALTH pdsm payload to add a new
field named 'dimm_fuel_gauge' to it.

Vaibhav Jain (2):
  powerpc/papr_scm: Fetch nvdimm performance stats from PHYP
  powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

 Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 +++
 arch/powerpc/include/uapi/asm/papr_pdsm.h |   9 +
 arch/powerpc/platforms/pseries/papr_scm.c | 199 ++
 3 files changed, 235 insertions(+)

-- 
2.26.2