Re: [6.1.0-rc4-next-20221108] Boot failure on powerpc

2022-11-09 Thread Jason A. Donenfeld
Should be fixed already in today's next.


Re: [6.1.0-rc3-next-20221104] Boot failure - kernel BUG at mm/memblock.c:519

2022-11-09 Thread Yajun Deng
Hey Mike,

Can you help me test the attached file? 
Please use this new patch instead of the one in memblock tree.

November 8, 2022 3:55 PM, "Mike Rapoport"  wrote:

> Hi Yajun,
> 
> On Tue, Nov 08, 2022 at 02:27:53AM +, Yajun Deng wrote:
> 
>> Hi Sachin,
>> I didn't have a powerpc architecture machine. I don't know why this happened.
>> 
>> Hi Mike,
>> Do you have any suggestions?
> 
> You can try reproducing the bug qemu or work with Sachin to debug the
> issue.
> 
>> I tested in tools/testing/memblock, and it was successful.
> 
> Memblock tests provide limited coverage still and they don't deal with all
> possible cases.
> 
> For now I'm dropping this patch from the memblock tree until the issue is
> fixed.
> 
>> November 6, 2022 8:07 PM, "Sachin Sant"  wrote:
>> 
>> While booting recent linux-next on a IBM Power10 Server LPAR
>> following crash is observed:
>> 
>> [ 0.00] numa: Partition configured for 32 NUMA nodes.
>> [ 0.00] [ cut here ]
>> [ 0.00] kernel BUG at mm/memblock.c:519!
>> [ 0.00] Oops: Exception in kernel mode, sig: 5 [#1]
>> [ 0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>> [ 0.00] Modules linked in:
>> [ 0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3-next-20221104 
>> #1
>> [ 0.00] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
>> of:IBM,FW1030.00
>> (NH1030_026) hv:phyp pSeries
>> [ 0.00] NIP: c04ba240 LR: c04bb240 CTR: c04ba210
>> [ 0.00] REGS: c2a8b7b0 TRAP: 0700 Not tainted 
>> (6.1.0-rc3-next-20221104)
>> [ 0.00] MSR: 80021033  CR: 24042424 XER: 
>> 0001
>> [ 0.00] CFAR: c04ba290 IRQMASK: 1
>> [ 0.00] GPR00: c04bb240 c2a8ba50 c136ee00 
>> c010f3ac00a8
>> [ 0.00] GPR04:  c010f3ac0090 0010f3ac 
>> 0d00
>> [ 0.00] GPR08: 0001 0007 0001 
>> 0081
>> [ 0.00] GPR12: c04ba210 c2e1  
>> 000d
>> [ 0.00] GPR16: 0f6be620 0f6be8e8 0f6be788 
>> 0f6bed58
>> [ 0.00] GPR20: 0f6f6d58 c29a8de8 0010f3ad8800 
>> 0080
>> [ 0.00] GPR24: 0010f3ad7b00  0100 
>> 0d00
>> [ 0.00] GPR28: 0010f3ad7b00 c29a8de8 c29a8e00 
>> 0006
>> [ 0.00] NIP [c04ba240] memblock_merge_regions.isra.12+0x40/0x130
>> [ 0.00] LR [c04bb240] memblock_add_range+0x190/0x300
>> [ 0.00] Call Trace:
>> [ 0.00] [c2a8ba50] [0100] 0x100 (unreliable)
>> [ 0.00] [c2a8ba90] [c04bb240] 
>> memblock_add_range+0x190/0x300
>> [ 0.00] [c2a8bb10] [c04bb5e0] memblock_reserve+0x70/0xd0
>> [ 0.00] [c2a8bba0] [c2045234] 
>> memblock_alloc_range_nid+0x11c/0x1e8
>> [ 0.00] [c2a8bc60] [c20453a4] 
>> memblock_alloc_internal+0xa4/0x110
>> [ 0.00] [c2a8bcb0] [c20456cc] 
>> memblock_alloc_try_nid+0x94/0xcc
>> [ 0.00] [c2a8bd40] [c200b570] alloc_paca_data+0x7c/0xcc
>> [ 0.00] [c2a8bdb0] [c200b770] allocate_paca+0x8c/0x28c
>> [ 0.00] [c2a8be50] [c200a26c] setup_arch+0x1c4/0x4d8
>> [ 0.00] [c2a8bed0] [c2004378] start_kernel+0xb4/0xa84
>> [ 0.00] [c2a8bf90] [c000da90] start_here_common+0x1c/0x20
>> [ 0.00] Instruction dump:
>> [ 0.00] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 3be0 
>> f8010010
>> [ 0.00] f821ffc1 e923 3969 480c <0b0a> 7d3f4b78 393f0001 
>> 7fbf5840
>> [ 0.00] ---[ end trace  ]---
>> [ 0.00]
>> [ 0.00] Kernel panic - not syncing: Fatal exception
>> [ 0.00] Rebooting in 180 seconds..
>> 
>> This problem was introduced with next-20221101. Git bisect points to
>> following patch
>> 
>> commit 3f82c9c4ac377082e1230f5299e0ccce07b15e12
>> Date: Tue Oct 25 15:09:43 2022 +0800
>> memblock: don't run loop in memblock_add_range() twice
>> 
>> Reverting this patch helps boot the kernel to login prompt.
>> 
>> Have attached .config
>> 
>> - Sachin
> 
> --
> Sincerely yours,
> Mike.


0001-memblock-don-t-run-loop-in-memblock_add_range-twice-.patch
Description: Binary data


Re: [PATCH 2/2] tools/perf: Fix printing field separator in CSV metrics output

2022-11-09 Thread Athira Rajeev



> On 09-Nov-2022, at 2:27 AM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Wed, Nov 02, 2022 at 02:07:06PM +0530, Athira Rajeev escreveu:
>> 
>> 
>>> On 18-Oct-2022, at 2:26 PM, Athira Rajeev  
>>> wrote:
>>> 
>>> In perf stat with CSV output option, number of fields
>>> in metrics output is not matching with number of fields
>>> in other event output lines.
>>> 
>>> Sample output below after applying patch to fix
>>> printing os->prefix.
>>> 
>>> # ./perf stat -x, --per-socket -a -C 1 ls
>>> S0,1,1.89,msec,cpu-clock,1887692,100.00,1.013,CPUs utilized
>>> S0,1,2,,context-switches,1885842,100.00,1.060,K/sec
>>> S0,1,0,,cpu-migrations,1885374,100.00,0.000,/sec
>>> S0,1,2,,page-faults,1884880,100.00,1.060,K/sec
>>> S0,1,189544,,cycles,1263158,67.00,0.100,GHz
>>> S0,1,64602,,stalled-cycles-frontend,1876146,100.00,34.08,frontend 
>>> cycles idle
>>> S0,1,128241,,stalled-cycles-backend,1875512,100.00,67.66,backend cycles 
>>> idle
>>> S0,1,95578,,instructions,1874676,100.00,0.50,insn per cycle
>>> ===>S0,1,,,1.34,stalled cycles per insn
>>> 
>>> The above command line uses field separator as ","
>>> via "-x," option and per-socket option displays
>>> socket value as first field. But here the last line
>>> for "stalled cycles per insn" has more separators.
>>> Each csv output line is expected to have 8 field
>>> separatorsi (for the 9 fields), where as last line
>>> has 10 "," in the result. Patch fixes this issue.
>>> 
>>> The counter stats are displayed by function
>>> "perf_stat__print_shadow_stats" in code
>>> "util/stat-shadow.c". While printing the stats info
>>> for "stalled cycles per insn", function "new_line_csv"
>>> is used as new_line callback.
>>> 
>>> The fields printed in each line contains:
>>> "Socket_id,aggr nr,Avg,unit,event_name,run,enable_percent,ratio,unit"
>>> 
>>> The metric output prints Socket_id, aggr nr, ratio
>>> and unit. It has to skip through remaining five fields
>>> ie, Avg,unit,event_name,run,enable_percent. The csv
>>> line callback uses "os->nfields" to know the number of
>>> fields to skip to match with other lines.
>>> Currently it is set as:
>>> os.nfields = 3 + aggr_fields[config->aggr_mode] + (counter->cgrp ? 1 : 
>>> 0);
>>> 
>>> But in case of aggregation modes, csv_sep already
>>> gets printed along with each field (Function "aggr_printout"
>>> in util/stat-display.c). So aggr_fields can be
>>> removed from nfields. And fixed number of fields to
>>> skip has to be "4". This is to skip fields for:
>>> "avg, unit, event name, run, enable_percent"
>>> Example from line for instructions:
>>> "1.89,msec,cpu-clock,1887692,100.00"
>>> 
>>> This needs 4 csv separators. Patch removes aggr_fields
>>> and uses 4 as fixed number of os->nfields to skip.
>>> 
>>> After the patch:
>>> 
>>> # ./perf stat -x, --per-socket -a -C 1 ls
>>> S0,1,1.92,msec,cpu-clock,1917648,100.00,1.010,CPUs utilized
>>> S0,1,54,,context-switches,1916762,100.00,28.176,K/sec
>>> ---
>>> S0,1,528693,,instructions,1908854,100.00,0.36,insn per cycle
>>> S0,1,,1.81,stalled cycles per insn
>>> 
>>> Fixes: 92a61f6412d3 ("perf stat: Implement CSV metrics output")
>>> Reported-by: Disha Goel 
>>> Signed-off-by: Athira Rajeev 
>> 
>> Hi All,
>> 
>> Looking for review comments for this change.
> 
> This clashed with a patch from Namhyung that I just applied:
> 
> http://lore.kernel.org/lkml/20221107213314.3239159-2-namhy...@kernel.org
> 
> Can you please check? I just applied the other patch in this series.
> 
> Thanks,
> 
> - Arnaldo

Hi Arnaldo,

Thanks for checking the patch series.
Please find the updated patch below which is created on top of perf/urgent.

>From dde8f830ad318c9111c3fea5415fd8170b4c51bd Mon Sep 17 00:00:00 2001
From: Athira Rajeev 
Date: Tue, 18 Oct 2022 14:26:05 +0530
Subject: [PATCH] tools/perf: Fix printing field separator in CSV metrics
 output

In perf stat with CSV output option, number of fields
in metrics output is not matching with number of fields
in other event output lines.

Sample output below after applying patch to fix
printing os->prefix.

# ./perf stat -x, --per-socket -a -C 1 ls
S0,1,1.89,msec,cpu-clock,1887692,100.00,1.013,CPUs utilized
S0,1,2,,context-switches,1885842,100.00,1.060,K/sec
S0,1,0,,cpu-migrations,1885374,100.00,0.000,/sec
S0,1,2,,page-faults,1884880,100.00,1.060,K/sec
S0,1,189544,,cycles,1263158,67.00,0.100,GHz
S0,1,64602,,stalled-cycles-frontend,1876146,100.00,34.08,frontend 
cycles idle
S0,1,128241,,stalled-cycles-backend,1875512,100.00,67.66,backend cycles 
idle
S0,1,95578,,instructions,1874676,100.00,0.50,insn per cycle
===>S0,1,,,1.34,stalled cycles per insn

The above command line uses field separator as ","
via "-x," option and per-socket option displays
socket value as first field. But here the last line
for "stalled cycles per insn" has more separators.
Each csv output line is expected to have 8 

Re: [6.1.0-rc3-next-20221104] Boot failure - kernel BUG at mm/memblock.c:519

2022-11-09 Thread Yajun Deng
November 9, 2022 6:03 PM, "Yajun Deng"  wrote:

> Hey Mike,
> 
Sorry, this email should be sent to Sachin but not Mike. 
Please forgive my confusion. So:

Hey Sachin,
Can you help me test the attached file? 
Please use this new patch instead of the one in memblock tree.

> Can you help me test the attached file? 
> Please use this new patch instead of the one in memblock tree.
> 
> November 8, 2022 3:55 PM, "Mike Rapoport"  wrote:
> 
>> Hi Yajun,
>> 
>> On Tue, Nov 08, 2022 at 02:27:53AM +, Yajun Deng wrote:
>> 
>>> Hi Sachin,
>>> I didn't have a powerpc architecture machine. I don't know why this 
>>> happened.
>>> 
>>> Hi Mike,
>>> Do you have any suggestions?
>> 
>> You can try reproducing the bug qemu or work with Sachin to debug the
>> issue.
>> 
>>> I tested in tools/testing/memblock, and it was successful.
>> 
>> Memblock tests provide limited coverage still and they don't deal with all
>> possible cases.
>> 
>> For now I'm dropping this patch from the memblock tree until the issue is
>> fixed.
>> 
>>> November 6, 2022 8:07 PM, "Sachin Sant"  wrote:
>>> 
>>> While booting recent linux-next on a IBM Power10 Server LPAR
>>> following crash is observed:
>>> 
>>> [ 0.00] numa: Partition configured for 32 NUMA nodes.
>>> [ 0.00] [ cut here ]
>>> [ 0.00] kernel BUG at mm/memblock.c:519!
>>> [ 0.00] Oops: Exception in kernel mode, sig: 5 [#1]
>>> [ 0.00] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>> [ 0.00] Modules linked in:
>>> [ 0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 6.1.0-rc3-next-20221104 
>>> #1
>>> [ 0.00] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
>>> of:IBM,FW1030.00
>>> (NH1030_026) hv:phyp pSeries
>>> [ 0.00] NIP: c04ba240 LR: c04bb240 CTR: c04ba210
>>> [ 0.00] REGS: c2a8b7b0 TRAP: 0700 Not tainted 
>>> (6.1.0-rc3-next-20221104)
>>> [ 0.00] MSR: 80021033  CR: 24042424 XER: 
>>> 0001
>>> [ 0.00] CFAR: c04ba290 IRQMASK: 1
>>> [ 0.00] GPR00: c04bb240 c2a8ba50 c136ee00 
>>> c010f3ac00a8
>>> [ 0.00] GPR04:  c010f3ac0090 0010f3ac 
>>> 0d00
>>> [ 0.00] GPR08: 0001 0007 0001 
>>> 0081
>>> [ 0.00] GPR12: c04ba210 c2e1  
>>> 000d
>>> [ 0.00] GPR16: 0f6be620 0f6be8e8 0f6be788 
>>> 0f6bed58
>>> [ 0.00] GPR20: 0f6f6d58 c29a8de8 0010f3ad8800 
>>> 0080
>>> [ 0.00] GPR24: 0010f3ad7b00  0100 
>>> 0d00
>>> [ 0.00] GPR28: 0010f3ad7b00 c29a8de8 c29a8e00 
>>> 0006
>>> [ 0.00] NIP [c04ba240] memblock_merge_regions.isra.12+0x40/0x130
>>> [ 0.00] LR [c04bb240] memblock_add_range+0x190/0x300
>>> [ 0.00] Call Trace:
>>> [ 0.00] [c2a8ba50] [0100] 0x100 (unreliable)
>>> [ 0.00] [c2a8ba90] [c04bb240] 
>>> memblock_add_range+0x190/0x300
>>> [ 0.00] [c2a8bb10] [c04bb5e0] memblock_reserve+0x70/0xd0
>>> [ 0.00] [c2a8bba0] [c2045234] 
>>> memblock_alloc_range_nid+0x11c/0x1e8
>>> [ 0.00] [c2a8bc60] [c20453a4] 
>>> memblock_alloc_internal+0xa4/0x110
>>> [ 0.00] [c2a8bcb0] [c20456cc] 
>>> memblock_alloc_try_nid+0x94/0xcc
>>> [ 0.00] [c2a8bd40] [c200b570] alloc_paca_data+0x7c/0xcc
>>> [ 0.00] [c2a8bdb0] [c200b770] allocate_paca+0x8c/0x28c
>>> [ 0.00] [c2a8be50] [c200a26c] setup_arch+0x1c4/0x4d8
>>> [ 0.00] [c2a8bed0] [c2004378] start_kernel+0xb4/0xa84
>>> [ 0.00] [c2a8bf90] [c000da90] 
>>> start_here_common+0x1c/0x20
>>> [ 0.00] Instruction dump:
>>> [ 0.00] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 7c7d1b78 7c9e2378 3be0 
>>> f8010010
>>> [ 0.00] f821ffc1 e923 3969 480c <0b0a> 7d3f4b78 
>>> 393f0001 7fbf5840
>>> [ 0.00] ---[ end trace  ]---
>>> [ 0.00]
>>> [ 0.00] Kernel panic - not syncing: Fatal exception
>>> [ 0.00] Rebooting in 180 seconds..
>>> 
>>> This problem was introduced with next-20221101. Git bisect points to
>>> following patch
>>> 
>>> commit 3f82c9c4ac377082e1230f5299e0ccce07b15e12
>>> Date: Tue Oct 25 15:09:43 2022 +0800
>>> memblock: don't run loop in memblock_add_range() twice
>>> 
>>> Reverting this patch helps boot the kernel to login prompt.
>>> 
>>> Have attached .config
>>> 
>>> - Sachin
>> 
>> --
>> Sincerely yours,
>> Mike.


0001-memblock-don-t-run-loop-in-memblock_add_range-twice-.patch
Description: Binary data


Re: [6.1.0-rc3-next-20221104] Boot failure - kernel BUG at mm/memblock.c:519

2022-11-09 Thread Sachin Sant



> On 09-Nov-2022, at 3:55 PM, Yajun Deng  wrote:
> 
> November 9, 2022 6:03 PM, "Yajun Deng"  wrote:
> 
>> Hey Mike,
>> 
> Sorry, this email should be sent to Sachin but not Mike. 
> Please forgive my confusion. So:
> 
> Hey Sachin,
> Can you help me test the attached file? 
> Please use this new patch instead of the one in memblock tree.

Thanks for the fix. With the updated patch kernel boots correctly.

Tested-by: Sachin Sant mailto:sach...@linux.ibm.com>>

- Sachin



Re: [6.1.0-rc4-next-20221108] Boot failure on powerpc

2022-11-09 Thread Sachin Sant



> On 09-Nov-2022, at 3:25 PM, Jason A. Donenfeld  wrote:
> 
> Should be fixed already in today's next.

Yup, thanks. next-20221109 boots successfully.

- Sachin 


Re: [6.1.0-rc3-next-20221104] Boot failure - kernel BUG at mm/memblock.c:519

2022-11-09 Thread Yajun Deng
November 9, 2022 6:55 PM, "Sachin Sant"  wrote:

>> On 09-Nov-2022, at 3:55 PM, Yajun Deng  wrote:
>> 
>> November 9, 2022 6:03 PM, "Yajun Deng"  wrote:
>> 
>>> Hey Mike,
>> 
>> Sorry, this email should be sent to Sachin but not Mike.
>> Please forgive my confusion. So:
>> 
>> Hey Sachin,
>> Can you help me test the attached file?
>> Please use this new patch instead of the one in memblock tree.
> 
> Thanks for the fix. With the updated patch kernel boots correctly.
> 

Thanks for your test results.

Hi Mike,
Do you have any other suggestions for this patch? If not, I'll send a v3 patch.

> Tested-by: Sachin Sant >
> 
> - Sachin


Re: [6.1.0-rc3-next-20221104] Boot failure - kernel BUG at mm/memblock.c:519

2022-11-09 Thread Mike Rapoport
Hi Yajun,

On Wed, Nov 09, 2022 at 11:32:27AM +, Yajun Deng wrote:
> November 9, 2022 6:55 PM, "Sachin Sant"  wrote:
> 
> >> On 09-Nov-2022, at 3:55 PM, Yajun Deng  wrote:
> >> 
> >> November 9, 2022 6:03 PM, "Yajun Deng"  wrote:
> >> 
> >>> Hey Mike,
> >> 
> >> Sorry, this email should be sent to Sachin but not Mike.
> >> Please forgive my confusion. So:
> >> 
> >> Hey Sachin,
> >> Can you help me test the attached file?
> >> Please use this new patch instead of the one in memblock tree.
> > 
> > Thanks for the fix. With the updated patch kernel boots correctly.
> > 
> 
> Thanks for your test results.
> 
> Hi Mike,
> Do you have any other suggestions for this patch? If not, I'll send a v3 
> patch.

Unfortunately I don't think the new version has much value as it does not
really eliminate the second loop in case memory allocation is required.
I'd say the improvement is not worth the churn.
 
> > Tested-by: Sachin Sant >
> > 
> > - Sachin

-- 
Sincerely yours,
Mike.


Re: [6.1.0-rc3-next-20221104] Boot failure - kernel BUG at mm/memblock.c:519

2022-11-09 Thread Yajun Deng
November 9, 2022 7:42 PM, "Mike Rapoport"  wrote:

> Hi Yajun,
> 
> On Wed, Nov 09, 2022 at 11:32:27AM +, Yajun Deng wrote:
> 
>> November 9, 2022 6:55 PM, "Sachin Sant"  wrote:
>> 
>> On 09-Nov-2022, at 3:55 PM, Yajun Deng  wrote:
>> 
>> November 9, 2022 6:03 PM, "Yajun Deng"  wrote:
>> 
>> Hey Mike,
>> 
>> Sorry, this email should be sent to Sachin but not Mike.
>> Please forgive my confusion. So:
>> 
>> Hey Sachin,
>> Can you help me test the attached file?
>> Please use this new patch instead of the one in memblock tree.
>> 
>> Thanks for the fix. With the updated patch kernel boots correctly.
>> 
>> Thanks for your test results.
>> 
>> Hi Mike,
>> Do you have any other suggestions for this patch? If not, I'll send a v3 
>> patch.
> 
> Unfortunately I don't think the new version has much value as it does not
> really eliminate the second loop in case memory allocation is required.
> I'd say the improvement is not worth the churn.
> 
OK, I got it.

>> Tested-by: Sachin Sant >
>> 
>> - Sachin
> 
> --
> Sincerely yours,
> Mike.


Re: [PATCH 2/4] fs: define a firmware security filesystem named fwsecurityfs

2022-11-09 Thread Greg Kroah-Hartman
On Sun, Nov 06, 2022 at 04:07:42PM -0500, Nayna Jain wrote:
> securityfs is meant for Linux security subsystems to expose policies/logs
> or any other information. However, there are various firmware security
> features which expose their variables for user management via the kernel.
> There is currently no single place to expose these variables. Different
> platforms use sysfs/platform specific filesystem(efivarfs)/securityfs
> interface as they find it appropriate. Thus, there is a gap in kernel
> interfaces to expose variables for security features.
> 
> Define a firmware security filesystem (fwsecurityfs) to be used by
> security features enabled by the firmware. These variables are platform
> specific. This filesystem provides platforms a way to implement their
>  own underlying semantics by defining own inode and file operations.
> 
> Similar to securityfs, the firmware security filesystem is recommended
> to be exposed on a well known mount point /sys/firmware/security.
> Platforms can define their own directory or file structure under this path.
> 
> Example:
> 
> # mount -t fwsecurityfs fwsecurityfs /sys/firmware/security

Why not juset use securityfs in /sys/security/firmware/ instead?  Then
you don't have to create a new filesystem and convince userspace to
mount it in a specific location?

thanks,

greg k-h


Re: [PATCH v1 2/2] stackprotector: actually use get_random_canary()

2022-11-09 Thread Catalin Marinas
On Sun, Oct 23, 2022 at 10:32:08PM +0200, Jason A. Donenfeld wrote:
> The RNG always mixes in the Linux version extremely early in boot. It
> also always includes a cycle counter, not only during early boot, but
> each and every time it is invoked prior to being fully initialized.
> Together, this means that the use of additional xors inside of the
> various stackprotector.h files is superfluous and over-complicated.
> Instead, we can get exactly the same thing, but better, by just calling
> `get_random_canary()`.
> 
> Signed-off-by: Jason A. Donenfeld 
> ---
>  arch/arm/include/asm/stackprotector.h |  9 +
>  arch/arm64/include/asm/stackprotector.h   |  9 +

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH 2/4] fs: define a firmware security filesystem named fwsecurityfs

2022-11-09 Thread Nayna



On 11/9/22 08:46, Greg Kroah-Hartman wrote:

On Sun, Nov 06, 2022 at 04:07:42PM -0500, Nayna Jain wrote:

securityfs is meant for Linux security subsystems to expose policies/logs
or any other information. However, there are various firmware security
features which expose their variables for user management via the kernel.
There is currently no single place to expose these variables. Different
platforms use sysfs/platform specific filesystem(efivarfs)/securityfs
interface as they find it appropriate. Thus, there is a gap in kernel
interfaces to expose variables for security features.

Define a firmware security filesystem (fwsecurityfs) to be used by
security features enabled by the firmware. These variables are platform
specific. This filesystem provides platforms a way to implement their
  own underlying semantics by defining own inode and file operations.

Similar to securityfs, the firmware security filesystem is recommended
to be exposed on a well known mount point /sys/firmware/security.
Platforms can define their own directory or file structure under this path.

Example:

# mount -t fwsecurityfs fwsecurityfs /sys/firmware/security

Why not juset use securityfs in /sys/security/firmware/ instead?  Then
you don't have to create a new filesystem and convince userspace to
mount it in a specific location?


From man 5 sysfs page:

/sys/firmware: This subdirectory contains interfaces for viewing and 
manipulating firmware-specific objects and attributes.


/sys/kernel: This subdirectory contains various files and subdirectories 
that provide information about the running kernel.


The security variables which are being exposed via fwsecurityfs are 
managed by firmware, stored in firmware managed space and also often 
consumed by firmware for enabling various security features.


From git commit b67dbf9d4c1987c370fd18fdc4cf9d8aaea604c2, the purpose 
of securityfs(/sys/kernel/security) is to provide a common place for all 
kernel LSMs. The idea of
fwsecurityfs(/sys/firmware/security) is to similarly provide a common 
place for all firmware security objects.


/sys/firmware already exists. The patch now defines a new /security 
directory in it for firmware security features. Using 
/sys/kernel/security would mean scattering firmware objects in multiple 
places and confusing the purpose of /sys/kernel and /sys/firmware.


Even though fwsecurityfs code is based on securityfs, since the two 
filesystems expose different types of objects and have different 
requirements, there are distinctions:


1. fwsecurityfs lets users create files in userspace, securityfs only 
allows kernel subsystems to create files.


2. firmware and kernel objects may have different requirements. For 
example, consideration of namespacing. As per my understanding, 
namespacing is applied to kernel resources and not firmware resources. 
That's why it makes sense to add support for namespacing in securityfs, 
but we concluded that fwsecurityfs currently doesn't need it. Another 
but similar example of it is: TPM space, which is exposed from hardware. 
For containers, the TPM would be made as virtual/software TPM. Similarly 
for firmware space for containers, it would have to be something 
virtualized/software version of it.


3. firmware objects are persistent and read at boot time by interaction 
with firmware, unlike kernel objects which are not persistent.


For a more detailed explanation refer to the LSS-NA 2022 "PowerVM 
Platform Keystore - Securing Linux Credentials Locally" talk and 
slides[1]. The link to previously posted RFC version is [2].


[1] 
https://static.sched.com/hosted_files/lssna2022/25/NaynaJain_PowerVM_PlatformKeyStore_SecuringLinuxCredentialsLocally.pdf

[2] https://lore.kernel.org/linuxppc-dev/yrqqphi4+jhz1...@kroah.com/

Thanks & Regards,

 - Nayna



thanks,

greg k-h


Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs

2022-11-09 Thread Christophe Leroy
+ linuxppc-dev list as we start mentioning powerpc.

Le 09/11/2022 à 18:43, Song Liu a écrit :
> On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport  wrote:
>>
> [...]
> 

 The proposed execmem_alloc() looks to me very much tailored for x86
 to be
 used as a replacement for module_alloc(). Some architectures have
 module_alloc() that is quite different from the default or x86
 version, so
 I'd expect at least some explanation how modules etc can use execmem_
 APIs
 without breaking !x86 architectures.
>>>
>>> I think this is fair, but I think we should ask ask ourselves - how
>>> much should we do in one step?
>>
>> I think that at least we need an evidence that execmem_alloc() etc can be
>> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
>> for him at all, so having a core MM API for code allocation that only works
>> with BPF on x86 seems not right to me.
> 
> While using execmem_alloc() et. al. in module support is difficult, folks are
> making progress with it. For example, the prototype would be more difficult
> before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> (introduced by Christophe).

By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC 
was completely different: This was because on powerpc book3s/32, no-exec 
flaggin is per segment of size 256 Mbytes, so in order to provide 
STRICT_MODULES_RWX it was necessary to put data outside of the segment 
that holds module text in order to be able to flag RW data as no-exec.

But I'm happy if it can also serve other purposes.

> 
> We also have other users that we can onboard soon: BPF trampoline on
> x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> s390.
> 
>>
>>> For non-text_poke() architectures, the way you can make it work is have
>>> the API look like:
>>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
>>> execmem_write()  <- Loads the mapping, doesn't work after finish()
>>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
>>>
>>> So for text_poke():
>>> execmem_alloc()  <- reserves the mapping
>>> execmem_write()  <- text_pokes() to the mapping
>>> execmem_finish() <- does nothing
>>>
>>> And non-text_poke():
>>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
>>> execmem_write()  <- Writes normally to it
>>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
>>>
>>> Non-text_poke() only gets the benefits of centralized logic, but the
>>> interface works for both. This is pretty much what the perm_alloc() RFC
>>> did to make it work with other arch's and modules. But to fit with the
>>> existing modules code (which is actually spread all over) and also
>>> handle RO sections, it also needed some additional bells and whistles.
>>
>> I'm less concerned about non-text_poke() part, but rather about
>> restrictions where code and data can live on different architectures and
>> whether these restrictions won't lead to inability to use the centralized
>> logic on, say, arm64 and powerpc.

Until recently, powerpc CPU didn't implement PC-relative data access. 
Only very recent powerpc CPUs (power10 only ?) have capability to do 
PC-relative accesses, but the kernel doesn't use it yet. So there's no 
constraint about distance between text and data. What matters is the 
distance between core kernel text and module text to avoid trampolines.

>>
>> For instance, if we use execmem_alloc() for modules, it means that data
>> sections should be allocated separately with plain vmalloc(). Will this
>> work universally? Or this will require special care with additional
>> complexity in the modules code?
>>
>>> So the question I'm trying to ask is, how much should we target for the
>>> next step? I first thought that this functionality was so intertwined,
>>> it would be too hard to do iteratively. So if we want to try
>>> iteratively, I'm ok if it doesn't solve everything.
>>
>> With execmem_alloc() as the first step I'm failing to see the large
>> picture. If we want to use it for modules, how will we allocate RO data?
>> with similar rodata_alloc() that uses yet another tree in vmalloc?
>> How the caching of large pages in vmalloc can be made useful for use cases
>> like secretmem and PKS?
> 
> If RO data causes problems with direct map fragmentation, we can use
> similar logic. I think we will need another tree in vmalloc for this case.
> Since the logic will be mostly identical, I personally don't think adding
> another tree is a big overhead.

On powerpc, kernel core RAM is not mapped by pages but is mapped by 
blocks. There are only two blocks: One ROX block which contains both 
text and rodata, and one RW block that contains everything else. Maybe 
the same can be done for modules. What matters is to be sure you never 
have WX memory. Having ROX rodata is not an issue.

Christophe

Re: [PATCH net-next v2 00/11] net: pcs: Add support for devices probed in the "usual" manner

2022-11-09 Thread Vladimir Oltean
On Thu, Nov 03, 2022 at 05:06:39PM -0400, Sean Anderson wrote:
> Several (later) patches in this series cannot be applied until a stable
> release has occured containing the dts updates.

New kernels must remain compatible with old device trees.


Re: [PATCH net-next v2 00/11] net: pcs: Add support for devices probed in the "usual" manner

2022-11-09 Thread Vladimir Oltean
On Thu, Nov 03, 2022 at 05:06:39PM -0400, Sean Anderson wrote:
> For a long time, PCSs have been tightly coupled with their MACs. For
> this reason, the MAC creates the "phy" or mdio device, and then passes
> it to the PCS to initialize. This has a few disadvantages:
> 
> - Each MAC must re-implement the same steps to look up/create a PCS
> - The PCS cannot use functions tied to device lifetime, such as devm_*.
> - Generally, the PCS does not have easy access to its device tree node

Is there a clear need to solve these disadvantages? There comes extra
runtime complexity with the PCS-as-device scheme (plus the extra
complexity needed to address the DT backwards compatibility problems
it causes; not addressed here).


Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:

> -#define queued_spin_lock queued_spin_lock
>  
> -static inline void queued_spin_unlock(struct qspinlock *lock)
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
> - smp_store_release(&lock->locked, 0);
> - else
> - __pv_queued_spin_unlock(lock);
> + if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
> + return 1;
> + return 0;

optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);

[resend as utf-8, not utf-7]



Re: [PATCH 02/17] powerpc/qspinlock: add mcs queueing for contended waiters

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:

[resend as utf-8, not utf-7]
>  
> +/*
> + * Bitfields in the atomic value:
> + *
> + * 0: locked bit
> + * 16-31: tail cpu (+1)
> + */
> +#define  _Q_SET_MASK(type)   (((1U << _Q_ ## type ## _BITS) - 1)\
> +   << _Q_ ## type ## _OFFSET)
> +#define _Q_LOCKED_OFFSET 0
> +#define _Q_LOCKED_BITS   1
> +#define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
> +#define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET   16
> +#define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
> +#define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)
> +

Just to state the obvious this is:

#define _Q_LOCKED_OFFSET0
#define _Q_LOCKED_BITS  1
#define _Q_LOCKED_MASK  0x0001
#define _Q_LOCKED_VAL   1

#define _Q_TAIL_CPU_OFFSET  16
#define _Q_TAIL_CPU_BITS16
#define _Q_TAIL_CPU_MASK0x

> +#if CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #endif /* _ASM_POWERPC_QSPINLOCK_TYPES_H */
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 8dbce99a373c..5ebb88d95636 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,12 +1,172 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> +#include 
> +#include 
> +#include 
>  #include 
> -#include 
> +#include 
> +#include 
>  #include 
>  
> -void queued_spin_lock_slowpath(struct qspinlock *lock)
> +#define MAX_NODES4
> +
> +struct qnode {
> + struct qnode*next;
> + struct qspinlock *lock;
> + u8  locked; /* 1 if lock acquired */
> +};
> +
> +struct qnodes {
> + int count;
> + struct qnode nodes[MAX_NODES];
> +};

I think it could be worth commenting why qnodes::count instead 
_Q_TAIL_IDX_OFFSET.

> +
> +static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +
> +static inline int encode_tail_cpu(void)

I think the generic version that takes smp_processor_id() as a parameter is 
clearer - at least with this function name.

> +{
> + return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> +}
> +
> +static inline int get_tail_cpu(int val)

It seems like there should be a "decode" function to pair up with the "encode" 
function.

> +{
> + return (val >> _Q_TAIL_CPU_OFFSET) - 1;
> +}
> +
> +/* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */

Does that comment mean it is not necessary to use an atomic_or here?

> +static __always_inline void lock_set_locked(struct qspinlock *lock)

nit: could just be called set_locked()

> +{
> + atomic_or(_Q_LOCKED_VAL, &lock->val);
> + __atomic_acquire_fence();
> +}
> +
> +/* Take lock, clearing tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, 
> int val)
> +{
> + int newval = _Q_LOCKED_VAL;
> +
> + if (atomic_cmpxchg_acquire(&lock->val, val, newval) == val)
> + return 1;
> + else
> + return 0;

same optional style nit: return (atomic_cmpxchg_acquire(&lock->val, val, 
newval) == val);

> +}
> +
> +/*
> + * Publish our tail, replacing previous tail. Return previous value.
> + *
> + * This provides a release barrier for publishing node, and an acquire 
> barrier
> + * for getting the old node.
> + */
> +static __always_inline int publish_tail_cpu(struct qspinlock *lock, int tail)

Did you change from the xchg_tail() name in the generic version because of the 
release and acquire barriers this provides?
Does "publish" generally imply the old value will be returned?

>  {
> - while (!queued_spin_trylock(lock))
> + for (;;) {
> + int val = atomic_read(&lock->val);
> + int newval = (val & ~_Q_TAIL_CPU_MASK) | tail;
> + int old;
> +
> + old = atomic_cmpxchg(&lock->val, val, newval);
> + if (old == val)
> + return old;
> + }
> +}
> +
> +static struct qnode *get_tail_qnode(struct qspinlock *lock, int val)
> +{
> + int cpu = get_tail_cpu(val);
> + struct qnodes *qnodesp = per_cpu_ptr(&qnodes, cpu);
> + int idx;
> +
> + for (idx = 0; idx < MAX_NODES; idx++) {
> + struct qnode *qnode = &qnodesp->nodes[idx];
> + if (qnode->lock == lock)
> + return qnode;
> + }

In case anyone else is confused by this, Nick explained each cpu can only queue 
on a unique spinlock once regardless of "idx" level.

> +
> + BUG();
> +}
> +
> +static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +{
> + struct qnodes *qnodesp;
> + struct qnode *next, *node;
> + int val, old, tail;
> + int idx;
> +
> + BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
> +
> + qnodesp = this_cpu_ptr(&qnodes);
> + if (unlikely(qnodesp-

Re: [PATCH 03/17] powerpc/qspinlock: use a half-word store to unlock to avoid larx/stcx.

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> The first 16 bits of the lock are only modified by the owner, and other
> modifications always use atomic operations on the entire 32 bits, so
> unlocks can use plain stores on the 16 bits. This is the same kind of
> optimisation done by core qspinlock code.
> ---
>  arch/powerpc/include/asm/qspinlock.h   |  6 +-
>  arch/powerpc/include/asm/qspinlock_types.h | 19 +--
>  2 files changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index f06117aa60e1..79a1936fb68d 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -38,11 +38,7 @@ static __always_inline void queued_spin_lock(struct 
> qspinlock *lock)
>  
>  static inline void queued_spin_unlock(struct qspinlock *lock)
>  {
> - for (;;) {
> - int val = atomic_read(&lock->val);
> - if (atomic_cmpxchg_release(&lock->val, val, val & 
> ~_Q_LOCKED_VAL) == val)
> - return;
> - }
> + smp_store_release(&lock->locked, 0);

Is it also possible for lock_set_locked() to use a non-atomic acquire
operation?

>  }
>  
>  #define arch_spin_is_locked(l)   queued_spin_is_locked(l)
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 9630e714c70d..3425dab42576 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -3,12 +3,27 @@
>  #define _ASM_POWERPC_QSPINLOCK_TYPES_H
>  
>  #include 
> +#include 
>  
>  typedef struct qspinlock {
> - atomic_t val;
> + union {
> + atomic_t val;
> +
> +#ifdef __LITTLE_ENDIAN
> + struct {
> + u16 locked;
> + u8  reserved[2];
> + };
> +#else
> + struct {
> + u8  reserved[2];
> + u16 locked;
> + };
> +#endif
> + };
>  } arch_spinlock_t;

Just to double check we have:

#define _Q_LOCKED_OFFSET0
#define _Q_LOCKED_BITS  1
#define _Q_LOCKED_MASK  0x0001
#define _Q_LOCKED_VAL   1

#define _Q_TAIL_CPU_OFFSET  16
#define _Q_TAIL_CPU_BITS16
#define _Q_TAIL_CPU_MASK0x


so the ordering here looks correct.

>  
> -#define  __ARCH_SPIN_LOCK_UNLOCKED   { .val = ATOMIC_INIT(0) }
> +#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = ATOMIC_INIT(0) } }
>  
>  /*
>   * Bitfields in the atomic value:



Re: [PATCH 04/17] powerpc/qspinlock: convert atomic operations to assembly

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> This uses more optimal ll/sc style access patterns (rather than
> cmpxchg), and also sets the EH=1 lock hint on those operations
> which acquire ownership of the lock.
> ---
>  arch/powerpc/include/asm/qspinlock.h   | 25 +--
>  arch/powerpc/include/asm/qspinlock_types.h |  6 +-
>  arch/powerpc/lib/qspinlock.c   | 81 +++---
>  3 files changed, 79 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 79a1936fb68d..3ab354159e5e 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -2,28 +2,43 @@
>  #ifndef _ASM_POWERPC_QSPINLOCK_H
>  #define _ASM_POWERPC_QSPINLOCK_H
>  
> -#include 
>  #include 
>  #include 
>  
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
> - return atomic_read(&lock->val);
> + return READ_ONCE(lock->val);
>  }
>  
>  static __always_inline int queued_spin_value_unlocked(struct qspinlock lock)
>  {
> - return !atomic_read(&lock.val);
> + return !lock.val;
>  }
>  
>  static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>  {
> - return !!(atomic_read(&lock->val) & _Q_TAIL_CPU_MASK);
> + return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - if (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0)
> + u32 new = _Q_LOCKED_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%3  # queued_spin_trylock   \n"
> +"cmpwi   0,%0,0  \n"
> +"bne-2f  \n"
> +"stwcx.  %2,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"
> + : "=&r" (prev)
> + : "r" (&lock->val), "r" (new),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)

btw IS_ENABLED() already returns 1 or 0

> + : "cr0", "memory");

This is the ISA's "test and set" atomic primitive. Do you think it would be 
worth seperating it as a helper?

> +
> + if (likely(prev == 0))
>   return 1;
>   return 0;

same optional style nit: return likely(prev == 0);

>  }
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 3425dab42576..210adf05b235 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -7,7 +7,7 @@
>  
>  typedef struct qspinlock {
>   union {
> - atomic_t val;
> + u32 val;
>  
>  #ifdef __LITTLE_ENDIAN
>   struct {
> @@ -23,10 +23,10 @@ typedef struct qspinlock {
>   };
>  } arch_spinlock_t;
>  
> -#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = ATOMIC_INIT(0) } }
> +#define  __ARCH_SPIN_LOCK_UNLOCKED   { { .val = 0 } }
>  
>  /*
> - * Bitfields in the atomic value:
> + * Bitfields in the lock word:
>   *
>   * 0: locked bit
>   * 16-31: tail cpu (+1)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5ebb88d95636..7c71e5e287df 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -1,5 +1,4 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -22,32 +21,59 @@ struct qnodes {
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static inline int encode_tail_cpu(void)
> +static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
>  }
>  
> -static inline int get_tail_cpu(int val)
> +static inline int get_tail_cpu(u32 val)
>  {
>   return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
>  /* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */

I think you missed deleting the above line.

> +/* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> - atomic_or(_Q_LOCKED_VAL, &lock->val);
> - __atomic_acquire_fence();
> + u32 new = _Q_LOCKED_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%3  # lock_set_locked   \n"
> +"or  %0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> + : "=&r" (prev)
> + : "r" (&lock->val), "r" 

Re: [PATCH 05/17] powerpc/qspinlock: allow new waiters to steal the lock before queueing

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Allow new waiters a number of spins on the lock word before queueing,
> which particularly helps paravirt performance when physical CPUs are
> oversubscribed.
> ---
>  arch/powerpc/lib/qspinlock.c | 152 ---
>  1 file changed, 141 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 7c71e5e287df..1625cce714b2 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -19,8 +19,17 @@ struct qnodes {
>   struct qnode nodes[MAX_NODES];
>  };
>  
> +/* Tuning parameters */
> +static int STEAL_SPINS __read_mostly = (1<<5);
> +static bool MAYBE_STEALERS __read_mostly = true;

I can understand why, but macro case variables can be a bit confusing.

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> +static __always_inline int get_steal_spins(void)
> +{
> + return STEAL_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -76,6 +85,39 @@ static __always_inline int trylock_clear_tail_cpu(struct 
> qspinlock *lock, u32 ol
>   return 0;
>  }
>  
> +static __always_inline u32 __trylock_cmpxchg(struct qspinlock *lock, u32 
> old, u32 new)
> +{
> + u32 prev;
> +
> + BUG_ON(old & _Q_LOCKED_VAL);
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1,%4  # queued_spin_trylock_cmpxchg   \n"

s/queued_spin_trylock_cmpxchg/__trylock_cmpxchg/

btw what is the format you using for the '\n's in the inline asm?

> +"cmpw0,%0,%2 \n"
> +"bne-2f  \n"
> +"stwcx.  %3,0,%1 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"
> + : "=&r" (prev)
> + : "r" (&lock->val), "r"(old), "r" (new),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> + : "cr0", "memory");

This is very similar to trylock_clear_tail_cpu(). So maybe it is worth having
some form of "test and set" primitive helper.

> +
> + return prev;
> +}
> +
> +/* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
> +static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 
> val)
> +{
> + u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> +
> + if (__trylock_cmpxchg(lock, val, newval) == val)
> + return 1;
> + else
> + return 0;

same optional style nit: return __trylock_cmpxchg(lock, val, newval) == val

> +}
> +
>  /*
>   * Publish our tail, replacing previous tail. Return previous value.
>   *
> @@ -115,6 +157,31 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> +static inline bool try_to_steal_lock(struct qspinlock *lock)
> +{
> + int iters;
> +
> + /* Attempt to steal the lock */
> + for (;;) {
> + u32 val = READ_ONCE(lock->val);
> +
> + if (unlikely(!(val & _Q_LOCKED_VAL))) {
> + if (trylock_with_tail_cpu(lock, val))
> + return true;
> + continue;
> + }

The continue would bypass iters++/cpu_relax but the next time around
  if (unlikely(!(val & _Q_LOCKED_VAL))) {
should fail so everything should be fine?

> +
> + cpu_relax();
> +
> + iters++;
> +
> + if (iters >= get_steal_spins())
> + break;
> + }
> +
> + return false;
> +}
> +
>  static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>  {
>   struct qnodes *qnodesp;
> @@ -164,20 +231,39 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   smp_rmb(); /* acquire barrier for the mcs lock */
>   }
>  
> - /* We're at the head of the waitqueue, wait for the lock. */
> - while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - cpu_relax();
> + if (!MAYBE_STEALERS) {
> + /* We're at the head of the waitqueue, wait for the lock. */
> + while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> + cpu_relax();
>  
> - /* If we're the last queued, must clean up the tail. */
> - if ((val & _Q_TAIL_CPU_MASK) == tail) {
> - if (trylock_clear_tail_cpu(lock, val))
> - goto release;
> - /* Another waiter must have enqueued */
> - }
> + /* If we're the last queued, must clean up the tail. */
> + if ((val & _Q_TAIL_CPU_MASK) == tail) {
> + if (trylock_clear_tail_cpu(lock, val))
> + goto release;
> +

Re: [PATCH 06/17] powerpc/qspinlock: theft prevention to control latency

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Give the queue head the ability to stop stealers. After a number of
> spins without sucessfully acquiring the lock, the queue head employs
> this, which will assure it is the next owner.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h | 10 +++-
>  arch/powerpc/lib/qspinlock.c   | 56 +-
>  2 files changed, 63 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 210adf05b235..8b20f5e22bba 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,7 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   * 0: locked bit
> - * 16-31: tail cpu (+1)
> + *16: must queue bit
> + * 17-31: tail cpu (+1)
>   */
>  #define  _Q_SET_MASK(type)   (((1U << _Q_ ## type ## _BITS) - 1)\
> << _Q_ ## type ## _OFFSET)
> @@ -38,7 +39,12 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
>  
> -#define _Q_TAIL_CPU_OFFSET   16
> +#define _Q_MUST_Q_OFFSET 16
> +#define _Q_MUST_Q_BITS   1
> +#define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> +#define _Q_MUST_Q_VAL(1U << _Q_MUST_Q_OFFSET)
> +
> +#define _Q_TAIL_CPU_OFFSET   17
>  #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET)
>  #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU)

Not a big deal but some of these values could be calculated like in the
generic version. e.g.

#define _Q_PENDING_OFFSET   (_Q_LOCKED_OFFSET +_Q_LOCKED_BITS)

>  
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 1625cce714b2..a906cc8f15fa 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -22,6 +22,7 @@ struct qnodes {
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
> +static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -30,6 +31,11 @@ static __always_inline int get_steal_spins(void)
>   return STEAL_SPINS;
>  }
>  
> +static __always_inline int get_head_spins(void)
> +{
> + return HEAD_SPINS;
> +}
> +
>  static inline u32 encode_tail_cpu(void)
>  {
>   return (smp_processor_id() + 1) << _Q_TAIL_CPU_OFFSET;
> @@ -142,6 +148,23 @@ static __always_inline u32 publish_tail_cpu(struct 
> qspinlock *lock, u32 tail)
>   return prev;
>  }
>  
> +static __always_inline u32 lock_set_mustq(struct qspinlock *lock)
> +{
> + u32 new = _Q_MUST_Q_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1 # lock_set_mustq\n"

Is the EH bit not set because we don't hold the lock here?

> +"or  %0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> + : "=&r" (prev)
> + : "r" (&lock->val), "r" (new)
> + : "cr0", "memory");

This is another usage close to the DEFINE_TESTOP() pattern.

> +
> + return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>   int cpu = get_tail_cpu(val);
> @@ -165,6 +188,9 @@ static inline bool try_to_steal_lock(struct qspinlock 
> *lock)
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
>  
> + if (val & _Q_MUST_Q_VAL)
> + break;
> +
>   if (unlikely(!(val & _Q_LOCKED_VAL))) {
>   if (trylock_with_tail_cpu(lock, val))
>   return true;
> @@ -246,11 +272,22 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   /* We must be the owner, just set the lock bit and acquire */
>   lock_set_locked(lock);
>   } else {
> + int iters = 0;
> + bool set_mustq = false;
> +
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
> - while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> + while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
>   cpu_relax();
>  
> + iters++;

It seems instead of using set_mustq, (val & _Q_MUST_Q_VAL) could be checked?

> + if (!set_mustq && iters >= get_head_spins()) {
> + set_mustq = true;
> + lock_set_mustq(lock);
> + val |= _Q_MUST_Q_VAL;
> + }
> + }
> +
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU

Re: [PATCH 07/17] powerpc/qspinlock: store owner CPU in lock word

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Store the owner CPU number in the lock word so it may be yielded to,
> as powerpc's paravirtualised simple spinlocks do.
> ---
>  arch/powerpc/include/asm/qspinlock.h   |  8 +++-
>  arch/powerpc/include/asm/qspinlock_types.h | 10 ++
>  arch/powerpc/lib/qspinlock.c   |  6 +++---
>  3 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 3ab354159e5e..44601b261e08 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -20,9 +20,15 @@ static __always_inline int queued_spin_is_contended(struct 
> qspinlock *lock)
>   return !!(READ_ONCE(lock->val) & _Q_TAIL_CPU_MASK);
>  }
>  
> +static __always_inline u32 queued_spin_get_locked_val(void)

Maybe this function should have "encode" in the name to match with
encode_tail_cpu().


> +{
> + /* XXX: make this use lock value in paca like simple spinlocks? */

Is that the paca's lock_token which is 0x8000?


> + return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
> +}
> +
>  static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   asm volatile(
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 8b20f5e22bba..35f9525381e6 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -29,6 +29,8 @@ typedef struct qspinlock {
>   * Bitfields in the lock word:
>   *
>   * 0: locked bit
> + *  1-14: lock holder cpu
> + *15: unused bit
>   *16: must queue bit
>   * 17-31: tail cpu (+1)

So there is one more bit to store the tail cpu vs the lock holder cpu?

>   */
> @@ -39,6 +41,14 @@ typedef struct qspinlock {
>  #define _Q_LOCKED_MASK   _Q_SET_MASK(LOCKED)
>  #define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET)
>  
> +#define _Q_OWNER_CPU_OFFSET  1
> +#define _Q_OWNER_CPU_BITS14
> +#define _Q_OWNER_CPU_MASK_Q_SET_MASK(OWNER_CPU)
> +
> +#if CONFIG_NR_CPUS > (1U << _Q_OWNER_CPU_BITS)
> +#error "qspinlock does not support such large CONFIG_NR_CPUS"
> +#endif
> +
>  #define _Q_MUST_Q_OFFSET 16
>  #define _Q_MUST_Q_BITS   1
>  #define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index a906cc8f15fa..aa26cfe21f18 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -50,7 +50,7 @@ static inline int get_tail_cpu(u32 val)
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   asm volatile(
> @@ -68,7 +68,7 @@ static __always_inline void lock_set_locked(struct 
> qspinlock *lock)
>  /* Take lock, clearing tail, cmpxchg with old (which must not be locked) */
>  static __always_inline int trylock_clear_tail_cpu(struct qspinlock *lock, 
> u32 old)
>  {
> - u32 new = _Q_LOCKED_VAL;
> + u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
>   BUG_ON(old & _Q_LOCKED_VAL);
> @@ -116,7 +116,7 @@ static __always_inline u32 __trylock_cmpxchg(struct 
> qspinlock *lock, u32 old, u3
>  /* Take lock, preserving tail, cmpxchg with val (which must not be locked) */
>  static __always_inline int trylock_with_tail_cpu(struct qspinlock *lock, u32 
> val)
>  {
> - u32 newval = _Q_LOCKED_VAL | (val & _Q_TAIL_CPU_MASK);
> + u32 newval = queued_spin_get_locked_val() | (val & _Q_TAIL_CPU_MASK);
>  
>   if (__trylock_cmpxchg(lock, val, newval) == val)
>   return 1;



Re: [PATCH 08/17] powerpc/qspinlock: paravirt yield to lock owner

2022-11-09 Thread Jordan Niethe
 On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
 [resend as utf-8, not utf-7]
> Waiters spinning on the lock word should yield to the lock owner if the
> vCPU is preempted. This improves performance when the hypervisor has
> oversubscribed physical CPUs.
> ---
>  arch/powerpc/lib/qspinlock.c | 97 ++--
>  1 file changed, 83 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index aa26cfe21f18..55286ac91da5 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define MAX_NODES4
>  
> @@ -24,14 +25,16 @@ static int STEAL_SPINS __read_mostly = (1<<5);
>  static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
> +static bool pv_yield_owner __read_mostly = true;

Not macro case for these globals? To me name does not make it super clear this
is a boolean. What about pv_yield_owner_enabled?

> +
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(void)
> +static __always_inline int get_steal_spins(bool paravirt)
>  {
>   return STEAL_SPINS;
>  }
>  
> -static __always_inline int get_head_spins(void)
> +static __always_inline int get_head_spins(bool paravirt)
>  {
>   return HEAD_SPINS;
>  }
> @@ -46,7 +49,11 @@ static inline int get_tail_cpu(u32 val)
>   return (val >> _Q_TAIL_CPU_OFFSET) - 1;
>  }
>  
> -/* Take the lock by setting the bit, no other CPUs may concurrently lock it. 
> */
> +static inline int get_owner_cpu(u32 val)
> +{
> + return (val & _Q_OWNER_CPU_MASK) >> _Q_OWNER_CPU_OFFSET;
> +}
> +
>  /* Take the lock by setting the lock bit, no other CPUs will touch it. */
>  static __always_inline void lock_set_locked(struct qspinlock *lock)
>  {
> @@ -180,7 +187,45 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static inline bool try_to_steal_lock(struct qspinlock *lock)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)

This name doesn't seem correct for the non paravirt case.

> +{
> + int owner;
> + u32 yield_count;
> +
> + BUG_ON(!(val & _Q_LOCKED_VAL));
> +
> + if (!paravirt)
> + goto relax;
> +
> + if (!pv_yield_owner)
> + goto relax;
> +
> + owner = get_owner_cpu(val);
> + yield_count = yield_count_of(owner);
> +
> + if ((yield_count & 1) == 0)
> + goto relax; /* owner vcpu is running */

I wonder why not use vcpu_is_preempted()?

> +
> + /*
> +  * Read the lock word after sampling the yield count. On the other side
> +  * there may a wmb because the yield count update is done by the
> +  * hypervisor preemption and the value update by the OS, however this
> +  * ordering might reduce the chance of out of order accesses and
> +  * improve the heuristic.
> +  */
> + smp_rmb();
> +
> + if (READ_ONCE(lock->val) == val) {
> + yield_to_preempted(owner, yield_count);
> + /* Don't relax if we yielded. Maybe we should? */
> + return;
> + }
> +relax:
> + cpu_relax();
> +}
> +
> +
> +static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
>   int iters;
>  
> @@ -197,18 +242,18 @@ static inline bool try_to_steal_lock(struct qspinlock 
> *lock)
>   continue;
>   }
>  
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>   iters++;
>  
> - if (iters >= get_steal_spins())
> + if (iters >= get_steal_spins(paravirt))
>   break;
>   }
>  
>   return false;
>  }
>  
> -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +static __always_inline void queued_spin_lock_mcs_queue(struct qspinlock 
> *lock, bool paravirt)
>  {
>   struct qnodes *qnodesp;
>   struct qnode *next, *node;
> @@ -260,7 +305,7 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>   if (!MAYBE_STEALERS) {
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -278,10 +323,10 @@ static inline void queued_spin_lock_mcs_queue(struct 
> qspinlock *lock)
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> - cpu_relax();
> + yield_to_locked_owner(lock, val, paravirt);
>  
>  

Re: [PATCH 09/17] powerpc/qspinlock: implement option to yield to previous node

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Queued waiters which are not at the head of the queue don't spin on
> the lock word but their qnode lock word, waiting for the previous queued
> CPU to release them. Add an option which allows these waiters to yield
> to the previous CPU if its vCPU is preempted.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 46 +++-
>  1 file changed, 45 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 55286ac91da5..b39f8c5b329c 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_prev __read_mostly = true;

Similiar suggestion, maybe pv_yield_prev_enabled would read better.

Isn't this enabled by default contrary to the commit message?


>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -224,6 +225,31 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   cpu_relax();
>  }
>  
> +static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)

yield_to_locked_owner() takes a raw val and works out the cpu to yield to.
I think for consistency have yield_to_prev() take the raw val and work it out 
too.

> +{
> + u32 yield_count;
> +
> + if (!paravirt)
> + goto relax;
> +
> + if (!pv_yield_prev)
> + goto relax;
> +
> + yield_count = yield_count_of(prev_cpu);
> + if ((yield_count & 1) == 0)
> + goto relax; /* owner vcpu is running */
> +
> + smp_rmb(); /* See yield_to_locked_owner comment */
> +
> + if (!node->locked) {
> + yield_to_preempted(prev_cpu, yield_count);
> + return;
> + }
> +
> +relax:
> + cpu_relax();
> +}
> +
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
> @@ -291,13 +317,14 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>*/
>   if (old & _Q_TAIL_CPU_MASK) {
>   struct qnode *prev = get_tail_qnode(lock, old);
> + int prev_cpu = get_tail_cpu(old);

This could then be removed.

>  
>   /* Link @node into the waitqueue. */
>   WRITE_ONCE(prev->next, node);
>  
>   /* Wait for mcs node lock to be released */
>   while (!node->locked)
> - cpu_relax();
> + yield_to_prev(lock, node, prev_cpu, paravirt);

And would have this as:
yield_to_prev(lock, node, old, paravirt);


>  
>   smp_rmb(); /* acquire barrier for the mcs lock */
>   }
> @@ -448,12 +475,29 @@ static int pv_yield_owner_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_owner, pv_yield_owner_get, 
> pv_yield_owner_set, "%llu\n");
>  
> +static int pv_yield_prev_set(void *data, u64 val)
> +{
> + pv_yield_prev = !!val;
> +
> + return 0;
> +}
> +
> +static int pv_yield_prev_get(void *data, u64 *val)
> +{
> + *val = pv_yield_prev;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_prev, pv_yield_prev_get, 
> pv_yield_prev_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> &fops_steal_spins);
>   debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, 
> &fops_head_spins);
>   if (is_shared_processor()) {
>   debugfs_create_file("qspl_pv_yield_owner", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_yield_owner);
> + debugfs_create_file("qspl_pv_yield_prev", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>   }
>  
>   return 0;



Re: [PATCH 10/17] powerpc/qspinlock: allow stealing when head of queue yields

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> If the head of queue is preventing stealing but it finds the owner vCPU
> is preempted, it will yield its cycles to the owner which could cause it
> to become preempted. Add an option to re-allow stealers before yielding,
> and disallow them again after returning from the yield.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 56 ++--
>  1 file changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index b39f8c5b329c..94f007f66942 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -26,6 +26,7 @@ static bool MAYBE_STEALERS __read_mostly = true;
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> +static bool pv_yield_allow_steal __read_mostly = false;

To me this one does read as a boolean, but if you go with those other changes
I'd make it pv_yield_steal_enable to be consistent.

>  static bool pv_yield_prev __read_mostly = true;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> @@ -173,6 +174,23 @@ static __always_inline u32 lock_set_mustq(struct 
> qspinlock *lock)
>   return prev;
>  }
>  
> +static __always_inline u32 lock_clear_mustq(struct qspinlock *lock)
> +{
> + u32 new = _Q_MUST_Q_VAL;
> + u32 prev;
> +
> + asm volatile(
> +"1:  lwarx   %0,0,%1 # lock_clear_mustq  \n"
> +"andc%0,%0,%2\n"
> +"stwcx.  %0,0,%1 \n"
> +"bne-1b  \n"
> + : "=&r" (prev)
> + : "r" (&lock->val), "r" (new)
> + : "cr0", "memory");
> +

This is pretty similar to the DEFINE_TESTOP() pattern again with the same llong 
caveat.


> + return prev;
> +}
> +
>  static struct qnode *get_tail_qnode(struct qspinlock *lock, u32 val)
>  {
>   int cpu = get_tail_cpu(val);
> @@ -188,7 +206,7 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq)

 /* See yield_to_locked_owner comment */ comment needs to be updated now.


>  {
>   int owner;
>   u32 yield_count;
> @@ -217,7 +235,11 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   smp_rmb();
>  
>   if (READ_ONCE(lock->val) == val) {
> + if (clear_mustq)
> + lock_clear_mustq(lock);
>   yield_to_preempted(owner, yield_count);
> + if (clear_mustq)
> + lock_set_mustq(lock);
>   /* Don't relax if we yielded. Maybe we should? */
>   return;
>   }
> @@ -225,6 +247,16 @@ static __always_inline void yield_to_locked_owner(struct 
> qspinlock *lock, u32 va
>   cpu_relax();
>  }
>  
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +{
> + __yield_to_locked_owner(lock, val, paravirt, false);
> +}
> +
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq)
> +{

The check for pv_yield_allow_steal seems like it could go here instead of
being done by the caller.
__yield_to_locked_owner() checks for pv_yield_owner so it seems more
  consistent.



> + __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> +}
> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)
>  {
>   u32 yield_count;
> @@ -332,7 +364,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   if (!MAYBE_STEALERS) {
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL)
> - yield_to_locked_owner(lock, val, paravirt);
> + yield_head_to_locked_owner(lock, val, paravirt, false);
>  
>   /* If we're the last queued, must clean up the tail. */
>   if ((val & _Q_TAIL_CPU_MASK) == tail) {
> @@ -350,7 +382,8 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>  again:
>   /* We're at the head of the waitqueue, wait for the lock. */
>   while ((val = READ_ONCE(lock->val)) & _Q_LOCKED_VAL) {
> - yield_to_locked_owner(lock, val, paravirt);
> + yield_head_to_locked_owner(lock, val, paravirt,
> + pv_yield_allow_stea

Re: [PATCH 11/17] powerpc/qspinlock: allow propagation of yield CPU down the queue

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Having all CPUs poll the lock word for the owner CPU that should be
> yielded to defeats most of the purpose of using MCS queueing for
> scalability. Yet it may be desirable for queued waiters to to yield
> to a preempted owner.
> 
> s390 addreses this problem by having queued waiters sample the lock
> word to find the owner much less frequently. In this approach, the
> waiters never sample it directly, but the queue head propagates the
> owner CPU back to the next waiter if it ever finds the owner has
> been preempted. Queued waiters then subsequently propagate the owner
> CPU back to the next waiter, and so on.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 85 +++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 94f007f66942..28c85a2d5635 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>   struct qnode*next;
>   struct qspinlock *lock;
> + int yield_cpu;
>   u8  locked; /* 1 if lock acquired */
>  };
>  
> @@ -28,6 +29,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
> +static bool pv_yield_propagate_owner __read_mostly = true;

This also seems to be enabled by default.

>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -257,13 +259,66 @@ static __always_inline void 
> yield_head_to_locked_owner(struct qspinlock *lock, u
>   __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
>  }
>  
> +static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, 
> int *set_yield_cpu, bool paravirt)
> +{
> + struct qnode *next;
> + int owner;
> +
> + if (!paravirt)
> + return;
> + if (!pv_yield_propagate_owner)
> + return;
> +
> + owner = get_owner_cpu(val);
> + if (*set_yield_cpu == owner)
> + return;
> +
> + next = READ_ONCE(node->next);
> + if (!next)
> + return;
> +
> + if (vcpu_is_preempted(owner)) {

Is there a difference about using vcpu_is_preempted() here
vs checking bit 0 in other places?


> + next->yield_cpu = owner;
> + *set_yield_cpu = owner;
> + } else if (*set_yield_cpu != -1) {

It might be worth giving the -1 CPU a #define.

> + next->yield_cpu = owner;
> + *set_yield_cpu = owner;
> + }
> +}

Does this need to pass set_yield_cpu by reference? Couldn't it's new value be
returned? To me it makes it more clear the function is used to change
set_yield_cpu. I think this would work:

int set_yield_cpu = -1;

static __always_inline int propagate_yield_cpu(struct qnode *node, u32 val, int 
set_yield_cpu, bool paravirt)
{
struct qnode *next;
int owner;

if (!paravirt)
goto out;
if (!pv_yield_propagate_owner)
goto out;

owner = get_owner_cpu(val);
if (set_yield_cpu == owner)
goto out;

next = READ_ONCE(node->next);
if (!next)
goto out;

if (vcpu_is_preempted(owner)) {
next->yield_cpu = owner;
return owner;
} else if (set_yield_cpu != -1) {
next->yield_cpu = owner;
return owner;
}

out:
return set_yield_cpu;
}

set_yield_cpu = propagate_yield_cpu(...  set_yield_cpu ...);



> +
>  static __always_inline void yield_to_prev(struct qspinlock *lock, struct 
> qnode *node, int prev_cpu, bool paravirt)
>  {
>   u32 yield_count;
> + int yield_cpu;
>  
>   if (!paravirt)
>   goto relax;
>  
> + if (!pv_yield_propagate_owner)
> + goto yield_prev;
> +
> + yield_cpu = READ_ONCE(node->yield_cpu);
> + if (yield_cpu == -1) {
> + /* Propagate back the -1 CPU */
> + if (node->next && node->next->yield_cpu != -1)
> + node->next->yield_cpu = yield_cpu;
> + goto yield_prev;
> + }
> +
> + yield_count = yield_count_of(yield_cpu);
> + if ((yield_count & 1) == 0)
> + goto yield_prev; /* owner vcpu is running */
> +
> + smp_rmb();
> +
> + if (yield_cpu == node->yield_cpu) {
> + if (node->next && node->next->yield_cpu != yield_cpu)
> + node->next->yield_cpu = yield_cpu;
> + yield_to_preempted(yield_cpu, yield_count);
> + return;
> + }
> +
> +yield_prev:
>   if (!pv_yield_prev)
>   goto relax;
>  
> @@ -337,6 +392,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qs

Re: [PATCH 12/17] powerpc/qspinlock: add ability to prod new queue head CPU

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> After the head of the queue acquires the lock, it releases the
> next waiter in the queue to become the new head. Add an option
> to prod the new head if its vCPU was preempted. This may only
> have an effect if queue waiters are yielding.
> 
> Disable this option by default for now, i.e., no logical change.
> ---
>  arch/powerpc/lib/qspinlock.c | 29 -
>  1 file changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 28c85a2d5635..3b10e31bcf0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -12,6 +12,7 @@
>  struct qnode {
>   struct qnode*next;
>   struct qspinlock *lock;
> + int cpu;
>   int yield_cpu;
>   u8  locked; /* 1 if lock acquired */
>  };
> @@ -30,6 +31,7 @@ static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
> +static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> @@ -392,6 +394,7 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   node = &qnodesp->nodes[idx];
>   node->next = NULL;
>   node->lock = lock;
> + node->cpu = smp_processor_id();

I suppose this could be used in some other places too.

For example change:
yield_to_prev(lock, node, prev, paravirt);

In yield_to_prev() it could then access the prev->cpu.

>   node->yield_cpu = -1;
>   node->locked = 0;
>  
> @@ -483,7 +486,14 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>* this store to locked. The corresponding barrier is the smp_rmb()
>* acquire barrier for mcs lock, above.
>*/
> - WRITE_ONCE(next->locked, 1);
> + if (paravirt && pv_prod_head) {
> + int next_cpu = next->cpu;
> + WRITE_ONCE(next->locked, 1);
> + if (vcpu_is_preempted(next_cpu))
> + prod_cpu(next_cpu);
> + } else {
> + WRITE_ONCE(next->locked, 1);
> + }
>  
>  release:
>   qnodesp->count--; /* release the node */
> @@ -622,6 +632,22 @@ static int pv_yield_propagate_owner_get(void *data, u64 
> *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_pv_yield_propagate_owner, 
> pv_yield_propagate_owner_get, pv_yield_propagate_owner_set, "%llu\n");
>  
> +static int pv_prod_head_set(void *data, u64 val)
> +{
> + pv_prod_head = !!val;
> +
> + return 0;
> +}
> +
> +static int pv_prod_head_get(void *data, u64 *val)
> +{
> + *val = pv_prod_head;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, pv_prod_head_get, 
> pv_prod_head_set, "%llu\n");
> +
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> &fops_steal_spins);
> @@ -631,6 +657,7 @@ static __init int spinlock_debugfs_init(void)
>   debugfs_create_file("qspl_pv_yield_allow_steal", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_yield_allow_steal);
>   debugfs_create_file("qspl_pv_yield_prev", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_yield_prev);
>   debugfs_create_file("qspl_pv_yield_propagate_owner", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_yield_propagate_owner);
> + debugfs_create_file("qspl_pv_prod_head", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_prod_head);
>   }
>  
>   return 0;



Re: [PATCH 13/17] powerpc/qspinlock: trylock and initial lock attempt may steal

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> This gives trylock slightly more strength, and it also gives most
> of the benefit of passing 'val' back through the slowpath without
> the complexity.
> ---
>  arch/powerpc/include/asm/qspinlock.h | 39 +++-
>  arch/powerpc/lib/qspinlock.c |  9 +++
>  2 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock.h 
> b/arch/powerpc/include/asm/qspinlock.h
> index 44601b261e08..d3d2039237b2 100644
> --- a/arch/powerpc/include/asm/qspinlock.h
> +++ b/arch/powerpc/include/asm/qspinlock.h
> @@ -5,6 +5,8 @@
>  #include 
>  #include 
>  
> +#define _Q_SPIN_TRY_LOCK_STEAL 1

Would this be a config option?

> +
>  static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
>  {
>   return READ_ONCE(lock->val);
> @@ -26,11 +28,12 @@ static __always_inline u32 
> queued_spin_get_locked_val(void)
>   return _Q_LOCKED_VAL | (smp_processor_id() << _Q_OWNER_CPU_OFFSET);
>  }
>  
> -static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +static __always_inline int __queued_spin_trylock_nosteal(struct qspinlock 
> *lock)
>  {
>   u32 new = queued_spin_get_locked_val();
>   u32 prev;
>  
> + /* Trylock succeeds only when unlocked and no queued nodes */
>   asm volatile(
>  "1:  lwarx   %0,0,%1,%3  # queued_spin_trylock   \n"

s/queued_spin_trylock/__queued_spin_trylock_nosteal

>  "cmpwi   0,%0,0  \n"
> @@ -49,6 +52,40 @@ static __always_inline int queued_spin_trylock(struct 
> qspinlock *lock)
>   return 0;
>  }
>  
> +static __always_inline int __queued_spin_trylock_steal(struct qspinlock 
> *lock)
> +{
> + u32 new = queued_spin_get_locked_val();
> + u32 prev, tmp;
> +
> + /* Trylock may get ahead of queued nodes if it finds unlocked */
> + asm volatile(
> +"1:  lwarx   %0,0,%2,%5  # queued_spin_trylock   \n"

s/queued_spin_trylock/__queued_spin_trylock_steal

> +"andc.   %1,%0,%4\n"
> +"bne-2f  \n"
> +"and %1,%0,%4\n"
> +"or  %1,%1,%3\n"
> +"stwcx.  %1,0,%2 \n"
> +"bne-1b  \n"
> +"\t" PPC_ACQUIRE_BARRIER "   \n"
> +"2:  \n"

Just because there's a little bit more going on here...

Q_TAIL_CPU_MASK = 0xFFFE
~Q_TAIL_CPU_MASK = 0x1


1:  lwarx   prev, 0, &lock->val, IS_ENABLED_PPC64
andc.   tmp, prev, _Q_TAIL_CPU_MASK (tmp = prev & ~_Q_TAIL_CPU_MASK)
bne-2f  (exit if locked)
and tmp, prev, _Q_TAIL_CPU_MASK (tmp = prev & _Q_TAIL_CPU_MASK)
or  tmp, tmp, new   (tmp |= new)

stwcx.  tmp, 0, &lock->val  

bne-1b  
PPC_ACQUIRE_BARRIER 
2:

... which seems correct.


> + : "=&r" (prev), "=&r" (tmp)
> + : "r" (&lock->val), "r" (new), "r" (_Q_TAIL_CPU_MASK),
> +   "i" (IS_ENABLED(CONFIG_PPC64) ? 1 : 0)
> + : "cr0", "memory");
> +
> + if (likely(!(prev & ~_Q_TAIL_CPU_MASK)))
> + return 1;
> + return 0;
> +}
> +
> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> +{
> + if (!_Q_SPIN_TRY_LOCK_STEAL)
> + return __queued_spin_trylock_nosteal(lock);
> + else
> + return __queued_spin_trylock_steal(lock);
> +}
> +
>  void queued_spin_lock_slowpath(struct qspinlock *lock);
>  
>  static __always_inline void queued_spin_lock(struct qspinlock *lock)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 3b10e31bcf0a..277aef1fab0a 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -24,7 +24,11 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> +static const bool MAYBE_STEALERS = true;
> +#else
>  static bool MAYBE_STEALERS __read_mostly = true;
> +#endif
>  static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
> @@ -522,6 +526,10 @@ void pv_spinlocks_init(void)
>  #include 
>  static int steal_spins_set(void *data, u64 val)
>  {
> +#if _Q_SPIN_TRY_LOCK_STEAL == 1
> + /* MAYBE_STEAL remains true */
> + STEAL_SPINS = val;
> +#else
>   static DEFINE_MUTEX(lock);
>  
>   mutex_lock(&lock);
> @@ -539,6 +547,7 @

Re: [PATCH 14/17] powerpc/qspinlock: use spin_begin/end API

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Use the spin_begin/spin_cpu_relax/spin_end APIs in qspinlock, which helps
> to prevent threads issuing a lot of expensive priority nops which may not
> have much effect due to immediately executing low then medium priority.

Just a general comment regarding the spin_{begin,end} API, more complicated
than something like

spin_begin()
for(;;)
spin_cpu_relax()
spin_end()

it becomes difficult to keep track of. Unfortunately, I don't have any good
suggestions how to improve it. Hopefully with P10s wait instruction we can
maybe try and move away from this.

It might be useful to comment the functions pre and post conditions regarding
expectations about spin_begin() and spin_end().

> ---
>  arch/powerpc/lib/qspinlock.c | 35 +++
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 277aef1fab0a..d4594c701f7d 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -233,6 +233,8 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   if ((yield_count & 1) == 0)
>   goto relax; /* owner vcpu is running */
>  
> + spin_end();
> +
>   /*
>* Read the lock word after sampling the yield count. On the other side
>* there may a wmb because the yield count update is done by the
> @@ -248,11 +250,13 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   yield_to_preempted(owner, yield_count);
>   if (clear_mustq)
>   lock_set_mustq(lock);
> + spin_begin();
>   /* Don't relax if we yielded. Maybe we should? */
>   return;
>   }
> + spin_begin();
>  relax:
> - cpu_relax();
> + spin_cpu_relax();
>  }
>  
>  static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> @@ -315,14 +319,18 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>   if ((yield_count & 1) == 0)
>   goto yield_prev; /* owner vcpu is running */
>  
> + spin_end();
> +
>   smp_rmb();
>  
>   if (yield_cpu == node->yield_cpu) {
>   if (node->next && node->next->yield_cpu != yield_cpu)
>   node->next->yield_cpu = yield_cpu;
>   yield_to_preempted(yield_cpu, yield_count);
> + spin_begin();
>   return;
>   }
> + spin_begin();
>  
>  yield_prev:
>   if (!pv_yield_prev)
> @@ -332,15 +340,19 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>   if ((yield_count & 1) == 0)
>   goto relax; /* owner vcpu is running */
>  
> + spin_end();
> +
>   smp_rmb(); /* See yield_to_locked_owner comment */
>  
>   if (!node->locked) {
>   yield_to_preempted(prev_cpu, yield_count);
> + spin_begin();
>   return;
>   }
> + spin_begin();
>  
>  relax:
> - cpu_relax();
> + spin_cpu_relax();
>  }
>  
>  
> @@ -349,6 +361,7 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   int iters;
>  
>   /* Attempt to steal the lock */
> + spin_begin();
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
>  
> @@ -356,8 +369,10 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   break;
>  
>   if (unlikely(!(val & _Q_LOCKED_VAL))) {
> + spin_end();
>   if (trylock_with_tail_cpu(lock, val))
>   return true;
> + spin_begin();
>   continue;
>   }
>  
> @@ -368,6 +383,7 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   if (iters >= get_steal_spins(paravirt))
>   break;
>   }
> + spin_end();
>  
>   return false;
>  }
> @@ -418,8 +434,10 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   WRITE_ONCE(prev->next, node);
>  
>   /* Wait for mcs node lock to be released */
> + spin_begin();
>   while (!node->locked)
>   yield_to_prev(lock, node, prev_cpu, paravirt);
> + spin_end();
>  
>   /* Clear out stale propagated yield_cpu */
>   if (paravirt && pv_yield_propagate_owner && node->yield_cpu != 
> -1)
> @@ -432,10 +450,12 @@ static __always_inline void 
> queued_spin_lock_mcs_queue(struct qspinlock *lock, b
>   int set_yield_cpu = -1;
>  
>   /* We're at the head of the waitqueue, wait for the lock. */
> + spin_begin();
> 

Re: [PATCH 15/17] powerpc/qspinlock: reduce remote node steal spins

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Allow for a reduction in the number of times a CPU from a different
> node than the owner can attempt to steal the lock before queueing.
> This could bias the transfer behaviour of the lock across the
> machine and reduce NUMA crossings.
> ---
>  arch/powerpc/lib/qspinlock.c | 34 +++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index d4594c701f7d..24f68bd71e2b 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -24,6 +25,7 @@ struct qnodes {
>  
>  /* Tuning parameters */
>  static int STEAL_SPINS __read_mostly = (1<<5);
> +static int REMOTE_STEAL_SPINS __read_mostly = (1<<2);
>  #if _Q_SPIN_TRY_LOCK_STEAL == 1
>  static const bool MAYBE_STEALERS = true;
>  #else
> @@ -39,9 +41,13 @@ static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
>  
> -static __always_inline int get_steal_spins(bool paravirt)
> +static __always_inline int get_steal_spins(bool paravirt, bool remote)
>  {
> - return STEAL_SPINS;
> + if (remote) {
> + return REMOTE_STEAL_SPINS;
> + } else {
> + return STEAL_SPINS;
> + }
>  }
>  
>  static __always_inline int get_head_spins(bool paravirt)
> @@ -380,8 +386,13 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>  
>   iters++;
>  
> - if (iters >= get_steal_spins(paravirt))
> + if (iters >= get_steal_spins(paravirt, false))
>   break;
> + if (iters >= get_steal_spins(paravirt, true)) {

There's no indication of what true and false mean here which is hard to read.
To me it feels like two separate functions would be more clear.


> + int cpu = get_owner_cpu(val);
> + if (numa_node_id() != cpu_to_node(cpu))

What about using node_distance() instead?


> + break;
> + }
>   }
>   spin_end();
>  
> @@ -588,6 +599,22 @@ static int steal_spins_get(void *data, u64 *val)
>  
>  DEFINE_SIMPLE_ATTRIBUTE(fops_steal_spins, steal_spins_get, steal_spins_set, 
> "%llu\n");
>  
> +static int remote_steal_spins_set(void *data, u64 val)
> +{
> + REMOTE_STEAL_SPINS = val;

REMOTE_STEAL_SPINS is int not u64.

> +
> + return 0;
> +}
> +
> +static int remote_steal_spins_get(void *data, u64 *val)
> +{
> + *val = REMOTE_STEAL_SPINS;
> +
> + return 0;
> +}
> +
> +DEFINE_SIMPLE_ATTRIBUTE(fops_remote_steal_spins, remote_steal_spins_get, 
> remote_steal_spins_set, "%llu\n");
> +
>  static int head_spins_set(void *data, u64 val)
>  {
>   HEAD_SPINS = val;
> @@ -687,6 +714,7 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_pv_prod_head, 
> pv_prod_head_get, pv_prod_head_set, "
>  static __init int spinlock_debugfs_init(void)
>  {
>   debugfs_create_file("qspl_steal_spins", 0600, arch_debugfs_dir, NULL, 
> &fops_steal_spins);
> + debugfs_create_file("qspl_remote_steal_spins", 0600, arch_debugfs_dir, 
> NULL, &fops_remote_steal_spins);
>   debugfs_create_file("qspl_head_spins", 0600, arch_debugfs_dir, NULL, 
> &fops_head_spins);
>   if (is_shared_processor()) {
>   debugfs_create_file("qspl_pv_yield_owner", 0600, 
> arch_debugfs_dir, NULL, &fops_pv_yield_owner);



Re: [PATCH 16/17] powerpc/qspinlock: allow indefinite spinning on a preempted owner

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Provide an option that holds off queueing indefinitely while the lock
> owner is preempted. This could reduce queueing latencies for very
> overcommitted vcpu situations.
> 
> This is disabled by default.
> ---
>  arch/powerpc/lib/qspinlock.c | 91 +++-
>  1 file changed, 79 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 24f68bd71e2b..5cfd69931e31 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -35,6 +35,7 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
> +static bool pv_spin_on_preempted_owner __read_mostly = false;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
> @@ -220,13 +221,15 @@ static struct qnode *get_tail_qnode(struct qspinlock 
> *lock, u32 val)
>   BUG();
>  }
>  
> -static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void __yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
>   int owner;
>   u32 yield_count;
>  
>   BUG_ON(!(val & _Q_LOCKED_VAL));
>  
> + *preempted = false;
> +
>   if (!paravirt)
>   goto relax;
>  
> @@ -241,6 +244,8 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>  
>   spin_end();
>  
> + *preempted = true;
> +
>   /*
>* Read the lock word after sampling the yield count. On the other side
>* there may a wmb because the yield count update is done by the
> @@ -265,14 +270,14 @@ static __always_inline void 
> __yield_to_locked_owner(struct qspinlock *lock, u32
>   spin_cpu_relax();
>  }
>  
> -static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt)
> +static __always_inline void yield_to_locked_owner(struct qspinlock *lock, 
> u32 val, bool paravirt, bool *preempted)

It seems like preempted parameter could be the return value of
yield_to_locked_owner(). Then callers that don't use the value returned in
preempted don't need to create an unnecessary variable to pass in.

>  {
> - __yield_to_locked_owner(lock, val, paravirt, false);
> + __yield_to_locked_owner(lock, val, paravirt, false, preempted);
>  }
>  
> -static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq)
> +static __always_inline void yield_head_to_locked_owner(struct qspinlock 
> *lock, u32 val, bool paravirt, bool clear_mustq, bool *preempted)
>  {
> - __yield_to_locked_owner(lock, val, paravirt, clear_mustq);
> + __yield_to_locked_owner(lock, val, paravirt, clear_mustq, preempted);
>  }
>  
>  static __always_inline void propagate_yield_cpu(struct qnode *node, u32 val, 
> int *set_yield_cpu, bool paravirt)
> @@ -364,12 +369,33 @@ static __always_inline void yield_to_prev(struct 
> qspinlock *lock, struct qnode *
>  
>  static __always_inline bool try_to_steal_lock(struct qspinlock *lock, bool 
> paravirt)
>  {
> - int iters;
> + int iters = 0;
> +
> + if (!STEAL_SPINS) {
> + if (paravirt && pv_spin_on_preempted_owner) {
> + spin_begin();
> + for (;;) {
> + u32 val = READ_ONCE(lock->val);
> + bool preempted;
> +
> + if (val & _Q_MUST_Q_VAL)
> + break;
> + if (!(val & _Q_LOCKED_VAL))
> + break;
> + if (!vcpu_is_preempted(get_owner_cpu(val)))
> + break;
> + yield_to_locked_owner(lock, val, paravirt, 
> &preempted);
> + }
> + spin_end();
> + }
> + return false;
> + }
>  
>   /* Attempt to steal the lock */
>   spin_begin();
>   for (;;) {
>   u32 val = READ_ONCE(lock->val);
> + bool preempted;
>  
>   if (val & _Q_MUST_Q_VAL)
>   break;
> @@ -382,9 +408,22 @@ static __always_inline bool try_to_steal_lock(struct 
> qspinlock *lock, bool parav
>   continue;
>   }
>  
> - yield_to_locked_owner(lock, val, paravirt);
> -
> - iters++;
> + yield_to_locked_owner(lock, val, paravirt, &preempted);
> +
> + if (paravirt && preempted) {
> + if (!pv_spin_on_preempted_owner)
> + iters++;
> + 

Re: [PATCH 17/17] powerpc/qspinlock: provide accounting and options for sleepy locks

2022-11-09 Thread Jordan Niethe
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
[resend as utf-8, not utf-7]
> Finding the owner or a queued waiter on a lock with a preempted vcpu
> is indicative of an oversubscribed guest causing the lock to get into
> trouble. Provide some options to detect this situation and have new
> CPUs avoid queueing for a longer time (more steal iterations) to
> minimise the problems caused by vcpu preemption on the queue.
> ---
>  arch/powerpc/include/asm/qspinlock_types.h |   7 +-
>  arch/powerpc/lib/qspinlock.c   | 240 +++--
>  2 files changed, 232 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/qspinlock_types.h 
> b/arch/powerpc/include/asm/qspinlock_types.h
> index 35f9525381e6..4fbcc8a4230b 100644
> --- a/arch/powerpc/include/asm/qspinlock_types.h
> +++ b/arch/powerpc/include/asm/qspinlock_types.h
> @@ -30,7 +30,7 @@ typedef struct qspinlock {
>   *
>   * 0: locked bit
>   *  1-14: lock holder cpu
> - *15: unused bit
> + *15: lock owner or queuer vcpus observed to be preempted bit
>   *16: must queue bit
>   * 17-31: tail cpu (+1)
>   */
> @@ -49,6 +49,11 @@ typedef struct qspinlock {
>  #error "qspinlock does not support such large CONFIG_NR_CPUS"
>  #endif
>  
> +#define _Q_SLEEPY_OFFSET 15
> +#define _Q_SLEEPY_BITS   1
> +#define _Q_SLEEPY_MASK   _Q_SET_MASK(SLEEPY_OWNER)
> +#define _Q_SLEEPY_VAL(1U << _Q_SLEEPY_OFFSET)
> +
>  #define _Q_MUST_Q_OFFSET 16
>  #define _Q_MUST_Q_BITS   1
>  #define _Q_MUST_Q_MASK   _Q_SET_MASK(MUST_Q)
> diff --git a/arch/powerpc/lib/qspinlock.c b/arch/powerpc/lib/qspinlock.c
> index 5cfd69931e31..c18133c01450 100644
> --- a/arch/powerpc/lib/qspinlock.c
> +++ b/arch/powerpc/lib/qspinlock.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -36,24 +37,54 @@ static int HEAD_SPINS __read_mostly = (1<<8);
>  static bool pv_yield_owner __read_mostly = true;
>  static bool pv_yield_allow_steal __read_mostly = false;
>  static bool pv_spin_on_preempted_owner __read_mostly = false;
> +static bool pv_sleepy_lock __read_mostly = true;
> +static bool pv_sleepy_lock_sticky __read_mostly = false;

The sticky part could potentially be its own patch.

> +static u64 pv_sleepy_lock_interval_ns __read_mostly = 0;
> +static int pv_sleepy_lock_factor __read_mostly = 256;
>  static bool pv_yield_prev __read_mostly = true;
>  static bool pv_yield_propagate_owner __read_mostly = true;
>  static bool pv_prod_head __read_mostly = false;
>  
>  static DEFINE_PER_CPU_ALIGNED(struct qnodes, qnodes);
> +static DEFINE_PER_CPU_ALIGNED(u64, sleepy_lock_seen_clock);
>  
> -static __always_inline int get_steal_spins(bool paravirt, bool remote)
> +static __always_inline bool recently_sleepy(void)
> +{

Other users of pv_sleepy_lock_interval_ns first check pv_sleepy_lock.

> + if (pv_sleepy_lock_interval_ns) {
> + u64 seen = this_cpu_read(sleepy_lock_seen_clock);
> +
> + if (seen) {
> + u64 delta = sched_clock() - seen;
> + if (delta < pv_sleepy_lock_interval_ns)
> + return true;
> + this_cpu_write(sleepy_lock_seen_clock, 0);
> + }
> + }
> +
> + return false;
> +}
> +
> +static __always_inline int get_steal_spins(bool paravirt, bool remote, bool 
> sleepy)

It seems like paravirt is implied by sleepy.

>  {
>   if (remote) {
> - return REMOTE_STEAL_SPINS;
> + if (paravirt && sleepy)
> + return REMOTE_STEAL_SPINS * pv_sleepy_lock_factor;
> + else
> + return REMOTE_STEAL_SPINS;
>   } else {
> - return STEAL_SPINS;
> + if (paravirt && sleepy)
> + return STEAL_SPINS * pv_sleepy_lock_factor;
> + else
> + return STEAL_SPINS;
>   }
>  }

I think that separate functions would still be nicer but this could get rid of
the nesting conditionals like


int spins;
if (remote)
spins = REMOTE_STEAL_SPINS;
else
spins = STEAL_SPINS;

if (sleepy)
return spins * pv_sleepy_lock_factor;
return spins;

>  
> -static __always_inline int get_head_spins(bool paravirt)
> +static __always_inline int get_head_spins(bool paravirt, bool sleepy)
>  {
> - return HEAD_SPINS;
> + if (paravirt && sleepy)
> + return HEAD_SPINS * pv_sleepy_lock_factor;
> + else
> + return HEAD_SPINS;
>  }
>  
>  static inline u32 encode_tail_cpu(void)
> @@ -206,6 +237,60 @@ static __always_inline u32 lock_clear_mustq(struct 
> qspinlock *lock)
>   return prev;
>  }
>  
> +static __always_inline bool lock_try_set_sleepy(struct qspinlock *lock, u32 
> old)
> +{
> + u32 prev;
> + u32 new = old | _Q_SLEEPY_VAL;
> +
> + BUG_ON(!(old &

Re: [PATCH 38/44] KVM: Disable CPU hotplug during hardware enabling

2022-11-09 Thread Huang, Kai
On Wed, 2022-11-02 at 23:19 +, Sean Christopherson wrote:
> From: Chao Gao 
> 
> Disable CPU hotplug during hardware_enable_all() to prevent the corner
> case where if the following sequence occurs:
> 
>   1. A hotplugged CPU marks itself online in cpu_online_mask
>   2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE
>  callback
>   3  hardware_enable_all() is invoked on another CPU right
> 
> the hotplugged CPU will be included in on_each_cpu() and thus get sent
> through hardware_enable_nolock() before kvm_online_cpu() is called.
> 
> start_secondary { ...
> set_cpu_online(smp_processor_id(), true); <- 1
> ...
> local_irq_enable();  <- 2
> ...
> cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3
> }
> 
> KVM currently fudges around this race by keeping track of which CPUs have
> done hardware enabling (see commit 1b6c016818a5 "KVM: Keep track of which
> cpus have virtualization enabled"), but that's an inefficient, convoluted,
> and hacky solution.
> 
> Signed-off-by: Chao Gao 
> [sean: split to separate patch, write changelog]
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/x86.c  |  8 +++-
>  virt/kvm/kvm_main.c | 10 ++
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a7b1d916ecb2..a15e54ba0471 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9283,7 +9283,13 @@ static int 
> kvm_x86_check_processor_compatibility(struct kvm_x86_init_ops *ops)
>   int cpu = smp_processor_id();
>   struct cpuinfo_x86 *c = &cpu_data(cpu);
>  
> - WARN_ON(!irqs_disabled());
> + /*
> +  * Compatibility checks are done when loading KVM and when enabling
> +  * hardware, e.g. during CPU hotplug, to ensure all online CPUs are
> +  * compatible, i.e. KVM should never perform a compatibility check on
> +  * an offline CPU.
> +  */
> + WARN_ON(!irqs_disabled() && cpu_active(cpu));

Comment doesn't match with the code?

"KVM should never perform a compatibility check on on offline CPU" should be
something like:

WARN_ON(!cpu_online(cpu));

So, should the comment be something like below?

"KVM compatibility check happens before CPU is marked as active".

>  
>   if (__cr4_reserved_bits(cpu_has, c) !=
>   __cr4_reserved_bits(cpu_has, &boot_cpu_data))
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index fd9e39c85549..4e765ef9f4bd 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -5088,6 +5088,15 @@ static int hardware_enable_all(void)
>  {
>   int r = 0;
>  
> + /*
> +  * When onlining a CPU, cpu_online_mask is set before kvm_online_cpu()
> +  * is called, and so on_each_cpu() between them includes the CPU that
> +  * is being onlined.  As a result, hardware_enable_nolock() may get
> +  * invoked before kvm_online_cpu().
> +  *
> +  * Disable CPU hotplug to prevent scenarios where KVM sees
> +  */

The above sentence is broken.

I think below comment Quoted from Isaku's series should be OK?

/*
 * During onlining a CPU, cpu_online_mask is set before
kvm_online_cpu()
 * is called. on_each_cpu() between them includes the CPU. As a result,
 * hardware_enable_nolock() may get invoked before kvm_online_cpu().
 * This would enable hardware virtualization on that cpu without
 * compatibility checks, which can potentially crash system or break
 * running VMs.
 *
 * Disable CPU hotplug to prevent this case from happening.
 */

> + cpus_read_lock();
>   raw_spin_lock(&kvm_count_lock);
>  
>   kvm_usage_count++;
> @@ -5102,6 +5111,7 @@ static int hardware_enable_all(void)
>   }
>  
>   raw_spin_unlock(&kvm_count_lock);
> + cpus_read_unlock();
>  
>   return r;
>  }



[PATCH] powerpc/pseries: fix possible memory leak in ibmebus_bus_init()

2022-11-09 Thread ruanjinjie
If device_register() returns error in ibmebus_bus_init(), name of kobject
which is allocated in dev_set_name() called in device_add() is leaked.

As comment of device_add() says, it should call put_device() to drop
the reference count that was set in device_initialize() when it fails,
so the name can be freed in kobject_cleanup().

Signed-off-by: ruanjinjie 
---
 arch/powerpc/platforms/pseries/ibmebus.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/pseries/ibmebus.c 
b/arch/powerpc/platforms/pseries/ibmebus.c
index a870cada7acd..ed5fc70b7353 100644
--- a/arch/powerpc/platforms/pseries/ibmebus.c
+++ b/arch/powerpc/platforms/pseries/ibmebus.c
@@ -455,6 +455,7 @@ static int __init ibmebus_bus_init(void)
if (err) {
printk(KERN_WARNING "%s: device_register returned %i\n",
   __func__, err);
+   put_device(&ibmebus_bus_device);
bus_unregister(&ibmebus_bus_type);
 
return err;
-- 
2.25.1



Re: [PATCH 38/44] KVM: Disable CPU hotplug during hardware enabling

2022-11-09 Thread Huang, Kai
On Wed, 2022-11-02 at 23:19 +, Sean Christopherson wrote:
> From: Chao Gao 
> 
> Disable CPU hotplug during hardware_enable_all() to prevent the corner
> case where if the following sequence occurs:
> 
>   1. A hotplugged CPU marks itself online in cpu_online_mask
>   2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE
>  callback
>   3  hardware_enable_all() is invoked on another CPU right
> 
> the hotplugged CPU will be included in on_each_cpu() and thus get sent
> through hardware_enable_nolock() before kvm_online_cpu() is called.
> 
>     start_secondary { ...
>     set_cpu_online(smp_processor_id(), true); <- 1
>     ...
>     local_irq_enable();  <- 2
>     ...
>     cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3
>     }
> 
> KVM currently fudges around this race by keeping track of which CPUs have
> done hardware enabling (see commit 1b6c016818a5 "KVM: Keep track of which
> cpus have virtualization enabled"), but that's an inefficient, convoluted,
> and hacky solution.
> 
> Signed-off-by: Chao Gao 
> [sean: split to separate patch, write changelog]
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/x86.c  |  8 +++-
>  virt/kvm/kvm_main.c | 10 ++
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a7b1d916ecb2..a15e54ba0471 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9283,7 +9283,13 @@ static int 
> kvm_x86_check_processor_compatibility(struct kvm_x86_init_ops *ops)
>   int cpu = smp_processor_id();
>   struct cpuinfo_x86 *c = &cpu_data(cpu);
>  
> - WARN_ON(!irqs_disabled());
> + /*
> +  * Compatibility checks are done when loading KVM and when enabling
> +  * hardware, e.g. during CPU hotplug, to ensure all online CPUs are
> +  * compatible, i.e. KVM should never perform a compatibility check on
> +  * an offline CPU.
> +  */
> + WARN_ON(!irqs_disabled() && cpu_active(cpu));
>  

Also, the logic of:

!irqs_disabled() && cpu_active(cpu)

is quite weird.

The original "WARN(!irqs_disabled())" is reasonable because in STARTING section
the IRQ is indeed disabled.

But this doesn't make sense anymore after we move to ONLINE section, in which
IRQ has already been enabled (see start_secondary()).  IIUC the WARN_ON()
doesn't get exploded is purely because there's an additional cpu_active(cpu)
check.

So, a more reasonable check should be something like:

WARN_ON(irqs_disabled() || cpu_active(cpu) || !cpu_online(cpu));

Or we can simply do:

WARN_ON(!cpu_online(cpu) || cpu_active(cpu));

(because I don't know whether it's possible IRQ can somehow get disabled in
ONLINE section).

Btw above is purely based on code analysis, but I haven't done any test.


Re: [PATCH bpf-next v2 0/5] execmem_alloc for BPF programs

2022-11-09 Thread Song Liu
On Wed, Nov 9, 2022 at 1:24 PM Christophe Leroy
 wrote:
>
> + linuxppc-dev list as we start mentioning powerpc.
>
> Le 09/11/2022 à 18:43, Song Liu a écrit :
> > On Wed, Nov 9, 2022 at 3:18 AM Mike Rapoport  wrote:
> >>
> > [...]
> >
> 
>  The proposed execmem_alloc() looks to me very much tailored for x86
>  to be
>  used as a replacement for module_alloc(). Some architectures have
>  module_alloc() that is quite different from the default or x86
>  version, so
>  I'd expect at least some explanation how modules etc can use execmem_
>  APIs
>  without breaking !x86 architectures.
> >>>
> >>> I think this is fair, but I think we should ask ask ourselves - how
> >>> much should we do in one step?
> >>
> >> I think that at least we need an evidence that execmem_alloc() etc can be
> >> actually used by modules/ftrace/kprobes. Luis said that RFC v2 didn't work
> >> for him at all, so having a core MM API for code allocation that only works
> >> with BPF on x86 seems not right to me.
> >
> > While using execmem_alloc() et. al. in module support is difficult, folks 
> > are
> > making progress with it. For example, the prototype would be more difficult
> > before CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> > (introduced by Christophe).
>
> By the way, the motivation for CONFIG_ARCH_WANTS_MODULES_DATA_IN_VMALLOC
> was completely different: This was because on powerpc book3s/32, no-exec
> flaggin is per segment of size 256 Mbytes, so in order to provide
> STRICT_MODULES_RWX it was necessary to put data outside of the segment
> that holds module text in order to be able to flag RW data as no-exec.

Yeah, I only noticed the actual motivation of this work earlier today. :)

>
> But I'm happy if it can also serve other purposes.
>
> >
> > We also have other users that we can onboard soon: BPF trampoline on
> > x86_64, BPF jit and trampoline on arm64, and maybe also on powerpc and
> > s390.
> >
> >>
> >>> For non-text_poke() architectures, the way you can make it work is have
> >>> the API look like:
> >>> execmem_alloc()  <- Does the allocation, but necessarily usable yet
> >>> execmem_write()  <- Loads the mapping, doesn't work after finish()
> >>> execmem_finish() <- Makes the mapping live (loaded, executable, ready)
> >>>
> >>> So for text_poke():
> >>> execmem_alloc()  <- reserves the mapping
> >>> execmem_write()  <- text_pokes() to the mapping
> >>> execmem_finish() <- does nothing
> >>>
> >>> And non-text_poke():
> >>> execmem_alloc()  <- Allocates a regular RW vmalloc allocation
> >>> execmem_write()  <- Writes normally to it
> >>> execmem_finish() <- does set_memory_ro()/set_memory_x() on it
> >>>
> >>> Non-text_poke() only gets the benefits of centralized logic, but the
> >>> interface works for both. This is pretty much what the perm_alloc() RFC
> >>> did to make it work with other arch's and modules. But to fit with the
> >>> existing modules code (which is actually spread all over) and also
> >>> handle RO sections, it also needed some additional bells and whistles.
> >>
> >> I'm less concerned about non-text_poke() part, but rather about
> >> restrictions where code and data can live on different architectures and
> >> whether these restrictions won't lead to inability to use the centralized
> >> logic on, say, arm64 and powerpc.
>
> Until recently, powerpc CPU didn't implement PC-relative data access.
> Only very recent powerpc CPUs (power10 only ?) have capability to do
> PC-relative accesses, but the kernel doesn't use it yet. So there's no
> constraint about distance between text and data. What matters is the
> distance between core kernel text and module text to avoid trampolines.

Ah, this is great. I guess this means powerpc can benefit from this work
with much less effort than x86_64.

>
> >>
> >> For instance, if we use execmem_alloc() for modules, it means that data
> >> sections should be allocated separately with plain vmalloc(). Will this
> >> work universally? Or this will require special care with additional
> >> complexity in the modules code?
> >>
> >>> So the question I'm trying to ask is, how much should we target for the
> >>> next step? I first thought that this functionality was so intertwined,
> >>> it would be too hard to do iteratively. So if we want to try
> >>> iteratively, I'm ok if it doesn't solve everything.
> >>
> >> With execmem_alloc() as the first step I'm failing to see the large
> >> picture. If we want to use it for modules, how will we allocate RO data?
> >> with similar rodata_alloc() that uses yet another tree in vmalloc?
> >> How the caching of large pages in vmalloc can be made useful for use cases
> >> like secretmem and PKS?
> >
> > If RO data causes problems with direct map fragmentation, we can use
> > similar logic. I think we will need another tree in vmalloc for this case.
> > Since the logic will be mostly identical, I personally don't think adding
> > another tree is a big overhead.
>
> On powerpc, kernel core RAM is 

Re: [PATCH 38/44] KVM: Disable CPU hotplug during hardware enabling

2022-11-09 Thread Huang, Kai
On Thu, 2022-11-10 at 01:33 +, Huang, Kai wrote:
> > @@ -9283,7 +9283,13 @@ static int
> > kvm_x86_check_processor_compatibility(struct kvm_x86_init_ops *ops)
> >     int cpu = smp_processor_id();
> >     struct cpuinfo_x86 *c = &cpu_data(cpu);
> >  
> > -   WARN_ON(!irqs_disabled());
> > +   /*
> > +* Compatibility checks are done when loading KVM and when enabling
> > +* hardware, e.g. during CPU hotplug, to ensure all online CPUs are
> > +* compatible, i.e. KVM should never perform a compatibility check
> > on
> > +* an offline CPU.
> > +*/
> > +   WARN_ON(!irqs_disabled() && cpu_active(cpu));
> >  
> 
> Also, the logic of:
> 
>   !irqs_disabled() && cpu_active(cpu)
> 
> is quite weird.
> 
> The original "WARN(!irqs_disabled())" is reasonable because in STARTING
> section
> the IRQ is indeed disabled.
> 
> But this doesn't make sense anymore after we move to ONLINE section, in which
> IRQ has already been enabled (see start_secondary()).  IIUC the WARN_ON()
> doesn't get exploded is purely because there's an additional cpu_active(cpu)
> check.
> 
> So, a more reasonable check should be something like:
> 
>   WARN_ON(irqs_disabled() || cpu_active(cpu) || !cpu_online(cpu));
> 
> Or we can simply do:
> 
>   WARN_ON(!cpu_online(cpu) || cpu_active(cpu));
> 
> (because I don't know whether it's possible IRQ can somehow get disabled in
> ONLINE section).
> 
> Btw above is purely based on code analysis, but I haven't done any test.

Hmm.. I wasn't thinking thoroughly.  I forgot CPU compatibility check also
happens on all online cpus when loading KVM.  For this case, IRQ is disabled and
cpu_active() is true.  For the hotplug case, IRQ is enabled but  cpu_active() is
false.

So WARN_ON(!irqs_disabled() && cpu_active(cpu)) looks reasonable.  Sorry for the
noise.  Just needed some time to connect the comment with the code.


Re: [PATCH 38/44] KVM: Disable CPU hotplug during hardware enabling

2022-11-09 Thread Huang, Kai
On Thu, 2022-11-10 at 01:08 +, Huang, Kai wrote:
> > -   WARN_ON(!irqs_disabled());
> > +   /*
> > +* Compatibility checks are done when loading KVM and when enabling
> > +* hardware, e.g. during CPU hotplug, to ensure all online CPUs are
> > +* compatible, i.e. KVM should never perform a compatibility check
> > on
> > +* an offline CPU.
> > +*/
> > +   WARN_ON(!irqs_disabled() && cpu_active(cpu));
> 
> Comment doesn't match with the code?
> 
> "KVM should never perform a compatibility check on on offline CPU" should be
> something like:
> 
>   WARN_ON(!cpu_online(cpu));
> 
> So, should the comment be something like below?
> 
> "KVM compatibility check happens before CPU is marked as active".

Also ignore this one as I only thought about hotplug case.


Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation

2022-11-09 Thread Christophe Leroy


Le 10/11/2022 à 01:35, Jordan Niethe a écrit :
> On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote:
> 
>> -#define queued_spin_lock queued_spin_lock
>>   
>> -static inline void queued_spin_unlock(struct qspinlock *lock)
>> +static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>>   {
>> -if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
>> -smp_store_release(&lock->locked, 0);
>> -else
>> -__pv_queued_spin_unlock(lock);
>> +if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0)
>> +return 1;
>> +return 0;
> 
> optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);

No parenthesis.
No == 0

Should be :

return !atomic_cmpxchg_acquire(&lock->val, 0, 1);

> 
> [resend as utf-8, not utf-7]
> 

Re: [PATCH 37/44] KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section

2022-11-09 Thread Robert Hoo
On Wed, 2022-11-02 at 23:19 +, Sean Christopherson wrote:
> From: Chao Gao 
> 
> The CPU STARTING section doesn't allow callbacks to fail. Move KVM's
> hotplug callback to ONLINE section so that it can abort onlining a
> CPU in
> certain cases to avoid potentially breaking VMs running on existing
> CPUs.
> For example, when KVM fails to enable hardware virtualization on the
> hotplugged CPU.
> 
> Place KVM's hotplug state before CPUHP_AP_SCHED_WAIT_EMPTY as it
> ensures
> when offlining a CPU, all user tasks and non-pinned kernel tasks have
> left
> the CPU, i.e. there cannot be a vCPU task around. So, it is safe for
> KVM's
> CPU offline callback to disable hardware virtualization at that
> point.
> Likewise, KVM's online callback can enable hardware virtualization
> before
> any vCPU task gets a chance to run on hotplugged CPUs.
> 
> Rename KVM's CPU hotplug callbacks accordingly.
> 
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Chao Gao 
> Reviewed-by: Sean Christopherson 
> Signed-off-by: Isaku Yamahata 
> Reviewed-by: Yuan Yao 
> Signed-off-by: Sean Christopherson 
> ---
>  include/linux/cpuhotplug.h |  2 +-
>  virt/kvm/kvm_main.c| 30 ++
>  2 files changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 7337414e4947..de45be38dd27 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -185,7 +185,6 @@ enum cpuhp_state {
>   CPUHP_AP_CSKY_TIMER_STARTING,
>   CPUHP_AP_TI_GP_TIMER_STARTING,
>   CPUHP_AP_HYPERV_TIMER_STARTING,
> - CPUHP_AP_KVM_STARTING,
>   /* Must be the last timer callback */
>   CPUHP_AP_DUMMY_TIMER_STARTING,
>   CPUHP_AP_ARM_XEN_STARTING,
> @@ -200,6 +199,7 @@ enum cpuhp_state {
>  
>   /* Online section invoked on the hotplugged CPU from the
> hotplug thread */
>   CPUHP_AP_ONLINE_IDLE,
> + CPUHP_AP_KVM_ONLINE,
>   CPUHP_AP_SCHED_WAIT_EMPTY,
>   CPUHP_AP_SMPBOOT_THREADS,
>   CPUHP_AP_X86_VDSO_VMA_ONLINE,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index dd13af9f06d5..fd9e39c85549 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -5026,13 +5026,27 @@ static void hardware_enable_nolock(void
> *junk)
>   }
>  }
>  
> -static int kvm_starting_cpu(unsigned int cpu)
> +static int kvm_online_cpu(unsigned int cpu)
>  {
> + int ret = 0;
> +
>   raw_spin_lock(&kvm_count_lock);
> - if (kvm_usage_count)
> + /*
> +  * Abort the CPU online process if hardware virtualization
> cannot
> +  * be enabled. Otherwise running VMs would encounter
> unrecoverable
> +  * errors when scheduled to this CPU.
> +  */
> + if (kvm_usage_count) {
> + WARN_ON_ONCE(atomic_read(&hardware_enable_failed));
> +
>   hardware_enable_nolock(NULL);
> + if (atomic_read(&hardware_enable_failed)) {
> + atomic_set(&hardware_enable_failed, 0);

I see other places using this hardware_enable_failed with atomic_inc(),
should here use atomic_dec() instead of straightly set to 0?
Though here is embraced by spin_lock, hardware_enable_nolock() can be
invoked in other places in parallel?

Fortunately in the end of this patch set, global hardware_enable_failed
is get rid of.

> + ret = -EIO;
> + }
> + }
>   raw_spin_unlock(&kvm_count_lock);
> - return 0;
> + return ret;
>  }
>  
>  static void hardware_disable_nolock(void *junk)
> @@ -5045,7 +5059,7 @@ static void hardware_disable_nolock(void *junk)
>   kvm_arch_hardware_disable();
>  }
>  
> -static int kvm_dying_cpu(unsigned int cpu)
> +static int kvm_offline_cpu(unsigned int cpu)
>  {
>   raw_spin_lock(&kvm_count_lock);
>   if (kvm_usage_count)
> @@ -5822,8 +5836,8 @@ int kvm_init(unsigned vcpu_size, unsigned
> vcpu_align, struct module *module)
>   if (!zalloc_cpumask_var(&cpus_hardware_enabled, GFP_KERNEL))
>   return -ENOMEM;
>  
> - r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING,
> "kvm/cpu:starting",
> -   kvm_starting_cpu, kvm_dying_cpu);
> + r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_ONLINE,
> "kvm/cpu:online",
> +   kvm_online_cpu, kvm_offline_cpu);
>   if (r)
>   goto out_free_2;
>   register_reboot_notifier(&kvm_reboot_notifier);
> @@ -5897,7 +5911,7 @@ int kvm_init(unsigned vcpu_size, unsigned
> vcpu_align, struct module *module)
>   kmem_cache_destroy(kvm_vcpu_cache);
>  out_free_3:
>   unregister_reboot_notifier(&kvm_reboot_notifier);
> - cpuhp_remove_state_nocalls(CPUHP_AP_KVM_STARTING);
> + cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE);
>  out_free_2:
>   free_cpumask_var(cpus_hardware_enabled);
>   return r;
> @@ -5923,7 +5937,7 @@ void kvm_exit(void)
>   kvm_async_pf_deinit();
>   unregister_syscore_ops(&kvm_syscore_ops);
>   unregister_reboot_noti

Re: [PATCH 32/44] KVM: x86: Unify pr_fmt to use module name for all KVM modules

2022-11-09 Thread Robert Hoo
On Wed, 2022-11-02 at 23:18 +, Sean Christopherson wrote:
> Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that
> printks
> use consistent formatting across common x86, Intel, and AMD code.  In
> addition to providing consistent print formatting, using
> KBUILD_MODNAME,
> e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV
> and
> SGX and ...) as technologies without generating weird messages, and
> without causing naming conflicts with other kernel code, e.g. "SEV:
> ",
> "tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM
> subsystems.
> 
> Opportunistically move away from printk() for prints that need to be
> modified anyways, e.g. to drop a manual "kvm: " prefix.
> 
> Opportunistically convert a few SGX WARNs that are similarly modified
> to
> WARN_ONCE; in the very unlikely event that the WARNs fire, odds are
> good
> that they would fire repeatedly and spam the kernel log without
> providing
> unique information in each print.
> 
> Note, defining pr_fmt yields undesirable results for code that uses
> KVM's
> printk wrappers, e.g. vcpu_unimpl().  But, that's a pre-existing
> problem
> as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
> wrappers is relatively limited in KVM x86 code.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/cpuid.c|  1 +
>  arch/x86/kvm/debugfs.c  |  2 ++
>  arch/x86/kvm/emulate.c  |  1 +
>  arch/x86/kvm/hyperv.c   |  1 +
>  arch/x86/kvm/i8254.c|  4 ++--
>  arch/x86/kvm/i8259.c|  4 +++-
>  arch/x86/kvm/ioapic.c   |  1 +
>  arch/x86/kvm/irq.c  |  1 +
>  arch/x86/kvm/irq_comm.c |  7 +++---
>  arch/x86/kvm/kvm_onhyperv.c |  1 +
>  arch/x86/kvm/lapic.c|  8 +++
>  arch/x86/kvm/mmu/mmu.c  |  6 ++---
>  arch/x86/kvm/mmu/page_track.c   |  1 +
>  arch/x86/kvm/mmu/spte.c |  4 ++--
>  arch/x86/kvm/mmu/spte.h |  4 ++--
>  arch/x86/kvm/mmu/tdp_iter.c |  1 +
>  arch/x86/kvm/mmu/tdp_mmu.c  |  1 +
>  arch/x86/kvm/mtrr.c |  1 +
>  arch/x86/kvm/pmu.c  |  1 +
>  arch/x86/kvm/smm.c  |  1 +
>  arch/x86/kvm/svm/avic.c |  2 +-
>  arch/x86/kvm/svm/nested.c   |  2 +-
>  arch/x86/kvm/svm/pmu.c  |  2 ++
>  arch/x86/kvm/svm/sev.c  |  1 +
>  arch/x86/kvm/svm/svm.c  | 10 -
>  arch/x86/kvm/svm/svm_onhyperv.c |  1 +
>  arch/x86/kvm/svm/svm_onhyperv.h |  4 ++--
>  arch/x86/kvm/vmx/evmcs.c|  1 +
>  arch/x86/kvm/vmx/evmcs.h|  4 +---
>  arch/x86/kvm/vmx/nested.c   |  3 ++-
>  arch/x86/kvm/vmx/pmu_intel.c|  5 +++--
>  arch/x86/kvm/vmx/posted_intr.c  |  2 ++
>  arch/x86/kvm/vmx/sgx.c  |  5 +++--
>  arch/x86/kvm/vmx/vmcs12.c   |  1 +
>  arch/x86/kvm/vmx/vmx.c  | 40 ---
> --
>  arch/x86/kvm/vmx/vmx_ops.h  |  4 ++--
>  arch/x86/kvm/x86.c  | 28 ---
>  arch/x86/kvm/xen.c  |  1 +
>  38 files changed, 97 insertions(+), 70 deletions(-)
> 
After this patch set, still find some printk()s left in arch/x86/kvm/*,
consider clean all of them up?

arch/x86/kvm/lapic.c:1215:  printk(KERN_ERR "TODO:
unsupported delivery mode %x\n",
arch/x86/kvm/lapic.c:1506:  printk(KERN_ERR "Local APIC
read with len = %x, "
arch/x86/kvm/lapic.c:2586:  printk(KERN_ERR "malloc apic
regs error for vcpu %x\n",
arch/x86/kvm/ioapic.h:95:   printk(KERN_EMERG "assertion
failed %s: %d: %s\n",   \
arch/x86/kvm/ioapic.c:614:  printk(KERN_WARNING "ioapic:
wrong length %d\n", len);
arch/x86/kvm/ioapic.c:641:  printk(KERN_WARNING "ioapic:
Unsupported size %d\n", len);
arch/x86/kvm/mmu/mmu.c:1652:printk(KERN_ERR "%s: %p
%llx\n", __func__,
arch/x86/kvm/svm/svm.c:3450:printk(KERN_ERR "%s: unexpected
exit_int_info 0x%x "
arch/x86/kvm/vmx/posted_intr.c:322: printk(
KERN_INFO
arch/x86/kvm/vmx/posted_intr.c:343: printk(KERN_INF
O "%s: failed to update PI IRTE\n",
arch/x86/kvm/vmx/vmx.c:6507:printk(KERN_WARNING
"%s: Breaking out of NMI-blocked "
arch/x86/kvm/x86.c:13027:   printk(KERN_INFO "irq bypass
consumer (token %p) unregistration"