Re: [powerpc] memcpy warning drivers/scsi/scsi_transport_fc.c:581 (next-20220921)
> On 22-Sep-2022, at 2:13 AM, Kees Cook wrote: > > On Wed, Sep 21, 2022 at 09:21:52PM +0530, Sachin Sant wrote: >> While booting recent linux-next kernel on a Power server following >> warning is seen: >> >> [6.427054] lpfc 0022:01:00.0: 0:6468 Set host date / time: Status x10: >> [6.471457] lpfc 0022:01:00.0: 0:6448 Dual Dump is enabled >> [7.432161] [ cut here ] >> [7.432178] memcpy: detected field-spanning write (size 8) of single >> field ">event_data" at drivers/scsi/scsi_transport_fc.c:581 (size 4) > > Interesting! > > The memcpy() is this one: > >memcpy(>event_data, data_buf, data_len); > > The struct member, "event_data" is defined as u32: > > ... > * Note: if Vendor Unique message, _data will be start of > * Note: if Vendor Unique message, event_data_flex will be start of > * vendor unique payload, and the length of the payload is > * per event_datalen > ... > struct fc_nl_event { >struct scsi_nl_hdr snlh;/* must be 1st element ! */ >__u64 seconds; >__u64 vendor_id; >__u16 host_no; >__u16 event_datalen; >__u32 event_num; >__u32 event_code; >__u32 event_data; > } __attribute__((aligned(sizeof(__u64; > > The warning says memcpy is trying to write 8 bytes into the 4 byte > member, so the compiler is seeing it "correctly", but I think this is > partially a false positive. It looks like there is also a small bug in > the allocation size calculation and therefore a small leak of kernel > heap memory contents. My notes: > > void > fc_host_post_fc_event(struct Scsi_Host *shost, u32 event_number, >enum fc_host_event_code event_code, >u32 data_len, char *data_buf, u64 vendor_id) > { > ... >struct fc_nl_event *event; > ... >if (!data_buf || data_len < 4) >data_len = 0; > > This wants a data_buf and a data_len >= 4, so it does look like it's > expected to be variable sized. There does appear to be an alignment > and padding expectation, though: > > /* macro to round up message lengths to 8byte boundary */ > #define FC_NL_MSGALIGN(len) (((len) + 7) & ~7) > > ... >len = FC_NL_MSGALIGN(sizeof(*event) + data_len); > > But this is immediately suspicious: sizeof(*event) _includes_ event_data, > so the alignment is going to be bumped up incorrectly. Note that > struct fc_nl_event is 8 * 5 == 40 bytes, which allows for 4 bytes in > event_data. But setting data_len to 4 (i.e. no "overflow") means we're > asking for 44 bytes, which is aligned to 48. > > So, in all cases, there is uninitialized memory being sent... > >skb = nlmsg_new(len, GFP_KERNEL); > ... >nlh = nlmsg_put(skb, 0, 0, SCSI_TRANSPORT_MSG, len, 0); > ... >event = nlmsg_data(nlh); > ... >event->event_datalen = data_len;/* bytes */ > > Comments in the struct say this is counting from start of event_data. > > ... >if (data_len) >memcpy(>event_data, data_buf, data_len); > > And here is the reported memcpy(). > > The callers of fc_host_post_fc_event() are: > >fc_host_post_fc_event(shost, event_number, event_code, >(u32)sizeof(u32), (char *)_data, 0); > > Fixed-size of 4 bytes: no "overflow". > >fc_host_post_fc_event(shost, event_number, FCH_EVT_VENDOR_UNIQUE, >data_len, data_buf, vendor_id); > >fc_host_post_fc_event(shost, fc_get_event_number(), >FCH_EVT_LINK_FPIN, fpin_len, fpin_buf, 0); > > These two appear to be of arbitrary length, but I didn't look more > deeply. > > Given that the only user of struct fc_nl_event is fc_host_post_fc_event() > in drivers/scsi/scsi_transport_fc.c, it looks safe to say that changing > the struct to use a flexible array is the thing to do in the kernel, but > we can't actually change the size or layout because it's a UAPI header. > > Are you able to test this patch: Thank you for the detailed analysis. With this patch applied I don’t see the warning. Thanks - Sachin
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On 9/21/22 12:47, Nadav Amit wrote: > On Sep 20, 2022, at 11:53 PM, Anshuman Khandual > wrote: > >> ⚠ External Email >> >> On 8/22/22 13:51, Yicong Yang wrote: >>> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch >>> *batch, >>> + struct mm_struct *mm, >>> + unsigned long uaddr) >>> +{ >>> + __flush_tlb_page_nosync(mm, uaddr); >>> +} >>> + >>> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch >>> *batch) >>> +{ >>> + dsb(ish); >>> +} >> >> Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping >> TLB invalidation requests on a given mm and try to generate a range based TLB >> invalidation such as flush_tlb_range(). >> >> struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous >> ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can >> later be flushed in subsequent arch_tlbbatch_flush() ? >> >> OR >> >> It might not be worth the effort and complexity, in comparison to performance >> improvement, TLB range flush brings in ? > > So here are my 2 cents, based on my experience with Intel-x86. It is likely > different on arm64, but perhaps it can provide you some insight into what > parameters you should measure and consider. > > In general there is a tradeoff between full TLB flushes and entry-specific > ones. Flushing specific entries takes more time than flushing the entire > TLB, but sade TLB refills. Right. > > Dave Hansen made some calculations in the past and came up with 33 as a > magic cutoff number, i.e., if you need to flush more than 33 entries, just > flush the entire TLB. I am not sure that this exact number is very > meaningful, since one might argue that it should’ve taken PTI into account > (which might require twice as many TLB invalidations). Okay. > > Anyhow, back to arch_tlbbatch_add_mm(). It may be possible to track ranges, > but the question is whether you would actually succeed in forming continuous > ranges that are eventually (on x86) smaller than the full TLB flush cutoff > (=33). Questionable (perhaps better with MGLRU?). This proposal here for arm64 does not cause a full TLB flush ever. It creates individual TLB flushes all the time. Hence the choice here is not between full TLB flush and possible range flushes. Choice is actually between individual TLB flushes and range/full TLB flushes. > > Then, you should remember that tracking should be very efficient, since even > few cache misses might have greater cost than what you save by > selective-flushing. Finally, on x86 you would need to invoke the smp/IPI > layer multiple times to send different cores the relevant range they need to > flush. Agreed, these reasons make it much difficult to gain any more performance. > > IOW: It is somewhat complicated to implement efficeintly. On x86, and > probably other IPI-based TLB shootdown systems, does not have clear > performance benefit (IMHO). Agreed, thanks for such a detailed explanation, appreciate it.
Re: [PATCH] block: move from strlcpy with unused retval to strscpy
On Thu, 18 Aug 2022 22:59:57 +0200, Wolfram Sang wrote: > Follow the advice of the below link and prefer 'strscpy' in this > subsystem. Conversion is 1:1 because the return value is not used. > Generated by a coccinelle script. > > Applied, thanks! [1/1] block: move from strlcpy with unused retval to strscpy commit: e55e1b4831563e5766f88fa821f5bfac0d6c298c Best regards, -- Jens Axboe
Re: [PATCH v2 07/44] cpuidle,psci: Push RCU-idle into driver
Reviewed-by: Guo Ren On Mon, Sep 19, 2022 at 6:17 PM Peter Zijlstra wrote: > > Doing RCU-idle outside the driver, only to then temporarily enable it > again, at least twice, before going idle is daft. > > Signed-off-by: Peter Zijlstra (Intel) > --- > drivers/cpuidle/cpuidle-psci.c |9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > --- a/drivers/cpuidle/cpuidle-psci.c > +++ b/drivers/cpuidle/cpuidle-psci.c > @@ -69,12 +69,12 @@ static int __psci_enter_domain_idle_stat > return -1; > > /* Do runtime PM to manage a hierarchical CPU toplogy. */ > - ct_irq_enter_irqson(); > if (s2idle) > dev_pm_genpd_suspend(pd_dev); > else > pm_runtime_put_sync_suspend(pd_dev); > - ct_irq_exit_irqson(); > + > + ct_idle_enter(); > > state = psci_get_domain_state(); > if (!state) > @@ -82,12 +82,12 @@ static int __psci_enter_domain_idle_stat > > ret = psci_cpu_suspend_enter(state) ? -1 : idx; > > - ct_irq_enter_irqson(); > + ct_idle_exit(); > + > if (s2idle) > dev_pm_genpd_resume(pd_dev); > else > pm_runtime_get_sync(pd_dev); > - ct_irq_exit_irqson(); > > cpu_pm_exit(); > > @@ -240,6 +240,7 @@ static int psci_dt_cpu_init_topology(str > * of a shared state for the domain, assumes the domain states are all > * deeper states. > */ > + drv->states[state_count - 1].flags |= CPUIDLE_FLAG_RCU_IDLE; > drv->states[state_count - 1].enter = psci_enter_domain_idle_state; > drv->states[state_count - 1].enter_s2idle = > psci_enter_s2idle_domain_idle_state; > psci_cpuidle_use_cpuhp = true; > > -- Best Regards Guo Ren
Re: [PATCH v2 07/44] cpuidle,psci: Push RCU-idle into driver
On Mon, Sep 19, 2022 at 11:59:46AM +0200, Peter Zijlstra wrote: > Doing RCU-idle outside the driver, only to then temporarily enable it > again, at least twice, before going idle is daft. > > Signed-off-by: Peter Zijlstra (Intel) Tried it on Pixel 6 running psci_idle, looks good with no apparent issues. Tested-by: Kajetan Puchalski
Re: [powerpc] Kernel crash with THP tests (next-20220920)
On 09/21/22 12:00, Sachin Sant wrote: > While running transparent huge page tests [1] against 6.0.0-rc6-next-20220920 > following crash is seen on IBM Power server. Thanks Sachin, Naoya reported this, with my analysis here: https://lore.kernel.org/linux-mm/YyqCS6+OXAgoqI8T@monkey/ An updated version of the patch was posted here, https://lore.kernel.org/linux-mm/20220921202702.106069-1-mike.krav...@oracle.com/ Sorry about that, -- Mike Kravetz > > Kernel attempted to read user page (34) - exploit attempt? (uid: 0) > BUG: Kernel NULL pointer dereference on read at 0x0034 > Faulting instruction address: 0xc04d2744 > Oops: Kernel access of bad area, sig: 11 [#1] > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > Modules linked in: dm_mod(E) bonding(E) rfkill(E) tls(E) sunrpc(E) nd_pmem(E) > nd_btt(E) dax_pmem(E) papr_scm(E) libnvdimm(E) pseries_rng(E) vmx_crypto(E) > ext4(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) > sg(E) ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) fuse(E) > CPU: 37 PID: 2219255 Comm: sysctl Tainted: GE > 6.0.0-rc6-next-20220920 #1 > NIP: c04d2744 LR: c04d2734 CTR: > REGS: c012801bf660 TRAP: 0300 Tainted: GE > (6.0.0-rc6-next-20220920) > MSR: 80009033 CR: 24048222 XER: 2004 > CFAR: c04b0eac DAR: 0034 DSISR: 4000 IRQMASK: 0 > GPR00: c04d2734 c012801bf900 c2a92300 > GPR04: c2ac8ac0 c1209340 0005 c01286714b80 > GPR08: 0034 > GPR12: 28048242 c0167fff6b00 > GPR16: > GPR20: c012801bfae8 0001 0100 0001 > GPR24: c012801bfae8 c2ac8ac0 0002 0005 > GPR28: 0001 00346cca > NIP [c04d2744] alloc_buddy_huge_page+0xd4/0x240 > LR [c04d2734] alloc_buddy_huge_page+0xc4/0x240 > Call Trace: > [c012801bf900] [c04d2734] alloc_buddy_huge_page+0xc4/0x240 > (unreliable) > [c012801bf9b0] [c04d46a4] > alloc_fresh_huge_page.part.72+0x214/0x2a0 > [c012801bfa40] [c04d7f88] alloc_pool_huge_page+0x118/0x190 > [c012801bfa90] [c04d84dc] __nr_hugepages_store_common+0x4dc/0x610 > [c012801bfb70] [c04d88bc] > hugetlb_sysctl_handler_common+0x13c/0x180 > [c012801bfc10] [c06380e0] proc_sys_call_handler+0x210/0x350 > [c012801bfc90] [c0551c00] vfs_write+0x2e0/0x460 > [c012801bfd50] [c0551f5c] ksys_write+0x7c/0x140 > [c012801bfda0] [c0033f58] system_call_exception+0x188/0x3f0 > [c012801bfe10] [c000c53c] system_call_common+0xec/0x270 > --- interrupt: c00 at 0x7fffa9520c34 > NIP: 7fffa9520c34 LR: 0001024754bc CTR: > REGS: c012801bfe80 TRAP: 0c00 Tainted: GE > (6.0.0-rc6-next-20220920) > MSR: 8280f033 CR: 28002202 > XER: > IRQMASK: 0 > GPR00: 0004 7fffccd76cd0 7fffa9607300 0003 > GPR04: 000138da6970 0006 fff6 > GPR08: 000138da6970 > GPR12: 7fffa9a40940 > GPR16: > GPR20: > GPR24: 0001 0010 0006 000138da8aa0 > GPR28: 7fffa95fc2c8 000138da8aa0 0006 000138da6930 > NIP [7fffa9520c34] 0x7fffa9520c34 > LR [0001024754bc] 0x1024754bc > --- interrupt: c00 > Instruction dump: > 3b42 3ba1 3b80 7f26cb78 7fc5f378 7f64db78 7fe3fb78 4bfde5b9 > 6000 7c691b78 39030034 7c0004ac <7d404028> 7c0ae800 40c20010 7f80412d > ---[ end trace ]--- > > Kernel panic - not syncing: Fatal exception > > Bisect points to following patch: > commit f2f3c25dea3acfb17aecb7273541e7266dfc8842 > hugetlb: freeze allocated pages before creating hugetlb pages > > Reverting the patch allows the test to run successfully. > > Thanks > - Sachin > > [1] > https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/transparent_hugepages_defrag.py
Re: [powerpc] memcpy warning drivers/scsi/scsi_transport_fc.c:581 (next-20220921)
On Wed, Sep 21, 2022 at 09:21:52PM +0530, Sachin Sant wrote: > While booting recent linux-next kernel on a Power server following > warning is seen: > > [6.427054] lpfc 0022:01:00.0: 0:6468 Set host date / time: Status x10: > [6.471457] lpfc 0022:01:00.0: 0:6448 Dual Dump is enabled > [7.432161] [ cut here ] > [7.432178] memcpy: detected field-spanning write (size 8) of single field > ">event_data" at drivers/scsi/scsi_transport_fc.c:581 (size 4) Interesting! The memcpy() is this one: memcpy(>event_data, data_buf, data_len); The struct member, "event_data" is defined as u32: ... * Note: if Vendor Unique message, _data will be start of * Note: if Vendor Unique message, event_data_flex will be start of * vendor unique payload, and the length of the payload is * per event_datalen ... struct fc_nl_event { struct scsi_nl_hdr snlh;/* must be 1st element ! */ __u64 seconds; __u64 vendor_id; __u16 host_no; __u16 event_datalen; __u32 event_num; __u32 event_code; __u32 event_data; } __attribute__((aligned(sizeof(__u64; The warning says memcpy is trying to write 8 bytes into the 4 byte member, so the compiler is seeing it "correctly", but I think this is partially a false positive. It looks like there is also a small bug in the allocation size calculation and therefore a small leak of kernel heap memory contents. My notes: void fc_host_post_fc_event(struct Scsi_Host *shost, u32 event_number, enum fc_host_event_code event_code, u32 data_len, char *data_buf, u64 vendor_id) { ... struct fc_nl_event *event; ... if (!data_buf || data_len < 4) data_len = 0; This wants a data_buf and a data_len >= 4, so it does look like it's expected to be variable sized. There does appear to be an alignment and padding expectation, though: /* macro to round up message lengths to 8byte boundary */ #define FC_NL_MSGALIGN(len) (((len) + 7) & ~7) ... len = FC_NL_MSGALIGN(sizeof(*event) + data_len); But this is immediately suspicious: sizeof(*event) _includes_ event_data, so the alignment is going to be bumped up incorrectly. Note that struct fc_nl_event is 8 * 5 == 40 bytes, which allows for 4 bytes in event_data. But setting data_len to 4 (i.e. no "overflow") means we're asking for 44 bytes, which is aligned to 48. So, in all cases, there is uninitialized memory being sent... skb = nlmsg_new(len, GFP_KERNEL); ... nlh = nlmsg_put(skb, 0, 0, SCSI_TRANSPORT_MSG, len, 0); ... event = nlmsg_data(nlh); ... event->event_datalen = data_len;/* bytes */ Comments in the struct say this is counting from start of event_data. ... if (data_len) memcpy(>event_data, data_buf, data_len); And here is the reported memcpy(). The callers of fc_host_post_fc_event() are: fc_host_post_fc_event(shost, event_number, event_code, (u32)sizeof(u32), (char *)_data, 0); Fixed-size of 4 bytes: no "overflow". fc_host_post_fc_event(shost, event_number, FCH_EVT_VENDOR_UNIQUE, data_len, data_buf, vendor_id); fc_host_post_fc_event(shost, fc_get_event_number(), FCH_EVT_LINK_FPIN, fpin_len, fpin_buf, 0); These two appear to be of arbitrary length, but I didn't look more deeply. Given that the only user of struct fc_nl_event is fc_host_post_fc_event() in drivers/scsi/scsi_transport_fc.c, it looks safe to say that changing the struct to use a flexible array is the thing to do in the kernel, but we can't actually change the size or layout because it's a UAPI header. Are you able to test this patch: diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index a2524106206d..0d798f11dc34 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -543,7 +543,7 @@ fc_host_post_fc_event(struct Scsi_Host *shost, u32 event_number, struct nlmsghdr *nlh; struct fc_nl_event *event; const char *name; - u32 len; + size_t len, padding; int err; if (!data_buf || data_len < 4) @@ -554,7 +554,7 @@ fc_host_post_fc_event(struct Scsi_Host *shost, u32 event_number, goto send_fail; } - len = FC_NL_MSGALIGN(sizeof(*event) + data_len); + len = FC_NL_MSGALIGN(sizeof(*event) - sizeof(event->event_data) + data_len); skb = nlmsg_new(len, GFP_KERNEL); if (!skb) { @@ -578,7 +578,9 @@ fc_host_post_fc_event(struct Scsi_Host *shost, u32 event_number, event->event_num = event_number; event->event_code = event_code; if (data_len) - memcpy(>event_data, data_buf, data_len); + memcpy(event->event_data_flex, data_buf, data_len); +
Re: [PATCH] Documentation: spufs: correct a duplicate word typo
Randy Dunlap writes: > Fix a typo of "or" which should be "of". > > Signed-off-by: Randy Dunlap > Cc: Jeremy Kerr > Cc: Arnd Bergmann > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Jonathan Corbet > --- > Documentation/filesystems/spufs/spufs.rst |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/Documentation/filesystems/spufs/spufs.rst > +++ b/Documentation/filesystems/spufs/spufs.rst > @@ -227,7 +227,7 @@ Files >from the data buffer, updating the value of the specified > signal >notification register. The signal notification register > will >either be replaced with the input data or will be updated to > the > - bitwise OR or the old value and the input data, depending on > the > + bitwise OR of the old value and the input data, depending on > the >contents of the signal1_type, or signal2_type > respectively, >file. Applied, thanks. jon
Re: [PATCH 2/2] powerpc/64s: update cpu selection options
On Wed, Sep 21, 2022 at 11:41:03AM +1000, Nicholas Piggin wrote: > Update the 64s GENERIC_CPU option. POWER4 support has been dropped, so > make that clear in the option name. The POWER5_CPU option is dropped > because it's uncommon, and GENERIC_CPU covers it. > > -mtune= before power8 is dropped because the minimum gcc version > supports power8, and tuning is made consistent between big and little > endian. > > A 970 option is added for PowerPC 970 / G5 because they still have a > user base, and -mtune=power8 does not generate good code for the 970. > > This also updates the ISA versions document to add Power4/Power4+ > because I didn't realise Power4+ used 2.01. > > Signed-off-by: Nicholas Piggin Reviewed-by: Segher Boessenkool Cheers, Segher
Re: [PATCH 1/2] powerpc/64s: Fix GENERIC_CPU build flags for PPC970 / G5
On Wed, Sep 21, 2022 at 11:41:02AM +1000, Nicholas Piggin wrote: > Big-endian GENERIC_CPU supports 970, but builds with -mcpu=power5. > POWER5 is ISA v2.02 whereas 970 is v2.01 plus Altivec. 2.02 added > the popcntb instruction which a compiler might use. > > Use -mcpu=power4. > > Fixes: 471d7ff8b51b ("powerpc/64s: Remove POWER4 support") > Signed-off-by: Nicholas Piggin Reviewed-by: Segher Boessenkool Thank you! Maybe superfluous, but some more context: GCC's -mcpu=power4 means POWER4+, ISA 2.01, according to our documentation. There is no difference with ISA 2.00 (what plain POWER4 implements) for anything GCC does. Segher
[PATCH 2/2] tools/perf/tests: Fix build id test check for PE file
Perf test "build id cache operations" fails for PE executable. Logs below from powerpc system. Same is observed on x86 as well. <<>> Adding 5a0fd882b53084224ba47b624c55a469 ./tests/shell/../pe-file.exe: Ok build id: 5a0fd882b53084224ba47b624c55a469 link: /tmp/perf.debug.w0V/.build-id/5a/0fd882b53084224ba47b624c55a469 file: /tmp/perf.debug.w0V/.build-id/5a/../../root/athira/linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf failed: file /tmp/perf.debug.w0V/.build-id/5a/../../root/athira/linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf does not exist test child finished with -1 end build id cache operations: FAILED! <<>> The test tries to do: <<>> mkdir /tmp/perf.debug.TeY1 perf --buildid-dir /tmp/perf.debug.TeY1 buildid-cache -v -a ./tests/shell/../pe-file.exe <<>> The option "--buildid-dir" sets the build id cache directory as /tmp/perf.debug.TeY1. The option given to buildid-cahe, ie "-a ./tests/shell/../pe-file.exe", is to add the pe-file.exe to the cache. The testcase, sets buildid-dir and adds the file: pe-file.exe to build id cache. To check if the command is run successfully, "check" function looks for presence of the file in buildid cache directory. But the check here expects the added file to be executable. Snippet below: <<>> if [ ! -x $file ]; then echo "failed: file ${file} does not exist" exit 1 fi <<>> Buildid test is done for sha1 binary, md5 binary and also for PE file. The firt two binaries are created at runtime by compiling with "--build-id" option and hence the check for sha1/md5 test should use [ ! -x ]. But in case of PE file, the permission for this input file is rw-r--r-- Hence the file added to build id cache has same permissoin Original file: ls tests/pe-file.exe | xargs stat --printf "%n %A \n" tests/pe-file.exe -rw-r--r-- buildid cache file: ls /tmp/perf.debug.w0V/.build-id/5a/../../root/athira/linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf | xargs stat --printf "%n %A \n" /tmp/perf.debug.w0V/.build-id/5a/../../root/athira/linux/tools/perf/tests/pe-file.exe/5a0fd882b53084224ba47b624c55a469/elf -rw-r--r-- Fix the test to match with the permission of original file in case of FE file. ie if the "tests/pe-file.exe" file is not having exec permission, just check for existence of the buildid file using [ ! -e ] Signed-off-by: Athira Rajeev --- tools/perf/tests/shell/buildid.sh | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/tools/perf/tests/shell/buildid.sh b/tools/perf/tests/shell/buildid.sh index 3512c4423d48..75f5117c3417 100755 --- a/tools/perf/tests/shell/buildid.sh +++ b/tools/perf/tests/shell/buildid.sh @@ -77,7 +77,20 @@ check() file=${build_id_dir}/.build-id/$(echo ${id}|cut -c 1-2)/`readlink ${link}`/elf echo "file: ${file}" - if [ ! -x $file ]; then + # Check for file permission of orginal file + # in case of pe-file.exe file + echo $1 | grep ".exe" + if [ $? -eq 0 ]; then + if [ -x $1 -a ! -x $file ]; then + echo "failed: file ${file} executable does not exist" + exit 1 + fi + + if [ ! -x $file -a ! -e $file ]; then + echo "failed: file ${file} does not exist" + exit 1 + fi + elif [ ! -x $file ]; then echo "failed: file ${file} does not exist" exit 1 fi -- 2.17.1
[PATCH 1/2] tools/perf/tests: Fix string substitutions in build id test
The perf test named “build id cache operations” skips with below error on some distros: <<>> 78: build id cache operations : test child forked, pid 01 WARNING: wine not found. PE binaries will not be run. test binaries: /tmp/perf.ex.SHA1.PKz /tmp/perf.ex.MD5.Gt3 ./tests/shell/../pe-file.exe DEBUGINFOD_URLS= Adding 4abd406f041feb4f10ecde3fc30fd0639e1a91cb /tmp/perf.ex.SHA1.PKz: Ok build id: 4abd406f041feb4f10ecde3fc30fd0639e1a91cb ./tests/shell/buildid.sh: 69: ./tests/shell/buildid.sh: Bad substitution test child finished with -2 build id cache operations: Skip <<>> The test script "tests/shell/buildid.sh" uses some of the string substitution ways which are supported in bash, but not in "sh" or other shells. Above error on line number 69 that reports "Bad substitution" is: <<>> link=${build_id_dir}/.build-id/${id:0:2}/${id:2} <<>> Here the way of getting first two characters from id ie, ${id:0:2} and similarly expressions like ${id:2} is not recognised in "sh". So the line errors and instead of hitting failure, the test gets skipped as shown in logs. So the syntax issue causes test not to be executed in such cases. Similarly usage : "${@: -1}" [ to pick last argument passed to a function] in “test_record” doesn’t work in all distros. Fix this by using alternative way with "cut" command to pick "n" characters from the string. Also fix the usage of “${@: -1}” to work in all cases. Another usage in “test_record” is: <<>> ${perf} record --buildid-all -o ${data} $@ &> ${log} <<>> This causes the perf record to start in background and Results in the data file not being created by the time "check" function is invoked. Below log shows perf record result getting displayed after the call to "check" function. <<>> running: perf record /tmp/perf.ex.SHA1.EAU build id: 4abd406f041feb4f10ecde3fc30fd0639e1a91cb link: /tmp/perf.debug.mLT/.build-id/4a/bd406f041feb4f10ecde3fc30fd0639e1a91cb failed: link /tmp/perf.debug.mLT/.build-id/4a/bd406f041feb4f10ecde3fc30fd0639e1a91cb does not exist test child finished with -1 build id cache operations: FAILED! root@machine:~/athira/linux/tools/perf# Couldn't synthesize bpf events. [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.010 MB /tmp/perf.data.bFF ] <<>> Fix this by redirecting output instead of using “&” which starts the command in background. Signed-off-by: Athira Rajeev --- tools/perf/tests/shell/buildid.sh | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/tools/perf/tests/shell/buildid.sh b/tools/perf/tests/shell/buildid.sh index f05670d1e39e..3512c4423d48 100755 --- a/tools/perf/tests/shell/buildid.sh +++ b/tools/perf/tests/shell/buildid.sh @@ -66,7 +66,7 @@ check() esac echo "build id: ${id}" - link=${build_id_dir}/.build-id/${id:0:2}/${id:2} + link=${build_id_dir}/.build-id/$(echo ${id}|cut -c 1-2)/$(echo ${id}|cut -c 3-) echo "link: ${link}" if [ ! -h $link ]; then @@ -74,7 +74,7 @@ check() exit 1 fi - file=${build_id_dir}/.build-id/${id:0:2}/`readlink ${link}`/elf + file=${build_id_dir}/.build-id/$(echo ${id}|cut -c 1-2)/`readlink ${link}`/elf echo "file: ${file}" if [ ! -x $file ]; then @@ -117,20 +117,22 @@ test_record() { data=$(mktemp /tmp/perf.data.XXX) build_id_dir=$(mktemp -d /tmp/perf.debug.XXX) - log=$(mktemp /tmp/perf.log.XXX) + log_out=$(mktemp /tmp/perf.log.out.XXX) + log_err=$(mktemp /tmp/perf.log.err.XXX) perf="perf --buildid-dir ${build_id_dir}" + eval last=\${$#} echo "running: perf record $@" - ${perf} record --buildid-all -o ${data} $@ &> ${log} + ${perf} record --buildid-all -o ${data} $@ 1>${log_out} 2>${log_err} if [ $? -ne 0 ]; then echo "failed: record $@" - echo "see log: ${log}" + echo "see log: ${log_err}" exit 1 fi - check ${@: -1} + check $last - rm -f ${log} + rm -f ${log_out} ${log_err} rm -rf ${build_id_dir} rm -rf ${data} } -- 2.17.1
Re: [RFC PATCH 3/7] powerpc/64: provide a helper macro to load r2 with the kernel TOC
Le 19/09/2022 à 16:01, Nicholas Piggin a écrit : > A later change stops the kernel using r2 and loads it with a poison > value. Provide a PACATOC loading abstraction which can hide this > detail. > > XXX: 64e, KVM, ftrace not entirely done > > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/include/asm/ppc_asm.h | 3 +++ > arch/powerpc/kernel/exceptions-64e.S | 4 ++-- > arch/powerpc/kernel/exceptions-64s.S | 6 +++--- > arch/powerpc/kernel/head_64.S | 4 ++-- > arch/powerpc/kernel/interrupt_64.S | 12 ++-- > arch/powerpc/kernel/optprobes_head.S | 2 +- > arch/powerpc/kernel/trace/ftrace_mprofile.S| 4 ++-- > arch/powerpc/platforms/powernv/opal-wrappers.S | 2 +- > 8 files changed, 20 insertions(+), 17 deletions(-) > > diff --git a/arch/powerpc/include/asm/ppc_asm.h > b/arch/powerpc/include/asm/ppc_asm.h > index 520c4c9caf7f..c0848303151c 100644 > --- a/arch/powerpc/include/asm/ppc_asm.h > +++ b/arch/powerpc/include/asm/ppc_asm.h > @@ -374,6 +374,9 @@ GLUE(.,name): > > #ifdef __powerpc64__ > > +#define LOAD_PACA_TOC() \ > + ld r2,PACATOC(r13) > + Wouldn't it be better as a GAS macro ?
Re: [RFC PATCH 2/7] powerpc/64: abstract asm global variable declaration and access
Le 19/09/2022 à 16:01, Nicholas Piggin a écrit : > Use asm helpers to access global variables and to define them in asm. > Stop using got addressing and use the more common @toc offsets. 32-bit > already does this so that should be unchanged. > > Signed-off-by: Nicholas Piggin > --- > diff --git a/arch/powerpc/boot/ppc_asm.h b/arch/powerpc/boot/ppc_asm.h > index 192b97523b05..ea290bf78fb2 100644 > --- a/arch/powerpc/boot/ppc_asm.h > +++ b/arch/powerpc/boot/ppc_asm.h > @@ -84,4 +84,8 @@ > #define MFTBU(dest) mfspr dest, SPRN_TBRU > #endif > > +#define LOAD_REG_ADDR(reg,name) \ > + addis reg,r2,name@toc@ha; \ > + addireg,reg,name@toc@l > + > #endif /* _PPC64_PPC_ASM_H */ Wouldn't it be better as a GAS macro ?
Re: [RFC PATCH 1/7] powerpc: use 16-bit immediate for STACK_FRAME_REGS_MARKER
Le 19/09/2022 à 16:01, Nicholas Piggin a écrit : > Using a 16-bit constant for this marker allows it to be loaded with > a single 'li' instruction. On 64-bit this avoids a TOC entry and a > TOC load that depends on the r2 value that has just been loaded from > the PACA. > > XXX: this probably should be 64-bit change and use 2 instruction > sequence that 32-bit uses, to avoid false positives. Yes would probably be safer ? It is only one instruction more, would likely be unnoticeable. Why value 0xba51 ? Why not 0xdead like PPC64 ? Christophe > > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/include/asm/ptrace.h| 6 +++--- > arch/powerpc/kernel/entry_32.S | 9 - > arch/powerpc/kernel/exceptions-64e.S | 8 +--- > arch/powerpc/kernel/exceptions-64s.S | 2 +- > arch/powerpc/kernel/head_32.h| 3 +-- > arch/powerpc/kernel/head_64.S| 7 --- > arch/powerpc/kernel/head_booke.h | 3 +-- > arch/powerpc/kernel/interrupt_64.S | 6 +++--- > 8 files changed, 14 insertions(+), 30 deletions(-) > > diff --git a/arch/powerpc/include/asm/ptrace.h > b/arch/powerpc/include/asm/ptrace.h > index a03403695cd4..f47066f7878e 100644 > --- a/arch/powerpc/include/asm/ptrace.h > +++ b/arch/powerpc/include/asm/ptrace.h > @@ -115,10 +115,10 @@ struct pt_regs > > #define STACK_FRAME_OVERHEAD112 /* size of minimum stack frame > */ > #define STACK_FRAME_LR_SAVE 2 /* Location of LR in stack frame */ > -#define STACK_FRAME_REGS_MARKER ASM_CONST(0x7265677368657265) > +#define STACK_FRAME_REGS_MARKER ASM_CONST(0xdead) > #define STACK_INT_FRAME_SIZE(sizeof(struct pt_regs) + \ >STACK_FRAME_OVERHEAD + KERNEL_REDZONE_SIZE) > -#define STACK_FRAME_MARKER 12 > +#define STACK_FRAME_MARKER 1 /* Reuse CR+reserved word */ > > #ifdef CONFIG_PPC64_ELF_ABI_V2 > #define STACK_FRAME_MIN_SIZE32 > @@ -136,7 +136,7 @@ struct pt_regs > #define KERNEL_REDZONE_SIZE 0 > #define STACK_FRAME_OVERHEAD16 /* size of minimum stack frame > */ > #define STACK_FRAME_LR_SAVE 1 /* Location of LR in stack frame */ > -#define STACK_FRAME_REGS_MARKER ASM_CONST(0x72656773) > +#define STACK_FRAME_REGS_MARKER ASM_CONST(0xba51) > #define STACK_INT_FRAME_SIZE(sizeof(struct pt_regs) + > STACK_FRAME_OVERHEAD) > #define STACK_FRAME_MARKER 2 > #define STACK_FRAME_MIN_SIZESTACK_FRAME_OVERHEAD > diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S > index 1d599df6f169..c221e764cefd 100644 > --- a/arch/powerpc/kernel/entry_32.S > +++ b/arch/powerpc/kernel/entry_32.S > @@ -108,9 +108,8 @@ transfer_to_syscall: > #ifdef CONFIG_BOOKE_OR_40x > rlwinm r9,r9,0,14,12 /* clear MSR_WE (necessary?) */ > #endif > - lis r12,STACK_FRAME_REGS_MARKER@ha /* exception frame marker */ > + li r12,STACK_FRAME_REGS_MARKER /* exception frame marker */ > SAVE_GPR(2, r1) > - addir12,r12,STACK_FRAME_REGS_MARKER@l > stw r9,_MSR(r1) > li r2, INTERRUPT_SYSCALL > stw r12,8(r1) > @@ -265,7 +264,7 @@ fast_exception_return: > mtcrr10 > lwz r10,_LINK(r11) > mtlrr10 > - /* Clear the exception_marker on the stack to avoid confusing > stacktrace */ > + /* Clear the STACK_FRAME_REGS_MARKER on the stack to avoid confusing > stacktrace */ > li r10, 0 > stw r10, 8(r11) > REST_GPR(10, r11) > @@ -322,7 +321,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS) > li r0,0 > > /* > - * Leaving a stale exception_marker on the stack can confuse > + * Leaving a stale STACK_FRAME_REGS_MARKER on the stack can confuse >* the reliable stack unwinder later on. Clear it. >*/ > stw r0,8(r1) > @@ -374,7 +373,7 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS) > mtspr SPRN_XER,r5 > > /* > - * Leaving a stale exception_marker on the stack can confuse > + * Leaving a stale STACK_FRAME_REGS_MARKER on the stack can confuse >* the reliable stack unwinder later on. Clear it. >*/ > stw r0,8(r1) > diff --git a/arch/powerpc/kernel/exceptions-64e.S > b/arch/powerpc/kernel/exceptions-64e.S > index 67dc4e3179a0..08b7d6bd4da6 100644 > --- a/arch/powerpc/kernel/exceptions-64e.S > +++ b/arch/powerpc/kernel/exceptions-64e.S > @@ -389,7 +389,7 @@ exc_##n##_common: > \ > ld r9,excf+EX_R1(r13); /* load orig r1 back from PACA */ \ > lwz r10,excf+EX_CR(r13);/* load orig CR back from PACA */ \ > lbz r11,PACAIRQSOFTMASK(r13); /* get current IRQ softe */ \ > - ld r12,exception_marker@toc(r2); \ > + li r12,STACK_FRAME_REGS_MARKER;\ > li
Re: [PATCH v2 10/10] drm/ofdrm: Support color management
Hi Thomas, On Wed, Sep 21, 2022 at 2:55 PM Thomas Zimmermann wrote: > Am 05.08.22 um 02:19 schrieb Benjamin Herrenschmidt: > > On Wed, 2022-07-20 at 16:27 +0200, Thomas Zimmermann wrote: > >> +#if !defined(CONFIG_PPC) > >> +static inline void out_8(void __iomem *addr, int val) > >> +{ } > >> +static inline void out_le32(void __iomem *addr, int val) > >> +{ } > >> +static inline unsigned int in_le32(const void __iomem *addr) > >> +{ > >> + return 0; > >> +} > >> +#endif > > > > These guys could just be replaced with readb/writel/readl respectively > > (beware of the argument swap). > > I only added them for COMPILE_TEST. There appears to be no portable > interface that implements out_le32() and in_le32()? iowrite32() and ioread32()? Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH 1/2] powerpc/64s: Fix GENERIC_CPU build flags for PPC970 / G5
Le 21/09/2022 à 03:41, Nicholas Piggin a écrit : > Big-endian GENERIC_CPU supports 970, but builds with -mcpu=power5. > POWER5 is ISA v2.02 whereas 970 is v2.01 plus Altivec. 2.02 added > the popcntb instruction which a compiler might use. > > Use -mcpu=power4. > > Fixes: 471d7ff8b51b ("powerpc/64s: Remove POWER4 support") > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/Makefile | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile > index 02742facf895..140a5e6471fe 100644 > --- a/arch/powerpc/Makefile > +++ b/arch/powerpc/Makefile > @@ -152,7 +152,7 @@ CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=power8 > CFLAGS-$(CONFIG_GENERIC_CPU) += $(call > cc-option,-mtune=power9,-mtune=power8) > else > CFLAGS-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mtune=power7,$(call > cc-option,-mtune=power5)) > -CFLAGS-$(CONFIG_GENERIC_CPU) += $(call cc-option,-mcpu=power5,-mcpu=power4) > +CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=power4 > endif > else ifdef CONFIG_PPC_BOOK3E_64 > CFLAGS-$(CONFIG_GENERIC_CPU) += -mcpu=powerpc64 That else ifdef CONFIG_PPC_BOOK3E_64 looks odd. I might have forgotten to drop something. Since commit d6b551b8f90c ("powerpc/64e: Fix build failure with GCC 12 (unrecognized opcode: `wrteei')") it is not possible anymore to select CONFIG_GENERIC_CPU if not book3s64. Christophe
Re: [PATCH] Revert "powerpc/rtas: Implement reentrant rtas call"
"Nicholas Piggin" writes: > On Mon Sep 19, 2022 at 11:51 PM AEST, Nathan Lynch wrote: >> > I wonder - would it be worth making the panic path use a separate >> > "emergency" rtas_args buffer as well? If a CPU is actually "stuck" in >> > RTAS at panic time, then leaving rtas.args untouched might make the >> > ibm,int-off, ibm,set-xive, ibm,os-term, and any other RTAS calls we >> > incur on the panic path more likely to succeed. > > Yeah I think that's probably not a bad idea. Not sure if you've got the > bandwidth to take on doing the patch but be my guest if you do :) > Otherwise we can file it in github issues. Not sure I'll be able to work it soon. I filed https://github.com/linuxppc/issues/issues/435
[powerpc] memcpy warning drivers/scsi/scsi_transport_fc.c:581 (next-20220921)
While booting recent linux-next kernel on a Power server following warning is seen: [6.427054] lpfc 0022:01:00.0: 0:6468 Set host date / time: Status x10: [6.471457] lpfc 0022:01:00.0: 0:6448 Dual Dump is enabled [7.432161] [ cut here ] [7.432178] memcpy: detected field-spanning write (size 8) of single field ">event_data" at drivers/scsi/scsi_transport_fc.c:581 (size 4) [7.432201] WARNING: CPU: 0 PID: 16 at drivers/scsi/scsi_transport_fc.c:581 fc_host_post_fc_event+0x214/0x300 [scsi_transport_fc] [7.432228] Modules linked in: sr_mod(E) cdrom(E) sd_mod(E) sg(E) lpfc(E+) nvmet_fc(E) ibmvscsi(E) nvmet(E) scsi_transport_srp(E) ibmveth(E) nvme_fc(E) nvme(E) nvme_fabrics(E) nvme_core(E) t10_pi(E) scsi_transport_fc(E) crc64_rocksoft(E) crc64(E) tg3(E) ipmi_devintf(E) ipmi_msghandler(E) fuse(E) [7.432263] CPU: 0 PID: 16 Comm: kworker/0:1 Tainted: GE 6.0.0-rc6-next-20220921 #38 [7.432270] Workqueue: events work_for_cpu_fn [7.432277] NIP: c00801366a2c LR: c00801366a28 CTR: 007088ec [7.432282] REGS: c380b6d0 TRAP: 0700 Tainted: GE (6.0.0-rc6-next-20220921) [7.432288] MSR: 8282b033 CR: 48002824 XER: 0005 [7.432304] CFAR: c01555b4 IRQMASK: 0 GPR00: c00801366a28 c380b970 c00801388300 0084 GPR04: 7fff c380b730 c380b728 0027 GPR08: c00db7007f98 0001 0027 c2947378 GPR12: 2000 c2dc c018e3d8 c3045740 GPR16: GPR20: 0030 010010df c380ba90 GPR24: 0001 c30ea000 c2da2a08 GPR28: 0040 c00073f52400 0008 c000940b9834 [7.432365] NIP [c00801366a2c] fc_host_post_fc_event+0x214/0x300 [scsi_transport_fc] [7.432374] LR [c00801366a28] fc_host_post_fc_event+0x210/0x300 [scsi_transport_fc] [7.432383] Call Trace: [7.432385] [c380b970] [c00801366a28] fc_host_post_fc_event+0x210/0x300 [scsi_transport_fc] (unreliable) [7.432396] [c380ba30] [c00801c23028] lpfc_post_init_setup+0xc0/0x1f0 [lpfc] [7.432429] [c380bab0] [c00801c24e00] lpfc_pci_probe_one_s4.isra.59+0x428/0xa10 [lpfc] [7.432455] [c380bb40] [c00801c255a4] lpfc_pci_probe_one+0x1bc/0xb70 [lpfc] [7.432480] [c380bbe0] [c07fdc7c] local_pci_probe+0x6c/0x110 [7.432489] [c380bc60] [c017bdf8] work_for_cpu_fn+0x38/0x60 [7.432494] [c380bc90] [c01812d4] process_one_work+0x2b4/0x5b0 [7.432501] [c380bd30] [c0181820] worker_thread+0x250/0x600 [7.432508] [c380bdc0] [c018e4f4] kthread+0x124/0x130 [7.432514] [c380be10] [c000cdf4] ret_from_kernel_thread+0x5c/0x64 [7.432521] Instruction dump: [7.432524] 2f89 409eff5c 3ca2 e8a58170 3c62 e8638178 3921 38c4 [7.432535] 7fc4f378 992a 4800414d e8410018 <0fe0> 7fa3eb78 3881 480044d1 [7.432546] ---[ end trace ]--- [7.471075] lpfc 0022:01:00.0: 0:3176 Port Name 0 Physical Link is functional [7.471405] lpfc 0022:01:00.1: enabling device (0144 -> 0146) The warning was added by the following patch commit 54d9469bc515dc5fcbc20eecbe19cea868b70d68 fortify: Add run-time WARN for cross-field memcpy() Should this be fixed in the driver or is this a false warning? Thanks - Sachin
Re: [RFC PATCH 5/7] powerpc/64s: update generic cpu option name and compiler flags
Hi! On Wed, Sep 21, 2022 at 11:01:18AM +1000, Nicholas Piggin wrote: > On Wed Sep 21, 2022 at 8:16 AM AEST, Segher Boessenkool wrote: > > On Tue, Sep 20, 2022 at 12:01:47AM +1000, Nicholas Piggin wrote: > > > Update the 64s GENERIC_CPU option. POWER4 support has been dropped, so > > > make that clear in the option name. > > > > AFAIR the minimum now is POWER4+ (ISA 2.01), not POWER5 (ISA 2.02). > > It's POWER5 now, because of commit 471d7ff8b5 ("powerpc/64s: Remove > POWER4 support"), which is misguided about POWER4+ and also introduced > the -mcpu=power5 bug on 970 builds :\ ISA 2.01 added just a few things (LPES[0], HDEC, some PM things, but crucially also anything that sets MSR[PR] also sets MSR[EE] since then). > Not sure it's worth adding POWER4+ support back but if someone has a > POWER4+ or adds it to QEMU TCG, I will do the patch. 970 is 2.01 -- pretending it is 2.02 is a ticking time bomb: the popcntb insn will be generated for popcount and parity intrinsics, which can be generated by generic code! > > > -mtune= before power8 is dropped because the minimum gcc version > > > supports power8, and tuning is made consistent between big and little > > > endian. > > > > Tuning for p8 on e.g. 970 gives quite bad results. No idea if anyone > > cares, but this is a serious regression if so. > > It's for "generic" kernel so we set low minimum but higher tune, > assuming that people would usually have newer, so it was already > doing -mtune=power7. > > We could make a specific 970/G5 entry though, since those still > have users. If that uses -mcpu=power4 (which means ISA 2.01 btw!) all is fine already? (Or -mcpu=970, same thing really, it just allows VMX as well). Thanks for taking care of this Nick, much appreciated! Segher
[PATCH V2 2/3] tools/perf/tests: Fix branch stack sampling test to include sanity check for branch filter
commit b55878c90ab9 ("perf test: Add test for branch stack sampling") added test for branch stack sampling. There is a sanity check in the beginning to skip the test if the hardware doesn't support branch stack sampling. Snippet <<>> skip the test if the hardware doesn't support branch stack sampling perf record -b -o- -B true > /dev/null 2>&1 || exit 2 <<>> But the testcase also uses branch sample types: save_type, any. if any platform doesn't support the branch filters used in the test, the testcase will fail. In powerpc, currently mutliple branch filters are not supported and hence this test fails in powerpc. Fix the sanity check to look at the support for branch filters used in this test before proceeding with the test. Fixes: b55878c90ab9 ("perf test: Add test for branch stack sampling") Reported-by: Disha Goel Signed-off-by: Athira Rajeev Reviewed-by: Kajol Jain --- tools/perf/tests/shell/test_brstack.sh | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/perf/tests/shell/test_brstack.sh b/tools/perf/tests/shell/test_brstack.sh index c644f94a6500..ec801cffae6b 100755 --- a/tools/perf/tests/shell/test_brstack.sh +++ b/tools/perf/tests/shell/test_brstack.sh @@ -12,7 +12,8 @@ if ! [ -x "$(command -v cc)" ]; then fi # skip the test if the hardware doesn't support branch stack sampling -perf record -b -o- -B true > /dev/null 2>&1 || exit 2 +# and if the architecture doesn't support filter types: any,save_type,u +perf record -b -o- -B --branch-filter any,save_type,u true > /dev/null 2>&1 || exit 2 TMPDIR=$(mktemp -d /tmp/__perf_test.program.X) -- 2.31.1
[PATCH V2 3/3] tools/testing/selftests/powerpc: Update the bhrb filter sampling test to test for multiple branch filters
For PERF_SAMPLE_BRANCH_STACK sample type, different branch_sample_type, ie branch filters are supported. The testcase "bhrb_filter_map_test" tests the valid and invalid filter maps in different powerpc platforms. Update this testcase to include scenario to cover multiple branch filters at sametime. Since powerpc doesn't support multiple filters at sametime, expect failure during perf_event_open. Reported-by: Disha Goel Signed-off-by: Athira Rajeev Reviewed-by: Kajol Jain --- Changelog: Fixed compile error for undefined branch sample type by using PERF_SAMPLE_BRANCH_ANY_CALL instead of PERF_SAMPLE_BRANCH_TYPE_SAVE .../powerpc/pmu/sampling_tests/bhrb_filter_map_test.c| 9 + 1 file changed, 9 insertions(+) diff --git a/tools/testing/selftests/powerpc/pmu/sampling_tests/bhrb_filter_map_test.c b/tools/testing/selftests/powerpc/pmu/sampling_tests/bhrb_filter_map_test.c index 8182647c63c8..3f43c315c666 100644 --- a/tools/testing/selftests/powerpc/pmu/sampling_tests/bhrb_filter_map_test.c +++ b/tools/testing/selftests/powerpc/pmu/sampling_tests/bhrb_filter_map_test.c @@ -96,6 +96,15 @@ static int bhrb_filter_map_test(void) } } + /* +* Combine filter maps which includes a valid branch filter and an invalid branch +* filter. Example: any ( PERF_SAMPLE_BRANCH_ANY) and any_call +* (PERF_SAMPLE_BRANCH_ANY_CALL). +* The perf_event_open should fail in this case. +*/ + event.attr.branch_sample_type = PERF_SAMPLE_BRANCH_ANY | PERF_SAMPLE_BRANCH_ANY_CALL; + FAIL_IF(!event_open()); + return 0; } -- 2.31.1
[PATCH V2 1/3] powerpc/perf: Fix branch_filter support for multiple filters in powerpc
For PERF_SAMPLE_BRANCH_STACK sample type, different branch_sample_type ie branch filters are supported. The branch filters are requested via event attribute "branch_sample_type". Multiple branch filters can be passed in event attribute. Example: perf record -b -o- -B --branch-filter any,ind_call true Powerpc does not support having multiple branch filters at sametime. In powerpc, branch filters for branch stack sampling is set via MMCRA IFM bits [32:33]. But currently when requesting for multiple filter types, the "perf record" command does not report any error. Example: perf record -b -o- -B --branch-filter any,save_type true perf record -b -o- -B --branch-filter any,ind_call true The "bhrb_filter_map" function in PMU driver code does the validity check for supported branch filters. But this check is done for single filter. Hence "perf record" will proceed here without reporting any error. Fix power_pmu_event_init to return EOPNOTSUPP when multiple branch filters are requested in the event attr. After the fix: perf record --branch-filter any,ind_call -- ls Error: cycles: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat' Reported-by: Disha Goel Signed-off-by: Athira Rajeev Reviewed-by: Madhavan Srinivasan Tested-by: Disha Goel Reviewed-by: Kajol Jain --- Changelog: Addressed review comment from Michael Ellerman to do irq restore before returning EOPNOTSUPP arch/powerpc/perf/core-book3s.c | 17 + 1 file changed, 17 insertions(+) diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c index 13919eb96931..a939ac3b3a84 100644 --- a/arch/powerpc/perf/core-book3s.c +++ b/arch/powerpc/perf/core-book3s.c @@ -2131,6 +2131,23 @@ static int power_pmu_event_init(struct perf_event *event) if (has_branch_stack(event)) { u64 bhrb_filter = -1; + /* +* powerpc does not support having multiple branch filters +* at sametime. Branch filters are set via MMCRA IFM[32:33] +* bits for power8 and above. Return EOPNOTSUPP when multiple +* branch filters are requested in the event attr. +* +* When opening event via perf_event_open, branch_sample_type +* gets adjusted in perf_copy_attr function. Kernel will +* automatically adjust the branch_sample_type based on the +* event modifier settings to include PERF_SAMPLE_BRANCH_PLM_ALL. +* Hence drop the check for PERF_SAMPLE_BRANCH_PLM_ALL. +*/ + if (hweight64(event->attr.branch_sample_type & ~PERF_SAMPLE_BRANCH_PLM_ALL) > 1) { + local_irq_restore(irq_flags); + return -EOPNOTSUPP; + } + if (ppmu->bhrb_filter_map) bhrb_filter = ppmu->bhrb_filter_map( event->attr.branch_sample_type); -- 2.31.1
Re: [PATCH v1 2/3] powerpc/prom_init: drop PROM_BUG()
On 21.09.22 15:02, Michael Ellerman wrote: David Hildenbrand writes: Unused, let's drop it. Signed-off-by: David Hildenbrand --- arch/powerpc/kernel/prom_init.c | 6 -- 1 file changed, 6 deletions(-) Thanks. I'll take this one via the powerpc tree, and the others can go via wherever? Makes sense; I'll drop this patch in case I have to resend, assuming it's in your tree. Thanks! -- Thanks, David / dhildenb
Re: [PATCH v1 2/3] powerpc/prom_init: drop PROM_BUG()
David Hildenbrand writes: > Unused, let's drop it. > > Signed-off-by: David Hildenbrand > --- > arch/powerpc/kernel/prom_init.c | 6 -- > 1 file changed, 6 deletions(-) Thanks. I'll take this one via the powerpc tree, and the others can go via wherever? cheers > diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c > index a6669c40c1db..d464ba412084 100644 > --- a/arch/powerpc/kernel/prom_init.c > +++ b/arch/powerpc/kernel/prom_init.c > @@ -96,12 +96,6 @@ static int of_workarounds __prombss; > #define OF_WA_CLAIM 1 /* do phys/virt claim separately, then map */ > #define OF_WA_LONGTRAIL 2 /* work around longtrail bugs */ > > -#define PROM_BUG() do { \ > -prom_printf("kernel BUG at %s line 0x%x!\n", \ > - __FILE__, __LINE__);\ > - __builtin_trap(); \ > -} while (0) > - > #ifdef DEBUG_PROM > #define prom_debug(x...) prom_printf(x) > #else > -- > 2.37.3
Re: [PATCH] powerpc: Save AMR/IAMR when switching tasks
Christophe Leroy writes: > Le 19/09/2022 à 14:37, Michael Ellerman a écrit : >> Christophe Leroy writes: >>> Le 16/09/2022 à 07:05, Samuel Holland a écrit : With CONFIG_PREEMPT=y (involuntary preemption enabled), it is possible to switch away from a task inside copy_{from,to}_user. This left the CPU with userspace access enabled until after the next IRQ or privilege level switch, when AMR/IAMR got reset to AMR_KU[AE]P_BLOCKED. Then, when switching back to the original task, the userspace access would fault: >>> >>> This is not supposed to happen. You never switch away from a task >>> magically. Task switch will always happen in an interrupt, that means >>> copy_{from,to}_user() get interrupted. >> >> Unfortunately this isn't true when CONFIG_PREEMPT=y. > > Argh, yes, I wrote the above with the assumption that we properly follow > the main principles that no complex fonction is to be used while KUAP is > open ... Which is apparently not true here. x86 would have detected it > with objtool, but we don't have it yet in powerpc. Yes and yes :/ >> We can switch away without an interrupt via: >>__copy_tofrom_user() >> -> __copy_tofrom_user_power7() >> -> exit_vmx_usercopy() >>-> preempt_enable() >> -> __preempt_schedule() >> -> preempt_schedule() >> -> preempt_schedule_common() >>-> __schedule() > > > Should we use preempt_enable_no_resched() to avoid that ? Good point :) ... >> >> Still I think it might be our best option for an easy fix. > > Wouldn't it be even easier and less abusive to use > preemt_enable_no_resched() ? Or is there definitively a good reason to > resched after a VMX copy while we don't with regular copies ? I don't think it's important to reschedule there. I guess it means another task that wants to preempt will have to wait a little longer, but I doubt it's measurable. One reason to do the KUAP lock is it also protects us from running disable_kernel_altivec() with KUAP unlocked. That in turn calls into msr_check_and_clear() and __msr_check_and_clear(), none of which is a problem per-se, but we'd need to make that all notrace to be 100% safe. At the moment we're running that all with KUAP unlocked anyway, so using preempt_enable_no_resched() would fix the actual bug and is much nicer than my patch, so we should probably do that. We can look at making disable_kernel_altivec() etc. notrace as a follow-up. cheers
Re: [PATCH v2 10/10] drm/ofdrm: Support color management
Hi Am 05.08.22 um 02:19 schrieb Benjamin Herrenschmidt: On Wed, 2022-07-20 at 16:27 +0200, Thomas Zimmermann wrote: +#if !defined(CONFIG_PPC) +static inline void out_8(void __iomem *addr, int val) +{ } +static inline void out_le32(void __iomem *addr, int val) +{ } +static inline unsigned int in_le32(const void __iomem *addr) +{ + return 0; +} +#endif These guys could just be replaced with readb/writel/readl respectively (beware of the argument swap). I only added them for COMPILE_TEST. There appears to be no portable interface that implements out_le32() and in_le32()? Best regards Thomas Cheers, Ben. -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH v2 09/10] drm/ofdrm: Add per-model device function
Hi Am 05.08.22 um 02:22 schrieb Benjamin Herrenschmidt: On Tue, 2022-07-26 at 16:40 +0200, Michal Suchánek wrote: Hello, On Tue, Jul 26, 2022 at 03:38:37PM +0200, Javier Martinez Canillas wrote: On 7/20/22 16:27, Thomas Zimmermann wrote: Add a per-model device-function structure in preparation of adding color-management support. Detection of the individual models has been taken from fbdev's offb. Signed-off-by: Thomas Zimmermann --- Reviewed-by: Javier Martinez Canillas [...] +static bool is_avivo(__be32 vendor, __be32 device) +{ + /* This will match most R5xx */ + return (vendor == 0x1002) && + ((device >= 0x7100 && device < 0x7800) || (device >= 0x9400)); +} Maybe add some constant macros to not have these magic numbers ? This is based on the existing fbdev implementation's magic numbers: drivers/video/fbdev/offb.c: ((*did >= 0x7100 && *did < 0x7800) || Of course, it would be great if somebody knowledgeable could clarify those. I don't think anybody remembers :-) Vendor 0x1002 is PCI_VENDOR_ID_ATI, I do :) but the rest is basically ranges of PCI IDs for which we don't have symbolic constants. Should we add them to the official list in pci_ids.h? I cannot find 0x7800. The others are R520 and R600. Best regards Thomas Cheers, Ben. -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH v2 08/10] drm/ofdrm: Add CRTC state
Hi Am 26.07.22 um 15:36 schrieb Javier Martinez Canillas: On 7/20/22 16:27, Thomas Zimmermann wrote: Add a dedicated CRTC state to ofdrm to later store information for palette updates. Signed-off-by: Thomas Zimmermann --- drivers/gpu/drm/tiny/ofdrm.c | 62 ++-- Reviewed-by: Javier Martinez Canillas [...] +static void ofdrm_crtc_reset(struct drm_crtc *crtc) +{ + struct ofdrm_crtc_state *ofdrm_crtc_state; + + if (crtc->state) { + ofdrm_crtc_state_destroy(to_ofdrm_crtc_state(crtc->state)); + crtc->state = NULL; /* must be set to NULL here */ + } + + ofdrm_crtc_state = kzalloc(sizeof(*ofdrm_crtc_state), GFP_KERNEL); + if (!ofdrm_crtc_state) + return; + __drm_atomic_helper_crtc_reset(crtc, _crtc_state->base); +} + IMO this function is hard to read, I would instead write it as following: static void ofdrm_crtc_reset(struct drm_crtc *crtc) { struct ofdrm_crtc_state *ofdrm_crtc_state = kzalloc(sizeof(*ofdrm_crtc_state), GFP_KERNEL); if (!ofdrm_crtc_state) return; if (crtc->state) { ofdrm_crtc_state_destroy(to_ofdrm_crtc_state(crtc->state)); crtc->state = NULL; /* must be set to NULL here */ } __drm_atomic_helper_crtc_reset(crtc, _crtc_state->base); } Also with that form I think that the crtc->state = NULL could just be dropped ? I once had to add this line to a driver to make the DRM helpers work. But I cannot find any longer why. Maybe it's been resolved meanwhile. Best regards Thomas -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH v2 07/10] drm/ofdrm: Add ofdrm for Open Firmware framebuffers
Hi Am 26.07.22 um 15:17 schrieb Javier Martinez Canillas: Hello Thomas, On 7/20/22 16:27, Thomas Zimmermann wrote: Open Firmware provides basic display output via the 'display' node. DT platform code already provides a device that represents the node's framebuffer. Add a DRM driver for the device. The display mode and color format is pre-initialized by the system's firmware. Runtime modesetting via DRM is not possible. The display is useful during early boot stages or as error fallback. I'm not familiar with OF display but the driver looks good to me. Reviewed-by: Javier Martinez Canillas I just have a few questions below. [...] +static int ofdrm_primary_plane_helper_atomic_check(struct drm_plane *plane, + struct drm_atomic_state *new_state) +{ + struct drm_plane_state *new_plane_state = drm_atomic_get_new_plane_state(new_state, plane); + struct drm_crtc_state *new_crtc_state; + int ret; + + if (!new_plane_state->fb) + return 0; + + new_crtc_state = drm_atomic_get_new_crtc_state(new_state, new_plane_state->crtc); + + ret = drm_atomic_helper_check_plane_state(new_plane_state, new_crtc_state, + DRM_PLANE_HELPER_NO_SCALING, + DRM_PLANE_HELPER_NO_SCALING, + false, false); + if (ret) + return ret; + + return 0; +} This seems to be exactly the same check than used in the simpledrm driver. Maybe could be moved to the fwfb helper library too ? I've meanwhile dropped fwfb helpers. Color management requires specific code here, so there's no much to share anyway. [...] + +static void ofdrm_crtc_helper_atomic_disable(struct drm_crtc *crtc, +struct drm_atomic_state *old_state) +{ + /* +* Always enabled; disabling clears the screen in the +* primary plane's atomic_disable function. +*/ +} + Same comment than for simpledrm, are these no-op helpers really needed ? In simpledrm, I've meanwhile removed them. I'll do so here as well. [...] +static const struct of_device_id ofdrm_of_match_display[] = { + { .compatible = "display", }, + { }, +}; +MODULE_DEVICE_TABLE(of, ofdrm_of_match_display); + I don't see a binding for this in Documentation/devicetree/bindings/display. Do we need one or it's that only required for FDT and not Open Firmware DT ? No idea. The device is being created in drivers/of/platform.c. If offb didn't need these bindings, ofdrm probably won't need them either. Best regards Thomas -- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH v3] hugetlb: simplify hugetlb handling in follow_page_mask
On 9/19/2022 10:13 AM, Mike Kravetz wrote: During discussions of this series [1], it was suggested that hugetlb handling code in follow_page_mask could be simplified. At the beginning of follow_page_mask, there currently is a call to follow_huge_addr which 'may' handle hugetlb pages. ia64 is the only architecture which provides a follow_huge_addr routine that does not return error. Instead, at each level of the page table a check is made for a hugetlb entry. If a hugetlb entry is found, a call to a routine associated with that entry is made. Currently, there are two checks for hugetlb entries at each page table level. The first check is of the form: if (p?d_huge()) page = follow_huge_p?d(); the second check is of the form: if (is_hugepd()) page = follow_huge_pd(). We can replace these checks, as well as the special handling routines such as follow_huge_p?d() and follow_huge_pd() with a single routine to handle hugetlb vmas. A new routine hugetlb_follow_page_mask is called for hugetlb vmas at the beginning of follow_page_mask. hugetlb_follow_page_mask will use the existing routine huge_pte_offset to walk page tables looking for hugetlb entries. huge_pte_offset can be overwritten by architectures, and already handles special cases such as hugepd entries. [1] https://lore.kernel.org/linux-mm/cover.1661240170.git.baolin.w...@linux.alibaba.com/ Suggested-by: David Hildenbrand Signed-off-by: Mike Kravetz LGTM, and works well on my machine. So feel free to add: Reviewed-by: Baolin Wang Tested-by: Baolin Wang
Re: [PATCH v3 00/16] objtool: Enable and implement --mcount option on powerpc
Hi Josh, On 14/09/22 05:45, Josh Poimboeuf wrote: On Tue, Sep 13, 2022 at 04:13:52PM +0200, Peter Zijlstra wrote: On Mon, Sep 12, 2022 at 01:50:04PM +0530, Sathvika Vasireddy wrote: Christophe Leroy (4): objtool: Fix SEGFAULT objtool: Use target file endianness instead of a compiled constant objtool: Use target file class size instead of a compiled constant Sathvika Vasireddy (12): objtool: Add --mnop as an option to --mcount objtool: Read special sections with alts only when specific options are selected objtool: Use macros to define arch specific reloc types objtool: Add arch specific function arch_ftrace_match() objtool/powerpc: Enable objtool to be built on ppc objtool/powerpc: Add --mcount specific implementation tools/objtool/arch/powerpc/Build | 2 + tools/objtool/arch/powerpc/decode.c | 101 ++ .../arch/powerpc/include/arch/cfi_regs.h | 11 ++ tools/objtool/arch/powerpc/include/arch/elf.h | 10 ++ .../arch/powerpc/include/arch/special.h | 21 tools/objtool/arch/powerpc/special.c | 19 tools/objtool/arch/x86/decode.c | 5 + tools/objtool/arch/x86/include/arch/elf.h | 2 + .../arch/x86/include/arch/endianness.h| 9 -- tools/objtool/builtin-check.c | 14 +++ tools/objtool/check.c | 53 - tools/objtool/elf.c | 8 +- tools/objtool/include/objtool/arch.h | 2 + tools/objtool/include/objtool/builtin.h | 1 + tools/objtool/include/objtool/elf.h | 8 ++ tools/objtool/include/objtool/endianness.h| 32 +++--- tools/objtool/orc_dump.c | 11 +- tools/objtool/orc_gen.c | 4 +- tools/objtool/special.c | 3 +- This seems to painlessly merge with the objtool changes I have in queue.git/call-depth-tracking. After that all I need is the below little patch to make it to build ppc44x_defconfig + CONFIG_DYNAMIC_FTRACE=y. So I think these patches can go through the powerpc tree if Michael wants. Josh you okay with that, or should we do something complicated? I'm all for avoiding complicated, but let me try to give it a proper review first. Did you get a chance to review this patch set? - Sathvika
Re: [PATCH v4 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
On Wed, Sep 21, 2022 at 8:45 PM Yicong Yang wrote: > > From: Anshuman Khandual > > The entire scheme of deferred TLB flush in reclaim path rests on the > fact that the cost to refill TLB entries is less than flushing out > individual entries by sending IPI to remote CPUs. But architecture > can have different ways to evaluate that. Hence apart from checking > TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be > architecture specific. > > Signed-off-by: Anshuman Khandual > [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/] > Signed-off-by: Yicong Yang > [Rebase and fix incorrect return value type] > Reviewed-by: Kefeng Wang > Reviewed-by: Anshuman Khandual > --- Reviewed-by: Barry Song > arch/x86/include/asm/tlbflush.h | 12 > mm/rmap.c | 9 + > 2 files changed, 13 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index cda3118f3b27..8a497d902c16 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct > *vma, unsigned long a) > flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); > } > > +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) > +{ > + bool should_defer = false; > + > + /* If remote CPUs need to be flushed then defer batch the flush */ > + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > + should_defer = true; > + put_cpu(); > + > + return should_defer; > +} > + > static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) > { > /* > diff --git a/mm/rmap.c b/mm/rmap.c > index 93d5a6f793d2..cd8cf5cb0b01 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -690,17 +690,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct > *mm, bool writable) > */ > static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) > { > - bool should_defer = false; > - > if (!(flags & TTU_BATCH_FLUSH)) > return false; > > - /* If remote CPUs need to be flushed then defer batch the flush */ > - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) > - should_defer = true; > - put_cpu(); > - > - return should_defer; > + return arch_tlbbatch_should_defer(mm); > } > > /* > -- > 2.24.0 >
[PATCH v4 0/2] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
From: Yicong Yang Though ARM64 has the hardware to do tlb shootdown, the hardware broadcasting is not free. A simplest micro benchmark shows even on snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is huge even for paging out one page mapped by only one process: 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush While pages are mapped by multiple processes or HW has more CPUs, the cost should become even higher due to the bad scalability of tlb shootdown. The same benchmark can result in 16.99% CPU consumption on ARM64 server with around 100 cores according to Yicong's test on patch 4/4. This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by 1. only send tlbi instructions in the first stage - arch_tlbbatch_add_mm() 2. wait for the completion of tlbi by dsb while doing tlbbatch sync in arch_tlbbatch_flush() Testing on snapdragon shows the overhead of ptep_clear_flush is removed by the patchset. The micro benchmark becomes 5% faster even for one page mapped by single process on snapdragon 888. -v4: 1. Add tags from Kefeng and Anshuman, Thanks. 2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman 3. Merge previous Patch 1,2-3 into one, per Anshuman Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyic...@huawei.com/ -v3: 1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng 2. Add Tested-by from Xin Hao Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cn...@gmail.com/ -v2: 1. Collected Yicong's test result on kunpeng920 ARM64 server; 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm() according to the comments of Peter Zijlstra and Dave Hansen 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask is empty according to the comments of Nadav Amit Thanks, Peter, Dave and Nadav for your testing or reviewing , and comments. -v1: https://lore.kernel.org/lkml/20220707125242.425242-1-21cn...@gmail.com/ Anshuman Khandual (1): mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Barry Song (1): arm64: support batched/deferred tlb shootdown during page reclamation .../features/vm/TLB/arch-support.txt | 2 +- arch/arm64/Kconfig| 1 + arch/arm64/include/asm/tlbbatch.h | 12 ++ arch/arm64/include/asm/tlbflush.h | 37 ++- arch/x86/include/asm/tlbflush.h | 15 +++- mm/rmap.c | 19 -- 6 files changed, 70 insertions(+), 16 deletions(-) create mode 100644 arch/arm64/include/asm/tlbbatch.h -- 2.24.0
[PATCH v4 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
From: Anshuman Khandual The entire scheme of deferred TLB flush in reclaim path rests on the fact that the cost to refill TLB entries is less than flushing out individual entries by sending IPI to remote CPUs. But architecture can have different ways to evaluate that. Hence apart from checking TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be architecture specific. Signed-off-by: Anshuman Khandual [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/] Signed-off-by: Yicong Yang [Rebase and fix incorrect return value type] Reviewed-by: Kefeng Wang Reviewed-by: Anshuman Khandual --- arch/x86/include/asm/tlbflush.h | 12 mm/rmap.c | 9 + 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index cda3118f3b27..8a497d902c16 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); } +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) +{ + bool should_defer = false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} + static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) { /* diff --git a/mm/rmap.c b/mm/rmap.c index 93d5a6f793d2..cd8cf5cb0b01 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -690,17 +690,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, bool writable) */ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) { - bool should_defer = false; - if (!(flags & TTU_BATCH_FLUSH)) return false; - /* If remote CPUs need to be flushed then defer batch the flush */ - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) - should_defer = true; - put_cpu(); - - return should_defer; + return arch_tlbbatch_should_defer(mm); } /* -- 2.24.0
[PATCH v4 2/2] arm64: support batched/deferred tlb shootdown during page reclamation
From: Barry Song on x86, batched and deferred tlb shootdown has lead to 90% performance increase on tlb shootdown. on arm64, HW can do tlb shootdown without software IPI. But sync tlbi is still quite expensive. Even running a simplest program which requires swapout can prove this is true, #include #include #include #include int main() { #define SIZE (1 * 1024 * 1024) volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(p, 0x88, SIZE); for (int k = 0; k < 1; k++) { /* swap in */ for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } /* swap out */ madvise(p, SIZE, MADV_PAGEOUT); } } Perf result on snapdragon 888 with 8 cores by using zRAM as the swap block device. ~ # perf record taskset -c 4 ./a.out [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] ~ # perf report # To display the perf.data header info, please use --header/--header-only options. # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 60K of event 'cycles' # Event count (approx.): 35706225414 # # Overhead Command Shared Object Symbol # ... . . # 21.07% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irq 8.23% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 6.67% a.out[kernel.kallsyms] [k] filemap_map_pages 6.16% a.out[kernel.kallsyms] [k] __zram_bvec_write 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush 3.71% a.out[kernel.kallsyms] [k] _raw_spin_lock 3.49% a.out[kernel.kallsyms] [k] memset64 1.63% a.out[kernel.kallsyms] [k] clear_page 1.42% a.out[kernel.kallsyms] [k] _raw_spin_unlock 1.26% a.out[kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 1.23% a.out[kernel.kallsyms] [k] xas_load 1.15% a.out[kernel.kallsyms] [k] zram_slot_lock ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out a page mapped by only one process. If the page is mapped by multiple processes, typically, like more than 100 on a phone, the overhead would be much higher as we have to run tlb flush 100 times for one single page. Plus, tlb flush overhead will increase with the number of CPU cores due to the bad scalability of tlb shootdown in HW, so those ARM64 servers should expect much higher overhead. Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush. This provides us a very good chance to leverage the existing batched tlb in kernel. The minimum modification is that we only send async tlbi in the first stage and we send dsb while we have to sync in the second stage. With the above simplest micro benchmark, collapsed time to finish the program decreases around 5%. Typical collapsed time w/o patch: ~ # time taskset -c 4 ./a.out 0.21user 14.34system 0:14.69elapsed w/ patch: ~ # time taskset -c 4 ./a.out 0.22user 13.45system 0:13.80elapsed Also, Yicong Yang added the following observation. Tested with benchmark in the commit on Kunpeng920 arm64 server, observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real0m13.460s 0m11.771s user0m0.248s0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush Cc: Jonathan Corbet Cc: Nadav Amit Cc: Mel Gorman Tested-by: Yicong Yang Tested-by: Xin Hao Signed-off-by: Barry Song Signed-off-by: Yicong Yang Reviewed-by: Kefeng Wang --- .../features/vm/TLB/arch-support.txt | 2 +- arch/arm64/Kconfig| 1 + arch/arm64/include/asm/tlbbatch.h | 12 ++ arch/arm64/include/asm/tlbflush.h | 37 ++- arch/x86/include/asm/tlbflush.h | 3 +- mm/rmap.c | 10 +++-- 6 files changed, 57 insertions(+), 8 deletions(-) create mode 100644 arch/arm64/include/asm/tlbbatch.h diff --git a/Documentation/features/vm/TLB/arch-support.txt b/Documentation/features/vm/TLB/arch-support.txt index 039e4e91ada3..2caf815d7c6c 100644 --- a/Documentation/features/vm/TLB/arch-support.txt +++ b/Documentation/features/vm/TLB/arch-support.txt @@ -9,7 +9,7 @@ |
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On Sep 20, 2022, at 11:53 PM, Anshuman Khandual wrote: > ⚠ External Email > > On 8/22/22 13:51, Yicong Yang wrote: >> +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch >> *batch, >> + struct mm_struct *mm, >> + unsigned long uaddr) >> +{ >> + __flush_tlb_page_nosync(mm, uaddr); >> +} >> + >> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch >> *batch) >> +{ >> + dsb(ish); >> +} > > Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping > TLB invalidation requests on a given mm and try to generate a range based TLB > invalidation such as flush_tlb_range(). > > struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous > ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can > later be flushed in subsequent arch_tlbbatch_flush() ? > > OR > > It might not be worth the effort and complexity, in comparison to performance > improvement, TLB range flush brings in ? So here are my 2 cents, based on my experience with Intel-x86. It is likely different on arm64, but perhaps it can provide you some insight into what parameters you should measure and consider. In general there is a tradeoff between full TLB flushes and entry-specific ones. Flushing specific entries takes more time than flushing the entire TLB, but sade TLB refills. Dave Hansen made some calculations in the past and came up with 33 as a magic cutoff number, i.e., if you need to flush more than 33 entries, just flush the entire TLB. I am not sure that this exact number is very meaningful, since one might argue that it should’ve taken PTI into account (which might require twice as many TLB invalidations). Anyhow, back to arch_tlbbatch_add_mm(). It may be possible to track ranges, but the question is whether you would actually succeed in forming continuous ranges that are eventually (on x86) smaller than the full TLB flush cutoff (=33). Questionable (perhaps better with MGLRU?). Then, you should remember that tracking should be very efficient, since even few cache misses might have greater cost than what you save by selective-flushing. Finally, on x86 you would need to invoke the smp/IPI layer multiple times to send different cores the relevant range they need to flush. IOW: It is somewhat complicated to implement efficeintly. On x86, and probably other IPI-based TLB shootdown systems, does not have clear performance benefit (IMHO).
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On Wed, Sep 21, 2022 at 6:53 PM Anshuman Khandual wrote: > > > On 8/22/22 13:51, Yicong Yang wrote: > > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch > > *batch, > > + struct mm_struct *mm, > > + unsigned long uaddr) > > +{ > > + __flush_tlb_page_nosync(mm, uaddr); > > +} > > + > > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch > > *batch) > > +{ > > + dsb(ish); > > +} > > Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping > TLB invalidation requests on a given mm and try to generate a range based TLB > invalidation such as flush_tlb_range(). > > struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous > ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can > later be flushed in subsequent arch_tlbbatch_flush() ? > > OR > > It might not be worth the effort and complexity, in comparison to performance > improvement, TLB range flush brings in ? Probably it is not worth the complexity as perf annotate shows " Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush." so any further optimization before dsb(ish) might bring some improvement but seems minor. Thanks Barry
Re: [PATCH v2 4/4] powerpc/64s: Enable KFENCE on book3s64
Le 21/09/2022 à 04:02, Nicholas Miehlbradt a écrit : > KFENCE support was added for ppc32 in commit 90cbac0e995d > ("powerpc: Enable KFENCE for PPC32"). > Enable KFENCE on ppc64 architecture with hash and radix MMUs. > It uses the same mechanism as debug pagealloc to > protect/unprotect pages. All KFENCE kunit tests pass on both > MMUs. > > KFENCE memory is initially allocated using memblock but is > later marked as SLAB allocated. This necessitates the change > to __pud_free to ensure that the KFENCE pages are freed > appropriately. > > Based on previous work by Christophe Leroy and Jordan Niethe. > > Signed-off-by: Nicholas Miehlbradt > --- > v2: Refactor > --- > arch/powerpc/Kconfig | 2 +- > arch/powerpc/include/asm/book3s/64/pgalloc.h | 6 -- > arch/powerpc/include/asm/book3s/64/pgtable.h | 2 +- > arch/powerpc/include/asm/kfence.h| 15 +++ > arch/powerpc/mm/book3s64/hash_utils.c| 10 +- > arch/powerpc/mm/book3s64/radix_pgtable.c | 8 +--- > 6 files changed, 31 insertions(+), 12 deletions(-) > > diff --git a/arch/powerpc/include/asm/kfence.h > b/arch/powerpc/include/asm/kfence.h > index a9846b68c6b9..cff60983e88d 100644 > --- a/arch/powerpc/include/asm/kfence.h > +++ b/arch/powerpc/include/asm/kfence.h > @@ -11,11 +11,25 @@ > #include > #include > > +#if defined(CONFIG_PPC64) && !defined(CONFIG_PPC64_ELF_ABI_V2) CONFIG_PPC64 && !CONFIG_PPC64_ELF_ABI_V2 is the same as CONFIG_PPC64_ELF_ABI_V1 > +#define ARCH_FUNC_PREFIX "." > +#endif > + Christophe
[PATCH v6 25/25] powerpc/64e: Clear gprs on interrupt routine entry on Book3E
Zero GPRS r14-r31 on entry into the kernel for interrupt sources to limit influence of user-space values in potential speculation gadgets. Prior to this commit, all other GPRS are reassigned during the common prologue to interrupt handlers and so need not be zeroised explicitly. This may be done safely, without loss of register state prior to the interrupt, as the common prologue saves the initial values of non-volatiles, which are unconditionally restored in interrupt_64.S. Mitigation defaults to enabled by INTERRUPT_SANITIZE_REGISTERS. Signed-off-by: Rohan McLure --- V4: New patch. V5: Depend on Kconfig option. Remove ZEROIZE_NVGPRS on bad kernel stack handler. V6: Change name of zeroize macro to be consistent with Book3S. --- arch/powerpc/kernel/exceptions-64e.S | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S index f1d714acc945..d55c0a368dcb 100644 --- a/arch/powerpc/kernel/exceptions-64e.S +++ b/arch/powerpc/kernel/exceptions-64e.S @@ -366,6 +366,11 @@ ret_from_mc_except: std r14,PACA_EXMC+EX_R14(r13); \ std r15,PACA_EXMC+EX_R15(r13) +#ifdef CONFIG_INTERRUPT_SANITIZE_REGISTERS +#define SANITIZE_ZEROIZE_NVGPRS() ZEROIZE_NVGPRS() +#else +#define SANITIZE_ZEROIZE_NVGPRS() +#endif /* Core exception code for all exceptions except TLB misses. */ #define EXCEPTION_COMMON_LVL(n, scratch, excf) \ @@ -402,7 +407,8 @@ exc_##n##_common: \ std r12,STACK_FRAME_OVERHEAD-16(r1); /* mark the frame */ \ std r3,_TRAP(r1); /* set trap number */ \ std r0,RESULT(r1); /* clear regs->result */\ - SAVE_NVGPRS(r1); + SAVE_NVGPRS(r1);\ + SANITIZE_ZEROIZE_NVGPRS(); /* minimise speculation influence */ #define EXCEPTION_COMMON(n) \ EXCEPTION_COMMON_LVL(n, SPRN_SPRG_GEN_SCRATCH, PACA_EXGEN) -- 2.34.1
[PATCH v6 24/25] powerpc/64s: Clear gprs on interrupt routine entry in Book3S
Zero GPRS r0, r2-r11, r14-r31, on entry into the kernel for all other interrupt sources to limit influence of user-space values in potential speculation gadgets. The remaining gprs are overwritten by entry macros to interrupt handlers, irrespective of whether or not a given handler consumes these register values. Prior to this commit, r14-r31 are restored on a per-interrupt basis at exit, but now they are always restored. Remove explicit REST_NVGPRS invocations as non-volatiles must now always be restored. 32-bit systems do not clear user registers on interrupt, and continue to depend on the return value of interrupt_exit_user_prepare to determine whether or not to restore non-volatiles. The mmap_bench benchmark in selftests should rapidly invoke pagefaults. See ~0.8% performance regression with this mitigation, but this indicates the worst-case performance due to heavier-weight interrupt handlers. This mitigation is disabled by default, but enabled with CONFIG_INTERRUPT_SANITIZE_REGISTERS. Signed-off-by: Rohan McLure --- V2: Add benchmark data V3: Use ZEROIZE_GPR{,S} macro renames, clarify upt_exit_user_prepare changes in summary. V5: Configurable now with INTERRUPT_SANITIZE_REGISTERS. Zero r12 (containing MSR) from common macro on per-interrupt basis with IOPTION. V6: Replace ifdefs with invocations of conditionally defined macros. --- arch/powerpc/kernel/exceptions-64s.S | 47 +- arch/powerpc/kernel/interrupt_64.S | 15 2 files changed, 53 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index a3b51441b039..b3f5ef1c712f 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -21,6 +21,19 @@ #include #include +/* + * macros for handling user register sanitisation + */ +#ifdef CONFIG_INTERRUPT_SANITIZE_REGISTERS +#define SANITIZE_ZEROIZE_NVGPRS() ZEROIZE_NVGPRS() +#define SANITIZE_RESTORE_NVGPRS() REST_NVGPRS(r1) +#define HANDLER_RESTORE_NVGPRS() +#else +#define SANITIZE_ZEROIZE_NVGPRS() +#define SANITIZE_RESTORE_NVGPRS() +#define HANDLER_RESTORE_NVGPRS() REST_NVGPRS(r1) +#endif /* CONFIG_INTERRUPT_SANITIZE_REGISTERS */ + /* * Following are fixed section helper macros. * @@ -111,6 +124,7 @@ name: #define ISTACK .L_ISTACK_\name\() /* Set regular kernel stack */ #define __ISTACK(name) .L_ISTACK_ ## name #define IKUAP .L_IKUAP_\name\() /* Do KUAP lock */ +#define IMSR_R12 .L_IMSR_R12_\name\()/* Assumes MSR saved to r12 */ #define INT_DEFINE_BEGIN(n)\ .macro int_define_ ## n name @@ -176,6 +190,9 @@ do_define_int n .ifndef IKUAP IKUAP=1 .endif + .ifndef IMSR_R12 + IMSR_R12=0 + .endif .endm /* @@ -502,6 +519,7 @@ DEFINE_FIXED_SYMBOL(\name\()_common_real, text) std r10,0(r1) /* make stack chain pointer */ std r0,GPR0(r1) /* save r0 in stackframe*/ std r10,GPR1(r1)/* save r1 in stackframe*/ + ZEROIZE_GPR(0) /* Mark our [H]SRRs valid for return */ li r10,1 @@ -544,8 +562,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) std r9,GPR11(r1) std r10,GPR12(r1) std r11,GPR13(r1) + .if !IMSR_R12 + ZEROIZE_GPRS(9, 12) + .else + ZEROIZE_GPRS(9, 11) + .endif SAVE_NVGPRS(r1) + SANITIZE_ZEROIZE_NVGPRS() .if IDAR .if IISIDE @@ -577,8 +601,8 @@ BEGIN_FTR_SECTION END_FTR_SECTION_IFSET(CPU_FTR_CFAR) ld r10,IAREA+EX_CTR(r13) std r10,_CTR(r1) - std r2,GPR2(r1) /* save r2 in stackframe*/ - SAVE_GPRS(3, 8, r1) /* save r3 - r8 in stackframe */ + SAVE_GPRS(2, 8, r1) /* save r2 - r8 in stackframe */ + ZEROIZE_GPRS(2, 8) mflrr9 /* Get LR, later save to stack */ ld r2,PACATOC(r13) /* get kernel TOC into r2 */ std r9,_LINK(r1) @@ -696,6 +720,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR) mtlrr9 ld r9,_CCR(r1) mtcrr9 + SANITIZE_RESTORE_NVGPRS() REST_GPRS(2, 13, r1) REST_GPR(0, r1) /* restore original r1. */ @@ -1372,7 +1397,7 @@ ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX) * do_break() may have changed the NV GPRS while handling a breakpoint. * If so, we need to restore them with their updated values. */ - REST_NVGPRS(r1) + HANDLER_RESTORE_NVGPRS() b interrupt_return_srr @@ -1598,7 +1623,7 @@ EXC_COMMON_BEGIN(alignment_common) GEN_COMMON alignment addir3,r1,STACK_FRAME_OVERHEAD bl alignment_exception - REST_NVGPRS(r1) /* instruction emulation may change GPRs */ +
[PATCH v6 23/25] powerpc/64: Add INTERRUPT_SANITIZE_REGISTERS Kconfig
Add Kconfig option for enabling clearing of registers on arrival in an interrupt handler. This reduces the speculation influence of registers on kernel internals. The option will be consumed by 64-bit systems that feature speculation and wish to implement this mitigation. This patch only introduces the Kconfig option, no actual mitigations. The primary overhead of this mitigation lies in an increased number of registers that must be saved and restored by interrupt handlers on Book3S systems. Enable by default on Book3E systems, which prior to this patch eagerly save and restore register state, meaning that the mitigation when implemented will have minimal overhead. Signed-off-by: Rohan McLure Acked-by: Nicholas Piggin --- V5: New patch --- arch/powerpc/Kconfig | 9 + 1 file changed, 9 insertions(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index ef6c83e79c9b..a643ebd83349 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -528,6 +528,15 @@ config HOTPLUG_CPU Say N if you are unsure. +config INTERRUPT_SANITIZE_REGISTERS + bool "Clear gprs on interrupt arrival" + depends on PPC64 && ARCH_HAS_SYSCALL_WRAPPER + default PPC_BOOK3E_64 + help + Reduce the influence of user register state on interrupt handlers and + syscalls through clearing user state from registers before handling + the exception. + config PPC_QUEUED_SPINLOCKS bool "Queued spinlocks" if EXPERT depends on SMP -- 2.34.1
[PATCH v6 22/25] powerpc/64s: Clear user GPRs in syscall interrupt entry
Clear user state in gprs (assign to zero) to reduce the influence of user registers on speculation within kernel syscall handlers. Clears occur at the very beginning of the sc and scv 0 interrupt handlers, with restores occurring following the execution of the syscall handler. Signed-off-by: Rohan McLure --- V2: Update summary V3: Remove erroneous summary paragraph on syscall_exit_prepare V4: Use ZEROIZE instead of NULLIFY. Clear r0 also. V5: Move to end of patch series. V6: Include clears which were previously in the syscall wrapper patch. Move comment on r3-r8 register save to when we alter the calling convention for system_call_exception. --- arch/powerpc/kernel/interrupt_64.S | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/interrupt_64.S b/arch/powerpc/kernel/interrupt_64.S index a5dd78bdbe6d..40147558e1a6 100644 --- a/arch/powerpc/kernel/interrupt_64.S +++ b/arch/powerpc/kernel/interrupt_64.S @@ -106,6 +106,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) * but this is the best we can do. */ + /* +* Zero user registers to prevent influencing speculative execution +* state of kernel code. +*/ + ZEROIZE_GPR(0) + ZEROIZE_GPRS(5, 12) + ZEROIZE_NVGPRS() bl system_call_exception .Lsyscall_vectored_\name\()_exit: @@ -134,6 +141,7 @@ BEGIN_FTR_SECTION HMT_MEDIUM_LOW END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) + REST_NVGPRS(r1) cmpdi r3,0 bne .Lsyscall_vectored_\name\()_restore_regs @@ -285,6 +293,13 @@ END_BTB_FLUSH_SECTION wrteei 1 #endif + /* +* Zero user registers to prevent influencing speculative execution +* state of kernel code. +*/ + ZEROIZE_GPR(0) + ZEROIZE_GPRS(5, 12) + ZEROIZE_NVGPRS() bl system_call_exception .Lsyscall_exit: @@ -325,6 +340,7 @@ BEGIN_FTR_SECTION stdcx. r0,0,r1 /* to clear the reservation */ END_FTR_SECTION_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS) + REST_NVGPRS(r1) cmpdi r3,0 bne .Lsyscall_restore_regs /* Zero volatile regs that may contain sensitive kernel data */ @@ -352,7 +368,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) .Lsyscall_restore_regs: ld r3,_CTR(r1) ld r4,_XER(r1) - REST_NVGPRS(r1) mtctr r3 mtspr SPRN_XER,r4 REST_GPR(0, r1) -- 2.34.1
[PATCH v6 21/25] powerpc: Provide syscall wrapper
Implement syscall wrapper as per s390, x86, arm64. When enabled cause handlers to accept parameters from a stack frame rather than from user scratch register state. This allows for user registers to be safely cleared in order to reduce caller influence on speculation within syscall routine. The wrapper is a macro that emits syscall handler symbols that call into the target handler, obtaining its parameters from a struct pt_regs on the stack. As registers are already saved to the stack prior to calling system_call_exception, it appears that this function is executed more efficiently with the new stack-pointer convention than with parameters passed by registers, avoiding the allocation of a stack frame for this method. On a 32-bit system, we see >20% performance increases on the null_syscall microbenchmark, and on a Power 8 the performance gains amortise the cost of clearing and restoring registers which is implemented at the end of this series, seeing final result of ~5.6% performance improvement on null_syscall. Syscalls are wrapped in this fashion on all platforms except for the Cell processor as this commit does not provide SPU support. This can be quickly fixed in a successive patch, but requires spu_sys_callback to allocate a pt_regs structure to satisfy the wrapped calling convention. Co-developed-by: Andrew Donnellan Signed-off-by: Andrew Donnellan Signed-off-by: Rohan McLure --- V2: Generate prototypes for symbols produced by the wrapper. V3: Rebased to remove conflict with 1547db7d1f44 ("powerpc: Move system_call_exception() to syscall.c"). Also remove copy from gpr3 save slot on stackframe to orig_r3's slot. Fix whitespace with preprocessor defines in system_call_exception. V5: Move systbl.c syscall wrapper support to this patch. Swap calling convention for system_call_exception to be (, r0) V6: Change calling convention for system_call_exception in another patch. --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/syscall.h | 4 + arch/powerpc/include/asm/syscall_wrapper.h | 84 arch/powerpc/include/asm/syscalls.h| 30 ++- arch/powerpc/kernel/syscall.c | 7 +- arch/powerpc/kernel/systbl.c | 7 ++ arch/powerpc/kernel/vdso.c | 2 + 7 files changed, 132 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 4c466acdc70d..ef6c83e79c9b 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -137,6 +137,7 @@ config PPC select ARCH_HAS_STRICT_KERNEL_RWX if (PPC_BOOK3S || PPC_8xx || 40x) && !HIBERNATION select ARCH_HAS_STRICT_KERNEL_RWX if FSL_BOOKE && !HIBERNATION && !RANDOMIZE_BASE select ARCH_HAS_STRICT_MODULE_RWX if ARCH_HAS_STRICT_KERNEL_RWX + select ARCH_HAS_SYSCALL_WRAPPER if !SPU_BASE select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST select ARCH_HAS_UACCESS_FLUSHCACHE select ARCH_HAS_UBSAN_SANITIZE_ALL diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index d2a8dfd5de33..3dd36c5e334a 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -14,8 +14,12 @@ #include #include +#ifdef CONFIG_ARCH_HAS_SYSCALL_WRAPPER +typedef long (*syscall_fn)(const struct pt_regs *); +#else typedef long (*syscall_fn)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); +#endif /* ftrace syscalls requires exporting the sys_call_table */ extern const syscall_fn sys_call_table[]; diff --git a/arch/powerpc/include/asm/syscall_wrapper.h b/arch/powerpc/include/asm/syscall_wrapper.h new file mode 100644 index ..91bcfa40f740 --- /dev/null +++ b/arch/powerpc/include/asm/syscall_wrapper.h @@ -0,0 +1,84 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * syscall_wrapper.h - powerpc specific wrappers to syscall definitions + * + * Based on arch/{x86,arm64}/include/asm/syscall_wrapper.h + */ + +#ifndef __ASM_SYSCALL_WRAPPER_H +#define __ASM_SYSCALL_WRAPPER_H + +struct pt_regs; + +#define SC_POWERPC_REGS_TO_ARGS(x, ...)\ + __MAP(x,__SC_ARGS \ + ,,regs->gpr[3],,regs->gpr[4],,regs->gpr[5]\ + ,,regs->gpr[6],,regs->gpr[7],,regs->gpr[8]) + +#ifdef CONFIG_COMPAT + +#define COMPAT_SYSCALL_DEFINEx(x, name, ...) \ + long __powerpc_compat_sys##name(const struct pt_regs *regs); \ + ALLOW_ERROR_INJECTION(__powerpc_compat_sys##name, ERRNO); \ + static long __se_compat_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ + static inline long __do_compat_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \ + long __powerpc_compat_sys##name(const struct pt_regs *regs) \ +
[PATCH v6 20/25] powerpc: Change system_call_exception calling convention
Change system_call_exception arguments to pass a pointer to a stack frame container caller state, as well as the original r0, which determines the number of the syscall. This has been observed to yield improved performance to passing them by registers, circumventing the need to allocate a stack frame. Signed-off-by: Rohan McLure --- V6: Split off from syscall wrapper patch. --- arch/powerpc/include/asm/interrupt.h | 3 +-- arch/powerpc/kernel/entry_32.S | 6 +++--- arch/powerpc/kernel/interrupt_64.S | 20 ++-- arch/powerpc/kernel/syscall.c| 10 +- 4 files changed, 15 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h index 8069dbc4b8d1..48eec9cd1429 100644 --- a/arch/powerpc/include/asm/interrupt.h +++ b/arch/powerpc/include/asm/interrupt.h @@ -665,8 +665,7 @@ static inline void interrupt_cond_local_irq_enable(struct pt_regs *regs) local_irq_enable(); } -long system_call_exception(long r3, long r4, long r5, long r6, long r7, long r8, - unsigned long r0, struct pt_regs *regs); +long system_call_exception(struct pt_regs *regs, unsigned long r0); notrace unsigned long syscall_exit_prepare(unsigned long r3, struct pt_regs *regs, long scv); notrace unsigned long interrupt_exit_user_prepare(struct pt_regs *regs); notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs); diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index e4b694cebc44..96782aa72083 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -122,9 +122,9 @@ transfer_to_syscall: SAVE_NVGPRS(r1) kuep_lock - /* Calling convention has r9 = orig r0, r10 = regs */ - addir10,r1,STACK_FRAME_OVERHEAD - mr r9,r0 + /* Calling convention has r3 = regs, r4 = orig r0 */ + addir3,r1,STACK_FRAME_OVERHEAD + mr r4,r0 bl system_call_exception ret_from_syscall: diff --git a/arch/powerpc/kernel/interrupt_64.S b/arch/powerpc/kernel/interrupt_64.S index 7d92a7a54727..a5dd78bdbe6d 100644 --- a/arch/powerpc/kernel/interrupt_64.S +++ b/arch/powerpc/kernel/interrupt_64.S @@ -70,7 +70,7 @@ _ASM_NOKPROBE_SYMBOL(system_call_vectored_\name) ld r2,PACATOC(r13) mfcrr12 li r11,0 - /* Can we avoid saving r3-r8 in common case? */ + /* Save syscall parameters in r3-r8 */ SAVE_GPRS(3, 8, r1) /* Zero r9-r12, this should only be required when restoring all GPRs */ std r11,GPR9(r1) @@ -87,9 +87,11 @@ _ASM_NOKPROBE_SYMBOL(system_call_vectored_\name) std r11,_TRAP(r1) std r12,_CCR(r1) std r3,ORIG_GPR3(r1) - addir10,r1,STACK_FRAME_OVERHEAD + /* Calling convention has r3 = regs, r4 = orig r0 */ + addir3,r1,STACK_FRAME_OVERHEAD + mr r4,r0 ld r11,exception_marker@toc(r2) - std r11,-16(r10)/* "regshere" marker */ + std r11,-16(r3) /* "regshere" marker */ BEGIN_FTR_SECTION HMT_MEDIUM @@ -104,8 +106,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) * but this is the best we can do. */ - /* Calling convention has r9 = orig r0, r10 = regs */ - mr r9,r0 bl system_call_exception .Lsyscall_vectored_\name\()_exit: @@ -237,7 +237,7 @@ END_BTB_FLUSH_SECTION ld r2,PACATOC(r13) mfcrr12 li r11,0 - /* Can we avoid saving r3-r8 in common case? */ + /* Save syscall parameters in r3-r8 */ SAVE_GPRS(3, 8, r1) /* Zero r9-r12, this should only be required when restoring all GPRs */ std r11,GPR9(r1) @@ -260,9 +260,11 @@ END_BTB_FLUSH_SECTION std r11,_TRAP(r1) std r12,_CCR(r1) std r3,ORIG_GPR3(r1) - addir10,r1,STACK_FRAME_OVERHEAD + /* Calling convention has r3 = regs, r4 = orig r0 */ + addir3,r1,STACK_FRAME_OVERHEAD + mr r4,r0 ld r11,exception_marker@toc(r2) - std r11,-16(r10)/* "regshere" marker */ + std r11,-16(r3) /* "regshere" marker */ #ifdef CONFIG_PPC_BOOK3S li r11,1 @@ -283,8 +285,6 @@ END_BTB_FLUSH_SECTION wrteei 1 #endif - /* Calling convention has r9 = orig r0, r10 = regs */ - mr r9,r0 bl system_call_exception .Lsyscall_exit: diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c index 15af0ed019a7..0e9ba3efee94 100644 --- a/arch/powerpc/kernel/syscall.c +++ b/arch/powerpc/kernel/syscall.c @@ -13,9 +13,7 @@ /* Has to run notrace because it is entered not completely "reconciled" */ -notrace long system_call_exception(long r3, long r4, long r5, - long r6, long r7, long r8, -
[PATCH v6 19/25] powerpc: Remove high-order word clearing on compat syscall entry
Remove explicit clearing of the high order-word of user parameters when handling compatibility syscalls in system_call_exception. The COMPAT_SYSCALL_DEFINEx macros handle this clearing through an explicit cast to the signature type of the target handler. Signed-off-by: Rohan McLure Reported-by: Nicholas Piggin --- V6: New patch --- arch/powerpc/kernel/syscall.c | 8 1 file changed, 8 deletions(-) diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c index 9875486f6168..15af0ed019a7 100644 --- a/arch/powerpc/kernel/syscall.c +++ b/arch/powerpc/kernel/syscall.c @@ -157,14 +157,6 @@ notrace long system_call_exception(long r3, long r4, long r5, if (unlikely(is_compat_task())) { f = (void *)compat_sys_call_table[r0]; - - r3 &= 0xULL; - r4 &= 0xULL; - r5 &= 0xULL; - r6 &= 0xULL; - r7 &= 0xULL; - r8 &= 0xULL; - } else { f = (void *)sys_call_table[r0]; } -- 2.34.1
[PATCH v6 18/25] powerpc: Use common syscall handler type
Cause syscall handlers to be typed as follows when called indirectly throughout the kernel. This is to allow for better type checking. typedef long (*syscall_fn)(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); Since both 32 and 64-bit abis allow for at least the first six machine-word length parameters to a function to be passed by registers, even handlers which admit fewer than six parameters may be viewed as having the above type. Coercing syscalls to syscall_fn requires a cast to void* to avoid -Wcast-function-type. Fixup comparisons in VDSO to avoid pointer-integer comparison. Introduce explicit cast on systems with SPUs. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V2: New patch. V3: Remove unnecessary cast from const syscall_fn to syscall_fn V5: Update patch desctiption. V6: Remove syscall_fn typedef in syscall.c. Comment void* cast. --- arch/powerpc/include/asm/syscall.h | 7 +-- arch/powerpc/include/asm/syscalls.h | 1 + arch/powerpc/kernel/syscall.c | 2 -- arch/powerpc/kernel/systbl.c| 11 --- arch/powerpc/kernel/vdso.c | 4 ++-- arch/powerpc/platforms/cell/spu_callbacks.c | 6 +++--- 6 files changed, 19 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index 25fc8ad9a27a..d2a8dfd5de33 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -14,9 +14,12 @@ #include #include +typedef long (*syscall_fn)(unsigned long, unsigned long, unsigned long, + unsigned long, unsigned long, unsigned long); + /* ftrace syscalls requires exporting the sys_call_table */ -extern const unsigned long sys_call_table[]; -extern const unsigned long compat_sys_call_table[]; +extern const syscall_fn sys_call_table[]; +extern const syscall_fn compat_sys_call_table[]; static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs) { diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 5d106acf7906..cc87168d6ecb 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -8,6 +8,7 @@ #include #include +#include #ifdef CONFIG_PPC64 #include #endif diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c index 64102a64fd84..9875486f6168 100644 --- a/arch/powerpc/kernel/syscall.c +++ b/arch/powerpc/kernel/syscall.c @@ -12,8 +12,6 @@ #include -typedef long (*syscall_fn)(long, long, long, long, long, long); - /* Has to run notrace because it is entered not completely "reconciled" */ notrace long system_call_exception(long r3, long r4, long r5, long r6, long r7, long r8, diff --git a/arch/powerpc/kernel/systbl.c b/arch/powerpc/kernel/systbl.c index ce52bd2ec292..d64dfafc2e5e 100644 --- a/arch/powerpc/kernel/systbl.c +++ b/arch/powerpc/kernel/systbl.c @@ -16,9 +16,14 @@ #include #define __SYSCALL_WITH_COMPAT(nr, entry, compat) __SYSCALL(nr, entry) -#define __SYSCALL(nr, entry) [nr] = (unsigned long) , -const unsigned long sys_call_table[] = { +/* + * Coerce syscall handlers with arbitrary parameters to common type + * requires cast to void* to avoid -Wcast-function-type. + */ +#define __SYSCALL(nr, entry) [nr] = (void *) entry, + +const syscall_fn sys_call_table[] = { #ifdef CONFIG_PPC64 #include #else @@ -29,7 +34,7 @@ const unsigned long sys_call_table[] = { #ifdef CONFIG_COMPAT #undef __SYSCALL_WITH_COMPAT #define __SYSCALL_WITH_COMPAT(nr, native, compat) __SYSCALL(nr, compat) -const unsigned long compat_sys_call_table[] = { +const syscall_fn compat_sys_call_table[] = { #include }; #endif /* CONFIG_COMPAT */ diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c index bf9574ec26ce..fcca06d200d3 100644 --- a/arch/powerpc/kernel/vdso.c +++ b/arch/powerpc/kernel/vdso.c @@ -304,10 +304,10 @@ static void __init vdso_setup_syscall_map(void) unsigned int i; for (i = 0; i < NR_syscalls; i++) { - if (sys_call_table[i] != (unsigned long)_ni_syscall) + if (sys_call_table[i] != (void *)_ni_syscall) vdso_data->syscall_map[i >> 5] |= 0x8000UL >> (i & 0x1f); if (IS_ENABLED(CONFIG_COMPAT) && - compat_sys_call_table[i] != (unsigned long)_ni_syscall) + compat_sys_call_table[i] != (void *)_ni_syscall) vdso_data->compat_syscall_map[i >> 5] |= 0x8000UL >> (i & 0x1f); } } diff --git a/arch/powerpc/platforms/cell/spu_callbacks.c b/arch/powerpc/platforms/cell/spu_callbacks.c index fe0d8797a00a..e780c14c5733 100644 --- a/arch/powerpc/platforms/cell/spu_callbacks.c +++ b/arch/powerpc/platforms/cell/spu_callbacks.c @@ -34,15 +34,15 @@ * mbind, mq_open, ipc, ... */ -static
[PATCH v6 17/25] powerpc: Enable compile-time check for syscall handlers
The table of syscall handlers and registered compatibility syscall handlers has in past been produced using assembly, with function references resolved at link time. This moves link-time errors to compile-time, by rewriting systbl.S in C, and including the linux/syscalls.h, linux/compat.h and asm/syscalls.h headers for prototypes. Reported-by: Arnd Bergmann Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V2: New patch. V5: For this patch only, represent handler function pointers as unsigned long. Remove reference to syscall wrappers. Use asm/syscalls.h which implies asm/syscall.h --- arch/powerpc/kernel/{systbl.S => systbl.c} | 28 1 file changed, 11 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kernel/systbl.S b/arch/powerpc/kernel/systbl.c similarity index 61% rename from arch/powerpc/kernel/systbl.S rename to arch/powerpc/kernel/systbl.c index 6c1db3b6de2d..ce52bd2ec292 100644 --- a/arch/powerpc/kernel/systbl.S +++ b/arch/powerpc/kernel/systbl.c @@ -10,32 +10,26 @@ * PPC64 updates by Dave Engebretsen (engeb...@us.ibm.com) */ -#include +#include +#include +#include +#include -.section .rodata,"a" +#define __SYSCALL_WITH_COMPAT(nr, entry, compat) __SYSCALL(nr, entry) +#define __SYSCALL(nr, entry) [nr] = (unsigned long) , -#ifdef CONFIG_PPC64 - .p2align3 -#define __SYSCALL(nr, entry) .8byte entry -#else - .p2align2 -#define __SYSCALL(nr, entry) .long entry -#endif - -#define __SYSCALL_WITH_COMPAT(nr, native, compat) __SYSCALL(nr, native) -.globl sys_call_table -sys_call_table: +const unsigned long sys_call_table[] = { #ifdef CONFIG_PPC64 #include #else #include #endif +}; #ifdef CONFIG_COMPAT #undef __SYSCALL_WITH_COMPAT #define __SYSCALL_WITH_COMPAT(nr, native, compat) __SYSCALL(nr, compat) -.globl compat_sys_call_table -compat_sys_call_table: -#define compat_sys_sigsuspend sys_sigsuspend +const unsigned long compat_sys_call_table[] = { #include -#endif +}; +#endif /* CONFIG_COMPAT */ -- 2.34.1
[PATCH v6 16/25] powerpc: Include all arch-specific syscall prototypes
Forward declare all syscall handler prototypes where a generic prototype is not provided in either linux/syscalls.h or linux/compat.h in asm/syscalls.h. This is required for compile-time type-checking for syscall handlers, which is implemented later in this series. 32-bit compatibility syscall handlers are expressed in terms of types in ppc32.h. Expose this header globally. Signed-off-by: Rohan McLure Acked-by: Nicholas Piggin --- V2: Explicitly include prototypes. V3: Remove extraneous #include and ppc_fallocate prototype. Rename header. V5: Clean. Elaborate comment on long long munging. Remove prototype hiding conditional on SYSCALL_WRAPPER. --- arch/powerpc/include/asm/syscalls.h | 97 ++ .../ppc32.h => include/asm/syscalls_32.h}| 0 arch/powerpc/kernel/signal_32.c | 2 +- arch/powerpc/perf/callchain_32.c | 2 +- 4 files changed, 77 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 525d2aa0c8ca..5d106acf7906 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -8,6 +8,14 @@ #include #include +#ifdef CONFIG_PPC64 +#include +#endif +#include +#include + +struct rtas_args; + /* * long long munging: * The 32 bit ABI passes long longs in an odd even register pair. @@ -20,44 +28,89 @@ #define merge_64(high, low) ((u64)high << 32) | low #endif -struct rtas_args; +long sys_ni_syscall(void); + +/* + * PowerPC architecture-specific syscalls + */ + +long sys_rtas(struct rtas_args __user *uargs); + +#ifdef CONFIG_PPC64 +long sys_ppc64_personality(unsigned long personality); +#ifdef CONFIG_COMPAT +long compat_sys_ppc64_personality(unsigned long personality); +#endif /* CONFIG_COMPAT */ +#endif /* CONFIG_PPC64 */ +long sys_swapcontext(struct ucontext __user *old_ctx, +struct ucontext __user *new_ctx, long ctx_size); long sys_mmap(unsigned long addr, size_t len, unsigned long prot, unsigned long flags, unsigned long fd, off_t offset); long sys_mmap2(unsigned long addr, size_t len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long pgoff); -long sys_ppc64_personality(unsigned long personality); -long sys_rtas(struct rtas_args __user *uargs); -long sys_ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low, - u32 len_high, u32 len_low); +long sys_switch_endian(void); -#ifdef CONFIG_COMPAT -unsigned long compat_sys_mmap2(unsigned long addr, size_t len, - unsigned long prot, unsigned long flags, - unsigned long fd, unsigned long pgoff); - -compat_ssize_t compat_sys_pread64(unsigned int fd, char __user *ubuf, compat_size_t count, - u32 reg6, u32 pos1, u32 pos2); +#ifdef CONFIG_PPC32 +long sys_sigreturn(void); +long sys_debug_setcontext(struct ucontext __user *ctx, int ndbg, + struct sig_dbg_op __user *dbg); +#endif -compat_ssize_t compat_sys_pwrite64(unsigned int fd, const char __user *ubuf, compat_size_t count, - u32 reg6, u32 pos1, u32 pos2); +long sys_rt_sigreturn(void); -compat_ssize_t compat_sys_readahead(int fd, u32 r4, u32 offset1, u32 offset2, u32 count); +long sys_subpage_prot(unsigned long addr, + unsigned long len, u32 __user *map); -int compat_sys_truncate64(const char __user *path, u32 reg4, - unsigned long len1, unsigned long len2); +#ifdef CONFIG_COMPAT +long compat_sys_swapcontext(struct ucontext32 __user *old_ctx, + struct ucontext32 __user *new_ctx, + int ctx_size); +long compat_sys_old_getrlimit(unsigned int resource, + struct compat_rlimit __user *rlim); +long compat_sys_sigreturn(void); +long compat_sys_rt_sigreturn(void); +#endif /* CONFIG_COMPAT */ -int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long len1, - unsigned long len2); +/* + * Architecture specific signatures required by long long munging: + * The 32 bit ABI passes long longs in an odd even register pair. + * The following signatures provide a machine long parameter for + * each register that will be supplied. The implementation is + * responsible for combining parameter pairs. + */ +#ifdef CONFIG_COMPAT +long compat_sys_mmap2(unsigned long addr, size_t len, + unsigned long prot, unsigned long flags, + unsigned long fd, unsigned long pgoff); +long compat_sys_ppc_pread64(unsigned int fd, + char __user *ubuf, compat_size_t count, + u32 reg6, u32 pos1, u32 pos2); +long compat_sys_ppc_pwrite64(unsigned int fd, +const char __user *ubuf, compat_size_t count, +
[PATCH v6 15/25] powerpc: Adopt SYSCALL_DEFINE for arch-specific syscall handlers
Arch-specific implementations of syscall handlers are currently used over generic implementations for the following reasons: 1. Semantics unique to powerpc 2. Compatibility syscalls require 'argument padding' to comply with 64-bit argument convention in ELF32 abi. 3. Parameter types or order is different in other architectures. These syscall handlers have been defined prior to this patch series without invoking the SYSCALL_DEFINE or COMPAT_SYSCALL_DEFINE macros with custom input and output types. We remove every such direct definition in favour of the aforementioned macros. Also update syscalls.tbl in order to refer to the symbol names generated by each of these macros. Since ppc64_personality can be called by both 64 bit and 32 bit binaries through compatibility, we must generate both both compat_sys_ and sys_ symbols for this handler. As an aside: A number of architectures including arm and powerpc agree on an alternative argument order and numbering for most of these arch-specific handlers. A future patch series may allow for asm/unistd.h to signal through its defines that a generic implementation of these syscall handlers with the correct calling convention be emitted, through the __ARCH_WANT_COMPAT_SYS_... convention. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V2: All syscall handlers wrapped by this macro. V3: Move creation of do_ppc64_personality helper to prior patch. V4: Fix parenthesis alignment. Don't emit sys_*** symbols. V5: Use 'aside' in the asm-generic rant in commit message. --- arch/powerpc/include/asm/syscalls.h | 10 ++--- arch/powerpc/kernel/sys_ppc32.c | 38 +++--- arch/powerpc/kernel/syscalls.c | 17 ++-- arch/powerpc/kernel/syscalls/syscall.tbl | 22 +- .../arch/powerpc/entry/syscalls/syscall.tbl | 22 +- 5 files changed, 64 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 20cbd29b1228..525d2aa0c8ca 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -28,10 +28,10 @@ long sys_mmap(unsigned long addr, size_t len, long sys_mmap2(unsigned long addr, size_t len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long pgoff); -long ppc64_personality(unsigned long personality); +long sys_ppc64_personality(unsigned long personality); long sys_rtas(struct rtas_args __user *uargs); -long ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low, - u32 len_high, u32 len_low); +long sys_ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low, + u32 len_high, u32 len_low); #ifdef CONFIG_COMPAT unsigned long compat_sys_mmap2(unsigned long addr, size_t len, @@ -52,8 +52,8 @@ int compat_sys_truncate64(const char __user *path, u32 reg4, int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long len1, unsigned long len2); -long ppc32_fadvise64(int fd, u32 unused, u32 offset1, u32 offset2, -size_t len, int advice); +long compat_sys_ppc32_fadvise64(int fd, u32 unused, u32 offset1, u32 offset2, + size_t len, int advice); long compat_sys_sync_file_range2(int fd, unsigned int flags, unsigned int offset1, unsigned int offset2, diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c index 776ae7565fc5..dcc3c9fd4cfd 100644 --- a/arch/powerpc/kernel/sys_ppc32.c +++ b/arch/powerpc/kernel/sys_ppc32.c @@ -47,45 +47,55 @@ #include #include -compat_ssize_t compat_sys_pread64(unsigned int fd, char __user *ubuf, compat_size_t count, -u32 reg6, u32 pos1, u32 pos2) +COMPAT_SYSCALL_DEFINE6(ppc_pread64, + unsigned int, fd, + char __user *, ubuf, compat_size_t, count, + u32, reg6, u32, pos1, u32, pos2) { return ksys_pread64(fd, ubuf, count, merge_64(pos1, pos2)); } -compat_ssize_t compat_sys_pwrite64(unsigned int fd, const char __user *ubuf, compat_size_t count, - u32 reg6, u32 pos1, u32 pos2) +COMPAT_SYSCALL_DEFINE6(ppc_pwrite64, + unsigned int, fd, + const char __user *, ubuf, compat_size_t, count, + u32, reg6, u32, pos1, u32, pos2) { return ksys_pwrite64(fd, ubuf, count, merge_64(pos1, pos2)); } -compat_ssize_t compat_sys_readahead(int fd, u32 r4, u32 offset1, u32 offset2, u32 count) +COMPAT_SYSCALL_DEFINE5(ppc_readahead, + int, fd, u32, r4, + u32, offset1, u32, offset2, u32, count) { return ksys_readahead(fd, merge_64(offset1, offset2), count); } -int compat_sys_truncate64(const char __user * path, u32 reg4, - unsigned long len1, unsigned
[PATCH v6 14/25] powerpc: Provide do_ppc64_personality helper
Avoid duplication in future patch that will define the ppc64_personality syscall handler in terms of the SYSCALL_DEFINE and COMPAT_SYSCALL_DEFINE macros, by extracting the common body of ppc64_personality into a helper function. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V3: New commit. V5: Remove 'inline'. --- arch/powerpc/kernel/syscalls.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c index 9830957498b0..135a0b9108d5 100644 --- a/arch/powerpc/kernel/syscalls.c +++ b/arch/powerpc/kernel/syscalls.c @@ -75,7 +75,7 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, size_t, len, } #ifdef CONFIG_PPC64 -long ppc64_personality(unsigned long personality) +static long do_ppc64_personality(unsigned long personality) { long ret; @@ -87,6 +87,10 @@ long ppc64_personality(unsigned long personality) ret = (ret & ~PER_MASK) | PER_LINUX; return ret; } +long ppc64_personality(unsigned long personality) +{ + return do_ppc64_personality(personality); +} #endif long ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low, -- 2.34.1
[PATCH v6 13/25] powerpc: Remove direct call to mmap2 syscall handlers
Syscall handlers should not be invoked internally by their symbol names, as these symbols defined by the architecture-defined SYSCALL_DEFINE macro. Move the compatibility syscall definition for mmap2 to syscalls.c, so that all mmap implementations can share a helper function. Remove 'inline' on static mmap helper. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V2: Move mmap2 compat implementation to asm/kernel/syscalls.c. V4: Move to be applied before syscall wrapper introduced. V5: Remove 'inline' in helper. --- arch/powerpc/kernel/sys_ppc32.c | 9 - arch/powerpc/kernel/syscalls.c | 17 ++--- 2 files changed, 14 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c index d961634976d8..776ae7565fc5 100644 --- a/arch/powerpc/kernel/sys_ppc32.c +++ b/arch/powerpc/kernel/sys_ppc32.c @@ -25,7 +25,6 @@ #include #include #include -#include #include #include #include @@ -48,14 +47,6 @@ #include #include -unsigned long compat_sys_mmap2(unsigned long addr, size_t len, - unsigned long prot, unsigned long flags, - unsigned long fd, unsigned long pgoff) -{ - /* This should remain 12 even if PAGE_SIZE changes */ - return sys_mmap(addr, len, prot, flags, fd, pgoff << 12); -} - compat_ssize_t compat_sys_pread64(unsigned int fd, char __user *ubuf, compat_size_t count, u32 reg6, u32 pos1, u32 pos2) { diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c index a04c97faa21a..9830957498b0 100644 --- a/arch/powerpc/kernel/syscalls.c +++ b/arch/powerpc/kernel/syscalls.c @@ -36,9 +36,9 @@ #include #include -static inline long do_mmap2(unsigned long addr, size_t len, - unsigned long prot, unsigned long flags, - unsigned long fd, unsigned long off, int shift) +static long do_mmap2(unsigned long addr, size_t len, +unsigned long prot, unsigned long flags, +unsigned long fd, unsigned long off, int shift) { if (!arch_validate_prot(prot, addr)) return -EINVAL; @@ -56,6 +56,17 @@ SYSCALL_DEFINE6(mmap2, unsigned long, addr, size_t, len, return do_mmap2(addr, len, prot, flags, fd, pgoff, PAGE_SHIFT-12); } +#ifdef CONFIG_COMPAT +COMPAT_SYSCALL_DEFINE6(mmap2, + unsigned long, addr, size_t, len, + unsigned long, prot, unsigned long, flags, + unsigned long, fd, unsigned long, pgoff) +{ + /* This should remain 12 even if PAGE_SIZE changes */ + return do_mmap2(addr, len, prot, flags, fd, pgoff << 12, PAGE_SHIFT-12); +} +#endif + SYSCALL_DEFINE6(mmap, unsigned long, addr, size_t, len, unsigned long, prot, unsigned long, flags, unsigned long, fd, off_t, offset) -- 2.34.1
[PATCH v6 12/25] powerpc: Remove direct call to personality syscall handler
Syscall handlers should not be invoked internally by their symbol names, as these symbols defined by the architecture-defined SYSCALL_DEFINE macro. Fortunately, in the case of ppc64_personality, its call to sys_personality can be replaced with an invocation to the equivalent ksys_personality inline helper in . Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V2: Use inline helper to deduplicate bodies in compat/regular implementations. V4: Move to be applied before syscall wrapper. --- arch/powerpc/kernel/syscalls.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c index 34e1ae88e15b..a04c97faa21a 100644 --- a/arch/powerpc/kernel/syscalls.c +++ b/arch/powerpc/kernel/syscalls.c @@ -71,7 +71,7 @@ long ppc64_personality(unsigned long personality) if (personality(current->personality) == PER_LINUX32 && personality(personality) == PER_LINUX) personality = (personality & ~PER_MASK) | PER_LINUX32; - ret = sys_personality(personality); + ret = ksys_personality(personality); if (personality(ret) == PER_LINUX32) ret = (ret & ~PER_MASK) | PER_LINUX; return ret; -- 2.34.1
[PATCH v6 11/25] powerpc/32: Remove powerpc select specialisation
Syscall #82 has been implemented for 32-bit platforms in a unique way on powerpc systems. This hack will in effect guess whether the caller is expecting new select semantics or old select semantics. It does so via a guess, based off the first parameter. In new select, this parameter represents the length of a user-memory array of file descriptors, and in old select this is a pointer to an arguments structure. The heuristic simply interprets sufficiently large values of its first parameter as being a call to old select. The following is a discussion on how this syscall should be handled. Link: https://lore.kernel.org/lkml/13737de5-0eb7-e881-9af0-163b0d29a...@csgroup.eu/ As discussed in this thread, the existence of such a hack suggests that for whatever powerpc binaries may predate glibc, it is most likely that they would have taken use of the old select semantics. x86 and arm64 both implement this syscall with oldselect semantics. Remove the powerpc implementation, and update syscall.tbl to refer to emit a reference to sys_old_select and compat_sys_old_select for 32-bit binaries, in keeping with how other architectures support syscall #82. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V2: Remove arch-specific select handler V3: Remove ppc_old_select prototype in . Move to earlier in patch series V5: Use compat_sys_old_select on 64-bit systems. --- arch/powerpc/include/asm/syscalls.h | 2 -- arch/powerpc/kernel/syscalls.c| 17 - arch/powerpc/kernel/syscalls/syscall.tbl | 2 +- .../arch/powerpc/entry/syscalls/syscall.tbl | 2 +- 4 files changed, 2 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 960b3871db72..20cbd29b1228 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -30,8 +30,6 @@ long sys_mmap2(unsigned long addr, size_t len, unsigned long fd, unsigned long pgoff); long ppc64_personality(unsigned long personality); long sys_rtas(struct rtas_args __user *uargs); -int ppc_select(int n, fd_set __user *inp, fd_set __user *outp, - fd_set __user *exp, struct __kernel_old_timeval __user *tvp); long ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low, u32 len_high, u32 len_low); diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c index abc3fbb3c490..34e1ae88e15b 100644 --- a/arch/powerpc/kernel/syscalls.c +++ b/arch/powerpc/kernel/syscalls.c @@ -63,23 +63,6 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, size_t, len, return do_mmap2(addr, len, prot, flags, fd, offset, PAGE_SHIFT); } -#ifdef CONFIG_PPC32 -/* - * Due to some executables calling the wrong select we sometimes - * get wrong args. This determines how the args are being passed - * (a single ptr to them all args passed) then calls - * sys_select() with the appropriate args. -- Cort - */ -int -ppc_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user *exp, struct __kernel_old_timeval __user *tvp) -{ - if ((unsigned long)n >= 4096) - return sys_old_select((void __user *)n); - - return sys_select(n, inp, outp, exp, tvp); -} -#endif - #ifdef CONFIG_PPC64 long ppc64_personality(unsigned long personality) { diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 2600b4237292..64f27cbbdd2c 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -110,7 +110,7 @@ 79 common settimeofdaysys_settimeofday compat_sys_settimeofday 80 common getgroups sys_getgroups 81 common setgroups sys_setgroups -82 32 select ppc_select sys_ni_syscall +82 32 select sys_old_select compat_sys_old_select 82 64 select sys_ni_syscall 82 spu select sys_ni_syscall 83 common symlink sys_symlink diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl index 2600b4237292..64f27cbbdd2c 100644 --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl @@ -110,7 +110,7 @@ 79 common settimeofdaysys_settimeofday compat_sys_settimeofday 80 common getgroups sys_getgroups 81 common setgroups sys_setgroups -82 32 select ppc_select sys_ni_syscall +82 32 select sys_old_select compat_sys_old_select 82 64 select
[PATCH v6 10/25] powerpc: Use generic fallocate compatibility syscall
The powerpc fallocate compat syscall handler is identical to the generic implementation provided by commit 59c10c52f573f ("riscv: compat: syscall: Add compat_sys_call_table implementation"), and as such can be removed in favour of the generic implementation. A future patch series will replace more architecture-defined syscall handlers with generic implementations, dependent on introducing generic implementations that are compatible with powerpc and arm's parameter reorderings. Reported-by: Arnd Bergmann Signed-off-by: Rohan McLure Reviewed-by: Arnd Bergmann --- V2: Remove arch-specific fallocate handler. V3: Remove generic fallocate prototype. Move to beginning of. V5: Remove implementation as well which I somehow failed to do. Replace local BE compat_arg_u64 with generic. --- arch/powerpc/include/asm/syscalls.h | 2 -- arch/powerpc/include/asm/unistd.h | 1 + arch/powerpc/kernel/sys_ppc32.c | 7 --- 3 files changed, 1 insertion(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 16b668515d15..960b3871db72 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -51,8 +51,6 @@ compat_ssize_t compat_sys_readahead(int fd, u32 r4, u32 offset1, u32 offset2, u3 int compat_sys_truncate64(const char __user *path, u32 reg4, unsigned long len1, unsigned long len2); -long compat_sys_fallocate(int fd, int mode, u32 offset1, u32 offset2, u32 len1, u32 len2); - int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long len1, unsigned long len2); diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index b1129b4ef57d..659a996c75aa 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -45,6 +45,7 @@ #define __ARCH_WANT_SYS_UTIME #define __ARCH_WANT_SYS_NEWFSTATAT #define __ARCH_WANT_COMPAT_STAT +#define __ARCH_WANT_COMPAT_FALLOCATE #define __ARCH_WANT_COMPAT_SYS_SENDFILE #endif #define __ARCH_WANT_SYS_FORK diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c index ba363328da2b..d961634976d8 100644 --- a/arch/powerpc/kernel/sys_ppc32.c +++ b/arch/powerpc/kernel/sys_ppc32.c @@ -79,13 +79,6 @@ int compat_sys_truncate64(const char __user * path, u32 reg4, return ksys_truncate(path, merge_64(len1, len2)); } -long compat_sys_fallocate(int fd, int mode, u32 offset1, u32 offset2, -u32 len1, u32 len2) -{ - return ksys_fallocate(fd, mode, merge_64(offset1, offset2), -merge_64(len1, len2)); -} - int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long len1, unsigned long len2) { -- 2.34.1
[PATCH v6 09/25] asm-generic: compat: Support BE for long long args in 32-bit ABIs
32-bit ABIs support passing 64-bit integers by registers via argument translation. Commit 59c10c52f573 ("riscv: compat: syscall: Add compat_sys_call_table implementation") implements the compat_arg_u64 macro for efficiently defining little endian compatibility syscalls. Architectures supporting big endianness may benefit from reciprocal argument translation, but are welcome also to implement their own. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin Reviewed-by: Arnd Bergmann --- V5: New patch. --- include/asm-generic/compat.h | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/compat.h b/include/asm-generic/compat.h index d06308a2a7a8..aeb257ad3d1a 100644 --- a/include/asm-generic/compat.h +++ b/include/asm-generic/compat.h @@ -14,12 +14,17 @@ #define COMPAT_OFF_T_MAX 0x7fff #endif -#if !defined(compat_arg_u64) && !defined(CONFIG_CPU_BIG_ENDIAN) +#ifndef compat_arg_u64 +#ifdef CONFIG_CPU_BIG_ENDIAN #define compat_arg_u64(name) u32 name##_lo, u32 name##_hi #define compat_arg_u64_dual(name) u32, name##_lo, u32, name##_hi +#else +#define compat_arg_u64(name) u32 name##_hi, u32 name##_lo +#define compat_arg_u64_dual(name) u32, name##_hi, u32, name##_lo +#endif #define compat_arg_u64_glue(name) (((u64)name##_lo & 0xUL) | \ ((u64)name##_hi << 32)) -#endif +#endif /* compat_arg_u64 */ /* These types are common across all compat ABIs */ typedef u32 compat_size_t; -- 2.34.1
[PATCH v6 08/25] powerpc: Fix fallocate and fadvise64_64 compat parameter combination
As reported[1] by Arnd, the arch-specific fadvise64_64 and fallocate compatibility handlers assume parameters are passed with 32-bit big-endian ABI. This affects the assignment of odd-even parameter pairs to the high or low words of a 64-bit syscall parameter. Fix fadvise64_64 fallocate compat handlers to correctly swap upper/lower 32 bits conditioned on endianness. A future patch will replace the arch-specific compat fallocate with an asm-generic implementation. This patch is intended for ease of back-port. [1]: https://lore.kernel.org/all/be29926f-226e-48dc-871a-e29a54e80...@www.fastmail.com/ Fixes: 57f48b4b74e7 ("powerpc/compat_sys: swap hi/lo parts of 64-bit syscall args in LE mode") Reported-by: Arnd Bergmann Signed-off-by: Rohan McLure Reviewed-by: Arnd Bergmann Reviewed-by: Nicholas Piggin --- V5: New patch. --- arch/powerpc/include/asm/syscalls.h | 12 arch/powerpc/kernel/sys_ppc32.c | 14 +- arch/powerpc/kernel/syscalls.c | 4 ++-- 3 files changed, 15 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index 21c2faaa2957..16b668515d15 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -8,6 +8,18 @@ #include #include +/* + * long long munging: + * The 32 bit ABI passes long longs in an odd even register pair. + * High and low parts are swapped depending on endian mode, + * so define a macro (similar to mips linux32) to handle that. + */ +#ifdef __LITTLE_ENDIAN__ +#define merge_64(low, high) ((u64)high << 32) | low +#else +#define merge_64(high, low) ((u64)high << 32) | low +#endif + struct rtas_args; long sys_mmap(unsigned long addr, size_t len, diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c index f4edcc9489fb..ba363328da2b 100644 --- a/arch/powerpc/kernel/sys_ppc32.c +++ b/arch/powerpc/kernel/sys_ppc32.c @@ -56,18 +56,6 @@ unsigned long compat_sys_mmap2(unsigned long addr, size_t len, return sys_mmap(addr, len, prot, flags, fd, pgoff << 12); } -/* - * long long munging: - * The 32 bit ABI passes long longs in an odd even register pair. - * High and low parts are swapped depending on endian mode, - * so define a macro (similar to mips linux32) to handle that. - */ -#ifdef __LITTLE_ENDIAN__ -#define merge_64(low, high) ((u64)high << 32) | low -#else -#define merge_64(high, low) ((u64)high << 32) | low -#endif - compat_ssize_t compat_sys_pread64(unsigned int fd, char __user *ubuf, compat_size_t count, u32 reg6, u32 pos1, u32 pos2) { @@ -94,7 +82,7 @@ int compat_sys_truncate64(const char __user * path, u32 reg4, long compat_sys_fallocate(int fd, int mode, u32 offset1, u32 offset2, u32 len1, u32 len2) { - return ksys_fallocate(fd, mode, ((loff_t)offset1 << 32) | offset2, + return ksys_fallocate(fd, mode, merge_64(offset1, offset2), merge_64(len1, len2)); } diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c index fc999140bc27..abc3fbb3c490 100644 --- a/arch/powerpc/kernel/syscalls.c +++ b/arch/powerpc/kernel/syscalls.c @@ -98,8 +98,8 @@ long ppc64_personality(unsigned long personality) long ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low, u32 len_high, u32 len_low) { - return ksys_fadvise64_64(fd, (u64)offset_high << 32 | offset_low, -(u64)len_high << 32 | len_low, advice); + return ksys_fadvise64_64(fd, merge_64(offset_high, offset_low), +merge_64(len_high, len_low), advice); } SYSCALL_DEFINE0(switch_endian) -- 2.34.1
[PATCH v6 07/25] powerpc/64s: Fix comment on interrupt handler prologue
Interrupt handlers on 64s systems will often need to save register state from the interrupted process to make space for loading special purpose registers or for internal state. Fix a comment documenting a common code path macro in the beginning of interrupt handlers where r10 is saved to the PACA to afford space for the value of the CFAR. Comment is currently written as if r10-r12 are saved to PACA, but in fact only r10 is saved, with r11-r12 saved much later. The distance in code between these saves has grown over the many revisions of this macro. Fix this by signalling with a comment where r11-r12 are saved to the PACA. Signed-off-by: Rohan McLure Reported-by: Nicholas Piggin Reviewed-by: Nicholas Piggin --- V2: Given its own commit V3: Annotate r11-r12 save locations with comment. --- arch/powerpc/kernel/exceptions-64s.S | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 3d0dc133a9ae..a3b51441b039 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -281,7 +281,7 @@ BEGIN_FTR_SECTION mfspr r9,SPRN_PPR END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) HMT_MEDIUM - std r10,IAREA+EX_R10(r13) /* save r10 - r12 */ + std r10,IAREA+EX_R10(r13) /* save r10 */ .if ICFAR BEGIN_FTR_SECTION mfspr r10,SPRN_CFAR @@ -321,7 +321,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR) mfctr r10 std r10,IAREA+EX_CTR(r13) mfcrr9 - std r11,IAREA+EX_R11(r13) + std r11,IAREA+EX_R11(r13) /* save r11 - r12 */ std r12,IAREA+EX_R12(r13) /* -- 2.34.1
[PATCH v6 06/25] powerpc/64e: Clarify register saves and clears with {SAVE,ZEROIZE}_GPRS
The common interrupt handler prologue macro and the bad_stack trampolines include consecutive sequences of register saves, and some register clears. Neaten such instances by expanding use of the SAVE_GPRS macro and employing the ZEROIZE_GPR macro when appropriate. Also simplify an invocation of SAVE_GPRS targetting all non-volatile registers to SAVE_NVGPRS. Signed-off-by: Rohan Mclure Reported-by: Nicholas Piggin Reviewed-by: Nicholas Piggin --- V4: New commit. V6: Split REST_GPRS(0, 1, r1) to remove dependence on macro implementation ordering the r0 restore before r1 restore. --- arch/powerpc/kernel/exceptions-64e.S | 28 +++--- 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S index 67dc4e3179a0..f1d714acc945 100644 --- a/arch/powerpc/kernel/exceptions-64e.S +++ b/arch/powerpc/kernel/exceptions-64e.S @@ -216,17 +216,15 @@ END_FTR_SECTION_IFSET(CPU_FTR_EMB_HV) mtlrr10 mtcrr11 - ld r10,GPR10(r1) - ld r11,GPR11(r1) - ld r12,GPR12(r1) + REST_GPRS(10, 12, r1) mtspr \scratch,r0 std r10,\paca_ex+EX_R10(r13); std r11,\paca_ex+EX_R11(r13); ld r10,_NIP(r1) ld r11,_MSR(r1) - ld r0,GPR0(r1) - ld r1,GPR1(r1) + REST_GPR(0, r1) + REST_GPR(1, r1) mtspr \srr0,r10 mtspr \srr1,r11 ld r10,\paca_ex+EX_R10(r13) @@ -372,16 +370,15 @@ ret_from_mc_except: /* Core exception code for all exceptions except TLB misses. */ #define EXCEPTION_COMMON_LVL(n, scratch, excf) \ exc_##n##_common: \ - std r0,GPR0(r1);/* save r0 in stackframe */ \ - std r2,GPR2(r1);/* save r2 in stackframe */ \ - SAVE_GPRS(3, 9, r1);/* save r3 - r9 in stackframe */\ + SAVE_GPR(0, r1);/* save r0 in stackframe */ \ + SAVE_GPRS(2, 9, r1);/* save r2 - r9 in stackframe */\ std r10,_NIP(r1); /* save SRR0 to stackframe */ \ std r11,_MSR(r1); /* save SRR1 to stackframe */ \ beq 2f; /* if from kernel mode */ \ 2: ld r3,excf+EX_R10(r13);/* get back r10 */ \ ld r4,excf+EX_R11(r13);/* get back r11 */ \ mfspr r5,scratch; /* get back r13 */ \ - std r12,GPR12(r1); /* save r12 in stackframe */\ + SAVE_GPR(12, r1); /* save r12 in stackframe */\ ld r2,PACATOC(r13);/* get kernel TOC into r2 */\ mflrr6; /* save LR in stackframe */ \ mfctr r7; /* save CTR in stackframe */\ @@ -390,7 +387,7 @@ exc_##n##_common: \ lwz r10,excf+EX_CR(r13);/* load orig CR back from PACA */ \ lbz r11,PACAIRQSOFTMASK(r13); /* get current IRQ softe */ \ ld r12,exception_marker@toc(r2); \ - li r0,0; \ + ZEROIZE_GPR(0); \ std r3,GPR10(r1); /* save r10 to stackframe */\ std r4,GPR11(r1); /* save r11 to stackframe */\ std r5,GPR13(r1); /* save it to stackframe */ \ @@ -1056,15 +1053,14 @@ bad_stack_book3e: mfspr r11,SPRN_ESR std r10,_DEAR(r1) std r11,_ESR(r1) - std r0,GPR0(r1);/* save r0 in stackframe */ \ - std r2,GPR2(r1);/* save r2 in stackframe */ \ - SAVE_GPRS(3, 9, r1);/* save r3 - r9 in stackframe */\ + SAVE_GPR(0, r1);/* save r0 in stackframe */ \ + SAVE_GPRS(2, 9, r1);/* save r2 - r9 in stackframe */\ ld r3,PACA_EXGEN+EX_R10(r13);/* get back r10 */\ ld r4,PACA_EXGEN+EX_R11(r13);/* get back r11 */\ mfspr r5,SPRN_SPRG_GEN_SCRATCH;/* get back r13 XXX can be wrong */ \ std r3,GPR10(r1); /* save r10 to stackframe */\ std r4,GPR11(r1); /* save r11 to stackframe */\ - std r12,GPR12(r1); /* save r12 in stackframe */\ + SAVE_GPR(12, r1); /* save r12 in stackframe */\ std r5,GPR13(r1); /* save it to stackframe */ \ mflrr10 mfctr r11 @@ -1072,12 +1068,12 @@ bad_stack_book3e: std
[PATCH v6 05/25] powerpc/32: Clarify interrupt restores with REST_GPR macro in entry_32.S
Restoring the register state of the interrupted thread involves issuing a large number of predictable loads to the kernel stack frame. Issue the REST_GPR{,S} macros to clearly signal when this is happening, and bunch together restores at the end of the interrupt handler where the saved value is not consumed earlier in the handler code. Signed-off-by: Rohan McLure Reported-by: Christophe Leroy Reviewed-by: Nicholas Piggin --- V3: New patch. V4: Minimise restores in the unrecoverable window between restoring SRR0/1 and return from interrupt. --- arch/powerpc/kernel/entry_32.S | 33 +--- 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index 44dfce9a60c5..e4b694cebc44 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -68,7 +68,7 @@ prepare_transfer_to_handler: lwz r9,_MSR(r11)/* if sleeping, clear MSR.EE */ rlwinm r9,r9,0,~MSR_EE lwz r12,_LINK(r11) /* and return to address in LR */ - lwz r2, GPR2(r11) + REST_GPR(2, r11) b fast_exception_return _ASM_NOKPROBE_SYMBOL(prepare_transfer_to_handler) #endif /* CONFIG_PPC_BOOK3S_32 || CONFIG_E500 */ @@ -144,7 +144,7 @@ ret_from_syscall: lwz r7,_NIP(r1) lwz r8,_MSR(r1) cmpwi r3,0 - lwz r3,GPR3(r1) + REST_GPR(3, r1) syscall_exit_finish: mtspr SPRN_SRR0,r7 mtspr SPRN_SRR1,r8 @@ -152,8 +152,8 @@ syscall_exit_finish: bne 3f mtcrr5 -1: lwz r2,GPR2(r1) - lwz r1,GPR1(r1) +1: REST_GPR(2, r1) + REST_GPR(1, r1) rfi #ifdef CONFIG_40x b . /* Prevent prefetch past rfi */ @@ -165,10 +165,8 @@ syscall_exit_finish: REST_NVGPRS(r1) mtctr r4 mtxer r5 - lwz r0,GPR0(r1) - lwz r3,GPR3(r1) - REST_GPRS(4, 11, r1) - lwz r12,GPR12(r1) + REST_GPR(0, r1) + REST_GPRS(3, 12, r1) b 1b #ifdef CONFIG_44x @@ -260,9 +258,8 @@ fast_exception_return: beq 3f /* if not, we've got problems */ #endif -2: REST_GPRS(3, 6, r11) - lwz r10,_CCR(r11) - REST_GPRS(1, 2, r11) +2: lwz r10,_CCR(r11) + REST_GPRS(1, 6, r11) mtcrr10 lwz r10,_LINK(r11) mtlrr10 @@ -277,7 +274,7 @@ fast_exception_return: mtspr SPRN_SRR0,r12 REST_GPR(9, r11) REST_GPR(12, r11) - lwz r11,GPR11(r11) + REST_GPR(11, r11) rfi #ifdef CONFIG_40x b . /* Prevent prefetch past rfi */ @@ -454,9 +451,8 @@ _ASM_NOKPROBE_SYMBOL(interrupt_return) lwz r3,_MSR(r1);\ andi. r3,r3,MSR_PR; \ bne interrupt_return; \ - lwz r0,GPR0(r1);\ - lwz r2,GPR2(r1);\ - REST_GPRS(3, 8, r1);\ + REST_GPR(0, r1);\ + REST_GPRS(2, 8, r1);\ lwz r10,_XER(r1); \ lwz r11,_CTR(r1); \ mtspr SPRN_XER,r10; \ @@ -475,11 +471,8 @@ _ASM_NOKPROBE_SYMBOL(interrupt_return) lwz r12,_MSR(r1); \ mtspr exc_lvl_srr0,r11; \ mtspr exc_lvl_srr1,r12; \ - lwz r9,GPR9(r1);\ - lwz r12,GPR12(r1); \ - lwz r10,GPR10(r1); \ - lwz r11,GPR11(r1); \ - lwz r1,GPR1(r1);\ + REST_GPRS(9, 12, r1); \ + REST_GPR(1, r1);\ exc_lvl_rfi;\ b .; /* prevent prefetch past exc_lvl_rfi */ -- 2.34.1
[PATCH v6 04/25] powerpc/64s: Use {ZEROIZE,SAVE,REST}_GPRS macros in sc, scv 0 handlers
Use the convenience macros for saving/clearing/restoring gprs in keeping with syscall calling conventions. The plural variants of these macros can store a range of registers for concision. This works well when the user gpr value we are hoping to save is still live. In the syscall interrupt handlers, user register state is sometimes juggled between registers. Hold-off from issuing the SAVE_GPR macro for applicable neighbouring lines to highlight the delicate register save logic. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V2: Update summary V3: Update summary regarding exclusions for the SAVE_GPR marco. ledge new name for ZEROIZE_GPR{,S} macros. V5: Move to beginning of series --- arch/powerpc/kernel/interrupt_64.S | 43 ++-- 1 file changed, 9 insertions(+), 34 deletions(-) diff --git a/arch/powerpc/kernel/interrupt_64.S b/arch/powerpc/kernel/interrupt_64.S index 71d2d9497283..7d92a7a54727 100644 --- a/arch/powerpc/kernel/interrupt_64.S +++ b/arch/powerpc/kernel/interrupt_64.S @@ -71,12 +71,7 @@ _ASM_NOKPROBE_SYMBOL(system_call_vectored_\name) mfcrr12 li r11,0 /* Can we avoid saving r3-r8 in common case? */ - std r3,GPR3(r1) - std r4,GPR4(r1) - std r5,GPR5(r1) - std r6,GPR6(r1) - std r7,GPR7(r1) - std r8,GPR8(r1) + SAVE_GPRS(3, 8, r1) /* Zero r9-r12, this should only be required when restoring all GPRs */ std r11,GPR9(r1) std r11,GPR10(r1) @@ -149,17 +144,10 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) /* Could zero these as per ABI, but we may consider a stricter ABI * which preserves these if libc implementations can benefit, so * restore them for now until further measurement is done. */ - ld r0,GPR0(r1) - ld r4,GPR4(r1) - ld r5,GPR5(r1) - ld r6,GPR6(r1) - ld r7,GPR7(r1) - ld r8,GPR8(r1) + REST_GPR(0, r1) + REST_GPRS(4, 8, r1) /* Zero volatile regs that may contain sensitive kernel data */ - li r9,0 - li r10,0 - li r11,0 - li r12,0 + ZEROIZE_GPRS(9, 12) mtspr SPRN_XER,r0 /* @@ -182,7 +170,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) ld r5,_XER(r1) REST_NVGPRS(r1) - ld r0,GPR0(r1) + REST_GPR(0, r1) mtcrr2 mtctr r3 mtlrr4 @@ -250,12 +238,7 @@ END_BTB_FLUSH_SECTION mfcrr12 li r11,0 /* Can we avoid saving r3-r8 in common case? */ - std r3,GPR3(r1) - std r4,GPR4(r1) - std r5,GPR5(r1) - std r6,GPR6(r1) - std r7,GPR7(r1) - std r8,GPR8(r1) + SAVE_GPRS(3, 8, r1) /* Zero r9-r12, this should only be required when restoring all GPRs */ std r11,GPR9(r1) std r11,GPR10(r1) @@ -345,16 +328,8 @@ END_FTR_SECTION_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS) cmpdi r3,0 bne .Lsyscall_restore_regs /* Zero volatile regs that may contain sensitive kernel data */ - li r0,0 - li r4,0 - li r5,0 - li r6,0 - li r7,0 - li r8,0 - li r9,0 - li r10,0 - li r11,0 - li r12,0 + ZEROIZE_GPR(0) + ZEROIZE_GPRS(4, 12) mtctr r0 mtspr SPRN_XER,r0 .Lsyscall_restore_regs_cont: @@ -380,7 +355,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR) REST_NVGPRS(r1) mtctr r3 mtspr SPRN_XER,r4 - ld r0,GPR0(r1) + REST_GPR(0, r1) REST_GPRS(4, 12, r1) b .Lsyscall_restore_regs_cont .Lsyscall_rst_end: -- 2.34.1
[PATCH v6 03/25] powerpc: Add ZEROIZE_GPRS macros for register clears
Provide register zeroing macros, following the same convention as existing register stack save/restore macros, to be used in later change to concisely zero a sequence of consecutive gprs. The resulting macros are called ZEROIZE_GPRS and ZEROIZE_NVGPRS, keeping with the naming of the accompanying restore and save macros, and usage of zeroize to describe this operation elsewhere in the kernel. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V2: Change 'ZERO' usage in naming to 'NULLIFY', a more obvious verb V3: Change 'NULLIFY' usage in naming to 'ZEROIZE', which has ent in kernel and explicitly specifies that we are zeroing. V4: Update commit message to use zeroize. V5: The reason for the patch is to add zeroize macros. Move that to first paragraph in patch description. --- arch/powerpc/include/asm/ppc_asm.h | 22 ++ 1 file changed, 22 insertions(+) diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h index 83c02f5a7f2a..b95689ada59c 100644 --- a/arch/powerpc/include/asm/ppc_asm.h +++ b/arch/powerpc/include/asm/ppc_asm.h @@ -33,6 +33,20 @@ .endr .endm +/* + * This expands to a sequence of register clears for regs start to end + * inclusive, of the form: + * + * li rN, 0 + */ +.macro ZEROIZE_REGS start, end + .Lreg=\start + .rept (\end - \start + 1) + li .Lreg, 0 + .Lreg=.Lreg+1 + .endr +.endm + /* * Macros for storing registers into and loading registers from * exception frames. @@ -49,6 +63,14 @@ #define REST_NVGPRS(base) REST_GPRS(13, 31, base) #endif +#defineZEROIZE_GPRS(start, end)ZEROIZE_REGS start, end +#ifdef __powerpc64__ +#defineZEROIZE_NVGPRS()ZEROIZE_GPRS(14, 31) +#else +#defineZEROIZE_NVGPRS()ZEROIZE_GPRS(13, 31) +#endif +#defineZEROIZE_GPR(n) ZEROIZE_GPRS(n, n) + #define SAVE_GPR(n, base) SAVE_GPRS(n, n, base) #define REST_GPR(n, base) REST_GPRS(n, n, base) -- 2.34.1
[PATCH v6 02/25] powerpc: Save caller r3 prior to system_call_exception
This reverts commit 8875f47b7681 ("powerpc/syscall: Save r3 in regs->orig_r3 "). Save caller's original r3 state to the kernel stackframe before entering system_call_exception. This allows for user registers to be cleared by the time system_call_exception is entered, reducing the influence of user registers on speculation within the kernel. Prior to this commit, orig_r3 was saved at the beginning of system_call_exception. Instead, save orig_r3 while the user value is still live in r3. Also replicate this early save in 32-bit. A similar save was removed in commit 6f76a01173cc ("powerpc/syscall: implement system call entry/exit logic in C for PPC32") when 32-bit adopted system_call_exception. Revert its removal of orig_r3 saves. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin --- V3: New commit. V5: New commit message, as we do more than just revert 8875f47b7681. --- arch/powerpc/kernel/entry_32.S | 1 + arch/powerpc/kernel/interrupt_64.S | 2 ++ arch/powerpc/kernel/syscall.c | 1 - 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index 1d599df6f169..44dfce9a60c5 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -101,6 +101,7 @@ __kuep_unlock: .globl transfer_to_syscall transfer_to_syscall: + stw r3, ORIG_GPR3(r1) stw r11, GPR1(r1) stw r11, 0(r1) mflrr12 diff --git a/arch/powerpc/kernel/interrupt_64.S b/arch/powerpc/kernel/interrupt_64.S index ce25b28cf418..71d2d9497283 100644 --- a/arch/powerpc/kernel/interrupt_64.S +++ b/arch/powerpc/kernel/interrupt_64.S @@ -91,6 +91,7 @@ _ASM_NOKPROBE_SYMBOL(system_call_vectored_\name) li r11,\trapnr std r11,_TRAP(r1) std r12,_CCR(r1) + std r3,ORIG_GPR3(r1) addir10,r1,STACK_FRAME_OVERHEAD ld r11,exception_marker@toc(r2) std r11,-16(r10)/* "regshere" marker */ @@ -275,6 +276,7 @@ END_BTB_FLUSH_SECTION std r10,_LINK(r1) std r11,_TRAP(r1) std r12,_CCR(r1) + std r3,ORIG_GPR3(r1) addir10,r1,STACK_FRAME_OVERHEAD ld r11,exception_marker@toc(r2) std r11,-16(r10)/* "regshere" marker */ diff --git a/arch/powerpc/kernel/syscall.c b/arch/powerpc/kernel/syscall.c index 81ace9e8b72b..64102a64fd84 100644 --- a/arch/powerpc/kernel/syscall.c +++ b/arch/powerpc/kernel/syscall.c @@ -25,7 +25,6 @@ notrace long system_call_exception(long r3, long r4, long r5, kuap_lock(); add_random_kstack_offset(); - regs->orig_gpr3 = r3; if (IS_ENABLED(CONFIG_PPC_IRQ_SOFT_MASK_DEBUG)) BUG_ON(irq_soft_mask_return() != IRQS_ALL_DISABLED); -- 2.34.1
Re: [PATCH v6 2/8] dt-bindings: phy: Add Lynx 10G phy binding
On Tue, 20 Sep 2022 16:23:50 -0400, Sean Anderson wrote: > This adds a binding for the SerDes module found on QorIQ processors. > Each phy is a subnode of the top-level device, possibly supporting > multiple lanes and protocols. This "thick" #phy-cells is used due to > allow for better organization of parameters. Note that the particular > parameters necessary to select a protocol-controller/lane combination > vary across different SoCs, and even within different SerDes on the same > SoC. > > The driver is designed to be able to completely reconfigure lanes at > runtime. Generally, the phy consumer can select the appropriate > protocol using set_mode. > > There are two PLLs, each of which can be used as the master clock for > each lane. Each PLL has its own reference. For the moment they are > required, because it simplifies the driver implementation. Absent > reference clocks can be modeled by a fixed-clock with a rate of 0. > > Signed-off-by: Sean Anderson > Reviewed-by: Rob Herring > --- > > Changes in v6: > - fsl,type -> phy-type > > Changes in v4: > - Use subnodes to describe lane configuration, instead of describing > PCCRs. This is the same style used by phy-cadence-sierra et al. > > Changes in v3: > - Manually expand yaml references > - Add mode configuration to device tree > > Changes in v2: > - Rename to fsl,lynx-10g.yaml > - Refer to the device in the documentation, rather than the binding > - Move compatible first > - Document phy cells in the description > - Allow a value of 1 for phy-cells. This allows for compatibility with > the similar (but according to Ioana Ciornei different enough) lynx-28g > binding. > - Remove minItems > - Use list for clock-names > - Fix example binding having too many cells in regs > - Add #clock-cells. This will allow using assigned-clocks* to configure > the PLLs. > - Document the structure of the compatible strings > > .../devicetree/bindings/phy/fsl,lynx-10g.yaml | 236 ++ > 1 file changed, 236 insertions(+) > create mode 100644 Documentation/devicetree/bindings/phy/fsl,lynx-10g.yaml > My bot found errors running 'make DT_CHECKER_FLAGS=-m dt_binding_check' on your patch (DT_CHECKER_FLAGS is new in v5.13): yamllint warnings/errors: dtschema/dtc warnings/errors: Error: Documentation/devicetree/bindings/phy/fsl,lynx-10g.example.dts:51.27-28 syntax error FATAL ERROR: Unable to parse input tree make[1]: *** [scripts/Makefile.lib:384: Documentation/devicetree/bindings/phy/fsl,lynx-10g.example.dtb] Error 1 make[1]: *** Waiting for unfinished jobs make: *** [Makefile:1420: dt_binding_check] Error 2 doc reference errors (make refcheckdocs): See https://patchwork.ozlabs.org/patch/ This check can fail if there are any dependencies. The base for a patch series is generally the most recent rc1. If you already ran 'make dt_binding_check' and didn't see the above error(s), then make sure 'yamllint' is installed and dt-schema is up to date: pip3 install dtschema --upgrade Please check and re-submit.
[PATCH v6 01/25] powerpc: Remove asmlinkage from syscall handler definitions
The asmlinkage macro has no special meaning in powerpc, and prior to this patch is used sporadically on some syscall handler definitions. On architectures that do not define asmlinkage, it resolves to extern "C" for C++ compilers and a nop otherwise. The current invocations of asmlinkage provide far from complete support for C++ toolchains, and so the macro serves no purpose in powerpc. Remove all invocations of asmlinkage in arch/powerpc. These incidentally only occur in syscall definitions and prototypes. Signed-off-by: Rohan McLure Reviewed-by: Nicholas Piggin Reviewed-by: Andrew Donnellan --- V3: new patch --- arch/powerpc/include/asm/syscalls.h | 16 arch/powerpc/kernel/sys_ppc32.c | 8 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h index a2b13e55254f..21c2faaa2957 100644 --- a/arch/powerpc/include/asm/syscalls.h +++ b/arch/powerpc/include/asm/syscalls.h @@ -10,14 +10,14 @@ struct rtas_args; -asmlinkage long sys_mmap(unsigned long addr, size_t len, - unsigned long prot, unsigned long flags, - unsigned long fd, off_t offset); -asmlinkage long sys_mmap2(unsigned long addr, size_t len, - unsigned long prot, unsigned long flags, - unsigned long fd, unsigned long pgoff); -asmlinkage long ppc64_personality(unsigned long personality); -asmlinkage long sys_rtas(struct rtas_args __user *uargs); +long sys_mmap(unsigned long addr, size_t len, + unsigned long prot, unsigned long flags, + unsigned long fd, off_t offset); +long sys_mmap2(unsigned long addr, size_t len, + unsigned long prot, unsigned long flags, + unsigned long fd, unsigned long pgoff); +long ppc64_personality(unsigned long personality); +long sys_rtas(struct rtas_args __user *uargs); int ppc_select(int n, fd_set __user *inp, fd_set __user *outp, fd_set __user *exp, struct __kernel_old_timeval __user *tvp); long ppc_fadvise64_64(int fd, int advice, u32 offset_high, u32 offset_low, diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c index 16ff0399a257..f4edcc9489fb 100644 --- a/arch/powerpc/kernel/sys_ppc32.c +++ b/arch/powerpc/kernel/sys_ppc32.c @@ -85,20 +85,20 @@ compat_ssize_t compat_sys_readahead(int fd, u32 r4, u32 offset1, u32 offset2, u3 return ksys_readahead(fd, merge_64(offset1, offset2), count); } -asmlinkage int compat_sys_truncate64(const char __user * path, u32 reg4, +int compat_sys_truncate64(const char __user * path, u32 reg4, unsigned long len1, unsigned long len2) { return ksys_truncate(path, merge_64(len1, len2)); } -asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offset1, u32 offset2, +long compat_sys_fallocate(int fd, int mode, u32 offset1, u32 offset2, u32 len1, u32 len2) { return ksys_fallocate(fd, mode, ((loff_t)offset1 << 32) | offset2, merge_64(len1, len2)); } -asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long len1, +int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long len1, unsigned long len2) { return ksys_ftruncate(fd, merge_64(len1, len2)); @@ -111,7 +111,7 @@ long ppc32_fadvise64(int fd, u32 unused, u32 offset1, u32 offset2, advice); } -asmlinkage long compat_sys_sync_file_range2(int fd, unsigned int flags, +long compat_sys_sync_file_range2(int fd, unsigned int flags, unsigned offset1, unsigned offset2, unsigned nbytes1, unsigned nbytes2) { -- 2.34.1
[PATCH v6 00/25] powerpc: Syscall wrapper and register clearing
V5 available here: Link: https://lore.kernel.org/all/20220916053300.786330-2-rmcl...@linux.ibm.com/T/ Implement a syscall wrapper, causing arguments to handlers to be passed via a struct pt_regs on the stack. The syscall wrapper is implemented for all platforms other than the Cell processor, from which SPUs expect the ability to directly call syscall handler symbols with the regular in-register calling convention. Adopting syscall wrappers requires redefinition of architecture-specific syscalls and compatibility syscalls to use the SYSCALL_DEFINE and COMPAT_SYSCALL_DEFINE macros, as well as removal of direct-references to the emitted syscall-handler symbols from within the kernel. This work lead to the following modernisations of powerpc's syscall handlers: - Replace syscall 82 semantics with sys_old_select and remove ppc_select handler, which features direct call to both sys_old_select and sys_select. - Use a generic fallocate compatibility syscall Replace asm implementation of syscall table with C implementation for more compile-time checks. Many compatibility syscalls are candidates to be removed in favour of generically defined handlers, but exhibit different parameter orderings and numberings due to 32-bit ABI support for 64-bit parameters. The parameter reorderings are however consistent with arm. A future patch series will serve to modernise syscalls by providing generic implementations featuring these reorderings. The design of this syscall is very similar to the s390, x86 and arm64 implementations. See also Commit 4378a7d4be30 (arm64: implement syscall wrappers). The motivation for this change is that it allows for the clearing of register state when entering the kernel via through interrupt handlers on 64-bit servers. This serves to reduce the influence of values in registers carried over from the interrupted process, e.g. syscall parameters from user space, or user state at the site of a pagefault. All values in registers are saved and zeroized at the entry to an interrupt handler and restored afterward. While this may sound like a heavy-weight mitigation, many gprs are already saved and restored on handling of an interrupt, and the mmap_bench benchmark on Power 9 guest, repeatedly invoking the pagefault handler suggests at most ~0.8% regression in performance. Realistic workloads are not constantly producing interrupts, and so this does not indicate realistic slowdown. Using wrapped syscalls yields to a performance improvement of ~5.6% on the null_syscall benchmark on pseries guests, by removing the need for system_call_exception to allocate its own stack frame. This amortises the additional costs of saving and restoring non-volatile registers (register clearing is cheap on super scalar platforms), and so the final mitigation actually yields a net performance improvement of ~0.6% on the null_syscall benchmark. The clearing of general purpose registers on interrupts other than syscalls is enabled by default only on Book3E 64-bit systems (where the mitigation is inexpensive), but available to other 64-bit systems via the INTERRUPT_SANITIZE_REGISTERS Kconfig option. This mitigation is optional, as the speculation influence of interrupts is likely less than that of syscalls. Patch Changelog: - Clears and restores of NVGPRS that depend on either SANITIZE or !SANITIZE have convenience macros. - Split syscall wrapper patch with change to system_call_exception calling convention. - In 64e, exchange misleading REST_GPRS(0, 1, r1) to clearly restore r0 first and avoid clobbering the stack pointer. - Remove extraneous clears of the high-order words of syscall parameters, which is now redundant thanks to use of COMPAT_SYSCALL_DEFINE everywhere. Rohan McLure (25): powerpc: Remove asmlinkage from syscall handler definitions powerpc: Save caller r3 prior to system_call_exception powerpc: Add ZEROIZE_GPRS macros for register clears powerpc/64s: Use {ZEROIZE,SAVE,REST}_GPRS macros in sc, scv 0 handlers powerpc/32: Clarify interrupt restores with REST_GPR macro in entry_32.S powerpc/64e: Clarify register saves and clears with {SAVE,ZEROIZE}_GPRS powerpc/64s: Fix comment on interrupt handler prologue powerpc: Fix fallocate and fadvise64_64 compat parameter combination asm-generic: compat: Support BE for long long args in 32-bit ABIs powerpc: Use generic fallocate compatibility syscall powerpc/32: Remove powerpc select specialisation powerpc: Remove direct call to personality syscall handler powerpc: Remove direct call to mmap2 syscall handlers powerpc: Provide do_ppc64_personality helper powerpc: Adopt SYSCALL_DEFINE for arch-specific syscall handlers powerpc: Include all arch-specific syscall prototypes powerpc: Enable compile-time check for syscall handlers powerpc: Use common syscall handler type powerpc: Remove high-order word clearing on compat syscall entry powerpc: Change system_call_exception calling convention
Re: [PATCH v3 4/4] arm64: support batched/deferred tlb shootdown during page reclamation
On 8/22/22 13:51, Yicong Yang wrote: > +static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch > *batch, > + struct mm_struct *mm, > + unsigned long uaddr) > +{ > + __flush_tlb_page_nosync(mm, uaddr); > +} > + > +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch > *batch) > +{ > + dsb(ish); > +} Just wondering if arch_tlbbatch_add_mm() could also detect continuous mapping TLB invalidation requests on a given mm and try to generate a range based TLB invalidation such as flush_tlb_range(). struct arch_tlbflush_unmap_batch via task->tlb_ubc->arch can track continuous ranges while being queued up via arch_tlbbatch_add_mm(), any range formed can later be flushed in subsequent arch_tlbbatch_flush() ? OR It might not be worth the effort and complexity, in comparison to performance improvement, TLB range flush brings in ?
[powerpc] Kernel crash with THP tests (next-20220920)
While running transparent huge page tests [1] against 6.0.0-rc6-next-20220920 following crash is seen on IBM Power server. Kernel attempted to read user page (34) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x0034 Faulting instruction address: 0xc04d2744 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: dm_mod(E) bonding(E) rfkill(E) tls(E) sunrpc(E) nd_pmem(E) nd_btt(E) dax_pmem(E) papr_scm(E) libnvdimm(E) pseries_rng(E) vmx_crypto(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) sg(E) ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) fuse(E) CPU: 37 PID: 2219255 Comm: sysctl Tainted: GE 6.0.0-rc6-next-20220920 #1 NIP: c04d2744 LR: c04d2734 CTR: REGS: c012801bf660 TRAP: 0300 Tainted: GE (6.0.0-rc6-next-20220920) MSR: 80009033 CR: 24048222 XER: 2004 CFAR: c04b0eac DAR: 0034 DSISR: 4000 IRQMASK: 0 GPR00: c04d2734 c012801bf900 c2a92300 GPR04: c2ac8ac0 c1209340 0005 c01286714b80 GPR08: 0034 GPR12: 28048242 c0167fff6b00 GPR16: GPR20: c012801bfae8 0001 0100 0001 GPR24: c012801bfae8 c2ac8ac0 0002 0005 GPR28: 0001 00346cca NIP [c04d2744] alloc_buddy_huge_page+0xd4/0x240 LR [c04d2734] alloc_buddy_huge_page+0xc4/0x240 Call Trace: [c012801bf900] [c04d2734] alloc_buddy_huge_page+0xc4/0x240 (unreliable) [c012801bf9b0] [c04d46a4] alloc_fresh_huge_page.part.72+0x214/0x2a0 [c012801bfa40] [c04d7f88] alloc_pool_huge_page+0x118/0x190 [c012801bfa90] [c04d84dc] __nr_hugepages_store_common+0x4dc/0x610 [c012801bfb70] [c04d88bc] hugetlb_sysctl_handler_common+0x13c/0x180 [c012801bfc10] [c06380e0] proc_sys_call_handler+0x210/0x350 [c012801bfc90] [c0551c00] vfs_write+0x2e0/0x460 [c012801bfd50] [c0551f5c] ksys_write+0x7c/0x140 [c012801bfda0] [c0033f58] system_call_exception+0x188/0x3f0 [c012801bfe10] [c000c53c] system_call_common+0xec/0x270 --- interrupt: c00 at 0x7fffa9520c34 NIP: 7fffa9520c34 LR: 0001024754bc CTR: REGS: c012801bfe80 TRAP: 0c00 Tainted: GE (6.0.0-rc6-next-20220920) MSR: 8280f033 CR: 28002202 XER: IRQMASK: 0 GPR00: 0004 7fffccd76cd0 7fffa9607300 0003 GPR04: 000138da6970 0006 fff6 GPR08: 000138da6970 GPR12: 7fffa9a40940 GPR16: GPR20: GPR24: 0001 0010 0006 000138da8aa0 GPR28: 7fffa95fc2c8 000138da8aa0 0006 000138da6930 NIP [7fffa9520c34] 0x7fffa9520c34 LR [0001024754bc] 0x1024754bc --- interrupt: c00 Instruction dump: 3b42 3ba1 3b80 7f26cb78 7fc5f378 7f64db78 7fe3fb78 4bfde5b9 6000 7c691b78 39030034 7c0004ac <7d404028> 7c0ae800 40c20010 7f80412d ---[ end trace ]--- Kernel panic - not syncing: Fatal exception Bisect points to following patch: commit f2f3c25dea3acfb17aecb7273541e7266dfc8842 hugetlb: freeze allocated pages before creating hugetlb pages Reverting the patch allows the test to run successfully. Thanks - Sachin [1] https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/transparent_hugepages_defrag.py