Re: [PATCH v4 2/9] perf/core: open access for CAP_SYS_PERFMON privileged process
On Fri, 10 Jan 2020 21:35:12 -0300 arnaldo.m...@gmail.com wrote: > ,Jann Horn ,Thomas Gleixner > ,Tvrtko Ursulin ,Lionel > Landwerlin ,linux-kernel > ,"linux-security-mod...@vger.kernel.org" > ,"seli...@vger.kernel.org" > ,"intel-...@lists.freedesktop.org" > ,"b...@vger.kernel.org" > ,"linux-par...@vger.kernel.org" > ,"linuxppc-dev@lists.ozlabs.org" > ,"linux-perf-us...@vger.kernel.org" > ,"linux-arm-ker...@lists.infradead.org" > ,"oprofile-l...@lists.sf.net" > > From: Arnaldo Carvalho de Melo > Message-ID: > > On January 10, 2020 9:23:27 PM GMT-03:00, Song Liu > wrote: > > > > > >> On Jan 10, 2020, at 3:47 PM, Masami Hiramatsu > >wrote: > >> > >> On Fri, 10 Jan 2020 13:45:31 -0300 > >> Arnaldo Carvalho de Melo wrote: > >> > >>> Em Sat, Jan 11, 2020 at 12:52:13AM +0900, Masami Hiramatsu escreveu: > On Fri, 10 Jan 2020 15:02:34 +0100 Peter Zijlstra > > wrote: > > Again, this only allows attaching to previously created kprobes, > >it does > > not allow creating kprobes, right? > >>> > > That is; I don't think CAP_SYS_PERFMON should be allowed to create > > kprobes. > >>> > > As might be clear; I don't actually know what the user-ABI is for > > creating kprobes. > >>> > There are 2 ABIs nowadays, ftrace and ebpf. perf-probe uses ftrace > >interface to > define new kprobe events, and those events are treated as > >completely same as > tracepoint events. On the other hand, ebpf tries to define new > >probe event > via perf_event interface. Above one is that interface. IOW, it > >creates new kprobe. > >>> > >>> Masami, any plans to make 'perf probe' use the perf_event_open() > >>> interface for creating kprobes/uprobes? > >> > >> Would you mean perf probe to switch to perf_event_open()? > >> No, perf probe is for setting up the ftrace probe events. I think we > >can add an > >> option to use perf_event_open(). But current kprobe creation from > >perf_event_open() > >> is separated from ftrace by design. > > > >I guess we can extend event parser to understand kprobe directly. > >Instead of > > > > perf probe kernel_func > > perf stat/record -e probe:kernel_func ... > > > >We can just do > > > > perf stat/record -e kprobe:kernel_func ... > > > You took the words from my mouth, exactly, that is a perfect use case, an > alternative to the 'perf probe' one of making a disabled event that then gets > activated via record/stat/trace, in many cases it's better, removes the > explicit probe setup case. Ah, I got it. If the perf event parser just kicks perf's kprobe creation interface, it will be easy. In that case, there should be following differences. - perf * -e "kprobe":kernel_func will put a local (hidden) kprobe events. So ftrace user can not access it. - perf * -e "kprobe":kernel_func may not support inline/function-body nor trace local variables etc. Hm, if we support inline function via -e "kprobe" interface, we have to expand perf_event_open() to support multi-probe event. Thanks, > > Regards, > > - Arnaldo > > > > >Thanks, > >Song > -- Masami Hiramatsu
Re: [PATCH] lib: vdso: mark __cvdso_clock_getres() as static
Christophe Leroy writes: > When __cvdso_clock_getres() became __cvdso_clock_getres_common() > and a new __cvdso_clock_getres() was added, static qualifier was > forgotten. > > Fixes: 502a590a170b ("lib/vdso: Move fallback invocation to the callers") > Signed-off-by: Christophe Leroy I've already queued: https://lore.kernel.org/r/20191128111719.8282-1-vincenzo.frasc...@arm.com but thanks for caring! Thanks, tglx
[PATCH] lib: vdso: mark __cvdso_clock_getres() as static
When __cvdso_clock_getres() became __cvdso_clock_getres_common() and a new __cvdso_clock_getres() was added, static qualifier was forgotten. Fixes: 502a590a170b ("lib/vdso: Move fallback invocation to the callers") Signed-off-by: Christophe Leroy --- lib/vdso/gettimeofday.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 9ecfd3b547ba..42bd8ab955fa 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -221,6 +221,7 @@ int __cvdso_clock_getres_common(clockid_t clock, struct __kernel_timespec *res) return 0; } +static __maybe_unused int __cvdso_clock_getres(clockid_t clock, struct __kernel_timespec *res) { int ret = __cvdso_clock_getres_common(clock, res); -- 2.13.3
Re: [PATCH v15 00/24] selftests, powerpc, x86: Memory Protection Keys
Hi Dave, On 10/01/20 11:27 pm, Dave Hansen wrote: > > Could you dump these in a git tree, please? It will make it a wee bit > easier for me to ship the resulting tree around to a couple different > systems. > I have pushed a version of this series that uses u64 for all references to the pkey register irrespective of architecture. This is available at: https://github.com/sandip4n/linux/tree/pkey-selftests - Sandipan
Re: linux-next: build warning after merge of the bpf-next tree
On 1/10/20 7:20 PM, Palmer Dabbelt wrote: On Fri, 10 Jan 2020 14:28:17 PST (-0800), alexan...@ghiti.fr wrote: Hi guys, On 10/27/19 8:02 PM, Stephen Rothwell wrote: Hi all, On Fri, 18 Oct 2019 10:56:57 +1100 Stephen Rothwell wrote: Hi all, After merging the bpf-next tree, today's linux-next build (powerpc ppc64_defconfig) produced this warning: WARNING: 2 bad relocations c1998a48 R_PPC64_ADDR64 _binary__btf_vmlinux_bin_start c1998a50 R_PPC64_ADDR64 _binary__btf_vmlinux_bin_end Introduced by commit 8580ac9404f6 ("bpf: Process in-kernel BTF") This warning now appears in the net-next tree build. I bump that thread up because Zong also noticed that 2 new relocations for those symbols appeared in my riscv relocatable kernel branch following that commit. I also noticed 2 new relocations R_AARCH64_ABS64 appearing in arm64 kernel. Those 2 weak undefined symbols have existed since commit 341dfcf8d78e ("btf: expose BTF info through sysfs") but this is the fact to declare those symbols into btf.c that produced those relocations. I'm not sure what this all means, but this is not something I expected for riscv for a kernel linked with -shared/-fpie. Maybe should we just leave them to zero ? I think that deserves a deeper look if someone understands all this better than I do. Can you give me a pointer to your tree and how to build a relocatable kernel? Weak undefined symbols have the absolute value 0, So according to you the 2 new relocations R_RISCV_64 are normal and should not be modified at runtime right ? but the kernel is linked at an address such that 0 can't be reached by normal means. When I added support to binutils for this I did it in a way that required almost no code -- essetially I just stopped dissallowing x0 as a possible base register for PCREL relocations, which results in 0 always being accessible. I just wanted to get the kernel to build again, so I didn't worry about chasing around all the addressing modes. The PIC/PIE support generates different relocations and I wouldn't be surprised if I just missed one (or more likely all) of them. It's probably a simple fix, though I feel like every time I say that about the linker I end up spending a month in there... You can find it here: https://github.com/AlexGhiti/riscv-linux/tree/int/alex/riscv_relocatable_v1 Zong fixed the bug introduced by those 2 new relocations and everything works like a charm, so I'm not sure you have to dig in the linker :) Alex
Re: linux-next: build warning after merge of the bpf-next tree
On 1/10/20 6:18 PM, Alexei Starovoitov wrote: On Fri, Jan 10, 2020 at 2:28 PM Alexandre Ghiti wrote: Hi guys, On 10/27/19 8:02 PM, Stephen Rothwell wrote: Hi all, On Fri, 18 Oct 2019 10:56:57 +1100 Stephen Rothwell wrote: Hi all, After merging the bpf-next tree, today's linux-next build (powerpc ppc64_defconfig) produced this warning: WARNING: 2 bad relocations c1998a48 R_PPC64_ADDR64_binary__btf_vmlinux_bin_start c1998a50 R_PPC64_ADDR64_binary__btf_vmlinux_bin_end Introduced by commit 8580ac9404f6 ("bpf: Process in-kernel BTF") This warning now appears in the net-next tree build. I bump that thread up because Zong also noticed that 2 new relocations for those symbols appeared in my riscv relocatable kernel branch following that commit. I also noticed 2 new relocations R_AARCH64_ABS64 appearing in arm64 kernel. Those 2 weak undefined symbols have existed since commit 341dfcf8d78e ("btf: expose BTF info through sysfs") but this is the fact to declare those symbols into btf.c that produced those relocations. I'm not sure what this all means, but this is not something I expected for riscv for a kernel linked with -shared/-fpie. Maybe should we just leave them to zero ? I think that deserves a deeper look if someone understands all this better than I do. Are you saying there is a warning for arm64 as well? Nop. Can ppc folks explain the above warning? What does it mean "2 bad relocations"? This is what I'd like to understand too, it is not clear in the ppc tool that outputs this message why it is considered 'bad'. The code is doing: extern char __weak _binary__btf_vmlinux_bin_start[]; extern char __weak _binary__btf_vmlinux_bin_end[]; Since they are weak they should be zero when not defined. What's the issue? There likely is no issue, I just want to make sure those relocations are legitimate and I want to understand what we should do with those. At the moment arm64 does not relocate those at runtime and purely ignore them: is this the right thing to do ?
Re: Surprising code generated for vdso_read_begin()
On Fri, Jan 10, 2020 at 07:45:44AM +0100, Christophe Leroy wrote: > Le 09/01/2020 à 21:07, Segher Boessenkool a écrit : > >It looks like the compiler did loop peeling. What GCC version is this? > >Please try current trunk (to become GCC 10), or at least GCC 9? > > It is with GCC 5.5 > > https://mirrors.edge.kernel.org/pub/tools/crosstool/ doesn't have more > recent than 8.1 Arnd, can you update the tools? We are at 8.3 and 9.2 now :-) Or is this hard and/or painful to do? > With 8.1, the problem doesn't show up. Good to hear that. Hrm, I wonder when it was fixed... Segher
Re: [RFC PATCH v2 07/10] lib: vdso: don't use READ_ONCE() in __c_kernel_time()
Christophe Leroy writes: > > With READ_ONCE() the 64 bits are being read: > > Without the READ_ONCE() only 32 bits are read. That's the most optimal. > > Without READ_ONCE() but with a barrier() after the read, we should get > the same result but GCC (GCC 8.1) does less good: > > Assuming both part of the 64 bits data will fall into a single > cacheline, the second read is in the noise. They definitely are in the same cacheline. > So agreed to drop this change. We could be smart about this and force the compiler to issue a 32bit read for 32bit builds. See below. Not sure whether it's worth it, but OTOH it will take quite a while until the 32bit time interfaces die completely. Thanks, tglx 8< --- a/include/vdso/datapage.h +++ b/include/vdso/datapage.h @@ -21,6 +21,18 @@ #define CS_RAW 1 #define CS_BASES (CS_RAW + 1) +#ifdef __LITTLE_ENDIAN +struct sec_hl { + u32 sec_l; + u32 sec_h; +}; +#else +struct sec_hl { + u32 sec_h; + u32 sec_l; +}; +#endif + /** * struct vdso_timestamp - basetime per clock_id * @sec: seconds @@ -35,7 +47,10 @@ * vdso_data.cs[x].shift. */ struct vdso_timestamp { - u64 sec; + union { + u64 sec; + struct sec_hl sec_hl; + }; u64 nsec; }; --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -165,8 +165,13 @@ static __maybe_unused int static __maybe_unused __kernel_old_time_t __cvdso_time(__kernel_old_time_t *time) { const struct vdso_data *vd = __arch_get_vdso_data(); - __kernel_old_time_t t = READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec); + __kernel_old_time_t t; +#if BITS_PER_LONG == 32 + t = READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec_hl.sec_l); +#else + t = READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec); +#endif if (time) *time = t;
Re: [RFC PATCH v2 05/10] lib: vdso: inline do_hres()
On 01/10/2020 09:07 PM, Thomas Gleixner wrote: Arnd Bergmann writes: On Mon, Dec 23, 2019 at 3:31 PM Christophe Leroy wrote: do_hres() is called from several places, so GCC doesn't inline it at first. do_hres() takes a struct __kernel_timespec * parameter for passing the result. In the 32 bits case, this parameter corresponds to a local var in the caller. In order to provide a pointer to this structure, the caller has to put it in its stack and do_hres() has to write the result in the stack. This is suboptimal, especially on RISC processor like powerpc. By making GCC inline the function, the struct __kernel_timespec remains a local var using registers, avoiding the need to write and read stack. The improvement is significant on powerpc. Signed-off-by: Christophe Leroy Good idea, I can see how this ends up being an improvement for most of the callers. Acked-by: Arnd Bergmann https://lore.kernel.org/r/20191112012724.250792-3-d...@arista.com On the way to be applied. Oh nice, I get even better result with the way it is done by Dmitry compared to my own first patch. On an mpc8xx at 132Mhz (32bits powerpc), before the patch I have gettimeofday:vdso: 1256 nsec/call clock-gettime-monotonic-raw:vdso: 1449 nsec/call clock-gettime-monotonic-coarse:vdso: 768 nsec/call clock-gettime-monotonic:vdso: 1390 nsec/call With the patch I have: gettimeofday:vdso: 947 nsec/call clock-gettime-monotonic-raw:vdso: 1156 nsec/call clock-gettime-monotonic-coarse:vdso: 638 nsec/call clock-gettime-monotonic:vdso: 1094 nsec/call So that's a 20-25% improvement. I modified it slightly as follows: diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 9e474d54814f..b793f211bca8 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -37,8 +37,8 @@ u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult) } #endif -static int do_hres(const struct vdso_data *vd, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_hres(const struct vdso_data *vd, clockid_t clk, + struct __kernel_timespec *ts) { const struct vdso_timestamp *vdso_ts = &vd->basetime[clk]; u64 cycles, last, sec, ns; @@ -67,8 +67,8 @@ static int do_hres(const struct vdso_data *vd, clockid_t clk, return 0; } -static void do_coarse(const struct vdso_data *vd, clockid_t clk, - struct __kernel_timespec *ts) +static __always_inline int do_coarse(const struct vdso_data *vd, clockid_t clk, +struct __kernel_timespec *ts) { const struct vdso_timestamp *vdso_ts = &vd->basetime[clk]; u32 seq; @@ -78,6 +78,8 @@ static void do_coarse(const struct vdso_data *vd, clockid_t clk, ts->tv_sec = vdso_ts->sec; ts->tv_nsec = vdso_ts->nsec; } while (unlikely(vdso_read_retry(vd, seq))); + + return 0; } static __maybe_unused int @@ -95,15 +97,16 @@ __cvdso_clock_gettime_common(const struct vdso_data *vd, clockid_t clock, * clocks are handled in the VDSO directly. */ msk = 1U << clock; - if (likely(msk & VDSO_HRES)) { - return do_hres(&vd[CS_HRES_COARSE], clock, ts); - } else if (msk & VDSO_COARSE) { - do_coarse(&vd[CS_HRES_COARSE], clock, ts); - return 0; - } else if (msk & VDSO_RAW) { - return do_hres(&vd[CS_RAW], clock, ts); - } - return -1; + if (likely(msk & VDSO_HRES)) + vd += CS_HRES_COARSE; + else if (msk & VDSO_COARSE) + return do_coarse(&vd[CS_HRES_COARSE], clock, ts); + else if (msk & VDSO_RAW) + vd += CS_RAW; + else + return -1; + + return do_hres(vd, clock, ts); } static __maybe_unused int --- Christophe
Re: [RFC PATCH v2 07/10] lib: vdso: don't use READ_ONCE() in __c_kernel_time()
On 01/10/2020 09:12 PM, Thomas Gleixner wrote: Christophe Leroy writes: diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index 17b4cff6e5f0..5a17a9d2e6cd 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -144,7 +144,7 @@ __cvdso_gettimeofday(const struct vdso_data *vd, struct __kernel_old_timeval *tv static __maybe_unused __kernel_old_time_t __cvdso_time(const struct vdso_data *vd, __kernel_old_time_t *time) { - __kernel_old_time_t t = READ_ONCE(vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec); + __kernel_old_time_t t = vd[CS_HRES_COARSE].basetime[CLOCK_REALTIME].sec; if (time) *time = t; Allows the compiler to load twice, i.e. the returned value might be different from the stored value. So no. With READ_ONCE() the 64 bits are being read: 0ac8 <__c_kernel_time>: ac8: 2c 03 00 00 cmpwi r3,0 acc: 81 44 00 20 lwz r10,32(r4) ad0: 81 64 00 24 lwz r11,36(r4) ad4: 41 82 00 08 beq adc <__c_kernel_time+0x14> ad8: 91 63 00 00 stw r11,0(r3) adc: 7d 63 5b 78 mr r3,r11 ae0: 4e 80 00 20 blr Without the READ_ONCE() only 32 bits are read. That's the most optimal. 0ac8 <__c_kernel_time>: ac8: 7c 69 1b 79 mr. r9,r3 acc: 80 64 00 24 lwz r3,36(r4) ad0: 4d 82 00 20 beqlr ad4: 90 69 00 00 stw r3,0(r9) ad8: 4e 80 00 20 blr Without READ_ONCE() but with a barrier() after the read, we should get the same result but GCC (GCC 8.1) does less good: 0ac8 <__c_kernel_time>: ac8: 81 24 00 24 lwz r9,36(r4) acc: 2f 83 00 00 cmpwi cr7,r3,0 ad0: 41 9e 00 08 beq cr7,ad8 <__c_kernel_time+0x10> ad4: 91 23 00 00 stw r9,0(r3) ad8: 7d 23 4b 78 mr r3,r9 adc: 4e 80 00 20 blr Assuming both part of the 64 bits data will fall into a single cacheline, the second read is in the noise. So agreed to drop this change. Christophe