Re: [PATCH 08/12] arm64: dts: qcom: sm6115-pro1x: Update copyright year
On 22/07/2024 09:10, Dang Huynh wrote: > It's 2024, let's update the copyright year. > > Signed-off-by: Dang Huynh > --- > arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts > b/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts > index a32fc27bc783..6e9e4d9f8250 100644 > --- a/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts > +++ b/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts > @@ -1,6 +1,6 @@ > // SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause) > /* > - * Copyright (c) 2023, Dang Huynh > + * Copyright (c) 2023 - 2024, Dang Huynh Please squash it. I don't think copyright update has a merit on its own. Best regards, Krzysztof
[PATCH 08/12] arm64: dts: qcom: sm6115-pro1x: Update copyright year
It's 2024, let's update the copyright year. Signed-off-by: Dang Huynh --- arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts b/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts index a32fc27bc783..6e9e4d9f8250 100644 --- a/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts +++ b/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts @@ -1,6 +1,6 @@ // SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause) /* - * Copyright (c) 2023, Dang Huynh + * Copyright (c) 2023 - 2024, Dang Huynh */ /dts-v1/; -- 2.45.2
Re: [PATCH V2 0/7] vdpa live update
On 7/17/2024 8:33 PM, Jason Wang wrote: On Thu, Jul 18, 2024 at 2:29 AM Steven Sistare wrote: On 7/16/2024 1:30 AM, Jason Wang wrote: On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare wrote: On 7/14/2024 10:14 PM, Jason Wang wrote: On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare wrote: Live update is a technique wherein an application saves its state, exec's to an updated version of itself, and restores its state. Clients of the application experience a brief suspension of service, on the order of 100's of milliseconds, but are otherwise unaffected. Define and implement interfaces that allow vdpa devices to be preserved across fork or exec, to support live update for applications such as QEMU. The device must be suspended during the update, but its DMA mappings are preserved, so the suspension is brief. The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory accounting from one process to another. The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that VHOST_NEW_OWNER is supported. The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland address in the new process. The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that VHOST_IOTLB_REMAP is supported and required. Some devices do not require it, because the userland address of each DMA mapping is discarded after being translated to a physical address. Here is a pseudo-code sequence for performing live update, based on suspend + reset because resume is not yet widely available. The vdpa device descriptor, fd, remains open across the exec. ioctl(fd, VHOST_VDPA_SUSPEND) ioctl(fd, VHOST_VDPA_SET_STATUS, 0) I don't understand why we need a reset after suspend, it looks to me the previous suspend became meaningless. The suspend guarantees completion of in-progress DMA. At least, that is my interpretation of why that is done for live migration in QEMU, which also does suspend + reset + re-create. I am following the live migration model. Yes, but any reason we need a reset after the suspension? Probably not. I found it cleanest to call reset and let new qemu configure the device as it always does during startup, rather than altering those code paths to skip the kernel calls. If we care about the downtime, I think avoiding a reset should be faster. Agreed. I want to try it some time, but I think what I have now using reset is decent enough for version 1. - Steve So, consider this to be just one of several possible userland algorithms. - Steve Thanks exec ioctl(fd, VHOST_NEW_OWNER) issue ioctls to re-create vrings if VHOST_BACKEND_F_IOTLB_REMAP So the idea is for a device that is using a virtual address, it doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all? Actually the reverse: if the device translates virtual to physical when the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP is not needed. Ok. foreach dma mapping write(fd, {VHOST_IOTLB_REMAP, new_addr}) ioctl(fd, VHOST_VDPA_SET_STATUS, ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK) From API level, this seems to be asymmetric as we have suspending but not resuming? Again, I am just following the path taken by live migration. I will be happy to use resume when the devices and QEMU support it. The decision to use reset vs resume should not affect the definition and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP. - Steve Thanks
Re: [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function
On Tue, 16 Jul 2024 14:13:38 +0300 Mike Rapoport wrote: > From: "Mike Rapoport (Microsoft)" > > This is required to make numa emulation code architecture independent so > that it can be moved to generic code in following commits. > > Signed-off-by: Mike Rapoport (Microsoft) Not the most intuitive of function names but I can't immediately think of a better one. Reviewed-by: Jonathan Cameron
Re: [PATCH V2 0/7] vdpa live update
On Thu, Jul 18, 2024 at 2:29 AM Steven Sistare wrote: > > On 7/16/2024 1:30 AM, Jason Wang wrote: > > On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare > > wrote: > >> > >> On 7/14/2024 10:14 PM, Jason Wang wrote: > >>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare > >>> wrote: > >>>> > >>>> Live update is a technique wherein an application saves its state, exec's > >>>> to an updated version of itself, and restores its state. Clients of the > >>>> application experience a brief suspension of service, on the order of > >>>> 100's of milliseconds, but are otherwise unaffected. > >>>> > >>>> Define and implement interfaces that allow vdpa devices to be preserved > >>>> across fork or exec, to support live update for applications such as > >>>> QEMU. > >>>> The device must be suspended during the update, but its DMA mappings are > >>>> preserved, so the suspension is brief. > >>>> > >>>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory > >>>> accounting from one process to another. > >>>> > >>>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that > >>>> VHOST_NEW_OWNER is supported. > >>>> > >>>> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its > >>>> userland > >>>> address in the new process. > >>>> > >>>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that > >>>> VHOST_IOTLB_REMAP is supported and required. Some devices do not > >>>> require it, because the userland address of each DMA mapping is discarded > >>>> after being translated to a physical address. > >>>> > >>>> Here is a pseudo-code sequence for performing live update, based on > >>>> suspend + reset because resume is not yet widely available. The vdpa > >>>> device > >>>> descriptor, fd, remains open across the exec. > >>>> > >>>> ioctl(fd, VHOST_VDPA_SUSPEND) > >>>> ioctl(fd, VHOST_VDPA_SET_STATUS, 0) > >>> > >>> I don't understand why we need a reset after suspend, it looks to me > >>> the previous suspend became meaningless. > >> > >> The suspend guarantees completion of in-progress DMA. At least, that is > >> my interpretation of why that is done for live migration in QEMU, which > >> also does suspend + reset + re-create. I am following the live migration > >> model. > > > > Yes, but any reason we need a reset after the suspension? > > Probably not. I found it cleanest to call reset and let new qemu configure > the > device as it always does during startup, rather than altering those code paths > to skip the kernel calls. If we care about the downtime, I think avoiding a reset should be faster. > So, consider this to be just one of several possible > userland algorithms. > > - Steve Thanks > > >>>> exec > >>>> > >>>> ioctl(fd, VHOST_NEW_OWNER) > >>>> > >>>> issue ioctls to re-create vrings > >>>> > >>>> if VHOST_BACKEND_F_IOTLB_REMAP > >>> > >>> So the idea is for a device that is using a virtual address, it > >>> doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all? > >> > >> Actually the reverse: if the device translates virtual to physical when > >> the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP > >> is not needed. > > > > Ok. > > > >> > >>>> foreach dma mapping > >>>> write(fd, {VHOST_IOTLB_REMAP, new_addr}) > >>>> > >>>> ioctl(fd, VHOST_VDPA_SET_STATUS, > >>>> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK) > >>> > >>> From API level, this seems to be asymmetric as we have suspending but > >>> not resuming? > >> > >> Again, I am just following the path taken by live migration. > >> I will be happy to use resume when the devices and QEMU support it. > >> The decision to use reset vs resume should not affect the definition > >> and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP. > >> > >> - Steve > > > > Thanks > > >
Re: [PATCH V2 0/7] vdpa live update
On 7/16/2024 1:30 AM, Jason Wang wrote: On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare wrote: On 7/14/2024 10:14 PM, Jason Wang wrote: On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare wrote: Live update is a technique wherein an application saves its state, exec's to an updated version of itself, and restores its state. Clients of the application experience a brief suspension of service, on the order of 100's of milliseconds, but are otherwise unaffected. Define and implement interfaces that allow vdpa devices to be preserved across fork or exec, to support live update for applications such as QEMU. The device must be suspended during the update, but its DMA mappings are preserved, so the suspension is brief. The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory accounting from one process to another. The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that VHOST_NEW_OWNER is supported. The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland address in the new process. The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that VHOST_IOTLB_REMAP is supported and required. Some devices do not require it, because the userland address of each DMA mapping is discarded after being translated to a physical address. Here is a pseudo-code sequence for performing live update, based on suspend + reset because resume is not yet widely available. The vdpa device descriptor, fd, remains open across the exec. ioctl(fd, VHOST_VDPA_SUSPEND) ioctl(fd, VHOST_VDPA_SET_STATUS, 0) I don't understand why we need a reset after suspend, it looks to me the previous suspend became meaningless. The suspend guarantees completion of in-progress DMA. At least, that is my interpretation of why that is done for live migration in QEMU, which also does suspend + reset + re-create. I am following the live migration model. Yes, but any reason we need a reset after the suspension? Probably not. I found it cleanest to call reset and let new qemu configure the device as it always does during startup, rather than altering those code paths to skip the kernel calls. So, consider this to be just one of several possible userland algorithms. - Steve exec ioctl(fd, VHOST_NEW_OWNER) issue ioctls to re-create vrings if VHOST_BACKEND_F_IOTLB_REMAP So the idea is for a device that is using a virtual address, it doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all? Actually the reverse: if the device translates virtual to physical when the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP is not needed. Ok. foreach dma mapping write(fd, {VHOST_IOTLB_REMAP, new_addr}) ioctl(fd, VHOST_VDPA_SET_STATUS, ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK) From API level, this seems to be asymmetric as we have suspending but not resuming? Again, I am just following the path taken by live migration. I will be happy to use resume when the devices and QEMU support it. The decision to use reset vs resume should not affect the definition and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP. - Steve Thanks
[PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function
From: "Mike Rapoport (Microsoft)" This is required to make numa emulation code architecture independent so that it can be moved to generic code in following commits. Signed-off-by: Mike Rapoport (Microsoft) --- arch/x86/include/asm/numa.h | 2 ++ arch/x86/mm/numa.c | 22 ++ arch/x86/mm/numa_emulation.c | 14 +- 3 files changed, 25 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index 2dab1ada96cf..7017d540894a 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -72,6 +72,8 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable); #ifdef CONFIG_NUMA_EMU int numa_emu_cmdline(char *str); +void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, + unsigned int nr_emu_nids); #else /* CONFIG_NUMA_EMU */ static inline int numa_emu_cmdline(char *str) { diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index ab2d4ecef786..1320d776caed 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -852,6 +852,28 @@ EXPORT_SYMBOL(cpumask_of_node); #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */ +#ifdef CONFIG_NUMA_EMU +void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys, + unsigned int nr_emu_nids) +{ + int i, j; + + /* +* Transform __apicid_to_node table to use emulated nids by +* reverse-mapping phys_nid. The maps should always exist but fall +* back to zero just in case. +*/ + for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) { + if (__apicid_to_node[i] == NUMA_NO_NODE) + continue; + for (j = 0; j < nr_emu_nids; j++) + if (__apicid_to_node[i] == emu_nid_to_phys[j]) + break; + __apicid_to_node[i] = j < nr_emu_nids ? j : 0; + } +} +#endif /* CONFIG_NUMA_EMU */ + #ifdef CONFIG_NUMA_KEEP_MEMINFO static int meminfo_to_nid(struct numa_meminfo *mi, u64 start) { diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c index 439804e21962..f2746e52ab93 100644 --- a/arch/x86/mm/numa_emulation.c +++ b/arch/x86/mm/numa_emulation.c @@ -476,19 +476,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) ei.blk[i].nid != NUMA_NO_NODE) node_set(ei.blk[i].nid, numa_nodes_parsed); - /* -* Transform __apicid_to_node table to use emulated nids by -* reverse-mapping phys_nid. The maps should always exist but fall -* back to zero just in case. -*/ - for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) { - if (__apicid_to_node[i] == NUMA_NO_NODE) - continue; - for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++) - if (__apicid_to_node[i] == emu_nid_to_phys[j]) - break; - __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0; - } + numa_emu_update_cpu_to_node(emu_nid_to_phys, ARRAY_SIZE(emu_nid_to_phys)); /* make sure all emulated nodes are mapped to a physical node */ for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++) -- 2.43.0
Re: [PATCH] tracing: Update MAINTAINERS file
On Mon, 15 Jul 2024 14:47:45 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > Gone but never forgotten. > > [ Also moved Daniel's name to be consistent with the alphabetical order ] > Reviewed-by: Masami Hiramatsu (Google) Thank you, > Signed-off-by: Steven Rostedt (Google) > --- > CREDITS | 10 +++--- > MAINTAINERS | 3 --- > 2 files changed, 7 insertions(+), 6 deletions(-) > > diff --git a/CREDITS b/CREDITS > index 88c4c08cb613..d7798910e99f 100644 > --- a/CREDITS > +++ b/CREDITS > @@ -271,9 +271,6 @@ D: Driver for WaveFront soundcards (Turtle Beach Maui, > Tropez, Tropez+) > D: Various bugfixes and changes to sound drivers > S: USA > > -N: Daniel Bristot de Oliveira > -D: Scheduler contributions, notably: SCHED_DEADLINE > - > N: Carlos Henrique Bauer > E: chba...@acm.org > E: ba...@atlas.unisinos.br > @@ -534,6 +531,13 @@ S: Kopmansg 2 > S: 411 13 Goteborg > S: Sweden > > +N: Daniel Bristot de Oliveira > +D: Scheduler contributions, notably: SCHED_DEADLINE > +D: Real-time Linux Analysis > +D: Runtime Verification > +D: OS Noise and Latency Tracers > +S: Brazil and Italy > + > N: Paul Bristow > E: p...@paulbristow.net > W: https://paulbristow.net/linux/idefloppy.html > diff --git a/MAINTAINERS b/MAINTAINERS > index 2e1b8bbacb5e..0b7e4cd4ffd7 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -18895,7 +18895,6 @@ F:include/uapi/linux/rtc.h > F: tools/testing/selftests/rtc/ > > Real-time Linux Analysis (RTLA) tools > -M: Daniel Bristot de Oliveira > M: Steven Rostedt > L: linux-trace-ker...@vger.kernel.org > S: Maintained > @@ -19529,7 +19528,6 @@ S:Maintained > F: drivers/infiniband/ulp/rtrs/ > > RUNTIME VERIFICATION (RV) > -M: Daniel Bristot de Oliveira > M: Steven Rostedt > L: linux-trace-ker...@vger.kernel.org > S: Maintained > @@ -22806,7 +22804,6 @@ F:kernel/trace/trace_mmiotrace.c > > TRACING OS NOISE / LATENCY TRACERS > M: Steven Rostedt > -M: Daniel Bristot de Oliveira > S: Maintained > F: Documentation/trace/hwlat_detector.rst > F: Documentation/trace/osnoise-tracer.rst > -- > 2.43.0 > -- Masami Hiramatsu (Google)
Re: [PATCH V2 0/7] vdpa live update
On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare wrote: > > On 7/14/2024 10:14 PM, Jason Wang wrote: > > On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare > > wrote: > >> > >> Live update is a technique wherein an application saves its state, exec's > >> to an updated version of itself, and restores its state. Clients of the > >> application experience a brief suspension of service, on the order of > >> 100's of milliseconds, but are otherwise unaffected. > >> > >> Define and implement interfaces that allow vdpa devices to be preserved > >> across fork or exec, to support live update for applications such as QEMU. > >> The device must be suspended during the update, but its DMA mappings are > >> preserved, so the suspension is brief. > >> > >> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory > >> accounting from one process to another. > >> > >> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that > >> VHOST_NEW_OWNER is supported. > >> > >> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland > >> address in the new process. > >> > >> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that > >> VHOST_IOTLB_REMAP is supported and required. Some devices do not > >> require it, because the userland address of each DMA mapping is discarded > >> after being translated to a physical address. > >> > >> Here is a pseudo-code sequence for performing live update, based on > >> suspend + reset because resume is not yet widely available. The vdpa > >> device > >> descriptor, fd, remains open across the exec. > >> > >>ioctl(fd, VHOST_VDPA_SUSPEND) > >>ioctl(fd, VHOST_VDPA_SET_STATUS, 0) > > > > I don't understand why we need a reset after suspend, it looks to me > > the previous suspend became meaningless. > > The suspend guarantees completion of in-progress DMA. At least, that is > my interpretation of why that is done for live migration in QEMU, which > also does suspend + reset + re-create. I am following the live migration > model. Yes, but any reason we need a reset after the suspension? > > >>exec > >> > >>ioctl(fd, VHOST_NEW_OWNER) > >> > >>issue ioctls to re-create vrings > >> > >>if VHOST_BACKEND_F_IOTLB_REMAP > > > > So the idea is for a device that is using a virtual address, it > > doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all? > > Actually the reverse: if the device translates virtual to physical when > the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP > is not needed. Ok. > > >>foreach dma mapping > >>write(fd, {VHOST_IOTLB_REMAP, new_addr}) > >> > >>ioctl(fd, VHOST_VDPA_SET_STATUS, > >> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK) > > > > From API level, this seems to be asymmetric as we have suspending but > > not resuming? > > Again, I am just following the path taken by live migration. > I will be happy to use resume when the devices and QEMU support it. > The decision to use reset vs resume should not affect the definition > and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP. > > - Steve Thanks
Re: [PATCH] tracing: Update MAINTAINERS file
On 2024-07-15 15:13, Steven Rostedt wrote: On Mon, 15 Jul 2024 15:10:17 -0400 Mathieu Desnoyers wrote: On 2024-07-15 14:47, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Gone but never forgotten. [ Also moved Daniel's name to be consistent with the alphabetical order ] Hi Steven, You appear to have missed this entry from SCHEDULER: R: Daniel Bristot de Oliveira (SCHED_DEADLINE) Should it be done in a separate commit or folded in this one ? That has already been done: https://lore.kernel.org/all/20240708075752.gf11...@noisy.programming.kicks-ass.net/ This patch is actually on top of that one. Sorry I missed that. Reviewed-by: Mathieu Desnoyers -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Update MAINTAINERS file
On Mon, 15 Jul 2024 15:10:17 -0400 Mathieu Desnoyers wrote: > On 2024-07-15 14:47, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" > > > > Gone but never forgotten. > > > > [ Also moved Daniel's name to be consistent with the alphabetical order ] > > Hi Steven, > > You appear to have missed this entry from SCHEDULER: > > R: Daniel Bristot de Oliveira (SCHED_DEADLINE) > > Should it be done in a separate commit or folded in this one ? That has already been done: https://lore.kernel.org/all/20240708075752.gf11...@noisy.programming.kicks-ass.net/ This patch is actually on top of that one. -- Steve
Re: [PATCH] tracing: Update MAINTAINERS file
On 2024-07-15 14:47, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Gone but never forgotten. [ Also moved Daniel's name to be consistent with the alphabetical order ] Hi Steven, You appear to have missed this entry from SCHEDULER: R: Daniel Bristot de Oliveira (SCHED_DEADLINE) Should it be done in a separate commit or folded in this one ? Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) --- CREDITS | 10 +++--- MAINTAINERS | 3 --- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/CREDITS b/CREDITS index 88c4c08cb613..d7798910e99f 100644 --- a/CREDITS +++ b/CREDITS @@ -271,9 +271,6 @@ D: Driver for WaveFront soundcards (Turtle Beach Maui, Tropez, Tropez+) D: Various bugfixes and changes to sound drivers S: USA -N: Daniel Bristot de Oliveira -D: Scheduler contributions, notably: SCHED_DEADLINE - N: Carlos Henrique Bauer E: chba...@acm.org E: ba...@atlas.unisinos.br @@ -534,6 +531,13 @@ S: Kopmansg 2 S: 411 13 Goteborg S: Sweden +N: Daniel Bristot de Oliveira +D: Scheduler contributions, notably: SCHED_DEADLINE +D: Real-time Linux Analysis +D: Runtime Verification +D: OS Noise and Latency Tracers +S: Brazil and Italy + N: Paul Bristow E: p...@paulbristow.net W: https://paulbristow.net/linux/idefloppy.html diff --git a/MAINTAINERS b/MAINTAINERS index 2e1b8bbacb5e..0b7e4cd4ffd7 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18895,7 +18895,6 @@ F: include/uapi/linux/rtc.h F:tools/testing/selftests/rtc/ Real-time Linux Analysis (RTLA) tools -M: Daniel Bristot de Oliveira M:Steven Rostedt L:linux-trace-ker...@vger.kernel.org S:Maintained @@ -19529,7 +19528,6 @@ S: Maintained F:drivers/infiniband/ulp/rtrs/ RUNTIME VERIFICATION (RV) -M: Daniel Bristot de Oliveira M:Steven Rostedt L:linux-trace-ker...@vger.kernel.org S:Maintained @@ -22806,7 +22804,6 @@ F: kernel/trace/trace_mmiotrace.c TRACING OS NOISE / LATENCY TRACERS M:Steven Rostedt -M: Daniel Bristot de Oliveira S:Maintained F:Documentation/trace/hwlat_detector.rst F:Documentation/trace/osnoise-tracer.rst -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH] tracing: Update MAINTAINERS file
From: "Steven Rostedt (Google)" Gone but never forgotten. [ Also moved Daniel's name to be consistent with the alphabetical order ] Signed-off-by: Steven Rostedt (Google) --- CREDITS | 10 +++--- MAINTAINERS | 3 --- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/CREDITS b/CREDITS index 88c4c08cb613..d7798910e99f 100644 --- a/CREDITS +++ b/CREDITS @@ -271,9 +271,6 @@ D: Driver for WaveFront soundcards (Turtle Beach Maui, Tropez, Tropez+) D: Various bugfixes and changes to sound drivers S: USA -N: Daniel Bristot de Oliveira -D: Scheduler contributions, notably: SCHED_DEADLINE - N: Carlos Henrique Bauer E: chba...@acm.org E: ba...@atlas.unisinos.br @@ -534,6 +531,13 @@ S: Kopmansg 2 S: 411 13 Goteborg S: Sweden +N: Daniel Bristot de Oliveira +D: Scheduler contributions, notably: SCHED_DEADLINE +D: Real-time Linux Analysis +D: Runtime Verification +D: OS Noise and Latency Tracers +S: Brazil and Italy + N: Paul Bristow E: p...@paulbristow.net W: https://paulbristow.net/linux/idefloppy.html diff --git a/MAINTAINERS b/MAINTAINERS index 2e1b8bbacb5e..0b7e4cd4ffd7 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -18895,7 +18895,6 @@ F: include/uapi/linux/rtc.h F: tools/testing/selftests/rtc/ Real-time Linux Analysis (RTLA) tools -M: Daniel Bristot de Oliveira M: Steven Rostedt L: linux-trace-ker...@vger.kernel.org S: Maintained @@ -19529,7 +19528,6 @@ S: Maintained F: drivers/infiniband/ulp/rtrs/ RUNTIME VERIFICATION (RV) -M: Daniel Bristot de Oliveira M: Steven Rostedt L: linux-trace-ker...@vger.kernel.org S: Maintained @@ -22806,7 +22804,6 @@ F: kernel/trace/trace_mmiotrace.c TRACING OS NOISE / LATENCY TRACERS M: Steven Rostedt -M: Daniel Bristot de Oliveira S: Maintained F: Documentation/trace/hwlat_detector.rst F: Documentation/trace/osnoise-tracer.rst -- 2.43.0
Re: [PATCH V2 0/7] vdpa live update
On 7/14/2024 10:14 PM, Jason Wang wrote: On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare wrote: Live update is a technique wherein an application saves its state, exec's to an updated version of itself, and restores its state. Clients of the application experience a brief suspension of service, on the order of 100's of milliseconds, but are otherwise unaffected. Define and implement interfaces that allow vdpa devices to be preserved across fork or exec, to support live update for applications such as QEMU. The device must be suspended during the update, but its DMA mappings are preserved, so the suspension is brief. The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory accounting from one process to another. The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that VHOST_NEW_OWNER is supported. The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland address in the new process. The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that VHOST_IOTLB_REMAP is supported and required. Some devices do not require it, because the userland address of each DMA mapping is discarded after being translated to a physical address. Here is a pseudo-code sequence for performing live update, based on suspend + reset because resume is not yet widely available. The vdpa device descriptor, fd, remains open across the exec. ioctl(fd, VHOST_VDPA_SUSPEND) ioctl(fd, VHOST_VDPA_SET_STATUS, 0) I don't understand why we need a reset after suspend, it looks to me the previous suspend became meaningless. The suspend guarantees completion of in-progress DMA. At least, that is my interpretation of why that is done for live migration in QEMU, which also does suspend + reset + re-create. I am following the live migration model. exec ioctl(fd, VHOST_NEW_OWNER) issue ioctls to re-create vrings if VHOST_BACKEND_F_IOTLB_REMAP So the idea is for a device that is using a virtual address, it doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all? Actually the reverse: if the device translates virtual to physical when the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP is not needed. foreach dma mapping write(fd, {VHOST_IOTLB_REMAP, new_addr}) ioctl(fd, VHOST_VDPA_SET_STATUS, ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK) From API level, this seems to be asymmetric as we have suspending but not resuming? Again, I am just following the path taken by live migration. I will be happy to use resume when the devices and QEMU support it. The decision to use reset vs resume should not affect the definition and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP. - Steve This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE, as that would would unpin and repin physical pages, which would cost multiple seconds for large memories. This is implemented in QEMU by the patch series "Live update: vdpa" https://lore.kernel.org/qemu-devel/TBD (reference to be posted shortly) The QEMU implementation leverages the live migration code path, but after CPR exec's new QEMU: - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE Changes in V2: - clean up handling of set_map vs dma_map vs platform iommu in remap - augment and clarify commit messages and comments Steve Sistare (7): vhost-vdpa: count pinned memory vhost-vdpa: pass mm to bind vhost-vdpa: VHOST_NEW_OWNER vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER vhost-vdpa: VHOST_IOTLB_REMAP vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP vdpa/mlx5: new owner capability drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +- drivers/vhost/vdpa.c | 125 -- drivers/vhost/vhost.c | 15 drivers/vhost/vhost.h | 1 + include/uapi/linux/vhost.h| 10 +++ include/uapi/linux/vhost_types.h | 15 +++- 6 files changed, 161 insertions(+), 8 deletions(-) -- 2.39.3 Thanks
Re: [PATCH V2 0/7] vdpa live update
On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare wrote: > > Live update is a technique wherein an application saves its state, exec's > to an updated version of itself, and restores its state. Clients of the > application experience a brief suspension of service, on the order of > 100's of milliseconds, but are otherwise unaffected. > > Define and implement interfaces that allow vdpa devices to be preserved > across fork or exec, to support live update for applications such as QEMU. > The device must be suspended during the update, but its DMA mappings are > preserved, so the suspension is brief. > > The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory > accounting from one process to another. > > The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that > VHOST_NEW_OWNER is supported. > > The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland > address in the new process. > > The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that > VHOST_IOTLB_REMAP is supported and required. Some devices do not > require it, because the userland address of each DMA mapping is discarded > after being translated to a physical address. > > Here is a pseudo-code sequence for performing live update, based on > suspend + reset because resume is not yet widely available. The vdpa device > descriptor, fd, remains open across the exec. > > ioctl(fd, VHOST_VDPA_SUSPEND) > ioctl(fd, VHOST_VDPA_SET_STATUS, 0) I don't understand why we need a reset after suspend, it looks to me the previous suspend became meaningless. > exec > > ioctl(fd, VHOST_NEW_OWNER) > > issue ioctls to re-create vrings > > if VHOST_BACKEND_F_IOTLB_REMAP So the idea is for a device that is using a virtual address, it doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all? > foreach dma mapping > write(fd, {VHOST_IOTLB_REMAP, new_addr}) > > ioctl(fd, VHOST_VDPA_SET_STATUS, > ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK) >From API level, this seems to be asymmetric as we have suspending but not resuming? > > This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE, > as that would would unpin and repin physical pages, which would cost multiple > seconds for large memories. > > This is implemented in QEMU by the patch series "Live update: vdpa" > https://lore.kernel.org/qemu-devel/TBD (reference to be posted shortly) > > The QEMU implementation leverages the live migration code path, but after > CPR exec's new QEMU: > - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER > - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of > VHOST_IOTLB_UPDATE > > Changes in V2: > - clean up handling of set_map vs dma_map vs platform iommu in remap > - augment and clarify commit messages and comments > > Steve Sistare (7): > vhost-vdpa: count pinned memory > vhost-vdpa: pass mm to bind > vhost-vdpa: VHOST_NEW_OWNER > vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER > vhost-vdpa: VHOST_IOTLB_REMAP > vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP > vdpa/mlx5: new owner capability > > drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +- > drivers/vhost/vdpa.c | 125 -- > drivers/vhost/vhost.c | 15 > drivers/vhost/vhost.h | 1 + > include/uapi/linux/vhost.h| 10 +++ > include/uapi/linux/vhost_types.h | 15 +++- > 6 files changed, 161 insertions(+), 8 deletions(-) > > -- > 2.39.3 > Thanks
Re: [PATCH V2 0/7] vdpa live update
On 7/12/2024 9:18 AM, Steve Sistare wrote: Live update is a technique wherein an application saves its state, exec's to an updated version of itself, and restores its state. Clients of the application experience a brief suspension of service, on the order of 100's of milliseconds, but are otherwise unaffected. Define and implement interfaces that allow vdpa devices to be preserved across fork or exec, to support live update for applications such as QEMU. The device must be suspended during the update, but its DMA mappings are preserved, so the suspension is brief. The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory accounting from one process to another. The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that VHOST_NEW_OWNER is supported. The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland address in the new process. The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that VHOST_IOTLB_REMAP is supported and required. Some devices do not require it, because the userland address of each DMA mapping is discarded after being translated to a physical address. Here is a pseudo-code sequence for performing live update, based on suspend + reset because resume is not yet widely available. The vdpa device descriptor, fd, remains open across the exec. ioctl(fd, VHOST_VDPA_SUSPEND) ioctl(fd, VHOST_VDPA_SET_STATUS, 0) exec ioctl(fd, VHOST_NEW_OWNER) issue ioctls to re-create vrings if VHOST_BACKEND_F_IOTLB_REMAP foreach dma mapping write(fd, {VHOST_IOTLB_REMAP, new_addr}) ioctl(fd, VHOST_VDPA_SET_STATUS, ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK) This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE, as that would would unpin and repin physical pages, which would cost multiple seconds for large memories. This is implemented in QEMU by the patch series "Live update: vdpa" https://lore.kernel.org/qemu-devel/TBD (reference to be posted shortly) https://lore.kernel.org/qemu-devel/1720792931-456433-3-git-send-email-steven.sist...@oracle.com The QEMU implementation leverages the live migration code path, but after CPR exec's new QEMU: - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE Changes in V2: - clean up handling of set_map vs dma_map vs platform iommu in remap - augment and clarify commit messages and comments Steve Sistare (7): vhost-vdpa: count pinned memory vhost-vdpa: pass mm to bind vhost-vdpa: VHOST_NEW_OWNER vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER vhost-vdpa: VHOST_IOTLB_REMAP vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP vdpa/mlx5: new owner capability drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +- drivers/vhost/vdpa.c | 125 -- drivers/vhost/vhost.c | 15 drivers/vhost/vhost.h | 1 + include/uapi/linux/vhost.h| 10 +++ include/uapi/linux/vhost_types.h | 15 +++- 6 files changed, 161 insertions(+), 8 deletions(-)
[PATCH V2 0/7] vdpa live update
Live update is a technique wherein an application saves its state, exec's to an updated version of itself, and restores its state. Clients of the application experience a brief suspension of service, on the order of 100's of milliseconds, but are otherwise unaffected. Define and implement interfaces that allow vdpa devices to be preserved across fork or exec, to support live update for applications such as QEMU. The device must be suspended during the update, but its DMA mappings are preserved, so the suspension is brief. The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory accounting from one process to another. The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that VHOST_NEW_OWNER is supported. The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland address in the new process. The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that VHOST_IOTLB_REMAP is supported and required. Some devices do not require it, because the userland address of each DMA mapping is discarded after being translated to a physical address. Here is a pseudo-code sequence for performing live update, based on suspend + reset because resume is not yet widely available. The vdpa device descriptor, fd, remains open across the exec. ioctl(fd, VHOST_VDPA_SUSPEND) ioctl(fd, VHOST_VDPA_SET_STATUS, 0) exec ioctl(fd, VHOST_NEW_OWNER) issue ioctls to re-create vrings if VHOST_BACKEND_F_IOTLB_REMAP foreach dma mapping write(fd, {VHOST_IOTLB_REMAP, new_addr}) ioctl(fd, VHOST_VDPA_SET_STATUS, ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK) This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE, as that would would unpin and repin physical pages, which would cost multiple seconds for large memories. This is implemented in QEMU by the patch series "Live update: vdpa" https://lore.kernel.org/qemu-devel/TBD (reference to be posted shortly) The QEMU implementation leverages the live migration code path, but after CPR exec's new QEMU: - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of VHOST_IOTLB_UPDATE Changes in V2: - clean up handling of set_map vs dma_map vs platform iommu in remap - augment and clarify commit messages and comments Steve Sistare (7): vhost-vdpa: count pinned memory vhost-vdpa: pass mm to bind vhost-vdpa: VHOST_NEW_OWNER vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER vhost-vdpa: VHOST_IOTLB_REMAP vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP vdpa/mlx5: new owner capability drivers/vdpa/mlx5/net/mlx5_vnet.c | 3 +- drivers/vhost/vdpa.c | 125 -- drivers/vhost/vhost.c | 15 drivers/vhost/vhost.h | 1 + include/uapi/linux/vhost.h| 10 +++ include/uapi/linux/vhost_types.h | 15 +++- 6 files changed, 161 insertions(+), 8 deletions(-) -- 2.39.3
Re: [PATCH v2 01/12] uprobes: update outdated comment
On Wed, Jul 10, 2024 at 6:33 AM Oleg Nesterov wrote: > > On 07/03, Oleg Nesterov wrote: > > > > > /* > > > -* The NULL 'tsk' here ensures that any faults that occur here > > > -* will not be accounted to the task. 'mm' *is* current->mm, > > > -* but we treat this as a 'remote' access since it is > > > -* essentially a kernel access to the memory. > > > +* 'mm' *is* current->mm, but we treat this as a 'remote' access since > > > +* it is essentially a kernel access to the memory. > > > */ > > > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); > > > > OK, this makes it less confusing, so > > > > Acked-by: Oleg Nesterov > > > > - > > but it still looks confusing to me. This code used to pass tsk = NULL > > only to avoid tsk->maj/min_flt++ in faultin_page(). > > > > But today mm_account_fault() increments these counters without checking > > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to > > just use get_user_pages() and remove this comment? > > Well, yes, it still looks confusing, imo. > > Andrii, I hope you won't mind if I redo/resend this and the next cleanup? > > The next one only updates the comment above uprobe_write_opcode(), but > it would be nice to explain mmap_write_lock() in register_for_each_vma(). > I don't mind a bit, thanks for sending the patches! > Oleg. > >
Re: [PATCH v2 01/12] uprobes: update outdated comment
On 07/03, Oleg Nesterov wrote: > > > /* > > -* The NULL 'tsk' here ensures that any faults that occur here > > -* will not be accounted to the task. 'mm' *is* current->mm, > > -* but we treat this as a 'remote' access since it is > > -* essentially a kernel access to the memory. > > +* 'mm' *is* current->mm, but we treat this as a 'remote' access since > > +* it is essentially a kernel access to the memory. > > */ > > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); > > OK, this makes it less confusing, so > > Acked-by: Oleg Nesterov > > - > but it still looks confusing to me. This code used to pass tsk = NULL > only to avoid tsk->maj/min_flt++ in faultin_page(). > > But today mm_account_fault() increments these counters without checking > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to > just use get_user_pages() and remove this comment? Well, yes, it still looks confusing, imo. Andrii, I hope you won't mind if I redo/resend this and the next cleanup? The next one only updates the comment above uprobe_write_opcode(), but it would be nice to explain mmap_write_lock() in register_for_each_vma(). Oleg.
Re: [PATCH v4 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss
On 9.07.2024 8:49 AM, Naina Mehta wrote: > Rename qdss@8880 memory region as qlink_logging memory region > and add qdss_mem memory region at address of 0x8850, > qlink_logging is being added at the memory region at the address > of 0x8880 as the region is being used by modem firmware. > Since different DSM region size is required for different modem > firmware, split mpss_dsmharq_mem region into 2 separate regions. > This would provide the flexibility to remove the region which is > not required for a particular platform. Based on the modem firmware > either both the regions have to be used or only mpss_dsm_mem has > to be used. Also, reduce the size of mpssadsp_mem region. > > Signed-off-by: Naina Mehta > --- Thanks Reviewed-by: Konrad Dybcio Konrad
[PATCH v4 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss
Rename qdss@8880 memory region as qlink_logging memory region and add qdss_mem memory region at address of 0x8850, qlink_logging is being added at the memory region at the address of 0x8880 as the region is being used by modem firmware. Since different DSM region size is required for different modem firmware, split mpss_dsmharq_mem region into 2 separate regions. This would provide the flexibility to remove the region which is not required for a particular platform. Based on the modem firmware either both the regions have to be used or only mpss_dsm_mem has to be used. Also, reduce the size of mpssadsp_mem region. Signed-off-by: Naina Mehta --- arch/arm64/boot/dts/qcom/sdx75.dtsi | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/arch/arm64/boot/dts/qcom/sdx75.dtsi b/arch/arm64/boot/dts/qcom/sdx75.dtsi index 9b93f6501d55..6f0abcc87a3b 100644 --- a/arch/arm64/boot/dts/qcom/sdx75.dtsi +++ b/arch/arm64/boot/dts/qcom/sdx75.dtsi @@ -366,7 +366,12 @@ uefi_log_mem: uefi-log@87f75000 { no-map; }; - qdss_mem: qdss@8880 { + qdss_mem: qdss@8850 { + reg = <0x0 0x8850 0x0 0x30>; + no-map; + }; + + qlink_logging_mem: qlink-logging@8880 { reg = <0x0 0x8880 0x0 0x30>; no-map; }; @@ -377,8 +382,13 @@ audio_heap_mem: audio-heap@88b0 { no-map; }; - mpss_dsmharq_mem: mpss-dsmharq@88f0 { - reg = <0x0 0x88f0 0x0 0x508>; + mpss_dsm_mem_2: mpss-dsm-2@88f0 { + reg = <0x0 0x88f0 0x0 0x250>; + no-map; + }; + + mpss_dsm_mem: mpss-dsm@8b40 { + reg = <0x0 0x8b40 0x0 0x2b8>; no-map; }; @@ -388,7 +398,7 @@ q6_mpss_dtb_mem: q6-mpss-dtb@8df8 { }; mpssadsp_mem: mpssadsp@8e00 { - reg = <0x0 0x8e00 0x0 0xf40>; + reg = <0x0 0x8e00 0x0 0xf10>; no-map; }; -- 2.34.1
Re: [PATCH] mailmap: Update Luca Weiss's email address
On Fri, 28 Jun 2024 19:40:55 +0200, Luca Weiss wrote: > I'm slowly migrating my mail to a new domain, add an entry to map the > mail address. Just for clarity, my work-related @fairphone.com email > stays unchanged. > > Applied, thanks! [1/1] mailmap: Update Luca Weiss's email address commit: 2881fcfc8f32c536a4bf708066d6fea9ba762e86 Best regards, -- Bjorn Andersson
Re: [PATCH v2 01/12] uprobes: update outdated comment
On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov wrote: > > Sorry for the late reply. I'll try to read this version/discussion > when I have time... yes, I have already promised this before, sorry :/ > No worries, there will be v3 next week (I'm going on short PTO starting tomorrow). So you still have time. I appreciate you looking at it, though! > On 07/01, Andrii Nakryiko wrote: > > > > There is no task_struct passed into get_user_pages_remote() anymore, > > drop the parts of comment mentioning NULL tsk, it's just confusing at > > this point. > > Agreed. > > > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, > > unsigned long vaddr) > > goto out; > > > > /* > > - * The NULL 'tsk' here ensures that any faults that occur here > > - * will not be accounted to the task. 'mm' *is* current->mm, > > - * but we treat this as a 'remote' access since it is > > - * essentially a kernel access to the memory. > > + * 'mm' *is* current->mm, but we treat this as a 'remote' access since > > + * it is essentially a kernel access to the memory. > >*/ > > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); > > OK, this makes it less confusing, so > > Acked-by: Oleg Nesterov > > > - > but it still looks confusing to me. This code used to pass tsk = NULL > only to avoid tsk->maj/min_flt++ in faultin_page(). > > But today mm_account_fault() increments these counters without checking > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to > just use get_user_pages() and remove this comment? > > Oleg. >
Re: [PATCH v2 01/12] uprobes: update outdated comment
On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov wrote: > > Sorry for the late reply. I'll try to read this version/discussion > when I have time... yes, I have already promised this before, sorry :/ > > On 07/01, Andrii Nakryiko wrote: > > > > There is no task_struct passed into get_user_pages_remote() anymore, > > drop the parts of comment mentioning NULL tsk, it's just confusing at > > this point. > > Agreed. > > > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, > > unsigned long vaddr) > > goto out; > > > > /* > > - * The NULL 'tsk' here ensures that any faults that occur here > > - * will not be accounted to the task. 'mm' *is* current->mm, > > - * but we treat this as a 'remote' access since it is > > - * essentially a kernel access to the memory. > > + * 'mm' *is* current->mm, but we treat this as a 'remote' access since > > + * it is essentially a kernel access to the memory. > >*/ > > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); > > OK, this makes it less confusing, so > > Acked-by: Oleg Nesterov > > > - > but it still looks confusing to me. This code used to pass tsk = NULL > only to avoid tsk->maj/min_flt++ in faultin_page(). > > But today mm_account_fault() increments these counters without checking > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to > just use get_user_pages() and remove this comment? I don't know, it was a drive-by cleanup which I'm starting to regret already :) > > Oleg. >
Re: [PATCH v2 01/12] uprobes: update outdated comment
Sorry for the late reply. I'll try to read this version/discussion when I have time... yes, I have already promised this before, sorry :/ On 07/01, Andrii Nakryiko wrote: > > There is no task_struct passed into get_user_pages_remote() anymore, > drop the parts of comment mentioning NULL tsk, it's just confusing at > this point. Agreed. > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, > unsigned long vaddr) > goto out; > > /* > - * The NULL 'tsk' here ensures that any faults that occur here > - * will not be accounted to the task. 'mm' *is* current->mm, > - * but we treat this as a 'remote' access since it is > - * essentially a kernel access to the memory. > + * 'mm' *is* current->mm, but we treat this as a 'remote' access since > + * it is essentially a kernel access to the memory. >*/ > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); OK, this makes it less confusing, so Acked-by: Oleg Nesterov - but it still looks confusing to me. This code used to pass tsk = NULL only to avoid tsk->maj/min_flt++ in faultin_page(). But today mm_account_fault() increments these counters without checking FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to just use get_user_pages() and remove this comment? Oleg.
[PATCH v12 17/19] Documentation: probes: Update fprobe on function-graph tracer
From: Masami Hiramatsu (Google) Update fprobe documentation for the new fprobe on function-graph tracer. This includes some bahvior changes and pt_regs to ftrace_regs interface change. Signed-off-by: Masami Hiramatsu (Google) --- Changes in v2: - Update @fregs parameter explanation. --- Documentation/trace/fprobe.rst | 42 ++-- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst index 196f52386aaa..f58bdc64504f 100644 --- a/Documentation/trace/fprobe.rst +++ b/Documentation/trace/fprobe.rst @@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe Introduction -Fprobe is a function entry/exit probe mechanism based on ftrace. -Instead of using ftrace full feature, if you only want to attach callbacks -on function entry and exit, similar to the kprobes and kretprobes, you can +Fprobe is a function entry/exit probe mechanism based on the function-graph +tracer. +Instead of tracing all functions, if you want to attach callbacks on specific +function entry and exit, similar to the kprobes and kretprobes, you can use fprobe. Compared with kprobes and kretprobes, fprobe gives faster instrumentation for multiple functions with single handler. This document describes how to use fprobe. @@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as follows: .. code-block:: c - int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); - void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); -Note that the @entry_ip is saved at function entry and passed to exit handler. -If the entry callback function returns !0, the corresponding exit callback will be cancelled. +Note that the @entry_ip is saved at function entry and passed to exit +handler. +If the entry callback function returns !0, the corresponding exit callback +will be cancelled. @fp This is the address of `fprobe` data structure related to this handler. @@ -112,12 +115,10 @@ If the entry callback function returns !0, the corresponding exit callback will This is the return address that the traced function will return to, somewhere in the caller. This can be used at both entry and exit. -@regs -This is the `pt_regs` data structure at the entry and exit. Note that -the instruction pointer of @regs may be different from the @entry_ip -in the entry_handler. If you need traced instruction pointer, you need -to use @entry_ip. On the other hand, in the exit_handler, the instruction -pointer of @regs is set to the current return address. +@fregs +This is the `ftrace_regs` data structure at the entry and exit. This +includes the function parameters, or the return values. So user can +access thos values via appropriate `ftrace_regs_*` APIs. @entry_data This is a local storage to share the data between entry and exit handlers. @@ -125,6 +126,17 @@ If the entry callback function returns !0, the corresponding exit callback will and `entry_data_size` field when registering the fprobe, the storage is allocated and passed to both `entry_handler` and `exit_handler`. +Entry data size and exit handlers on the same function +== + +Since the entry data is passed via per-task stack and it is has limited size, +the entry data size per probe is limited to `15 * sizeof(long)`. You also need +to take care that the different fprobes are probing on the same function, this +limit becomes smaller. The entry data size is aligned to `sizeof(long)` and +each fprobe which has exit handler uses a `sizeof(long)` space on the stack, +you should keep the number of fprobes on the same function as small as +possible. + Share the callbacks with kprobes @@ -165,8 +177,8 @@ This counter counts up when; - fprobe fails to take ftrace_recursion lock. This usually means that a function which is traced by other ftrace users is called from the entry_handler. - - fprobe fails to setup the function exit because of the shortage of rethook - (the shadow stack for hooking the function return.) + - fprobe fails to setup the function exit because of failing to allocate the + data buffer from the per-task shadow stack. The `fprobe::nmissed` field counts up in both cases. Therefore, the former skips both of entry and exit callback and the latter skips the exit
[PATCH v2 01/12] uprobes: update outdated comment
There is no task_struct passed into get_user_pages_remote() anymore, drop the parts of comment mentioning NULL tsk, it's just confusing at this point. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 99be2adedbc0..081821fd529a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr) goto out; /* -* The NULL 'tsk' here ensures that any faults that occur here -* will not be accounted to the task. 'mm' *is* current->mm, -* but we treat this as a 'remote' access since it is -* essentially a kernel access to the memory. +* 'mm' *is* current->mm, but we treat this as a 'remote' access since +* it is essentially a kernel access to the memory. */ result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); if (result < 0) -- 2.43.0
Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic
Hi Arseniy, On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote: Previous calculation of 'free_space' was wrong (but worked as expected in most cases, see below), because it didn't account number of bytes in rx queue. Let's rework 'free_space' calculation in the following way: as this value is considered free space at rx side from tx point of view, it must be equal to return value of 'virtio_transport_get_credit()' at tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first is number of transmitted bytes (without wrap), second is last 'fwd_cnt' value received from rx. So let's use same approach at rx side during 'free_space' calculation: add 'rx_cnt' counter which is number of received bytes (also without wrap) and subtract 'last_fwd_cnt' from it. Now we have: 1) 'rx_cnt' == 'tx_cnt' at both sides. 2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt' sent to tx, while second is last 'fwd_cnt' received from rx. Now 'free_space' is handled correctly and also we don't need 'low_rx_bytes' flag - this was more like a hack. Previous calculation of 'free_space' worked (in 99% cases), because if we take a look on behaviour of both expressions (new and previous): '(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)' Both of them always grows up, with almost same "speed": only difference is that 'rx_cnt' is incremented earlier during packet is received, while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt' grows "faster", then resulting 'free_space' become smaller also, so we send credit updates a little bit more, but: * 'free_space' calculation based on 'rx_cnt' gives the same value, which tx sees as free space at rx side, so original idea of 'free_space' is now implemented as planned. * Hack with 'low_rx_bytes' now is not needed. Also here is some performance comparison between both versions of 'free_space' calculation: *--*--*--* | | 'rx_cnt' | previous | *--*--*--* |H -> G| 8.42 | 7.82 | *--*--*--* |G -> H| 11.6 | 12.1 | *--*--*--* I did some tests on an Intel(R) Xeon(R) Silver 4410Y using iperf-vsock: - kernel 6.9.0 pkt_size G->H H->G 4k4.6 6.4 64k 13.8 11.5 128k 13.4 11.7 - kernel 6.9.0 with this series applied pkt_size G->H H->G 4k 4.6 8.16 64k 12.2 8.9 128k 12.8 8.8 I see a big drop, especially on H->G with big packets. Can you try to replicate on your env? I'll try to understand more and also an i7 on the next days. Thanks, Stefano As benchmark 'vsock-iperf' with default arguments was used. There is no significant performance difference before and after this patch. Signed-off-by: Arseniy Krasnov --- include/linux/virtio_vsock.h| 1 + net/vmw_vsock/virtio_transport_common.c | 8 +++- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h index c82089dee0c8..3579491c411e 100644 --- a/include/linux/virtio_vsock.h +++ b/include/linux/virtio_vsock.h @@ -135,6 +135,7 @@ struct virtio_vsock_sock { u32 peer_buf_alloc; /* Protected by rx_lock */ + u32 rx_cnt; u32 fwd_cnt; u32 last_fwd_cnt; u32 rx_bytes; diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 16ff976a86e3..1d4e2328e06e 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, return false; vvs->rx_bytes += len; + vvs->rx_cnt += len; return true; } @@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, size_t bytes, total = 0; struct sk_buff *skb; u32 fwd_cnt_delta; - bool low_rx_bytes; int err = -EFAULT; u32 free_space; @@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, } fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt; - free_space = vvs->buf_alloc - fwd_cnt_delta; - low_rx_bytes = (vvs->rx_bytes < - sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX)); + free_space = vvs->buf_alloc - (vvs->rx_cnt - vvs->last_fwd_cnt); spin_unlock_bh(>rx_lock); @@ -619,7 +617,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, * number of bytes in rx queue is not enough to wake up reader. */ if (fwd_cnt_delta && - (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes)) + (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)) virtio_transport_send_credit_update(vsk); return total; -- 2.25.1
Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss
On 6/26/2024 9:12 PM, Konrad Dybcio wrote: On 24.06.2024 1:21 PM, Naina Mehta wrote: On 6/18/2024 7:08 PM, Konrad Dybcio wrote: On 6/18/24 15:13, Naina Mehta wrote: Rename qdss@8880 memory region as qlink_logging memory region and add qdss_mem memory region at address of 0x8850. Split mpss_dsmharq_mem region into 2 separate regions and reduce the size of mpssadsp_mem region. Signed-off-by: Naina Mehta --- Alright, we're getting somewhere. The commit message should however motivate why such changes are necessary. For all we know, the splitting in two is currently done for no reason, as qdss_mem and qlink_logging_mem are contiguous - does the firmware have some expectations about them being separate? Since different DSM region size is required for different modem firmware, mpss_dsmharq_mem region being split into 2 separate regions. This would provide the flexibility to remove the region which is not required for a particular platform. qlink_logging is being added at the memory region at the address of 0x8880 as the region is being used by modem firmware. Ok, now put that in the commit message :) And I suppose: "This would provide the flexibility to remove the region which is not required for a particular platform." - but you still pass both to the remoteproc in patch 4. Are these regions mutually exclusive? Yes, for IDP platform, we are using both the DSM regions. Based on the modem firmware either both the regions have to be used or only mpss_dsm_mem has to be used. Regards, Naina Konrad
[PATCH] mailmap: Update Luca Weiss's email address
I'm slowly migrating my mail to a new domain, add an entry to map the mail address. Just for clarity, my work-related @fairphone.com email stays unchanged. Signed-off-by: Luca Weiss --- Since my email address also appears in a bunch of drivers and arm(64) files, and two devicetree binding files, how are those normally handled? Just ignore them and let mailmap handle everything relevant? --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index a6c619e22efc..e169a99ce7c7 100644 --- a/.mailmap +++ b/.mailmap @@ -385,6 +385,7 @@ Li Yang Lior David Lorenzo Pieralisi Luca Ceresoli +Luca Weiss Lukasz Luba Luo Jie Maciej W. Rozycki --- base-commit: 642a16ca7994a50d7de85715996a8ce171a5bdfb change-id: 20240628-mailmap-3528f7365abb Best regards, -- Luca Weiss
[PATCH v2 4/6] riscv: ftrace: do not use stop_machine to update code
Now it is safe to remove dependency from stop_machine() for us to patch code in ftrace. Signed-off-by: Andy Chiu --- arch/riscv/kernel/ftrace.c | 53 -- 1 file changed, 4 insertions(+), 49 deletions(-) diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c index 5ebe412280ef..57a6558e212e 100644 --- a/arch/riscv/kernel/ftrace.c +++ b/arch/riscv/kernel/ftrace.c @@ -13,23 +13,13 @@ #include #ifdef CONFIG_DYNAMIC_FTRACE -void ftrace_arch_code_modify_prepare(void) __acquires(_mutex) +void arch_ftrace_update_code(int command) { mutex_lock(_mutex); - - /* -* The code sequences we use for ftrace can't be patched while the -* kernel is running, so we need to use stop_machine() to modify them -* for now. This doesn't play nice with text_mutex, we use this flag -* to elide the check. -*/ - riscv_patch_in_stop_machine = true; -} - -void ftrace_arch_code_modify_post_process(void) __releases(_mutex) -{ - riscv_patch_in_stop_machine = false; + command |= FTRACE_MAY_SLEEP; + ftrace_modify_all_code(command); mutex_unlock(_mutex); + flush_icache_all(); } static int ftrace_check_current_call(unsigned long hook_pos, @@ -155,41 +145,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func) return __ftrace_modify_call_site(_call_dest, func, true); } -struct ftrace_modify_param { - int command; - atomic_t cpu_count; -}; - -static int __ftrace_modify_code(void *data) -{ - struct ftrace_modify_param *param = data; - - if (atomic_inc_return(>cpu_count) == num_online_cpus()) { - ftrace_modify_all_code(param->command); - /* -* Make sure the patching store is effective *before* we -* increment the counter which releases all waiting CPUs -* by using the release variant of atomic increment. The -* release pairs with the call to local_flush_icache_all() -* on the waiting CPU. -*/ - atomic_inc_return_release(>cpu_count); - } else { - while (atomic_read(>cpu_count) <= num_online_cpus()) - cpu_relax(); - - local_flush_icache_all(); - } - - return 0; -} - -void arch_ftrace_update_code(int command) -{ - struct ftrace_modify_param param = { command, ATOMIC_INIT(0) }; - - stop_machine(__ftrace_modify_code, , cpu_online_mask); -} #endif #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS -- 2.43.0
Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss
On 24.06.2024 1:21 PM, Naina Mehta wrote: > > > On 6/18/2024 7:08 PM, Konrad Dybcio wrote: >> >> >> On 6/18/24 15:13, Naina Mehta wrote: >>> Rename qdss@8880 memory region as qlink_logging memory region >>> and add qdss_mem memory region at address of 0x8850. >>> Split mpss_dsmharq_mem region into 2 separate regions and >>> reduce the size of mpssadsp_mem region. >>> >>> Signed-off-by: Naina Mehta >>> --- >> >> Alright, we're getting somewhere. The commit message should however motivate >> why such changes are necessary. For all we know, the splitting in two is >> currently done for no reason, as qdss_mem and qlink_logging_mem are >> contiguous >> - does the firmware have some expectations about them being separate? >> > > Since different DSM region size is required for different modem firmware, > mpss_dsmharq_mem region being split into 2 separate regions. > This would provide the flexibility to remove the region which is > not required for a particular platform. > qlink_logging is being added at the memory region at the address of > 0x8880 as the region is being used by modem firmware. Ok, now put that in the commit message :) And I suppose: "This would provide the flexibility to remove the region which is not required for a particular platform." - but you still pass both to the remoteproc in patch 4. Are these regions mutually exclusive? Konrad
Re: [PATCH v2 0/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update device tree of Motorola Moto G 4G (2013)
On Mon, 17 Jun 2024 23:22:26 +0200, André Apitzsch wrote: > Add accelerometer, magnetometer, regulator and temperature sensor alert > interrupt and update framebuffer supplies. > > Applied, thanks! [1/3] ARM: dts: qcom: msm8926-motorola-peregrine: Add accelerometer, magnetometer, regulator commit: 65ec35baeb937e91970c5d88118c5638d8582bb3 [2/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor commit: c9c86387ea1c4034fec34690c7cf2a96f9c21196 [3/3] ARM: dts: qcom: msm8926-motorola-peregrine: Add framebuffer supplies commit: fed1c79fc7fe10900d99a79a36e40443f3267ef3 Best regards, -- Bjorn Andersson
Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic
On 25.06.2024 16:46, Stefano Garzarella wrote: > On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote: >> Previous calculation of 'free_space' was wrong (but worked as expected >> in most cases, see below), because it didn't account number of bytes in >> rx queue. Let's rework 'free_space' calculation in the following way: >> as this value is considered free space at rx side from tx point of view, >> it must be equal to return value of 'virtio_transport_get_credit()' at >> tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first >> is number of transmitted bytes (without wrap), second is last 'fwd_cnt' >> value received from rx. So let's use same approach at rx side during >> 'free_space' calculation: add 'rx_cnt' counter which is number of >> received bytes (also without wrap) and subtract 'last_fwd_cnt' from it. >> Now we have: >> 1) 'rx_cnt' == 'tx_cnt' at both sides. >> 2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt' >> sent to tx, while second is last 'fwd_cnt' received from rx. >> >> Now 'free_space' is handled correctly and also we don't need > > mmm, I don't know if it was wrong before, maybe we could say it was less > accurate. May be "now 'free_space' is handled in more precise way and also we ..." ? > > That said, could we have the same problem now if we have a lot of producers > and the virtqueue becomes full? > I guess if virtqueue is full, we just wait by returning skb back to tx queue... e.g. data exchange between two sockets just freezes. ? >> 'low_rx_bytes' flag - this was more like a hack. >> >> Previous calculation of 'free_space' worked (in 99% cases), because if >> we take a look on behaviour of both expressions (new and previous): >> >> '(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)' >> >> Both of them always grows up, with almost same "speed": only difference >> is that 'rx_cnt' is incremented earlier during packet is received, >> while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt' >> grows "faster", then resulting 'free_space' become smaller also, so we >> send credit updates a little bit more, but: >> >> * 'free_space' calculation based on 'rx_cnt' gives the same value, >> which tx sees as free space at rx side, so original idea of > > Ditto, what happen if the virtqueue is full? > >> 'free_space' is now implemented as planned. >> * Hack with 'low_rx_bytes' now is not needed. > > Yeah, so this patch should also mitigate issue reported by Alex (added in > CC), right? > > If yes, please mention that problem and add a Reported-by giving credit to > Alex. Yes, of course! > >> >> Also here is some performance comparison between both versions of >> 'free_space' calculation: >> >> *--*--*--* >> | | 'rx_cnt' | previous | >> *--*--*--* >> |H -> G| 8.42 | 7.82 | >> *--*--*--* >> |G -> H| 11.6 | 12.1 | >> *--*--*--* > > How many seconds did you run it? How many repetitions? There's a little > discrepancy anyway, but I can't tell if it's just noise. I run 4 times, each run for ~10 seconds... I think I can also add number of credit update messages to this report. > >> >> As benchmark 'vsock-iperf' with default arguments was used. There is no >> significant performance difference before and after this patch. >> >> Signed-off-by: Arseniy Krasnov >> --- >> include/linux/virtio_vsock.h | 1 + >> net/vmw_vsock/virtio_transport_common.c | 8 +++- >> 2 files changed, 4 insertions(+), 5 deletions(-) > > Thanks for working on this, I'll do more tests but the approach LGTM. Got it, Thanks > > Thanks, > Stefano > >> >> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h >> index c82089dee0c8..3579491c411e 100644 >> --- a/include/linux/virtio_vsock.h >> +++ b/include/linux/virtio_vsock.h >> @@ -135,6 +135,7 @@ struct virtio_vsock_sock { >> u32 peer_buf_alloc; >> >> /* Protected by rx_lock */ >> + u32 rx_cnt; >> u32 fwd_cnt; >> u32 last_fwd_cnt; >> u32 rx_bytes; >> diff --git a/net/vmw_vsock/virtio_transport_common.c >> b/net/vmw_vsock/virtio_transport_common.c >> index 16ff976a86e3..1d4e2328e06e 100644 >> --- a/net/vmw_vsock/virtio_transport_common.c >> +++ b/net/vmw_vsock/virtio_transport_common.c >> @@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct >> virtio_vs
Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic
On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote: Previous calculation of 'free_space' was wrong (but worked as expected in most cases, see below), because it didn't account number of bytes in rx queue. Let's rework 'free_space' calculation in the following way: as this value is considered free space at rx side from tx point of view, it must be equal to return value of 'virtio_transport_get_credit()' at tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first is number of transmitted bytes (without wrap), second is last 'fwd_cnt' value received from rx. So let's use same approach at rx side during 'free_space' calculation: add 'rx_cnt' counter which is number of received bytes (also without wrap) and subtract 'last_fwd_cnt' from it. Now we have: 1) 'rx_cnt' == 'tx_cnt' at both sides. 2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt' sent to tx, while second is last 'fwd_cnt' received from rx. Now 'free_space' is handled correctly and also we don't need mmm, I don't know if it was wrong before, maybe we could say it was less accurate. That said, could we have the same problem now if we have a lot of producers and the virtqueue becomes full? 'low_rx_bytes' flag - this was more like a hack. Previous calculation of 'free_space' worked (in 99% cases), because if we take a look on behaviour of both expressions (new and previous): '(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)' Both of them always grows up, with almost same "speed": only difference is that 'rx_cnt' is incremented earlier during packet is received, while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt' grows "faster", then resulting 'free_space' become smaller also, so we send credit updates a little bit more, but: * 'free_space' calculation based on 'rx_cnt' gives the same value, which tx sees as free space at rx side, so original idea of Ditto, what happen if the virtqueue is full? 'free_space' is now implemented as planned. * Hack with 'low_rx_bytes' now is not needed. Yeah, so this patch should also mitigate issue reported by Alex (added in CC), right? If yes, please mention that problem and add a Reported-by giving credit to Alex. Also here is some performance comparison between both versions of 'free_space' calculation: *--*--*--* | | 'rx_cnt' | previous | *--*--*--* |H -> G| 8.42 | 7.82 | *--*--*--* |G -> H| 11.6 | 12.1 | *--*--*--* How many seconds did you run it? How many repetitions? There's a little discrepancy anyway, but I can't tell if it's just noise. As benchmark 'vsock-iperf' with default arguments was used. There is no significant performance difference before and after this patch. Signed-off-by: Arseniy Krasnov --- include/linux/virtio_vsock.h| 1 + net/vmw_vsock/virtio_transport_common.c | 8 +++- 2 files changed, 4 insertions(+), 5 deletions(-) Thanks for working on this, I'll do more tests but the approach LGTM. Thanks, Stefano diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h index c82089dee0c8..3579491c411e 100644 --- a/include/linux/virtio_vsock.h +++ b/include/linux/virtio_vsock.h @@ -135,6 +135,7 @@ struct virtio_vsock_sock { u32 peer_buf_alloc; /* Protected by rx_lock */ + u32 rx_cnt; u32 fwd_cnt; u32 last_fwd_cnt; u32 rx_bytes; diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 16ff976a86e3..1d4e2328e06e 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, return false; vvs->rx_bytes += len; + vvs->rx_cnt += len; return true; } @@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, size_t bytes, total = 0; struct sk_buff *skb; u32 fwd_cnt_delta; - bool low_rx_bytes; int err = -EFAULT; u32 free_space; @@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, } fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt; - free_space = vvs->buf_alloc - fwd_cnt_delta; - low_rx_bytes = (vvs->rx_bytes < - sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX)); + free_space = vvs->buf_alloc - (vvs->rx_cnt - vvs->last_fwd_cnt); spin_unlock_bh(>rx_lock); @@ -619,7 +617,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, * number of bytes in rx queue is not enough to wake up reader. */ if (fwd_cnt_delta && - (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes)) + (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)) virtio_transport_send_credit_update(vsk); return total; --
[PATCH 01/12] uprobes: update outdated comment
There is no task_struct passed into get_user_pages_remote() anymore, drop the parts of comment mentioning NULL tsk, it's just confusing at this point. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 2816e65729ac..197fbe4663b5 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr) goto out; /* -* The NULL 'tsk' here ensures that any faults that occur here -* will not be accounted to the task. 'mm' *is* current->mm, -* but we treat this as a 'remote' access since it is -* essentially a kernel access to the memory. +* 'mm' *is* current->mm, but we treat this as a 'remote' access since +* it is essentially a kernel access to the memory. */ result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL); if (result < 0) -- 2.43.0
Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss
On 6/18/2024 7:08 PM, Konrad Dybcio wrote: On 6/18/24 15:13, Naina Mehta wrote: Rename qdss@8880 memory region as qlink_logging memory region and add qdss_mem memory region at address of 0x8850. Split mpss_dsmharq_mem region into 2 separate regions and reduce the size of mpssadsp_mem region. Signed-off-by: Naina Mehta --- Alright, we're getting somewhere. The commit message should however motivate why such changes are necessary. For all we know, the splitting in two is currently done for no reason, as qdss_mem and qlink_logging_mem are contiguous - does the firmware have some expectations about them being separate? Since different DSM region size is required for different modem firmware, mpss_dsmharq_mem region being split into 2 separate regions. This would provide the flexibility to remove the region which is not required for a particular platform. qlink_logging is being added at the memory region at the address of 0x8880 as the region is being used by modem firmware. Regards, Naina Konrad
[RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic
Previous calculation of 'free_space' was wrong (but worked as expected in most cases, see below), because it didn't account number of bytes in rx queue. Let's rework 'free_space' calculation in the following way: as this value is considered free space at rx side from tx point of view, it must be equal to return value of 'virtio_transport_get_credit()' at tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first is number of transmitted bytes (without wrap), second is last 'fwd_cnt' value received from rx. So let's use same approach at rx side during 'free_space' calculation: add 'rx_cnt' counter which is number of received bytes (also without wrap) and subtract 'last_fwd_cnt' from it. Now we have: 1) 'rx_cnt' == 'tx_cnt' at both sides. 2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt' sent to tx, while second is last 'fwd_cnt' received from rx. Now 'free_space' is handled correctly and also we don't need 'low_rx_bytes' flag - this was more like a hack. Previous calculation of 'free_space' worked (in 99% cases), because if we take a look on behaviour of both expressions (new and previous): '(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)' Both of them always grows up, with almost same "speed": only difference is that 'rx_cnt' is incremented earlier during packet is received, while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt' grows "faster", then resulting 'free_space' become smaller also, so we send credit updates a little bit more, but: * 'free_space' calculation based on 'rx_cnt' gives the same value, which tx sees as free space at rx side, so original idea of 'free_space' is now implemented as planned. * Hack with 'low_rx_bytes' now is not needed. Also here is some performance comparison between both versions of 'free_space' calculation: *--*--*--* | | 'rx_cnt' | previous | *--*--*--* |H -> G| 8.42 | 7.82 | *--*--*--* |G -> H| 11.6 | 12.1 | *--*--*--* As benchmark 'vsock-iperf' with default arguments was used. There is no significant performance difference before and after this patch. Signed-off-by: Arseniy Krasnov --- include/linux/virtio_vsock.h| 1 + net/vmw_vsock/virtio_transport_common.c | 8 +++- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h index c82089dee0c8..3579491c411e 100644 --- a/include/linux/virtio_vsock.h +++ b/include/linux/virtio_vsock.h @@ -135,6 +135,7 @@ struct virtio_vsock_sock { u32 peer_buf_alloc; /* Protected by rx_lock */ + u32 rx_cnt; u32 fwd_cnt; u32 last_fwd_cnt; u32 rx_bytes; diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 16ff976a86e3..1d4e2328e06e 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs, return false; vvs->rx_bytes += len; + vvs->rx_cnt += len; return true; } @@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, size_t bytes, total = 0; struct sk_buff *skb; u32 fwd_cnt_delta; - bool low_rx_bytes; int err = -EFAULT; u32 free_space; @@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, } fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt; - free_space = vvs->buf_alloc - fwd_cnt_delta; - low_rx_bytes = (vvs->rx_bytes < - sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX)); + free_space = vvs->buf_alloc - (vvs->rx_cnt - vvs->last_fwd_cnt); spin_unlock_bh(>rx_lock); @@ -619,7 +617,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk, * number of bytes in rx queue is not enough to wake up reader. */ if (fwd_cnt_delta && - (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes)) + (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE)) virtio_transport_send_credit_update(vsk); return total; -- 2.25.1
[RFC PATCH v1 2/2] vsock/test: add test for deferred credit update
This test checks, that we send exactly expected number of credit update packets during deferred credit update optimization. Test work in client/server modes: 1) Client just connects to server and send 256Kb of data. 256Kb is chosen because it is default space for vsock peer. After transmission client waits until server performs checks of this test. 2) Server waits for vsock connection and also open raw socket binded to vsock monitor interface. Then server waits until there will be 256Kb of data in its rx queue (by reading data from stream socket with MSG_PEEK flag). Then server starts reading data from stream socket - it reads entire socket, also reading packets from raw vsock socket, to check that there is OP_RW packets. After data read is done, it checks raw socket again, by waiting exact amount of CREDIT_UPDATE packets. Finally it checks that rx queue of raw socket is empty using read with timeout. Some notes about this test: * It is only for virtio transport. * It relies on sizes of virtio rx buffers for guest and host, to handle raw packets properly (4Kb for guest and 64Kb for host). * It relies on free space limit for deferred credit update - currently it is 64Kb. * It needs permissions to open raw sockets. Signed-off-by: Arseniy Krasnov --- tools/testing/vsock/.gitignore | 1 + tools/testing/vsock/Makefile| 2 + tools/testing/vsock/virtio_vsock_test.c | 369 3 files changed, 372 insertions(+) create mode 100644 tools/testing/vsock/virtio_vsock_test.c diff --git a/tools/testing/vsock/.gitignore b/tools/testing/vsock/.gitignore index d9f798713cd7..74d13a38cf43 100644 --- a/tools/testing/vsock/.gitignore +++ b/tools/testing/vsock/.gitignore @@ -4,3 +4,4 @@ vsock_test vsock_diag_test vsock_perf vsock_uring_test +virtio_vsock_test diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile index a7f56a09ca9f..e04d69903687 100644 --- a/tools/testing/vsock/Makefile +++ b/tools/testing/vsock/Makefile @@ -8,6 +8,8 @@ vsock_perf: vsock_perf.o msg_zerocopy_common.o vsock_uring_test: LDLIBS = -luring vsock_uring_test: control.o util.o vsock_uring_test.o timeout.o msg_zerocopy_common.o +virtio_vsock_test: virtio_vsock_test.o util.o timeout.o control.o + CFLAGS += -g -O2 -Werror -Wall -I. -I../../include -I../../../usr/include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE .PHONY: all test clean clean: diff --git a/tools/testing/vsock/virtio_vsock_test.c b/tools/testing/vsock/virtio_vsock_test.c new file mode 100644 index ..4320dbc31ecc --- /dev/null +++ b/tools/testing/vsock/virtio_vsock_test.c @@ -0,0 +1,369 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * virtio_vsock_test - vsock.ko test suite for VirtIO transport. + * + * Copyright (C) 2024 SaluteDevices. + * + * Author: Arseniy Krasnov + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "control.h" +#include "util.h" + +#define RAW_SOCK_RCV_BUF (1024 * 1024) + +#define TOTAL_TX_BYTES (1024 * 256) + +#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64) + +#define VIRTIO_VSOCK_GUEST_RX_BUF_SIZE (4096) +#define VIRTIO_VSOCK_HOST_RX_BUF_SIZE (1024 * 64) + +static const char *interface; + +static void test_stream_deferred_credit_update_client(const struct test_opts *opts) +{ + unsigned char buf[VIRTIO_VSOCK_MAX_PKT_BUF_SIZE]; + int fd; + int i; + + fd = vsock_stream_connect(opts->peer_cid, opts->peer_port); + if (fd < 0) { + perror("connect"); + exit(EXIT_FAILURE); + } + + control_expectln("SERVERREADY"); + memset(buf, 0, sizeof(buf)); + + for (i = 0; i < TOTAL_TX_BYTES / VIRTIO_VSOCK_MAX_PKT_BUF_SIZE; i++) { + if (write(fd, buf, sizeof(buf)) != sizeof(buf)) { + perror("write"); + exit(EXIT_FAILURE); + } + } + + control_writeln("CLIENTDONE"); + control_expectln("SERVERDONE"); + + close(fd); +} + +static int create_raw_sock(const char *ifname) +{ + struct sockaddr_ll addr_ll; + struct ifreq ifr; + size_t rcv_buf; + int fd; + + fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (fd < 0) { + perror("socket"); + exit(EXIT_FAILURE); + } + + memset(, 0, sizeof(ifr)); + snprintf(ifr.ifr_name, sizeof(ifr.ifr_name), +"%s", ifname); + + rcv_buf = RAW_SOCK_RCV_BUF; + if (setsockopt(fd, SOL_SOCKET, SO_RCVBUFFORCE, _buf, + sizeof(rcv_buf))) { + p
[RFC PATCH v1 0/2] virtio/vsock: some updates for deferred credit update
This patchset contains: 0001 - patch which reworks deferred credit update. Pls see commit message, it contains full description of this problem. 0002 - test which uses vsockmon interface, and checks that deferred credit update works as expected by parsing raw packets. Arseniy Krasnov (2): virtio/vsock: rework deferred credit update logic vsock/test: add test for deferred credit update include/linux/virtio_vsock.h| 1 + net/vmw_vsock/virtio_transport_common.c | 8 +- tools/testing/vsock/.gitignore | 1 + tools/testing/vsock/Makefile| 2 + tools/testing/vsock/virtio_vsock_test.c | 369 5 files changed, 376 insertions(+), 5 deletions(-) create mode 100644 tools/testing/vsock/virtio_vsock_test.c -- 2.25.1
Re: [PATCH v9 5/8] remoteproc: qcom: Update regmap offsets for halt register
On 21/06/2024 13:46, Gokul Sriram Palanisamy wrote: > Fixed issue in reading halt-regs parameter from device-tree. What issue? That's a terrible commit msg. Explain what is the problem, how can it be reproduced. > > Signed-off-by: Sricharan R > Signed-off-by: Gokul Sriram Palanisamy > --- > drivers/remoteproc/qcom_q6v5_wcss.c | 22 ++ > 1 file changed, 14 insertions(+), 8 deletions(-) > > diff --git a/drivers/remoteproc/qcom_q6v5_wcss.c > b/drivers/remoteproc/qcom_q6v5_wcss.c > index 06936ca1d079..87b78eb15b86 100644 > --- a/drivers/remoteproc/qcom_q6v5_wcss.c > +++ b/drivers/remoteproc/qcom_q6v5_wcss.c > @@ -86,7 +86,7 @@ > #define TCSR_WCSS_CLK_MASK 0x1F > #define TCSR_WCSS_CLK_ENABLE 0x14 > > -#define MAX_HALT_REG 3 > +#define MAX_HALT_REG 4 ? That's confusing and looks unrelated. Best regards, Krzysztof
[PATCH v9 5/8] remoteproc: qcom: Update regmap offsets for halt register
Fixed issue in reading halt-regs parameter from device-tree. Signed-off-by: Sricharan R Signed-off-by: Gokul Sriram Palanisamy --- drivers/remoteproc/qcom_q6v5_wcss.c | 22 ++ 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/drivers/remoteproc/qcom_q6v5_wcss.c b/drivers/remoteproc/qcom_q6v5_wcss.c index 06936ca1d079..87b78eb15b86 100644 --- a/drivers/remoteproc/qcom_q6v5_wcss.c +++ b/drivers/remoteproc/qcom_q6v5_wcss.c @@ -86,7 +86,7 @@ #define TCSR_WCSS_CLK_MASK 0x1F #define TCSR_WCSS_CLK_ENABLE 0x14 -#define MAX_HALT_REG 3 +#define MAX_HALT_REG 4 #define WCNSS_PAS_ID 6 @@ -154,6 +154,7 @@ struct wcss_data { u32 version; bool aon_reset_required; bool wcss_q6_reset_required; + bool bcr_reset_required; const char *ssr_name; const char *sysmon_name; int ssctl_id; @@ -875,10 +876,13 @@ static int q6v5_wcss_init_reset(struct q6v5_wcss *wcss, } } - wcss->wcss_q6_bcr_reset = devm_reset_control_get_exclusive(dev, "wcss_q6_bcr_reset"); - if (IS_ERR(wcss->wcss_q6_bcr_reset)) { - dev_err(wcss->dev, "unable to acquire wcss_q6_bcr_reset\n"); - return PTR_ERR(wcss->wcss_q6_bcr_reset); + if (desc->bcr_reset_required) { + wcss->wcss_q6_bcr_reset = devm_reset_control_get_exclusive(dev, + "wcss_q6_bcr_reset"); + if (IS_ERR(wcss->wcss_q6_bcr_reset)) { + dev_err(wcss->dev, "unable to acquire wcss_q6_bcr_reset\n"); + return PTR_ERR(wcss->wcss_q6_bcr_reset); + } } return 0; @@ -928,9 +932,9 @@ static int q6v5_wcss_init_mmio(struct q6v5_wcss *wcss, return -EINVAL; } - wcss->halt_q6 = halt_reg[0]; - wcss->halt_wcss = halt_reg[1]; - wcss->halt_nc = halt_reg[2]; + wcss->halt_q6 = halt_reg[1]; + wcss->halt_wcss = halt_reg[2]; + wcss->halt_nc = halt_reg[3]; return 0; } @@ -1170,6 +1174,7 @@ static const struct wcss_data wcss_ipq8074_res_init = { .crash_reason_smem = WCSS_CRASH_REASON, .aon_reset_required = true, .wcss_q6_reset_required = true, + .bcr_reset_required = false, .ssr_name = "q6wcss", .ops = _wcss_ipq8074_ops, .requires_force_stop = true, @@ -1184,6 +1189,7 @@ static const struct wcss_data wcss_qcs404_res_init = { .version = WCSS_QCS404, .aon_reset_required = false, .wcss_q6_reset_required = false, + .bcr_reset_required = true, .ssr_name = "mpss", .sysmon_name = "wcnss", .ssctl_id = 0x12, -- 2.34.1
Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss
On 6/18/24 15:13, Naina Mehta wrote: Rename qdss@8880 memory region as qlink_logging memory region and add qdss_mem memory region at address of 0x8850. Split mpss_dsmharq_mem region into 2 separate regions and reduce the size of mpssadsp_mem region. Signed-off-by: Naina Mehta --- Alright, we're getting somewhere. The commit message should however motivate why such changes are necessary. For all we know, the splitting in two is currently done for no reason, as qdss_mem and qlink_logging_mem are contiguous - does the firmware have some expectations about them being separate? Konrad
[PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss
Rename qdss@8880 memory region as qlink_logging memory region and add qdss_mem memory region at address of 0x8850. Split mpss_dsmharq_mem region into 2 separate regions and reduce the size of mpssadsp_mem region. Signed-off-by: Naina Mehta --- arch/arm64/boot/dts/qcom/sdx75.dtsi | 18 ++ 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/arch/arm64/boot/dts/qcom/sdx75.dtsi b/arch/arm64/boot/dts/qcom/sdx75.dtsi index 9b93f6501d55..6f0abcc87a3b 100644 --- a/arch/arm64/boot/dts/qcom/sdx75.dtsi +++ b/arch/arm64/boot/dts/qcom/sdx75.dtsi @@ -366,7 +366,12 @@ uefi_log_mem: uefi-log@87f75000 { no-map; }; - qdss_mem: qdss@8880 { + qdss_mem: qdss@8850 { + reg = <0x0 0x8850 0x0 0x30>; + no-map; + }; + + qlink_logging_mem: qlink-logging@8880 { reg = <0x0 0x8880 0x0 0x30>; no-map; }; @@ -377,8 +382,13 @@ audio_heap_mem: audio-heap@88b0 { no-map; }; - mpss_dsmharq_mem: mpss-dsmharq@88f0 { - reg = <0x0 0x88f0 0x0 0x508>; + mpss_dsm_mem_2: mpss-dsm-2@88f0 { + reg = <0x0 0x88f0 0x0 0x250>; + no-map; + }; + + mpss_dsm_mem: mpss-dsm@8b40 { + reg = <0x0 0x8b40 0x0 0x2b8>; no-map; }; @@ -388,7 +398,7 @@ q6_mpss_dtb_mem: q6-mpss-dtb@8df8 { }; mpssadsp_mem: mpssadsp@8e00 { - reg = <0x0 0x8e00 0x0 0xf40>; + reg = <0x0 0x8e00 0x0 0xf10>; no-map; }; -- 2.34.1
Re: [PATCH v2 2/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor
On 6/17/24 23:22, André Apitzsch via B4 Relay wrote: From: André Apitzsch Add alert interrupt for the temperature sensor of Motorola Moto G 4G (2013), although not used by the driver yet. Signed-off-by: André Apitzsch --- Reviewed-by: Konrad Dybcio Konrad
[PATCH v2 0/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update device tree of Motorola Moto G 4G (2013)
Add accelerometer, magnetometer, regulator and temperature sensor alert interrupt and update framebuffer supplies. Signed-off-by: André Apitzsch --- Changes in v2: - Split commit into three commits - Link to v1: https://lore.kernel.org/r/20240616-peregrine-v1-1-85d14ae1a...@apitzsch.eu --- André Apitzsch (3): ARM: dts: qcom: msm8926-motorola-peregrine: Add accelerometer, magnetometer, regulator ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor ARM: dts: qcom: msm8926-motorola-peregrine: Add framebuffer supplies .../dts/qcom/qcom-msm8926-motorola-peregrine.dts | 121 + 1 file changed, 121 insertions(+) --- base-commit: c71189547381bb5f176c6b22a9edc3414f1837b9 change-id: 20240616-peregrine-6ec5e26b15ec Best regards, -- André Apitzsch
[PATCH v2 2/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor
From: André Apitzsch Add alert interrupt for the temperature sensor of Motorola Moto G 4G (2013), although not used by the driver yet. Signed-off-by: André Apitzsch --- arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts | 12 1 file changed, 12 insertions(+) diff --git a/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts b/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts index 50ae3cfc95bb..cff9415baa46 100644 --- a/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts +++ b/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts @@ -125,6 +125,10 @@ reg_lcd_neg: outn { sensor@48 { compatible = "ti,tmp108"; reg = <0x48>; + interrupts-extended = < 13 IRQ_TYPE_LEVEL_LOW>; + pinctrl-0 = <_alert_default>; + pinctrl-names = "default"; + #thermal-sensor-cells = <0>; }; }; @@ -361,6 +365,14 @@ reg_lcd_default: reg-lcd-default-state { bias-disable; output-high; }; + + temp_alert_default: temp-alert-default-state { + pins = "gpio13"; + function = "gpio"; + drive-strength = <2>; + bias-disable; + output-disable; + }; }; { -- 2.45.2
[PATCH v11 17/18] Documentation: probes: Update fprobe on function-graph tracer
From: Masami Hiramatsu (Google) Update fprobe documentation for the new fprobe on function-graph tracer. This includes some bahvior changes and pt_regs to ftrace_regs interface change. Signed-off-by: Masami Hiramatsu (Google) --- Changes in v2: - Update @fregs parameter explanation. --- Documentation/trace/fprobe.rst | 42 ++-- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst index 196f52386aaa..f58bdc64504f 100644 --- a/Documentation/trace/fprobe.rst +++ b/Documentation/trace/fprobe.rst @@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe Introduction -Fprobe is a function entry/exit probe mechanism based on ftrace. -Instead of using ftrace full feature, if you only want to attach callbacks -on function entry and exit, similar to the kprobes and kretprobes, you can +Fprobe is a function entry/exit probe mechanism based on the function-graph +tracer. +Instead of tracing all functions, if you want to attach callbacks on specific +function entry and exit, similar to the kprobes and kretprobes, you can use fprobe. Compared with kprobes and kretprobes, fprobe gives faster instrumentation for multiple functions with single handler. This document describes how to use fprobe. @@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as follows: .. code-block:: c - int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); - void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); -Note that the @entry_ip is saved at function entry and passed to exit handler. -If the entry callback function returns !0, the corresponding exit callback will be cancelled. +Note that the @entry_ip is saved at function entry and passed to exit +handler. +If the entry callback function returns !0, the corresponding exit callback +will be cancelled. @fp This is the address of `fprobe` data structure related to this handler. @@ -112,12 +115,10 @@ If the entry callback function returns !0, the corresponding exit callback will This is the return address that the traced function will return to, somewhere in the caller. This can be used at both entry and exit. -@regs -This is the `pt_regs` data structure at the entry and exit. Note that -the instruction pointer of @regs may be different from the @entry_ip -in the entry_handler. If you need traced instruction pointer, you need -to use @entry_ip. On the other hand, in the exit_handler, the instruction -pointer of @regs is set to the current return address. +@fregs +This is the `ftrace_regs` data structure at the entry and exit. This +includes the function parameters, or the return values. So user can +access thos values via appropriate `ftrace_regs_*` APIs. @entry_data This is a local storage to share the data between entry and exit handlers. @@ -125,6 +126,17 @@ If the entry callback function returns !0, the corresponding exit callback will and `entry_data_size` field when registering the fprobe, the storage is allocated and passed to both `entry_handler` and `exit_handler`. +Entry data size and exit handlers on the same function +== + +Since the entry data is passed via per-task stack and it is has limited size, +the entry data size per probe is limited to `15 * sizeof(long)`. You also need +to take care that the different fprobes are probing on the same function, this +limit becomes smaller. The entry data size is aligned to `sizeof(long)` and +each fprobe which has exit handler uses a `sizeof(long)` space on the stack, +you should keep the number of fprobes on the same function as small as +possible. + Share the callbacks with kprobes @@ -165,8 +177,8 @@ This counter counts up when; - fprobe fails to take ftrace_recursion lock. This usually means that a function which is traced by other ftrace users is called from the entry_handler. - - fprobe fails to setup the function exit because of the shortage of rethook - (the shadow stack for hooking the function return.) + - fprobe fails to setup the function exit because of failing to allocate the + data buffer from the per-task shadow stack. The `fprobe::nmissed` field counts up in both cases. Therefore, the former skips both of entry and exit callback and the latter skips the exit
[PATCH 6/8] riscv: ftrace: do not use stop_machine to update code
Now it is safe to remove dependency from stop_machine() for us to patch code in ftrace. Signed-off-by: Andy Chiu --- arch/riscv/kernel/ftrace.c | 53 -- 1 file changed, 4 insertions(+), 49 deletions(-) diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c index f3b09f2d3ecc..9a421e151b1d 100644 --- a/arch/riscv/kernel/ftrace.c +++ b/arch/riscv/kernel/ftrace.c @@ -13,23 +13,13 @@ #include #ifdef CONFIG_DYNAMIC_FTRACE -void ftrace_arch_code_modify_prepare(void) __acquires(_mutex) +void arch_ftrace_update_code(int command) { mutex_lock(_mutex); - - /* -* The code sequences we use for ftrace can't be patched while the -* kernel is running, so we need to use stop_machine() to modify them -* for now. This doesn't play nice with text_mutex, we use this flag -* to elide the check. -*/ - riscv_patch_in_stop_machine = true; -} - -void ftrace_arch_code_modify_post_process(void) __releases(_mutex) -{ - riscv_patch_in_stop_machine = false; + command |= FTRACE_MAY_SLEEP; + ftrace_modify_all_code(command); mutex_unlock(_mutex); + flush_icache_all(); } static int ftrace_check_current_call(unsigned long hook_pos, @@ -158,41 +148,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func) return __ftrace_modify_call_site(_call_dest, func, true); } -struct ftrace_modify_param { - int command; - atomic_t cpu_count; -}; - -static int __ftrace_modify_code(void *data) -{ - struct ftrace_modify_param *param = data; - - if (atomic_inc_return(>cpu_count) == num_online_cpus()) { - ftrace_modify_all_code(param->command); - /* -* Make sure the patching store is effective *before* we -* increment the counter which releases all waiting CPUs -* by using the release variant of atomic increment. The -* release pairs with the call to local_flush_icache_all() -* on the waiting CPU. -*/ - atomic_inc_return_release(>cpu_count); - } else { - while (atomic_read(>cpu_count) <= num_online_cpus()) - cpu_relax(); - } - - local_flush_icache_all(); - - return 0; -} - -void arch_ftrace_update_code(int command) -{ - struct ftrace_modify_param param = { command, ATOMIC_INIT(0) }; - - stop_machine(__ftrace_modify_code, , cpu_online_mask); -} #endif #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS -- 2.43.0
[PATCH v6 12/13] tracing: Update function tracing output for previous boot buffer
From: "Steven Rostedt (Google)" For a persistent ring buffer that is saved across boots, if function tracing was performed in the previous boot, it only saves the address of the functions and uses "%pS" to print their names. But the current boot, those functions may be in different locations. The persistent meta-data saves the text delta between the two boots and can be used to find the address of the saved function of where it is located in the current boot. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_output.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index d8b302d01083..b9d2c64c0648 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags, } static void print_fn_trace(struct trace_seq *s, unsigned long ip, - unsigned long parent_ip, int flags) + unsigned long parent_ip, long delta, int flags) { + ip += delta; + parent_ip += delta; + seq_print_ip_sym(s, ip, flags); if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) { @@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_putc(s, '\n'); return trace_handle_return(s); @@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_printf(s, " (repeats: %u, last_ts:", field->count); trace_print_time(s, iter, iter->ts - FUNC_REPEATS_GET_DELTA_TS(field)); -- 2.43.0
[PATCH v5 12/13] tracing: Update function tracing output for previous boot buffer
From: "Steven Rostedt (Google)" For a persistent ring buffer that is saved across boots, if function tracing was performed in the previous boot, it only saves the address of the functions and uses "%pS" to print their names. But the current boot, those functions may be in different locations. The persistent meta-data saves the text delta between the two boots and can be used to find the address of the saved function of where it is located in the current boot. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_output.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index d8b302d01083..b9d2c64c0648 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags, } static void print_fn_trace(struct trace_seq *s, unsigned long ip, - unsigned long parent_ip, int flags) + unsigned long parent_ip, long delta, int flags) { + ip += delta; + parent_ip += delta; + seq_print_ip_sym(s, ip, flags); if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) { @@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_putc(s, '\n'); return trace_handle_return(s); @@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_printf(s, " (repeats: %u, last_ts:", field->count); trace_print_time(s, iter, iter->ts - FUNC_REPEATS_GET_DELTA_TS(field)); -- 2.43.0
[PATCH v4 12/13] tracing: Update function tracing output for previous boot buffer
From: "Steven Rostedt (Google)" For a persistent ring buffer that is saved across boots, if function tracing was performed in the previous boot, it only saves the address of the functions and uses "%pS" to print their names. But the current boot, those functions may be in different locations. The persistent meta-data saves the text delta between the two boots and can be used to find the address of the saved function of where it is located in the current boot. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_output.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index d8b302d01083..b9d2c64c0648 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags, } static void print_fn_trace(struct trace_seq *s, unsigned long ip, - unsigned long parent_ip, int flags) + unsigned long parent_ip, long delta, int flags) { + ip += delta; + parent_ip += delta; + seq_print_ip_sym(s, ip, flags); if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) { @@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_putc(s, '\n'); return trace_handle_return(s); @@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_printf(s, " (repeats: %u, last_ts:", field->count); trace_print_time(s, iter, iter->ts - FUNC_REPEATS_GET_DELTA_TS(field)); -- 2.43.0
[PATCH v3 12/13] tracing: Update function tracing output for previous boot buffer
From: "Steven Rostedt (Google)" For a persistent ring buffer that is saved across boots, if function tracing was performed in the previous boot, it only saves the address of the functions and uses "%pS" to print their names. But the current boot, those functions may be in different locations. The persistent meta-data saves the text delta between the two boots and can be used to find the address of the saved function of where it is located in the current boot. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_output.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index d8b302d01083..b9d2c64c0648 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags, } static void print_fn_trace(struct trace_seq *s, unsigned long ip, - unsigned long parent_ip, int flags) + unsigned long parent_ip, long delta, int flags) { + ip += delta; + parent_ip += delta; + seq_print_ip_sym(s, ip, flags); if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) { @@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_putc(s, '\n'); return trace_handle_return(s); @@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_printf(s, " (repeats: %u, last_ts:", field->count); trace_print_time(s, iter, iter->ts - FUNC_REPEATS_GET_DELTA_TS(field)); -- 2.43.0
[PATCH 6/6] function_graph: Do not update pid func if CONFIG_DYNAMIC_FTRACE not enabled
From: "Steven Rostedt (Google)" The ftrace subops is only defined if CONFIG_DYNAMIC_FTRACE is enabled. If it is not, function tracing is extremely limited, and the subops in the ftrace_ops structure is not defined (and will fail to compile). If DYNAMIC_FTRACE is not enabled, then function graph filtering will not work (as it shouldn't). Fixes: df3ec5da6a1e7 ("function_graph: Add pid tracing back to function graph tracer") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202406051855.9viyxbtb-...@intel.com/ Signed-off-by: Steven Rostedt (Google) --- kernel/trace/fgraph.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c index 63d828054c79..c0e428c87ea5 100644 --- a/kernel/trace/fgraph.c +++ b/kernel/trace/fgraph.c @@ -1177,6 +1177,7 @@ void fgraph_update_pid_func(void) if (!(graph_ops.flags & FTRACE_OPS_FL_INITIALIZED)) return; +#ifdef CONFIG_DYNAMIC_FTRACE list_for_each_entry(op, _ops.subop_list, list) { if (op->flags & FTRACE_OPS_FL_PID) { gops = container_of(op, struct fgraph_ops, ops); @@ -1186,6 +1187,7 @@ void fgraph_update_pid_func(void) static_call_update(fgraph_func, gops->entryfunc); } } +#endif } /* Allocate a return stack for each task */ -- 2.43.0
Re: [PATCH] rv: Update rv_en(dis)able_monitor doc to match kernel-doc
On 5/20/24 07:42, Yang Li wrote: > The patch updates the function documentation comment for > rv_en(dis)able_monitor to adhere to the kernel-doc specification. > > Signed-off-by: Yang Li Acked-by: Daniel Bristot de Oliveira Thanks -- Daniel
[PATCH v2 3/4] eventfs: Update all the eventfs_inodes from the events descriptor
From: "Steven Rostedt (Google)" The change to update the permissions of the eventfs_inode had the misconception that using the tracefs_inode would find all the eventfs_inodes that have been updated and reset them on remount. The problem with this approach is that the eventfs_inodes are freed when they are no longer used (basically the reason the eventfs system exists). When they are freed, the updated eventfs_inodes are not reset on a remount because their tracefs_inodes have been freed. Instead, since the events directory eventfs_inode always has a tracefs_inode pointing to it (it is not freed when finished), and the events directory has a link to all its children, have the eventfs_remount() function only operate on the events eventfs_inode and have it descend into its children updating their uid and gids. Link: https://lore.kernel.org/all/cak7lnarxgaww3kh9jgrnh4vk6fr8ldknkf3wq8nhmwjrvwj...@mail.gmail.com/ Cc: sta...@vger.kernel.org Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Reported-by: Masahiro Yamada Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 44 1 file changed, 31 insertions(+), 13 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 5dfb1ccd56ea..129d0f54ba62 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -305,27 +305,27 @@ static const struct file_operations eventfs_file_operations = { .llseek = generic_file_llseek, }; -/* - * On a remount of tracefs, if UID or GID options are set, then - * the mount point inode permissions should be used. - * Reset the saved permission flags appropriately. - */ -void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool update_gid) +static void eventfs_set_attrs(struct eventfs_inode *ei, bool update_uid, kuid_t uid, + bool update_gid, kgid_t gid, int level) { - struct eventfs_inode *ei = ti->private; + struct eventfs_inode *ei_child; - if (!ei) + /* Update events// */ + if (WARN_ON_ONCE(level > 3)) return; if (update_uid) { ei->attr.mode &= ~EVENTFS_SAVE_UID; - ei->attr.uid = ti->vfs_inode.i_uid; + ei->attr.uid = uid; } - if (update_gid) { ei->attr.mode &= ~EVENTFS_SAVE_GID; - ei->attr.gid = ti->vfs_inode.i_gid; + ei->attr.gid = gid; + } + + list_for_each_entry(ei_child, >children, list) { + eventfs_set_attrs(ei_child, update_uid, uid, update_gid, gid, level + 1); } if (!ei->entry_attrs) @@ -334,13 +334,31 @@ void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool update_gid) for (int i = 0; i < ei->nr_entries; i++) { if (update_uid) { ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID; - ei->entry_attrs[i].uid = ti->vfs_inode.i_uid; + ei->entry_attrs[i].uid = uid; } if (update_gid) { ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID; - ei->entry_attrs[i].gid = ti->vfs_inode.i_gid; + ei->entry_attrs[i].gid = gid; } } + +} + +/* + * On a remount of tracefs, if UID or GID options are set, then + * the mount point inode permissions should be used. + * Reset the saved permission flags appropriately. + */ +void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool update_gid) +{ + struct eventfs_inode *ei = ti->private; + + /* Only the events directory does the updates */ + if (!ei || !ei->is_events || ei->is_freed) + return; + + eventfs_set_attrs(ei, update_uid, ti->vfs_inode.i_uid, + update_gid, ti->vfs_inode.i_gid, 0); } /* Return the evenfs_inode of the "events" directory */ -- 2.43.0
[PATCH v2 2/4] tracefs: Update inode permissions on remount
From: "Steven Rostedt (Google)" When a remount happens, if a gid or uid is specified update the inodes to have the same gid and uid. This will allow the simplification of the permissions logic for the dynamically created files and directories. Cc: sta...@vger.kernel.org Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are options") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 17 + fs/tracefs/inode.c | 15 --- 2 files changed, 25 insertions(+), 7 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 55a40a730b10..5dfb1ccd56ea 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -317,20 +317,29 @@ void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool update_gid) if (!ei) return; - if (update_uid) + if (update_uid) { ei->attr.mode &= ~EVENTFS_SAVE_UID; + ei->attr.uid = ti->vfs_inode.i_uid; + } + - if (update_gid) + if (update_gid) { ei->attr.mode &= ~EVENTFS_SAVE_GID; + ei->attr.gid = ti->vfs_inode.i_gid; + } if (!ei->entry_attrs) return; for (int i = 0; i < ei->nr_entries; i++) { - if (update_uid) + if (update_uid) { ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID; - if (update_gid) + ei->entry_attrs[i].uid = ti->vfs_inode.i_uid; + } + if (update_gid) { ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID; + ei->entry_attrs[i].gid = ti->vfs_inode.i_gid; + } } } diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index a827f6a716c4..9252e0d78ea2 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -373,12 +373,21 @@ static int tracefs_apply_options(struct super_block *sb, bool remount) rcu_read_lock(); list_for_each_entry_rcu(ti, _inodes, list) { - if (update_uid) + if (update_uid) { ti->flags &= ~TRACEFS_UID_PERM_SET; + ti->vfs_inode.i_uid = fsi->uid; + } - if (update_gid) + if (update_gid) { ti->flags &= ~TRACEFS_GID_PERM_SET; - + ti->vfs_inode.i_gid = fsi->gid; + } + + /* +* Note, the above ti->vfs_inode updates are +* used in eventfs_remount() so they must come +* before calling it. +*/ if (ti->flags & TRACEFS_EVENT_INODE) eventfs_remount(ti, update_uid, update_gid); } -- 2.43.0
[PATCH] rv: Update rv_en(dis)able_monitor doc to match kernel-doc
The patch updates the function documentation comment for rv_en(dis)able_monitor to adhere to the kernel-doc specification. Signed-off-by: Yang Li --- kernel/trace/rv/rv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c index 2f68e93fff0b..df0745a42a3f 100644 --- a/kernel/trace/rv/rv.c +++ b/kernel/trace/rv/rv.c @@ -245,6 +245,7 @@ static int __rv_disable_monitor(struct rv_monitor_def *mdef, bool sync) /** * rv_disable_monitor - disable a given runtime monitor + * @mdef: Pointer to the monitor definition structure. * * Returns 0 on success. */ @@ -256,6 +257,7 @@ int rv_disable_monitor(struct rv_monitor_def *mdef) /** * rv_enable_monitor - enable a given runtime monitor + * @mdef: Pointer to the monitor definition structure. * * Returns 0 on success, error otherwise. */ -- 2.20.1.7.g153144c
Re: [PATCH -next] rv: Update rv_en(dis)able_monitor doc to match kernel-doc
Hi Yang On 5/17/24 11:14, Yang Li wrote: > The patch updates the function documentation comment for > rv_en(dis)able_monitor to adhere to the kernel-doc specification. > > Signed-off-by: Yang Li > --- > kernel/trace/rv/rv.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c > index 2f68e93fff0b..df0745a42a3f 100644 > --- a/kernel/trace/rv/rv.c > +++ b/kernel/trace/rv/rv.c > @@ -245,6 +245,7 @@ static int __rv_disable_monitor(struct rv_monitor_def > *mdef, bool sync) > > /** > * rv_disable_monitor - disable a given runtime monitor > + * @mdef: Pointer to the monitor definition structure. This change is in for mainline kernel, why are you using the -next on the Subject? -- Daniel
[PATCH -next] rv: Update rv_en(dis)able_monitor doc to match kernel-doc
The patch updates the function documentation comment for rv_en(dis)able_monitor to adhere to the kernel-doc specification. Signed-off-by: Yang Li --- kernel/trace/rv/rv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c index 2f68e93fff0b..df0745a42a3f 100644 --- a/kernel/trace/rv/rv.c +++ b/kernel/trace/rv/rv.c @@ -245,6 +245,7 @@ static int __rv_disable_monitor(struct rv_monitor_def *mdef, bool sync) /** * rv_disable_monitor - disable a given runtime monitor + * @mdef: Pointer to the monitor definition structure. * * Returns 0 on success. */ @@ -256,6 +257,7 @@ int rv_disable_monitor(struct rv_monitor_def *mdef) /** * rv_enable_monitor - enable a given runtime monitor + * @mdef: Pointer to the monitor definition structure. * * Returns 0 on success, error otherwise. */ -- 2.20.1.7.g153144c
[PATCH v2 5/6] documentation: Update on livepatch elf format
Add a section to Documentation/livepatch/module-elf-format.rst describing how klp-convert works for fixing relocations. Signed-off-by: Lukas Hruska Reviewed-by: Petr Mladek Reviewed-by: Marcos Paulo de Souza --- Documentation/livepatch/module-elf-format.rst | 67 +++ 1 file changed, 67 insertions(+) diff --git a/Documentation/livepatch/module-elf-format.rst b/Documentation/livepatch/module-elf-format.rst index a03ed02ec57e..2aa9b11cd806 100644 --- a/Documentation/livepatch/module-elf-format.rst +++ b/Documentation/livepatch/module-elf-format.rst @@ -300,3 +300,70 @@ symbol table, and relocation section indices, ELF information is preserved for livepatch modules and is made accessible by the module loader through module->klp_info, which is a :c:type:`klp_modinfo` struct. When a livepatch module loads, this struct is filled in by the module loader. + +6. klp-convert tool +=== +The livepatch relocation sections might be created using +scripts/livepatch/klp-convert. It is called automatically during +the build as part of a module post processing. + +The tool is not able to find the symbols and all the metadata +automatically. Instead, all needed information must already be +part of rela entry for the given symbol. Such a rela can +be created easily by using KLP_RELOC_SYMBOL() macro after +the symbol declaration. + +KLP_RELOC_SYMBOL causes that the relocation entries for +the given symbol will be created in the following format:: + + .klp.sym.rela.lp_object.sym_object.sym_name,sympos + ^ ^ ^ ^ ^^ ^ ^ ^ + |___| |___| || |__| | + [A] [B][C] [D][E] + +[A] + The symbol name is prefixed with the string ".klp.sym.rela." + +[B] + The name of the object (i.e. "vmlinux" or name of module) which + is livepatched. + +[C] + The name of the object (i.e. "vmlinux" or name of module) to + which the symbol belongs follows immediately after the prefix. + +[D] + The actual name of the symbol. + +[E] + The position of the symbol in the object (as according to kallsyms) + This is used to differentiate duplicate symbols within the same + object. The symbol position is expressed numerically (0, 1, 2...). + The symbol position of a unique symbol is 0. + +Example: + +**Livepatch source code:** + +:: + + extern char *saved_command_line \ + KLP_RELOC_SYMBOL(vmlinux, vmlinux, saved_command_line, 0); + +**`readelf -r -W` output of compiled module:** + +:: + + Relocation section '.rela.text' at offset 0x32e60 contains 10 entries: + Offset Info Type Symbol's Value Symbol's Name + Addend + ... + 0068 003c0002 R_X86_64_PC32 .klp.sym.rela.vmlinux.vmlinux.saved_command_line,0 - 4 + ... + +**`readelf -r -W` output of transformed module by klp-convert:** + +:: + + Relocation section '.klp.rela.vmlinux.text' at offset 0x5cb60 contains 1 entry: + Offset Info Type Symbol's Value Symbol's Name + Addend + 0068 003c0002 R_X86_64_PC32 .klp.sym.vmlinux.saved_command_line,0 - 4 -- 2.45.0
[PATCH v10 35/36] Documentation: probes: Update fprobe on function-graph tracer
From: Masami Hiramatsu (Google) Update fprobe documentation for the new fprobe on function-graph tracer. This includes some bahvior changes and pt_regs to ftrace_regs interface change. Signed-off-by: Masami Hiramatsu (Google) --- Changes in v2: - Update @fregs parameter explanation. --- Documentation/trace/fprobe.rst | 42 ++-- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst index 196f52386aaa..f58bdc64504f 100644 --- a/Documentation/trace/fprobe.rst +++ b/Documentation/trace/fprobe.rst @@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe Introduction -Fprobe is a function entry/exit probe mechanism based on ftrace. -Instead of using ftrace full feature, if you only want to attach callbacks -on function entry and exit, similar to the kprobes and kretprobes, you can +Fprobe is a function entry/exit probe mechanism based on the function-graph +tracer. +Instead of tracing all functions, if you want to attach callbacks on specific +function entry and exit, similar to the kprobes and kretprobes, you can use fprobe. Compared with kprobes and kretprobes, fprobe gives faster instrumentation for multiple functions with single handler. This document describes how to use fprobe. @@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as follows: .. code-block:: c - int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); - void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); -Note that the @entry_ip is saved at function entry and passed to exit handler. -If the entry callback function returns !0, the corresponding exit callback will be cancelled. +Note that the @entry_ip is saved at function entry and passed to exit +handler. +If the entry callback function returns !0, the corresponding exit callback +will be cancelled. @fp This is the address of `fprobe` data structure related to this handler. @@ -112,12 +115,10 @@ If the entry callback function returns !0, the corresponding exit callback will This is the return address that the traced function will return to, somewhere in the caller. This can be used at both entry and exit. -@regs -This is the `pt_regs` data structure at the entry and exit. Note that -the instruction pointer of @regs may be different from the @entry_ip -in the entry_handler. If you need traced instruction pointer, you need -to use @entry_ip. On the other hand, in the exit_handler, the instruction -pointer of @regs is set to the current return address. +@fregs +This is the `ftrace_regs` data structure at the entry and exit. This +includes the function parameters, or the return values. So user can +access thos values via appropriate `ftrace_regs_*` APIs. @entry_data This is a local storage to share the data between entry and exit handlers. @@ -125,6 +126,17 @@ If the entry callback function returns !0, the corresponding exit callback will and `entry_data_size` field when registering the fprobe, the storage is allocated and passed to both `entry_handler` and `exit_handler`. +Entry data size and exit handlers on the same function +== + +Since the entry data is passed via per-task stack and it is has limited size, +the entry data size per probe is limited to `15 * sizeof(long)`. You also need +to take care that the different fprobes are probing on the same function, this +limit becomes smaller. The entry data size is aligned to `sizeof(long)` and +each fprobe which has exit handler uses a `sizeof(long)` space on the stack, +you should keep the number of fprobes on the same function as small as +possible. + Share the callbacks with kprobes @@ -165,8 +177,8 @@ This counter counts up when; - fprobe fails to take ftrace_recursion lock. This usually means that a function which is traced by other ftrace users is called from the entry_handler. - - fprobe fails to setup the function exit because of the shortage of rethook - (the shadow stack for hooking the function return.) + - fprobe fails to setup the function exit because of failing to allocate the + data buffer from the per-task shadow stack. The `fprobe::nmissed` field counts up in both cases. Therefore, the former skips both of entry and exit callback and the latter skips the exit
[PATCH v9 35/36] Documentation: probes: Update fprobe on function-graph tracer
From: Masami Hiramatsu (Google) Update fprobe documentation for the new fprobe on function-graph tracer. This includes some bahvior changes and pt_regs to ftrace_regs interface change. Signed-off-by: Masami Hiramatsu (Google) --- Changes in v2: - Update @fregs parameter explanation. --- Documentation/trace/fprobe.rst | 42 ++-- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst index 196f52386aaa..f58bdc64504f 100644 --- a/Documentation/trace/fprobe.rst +++ b/Documentation/trace/fprobe.rst @@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe Introduction -Fprobe is a function entry/exit probe mechanism based on ftrace. -Instead of using ftrace full feature, if you only want to attach callbacks -on function entry and exit, similar to the kprobes and kretprobes, you can +Fprobe is a function entry/exit probe mechanism based on the function-graph +tracer. +Instead of tracing all functions, if you want to attach callbacks on specific +function entry and exit, similar to the kprobes and kretprobes, you can use fprobe. Compared with kprobes and kretprobes, fprobe gives faster instrumentation for multiple functions with single handler. This document describes how to use fprobe. @@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as follows: .. code-block:: c - int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); - void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); -Note that the @entry_ip is saved at function entry and passed to exit handler. -If the entry callback function returns !0, the corresponding exit callback will be cancelled. +Note that the @entry_ip is saved at function entry and passed to exit +handler. +If the entry callback function returns !0, the corresponding exit callback +will be cancelled. @fp This is the address of `fprobe` data structure related to this handler. @@ -112,12 +115,10 @@ If the entry callback function returns !0, the corresponding exit callback will This is the return address that the traced function will return to, somewhere in the caller. This can be used at both entry and exit. -@regs -This is the `pt_regs` data structure at the entry and exit. Note that -the instruction pointer of @regs may be different from the @entry_ip -in the entry_handler. If you need traced instruction pointer, you need -to use @entry_ip. On the other hand, in the exit_handler, the instruction -pointer of @regs is set to the current return address. +@fregs +This is the `ftrace_regs` data structure at the entry and exit. This +includes the function parameters, or the return values. So user can +access thos values via appropriate `ftrace_regs_*` APIs. @entry_data This is a local storage to share the data between entry and exit handlers. @@ -125,6 +126,17 @@ If the entry callback function returns !0, the corresponding exit callback will and `entry_data_size` field when registering the fprobe, the storage is allocated and passed to both `entry_handler` and `exit_handler`. +Entry data size and exit handlers on the same function +== + +Since the entry data is passed via per-task stack and it is has limited size, +the entry data size per probe is limited to `15 * sizeof(long)`. You also need +to take care that the different fprobes are probing on the same function, this +limit becomes smaller. The entry data size is aligned to `sizeof(long)` and +each fprobe which has exit handler uses a `sizeof(long)` space on the stack, +you should keep the number of fprobes on the same function as small as +possible. + Share the callbacks with kprobes @@ -165,8 +177,8 @@ This counter counts up when; - fprobe fails to take ftrace_recursion lock. This usually means that a function which is traced by other ftrace users is called from the entry_handler. - - fprobe fails to setup the function exit because of the shortage of rethook - (the shadow stack for hooking the function return.) + - fprobe fails to setup the function exit because of failing to allocate the + data buffer from the per-task shadow stack. The `fprobe::nmissed` field counts up in both cases. Therefore, the former skips both of entry and exit callback and the latter skips the exit
[PATCH v2 11/11] tracing: Update function tracing output for previous boot buffer
From: "Steven Rostedt (Google)" For a persistent ring buffer that is saved across boots, if function tracing was performed in the previous boot, it only saves the address of the functions and uses "%pS" to print their names. But the current boot, those functions may be in different locations. The persistent meta-data saves the text delta between the two boots and can be used to find the address of the saved function of where it is located in the current boot. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_output.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index d8b302d01083..b9d2c64c0648 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags, } static void print_fn_trace(struct trace_seq *s, unsigned long ip, - unsigned long parent_ip, int flags) + unsigned long parent_ip, long delta, int flags) { + ip += delta; + parent_ip += delta; + seq_print_ip_sym(s, ip, flags); if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) { @@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_putc(s, '\n'); return trace_handle_return(s); @@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_printf(s, " (repeats: %u, last_ts:", field->count); trace_print_time(s, iter, iter->ts - FUNC_REPEATS_GET_DELTA_TS(field)); -- 2.43.0
Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name
On 09/04/2024 20:36, Luca Weiss wrote: > Follow the gpio-hog bindings and use otg-hog as node name. > > Signed-off-by: Luca Weiss > --- > arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched
On Wed, 10 Apr 2024 08:44:00 +0900 Masami Hiramatsu (Google) wrote: > Looks good to me. > > Acked-by: Masami Hiramatsu (Google) Thanks. > > BTW, isn't this a real bugfix, because the page_touched can be > bigger than nr_pages without this fix? Yes, I simply forgot to add the Cc stable. -- Steve
Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched
On Tue, 9 Apr 2024 15:13:09 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The "buffer_percent" logic that is used by the ring buffer splice code to > only wake up the tasks when there's no data after the buffer is filled to > the percentage of the "buffer_percent" file is dependent on three > variables that determine the amount of data that is in the ring buffer: > > 1) pages_read - incremented whenever a new sub-buffer is consumed > 2) pages_lost - incremented every time a writer overwrites a sub-buffer > 3) pages_touched - incremented when a write goes to a new sub-buffer > > The percentage is the calculation of: > > (pages_touched - (pages_lost + pages_read)) / nr_pages > > Basically, the amount of data is the total number of sub-bufs that have been > touched, minus the number of sub-bufs lost and sub-bufs consumed. This is > divided by the total count to give the buffer percentage. When the > percentage is greater than the value in the "buffer_percent" file, it > wakes up splice readers waiting for that amount. > > It was observed that over time, the amount read from the splice was > constantly decreasing the longer the trace was running. That is, if one > asked for 60%, it would read over 60% when it first starts tracing, but > then it would be woken up at under 60% and would slowly decrease the > amount of data read after being woken up, where the amount becomes much > less than the buffer percent. > > This was due to an accounting of the pages_touched incrementation. This > value is incremented whenever a writer transfers to a new sub-buffer. But > the place where it was incremented was incorrect. If a writer overflowed > the current sub-buffer it would go to the next one. If it gets preempted > by an interrupt at that time, and the interrupt performs a trace, it too > will end up going to the next sub-buffer. But only one should increment > the counter. Unfortunately, that was not the case. > > Change the cmpxchg() that does the real switch of the tail-page into a > try_cmpxchg(), and on success, perform the increment of pages_touched. This > will only increment the counter once for when the writer moves to a new > sub-buffer, and not when there's a race and is incremented for when a > writer and its preempting writer both move to the same new sub-buffer. > Looks good to me. Acked-by: Masami Hiramatsu (Google) BTW, isn't this a real bugfix, because the page_touched can be bigger than nr_pages without this fix? Thank you, > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ring_buffer.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 25476ead681b..6511dc3a00da 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct > ring_buffer_per_cpu *cpu_buffer, > old_write = local_add_return(RB_WRITE_INTCNT, _page->write); > old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries); > > - local_inc(_buffer->pages_touched); > /* >* Just make sure we have seen our old_write and synchronize >* with any interrupts that come in. > @@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct > ring_buffer_per_cpu *cpu_buffer, > */ > local_set(_page->page->commit, 0); > > - /* Again, either we update tail_page or an interrupt does */ > - (void)cmpxchg(_buffer->tail_page, tail_page, next_page); > + /* Either we update tail_page or an interrupt does */ > + if (try_cmpxchg(_buffer->tail_page, _page, next_page)) > + local_inc(_buffer->pages_touched); > } > } > > -- > 2.43.0 > -- Masami Hiramatsu (Google)
[PATCH] ring-buffer: Only update pages_touched when a new page is touched
From: "Steven Rostedt (Google)" The "buffer_percent" logic that is used by the ring buffer splice code to only wake up the tasks when there's no data after the buffer is filled to the percentage of the "buffer_percent" file is dependent on three variables that determine the amount of data that is in the ring buffer: 1) pages_read - incremented whenever a new sub-buffer is consumed 2) pages_lost - incremented every time a writer overwrites a sub-buffer 3) pages_touched - incremented when a write goes to a new sub-buffer The percentage is the calculation of: (pages_touched - (pages_lost + pages_read)) / nr_pages Basically, the amount of data is the total number of sub-bufs that have been touched, minus the number of sub-bufs lost and sub-bufs consumed. This is divided by the total count to give the buffer percentage. When the percentage is greater than the value in the "buffer_percent" file, it wakes up splice readers waiting for that amount. It was observed that over time, the amount read from the splice was constantly decreasing the longer the trace was running. That is, if one asked for 60%, it would read over 60% when it first starts tracing, but then it would be woken up at under 60% and would slowly decrease the amount of data read after being woken up, where the amount becomes much less than the buffer percent. This was due to an accounting of the pages_touched incrementation. This value is incremented whenever a writer transfers to a new sub-buffer. But the place where it was incremented was incorrect. If a writer overflowed the current sub-buffer it would go to the next one. If it gets preempted by an interrupt at that time, and the interrupt performs a trace, it too will end up going to the next sub-buffer. But only one should increment the counter. Unfortunately, that was not the case. Change the cmpxchg() that does the real switch of the tail-page into a try_cmpxchg(), and on success, perform the increment of pages_touched. This will only increment the counter once for when the writer moves to a new sub-buffer, and not when there's a race and is incremented for when a writer and its preempting writer both move to the same new sub-buffer. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 25476ead681b..6511dc3a00da 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct ring_buffer_per_cpu *cpu_buffer, old_write = local_add_return(RB_WRITE_INTCNT, _page->write); old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries); - local_inc(_buffer->pages_touched); /* * Just make sure we have seen our old_write and synchronize * with any interrupts that come in. @@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct ring_buffer_per_cpu *cpu_buffer, */ local_set(_page->page->commit, 0); - /* Again, either we update tail_page or an interrupt does */ - (void)cmpxchg(_buffer->tail_page, tail_page, next_page); + /* Either we update tail_page or an interrupt does */ + if (try_cmpxchg(_buffer->tail_page, _page, next_page)) + local_inc(_buffer->pages_touched); } } -- 2.43.0
Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name
On Tue, 9 Apr 2024 at 21:37, Luca Weiss wrote: > > Follow the gpio-hog bindings and use otg-hog as node name. > > Signed-off-by: Luca Weiss > --- > arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > Reviewed-by: Dmitry Baryshkov -- With best wishes Dmitry
[PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name
Follow the gpio-hog bindings and use otg-hog as node name. Signed-off-by: Luca Weiss --- arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts index 4aaae8537a3f..06549051be50 100644 --- a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts +++ b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts @@ -328,7 +328,7 @@ wlan_regulator_pin: wl-reg-active-state { power-source = ; }; - otg { + otg-hog { gpio-hog; gpios = <35 GPIO_ACTIVE_HIGH>; output-high; -- 2.44.0
[PATCH 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name
Follow the gpio-hog bindings and use otg-hog as node name. Signed-off-by: Luca Weiss --- arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts index 4aaae8537a3f..06549051be50 100644 --- a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts +++ b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts @@ -328,7 +328,7 @@ wlan_regulator_pin: wl-reg-active-state { power-source = ; }; - otg { + otg-hog { gpio-hog; gpios = <35 GPIO_ACTIVE_HIGH>; output-high; -- 2.44.0
[PATCH v6 0/2] Update mce_record tracepoint
This patchset updates the mce_record tracepoint so that the recently added fields of struct mce are exported through it to userspace. The first patch adds PPIN (Protected Processor Inventory Number) field to the tracepoint. The second patch adds the microcode field (Microcode Revision) to the tracepoint. Changes in v2: - Export microcode field (Microcode Revision) through the tracepoiont in addition to PPIN. Changes in v3: - Change format specifier for microcode revision from %u to %x - Fix tab alignments - Add Reviewed-by: Sohil Mehta Changes in v4: - Update commit messages to reflect the reason for the fields being added to the tracepoint. - Add comment to explicitly state the type of information that should be added to the tracepoint. - Add Reviewed-by: Steven Rostedt (Google) Changes in v5: - Changed "MICROCODE REVISION" to just "MICROCODE". - Changed words which are not acronyms from ALL CAPS to no caps. - Added Reviewed-by: Tony Luck Changes in v6: - Rebased on top of Ingo's changes to the MCE tracepoint https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/include/trace/events/mce.h?id=ac5e80e94f5c67d7053f50fc3faddab931707f0f [NOTE: - Since changes in this version are very minor, have retained the below tags received for previous versions: Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck ] Avadhut Naik (2): tracing: Include PPIN in mce_record tracepoint tracing: Include Microcode Revision in mce_record tracepoint include/trace/events/mce.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) base-commit: 65d1240b6728b38e4d2068d6738a17e4ee4351f5 -- 2.34.1
[RESEND v5 0/2] Update mce_record tracepoint
This patchset updates the mce_record tracepoint so that the recently added fields of struct mce are exported through it to userspace. The first patch adds PPIN (Protected Processor Inventory Number) field to the tracepoint. The second patch adds the microcode field (Microcode Revision) to the tracepoint. Changes in v2: - Export microcode field (Microcode Revision) through the tracepoiont in addition to PPIN. Changes in v3: - Change format specifier for microcode revision from %u to %x - Fix tab alignments - Add Reviewed-by: Sohil Mehta Changes in v4: - Update commit messages to reflect the reason for the fields being added to the tracepoint. - Add comment to explicitly state the type of information that should be added to the tracepoint. - Add Reviewed-by: Steven Rostedt (Google) Changes in v5: - Changed "MICROCODE REVISION" to just "MICROCODE". - Changed words which are not acronyms from ALL CAPS to no caps. - Added Reviewed-by: Tony Luck [NOTE: - Since only caps of words which are not acronyms have been changed in this version and the word "REVISION" has been removed i.e. changes are very minor, have retained the the below tags received for previous versions: Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck - Missed adding linux-kernel@vger.kernel.org with v5. Hence, resending.] Avadhut Naik (2): tracing: Include PPIN in mce_record tracepoint tracing: Include Microcode Revision in mce_record tracepoint include/trace/events/mce.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) base-commit: 9699768066a9b12db247d5ee9276cf298b67528d -- 2.34.1
Re: [PATCH v5 0/2] Update mce_record tracepoint
On 3/28/2024 13:14, Sohil Mehta wrote: > On 3/28/2024 11:04 AM, Avadhut Naik wrote: > >> - Since only caps of words which are not acronyms have been changed in >>this version and the word "REVISION" has been removed i.e. changes are >>very minor, have retained the the below tags received for previous >>versions: >> Reviewed-by: Sohil Mehta > >> Avadhut Naik (2): >> tracing: Include PPIN in mce_record tracepoint >> tracing: Include Microcode Revision in mce_record tracepoint > > > The patches look good to me. Kindly ignore this set. As pointed out by Sohit, have missed adding linux-ker...@vger.kernel.org Will resend the set. -- Thanks, Avadhut Naik
Re: [PATCH v5 0/2] Update mce_record tracepoint
On 3/28/2024 11:04 AM, Avadhut Naik wrote: > - Since only caps of words which are not acronyms have been changed in >this version and the word "REVISION" has been removed i.e. changes are >very minor, have retained the the below tags received for previous >versions: > Reviewed-by: Sohil Mehta > Avadhut Naik (2): > tracing: Include PPIN in mce_record tracepoint > tracing: Include Microcode Revision in mce_record tracepoint The patches look good to me.
[PATCH v5 0/2] Update mce_record tracepoint
This patchset updates the mce_record tracepoint so that the recently added fields of struct mce are exported through it to userspace. The first patch adds PPIN (Protected Processor Inventory Number) field to the tracepoint. The second patch adds the microcode field (Microcode Revision) to the tracepoint. Changes in v2: - Export microcode field (Microcode Revision) through the tracepoiont in addition to PPIN. Changes in v3: - Change format specifier for microcode revision from %u to %x - Fix tab alignments - Add Reviewed-by: Sohil Mehta Changes in v4: - Update commit messages to reflect the reason for the fields being added to the tracepoint. - Add comment to explicitly state the type of information that should be added to the tracepoint. - Add Reviewed-by: Steven Rostedt (Google) Changes in v5: - Changed "MICROCODE REVISION" to just "MICROCODE". - Changed words which are not acronyms from ALL CAPS to no caps. - Added Reviewed-by: Tony Luck [NOTE: - Since only caps of words which are not acronyms have been changed in this version and the word "REVISION" has been removed i.e. changes are very minor, have retained the the below tags received for previous versions: Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) Reviewed-by: Tony Luck ] Avadhut Naik (2): tracing: Include PPIN in mce_record tracepoint tracing: Include Microcode Revision in mce_record tracepoint include/trace/events/mce.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) base-commit: 9699768066a9b12db247d5ee9276cf298b67528d -- 2.34.1
[PATCH v4 0/2] Update mce_record tracepoint
This patchset updates the mce_record tracepoint so that the recently added fields of struct mce are exported through it to userspace. The first patch adds PPIN (Protected Processor Inventory Number) field to the tracepoint. The second patch adds the microcode field (Microcode Revision) to the tracepoint. Changes in v2: - Export microcode field (Microcode Revision) through the tracepoiont in addition to PPIN. Changes in v3: - Change format specifier for microcode revision from %u to %x - Fix tab alignments - Add Reviewed-by: Sohil Mehta Changes in v4: - Update commit messages to reflect the reason for the fields being added to the tracepoint. - Add comment to explicitly state the type of information that should be added to the tracepoint. - Add Reviewed-by: Steven Rostedt (Google) [NOTE: - Since only a comment has been added and only the commit messages have been reworked i.e. no code changes have been undertaken for this version, have the retained the below tags from v3: Reviewed-by: Sohil Mehta Reviewed-by: Steven Rostedt (Google) ] Avadhut Naik (2): tracing: Include PPIN in mce_record tracepoint tracing: Include Microcode Revision in mce_record tracepoint include/trace/events/mce.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) base-commit: 818ea9b4c8237f96ac99dc0b2f02dd6d3f2adb97 -- 2.34.1
Re: [PATCH v3 0/2] Update mce_record tracepoint
On 3/25/2024 15:31, Borislav Petkov wrote: > On Mon, Mar 25, 2024 at 03:12:14PM -0500, Naik, Avadhut wrote: >> Can this patchset be merged in? Or would you prefer me sending out >> another revision with Steven's "Reviewed-by:" tag? > > First of all, please do not top-post. > Apologies for that! > Then, you were on Cc on the previous thread. Please summarize from it > and put in the commit message *why* it is good to have each field added. > > And then, above the tracepoint, I'd like you to add a rule which > states what information can and should be added to the tracepoint. And > no, "just because" is not good enough. The previous thread has hints. > Thanks for the clarification! Will update accordingly. > Thx. > -- Thanks, Avadhut Naik
Re: [PATCH v3 0/2] Update mce_record tracepoint
On Mon, Mar 25, 2024 at 03:12:14PM -0500, Naik, Avadhut wrote: > Can this patchset be merged in? Or would you prefer me sending out > another revision with Steven's "Reviewed-by:" tag? First of all, please do not top-post. Then, you were on Cc on the previous thread. Please summarize from it and put in the commit message *why* it is good to have each field added. And then, above the tracepoint, I'd like you to add a rule which states what information can and should be added to the tracepoint. And no, "just because" is not good enough. The previous thread has hints. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
[PATCH v3 0/2] Update mce_record tracepoint
Hi Boris, Can this patchset be merged in? Or would you prefer me sending out another revision with Steven's "Reviewed-by:" tag? On 2/8/2024 11:10, Steven Rostedt wrote: > On Fri, 26 Jan 2024 01:57:58 -0600 > Avadhut Naik wrote: > >> This patchset updates the mce_record tracepoint so that the recently added >> fields of struct mce are exported through it to userspace. >> >> The first patch adds PPIN (Protected Processor Inventory Number) field to >> the tracepoint. >> >> The second patch adds the microcode field (Microcode Revision) to the >> tracepoint. > > From a tracing POV only: > > Reviewed-by: Steven Rostedt (Google) > > -- Steve -- Thanks, Avadhut Naik
[PATCH v8 35/35] Documentation: probes: Update fprobe on function-graph tracer
From: Masami Hiramatsu (Google) Update fprobe documentation for the new fprobe on function-graph tracer. This includes some bahvior changes and pt_regs to ftrace_regs interface change. Signed-off-by: Masami Hiramatsu (Google) --- Changes in v2: - Update @fregs parameter explanation. --- Documentation/trace/fprobe.rst | 42 ++-- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst index 196f52386aaa..f58bdc64504f 100644 --- a/Documentation/trace/fprobe.rst +++ b/Documentation/trace/fprobe.rst @@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe Introduction -Fprobe is a function entry/exit probe mechanism based on ftrace. -Instead of using ftrace full feature, if you only want to attach callbacks -on function entry and exit, similar to the kprobes and kretprobes, you can +Fprobe is a function entry/exit probe mechanism based on the function-graph +tracer. +Instead of tracing all functions, if you want to attach callbacks on specific +function entry and exit, similar to the kprobes and kretprobes, you can use fprobe. Compared with kprobes and kretprobes, fprobe gives faster instrumentation for multiple functions with single handler. This document describes how to use fprobe. @@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as follows: .. code-block:: c - int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); - void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); -Note that the @entry_ip is saved at function entry and passed to exit handler. -If the entry callback function returns !0, the corresponding exit callback will be cancelled. +Note that the @entry_ip is saved at function entry and passed to exit +handler. +If the entry callback function returns !0, the corresponding exit callback +will be cancelled. @fp This is the address of `fprobe` data structure related to this handler. @@ -112,12 +115,10 @@ If the entry callback function returns !0, the corresponding exit callback will This is the return address that the traced function will return to, somewhere in the caller. This can be used at both entry and exit. -@regs -This is the `pt_regs` data structure at the entry and exit. Note that -the instruction pointer of @regs may be different from the @entry_ip -in the entry_handler. If you need traced instruction pointer, you need -to use @entry_ip. On the other hand, in the exit_handler, the instruction -pointer of @regs is set to the current return address. +@fregs +This is the `ftrace_regs` data structure at the entry and exit. This +includes the function parameters, or the return values. So user can +access thos values via appropriate `ftrace_regs_*` APIs. @entry_data This is a local storage to share the data between entry and exit handlers. @@ -125,6 +126,17 @@ If the entry callback function returns !0, the corresponding exit callback will and `entry_data_size` field when registering the fprobe, the storage is allocated and passed to both `entry_handler` and `exit_handler`. +Entry data size and exit handlers on the same function +== + +Since the entry data is passed via per-task stack and it is has limited size, +the entry data size per probe is limited to `15 * sizeof(long)`. You also need +to take care that the different fprobes are probing on the same function, this +limit becomes smaller. The entry data size is aligned to `sizeof(long)` and +each fprobe which has exit handler uses a `sizeof(long)` space on the stack, +you should keep the number of fprobes on the same function as small as +possible. + Share the callbacks with kprobes @@ -165,8 +177,8 @@ This counter counts up when; - fprobe fails to take ftrace_recursion lock. This usually means that a function which is traced by other ftrace users is called from the entry_handler. - - fprobe fails to setup the function exit because of the shortage of rethook - (the shadow stack for hooking the function return.) + - fprobe fails to setup the function exit because of failing to allocate the + data buffer from the per-task shadow stack. The `fprobe::nmissed` field counts up in both cases. Therefore, the former skips both of entry and exit callback and the latter skips the exit
Re: [PATCH v2] mm: Update mark_victim tracepoints fields
On Thu, Feb 22, 2024 at 9:59 AM Carlos Galo wrote: > > On Thu, Feb 22, 2024 at 6:16 AM Michal Hocko wrote: > > > > On Wed 21-02-24 13:30:51, Carlos Galo wrote: > > > On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko wrote: > > > > > > > > Hi, > > > > sorry I have missed this before. > > > > > > > > On Thu 11-01-24 21:05:30, Carlos Galo wrote: > > > > > The current implementation of the mark_victim tracepoint provides only > > > > > the process ID (pid) of the victim process. This limitation poses > > > > > challenges for userspace tools that need additional information > > > > > about the OOM victim. The association between pid and the additional > > > > > data may be lost after the kill, making it difficult for userspace to > > > > > correlate the OOM event with the specific process. > > > > > > > > You are correct that post OOM all per-process information is lost. On > > > > the other hand we do dump all this information to the kernel log. Could > > > > you explain why that is not suitable for your purpose? > > > > > > Userspace tools often need real-time visibility into OOM situations > > > for userspace intervention. Our use case involves utilizing BPF > > > programs, along with BPF ring buffers, to provide OOM notification to > > > userspace. Parsing kernel logs would be significant overhead as > > > opposed to the event based BPF approach. > > > > Please put that into the changelog. > > Will do. > > > > > > In order to mitigate this limitation, add the following fields: > > > > > > > > > > - UID > > > > >In Android each installed application has a unique UID. Including > > > > >the `uid` assists in correlating OOM events with specific apps. > > > > > > > > > > - Process Name (comm) > > > > >Enables identification of the affected process. > > > > > > > > > > - OOM Score > > > > >Allows userspace to get additional insights of the relative kill > > > > >priority of the OOM victim. > > > > > > > > What is the oom score useful for? > > > > > > > The OOM score provides us a measure of the victim's importance. On the > > > android side, it allows us to identify if top or foreground apps are > > > killed, which have user perceptible impact. > > > > But the value on its own (wihtout knowing scores of other tasks) doesn't > > really tell you anything, does it? > > Android uses the OOM adj_score ranges to categorize app state > (foreground, background, ...). I'll resend a v3 with the commit text > updated to include details about this. > > > > > Is there any reason to provide a different information from the one > > > > reported to the kernel log? > > > > __oom_kill_process: > > > > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, > > > > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB > > > > oom_score_adj:%hd\n", > > > > message, task_pid_nr(victim), victim->comm, > > > > K(mm->total_vm), > > > > K(get_mm_counter(mm, MM_ANONPAGES)), > > > > K(get_mm_counter(mm, MM_FILEPAGES)), > > > > K(get_mm_counter(mm, MM_SHMEMPAGES)), > > > > from_kuid(_user_ns, task_uid(victim)), > > > > mm_pgtables_bytes(mm) >> 10, > > > > victim->signal->oom_score_adj); > > > > > > > > > > We added these fields we need (UID, process name, and OOM score), but > > > we're open to adding the others if you prefer that for consistency > > > with the kernel log. > > > > yes, I think the consistency would be better here. For one it reports > > numbers which can tell quite a lot about the killed victim. It is a > > superset of what you already asking for. With a notable exception of the > > oom_score which is really dubious without a wider context. > > Sounds good, I'll resend a v3 that includes these fields as well. > > Thanks, > Carlos > I posted V3 here: https://lore.kernel.org/all/20240223173258.174828-1-carlosg...@google.com/ Thanks, Carlos > > -- > > Michal Hocko > > SUSE Labs
[PATCH v3] mm: Update mark_victim tracepoints fields
The current implementation of the mark_victim tracepoint provides only the process ID (pid) of the victim process. This limitation poses challenges for userspace tools requiring real-time OOM analysis and intervention. Although this information is available from the kernel logs, it’s not the appropriate format to provide OOM notifications. In Android, BPF programs are used with the mark_victim trace events to notify userspace of an OOM kill. For consistency, update the trace event to include the same information about the OOMed victim as the kernel logs. - UID In Android each installed application has a unique UID. Including the `uid` assists in correlating OOM events with specific apps. - Process Name (comm) Enables identification of the affected process. - OOM Score Will allow userspace to get additional insight of the relative kill priority of the OOM victim. In Android, the oom_score_adj is used to categorize app state (foreground, background, etc.), which aids in analyzing user-perceptible impacts of OOM events [1]. - Total VM, RSS Stats, and pgtables Amount of memory used by the victim that will, potentially, be freed up by killing it. [1] https://cs.android.com/android/platform/superproject/main/+/246dc8fc95b6d93afcba5c6d6c133307abb3ac2e:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283 Cc: Steven Rostedt Cc: Andrew Morton Cc: Suren Baghdasaryan Cc: Michal Hocko Reviewed-by: Steven Rostedt Signed-off-by: Carlos Galo --- v3: - Added total_vm, rss fields, and pgtables per Michal Hocko. - Added Steven Rostedt reviewed by - Updated commit description to include android usecase v2: Fixed build error. Added missing comma when printing `__entry->uid`. include/trace/events/oom.h | 36 mm/oom_kill.c | 6 +- 2 files changed, 37 insertions(+), 5 deletions(-) diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h index 26a11e4a2c36..b799f3bcba82 100644 --- a/include/trace/events/oom.h +++ b/include/trace/events/oom.h @@ -7,6 +7,8 @@ #include #include +#define PG_COUNT_TO_KB(x) ((x) << (PAGE_SHIFT - 10)) + TRACE_EVENT(oom_score_adj_update, TP_PROTO(struct task_struct *task), @@ -72,19 +74,45 @@ TRACE_EVENT(reclaim_retry_zone, ); TRACE_EVENT(mark_victim, - TP_PROTO(int pid), + TP_PROTO(struct task_struct *task, uid_t uid), - TP_ARGS(pid), + TP_ARGS(task, uid), TP_STRUCT__entry( __field(int, pid) + __string(comm, task->comm) + __field(unsigned long, total_vm) + __field(unsigned long, anon_rss) + __field(unsigned long, file_rss) + __field(unsigned long, shmem_rss) + __field(uid_t, uid) + __field(unsigned long, pgtables) + __field(short, oom_score_adj) ), TP_fast_assign( - __entry->pid = pid; + __entry->pid = task->pid; + __assign_str(comm, task->comm); + __entry->total_vm = PG_COUNT_TO_KB(task->mm->total_vm); + __entry->anon_rss = PG_COUNT_TO_KB(get_mm_counter(task->mm, MM_ANONPAGES)); + __entry->file_rss = PG_COUNT_TO_KB(get_mm_counter(task->mm, MM_FILEPAGES)); + __entry->shmem_rss = PG_COUNT_TO_KB(get_mm_counter(task->mm, MM_SHMEMPAGES)); + __entry->uid = uid; + __entry->pgtables = mm_pgtables_bytes(task->mm) >> 10; + __entry->oom_score_adj = task->signal->oom_score_adj; ), - TP_printk("pid=%d", __entry->pid) + TP_printk("pid=%d comm=%s total-vm=%lukB anon-rss=%lukB file-rss:%lukB shmem-rss:%lukB uid=%u pgtables=%lukB oom_score_adj=%hd", + __entry->pid, + __get_str(comm), + __entry->total_vm, + __entry->anon_rss, + __entry->file_rss, + __entry->shmem_rss, + __entry->uid, + __entry->pgtables, + __entry->oom_score_adj + ) ); TRACE_EVENT(wake_reaper, diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 91ccd82097c2..8d6a207c3c59 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -44,6 +44,7 @@ #include #include #include +#include #include #include "internal.h" @@ -754,6 +755,7 @@ static inline void queue_oom_reaper(struct task_struct *tsk) */ static void mark_oom_victim(struct task_struct *tsk) { + const struct cred *cred; struct mm_struct *mm = tsk->mm; WARN_ON(oom_killer_disabled); @@ -773,7 +775,9 @@ static void mark_oom_victim(struct task_struct *tsk) */ __thaw_task(tsk); atomic_inc(_victims); - trace_mark_victim(tsk->pid); + cred = get_task_cred(tsk); + trace_mark_victim(tsk, cred->uid.val); + put_cred(cred); } /** -- 2.44.0.rc0.258.g7320e95886-goog
Re: [PATCH v2] mm: Update mark_victim tracepoints fields
On Thu, Feb 22, 2024 at 6:16 AM Michal Hocko wrote: > > On Wed 21-02-24 13:30:51, Carlos Galo wrote: > > On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko wrote: > > > > > > Hi, > > > sorry I have missed this before. > > > > > > On Thu 11-01-24 21:05:30, Carlos Galo wrote: > > > > The current implementation of the mark_victim tracepoint provides only > > > > the process ID (pid) of the victim process. This limitation poses > > > > challenges for userspace tools that need additional information > > > > about the OOM victim. The association between pid and the additional > > > > data may be lost after the kill, making it difficult for userspace to > > > > correlate the OOM event with the specific process. > > > > > > You are correct that post OOM all per-process information is lost. On > > > the other hand we do dump all this information to the kernel log. Could > > > you explain why that is not suitable for your purpose? > > > > Userspace tools often need real-time visibility into OOM situations > > for userspace intervention. Our use case involves utilizing BPF > > programs, along with BPF ring buffers, to provide OOM notification to > > userspace. Parsing kernel logs would be significant overhead as > > opposed to the event based BPF approach. > > Please put that into the changelog. Will do. > > > > In order to mitigate this limitation, add the following fields: > > > > > > > > - UID > > > >In Android each installed application has a unique UID. Including > > > >the `uid` assists in correlating OOM events with specific apps. > > > > > > > > - Process Name (comm) > > > >Enables identification of the affected process. > > > > > > > > - OOM Score > > > >Allows userspace to get additional insights of the relative kill > > > >priority of the OOM victim. > > > > > > What is the oom score useful for? > > > > > The OOM score provides us a measure of the victim's importance. On the > > android side, it allows us to identify if top or foreground apps are > > killed, which have user perceptible impact. > > But the value on its own (wihtout knowing scores of other tasks) doesn't > really tell you anything, does it? Android uses the OOM adj_score ranges to categorize app state (foreground, background, ...). I'll resend a v3 with the commit text updated to include details about this. > > > Is there any reason to provide a different information from the one > > > reported to the kernel log? > > > __oom_kill_process: > > > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, > > > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB > > > oom_score_adj:%hd\n", > > > message, task_pid_nr(victim), victim->comm, > > > K(mm->total_vm), > > > K(get_mm_counter(mm, MM_ANONPAGES)), > > > K(get_mm_counter(mm, MM_FILEPAGES)), > > > K(get_mm_counter(mm, MM_SHMEMPAGES)), > > > from_kuid(_user_ns, task_uid(victim)), > > > mm_pgtables_bytes(mm) >> 10, > > > victim->signal->oom_score_adj); > > > > > > > We added these fields we need (UID, process name, and OOM score), but > > we're open to adding the others if you prefer that for consistency > > with the kernel log. > > yes, I think the consistency would be better here. For one it reports > numbers which can tell quite a lot about the killed victim. It is a > superset of what you already asking for. With a notable exception of the > oom_score which is really dubious without a wider context. Sounds good, I'll resend a v3 that includes these fields as well. Thanks, Carlos > -- > Michal Hocko > SUSE Labs
Re: [PATCH v2] mm: Update mark_victim tracepoints fields
On Wed 21-02-24 13:30:51, Carlos Galo wrote: > On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko wrote: > > > > Hi, > > sorry I have missed this before. > > > > On Thu 11-01-24 21:05:30, Carlos Galo wrote: > > > The current implementation of the mark_victim tracepoint provides only > > > the process ID (pid) of the victim process. This limitation poses > > > challenges for userspace tools that need additional information > > > about the OOM victim. The association between pid and the additional > > > data may be lost after the kill, making it difficult for userspace to > > > correlate the OOM event with the specific process. > > > > You are correct that post OOM all per-process information is lost. On > > the other hand we do dump all this information to the kernel log. Could > > you explain why that is not suitable for your purpose? > > Userspace tools often need real-time visibility into OOM situations > for userspace intervention. Our use case involves utilizing BPF > programs, along with BPF ring buffers, to provide OOM notification to > userspace. Parsing kernel logs would be significant overhead as > opposed to the event based BPF approach. Please put that into the changelog. > > > In order to mitigate this limitation, add the following fields: > > > > > > - UID > > >In Android each installed application has a unique UID. Including > > >the `uid` assists in correlating OOM events with specific apps. > > > > > > - Process Name (comm) > > >Enables identification of the affected process. > > > > > > - OOM Score > > >Allows userspace to get additional insights of the relative kill > > >priority of the OOM victim. > > > > What is the oom score useful for? > > > The OOM score provides us a measure of the victim's importance. On the > android side, it allows us to identify if top or foreground apps are > killed, which have user perceptible impact. But the value on its own (wihtout knowing scores of other tasks) doesn't really tell you anything, does it? > > Is there any reason to provide a different information from the one > > reported to the kernel log? > > __oom_kill_process: > > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, > > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n", > > message, task_pid_nr(victim), victim->comm, K(mm->total_vm), > > K(get_mm_counter(mm, MM_ANONPAGES)), > > K(get_mm_counter(mm, MM_FILEPAGES)), > > K(get_mm_counter(mm, MM_SHMEMPAGES)), > > from_kuid(_user_ns, task_uid(victim)), > > mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj); > > > > We added these fields we need (UID, process name, and OOM score), but > we're open to adding the others if you prefer that for consistency > with the kernel log. yes, I think the consistency would be better here. For one it reports numbers which can tell quite a lot about the killed victim. It is a superset of what you already asking for. With a notable exception of the oom_score which is really dubious without a wider context. -- Michal Hocko SUSE Labs
Re: [PATCH v2] mm: Update mark_victim tracepoints fields
On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko wrote: > > Hi, > sorry I have missed this before. > > On Thu 11-01-24 21:05:30, Carlos Galo wrote: > > The current implementation of the mark_victim tracepoint provides only > > the process ID (pid) of the victim process. This limitation poses > > challenges for userspace tools that need additional information > > about the OOM victim. The association between pid and the additional > > data may be lost after the kill, making it difficult for userspace to > > correlate the OOM event with the specific process. > > You are correct that post OOM all per-process information is lost. On > the other hand we do dump all this information to the kernel log. Could > you explain why that is not suitable for your purpose? Userspace tools often need real-time visibility into OOM situations for userspace intervention. Our use case involves utilizing BPF programs, along with BPF ring buffers, to provide OOM notification to userspace. Parsing kernel logs would be significant overhead as opposed to the event based BPF approach. > > In order to mitigate this limitation, add the following fields: > > > > - UID > >In Android each installed application has a unique UID. Including > >the `uid` assists in correlating OOM events with specific apps. > > > > - Process Name (comm) > >Enables identification of the affected process. > > > > - OOM Score > >Allows userspace to get additional insights of the relative kill > >priority of the OOM victim. > > What is the oom score useful for? > The OOM score provides us a measure of the victim's importance. On the android side, it allows us to identify if top or foreground apps are killed, which have user perceptible impact. > Is there any reason to provide a different information from the one > reported to the kernel log? > __oom_kill_process: > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n", > message, task_pid_nr(victim), victim->comm, K(mm->total_vm), > K(get_mm_counter(mm, MM_ANONPAGES)), > K(get_mm_counter(mm, MM_FILEPAGES)), > K(get_mm_counter(mm, MM_SHMEMPAGES)), > from_kuid(_user_ns, task_uid(victim)), > mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj); > We added these fields we need (UID, process name, and OOM score), but we're open to adding the others if you prefer that for consistency with the kernel log. Thanks, Carlos > -- > Michal Hocko > SUSE Labs
Re: [PATCH v2] mm: Update mark_victim tracepoints fields
Hi, sorry I have missed this before. On Thu 11-01-24 21:05:30, Carlos Galo wrote: > The current implementation of the mark_victim tracepoint provides only > the process ID (pid) of the victim process. This limitation poses > challenges for userspace tools that need additional information > about the OOM victim. The association between pid and the additional > data may be lost after the kill, making it difficult for userspace to > correlate the OOM event with the specific process. You are correct that post OOM all per-process information is lost. On the other hand we do dump all this information to the kernel log. Could you explain why that is not suitable for your purpose? > In order to mitigate this limitation, add the following fields: > > - UID >In Android each installed application has a unique UID. Including >the `uid` assists in correlating OOM events with specific apps. > > - Process Name (comm) >Enables identification of the affected process. > > - OOM Score >Allows userspace to get additional insights of the relative kill >priority of the OOM victim. What is the oom score useful for? Is there any reason to provide a different information from the one reported to the kernel log? __oom_kill_process: pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n", message, task_pid_nr(victim), victim->comm, K(mm->total_vm), K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES)), from_kuid(_user_ns, task_uid(victim)), mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj); -- Michal Hocko SUSE Labs
Re: [PATCH v3] mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages
On 2/20/24 07:16, Baolin Wang wrote: > Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison > to ensure that enough freepages are isolated in isolate_freepages(), however > it just decreases the cc->nr_freepages without updating cc->nr_migratepages > in compaction_alloc(), which will waste more CPU cycles and cause too many > freepages to be isolated. > > So we should also update the cc->nr_migratepages when allocating or freeing > the freepages to avoid isolating excess freepages. And I can see fewer free > pages are scanned and isolated when running thpcompact on my Arm64 server: >k6.7 k6.7_patched > Ops Compaction pages isolated 120692036.00 118160797.00 > Ops Compaction migrate scanned 131210329.00 154093268.00 > Ops Compaction free scanned 1090587971.00 1080632536.00 > Ops Compact scan efficiency 12.03 14.26 > > Moreover, I did not see an obvious latency improvements, this is likely > because > isolating freepages is not the bottleneck in the thpcompact test case. > k6.7 k6.7_patched > Amean fault-both-1 1089.76 ( 0.00%) 1080.16 * 0.88%* > Amean fault-both-3 1616.48 ( 0.00%) 1636.65 * -1.25%* > Amean fault-both-5 2266.66 ( 0.00%) 2219.20 * 2.09%* > Amean fault-both-7 2909.84 ( 0.00%) 2801.90 * 3.71%* > Amean fault-both-12 4861.26 ( 0.00%) 4733.25 * 2.63%* > Amean fault-both-18 7351.11 ( 0.00%) 6950.51 * 5.45%* > Amean fault-both-24 9059.30 ( 0.00%) 9159.99 * -1.11%* > Amean fault-both-3010685.68 ( 0.00%)11399.02 * -6.68%* > > Signed-off-by: Baolin Wang > Acked-by: Mel Gorman Reviewed-by: Vlastimil Babka Thanks. > --- > Hi Andrew, please use this patch to replace below 2 old patches. Thanks. > https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=cb30899d456d64fb83ee3e3d95cd9fbb18735d87 > https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=65713d1c4fc498d35dc1f7c1234ef815aa128429 > > Changes from v2: > - Add acked tag from Mel. > - Fix the mm_compaction_migratepages trace event. > Changes from v1: > - Rebased on the latest mm-unstable branch. > --- > include/trace/events/compaction.h | 6 +++--- > mm/compaction.c | 12 ++-- > 2 files changed, 13 insertions(+), 5 deletions(-) > > diff --git a/include/trace/events/compaction.h > b/include/trace/events/compaction.h > index 2b2a975efd20..d05759d18538 100644 > --- a/include/trace/events/compaction.h > +++ b/include/trace/events/compaction.h > @@ -78,10 +78,10 @@ DEFINE_EVENT(mm_compaction_isolate_template, > mm_compaction_fast_isolate_freepage > #ifdef CONFIG_COMPACTION > TRACE_EVENT(mm_compaction_migratepages, > > - TP_PROTO(struct compact_control *cc, > + TP_PROTO(unsigned int nr_migratepages, > unsigned int nr_succeeded), > > - TP_ARGS(cc, nr_succeeded), > + TP_ARGS(nr_migratepages, nr_succeeded), > > TP_STRUCT__entry( > __field(unsigned long, nr_migrated) > @@ -90,7 +90,7 @@ TRACE_EVENT(mm_compaction_migratepages, > > TP_fast_assign( > __entry->nr_migrated = nr_succeeded; > - __entry->nr_failed = cc->nr_migratepages - nr_succeeded; > + __entry->nr_failed = nr_migratepages - nr_succeeded; > ), > > TP_printk("nr_migrated=%lu nr_failed=%lu", > diff --git a/mm/compaction.c b/mm/compaction.c > index 4494b2914386..218089b29f13 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1798,6 +1798,7 @@ static struct folio *compaction_alloc(struct folio > *src, unsigned long data) > dst = list_entry(cc->freepages.next, struct folio, lru); > list_del(>lru); > cc->nr_freepages--; > + cc->nr_migratepages--; > > return dst; > } > @@ -1813,6 +1814,7 @@ static void compaction_free(struct folio *dst, unsigned > long data) > > list_add(>lru, >freepages); > cc->nr_freepages++; > + cc->nr_migratepages++; > } > > /* possible outcome of isolate_migratepages */ > @@ -2435,7 +2437,7 @@ compact_zone(struct compact_control *cc, struct > capture_control *capc) > unsigned long last_migrated_pfn; > const bool sync = cc->mode != MIGRATE_ASYNC; > bool update_cached; > - unsigned int nr_succeeded = 0; > + unsigned int nr_succeeded = 0, nr_migratepages; > > /* >* Thes
[PATCH v3] mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages
Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison to ensure that enough freepages are isolated in isolate_freepages(), however it just decreases the cc->nr_freepages without updating cc->nr_migratepages in compaction_alloc(), which will waste more CPU cycles and cause too many freepages to be isolated. So we should also update the cc->nr_migratepages when allocating or freeing the freepages to avoid isolating excess freepages. And I can see fewer free pages are scanned and isolated when running thpcompact on my Arm64 server: k6.7 k6.7_patched Ops Compaction pages isolated 120692036.00 118160797.00 Ops Compaction migrate scanned 131210329.00 154093268.00 Ops Compaction free scanned 1090587971.00 1080632536.00 Ops Compact scan efficiency 12.03 14.26 Moreover, I did not see an obvious latency improvements, this is likely because isolating freepages is not the bottleneck in the thpcompact test case. k6.7 k6.7_patched Amean fault-both-1 1089.76 ( 0.00%) 1080.16 * 0.88%* Amean fault-both-3 1616.48 ( 0.00%) 1636.65 * -1.25%* Amean fault-both-5 2266.66 ( 0.00%) 2219.20 * 2.09%* Amean fault-both-7 2909.84 ( 0.00%) 2801.90 * 3.71%* Amean fault-both-12 4861.26 ( 0.00%) 4733.25 * 2.63%* Amean fault-both-18 7351.11 ( 0.00%) 6950.51 * 5.45%* Amean fault-both-24 9059.30 ( 0.00%) 9159.99 * -1.11%* Amean fault-both-3010685.68 ( 0.00%)11399.02 * -6.68%* Signed-off-by: Baolin Wang Acked-by: Mel Gorman --- Hi Andrew, please use this patch to replace below 2 old patches. Thanks. https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=cb30899d456d64fb83ee3e3d95cd9fbb18735d87 https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=65713d1c4fc498d35dc1f7c1234ef815aa128429 Changes from v2: - Add acked tag from Mel. - Fix the mm_compaction_migratepages trace event. Changes from v1: - Rebased on the latest mm-unstable branch. --- include/trace/events/compaction.h | 6 +++--- mm/compaction.c | 12 ++-- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h index 2b2a975efd20..d05759d18538 100644 --- a/include/trace/events/compaction.h +++ b/include/trace/events/compaction.h @@ -78,10 +78,10 @@ DEFINE_EVENT(mm_compaction_isolate_template, mm_compaction_fast_isolate_freepage #ifdef CONFIG_COMPACTION TRACE_EVENT(mm_compaction_migratepages, - TP_PROTO(struct compact_control *cc, + TP_PROTO(unsigned int nr_migratepages, unsigned int nr_succeeded), - TP_ARGS(cc, nr_succeeded), + TP_ARGS(nr_migratepages, nr_succeeded), TP_STRUCT__entry( __field(unsigned long, nr_migrated) @@ -90,7 +90,7 @@ TRACE_EVENT(mm_compaction_migratepages, TP_fast_assign( __entry->nr_migrated = nr_succeeded; - __entry->nr_failed = cc->nr_migratepages - nr_succeeded; + __entry->nr_failed = nr_migratepages - nr_succeeded; ), TP_printk("nr_migrated=%lu nr_failed=%lu", diff --git a/mm/compaction.c b/mm/compaction.c index 4494b2914386..218089b29f13 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1798,6 +1798,7 @@ static struct folio *compaction_alloc(struct folio *src, unsigned long data) dst = list_entry(cc->freepages.next, struct folio, lru); list_del(>lru); cc->nr_freepages--; + cc->nr_migratepages--; return dst; } @@ -1813,6 +1814,7 @@ static void compaction_free(struct folio *dst, unsigned long data) list_add(>lru, >freepages); cc->nr_freepages++; + cc->nr_migratepages++; } /* possible outcome of isolate_migratepages */ @@ -2435,7 +2437,7 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) unsigned long last_migrated_pfn; const bool sync = cc->mode != MIGRATE_ASYNC; bool update_cached; - unsigned int nr_succeeded = 0; + unsigned int nr_succeeded = 0, nr_migratepages; /* * These counters track activities during zone compaction. Initialize @@ -2553,11 +2555,17 @@ compact_zone(struct compact_control *cc, struct capture_control *capc) pageblock_start_pfn(cc->migrate_pfn - 1)); } + /* +* Record the number of pages to migrate since the +* compaction_alloc/free() will update cc->nr_migratepages +* properly. +*/ + nr_migratepages = cc->nr_migra
Re: [PATCH v3 0/2] Update mce_record tracepoint
On Fri, 26 Jan 2024 01:57:58 -0600 Avadhut Naik wrote: > This patchset updates the mce_record tracepoint so that the recently added > fields of struct mce are exported through it to userspace. > > The first patch adds PPIN (Protected Processor Inventory Number) field to > the tracepoint. > > The second patch adds the microcode field (Microcode Revision) to the > tracepoint. From a tracing POV only: Reviewed-by: Steven Rostedt (Google) -- Steve
Re: [PATCH v2] mm: Update mark_victim tracepoints fields
On Thu, 11 Jan 2024 21:05:30 + Carlos Galo wrote: > The current implementation of the mark_victim tracepoint provides only > the process ID (pid) of the victim process. This limitation poses > challenges for userspace tools that need additional information > about the OOM victim. The association between pid and the additional > data may be lost after the kill, making it difficult for userspace to > correlate the OOM event with the specific process. > > In order to mitigate this limitation, add the following fields: > > - UID >In Android each installed application has a unique UID. Including >the `uid` assists in correlating OOM events with specific apps. > > - Process Name (comm) >Enables identification of the affected process. > > - OOM Score >Allows userspace to get additional insights of the relative kill >priority of the OOM victim. > From a tracing POV this looks fine to me, but it's up to the mm maintainers to decide to take it. > Cc: Steven Rostedt > Cc: Andrew Morton > Cc: Suren Baghdasaryan > Signed-off-by: Carlos Galo > --- > v2: Fixed build error. Added missing comma when printing `__entry->uid`. > > include/trace/events/oom.h | 19 +++ > mm/oom_kill.c | 6 +- > 2 files changed, 20 insertions(+), 5 deletions(-) > > diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h > index 26a11e4a2c36..fb8a5d1b8a0a 100644 > --- a/include/trace/events/oom.h > +++ b/include/trace/events/oom.h > @@ -72,19 +72,30 @@ TRACE_EVENT(reclaim_retry_zone, > ); > > TRACE_EVENT(mark_victim, > - TP_PROTO(int pid), > + TP_PROTO(struct task_struct *task, uid_t uid), > > - TP_ARGS(pid), > + TP_ARGS(task, uid), > > TP_STRUCT__entry( > __field(int, pid) > + __field(uid_t, uid) > + __string(comm, task->comm) > + __field(short, oom_score_adj) > ), > > TP_fast_assign( > - __entry->pid = pid; > + __entry->pid = task->pid; > + __entry->uid = uid; > + __assign_str(comm, task->comm); > + __entry->oom_score_adj = task->signal->oom_score_adj; > ), > > - TP_printk("pid=%d", __entry->pid) > + TP_printk("pid=%d uid=%u comm=%s oom_score_adj=%hd", > + __entry->pid, > + __entry->uid, > + __get_str(comm), > + __entry->oom_score_adj > + ) > ); > > TRACE_EVENT(wake_reaper, > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 9e6071fde34a..0698c00c5da6 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -44,6 +44,7 @@ > #include > #include > #include > +#include > > #include > #include "internal.h" > @@ -753,6 +754,7 @@ static inline void queue_oom_reaper(struct task_struct > *tsk) > */ > static void mark_oom_victim(struct task_struct *tsk) > { > + const struct cred *cred; > struct mm_struct *mm = tsk->mm; > > WARN_ON(oom_killer_disabled); > @@ -772,7 +774,9 @@ static void mark_oom_victim(struct task_struct *tsk) >*/ > __thaw_task(tsk); > atomic_inc(_victims); > - trace_mark_victim(tsk->pid); > + cred = get_task_cred(tsk); get_task_cred() isn't a trivial function and is used only for tracing. But this isn't a fast path so I guess that is fine. For tracing: Reviewed-by: Steven Rostedt (Google) -- Steve > + trace_mark_victim(tsk, cred->uid.val); > + put_cred(cred); > } > > /** > > base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
[PATCH v7 36/36] Documentation: probes: Update fprobe on function-graph tracer
From: Masami Hiramatsu (Google) Update fprobe documentation for the new fprobe on function-graph tracer. This includes some bahvior changes and pt_regs to ftrace_regs interface change. Signed-off-by: Masami Hiramatsu (Google) --- Changes in v2: - Update @fregs parameter explanation. --- Documentation/trace/fprobe.rst | 42 ++-- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst index 196f52386aaa..f58bdc64504f 100644 --- a/Documentation/trace/fprobe.rst +++ b/Documentation/trace/fprobe.rst @@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe Introduction -Fprobe is a function entry/exit probe mechanism based on ftrace. -Instead of using ftrace full feature, if you only want to attach callbacks -on function entry and exit, similar to the kprobes and kretprobes, you can +Fprobe is a function entry/exit probe mechanism based on the function-graph +tracer. +Instead of tracing all functions, if you want to attach callbacks on specific +function entry and exit, similar to the kprobes and kretprobes, you can use fprobe. Compared with kprobes and kretprobes, fprobe gives faster instrumentation for multiple functions with single handler. This document describes how to use fprobe. @@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as follows: .. code-block:: c - int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); - void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data); + void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data); -Note that the @entry_ip is saved at function entry and passed to exit handler. -If the entry callback function returns !0, the corresponding exit callback will be cancelled. +Note that the @entry_ip is saved at function entry and passed to exit +handler. +If the entry callback function returns !0, the corresponding exit callback +will be cancelled. @fp This is the address of `fprobe` data structure related to this handler. @@ -112,12 +115,10 @@ If the entry callback function returns !0, the corresponding exit callback will This is the return address that the traced function will return to, somewhere in the caller. This can be used at both entry and exit. -@regs -This is the `pt_regs` data structure at the entry and exit. Note that -the instruction pointer of @regs may be different from the @entry_ip -in the entry_handler. If you need traced instruction pointer, you need -to use @entry_ip. On the other hand, in the exit_handler, the instruction -pointer of @regs is set to the current return address. +@fregs +This is the `ftrace_regs` data structure at the entry and exit. This +includes the function parameters, or the return values. So user can +access thos values via appropriate `ftrace_regs_*` APIs. @entry_data This is a local storage to share the data between entry and exit handlers. @@ -125,6 +126,17 @@ If the entry callback function returns !0, the corresponding exit callback will and `entry_data_size` field when registering the fprobe, the storage is allocated and passed to both `entry_handler` and `exit_handler`. +Entry data size and exit handlers on the same function +== + +Since the entry data is passed via per-task stack and it is has limited size, +the entry data size per probe is limited to `15 * sizeof(long)`. You also need +to take care that the different fprobes are probing on the same function, this +limit becomes smaller. The entry data size is aligned to `sizeof(long)` and +each fprobe which has exit handler uses a `sizeof(long)` space on the stack, +you should keep the number of fprobes on the same function as small as +possible. + Share the callbacks with kprobes @@ -165,8 +177,8 @@ This counter counts up when; - fprobe fails to take ftrace_recursion lock. This usually means that a function which is traced by other ftrace users is called from the entry_handler. - - fprobe fails to setup the function exit because of the shortage of rethook - (the shadow stack for hooking the function return.) + - fprobe fails to setup the function exit because of failing to allocate the + data buffer from the per-task shadow stack. The `fprobe::nmissed` field counts up in both cases. Therefore, the former skips both of entry and exit callback and the latter skips the exit
Re: [PATCH v2 0/2] Update mce_record tracepoint
On Fri, Jan 26, 2024 at 10:01:29PM +, Luck, Tony wrote: > PPIN: Nice to have. People that send stuff to me are terrible about > providing surrounding details. The record already includes > CPUID(1).EAX ... so I can at least skip the step of asking them which > CPU family/model/stepping they were using). But PPIN can be recovered > (so long as the submitter kept good records about which system > generated the record). Yes. > MICROCODE: Must have. Microcode version can be changed at run time. > Going back to the system to check later may not give the correct > answer to what was active at the time of the error. Especially for an > error reported while a microcode update is waling across the CPUs > poking the MSR on each in turn. Easy: - You've got an MCE? Was it during scheduled microcode updates? - Yes. - Come back to me when it happens again, *outside* of the microcode update schedule. Anyway, I still don't buy that. Maybe I'm wrong and maybe I need to talk to data center operators more but this sounds like microcode update failing is such a common thing to happen so that we *absolutely* *must* capture the microcode revision when an MCE happens. Maybe we should make microcode updates more resilient and add a retry mechanism which doesn't back off as easily. Or maybe people should script around it and keep retrying, dunno. In my world, microcode update just works in the vast majority of the cases and if it doesn't, then those cases need a specific look. And if I am debugging an issue and I want to see the microcode revision, I look at /proc/cpuinfo. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
RE: [PATCH v2 0/2] Update mce_record tracepoint
> But no, that's not the right question to ask. > > It is rather: which bits of information are very relevant to an error > record and which are transient enough so that they cannot be gathered > from a system by other means or only gathered in a difficult way, and > should be part of that record. > > The PPIN is not transient but you have to go map ->extcpu to the PPIN so > adding it to the tracepoint is purely a convenience thing. More or less. > > The microcode revision thing I still don't buy but it is already there > so whateva... > > So we'd need a rule hammered out and put there in a prominent place so > that it is clear what goes into struct mce and what not. My personal evaluation of the value of these two additions to the trace record: PPIN: Nice to have. People that send stuff to me are terrible about providing surrounding details. The record already includes CPUID(1).EAX ... so I can at least skip the step of asking them which CPU family/model/stepping they were using). But PPIN can be recovered (so long as the submitter kept good records about which system generated the record). MICROCODE: Must have. Microcode version can be changed at run time. Going back to the system to check later may not give the correct answer to what was active at the time of the error. Especially for an error reported while a microcode update is waling across the CPUs poking the MSR on each in turn. -Tony
Re: [PATCH v2 0/2] Update mce_record tracepoint
On Fri, Jan 26, 2024 at 08:49:03PM +, Luck, Tony wrote: > Every patch that adds new code or data structures adds to the kernel > memory footprint. Each should be considered on its merits. The basic > question being: > >"Is the new functionality worth the cost?" > > Where does it end? It would end if Linus declared: > > "Linux is now complete. Stop sending patches". > > I.e. it is never going to end. No, it's not that - it is the merit thing which determines. > 1) PPIN > Cost = 8 bytes. > Benefit: Emdeds a system identifier into the trace record so there > can be no ambiguity about which machine generated this error. > Also definitively indicates which socket on a multi-socket system. > > 2) MICROCODE > Cost = 4 bytes > Benefit: Certainty about the microcode version active on the core > at the time the error was detected. > > RAS = Reliability, Availability, Serviceability > > These changes fall into the serviceability bucket. They make it > easier to diagnose what went wrong. So does dmesg. Let's add it to the tracepoint... But no, that's not the right question to ask. It is rather: which bits of information are very relevant to an error record and which are transient enough so that they cannot be gathered from a system by other means or only gathered in a difficult way, and should be part of that record. The PPIN is not transient but you have to go map ->extcpu to the PPIN so adding it to the tracepoint is purely a convenience thing. More or less. The microcode revision thing I still don't buy but it is already there so whateva... So we'd need a rule hammered out and put there in a prominent place so that it is clear what goes into struct mce and what not. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
RE: [PATCH v2 0/2] Update mce_record tracepoint
> > Is it so very different to add this to a trace record so that rasdaemon > > can have feature parity with mcelog(8)? > > I knew you were gonna say that. When someone decides that it is > a splendid idea to add more stuff to struct mce then said someone would > want it in the tracepoint too. > > And then we're back to my original question: > > "And where does it end? Stick full dmesg in the tracepoint too?" > > Where do you draw the line in the sand and say, no more, especially > static, fields bloating the trace record should be added and from then > on, you should go collect the info from that box. Something which you're > supposed to do anyway. Every patch that adds new code or data structures adds to the kernel memory footprint. Each should be considered on its merits. The basic question being: "Is the new functionality worth the cost?" Where does it end? It would end if Linus declared: "Linux is now complete. Stop sending patches". I.e. it is never going to end. If somebody posts a patch asking to add the full dmesg to a tracepoint, I'll stand with you to say: "Not only no, but hell no". So for Naik's two patches we have: 1) PPIN Cost = 8 bytes. Benefit: Emdeds a system identifier into the trace record so there can be no ambiguity about which machine generated this error. Also definitively indicates which socket on a multi-socket system. 2) MICROCODE Cost = 4 bytes Benefit: Certainty about the microcode version active on the core at the time the error was detected. RAS = Reliability, Availability, Serviceability These changes fall into the serviceability bucket. They make it easier to diagnose what went wrong. -Tony
Re: [PATCH v2 0/2] Update mce_record tracepoint
On Fri, Jan 26, 2024 at 07:15:50PM +, Luck, Tony wrote: > If deployment of a microcode update across a fleet always went > flawlessly, life would be simpler. But things can fail. And maybe the > failure wasn't noticed. Perhaps a node was rebooting when the sysadmin > pushed the update to the fleet and so missed the deployment. Perhaps > one core was already acting weird and the microcode update didn't get > applied to that core. Yes, and you go collect data from that box. You will have to anyway to figure out why the microcode didn't update. > Swapping a hard drive, or hot plugging a NIC isn't very likely > to correlate with an error reported by the CPU in machine > check banks. Ofc it is - coherent probe timeoutting due to problematic insertion could be reported with a MCE, and so on and so on. > Is it so very different to add this to a trace record so that rasdaemon > can have feature parity with mcelog(8)? I knew you were gonna say that. When someone decides that it is a splendid idea to add more stuff to struct mce then said someone would want it in the tracepoint too. And then we're back to my original question: "And where does it end? Stick full dmesg in the tracepoint too?" Where do you draw the line in the sand and say, no more, especially static, fields bloating the trace record should be added and from then on, you should go collect the info from that box. Something which you're supposed to do anyway. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
RE: [PATCH v2 0/2] Update mce_record tracepoint
> > You've spent enough time with Ashok and Thomas tweaking the Linux > > microcode driver to know that going back to the machine the next day > > to ask about microcode version has a bunch of ways to get a wrong > > answer. > > Huh, what does that have to do with this? If deployment of a microcode update across a fleet always went flawlessly, life would be simpler. But things can fail. And maybe the failure wasn't noticed. Perhaps a node was rebooting when the sysadmin pushed the update to the fleet and so missed the deployment. Perhaps one core was already acting weird and the microcode update didn't get applied to that core. > IIUC, if someone changes something on the system, whether that is > updating microcode or swapping a harddrive or swapping memory or > whatever, that invalidates the errors reported, pretty much. > > You can't put it all in the trace record, you just can't. Swapping a hard drive, or hot plugging a NIC isn't very likely to correlate with an error reported by the CPU in machine check banks. But microcode can be (and has been) the issue in enough cases that knowing the version at the time of the error matters. You seemed to agree with this argument when the microcode field was added to "struct mce" back in 2018 fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records") Is it so very different to add this to a trace record so that rasdaemon can have feature parity with mcelog(8)? -Tony
Re: [PATCH v2 0/2] Update mce_record tracepoint
On Fri, Jan 26, 2024 at 05:10:20PM +, Luck, Tony wrote: > 12 extra bytes divided by (say) 64GB (a very small server these days, may > laptop has that much) >= 0.0001746% > > We will need 57000 changes like this one before we get to 0.001% :-) You're forgetting that those 12 bytes repeat per MCE tracepoint logged. And there's other code which adds more 0.01% here and there, well, because we can. > But the key there is keeping the details of the source machine attached to > the error record. My first contact with machine check debugging is always > just the raw error record (from mcelog, rasdaemon, or console log). Yes, that is somewhat sensible reason to have the PPIN together with the MCE record. > Knowing which microcode version was loaded on a core *at the time of > the error* is critical. So is the rest of the debug info you're going to need from that machine. And yet we're not adding that to the tracepoint. > You've spent enough time with Ashok and Thomas tweaking the Linux > microcode driver to know that going back to the machine the next day > to ask about microcode version has a bunch of ways to get a wrong > answer. Huh, what does that have to do with this? IIUC, if someone changes something on the system, whether that is updating microcode or swapping a harddrive or swapping memory or whatever, that invalidates the errors reported, pretty much. You can't put it all in the trace record, you just can't. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
RE: [PATCH v2 0/2] Update mce_record tracepoint
> > 8 bytes for PPIN, 4 more for microcode. > > I know, nothing leads to bloat like 0.01% here, 0.001% there... 12 extra bytes divided by (say) 64GB (a very small server these days, may laptop has that much) = 0.0001746% We will need 57000 changes like this one before we get to 0.001% :-) > > Number of recoverable machine checks per system I hope the > > monthly rate should be countable on my fingers... > > That's not the point. Rather, when you look at MCE reports, you pretty > much almost always go and collect additional information from the target > machine because you want to figure out what exactly is going on. > > So what's stopping you from collecting all that static information > instead of parrotting it through the tracepoint with each error? PPIN is static. So with good tracking to keep source platform information attached to the error record as it gets passed around people trying to triage the problem, you could say it can be retrieved later (or earlier when setting up a database of attributes of each machine in the fleet. But the key there is keeping the details of the source machine attached to the error record. My first contact with machine check debugging is always just the raw error record (from mcelog, rasdaemon, or console log). > > PPIN is useful when talking to the CPU vendor about patterns of > > similar errors seen across a cluster. > > I guess that is perhaps the only thing of the two that makes some sense > at least - the identifier uniquely describes which CPU the error comes > from... > > > MICROCODE - gives a fast path to root cause problems that have already > > been fixed in a microcode update. > > But that, nah. See above. Knowing which microcode version was loaded on a core *at the time of the error* is critical. You've spent enough time with Ashok and Thomas tweaking the Linux microcode driver to know that going back to the machine the next day to ask about microcode version has a bunch of ways to get a wrong answer. -Tony