Re: [PATCH 08/12] arm64: dts: qcom: sm6115-pro1x: Update copyright year

2024-07-22 Thread Krzysztof Kozlowski
On 22/07/2024 09:10, Dang Huynh wrote:
> It's 2024, let's update the copyright year.
> 
> Signed-off-by: Dang Huynh 
> ---
>  arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts 
> b/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts
> index a32fc27bc783..6e9e4d9f8250 100644
> --- a/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts
> +++ b/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause)
>  /*
> - * Copyright (c) 2023, Dang Huynh 
> + * Copyright (c) 2023 - 2024, Dang Huynh 

Please squash it. I don't think copyright update has a merit on its own.

Best regards,
Krzysztof




[PATCH 08/12] arm64: dts: qcom: sm6115-pro1x: Update copyright year

2024-07-22 Thread Dang Huynh
It's 2024, let's update the copyright year.

Signed-off-by: Dang Huynh 
---
 arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts 
b/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts
index a32fc27bc783..6e9e4d9f8250 100644
--- a/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts
+++ b/arch/arm64/boot/dts/qcom/sm6115-fxtec-pro1x.dts
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause)
 /*
- * Copyright (c) 2023, Dang Huynh 
+ * Copyright (c) 2023 - 2024, Dang Huynh 
  */
 
 /dts-v1/;

-- 
2.45.2




Re: [PATCH V2 0/7] vdpa live update

2024-07-20 Thread Steven Sistare

On 7/17/2024 8:33 PM, Jason Wang wrote:

On Thu, Jul 18, 2024 at 2:29 AM Steven Sistare
 wrote:


On 7/16/2024 1:30 AM, Jason Wang wrote:

On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare
 wrote:


On 7/14/2024 10:14 PM, Jason Wang wrote:

On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare  wrote:


Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state.  Clients of the
application experience a brief suspension of service, on the order of
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as QEMU.
The device must be suspended during the update, but its DMA mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required.  Some devices do not
require it, because the userland address of each DMA mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet widely available.  The vdpa device
descriptor, fd, remains open across the exec.

 ioctl(fd, VHOST_VDPA_SUSPEND)
 ioctl(fd, VHOST_VDPA_SET_STATUS, 0)


I don't understand why we need a reset after suspend, it looks to me
the previous suspend became meaningless.


The suspend guarantees completion of in-progress DMA.  At least, that is
my interpretation of why that is done for live migration in QEMU, which
also does suspend + reset + re-create.  I am following the live migration
model.


Yes, but any reason we need a reset after the suspension?


Probably not.  I found it cleanest to call reset and let new qemu configure the
device as it always does during startup, rather than altering those code paths
to skip the kernel calls.


If we care about the downtime, I think avoiding a reset should be faster.


Agreed.  I want to try it some time, but I think what I have now using reset is
decent enough for version 1.

- Steve


So, consider this to be just one of several possible
userland algorithms.

- Steve


Thanks




 exec

 ioctl(fd, VHOST_NEW_OWNER)

 issue ioctls to re-create vrings

 if VHOST_BACKEND_F_IOTLB_REMAP


So the idea is for a device that is using a virtual address, it
doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?


Actually the reverse: if the device translates virtual to physical when
the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
is not needed.


Ok.




 foreach dma mapping
 write(fd, {VHOST_IOTLB_REMAP, new_addr})

 ioctl(fd, VHOST_VDPA_SET_STATUS,
   ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)


   From API level, this seems to be asymmetric as we have suspending but
not resuming?


Again, I am just following the path taken by live migration.
I will be happy to use resume when the devices and QEMU support it.
The decision to use reset vs resume should not affect the definition
and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.

- Steve


Thanks









Re: [PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function

2024-07-19 Thread Jonathan Cameron
On Tue, 16 Jul 2024 14:13:38 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (Microsoft)" 
> 
> This is required to make numa emulation code architecture independent so
> that it can be moved to generic code in following commits.
> 
> Signed-off-by: Mike Rapoport (Microsoft) 

Not the most intuitive of function names but I can't immediately
think of a better one.

Reviewed-by: Jonathan Cameron 




Re: [PATCH V2 0/7] vdpa live update

2024-07-17 Thread Jason Wang
On Thu, Jul 18, 2024 at 2:29 AM Steven Sistare
 wrote:
>
> On 7/16/2024 1:30 AM, Jason Wang wrote:
> > On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare
> >  wrote:
> >>
> >> On 7/14/2024 10:14 PM, Jason Wang wrote:
> >>> On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare  
> >>> wrote:
> >>>>
> >>>> Live update is a technique wherein an application saves its state, exec's
> >>>> to an updated version of itself, and restores its state.  Clients of the
> >>>> application experience a brief suspension of service, on the order of
> >>>> 100's of milliseconds, but are otherwise unaffected.
> >>>>
> >>>> Define and implement interfaces that allow vdpa devices to be preserved
> >>>> across fork or exec, to support live update for applications such as 
> >>>> QEMU.
> >>>> The device must be suspended during the update, but its DMA mappings are
> >>>> preserved, so the suspension is brief.
> >>>>
> >>>> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> >>>> accounting from one process to another.
> >>>>
> >>>> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> >>>> VHOST_NEW_OWNER is supported.
> >>>>
> >>>> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its 
> >>>> userland
> >>>> address in the new process.
> >>>>
> >>>> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> >>>> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> >>>> require it, because the userland address of each DMA mapping is discarded
> >>>> after being translated to a physical address.
> >>>>
> >>>> Here is a pseudo-code sequence for performing live update, based on
> >>>> suspend + reset because resume is not yet widely available.  The vdpa 
> >>>> device
> >>>> descriptor, fd, remains open across the exec.
> >>>>
> >>>> ioctl(fd, VHOST_VDPA_SUSPEND)
> >>>> ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> >>>
> >>> I don't understand why we need a reset after suspend, it looks to me
> >>> the previous suspend became meaningless.
> >>
> >> The suspend guarantees completion of in-progress DMA.  At least, that is
> >> my interpretation of why that is done for live migration in QEMU, which
> >> also does suspend + reset + re-create.  I am following the live migration
> >> model.
> >
> > Yes, but any reason we need a reset after the suspension?
>
> Probably not.  I found it cleanest to call reset and let new qemu configure 
> the
> device as it always does during startup, rather than altering those code paths
> to skip the kernel calls.

If we care about the downtime, I think avoiding a reset should be faster.

> So, consider this to be just one of several possible
> userland algorithms.
>
> - Steve

Thanks

>
> >>>> exec
> >>>>
> >>>> ioctl(fd, VHOST_NEW_OWNER)
> >>>>
> >>>> issue ioctls to re-create vrings
> >>>>
> >>>> if VHOST_BACKEND_F_IOTLB_REMAP
> >>>
> >>> So the idea is for a device that is using a virtual address, it
> >>> doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?
> >>
> >> Actually the reverse: if the device translates virtual to physical when
> >> the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
> >> is not needed.
> >
> > Ok.
> >
> >>
> >>>> foreach dma mapping
> >>>> write(fd, {VHOST_IOTLB_REMAP, new_addr})
> >>>>
> >>>> ioctl(fd, VHOST_VDPA_SET_STATUS,
> >>>>   ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
> >>>
> >>>   From API level, this seems to be asymmetric as we have suspending but
> >>> not resuming?
> >>
> >> Again, I am just following the path taken by live migration.
> >> I will be happy to use resume when the devices and QEMU support it.
> >> The decision to use reset vs resume should not affect the definition
> >> and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.
> >>
> >> - Steve
> >
> > Thanks
> >
>




Re: [PATCH V2 0/7] vdpa live update

2024-07-17 Thread Steven Sistare

On 7/16/2024 1:30 AM, Jason Wang wrote:

On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare
 wrote:


On 7/14/2024 10:14 PM, Jason Wang wrote:

On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare  wrote:


Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state.  Clients of the
application experience a brief suspension of service, on the order of
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as QEMU.
The device must be suspended during the update, but its DMA mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required.  Some devices do not
require it, because the userland address of each DMA mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet widely available.  The vdpa device
descriptor, fd, remains open across the exec.

ioctl(fd, VHOST_VDPA_SUSPEND)
ioctl(fd, VHOST_VDPA_SET_STATUS, 0)


I don't understand why we need a reset after suspend, it looks to me
the previous suspend became meaningless.


The suspend guarantees completion of in-progress DMA.  At least, that is
my interpretation of why that is done for live migration in QEMU, which
also does suspend + reset + re-create.  I am following the live migration
model.


Yes, but any reason we need a reset after the suspension?


Probably not.  I found it cleanest to call reset and let new qemu configure the
device as it always does during startup, rather than altering those code paths
to skip the kernel calls. So, consider this to be just one of several possible
userland algorithms.

- Steve


exec

ioctl(fd, VHOST_NEW_OWNER)

issue ioctls to re-create vrings

if VHOST_BACKEND_F_IOTLB_REMAP


So the idea is for a device that is using a virtual address, it
doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?


Actually the reverse: if the device translates virtual to physical when
the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
is not needed.


Ok.




foreach dma mapping
write(fd, {VHOST_IOTLB_REMAP, new_addr})

ioctl(fd, VHOST_VDPA_SET_STATUS,
  ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)


  From API level, this seems to be asymmetric as we have suspending but
not resuming?


Again, I am just following the path taken by live migration.
I will be happy to use resume when the devices and QEMU support it.
The decision to use reset vs resume should not affect the definition
and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.

- Steve


Thanks





[PATCH 09/17] x86/numa_emu: split __apicid_to_node update to a helper function

2024-07-16 Thread Mike Rapoport
From: "Mike Rapoport (Microsoft)" 

This is required to make numa emulation code architecture independent so
that it can be moved to generic code in following commits.

Signed-off-by: Mike Rapoport (Microsoft) 
---
 arch/x86/include/asm/numa.h  |  2 ++
 arch/x86/mm/numa.c   | 22 ++
 arch/x86/mm/numa_emulation.c | 14 +-
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 2dab1ada96cf..7017d540894a 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -72,6 +72,8 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
 
 #ifdef CONFIG_NUMA_EMU
 int numa_emu_cmdline(char *str);
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+   unsigned int nr_emu_nids);
 #else /* CONFIG_NUMA_EMU */
 static inline int numa_emu_cmdline(char *str)
 {
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ab2d4ecef786..1320d776caed 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -852,6 +852,28 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif /* !CONFIG_DEBUG_PER_CPU_MAPS */
 
+#ifdef CONFIG_NUMA_EMU
+void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
+   unsigned int nr_emu_nids)
+{
+   int i, j;
+
+   /*
+* Transform __apicid_to_node table to use emulated nids by
+* reverse-mapping phys_nid.  The maps should always exist but fall
+* back to zero just in case.
+*/
+   for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
+   if (__apicid_to_node[i] == NUMA_NO_NODE)
+   continue;
+   for (j = 0; j < nr_emu_nids; j++)
+   if (__apicid_to_node[i] == emu_nid_to_phys[j])
+   break;
+   __apicid_to_node[i] = j < nr_emu_nids ? j : 0;
+   }
+}
+#endif /* CONFIG_NUMA_EMU */
+
 #ifdef CONFIG_NUMA_KEEP_MEMINFO
 static int meminfo_to_nid(struct numa_meminfo *mi, u64 start)
 {
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index 439804e21962..f2746e52ab93 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -476,19 +476,7 @@ void __init numa_emulation(struct numa_meminfo 
*numa_meminfo, int numa_dist_cnt)
ei.blk[i].nid != NUMA_NO_NODE)
node_set(ei.blk[i].nid, numa_nodes_parsed);
 
-   /*
-* Transform __apicid_to_node table to use emulated nids by
-* reverse-mapping phys_nid.  The maps should always exist but fall
-* back to zero just in case.
-*/
-   for (i = 0; i < ARRAY_SIZE(__apicid_to_node); i++) {
-   if (__apicid_to_node[i] == NUMA_NO_NODE)
-   continue;
-   for (j = 0; j < ARRAY_SIZE(emu_nid_to_phys); j++)
-   if (__apicid_to_node[i] == emu_nid_to_phys[j])
-   break;
-   __apicid_to_node[i] = j < ARRAY_SIZE(emu_nid_to_phys) ? j : 0;
-   }
+   numa_emu_update_cpu_to_node(emu_nid_to_phys, 
ARRAY_SIZE(emu_nid_to_phys));
 
/* make sure all emulated nodes are mapped to a physical node */
for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
-- 
2.43.0




Re: [PATCH] tracing: Update MAINTAINERS file

2024-07-16 Thread Google
On Mon, 15 Jul 2024 14:47:45 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> Gone but never forgotten.
> 
> [ Also moved Daniel's name to be consistent with the alphabetical order ]
> 

Reviewed-by: Masami Hiramatsu (Google) 

Thank you,

> Signed-off-by: Steven Rostedt (Google) 
> ---
>  CREDITS | 10 +++---
>  MAINTAINERS |  3 ---
>  2 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/CREDITS b/CREDITS
> index 88c4c08cb613..d7798910e99f 100644
> --- a/CREDITS
> +++ b/CREDITS
> @@ -271,9 +271,6 @@ D: Driver for WaveFront soundcards (Turtle Beach Maui, 
> Tropez, Tropez+)
>  D: Various bugfixes and changes to sound drivers
>  S: USA
>  
> -N: Daniel Bristot de Oliveira
> -D: Scheduler contributions, notably: SCHED_DEADLINE
> -
>  N: Carlos Henrique Bauer
>  E: chba...@acm.org
>  E: ba...@atlas.unisinos.br
> @@ -534,6 +531,13 @@ S: Kopmansg 2
>  S: 411 13  Goteborg
>  S: Sweden
>  
> +N: Daniel Bristot de Oliveira
> +D: Scheduler contributions, notably: SCHED_DEADLINE
> +D: Real-time Linux Analysis
> +D: Runtime Verification
> +D: OS Noise and Latency Tracers
> +S: Brazil and Italy
> +
>  N: Paul Bristow
>  E: p...@paulbristow.net
>  W: https://paulbristow.net/linux/idefloppy.html
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2e1b8bbacb5e..0b7e4cd4ffd7 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -18895,7 +18895,6 @@ F:include/uapi/linux/rtc.h
>  F:   tools/testing/selftests/rtc/
>  
>  Real-time Linux Analysis (RTLA) tools
> -M:   Daniel Bristot de Oliveira 
>  M:   Steven Rostedt 
>  L:   linux-trace-ker...@vger.kernel.org
>  S:   Maintained
> @@ -19529,7 +19528,6 @@ S:Maintained
>  F:   drivers/infiniband/ulp/rtrs/
>  
>  RUNTIME VERIFICATION (RV)
> -M:   Daniel Bristot de Oliveira 
>  M:   Steven Rostedt 
>  L:   linux-trace-ker...@vger.kernel.org
>  S:   Maintained
> @@ -22806,7 +22804,6 @@ F:kernel/trace/trace_mmiotrace.c
>  
>  TRACING OS NOISE / LATENCY TRACERS
>  M:   Steven Rostedt 
> -M:   Daniel Bristot de Oliveira 
>  S:   Maintained
>  F:   Documentation/trace/hwlat_detector.rst
>  F:   Documentation/trace/osnoise-tracer.rst
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH V2 0/7] vdpa live update

2024-07-15 Thread Jason Wang
On Mon, Jul 15, 2024 at 10:29 PM Steven Sistare
 wrote:
>
> On 7/14/2024 10:14 PM, Jason Wang wrote:
> > On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare  
> > wrote:
> >>
> >> Live update is a technique wherein an application saves its state, exec's
> >> to an updated version of itself, and restores its state.  Clients of the
> >> application experience a brief suspension of service, on the order of
> >> 100's of milliseconds, but are otherwise unaffected.
> >>
> >> Define and implement interfaces that allow vdpa devices to be preserved
> >> across fork or exec, to support live update for applications such as QEMU.
> >> The device must be suspended during the update, but its DMA mappings are
> >> preserved, so the suspension is brief.
> >>
> >> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> >> accounting from one process to another.
> >>
> >> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> >> VHOST_NEW_OWNER is supported.
> >>
> >> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
> >> address in the new process.
> >>
> >> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> >> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> >> require it, because the userland address of each DMA mapping is discarded
> >> after being translated to a physical address.
> >>
> >> Here is a pseudo-code sequence for performing live update, based on
> >> suspend + reset because resume is not yet widely available.  The vdpa 
> >> device
> >> descriptor, fd, remains open across the exec.
> >>
> >>ioctl(fd, VHOST_VDPA_SUSPEND)
> >>ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> >
> > I don't understand why we need a reset after suspend, it looks to me
> > the previous suspend became meaningless.
>
> The suspend guarantees completion of in-progress DMA.  At least, that is
> my interpretation of why that is done for live migration in QEMU, which
> also does suspend + reset + re-create.  I am following the live migration
> model.

Yes, but any reason we need a reset after the suspension?

>
> >>exec
> >>
> >>ioctl(fd, VHOST_NEW_OWNER)
> >>
> >>issue ioctls to re-create vrings
> >>
> >>if VHOST_BACKEND_F_IOTLB_REMAP
> >
> > So the idea is for a device that is using a virtual address, it
> > doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?
>
> Actually the reverse: if the device translates virtual to physical when
> the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
> is not needed.

Ok.

>
> >>foreach dma mapping
> >>write(fd, {VHOST_IOTLB_REMAP, new_addr})
> >>
> >>ioctl(fd, VHOST_VDPA_SET_STATUS,
> >>  ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
> >
> >  From API level, this seems to be asymmetric as we have suspending but
> > not resuming?
>
> Again, I am just following the path taken by live migration.
> I will be happy to use resume when the devices and QEMU support it.
> The decision to use reset vs resume should not affect the definition
> and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.
>
> - Steve

Thanks




Re: [PATCH] tracing: Update MAINTAINERS file

2024-07-15 Thread Mathieu Desnoyers

On 2024-07-15 15:13, Steven Rostedt wrote:

On Mon, 15 Jul 2024 15:10:17 -0400
Mathieu Desnoyers  wrote:


On 2024-07-15 14:47, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Gone but never forgotten.

[ Also moved Daniel's name to be consistent with the alphabetical order ]


Hi Steven,

You appear to have missed this entry from SCHEDULER:

R:  Daniel Bristot de Oliveira  (SCHED_DEADLINE)

Should it be done in a separate commit or folded in this one ?


That has already been done:

   
https://lore.kernel.org/all/20240708075752.gf11...@noisy.programming.kicks-ass.net/

This patch is actually on top of that one.


Sorry I missed that.

Reviewed-by: Mathieu Desnoyers 



-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] tracing: Update MAINTAINERS file

2024-07-15 Thread Steven Rostedt
On Mon, 15 Jul 2024 15:10:17 -0400
Mathieu Desnoyers  wrote:

> On 2024-07-15 14:47, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > Gone but never forgotten.
> > 
> > [ Also moved Daniel's name to be consistent with the alphabetical order ]  
> 
> Hi Steven,
> 
> You appear to have missed this entry from SCHEDULER:
> 
> R:  Daniel Bristot de Oliveira  (SCHED_DEADLINE)
> 
> Should it be done in a separate commit or folded in this one ?

That has already been done:

  
https://lore.kernel.org/all/20240708075752.gf11...@noisy.programming.kicks-ass.net/

This patch is actually on top of that one.

-- Steve



Re: [PATCH] tracing: Update MAINTAINERS file

2024-07-15 Thread Mathieu Desnoyers

On 2024-07-15 14:47, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Gone but never forgotten.

[ Also moved Daniel's name to be consistent with the alphabetical order ]


Hi Steven,

You appear to have missed this entry from SCHEDULER:

R:  Daniel Bristot de Oliveira  (SCHED_DEADLINE)

Should it be done in a separate commit or folded in this one ?

Thanks,

Mathieu



Signed-off-by: Steven Rostedt (Google) 
---
  CREDITS | 10 +++---
  MAINTAINERS |  3 ---
  2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/CREDITS b/CREDITS
index 88c4c08cb613..d7798910e99f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -271,9 +271,6 @@ D: Driver for WaveFront soundcards (Turtle Beach Maui, 
Tropez, Tropez+)
  D: Various bugfixes and changes to sound drivers
  S: USA
  
-N: Daniel Bristot de Oliveira

-D: Scheduler contributions, notably: SCHED_DEADLINE
-
  N: Carlos Henrique Bauer
  E: chba...@acm.org
  E: ba...@atlas.unisinos.br
@@ -534,6 +531,13 @@ S: Kopmansg 2
  S: 411 13  Goteborg
  S: Sweden
  
+N: Daniel Bristot de Oliveira

+D: Scheduler contributions, notably: SCHED_DEADLINE
+D: Real-time Linux Analysis
+D: Runtime Verification
+D: OS Noise and Latency Tracers
+S: Brazil and Italy
+
  N: Paul Bristow
  E: p...@paulbristow.net
  W: https://paulbristow.net/linux/idefloppy.html
diff --git a/MAINTAINERS b/MAINTAINERS
index 2e1b8bbacb5e..0b7e4cd4ffd7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18895,7 +18895,6 @@ F:  include/uapi/linux/rtc.h
  F:tools/testing/selftests/rtc/
  
  Real-time Linux Analysis (RTLA) tools

-M: Daniel Bristot de Oliveira 
  M:Steven Rostedt 
  L:linux-trace-ker...@vger.kernel.org
  S:Maintained
@@ -19529,7 +19528,6 @@ S:  Maintained
  F:drivers/infiniband/ulp/rtrs/
  
  RUNTIME VERIFICATION (RV)

-M: Daniel Bristot de Oliveira 
  M:Steven Rostedt 
  L:linux-trace-ker...@vger.kernel.org
  S:Maintained
@@ -22806,7 +22804,6 @@ F:  kernel/trace/trace_mmiotrace.c
  
  TRACING OS NOISE / LATENCY TRACERS

  M:Steven Rostedt 
-M: Daniel Bristot de Oliveira 
  S:Maintained
  F:Documentation/trace/hwlat_detector.rst
  F:Documentation/trace/osnoise-tracer.rst


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




[PATCH] tracing: Update MAINTAINERS file

2024-07-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Gone but never forgotten.

[ Also moved Daniel's name to be consistent with the alphabetical order ]

Signed-off-by: Steven Rostedt (Google) 
---
 CREDITS | 10 +++---
 MAINTAINERS |  3 ---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/CREDITS b/CREDITS
index 88c4c08cb613..d7798910e99f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -271,9 +271,6 @@ D: Driver for WaveFront soundcards (Turtle Beach Maui, 
Tropez, Tropez+)
 D: Various bugfixes and changes to sound drivers
 S: USA
 
-N: Daniel Bristot de Oliveira
-D: Scheduler contributions, notably: SCHED_DEADLINE
-
 N: Carlos Henrique Bauer
 E: chba...@acm.org
 E: ba...@atlas.unisinos.br
@@ -534,6 +531,13 @@ S: Kopmansg 2
 S: 411 13  Goteborg
 S: Sweden
 
+N: Daniel Bristot de Oliveira
+D: Scheduler contributions, notably: SCHED_DEADLINE
+D: Real-time Linux Analysis
+D: Runtime Verification
+D: OS Noise and Latency Tracers
+S: Brazil and Italy
+
 N: Paul Bristow
 E: p...@paulbristow.net
 W: https://paulbristow.net/linux/idefloppy.html
diff --git a/MAINTAINERS b/MAINTAINERS
index 2e1b8bbacb5e..0b7e4cd4ffd7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18895,7 +18895,6 @@ F:  include/uapi/linux/rtc.h
 F: tools/testing/selftests/rtc/
 
 Real-time Linux Analysis (RTLA) tools
-M: Daniel Bristot de Oliveira 
 M: Steven Rostedt 
 L: linux-trace-ker...@vger.kernel.org
 S: Maintained
@@ -19529,7 +19528,6 @@ S:  Maintained
 F: drivers/infiniband/ulp/rtrs/
 
 RUNTIME VERIFICATION (RV)
-M: Daniel Bristot de Oliveira 
 M: Steven Rostedt 
 L: linux-trace-ker...@vger.kernel.org
 S: Maintained
@@ -22806,7 +22804,6 @@ F:  kernel/trace/trace_mmiotrace.c
 
 TRACING OS NOISE / LATENCY TRACERS
 M: Steven Rostedt 
-M: Daniel Bristot de Oliveira 
 S: Maintained
 F: Documentation/trace/hwlat_detector.rst
 F: Documentation/trace/osnoise-tracer.rst
-- 
2.43.0




Re: [PATCH V2 0/7] vdpa live update

2024-07-15 Thread Steven Sistare

On 7/14/2024 10:14 PM, Jason Wang wrote:

On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare  wrote:


Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state.  Clients of the
application experience a brief suspension of service, on the order of
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as QEMU.
The device must be suspended during the update, but its DMA mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required.  Some devices do not
require it, because the userland address of each DMA mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet widely available.  The vdpa device
descriptor, fd, remains open across the exec.

   ioctl(fd, VHOST_VDPA_SUSPEND)
   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)


I don't understand why we need a reset after suspend, it looks to me
the previous suspend became meaningless.


The suspend guarantees completion of in-progress DMA.  At least, that is
my interpretation of why that is done for live migration in QEMU, which
also does suspend + reset + re-create.  I am following the live migration
model.


   exec

   ioctl(fd, VHOST_NEW_OWNER)

   issue ioctls to re-create vrings

   if VHOST_BACKEND_F_IOTLB_REMAP


So the idea is for a device that is using a virtual address, it
doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?


Actually the reverse: if the device translates virtual to physical when
the mappings are created, and discards the virtual, then VHOST_IOTLB_REMAP
is not needed.


   foreach dma mapping
   write(fd, {VHOST_IOTLB_REMAP, new_addr})

   ioctl(fd, VHOST_VDPA_SET_STATUS,
 ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)


 From API level, this seems to be asymmetric as we have suspending but
not resuming?


Again, I am just following the path taken by live migration.
I will be happy to use resume when the devices and QEMU support it.
The decision to use reset vs resume should not affect the definition
and use of VHOST_NEW_OWNER and VHOST_IOTLB_REMAP.

- Steve


This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
as that would would unpin and repin physical pages, which would cost multiple
seconds for large memories.

This is implemented in QEMU by the patch series "Live update: vdpa"
   https://lore.kernel.org/qemu-devel/TBD  (reference to be posted shortly)

The QEMU implementation leverages the live migration code path, but after
CPR exec's new QEMU:
   - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
   - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
 VHOST_IOTLB_UPDATE

Changes in V2:
   - clean up handling of set_map vs dma_map vs platform iommu in remap
   - augment and clarify commit messages and comments

Steve Sistare (7):
   vhost-vdpa: count pinned memory
   vhost-vdpa: pass mm to bind
   vhost-vdpa: VHOST_NEW_OWNER
   vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
   vhost-vdpa: VHOST_IOTLB_REMAP
   vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
   vdpa/mlx5: new owner capability

  drivers/vdpa/mlx5/net/mlx5_vnet.c |   3 +-
  drivers/vhost/vdpa.c  | 125 --
  drivers/vhost/vhost.c |  15 
  drivers/vhost/vhost.h |   1 +
  include/uapi/linux/vhost.h|  10 +++
  include/uapi/linux/vhost_types.h  |  15 +++-
  6 files changed, 161 insertions(+), 8 deletions(-)

--
2.39.3



Thanks





Re: [PATCH V2 0/7] vdpa live update

2024-07-14 Thread Jason Wang
On Fri, Jul 12, 2024 at 9:19 PM Steve Sistare  wrote:
>
> Live update is a technique wherein an application saves its state, exec's
> to an updated version of itself, and restores its state.  Clients of the
> application experience a brief suspension of service, on the order of
> 100's of milliseconds, but are otherwise unaffected.
>
> Define and implement interfaces that allow vdpa devices to be preserved
> across fork or exec, to support live update for applications such as QEMU.
> The device must be suspended during the update, but its DMA mappings are
> preserved, so the suspension is brief.
>
> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> accounting from one process to another.
>
> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> VHOST_NEW_OWNER is supported.
>
> The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
> address in the new process.
>
> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> require it, because the userland address of each DMA mapping is discarded
> after being translated to a physical address.
>
> Here is a pseudo-code sequence for performing live update, based on
> suspend + reset because resume is not yet widely available.  The vdpa device
> descriptor, fd, remains open across the exec.
>
>   ioctl(fd, VHOST_VDPA_SUSPEND)
>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)

I don't understand why we need a reset after suspend, it looks to me
the previous suspend became meaningless.

>   exec
>
>   ioctl(fd, VHOST_NEW_OWNER)
>
>   issue ioctls to re-create vrings
>
>   if VHOST_BACKEND_F_IOTLB_REMAP

So the idea is for a device that is using a virtual address, it
doesn't need VHOST_BACKEND_F_IOTLB_REMAP at all?

>   foreach dma mapping
>   write(fd, {VHOST_IOTLB_REMAP, new_addr})
>
>   ioctl(fd, VHOST_VDPA_SET_STATUS,
> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)

>From API level, this seems to be asymmetric as we have suspending but
not resuming?

>
> This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
> as that would would unpin and repin physical pages, which would cost multiple
> seconds for large memories.
>
> This is implemented in QEMU by the patch series "Live update: vdpa"
>   https://lore.kernel.org/qemu-devel/TBD  (reference to be posted shortly)
>
> The QEMU implementation leverages the live migration code path, but after
> CPR exec's new QEMU:
>   - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
>   - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
> VHOST_IOTLB_UPDATE
>
> Changes in V2:
>   - clean up handling of set_map vs dma_map vs platform iommu in remap
>   - augment and clarify commit messages and comments
>
> Steve Sistare (7):
>   vhost-vdpa: count pinned memory
>   vhost-vdpa: pass mm to bind
>   vhost-vdpa: VHOST_NEW_OWNER
>   vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
>   vhost-vdpa: VHOST_IOTLB_REMAP
>   vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
>   vdpa/mlx5: new owner capability
>
>  drivers/vdpa/mlx5/net/mlx5_vnet.c |   3 +-
>  drivers/vhost/vdpa.c  | 125 --
>  drivers/vhost/vhost.c |  15 
>  drivers/vhost/vhost.h |   1 +
>  include/uapi/linux/vhost.h|  10 +++
>  include/uapi/linux/vhost_types.h  |  15 +++-
>  6 files changed, 161 insertions(+), 8 deletions(-)
>
> --
> 2.39.3
>

Thanks




Re: [PATCH V2 0/7] vdpa live update

2024-07-12 Thread Steven Sistare

On 7/12/2024 9:18 AM, Steve Sistare wrote:

Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state.  Clients of the
application experience a brief suspension of service, on the order of
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as QEMU.
The device must be suspended during the update, but its DMA mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required.  Some devices do not
require it, because the userland address of each DMA mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet widely available.  The vdpa device
descriptor, fd, remains open across the exec.

   ioctl(fd, VHOST_VDPA_SUSPEND)
   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
   exec

   ioctl(fd, VHOST_NEW_OWNER)

   issue ioctls to re-create vrings

   if VHOST_BACKEND_F_IOTLB_REMAP
   foreach dma mapping
   write(fd, {VHOST_IOTLB_REMAP, new_addr})

   ioctl(fd, VHOST_VDPA_SET_STATUS,
 ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)

This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
as that would would unpin and repin physical pages, which would cost multiple
seconds for large memories.

This is implemented in QEMU by the patch series "Live update: vdpa"
   https://lore.kernel.org/qemu-devel/TBD  (reference to be posted shortly)


https://lore.kernel.org/qemu-devel/1720792931-456433-3-git-send-email-steven.sist...@oracle.com


The QEMU implementation leverages the live migration code path, but after
CPR exec's new QEMU:
   - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
   - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
 VHOST_IOTLB_UPDATE

Changes in V2:
   - clean up handling of set_map vs dma_map vs platform iommu in remap
   - augment and clarify commit messages and comments

Steve Sistare (7):
   vhost-vdpa: count pinned memory
   vhost-vdpa: pass mm to bind
   vhost-vdpa: VHOST_NEW_OWNER
   vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
   vhost-vdpa: VHOST_IOTLB_REMAP
   vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
   vdpa/mlx5: new owner capability

  drivers/vdpa/mlx5/net/mlx5_vnet.c |   3 +-
  drivers/vhost/vdpa.c  | 125 --
  drivers/vhost/vhost.c |  15 
  drivers/vhost/vhost.h |   1 +
  include/uapi/linux/vhost.h|  10 +++
  include/uapi/linux/vhost_types.h  |  15 +++-
  6 files changed, 161 insertions(+), 8 deletions(-)





[PATCH V2 0/7] vdpa live update

2024-07-12 Thread Steve Sistare
Live update is a technique wherein an application saves its state, exec's
to an updated version of itself, and restores its state.  Clients of the
application experience a brief suspension of service, on the order of
100's of milliseconds, but are otherwise unaffected.

Define and implement interfaces that allow vdpa devices to be preserved
across fork or exec, to support live update for applications such as QEMU.
The device must be suspended during the update, but its DMA mappings are
preserved, so the suspension is brief.

The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
accounting from one process to another.

The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
VHOST_NEW_OWNER is supported.

The VHOST_IOTLB_REMAP message type updates a DMA mapping with its userland
address in the new process.

The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
VHOST_IOTLB_REMAP is supported and required.  Some devices do not
require it, because the userland address of each DMA mapping is discarded
after being translated to a physical address.

Here is a pseudo-code sequence for performing live update, based on
suspend + reset because resume is not yet widely available.  The vdpa device
descriptor, fd, remains open across the exec.

  ioctl(fd, VHOST_VDPA_SUSPEND)
  ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
  exec

  ioctl(fd, VHOST_NEW_OWNER)

  issue ioctls to re-create vrings

  if VHOST_BACKEND_F_IOTLB_REMAP
  foreach dma mapping
  write(fd, {VHOST_IOTLB_REMAP, new_addr})

  ioctl(fd, VHOST_VDPA_SET_STATUS,
ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)

This is faster than VHOST_RESET_OWNER + VHOST_SET_OWNER + VHOST_IOTLB_UPDATE,
as that would would unpin and repin physical pages, which would cost multiple
seconds for large memories.

This is implemented in QEMU by the patch series "Live update: vdpa"
  https://lore.kernel.org/qemu-devel/TBD  (reference to be posted shortly)

The QEMU implementation leverages the live migration code path, but after
CPR exec's new QEMU:
  - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
  - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of
VHOST_IOTLB_UPDATE

Changes in V2:
  - clean up handling of set_map vs dma_map vs platform iommu in remap
  - augment and clarify commit messages and comments

Steve Sistare (7):
  vhost-vdpa: count pinned memory
  vhost-vdpa: pass mm to bind
  vhost-vdpa: VHOST_NEW_OWNER
  vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
  vhost-vdpa: VHOST_IOTLB_REMAP
  vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
  vdpa/mlx5: new owner capability

 drivers/vdpa/mlx5/net/mlx5_vnet.c |   3 +-
 drivers/vhost/vdpa.c  | 125 --
 drivers/vhost/vhost.c |  15 
 drivers/vhost/vhost.h |   1 +
 include/uapi/linux/vhost.h|  10 +++
 include/uapi/linux/vhost_types.h  |  15 +++-
 6 files changed, 161 insertions(+), 8 deletions(-)

-- 
2.39.3




Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-10 Thread Andrii Nakryiko
On Wed, Jul 10, 2024 at 6:33 AM Oleg Nesterov  wrote:
>
> On 07/03, Oleg Nesterov wrote:
> >
> > > /*
> > > -* The NULL 'tsk' here ensures that any faults that occur here
> > > -* will not be accounted to the task.  'mm' *is* current->mm,
> > > -* but we treat this as a 'remote' access since it is
> > > -* essentially a kernel access to the memory.
> > > +* 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > > +* it is essentially a kernel access to the memory.
> > >  */
> > > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
> >
> > OK, this makes it less confusing, so
> >
> > Acked-by: Oleg Nesterov 
> >
> > -
> > but it still looks confusing to me. This code used to pass tsk = NULL
> > only to avoid tsk->maj/min_flt++ in faultin_page().
> >
> > But today mm_account_fault() increments these counters without checking
> > FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> > just use get_user_pages() and remove this comment?
>
> Well, yes, it still looks confusing, imo.
>
> Andrii, I hope you won't mind if I redo/resend this and the next cleanup?
>
> The next one only updates the comment above uprobe_write_opcode(), but
> it would be nice to explain mmap_write_lock() in register_for_each_vma().
>

I don't mind a bit, thanks for sending the patches!

> Oleg.
>
>



Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-10 Thread Oleg Nesterov
On 07/03, Oleg Nesterov wrote:
>
> > /*
> > -* The NULL 'tsk' here ensures that any faults that occur here
> > -* will not be accounted to the task.  'mm' *is* current->mm,
> > -* but we treat this as a 'remote' access since it is
> > -* essentially a kernel access to the memory.
> > +* 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +* it is essentially a kernel access to the memory.
> >  */
> > result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov 
>
> -
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?

Well, yes, it still looks confusing, imo.

Andrii, I hope you won't mind if I redo/resend this and the next cleanup?

The next one only updates the comment above uprobe_write_opcode(), but
it would be nice to explain mmap_write_lock() in register_for_each_vma().

Oleg.




Re: [PATCH v4 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss

2024-07-09 Thread Konrad Dybcio
On 9.07.2024 8:49 AM, Naina Mehta wrote:
> Rename qdss@8880 memory region as qlink_logging memory region
> and add qdss_mem memory region at address of 0x8850,
> qlink_logging is being added at the memory region at the address
> of 0x8880 as the region is being used by modem firmware.
> Since different DSM region size is required for different modem
> firmware, split mpss_dsmharq_mem region into 2 separate regions.
> This would provide the flexibility to remove the region which is
> not required for a particular platform. Based on the modem firmware
> either both the regions have to be used or only mpss_dsm_mem has
> to be used. Also, reduce the size of mpssadsp_mem region.
> 
> Signed-off-by: Naina Mehta 
> ---

Thanks

Reviewed-by: Konrad Dybcio 

Konrad



[PATCH v4 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss

2024-07-09 Thread Naina Mehta
Rename qdss@8880 memory region as qlink_logging memory region
and add qdss_mem memory region at address of 0x8850,
qlink_logging is being added at the memory region at the address
of 0x8880 as the region is being used by modem firmware.
Since different DSM region size is required for different modem
firmware, split mpss_dsmharq_mem region into 2 separate regions.
This would provide the flexibility to remove the region which is
not required for a particular platform. Based on the modem firmware
either both the regions have to be used or only mpss_dsm_mem has
to be used. Also, reduce the size of mpssadsp_mem region.

Signed-off-by: Naina Mehta 
---
 arch/arm64/boot/dts/qcom/sdx75.dtsi | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/sdx75.dtsi 
b/arch/arm64/boot/dts/qcom/sdx75.dtsi
index 9b93f6501d55..6f0abcc87a3b 100644
--- a/arch/arm64/boot/dts/qcom/sdx75.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdx75.dtsi
@@ -366,7 +366,12 @@ uefi_log_mem: uefi-log@87f75000 {
no-map;
};
 
-   qdss_mem: qdss@8880 {
+   qdss_mem: qdss@8850 {
+   reg = <0x0 0x8850 0x0 0x30>;
+   no-map;
+   };
+
+   qlink_logging_mem: qlink-logging@8880 {
reg = <0x0 0x8880 0x0 0x30>;
no-map;
};
@@ -377,8 +382,13 @@ audio_heap_mem: audio-heap@88b0 {
no-map;
};
 
-   mpss_dsmharq_mem: mpss-dsmharq@88f0 {
-   reg = <0x0 0x88f0 0x0 0x508>;
+   mpss_dsm_mem_2: mpss-dsm-2@88f0 {
+   reg = <0x0 0x88f0 0x0 0x250>;
+   no-map;
+   };
+
+   mpss_dsm_mem: mpss-dsm@8b40 {
+   reg = <0x0 0x8b40 0x0 0x2b8>;
no-map;
};
 
@@ -388,7 +398,7 @@ q6_mpss_dtb_mem: q6-mpss-dtb@8df8 {
};
 
mpssadsp_mem: mpssadsp@8e00 {
-   reg = <0x0 0x8e00 0x0 0xf40>;
+   reg = <0x0 0x8e00 0x0 0xf10>;
no-map;
};
 
-- 
2.34.1




Re: [PATCH] mailmap: Update Luca Weiss's email address

2024-07-08 Thread Bjorn Andersson


On Fri, 28 Jun 2024 19:40:55 +0200, Luca Weiss wrote:
> I'm slowly migrating my mail to a new domain, add an entry to map the
> mail address. Just for clarity, my work-related @fairphone.com email
> stays unchanged.
> 
> 

Applied, thanks!

[1/1] mailmap: Update Luca Weiss's email address
  commit: 2881fcfc8f32c536a4bf708066d6fea9ba762e86

Best regards,
-- 
Bjorn Andersson 



Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-03 Thread Andrii Nakryiko
On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov  wrote:
>
> Sorry for the late reply. I'll try to read this version/discussion
> when I have time... yes, I have already promised this before, sorry :/
>

No worries, there will be v3 next week (I'm going on short PTO
starting tomorrow). So you still have time. I appreciate you looking
at it, though!

> On 07/01, Andrii Nakryiko wrote:
> >
> > There is no task_struct passed into get_user_pages_remote() anymore,
> > drop the parts of comment mentioning NULL tsk, it's just confusing at
> > this point.
>
> Agreed.
>
> > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
> > unsigned long vaddr)
> >   goto out;
> >
> >   /*
> > -  * The NULL 'tsk' here ensures that any faults that occur here
> > -  * will not be accounted to the task.  'mm' *is* current->mm,
> > -  * but we treat this as a 'remote' access since it is
> > -  * essentially a kernel access to the memory.
> > +  * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +  * it is essentially a kernel access to the memory.
> >*/
> >   result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov 
>
>
> -
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?
>
> Oleg.
>



Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-03 Thread Andrii Nakryiko
On Wed, Jul 3, 2024 at 4:40 AM Oleg Nesterov  wrote:
>
> Sorry for the late reply. I'll try to read this version/discussion
> when I have time... yes, I have already promised this before, sorry :/
>
> On 07/01, Andrii Nakryiko wrote:
> >
> > There is no task_struct passed into get_user_pages_remote() anymore,
> > drop the parts of comment mentioning NULL tsk, it's just confusing at
> > this point.
>
> Agreed.
>
> > @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
> > unsigned long vaddr)
> >   goto out;
> >
> >   /*
> > -  * The NULL 'tsk' here ensures that any faults that occur here
> > -  * will not be accounted to the task.  'mm' *is* current->mm,
> > -  * but we treat this as a 'remote' access since it is
> > -  * essentially a kernel access to the memory.
> > +  * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> > +  * it is essentially a kernel access to the memory.
> >*/
> >   result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
>
> OK, this makes it less confusing, so
>
> Acked-by: Oleg Nesterov 
>
>
> -
> but it still looks confusing to me. This code used to pass tsk = NULL
> only to avoid tsk->maj/min_flt++ in faultin_page().
>
> But today mm_account_fault() increments these counters without checking
> FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
> just use get_user_pages() and remove this comment?

I don't know, it was a drive-by cleanup which I'm starting to regret already :)

>
> Oleg.
>



Re: [PATCH v2 01/12] uprobes: update outdated comment

2024-07-03 Thread Oleg Nesterov
Sorry for the late reply. I'll try to read this version/discussion
when I have time... yes, I have already promised this before, sorry :/

On 07/01, Andrii Nakryiko wrote:
>
> There is no task_struct passed into get_user_pages_remote() anymore,
> drop the parts of comment mentioning NULL tsk, it's just confusing at
> this point.

Agreed.

> @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
> unsigned long vaddr)
>   goto out;
>
>   /*
> -  * The NULL 'tsk' here ensures that any faults that occur here
> -  * will not be accounted to the task.  'mm' *is* current->mm,
> -  * but we treat this as a 'remote' access since it is
> -  * essentially a kernel access to the memory.
> +  * 'mm' *is* current->mm, but we treat this as a 'remote' access since
> +  * it is essentially a kernel access to the memory.
>*/
>   result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);

OK, this makes it less confusing, so

Acked-by: Oleg Nesterov 


-
but it still looks confusing to me. This code used to pass tsk = NULL
only to avoid tsk->maj/min_flt++ in faultin_page().

But today mm_account_fault() increments these counters without checking
FAULT_FLAG_REMOTE, mm == current->mm, so it seems it would be better to
just use get_user_pages() and remove this comment?

Oleg.




[PATCH v12 17/19] Documentation: probes: Update fprobe on function-graph tracer

2024-07-03 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Update fprobe documentation for the new fprobe on function-graph
tracer. This includes some bahvior changes and pt_regs to
ftrace_regs interface change.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Update @fregs parameter explanation.
---
 Documentation/trace/fprobe.rst |   42 ++--
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..f58bdc64504f 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
 Introduction
 
 
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe mechanism based on the function-graph
+tracer.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
 use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
 instrumentation for multiple functions with single handler. This document
 describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as 
follows:
 
 .. code-block:: c
 
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback 
will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
 
 @fp
 This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 This is the return address that the traced function will return to,
 somewhere in the caller. This can be used at both entry and exit.
 
-@regs
-This is the `pt_regs` data structure at the entry and exit. Note that
-the instruction pointer of @regs may be different from the @entry_ip
-in the entry_handler. If you need traced instruction pointer, you need
-to use @entry_ip. On the other hand, in the exit_handler, the 
instruction
-pointer of @regs is set to the current return address.
+@fregs
+This is the `ftrace_regs` data structure at the entry and exit. This
+includes the function parameters, or the return values. So user can
+access thos values via appropriate `ftrace_regs_*` APIs.
 
 @entry_data
 This is a local storage to share the data between entry and exit 
handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 and `entry_data_size` field when registering the fprobe, the storage is
 allocated and passed to both `entry_handler` and `exit_handler`.
 
+Entry data size and exit handlers on the same function
+==
+
+Since the entry data is passed via per-task stack and it is has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
 Share the callbacks with kprobes
 
 
@@ -165,8 +177,8 @@ This counter counts up when;
  - fprobe fails to take ftrace_recursion lock. This usually means that a 
function
which is traced by other ftrace users is called from the entry_handler.
 
- - fprobe fails to setup the function exit because of the shortage of rethook
-   (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+   data buffer from the per-task shadow stack.
 
 The `fprobe::nmissed` field counts up in both cases. Therefore, the former
 skips both of entry and exit callback and the latter skips the exit




[PATCH v2 01/12] uprobes: update outdated comment

2024-07-01 Thread Andrii Nakryiko
There is no task_struct passed into get_user_pages_remote() anymore,
drop the parts of comment mentioning NULL tsk, it's just confusing at
this point.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..081821fd529a 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
unsigned long vaddr)
goto out;
 
/*
-* The NULL 'tsk' here ensures that any faults that occur here
-* will not be accounted to the task.  'mm' *is* current->mm,
-* but we treat this as a 'remote' access since it is
-* essentially a kernel access to the memory.
+* 'mm' *is* current->mm, but we treat this as a 'remote' access since
+* it is essentially a kernel access to the memory.
 */
result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
if (result < 0)
-- 
2.43.0




Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic

2024-07-01 Thread Stefano Garzarella

Hi Arseniy,

On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote:

Previous calculation of 'free_space' was wrong (but worked as expected
in most cases, see below), because it didn't account number of bytes in
rx queue. Let's rework 'free_space' calculation in the following way:
as this value is considered free space at rx side from tx point of 
view,

it must be equal to return value of 'virtio_transport_get_credit()' at
tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first
is number of transmitted bytes (without wrap), second is last 'fwd_cnt'
value received from rx. So let's use same approach at rx side during
'free_space' calculation: add 'rx_cnt' counter which is number of
received bytes (also without wrap) and subtract 'last_fwd_cnt' from it.
Now we have:
1) 'rx_cnt' == 'tx_cnt' at both sides.
2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt'
  sent to tx, while second is last 'fwd_cnt' received from rx.

Now 'free_space' is handled correctly and also we don't need
'low_rx_bytes' flag - this was more like a hack.

Previous calculation of 'free_space' worked (in 99% cases), because if
we take a look on behaviour of both expressions (new and previous):

'(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)'

Both of them always grows up, with almost same "speed": only difference
is that 'rx_cnt' is incremented earlier during packet is received,
while 'fwd_cnt' in incremented when packet is read by user. So if 
'rx_cnt'

grows "faster", then resulting 'free_space' become smaller also, so we
send credit updates a little bit more, but:

 * 'free_space' calculation based on 'rx_cnt' gives the same value,
   which tx sees as free space at rx side, so original idea of
   'free_space' is now implemented as planned.
 * Hack with 'low_rx_bytes' now is not needed.

Also here is some performance comparison between both versions of
'free_space' calculation:

*--*--*--*
|  | 'rx_cnt' | previous |
*--*--*--*
|H -> G|   8.42   |   7.82   |
*--*--*--*
|G -> H|   11.6   |   12.1   |
*--*--*--*


I did some tests on an Intel(R) Xeon(R) Silver 4410Y using iperf-vsock:
- kernel 6.9.0
pkt_size G->H H->G
4k4.6  6.4
64k  13.8 11.5
128k 13.4 11.7

- kernel 6.9.0 with this series applied
pkt_size     G->H     H->G
4k            4.6     8.16
64k          12.2     8.9
128k         12.8     8.8

I see a big drop, especially on H->G with big packets. Can you try to 
replicate on your env?


I'll try to understand more and also an i7 on the next days.

Thanks,
Stefano



As benchmark 'vsock-iperf' with default arguments was used. There is no
significant performance difference before and after this patch.

Signed-off-by: Arseniy Krasnov 
---
include/linux/virtio_vsock.h| 1 +
net/vmw_vsock/virtio_transport_common.c | 8 +++-
2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c82089dee0c8..3579491c411e 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -135,6 +135,7 @@ struct virtio_vsock_sock {
u32 peer_buf_alloc;

/* Protected by rx_lock */
+   u32 rx_cnt;
u32 fwd_cnt;
u32 last_fwd_cnt;
u32 rx_bytes;
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 16ff976a86e3..1d4e2328e06e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct 
virtio_vsock_sock *vvs,
return false;

vvs->rx_bytes += len;
+   vvs->rx_cnt += len;
return true;
}

@@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
size_t bytes, total = 0;
struct sk_buff *skb;
u32 fwd_cnt_delta;
-   bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;

@@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}

fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
-   free_space = vvs->buf_alloc - fwd_cnt_delta;
-   low_rx_bytes = (vvs->rx_bytes <
-   sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
+   free_space = vvs->buf_alloc - (vvs->rx_cnt - vvs->last_fwd_cnt);

spin_unlock_bh(>rx_lock);

@@ -619,7 +617,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 * number of bytes in rx queue is not enough to wake up reader.
 */
if (fwd_cnt_delta &&
-   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
+   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE))
virtio_transport_send_credit_update(vsk);

return total;
--
2.25.1







Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss

2024-07-01 Thread Naina Mehta




On 6/26/2024 9:12 PM, Konrad Dybcio wrote:

On 24.06.2024 1:21 PM, Naina Mehta wrote:



On 6/18/2024 7:08 PM, Konrad Dybcio wrote:



On 6/18/24 15:13, Naina Mehta wrote:

Rename qdss@8880 memory region as qlink_logging memory region
and add qdss_mem memory region at address of 0x8850.
Split mpss_dsmharq_mem region into 2 separate regions and
reduce the size of mpssadsp_mem region.

Signed-off-by: Naina Mehta 
---


Alright, we're getting somewhere. The commit message should however motivate
why such changes are necessary. For all we know, the splitting in two is
currently done for no reason, as qdss_mem and qlink_logging_mem are contiguous
- does the firmware have some expectations about them being separate?



Since different DSM region size is required for different modem firmware, 
mpss_dsmharq_mem region being split into 2 separate regions.
This would provide the flexibility to remove the region which is
not required for a particular platform.
qlink_logging is being added at the memory region at the address of
0x8880 as the region is being used by modem firmware.


Ok, now put that in the commit message :)

And I suppose:

"This would provide the flexibility to remove the region which is not
required for a particular platform." - but you still pass both to the
remoteproc in patch 4. Are these regions mutually exclusive?



Yes, for IDP platform, we are using both the DSM regions.
Based on the modem firmware either both the regions have to be used or 
only mpss_dsm_mem has to be used.


Regards,
Naina


Konrad





[PATCH] mailmap: Update Luca Weiss's email address

2024-06-28 Thread Luca Weiss
I'm slowly migrating my mail to a new domain, add an entry to map the
mail address. Just for clarity, my work-related @fairphone.com email
stays unchanged.

Signed-off-by: Luca Weiss 
---
Since my email address also appears in a bunch of drivers and arm(64)
files, and two devicetree binding files, how are those normally handled?
Just ignore them and let mailmap handle everything relevant?
---
 .mailmap | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.mailmap b/.mailmap
index a6c619e22efc..e169a99ce7c7 100644
--- a/.mailmap
+++ b/.mailmap
@@ -385,6 +385,7 @@ Li Yang  
 Lior David  
 Lorenzo Pieralisi  
 Luca Ceresoli  
+Luca Weiss  
 Lukasz Luba  
 Luo Jie  
 Maciej W. Rozycki  

---
base-commit: 642a16ca7994a50d7de85715996a8ce171a5bdfb
change-id: 20240628-mailmap-3528f7365abb

Best regards,
-- 
Luca Weiss 




[PATCH v2 4/6] riscv: ftrace: do not use stop_machine to update code

2024-06-28 Thread Andy Chiu
Now it is safe to remove dependency from stop_machine() for us to patch
code in ftrace.

Signed-off-by: Andy Chiu 
---
 arch/riscv/kernel/ftrace.c | 53 --
 1 file changed, 4 insertions(+), 49 deletions(-)

diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index 5ebe412280ef..57a6558e212e 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -13,23 +13,13 @@
 #include 
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-void ftrace_arch_code_modify_prepare(void) __acquires(_mutex)
+void arch_ftrace_update_code(int command)
 {
mutex_lock(_mutex);
-
-   /*
-* The code sequences we use for ftrace can't be patched while the
-* kernel is running, so we need to use stop_machine() to modify them
-* for now.  This doesn't play nice with text_mutex, we use this flag
-* to elide the check.
-*/
-   riscv_patch_in_stop_machine = true;
-}
-
-void ftrace_arch_code_modify_post_process(void) __releases(_mutex)
-{
-   riscv_patch_in_stop_machine = false;
+   command |= FTRACE_MAY_SLEEP;
+   ftrace_modify_all_code(command);
mutex_unlock(_mutex);
+   flush_icache_all();
 }
 
 static int ftrace_check_current_call(unsigned long hook_pos,
@@ -155,41 +145,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
return __ftrace_modify_call_site(_call_dest, func, true);
 }
 
-struct ftrace_modify_param {
-   int command;
-   atomic_t cpu_count;
-};
-
-static int __ftrace_modify_code(void *data)
-{
-   struct ftrace_modify_param *param = data;
-
-   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
-   ftrace_modify_all_code(param->command);
-   /*
-* Make sure the patching store is effective *before* we
-* increment the counter which releases all waiting CPUs
-* by using the release variant of atomic increment. The
-* release pairs with the call to local_flush_icache_all()
-* on the waiting CPU.
-*/
-   atomic_inc_return_release(>cpu_count);
-   } else {
-   while (atomic_read(>cpu_count) <= num_online_cpus())
-   cpu_relax();
-
-   local_flush_icache_all();
-   }
-
-   return 0;
-}
-
-void arch_ftrace_update_code(int command)
-{
-   struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
-
-   stop_machine(__ftrace_modify_code, , cpu_online_mask);
-}
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS

-- 
2.43.0




Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss

2024-06-26 Thread Konrad Dybcio
On 24.06.2024 1:21 PM, Naina Mehta wrote:
> 
> 
> On 6/18/2024 7:08 PM, Konrad Dybcio wrote:
>>
>>
>> On 6/18/24 15:13, Naina Mehta wrote:
>>> Rename qdss@8880 memory region as qlink_logging memory region
>>> and add qdss_mem memory region at address of 0x8850.
>>> Split mpss_dsmharq_mem region into 2 separate regions and
>>> reduce the size of mpssadsp_mem region.
>>>
>>> Signed-off-by: Naina Mehta 
>>> ---
>>
>> Alright, we're getting somewhere. The commit message should however motivate
>> why such changes are necessary. For all we know, the splitting in two is
>> currently done for no reason, as qdss_mem and qlink_logging_mem are 
>> contiguous
>> - does the firmware have some expectations about them being separate?
>>
> 
> Since different DSM region size is required for different modem firmware, 
> mpss_dsmharq_mem region being split into 2 separate regions.
> This would provide the flexibility to remove the region which is
> not required for a particular platform.
> qlink_logging is being added at the memory region at the address of
> 0x8880 as the region is being used by modem firmware.

Ok, now put that in the commit message :)

And I suppose:

"This would provide the flexibility to remove the region which is not
required for a particular platform." - but you still pass both to the
remoteproc in patch 4. Are these regions mutually exclusive?

Konrad



Re: [PATCH v2 0/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update device tree of Motorola Moto G 4G (2013)

2024-06-25 Thread Bjorn Andersson


On Mon, 17 Jun 2024 23:22:26 +0200, André Apitzsch wrote:
> Add accelerometer, magnetometer, regulator and temperature sensor alert
> interrupt and update framebuffer supplies.
> 
> 

Applied, thanks!

[1/3] ARM: dts: qcom: msm8926-motorola-peregrine: Add accelerometer, 
magnetometer, regulator
  commit: 65ec35baeb937e91970c5d88118c5638d8582bb3
[2/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor
  commit: c9c86387ea1c4034fec34690c7cf2a96f9c21196
[3/3] ARM: dts: qcom: msm8926-motorola-peregrine: Add framebuffer supplies
  commit: fed1c79fc7fe10900d99a79a36e40443f3267ef3

Best regards,
-- 
Bjorn Andersson 



Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic

2024-06-25 Thread Arseniy Krasnov



On 25.06.2024 16:46, Stefano Garzarella wrote:
> On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote:
>> Previous calculation of 'free_space' was wrong (but worked as expected
>> in most cases, see below), because it didn't account number of bytes in
>> rx queue. Let's rework 'free_space' calculation in the following way:
>> as this value is considered free space at rx side from tx point of view,
>> it must be equal to return value of 'virtio_transport_get_credit()' at
>> tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first
>> is number of transmitted bytes (without wrap), second is last 'fwd_cnt'
>> value received from rx. So let's use same approach at rx side during
>> 'free_space' calculation: add 'rx_cnt' counter which is number of
>> received bytes (also without wrap) and subtract 'last_fwd_cnt' from it.
>> Now we have:
>> 1) 'rx_cnt' == 'tx_cnt' at both sides.
>> 2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt'
>>   sent to tx, while second is last 'fwd_cnt' received from rx.
>>
>> Now 'free_space' is handled correctly and also we don't need
> 
> mmm, I don't know if it was wrong before, maybe we could say it was less 
> accurate.

May be "now 'free_space' is handled in more precise way and also we ..." ?

> 
> That said, could we have the same problem now if we have a lot of producers 
> and the virtqueue becomes full?
> 

I guess if virtqueue is full, we just wait by returning skb back to tx queue... 
e.g.
data exchange between two sockets just freezes. ?

>> 'low_rx_bytes' flag - this was more like a hack.
>>
>> Previous calculation of 'free_space' worked (in 99% cases), because if
>> we take a look on behaviour of both expressions (new and previous):
>>
>> '(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)'
>>
>> Both of them always grows up, with almost same "speed": only difference
>> is that 'rx_cnt' is incremented earlier during packet is received,
>> while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt'
>> grows "faster", then resulting 'free_space' become smaller also, so we
>> send credit updates a little bit more, but:
>>
>>  * 'free_space' calculation based on 'rx_cnt' gives the same value,
>>    which tx sees as free space at rx side, so original idea of
> 
> Ditto, what happen if the virtqueue is full?
> 
>>    'free_space' is now implemented as planned.
>>  * Hack with 'low_rx_bytes' now is not needed.
> 
> Yeah, so this patch should also mitigate issue reported by Alex (added in 
> CC), right?
> 
> If yes, please mention that problem and add a Reported-by giving credit to 
> Alex.

Yes, of course!

> 
>>
>> Also here is some performance comparison between both versions of
>> 'free_space' calculation:
>>
>> *--*--*--*
>> |  | 'rx_cnt' | previous |
>> *--*--*--*
>> |H -> G|   8.42   |   7.82   |
>> *--*--*--*
>> |G -> H|   11.6   |   12.1   |
>> *--*--*--*
> 
> How many seconds did you run it? How many repetitions? There's a little 
> discrepancy anyway, but I can't tell if it's just noise.

I run 4 times, each run for ~10 seconds... I think I can also add number of 
credit update messages to this report.

> 
>>
>> As benchmark 'vsock-iperf' with default arguments was used. There is no
>> significant performance difference before and after this patch.
>>
>> Signed-off-by: Arseniy Krasnov 
>> ---
>> include/linux/virtio_vsock.h    | 1 +
>> net/vmw_vsock/virtio_transport_common.c | 8 +++-
>> 2 files changed, 4 insertions(+), 5 deletions(-)
> 
> Thanks for working on this, I'll do more tests but the approach LGTM.

Got it, Thanks

> 
> Thanks,
> Stefano
> 
>>
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index c82089dee0c8..3579491c411e 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -135,6 +135,7 @@ struct virtio_vsock_sock {
>> u32 peer_buf_alloc;
>>
>> /* Protected by rx_lock */
>> +    u32 rx_cnt;
>> u32 fwd_cnt;
>> u32 last_fwd_cnt;
>> u32 rx_bytes;
>> diff --git a/net/vmw_vsock/virtio_transport_common.c 
>> b/net/vmw_vsock/virtio_transport_common.c
>> index 16ff976a86e3..1d4e2328e06e 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct 
>> virtio_vs

Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic

2024-06-25 Thread Stefano Garzarella

On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote:

Previous calculation of 'free_space' was wrong (but worked as expected
in most cases, see below), because it didn't account number of bytes in
rx queue. Let's rework 'free_space' calculation in the following way:
as this value is considered free space at rx side from tx point of view,
it must be equal to return value of 'virtio_transport_get_credit()' at
tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first
is number of transmitted bytes (without wrap), second is last 'fwd_cnt'
value received from rx. So let's use same approach at rx side during
'free_space' calculation: add 'rx_cnt' counter which is number of
received bytes (also without wrap) and subtract 'last_fwd_cnt' from it.
Now we have:
1) 'rx_cnt' == 'tx_cnt' at both sides.
2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt'
  sent to tx, while second is last 'fwd_cnt' received from rx.

Now 'free_space' is handled correctly and also we don't need


mmm, I don't know if it was wrong before, maybe we could say it was less 
accurate.


That said, could we have the same problem now if we have a lot of 
producers and the virtqueue becomes full?



'low_rx_bytes' flag - this was more like a hack.

Previous calculation of 'free_space' worked (in 99% cases), because if
we take a look on behaviour of both expressions (new and previous):

'(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)'

Both of them always grows up, with almost same "speed": only difference
is that 'rx_cnt' is incremented earlier during packet is received,
while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt'
grows "faster", then resulting 'free_space' become smaller also, so we
send credit updates a little bit more, but:

 * 'free_space' calculation based on 'rx_cnt' gives the same value,
   which tx sees as free space at rx side, so original idea of


Ditto, what happen if the virtqueue is full?


   'free_space' is now implemented as planned.
 * Hack with 'low_rx_bytes' now is not needed.


Yeah, so this patch should also mitigate issue reported by Alex (added 
in CC), right?


If yes, please mention that problem and add a Reported-by giving credit 
to Alex.




Also here is some performance comparison between both versions of
'free_space' calculation:

*--*--*--*
|  | 'rx_cnt' | previous |
*--*--*--*
|H -> G|   8.42   |   7.82   |
*--*--*--*
|G -> H|   11.6   |   12.1   |
*--*--*--*


How many seconds did you run it? How many repetitions? There's a little 
discrepancy anyway, but I can't tell if it's just noise.




As benchmark 'vsock-iperf' with default arguments was used. There is no
significant performance difference before and after this patch.

Signed-off-by: Arseniy Krasnov 
---
include/linux/virtio_vsock.h| 1 +
net/vmw_vsock/virtio_transport_common.c | 8 +++-
2 files changed, 4 insertions(+), 5 deletions(-)


Thanks for working on this, I'll do more tests but the approach LGTM.

Thanks,
Stefano



diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c82089dee0c8..3579491c411e 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -135,6 +135,7 @@ struct virtio_vsock_sock {
u32 peer_buf_alloc;

/* Protected by rx_lock */
+   u32 rx_cnt;
u32 fwd_cnt;
u32 last_fwd_cnt;
u32 rx_bytes;
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 16ff976a86e3..1d4e2328e06e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct 
virtio_vsock_sock *vvs,
return false;

vvs->rx_bytes += len;
+   vvs->rx_cnt += len;
return true;
}

@@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
size_t bytes, total = 0;
struct sk_buff *skb;
u32 fwd_cnt_delta;
-   bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;

@@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}

fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
-   free_space = vvs->buf_alloc - fwd_cnt_delta;
-   low_rx_bytes = (vvs->rx_bytes <
-   sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
+   free_space = vvs->buf_alloc - (vvs->rx_cnt - vvs->last_fwd_cnt);

spin_unlock_bh(>rx_lock);

@@ -619,7 +617,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 * number of bytes in rx queue is not enough to wake up reader.
 */
if (fwd_cnt_delta &&
-   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
+   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE))
virtio_transport_send_credit_update(vsk);

return total;
--

[PATCH 01/12] uprobes: update outdated comment

2024-06-24 Thread Andrii Nakryiko
There is no task_struct passed into get_user_pages_remote() anymore,
drop the parts of comment mentioning NULL tsk, it's just confusing at
this point.

Signed-off-by: Andrii Nakryiko 
---
 kernel/events/uprobes.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2816e65729ac..197fbe4663b5 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, 
unsigned long vaddr)
goto out;
 
/*
-* The NULL 'tsk' here ensures that any faults that occur here
-* will not be accounted to the task.  'mm' *is* current->mm,
-* but we treat this as a 'remote' access since it is
-* essentially a kernel access to the memory.
+* 'mm' *is* current->mm, but we treat this as a 'remote' access since
+* it is essentially a kernel access to the memory.
 */
result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, , NULL);
if (result < 0)
-- 
2.43.0




Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss

2024-06-24 Thread Naina Mehta




On 6/18/2024 7:08 PM, Konrad Dybcio wrote:



On 6/18/24 15:13, Naina Mehta wrote:

Rename qdss@8880 memory region as qlink_logging memory region
and add qdss_mem memory region at address of 0x8850.
Split mpss_dsmharq_mem region into 2 separate regions and
reduce the size of mpssadsp_mem region.

Signed-off-by: Naina Mehta 
---


Alright, we're getting somewhere. The commit message should however 
motivate

why such changes are necessary. For all we know, the splitting in two is
currently done for no reason, as qdss_mem and qlink_logging_mem are 
contiguous

- does the firmware have some expectations about them being separate?



Since different DSM region size is required for different modem 
firmware, mpss_dsmharq_mem region being split into 2 separate regions.

This would provide the flexibility to remove the region which is
not required for a particular platform.
qlink_logging is being added at the memory region at the address of
0x8880 as the region is being used by modem firmware.

Regards,
Naina


Konrad




[RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic

2024-06-21 Thread Arseniy Krasnov
Previous calculation of 'free_space' was wrong (but worked as expected
in most cases, see below), because it didn't account number of bytes in
rx queue. Let's rework 'free_space' calculation in the following way:
as this value is considered free space at rx side from tx point of view,
it must be equal to return value of 'virtio_transport_get_credit()' at
tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first
is number of transmitted bytes (without wrap), second is last 'fwd_cnt'
value received from rx. So let's use same approach at rx side during
'free_space' calculation: add 'rx_cnt' counter which is number of
received bytes (also without wrap) and subtract 'last_fwd_cnt' from it.
Now we have:
1) 'rx_cnt' == 'tx_cnt' at both sides.
2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt'
   sent to tx, while second is last 'fwd_cnt' received from rx.

Now 'free_space' is handled correctly and also we don't need
'low_rx_bytes' flag - this was more like a hack.

Previous calculation of 'free_space' worked (in 99% cases), because if
we take a look on behaviour of both expressions (new and previous):

'(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)'

Both of them always grows up, with almost same "speed": only difference
is that 'rx_cnt' is incremented earlier during packet is received,
while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt'
grows "faster", then resulting 'free_space' become smaller also, so we
send credit updates a little bit more, but:

  * 'free_space' calculation based on 'rx_cnt' gives the same value,
which tx sees as free space at rx side, so original idea of
'free_space' is now implemented as planned.
  * Hack with 'low_rx_bytes' now is not needed.

Also here is some performance comparison between both versions of
'free_space' calculation:

 *--*--*--*
 |  | 'rx_cnt' | previous |
 *--*--*--*
 |H -> G|   8.42   |   7.82   |
 *--*--*--*
 |G -> H|   11.6   |   12.1   |
 *--*--*--*

As benchmark 'vsock-iperf' with default arguments was used. There is no
significant performance difference before and after this patch.

Signed-off-by: Arseniy Krasnov 
---
 include/linux/virtio_vsock.h| 1 +
 net/vmw_vsock/virtio_transport_common.c | 8 +++-
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c82089dee0c8..3579491c411e 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -135,6 +135,7 @@ struct virtio_vsock_sock {
u32 peer_buf_alloc;
 
/* Protected by rx_lock */
+   u32 rx_cnt;
u32 fwd_cnt;
u32 last_fwd_cnt;
u32 rx_bytes;
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 16ff976a86e3..1d4e2328e06e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct 
virtio_vsock_sock *vvs,
return false;
 
vvs->rx_bytes += len;
+   vvs->rx_cnt += len;
return true;
 }
 
@@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
size_t bytes, total = 0;
struct sk_buff *skb;
u32 fwd_cnt_delta;
-   bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;
 
@@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}
 
fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
-   free_space = vvs->buf_alloc - fwd_cnt_delta;
-   low_rx_bytes = (vvs->rx_bytes <
-   sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
+   free_space = vvs->buf_alloc - (vvs->rx_cnt - vvs->last_fwd_cnt);
 
spin_unlock_bh(>rx_lock);
 
@@ -619,7 +617,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 * number of bytes in rx queue is not enough to wake up reader.
 */
if (fwd_cnt_delta &&
-   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
+   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE))
virtio_transport_send_credit_update(vsk);
 
return total;
-- 
2.25.1




[RFC PATCH v1 2/2] vsock/test: add test for deferred credit update

2024-06-21 Thread Arseniy Krasnov
This test checks, that we send exactly expected number of credit
update packets during deferred credit update optimization. Test
work in client/server modes:
1) Client just connects to server and send 256Kb of data. 256Kb
   is chosen because it is default space for vsock peer. After
   transmission client waits until server performs checks of this
   test.

2) Server waits for vsock connection and also open raw socket
   binded to vsock monitor interface. Then server waits until
   there will be 256Kb of data in its rx queue (by reading data
   from stream socket with MSG_PEEK flag). Then server starts
   reading data from stream socket - it reads entire socket,
   also reading packets from raw vsock socket, to check that
   there is OP_RW packets. After data read is done, it checks
   raw socket again, by waiting exact amount of CREDIT_UPDATE
   packets. Finally it checks that rx queue of raw socket is
   empty using read with timeout.

Some notes about this test:
* It is only for virtio transport.
* It relies on sizes of virtio rx buffers for guest and host,
  to handle raw packets properly (4Kb for guest and 64Kb for
  host).
* It relies on free space limit for deferred credit update -
  currently it is 64Kb.
* It needs permissions to open raw sockets.

Signed-off-by: Arseniy Krasnov 
---
 tools/testing/vsock/.gitignore  |   1 +
 tools/testing/vsock/Makefile|   2 +
 tools/testing/vsock/virtio_vsock_test.c | 369 
 3 files changed, 372 insertions(+)
 create mode 100644 tools/testing/vsock/virtio_vsock_test.c

diff --git a/tools/testing/vsock/.gitignore b/tools/testing/vsock/.gitignore
index d9f798713cd7..74d13a38cf43 100644
--- a/tools/testing/vsock/.gitignore
+++ b/tools/testing/vsock/.gitignore
@@ -4,3 +4,4 @@ vsock_test
 vsock_diag_test
 vsock_perf
 vsock_uring_test
+virtio_vsock_test
diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index a7f56a09ca9f..e04d69903687 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -8,6 +8,8 @@ vsock_perf: vsock_perf.o msg_zerocopy_common.o
 vsock_uring_test: LDLIBS = -luring
 vsock_uring_test: control.o util.o vsock_uring_test.o timeout.o 
msg_zerocopy_common.o
 
+virtio_vsock_test: virtio_vsock_test.o util.o timeout.o control.o
+
 CFLAGS += -g -O2 -Werror -Wall -I. -I../../include -I../../../usr/include 
-Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD 
-U_FORTIFY_SOURCE -D_GNU_SOURCE
 .PHONY: all test clean
 clean:
diff --git a/tools/testing/vsock/virtio_vsock_test.c 
b/tools/testing/vsock/virtio_vsock_test.c
new file mode 100644
index ..4320dbc31ecc
--- /dev/null
+++ b/tools/testing/vsock/virtio_vsock_test.c
@@ -0,0 +1,369 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * virtio_vsock_test - vsock.ko test suite for VirtIO transport.
+ *
+ * Copyright (C) 2024 SaluteDevices.
+ *
+ * Author: Arseniy Krasnov 
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "control.h"
+#include "util.h"
+
+#define RAW_SOCK_RCV_BUF   (1024 * 1024)
+
+#define TOTAL_TX_BYTES (1024 * 256)
+
+#define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE  (1024 * 64)
+
+#define VIRTIO_VSOCK_GUEST_RX_BUF_SIZE (4096)
+#define VIRTIO_VSOCK_HOST_RX_BUF_SIZE  (1024 * 64)
+
+static const char *interface;
+
+static void test_stream_deferred_credit_update_client(const struct test_opts 
*opts)
+{
+   unsigned char buf[VIRTIO_VSOCK_MAX_PKT_BUF_SIZE];
+   int fd;
+   int i;
+
+   fd = vsock_stream_connect(opts->peer_cid, opts->peer_port);
+   if (fd < 0) {
+   perror("connect");
+   exit(EXIT_FAILURE);
+   }
+
+   control_expectln("SERVERREADY");
+   memset(buf, 0, sizeof(buf));
+
+   for (i = 0; i < TOTAL_TX_BYTES / VIRTIO_VSOCK_MAX_PKT_BUF_SIZE; i++) {
+   if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
+   perror("write");
+   exit(EXIT_FAILURE);
+   }
+   }
+
+   control_writeln("CLIENTDONE");
+   control_expectln("SERVERDONE");
+
+   close(fd);
+}
+
+static int create_raw_sock(const char *ifname)
+{
+   struct sockaddr_ll addr_ll;
+   struct ifreq ifr;
+   size_t rcv_buf;
+   int fd;
+
+   fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+   if (fd < 0) {
+   perror("socket");
+   exit(EXIT_FAILURE);
+   }
+
+   memset(, 0, sizeof(ifr));
+   snprintf(ifr.ifr_name, sizeof(ifr.ifr_name),
+"%s", ifname);
+
+   rcv_buf = RAW_SOCK_RCV_BUF;
+   if (setsockopt(fd, SOL_SOCKET, SO_RCVBUFFORCE, _buf,
+   sizeof(rcv_buf))) {
+   p

[RFC PATCH v1 0/2] virtio/vsock: some updates for deferred credit update

2024-06-21 Thread Arseniy Krasnov
This patchset contains:
0001 - patch which reworks deferred credit update. Pls see commit message,
   it contains full description of this problem.

0002 - test which uses vsockmon interface, and checks that deferred
   credit update works as expected by parsing raw packets.

Arseniy Krasnov (2):
  virtio/vsock: rework deferred credit update logic
  vsock/test: add test for deferred credit update

 include/linux/virtio_vsock.h|   1 +
 net/vmw_vsock/virtio_transport_common.c |   8 +-
 tools/testing/vsock/.gitignore  |   1 +
 tools/testing/vsock/Makefile|   2 +
 tools/testing/vsock/virtio_vsock_test.c | 369 
 5 files changed, 376 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/vsock/virtio_vsock_test.c

-- 
2.25.1




Re: [PATCH v9 5/8] remoteproc: qcom: Update regmap offsets for halt register

2024-06-21 Thread Krzysztof Kozlowski
On 21/06/2024 13:46, Gokul Sriram Palanisamy wrote:
> Fixed issue in reading halt-regs parameter from device-tree.

What issue?

That's a terrible commit msg. Explain what is the problem, how can it be
reproduced.

> 
> Signed-off-by: Sricharan R 
> Signed-off-by: Gokul Sriram Palanisamy 
> ---
>  drivers/remoteproc/qcom_q6v5_wcss.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/remoteproc/qcom_q6v5_wcss.c 
> b/drivers/remoteproc/qcom_q6v5_wcss.c
> index 06936ca1d079..87b78eb15b86 100644
> --- a/drivers/remoteproc/qcom_q6v5_wcss.c
> +++ b/drivers/remoteproc/qcom_q6v5_wcss.c
> @@ -86,7 +86,7 @@
>  #define TCSR_WCSS_CLK_MASK   0x1F
>  #define TCSR_WCSS_CLK_ENABLE 0x14
>  
> -#define MAX_HALT_REG 3
> +#define MAX_HALT_REG 4

? That's confusing and looks unrelated.


Best regards,
Krzysztof




[PATCH v9 5/8] remoteproc: qcom: Update regmap offsets for halt register

2024-06-21 Thread Gokul Sriram Palanisamy
Fixed issue in reading halt-regs parameter from device-tree.

Signed-off-by: Sricharan R 
Signed-off-by: Gokul Sriram Palanisamy 
---
 drivers/remoteproc/qcom_q6v5_wcss.c | 22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/remoteproc/qcom_q6v5_wcss.c 
b/drivers/remoteproc/qcom_q6v5_wcss.c
index 06936ca1d079..87b78eb15b86 100644
--- a/drivers/remoteproc/qcom_q6v5_wcss.c
+++ b/drivers/remoteproc/qcom_q6v5_wcss.c
@@ -86,7 +86,7 @@
 #define TCSR_WCSS_CLK_MASK 0x1F
 #define TCSR_WCSS_CLK_ENABLE   0x14
 
-#define MAX_HALT_REG   3
+#define MAX_HALT_REG   4
 
 #define WCNSS_PAS_ID   6
 
@@ -154,6 +154,7 @@ struct wcss_data {
u32 version;
bool aon_reset_required;
bool wcss_q6_reset_required;
+   bool bcr_reset_required;
const char *ssr_name;
const char *sysmon_name;
int ssctl_id;
@@ -875,10 +876,13 @@ static int q6v5_wcss_init_reset(struct q6v5_wcss *wcss,
}
}
 
-   wcss->wcss_q6_bcr_reset = devm_reset_control_get_exclusive(dev, 
"wcss_q6_bcr_reset");
-   if (IS_ERR(wcss->wcss_q6_bcr_reset)) {
-   dev_err(wcss->dev, "unable to acquire wcss_q6_bcr_reset\n");
-   return PTR_ERR(wcss->wcss_q6_bcr_reset);
+   if (desc->bcr_reset_required) {
+   wcss->wcss_q6_bcr_reset = devm_reset_control_get_exclusive(dev,
+  
"wcss_q6_bcr_reset");
+   if (IS_ERR(wcss->wcss_q6_bcr_reset)) {
+   dev_err(wcss->dev, "unable to acquire 
wcss_q6_bcr_reset\n");
+   return PTR_ERR(wcss->wcss_q6_bcr_reset);
+   }
}
 
return 0;
@@ -928,9 +932,9 @@ static int q6v5_wcss_init_mmio(struct q6v5_wcss *wcss,
return -EINVAL;
}
 
-   wcss->halt_q6 = halt_reg[0];
-   wcss->halt_wcss = halt_reg[1];
-   wcss->halt_nc = halt_reg[2];
+   wcss->halt_q6 = halt_reg[1];
+   wcss->halt_wcss = halt_reg[2];
+   wcss->halt_nc = halt_reg[3];
 
return 0;
 }
@@ -1170,6 +1174,7 @@ static const struct wcss_data wcss_ipq8074_res_init = {
.crash_reason_smem = WCSS_CRASH_REASON,
.aon_reset_required = true,
.wcss_q6_reset_required = true,
+   .bcr_reset_required = false,
.ssr_name = "q6wcss",
.ops = _wcss_ipq8074_ops,
.requires_force_stop = true,
@@ -1184,6 +1189,7 @@ static const struct wcss_data wcss_qcs404_res_init = {
.version = WCSS_QCS404,
.aon_reset_required = false,
.wcss_q6_reset_required = false,
+   .bcr_reset_required = true,
.ssr_name = "mpss",
.sysmon_name = "wcnss",
.ssctl_id = 0x12,
-- 
2.34.1




Re: [PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss

2024-06-18 Thread Konrad Dybcio




On 6/18/24 15:13, Naina Mehta wrote:

Rename qdss@8880 memory region as qlink_logging memory region
and add qdss_mem memory region at address of 0x8850.
Split mpss_dsmharq_mem region into 2 separate regions and
reduce the size of mpssadsp_mem region.

Signed-off-by: Naina Mehta 
---


Alright, we're getting somewhere. The commit message should however motivate
why such changes are necessary. For all we know, the splitting in two is
currently done for no reason, as qdss_mem and qlink_logging_mem are contiguous
- does the firmware have some expectations about them being separate?

Konrad



[PATCH v3 3/5] arm64: dts: qcom: sdx75: update reserved memory regions for mpss

2024-06-18 Thread Naina Mehta
Rename qdss@8880 memory region as qlink_logging memory region
and add qdss_mem memory region at address of 0x8850.
Split mpss_dsmharq_mem region into 2 separate regions and
reduce the size of mpssadsp_mem region.

Signed-off-by: Naina Mehta 
---
 arch/arm64/boot/dts/qcom/sdx75.dtsi | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/sdx75.dtsi 
b/arch/arm64/boot/dts/qcom/sdx75.dtsi
index 9b93f6501d55..6f0abcc87a3b 100644
--- a/arch/arm64/boot/dts/qcom/sdx75.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdx75.dtsi
@@ -366,7 +366,12 @@ uefi_log_mem: uefi-log@87f75000 {
no-map;
};
 
-   qdss_mem: qdss@8880 {
+   qdss_mem: qdss@8850 {
+   reg = <0x0 0x8850 0x0 0x30>;
+   no-map;
+   };
+
+   qlink_logging_mem: qlink-logging@8880 {
reg = <0x0 0x8880 0x0 0x30>;
no-map;
};
@@ -377,8 +382,13 @@ audio_heap_mem: audio-heap@88b0 {
no-map;
};
 
-   mpss_dsmharq_mem: mpss-dsmharq@88f0 {
-   reg = <0x0 0x88f0 0x0 0x508>;
+   mpss_dsm_mem_2: mpss-dsm-2@88f0 {
+   reg = <0x0 0x88f0 0x0 0x250>;
+   no-map;
+   };
+
+   mpss_dsm_mem: mpss-dsm@8b40 {
+   reg = <0x0 0x8b40 0x0 0x2b8>;
no-map;
};
 
@@ -388,7 +398,7 @@ q6_mpss_dtb_mem: q6-mpss-dtb@8df8 {
};
 
mpssadsp_mem: mpssadsp@8e00 {
-   reg = <0x0 0x8e00 0x0 0xf40>;
+   reg = <0x0 0x8e00 0x0 0xf10>;
no-map;
};
 
-- 
2.34.1




Re: [PATCH v2 2/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor

2024-06-18 Thread Konrad Dybcio




On 6/17/24 23:22, André Apitzsch via B4 Relay wrote:

From: André Apitzsch 

Add alert interrupt for the temperature sensor of Motorola Moto G 4G
(2013), although not used by the driver yet.

Signed-off-by: André Apitzsch 
---


Reviewed-by: Konrad Dybcio 

Konrad



[PATCH v2 0/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update device tree of Motorola Moto G 4G (2013)

2024-06-17 Thread André Apitzsch via B4 Relay
Add accelerometer, magnetometer, regulator and temperature sensor alert
interrupt and update framebuffer supplies.

Signed-off-by: André Apitzsch 
---
Changes in v2:
- Split commit into three commits
- Link to v1: 
https://lore.kernel.org/r/20240616-peregrine-v1-1-85d14ae1a...@apitzsch.eu

---
André Apitzsch (3):
  ARM: dts: qcom: msm8926-motorola-peregrine: Add accelerometer, 
magnetometer, regulator
  ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor
  ARM: dts: qcom: msm8926-motorola-peregrine: Add framebuffer supplies

 .../dts/qcom/qcom-msm8926-motorola-peregrine.dts   | 121 +
 1 file changed, 121 insertions(+)
---
base-commit: c71189547381bb5f176c6b22a9edc3414f1837b9
change-id: 20240616-peregrine-6ec5e26b15ec

Best regards,
-- 
André Apitzsch 





[PATCH v2 2/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor

2024-06-17 Thread André Apitzsch via B4 Relay
From: André Apitzsch 

Add alert interrupt for the temperature sensor of Motorola Moto G 4G
(2013), although not used by the driver yet.

Signed-off-by: André Apitzsch 
---
 arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts | 12 
 1 file changed, 12 insertions(+)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts
index 50ae3cfc95bb..cff9415baa46 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts
+++ b/arch/arm/boot/dts/qcom/qcom-msm8926-motorola-peregrine.dts
@@ -125,6 +125,10 @@ reg_lcd_neg: outn {
sensor@48 {
compatible = "ti,tmp108";
reg = <0x48>;
+   interrupts-extended = < 13 IRQ_TYPE_LEVEL_LOW>;
+   pinctrl-0 = <_alert_default>;
+   pinctrl-names = "default";
+   #thermal-sensor-cells = <0>;
};
 };
 
@@ -361,6 +365,14 @@ reg_lcd_default: reg-lcd-default-state {
bias-disable;
output-high;
};
+
+   temp_alert_default: temp-alert-default-state {
+   pins = "gpio13";
+   function = "gpio";
+   drive-strength = <2>;
+   bias-disable;
+   output-disable;
+   };
 };
 
  {

-- 
2.45.2





[PATCH v11 17/18] Documentation: probes: Update fprobe on function-graph tracer

2024-06-16 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Update fprobe documentation for the new fprobe on function-graph
tracer. This includes some bahvior changes and pt_regs to
ftrace_regs interface change.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Update @fregs parameter explanation.
---
 Documentation/trace/fprobe.rst |   42 ++--
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..f58bdc64504f 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
 Introduction
 
 
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe mechanism based on the function-graph
+tracer.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
 use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
 instrumentation for multiple functions with single handler. This document
 describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as 
follows:
 
 .. code-block:: c
 
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback 
will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
 
 @fp
 This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 This is the return address that the traced function will return to,
 somewhere in the caller. This can be used at both entry and exit.
 
-@regs
-This is the `pt_regs` data structure at the entry and exit. Note that
-the instruction pointer of @regs may be different from the @entry_ip
-in the entry_handler. If you need traced instruction pointer, you need
-to use @entry_ip. On the other hand, in the exit_handler, the 
instruction
-pointer of @regs is set to the current return address.
+@fregs
+This is the `ftrace_regs` data structure at the entry and exit. This
+includes the function parameters, or the return values. So user can
+access thos values via appropriate `ftrace_regs_*` APIs.
 
 @entry_data
 This is a local storage to share the data between entry and exit 
handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 and `entry_data_size` field when registering the fprobe, the storage is
 allocated and passed to both `entry_handler` and `exit_handler`.
 
+Entry data size and exit handlers on the same function
+==
+
+Since the entry data is passed via per-task stack and it is has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
 Share the callbacks with kprobes
 
 
@@ -165,8 +177,8 @@ This counter counts up when;
  - fprobe fails to take ftrace_recursion lock. This usually means that a 
function
which is traced by other ftrace users is called from the entry_handler.
 
- - fprobe fails to setup the function exit because of the shortage of rethook
-   (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+   data buffer from the per-task shadow stack.
 
 The `fprobe::nmissed` field counts up in both cases. Therefore, the former
 skips both of entry and exit callback and the latter skips the exit




[PATCH 6/8] riscv: ftrace: do not use stop_machine to update code

2024-06-13 Thread Andy Chiu
Now it is safe to remove dependency from stop_machine() for us to patch
code in ftrace.

Signed-off-by: Andy Chiu 
---
 arch/riscv/kernel/ftrace.c | 53 --
 1 file changed, 4 insertions(+), 49 deletions(-)

diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index f3b09f2d3ecc..9a421e151b1d 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -13,23 +13,13 @@
 #include 
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-void ftrace_arch_code_modify_prepare(void) __acquires(_mutex)
+void arch_ftrace_update_code(int command)
 {
mutex_lock(_mutex);
-
-   /*
-* The code sequences we use for ftrace can't be patched while the
-* kernel is running, so we need to use stop_machine() to modify them
-* for now.  This doesn't play nice with text_mutex, we use this flag
-* to elide the check.
-*/
-   riscv_patch_in_stop_machine = true;
-}
-
-void ftrace_arch_code_modify_post_process(void) __releases(_mutex)
-{
-   riscv_patch_in_stop_machine = false;
+   command |= FTRACE_MAY_SLEEP;
+   ftrace_modify_all_code(command);
mutex_unlock(_mutex);
+   flush_icache_all();
 }
 
 static int ftrace_check_current_call(unsigned long hook_pos,
@@ -158,41 +148,6 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
return __ftrace_modify_call_site(_call_dest, func, true);
 }
 
-struct ftrace_modify_param {
-   int command;
-   atomic_t cpu_count;
-};
-
-static int __ftrace_modify_code(void *data)
-{
-   struct ftrace_modify_param *param = data;
-
-   if (atomic_inc_return(>cpu_count) == num_online_cpus()) {
-   ftrace_modify_all_code(param->command);
-   /*
-* Make sure the patching store is effective *before* we
-* increment the counter which releases all waiting CPUs
-* by using the release variant of atomic increment. The
-* release pairs with the call to local_flush_icache_all()
-* on the waiting CPU.
-*/
-   atomic_inc_return_release(>cpu_count);
-   } else {
-   while (atomic_read(>cpu_count) <= num_online_cpus())
-   cpu_relax();
-   }
-
-   local_flush_icache_all();
-
-   return 0;
-}
-
-void arch_ftrace_update_code(int command)
-{
-   struct ftrace_modify_param param = { command, ATOMIC_INIT(0) };
-
-   stop_machine(__ftrace_modify_code, , cpu_online_mask);
-}
 #endif
 
 #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS

-- 
2.43.0




[PATCH v6 12/13] tracing: Update function tracing output for previous boot buffer

2024-06-12 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0





[PATCH v5 12/13] tracing: Update function tracing output for previous boot buffer

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0





[PATCH v4 12/13] tracing: Update function tracing output for previous boot buffer

2024-06-11 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0





[PATCH v3 12/13] tracing: Update function tracing output for previous boot buffer

2024-06-06 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0





[PATCH 6/6] function_graph: Do not update pid func if CONFIG_DYNAMIC_FTRACE not enabled

2024-06-05 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The ftrace subops is only defined if CONFIG_DYNAMIC_FTRACE is enabled. If
it is not, function tracing is extremely limited, and the subops in the
ftrace_ops structure is not defined (and will fail to compile). If
DYNAMIC_FTRACE is not enabled, then function graph filtering will not
work (as it shouldn't).

Fixes: df3ec5da6a1e7 ("function_graph: Add pid tracing back to function graph 
tracer")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202406051855.9viyxbtb-...@intel.com/
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/fgraph.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 63d828054c79..c0e428c87ea5 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -1177,6 +1177,7 @@ void fgraph_update_pid_func(void)
if (!(graph_ops.flags & FTRACE_OPS_FL_INITIALIZED))
return;
 
+#ifdef CONFIG_DYNAMIC_FTRACE
list_for_each_entry(op, _ops.subop_list, list) {
if (op->flags & FTRACE_OPS_FL_PID) {
gops = container_of(op, struct fgraph_ops, ops);
@@ -1186,6 +1187,7 @@ void fgraph_update_pid_func(void)
static_call_update(fgraph_func, 
gops->entryfunc);
}
}
+#endif
 }
 
 /* Allocate a return stack for each task */
-- 
2.43.0





Re: [PATCH] rv: Update rv_en(dis)able_monitor doc to match kernel-doc

2024-05-28 Thread Daniel Bristot de Oliveira
On 5/20/24 07:42, Yang Li wrote:
> The patch updates the function documentation comment for
> rv_en(dis)able_monitor to adhere to the kernel-doc specification.
> 
> Signed-off-by: Yang Li 

Acked-by: Daniel Bristot de Oliveira 

Thanks
-- Daniel



[PATCH v2 3/4] eventfs: Update all the eventfs_inodes from the events descriptor

2024-05-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The change to update the permissions of the eventfs_inode had the
misconception that using the tracefs_inode would find all the
eventfs_inodes that have been updated and reset them on remount.
The problem with this approach is that the eventfs_inodes are freed when
they are no longer used (basically the reason the eventfs system exists).
When they are freed, the updated eventfs_inodes are not reset on a remount
because their tracefs_inodes have been freed.

Instead, since the events directory eventfs_inode always has a
tracefs_inode pointing to it (it is not freed when finished), and the
events directory has a link to all its children, have the
eventfs_remount() function only operate on the events eventfs_inode and
have it descend into its children updating their uid and gids.

Link: 
https://lore.kernel.org/all/cak7lnarxgaww3kh9jgrnh4vk6fr8ldknkf3wq8nhmwjrvwj...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are 
options")
Reported-by: Masahiro Yamada 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 44 
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 5dfb1ccd56ea..129d0f54ba62 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -305,27 +305,27 @@ static const struct file_operations 
eventfs_file_operations = {
.llseek = generic_file_llseek,
 };
 
-/*
- * On a remount of tracefs, if UID or GID options are set, then
- * the mount point inode permissions should be used.
- * Reset the saved permission flags appropriately.
- */
-void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
update_gid)
+static void eventfs_set_attrs(struct eventfs_inode *ei, bool update_uid, 
kuid_t uid,
+ bool update_gid, kgid_t gid, int level)
 {
-   struct eventfs_inode *ei = ti->private;
+   struct eventfs_inode *ei_child;
 
-   if (!ei)
+   /* Update events// */
+   if (WARN_ON_ONCE(level > 3))
return;
 
if (update_uid) {
ei->attr.mode &= ~EVENTFS_SAVE_UID;
-   ei->attr.uid = ti->vfs_inode.i_uid;
+   ei->attr.uid = uid;
}
 
-
if (update_gid) {
ei->attr.mode &= ~EVENTFS_SAVE_GID;
-   ei->attr.gid = ti->vfs_inode.i_gid;
+   ei->attr.gid = gid;
+   }
+
+   list_for_each_entry(ei_child, >children, list) {
+   eventfs_set_attrs(ei_child, update_uid, uid, update_gid, gid, 
level + 1);
}
 
if (!ei->entry_attrs)
@@ -334,13 +334,31 @@ void eventfs_remount(struct tracefs_inode *ti, bool 
update_uid, bool update_gid)
for (int i = 0; i < ei->nr_entries; i++) {
if (update_uid) {
ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
-   ei->entry_attrs[i].uid = ti->vfs_inode.i_uid;
+   ei->entry_attrs[i].uid = uid;
}
if (update_gid) {
ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
-   ei->entry_attrs[i].gid = ti->vfs_inode.i_gid;
+   ei->entry_attrs[i].gid = gid;
}
}
+
+}
+
+/*
+ * On a remount of tracefs, if UID or GID options are set, then
+ * the mount point inode permissions should be used.
+ * Reset the saved permission flags appropriately.
+ */
+void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
update_gid)
+{
+   struct eventfs_inode *ei = ti->private;
+
+   /* Only the events directory does the updates */
+   if (!ei || !ei->is_events || ei->is_freed)
+   return;
+
+   eventfs_set_attrs(ei, update_uid, ti->vfs_inode.i_uid,
+ update_gid, ti->vfs_inode.i_gid, 0);
 }
 
 /* Return the evenfs_inode of the "events" directory */
-- 
2.43.0





[PATCH v2 2/4] tracefs: Update inode permissions on remount

2024-05-22 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

When a remount happens, if a gid or uid is specified update the inodes to
have the same gid and uid. This will allow the simplification of the
permissions logic for the dynamically created files and directories.

Cc: sta...@vger.kernel.org
Fixes: baa23a8d4360d ("tracefs: Reset permissions on remount if permissions are 
options")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 17 +
 fs/tracefs/inode.c   | 15 ---
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 55a40a730b10..5dfb1ccd56ea 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -317,20 +317,29 @@ void eventfs_remount(struct tracefs_inode *ti, bool 
update_uid, bool update_gid)
if (!ei)
return;
 
-   if (update_uid)
+   if (update_uid) {
ei->attr.mode &= ~EVENTFS_SAVE_UID;
+   ei->attr.uid = ti->vfs_inode.i_uid;
+   }
+
 
-   if (update_gid)
+   if (update_gid) {
ei->attr.mode &= ~EVENTFS_SAVE_GID;
+   ei->attr.gid = ti->vfs_inode.i_gid;
+   }
 
if (!ei->entry_attrs)
return;
 
for (int i = 0; i < ei->nr_entries; i++) {
-   if (update_uid)
+   if (update_uid) {
ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
-   if (update_gid)
+   ei->entry_attrs[i].uid = ti->vfs_inode.i_uid;
+   }
+   if (update_gid) {
ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
+   ei->entry_attrs[i].gid = ti->vfs_inode.i_gid;
+   }
}
 }
 
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index a827f6a716c4..9252e0d78ea2 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -373,12 +373,21 @@ static int tracefs_apply_options(struct super_block *sb, 
bool remount)
 
rcu_read_lock();
list_for_each_entry_rcu(ti, _inodes, list) {
-   if (update_uid)
+   if (update_uid) {
ti->flags &= ~TRACEFS_UID_PERM_SET;
+   ti->vfs_inode.i_uid = fsi->uid;
+   }
 
-   if (update_gid)
+   if (update_gid) {
ti->flags &= ~TRACEFS_GID_PERM_SET;
-
+   ti->vfs_inode.i_gid = fsi->gid;
+   }
+
+   /*
+* Note, the above ti->vfs_inode updates are
+* used in eventfs_remount() so they must come
+* before calling it.
+*/
if (ti->flags & TRACEFS_EVENT_INODE)
eventfs_remount(ti, update_uid, update_gid);
}
-- 
2.43.0





[PATCH] rv: Update rv_en(dis)able_monitor doc to match kernel-doc

2024-05-19 Thread Yang Li
The patch updates the function documentation comment for
rv_en(dis)able_monitor to adhere to the kernel-doc specification.

Signed-off-by: Yang Li 
---
 kernel/trace/rv/rv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
index 2f68e93fff0b..df0745a42a3f 100644
--- a/kernel/trace/rv/rv.c
+++ b/kernel/trace/rv/rv.c
@@ -245,6 +245,7 @@ static int __rv_disable_monitor(struct rv_monitor_def 
*mdef, bool sync)
 
 /**
  * rv_disable_monitor - disable a given runtime monitor
+ * @mdef: Pointer to the monitor definition structure.
  *
  * Returns 0 on success.
  */
@@ -256,6 +257,7 @@ int rv_disable_monitor(struct rv_monitor_def *mdef)
 
 /**
  * rv_enable_monitor - enable a given runtime monitor
+ * @mdef: Pointer to the monitor definition structure.
  *
  * Returns 0 on success, error otherwise.
  */
-- 
2.20.1.7.g153144c




Re: [PATCH -next] rv: Update rv_en(dis)able_monitor doc to match kernel-doc

2024-05-17 Thread Daniel Bristot de Oliveira
Hi Yang

On 5/17/24 11:14, Yang Li wrote:
> The patch updates the function documentation comment for
> rv_en(dis)able_monitor to adhere to the kernel-doc specification.
> 
> Signed-off-by: Yang Li 
> ---
>  kernel/trace/rv/rv.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
> index 2f68e93fff0b..df0745a42a3f 100644
> --- a/kernel/trace/rv/rv.c
> +++ b/kernel/trace/rv/rv.c
> @@ -245,6 +245,7 @@ static int __rv_disable_monitor(struct rv_monitor_def 
> *mdef, bool sync)
>  
>  /**
>   * rv_disable_monitor - disable a given runtime monitor
> + * @mdef: Pointer to the monitor definition structure.

This change is in for mainline kernel, why are you using the -next on the 
Subject?

-- Daniel



[PATCH -next] rv: Update rv_en(dis)able_monitor doc to match kernel-doc

2024-05-17 Thread Yang Li
The patch updates the function documentation comment for
rv_en(dis)able_monitor to adhere to the kernel-doc specification.

Signed-off-by: Yang Li 
---
 kernel/trace/rv/rv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/rv/rv.c b/kernel/trace/rv/rv.c
index 2f68e93fff0b..df0745a42a3f 100644
--- a/kernel/trace/rv/rv.c
+++ b/kernel/trace/rv/rv.c
@@ -245,6 +245,7 @@ static int __rv_disable_monitor(struct rv_monitor_def 
*mdef, bool sync)
 
 /**
  * rv_disable_monitor - disable a given runtime monitor
+ * @mdef: Pointer to the monitor definition structure.
  *
  * Returns 0 on success.
  */
@@ -256,6 +257,7 @@ int rv_disable_monitor(struct rv_monitor_def *mdef)
 
 /**
  * rv_enable_monitor - enable a given runtime monitor
+ * @mdef: Pointer to the monitor definition structure.
  *
  * Returns 0 on success, error otherwise.
  */
-- 
2.20.1.7.g153144c




[PATCH v2 5/6] documentation: Update on livepatch elf format

2024-05-16 Thread Lukas Hruska
Add a section to Documentation/livepatch/module-elf-format.rst
describing how klp-convert works for fixing relocations.

Signed-off-by: Lukas Hruska 
Reviewed-by: Petr Mladek 
Reviewed-by: Marcos Paulo de Souza 
---
 Documentation/livepatch/module-elf-format.rst | 67 +++
 1 file changed, 67 insertions(+)

diff --git a/Documentation/livepatch/module-elf-format.rst 
b/Documentation/livepatch/module-elf-format.rst
index a03ed02ec57e..2aa9b11cd806 100644
--- a/Documentation/livepatch/module-elf-format.rst
+++ b/Documentation/livepatch/module-elf-format.rst
@@ -300,3 +300,70 @@ symbol table, and relocation section indices, ELF 
information is preserved for
 livepatch modules and is made accessible by the module loader through
 module->klp_info, which is a :c:type:`klp_modinfo` struct. When a livepatch 
module
 loads, this struct is filled in by the module loader.
+
+6. klp-convert tool
+===
+The livepatch relocation sections might be created using
+scripts/livepatch/klp-convert. It is called automatically during
+the build as part of a module post processing.
+
+The tool is not able to find the symbols and all the metadata
+automatically. Instead, all needed information must already be
+part of rela entry for the given symbol. Such a rela can
+be created easily by using KLP_RELOC_SYMBOL() macro after
+the symbol declaration.
+
+KLP_RELOC_SYMBOL causes that the relocation entries for
+the given symbol will be created in the following format::
+
+  .klp.sym.rela.lp_object.sym_object.sym_name,sympos
+  ^   ^ ^   ^ ^^ ^  ^   ^
+  |___| |___| || |__|   |
+   [A] [B][C]   [D][E]
+
+[A]
+  The symbol name is prefixed with the string ".klp.sym.rela."
+
+[B]
+  The name of the object (i.e. "vmlinux" or name of module) which
+  is livepatched.
+
+[C]
+  The name of the object (i.e. "vmlinux" or name of module) to
+  which the symbol belongs follows immediately after the prefix.
+
+[D]
+  The actual name of the symbol.
+
+[E]
+  The position of the symbol in the object (as according to kallsyms)
+  This is used to differentiate duplicate symbols within the same
+  object. The symbol position is expressed numerically (0, 1, 2...).
+  The symbol position of a unique symbol is 0.
+
+Example:
+
+**Livepatch source code:**
+
+::
+
+  extern char *saved_command_line \
+ KLP_RELOC_SYMBOL(vmlinux, vmlinux, saved_command_line, 0);
+
+**`readelf -r -W` output of compiled module:**
+
+::
+
+  Relocation section '.rela.text' at offset 0x32e60 contains 10 entries:
+  Offset Info Type   Symbol's Value  
Symbol's Name + Addend
+  ...
+  0068  003c0002 R_X86_64_PC32   
.klp.sym.rela.vmlinux.vmlinux.saved_command_line,0 - 4
+  ...
+
+**`readelf -r -W` output of transformed module by klp-convert:**
+
+::
+
+  Relocation section '.klp.rela.vmlinux.text' at offset 0x5cb60 contains 1 
entry:
+  Offset Info Type   Symbol's Value  
Symbol's Name + Addend
+  0068  003c0002 R_X86_64_PC32  
 .klp.sym.vmlinux.saved_command_line,0 - 4
-- 
2.45.0




[PATCH v10 35/36] Documentation: probes: Update fprobe on function-graph tracer

2024-05-07 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Update fprobe documentation for the new fprobe on function-graph
tracer. This includes some bahvior changes and pt_regs to
ftrace_regs interface change.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Update @fregs parameter explanation.
---
 Documentation/trace/fprobe.rst |   42 ++--
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..f58bdc64504f 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
 Introduction
 
 
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe mechanism based on the function-graph
+tracer.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
 use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
 instrumentation for multiple functions with single handler. This document
 describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as 
follows:
 
 .. code-block:: c
 
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback 
will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
 
 @fp
 This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 This is the return address that the traced function will return to,
 somewhere in the caller. This can be used at both entry and exit.
 
-@regs
-This is the `pt_regs` data structure at the entry and exit. Note that
-the instruction pointer of @regs may be different from the @entry_ip
-in the entry_handler. If you need traced instruction pointer, you need
-to use @entry_ip. On the other hand, in the exit_handler, the 
instruction
-pointer of @regs is set to the current return address.
+@fregs
+This is the `ftrace_regs` data structure at the entry and exit. This
+includes the function parameters, or the return values. So user can
+access thos values via appropriate `ftrace_regs_*` APIs.
 
 @entry_data
 This is a local storage to share the data between entry and exit 
handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 and `entry_data_size` field when registering the fprobe, the storage is
 allocated and passed to both `entry_handler` and `exit_handler`.
 
+Entry data size and exit handlers on the same function
+==
+
+Since the entry data is passed via per-task stack and it is has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
 Share the callbacks with kprobes
 
 
@@ -165,8 +177,8 @@ This counter counts up when;
  - fprobe fails to take ftrace_recursion lock. This usually means that a 
function
which is traced by other ftrace users is called from the entry_handler.
 
- - fprobe fails to setup the function exit because of the shortage of rethook
-   (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+   data buffer from the per-task shadow stack.
 
 The `fprobe::nmissed` field counts up in both cases. Therefore, the former
 skips both of entry and exit callback and the latter skips the exit




[PATCH v9 35/36] Documentation: probes: Update fprobe on function-graph tracer

2024-04-15 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Update fprobe documentation for the new fprobe on function-graph
tracer. This includes some bahvior changes and pt_regs to
ftrace_regs interface change.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Update @fregs parameter explanation.
---
 Documentation/trace/fprobe.rst |   42 ++--
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..f58bdc64504f 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
 Introduction
 
 
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe mechanism based on the function-graph
+tracer.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
 use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
 instrumentation for multiple functions with single handler. This document
 describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as 
follows:
 
 .. code-block:: c
 
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback 
will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
 
 @fp
 This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 This is the return address that the traced function will return to,
 somewhere in the caller. This can be used at both entry and exit.
 
-@regs
-This is the `pt_regs` data structure at the entry and exit. Note that
-the instruction pointer of @regs may be different from the @entry_ip
-in the entry_handler. If you need traced instruction pointer, you need
-to use @entry_ip. On the other hand, in the exit_handler, the 
instruction
-pointer of @regs is set to the current return address.
+@fregs
+This is the `ftrace_regs` data structure at the entry and exit. This
+includes the function parameters, or the return values. So user can
+access thos values via appropriate `ftrace_regs_*` APIs.
 
 @entry_data
 This is a local storage to share the data between entry and exit 
handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 and `entry_data_size` field when registering the fprobe, the storage is
 allocated and passed to both `entry_handler` and `exit_handler`.
 
+Entry data size and exit handlers on the same function
+==
+
+Since the entry data is passed via per-task stack and it is has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
 Share the callbacks with kprobes
 
 
@@ -165,8 +177,8 @@ This counter counts up when;
  - fprobe fails to take ftrace_recursion lock. This usually means that a 
function
which is traced by other ftrace users is called from the entry_handler.
 
- - fprobe fails to setup the function exit because of the shortage of rethook
-   (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+   data buffer from the per-task shadow stack.
 
 The `fprobe::nmissed` field counts up in both cases. Therefore, the former
 skips both of entry and exit callback and the latter skips the exit




[PATCH v2 11/11] tracing: Update function tracing output for previous boot buffer

2024-04-10 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0





Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-10 Thread Krzysztof Kozlowski
On 09/04/2024 20:36, Luca Weiss wrote:
> Follow the gpio-hog bindings and use otg-hog as node name.
> 
> Signed-off-by: Luca Weiss 
> ---
>  arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof




Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt
On Wed, 10 Apr 2024 08:44:00 +0900
Masami Hiramatsu (Google)  wrote:

> Looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) 

Thanks.

> 
> BTW, isn't this a real bugfix, because the page_touched can be
> bigger than nr_pages without this fix?

Yes, I simply forgot to add the Cc stable.

-- Steve



Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Google
On Tue, 9 Apr 2024 15:13:09 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The "buffer_percent" logic that is used by the ring buffer splice code to
> only wake up the tasks when there's no data after the buffer is filled to
> the percentage of the "buffer_percent" file is dependent on three
> variables that determine the amount of data that is in the ring buffer:
> 
>  1) pages_read - incremented whenever a new sub-buffer is consumed
>  2) pages_lost - incremented every time a writer overwrites a sub-buffer
>  3) pages_touched - incremented when a write goes to a new sub-buffer
> 
> The percentage is the calculation of:
> 
>   (pages_touched - (pages_lost + pages_read)) / nr_pages
> 
> Basically, the amount of data is the total number of sub-bufs that have been
> touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
> divided by the total count to give the buffer percentage. When the
> percentage is greater than the value in the "buffer_percent" file, it
> wakes up splice readers waiting for that amount.
> 
> It was observed that over time, the amount read from the splice was
> constantly decreasing the longer the trace was running. That is, if one
> asked for 60%, it would read over 60% when it first starts tracing, but
> then it would be woken up at under 60% and would slowly decrease the
> amount of data read after being woken up, where the amount becomes much
> less than the buffer percent.
> 
> This was due to an accounting of the pages_touched incrementation. This
> value is incremented whenever a writer transfers to a new sub-buffer. But
> the place where it was incremented was incorrect. If a writer overflowed
> the current sub-buffer it would go to the next one. If it gets preempted
> by an interrupt at that time, and the interrupt performs a trace, it too
> will end up going to the next sub-buffer. But only one should increment
> the counter. Unfortunately, that was not the case.
> 
> Change the cmpxchg() that does the real switch of the tail-page into a
> try_cmpxchg(), and on success, perform the increment of pages_touched. This
> will only increment the counter once for when the writer moves to a new
> sub-buffer, and not when there's a race and is incremented for when a
> writer and its preempting writer both move to the same new sub-buffer.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

BTW, isn't this a real bugfix, because the page_touched can be
bigger than nr_pages without this fix?

Thank you,

> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 25476ead681b..6511dc3a00da 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct 
> ring_buffer_per_cpu *cpu_buffer,
>   old_write = local_add_return(RB_WRITE_INTCNT, _page->write);
>   old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries);
>  
> - local_inc(_buffer->pages_touched);
>   /*
>* Just make sure we have seen our old_write and synchronize
>* with any interrupts that come in.
> @@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct 
> ring_buffer_per_cpu *cpu_buffer,
>    */
>   local_set(_page->page->commit, 0);
>  
> - /* Again, either we update tail_page or an interrupt does */
> - (void)cmpxchg(_buffer->tail_page, tail_page, next_page);
> + /* Either we update tail_page or an interrupt does */
> + if (try_cmpxchg(_buffer->tail_page, _page, next_page))
> + local_inc(_buffer->pages_touched);
>   }
>  }
>  
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



[PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The "buffer_percent" logic that is used by the ring buffer splice code to
only wake up the tasks when there's no data after the buffer is filled to
the percentage of the "buffer_percent" file is dependent on three
variables that determine the amount of data that is in the ring buffer:

 1) pages_read - incremented whenever a new sub-buffer is consumed
 2) pages_lost - incremented every time a writer overwrites a sub-buffer
 3) pages_touched - incremented when a write goes to a new sub-buffer

The percentage is the calculation of:

  (pages_touched - (pages_lost + pages_read)) / nr_pages

Basically, the amount of data is the total number of sub-bufs that have been
touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
divided by the total count to give the buffer percentage. When the
percentage is greater than the value in the "buffer_percent" file, it
wakes up splice readers waiting for that amount.

It was observed that over time, the amount read from the splice was
constantly decreasing the longer the trace was running. That is, if one
asked for 60%, it would read over 60% when it first starts tracing, but
then it would be woken up at under 60% and would slowly decrease the
amount of data read after being woken up, where the amount becomes much
less than the buffer percent.

This was due to an accounting of the pages_touched incrementation. This
value is incremented whenever a writer transfers to a new sub-buffer. But
the place where it was incremented was incorrect. If a writer overflowed
the current sub-buffer it would go to the next one. If it gets preempted
by an interrupt at that time, and the interrupt performs a trace, it too
will end up going to the next sub-buffer. But only one should increment
the counter. Unfortunately, that was not the case.

Change the cmpxchg() that does the real switch of the tail-page into a
try_cmpxchg(), and on success, perform the increment of pages_touched. This
will only increment the counter once for when the writer moves to a new
sub-buffer, and not when there's a race and is incremented for when a
writer and its preempting writer both move to the same new sub-buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..6511dc3a00da 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
old_write = local_add_return(RB_WRITE_INTCNT, _page->write);
old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries);
 
-   local_inc(_buffer->pages_touched);
/*
 * Just make sure we have seen our old_write and synchronize
 * with any interrupts that come in.
@@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
 */
local_set(_page->page->commit, 0);
 
-   /* Again, either we update tail_page or an interrupt does */
-   (void)cmpxchg(_buffer->tail_page, tail_page, next_page);
+   /* Either we update tail_page or an interrupt does */
+   if (try_cmpxchg(_buffer->tail_page, _page, next_page))
+   local_inc(_buffer->pages_touched);
}
 }
 
-- 
2.43.0




Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-09 Thread Dmitry Baryshkov
On Tue, 9 Apr 2024 at 21:37, Luca Weiss  wrote:
>
> Follow the gpio-hog bindings and use otg-hog as node name.
>
> Signed-off-by: Luca Weiss 
> ---
>  arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

Reviewed-by: Dmitry Baryshkov 

-- 
With best wishes
Dmitry



[PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-09 Thread Luca Weiss
Follow the gpio-hog bindings and use otg-hog as node name.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
index 4aaae8537a3f..06549051be50 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
@@ -328,7 +328,7 @@ wlan_regulator_pin: wl-reg-active-state {
power-source = ;
};
 
-   otg {
+   otg-hog {
gpio-hog;
gpios = <35 GPIO_ACTIVE_HIGH>;
output-high;

-- 
2.44.0




[PATCH 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-08 Thread Luca Weiss
Follow the gpio-hog bindings and use otg-hog as node name.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
index 4aaae8537a3f..06549051be50 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
@@ -328,7 +328,7 @@ wlan_regulator_pin: wl-reg-active-state {
power-source = ;
};
 
-   otg {
+   otg-hog {
gpio-hog;
gpios = <35 GPIO_ACTIVE_HIGH>;
output-high;

-- 
2.44.0




[PATCH v6 0/2] Update mce_record tracepoint

2024-04-01 Thread Avadhut Naik
This patchset updates the mce_record tracepoint so that the recently added
fields of struct mce are exported through it to userspace.

The first patch adds PPIN (Protected Processor Inventory Number) field to
the tracepoint.

The second patch adds the microcode field (Microcode Revision) to the
tracepoint.

Changes in v2:
 - Export microcode field (Microcode Revision) through the tracepoiont in
   addition to PPIN.

Changes in v3:
 - Change format specifier for microcode revision from %u to %x
 - Fix tab alignments
 - Add Reviewed-by: Sohil Mehta 

Changes in v4:
 - Update commit messages to reflect the reason for the fields being
   added to the tracepoint.
 - Add comment to explicitly state the type of information that should
   be added to the tracepoint.
 - Add Reviewed-by: Steven Rostedt (Google) 

Changes in v5:
 - Changed "MICROCODE REVISION" to just "MICROCODE".
 - Changed words which are not acronyms from ALL CAPS to no caps.
 - Added Reviewed-by: Tony Luck 

Changes in v6:
 - Rebased on top of Ingo's changes to the MCE tracepoint

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/include/trace/events/mce.h?id=ac5e80e94f5c67d7053f50fc3faddab931707f0f

[NOTE:
 - Since changes in this version are very minor, have retained the below
   tags received for previous versions:
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) 
Reviewed-by: Tony Luck ]

Avadhut Naik (2):
  tracing: Include PPIN in mce_record tracepoint
  tracing: Include Microcode Revision in mce_record tracepoint

 include/trace/events/mce.h | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)


base-commit: 65d1240b6728b38e4d2068d6738a17e4ee4351f5
-- 
2.34.1




[RESEND v5 0/2] Update mce_record tracepoint

2024-03-28 Thread Avadhut Naik
This patchset updates the mce_record tracepoint so that the recently added
fields of struct mce are exported through it to userspace.

The first patch adds PPIN (Protected Processor Inventory Number) field to
the tracepoint.

The second patch adds the microcode field (Microcode Revision) to the
tracepoint.

Changes in v2:
 - Export microcode field (Microcode Revision) through the tracepoiont in
   addition to PPIN.

Changes in v3:
 - Change format specifier for microcode revision from %u to %x
 - Fix tab alignments
 - Add Reviewed-by: Sohil Mehta 

Changes in v4:
 - Update commit messages to reflect the reason for the fields being
   added to the tracepoint.
 - Add comment to explicitly state the type of information that should
   be added to the tracepoint.
 - Add Reviewed-by: Steven Rostedt (Google) 

Changes in v5:
 - Changed "MICROCODE REVISION" to just "MICROCODE".
 - Changed words which are not acronyms from ALL CAPS to no caps.
 - Added Reviewed-by: Tony Luck 

[NOTE:

 - Since only caps of words which are not acronyms have been changed in
   this version and the word "REVISION" has been removed i.e. changes are
   very minor, have retained the the below tags received for previous
   versions:
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) 
Reviewed-by: Tony Luck 

 - Missed adding linux-kernel@vger.kernel.org with v5. Hence, resending.]

Avadhut Naik (2):
  tracing: Include PPIN in mce_record tracepoint
  tracing: Include Microcode Revision in mce_record tracepoint

 include/trace/events/mce.h | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)


base-commit: 9699768066a9b12db247d5ee9276cf298b67528d
-- 
2.34.1




Re: [PATCH v5 0/2] Update mce_record tracepoint

2024-03-28 Thread Naik, Avadhut



On 3/28/2024 13:14, Sohil Mehta wrote:
> On 3/28/2024 11:04 AM, Avadhut Naik wrote:
> 
>>  - Since only caps of words which are not acronyms have been changed in
>>this version and the word "REVISION" has been removed i.e. changes are
>>very minor, have retained the the below tags received for previous
>>versions:
>> Reviewed-by: Sohil Mehta 
> 
>> Avadhut Naik (2):
>>   tracing: Include PPIN in mce_record tracepoint
>>   tracing: Include Microcode Revision in mce_record tracepoint
> 
> 
> The patches look good to me.

Kindly ignore this set.

As pointed out by Sohit, have missed adding linux-ker...@vger.kernel.org

Will resend the set.

-- 
Thanks,
Avadhut Naik



Re: [PATCH v5 0/2] Update mce_record tracepoint

2024-03-28 Thread Sohil Mehta
On 3/28/2024 11:04 AM, Avadhut Naik wrote:

>  - Since only caps of words which are not acronyms have been changed in
>this version and the word "REVISION" has been removed i.e. changes are
>very minor, have retained the the below tags received for previous
>versions:
> Reviewed-by: Sohil Mehta 

> Avadhut Naik (2):
>   tracing: Include PPIN in mce_record tracepoint
>   tracing: Include Microcode Revision in mce_record tracepoint


The patches look good to me.



[PATCH v5 0/2] Update mce_record tracepoint

2024-03-28 Thread Avadhut Naik
This patchset updates the mce_record tracepoint so that the recently added
fields of struct mce are exported through it to userspace.

The first patch adds PPIN (Protected Processor Inventory Number) field to
the tracepoint.

The second patch adds the microcode field (Microcode Revision) to the
tracepoint.

Changes in v2:
 - Export microcode field (Microcode Revision) through the tracepoiont in
   addition to PPIN.

Changes in v3:
 - Change format specifier for microcode revision from %u to %x
 - Fix tab alignments
 - Add Reviewed-by: Sohil Mehta 

Changes in v4:
 - Update commit messages to reflect the reason for the fields being
   added to the tracepoint.
 - Add comment to explicitly state the type of information that should
   be added to the tracepoint.
 - Add Reviewed-by: Steven Rostedt (Google) 

Changes in v5:
 - Changed "MICROCODE REVISION" to just "MICROCODE".
 - Changed words which are not acronyms from ALL CAPS to no caps.
 - Added Reviewed-by: Tony Luck 

[NOTE:

 - Since only caps of words which are not acronyms have been changed in
   this version and the word "REVISION" has been removed i.e. changes are
   very minor, have retained the the below tags received for previous
   versions:
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) 
Reviewed-by: Tony Luck ]

Avadhut Naik (2):
  tracing: Include PPIN in mce_record tracepoint
  tracing: Include Microcode Revision in mce_record tracepoint

 include/trace/events/mce.h | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)


base-commit: 9699768066a9b12db247d5ee9276cf298b67528d
-- 
2.34.1




[PATCH v4 0/2] Update mce_record tracepoint

2024-03-27 Thread Avadhut Naik
This patchset updates the mce_record tracepoint so that the recently added
fields of struct mce are exported through it to userspace.

The first patch adds PPIN (Protected Processor Inventory Number) field to
the tracepoint.

The second patch adds the microcode field (Microcode Revision) to the
tracepoint.

Changes in v2:
 - Export microcode field (Microcode Revision) through the tracepoiont in
   addition to PPIN.

Changes in v3:
 - Change format specifier for microcode revision from %u to %x
 - Fix tab alignments
 - Add Reviewed-by: Sohil Mehta 

Changes in v4:
 - Update commit messages to reflect the reason for the fields being
   added to the tracepoint.
 - Add comment to explicitly state the type of information that should
   be added to the tracepoint.
 - Add Reviewed-by: Steven Rostedt (Google) 

[NOTE:

 - Since only a comment has been added and only the commit messages have
   been reworked i.e. no code changes have been undertaken for this
   version, have the retained the below tags from v3:
Reviewed-by: Sohil Mehta 
Reviewed-by: Steven Rostedt (Google) ]

Avadhut Naik (2):
  tracing: Include PPIN in mce_record tracepoint
  tracing: Include Microcode Revision in mce_record tracepoint

 include/trace/events/mce.h | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)


base-commit: 818ea9b4c8237f96ac99dc0b2f02dd6d3f2adb97
-- 
2.34.1




Re: [PATCH v3 0/2] Update mce_record tracepoint

2024-03-25 Thread Naik, Avadhut



On 3/25/2024 15:31, Borislav Petkov wrote:
> On Mon, Mar 25, 2024 at 03:12:14PM -0500, Naik, Avadhut wrote:
>> Can this patchset be merged in? Or would you prefer me sending out
>> another revision with Steven's "Reviewed-by:" tag?
> 
> First of all, please do not top-post.
>
Apologies for that!
 
> Then, you were on Cc on the previous thread. Please summarize from it
> and put in the commit message *why* it is good to have each field added.
> 
> And then, above the tracepoint, I'd like you to add a rule which
> states what information can and should be added to the tracepoint. And
> no, "just because" is not good enough. The previous thread has hints.
> 

Thanks for the clarification! Will update accordingly.

> Thx.
> 

-- 
Thanks,
Avadhut Naik



Re: [PATCH v3 0/2] Update mce_record tracepoint

2024-03-25 Thread Borislav Petkov
On Mon, Mar 25, 2024 at 03:12:14PM -0500, Naik, Avadhut wrote:
> Can this patchset be merged in? Or would you prefer me sending out
> another revision with Steven's "Reviewed-by:" tag?

First of all, please do not top-post.

Then, you were on Cc on the previous thread. Please summarize from it
and put in the commit message *why* it is good to have each field added.

And then, above the tracepoint, I'd like you to add a rule which
states what information can and should be added to the tracepoint. And
no, "just because" is not good enough. The previous thread has hints.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



[PATCH v3 0/2] Update mce_record tracepoint

2024-03-25 Thread Naik, Avadhut
Hi Boris,

Can this patchset be merged in? Or would you prefer me sending out another
revision with Steven's "Reviewed-by:" tag?

On 2/8/2024 11:10, Steven Rostedt wrote:
> On Fri, 26 Jan 2024 01:57:58 -0600
> Avadhut Naik  wrote:
> 
>> This patchset updates the mce_record tracepoint so that the recently added
>> fields of struct mce are exported through it to userspace.
>>
>> The first patch adds PPIN (Protected Processor Inventory Number) field to
>> the tracepoint.
>>
>> The second patch adds the microcode field (Microcode Revision) to the
>> tracepoint.
> 
> From a tracing POV only:
> 
> Reviewed-by: Steven Rostedt (Google) 
> 
> -- Steve

-- 
Thanks,
Avadhut Naik



[PATCH v8 35/35] Documentation: probes: Update fprobe on function-graph tracer

2024-02-25 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Update fprobe documentation for the new fprobe on function-graph
tracer. This includes some bahvior changes and pt_regs to
ftrace_regs interface change.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Update @fregs parameter explanation.
---
 Documentation/trace/fprobe.rst |   42 ++--
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..f58bdc64504f 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
 Introduction
 
 
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe mechanism based on the function-graph
+tracer.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
 use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
 instrumentation for multiple functions with single handler. This document
 describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as 
follows:
 
 .. code-block:: c
 
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback 
will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
 
 @fp
 This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 This is the return address that the traced function will return to,
 somewhere in the caller. This can be used at both entry and exit.
 
-@regs
-This is the `pt_regs` data structure at the entry and exit. Note that
-the instruction pointer of @regs may be different from the @entry_ip
-in the entry_handler. If you need traced instruction pointer, you need
-to use @entry_ip. On the other hand, in the exit_handler, the 
instruction
-pointer of @regs is set to the current return address.
+@fregs
+This is the `ftrace_regs` data structure at the entry and exit. This
+includes the function parameters, or the return values. So user can
+access thos values via appropriate `ftrace_regs_*` APIs.
 
 @entry_data
 This is a local storage to share the data between entry and exit 
handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 and `entry_data_size` field when registering the fprobe, the storage is
 allocated and passed to both `entry_handler` and `exit_handler`.
 
+Entry data size and exit handlers on the same function
+==
+
+Since the entry data is passed via per-task stack and it is has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
 Share the callbacks with kprobes
 
 
@@ -165,8 +177,8 @@ This counter counts up when;
  - fprobe fails to take ftrace_recursion lock. This usually means that a 
function
which is traced by other ftrace users is called from the entry_handler.
 
- - fprobe fails to setup the function exit because of the shortage of rethook
-   (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+   data buffer from the per-task shadow stack.
 
 The `fprobe::nmissed` field counts up in both cases. Therefore, the former
 skips both of entry and exit callback and the latter skips the exit




Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-23 Thread Carlos Galo
On Thu, Feb 22, 2024 at 9:59 AM Carlos Galo  wrote:
>
> On Thu, Feb 22, 2024 at 6:16 AM Michal Hocko  wrote:
> >
> > On Wed 21-02-24 13:30:51, Carlos Galo wrote:
> > > On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko  wrote:
> > > >
> > > > Hi,
> > > > sorry I have missed this before.
> > > >
> > > > On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> > > > > The current implementation of the mark_victim tracepoint provides only
> > > > > the process ID (pid) of the victim process. This limitation poses
> > > > > challenges for userspace tools that need additional information
> > > > > about the OOM victim. The association between pid and the additional
> > > > > data may be lost after the kill, making it difficult for userspace to
> > > > > correlate the OOM event with the specific process.
> > > >
> > > > You are correct that post OOM all per-process information is lost. On
> > > > the other hand we do dump all this information to the kernel log. Could
> > > > you explain why that is not suitable for your purpose?
> > >
> > > Userspace tools often need real-time visibility into OOM situations
> > > for userspace intervention. Our use case involves utilizing BPF
> > > programs, along with BPF ring buffers, to provide OOM notification to
> > > userspace. Parsing kernel logs would be significant overhead as
> > > opposed to the event based BPF approach.
> >
> > Please put that into the changelog.
>
> Will do.
>
> > > > > In order to mitigate this limitation, add the following fields:
> > > > >
> > > > > - UID
> > > > >In Android each installed application has a unique UID. Including
> > > > >the `uid` assists in correlating OOM events with specific apps.
> > > > >
> > > > > - Process Name (comm)
> > > > >Enables identification of the affected process.
> > > > >
> > > > > - OOM Score
> > > > >Allows userspace to get additional insights of the relative kill
> > > > >priority of the OOM victim.
> > > >
> > > > What is the oom score useful for?
> > > >
> > > The OOM score provides us a measure of the victim's importance. On the
> > > android side, it allows us to identify if top or foreground apps are
> > > killed, which have user perceptible impact.
> >
> > But the value on its own (wihtout knowing scores of other tasks) doesn't
> > really tell you anything, does it?
>
> Android uses the OOM adj_score ranges  to categorize app state
> (foreground, background, ...). I'll resend a v3 with the commit text
> updated to include details about this.
>
> > > > Is there any reason to provide a different information from the one
> > > > reported to the kernel log?
> > > > __oom_kill_process:
> > > > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> > > > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB 
> > > > oom_score_adj:%hd\n",
> > > > message, task_pid_nr(victim), victim->comm, 
> > > > K(mm->total_vm),
> > > > K(get_mm_counter(mm, MM_ANONPAGES)),
> > > > K(get_mm_counter(mm, MM_FILEPAGES)),
> > > > K(get_mm_counter(mm, MM_SHMEMPAGES)),
> > > > from_kuid(_user_ns, task_uid(victim)),
> > > > mm_pgtables_bytes(mm) >> 10, 
> > > > victim->signal->oom_score_adj);
> > > >
> > >
> > > We added these fields we need (UID, process name, and OOM score), but
> > > we're open to adding the others if you prefer that for consistency
> > > with the kernel log.
> >
> > yes, I think the consistency would be better here. For one it reports
> > numbers which can tell quite a lot about the killed victim. It is a
> > superset of what you already asking for. With a notable exception of the
> > oom_score which is really dubious without a wider context.
>
> Sounds good, I'll resend a v3 that includes these fields as well.
>
> Thanks,
> Carlos
>

I posted V3 here:
https://lore.kernel.org/all/20240223173258.174828-1-carlosg...@google.com/

Thanks,
Carlos
> > --
> > Michal Hocko
> > SUSE Labs



[PATCH v3] mm: Update mark_victim tracepoints fields

2024-02-23 Thread Carlos Galo
The current implementation of the mark_victim tracepoint provides only
the process ID (pid) of the victim process. This limitation poses
challenges for userspace tools requiring real-time OOM analysis and
intervention. Although this information is available from the kernel
logs, it’s not the appropriate format to provide OOM notifications. In
Android, BPF programs are used with the mark_victim trace events to
notify userspace of an OOM kill. For consistency, update the trace
event to include the same information about the OOMed victim as the
kernel logs.

- UID
   In Android each installed application has a unique UID. Including
   the `uid` assists in correlating OOM events with specific apps.

- Process Name (comm)
   Enables identification of the affected process.

- OOM Score
  Will allow userspace to get additional insight of the relative kill
  priority of the OOM victim. In Android, the oom_score_adj is used to
  categorize app state (foreground, background, etc.), which aids in
  analyzing user-perceptible impacts of OOM events [1].

- Total VM, RSS Stats, and pgtables
  Amount of memory used by the victim that will, potentially, be freed up
  by killing it.

[1] 
https://cs.android.com/android/platform/superproject/main/+/246dc8fc95b6d93afcba5c6d6c133307abb3ac2e:frameworks/base/services/core/java/com/android/server/am/ProcessList.java;l=188-283

Cc: Steven Rostedt 
Cc: Andrew Morton 
Cc: Suren Baghdasaryan 
Cc: Michal Hocko 
Reviewed-by: Steven Rostedt 
Signed-off-by: Carlos Galo 
---

v3:
- Added total_vm, rss fields, and pgtables  per Michal Hocko.
- Added Steven Rostedt reviewed by
- Updated commit description to include android usecase

v2: Fixed build error. Added missing comma when printing `__entry->uid`.

 include/trace/events/oom.h | 36 
 mm/oom_kill.c  |  6 +-
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
index 26a11e4a2c36..b799f3bcba82 100644
--- a/include/trace/events/oom.h
+++ b/include/trace/events/oom.h
@@ -7,6 +7,8 @@
 #include 
 #include 
 
+#define PG_COUNT_TO_KB(x) ((x) << (PAGE_SHIFT - 10))
+
 TRACE_EVENT(oom_score_adj_update,
 
TP_PROTO(struct task_struct *task),
@@ -72,19 +74,45 @@ TRACE_EVENT(reclaim_retry_zone,
 );
 
 TRACE_EVENT(mark_victim,
-   TP_PROTO(int pid),
+   TP_PROTO(struct task_struct *task, uid_t uid),
 
-   TP_ARGS(pid),
+   TP_ARGS(task, uid),
 
TP_STRUCT__entry(
__field(int, pid)
+   __string(comm, task->comm)
+   __field(unsigned long, total_vm)
+   __field(unsigned long, anon_rss)
+   __field(unsigned long, file_rss)
+   __field(unsigned long, shmem_rss)
+   __field(uid_t, uid)
+   __field(unsigned long, pgtables)
+   __field(short, oom_score_adj)
),
 
TP_fast_assign(
-   __entry->pid = pid;
+   __entry->pid = task->pid;
+   __assign_str(comm, task->comm);
+   __entry->total_vm = PG_COUNT_TO_KB(task->mm->total_vm);
+   __entry->anon_rss = PG_COUNT_TO_KB(get_mm_counter(task->mm, 
MM_ANONPAGES));
+   __entry->file_rss = PG_COUNT_TO_KB(get_mm_counter(task->mm, 
MM_FILEPAGES));
+   __entry->shmem_rss = PG_COUNT_TO_KB(get_mm_counter(task->mm, 
MM_SHMEMPAGES));
+   __entry->uid = uid;
+   __entry->pgtables = mm_pgtables_bytes(task->mm) >> 10;
+   __entry->oom_score_adj = task->signal->oom_score_adj;
),
 
-   TP_printk("pid=%d", __entry->pid)
+   TP_printk("pid=%d comm=%s total-vm=%lukB anon-rss=%lukB file-rss:%lukB 
shmem-rss:%lukB uid=%u pgtables=%lukB oom_score_adj=%hd",
+   __entry->pid,
+   __get_str(comm),
+   __entry->total_vm,
+   __entry->anon_rss,
+   __entry->file_rss,
+   __entry->shmem_rss,
+   __entry->uid,
+   __entry->pgtables,
+   __entry->oom_score_adj
+   )
 );
 
 TRACE_EVENT(wake_reaper,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 91ccd82097c2..8d6a207c3c59 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include "internal.h"
@@ -754,6 +755,7 @@ static inline void queue_oom_reaper(struct task_struct *tsk)
  */
 static void mark_oom_victim(struct task_struct *tsk)
 {
+   const struct cred *cred;
struct mm_struct *mm = tsk->mm;
 
WARN_ON(oom_killer_disabled);
@@ -773,7 +775,9 @@ static void mark_oom_victim(struct task_struct *tsk)
 */
__thaw_task(tsk);
atomic_inc(_victims);
-   trace_mark_victim(tsk->pid);
+   cred = get_task_cred(tsk);
+   trace_mark_victim(tsk, cred->uid.val);
+   put_cred(cred);
 }
 
 /**
-- 
2.44.0.rc0.258.g7320e95886-goog




Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-22 Thread Carlos Galo
On Thu, Feb 22, 2024 at 6:16 AM Michal Hocko  wrote:
>
> On Wed 21-02-24 13:30:51, Carlos Galo wrote:
> > On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko  wrote:
> > >
> > > Hi,
> > > sorry I have missed this before.
> > >
> > > On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> > > > The current implementation of the mark_victim tracepoint provides only
> > > > the process ID (pid) of the victim process. This limitation poses
> > > > challenges for userspace tools that need additional information
> > > > about the OOM victim. The association between pid and the additional
> > > > data may be lost after the kill, making it difficult for userspace to
> > > > correlate the OOM event with the specific process.
> > >
> > > You are correct that post OOM all per-process information is lost. On
> > > the other hand we do dump all this information to the kernel log. Could
> > > you explain why that is not suitable for your purpose?
> >
> > Userspace tools often need real-time visibility into OOM situations
> > for userspace intervention. Our use case involves utilizing BPF
> > programs, along with BPF ring buffers, to provide OOM notification to
> > userspace. Parsing kernel logs would be significant overhead as
> > opposed to the event based BPF approach.
>
> Please put that into the changelog.

Will do.

> > > > In order to mitigate this limitation, add the following fields:
> > > >
> > > > - UID
> > > >In Android each installed application has a unique UID. Including
> > > >the `uid` assists in correlating OOM events with specific apps.
> > > >
> > > > - Process Name (comm)
> > > >Enables identification of the affected process.
> > > >
> > > > - OOM Score
> > > >Allows userspace to get additional insights of the relative kill
> > > >priority of the OOM victim.
> > >
> > > What is the oom score useful for?
> > >
> > The OOM score provides us a measure of the victim's importance. On the
> > android side, it allows us to identify if top or foreground apps are
> > killed, which have user perceptible impact.
>
> But the value on its own (wihtout knowing scores of other tasks) doesn't
> really tell you anything, does it?

Android uses the OOM adj_score ranges  to categorize app state
(foreground, background, ...). I'll resend a v3 with the commit text
updated to include details about this.

> > > Is there any reason to provide a different information from the one
> > > reported to the kernel log?
> > > __oom_kill_process:
> > > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> > > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB 
> > > oom_score_adj:%hd\n",
> > > message, task_pid_nr(victim), victim->comm, 
> > > K(mm->total_vm),
> > > K(get_mm_counter(mm, MM_ANONPAGES)),
> > > K(get_mm_counter(mm, MM_FILEPAGES)),
> > > K(get_mm_counter(mm, MM_SHMEMPAGES)),
> > > from_kuid(_user_ns, task_uid(victim)),
> > > mm_pgtables_bytes(mm) >> 10, 
> > > victim->signal->oom_score_adj);
> > >
> >
> > We added these fields we need (UID, process name, and OOM score), but
> > we're open to adding the others if you prefer that for consistency
> > with the kernel log.
>
> yes, I think the consistency would be better here. For one it reports
> numbers which can tell quite a lot about the killed victim. It is a
> superset of what you already asking for. With a notable exception of the
> oom_score which is really dubious without a wider context.

Sounds good, I'll resend a v3 that includes these fields as well.

Thanks,
Carlos

> --
> Michal Hocko
> SUSE Labs



Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-22 Thread Michal Hocko
On Wed 21-02-24 13:30:51, Carlos Galo wrote:
> On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko  wrote:
> >
> > Hi,
> > sorry I have missed this before.
> >
> > On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> > > The current implementation of the mark_victim tracepoint provides only
> > > the process ID (pid) of the victim process. This limitation poses
> > > challenges for userspace tools that need additional information
> > > about the OOM victim. The association between pid and the additional
> > > data may be lost after the kill, making it difficult for userspace to
> > > correlate the OOM event with the specific process.
> >
> > You are correct that post OOM all per-process information is lost. On
> > the other hand we do dump all this information to the kernel log. Could
> > you explain why that is not suitable for your purpose?
> 
> Userspace tools often need real-time visibility into OOM situations
> for userspace intervention. Our use case involves utilizing BPF
> programs, along with BPF ring buffers, to provide OOM notification to
> userspace. Parsing kernel logs would be significant overhead as
> opposed to the event based BPF approach.

Please put that into the changelog.

> > > In order to mitigate this limitation, add the following fields:
> > >
> > > - UID
> > >In Android each installed application has a unique UID. Including
> > >the `uid` assists in correlating OOM events with specific apps.
> > >
> > > - Process Name (comm)
> > >Enables identification of the affected process.
> > >
> > > - OOM Score
> > >Allows userspace to get additional insights of the relative kill
> > >priority of the OOM victim.
> >
> > What is the oom score useful for?
> >
> The OOM score provides us a measure of the victim's importance. On the
> android side, it allows us to identify if top or foreground apps are
> killed, which have user perceptible impact.

But the value on its own (wihtout knowing scores of other tasks) doesn't
really tell you anything, does it?
 
> > Is there any reason to provide a different information from the one
> > reported to the kernel log?
> > __oom_kill_process:
> > pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> > file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
> > message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
> > K(get_mm_counter(mm, MM_ANONPAGES)),
> > K(get_mm_counter(mm, MM_FILEPAGES)),
> > K(get_mm_counter(mm, MM_SHMEMPAGES)),
> > from_kuid(_user_ns, task_uid(victim)),
> > mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
> >
> 
> We added these fields we need (UID, process name, and OOM score), but
> we're open to adding the others if you prefer that for consistency
> with the kernel log.

yes, I think the consistency would be better here. For one it reports
numbers which can tell quite a lot about the killed victim. It is a
superset of what you already asking for. With a notable exception of the
oom_score which is really dubious without a wider context.
-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-21 Thread Carlos Galo
On Tue, Feb 20, 2024 at 11:55 PM Michal Hocko  wrote:
>
> Hi,
> sorry I have missed this before.
>
> On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> > The current implementation of the mark_victim tracepoint provides only
> > the process ID (pid) of the victim process. This limitation poses
> > challenges for userspace tools that need additional information
> > about the OOM victim. The association between pid and the additional
> > data may be lost after the kill, making it difficult for userspace to
> > correlate the OOM event with the specific process.
>
> You are correct that post OOM all per-process information is lost. On
> the other hand we do dump all this information to the kernel log. Could
> you explain why that is not suitable for your purpose?

Userspace tools often need real-time visibility into OOM situations
for userspace intervention. Our use case involves utilizing BPF
programs, along with BPF ring buffers, to provide OOM notification to
userspace. Parsing kernel logs would be significant overhead as
opposed to the event based BPF approach.

> > In order to mitigate this limitation, add the following fields:
> >
> > - UID
> >In Android each installed application has a unique UID. Including
> >the `uid` assists in correlating OOM events with specific apps.
> >
> > - Process Name (comm)
> >Enables identification of the affected process.
> >
> > - OOM Score
> >Allows userspace to get additional insights of the relative kill
> >priority of the OOM victim.
>
> What is the oom score useful for?
>
The OOM score provides us a measure of the victim's importance. On the
android side, it allows us to identify if top or foreground apps are
killed, which have user perceptible impact.

> Is there any reason to provide a different information from the one
> reported to the kernel log?
> __oom_kill_process:
> pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
> message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
> K(get_mm_counter(mm, MM_ANONPAGES)),
> K(get_mm_counter(mm, MM_FILEPAGES)),
> K(get_mm_counter(mm, MM_SHMEMPAGES)),
> from_kuid(_user_ns, task_uid(victim)),
> mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);
>

We added these fields we need (UID, process name, and OOM score), but
we're open to adding the others if you prefer that for consistency
with the kernel log.

Thanks,
Carlos

> --
> Michal Hocko
> SUSE Labs



Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-20 Thread Michal Hocko
Hi,
sorry I have missed this before.

On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> The current implementation of the mark_victim tracepoint provides only
> the process ID (pid) of the victim process. This limitation poses
> challenges for userspace tools that need additional information
> about the OOM victim. The association between pid and the additional
> data may be lost after the kill, making it difficult for userspace to
> correlate the OOM event with the specific process.

You are correct that post OOM all per-process information is lost. On
the other hand we do dump all this information to the kernel log. Could
you explain why that is not suitable for your purpose?

> In order to mitigate this limitation, add the following fields:
> 
> - UID
>In Android each installed application has a unique UID. Including
>the `uid` assists in correlating OOM events with specific apps.
> 
> - Process Name (comm)
>Enables identification of the affected process.
> 
> - OOM Score
>Allows userspace to get additional insights of the relative kill
>priority of the OOM victim.

What is the oom score useful for?

Is there any reason to provide a different information from the one
reported to the kernel log?
__oom_kill_process:
pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)),
from_kuid(_user_ns, task_uid(victim)),
mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v3] mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages

2024-02-19 Thread Vlastimil Babka
On 2/20/24 07:16, Baolin Wang wrote:
> Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison
> to ensure that enough freepages are isolated in isolate_freepages(), however
> it just decreases the cc->nr_freepages without updating cc->nr_migratepages
> in compaction_alloc(), which will waste more CPU cycles and cause too many
> freepages to be isolated.
> 
> So we should also update the cc->nr_migratepages when allocating or freeing
> the freepages to avoid isolating excess freepages. And I can see fewer free
> pages are scanned and isolated when running thpcompact on my Arm64 server:
>k6.7 k6.7_patched
> Ops Compaction pages isolated  120692036.00   118160797.00
> Ops Compaction migrate scanned 131210329.00   154093268.00
> Ops Compaction free scanned   1090587971.00  1080632536.00
> Ops Compact scan efficiency   12.03  14.26
> 
> Moreover, I did not see an obvious latency improvements, this is likely 
> because
> isolating freepages is not the bottleneck in the thpcompact test case.
>   k6.7  k6.7_patched
> Amean fault-both-1  1089.76 (   0.00%) 1080.16 *   0.88%*
> Amean fault-both-3  1616.48 (   0.00%) 1636.65 *  -1.25%*
> Amean fault-both-5  2266.66 (   0.00%) 2219.20 *   2.09%*
> Amean fault-both-7  2909.84 (   0.00%) 2801.90 *   3.71%*
> Amean fault-both-12 4861.26 (   0.00%) 4733.25 *   2.63%*
> Amean fault-both-18 7351.11 (   0.00%) 6950.51 *   5.45%*
> Amean fault-both-24 9059.30 (   0.00%) 9159.99 *  -1.11%*
> Amean fault-both-3010685.68 (   0.00%)11399.02 *  -6.68%*
> 
> Signed-off-by: Baolin Wang 
> Acked-by: Mel Gorman 

Reviewed-by: Vlastimil Babka 

Thanks.

> ---
> Hi Andrew, please use this patch to replace below 2 old patches. Thanks.
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=cb30899d456d64fb83ee3e3d95cd9fbb18735d87
> https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=65713d1c4fc498d35dc1f7c1234ef815aa128429
> 
> Changes from v2:
>  - Add acked tag from Mel.
>  - Fix the mm_compaction_migratepages trace event.
> Changes from v1:
>  - Rebased on the latest mm-unstable branch.
> ---
>  include/trace/events/compaction.h |  6 +++---
>  mm/compaction.c   | 12 ++--
>  2 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/include/trace/events/compaction.h 
> b/include/trace/events/compaction.h
> index 2b2a975efd20..d05759d18538 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -78,10 +78,10 @@ DEFINE_EVENT(mm_compaction_isolate_template, 
> mm_compaction_fast_isolate_freepage
>  #ifdef CONFIG_COMPACTION
>  TRACE_EVENT(mm_compaction_migratepages,
>  
> - TP_PROTO(struct compact_control *cc,
> + TP_PROTO(unsigned int nr_migratepages,
>   unsigned int nr_succeeded),
>  
> - TP_ARGS(cc, nr_succeeded),
> + TP_ARGS(nr_migratepages, nr_succeeded),
>  
>   TP_STRUCT__entry(
>   __field(unsigned long, nr_migrated)
> @@ -90,7 +90,7 @@ TRACE_EVENT(mm_compaction_migratepages,
>  
>   TP_fast_assign(
>   __entry->nr_migrated = nr_succeeded;
> - __entry->nr_failed = cc->nr_migratepages - nr_succeeded;
> + __entry->nr_failed = nr_migratepages - nr_succeeded;
>   ),
>  
>   TP_printk("nr_migrated=%lu nr_failed=%lu",
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4494b2914386..218089b29f13 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1798,6 +1798,7 @@ static struct folio *compaction_alloc(struct folio 
> *src, unsigned long data)
>   dst = list_entry(cc->freepages.next, struct folio, lru);
>   list_del(>lru);
>   cc->nr_freepages--;
> + cc->nr_migratepages--;
>  
>   return dst;
>  }
> @@ -1813,6 +1814,7 @@ static void compaction_free(struct folio *dst, unsigned 
> long data)
>  
>   list_add(>lru, >freepages);
>   cc->nr_freepages++;
> + cc->nr_migratepages++;
>  }
>  
>  /* possible outcome of isolate_migratepages */
> @@ -2435,7 +2437,7 @@ compact_zone(struct compact_control *cc, struct 
> capture_control *capc)
>   unsigned long last_migrated_pfn;
>   const bool sync = cc->mode != MIGRATE_ASYNC;
>   bool update_cached;
> - unsigned int nr_succeeded = 0;
> + unsigned int nr_succeeded = 0, nr_migratepages;
>  
>   /*
>* Thes

[PATCH v3] mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages

2024-02-19 Thread Baolin Wang
Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison
to ensure that enough freepages are isolated in isolate_freepages(), however
it just decreases the cc->nr_freepages without updating cc->nr_migratepages
in compaction_alloc(), which will waste more CPU cycles and cause too many
freepages to be isolated.

So we should also update the cc->nr_migratepages when allocating or freeing
the freepages to avoid isolating excess freepages. And I can see fewer free
pages are scanned and isolated when running thpcompact on my Arm64 server:
   k6.7 k6.7_patched
Ops Compaction pages isolated  120692036.00   118160797.00
Ops Compaction migrate scanned 131210329.00   154093268.00
Ops Compaction free scanned   1090587971.00  1080632536.00
Ops Compact scan efficiency   12.03  14.26

Moreover, I did not see an obvious latency improvements, this is likely because
isolating freepages is not the bottleneck in the thpcompact test case.
  k6.7  k6.7_patched
Amean fault-both-1  1089.76 (   0.00%) 1080.16 *   0.88%*
Amean fault-both-3  1616.48 (   0.00%) 1636.65 *  -1.25%*
Amean fault-both-5  2266.66 (   0.00%) 2219.20 *   2.09%*
Amean fault-both-7  2909.84 (   0.00%) 2801.90 *   3.71%*
Amean fault-both-12 4861.26 (   0.00%) 4733.25 *   2.63%*
Amean fault-both-18 7351.11 (   0.00%) 6950.51 *   5.45%*
Amean fault-both-24 9059.30 (   0.00%) 9159.99 *  -1.11%*
Amean fault-both-3010685.68 (   0.00%)11399.02 *  -6.68%*

Signed-off-by: Baolin Wang 
Acked-by: Mel Gorman 
---
Hi Andrew, please use this patch to replace below 2 old patches. Thanks.
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=cb30899d456d64fb83ee3e3d95cd9fbb18735d87
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=65713d1c4fc498d35dc1f7c1234ef815aa128429

Changes from v2:
 - Add acked tag from Mel.
 - Fix the mm_compaction_migratepages trace event.
Changes from v1:
 - Rebased on the latest mm-unstable branch.
---
 include/trace/events/compaction.h |  6 +++---
 mm/compaction.c   | 12 ++--
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/compaction.h 
b/include/trace/events/compaction.h
index 2b2a975efd20..d05759d18538 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -78,10 +78,10 @@ DEFINE_EVENT(mm_compaction_isolate_template, 
mm_compaction_fast_isolate_freepage
 #ifdef CONFIG_COMPACTION
 TRACE_EVENT(mm_compaction_migratepages,
 
-   TP_PROTO(struct compact_control *cc,
+   TP_PROTO(unsigned int nr_migratepages,
unsigned int nr_succeeded),
 
-   TP_ARGS(cc, nr_succeeded),
+   TP_ARGS(nr_migratepages, nr_succeeded),
 
TP_STRUCT__entry(
__field(unsigned long, nr_migrated)
@@ -90,7 +90,7 @@ TRACE_EVENT(mm_compaction_migratepages,
 
TP_fast_assign(
__entry->nr_migrated = nr_succeeded;
-   __entry->nr_failed = cc->nr_migratepages - nr_succeeded;
+   __entry->nr_failed = nr_migratepages - nr_succeeded;
),
 
TP_printk("nr_migrated=%lu nr_failed=%lu",
diff --git a/mm/compaction.c b/mm/compaction.c
index 4494b2914386..218089b29f13 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1798,6 +1798,7 @@ static struct folio *compaction_alloc(struct folio *src, 
unsigned long data)
dst = list_entry(cc->freepages.next, struct folio, lru);
list_del(>lru);
cc->nr_freepages--;
+   cc->nr_migratepages--;
 
return dst;
 }
@@ -1813,6 +1814,7 @@ static void compaction_free(struct folio *dst, unsigned 
long data)
 
list_add(>lru, >freepages);
cc->nr_freepages++;
+   cc->nr_migratepages++;
 }
 
 /* possible outcome of isolate_migratepages */
@@ -2435,7 +2437,7 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)
unsigned long last_migrated_pfn;
const bool sync = cc->mode != MIGRATE_ASYNC;
bool update_cached;
-   unsigned int nr_succeeded = 0;
+   unsigned int nr_succeeded = 0, nr_migratepages;
 
/*
 * These counters track activities during zone compaction.  Initialize
@@ -2553,11 +2555,17 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)
pageblock_start_pfn(cc->migrate_pfn - 1));
}
 
+   /*
+* Record the number of pages to migrate since the
+* compaction_alloc/free() will update cc->nr_migratepages
+* properly.
+*/
+   nr_migratepages = cc->nr_migra

Re: [PATCH v3 0/2] Update mce_record tracepoint

2024-02-08 Thread Steven Rostedt
On Fri, 26 Jan 2024 01:57:58 -0600
Avadhut Naik  wrote:

> This patchset updates the mce_record tracepoint so that the recently added
> fields of struct mce are exported through it to userspace.
> 
> The first patch adds PPIN (Protected Processor Inventory Number) field to
> the tracepoint.
> 
> The second patch adds the microcode field (Microcode Revision) to the
> tracepoint.

From a tracing POV only:

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-08 Thread Steven Rostedt
On Thu, 11 Jan 2024 21:05:30 +
Carlos Galo  wrote:

> The current implementation of the mark_victim tracepoint provides only
> the process ID (pid) of the victim process. This limitation poses
> challenges for userspace tools that need additional information
> about the OOM victim. The association between pid and the additional
> data may be lost after the kill, making it difficult for userspace to
> correlate the OOM event with the specific process.
> 
> In order to mitigate this limitation, add the following fields:
> 
> - UID
>In Android each installed application has a unique UID. Including
>the `uid` assists in correlating OOM events with specific apps.
> 
> - Process Name (comm)
>Enables identification of the affected process.
> 
> - OOM Score
>Allows userspace to get additional insights of the relative kill
>priority of the OOM victim.
> 

From a tracing POV this looks fine to me, but it's up to the mm
maintainers to decide to take it.

> Cc: Steven Rostedt 
> Cc: Andrew Morton 
> Cc: Suren Baghdasaryan 
> Signed-off-by: Carlos Galo 
> ---
> v2: Fixed build error. Added missing comma when printing `__entry->uid`.
> 
>  include/trace/events/oom.h | 19 +++
>  mm/oom_kill.c  |  6 +-
>  2 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
> index 26a11e4a2c36..fb8a5d1b8a0a 100644
> --- a/include/trace/events/oom.h
> +++ b/include/trace/events/oom.h
> @@ -72,19 +72,30 @@ TRACE_EVENT(reclaim_retry_zone,
>  );
>  
>  TRACE_EVENT(mark_victim,
> - TP_PROTO(int pid),
> + TP_PROTO(struct task_struct *task, uid_t uid),
>  
> - TP_ARGS(pid),
> + TP_ARGS(task, uid),
>  
>   TP_STRUCT__entry(
>   __field(int, pid)
> + __field(uid_t, uid)
> + __string(comm, task->comm)
> + __field(short, oom_score_adj)
>   ),
>  
>   TP_fast_assign(
> - __entry->pid = pid;
> + __entry->pid = task->pid;
> + __entry->uid = uid;
> + __assign_str(comm, task->comm);
> + __entry->oom_score_adj = task->signal->oom_score_adj;
>   ),
>  
> - TP_printk("pid=%d", __entry->pid)
> + TP_printk("pid=%d uid=%u comm=%s oom_score_adj=%hd",
> + __entry->pid,
> + __entry->uid,
> + __get_str(comm),
> + __entry->oom_score_adj
> + )
>  );
>  
>  TRACE_EVENT(wake_reaper,
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9e6071fde34a..0698c00c5da6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include "internal.h"
> @@ -753,6 +754,7 @@ static inline void queue_oom_reaper(struct task_struct 
> *tsk)
>   */
>  static void mark_oom_victim(struct task_struct *tsk)
>  {
> + const struct cred *cred;
>   struct mm_struct *mm = tsk->mm;
>  
>   WARN_ON(oom_killer_disabled);
> @@ -772,7 +774,9 @@ static void mark_oom_victim(struct task_struct *tsk)
>*/
>   __thaw_task(tsk);
>   atomic_inc(_victims);
> - trace_mark_victim(tsk->pid);
> + cred = get_task_cred(tsk);

get_task_cred() isn't a trivial function and is used only for tracing.
But this isn't a fast path so I guess that is fine.

For tracing:

Reviewed-by: Steven Rostedt (Google) 

-- Steve


> + trace_mark_victim(tsk, cred->uid.val);
> + put_cred(cred);
>  }
>  
>  /**
> 
> base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a




[PATCH v7 36/36] Documentation: probes: Update fprobe on function-graph tracer

2024-02-06 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Update fprobe documentation for the new fprobe on function-graph
tracer. This includes some bahvior changes and pt_regs to
ftrace_regs interface change.

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v2:
  - Update @fregs parameter explanation.
---
 Documentation/trace/fprobe.rst |   42 ++--
 1 file changed, 27 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..f58bdc64504f 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
 Introduction
 
 
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe mechanism based on the function-graph
+tracer.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
 use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
 instrumentation for multiple functions with single handler. This document
 describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as 
follows:
 
 .. code-block:: c
 
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long 
ret_ip, struct ftrace_regs *fregs, void *entry_data);
 
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback 
will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
 
 @fp
 This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 This is the return address that the traced function will return to,
 somewhere in the caller. This can be used at both entry and exit.
 
-@regs
-This is the `pt_regs` data structure at the entry and exit. Note that
-the instruction pointer of @regs may be different from the @entry_ip
-in the entry_handler. If you need traced instruction pointer, you need
-to use @entry_ip. On the other hand, in the exit_handler, the 
instruction
-pointer of @regs is set to the current return address.
+@fregs
+This is the `ftrace_regs` data structure at the entry and exit. This
+includes the function parameters, or the return values. So user can
+access thos values via appropriate `ftrace_regs_*` APIs.
 
 @entry_data
 This is a local storage to share the data between entry and exit 
handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the 
corresponding exit callback will
 and `entry_data_size` field when registering the fprobe, the storage is
 allocated and passed to both `entry_handler` and `exit_handler`.
 
+Entry data size and exit handlers on the same function
+==
+
+Since the entry data is passed via per-task stack and it is has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
 Share the callbacks with kprobes
 
 
@@ -165,8 +177,8 @@ This counter counts up when;
  - fprobe fails to take ftrace_recursion lock. This usually means that a 
function
which is traced by other ftrace users is called from the entry_handler.
 
- - fprobe fails to setup the function exit because of the shortage of rethook
-   (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+   data buffer from the per-task shadow stack.
 
 The `fprobe::nmissed` field counts up in both cases. Therefore, the former
 skips both of entry and exit callback and the latter skips the exit




Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-27 Thread Borislav Petkov
On Fri, Jan 26, 2024 at 10:01:29PM +, Luck, Tony wrote:
> PPIN: Nice to have. People that send stuff to me are terrible about
> providing surrounding details. The record already includes
> CPUID(1).EAX ... so I can at least skip the step of asking them which
> CPU family/model/stepping they were using). But PPIN can be recovered
> (so long as the submitter kept good records about which system
> generated the record).

Yes.

> MICROCODE: Must have. Microcode version can be changed at run time.
> Going back to the system to check later may not give the correct
> answer to what was active at the time of the error. Especially for an
> error reported while a microcode update is waling across the CPUs
> poking the MSR on each in turn.

Easy:

- You've got an MCE? Was it during scheduled microcode updates?
- Yes.
- Come back to me when it happens again, *outside* of the microcode
  update schedule.

Anyway, I still don't buy that. Maybe I'm wrong and maybe I need to talk
to data center operators more but this sounds like microcode update
failing is such a common thing to happen so that we *absolutely* *must*
capture the microcode revision when an MCE happens.

Maybe we should make microcode updates more resilient and add a retry
mechanism which doesn't back off as easily.

Or maybe people should script around it and keep retrying, dunno.

In my world, microcode update just works in the vast majority of the
cases and if it doesn't, then those cases need a specific look.

And if I am debugging an issue and I want to see the microcode revision,
I look at /proc/cpuinfo.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Luck, Tony
> But no, that's not the right question to ask.
>
> It is rather: which bits of information are very relevant to an error
> record and which are transient enough so that they cannot be gathered
> from a system by other means or only gathered in a difficult way, and
> should be part of that record.
>
> The PPIN is not transient but you have to go map ->extcpu to the PPIN so
> adding it to the tracepoint is purely a convenience thing. More or less.
>
> The microcode revision thing I still don't buy but it is already there
> so whateva...
>
> So we'd need a rule hammered out and put there in a prominent place so
> that it is clear what goes into struct mce and what not.

My personal evaluation of the value of these two additions to the trace record:

PPIN: Nice to have. People that send stuff to me are terrible about providing 
surrounding
details. The record already includes CPUID(1).EAX ... so I can at least skip 
the step of
asking them which CPU family/model/stepping they were using). But PPIN
can be recovered (so long as the submitter kept good records about which system
generated the record).

MICROCODE: Must have. Microcode version can be changed at run time. Going
back to the system to check later may not give the correct answer to what was 
active
at the time of the error. Especially for an error reported while a microcode 
update is
waling across the CPUs poking the MSR on each in turn.

-Tony


Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Borislav Petkov
On Fri, Jan 26, 2024 at 08:49:03PM +, Luck, Tony wrote:
> Every patch that adds new code or data structures adds to the kernel
> memory footprint. Each should be considered on its merits. The basic
> question being:
> 
>"Is the new functionality worth the cost?"
> 
> Where does it end? It would end if Linus declared:
> 
>   "Linux is now complete. Stop sending patches".
> 
> I.e. it is never going to end.

No, it's not that - it is the merit thing which determines.

> 1) PPIN
> Cost = 8 bytes.
> Benefit: Emdeds a system identifier into the trace record so there
> can be no ambiguity about which machine generated this error.
> Also definitively indicates which socket on a multi-socket system.
> 
> 2) MICROCODE
> Cost = 4 bytes
> Benefit: Certainty about the microcode version active on the core
> at the time the error was detected.
> 
> RAS = Reliability, Availability, Serviceability
> 
> These changes fall into the serviceability bucket. They make it
> easier to diagnose what went wrong.

So does dmesg. Let's add it to the tracepoint...

But no, that's not the right question to ask.

It is rather: which bits of information are very relevant to an error
record and which are transient enough so that they cannot be gathered
from a system by other means or only gathered in a difficult way, and
should be part of that record.

The PPIN is not transient but you have to go map ->extcpu to the PPIN so
adding it to the tracepoint is purely a convenience thing. More or less.

The microcode revision thing I still don't buy but it is already there
so whateva...

So we'd need a rule hammered out and put there in a prominent place so
that it is clear what goes into struct mce and what not.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Luck, Tony
> > Is it so very different to add this to a trace record so that rasdaemon
> > can have feature parity with mcelog(8)?
>
> I knew you were gonna say that. When someone decides that it is
> a splendid idea to add more stuff to struct mce then said someone would
> want it in the tracepoint too.
>
> And then we're back to my original question:
>
> "And where does it end? Stick full dmesg in the tracepoint too?"
>
> Where do you draw the line in the sand and say, no more, especially
> static, fields bloating the trace record should be added and from then
> on, you should go collect the info from that box. Something which you're
> supposed to do anyway.

Every patch that adds new code or data structures adds to the kernel
memory footprint. Each should be considered on its merits. The basic
question being:

   "Is the new functionality worth the cost?"

Where does it end? It would end if Linus declared:

  "Linux is now complete. Stop sending patches".

I.e. it is never going to end.

If somebody posts a patch asking to add the full dmesg to a
tracepoint, I'll stand with you to say: "Not only no, but hell no".

So for Naik's two patches we have:

1) PPIN
Cost = 8 bytes.
Benefit: Emdeds a system identifier into the trace record so there
can be no ambiguity about which machine generated this error.
Also definitively indicates which socket on a multi-socket system.

2) MICROCODE
Cost = 4 bytes
Benefit: Certainty about the microcode version active on the core
at the time the error was detected.

RAS = Reliability, Availability, Serviceability

These changes fall into the serviceability bucket. They make it
easier to diagnose what went wrong.


-Tony


Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Borislav Petkov
On Fri, Jan 26, 2024 at 07:15:50PM +, Luck, Tony wrote:
> If deployment of a microcode update across a fleet always went
> flawlessly, life would be simpler. But things can fail. And maybe the
> failure wasn't noticed. Perhaps a node was rebooting when the sysadmin
> pushed the update to the fleet and so missed the deployment. Perhaps
> one core was already acting weird and the microcode update didn't get
> applied to that core.

Yes, and you go collect data from that box. You will have to anyway to
figure out why the microcode didn't update.

> Swapping a hard drive, or hot plugging a NIC isn't very likely
> to correlate with an error reported by the CPU in machine
> check banks.

Ofc it is - coherent probe timeoutting due to problematic insertion
could be reported with a MCE, and so on and so on.

> Is it so very different to add this to a trace record so that rasdaemon
> can have feature parity with mcelog(8)?

I knew you were gonna say that. When someone decides that it is
a splendid idea to add more stuff to struct mce then said someone would
want it in the tracepoint too.

And then we're back to my original question: 

"And where does it end? Stick full dmesg in the tracepoint too?"

Where do you draw the line in the sand and say, no more, especially
static, fields bloating the trace record should be added and from then
on, you should go collect the info from that box. Something which you're
supposed to do anyway.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Luck, Tony
> > You've spent enough time with Ashok and Thomas tweaking the Linux
> > microcode driver to know that going back to the machine the next day
> > to ask about microcode version has a bunch of ways to get a wrong
> > answer.
>
> Huh, what does that have to do with this?

If deployment of a microcode update across a fleet always went
flawlessly, life would be simpler. But things can fail. And maybe the
failure wasn't noticed. Perhaps a node was rebooting when the
sysadmin pushed the update to the fleet and so missed the
deployment. Perhaps one core was already acting weird and
the microcode update didn't get applied to that core.

> IIUC, if someone changes something on the system, whether that is
> updating microcode or swapping a harddrive or swapping memory or
> whatever, that invalidates the errors reported, pretty much.
>
> You can't put it all in the trace record, you just can't.

Swapping a hard drive, or hot plugging a NIC isn't very likely
to correlate with an error reported by the CPU in machine
check banks. But microcode can be (and has been) the issue
in enough cases that knowing the version at the time of the
error matters.

You seemed to agree with this argument when the microcode
field was added to "struct mce" back in 2018

fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records")

Is it so very different to add this to a trace record so that rasdaemon
can have feature parity with mcelog(8)?

-Tony


Re: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Borislav Petkov
On Fri, Jan 26, 2024 at 05:10:20PM +, Luck, Tony wrote:
> 12 extra bytes divided by (say) 64GB (a very small server these days, may 
> laptop has that much)
>= 0.0001746%
> 
> We will need 57000 changes like this one before we get to 0.001% :-)

You're forgetting that those 12 bytes repeat per MCE tracepoint logged.
And there's other code which adds more 0.01% here and there, well,
because we can.

> But the key there is keeping the details of the source machine attached to
> the error record. My first contact with machine check debugging is always
> just the raw error record (from mcelog, rasdaemon, or console log).

Yes, that is somewhat sensible reason to have the PPIN together with the
MCE record.

> Knowing which microcode version was loaded on a core *at the time of
> the error* is critical. 

So is the rest of the debug info you're going to need from that machine.
And yet we're not adding that to the tracepoint.

> You've spent enough time with Ashok and Thomas tweaking the Linux
> microcode driver to know that going back to the machine the next day
> to ask about microcode version has a bunch of ways to get a wrong
> answer.

Huh, what does that have to do with this?

IIUC, if someone changes something on the system, whether that is
updating microcode or swapping a harddrive or swapping memory or
whatever, that invalidates the errors reported, pretty much.

You can't put it all in the trace record, you just can't. 

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



RE: [PATCH v2 0/2] Update mce_record tracepoint

2024-01-26 Thread Luck, Tony
> > 8 bytes for PPIN, 4 more for microcode.
>
> I know, nothing leads to bloat like 0.01% here, 0.001% there...

12 extra bytes divided by (say) 64GB (a very small server these days, may 
laptop has that much)
   = 0.0001746%

We will need 57000 changes like this one before we get to 0.001% :-)

> > Number of recoverable machine checks per system  I hope the
> > monthly rate should be countable on my fingers...
>
> That's not the point. Rather, when you look at MCE reports, you pretty
> much almost always go and collect additional information from the target
> machine because you want to figure out what exactly is going on.
>
> So what's stopping you from collecting all that static information
> instead of parrotting it through the tracepoint with each error?

PPIN is static. So with good tracking to keep source platform information
attached to the error record as it gets passed around people trying to triage
the problem, you could say it can be retrieved later (or earlier when setting
up a database of attributes of each machine in the fleet.

But the key there is keeping the details of the source machine attached to
the error record. My first contact with machine check debugging is always
just the raw error record (from mcelog, rasdaemon, or console log).

> > PPIN is useful when talking to the CPU vendor about patterns of
> > similar errors seen across a cluster.
>
> I guess that is perhaps the only thing of the two that makes some sense
> at least - the identifier uniquely describes which CPU the error comes
> from...
>
> > MICROCODE - gives a fast path to root cause problems that have already
> > been fixed in a microcode update.
>
> But that, nah. See above.

Knowing which microcode version was loaded on a core *at the time of the error*
is critical. You've spent enough time with Ashok and Thomas tweaking the Linux
microcode driver to know that going back to the machine the next day to ask 
about
microcode version has a bunch of ways to get a wrong answer.

-Tony


  1   2   3   4   5   6   7   8   9   10   >