date:20240318

Re: [PATCH v2 0/3] uprobes: two common case speed ups

2024-03-18 Thread Oleg Nesterov

On 03/18, Andrii Nakryiko wrote:
>
> Andrii Nakryiko (3):
>   uprobes: encapsulate preparation of uprobe args buffer
>   uprobes: prepare uprobe args buffer lazily
>   uprobes: add speculative lockless system-wide uprobe filter check
>
>  kernel/trace/trace_uprobe.c | 103 +---
>  1 file changed, 59 insertions(+), 44 deletions(-)

Reviewed-by: Oleg Nesterov

Re: [PATCH 1/3] dt-bindings: remoteproc: add Versal platform support

2024-03-18 Thread Krzysztof Kozlowski

On 19/03/2024 01:37, Tanmay Shah wrote:
> Hello,
> 
> Thanks for reviews, please find my comments below.
> 
> On 3/17/24 9:50 AM, Conor Dooley wrote:
>> On Fri, Mar 15, 2024 at 02:15:31PM -0700, Tanmay Shah wrote:
>>> AMD-Xilinx Versal platform is successor of ZynqMP platform. Real-time
>>> Processor Unit R5 cluster IP on Versal is same as of ZynqMP Platform.
>>
>>> Only difference is power-domains ID needed by power management firmware.
>>> Hence, keeping the compatible property same as of zynqmp node.
>>
>> No, don't be lazy. Add a compatible with a fallback please.
> 
> It's same IP on different platform. I am not sure how adding compatible string
> adds value. I will refactor this series based on other comments provided.

Judging by your other thread, it would add value. Also writing bindings
asks you for this.

Best regards,
Krzysztof

Re: [PATCH 2/3] dt-bindings: remoteproc: add Versal-NET platform

2024-03-18 Thread Krzysztof Kozlowski

On 19/03/2024 01:51, Tanmay Shah wrote:
> Hello Krzysztof,
> 
> Thanks for reviews. Please find my comments below.
> 
> On 3/17/24 1:53 PM, Krzysztof Kozlowski wrote:
>> On 15/03/2024 22:15, Tanmay Shah wrote:
>>> AMD-Xilinx Versal-NET platform is successor of Versal platform. It
>>> contains multiple clusters of cortex-R52 real-time processing units.
>>> Each cluster contains two cores of cortex-R52 processors. Each cluster
>>> can be configured in lockstep mode or split mode.
>>>
>>> Each R52 core is assigned 128KB of TCM memory. ATCM memory is 64KB, BTCM
>>> and CTCM memoreis are 32KB each. Each TCM memory has its own dedicated
>>> power-domain that needs to be requested before using it.
>>>
>>> Signed-off-by: Tanmay Shah 
>>> ---
>>>  .../remoteproc/xlnx,zynqmp-r5fss.yaml | 220 +++---
>>>  1 file changed, 184 insertions(+), 36 deletions(-)
>>>
>>> diff --git 
>>> a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml 
>>> b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>>> index 711da0272250..55654ee02eef 100644
>>> --- a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>>> +++ b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>>> @@ -18,7 +18,9 @@ description: |
>>>  
>>>  properties:
>>>compatible:
>>> -const: xlnx,zynqmp-r5fss
>>> +enum:
>>> +  - xlnx,zynqmp-r5fss
>>> +  - xlnx,versal-net-r52fss
>>>  
>>>"#address-cells":
>>>  const: 2
>>> @@ -64,7 +66,9 @@ patternProperties:
>>>  
>>>  properties:
>>>compatible:
>>> -const: xlnx,zynqmp-r5f
>>> +enum:
>>> +  - xlnx,zynqmp-r5f
>>> +  - xlnx,versal-net-r52f
>>>  
>>>reg:
>>>  minItems: 1
>>> @@ -135,9 +139,11 @@ required:
>>>  allOf:
>>>- if:
>>>properties:
>>> -xlnx,cluster-mode:
>>> -  enum:
>>> -- 1
>>> +compatible:
>>> +  contains:
>>> +enum:
>>> +  - xlnx,versal-net-r52fss
>>
>> Why do you touch these lines?
>>
>>> +
>>>  then:
>>>patternProperties:
>>>  "^r5f@[0-9a-f]+$":
>>> @@ -149,16 +155,14 @@ allOf:
>>>items:
>>>  - description: ATCM internal memory
>>>  - description: BTCM internal memory
>>> -- description: extra ATCM memory in lockstep mode
>>> -- description: extra BTCM memory in lockstep mode
>>> +- description: CTCM internal memory
>>>  
>>>  reg-names:
>>>minItems: 1
>>>items:
>>> -- const: atcm0
>>> -- const: btcm0
>>> -- const: atcm1
>>> -- const: btcm1
>>> +- const: atcm
>>> +- const: btcm
>>> +- const: ctcm
>>>  
>>>  power-domains:
>>>minItems: 2
>>> @@ -166,33 +170,70 @@ allOf:
>>>  - description: RPU core power domain
>>>  - description: ATCM power domain
>>>  - description: BTCM power domain
>>> -- description: second ATCM power domain
>>> -- description: second BTCM power domain
>>> +- description: CTCM power domain
>>>  
>>>  else:
>>> -  patternProperties:
>>> -"^r5f@[0-9a-f]+$":
>>> -  type: object
>>> -
>>> -  properties:
>>> -reg:
>>> -  minItems: 1
>>> -  items:
>>> -- description: ATCM internal memory
>>> -- description: BTCM internal memory
>>> -
>>> -reg-names:
>>> -  minItems: 1
>>> -  items:
>>> -- const: atcm0
>>> -- const: btcm0
>>> -
>>> -power-domains:
>>> -  minItems: 2
>>> -  items:
>>> -- description: RPU core power domain
>>> -- description: ATCM power domain
>>> -- description: BTCM power domain
>>> +  allOf:
>>> +- if:
>>> +properties:
>>> +  xlnx,cluster-mode:
>>> +enum:
>>> +  - 1
>>
>> Whatever you did here, is not really readable. You have now multiple
>> if:then:if:then embedded.
> 
> For ZynqMP platform, TCM can be configured differently in lockstep mode
> and split mode.
> 
> For Versal-NET no such configuration is available, but new CTCM memory
> is added.
> 
> So, I am trying to achieve following representation of TCM for both:
> 
> if: versal-net compatible
> then:
>   ATCM - 64KB
>   BTCM - 32KB
>   CTCM - 32KB
> 
> else: (ZynqMP compatible)
>   if:
> xlnx,cluster-mode (lockstep mode)
>   then:
> ATCM0 - 64KB
> BTCM0 - 64KB
> ATCM1 - 64KB
> BTCM1 - 64KB
>   else: (split mode)
> ATCM0 - 64KB
> BTCM0 - 64KB
> 
> 
> If bindings are getting complicated, does it make sense to introduce
> new file for

Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-18 Thread Krzysztof Kozlowski

On 11/03/2024 18:59, Tanmay Shah wrote:
> From: Radhey Shyam Pandey 
> 
> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
> UltraScale+ platform. It will help in defining TCM in device-tree
> and make it's access platform agnostic and data-driven.
> 
> Tightly-coupled memories(TCMs) are low-latency memory that provides
> predictable instruction execution and predictable data load/store
> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
> 
> The TCM resources(reg, reg-names and power-domain) are documented for
> each TCM in the R5 node. The reg and reg-names are made as required
> properties as we don't want to hardcode TCM addresses for future
> platforms and for zu+ legacy implementation will ensure that the
> old dts w/o reg/reg-names works and stable ABI is maintained.
> 
> It also extends the examples for TCM split and lockstep modes.
> 
> Signed-off-by: Radhey Shyam Pandey 
> Signed-off-by: Tanmay Shah 
> ---

I responded under my reviewed-tag, but to be clear, also here:

This patch has is not ready. Please do not merge.

Best regards,
Krzysztof

Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-18 Thread Krzysztof Kozlowski

On 12/03/2024 13:13, Krzysztof Kozlowski wrote:
> On 11/03/2024 18:59, Tanmay Shah wrote:
>> From: Radhey Shyam Pandey 
>>
>> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
>> UltraScale+ platform. It will help in defining TCM in device-tree
>> and make it's access platform agnostic and data-driven.
>>
>> Tightly-coupled memories(TCMs) are low-latency memory that provides
>> predictable instruction execution and predictable data load/store
>> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
>> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
>>
>> The TCM resources(reg, reg-names and power-domain) are documented for
>> each TCM in the R5 node. The reg and reg-names are made as required
>> properties as we don't want to hardcode TCM addresses for future
>> platforms and for zu+ legacy implementation will ensure that the
>> old dts w/o reg/reg-names works and stable ABI is maintained.
>>
>> It also extends the examples for TCM split and lockstep modes.
>>
>> Signed-off-by: Radhey Shyam Pandey 
>> Signed-off-by: Tanmay Shah 
>> ---
>>
>> Changes in v13:
>>   - Have power-domains property for lockstep case instead of
>> keeping it flexible.
>>   - Add "items:" list in power-domains property
> 
> 
> Reviewed-by: Krzysztof Kozlowski 

And unreviewed. It turns out you now mix devices and bring incompatible
programming models under one compatible. And this leads to problems in
your further patches.

NAK.

Best regards,
Krzysztof

Re: [PATCH 3/3] drivers: remoteproc: add Versal and Versal-NET support

2024-03-18 Thread Krzysztof Kozlowski

On 19/03/2024 02:06, Tanmay Shah wrote:
> 
> 
> On 3/17/24 1:55 PM, Krzysztof Kozlowski wrote:
>> On 15/03/2024 22:15, Tanmay Shah wrote:
>>> AMD-Xilinx Versal and Versal-NET are successor of ZynqMP platform. ZynqMP
>>> remoteproc driver is mostly compatible with new platforms except few
>>> platform specific differences.
>>>
>>> Versal has same IP of cortex-R5 cores hence maintained compatible string
>>> same as ZynqMP platform. However, hardcode TCM addresses are not
>>> supported for new platforms and must be provided in device-tree as per
>>> new bindings. This makes TCM representation data-driven and easy to
>>> maintain. This check is provided in the driver.
>>>
>>> For Versal-NET platform, TCM doesn't need to be configured in lockstep
>>> mode or split mode. Hence that call to PMC firmware is avoided in the
>>> driver for Versal-NET platform.
>>>
>>> Signed-off-by: Tanmay Shah 
>>> ---
>>>  drivers/remoteproc/xlnx_r5_remoteproc.c | 19 +++
>>>  1 file changed, 15 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
>>> b/drivers/remoteproc/xlnx_r5_remoteproc.c
>>> index d4a22caebaad..193bc159d1b4 100644
>>> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
>>> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
>>> @@ -323,9 +323,12 @@ static int zynqmp_r5_set_mode(struct zynqmp_r5_core 
>>> *r5_core,
>>> return ret;
>>> }
>>>  
>>> -   ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
>>> -   if (ret < 0)
>>> -   dev_err(r5_core->dev, "failed to configure TCM\n");
>>> +   /* TCM configuration is not needed in versal-net */
>>> +   if (device_is_compatible(r5_core->dev, "xlnx,zynqmp-r5f")) {
>>> +   ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
>>> +   if (ret < 0)
>>> +   dev_err(r5_core->dev, "failed to configure TCM\n");
>>> +   }
>>>  
>>> return ret;
>>>  }
>>> @@ -933,10 +936,17 @@ static int zynqmp_r5_core_init(struct 
>>> zynqmp_r5_cluster *cluster,
>>> int ret, i;
>>>  
>>> r5_core = cluster->r5_cores[0];
>>> +
>>> +   /*
>>> +* New platforms must use device tree for TCM parsing.
>>> +* Only ZynqMP uses hardcode TCM.
>>> +*/
>>> if (of_find_property(r5_core->np, "reg", NULL))
>>> ret = zynqmp_r5_get_tcm_node_from_dt(cluster);
>>> -   else
>>> +   else if (of_machine_is_compatible("xlnx,zynqmp"))
>>> ret = zynqmp_r5_get_tcm_node(cluster);
>>
>> That's poor code. Your drivers should not depend on platform. I don't
>> understand why you need to do this and how is even related to this patch.
> 
> You are correct, ideally this shouldn't be needed. However, this driver 
> contains
> hardcode TCM addresses that were used before TCM bindings were designed and 
> available in
> device-tree. This check is provided to maintain backward compatibility with 
> device-tree
> where TCM isn't expected.
> 
> For new platforms (Versal and Versal-NET) TCM must be provided in device-tree 
> and for
> ZynqMP if it's not in device-tree then to maintain backward compatibility 
> hardcode
> addresses are used.

That does not work like this. You cannot bind to some sort of different
compatible. If you disagree, please list the compatibles the driver
binds to.

Best regards,
Krzysztof

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-18 Thread Gavin Shan


On 3/19/24 02:59, Will Deacon wrote:

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:

The issue is reported by Yihuang Yu who have 'netperf' test on
NVidia's grace-grace and grace-hopper machines. The 'netperf'
client is started in the VM hosted by grace-hopper machine,
while the 'netperf' server is running on grace-grace machine.

The VM is started with virtio-net and vhost has been enabled.
We observe a error message spew from VM and then soft-lockup
report. The error message indicates the data associated with
the descriptor (index: 135) has been released, and the queue
is marked as broken. It eventually leads to the endless effort
to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
and soft-lockup. The stale index 135 is fetched from the available
ring and published to the used ring by vhost, meaning we have
disordred write to the available ring element and available index.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host\
  : \
   -netdev tap,id=vnet0,vhost=on\
   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \

   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!

Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
ARM64. It should work for other architectures, but performance loss is
expected.

Cc: sta...@vger.kernel.org
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
  drivers/virtio/virtio_ring.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
  
-	/* Descriptors and available array need to be set before we expose the

-* new available array entries. */
-   virtio_wmb(vq->weak_barriers);
+   /*
+* Descriptors and available array need to be set before we expose
+* the new available array entries. virtio_wmb() should be enough
+* to ensuere the order theoretically. However, a stronger barrier
+* is needed by ARM64. Otherwise, the stale data can be observed
+* by the host (vhost). A stronger barrier should work for other
+* architectures, but performance loss is expected.
+*/
+   virtio_mb(false);
vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);


Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
here, especially when ordering accesses to coherent memory.

In practice, either the larger timing different from the DSB or the fact
that you're going from a Store->Store barrier to a full barrier is what
makes things "work" for you. Have you tried, for example, a DMB SY
(e.g. via __smb_mb()).

We definitely shouldn't take changes like this without a proper
explanation of what is going on.



Thanks for your comments, Will.

Yes, DMB should work for us. However, it seems this instruction has issues on
NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
from hardware level. I agree it's not the solution to replace DMB with DSB
before we fully understand the root cause.

I tried the possible replacement like below. __smp_mb() can avoid the issue like
__mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.

static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
{
:
/* Put entry in available array (but don't update avail->idx until they
 * do sync). */
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

/* Descriptors and available array need to be set before we expose the
 * new available array entries. */
// Broken: virtio_wmb(vq->weak_barriers);
// Broken: __dma_mb();
// Work:   __mb();
// Work:   __smp_mb();
// Work:   __ndelay(100);
// Work:   __ndelay(10);
// Broken: __ndelay(9);

   vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);
vq->num_added++;

pr_debug("Added buffer head %i to %p\n", head, vq);
END_USE(vq);
:
}

I also tried to measure the consumed time for various barrier-relative

Re: [PATCH v2 0/3] uprobes: two common case speed ups

2024-03-18 Thread Google

On Tue, 19 Mar 2024 13:20:57 +0900
Masami Hiramatsu (Google)  wrote:

> Hi,
> 
> On Mon, 18 Mar 2024 11:17:25 -0700
> Andrii Nakryiko  wrote:
> 
> > This patch set implements two speed ups for uprobe/uretprobe runtime 
> > execution
> > path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
> > system-wide (non-PID-specific) uprobes (patch #3). Please see individual
> > patches for details.
> 
> This series looks good to me. Let me pick it on probes/for-next.

Ah, and 

Acked-by: Masami Hiramatsu (Google) 

for this series.

> 
> Thanks!
> 
> > 
> > v1->v2:
> >   - rebased onto trace/core branch of tracing tree, hopefully I guessed 
> > right;
> >   - simplified user_cpu_buffer usage further (Oleg Nesterov);
> >   - simplified patch #3, just moved speculative check outside of lock 
> > (Oleg);
> >   - added Reviewed-by from Jiri Olsa.
> > 
> > Andrii Nakryiko (3):
> >   uprobes: encapsulate preparation of uprobe args buffer
> >   uprobes: prepare uprobe args buffer lazily
> >   uprobes: add speculative lockless system-wide uprobe filter check
> > 
> >  kernel/trace/trace_uprobe.c | 103 +---
> >  1 file changed, 59 insertions(+), 44 deletions(-)
> > 
> > -- 
> > 2.43.0
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2 0/3] uprobes: two common case speed ups

2024-03-18 Thread Google

Hi,

On Mon, 18 Mar 2024 11:17:25 -0700
Andrii Nakryiko  wrote:

> This patch set implements two speed ups for uprobe/uretprobe runtime execution
> path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
> system-wide (non-PID-specific) uprobes (patch #3). Please see individual
> patches for details.

This series looks good to me. Let me pick it on probes/for-next.

Thanks!

> 
> v1->v2:
>   - rebased onto trace/core branch of tracing tree, hopefully I guessed right;
>   - simplified user_cpu_buffer usage further (Oleg Nesterov);
>   - simplified patch #3, just moved speculative check outside of lock (Oleg);
>   - added Reviewed-by from Jiri Olsa.
> 
> Andrii Nakryiko (3):
>   uprobes: encapsulate preparation of uprobe args buffer
>   uprobes: prepare uprobe args buffer lazily
>   uprobes: add speculative lockless system-wide uprobe filter check
> 
>  kernel/trace/trace_uprobe.c | 103 +---
>  1 file changed, 59 insertions(+), 44 deletions(-)
> 
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH 0/5] Some devicetree cleanup for MSM8974 Sony Xperia Z2 Tablet

2024-03-18 Thread Bjorn Andersson



On Wed, 06 Mar 2024 00:18:03 +0100, Luca Weiss wrote:
> The sony-castor dts has been around for a while, clean up some things to
> prepare for further changes including the introduction of the
> shinano-based Sony Xperia Z3.
> 
> 

Applied, thanks!

[1/5] ARM: dts: qcom: msm8974pro-castor: Clean up formatting
  commit: c0b9c616e881a1ea6b158dcad73e2198c483a56a
[2/5] ARM: dts: qcom: msm8974pro-castor: Add mmc aliases
  commit: 0268bfec8ae06bc1091c2b9058287391872f8853
[3/5] ARM: dts: qcom: msm8974pro-castor: Remove camera button definitions
  commit: 26b950f2af40bd6cc36c3ac8ee6643ff3e905753
[4/5] ARM: dts: qcom: msm8974pro-castor: Add debounce-interval for keys
  commit: ead56f2e68b874c3549a67c6ae0c6186f780bf2f
[5/5] ARM: dts: qcom: msm8974pro-castor: Rename wifi node name
  commit: b756001175b1145a4855c8cfa77ceec78350c36a

Best regards,
-- 
Bjorn Andersson

Re: (subset) [PATCH v2 0/2] Fairphone 5 PMIC-GLINK support (USB-C, charger, fuel gauge)

2024-03-18 Thread Bjorn Andersson



On Thu, 08 Feb 2024 10:52:31 +0100, Luca Weiss wrote:
> This series adds all the necessary bits to enable USB-C role switching,
> charger and fuel gauge (all via pmic-glink) on Fairphone 5.
> 
> 

Applied, thanks!

[1/2] dt-bindings: soc: qcom: qcom,pmic-glink: document QCM6490 compatible
  commit: a5b150af2c6ad2749aee51edc36f9f9883869f5b

Best regards,
-- 
Bjorn Andersson

Re: [PATCH] arm64: dts: qcom: sdm632-fairphone-fp3: enable USB-C port handling

2024-03-18 Thread Bjorn Andersson



On Tue, 20 Feb 2024 13:01:22 +0100, Luca Weiss wrote:
> Add the definition for the USB-C connector found on this phone and hook
> up the relevant bits. This enables USB role switching.
> 
> 

Applied, thanks!

[1/1] arm64: dts: qcom: sdm632-fairphone-fp3: enable USB-C port handling
  commit: 90053b1574f8cff3a3b53accc496246ad2e0aec3

Best regards,
-- 
Bjorn Andersson

Re: [PATCH] arm64: dts: qcom: Fix type of "wdog" IRQs for remoteprocs

2024-03-18 Thread Bjorn Andersson



On Mon, 19 Feb 2024 15:33:27 +0100, Luca Weiss wrote:
> The code in qcom_q6v5_init() requests the "wdog" IRQ as
> IRQF_TRIGGER_RISING. If dt defines the interrupt type as LEVEL_HIGH then
> the driver will have issues getting the IRQ again after probe deferral
> with an error like:
> 
>   irq: type mismatch, failed to map hwirq-14 for interrupt-controller@b22!
> 
> [...]

Applied, thanks!

[1/1] arm64: dts: qcom: Fix type of "wdog" IRQs for remoteprocs
  commit: dd6943ef8edde3c6b8aee06cb142c9a41086e3b3

Best regards,
-- 
Bjorn Andersson

Re: [PATCH 3/3] drivers: remoteproc: add Versal and Versal-NET support

2024-03-18 Thread Tanmay Shah




On 3/17/24 1:55 PM, Krzysztof Kozlowski wrote:
> On 15/03/2024 22:15, Tanmay Shah wrote:
>> AMD-Xilinx Versal and Versal-NET are successor of ZynqMP platform. ZynqMP
>> remoteproc driver is mostly compatible with new platforms except few
>> platform specific differences.
>> 
>> Versal has same IP of cortex-R5 cores hence maintained compatible string
>> same as ZynqMP platform. However, hardcode TCM addresses are not
>> supported for new platforms and must be provided in device-tree as per
>> new bindings. This makes TCM representation data-driven and easy to
>> maintain. This check is provided in the driver.
>> 
>> For Versal-NET platform, TCM doesn't need to be configured in lockstep
>> mode or split mode. Hence that call to PMC firmware is avoided in the
>> driver for Versal-NET platform.
>> 
>> Signed-off-by: Tanmay Shah 
>> ---
>>  drivers/remoteproc/xlnx_r5_remoteproc.c | 19 +++
>>  1 file changed, 15 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
>> b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> index d4a22caebaad..193bc159d1b4 100644
>> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
>> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> @@ -323,9 +323,12 @@ static int zynqmp_r5_set_mode(struct zynqmp_r5_core 
>> *r5_core,
>>  return ret;
>>  }
>>  
>> -ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
>> -if (ret < 0)
>> -dev_err(r5_core->dev, "failed to configure TCM\n");
>> +/* TCM configuration is not needed in versal-net */
>> +if (device_is_compatible(r5_core->dev, "xlnx,zynqmp-r5f")) {
>> +ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
>> +if (ret < 0)
>> +dev_err(r5_core->dev, "failed to configure TCM\n");
>> +}
>>  
>>  return ret;
>>  }
>> @@ -933,10 +936,17 @@ static int zynqmp_r5_core_init(struct 
>> zynqmp_r5_cluster *cluster,
>>  int ret, i;
>>  
>>  r5_core = cluster->r5_cores[0];
>> +
>> +/*
>> + * New platforms must use device tree for TCM parsing.
>> + * Only ZynqMP uses hardcode TCM.
>> + */
>>  if (of_find_property(r5_core->np, "reg", NULL))
>>  ret = zynqmp_r5_get_tcm_node_from_dt(cluster);
>> -else
>> +else if (of_machine_is_compatible("xlnx,zynqmp"))
>>  ret = zynqmp_r5_get_tcm_node(cluster);
> 
> That's poor code. Your drivers should not depend on platform. I don't
> understand why you need to do this and how is even related to this patch.

You are correct, ideally this shouldn't be needed. However, this driver contains
hardcode TCM addresses that were used before TCM bindings were designed and 
available in
device-tree. This check is provided to maintain backward compatibility with 
device-tree
where TCM isn't expected.

For new platforms (Versal and Versal-NET) TCM must be provided in device-tree 
and for
ZynqMP if it's not in device-tree then to maintain backward compatibility 
hardcode
addresses are used.

Thanks.


> 
> 
> Best regards,
> Krzysztof
>

Re: [PATCH 2/3] dt-bindings: remoteproc: add Versal-NET platform

2024-03-18 Thread Tanmay Shah

Hello Krzysztof,

Thanks for reviews. Please find my comments below.

On 3/17/24 1:53 PM, Krzysztof Kozlowski wrote:
> On 15/03/2024 22:15, Tanmay Shah wrote:
>> AMD-Xilinx Versal-NET platform is successor of Versal platform. It
>> contains multiple clusters of cortex-R52 real-time processing units.
>> Each cluster contains two cores of cortex-R52 processors. Each cluster
>> can be configured in lockstep mode or split mode.
>> 
>> Each R52 core is assigned 128KB of TCM memory. ATCM memory is 64KB, BTCM
>> and CTCM memoreis are 32KB each. Each TCM memory has its own dedicated
>> power-domain that needs to be requested before using it.
>> 
>> Signed-off-by: Tanmay Shah 
>> ---
>>  .../remoteproc/xlnx,zynqmp-r5fss.yaml | 220 +++---
>>  1 file changed, 184 insertions(+), 36 deletions(-)
>> 
>> diff --git 
>> a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml 
>> b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>> index 711da0272250..55654ee02eef 100644
>> --- a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>> +++ b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
>> @@ -18,7 +18,9 @@ description: |
>>  
>>  properties:
>>compatible:
>> -const: xlnx,zynqmp-r5fss
>> +enum:
>> +  - xlnx,zynqmp-r5fss
>> +  - xlnx,versal-net-r52fss
>>  
>>"#address-cells":
>>  const: 2
>> @@ -64,7 +66,9 @@ patternProperties:
>>  
>>  properties:
>>compatible:
>> -const: xlnx,zynqmp-r5f
>> +enum:
>> +  - xlnx,zynqmp-r5f
>> +  - xlnx,versal-net-r52f
>>  
>>reg:
>>  minItems: 1
>> @@ -135,9 +139,11 @@ required:
>>  allOf:
>>- if:
>>properties:
>> -xlnx,cluster-mode:
>> -  enum:
>> -- 1
>> +compatible:
>> +  contains:
>> +enum:
>> +  - xlnx,versal-net-r52fss
> 
> Why do you touch these lines?
> 
>> +
>>  then:
>>patternProperties:
>>  "^r5f@[0-9a-f]+$":
>> @@ -149,16 +155,14 @@ allOf:
>>items:
>>  - description: ATCM internal memory
>>  - description: BTCM internal memory
>> -- description: extra ATCM memory in lockstep mode
>> -- description: extra BTCM memory in lockstep mode
>> +- description: CTCM internal memory
>>  
>>  reg-names:
>>minItems: 1
>>items:
>> -- const: atcm0
>> -- const: btcm0
>> -- const: atcm1
>> -- const: btcm1
>> +- const: atcm
>> +- const: btcm
>> +- const: ctcm
>>  
>>  power-domains:
>>minItems: 2
>> @@ -166,33 +170,70 @@ allOf:
>>  - description: RPU core power domain
>>  - description: ATCM power domain
>>  - description: BTCM power domain
>> -- description: second ATCM power domain
>> -- description: second BTCM power domain
>> +- description: CTCM power domain
>>  
>>  else:
>> -  patternProperties:
>> -"^r5f@[0-9a-f]+$":
>> -  type: object
>> -
>> -  properties:
>> -reg:
>> -  minItems: 1
>> -  items:
>> -- description: ATCM internal memory
>> -- description: BTCM internal memory
>> -
>> -reg-names:
>> -  minItems: 1
>> -  items:
>> -- const: atcm0
>> -- const: btcm0
>> -
>> -power-domains:
>> -  minItems: 2
>> -  items:
>> -- description: RPU core power domain
>> -- description: ATCM power domain
>> -- description: BTCM power domain
>> +  allOf:
>> +- if:
>> +properties:
>> +  xlnx,cluster-mode:
>> +enum:
>> +  - 1
> 
> Whatever you did here, is not really readable. You have now multiple
> if:then:if:then embedded.

For ZynqMP platform, TCM can be configured differently in lockstep mode
and split mode.

For Versal-NET no such configuration is available, but new CTCM memory
is added.

So, I am trying to achieve following representation of TCM for both:

if: versal-net compatible
then:
  ATCM - 64KB
  BTCM - 32KB
  CTCM - 32KB

else: (ZynqMP compatible)
  if:
xlnx,cluster-mode (lockstep mode)
  then:
ATCM0 - 64KB
BTCM0 - 64KB
ATCM1 - 64KB
BTCM1 - 64KB
  else: (split mode)
ATCM0 - 64KB
BTCM0 - 64KB


If bindings are getting complicated, does it make sense to introduce
new file for Versal-NET bindings? Let me know how you would like me
to proceed.

Thanks,
Tanmay

> 
>> +  then:
>> +patternProperties:
>> +  "^r5f@[0-9a-f]+$":
>> +type: object
>> +
>> +

Re: [PATCH 1/3] dt-bindings: remoteproc: add Versal platform support

2024-03-18 Thread Tanmay Shah




On 3/17/24 1:50 PM, Krzysztof Kozlowski wrote:
> On 15/03/2024 22:15, Tanmay Shah wrote:
>> AMD-Xilinx Versal platform is successor of ZynqMP platform. Real-time
>> Processor Unit R5 cluster IP on Versal is same as of ZynqMP Platform.
>> Only difference is power-domains ID needed by power management firmware.
>> Hence, keeping the compatible property same as of zynqmp node.
>> 
>> Signed-off-by: Tanmay Shah 
> 
> There is no binding change, so NAK. Platform is not being added to
> examples. You changed nothing in  Versal support...

Thanks for reviews.

Okay got it. That means I don't really need to add anything related to Versal.
I will get rid of patch that says "Versal support". Looks like it's not needed
at all.

Thanks.

> 
> Best regards,
> Krzysztof
>

Re: [PATCH 1/3] dt-bindings: remoteproc: add Versal platform support

2024-03-18 Thread Tanmay Shah

Hello,

Thanks for reviews, please find my comments below.

On 3/17/24 9:50 AM, Conor Dooley wrote:
> On Fri, Mar 15, 2024 at 02:15:31PM -0700, Tanmay Shah wrote:
>> AMD-Xilinx Versal platform is successor of ZynqMP platform. Real-time
>> Processor Unit R5 cluster IP on Versal is same as of ZynqMP Platform.
> 
>> Only difference is power-domains ID needed by power management firmware.
>> Hence, keeping the compatible property same as of zynqmp node.
> 
> No, don't be lazy. Add a compatible with a fallback please.

It's same IP on different platform. I am not sure how adding compatible string
adds value. I will refactor this series based on other comments provided.

> This doesn't apply to linux-next either FYI.

Actually cover-letter contains dependent series link that is needed for this
patch. That series was acked recently but hasn't made to linux-next yet.

I may be missing something in the process. In such case there is no other way
to send patch except for waiting?

Thanks,
Tanmay

Re: [External] Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-18 Thread Ho-Ren (Jack) Chuang

I'm working on V3. Thanks for Ying's feedback.

cc: sthanne...@micron.com


On Thu, Mar 14, 2024 at 12:54 AM Huang, Ying  wrote:
>
> "Ho-Ren (Jack) Chuang"  writes:
>
> > On Tue, Mar 12, 2024 at 2:21 AM Huang, Ying  wrote:
> >>
> >> "Ho-Ren (Jack) Chuang"  writes:
> >>
> >> > The current implementation treats emulated memory devices, such as
> >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> >> > memory
> >> > (E820_TYPE_RAM). However, these emulated devices have different
> >> > characteristics than traditional DRAM, making it important to
> >> > distinguish them. Thus, we modify the tiered memory initialization 
> >> > process
> >> > to introduce a delay specifically for CPUless NUMA nodes. This delay
> >> > ensures that the memory tier initialization for these nodes is deferred
> >> > until HMAT information is obtained during the boot process. Finally,
> >> > demotion tables are recalculated at the end.
> >> >
> >> > * Abstract common functions into `find_alloc_memory_type()`
> >>
> >> We should move kmem_put_memory_types() (renamed to
> >> mt_put_memory_types()?) too.  This can be put in a separate patch.
> >>
> >
> > Will do! Thanks,
> >
> >
> >>
> >> > Since different memory devices require finding or allocating a memory 
> >> > type,
> >> > these common steps are abstracted into a single function,
> >> > `find_alloc_memory_type()`, enhancing code scalability and conciseness.
> >> >
> >> > * Handle cases where there is no HMAT when creating memory tiers
> >> > There is a scenario where a CPUless node does not provide HMAT 
> >> > information.
> >> > If no HMAT is specified, it falls back to using the default DRAM tier.
> >> >
> >> > * Change adist calculation code to use another new lock, mt_perf_lock.
> >> > In the current implementation, iterating through CPUlist nodes requires
> >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end 
> >> > up
> >> > trying to acquire the same lock, leading to a potential deadlock.
> >> > Therefore, we propose introducing a standalone `mt_perf_lock` to protect
> >> > `default_dram_perf`. This approach not only avoids deadlock but also
> >> > prevents holding a large lock simultaneously.
> >> >
> >> > Signed-off-by: Ho-Ren (Jack) Chuang 
> >> > Signed-off-by: Hao Xiang 
> >> > ---
> >> >  drivers/acpi/numa/hmat.c | 11 ++
> >> >  drivers/dax/kmem.c   | 13 +--
> >> >  include/linux/acpi.h |  6 
> >> >  include/linux/memory-tiers.h |  8 +
> >> >  mm/memory-tiers.c| 70 +---
> >> >  5 files changed, 92 insertions(+), 16 deletions(-)
> >> >
> >> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> >> > index d6b85f0f6082..28812ec2c793 100644
> >> > --- a/drivers/acpi/numa/hmat.c
> >> > +++ b/drivers/acpi/numa/hmat.c
> >> > @@ -38,6 +38,8 @@ static LIST_HEAD(targets);
> >> >  static LIST_HEAD(initiators);
> >> >  static LIST_HEAD(localities);
> >> >
> >> > +static LIST_HEAD(hmat_memory_types);
> >> > +
> >>
> >> HMAT isn't a device driver for some memory devices.  So I don't think we
> >> should manage memory types in HMAT.
> >
> > I can put it back in memory-tier.c. How about the list? Do we still
> > need to keep a separate list for storing late_inited memory nodes?
> > And how about the list name if we need to remove the prefix "hmat_"?
>
> I don't think we need a separate list for memory-less nodes.  Just
> iterate all memory-less nodes.
>

Ok. There is no need to keep a list of memory-less nodes. I will
only keep a memory_type list to manage created memory types.


> >
> >> Instead, if the memory_type of a
> >> node isn't set by the driver, we should manage it in memory-tier.c as
> >> fallback.
> >>
> >
> > Do you mean some device drivers may init memory tiers between
> > memory_tier_init() and late_initcall(memory_tier_late_init);?
> > And this is the reason why you mention to exclude
> > "node_memory_types[nid].memtype != NULL" in memory_tier_late_init().
> > Is my understanding correct?
>
> Yes.
>

Thanks.

> >> >  static DEFINE_MUTEX(target_lock);
> >> >
> >> >  /*
> >> > @@ -149,6 +151,12 @@ int acpi_get_genport_coordinates(u32 uid,
> >> >  }
> >> >  EXPORT_SYMBOL_NS_GPL(acpi_get_genport_coordinates, CXL);
> >> >
> >> > +struct memory_dev_type *hmat_find_alloc_memory_type(int adist)
> >> > +{
> >> > + return find_alloc_memory_type(adist, _memory_types);
> >> > +}
> >> > +EXPORT_SYMBOL_GPL(hmat_find_alloc_memory_type);
> >> > +
> >> >  static __init void alloc_memory_initiator(unsigned int cpu_pxm)
> >> >  {
> >> >   struct memory_initiator *initiator;
> >> > @@ -1038,6 +1046,9 @@ static __init int hmat_init(void)
> >> >   if (!hmat_set_default_dram_perf())
> >> >   register_mt_adistance_algorithm(_adist_nb);
> >> >
> >> > + /* Post-create CPUless memory tiers after getting HMAT info */
> >> > + memory_tier_late_init();
> >> > +
> >>
> >> This should be called in memory-tier.c via
> >>
>

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-18 Thread SeongJae Park

On Mon, 18 Mar 2024 22:27:45 +0900 Honggyu Kim  wrote:

> Hi SeongJae,
> 
> On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> > Hi Honggyu,
> > 
> > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> > 
> > > Hi SeongJae,
> > > 
> > > Thanks for the confirmation.  I have a few comments on young filter so
> > > please read the inline comments again.
> > > 
> > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > > > Hi Honggyu,
> > > > 
> > > > > > -Original Message-
> > > > > > From: SeongJae Park 
> > > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > > To: Honggyu Kim 
> > > > > > Cc: SeongJae Park ; kernel_team 
> > > > > > 
> > > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > > management for CXL memory
> > > > > >
> > > > > > Hi Honggyu,
> > > > > >
> > > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > > >  wrote:
> > > > > >
> > > > > > > Hi SeongJae,
> > > > > > >
> > > > > > > I've tested it again and found that "young" filter has to be set
> > > > > > > differently as follows.
> > > > > > > - demote action: set "young" filter with "matching" true
> > > > > > > - promote action: set "young" filter with "matching" false
> 
> Thinking it again, I feel like "matching" true or false looks quite
> vague to me as a general user.
> 
> Instead, I would like to have more meaningful names for "matching" as
> follows.
> 
> - matching "true" can be either (filter) "out" or "skip".
> - matching "false" can be either (filter) "in" or "apply".

I agree the naming could be done much better.  And thank you for the nice
suggestions.  I have a few concerns, though.

Firstly, increasing the number of behavioral concepts.  DAMOS filter feature
has only single behavior: excluding some types of memory from DAMOS action
target.  The "matching" is to provide a flexible way for further specifying the
target to exclude in a bit detail.  Without it, we would need non-variant for
each filter type.  Comapred to the current terms, the new terms feel like
implying there are two types of behaviors.  I think one behavior is easier to
understand than two behaviors, and better match what internal code is doing.

Secondly, ambiguity in "in" and "apply".  To me, the terms sound like _adding_
something more than _excluding_.  I think that might confuse people in some
cases.  Actually, I have used the term "filter-out" and "filter-in" on
this  and several threads.  When saying about "filter-in" scenario, I had to
emphasize the fact that it is not adding something but excluding others.  I now
think that was not a good approach.

Finally, "apply" sounds a bit deterministic.  I think it could be a bit
confusing in some cases such as when using multiple filters in a combined way.
For example, if we have two filters for 1) "apply" a memcg and 2) skip anon
pages, the given DAMOS action will not be applied to anon pages of the memcg.
I think this might be a bit confusing.

> 
> Internally, the type of "matching" can be boolean, but it'd be better
> for general users have another ways to set it such as "out"/"in" or
> "skip"/"apply" via sysfs interface.  I prefer "skip" and "apply" looks
> more intuitive, but I don't have strong objection on "out" and "in" as
> well.

Unfortunately, DAMON sysfs interface is an ABI that we want to keep stable.  Of
course we could make some changes on it if really required.  But I'm unsure if
the problem of current naming and benefit of the sugegsted change are big
enough to outweighs the stability risk and additional efforts.

Also, DAMON sysfs interface is arguably not for _very_ general users.  DAMON
user-space tool is the one for _more_ general users.  To quote DAMON usage
document,

- *DAMON user space tool.*
  `This `_ is for privileged people such as
  system administrators who want a just-working human-friendly interface.
  [...]
- *sysfs interface.*
  :ref:`This ` is for privileged user space programmers who
  want more optimized use of DAMON. [...]

If the concept is that confused, I think we could improve the documentation, or
the user space tool.  But for DAMON sysfs interface, I think we need more
discussions for getting clear pros/cons that justifies the risk and the effort.

> 
> I also feel the filter name "young" is more for developers not for
> general users.  I think this can be changed to "accessed" filter
> instead.

In my humble opinion, "accessed" might be confusing with the term that being
used by DAMON, specifically, the concept of "nr_accesses".  I also thought
about using more specific term such as "pg-accessed" or something else, but I
felt it is still unclear or making it too verbose.

I agree "young" sounds more for developers.  But, again, DAMON sysfs is for not
_very_ general users.  And the term is used commonly on relcaimation part and
LRU, so I think this is not too bad as long as we provide nice documentation or
abstraction

[PATCH v2 3/3] uprobes: add speculative lockless system-wide uprobe filter check

2024-03-18 Thread Andrii Nakryiko

It's very common with BPF-based uprobe/uretprobe use cases to have
a system-wide (not PID specific) probes used. In this case uprobe's
trace_uprobe_filter->nr_systemwide counter is bumped at registration
time, and actual filtering is short circuited at the time when
uprobe/uretprobe is triggered.

This is a great optimization, and the only issue with it is that to even
get to checking this counter uprobe subsystem is taking
read-side trace_uprobe_filter->rwlock. This is actually noticeable in
profiles and is just another point of contention when uprobe is
triggered on multiple CPUs simultaneously.

This patch moves this nr_systemwide check outside of filter list's
rwlock scope, as rwlock is meant to protect list modification, while
nr_systemwide-based check is speculative and racy already, despite the
lock (as discussed in [0]). trace_uprobe_filter_remove() and
trace_uprobe_filter_add() already check for filter->nr_systewide
explicitly outside of __uprobe_perf_filter, so no modifications are
required there.

Confirming with BPF selftests's based benchmarks.

BEFORE (based on changes in previous patch)
===
uprobe-nop :2.732 ± 0.022M/s
uprobe-push:2.621 ± 0.016M/s
uprobe-ret :1.105 ± 0.007M/s
uretprobe-nop  :1.396 ± 0.007M/s
uretprobe-push :1.347 ± 0.008M/s
uretprobe-ret  :0.800 ± 0.006M/s

AFTER
=
uprobe-nop :2.878 ± 0.017M/s (+5.5%, total +8.3%)
uprobe-push:2.753 ± 0.013M/s (+5.3%, total +10.2%)
uprobe-ret :1.142 ± 0.010M/s (+3.8%, total +3.8%)
uretprobe-nop  :1.444 ± 0.008M/s (+3.5%, total +6.5%)
uretprobe-push :1.410 ± 0.010M/s (+4.8%, total +7.1%)
uretprobe-ret  :0.816 ± 0.002M/s (+2.0%, total +3.9%)

In the above, first percentage value is based on top of previous patch
(lazy uprobe buffer optimization), while the "total" percentage is
based on kernel without any of the changes in this patch set.

As can be seen, we get about 4% - 10% speed up, in total, with both lazy
uprobe buffer and speculative filter check optimizations.

  [0] https://lore.kernel.org/bpf/20240313131926.ga19...@redhat.com/

Reviewed-by: Jiri Olsa 
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index b5da95240a31..ac05885a6ce6 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1226,9 +1226,6 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, 
struct mm_struct *mm)
 {
struct perf_event *event;
 
-   if (filter->nr_systemwide)
-   return true;
-
list_for_each_entry(event, >perf_events, hw.tp_list) {
if (event->hw.target->mm == mm)
return true;
@@ -1353,6 +1350,13 @@ static bool uprobe_perf_filter(struct uprobe_consumer 
*uc,
tu = container_of(uc, struct trace_uprobe, consumer);
filter = tu->tp.event->filter;
 
+   /*
+* speculative short-circuiting check to avoid unnecessarily taking
+* filter->rwlock below, if the uprobe has system-wide consumer
+*/
+   if (READ_ONCE(filter->nr_systemwide))
+   return true;
+
read_lock(>rwlock);
ret = __uprobe_perf_filter(filter, mm);
read_unlock(>rwlock);
-- 
2.43.0

[PATCH v2 2/3] uprobes: prepare uprobe args buffer lazily

2024-03-18 Thread Andrii Nakryiko

uprobe_cpu_buffer and corresponding logic to store uprobe args into it
are used for uprobes/uretprobes that are created through tracefs or
perf events.

BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't
need uprobe_cpu_buffer and associated data. For BPF-only use cases this
buffer handling and preparation is a pure overhead. At the same time,
BPF-only uprobe/uretprobe usage is very common in practice. Also, for
a lot of cases applications are very senstivie to performance overheads,
as they might be tracing a very high frequency functions like
malloc()/free(), so every bit of performance improvement matters.

All that is to say that this uprobe_cpu_buffer preparation is an
unnecessary overhead that each BPF user of uprobes/uretprobe has to pay.
This patch is changing this by making uprobe_cpu_buffer preparation
optional. It will happen only if either tracefs-based or perf event-based
uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For
BPF-only use cases this step will be skipped.

We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0])
to estimate the improvements. We have 3 uprobe and 3 uretprobe
scenarios, which vary an instruction that is replaced by uprobe: nop
(fastest uprobe case), `push rbp` (typical case), and non-simulated
`ret` instruction (slowest case). Benchmark thread is constantly calling
user space function in a tight loop. User space function has attached
BPF uprobe or uretprobe program doing nothing but atomic counter
increments to count number of triggering calls. Benchmark emits
throughput in millions of executions per second.

BEFORE these changes

uprobe-nop :2.657 ± 0.024M/s
uprobe-push:2.499 ± 0.018M/s
uprobe-ret :1.100 ± 0.006M/s
uretprobe-nop  :1.356 ± 0.004M/s
uretprobe-push :1.317 ± 0.019M/s
uretprobe-ret  :0.785 ± 0.007M/s

AFTER these changes
===
uprobe-nop :2.732 ± 0.022M/s (+2.8%)
uprobe-push:2.621 ± 0.016M/s (+4.9%)
uprobe-ret :1.105 ± 0.007M/s (+0.5%)
uretprobe-nop  :1.396 ± 0.007M/s (+2.9%)
uretprobe-push :1.347 ± 0.008M/s (+2.3%)
uretprobe-ret  :0.800 ± 0.006M/s (+1.9)

So the improvements on this particular machine seems to be between 2% and 5%.

  [0] 
https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c

Reviewed-by: Jiri Olsa 
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 49 +
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 9bffaab448a6..b5da95240a31 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -941,15 +941,21 @@ static struct uprobe_cpu_buffer *uprobe_buffer_get(void)
 
 static void uprobe_buffer_put(struct uprobe_cpu_buffer *ucb)
 {
+   if (!ucb)
+   return;
mutex_unlock(>mutex);
 }
 
 static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
-  struct pt_regs *regs)
+  struct pt_regs *regs,
+  struct uprobe_cpu_buffer 
**ucbp)
 {
struct uprobe_cpu_buffer *ucb;
int dsize, esize;
 
+   if (*ucbp)
+   return *ucbp;
+
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
dsize = __get_data_size(>tp, regs);
 
@@ -958,22 +964,25 @@ static struct uprobe_cpu_buffer 
*prepare_uprobe_buffer(struct trace_uprobe *tu,
 
store_trace_args(ucb->buf, >tp, regs, esize, dsize);
 
+   *ucbp = ucb;
return ucb;
 }
 
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer *ucb,
+   struct uprobe_cpu_buffer **ucbp,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
struct trace_event_buffer fbuffer;
+   struct uprobe_cpu_buffer *ucb;
void *data;
int size, esize;
struct trace_event_call *call = trace_probe_event_call(>tp);
 
WARN_ON(call != trace_file->event_call);
 
+   ucb = prepare_uprobe_buffer(tu, regs, ucbp);
if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE))
return;
 
@@ -1002,7 +1011,7 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
 
 /* uprobe handler */
 static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb)
+struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
 
@@ -1011,7 +1020,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
rcu_read_lock();

[PATCH v2 1/3] uprobes: encapsulate preparation of uprobe args buffer

2024-03-18 Thread Andrii Nakryiko

Move the logic of fetching temporary per-CPU uprobe buffer and storing
uprobes args into it to a new helper function. Store data size as part
of this buffer, simplifying interfaces a bit, as now we only pass single
uprobe_cpu_buffer reference around, instead of pointer + dsize.

This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher,
and now will be centralized. All this is also in preparation to make
this uprobe_cpu_buffer handling logic optional in the next patch.

Reviewed-by: Jiri Olsa 
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 78 +++--
 1 file changed, 41 insertions(+), 37 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index a84b85d8aac1..9bffaab448a6 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -854,6 +854,7 @@ static const struct file_operations uprobe_profile_ops = {
 struct uprobe_cpu_buffer {
struct mutex mutex;
void *buf;
+   int dsize;
 };
 static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer;
 static int uprobe_buffer_refcnt;
@@ -943,9 +944,26 @@ static void uprobe_buffer_put(struct uprobe_cpu_buffer 
*ucb)
mutex_unlock(>mutex);
 }
 
+static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
+  struct pt_regs *regs)
+{
+   struct uprobe_cpu_buffer *ucb;
+   int dsize, esize;
+
+   esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
+   dsize = __get_data_size(>tp, regs);
+
+   ucb = uprobe_buffer_get();
+   ucb->dsize = tu->tp.size + dsize;
+
+   store_trace_args(ucb->buf, >tp, regs, esize, dsize);
+
+   return ucb;
+}
+
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer *ucb, int dsize,
+   struct uprobe_cpu_buffer *ucb,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
@@ -956,14 +974,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
 
WARN_ON(call != trace_file->event_call);
 
-   if (WARN_ON_ONCE(tu->tp.size + dsize > PAGE_SIZE))
+   if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE))
return;
 
if (trace_trigger_soft_disabled(trace_file))
return;
 
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
-   size = esize + tu->tp.size + dsize;
+   size = esize + ucb->dsize;
entry = trace_event_buffer_reserve(, trace_file, size);
if (!entry)
return;
@@ -977,14 +995,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
data = DATAOF_TRACE_ENTRY(entry, false);
}
 
-   memcpy(data, ucb->buf, tu->tp.size + dsize);
+   memcpy(data, ucb->buf, ucb->dsize);
 
trace_event_buffer_commit();
 }
 
 /* uprobe handler */
 static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb, int dsize)
+struct uprobe_cpu_buffer *ucb)
 {
struct event_file_link *link;
 
@@ -993,7 +1011,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, 0, regs, ucb, dsize, link->file);
+   __uprobe_trace_func(tu, 0, regs, ucb, link->file);
rcu_read_unlock();
 
return 0;
@@ -1001,13 +1019,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
 static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
 struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb, int dsize)
+struct uprobe_cpu_buffer *ucb)
 {
struct event_file_link *link;
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, func, regs, ucb, dsize, link->file);
+   __uprobe_trace_func(tu, func, regs, ucb, link->file);
rcu_read_unlock();
 }
 
@@ -1335,7 +1353,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
 
 static void __uprobe_perf_func(struct trace_uprobe *tu,
   unsigned long func, struct pt_regs *regs,
-  struct uprobe_cpu_buffer *ucb, int dsize)
+  struct uprobe_cpu_buffer *ucb)
 {
struct trace_event_call *call = trace_probe_event_call(>tp);
struct uprobe_trace_entry_head *entry;
@@ -1356,7 +1374,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
 
-   size = esize + tu->tp.size + dsize;
+   size = esize + ucb->dsize;

[PATCH v2 0/3] uprobes: two common case speed ups

2024-03-18 Thread Andrii Nakryiko

This patch set implements two speed ups for uprobe/uretprobe runtime execution
path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
system-wide (non-PID-specific) uprobes (patch #3). Please see individual
patches for details.

v1->v2:
  - rebased onto trace/core branch of tracing tree, hopefully I guessed right;
  - simplified user_cpu_buffer usage further (Oleg Nesterov);
  - simplified patch #3, just moved speculative check outside of lock (Oleg);
  - added Reviewed-by from Jiri Olsa.

Andrii Nakryiko (3):
  uprobes: encapsulate preparation of uprobe args buffer
  uprobes: prepare uprobe args buffer lazily
  uprobes: add speculative lockless system-wide uprobe filter check

 kernel/trace/trace_uprobe.c | 103 +---
 1 file changed, 59 insertions(+), 44 deletions(-)

-- 
2.43.0

[GIT PULL] tracing/tools: Updates for 6.9

2024-03-18 Thread Daniel Bristot de Oliveira

Steven,

Tracing tooling updates for 6.9

Tracing:
- Update makefiles for latency-collector and RTLA,
  using tools/build/ makefiles like perf does, inheriting
  its benefits. For example, having a proper way to
  handle dependencies.

- The timerlat tracer has an interface for any tool to use.
  rtla timerlat tool uses this interface dispatching its
  own threads as workload. But, rtla timerlat could also be
  used for any other process. So, add 'rtla timerlat -U'
  option, allowing the timerlat tool to measure the latency of
  any task using the timerlat tracer interface.

Verification:
- Update makefiles for verification/rv, using tools/build/
  makefiles like perf does, inheriting its benefits.
  For example, having a proper way to handle dependencies.


Please pull the latest trace-tools-v6.9 tree, which can be found at:


  git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
trace-tools-v6.9

Tag SHA1: 2eb09a97c56af3c27bd9dcebccb495f70d56d5c0
Head SHA1: 9c63d9f58a42b979a42bcaed534d9246996ac0d9


Daniel Bristot de Oliveira (4):
  tools/tracing: Use tools/build makefiles on latency-collector
  tools/rtla: Use tools/build makefiles to build rtla
  tools/verification: Use tools/build makefiles on rv
  tools/rtla: Add -U/--user-load option to timerlat


 .../tools/rtla/common_timerlat_options.rst |   6 +
 tools/tracing/latency/.gitignore   |   5 +-
 tools/tracing/latency/Build|   1 +
 tools/tracing/latency/Makefile | 105 --
 tools/tracing/latency/Makefile.config  |  30 +++
 tools/tracing/rtla/.gitignore  |   7 +-
 tools/tracing/rtla/Build   |   1 +
 tools/tracing/rtla/Makefile| 217 +++--
 tools/tracing/rtla/Makefile.config |  47 +
 tools/tracing/rtla/Makefile.rtla   |  80 
 tools/tracing/rtla/Makefile.standalone |  26 +++
 tools/tracing/rtla/sample/timerlat_load.py |  74 +++
 tools/tracing/rtla/src/Build   |  11 ++
 tools/tracing/rtla/src/timerlat_hist.c |  16 +-
 tools/tracing/rtla/src/timerlat_top.c  |  14 +-
 tools/verification/rv/.gitignore   |   6 +
 tools/verification/rv/Build|   1 +
 tools/verification/rv/Makefile | 207 +++-
 tools/verification/rv/Makefile.config  |  47 +
 tools/verification/rv/Makefile.rv  |  51 +
 tools/verification/rv/src/Build|   4 +
 21 files changed, 650 insertions(+), 306 deletions(-)
 create mode 100644 tools/tracing/latency/Build
 create mode 100644 tools/tracing/latency/Makefile.config
 create mode 100644 tools/tracing/rtla/Build
 create mode 100644 tools/tracing/rtla/Makefile.config
 create mode 100644 tools/tracing/rtla/Makefile.rtla
 create mode 100644 tools/tracing/rtla/Makefile.standalone
 create mode 100644 tools/tracing/rtla/sample/timerlat_load.py
 create mode 100644 tools/tracing/rtla/src/Build
 create mode 100644 tools/verification/rv/.gitignore
 create mode 100644 tools/verification/rv/Build
 create mode 100644 tools/verification/rv/Makefile.config
 create mode 100644 tools/verification/rv/Makefile.rv
 create mode 100644 tools/verification/rv/src/Build
---

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-18 Thread Will Deacon

On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host\
>  : \
>   -netdev tap,id=vnet0,vhost=on\
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/virtio/virtio_ring.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> - /* Descriptors and available array need to be set before we expose the
> -  * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> +  * Descriptors and available array need to be set before we expose
> +  * the new available array entries. virtio_wmb() should be enough
> +  * to ensuere the order theoretically. However, a stronger barrier
> +  * is needed by ARM64. Otherwise, the stale data can be observed
> +  * by the host (vhost). A stronger barrier should work for other
> +  * architectures, but performance loss is expected.
> +  */
> + virtio_mb(false);
>   vq->split.avail_idx_shadow++;
>   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>   vq->split.avail_idx_shadow);

Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
here, especially when ordering accesses to coherent memory.

In practice, either the larger timing different from the DSB or the fact
that you're going from a Store->Store barrier to a full barrier is what
makes things "work" for you. Have you tried, for example, a DMB SY
(e.g. via __smb_mb()).

We definitely shouldn't take changes like this without a proper
explanation of what is going on.

Will

Re: TP_printk() bug with %c, and more?

2024-03-18 Thread Steven Rostedt

On Mon, 18 Mar 2024 16:43:07 +0100
Luca Ceresoli  wrote:

> Indeed I was on an older version, apologies.
> 
> I upgraded both libtraceevent and trace-cmd to master and applied your
> patch, now the %c is formatted correctly.
> 
> However the arrows are still reversed.
> 
> Is this what you were expecting?

No, I didn't look at the arrows, just the %c issue. I'll try to get some
time to do that.

-- Steve

Re: TP_printk() bug with %c, and more?

2024-03-18 Thread Luca Ceresoli

Hello Steven,

On Fri, 15 Mar 2024 14:58:52 -0400
Steven Rostedt  wrote:

> On Fri, 15 Mar 2024 19:03:12 +0100
> Luca Ceresoli  wrote:
> 
> > > > 
> > > > I've come across an unexpected behaviour in the kernel tracing
> > > > infrastructure that looks like a bug, or maybe two.
> > > > 
> > > > Cc-ing ASoC maintainers for as it appeared using ASoC traces, but it
> > > > does not look ASoC-specific.
> > > > 
> > > > It all started when using this trace-cmd sequence on an ARM64 board
> > > > running a mainline 6.8.0-rc7 kernel:
> > > > 
> > > >   trace-cmd record -e snd_soc_dapm_path ./my-play
> > > >   trace-cmd report
> > > > 
> > > > While this produces perfectly valid traces for other asoc events,
> > > > the snd_soc_dapm_path produces:
> > > > 
> > > >   snd_soc_dapm_path:>c<* MIC1_EN <- (direct) <-
> > > > 
> > > > instead of the expected:
> > > > 
> > > >   snd_soc_dapm_path:*MIC1 <- (direct) <- MIC1_EN
> > > > 
> > > > The originating macro is:
> > > > 
> > > > TP_printk("%c%s %s %s %s %s",
> > > > (int) __entry->path_node &&
> > > > (int) __entry->path_connect ? '*' : ' ',
> > > > __get_str(wname), DAPM_ARROW(__entry->path_dir),
> > > > __get_str(pname), DAPM_ARROW(__entry->path_dir),
> > > > __get_str(pnname))
> > > > 
> > > > It appears as if the %c placeholder always produces the three ">c<"
> > > > characters, the '*' or ' ' char is printed as the first %s, all the
> > > > other strings are shifted right by one position and the last string is
> > > > never printed.
> > > > 
> > > > On my x86_64 laptop running the default Ubuntu kernel (6.5) I'm able to
> > > > trace a few events having a '%c' in their TP_printk() macros and the
> > > > result is:
> > > > 
> > > >   intel_pipe_update_start: dev :00:02.0, pipe >c<, frame=1,
> > > >   scanline=107856, min=2208, max=2154
> > > >   
> > > 
> > > What does /sys/kernel/tracing/trace show?
> > 
> > It is correct:
> > 
> >intel_pipe_update_start: dev :00:02.0, pipe B, frame=377644, 
> > scanline=1466, min=2154, max=2159
> >   
> > > If that's fine, then the bug is in libtraceevent and not the kernel.
> > > 
> > > I'm testing it out now, and I see %c not being processed properly by
> > > libtraceevent. I'll take a deeper look.
> > 
> > Thanks.
> >   
> > > > originating from:
> > > > 
> > > >   TP_printk("dev %s, pipe %c, frame=%u, scanline=%u, min=%u, max=%u",
> > > > 
> > > > Here it looks like the %c produced ">c<" again, but apparently without
> > > > any shifting.
> > > > 
> > > > Back on the ARM64 board I found a couple interesting clues.
> > > > 
> > > > First, using the /tracing/ interface instead of trace-cmd, I'm
> > > > getting correctly formatted strings:
> > > > 
> > > > trace-cmd: snd_soc_dapm_path: >c<* HPOUT_L -> (direct) ->
> > > > debugfs:   snd_soc_dapm_path: *HPOUT_L <- (direct) <- HPOUT_POP_SOUND_L
> > > > 
> > > > Notice the arrows pointing to the opposite direction though. The correct
> > > > arrow is the one in the debugfs run.
> > 
> > This other issue appears a separate bug however.  
> 
> Can you make user you have the latest libtraceevent from:
> 
>git://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git
> 
> And apply this patch.
> 
> Thanks,
> 
> -- Steve
> 
> diff --git a/src/event-parse.c b/src/event-parse.c
> index d607556..61b0966 100644
> --- a/src/event-parse.c
> +++ b/src/event-parse.c
> @@ -3732,8 +3732,19 @@ process_arg_token(struct tep_event *event, struct 
> tep_print_arg *arg,
>   arg->atom.atom = atom;
>   break;
>  
> - case TEP_EVENT_DQUOTE:
>   case TEP_EVENT_SQUOTE:
> + arg->type = TEP_PRINT_ATOM;
> + /* Make characters into numbers */
> + if (asprintf(>atom.atom, "%d", token[0]) < 0) {
> + free_token(token);
> + *tok = NULL;
> + arg->atom.atom = NULL;
> + return TEP_EVENT_ERROR;
> + }
> + free_token(token);
> + type = read_token_item(event->tep, );
> + break;
> + case TEP_EVENT_DQUOTE:
>   arg->type = TEP_PRINT_ATOM;
>   arg->atom.atom = token;
>   type = read_token_item(event->tep, );

Indeed I was on an older version, apologies.

I upgraded both libtraceevent and trace-cmd to master and applied your
patch, now the %c is formatted correctly.

However the arrows are still reversed.

Is this what you were expecting?

Luca

-- 
Luca Ceresoli, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: net/sunrpc/sched.c: error: result of comparison against a string literal is unspecified (use an explicit string comparison function instead)

2024-03-18 Thread Nathan Chancellor

Hi Naresh,

On Mon, Mar 18, 2024 at 02:55:54PM +0530, Naresh Kamboju wrote:
> The following build warnings / errors noticed on x86 kselftests build with
> clang nightly  / clang-17 on Linux  next tag next-20240318.
> 
> This build config is generated from kselftest merge configs [1].
> 
> Reported-by: Linux Kernel Functional Testing 
> 
> Build log:
> ---
> In file included from net/sunrpc/sched.c:31:
> In file included from include/trace/events/sunrpc.h:2524:
> In file included from include/trace/define_trace.h:102:
> In file included from include/trace/trace_events.h:419:
> include/trace/events/sunrpc.h:707:4: error: result of comparison
> against a string literal is unspecified (use an explicit string
> comparison function instead) [-Werror,-Wstring-compare]
>   667 | __assign_str(progname,
>   | ~~
>   668 |  
> task->tk_client->cl_program->name);
>   |  
> ~~~
>   669 | __entry->version = task->tk_client->cl_vers;
>   | 
>   670 | __assign_str(procedure,
> task->tk_msg.rpc_proc->p_name);
>   |

Thanks for the report. This is caused by commit 433e1d88a3be ("tracing:
Add warning if string in __assign_str() does not match __string()") from
the tracing tree, not a change on the netdev side (although I have left
them on CC for this reply so they are aware of that).

To resolve it, [1] needs to be applied then the following diff can be
applied on top of that to fully clear it up (as __builtin_constant_p()
does not influence diagnostics in the front end, so the warning still
triggers on the else branch when it will really use strcmp(), as Steve
tested at [2]).

[1]: https://lore.kernel.org/20240312113002.00031...@gandalf.local.home/
[2]: https://lore.kernel.org/20240313161420.3b668...@gandalf.local.home/

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 83da83a0c14f..dbd27adb1b83 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -35,9 +35,14 @@
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
+   __diag_push();  \
+   __diag_ignore(clang, 13, "-Wstring-compare",\
+ "__builtin_constant_p() ensures strcmp()" \
+ "will be used for string literals");  \
WARN_ON_ONCE(__builtin_constant_p(src) ?\
 strcmp((src), __data_offsets.dst##_ptr_) : \
 (src) != __data_offsets.dst##_ptr_);   \
+   __diag_pop();   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-18 Thread Honggyu Kim

Hi SeongJae,

On Sun, 17 Mar 2024 12:13:57 -0700 SeongJae Park  wrote:
> On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> 
> > Hi Honggyu,
> > 
> > On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> > 
> > > Hi SeongJae,
> > > 
> > > Thanks for the confirmation.  I have a few comments on young filter so
> > > please read the inline comments again.
> > > 
> > > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > > > Hi Honggyu,
> [...]
> > > Thanks.  I see that it works fine, but I would like to have more
> > > discussion about "young" filter.  What I think about filter is that if I
> > > apply "young" filter "true" for demotion, then the action applies only
> > > for "young" pages, but the current implementation works opposite.
> > > 
> > > I understand the function name of internal implementation is
> > > "damos_pa_filter_out" so the basic action is filtering out, but the
> > > cgroup filter works in the opposite way for now.
> > 
> > Does memcg filter works in the opposite way?  I don't think so because
> > __damos_pa_filter_out() sets 'matches' as 'true' only if the the given 
> > folio is
> > contained in the given memcg.  'young' filter also simply sets 'matches' as
> > 'true' only if the given folio is young.
> > 
> > If it works in the opposite way, it's a bug that need to be fixed.  Please 
> > let
> > me know if I'm missing something.
> 
> I just read the DAMOS filters part of the documentation for DAMON sysfs
> interface again.  I believe it is explaining the meaning of 'matching' as I
> intended to, as below:
> 
> You can write ``Y`` or ``N`` to ``matching`` file to filter out pages 
> that does
> or does not match to the type, respectively.  Then, the scheme's action 
> will
> not be applied to the pages that specified to be filtered out.
> 
> But, I found the following example for memcg filter is wrong, as below:
> 
> For example, below restricts a DAMOS action to be applied to only 
> non-anonymous
> pages of all memory cgroups except ``/having_care_already``.::
> 
> # echo 2 > nr_filters
> # # filter out anonymous pages
> echo anon > 0/type
> echo Y > 0/matching
> # # further filter out all cgroups except one at 
> '/having_care_already'
> echo memcg > 1/type
> echo /having_care_already > 1/memcg_path
> echo N > 1/matching
> 
> Specifically, the last line of the commands should write 'Y' instead of 'N' to
> do what explained.  Without the fix, the action will be applied to only
> non-anonymous pages of 'having_care_already' memcg.  This is definitely wrong.
> I will fix this soon.  I'm unsure if this is what made you to believe memcg
> DAMOS filter is working in the opposite way, though.

I got confused not because of this.  I just think it again that this
user interface is better to be more intuitive as I mentioned in the
previous thread.

Thanks,
Honggyu

> 
> Thanks,
> SJ
> 
> [...]

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-18 Thread Honggyu Kim

Hi SeongJae,

On Sun, 17 Mar 2024 08:31:44 -0700 SeongJae Park  wrote:
> Hi Honggyu,
> 
> On Sun, 17 Mar 2024 17:36:29 +0900 Honggyu Kim  wrote:
> 
> > Hi SeongJae,
> > 
> > Thanks for the confirmation.  I have a few comments on young filter so
> > please read the inline comments again.
> > 
> > On Wed, 12 Mar 2024 08:53:00 -0800 SeongJae Park  wrote:
> > > Hi Honggyu,
> > > 
> > > > > -Original Message-
> > > > > From: SeongJae Park 
> > > > > Sent: Tuesday, March 12, 2024 3:33 AM
> > > > > To: Honggyu Kim 
> > > > > Cc: SeongJae Park ; kernel_team 
> > > > > 
> > > > > Subject: RE: Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory 
> > > > > management for CXL memory
> > > > >
> > > > > Hi Honggyu,
> > > > >
> > > > > On Mon, 11 Mar 2024 12:51:12 + "honggyu@sk.com" 
> > > > >  wrote:
> > > > >
> > > > > > Hi SeongJae,
> > > > > >
> > > > > > I've tested it again and found that "young" filter has to be set
> > > > > > differently as follows.
> > > > > > - demote action: set "young" filter with "matching" true
> > > > > > - promote action: set "young" filter with "matching" false

Thinking it again, I feel like "matching" true or false looks quite
vague to me as a general user.

Instead, I would like to have more meaningful names for "matching" as
follows.

- matching "true" can be either (filter) "out" or "skip".
- matching "false" can be either (filter) "in" or "apply".

Internally, the type of "matching" can be boolean, but it'd be better
for general users have another ways to set it such as "out"/"in" or
"skip"/"apply" via sysfs interface.  I prefer "skip" and "apply" looks
more intuitive, but I don't have strong objection on "out" and "in" as
well.

I also feel the filter name "young" is more for developers not for
general users.  I think this can be changed to "accessed" filter
instead.

The demote and promote filters can be set as follows using these.

- demote action: set "accessed" filter with "matching" to "skip"
- promote action: set "accessed" filter with "matching" to "apply"

I also think that you can feel this is more complicated so I would like
to hear how you think about this.

> > > > >
> > > > > DAMOS filter is basically for filtering "out" memory regions that 
> > > > > matches to
> > > > > the condition.

Right.  In other tools, I see filters are more used as filtering "in"
rather than filtering "out".  I feel this makes me more confused.

> > > > > Hence in your setup, young pages are not filtered out from
> > > > > demote action target.
> > > > 
> > > > I thought young filter true means "young pages ARE filtered out" for 
> > > > demotion.
> > > 
> > > You're correct.
> > 
> > Ack.
> > 
> > > > 
> > > > > That is, you're demoting pages that "not" young.
> > > > 
> > > > Your explanation here looks opposite to the previous statement.
> > > 
> > > Again, you're correct.  My intention was "non-young pages are not ..." but
> > > maybe I was out of my mind and mistakenly removed "non-" part.  Sorry for 
> > > the
> > > confusion.
> > 
> > No problem.  I also think it's quite confusing.
> > 
> > > > 
> > > > > And vice versa, so you're applying promote to non-non-young (young) 
> > > > > pages.
> > > > 
> > > > Yes, I understand "promote non-non-young pages" means "promote young 
> > > > pages".
> > > > This might be understood as "young pages are NOT filtered out" for 
> > > > promotion
> > > > but it doesn't mean that "old pages are filtered out" instead.
> > > > And we just rely hot detection only on DAMOS logics such as access 
> > > > frequency
> > > > and age. Am I correct?
> > > 
> > > You're correct.
> > 
> > Ack.  But if it doesn't mean that "old pages are filtered out" instead,
> 
> It does mean that.  Here, filtering is exclusive.  Hence, "filter-in a type of
> pages" means "filter-out pages of other types".  At least that's the 
> intention.
> To quote the documentation
> (https://docs.kernel.org/mm/damon/design.html#filters),
> 
> Each filter specifies the type of target memory, and whether it should
> exclude the memory of the type (filter-out), or all except the memory of
> the type (filter-in).

Thanks for the correction.

> > then do we really need this filter for promotion?  If not, maybe should
> > we create another "old" filter for promotion?  As of now, the promotion
> > is mostly done inaccurately, but the accurate migration is done at
> > demotion level.
> 
> Is this based on your theory?  Or, a real behavior that you're seeing from 
> your
> setup?  If this is a real behavior, I think that should be a bug that need to
> be fixed.

I have observed this in the hot_cold example, but I also found that the
promotion is done very quickly because the age for promote action is set
to 0 to max in my json setup.  It makes most pages of the region are
young because there is not enough time for those pages being old.  That
means I was wrong.

> > To avoid this issue, I feel we should promotion only "young" pages after
> > filtering "old" pages out.
>

[PATCH v1] virtio-mem: support suspend+resume

2024-03-18 Thread David Hildenbrand

With virtio-mem, primarily hibernation is problematic: as the machine shuts
down, the virtio-mem device loses its state. Powering the machine back up
is like losing a bunch of DIMMs. While there would be ways to add limited
support, suspend+resume is more commonly used for VMs and "easier" to
support cleanly.

s2idle can be supported without any device dependencies. Similarly, one
would expect suspend-to-ram (i.e., S3) to work out of the box. However,
QEMU currently unplugs all device memory when resuming the VM, using a
cold reset on the "wakeup" path. In order to support S3, we need a feature
flag for the device to tell us if memory remains plugged when waking up. In
the future, QEMU will implement this feature.

So let's always support s2idle and support S3 with plugged memory only if
the device indicates support. Block hibernation early using the PM
notifier.

Trying to hibernate now fails early:
# echo disk > /sys/power/state
[   26.455369] PM: hibernation: hibernation entry
[   26.458271] virtio_mem virtio0: hibernation is not supported.
[   26.462498] PM: hibernation: hibernation exit
-bash: echo: write error: Operation not permitted

s2idle works even without the new feature bit:
# echo s2idle > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   52.083725] PM: suspend entry (s2idle)
[   52.095950] Filesystems sync: 0.010 seconds
[   52.101493] Freezing user space processes
[   52.104213] Freezing user space processes completed (elapsed 0.001 
seconds)
[   52.106520] OOM killer disabled.
[   52.107655] Freezing remaining freezable tasks
[   52.110880] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   52.113296] printk: Suspending console(s) (use no_console_suspend to 
debug)

S3 does not work without the feature bit when memory is plugged:
# echo deep > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   32.788281] PM: suspend entry (deep)
[   32.816630] Filesystems sync: 0.027 seconds
[   32.820029] Freezing user space processes
[   32.823870] Freezing user space processes completed (elapsed 0.001 
seconds)
[   32.827756] OOM killer disabled.
[   32.829608] Freezing remaining freezable tasks
[   32.833842] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   32.837953] printk: Suspending console(s) (use no_console_suspend to 
debug)
[   32.916172] virtio_mem virtio0: suspend+resume with plugged memory 
is not supported
[   32.916181] virtio-pci :00:02.0: PM: pci_pm_suspend(): 
virtio_pci_freeze+0x0/0x50 returns -1
[   32.916197] virtio-pci :00:02.0: PM: dpm_run_callback(): 
pci_pm_suspend+0x0/0x170 returns -1
[   32.916210] virtio-pci :00:02.0: PM: failed to suspend async: 
error -1

But S3 works with the new feature bit when memory is plugged (patched
QEMU):
# echo deep > /sys/power/mem_sleep
# echo mem > /sys/power/state
[   33.983694] PM: suspend entry (deep)
[   34.009828] Filesystems sync: 0.024 seconds
[   34.013589] Freezing user space processes
[   34.016722] Freezing user space processes completed (elapsed 0.001 
seconds)
[   34.019092] OOM killer disabled.
[   34.020291] Freezing remaining freezable tasks
[   34.023549] Freezing remaining freezable tasks completed (elapsed 
0.001 seconds)
[   34.026090] printk: Suspending console(s) (use no_console_suspend to 
debug)

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Xuan Zhuo 
Signed-off-by: David Hildenbrand 
---

I had QEMU support ready [1] but reset-related things just changed upstream
that will require a bit of a rework -- and it will end up looking cleaner.

Will come back to upstreaming the QEMU part once I can properly sync
the Linux headers to contain VIRTIO_MEM_F_PERSISTENT_SUSPEND.

[1] https://github.com/davidhildenbrand/qemu/tree/virtio-mem-suspend

---
 drivers/virtio/virtio_mem.c | 68 ++---
 include/uapi/linux/virtio_mem.h |  2 +
 2 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 8e32232944423..51088d02de32f 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -252,6 +253,9 @@ struct virtio_mem {
/* Memory notifier (online/offline events). */
struct notifier_block memory_notifier;
 
+   /* Notifier to block hibernation image storing/reloading. */
+   struct notifier_block pm_notifier;
+
 #ifdef CONFIG_PROC_VMCORE
/* vmcore callback for /proc/vmcore handling in kdump mode */
struct vmcore_cb vmcore_cb;
@@ -,6 +1115,25 @@ static int virtio_mem_memory_notifier_cb(struct 
notifier_block *nb,
return rc;

Re: [PATCH] usb: typec: ptn36502: switch to DRM_AUX_BRIDGE

2024-03-18 Thread Heikki Krogerus

On Fri, Mar 15, 2024 at 05:04:22PM +0100, Luca Weiss wrote:
> Switch to using the new DRM_AUX_BRIDGE helper to create the transparent
> DRM bridge device instead of handcoding corresponding functionality.
> 
> Signed-off-by: Luca Weiss 

Reviewed-by: Heikki Krogerus 

> ---
> Very similar to this patch:
> c5d296bad640 ("usb: typec: nb7vpq904m: switch to DRM_AUX_BRIDGE")
> ---
>  drivers/usb/typec/mux/Kconfig|  2 +-
>  drivers/usb/typec/mux/ptn36502.c | 44 
> ++--
>  2 files changed, 3 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/usb/typec/mux/Kconfig b/drivers/usb/typec/mux/Kconfig
> index 399c7b0983df..4827e86fed6d 100644
> --- a/drivers/usb/typec/mux/Kconfig
> +++ b/drivers/usb/typec/mux/Kconfig
> @@ -60,7 +60,7 @@ config TYPEC_MUX_PTN36502
>   tristate "NXP PTN36502 Type-C redriver driver"
>   depends on I2C
>   depends on DRM || DRM=n
> - select DRM_PANEL_BRIDGE if DRM
> + select DRM_AUX_BRIDGE if DRM_BRIDGE
>   select REGMAP_I2C
>   help
> Say Y or M if your system has a NXP PTN36502 Type-C redriver chip
> diff --git a/drivers/usb/typec/mux/ptn36502.c 
> b/drivers/usb/typec/mux/ptn36502.c
> index 72ae38a1b2be..0ec86ef32a87 100644
> --- a/drivers/usb/typec/mux/ptn36502.c
> +++ b/drivers/usb/typec/mux/ptn36502.c
> @@ -8,7 +8,7 @@
>   * Copyright (C) 2023 Dmitry Baryshkov 
>   */
>  
> -#include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -68,8 +68,6 @@ struct ptn36502 {
>  
>   struct typec_switch *typec_switch;
>  
> - struct drm_bridge bridge;
> -
>   struct mutex lock; /* protect non-concurrent retimer & switch */
>  
>   enum typec_orientation orientation;
> @@ -283,44 +281,6 @@ static int ptn36502_detect(struct ptn36502 *ptn)
>   return 0;
>  }
>  
> -#if IS_ENABLED(CONFIG_OF) && IS_ENABLED(CONFIG_DRM_PANEL_BRIDGE)
> -static int ptn36502_bridge_attach(struct drm_bridge *bridge,
> -   enum drm_bridge_attach_flags flags)
> -{
> - struct ptn36502 *ptn = container_of(bridge, struct ptn36502, bridge);
> - struct drm_bridge *next_bridge;
> -
> - if (!(flags & DRM_BRIDGE_ATTACH_NO_CONNECTOR))
> - return -EINVAL;
> -
> - next_bridge = devm_drm_of_get_bridge(>client->dev, 
> ptn->client->dev.of_node, 0, 0);
> - if (IS_ERR(next_bridge)) {
> - dev_err(>client->dev, "failed to acquire drm_bridge: 
> %pe\n", next_bridge);
> - return PTR_ERR(next_bridge);
> - }
> -
> - return drm_bridge_attach(bridge->encoder, next_bridge, bridge,
> -  DRM_BRIDGE_ATTACH_NO_CONNECTOR);
> -}
> -
> -static const struct drm_bridge_funcs ptn36502_bridge_funcs = {
> - .attach = ptn36502_bridge_attach,
> -};
> -
> -static int ptn36502_register_bridge(struct ptn36502 *ptn)
> -{
> - ptn->bridge.funcs = _bridge_funcs;
> - ptn->bridge.of_node = ptn->client->dev.of_node;
> -
> - return devm_drm_bridge_add(>client->dev, >bridge);
> -}
> -#else
> -static int ptn36502_register_bridge(struct ptn36502 *ptn)
> -{
> - return 0;
> -}
> -#endif
> -
>  static const struct regmap_config ptn36502_regmap = {
>   .max_register = 0x0d,
>   .reg_bits = 8,
> @@ -369,7 +329,7 @@ static int ptn36502_probe(struct i2c_client *client)
>   if (ret)
>   goto err_disable_regulator;
>  
> - ret = ptn36502_register_bridge(ptn);
> + ret = drm_aux_bridge_register(dev);
>   if (ret)
>   goto err_disable_regulator;
>  
> 
> ---
> base-commit: 9bb9b28d0568991b1d63e66fe75afa5f97ad1156
> change-id: 20240315-ptn36502-aux-15dd6f289aff
> 
> Best regards,
> -- 
> Luca Weiss 

-- 
heikki

[PATCH] tracing: Explicitly cast divisor to fix Coccinelle warning

2024-03-18 Thread Thorsten Blum

Explicitly cast the divisor to u32 to fix a Coccinelle/coccicheck warning
reported by do_div.cocci.

Signed-off-by: Thorsten Blum 
---
 kernel/trace/trace_benchmark.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace_benchmark.c b/kernel/trace/trace_benchmark.c
index 54d5fa35c90a..218b40050629 100644
--- a/kernel/trace/trace_benchmark.c
+++ b/kernel/trace/trace_benchmark.c
@@ -105,7 +105,7 @@ static void trace_do_benchmark(void)
stddev = 0;
 
delta = bm_total;
-   do_div(delta, bm_cnt);
+   do_div(delta, (u32)bm_cnt);
avg = delta;
 
if (stddev > 0) {
-- 
2.44.0

Re: [PATCH 2/2] ARM: dts: qcom: msm8974: Add empty chosen node

2024-03-18 Thread Konrad Dybcio

On 18.03.2024 10:24, Luca Weiss wrote:
> Add an empty /chosen node to the dtsi like is common on most other
> Qualcomm SoC files, so that various pieces of software expecting this
> node to exist don't complain.
> 
> Signed-off-by: Luca Weiss 
> ---

Kinda weird dtc doesn't add it automatically or complain about its
absence at this point.. perhaps it could be taught about the latter

Reviewed-by: Konrad Dybcio 

Konrad

Re: [PATCH 1/2] ARM: dts: qcom: msm8974: Add @0 to memory node name

2024-03-18 Thread Konrad Dybcio

On 18.03.2024 10:24, Luca Weiss wrote:
> Add the @0 from reg to the node name, so that both dtc warning and dt
> validation failure get resolved.
> 
>   arch/arm/boot/dts/qcom/qcom-msm8974.dtsi:106.9-109.4: Warning 
> (unit_address_vs_reg): /memory: node has a reg or ranges property, but no 
> unit name
> 
>   [..]/arch/arm/boot/dts/qcom/qcom-msm8974pro-fairphone-fp2.dtb: /: memory: 
> False schema does not allow {'device_type': ['memory'], 'reg': [[0, 0]]}
>   from schema $id: http://devicetree.org/schemas/root-node.yaml#
> 
> Signed-off-by: Luca Weiss 
> ---

Looks like it's indeed the start of RAM

Reviewed-by: Konrad Dybcio 

Konrad

[PATCH 1/2] ARM: dts: qcom: msm8974: Add @0 to memory node name

2024-03-18 Thread Luca Weiss

Add the @0 from reg to the node name, so that both dtc warning and dt
validation failure get resolved.

  arch/arm/boot/dts/qcom/qcom-msm8974.dtsi:106.9-109.4: Warning 
(unit_address_vs_reg): /memory: node has a reg or ranges property, but no unit 
name

  [..]/arch/arm/boot/dts/qcom/qcom-msm8974pro-fairphone-fp2.dtb: /: memory: 
False schema does not allow {'device_type': ['memory'], 'reg': [[0, 0]]}
  from schema $id: http://devicetree.org/schemas/root-node.yaml#

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8974.dtsi | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi 
b/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
index 5efc38d712cc..00c6526a525d 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
@@ -103,7 +103,7 @@ scm {
};
};
 
-   memory {
+   memory@0 {
device_type = "memory";
reg = <0x0 0x0>;
};

-- 
2.44.0

[PATCH 2/2] ARM: dts: qcom: msm8974: Add empty chosen node

2024-03-18 Thread Luca Weiss

Add an empty /chosen node to the dtsi like is common on most other
Qualcomm SoC files, so that various pieces of software expecting this
node to exist don't complain.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8974.dtsi | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi 
b/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
index 00c6526a525d..2ec4ec4e5d2a 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974.dtsi
@@ -14,6 +14,8 @@ / {
#size-cells = <1>;
interrupt-parent = <>;
 
+   chosen { };
+
clocks {
xo_board: xo_board {
compatible = "fixed-clock";

-- 
2.44.0

[PATCH 0/2] Small fixes for MSM8974 SoC dtsi

2024-03-18 Thread Luca Weiss

One fix for dt schema validation, one for the /chosen node.

Signed-off-by: Luca Weiss 
---
Luca Weiss (2):
  ARM: dts: qcom: msm8974: Add @0 to memory node name
  ARM: dts: qcom: msm8974: Add empty chosen node

 arch/arm/boot/dts/qcom/qcom-msm8974.dtsi | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)
---
base-commit: f6cef5f8c37f58a3bc95b3754c3ae98e086631ca
change-id: 20240318-msm8974-misc2-1fb92ae6bdf3

Best regards,
-- 
Luca Weiss

[PATCH 2/2] virtio_balloon: Treat stats requests as wakeup events

2024-03-18 Thread David Stevens

From: David Stevens 

Treat stats requests as wakeup events to ensure that the driver responds
to device requests in a timely manner.

Signed-off-by: David Stevens 
---
 drivers/virtio/virtio_balloon.c | 75 -
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7fe7ef5f1c77..402dec98e08c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -121,11 +121,14 @@ struct virtio_balloon {
struct page_reporting_dev_info pr_dev_info;
 
/* State for keeping the wakeup_source active while adjusting the 
balloon */
-   spinlock_t adjustment_lock;
-   bool adjustment_signal_pending;
-   bool adjustment_in_progress;
+   spinlock_t wakeup_lock;
+   bool processing_wakeup_event;
+   u32 wakeup_signal_mask;
 };
 
+#define ADJUSTMENT_WAKEUP_SIGNAL (1 << 0)
+#define STATS_WAKEUP_SIGNAL (1 << 1)
+
 static const struct virtio_device_id id_table[] = {
{ VIRTIO_ID_BALLOON, VIRTIO_DEV_ANY_ID },
{ 0 },
@@ -140,6 +143,36 @@ static u32 page_to_balloon_pfn(struct page *page)
return pfn * VIRTIO_BALLOON_PAGES_PER_PAGE;
 }
 
+static void start_wakeup_event(struct virtio_balloon *vb, u32 mask)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(>wakeup_lock, flags);
+   vb->wakeup_signal_mask |= mask;
+   if (!vb->processing_wakeup_event) {
+   vb->processing_wakeup_event = true;
+   pm_stay_awake(>vdev->dev);
+   }
+   spin_unlock_irqrestore(>wakeup_lock, flags);
+}
+
+static void process_wakeup_event(struct virtio_balloon *vb, u32 mask)
+{
+   spin_lock_irq(>wakeup_lock);
+   vb->wakeup_signal_mask &= ~mask;
+   spin_unlock_irq(>wakeup_lock);
+}
+
+static void finish_wakeup_event(struct virtio_balloon *vb)
+{
+   spin_lock_irq(>wakeup_lock);
+   if (!vb->wakeup_signal_mask && vb->processing_wakeup_event) {
+   vb->processing_wakeup_event = false;
+   pm_relax(>vdev->dev);
+   }
+   spin_unlock_irq(>wakeup_lock);
+}
+
 static void balloon_ack(struct virtqueue *vq)
 {
struct virtio_balloon *vb = vq->vdev->priv;
@@ -370,8 +403,10 @@ static void stats_request(struct virtqueue *vq)
struct virtio_balloon *vb = vq->vdev->priv;
 
spin_lock(>stop_update_lock);
-   if (!vb->stop_update)
+   if (!vb->stop_update) {
+   start_wakeup_event(vb, STATS_WAKEUP_SIGNAL);
queue_work(system_freezable_wq, >update_balloon_stats_work);
+   }
spin_unlock(>stop_update_lock);
 }
 
@@ -444,29 +479,10 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
 
 static void start_update_balloon_size(struct virtio_balloon *vb)
 {
-   unsigned long flags;
-
-   spin_lock_irqsave(>adjustment_lock, flags);
-   vb->adjustment_signal_pending = true;
-   if (!vb->adjustment_in_progress) {
-   vb->adjustment_in_progress = true;
-   pm_stay_awake(>vdev->dev);
-   }
-   spin_unlock_irqrestore(>adjustment_lock, flags);
-
+   start_wakeup_event(vb, ADJUSTMENT_WAKEUP_SIGNAL);
queue_work(system_freezable_wq, >update_balloon_size_work);
 }
 
-static void end_update_balloon_size(struct virtio_balloon *vb)
-{
-   spin_lock_irq(>adjustment_lock);
-   if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
-   vb->adjustment_in_progress = false;
-   pm_relax(>vdev->dev);
-   }
-   spin_unlock_irq(>adjustment_lock);
-}
-
 static void virtballoon_changed(struct virtio_device *vdev)
 {
struct virtio_balloon *vb = vdev->priv;
@@ -495,7 +511,10 @@ static void update_balloon_stats_func(struct work_struct 
*work)
 
vb = container_of(work, struct virtio_balloon,
  update_balloon_stats_work);
+
+   process_wakeup_event(vb, STATS_WAKEUP_SIGNAL);
stats_handle_request(vb);
+   finish_wakeup_event(vb);
 }
 
 static void update_balloon_size_func(struct work_struct *work)
@@ -506,9 +525,7 @@ static void update_balloon_size_func(struct work_struct 
*work)
vb = container_of(work, struct virtio_balloon,
  update_balloon_size_work);
 
-   spin_lock_irq(>adjustment_lock);
-   vb->adjustment_signal_pending = false;
-   spin_unlock_irq(>adjustment_lock);
+   process_wakeup_event(vb, ADJUSTMENT_WAKEUP_SIGNAL);
 
diff = towards_target(vb);
 
@@ -523,7 +540,7 @@ static void update_balloon_size_func(struct work_struct 
*work)
if (diff)
queue_work(system_freezable_wq, work);
else
-   end_update_balloon_size(vb);
+   finish_wakeup_event(vb);
 }
 
 static int init_vqs(struct virtio_balloon *vb)
@@ -1027,7 +1044,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
goto out_unregister_oom;
}
 
-

[PATCH 1/2] virtio_balloon: Give the balloon its own wakeup source

2024-03-18 Thread David Stevens

From: David Stevens 

Wakeup sources don't support nesting multiple events, so sharing a
single object between multiple drivers can result in one driver
overriding the wakeup event processing period specified by another
driver. Have the virtio balloon driver use the wakeup source of the
device it is bound to rather than the wakeup source of the parent
device, to avoid conflicts with the transport layer.

Note that although the virtio balloon's virtio_device itself isn't what
actually wakes up the device, it is responsible for processing wakeup
events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source
to prevent suspend when userspace is processing wakeup events, a
dedicated wakeup_source is necessary when processing wakeup events in a
higher layer in the kernel.

Fixes: b12fbc3f787e ("virtio_balloon: stay awake while adjusting balloon")
Signed-off-by: David Stevens 
---
 drivers/virtio/virtio_balloon.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..7fe7ef5f1c77 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -450,7 +450,7 @@ static void start_update_balloon_size(struct virtio_balloon 
*vb)
vb->adjustment_signal_pending = true;
if (!vb->adjustment_in_progress) {
vb->adjustment_in_progress = true;
-   pm_stay_awake(vb->vdev->dev.parent);
+   pm_stay_awake(>vdev->dev);
}
spin_unlock_irqrestore(>adjustment_lock, flags);
 
@@ -462,7 +462,7 @@ static void end_update_balloon_size(struct virtio_balloon 
*vb)
spin_lock_irq(>adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
-   pm_relax(vb->vdev->dev.parent);
+   pm_relax(>vdev->dev);
}
spin_unlock_irq(>adjustment_lock);
 }
@@ -1028,6 +1028,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
}
 
spin_lock_init(>adjustment_lock);
+   device_set_wakeup_capable(>vdev->dev, true);
 
virtio_device_ready(vdev);
 
-- 
2.44.0.291.gc1ea87d7ee-goog

[PATCH 0/2] Improvements to virtio_balloon pm

2024-03-18 Thread David Stevens

From: David Stevens 

The virtio_balloon driver uses wakeup sources to allow the guest to
enter system power management sleep states (e.g. s2idle) without running
the risk of becoming unresponsive to cooperative memory management
requests from the host. This series fixes an issue where wakeup sources
for inflate/deflate were improperly shared between drivers. It also
closes a race where stats requests that come in immediately before a
sleep state transition could fail to be handled in a timely manner.

David Stevens (2):
  virtio_balloon: Give the balloon its own wakeup source
  virtio_balloon: Treat stats requests as wakeup events

 drivers/virtio/virtio_balloon.c | 76 -
 1 file changed, 47 insertions(+), 29 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.291.gc1ea87d7ee-goog

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-18 Thread Michael S. Tsirkin

On Mon, Mar 18, 2024 at 09:41:45AM +1000, Gavin Shan wrote:
> On 3/18/24 02:50, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:24:36PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/15/24 21:05, Michael S. Tsirkin wrote:
> > > > On Fri, Mar 15, 2024 at 08:45:10PM +1000, Gavin Shan wrote:
> > > > > > > Yes, I guess smp_wmb() ('dmb') is buggy on NVidia's grace-hopper 
> > > > > > > platform. I tried
> > > > > to reproduce it with my own driver where one thread writes to the 
> > > > > shared buffer
> > > > > and another thread reads from the buffer. I don't hit the 
> > > > > out-of-order issue so
> > > > > far.
> > > > 
> > > > Make sure the 2 areas you are accessing are in different cache lines.
> > > > 
> > > 
> > > Yes, I already put those 2 areas to separate cache lines.
> > > 
> > > > 
> > > > > My driver may be not correct somewhere and I will update if I can 
> > > > > reproduce
> > > > > the issue with my driver in the future.
> > > > 
> > > > Then maybe your change is just making virtio slower and masks the bug
> > > > that is actually elsewhere?
> > > > 
> > > > You don't really need a driver. Here's a simple test: without barriers
> > > > assertion will fail. With barriers it will not.
> > > > (Warning: didn't bother testing too much, could be buggy.
> > > > 
> > > > ---
> > > > 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > #include 
> > > > 
> > > > #define FIRST values[0]
> > > > #define SECOND values[64]
> > > > 
> > > > volatile int values[100] = {};
> > > > 
> > > > void* writer_thread(void* arg) {
> > > > while (1) {
> > > > FIRST++;
> > > > // NEED smp_wmb here
> > >  __asm__ volatile("dmb ishst" : : : "memory");
> > > > SECOND++;
> > > > }
> > > > }
> > > > 
> > > > void* reader_thread(void* arg) {
> > > >   while (1) {
> > > > int first = FIRST;
> > > > // NEED smp_rmb here
> > >  __asm__ volatile("dmb ishld" : : : "memory");
> > > > int second = SECOND;
> > > > assert(first - second == 1 || first - second == 0);
> > > >   }
> > > > }
> > > > 
> > > > int main() {
> > > >   pthread_t writer, reader;
> > > > 
> > > >   pthread_create(, NULL, writer_thread, NULL);
> > > >   pthread_create(, NULL, reader_thread, NULL);
> > > > 
> > > >   pthread_join(writer, NULL);
> > > >   pthread_join(reader, NULL);
> > > > 
> > > >   return 0;
> > > > }
> > > > 
> > > 
> > > Had a quick test on NVidia's grace-hopper and Ampere's CPUs. I hit
> > > the assert on both of them. After replacing 'dmb' with 'dsb', I can
> > > hit assert on both of them too. I need to look at the code closely.
> > > 
> > > [root@virt-mtcollins-02 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - 
> > > second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > [root@nvidia-grace-hopper-05 test]# ./a
> > > a: a.c:26: reader_thread: Assertion `first - second == 1 || first - 
> > > second == 0' failed.
> > > Aborted (core dumped)
> > > 
> > > Thanks,
> > > Gavin
> > 
> > 
> > Actually this test is broken. No need for ordering it's a simple race.
> > The following works on x86 though (x86 does not need barriers
> > though).
> > 
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > #if 0
> > #define x86_rmb()  asm volatile("lfence":::"memory")
> > #define x86_mb()  asm volatile("mfence":::"memory")
> > #define x86_smb()  asm volatile("sfence":::"memory")
> > #else
> > #define x86_rmb()  asm volatile("":::"memory")
> > #define x86_mb()  asm volatile("":::"memory")
> > #define x86_smb()  asm volatile("":::"memory")
> > #endif
> > 
> > #define FIRST values[0]
> > #define SECOND values[640]
> > #define FLAG values[1280]
> > 
> > volatile unsigned values[2000] = {};
> > 
> > void* writer_thread(void* arg) {
> > while (1) {
> > /* Now synchronize with reader */
> > while(FLAG);
> > FIRST++;
> > x86_smb();
> > SECOND++;
> > x86_smb();
> > FLAG = 1;
> > }
> > }
> > 
> > void* reader_thread(void* arg) {
> >  while (1) {
> > /* Now synchronize with writer */
> > while(!FLAG);
> > x86_rmb();
> > unsigned first = FIRST;
> > x86_rmb();
> > unsigned second = SECOND;
> > assert(first - second == 1 || first - second == 0);
> > FLAG = 0;
> > 
> > if (!(first %100))
> > printf("%d\n", first);
> > }
> > }
> > 
> > int main() {
> >  pthread_t writer, reader;
> > 
> >  pthread_create(, NULL, writer_thread, NULL);
> >  pthread_create(, NULL, reader_thread, NULL);
> > 
> >  pthread_join(writer, NULL);
> >  pthread_join(reader, NULL);
> > 
> >  return 0;
> > }
> > 
> 
> I tried it on host and VM of NVidia's grace-hopper. Without the barriers, I
> can hit assert. With the barriers, it's working fine without hitting the
> assert.
> 
> I also had some code to mimic virtio vring last weekend, and it's just
> working well. Back to our original

回复：KASAN: null-ptr-deref Write in tctx_task_work_run

2024-03-18 Thread Ubisectech Sirius

> I think you snipped the fault injection that came before this. It looks
> like an allocation failure, so we don't get tsk->io_uring setup for the
> SQPOLL thread. Not a great way to handle this, but can you try the
> below? Would be nicer if we could just prune the task rather than wake
> it and have it error.

Hi.
  The issue does not appear again when I apply the patch to the Linux kernel.


On 3/17/24 6:59 PM, Ubisectech Sirius wrote:
> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 
> 6.8.0-ge5e038b7ae9d. Attached to the email were a POC file of the issue.
> 
> Stack dump:
> 
> ==
> BUG: KASAN: null-ptr-deref in instrument_atomic_read_write 
> include/linux/instrumented.h:96 [inline]
> BUG: KASAN: null-ptr-deref in llist_del_all include/linux/llist.h:266 [inline]
> BUG: KASAN: null-ptr-deref in tctx_task_work_run+0x7d/0x330 
> io_uring/io_uring.c:1267
> Write of size 8 at addr 01c0 by task iou-sqp-215603/215604
> 
> CPU: 0 PID: 215604 Comm: iou-sqp-215603 Not tainted 6.8.0-ge5e038b7ae9d #40
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> 04/01/2014
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:114
>  kasan_report+0xbd/0xf0 mm/kasan/report.c:601
>  check_region_inline mm/kasan/generic.c:183 [inline]
>  kasan_check_range+0xf4/0x1a0 mm/kasan/generic.c:189
>  instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
>  llist_del_all include/linux/llist.h:266 [inline]
>  tctx_task_work_run+0x7d/0x330 io_uring/io_uring.c:1267
>  io_sq_tw+0x12a/0x1d0 io_uring/sqpoll.c:245
>  io_sq_thread+0x8d7/0x18a0 io_uring/sqpoll.c:308
>  ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
>  
> ==
> Kernel panic - not syncing: KASAN: panic_on_warn set ...
> CPU: 0 PID: 215604 Comm: iou-sqp-215603 Not tainted 6.8.0-ge5e038b7ae9d #40
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> 04/01/2014

I think you snipped the fault injection that came before this. It looks
like an allocation failure, so we don't get tsk->io_uring setup for the
SQPOLL thread. Not a great way to handle this, but can you try the
below? Would be nicer if we could just prune the task rather than wake
it and have it error.

diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 363052b4ea76..db7b0fdfe1cb 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -274,6 +274,10 @@ static int io_sq_thread(void *data)
  char buf[TASK_COMM_LEN];
  DEFINE_WAIT(wait);
 
+ /* offload context creation failed, just exit */
+ if (!current->io_uring) {
+  goto err_out;
+
  snprintf(buf, sizeof(buf), "iou-sqp-%d", sqd->task_pid);
  set_task_comm(current, buf);
 
@@ -371,7 +375,7 @@ static int io_sq_thread(void *data)
   atomic_or(IORING_SQ_NEED_WAKEUP, >rings->sq_flags);
  io_run_task_work();
  mutex_unlock(>lock);
-
+err_out:
  complete(>exited);
  do_exit(0);
 }

-- 
Jens Axboe

43 matches

Mail list logo