date:20240409

Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-09 Thread Krzysztof Kozlowski

On 09/04/2024 20:36, Luca Weiss wrote:
> Follow the gpio-hog bindings and use otg-hog as node name.
> 
> Signed-off-by: Luca Weiss 
> ---
>  arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH v2 1/2] dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes

2024-04-09 Thread Krzysztof Kozlowski

On 09/04/2024 20:36, Luca Weiss wrote:
> Allow specifying a GPIO hog, as already used on
> qcom-msm8974-lge-nexus5-hammerhead.dts.
> 
> Signed-off-by: Luca Weiss 
> ---
>  .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml  | 12 
> 
>  1 file changed, 12 insertions(+)


Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Ho-Ren (Jack) Chuang

On Tue, Apr 9, 2024 at 7:33 PM Huang, Ying  wrote:
>
> "Ho-Ren (Jack) Chuang"  writes:
>
> > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
> >  wrote:
> >>
> >> On Fri,  5 Apr 2024 00:07:06 +
> >> "Ho-Ren (Jack) Chuang"  wrote:
> >>
> >> > The current implementation treats emulated memory devices, such as
> >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> >> > memory
> >> > (E820_TYPE_RAM). However, these emulated devices have different
> >> > characteristics than traditional DRAM, making it important to
> >> > distinguish them. Thus, we modify the tiered memory initialization 
> >> > process
> >> > to introduce a delay specifically for CPUless NUMA nodes. This delay
> >> > ensures that the memory tier initialization for these nodes is deferred
> >> > until HMAT information is obtained during the boot process. Finally,
> >> > demotion tables are recalculated at the end.
> >> >
> >> > * late_initcall(memory_tier_late_init);
> >> > Some device drivers may have initialized memory tiers between
> >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> >> > online memory nodes and configuring memory tiers. They should be excluded
> >> > in the late init.
> >> >
> >> > * Handle cases where there is no HMAT when creating memory tiers
> >> > There is a scenario where a CPUless node does not provide HMAT 
> >> > information.
> >> > If no HMAT is specified, it falls back to using the default DRAM tier.
> >> >
> >> > * Introduce another new lock `default_dram_perf_lock` for adist 
> >> > calculation
> >> > In the current implementation, iterating through CPUlist nodes requires
> >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end 
> >> > up
> >> > trying to acquire the same lock, leading to a potential deadlock.
> >> > Therefore, we propose introducing a standalone `default_dram_perf_lock` 
> >> > to
> >> > protect `default_dram_perf_*`. This approach not only avoids deadlock
> >> > but also prevents holding a large lock simultaneously.
> >> >
> >> > * Upgrade `set_node_memory_tier` to support additional cases, including
> >> >   default DRAM, late CPUless, and hot-plugged initializations.
> >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> >> > handle cases where memtype is not initialized and where HMAT information 
> >> > is
> >> > available.
> >> >
> >> > * Introduce `default_memory_types` for those memory types that are not
> >> >   initialized by device drivers.
> >> > Because late initialized memory and default DRAM memory need to be 
> >> > managed,
> >> > a default memory type is created for storing all memory types that are
> >> > not initialized by device drivers and as a fallback.
> >> >
> >> > Signed-off-by: Ho-Ren (Jack) Chuang 
> >> > Signed-off-by: Hao Xiang 
> >> > Reviewed-by: "Huang, Ying" 
> >>
> >> Hi - one remaining question. Why can't we delay init for all nodes
> >> to either drivers or your fallback late_initcall code.
> >> It would be nice to reduce possible code paths.
> >
> > I try not to change too much of the existing code structure in
> > this patchset.
> >
> > To me, postponing/moving all memory tier registrations to
> > late_initcall() is another possible action item for the next patchset.
> >
> > After tier_mem(), hmat_init() is called, which requires registering
> > `default_dram_type` info. This is when `default_dram_type` is needed.
> > However, it is indeed possible to postpone the latter part,
> > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> > indeed be split into two parts, and the latter part can be moved to
> > late_initcall() to be processed together.
>
> I don't think that it's good to move all memory_tier initialization in
> drivers to late_initcall().  It's natural to keep them in
> device_initcall() level.
>
> If so, we can allocate default_dram_type in memory_tier_init(), and call
> set_node_memory_tier() only in memory_tier_lateinit().  We can call
> memory_tier_lateinit() in device_initcall() level too.
>

It makes sense to me to leave only `default_dram_type ` and
hotplug_init() in memory_tier_init(), postponing all
set_node_memory_tier()s to memory_tier_late_init()

Would it be possible there is no device_initcall() calling
memory_tier_late_init()? If yes, I think putting memory_tier_late_init()
in late_init() is still necessary.

> --
> Best Regards,
> Huang, Ying
>
> > Doing this all memory-type drivers have to call late_initcall() to
> > register a memory tier. I’m not sure how many they are?
> >
> > What do you guys think?
> >
> >>
> >> Jonathan
> >>
> >>
> >> > ---
> >> >  mm/memory-tiers.c | 94 +++
> >> >  1 file changed, 70 insertions(+), 24 deletions(-)
> >> >
> >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> >> > index 516b144fd45a..6632102bd5c9 100644
> >> > --- a/mm/memory-tiers.c
> >> > +++ b/mm/mem

Re: [PATCH v2 0/1] virtio-pci: Fix the crash that the vector was used after released

2024-04-09 Thread Cindy Lu

Sorry, send to the wrong mail list, please ignore it

On Wed, Apr 10, 2024 at 12:35 PM Cindy Lu  wrote:
>
> During the booting process of the Vyatta image, the behavior of the
> called function in qemu is as follows:
>
> 1. vhost_net_stop() was triggered by guest image . This will call the function
> virtio_pci_set_guest_notifiers() with assgin= false, and
> virtio_pci_set_guest_notifiers(） will release the irqfd for vector 0
>
> 2. virtio_reset() was called -->set configure vector to VIRTIO_NO_VECTOR
>
> 3.vhost_net_start() was called (at this time, the configure vector is
> still VIRTIO_NO_VECTOR) and call virtio_pci_set_guest_notifiers() with
> assgin= true, so the irqfd for vector 0 is still not "init" during this 
> process
>
> 4. The system continues to boot,set the vector back to 0, and 
> msix_fire_vector_notifier() was triggered
>  unmask the vector 0 and then met the crash
> [msix_fire_vector_notifier] 112 called vector 0 is_masked 1
> [msix_fire_vector_notifier] 112 called vector 0 is_masked 0
>
> To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()"
> when the vector changes back from VIRTIO_NO_VECTOR.
>
> The reason that we don't need to call kvm_virtio_pci_vector_release_one while 
> the vector changes to
> VIRTIO_NO_VECTOR is this function will called in vhost_net_stop(),
> So this step will not lost during this process.
>
> Change from V1
> 1.add the check for if using irqfd
> 2.remove the check for bool recovery, irqfd's user is enough to check status
>
> Cindy Lu (1):
>   virtio-pci: Fix the crash that the vector was used after released.
>
>  hw/virtio/virtio-pci.c | 35 +++
>  1 file changed, 35 insertions(+)
>
> --
> 2.43.0
>

Re: [PATCH v2 1/1] virtio-pci: Fix the crash that the vector was used after released.

2024-04-09 Thread Cindy Lu

Sorry, send to the wrong mail list, please ignore it

On Wed, Apr 10, 2024 at 12:35 PM Cindy Lu  wrote:
>
> When the guest triggers vhost_stop and then virtio_reset, the vector will the
> IRQFD for this vector will be released and change to VIRTIO_NO_VECTOR.
> After that, the guest called vhost_net_start,  (at this time, the configure
> vector is still VIRTIO_NO_VECTOR),  vector 0 still was not "init".
> The guest system continued to boot, set the vector back to 0, and then met 
> the crash.
>
> To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()"
> when the vector changes back from VIRTIO_NO_VECTOR
>
> (gdb) bt
> 0  __pthread_kill_implementation (threadid=, 
> signo=signo@entry=6, no_tid=no_tid@entry=0)
> at pthread_kill.c:44
> 1  0x7fc87148ec53 in __pthread_kill_internal (signo=6, 
> threadid=) at pthread_kill.c:78
> 2  0x7fc87143e956 in __GI_raise (sig=sig@entry=6) at 
> ../sysdeps/posix/raise.c:26
> 3  0x7fc8714287f4 in __GI_abort () at abort.c:79
> 4  0x7fc87142871b in __assert_fail_base
> (fmt=0x7fc8715bbde0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
> assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d 
> "../accel/kvm/kvm-all.c", line=1837, function=) at assert.c:92
> 5  0x7fc871437536 in __GI___assert_fail
> (assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d 
> "../accel/kvm/kvm-all.c", line=1837, function=0x5606413f06f0 
> <__PRETTY_FUNCTION__.19> "kvm_irqchip_commit_routes") at assert.c:101
> 6  0x560640f884b5 in kvm_irqchip_commit_routes (s=0x560642cae1f0) at 
> ../accel/kvm/kvm-all.c:1837
> 7  0x560640c98f8e in virtio_pci_one_vector_unmask
> (proxy=0x560643c65f00, queue_no=4294967295, vector=0, msg=..., 
> n=0x560643c6e4c8)
> at ../hw/virtio/virtio-pci.c:1005
> 8  0x560640c99201 in virtio_pci_vector_unmask (dev=0x560643c65f00, 
> vector=0, msg=...)
> at ../hw/virtio/virtio-pci.c:1070
> 9  0x560640bc402e in msix_fire_vector_notifier (dev=0x560643c65f00, 
> vector=0, is_masked=false)
> at ../hw/pci/msix.c:120
> 10 0x560640bc40f1 in msix_handle_mask_update (dev=0x560643c65f00, 
> vector=0, was_masked=true)
> at ../hw/pci/msix.c:140
> 11 0x560640bc4503 in msix_table_mmio_write (opaque=0x560643c65f00, 
> addr=12, val=0, size=4)
> at ../hw/pci/msix.c:231
> 12 0x560640f26d83 in memory_region_write_accessor
> (mr=0x560643c66540, addr=12, value=0x7fc86b7bc628, size=4, shift=0, 
> mask=4294967295, attrs=...)
> at ../system/memory.c:497
> 13 0x560640f270a6 in access_with_adjusted_size
>
>  (addr=12, value=0x7fc86b7bc628, size=4, access_size_min=1, 
> access_size_max=4, access_fn=0x560640f26c8d , 
> mr=0x560643c66540, attrs=...) at ../system/memory.c:573
> 14 0x560640f2a2b5 in memory_region_dispatch_write (mr=0x560643c66540, 
> addr=12, data=0, op=MO_32, attrs=...)
> at ../system/memory.c:1521
> 15 0x560640f37bac in flatview_write_continue
> (fv=0x7fc65805e0b0, addr=4273803276, attrs=..., ptr=0x7fc871e9c028, 
> len=4, addr1=12, l=4, mr=0x560643c66540)
> at ../system/physmem.c:2714
> 16 0x560640f37d0f in flatview_write
> (fv=0x7fc65805e0b0, addr=4273803276, attrs=..., buf=0x7fc871e9c028, 
> len=4) at ../system/physmem.c:2756
> 17 0x560640f380bf in address_space_write
> (as=0x560642161ae0 , addr=4273803276, attrs=..., 
> buf=0x7fc871e9c028, len=4)
> at ../system/physmem.c:2863
> 18 0x560640f3812c in address_space_rw
> (as=0x560642161ae0 , addr=4273803276, attrs=..., 
> buf=0x7fc871e9c028, len=4, is_write=true) at ../system/physmem.c:2873
> --Type  for more, q to quit, c to continue without paging--
> 19 0x560640f8aa55 in kvm_cpu_exec (cpu=0x560642f205e0) at 
> ../accel/kvm/kvm-all.c:2915
> 20 0x560640f8d731 in kvm_vcpu_thread_fn (arg=0x560642f205e0) at 
> ../accel/kvm/kvm-accel-ops.c:51
> 21 0x5606411949f4 in qemu_thread_start (args=0x560642f292b0) at 
> ../util/qemu-thread-posix.c:541
> 22 0x7fc87148cdcd in start_thread (arg=) at 
> pthread_create.c:442
> 23 0x7fc871512630 in clone3 () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
> (gdb)
> Signed-off-by: Cindy Lu 
> ---
>  hw/virtio/virtio-pci.c | 35 +++
>  1 file changed, 35 insertions(+)
>
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index 1a7039fb0c..344f4fb844 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -880,6 +880,7 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy 
> *proxy, int queue_no)
>  int ret;
>  EventNotifier *n;
>  PCIDevice *dev = &proxy->pci_dev;
> +VirtIOIRQFD *irqfd;
>  VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus);
>  VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);
>
> @@ -890,10 +891,19 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy 
> *proxy, int queue_no)
>  if (vector >= msix_nr_vectors_allocated(dev)) {
>  return 0;
>  }
> +/*
> + * if the irqfd still in use,

[PATCH v2 1/1] virtio-pci: Fix the crash that the vector was used after released.

2024-04-09 Thread Cindy Lu

When the guest triggers vhost_stop and then virtio_reset, the vector will the
IRQFD for this vector will be released and change to VIRTIO_NO_VECTOR.
After that, the guest called vhost_net_start,  (at this time, the configure
vector is still VIRTIO_NO_VECTOR),  vector 0 still was not "init".
The guest system continued to boot, set the vector back to 0, and then met the 
crash.

To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()"
when the vector changes back from VIRTIO_NO_VECTOR

(gdb) bt
0  __pthread_kill_implementation (threadid=, 
signo=signo@entry=6, no_tid=no_tid@entry=0)
at pthread_kill.c:44
1  0x7fc87148ec53 in __pthread_kill_internal (signo=6, threadid=) at pthread_kill.c:78
2  0x7fc87143e956 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
3  0x7fc8714287f4 in __GI_abort () at abort.c:79
4  0x7fc87142871b in __assert_fail_base
(fmt=0x7fc8715bbde0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d 
"../accel/kvm/kvm-all.c", line=1837, function=) at assert.c:92
5  0x7fc871437536 in __GI___assert_fail
(assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d 
"../accel/kvm/kvm-all.c", line=1837, function=0x5606413f06f0 
<__PRETTY_FUNCTION__.19> "kvm_irqchip_commit_routes") at assert.c:101
6  0x560640f884b5 in kvm_irqchip_commit_routes (s=0x560642cae1f0) at 
../accel/kvm/kvm-all.c:1837
7  0x560640c98f8e in virtio_pci_one_vector_unmask
(proxy=0x560643c65f00, queue_no=4294967295, vector=0, msg=..., 
n=0x560643c6e4c8)
at ../hw/virtio/virtio-pci.c:1005
8  0x560640c99201 in virtio_pci_vector_unmask (dev=0x560643c65f00, 
vector=0, msg=...)
at ../hw/virtio/virtio-pci.c:1070
9  0x560640bc402e in msix_fire_vector_notifier (dev=0x560643c65f00, 
vector=0, is_masked=false)
at ../hw/pci/msix.c:120
10 0x560640bc40f1 in msix_handle_mask_update (dev=0x560643c65f00, vector=0, 
was_masked=true)
at ../hw/pci/msix.c:140
11 0x560640bc4503 in msix_table_mmio_write (opaque=0x560643c65f00, addr=12, 
val=0, size=4)
at ../hw/pci/msix.c:231
12 0x560640f26d83 in memory_region_write_accessor
(mr=0x560643c66540, addr=12, value=0x7fc86b7bc628, size=4, shift=0, 
mask=4294967295, attrs=...)
at ../system/memory.c:497
13 0x560640f270a6 in access_with_adjusted_size

 (addr=12, value=0x7fc86b7bc628, size=4, access_size_min=1, 
access_size_max=4, access_fn=0x560640f26c8d , 
mr=0x560643c66540, attrs=...) at ../system/memory.c:573
14 0x560640f2a2b5 in memory_region_dispatch_write (mr=0x560643c66540, 
addr=12, data=0, op=MO_32, attrs=...)
at ../system/memory.c:1521
15 0x560640f37bac in flatview_write_continue
(fv=0x7fc65805e0b0, addr=4273803276, attrs=..., ptr=0x7fc871e9c028, len=4, 
addr1=12, l=4, mr=0x560643c66540)
at ../system/physmem.c:2714
16 0x560640f37d0f in flatview_write
(fv=0x7fc65805e0b0, addr=4273803276, attrs=..., buf=0x7fc871e9c028, len=4) 
at ../system/physmem.c:2756
17 0x560640f380bf in address_space_write
(as=0x560642161ae0 , addr=4273803276, attrs=..., 
buf=0x7fc871e9c028, len=4)
at ../system/physmem.c:2863
18 0x560640f3812c in address_space_rw
(as=0x560642161ae0 , addr=4273803276, attrs=..., 
buf=0x7fc871e9c028, len=4, is_write=true) at ../system/physmem.c:2873
--Type  for more, q to quit, c to continue without paging--
19 0x560640f8aa55 in kvm_cpu_exec (cpu=0x560642f205e0) at 
../accel/kvm/kvm-all.c:2915
20 0x560640f8d731 in kvm_vcpu_thread_fn (arg=0x560642f205e0) at 
../accel/kvm/kvm-accel-ops.c:51
21 0x5606411949f4 in qemu_thread_start (args=0x560642f292b0) at 
../util/qemu-thread-posix.c:541
22 0x7fc87148cdcd in start_thread (arg=) at 
pthread_create.c:442
23 0x7fc871512630 in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)
Signed-off-by: Cindy Lu 
---
 hw/virtio/virtio-pci.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 1a7039fb0c..344f4fb844 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -880,6 +880,7 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy 
*proxy, int queue_no)
 int ret;
 EventNotifier *n;
 PCIDevice *dev = &proxy->pci_dev;
+VirtIOIRQFD *irqfd;
 VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus);
 VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);
 
@@ -890,10 +891,19 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy 
*proxy, int queue_no)
 if (vector >= msix_nr_vectors_allocated(dev)) {
 return 0;
 }
+/*
+ * if the irqfd still in use, means the irqfd was not
+ * release before and don't need to set up
+ */
+irqfd = &proxy->vector_irqfd[vector];
+if (irqfd->users != 0) {
+return 0;
+}
 ret = kvm_virtio_pci_vq_vector_use(proxy, vector);
 if (ret < 0) {
 goto undo;
 }
+
 /*
  * If guest suppor

[PATCH v2 0/1] virtio-pci: Fix the crash that the vector was used after released

2024-04-09 Thread Cindy Lu

During the booting process of the Vyatta image, the behavior of the
called function in qemu is as follows:

1. vhost_net_stop() was triggered by guest image . This will call the function
virtio_pci_set_guest_notifiers() with assgin= false, and
virtio_pci_set_guest_notifiers(??? will release the irqfd for vector 0

2. virtio_reset() was called -->set configure vector to VIRTIO_NO_VECTOR

3.vhost_net_start() was called (at this time, the configure vector is
still VIRTIO_NO_VECTOR) and call virtio_pci_set_guest_notifiers() with
assgin= true, so the irqfd for vector 0 is still not "init" during this process

4. The system continues to boot,set the vector back to 0, and 
msix_fire_vector_notifier() was triggered
 unmask the vector 0 and then met the crash
[msix_fire_vector_notifier] 112 called vector 0 is_masked 1
[msix_fire_vector_notifier] 112 called vector 0 is_masked 0

To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()"
when the vector changes back from VIRTIO_NO_VECTOR.

The reason that we don't need to call kvm_virtio_pci_vector_release_one while 
the vector changes to
VIRTIO_NO_VECTOR is this function will called in vhost_net_stop(),
So this step will not lost during this process.

Change from V1
1.add the check for if using irqfd
2.remove the check for bool recovery, irqfd's user is enough to check status

Cindy Lu (1):
  virtio-pci: Fix the crash that the vector was used after released.

 hw/virtio/virtio-pci.c | 35 +++
 1 file changed, 35 insertions(+)

-- 
2.43.0

[PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread lyx634449800

From: Yuxue Liu 

When there is a ctlq and it doesn't require interrupt
callbacks,the original method of calculating vectors
wastes hardware msi or msix resources as well as system
IRQ resources.

When conducting performance testing using testpmd in the
guest os, it was found that the performance was lower compared
to directly using vfio-pci to passthrough the device

In scenarios where the virtio device in the guest os does
not utilize interrupts, the vdpa driver still configures
the hardware's msix vector. Therefore, the hardware still
sends interrupts to the host os. Because of this unnecessary
action by the hardware, hardware performance decreases, and
it also affects the performance of the host os.

Before modification:(interrupt mode)
 32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(interrupt mode)
 32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
 33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
 34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config

Before modification:(virtio pmd mode for guest os)
 32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(virtio pmd mode for guest os)
 32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config

To verify the use of the virtio PMD mode in the guest operating
system, the following patch needs to be applied to QEMU:
https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com

Signed-off-by: Yuxue Liu 
Acked-by: Jason Wang 
Reviewed-by: Heng Qi 
---
V5: modify the description of the printout when an exception occurs
V4: update the title and assign values to uninitialized variables
V3: delete unused variables and add validation records
V2: fix when allocating IRQs, scan all queues

 drivers/vdpa/virtio_pci/vp_vdpa.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
b/drivers/vdpa/virtio_pci/vp_vdpa.c
index df5f4a3bccb5..8de0224e9ec2 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -160,7 +160,13 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
struct pci_dev *pdev = mdev->pci_dev;
int i, ret, irq;
int queues = vp_vdpa->queues;
-   int vectors = queues + 1;
+   int vectors = 1;
+   int msix_vec = 0;
+
+   for (i = 0; i < queues; i++) {
+   if (vp_vdpa->vring[i].cb.callback)
+   vectors++;
+   }
 
ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
if (ret != vectors) {
@@ -173,9 +179,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
vp_vdpa->vectors = vectors;
 
for (i = 0; i < queues; i++) {
+   if (!vp_vdpa->vring[i].cb.callback)
+   continue;
+
snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
"vp-vdpa[%s]-%d\n", pci_name(pdev), i);
-   irq = pci_irq_vector(pdev, i);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(&pdev->dev, irq,
   vp_vdpa_vq_handler,
   0, vp_vdpa->vring[i].msix_name,
@@ -185,21 +194,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
"vp_vdpa: fail to request irq for vq %d\n", i);
goto err;
}
-   vp_modern_queue_vector(mdev, i, i);
+   vp_modern_queue_vector(mdev, i, msix_vec);
vp_vdpa->vring[i].irq = irq;
+   msix_vec++;
}
 
snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
 pci_name(pdev));
-   irq = pci_irq_vector(pdev, queues);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_config_handler, 0,
   vp_vdpa->msix_name, vp_vdpa);
if (ret) {
dev_err(&pdev->dev,
-   "vp_vdpa: fail to request irq for vq %d\n", i);
+   "vp_vdpa: fail to request irq for config: %d\n", ret);
goto err;
}
-   vp_modern_config_vector(mdev, queues);
+   vp_modern_config_vector(mdev, msix_vec);
vp_vdpa->config_irq = irq;
 
return 0;
-- 
2.43.0

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
>  wrote:
>>
>> On Fri,  5 Apr 2024 00:07:06 +
>> "Ho-Ren (Jack) Chuang"  wrote:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * late_initcall(memory_tier_late_init);
>> > Some device drivers may have initialized memory tiers between
>> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
>> > online memory nodes and configuring memory tiers. They should be excluded
>> > in the late init.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
>> > protect `default_dram_perf_*`. This approach not only avoids deadlock
>> > but also prevents holding a large lock simultaneously.
>> >
>> > * Upgrade `set_node_memory_tier` to support additional cases, including
>> >   default DRAM, late CPUless, and hot-plugged initializations.
>> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
>> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
>> > handle cases where memtype is not initialized and where HMAT information is
>> > available.
>> >
>> > * Introduce `default_memory_types` for those memory types that are not
>> >   initialized by device drivers.
>> > Because late initialized memory and default DRAM memory need to be managed,
>> > a default memory type is created for storing all memory types that are
>> > not initialized by device drivers and as a fallback.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > Reviewed-by: "Huang, Ying" 
>>
>> Hi - one remaining question. Why can't we delay init for all nodes
>> to either drivers or your fallback late_initcall code.
>> It would be nice to reduce possible code paths.
>
> I try not to change too much of the existing code structure in
> this patchset.
>
> To me, postponing/moving all memory tier registrations to
> late_initcall() is another possible action item for the next patchset.
>
> After tier_mem(), hmat_init() is called, which requires registering
> `default_dram_type` info. This is when `default_dram_type` is needed.
> However, it is indeed possible to postpone the latter part,
> set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> indeed be split into two parts, and the latter part can be moved to
> late_initcall() to be processed together.

I don't think that it's good to move all memory_tier initialization in
drivers to late_initcall().  It's natural to keep them in
device_initcall() level.

If so, we can allocate default_dram_type in memory_tier_init(), and call
set_node_memory_tier() only in memory_tier_lateinit().  We can call
memory_tier_lateinit() in device_initcall() level too.

--
Best Regards,
Huang, Ying

> Doing this all memory-type drivers have to call late_initcall() to
> register a memory tier. I’m not sure how many they are?
>
> What do you guys think?
>
>>
>> Jonathan
>>
>>
>> > ---
>> >  mm/memory-tiers.c | 94 +++
>> >  1 file changed, 70 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > index 516b144fd45a..6632102bd5c9 100644
>> > --- a/mm/memory-tiers.c
>> > +++ b/mm/memory-tiers.c
>>
>>
>>
>> > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
>> >* For now we can have 4 faster memory tiers with smaller adistance
>> >* than default DRAM tier.
>> >*/
>> > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
>> > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
>> > +   &default_memory_types);
>> >   if (IS_ERR(default_dram_type))
>> >   panic("%s() failed to allocate default DRAM tier\n", 
>> > __func__);
>> >
>> > @@ -865,6 +903,14 @@ static int __init memory_ti

Re: [PATCHv3 1/2] dt-bindings: usb: typec: anx7688: start a binding document

2024-04-09 Thread Ondřej Jirman

On Mon, Apr 08, 2024 at 10:12:30PM GMT, Krzysztof Kozlowski wrote:
> On 08/04/2024 17:17, Ondřej Jirman wrote:
> > 
> > Now for things to not fail during suspend/resume based on PM callbacks
> > invocation order, anx7688 driver needs to enable this regulator too, as long
> > as it needs it.
> 
> No, the I2C bus driver needs to manage it. Not one individual I2C
> device. Again, why anx7688 is specific? If you next phone has anx8867,
> using different driver, you also add there i2c-supply? And if it is
> nxp,ptn5100 as well?

Yes, that could work, if I2C core would manage this.

> > 
> > I can put bus-supply to I2C controller node, and read it from the ANX7688 
> > driver
> > I guess, by going up a DT node. Whether that's going to be acceptable, I 
> > don't
> > know. 
> > 
> > 
> > VCONN regulator I don't know where else to put either. It doesn't seem to 
> > belong
> > anywhere. It's not something directly connected to Type-C connector, so
> > not part of connector bindings, and there's nothing else I can see, other
> > than anx7688 device which needs it for core functionality.
> 
> That sounds like a GPIO, not regulator. anx7688 has GPIOs, right? On
> Pinephone they go to regulator, but on FooPhone also using anx7688 they
> go somewhere else, so why this anx7688 assumes this is a regulator?

CC1/CC2_VCONN control pins are "GPIO" of anx7688, sort of. They have fixed
purpose of switching external 5V regulator output to one of the CC pins
on type-c port. I don't care what other purpose with some other firmware
someone puts to those pins. It's irrelevant to the use case of anx7688
as a type-c controller/HDMI bridge, which we're describing here.

VCONN regulator is an actual GPIO controlled regulator on the board, and
needs to be controlled by the anx7688 driver. So that CC1/CC2_VCONN control
pins driven by the firmware actually do what they're supposed to do.

Not sure why it would be a business of anything else but anx7688 driver
enabling this regulator, because only this driver knows and cares about this.
If some other board doesn't have the need to manually enable the regulator, or
doesn't have the regulator, it can simply be optional.

There are also some other funky supplies in the bindings, that are not connected
to the chip in any way, but need to be controlled by the driver:

+  vbus-supply: true
+  vbus-in-supply: true

First one can be on the connector node instead, where the driver can fetch
it from.

The purpose of the second one is to link the Phone's PMIC's USB power input with
the type-c controller (anx7688), to make sure the PMIC has information about how
much power it can draw from external PSU.

The second one can be replaced by rewriting the anx7688 driver so that it
creates a power supply representing the USB PSU connected to the phone, and by
linking to anx7688 DT node from x-powers,axp813-usb-power-supply via
a power-supplies property, which would mean that USB input of the phone is
supplied by the external USB PSU. PMIC driver can be modified to watch
the power supply provided by anx7688 driver for information it detected
via USB-PD and update its input current limit via a standard helper function.

This is how eg. fusb302 works. Not sure if that's any better from the PoV of
DT bindings, though. Because power-supplies = <&anx7688>; will not look any
greater in DT bindings, IMO. It will just link the same nodes in the other
direction.

Anyway, the HW is that there's an external PSU (detected by type c controller)
and internal USB power input, and they are connected and one has to respect the
limits of the other. I guess I shouldn't be adding a device node for external 
PSU,
since it's not part of the phone. But that's what's trully being linked on
HW level.

Kind regards,
o.

> 
> > 
> > ANX7688 chip desing doesn't have integrated VCONN mosfet switches so it 
> > always
> > needs external supply + switches that are controlled by the chip itself. 
> > There's
> > no sensible design where someone would not want this and the driver needs
> > to get this regulator reference from somewhere. The switches are sort of an
> > extension of the chip.
> 
> 
> Best regards,
> Krzysztof
>

[PATCH v3] kprobes: Fix possible use-after-free issue on kprobe registration

2024-04-09 Thread Zheng Yejian

When unloading a module, its state is changing MODULE_STATE_LIVE ->
 MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.

In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).

To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.

Signed-off-by: Zheng Yejian 
---
 kernel/kprobes.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

v3:
 - Update commit messages as suggested by Masami.
   Link: 
https://lore.kernel.org/all/20240409224922.5f192e8ace5f7a90937bf...@kernel.org/
 - Also change to a more appropriate title.

v2:
 - Update commit messages and comments as suggested by Masami.
   Link: 
https://lore.kernel.org/all/20240408115038.b0c85767bf1f249eccc32...@kernel.org/
 - Link: 
https://lore.kernel.org/all/20240408083403.3302274-1-zhengyeji...@huawei.com/

v1:
 - Link: 
https://lore.kernel.org/all/20240407035904.2556645-1-zhengyeji...@huawei.com/

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..65adc815fc6e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1567,10 +1567,17 @@ static int check_kprobe_address_safe(struct kprobe *p,
jump_label_lock();
preempt_disable();
 
-   /* Ensure it is not in reserved area nor out of text */
-   if (!(core_kernel_text((unsigned long) p->addr) ||
-   is_module_text_address((unsigned long) p->addr)) ||
-   in_gate_area_no_mm((unsigned long) p->addr) ||
+   /* Ensure the address is in a text area, and find a module if exists. */
+   *probed_mod = NULL;
+   if (!core_kernel_text((unsigned long) p->addr)) {
+   *probed_mod = __module_text_address((unsigned long) p->addr);
+   if (!(*probed_mod)) {
+   ret = -EINVAL;
+   goto out;
+   }
+   }
+   /* Ensure it is not in reserved area. */
+   if (in_gate_area_no_mm((unsigned long) p->addr) ||
within_kprobe_blacklist((unsigned long) p->addr) ||
jump_label_text_reserved(p->addr, p->addr) ||
static_call_text_reserved(p->addr, p->addr) ||
@@ -1580,8 +1587,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
-   /* Check if 'p' is probing a module. */
-   *probed_mod = __module_text_address((unsigned long) p->addr);
+   /* Get module refcount and reject __init functions for loaded modules. 
*/
if (*probed_mod) {
/*
 * We must hold a refcount of the probed module while updating
-- 
2.25.1

Re: [PATCH v2] kprobes: Avoid possible warn in __arm_kprobe_ftrace()

2024-04-09 Thread Zheng Yejian


On 2024/4/9 21:49, Masami Hiramatsu (Google) wrote:

On Tue, 9 Apr 2024 14:20:45 +0800
Zheng Yejian  wrote:


On 2024/4/8 20:41, Masami Hiramatsu (Google) wrote:

Hi Zheng,

On Mon, 8 Apr 2024 16:34:03 +0800
Zheng Yejian  wrote:


There is once warn in __arm_kprobe_ftrace() on:

   ret = ftrace_set_filter_ip(ops, (unsigned long)p->addr, 0, 0);
   if (WARN_ONCE(..., "Failed to arm kprobe-ftrace at %pS (error %d)\n", ...)
 return ret;

This warning is generated because 'p->addr' is detected to be not a valid
ftrace location in ftrace_set_filter_ip(). The ftrace address check is done
by check_ftrace_location() at the beginning of check_kprobe_address_safe().
At that point, ftrace_location(addr) == addr should return true if the
module is loaded. Then the module is searched twice:
1. in is_module_text_address(), we find that 'p->addr' is in a module;
2. in __module_text_address(), we find the module;

If the module has just been unloaded before the second search, then
'*probed_mod' is NULL and we would not go to get the module refcount,
then the return value of check_kprobe_address_safe() would be 0, but
actually we need to return -EINVAL.


OK, so you found a race window in check_kprobe_address_safe().

It does something like below.

check_kprobe_address_safe() {
...

/* Timing [A] */

if (!(core_kernel_text(p->addr) ||
is_module_text_address(p->addr)) ||
...(other reserved address check)) {
return -EINVAL;
}

/* Timing [B] */

*probed_mod = __module_text_address(p->addr):
if (*probe_mod) {
if (!try_module_get(*probed_mod)) {
return -ENOENT;
}
... 
}
}

So, if p->addr is in a module which is alive at the timing [A], but
unloaded at timing [B], 'p->addr' is passed the
'is_module_text_address(p->addr)' check, but *probed_mod becomes NULL.
Thus the corresponding module is not referenced and kprobe_arm(p) will
access a wrong address (use after free).
This happens either kprobe on ftrace is enabled or not.


Yes, This is the problem. And for this case, check_kprobe_address_safe()
still return 0, and then going on to arm kprobe may cause problems. So
we should make check_kprobe_address_safe() return -EINVAL when refcount
of the module is not got.


Yes,





To fix this problem, we should move the mutex_lock(kprobe_mutex) before
check_kprobe_address_safe() because kprobe_module_callback() also lock it
so it can stop module unloading.

Can you ensure this will fix your problem?


It seems not, the warning in __arm_kprobe_ftrace() still occurs. I
contrived following simple test:

  #!/bin/bash
  sysctl -w kernel.panic_on_warn=1
  while [ True ]; do
  insmod mod.ko# contain function 'foo'
  rmmod mod.ko
  done &
  while [ True ]; do
  insmod kprobe.ko  # register kprobe on function 'foo'
  rmmod kprobe.ko
  done &

I think holding kprobe_mutex cannot make sure we get the refcount of the
module.


Aah, yes, it cannot, because the kallsyms in a module will be removed
after module->state becomes MODULE_STATE_UNFORMED. Before UNFORMED,
the state is MODULE_STATE_GOING and the kprobe_module_callback() is
called at that point. Thus, the following scenario happens.

  CPU1  CPU2

mod->state = MODULE_STATE_GOING
kprobe_module_callback() {
mutex_lock(&kprobe_mutex)
loop on kprobe_table
to disable kprobe in the module.
mutex_unlock(&kprobe_mutex)
}
register_kprobe(p) {

mutex_lock(&kprobe_mutex)

check_kprobe_address_safe(p->addr) {
[A'']

is_module_text_address() return true
until 
mod->state == UNFORMED.
mod->state = MODULE_STATE_UNFORMED
[B'']

__module_text_address() returns NULL.
}
p is on the 
kprobe_table.

mutex_unlock(&kprobe_mutex)

So, as your fix, if we save the module at [A''] and use it at [B''],
the mod is NOT able to get because mod->state != MODULE_STATE_LIVE.





I think your patch is just optimizing but not fixing the fundamental
problem, which is we don't have an atomic search symbol and get module


Sorry, this patch is a little confusing, but it is not just optimizing :)

As shown below, after my patch, if p->addr is in

Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt

On Wed, 10 Apr 2024 08:44:00 +0900
Masami Hiramatsu (Google)  wrote:

> Looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) 

Thanks.

> 
> BTW, isn't this a real bugfix, because the page_touched can be
> bigger than nr_pages without this fix?

Yes, I simply forgot to add the Cc stable.

-- Steve

Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

2024-04-09 Thread Gregory Price

On Mon, Apr 08, 2024 at 10:41:04PM +0900, Honggyu Kim wrote:
> Hi Gregory,
> 
> On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price  
> wrote:
> > Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> > but not DAMOS_MIGRATE_HOT actions? (and vice versa)
> > 
> > The question I have is exactly how often is MIGRATE_HOT actually being
> > utilized, and how much data is being moved. Testing MIGRATE_COLD only
> > would at least give a rough approximation of that.
> 
> To explain this, I better share more test results.  In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
> 
>   *. "Turn on DAMON."
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)

Aha! I see now, you are allocating memory to ensure the real workload
(redis-server) pressures the DRAM tier and causes "spillage" to the CXL
tier, and then measure the overhead in different scenarios.

I would still love to know what the result of a demote-only system would
produce, mosty because it would very clearly demonstrate the value of
the demote+promote system when the system is memory-pressured.

Given the additional results below, it shows a demote-only system would
likely trend toward CXL-only, and so this shows an affirmative support
for the promotion logic.

Just another datum that is useful and paints a more complete picture.

> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
> is missing part from the cover letter.
> 
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   *. "Turn on DAMON."
> 
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
>

This is excellent and definitely demonstrates part of the picture I was
alluding to, thank you for the additional data!

> 
> I have included "DAMON lazy" result and also the migration size by new
> DAMOS migrate actions.  Please note that demotion size is way higher
> than promotion because promotion target is only for redis data, but
> demotion target includes huge cold memory allocated by mmap + memset.
> (there could be some ping-pong issue though.)
> 
> As you mentioned, "DAMON tiered" case gets more benefit because new
> redis allocations go to DRAM more than "default", but it also gets
> benefit from promotion when it is under higher memory pressure as shown
> in 490G and 500G cases.  It promotes 22GB and 17GB of redis data to DRAM
> from CXL.

I think a better way of saying this is that "DAMON tiered" more
effectively mitigates the effect of memory-pressure on faster tier
before spillage occurs, while "DAMON lazy" demonstrates the expected
performance of the system after memory pressure outruns the demotion
logic, so you wind up with hot data stuck in the slow tier.

There are some out there that would simply say "just demote more
aggressively", so this is useful information for the discussion.

+/- ~2% despite greater meomry migration is an excellent result

> > Can you also provide the DRAM-only results for each test?  Presumably,
> > as workload size increases from 440G to 500G, the system probably starts
> > using some amount of swap/zswap/whatever.  It would be good to know how
> > this system compares to swap small amounts of overflow.
> 
> It looks like my explanation doesn't correctly inform you.   The size
> from 440GB to 500GB is for pre allocated cold data to give memory
> pressure on the system so that redis-server cannot be fully allocated at
> fast DRAM, then partially allocated at CXL memory as well.
> 

Yes, sorry for the misunderstanding.  This makes it much clearer.

> 
> I hope my explanation is helpful for you to understand.  Please let me
> know if you have more questions.
>

Excellent work, exciting results! Thank you for the additional answers
:]

~Gregory

Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Google

On Tue, 9 Apr 2024 16:45:47 +0200
Marco Elver  wrote:

> On Tue, 9 Apr 2024 at 16:31, Steven Rostedt  wrote:
> >
> > On Mon,  8 Apr 2024 11:01:54 +0200
> > Marco Elver  wrote:
> >
> > > Add "new_exec" tracepoint, which is run right after the point of no
> > > return but before the current task assumes its new exec identity.
> > >
> > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> > > runs before flushing the old exec, i.e. while the task still has the
> > > original state (such as original MM), but when the new exec either
> > > succeeds or crashes (but never returns to the original exec).
> > >
> > > Being able to trace this event can be helpful in a number of use cases:
> > >
> > >   * allowing tracing eBPF programs access to the original MM on exec,
> > > before current->mm is replaced;
> > >   * counting exec in the original task (via perf event);
> > >   * profiling flush time ("new_exec" to "sched_process_exec").
> > >
> > > Example of tracing output ("new_exec" and "sched_process_exec"):
> >
> > How common is this? And can't you just do the same with adding a kprobe?
> 
> Our main use case would be to use this in BPF programs to become
> exec-aware, where using the sched_process_exec hook is too late. This
> is particularly important where the BPF program must stop inspecting
> the user space's VM when the task does exec to become a new process.

Just out of curiousity, would you like to audit that the user-program
is not malformed? (security tracepoint?) I think that is an interesting
idea. What kind of information you need?

> 
> kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s
> permission check can still return an error which returns to the
> original task without crashing. Only at the point of no return are we
> guaranteed that the exec either succeeds, or the task is terminated on
> failure.

Just a note: That is BPF limitation, kprobe and kprobe events can put
a probe in the function body, but that is not supported on BPF (I guess
because it depends on kernel debuginfo.) You can add kprobe-event using
"perf probe" tool.

Thank you,

> 
> I don't know if "common" is the right question here, because it's a
> chicken-egg problem: no tracepoint, we give up; we have the
> tracepoint, it unlocks a range of new use cases (that require robust
> solution to make BPF programs exec-aware, and a tracepoint is the only
> option IMHO).
> 
> Thanks,
> -- Marco


-- 
Masami Hiramatsu (Google)

Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Google

On Tue, 9 Apr 2024 15:13:09 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The "buffer_percent" logic that is used by the ring buffer splice code to
> only wake up the tasks when there's no data after the buffer is filled to
> the percentage of the "buffer_percent" file is dependent on three
> variables that determine the amount of data that is in the ring buffer:
> 
>  1) pages_read - incremented whenever a new sub-buffer is consumed
>  2) pages_lost - incremented every time a writer overwrites a sub-buffer
>  3) pages_touched - incremented when a write goes to a new sub-buffer
> 
> The percentage is the calculation of:
> 
>   (pages_touched - (pages_lost + pages_read)) / nr_pages
> 
> Basically, the amount of data is the total number of sub-bufs that have been
> touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
> divided by the total count to give the buffer percentage. When the
> percentage is greater than the value in the "buffer_percent" file, it
> wakes up splice readers waiting for that amount.
> 
> It was observed that over time, the amount read from the splice was
> constantly decreasing the longer the trace was running. That is, if one
> asked for 60%, it would read over 60% when it first starts tracing, but
> then it would be woken up at under 60% and would slowly decrease the
> amount of data read after being woken up, where the amount becomes much
> less than the buffer percent.
> 
> This was due to an accounting of the pages_touched incrementation. This
> value is incremented whenever a writer transfers to a new sub-buffer. But
> the place where it was incremented was incorrect. If a writer overflowed
> the current sub-buffer it would go to the next one. If it gets preempted
> by an interrupt at that time, and the interrupt performs a trace, it too
> will end up going to the next sub-buffer. But only one should increment
> the counter. Unfortunately, that was not the case.
> 
> Change the cmpxchg() that does the real switch of the tail-page into a
> try_cmpxchg(), and on success, perform the increment of pages_touched. This
> will only increment the counter once for when the writer moves to a new
> sub-buffer, and not when there's a race and is incremented for when a
> writer and its preempting writer both move to the same new sub-buffer.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

BTW, isn't this a real bugfix, because the page_touched can be
bigger than nr_pages without this fix?

Thank you,

> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 25476ead681b..6511dc3a00da 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct 
> ring_buffer_per_cpu *cpu_buffer,
>   old_write = local_add_return(RB_WRITE_INTCNT, &next_page->write);
>   old_entries = local_add_return(RB_WRITE_INTCNT, &next_page->entries);
>  
> - local_inc(&cpu_buffer->pages_touched);
>   /*
>* Just make sure we have seen our old_write and synchronize
>* with any interrupts that come in.
> @@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct 
> ring_buffer_per_cpu *cpu_buffer,
>*/
>   local_set(&next_page->page->commit, 0);
>  
> - /* Again, either we update tail_page or an interrupt does */
> - (void)cmpxchg(&cpu_buffer->tail_page, tail_page, next_page);
> + /* Either we update tail_page or an interrupt does */
> + if (try_cmpxchg(&cpu_buffer->tail_page, &tail_page, next_page))
> + local_inc(&cpu_buffer->pages_touched);
>   }
>  }
>  
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google)

Re: Copying TLS/user register data per perf-sample?

2024-04-09 Thread Namhyung Kim

Hello,

On Thu, Apr 4, 2024 at 12:26 PM Beau Belgrave  wrote:
>
> Hello,
>
> I'm looking into the possibility of capturing user data that is pointed
> to by a user register (IE: fs/gs for TLS on x86/64) for each sample via
> perf_events.
>
> I was hoping to find a way to do this similar to PERF_SAMPLE_STACK_USER.
> I think it could even use roughly the same ABI in the perf ring buffer.
> Or it may be possible by some kprobe linked to the perf sample function.
>
> This would allow a profiler to collect TLS (or other values) on x64. In
> the Open Telemetry profiling SIG [1], we are trying to find a fast way
> to grab a tracing association quickly on a per-thread basis. The team
> at Elastic has a bespoke way to do this [2], however, I'd like to see a
> more general way to achieve this. The folks I've been talking with seem
> open to the idea of just having a TLS value for this we could capture
> upon each sample. We could then just state, Open Telemetry SDKs should
> have a TLS value for span correlation. However, we need a way to sample
> the TLS value(s) when a sampling event is generated.
>
> Is this already possible via some other means? It'd be great to be able
> to do this directly at the perf_event sample via the ABI or a probe.

I don't think the current perf ABI allows capturing %fs/%gs + offset.
IIRC kprobes/uprobes don't have that too but I could be wrong.

Thanks,
Namhyung

>
> 1. https://opentelemetry.io/blog/2024/profiling/
> 2. 
> https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation

Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-09 Thread Ho-Ren (Jack) Chuang

On Tue, Apr 9, 2024 at 2:50 PM Andrew Morton  wrote:
>
> On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" 
>  wrote:
>
> > Hi Jonathan,
> >
> > On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron
> >  wrote:
> > >
> > > On Fri,  5 Apr 2024 00:07:05 +
> > > "Ho-Ren (Jack) Chuang"  wrote:
> > >
> > > > Since different memory devices require finding, allocating, and putting
> > > > memory types, these common steps are abstracted in this patch,
> > > > enhancing the scalability and conciseness of the code.
> > > >
> > > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > > Reviewed-by: "Huang, Ying" 
> > > Reviewed-by: Jonathan Cameron 
> > >
> > Thank you for reviewing and for adding your "Reviewed-by"!
> > I was wondering if I need to send a v12 and manually add
> > this to the commit description, or if this is sufficient.
>
> I had added Jonathan's r-b to the mm.git copy of this patch.

Got it~ Thank you Andrew!

-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任

Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-09 Thread Andrew Morton

On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" 
 wrote:

> Hi Jonathan,
> 
> On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron
>  wrote:
> >
> > On Fri,  5 Apr 2024 00:07:05 +
> > "Ho-Ren (Jack) Chuang"  wrote:
> >
> > > Since different memory devices require finding, allocating, and putting
> > > memory types, these common steps are abstracted in this patch,
> > > enhancing the scalability and conciseness of the code.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Reviewed-by: "Huang, Ying" 
> > Reviewed-by: Jonathan Cameron 
> >
> Thank you for reviewing and for adding your "Reviewed-by"!
> I was wondering if I need to send a v12 and manually add
> this to the commit description, or if this is sufficient.

I had added Jonathan's r-b to the mm.git copy of this patch.

Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Kees Cook

On Tue, Apr 09, 2024 at 08:25:45PM +0200, Marco Elver wrote:
> On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote:
> [...]
> > > + trace_new_exec(current, bprm);
> > > +
> > 
> > All other steps in this function have explicit comments about
> > what/why/etc. Please add some kind of comment describing why the
> > tracepoint is where it is, etc.
> 
> I beefed up the tracepoint documentation, and wrote a little paragraph
> above where it's called to reinforce what we want.
> 
> [...]
> > What about binfmt_misc, and binfmt_script? You may want bprm->interp
> > too?
> 
> Good points. I'll make the below changes for v2:
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index ab778ae1fc06..472b9f7b40e8 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm)
>   if (retval)
>   return retval;
>  
> + /*
> +  * This tracepoint marks the point before flushing the old exec where
> +  * the current task is still unchanged, but errors are fatal (point of
> +  * no return). The later "sched_process_exec" tracepoint is called after
> +  * the current task has successfully switched to the new exec.
> +  */
>   trace_new_exec(current, bprm);
>  
>   /*
> diff --git a/include/trace/events/task.h b/include/trace/events/task.h
> index 8853dc44783d..623d9af777c1 100644
> --- a/include/trace/events/task.h
> +++ b/include/trace/events/task.h
> @@ -61,8 +61,11 @@ TRACE_EVENT(task_rename,
>   * @task:pointer to the current task
>   * @bprm:pointer to linux_binprm used for new exec
>   *
> - * Called before flushing the old exec, but at the point of no return during
> - * switching to the new exec.
> + * Called before flushing the old exec, where @task is still unchanged, but 
> at
> + * the point of no return during switching to the new exec. At the point it 
> is
> + * called the exec will either succeed, or on failure terminate the task. 
> Also
> + * see the "sched_process_exec" tracepoint, which is called right after @task
> + * has successfully switched to the new exec.
>   */
>  TRACE_EVENT(new_exec,
>  
> @@ -71,19 +74,22 @@ TRACE_EVENT(new_exec,
>   TP_ARGS(task, bprm),
>  
>   TP_STRUCT__entry(
> + __string(   interp, bprm->interp)
>   __string(   filename,   bprm->filename  )
>   __field(pid_t,  pid )
>   __string(   comm,   task->comm  )
>   ),
>  
>   TP_fast_assign(
> + __assign_str(interp, bprm->interp);
>   __assign_str(filename, bprm->filename);
>   __entry->pid = task->pid;
>   __assign_str(comm, task->comm);
>   ),
>  
> - TP_printk("filename=%s pid=%d comm=%s",
> -   __get_str(filename), __entry->pid, __get_str(comm))
> + TP_printk("interp=%s filename=%s pid=%d comm=%s",
> +   __get_str(interp), __get_str(filename),
> +   __entry->pid, __get_str(comm))
>  );
>  
>  #endif

Looks good; I await v2, and Steven's Ack. :)

-- 
Kees Cook

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-09 Thread David Hildenbrand


On 09.04.24 20:31, James Houghton wrote:

Ah, I didn't see this in my inbox, sorry David!


No worries :)



On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand  wrote:


On 02.04.24 01:29, James Houghton wrote:

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {

   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)

+#define MMU_NOTIFIER_YOUNG   (1 << 0)
+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)


Especially this one really deserves some documentation :)


Yes, will do. Something like

 MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
bitmap either (1) does not accurately represent the age of the pages
(in the case of test_young), or (2) was not able to be used to
completely clear the age/access bit (in the case of clear_young).


Make sense. I do wonder what the expected reaction from the caller is :)






+#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)


And that one as well.


Something like

Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
would have been supported for this address range.

The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
is able to harvest the access bit "fast" (so for x86, locklessly, and
for arm64, with the KVM MMU read lock), "fast" enough that using a
bitmap to do look-around is probably a good idea.


Is that really the right way to communicate that ("would have been 
supported") -- wouldn't we want to sense support differently?






Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).


Right. Will do.




+
   struct mmu_notifier_ops {
   /*
* Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
* clear_young is a lightweight version of clear_flush_young. Like the
* latter, it is supposed to test-and-clear the young/accessed bitflag
* in the secondary pte, but it may omit flushing the secondary tlb.
+  *
+  * If @bitmap is given but is not supported, return
+  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+  *
+  * If the walk is done "quickly" and there were young PTEs,
+  * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
   int (*clear_young)(struct mmu_notifier *subscription,
  struct mm_struct *mm,
  unsigned long start,
-unsigned long end);
+unsigned long end,
+unsigned long *bitmap);

   /*
* test_young is called to check the young/accessed bitflag in
* the secondary pte. This is used to know if the page is
* frequently used without actually clearing the flag or tearing
* down the secondary mapping on the page.
+  *
+  * If @bitmap is given but is not supported, return
+  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+  *
+  * If the walk is done "quickly" and there were young PTEs,
+  * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
   int (*test_young)(struct mmu_notifier *subscription,
 struct mm_struct *mm,
-   unsigned long address);
+   unsigned long start,
+   unsigned long end,
+   unsigned long *bitmap);


What does "quickly" mean (why not use "fast")? What are the semantics, I
don't find any existing usage of that in this file.


"fast" means fast enough such that using a bitmap to scan adjacent
pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
this comment. Perhaps I should just rename it to
MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
beneficial" thing -- that's for MGLRU/etc. to decide really.


Yes!





Further, what is MMU_NOTIFIER_YOUNG you introduce used for?


MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
(1) didn't use a bitmap, and (2) the "fast" access bit harvesting
wasn't possible. In this case we simply return 1, which is
MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
it will be 1.


Yes, that will clarify it!

--
Cheers,

David / dhildenb

[PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The "buffer_percent" logic that is used by the ring buffer splice code to
only wake up the tasks when there's no data after the buffer is filled to
the percentage of the "buffer_percent" file is dependent on three
variables that determine the amount of data that is in the ring buffer:

 1) pages_read - incremented whenever a new sub-buffer is consumed
 2) pages_lost - incremented every time a writer overwrites a sub-buffer
 3) pages_touched - incremented when a write goes to a new sub-buffer

The percentage is the calculation of:

  (pages_touched - (pages_lost + pages_read)) / nr_pages

Basically, the amount of data is the total number of sub-bufs that have been
touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
divided by the total count to give the buffer percentage. When the
percentage is greater than the value in the "buffer_percent" file, it
wakes up splice readers waiting for that amount.

It was observed that over time, the amount read from the splice was
constantly decreasing the longer the trace was running. That is, if one
asked for 60%, it would read over 60% when it first starts tracing, but
then it would be woken up at under 60% and would slowly decrease the
amount of data read after being woken up, where the amount becomes much
less than the buffer percent.

This was due to an accounting of the pages_touched incrementation. This
value is incremented whenever a writer transfers to a new sub-buffer. But
the place where it was incremented was incorrect. If a writer overflowed
the current sub-buffer it would go to the next one. If it gets preempted
by an interrupt at that time, and the interrupt performs a trace, it too
will end up going to the next sub-buffer. But only one should increment
the counter. Unfortunately, that was not the case.

Change the cmpxchg() that does the real switch of the tail-page into a
try_cmpxchg(), and on success, perform the increment of pages_touched. This
will only increment the counter once for when the writer moves to a new
sub-buffer, and not when there's a race and is incremented for when a
writer and its preempting writer both move to the same new sub-buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..6511dc3a00da 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
old_write = local_add_return(RB_WRITE_INTCNT, &next_page->write);
old_entries = local_add_return(RB_WRITE_INTCNT, &next_page->entries);
 
-   local_inc(&cpu_buffer->pages_touched);
/*
 * Just make sure we have seen our old_write and synchronize
 * with any interrupts that come in.
@@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
 */
local_set(&next_page->page->commit, 0);
 
-   /* Again, either we update tail_page or an interrupt does */
-   (void)cmpxchg(&cpu_buffer->tail_page, tail_page, next_page);
+   /* Either we update tail_page or an interrupt does */
+   if (try_cmpxchg(&cpu_buffer->tail_page, &tail_page, next_page))
+   local_inc(&cpu_buffer->pages_touched);
}
 }
 
-- 
2.43.0

Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Ho-Ren (Jack) Chuang

Hi Jonathan,

On Tue, Apr 9, 2024 at 9:12 AM Jonathan Cameron
 wrote:
>
> On Fri, 5 Apr 2024 15:43:47 -0700
> "Ho-Ren (Jack) Chuang"  wrote:
>
> > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
> >  wrote:
> > >
> > > On Fri,  5 Apr 2024 00:07:06 +
> > > "Ho-Ren (Jack) Chuang"  wrote:
> > >
> > > > The current implementation treats emulated memory devices, such as
> > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> > > > memory
> > > > (E820_TYPE_RAM). However, these emulated devices have different
> > > > characteristics than traditional DRAM, making it important to
> > > > distinguish them. Thus, we modify the tiered memory initialization 
> > > > process
> > > > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > > > ensures that the memory tier initialization for these nodes is deferred
> > > > until HMAT information is obtained during the boot process. Finally,
> > > > demotion tables are recalculated at the end.
> > > >
> > > > * late_initcall(memory_tier_late_init);
> > > > Some device drivers may have initialized memory tiers between
> > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> > > > online memory nodes and configuring memory tiers. They should be 
> > > > excluded
> > > > in the late init.
> > > >
> > > > * Handle cases where there is no HMAT when creating memory tiers
> > > > There is a scenario where a CPUless node does not provide HMAT 
> > > > information.
> > > > If no HMAT is specified, it falls back to using the default DRAM tier.
> > > >
> > > > * Introduce another new lock `default_dram_perf_lock` for adist 
> > > > calculation
> > > > In the current implementation, iterating through CPUlist nodes requires
> > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end 
> > > > up
> > > > trying to acquire the same lock, leading to a potential deadlock.
> > > > Therefore, we propose introducing a standalone `default_dram_perf_lock` 
> > > > to
> > > > protect `default_dram_perf_*`. This approach not only avoids deadlock
> > > > but also prevents holding a large lock simultaneously.
> > > >
> > > > * Upgrade `set_node_memory_tier` to support additional cases, including
> > > >   default DRAM, late CPUless, and hot-plugged initializations.
> > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> > > > handle cases where memtype is not initialized and where HMAT 
> > > > information is
> > > > available.
> > > >
> > > > * Introduce `default_memory_types` for those memory types that are not
> > > >   initialized by device drivers.
> > > > Because late initialized memory and default DRAM memory need to be 
> > > > managed,
> > > > a default memory type is created for storing all memory types that are
> > > > not initialized by device drivers and as a fallback.
> > > >
> > > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > > Signed-off-by: Hao Xiang 
> > > > Reviewed-by: "Huang, Ying" 
> > >
> > > Hi - one remaining question. Why can't we delay init for all nodes
> > > to either drivers or your fallback late_initcall code.
> > > It would be nice to reduce possible code paths.
> >
> > I try not to change too much of the existing code structure in
> > this patchset.
> >
> > To me, postponing/moving all memory tier registrations to
> > late_initcall() is another possible action item for the next patchset.
> >
> > After tier_mem(), hmat_init() is called, which requires registering
> > `default_dram_type` info. This is when `default_dram_type` is needed.
> > However, it is indeed possible to postpone the latter part,
> > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> > indeed be split into two parts, and the latter part can be moved to
> > late_initcall() to be processed together.
> >
> > Doing this all memory-type drivers have to call late_initcall() to
> > register a memory tier. I’m not sure how many they are?
> >
> > What do you guys think?
>
> Gut feeling - if you are going to move it for some cases, move it for
> all of them.  Then we only have to test once ;)
>
> J

Thank you for your reminder! I agree~ That's why I'm considering
changing them in the next patchset because of the amount of changes.
And also, this patchset already contains too many things.

> >
> > >
> > > Jonathan
> > >
> > >
> > > > ---
> > > >  mm/memory-tiers.c | 94 +++
> > > >  1 file changed, 70 insertions(+), 24 deletions(-)
> > > >
> > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > > index 516b144fd45a..6632102bd5c9 100644
> > > > --- a/mm/memory-tiers.c
> > > > +++ b/mm/memory-tiers.c
> > >
> > >
> > >
> > > > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
> > > >* For now we can have 4 faster memory tiers with smaller 
> > > > adistance
> > > >* than default DRAM tier.
> > > >*/
> > > > - default_dram_

Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-09 Thread Ho-Ren (Jack) Chuang

Hi Jonathan,

On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron
 wrote:
>
> On Fri,  5 Apr 2024 00:07:05 +
> "Ho-Ren (Jack) Chuang"  wrote:
>
> > Since different memory devices require finding, allocating, and putting
> > memory types, these common steps are abstracted in this patch,
> > enhancing the scalability and conciseness of the code.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Reviewed-by: "Huang, Ying" 
> Reviewed-by: Jonathan Cameron 
>
Thank you for reviewing and for adding your "Reviewed-by"!
I was wondering if I need to send a v12 and manually add
this to the commit description, or if this is sufficient.

-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任

Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-09 Thread Dmitry Baryshkov

On Tue, 9 Apr 2024 at 21:37, Luca Weiss  wrote:
>
> Follow the gpio-hog bindings and use otg-hog as node name.
>
> Signed-off-by: Luca Weiss 
> ---
>  arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

Reviewed-by: Dmitry Baryshkov 

-- 
With best wishes
Dmitry

[PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-09 Thread Luca Weiss

Follow the gpio-hog bindings and use otg-hog as node name.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
index 4aaae8537a3f..06549051be50 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
@@ -328,7 +328,7 @@ wlan_regulator_pin: wl-reg-active-state {
power-source = ;
};
 
-   otg {
+   otg-hog {
gpio-hog;
gpios = <35 GPIO_ACTIVE_HIGH>;
output-high;

-- 
2.44.0

[PATCH v2 1/2] dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes

2024-04-09 Thread Luca Weiss

Allow specifying a GPIO hog, as already used on
qcom-msm8974-lge-nexus5-hammerhead.dts.

Signed-off-by: Luca Weiss 
---
 .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml  | 12 
 1 file changed, 12 insertions(+)

diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml 
b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
index a786357ed1af..bd9471de0c69 100644
--- a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
+++ b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
@@ -424,6 +424,10 @@ patternProperties:
 $ref: "#/$defs/qcom-pmic-gpio-state"
 additionalProperties: false
 
+  "-hog(-[0-9]+)?$":
+required:
+  - gpio-hog
+
 $defs:
   qcom-pmic-gpio-state:
 type: object
@@ -571,6 +575,7 @@ $defs:
 
 examples:
   - |
+#include 
 #include 
 
 pm8921_gpio: gpio@150 {
@@ -594,5 +599,12 @@ examples:
   power-source = ;
 };
   };
+
+  otg-hog {
+gpio-hog;
+gpios = <35 GPIO_ACTIVE_HIGH>;
+output-high;
+line-name = "otg-gpio";
+  };
 };
 ...

-- 
2.44.0

[PATCH v2 0/2] Allow gpio-hog nodes in qcom,pmic-gpio bindings (& dt fixup)

2024-04-09 Thread Luca Weiss

Resolve the dt validation failure on Nexus 5.

Signed-off-by: Luca Weiss 
---
Changes in v2:
- Use simpler regex from tlmm bindings (Krzysztof)
- Link to v1: 
https://lore.kernel.org/r/20240408-qcom-pmic-gpio-hog-v1-0-f61fc5323...@z3ntu.xyz

---
Luca Weiss (2):
  dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes
  ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

 .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml  | 12 
 .../arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts |  2 +-
 2 files changed, 13 insertions(+), 1 deletion(-)
---
base-commit: 8568bb2ccc278f344e6ac44af6ed010a90aa88dc
change-id: 20240408-qcom-pmic-gpio-hog-2b4c5f103126

Best regards,
-- 
Luca Weiss

Re: [PATCH 0/3] Fix up qcom,halt-regs definition in various schemas

2024-04-09 Thread Luca Weiss

On Dienstag, 9. April 2024 17:10:41 CEST Rob Herring wrote:
> On Sun, Apr 07, 2024 at 11:58:29AM +0200, Luca Weiss wrote:
> > The original motivation is that a bunch of other schemas fail to
> > validate qcom,halt-regs, for example like in the following examples:
> > 
> > arch/arm64/boot/dts/qcom/apq8016-sbc.dtb: remoteproc@408: 
> > qcom,halt-regs:0: [20] is too short
> > from schema $id: 
> > http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml#
> > arch/arm64/boot/dts/qcom/apq8096-ifc6640.dtb: remoteproc@208: 
> > qcom,halt-regs:0: [82] is too short
> > from schema $id: 
> > http://devicetree.org/schemas/remoteproc/qcom,msm8996-mss-pil.yaml#
> > arch/arm64/boot/dts/qcom/apq8039-t2.dtb: remoteproc@408: 
> > qcom,halt-regs:0: [32] is too short
> > from schema $id: 
> > http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml#
> > 
> > While I'm actually not quite sure why these patches fix this in
> > the other schemas - feels like a bug/limitation in dt-schema maybe? -
> 
> Was this with v2024.02? It should be a bit better there. Though it 
> may just have different errors. The limitation is that property 
> types and in the case of matrix's (which phandle-array actually is) 
> range for dimensions are global. So if there's not correct dimensions 
> for a property, the tools aren't going to decode it properly.

You're right, I doesn't look like I can reproduce this with the latest
dtschema installed.

Anyways these patches should be good to actually validate qcom,halt-regs for
the schemas I'm touching here.

Regards
Luca

> 
> Rob
>

Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-09 Thread James Houghton

Ah, I didn't see this in my inbox, sorry David!

On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand  wrote:
>
> On 02.04.24 01:29, James Houghton wrote:
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index f349e08a9dfe..daaa9db625d3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >
> >   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >
> > +#define MMU_NOTIFIER_YOUNG   (1 << 0)
> > +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>
> Especially this one really deserves some documentation :)

Yes, will do. Something like

MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
bitmap either (1) does not accurately represent the age of the pages
(in the case of test_young), or (2) was not able to be used to
completely clear the age/access bit (in the case of clear_young).

>
> > +#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)
>
> And that one as well.

Something like

   Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
would have been supported for this address range.

The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
is able to harvest the access bit "fast" (so for x86, locklessly, and
for arm64, with the KVM MMU read lock), "fast" enough that using a
bitmap to do look-around is probably a good idea.

>
> Likely best to briefly document all of them, and how they are
> supposed to be used (return value for X).

Right. Will do.

>
> > +
> >   struct mmu_notifier_ops {
> >   /*
> >* Called either by mmu_notifier_unregister or when the mm is
> > @@ -106,21 +110,36 @@ struct mmu_notifier_ops {
> >* clear_young is a lightweight version of clear_flush_young. Like the
> >* latter, it is supposed to test-and-clear the young/accessed bitflag
> >* in the secondary pte, but it may omit flushing the secondary tlb.
> > +  *
> > +  * If @bitmap is given but is not supported, return
> > +  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> > +  *
> > +  * If the walk is done "quickly" and there were young PTEs,
> > +  * MMU_NOTIFIER_YOUNG_FAST is returned.
> >*/
> >   int (*clear_young)(struct mmu_notifier *subscription,
> >  struct mm_struct *mm,
> >  unsigned long start,
> > -unsigned long end);
> > +unsigned long end,
> > +unsigned long *bitmap);
> >
> >   /*
> >* test_young is called to check the young/accessed bitflag in
> >* the secondary pte. This is used to know if the page is
> >* frequently used without actually clearing the flag or tearing
> >* down the secondary mapping on the page.
> > +  *
> > +  * If @bitmap is given but is not supported, return
> > +  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> > +  *
> > +  * If the walk is done "quickly" and there were young PTEs,
> > +  * MMU_NOTIFIER_YOUNG_FAST is returned.
> >*/
> >   int (*test_young)(struct mmu_notifier *subscription,
> > struct mm_struct *mm,
> > -   unsigned long address);
> > +   unsigned long start,
> > +   unsigned long end,
> > +   unsigned long *bitmap);
>
> What does "quickly" mean (why not use "fast")? What are the semantics, I
> don't find any existing usage of that in this file.

"fast" means fast enough such that using a bitmap to scan adjacent
pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
this comment. Perhaps I should just rename it to
MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
beneficial" thing -- that's for MGLRU/etc. to decide really.

>
> Further, what is MMU_NOTIFIER_YOUNG you introduce used for?

MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
(1) didn't use a bitmap, and (2) the "fast" access bit harvesting
wasn't possible. In this case we simply return 1, which is
MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
it will be 1.

Thanks David!

Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Marco Elver

On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote:
[...]
> > +   trace_new_exec(current, bprm);
> > +
> 
> All other steps in this function have explicit comments about
> what/why/etc. Please add some kind of comment describing why the
> tracepoint is where it is, etc.

I beefed up the tracepoint documentation, and wrote a little paragraph
above where it's called to reinforce what we want.

[...]
> What about binfmt_misc, and binfmt_script? You may want bprm->interp
> too?

Good points. I'll make the below changes for v2:

diff --git a/fs/exec.c b/fs/exec.c
index ab778ae1fc06..472b9f7b40e8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
 
+   /*
+* This tracepoint marks the point before flushing the old exec where
+* the current task is still unchanged, but errors are fatal (point of
+* no return). The later "sched_process_exec" tracepoint is called after
+* the current task has successfully switched to the new exec.
+*/
trace_new_exec(current, bprm);
 
/*
diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 8853dc44783d..623d9af777c1 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -61,8 +61,11 @@ TRACE_EVENT(task_rename,
  * @task:  pointer to the current task
  * @bprm:  pointer to linux_binprm used for new exec
  *
- * Called before flushing the old exec, but at the point of no return during
- * switching to the new exec.
+ * Called before flushing the old exec, where @task is still unchanged, but at
+ * the point of no return during switching to the new exec. At the point it is
+ * called the exec will either succeed, or on failure terminate the task. Also
+ * see the "sched_process_exec" tracepoint, which is called right after @task
+ * has successfully switched to the new exec.
  */
 TRACE_EVENT(new_exec,
 
@@ -71,19 +74,22 @@ TRACE_EVENT(new_exec,
TP_ARGS(task, bprm),
 
TP_STRUCT__entry(
+   __string(   interp, bprm->interp)
__string(   filename,   bprm->filename  )
__field(pid_t,  pid )
__string(   comm,   task->comm  )
),
 
TP_fast_assign(
+   __assign_str(interp, bprm->interp);
__assign_str(filename, bprm->filename);
__entry->pid = task->pid;
__assign_str(comm, task->comm);
),
 
-   TP_printk("filename=%s pid=%d comm=%s",
- __get_str(filename), __entry->pid, __get_str(comm))
+   TP_printk("interp=%s filename=%s pid=%d comm=%s",
+ __get_str(interp), __get_str(filename),
+ __entry->pid, __get_str(comm))
 );
 
 #endif

Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-04-09 Thread SeongJae Park

Hi Honggyu,

On Tue,  9 Apr 2024 18:54:14 +0900 Honggyu Kim  wrote:
> On Mon,  8 Apr 2024 10:52:28 -0700 SeongJae Park  wrote:
> > On Mon,  8 Apr 2024 21:06:44 +0900 Honggyu Kim  wrote:
> > > On Fri,  5 Apr 2024 12:24:30 -0700 SeongJae Park  wrote:
> > > > On Fri,  5 Apr 2024 15:08:54 +0900 Honggyu Kim  
> > > > wrote:
[...]
> > > I can remove it, but I would like to have more discussion about this
> > > issue.  The current implementation allows only a single migration
> > > target with "target_nid", but users might want to provide fall back
> > > migration target nids.
> > > 
> > > For example, if more than two CXL nodes exist in the system, users might
> > > want to migrate cold pages to any CXL nodes.  In such cases, we might
> > > have to make "target_nid" accept comma separated node IDs.  nodemask can
> > > be better but we should provide a way to change the scanning order.
> > > 
> > > I would like to hear how you think about this.
> > 
> > Good point.  I think we could later extend the sysfs file to receive the
> > comma-separated numbers, or even mask.  For simplicity, adding sysfs files
> > dedicated for the different format of inputs could also be an option (e.g.,
> > target_nids_list, target_nids_mask).  But starting from this single node as 
> > is
> > now looks ok to me.
> 
> If you think we can start from a single node, then I will keep it as is.
> But are you okay if I change the same 'target_nid' to accept
> comma-separated numbers later?  Or do you want to introduce another knob
> such as 'target_nids_list'?  What about rename 'target_nid' to
> 'target_nids' at the first place?

I have no strong concern or opinion about this at the moment.  Please feel free
to renaming it to 'taget_nids' if you think that's better.

[...]
> Please note that I will be out of office this week so won't be able to
> answer quickly.

No problem, I hope you to take and enjoy your time :)


Thanks,
SJ

[...]

Re: [PATCH v2 fs/proc/bootconfig 0/2] remove redundant comments from /proc/bootconfig

2024-04-09 Thread Paul E. McKenney

On Tue, Apr 09, 2024 at 11:32:48PM +0900, Masami Hiramatsu wrote:
> Hi Paul,
> 
> Thanks, both looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) 
> 
> Let me update bootconfig/fixes.

Thank you very much!  As soon as I see them in -next from you, I will
drop them from -rcu.

Thanx, Paul

> On Mon, 8 Apr 2024 21:42:49 -0700
> "Paul E. McKenney"  wrote:
> 
> > Hello!
> > 
> > This series removes redundant comments from /proc/bootconfig:
> > 
> > 1.  fs/proc: remove redundant comments from /proc/bootconfig,
> > courtesy of Zhenhua Huang.
> > 
> > 2.  fs/proc: Skip bootloader comment if no embedded kernel parameters,
> > courtesy of Masami Hiramatsu.
> > 
> > Thanx, Paul
> > 
> > 
> > 
> >  b/fs/proc/bootconfig.c   |   12 ++--
> >  b/include/linux/bootconfig.h |1 +
> >  b/init/main.c|5 +
> >  fs/proc/bootconfig.c |2 +-
> >  4 files changed, 13 insertions(+), 7 deletions(-)
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Jonathan Cameron

On Fri, 5 Apr 2024 15:43:47 -0700
"Ho-Ren (Jack) Chuang"  wrote:

> On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
>  wrote:
> >
> > On Fri,  5 Apr 2024 00:07:06 +
> > "Ho-Ren (Jack) Chuang"  wrote:
> >  
> > > The current implementation treats emulated memory devices, such as
> > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> > > memory
> > > (E820_TYPE_RAM). However, these emulated devices have different
> > > characteristics than traditional DRAM, making it important to
> > > distinguish them. Thus, we modify the tiered memory initialization process
> > > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > > ensures that the memory tier initialization for these nodes is deferred
> > > until HMAT information is obtained during the boot process. Finally,
> > > demotion tables are recalculated at the end.
> > >
> > > * late_initcall(memory_tier_late_init);
> > > Some device drivers may have initialized memory tiers between
> > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> > > online memory nodes and configuring memory tiers. They should be excluded
> > > in the late init.
> > >
> > > * Handle cases where there is no HMAT when creating memory tiers
> > > There is a scenario where a CPUless node does not provide HMAT 
> > > information.
> > > If no HMAT is specified, it falls back to using the default DRAM tier.
> > >
> > > * Introduce another new lock `default_dram_perf_lock` for adist 
> > > calculation
> > > In the current implementation, iterating through CPUlist nodes requires
> > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> > > trying to acquire the same lock, leading to a potential deadlock.
> > > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> > > protect `default_dram_perf_*`. This approach not only avoids deadlock
> > > but also prevents holding a large lock simultaneously.
> > >
> > > * Upgrade `set_node_memory_tier` to support additional cases, including
> > >   default DRAM, late CPUless, and hot-plugged initializations.
> > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> > > handle cases where memtype is not initialized and where HMAT information 
> > > is
> > > available.
> > >
> > > * Introduce `default_memory_types` for those memory types that are not
> > >   initialized by device drivers.
> > > Because late initialized memory and default DRAM memory need to be 
> > > managed,
> > > a default memory type is created for storing all memory types that are
> > > not initialized by device drivers and as a fallback.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Signed-off-by: Hao Xiang 
> > > Reviewed-by: "Huang, Ying"   
> >
> > Hi - one remaining question. Why can't we delay init for all nodes
> > to either drivers or your fallback late_initcall code.
> > It would be nice to reduce possible code paths.  
> 
> I try not to change too much of the existing code structure in
> this patchset.
> 
> To me, postponing/moving all memory tier registrations to
> late_initcall() is another possible action item for the next patchset.
> 
> After tier_mem(), hmat_init() is called, which requires registering
> `default_dram_type` info. This is when `default_dram_type` is needed.
> However, it is indeed possible to postpone the latter part,
> set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> indeed be split into two parts, and the latter part can be moved to
> late_initcall() to be processed together.
> 
> Doing this all memory-type drivers have to call late_initcall() to
> register a memory tier. I’m not sure how many they are?
> 
> What do you guys think?

Gut feeling - if you are going to move it for some cases, move it for
all of them.  Then we only have to test once ;)

J
> 
> >
> > Jonathan
> >
> >  
> > > ---
> > >  mm/memory-tiers.c | 94 +++
> > >  1 file changed, 70 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > index 516b144fd45a..6632102bd5c9 100644
> > > --- a/mm/memory-tiers.c
> > > +++ b/mm/memory-tiers.c  
> >
> >
> >  
> > > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
> > >* For now we can have 4 faster memory tiers with smaller adistance
> > >* than default DRAM tier.
> > >*/
> > > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> > > + default_dram_type = 
> > > mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> > > +   
> > > &default_memory_types);
> > >   if (IS_ERR(default_dram_type))
> > >   panic("%s() failed to allocate default DRAM tier\n", 
> > > __func__);
> > >
> > > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void)
> > >* types assigned.
> > >*/
> > >   for_eac

Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Kees Cook

On Mon, Apr 08, 2024 at 11:01:54AM +0200, Marco Elver wrote:
> Add "new_exec" tracepoint, which is run right after the point of no
> return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> runs before flushing the old exec, i.e. while the task still has the
> original state (such as original MM), but when the new exec either
> succeeds or crashes (but never returns to the original exec).
> 
> Being able to trace this event can be helpful in a number of use cases:
> 
>   * allowing tracing eBPF programs access to the original MM on exec,
> before current->mm is replaced;
>   * counting exec in the original task (via perf event);
>   * profiling flush time ("new_exec" to "sched_process_exec").
> 
> Example of tracing output ("new_exec" and "sched_process_exec"):
> 
>   $ cat /sys/kernel/debug/tracing/trace_pipe
>   <...>-379 [003] .   179.626921: new_exec: 
> filename=/usr/bin/sshd pid=379 comm=sshd
>   <...>-379 [003] .   179.629131: sched_process_exec: 
> filename=/usr/bin/sshd pid=379 old_pid=379
>   <...>-381 [002] .   180.048580: new_exec: filename=/bin/bash 
> pid=381 comm=sshd
>   <...>-381 [002] .   180.053122: sched_process_exec: 
> filename=/bin/bash pid=381 old_pid=381
>   <...>-385 [001] .   180.068277: new_exec: filename=/usr/bin/tty 
> pid=385 comm=bash
>   <...>-385 [001] .   180.069485: sched_process_exec: 
> filename=/usr/bin/tty pid=385 old_pid=385
>   <...>-389 [006] .   192.020147: new_exec: 
> filename=/usr/bin/dmesg pid=389 comm=bash
>bash-389 [006] .   192.021377: sched_process_exec: 
> filename=/usr/bin/dmesg pid=389 old_pid=389
> 
> Signed-off-by: Marco Elver 
> ---
>  fs/exec.c   |  2 ++
>  include/trace/events/task.h | 30 ++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 38bf71cbdf5e..ab778ae1fc06 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1268,6 +1268,8 @@ int begin_new_exec(struct linux_binprm * bprm)
>   if (retval)
>   return retval;
>  
> + trace_new_exec(current, bprm);
> +

All other steps in this function have explicit comments about
what/why/etc. Please add some kind of comment describing why the
tracepoint is where it is, etc.

For example, maybe something like:

/*
 * Before any changes to 'current', report that the exec is about to
 * happen (since we made it to the point of no return). On a successful
 * exec, the 'sched_process_exec' tracepoint will also fire. On failure,
 * ... [something else]
 */

> +TRACE_EVENT(new_exec,
> +
> + TP_PROTO(struct task_struct *task, struct linux_binprm *bprm),
> +
> + TP_ARGS(task, bprm),
> +
> + TP_STRUCT__entry(
> + __string(   filename,   bprm->filename  )
> + __field(pid_t,  pid )
> + __string(   comm,   task->comm  )
> + ),
> +
> + TP_fast_assign(
> + __assign_str(filename, bprm->filename);

What about binfmt_misc, and binfmt_script? You may want bprm->interp
too?

-Kees

> + __entry->pid = task->pid;
> + __assign_str(comm, task->comm);
> + ),
> +
> + TP_printk("filename=%s pid=%d comm=%s",
> +   __get_str(filename), __entry->pid, __get_str(comm))
> +);
> +
>  #endif
>  
>  /* This part must be outside protection */
> -- 
> 2.44.0.478.gd926399ef9-goog
> 

-- 
Kees Cook

Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-04-09 Thread Dmitry Baryshkov

On Tue, Apr 09, 2024 at 01:04:12PM +0200, Pavel Machek wrote:
> Hi!
> 
> > > This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> > > features removed. ANX7688 is rather criticial piece on PinePhone,
> > > there's no display and no battery charging without it.
> > > 
> > > There's likely more work to be done here, but having basic support
> > > in mainline is needed to be able to work on the other stuff
> > > (networking, cameras, power management).
> > > 
> > > Signed-off-by: Ondrej Jirman 
> > > Co-developed-by: Martijn Braam 
> > > Co-developed-by: Samuel Holland 
> > > Signed-off-by: Pavel Machek 
> > 
> > Just couple of quick comments below - I did not have time to go over
> > this very thoroughly, but I think you need to make a new version in
> > any case because of comments in 1/2.
> 

[skipped]

> 
> > > +static int anx7688_connect(struct anx7688 *anx7688)
> > > +{
> > > + struct typec_partner_desc desc = {};
> > > + int ret, i;
> > > + u8 fw[2];
> > > + const u8 dp_snk_identity[16] = {
> > > + 0x00, 0x00, 0x00, 0xec, /* id header */
> > > + 0x00, 0x00, 0x00, 0x00, /* cert stat */
> > > + 0x00, 0x00, 0x00, 0x00, /* product type */
> > > + 0x39, 0x00, 0x00, 0x51  /* alt mode adapter */
> > > + };
> > > + const u8 svid[4] = {
> > > + 0x00, 0x00, 0x01, 0xff,
> > > + };
> > 
> > Why not get those from DT?
> 
> Are you sure it belongs to the DT (and that DT people will agree)?

>From Documentation/devicetree/bindings/connector/usb-connector.yaml:

altmodes {
displayport {
svid = /bits/ 16 <0xff01>;
vdo = <0x1c46>;
};
};

BTW, I don't see the VDO for the DP altmode in your code. Maybe I missed
it at a quick glance.

> 
> > > + u32 caps[8];
> > > +


-- 
With best wishes
Dmitry

Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-09 Thread Haitao Huang


On Tue, 09 Apr 2024 04:03:22 -0500, Michal Koutný  wrote:

On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang  
 wrote:

It's always non-NULL based on testing.

It's hard for me to say definitely by reading the code. But IIUC
cgroup_disable command-line only blocks operations in /sys/fs/cgroup so  
user
space can't set up controllers and config limits, etc., for the  
diasabled
ones. Each task->cgroups would still have a non-NULL pointer to the  
static

root object for each cgroup that is enabled by KConfig, so
get_current_misc_cg() thus  sgx_get_current_cg() should not return NULL
regardless 'cgroup_disable=misc'.

Maybe @Michal or @tj can confirm?


The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal

Thanks for the info.

The way I see it, misc does not have special handling like memcg so every  
task at least belong to the root(default) group even if it's disabled by  
command line parameter. So we would not get NULL from  
get_current_misc_cg(). I think I'll keep the WARN_ON_ONCE for now as a  
reminder in case misc do have custom support for disabling in future.


Thanks
Haitao

Re: [PATCH v2 3/4] arm64: dts: qcom: msm8976: Add Adreno GPU

2024-04-09 Thread Bjorn Andersson

On Mon, Apr 01, 2024 at 07:21:52PM +0200, Adam Skladowski wrote:
> Add Adreno GPU node.
> 
> Signed-off-by: Adam Skladowski 
> ---
>  arch/arm64/boot/dts/qcom/msm8976.dtsi | 65 +++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
> b/arch/arm64/boot/dts/qcom/msm8976.dtsi
> index 6be310079f5b..77670fce9b8f 100644
> --- a/arch/arm64/boot/dts/qcom/msm8976.dtsi
> +++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi
> @@ -1074,6 +1074,71 @@ mdss_dsi1_phy: phy@1a96a00 {
>   };
>   };
>  
> + adreno_gpu: gpu@1c0 {
> + compatible = "qcom,adreno-510.0", "qcom,adreno";
> +
> + reg = <0x01c0 0x4>;
> + reg-names = "kgsl_3d0_reg_memory";
> +
> + interrupts = ;
> + interrupt-names = "kgsl_3d0_irq";
> +
> + clocks = <&gcc GCC_GFX3D_OXILI_CLK>,
> +  <&gcc GCC_GFX3D_OXILI_AHB_CLK>,
> +  <&gcc GCC_GFX3D_OXILI_GMEM_CLK>,
> +  <&gcc GCC_GFX3D_BIMC_CLK>,
> +  <&gcc GCC_GFX3D_OXILI_TIMER_CLK>,
> +  <&gcc GCC_GFX3D_OXILI_AON_CLK>;
> + clock-names = "core",
> +   "iface",
> +   "mem",
> +   "mem_iface",
> +   "rbbmtimer",
> +   "alwayson";
> +
> + power-domains = <&gcc OXILI_GX_GDSC>;
> +
> + iommus = <&gpu_iommu 0>;
> +
> + status = "disabled";

Make status the last property of the node, please.

Regards,
Bjorn

> +
> + operating-points-v2 = <&gpu_opp_table>;
> +
> + gpu_opp_table: opp-table {
> + compatible = "operating-points-v2";
> +

Re: [PATCH v2 2/4] arm64: dts: qcom: msm8976: Add MDSS nodes

2024-04-09 Thread Bjorn Andersson

On Mon, Apr 01, 2024 at 07:21:51PM +0200, Adam Skladowski wrote:
> diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
> b/arch/arm64/boot/dts/qcom/msm8976.dtsi
[..]
> + mdss: display-subsystem@1a0 {
[..]
> + mdss_dsi0: dsi@1a94000 {
> + compatible = "qcom,msm8976-dsi-ctrl", 
> "qcom,mdss-dsi-ctrl";
> + reg = <0x01a94000 0x25c>;
> + reg-names = "dsi_ctrl";
> +
> + interrupt-parent = <&mdss>;
> + interrupts = <4>;
> +
> + clocks = <&gcc GCC_MDSS_MDP_CLK>,
> +  <&gcc GCC_MDSS_AHB_CLK>,
> +  <&gcc GCC_MDSS_AXI_CLK>,
> +  <&gcc GCC_MDSS_BYTE0_CLK>,
> +  <&gcc GCC_MDSS_PCLK0_CLK>,
> +  <&gcc GCC_MDSS_ESC0_CLK>;
> + clock-names = "mdp_core",
> +   "iface",
> +   "bus",
> +   "byte",
> +   "pixel",
> +   "core";
> +
> + assigned-clocks = <&gcc GCC_MDSS_BYTE0_CLK_SRC>,
> +   <&gcc GCC_MDSS_PCLK0_CLK_SRC>;
> + assigned-clock-parents = <&mdss_dsi0_phy 0>,
> +  <&mdss_dsi0_phy 1>;
> +
> + phys = <&mdss_dsi0_phy>;
> +
> + operating-points-v2 = <&dsi0_opp_table>;
> + power-domains = <&gcc MDSS_GDSC>;
> +
> + #address-cells = <1>;
> + #size-cells = <0>;
> +

Seems reasonable to keep this disabled as well. Further more &mdss_dsi0
depends on &mdss_dsi0_phy which is disabled.

> + ports {
[..]
> + mdss_dsi0_phy: phy@1a94a00 {
> + compatible = "qcom,dsi-phy-28nm-hpm-fam-b";
> + reg = <0x01a94a00 0xd4>,
> +   <0x01a94400 0x280>,
> +   <0x01a94b80 0x30>;
> + reg-names = "dsi_pll",
> + "dsi_phy",
> + "dsi_phy_regulator";
> +
> + #clock-cells = <1>;
> + #phy-cells = <0>;
> +
> + clocks = <&gcc GCC_MDSS_AHB_CLK>,
> +  <&rpmcc RPM_SMD_XO_CLK_SRC>;
> + clock-names = "iface", "ref";
> +
> + status = "disabled";
> + };

PS. Leave &mdss_mdp enabled...

Regards,
Bjorn

Re: [PATCH 0/3] Fix up qcom,halt-regs definition in various schemas

2024-04-09 Thread Rob Herring

On Sun, Apr 07, 2024 at 11:58:29AM +0200, Luca Weiss wrote:
> The original motivation is that a bunch of other schemas fail to
> validate qcom,halt-regs, for example like in the following examples:
> 
> arch/arm64/boot/dts/qcom/apq8016-sbc.dtb: remoteproc@408: 
> qcom,halt-regs:0: [20] is too short
> from schema $id: 
> http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml#
> arch/arm64/boot/dts/qcom/apq8096-ifc6640.dtb: remoteproc@208: 
> qcom,halt-regs:0: [82] is too short
> from schema $id: 
> http://devicetree.org/schemas/remoteproc/qcom,msm8996-mss-pil.yaml#
> arch/arm64/boot/dts/qcom/apq8039-t2.dtb: remoteproc@408: 
> qcom,halt-regs:0: [32] is too short
> from schema $id: 
> http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml#
> 
> While I'm actually not quite sure why these patches fix this in
> the other schemas - feels like a bug/limitation in dt-schema maybe? -

Was this with v2024.02? It should be a bit better there. Though it 
may just have different errors. The limitation is that property 
types and in the case of matrix's (which phandle-array actually is) 
range for dimensions are global. So if there's not correct dimensions 
for a property, the tools aren't going to decode it properly.

Rob

Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT

2024-04-09 Thread Steven Rostedt

On Mon,  8 Apr 2024 16:58:17 +0200
Michal Koutný  wrote:

> @@ -294,7 +295,7 @@ static void __trace_find_cmdline(int pid, char comm[])
>   return;
>   }
>  
> - tpid = pid & (PID_MAX_DEFAULT - 1);
> + tpid = pid % PID_MAP_SIZE;

Does that compile to the same? This is a fast path.

-- Steve


>   map = savedcmd->map_pid_to_cmdline[tpid];
>   if (map != NO_CMDLINE_MAP) {
>   tpid = savedcmd->map_cmdline_to_pid[map];

Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Marco Elver

On Tue, 9 Apr 2024 at 16:31, Steven Rostedt  wrote:
>
> On Mon,  8 Apr 2024 11:01:54 +0200
> Marco Elver  wrote:
>
> > Add "new_exec" tracepoint, which is run right after the point of no
> > return but before the current task assumes its new exec identity.
> >
> > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> > runs before flushing the old exec, i.e. while the task still has the
> > original state (such as original MM), but when the new exec either
> > succeeds or crashes (but never returns to the original exec).
> >
> > Being able to trace this event can be helpful in a number of use cases:
> >
> >   * allowing tracing eBPF programs access to the original MM on exec,
> > before current->mm is replaced;
> >   * counting exec in the original task (via perf event);
> >   * profiling flush time ("new_exec" to "sched_process_exec").
> >
> > Example of tracing output ("new_exec" and "sched_process_exec"):
>
> How common is this? And can't you just do the same with adding a kprobe?

Our main use case would be to use this in BPF programs to become
exec-aware, where using the sched_process_exec hook is too late. This
is particularly important where the BPF program must stop inspecting
the user space's VM when the task does exec to become a new process.

kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s
permission check can still return an error which returns to the
original task without crashing. Only at the point of no return are we
guaranteed that the exec either succeeds, or the task is terminated on
failure.

I don't know if "common" is the right question here, because it's a
chicken-egg problem: no tracepoint, we give up; we have the
tracepoint, it unlocks a range of new use cases (that require robust
solution to make BPF programs exec-aware, and a tracepoint is the only
option IMHO).

Thanks,
-- Marco

Re: [PATCH v2 fs/proc/bootconfig 0/2] remove redundant comments from /proc/bootconfig

2024-04-09 Thread Google

Hi Paul,

Thanks, both looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Let me update bootconfig/fixes.

On Mon, 8 Apr 2024 21:42:49 -0700
"Paul E. McKenney"  wrote:

> Hello!
> 
> This series removes redundant comments from /proc/bootconfig:
> 
> 1.fs/proc: remove redundant comments from /proc/bootconfig,
>   courtesy of Zhenhua Huang.
> 
> 2.fs/proc: Skip bootloader comment if no embedded kernel parameters,
>   courtesy of Masami Hiramatsu.
> 
>   Thanx, Paul
> 
> 
> 
>  b/fs/proc/bootconfig.c   |   12 ++--
>  b/include/linux/bootconfig.h |1 +
>  b/init/main.c|5 +
>  fs/proc/bootconfig.c |2 +-
>  4 files changed, 13 insertions(+), 7 deletions(-)
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Steven Rostedt

On Mon,  8 Apr 2024 11:01:54 +0200
Marco Elver  wrote:

> Add "new_exec" tracepoint, which is run right after the point of no
> return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> runs before flushing the old exec, i.e. while the task still has the
> original state (such as original MM), but when the new exec either
> succeeds or crashes (but never returns to the original exec).
> 
> Being able to trace this event can be helpful in a number of use cases:
> 
>   * allowing tracing eBPF programs access to the original MM on exec,
> before current->mm is replaced;
>   * counting exec in the original task (via perf event);
>   * profiling flush time ("new_exec" to "sched_process_exec").
> 
> Example of tracing output ("new_exec" and "sched_process_exec"):

How common is this? And can't you just do the same with adding a kprobe?

-- Steve

Re: [PATCH v2] kprobes: Avoid possible warn in __arm_kprobe_ftrace()

2024-04-09 Thread Google

On Tue, 9 Apr 2024 14:20:45 +0800
Zheng Yejian  wrote:

> On 2024/4/8 20:41, Masami Hiramatsu (Google) wrote:
> > Hi Zheng,
> > 
> > On Mon, 8 Apr 2024 16:34:03 +0800
> > Zheng Yejian  wrote:
> > 
> >> There is once warn in __arm_kprobe_ftrace() on:
> >>
> >>   ret = ftrace_set_filter_ip(ops, (unsigned long)p->addr, 0, 0);
> >>   if (WARN_ONCE(..., "Failed to arm kprobe-ftrace at %pS (error %d)\n", 
> >> ...)
> >> return ret;
> >>
> >> This warning is generated because 'p->addr' is detected to be not a valid
> >> ftrace location in ftrace_set_filter_ip(). The ftrace address check is done
> >> by check_ftrace_location() at the beginning of check_kprobe_address_safe().
> >> At that point, ftrace_location(addr) == addr should return true if the
> >> module is loaded. Then the module is searched twice:
> >>1. in is_module_text_address(), we find that 'p->addr' is in a module;
> >>2. in __module_text_address(), we find the module;
> >>
> >> If the module has just been unloaded before the second search, then
> >> '*probed_mod' is NULL and we would not go to get the module refcount,
> >> then the return value of check_kprobe_address_safe() would be 0, but
> >> actually we need to return -EINVAL.
> > 
> > OK, so you found a race window in check_kprobe_address_safe().
> > 
> > It does something like below.
> > 
> > check_kprobe_address_safe() {
> > ...
> > 
> > /* Timing [A] */
> > 
> > if (!(core_kernel_text(p->addr) ||
> > is_module_text_address(p->addr)) ||
> > ...(other reserved address check)) {
> > return -EINVAL;
> > }
> > 
> > /* Timing [B] */
> > 
> > *probed_mod = __module_text_address(p->addr):
> > if (*probe_mod) {
> > if (!try_module_get(*probed_mod)) {
> > return -ENOENT;
> > }
> > ... 
> > }
> > }
> > 
> > So, if p->addr is in a module which is alive at the timing [A], but
> > unloaded at timing [B], 'p->addr' is passed the
> > 'is_module_text_address(p->addr)' check, but *probed_mod becomes NULL.
> > Thus the corresponding module is not referenced and kprobe_arm(p) will
> > access a wrong address (use after free).
> > This happens either kprobe on ftrace is enabled or not.
> 
> Yes, This is the problem. And for this case, check_kprobe_address_safe() 
> still return 0, and then going on to arm kprobe may cause problems. So
> we should make check_kprobe_address_safe() return -EINVAL when refcount
> of the module is not got.

Yes,

> 
> > 
> > To fix this problem, we should move the mutex_lock(kprobe_mutex) before
> > check_kprobe_address_safe() because kprobe_module_callback() also lock it
> > so it can stop module unloading.
> > 
> > Can you ensure this will fix your problem?
> 
> It seems not, the warning in __arm_kprobe_ftrace() still occurs. I
> contrived following simple test:
> 
>  #!/bin/bash
>  sysctl -w kernel.panic_on_warn=1
>  while [ True ]; do
>  insmod mod.ko# contain function 'foo'
>  rmmod mod.ko
>  done &
>  while [ True ]; do
>  insmod kprobe.ko  # register kprobe on function 'foo'
>  rmmod kprobe.ko
>  done &
> 
> I think holding kprobe_mutex cannot make sure we get the refcount of the
> module.

Aah, yes, it cannot, because the kallsyms in a module will be removed
after module->state becomes MODULE_STATE_UNFORMED. Before UNFORMED,
the state is MODULE_STATE_GOING and the kprobe_module_callback() is
called at that point. Thus, the following scenario happens.

 CPU1   CPU2

mod->state = MODULE_STATE_GOING
kprobe_module_callback() {
mutex_lock(&kprobe_mutex)
loop on kprobe_table 
to disable kprobe in the module.
mutex_unlock(&kprobe_mutex)
}
register_kprobe(p) {

mutex_lock(&kprobe_mutex)

check_kprobe_address_safe(p->addr) {
[A'']

is_module_text_address() return true
until 
mod->state == UNFORMED.
mod->state = MODULE_STATE_UNFORMED
[B'']

__module_text_address() returns NULL.
}
p is on the 
kprobe_table.

mutex_unlock(&kprobe_mutex)

So, as your fix, if we save the module at [A''] and use it at [B''],
the mod is NOT able to get because mod->state != MODULE_STATE_LIVE.


> 
> > I think your patch is just optimizing but not fixing the fundamental
> > problem,

Re: [PATCH v3 1/2] dt-bindings: arm: qcom: Add Motorola Moto G (2013)

2024-04-09 Thread Rob Herring



On Sun, 07 Apr 2024 11:05:10 +0200, Stanislav Jakubek wrote:
> Document the Motorola Moto G (2013), which is a smartphone based
> on the Qualcomm MSM8226 SoC.
> 
> Acked-by: Krzysztof Kozlowski 
> Signed-off-by: Stanislav Jakubek 
> ---
> Changes in V3:
>   - no changes
> 
> Changes in V2:
>   - collect Krzysztof's A-b
> 
>  Documentation/devicetree/bindings/arm/qcom.yaml | 1 +
>  1 file changed, 1 insertion(+)
> 


My bot found new DTB warnings on the .dts files added or changed in this
series.

Some warnings may be from an existing SoC .dtsi. Or perhaps the warnings
are fixed by another series. Ultimately, it is up to the platform
maintainer whether these warnings are acceptable or not. No need to reply
unless the platform maintainer has comments.

If you already ran DT checks and didn't see these error(s), then
make sure dt-schema is up to date:

  pip3 install dtschema --upgrade


New warnings running 'make CHECK_DTBS=y qcom/msm8226-motorola-falcon.dtb' for 
32c507337ab80c550fb1df08f7014d1e31eb4c32.1712480582.git.stano.jaku...@gmail.com:

arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dtb: syscon@f9011000: 
compatible: 'anyOf' conditional failed, one must be fixed:
['syscon'] is too short
'syscon' is not one of ['allwinner,sun8i-a83t-system-controller', 
'allwinner,sun8i-h3-system-controller', 
'allwinner,sun8i-v3s-system-controller', 
'allwinner,sun50i-a64-system-controller', 'amd,pensando-elba-syscon', 
'brcm,cru-clkset', 'freecom,fsg-cs2-system-controller', 
'fsl,imx93-aonmix-ns-syscfg', 'fsl,imx93-wakeupmix-syscfg', 
'hisilicon,dsa-subctrl', 'hisilicon,hi6220-sramctrl', 
'hisilicon,pcie-sas-subctrl', 'hisilicon,peri-subctrl', 'hpe,gxp-sysreg', 
'intel,lgm-syscon', 'loongson,ls1b-syscon', 'loongson,ls1c-syscon', 
'marvell,armada-3700-usb2-host-misc', 'mediatek,mt8135-pctl-a-syscfg', 
'mediatek,mt8135-pctl-b-syscfg', 'mediatek,mt8365-syscfg', 
'microchip,lan966x-cpu-syscon', 'microchip,sparx5-cpu-syscon', 
'mstar,msc313-pmsleep', 'nuvoton,ma35d1-sys', 'nuvoton,wpcm450-shm', 
'rockchip,px30-qos', 'rockchip,rk3036-qos', 'rockchip,rk3066-qos', 
'rockchip,rk3128-qos', 'rockchip,rk3228-qos', 'rockchip,rk3288-qos', 
'rockchip,rk3368-qos', 'rockchip,rk3399-qos', 'rockchip,rk3568-qos', '
 rockchip,rk3588-qos', 'rockchip,rv1126-qos', 'starfive,jh7100-sysmain', 
'ti,am62-usb-phy-ctrl', 'ti,am654-dss-oldi-io-ctrl', 'ti,am654-serdes-ctrl', 
'ti,j784s4-pcie-ctrl']
from schema $id: http://devicetree.org/schemas/mfd/syscon.yaml#

Re: [PATCH v3] Documentation: Add reconnect process for VDUSE

2024-04-09 Thread Cindy Lu

On Mon, Apr 8, 2024 at 8:50 PM Michael S. Tsirkin  wrote:
>
> On Mon, Apr 08, 2024 at 08:39:21PM +0800, Cindy Lu wrote:
> > On Mon, Apr 8, 2024 at 3:40 PM Michael S. Tsirkin  wrote:
> > >
> > > On Thu, Apr 04, 2024 at 01:56:31PM +0800, Cindy Lu wrote:
> > > > Add a document explaining the reconnect process, including what the
> > > > Userspace App needs to do and how it works with the kernel.
> > > >
> > > > Signed-off-by: Cindy Lu 
> > > > ---
> > > >  Documentation/userspace-api/vduse.rst | 41 +++
> > > >  1 file changed, 41 insertions(+)
> > > >
> > > > diff --git a/Documentation/userspace-api/vduse.rst 
> > > > b/Documentation/userspace-api/vduse.rst
> > > > index bdb880e01132..7faa83462e78 100644
> > > > --- a/Documentation/userspace-api/vduse.rst
> > > > +++ b/Documentation/userspace-api/vduse.rst
> > > > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> > > > after the used ring is filled.
> > > >
> > > >  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> > > > +
> > > > +HOW VDUSE devices reconnection works
> > > > +
> > > > +1. What is reconnection?
> > > > +
> > > > +   When the userspace application loads, it should establish a 
> > > > connection
> > > > +   to the vduse kernel device. Sometimes,the userspace application 
> > > > exists,
> > > > +   and we want to support its restart and connect to the kernel device 
> > > > again
> > > > +
> > > > +2. How can I support reconnection in a userspace application?
> > > > +
> > > > +2.1 During initialization, the userspace application should first 
> > > > verify the
> > > > +existence of the device "/dev/vduse/vduse_name".
> > > > +If it doesn't exist, it means this is the first-time for 
> > > > connection. goto step 2.2
> > > > +If it exists, it means this is a reconnection, and we should goto 
> > > > step 2.3
> > > > +
> > > > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > > > +/dev/vduse/control.
> > > > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> > > > +the reconnect information. The total memory size is 
> > > > PAGE_SIZE*vq_mumber.
> > >
> > > Confused. Where is that allocation, in code?
> > >
> > > Thanks!
> > >
> > this should allocated in function vduse_create_dev(),
>
> I mean, it's not allocated there ATM right? This is just doc patch
> to become part of a larger patchset?
>
Got it, thanks Michael I will send the whole patchset soon
thanks
Cindy

> > I will rewrite
> > this part  to make it more clearer
> > will send a new version soon
> > Thanks
> > cindy
> >
> > > > +2.3 Check if the information is suitable for reconnect
> > > > +If this is reconnection :
> > > > +Before attempting to reconnect, The userspace application needs to 
> > > > use the
> > > > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> > > > VDUSE_DEV_GET_FEATURES...)
> > > > +to get the information from kernel.
> > > > +Please review the information and confirm if it is suitable to 
> > > > reconnect.
> > > > +
> > > > +2.4 Userspace application needs to mmap the memory to userspace
> > > > +The userspace application requires mapping one page for every vq. 
> > > > These pages
> > > > +should be used to save vq-related information during system 
> > > > running. Additionally,
> > > > +the application must define its own structure to store information 
> > > > for reconnection.
> > > > +
> > > > +2.5 Completed the initialization and running the application.
> > > > +While the application is running, it is important to store 
> > > > relevant information
> > > > +about reconnections in mapped pages. When calling the ioctl 
> > > > VDUSE_VQ_GET_INFO to
> > > > +get vq information, it's necessary to check whether it's a 
> > > > reconnection. If it is
> > > > +a reconnection, the vq-related information must be get from the 
> > > > mapped pages.
> > > > +
> > > > +2.6 When the Userspace application exits, it is necessary to unmap all 
> > > > the
> > > > +pages for reconnection
> > > > --
> > > > 2.43.0
> > >
>

Re: [PATCH 2/2] ARM: dts: qcom: msm8974: Use proper compatible for APCS syscon

2024-04-09 Thread Krzysztof Kozlowski

On 08/04/2024 21:32, Luca Weiss wrote:
> Use the apcs-kpss-global compatible for the APCS global mailbox block
> found on this SoC.
> 
> This also resolves a dt-binding checker warning:


Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

Re: [PATCH 1/2] dt-bindings: mailbox: qcom: Add MSM8974 APCS compatible

2024-04-09 Thread AngeloGioacchino Del Regno


Il 08/04/24 21:32, Luca Weiss ha scritto:

Add compatible for the Qualcomm MSM8974 APCS block.

Signed-off-by: Luca Weiss 


Reviewed-by: AngeloGioacchino Del Regno

Re: [PATCH 2/2] ARM: dts: qcom: msm8974: Use proper compatible for APCS syscon

2024-04-09 Thread AngeloGioacchino Del Regno


Il 08/04/24 21:32, Luca Weiss ha scritto:

Use the apcs-kpss-global compatible for the APCS global mailbox block
found on this SoC.

This also resolves a dt-binding checker warning:

   arch/arm/boot/dts/qcom/qcom-msm8974pro-fairphone-fp2.dtb: syscon@f9011000: 
compatible: 'anyOf' conditional failed, one must be fixed:
   ['syscon'] is too short
   'syscon' is not one of ['allwinner,sun8i-a83t-system-controller', 
'allwinner,sun8i-h3-system-controller', 
'allwinner,sun8i-v3s-system-controller', 
'allwinner,sun50i-a64-system-controller', 'amd,pensando-elba-syscon', 
'brcm,cru-clkset', 'freecom,fsg-cs2-system-controller', 
'fsl,imx93-aonmix-ns-syscfg', 'fsl,imx93-wakeupmix-syscfg', 
'hisilicon,dsa-subctrl', 'hisilicon,hi6220-sramctrl', 
'hisilicon,pcie-sas-subctrl', 'hisilicon,peri-subctrl', 'hpe,gxp-sysreg', 
'intel,lgm-syscon', 'loongson,ls1b-syscon', 'loongson,ls1c-syscon', 
'marvell,armada-3700-usb2-host-misc', 'mediatek,mt8135-pctl-a-syscfg', 
'mediatek,mt8135-pctl-b-syscfg', 'mediatek,mt8365-syscfg', 
'microchip,lan966x-cpu-syscon', 'microchip,sparx5-cpu-syscon', 
'mstar,msc313-pmsleep', 'nuvoton,ma35d1-sys', 'nuvoton,wpcm450-shm', 
'rockchip,px30-qos', 'rockchip,rk3036-qos', 'rockchip,rk3066-qos', 
'rockchip,rk3128-qos', 'rockchip,rk3228-qos', 'rockchip,rk3288-qos', 
'rockchip,rk3368-qos', 'rockchip,rk3399-qos', 'rockchip,rk356
  8-qos', 'rockchip,rk3588-qos', 'rockchip,rv1126-qos', 
'starfive,jh7100-sysmain', 'ti,am62-usb-phy-ctrl', 'ti,am654-dss-oldi-io-ctrl', 
'ti,am654-serdes-ctrl', 'ti,j784s4-pcie-ctrl']
   from schema $id: http://devicetree.org/schemas/mfd/syscon.yaml#

Signed-off-by: Luca Weiss 


Reviewed-by: AngeloGioacchino Del Regno

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-09 Thread Jiri Olsa

On Mon, Apr 08, 2024 at 06:22:59PM +0200, Oleg Nesterov wrote:
> On 04/08, Jiri Olsa wrote:
> >
> > On Fri, Apr 05, 2024 at 01:02:30PM +0200, Oleg Nesterov wrote:
> > >
> > > And what should sys_uretprobe() do if it is not called from the 
> > > trampoline?
> > > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
> >
> > so the similar behaviour with int3 ends up with immediate SIGTRAP
> > and not invoking pending uretprobe consumers, like:
> >
> >   - setup uretprobe for foo
> >   - foo() {
> >   executes int 3 -> sends SIGTRAP
> > }
> >
> > because the int3 handler checks if it got executed from the uretprobe's
> > trampoline.
> 
> ... or the task has uprobe at this address
> 
> > if not it treats that int3 as regular trap
> 
> Yes this mimics the "default" behaviour without uprobes/uretprobes
> 
> > so I think we should mimic int3 behaviour and:
> >
> >   - setup uretprobe for foo
> >   - foo() {
> >  uretprobe_syscall -> check if we got executed from uretprobe's
> >  trampoline and send SIGILL if that's not the case
> 
> Agreed,
> 
> > I think it's better to have the offending process killed right away,
> > rather than having more undefined behaviour, waiting for final 'ret'
> > instruction that jumps to uretprobe trampoline and causes SIGILL
> 
> Agreed. In fact I think it should be also killed if copy_to/from_user()
> fails by the same reason.

+1 makes sense

jirka

> 
> Oleg.
>

Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-04-09 Thread Pavel Machek

Hi!

> > This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> > features removed. ANX7688 is rather criticial piece on PinePhone,
> > there's no display and no battery charging without it.
> > 
> > There's likely more work to be done here, but having basic support
> > in mainline is needed to be able to work on the other stuff
> > (networking, cameras, power management).
> > 
> > Signed-off-by: Ondrej Jirman 
> > Co-developed-by: Martijn Braam 
> > Co-developed-by: Samuel Holland 
> > Signed-off-by: Pavel Machek 
> 
> Just couple of quick comments below - I did not have time to go over
> this very thoroughly, but I think you need to make a new version in
> any case because of comments in 1/2.

Yes, there will be new version.

There is ton of sleep primitives, and existing ones are okay. I can
change everything to fdelay or whatever primitive-of-the-day is, but
I'd rather not do pointless changes.

You can ask for major changes, or complain about extra newlines, but
doing both in the same email does not make sense.

> Btw, Co-developed-by comes before Signed-off-by I think.

I believe this order is fine, too.

> > +++ b/drivers/usb/typec/anx7688.c
> > @@ -0,0 +1,1830 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * ANX7688 USB-C HDMI bridge/PD driver
> > + *
> > + * Warning, this driver is somewhat PinePhone specific.
> 
> So why not split this into core part and PinePhone specific glue
> part?

Potentially a lot of work I can't really test and would rather not do.

> > +   struct delayed_work work;
> > +   struct timer_list work_timer;
> > +
> > +   struct mutex lock;
> 
> Undocumented lock.

This is simple driver. How do you expect me to document it? Protects
this data structure, not exactly uncommon.

> > +
> > +   /* wait for power to stabilize and release reset */
> > +   msleep(10);
> > +   gpiod_set_value(anx7688->gpio_reset, 0);
> > +   usleep_range(2, 4);
> 
> Why not just use usleep_range() in both cases.

It does not really make code any better. Can do if you insist.

> > +static int anx7688_connect(struct anx7688 *anx7688)
> > +{
> > +   struct typec_partner_desc desc = {};
> > +   int ret, i;
> > +   u8 fw[2];
> > +   const u8 dp_snk_identity[16] = {
> > +   0x00, 0x00, 0x00, 0xec, /* id header */
> > +   0x00, 0x00, 0x00, 0x00, /* cert stat */
> > +   0x00, 0x00, 0x00, 0x00, /* product type */
> > +   0x39, 0x00, 0x00, 0x51  /* alt mode adapter */
> > +   };
> > +   const u8 svid[4] = {
> > +   0x00, 0x00, 0x01, 0xff,
> > +   };
> 
> Why not get those from DT?

Are you sure it belongs to the DT (and that DT people will agree)?

> > +   u32 caps[8];
> > +
> > +   dev_dbg(anx7688->dev, "cable inserted\n");
> > +
> > +   anx7688->last_status = -1;
> > +   anx7688->last_cc_status = -1;
> > +   anx7688->last_dp_state = -1;
> > +
> > +   msleep(10);
> 
> Please make a comment here why you have to wait, and use
> usleep_range().

/* Dunno because working code does that and waiting for hardware to
settle down after cable insertion kind of looks like a good thing */

I did not write the driver, and there's no good documentation for this
chip. I can try to invent something, but...

> > +   i = 0;
> > +   while (1) {
> > +   ret = anx7688_reg_read(anx7688, 
> > ANX7688_REG_EEPROM_LOAD_STATUS0);
> > +
> > +   if (ret >= 0 && (ret & ANX7688_EEPROM_FW_LOADED) == 
> > ANX7688_EEPROM_FW_LOADED) {
> > +   dev_dbg(anx7688->dev, "eeprom0 = 0x%02x\n", ret);
> > +   dev_dbg(anx7688->dev, "fw loaded after %d ms\n", i * 
> > 10);
> > +   break;
> > +   }
> > +
> > +   if (i > 99) {
> > +   set_bit(ANX7688_F_FW_FAILED, anx7688->flags);
> > +   dev_err(anx7688->dev,
> > +   "boot firmware load failed (you may need to 
> > flash FW to anx7688 first)\n");
> > +   ret = -ETIMEDOUT;
> > +   goto err_vconoff;
> > +   }
> > +   msleep(5);
> > +   i++;
> > +   }
> 
> You need to use something like time_is_after_jiffies() here instead of
> a counter.

Well, this works as well, but yes, I agree here.

> > +   ret = i2c_smbus_read_i2c_block_data(anx7688->client,
> > +   ANX7688_REG_FW_VERSION1, 2, fw);
> > +   if (ret < 0) {
> > +   dev_err(anx7688->dev, "failed to read firmware version\n");
> > +   goto err_vconoff;
> > +   }
> > +
> > +   dev_dbg(anx7688->dev, "OCM firmware loaded (version 0x%04x)\n",
> > +fw[1] | fw[0] << 8);
> > +
> > +   /* Unmask interrupts */
> > +   ret = anx7688_reg_write(anx7688, ANX7688_REG_STATUS_INT, 0);
> > +   if (ret)
> > +   goto err_vconoff;
> > +
> > +   ret = anx7688_reg_write(anx7688, ANX7688_REG_STATUS_INT_MASK, 
> > ~ANX7688_SOFT_INT_MASK);
> > +   if (ret)
> > +   goto err_vconoff;
> > +
> > +   ret = anx7688_reg_write(anx7688, ANX7688_REG_IRQ_EXT_SOURCE2

Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

2024-04-09 Thread Honggyu Kim

On Mon,  8 Apr 2024 22:41:04 +0900 Honggyu Kim  wrote:
[...]
> To explain this, I better share more test results.  In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
> 
>   *. "Turn on DAMON."
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   3. Run YCSB to make zipfian distribution of memory accesses to
>  redis-server, then measure execution time.
>   4. Repeat 4 over 50 times to measure the average execution time for
>  each run.

Sorry, "Repeat 4 over 50 times" is incorrect.  This should be "Repeat 3
over 50 times".

>   5. Increase the cold memory size then repeat goes to 2.
> 
> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
> is missing part from the cover letter.
> 
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   *. "Turn on DAMON."
>   4. Run YCSB to make zipfian distribution of memory accesses to
>  redis-server, then measure execution time.
>   5. Repeat 4 over 50 times to measure the average execution time for
>  each run.
>   6. Increase the cold memory size then repeat goes to 2.
> 
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
[...]

Thanks,
Honggyu

Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-04-09 Thread Heikki Krogerus

Hi,

On Mon, Apr 08, 2024 at 12:54:25PM +0200, Pavel Machek wrote:
> From: Ondrej Jirman 
> 
> This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> features removed. ANX7688 is rather criticial piece on PinePhone,
> there's no display and no battery charging without it.
> 
> There's likely more work to be done here, but having basic support
> in mainline is needed to be able to work on the other stuff
> (networking, cameras, power management).
> 
> Signed-off-by: Ondrej Jirman 
> Co-developed-by: Martijn Braam 
> Co-developed-by: Samuel Holland 
> Signed-off-by: Pavel Machek 

Just couple of quick comments below - I did not have time to go over
this very thoroughly, but I think you need to make a new version in
any case because of comments in 1/2.

Btw, Co-developed-by comes before Signed-off-by I think.

> ---
> 
> v2: Fix checkpatch stuff. Some cleanups, adapt to dts format in 1/2.
> v3: Turn down debugging, as requested during review.
> 
> diff --git a/drivers/usb/typec/Kconfig b/drivers/usb/typec/Kconfig
> index 2f80c2792dbd..c9043ae61546 100644
> --- a/drivers/usb/typec/Kconfig
> +++ b/drivers/usb/typec/Kconfig
> @@ -64,6 +64,17 @@ config TYPEC_ANX7411
> If you choose to build this driver as a dynamically linked module, the
> module will be called anx7411.ko.
>  
> +config TYPEC_ANX7688
> + tristate "Analogix ANX7688 Type-C DRP Port controller and mux driver"
> + depends on I2C
> + depends on USB_ROLE_SWITCH
> + help
> +   Say Y or M here if your system has Analogix ANX7688 Type-C Bridge
> +   controller driver.
> +
> +   If you choose to build this driver as a dynamically linked module, the
> +   module will be called anx7688.ko.
> +
>  config TYPEC_RT1719
>   tristate "Richtek RT1719 Sink Only Type-C controller driver"
>   depends on USB_ROLE_SWITCH || !USB_ROLE_SWITCH
> diff --git a/drivers/usb/typec/Makefile b/drivers/usb/typec/Makefile
> index 7a368fea61bc..3f8ff94ad294 100644
> --- a/drivers/usb/typec/Makefile
> +++ b/drivers/usb/typec/Makefile
> @@ -7,6 +7,7 @@ obj-$(CONFIG_TYPEC_TCPM)  += tcpm/
>  obj-$(CONFIG_TYPEC_UCSI) += ucsi/
>  obj-$(CONFIG_TYPEC_TPS6598X) += tipd/
>  obj-$(CONFIG_TYPEC_ANX7411)  += anx7411.o
> +obj-$(CONFIG_TYPEC_ANX7688)  += anx7688.o
>  obj-$(CONFIG_TYPEC_HD3SS3220)+= hd3ss3220.o
>  obj-$(CONFIG_TYPEC_STUSB160X)+= stusb160x.o
>  obj-$(CONFIG_TYPEC_RT1719)   += rt1719.o
> diff --git a/drivers/usb/typec/anx7688.c b/drivers/usb/typec/anx7688.c
> new file mode 100644
> index ..407cbd5fedba
> --- /dev/null
> +++ b/drivers/usb/typec/anx7688.c
> @@ -0,0 +1,1830 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * ANX7688 USB-C HDMI bridge/PD driver
> + *
> + * Warning, this driver is somewhat PinePhone specific.

So why not split this into core part and PinePhone specific glue part?

> + * How this works:
> + * - this driver allows to program firmware into ANX7688 EEPROM, and
> + *   initialize it
> + * - it then communicates with the firmware running on the OCM (on-chip
> + *   microcontroller)
> + * - it detects whether there is cable plugged in or not and powers
> + *   up or down the ANX7688 based on that
> + * - when the cable is connected the firmware on the OCM will handle
> + *   the detection of the nature of the device on the other end
> + *   of the USB-C cable
> + * - this driver then communicates with the USB phy to let it swap
> + *   data roles accordingly
> + * - it also enables VBUS and VCONN regulators as appropriate
> + * - USB phy driver (Allwinner) needs to know whether to switch to
> + *   device or host mode, or whether to turn off
> + * - when the firmware detects SRC.1.5A or SRC.3.0A via CC pins
> + *   or something else via PD, it notifies this driver via software
> + *   interrupt and this driver will determine how to update the TypeC
> + *   port status and what input current limit is appropriate
> + * - input current limit determination happens 500ms after cable
> + *   insertion or hard reset (delay is necessary to determine whether
> + *   the remote end is PD capable or not)
> + * - this driver tells to the PMIC driver that the input current limit
> + *   needs to be changed
> + * - this driver also monitors PMIC status and re-sets the input current
> + *   limit if it changes for some reason (due to PMIC internal decision
> + *   making) (this is disabled for now)
> + *
> + * ANX7688 FW behavior as observed:
> + *
> + * - DO NOT SET MORE THAN 1 SINK CAPABILITY! Firmware will ignore what
> + *   you set and send hardcoded PDO_BATT 5-21V 30W message!
> + *
> + * Product brief:
> + * 
> https://www.analogix.com/en/system/files/AA-002281-PB-6-ANX7688_Product_Brief_0.pdf
> + * Development notes:
> + * https://xnux.eu/devices/feature/anx7688.html
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#i

Re: [PATCH v4] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread Heng Qi





在 2024/4/9 下午4:58, lyx634449800 写道:

From: Yuxue Liu 

When there is a ctlq and it doesn't require interrupt
callbacks,the original method of calculating vectors
wastes hardware msi or msix resources as well as system
IRQ resources.

When conducting performance testing using testpmd in the
guest os, it was found that the performance was lower compared
to directly using vfio-pci to passthrough the device

In scenarios where the virtio device in the guest os does
not utilize interrupts, the vdpa driver still configures
the hardware's msix vector. Therefore, the hardware still
sends interrupts to the host os. Because of this unnecessary
action by the hardware, hardware performance decreases, and
it also affects the performance of the host os.

Before modification:(interrupt mode)
  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(interrupt mode)
  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config

Before modification:(virtio pmd mode for guest os)
  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(virtio pmd mode for guest os)
  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config

To verify the use of the virtio PMD mode in the guest operating
system, the following patch needs to be applied to QEMU:
https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com

Signed-off-by: Yuxue Liu 
Acked-by: Jason Wang 
---
V4: Update the title and assign values to uninitialized variables
V3: delete unused variables and add validation records
V2: fix when allocating IRQs, scan all queues

  drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +--
  1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
b/drivers/vdpa/virtio_pci/vp_vdpa.c
index df5f4a3bccb5..74bc8adfc7e8 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
struct pci_dev *pdev = mdev->pci_dev;
int i, ret, irq;
int queues = vp_vdpa->queues;
-   int vectors = queues + 1;
+   int vectors = 0;
+   int msix_vec = 0;
+
+   for (i = 0; i < queues; i++) {
+   if (vp_vdpa->vring[i].cb.callback)
+   vectors++;
+   }
+   vectors++;
  
  	ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);

if (ret != vectors) {
@@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
vp_vdpa->vectors = vectors;
  
  	for (i = 0; i < queues; i++) {

+   if (!vp_vdpa->vring[i].cb.callback)
+   continue;
+
snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
"vp-vdpa[%s]-%d\n", pci_name(pdev), i);
-   irq = pci_irq_vector(pdev, i);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(&pdev->dev, irq,
   vp_vdpa_vq_handler,
   0, vp_vdpa->vring[i].msix_name,
@@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
"vp_vdpa: fail to request irq for vq %d\n", i);
goto err;
}
-   vp_modern_queue_vector(mdev, i, i);
+   vp_modern_queue_vector(mdev, i, msix_vec);
vp_vdpa->vring[i].irq = irq;
+   msix_vec++;
}
  
  	snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",

 pci_name(pdev));
-   irq = pci_irq_vector(pdev, queues);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(&pdev->dev, irq,  vp_vdpa_config_handler, 0,
   vp_vdpa->msix_name, vp_vdpa);
if (ret) {
dev_err(&pdev->dev,
-   "vp_vdpa: fail to request irq for vq %d\n", i);
+   "vp_vdpa: fail to request irq for config, ret %d\n", 
ret);
goto err;
}
-   vp_modern_config_vector(mdev, queues);
+   vp_modern_config_vector(mdev, msix_vec);
vp_vdpa->config_irq = irq;
  
  	return 0;


Reviewed-by: Heng Qi

Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-04-09 Thread Honggyu Kim

Hi SeongJae,

On Mon,  8 Apr 2024 10:52:28 -0700 SeongJae Park  wrote:
> On Mon,  8 Apr 2024 21:06:44 +0900 Honggyu Kim  wrote:
> 
> > On Fri,  5 Apr 2024 12:24:30 -0700 SeongJae Park  wrote:
> > > On Fri,  5 Apr 2024 15:08:54 +0900 Honggyu Kim  wrote:
> [...]
> > > > Here is one of the example usage of this 'migrate_cold' action.
> > > > 
> > > >   $ cd /sys/kernel/mm/damon/admin/kdamonds/
> > > >   $ cat contexts//schemes//action
> > > >   migrate_cold
> > > >   $ echo 2 > contexts//schemes//target_nid
> > > >   $ echo commit > state
> > > >   $ numactl -p 0 ./hot_cold 500M 600M &
> > > >   $ numastat -c -p hot_cold
> > > > 
> > > >   Per-node process memory usage (in MBs)
> > > >   PID Node 0 Node 1 Node 2 Total
> > > >   --  -- -- -- -
> > > >   701 (hot_cold) 501  0601  1101
> > > > 
> > > > Since there are some common routines with pageout, many functions have
> > > > similar logics between pageout and migrate cold.
> > > > 
> > > > damon_pa_migrate_folio_list() is a minimized version of
> > > > shrink_folio_list(), but it's minified only for demotion.
> > > 
> > > MIGRATE_COLD is not only for demotion, right?  I think the last two words 
> > > are
> > > better to be removed for reducing unnecessary confuses.
> > 
> > You mean the last two sentences?  I will remove them if you feel it's
> > confusing.
> 
> Yes.  My real intended suggestion was 's/only for demotion/only for
> migration/', but entirely removing the sentences is also ok for me.

Ack.

> > 
> > > > 
> > > > Signed-off-by: Honggyu Kim 
> > > > Signed-off-by: Hyeongtak Ji 
> > > > ---
> > > >  include/linux/damon.h|   2 +
> > > >  mm/damon/paddr.c | 146 ++-
> > > >  mm/damon/sysfs-schemes.c |   4 ++
> > > >  3 files changed, 151 insertions(+), 1 deletion(-)
> [...]
> > > > --- a/mm/damon/paddr.c
> > > > +++ b/mm/damon/paddr.c
> [...]
> > > > +{
> > > > +   unsigned int nr_succeeded;
> > > > +   nodemask_t allowed_mask = NODE_MASK_NONE;
> > > > +
> > > 
> > > I personally prefer not having empty lines in the middle of variable
> > > declarations/definitions.  Could we remove this empty line?
> > 
> > I can remove it, but I would like to have more discussion about this
> > issue.  The current implementation allows only a single migration
> > target with "target_nid", but users might want to provide fall back
> > migration target nids.
> > 
> > For example, if more than two CXL nodes exist in the system, users might
> > want to migrate cold pages to any CXL nodes.  In such cases, we might
> > have to make "target_nid" accept comma separated node IDs.  nodemask can
> > be better but we should provide a way to change the scanning order.
> > 
> > I would like to hear how you think about this.
> 
> Good point.  I think we could later extend the sysfs file to receive the
> comma-separated numbers, or even mask.  For simplicity, adding sysfs files
> dedicated for the different format of inputs could also be an option (e.g.,
> target_nids_list, target_nids_mask).  But starting from this single node as is
> now looks ok to me.

If you think we can start from a single node, then I will keep it as is.
But are you okay if I change the same 'target_nid' to accept
comma-separated numbers later?  Or do you want to introduce another knob
such as 'target_nids_list'?  What about rename 'target_nid' to
'target_nids' at the first place?

> [...]
> > > > +   /* 'folio_list' is always empty here */
> > > > +
> > > > +   /* Migrate folios selected for migration */
> > > > +   nr_migrated += migrate_folio_list(&migrate_folios, pgdat, 
> > > > target_nid);
> > > > +   /* Folios that could not be migrated are still in 
> > > > @migrate_folios */
> > > > +   if (!list_empty(&migrate_folios)) {
> > > > +   /* Folios which weren't migrated go back on @folio_list 
> > > > */
> > > > +   list_splice_init(&migrate_folios, folio_list);
> > > > +   }
> > > 
> > > Let's not use braces for single statement
> > > (https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces).
> > 
> > Hmm.. I know the convention but left it as is because of the comment.
> > If I remove the braces, it would have a weird alignment for the two
> > lines for comment and statement lines.
> 
> I don't really hate such alignment.  But if you don't like it, how about 
> moving
> the comment out of the if statement?  Having one comment for one-line if
> statement looks not bad to me.

Ack. I will manage this in the next revision.

> > 
> > > > +
> > > > +   try_to_unmap_flush();
> > > > +
> > > > +   list_splice(&ret_folios, folio_list);
> > > 
> > > Can't we move remaining folios in migrate_folios to ret_folios at once?
> > 
> > I will see if it's possible.
> 
> Thank you.  Not a strict request, though.
> 
> [...]
> > > > +   nid = folio_nid(lru_to_folio(folio_list));
> > > > +   do {
> > > > +   struct fol

Re: [PATCH v4] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread Michael S. Tsirkin

Good and clear subject, I like it.

On Tue, Apr 09, 2024 at 04:58:18PM +0800, lyx634449800 wrote:
> From: Yuxue Liu 
> 
> When there is a ctlq and it doesn't require interrupt
> callbacks,the original method of calculating vectors
> wastes hardware msi or msix resources as well as system
> IRQ resources.
> 
> When conducting performance testing using testpmd in the
> guest os, it was found that the performance was lower compared
> to directly using vfio-pci to passthrough the device
> 
> In scenarios where the virtio device in the guest os does
> not utilize interrupts, the vdpa driver still configures
> the hardware's msix vector. Therefore, the hardware still
> sends interrupts to the host os. Because of this unnecessary
> action by the hardware, hardware performance decreases, and
> it also affects the performance of the host os.
> 
> Before modification:(interrupt mode)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(interrupt mode)
>  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
>  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
>  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> 
> Before modification:(virtio pmd mode for guest os)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(virtio pmd mode for guest os)
>  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> 
> To verify the use of the virtio PMD mode in the guest operating
> system, the following patch needs to be applied to QEMU:
> https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com
> 
> Signed-off-by: Yuxue Liu 
> Acked-by: Jason Wang 

Much better, thanks!
A couple of small tweaks to polish it up and it'll be ready.

> ---
> V4: Update the title and assign values to uninitialized variables
> V3: delete unused variables and add validation records
> V2: fix when allocating IRQs, scan all queues
> 
>  drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index df5f4a3bccb5..74bc8adfc7e8 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   struct pci_dev *pdev = mdev->pci_dev;
>   int i, ret, irq;
>   int queues = vp_vdpa->queues;
> - int vectors = queues + 1;
> + int vectors = 0;
> + int msix_vec = 0;
> +
> + for (i = 0; i < queues; i++) {
> + if (vp_vdpa->vring[i].cb.callback)
> + vectors++;
> + }
> + vectors++;


Actually even easier: int vectors = 1; and then we do not need
this last line.
Sorry I only noticed now.

>  
>   ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
>   if (ret != vectors) {
> @@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   vp_vdpa->vectors = vectors;
>  
>   for (i = 0; i < queues; i++) {
> + if (!vp_vdpa->vring[i].cb.callback)
> + continue;
> +
>   snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
>   "vp-vdpa[%s]-%d\n", pci_name(pdev), i);
> - irq = pci_irq_vector(pdev, i);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(&pdev->dev, irq,
>  vp_vdpa_vq_handler,
>  0, vp_vdpa->vring[i].msix_name,
> @@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   "vp_vdpa: fail to request irq for vq %d\n", i);
>   goto err;
>   }
> - vp_modern_queue_vector(mdev, i, i);
> + vp_modern_queue_vector(mdev, i, msix_vec);
>   vp_vdpa->vring[i].irq = irq;
> + msix_vec++;
>   }
>  
>   snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
>pci_name(pdev));
> - irq = pci_irq_vector(pdev, queues);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_config_handler, 0,
>  vp_vdpa->msix_name, vp_vdpa);
>   if (ret) {
>   dev_err(&pdev->dev,
> - "vp_vdpa: fail to request irq for vq %d\n", i);
> + "vp_vdpa: fail to request irq for config, ret %d\n", 
> ret);

As lo

Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-09 Thread Michal Koutný

On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang 
 wrote:
> It's always non-NULL based on testing.
> 
> It's hard for me to say definitely by reading the code. But IIUC
> cgroup_disable command-line only blocks operations in /sys/fs/cgroup so user
> space can't set up controllers and config limits, etc., for the diasabled
> ones. Each task->cgroups would still have a non-NULL pointer to the static
> root object for each cgroup that is enabled by KConfig, so
> get_current_misc_cg() thus  sgx_get_current_cg() should not return NULL
> regardless 'cgroup_disable=misc'.
> 
> Maybe @Michal or @tj can confirm?

The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal

signature.asc
Description: PGP signature

[PATCH v4] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread lyx634449800

From: Yuxue Liu 

When there is a ctlq and it doesn't require interrupt
callbacks,the original method of calculating vectors
wastes hardware msi or msix resources as well as system
IRQ resources.

When conducting performance testing using testpmd in the
guest os, it was found that the performance was lower compared
to directly using vfio-pci to passthrough the device

In scenarios where the virtio device in the guest os does
not utilize interrupts, the vdpa driver still configures
the hardware's msix vector. Therefore, the hardware still
sends interrupts to the host os. Because of this unnecessary
action by the hardware, hardware performance decreases, and
it also affects the performance of the host os.

Before modification:(interrupt mode)
 32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(interrupt mode)
 32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
 33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
 34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config

Before modification:(virtio pmd mode for guest os)
 32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(virtio pmd mode for guest os)
 32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config

To verify the use of the virtio PMD mode in the guest operating
system, the following patch needs to be applied to QEMU:
https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com

Signed-off-by: Yuxue Liu 
Acked-by: Jason Wang 
---
V4: Update the title and assign values to uninitialized variables
V3: delete unused variables and add validation records
V2: fix when allocating IRQs, scan all queues

 drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
b/drivers/vdpa/virtio_pci/vp_vdpa.c
index df5f4a3bccb5..74bc8adfc7e8 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
struct pci_dev *pdev = mdev->pci_dev;
int i, ret, irq;
int queues = vp_vdpa->queues;
-   int vectors = queues + 1;
+   int vectors = 0;
+   int msix_vec = 0;
+
+   for (i = 0; i < queues; i++) {
+   if (vp_vdpa->vring[i].cb.callback)
+   vectors++;
+   }
+   vectors++;
 
ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
if (ret != vectors) {
@@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
vp_vdpa->vectors = vectors;
 
for (i = 0; i < queues; i++) {
+   if (!vp_vdpa->vring[i].cb.callback)
+   continue;
+
snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
"vp-vdpa[%s]-%d\n", pci_name(pdev), i);
-   irq = pci_irq_vector(pdev, i);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(&pdev->dev, irq,
   vp_vdpa_vq_handler,
   0, vp_vdpa->vring[i].msix_name,
@@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
"vp_vdpa: fail to request irq for vq %d\n", i);
goto err;
}
-   vp_modern_queue_vector(mdev, i, i);
+   vp_modern_queue_vector(mdev, i, msix_vec);
vp_vdpa->vring[i].irq = irq;
+   msix_vec++;
}
 
snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
 pci_name(pdev));
-   irq = pci_irq_vector(pdev, queues);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_config_handler, 0,
   vp_vdpa->msix_name, vp_vdpa);
if (ret) {
dev_err(&pdev->dev,
-   "vp_vdpa: fail to request irq for vq %d\n", i);
+   "vp_vdpa: fail to request irq for config, ret %d\n", 
ret);
goto err;
}
-   vp_modern_config_vector(mdev, queues);
+   vp_modern_config_vector(mdev, msix_vec);
vp_vdpa->config_irq = irq;
 
return 0;
-- 
2.43.0

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-09 Thread Jiri Olsa

On Tue, Apr 09, 2024 at 09:34:39AM +0900, Masami Hiramatsu wrote:

SNIP

> > > 
> > > > this can be fixed by checking the syscall is called from the trampoline
> > > > and prevent handle_trampoline call if it's not
> > > 
> > > Yes, but I still do not think this makes a lot of sense. But I won't 
> > > argue.
> > > 
> > > And what should sys_uretprobe() do if it is not called from the 
> > > trampoline?
> > > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
> > 
> > so the similar behaviour with int3 ends up with immediate SIGTRAP
> > and not invoking pending uretprobe consumers, like:
> > 
> >   - setup uretprobe for foo
> >   - foo() {
> >   executes int 3 -> sends SIGTRAP
> > }
> > 
> > because the int3 handler checks if it got executed from the uretprobe's
> > trampoline.. if not it treats that int3 as regular trap
> 
> Yeah, that is consistent behavior. Sounds good to me.
> 
> > 
> > while for uretprobe syscall we have at the moment following behaviour:
> > 
> >   - setup uretprobe for foo
> >   - foo() {
> >  uretprobe_syscall -> executes foo's uretprobe consumers
> > }
> >   - at some point we get to the 'ret' instruction that jump into uretprobe
> > trampoline and the uretprobe_syscall won't find pending uretprobe and
> > will send SIGILL
> > 
> > 
> > so I think we should mimic int3 behaviour and:
> > 
> >   - setup uretprobe for foo
> >   - foo() {
> >  uretprobe_syscall -> check if we got executed from uretprobe's
> >  trampoline and send SIGILL if that's not the case
> 
> OK, this looks good to me.
> 
> > 
> > I think it's better to have the offending process killed right away,
> > rather than having more undefined behaviour, waiting for final 'ret'
> > instruction that jumps to uretprobe trampoline and causes SIGILL
> > 
> > > 
> > > I agree very much with Andrii,
> > > 
> > >sigreturn()  exists only to allow the implementation of signal 
> > > handlers.  It should never be
> > >called directly.  Details of the arguments (if any) passed to 
> > > sigreturn() vary depending  on
> > >the architecture.
> > > 
> > > this is how sys_uretprobe() should be treated/documented.
> > 
> > yes, will include man page patch in new version
> 
> And please follow Documentation/process/adding-syscalls.rst in new version,
> then we can avoid repeating the same discussion :-)

yep, will do

thanks,
jirka

Re: [PATCH v14 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-04-09 Thread Krzysztof Kozlowski

On 08/04/2024 22:53, Tanmay Shah wrote:
> From: Radhey Shyam Pandey 
> 
> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
> UltraScale+ platform. It will help in defining TCM in device-tree
> and make it's access platform agnostic and data-driven.
> 
> Tightly-coupled memories(TCMs) are low-latency memory that provides
> predictable instruction execution and predictable data load/store
> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
> 
> The TCM resources(reg, reg-names and power-domain) are documented for
> each TCM in the R5 node. The reg and reg-names are made as required
> properties as we don't want to hardcode TCM addresses for future
> platforms and for zu+ legacy implementation will ensure that the
> old dts w/o reg/reg-names works and stable ABI is maintained.
> 
> It also extends the examples for TCM split and lockstep modes.
> 
> Signed-off-by: Radhey Shyam Pandey 
> Signed-off-by: Tanmay Shah 

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof

61 matches

Mail list logo