Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Ho-Ren (Jack) Chuang
On Tue, Apr 9, 2024 at 7:33 PM Huang, Ying  wrote:
>
> "Ho-Ren (Jack) Chuang"  writes:
>
> > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
> >  wrote:
> >>
> >> On Fri,  5 Apr 2024 00:07:06 +
> >> "Ho-Ren (Jack) Chuang"  wrote:
> >>
> >> > The current implementation treats emulated memory devices, such as
> >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> >> > memory
> >> > (E820_TYPE_RAM). However, these emulated devices have different
> >> > characteristics than traditional DRAM, making it important to
> >> > distinguish them. Thus, we modify the tiered memory initialization 
> >> > process
> >> > to introduce a delay specifically for CPUless NUMA nodes. This delay
> >> > ensures that the memory tier initialization for these nodes is deferred
> >> > until HMAT information is obtained during the boot process. Finally,
> >> > demotion tables are recalculated at the end.
> >> >
> >> > * late_initcall(memory_tier_late_init);
> >> > Some device drivers may have initialized memory tiers between
> >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> >> > online memory nodes and configuring memory tiers. They should be excluded
> >> > in the late init.
> >> >
> >> > * Handle cases where there is no HMAT when creating memory tiers
> >> > There is a scenario where a CPUless node does not provide HMAT 
> >> > information.
> >> > If no HMAT is specified, it falls back to using the default DRAM tier.
> >> >
> >> > * Introduce another new lock `default_dram_perf_lock` for adist 
> >> > calculation
> >> > In the current implementation, iterating through CPUlist nodes requires
> >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end 
> >> > up
> >> > trying to acquire the same lock, leading to a potential deadlock.
> >> > Therefore, we propose introducing a standalone `default_dram_perf_lock` 
> >> > to
> >> > protect `default_dram_perf_*`. This approach not only avoids deadlock
> >> > but also prevents holding a large lock simultaneously.
> >> >
> >> > * Upgrade `set_node_memory_tier` to support additional cases, including
> >> >   default DRAM, late CPUless, and hot-plugged initializations.
> >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> >> > handle cases where memtype is not initialized and where HMAT information 
> >> > is
> >> > available.
> >> >
> >> > * Introduce `default_memory_types` for those memory types that are not
> >> >   initialized by device drivers.
> >> > Because late initialized memory and default DRAM memory need to be 
> >> > managed,
> >> > a default memory type is created for storing all memory types that are
> >> > not initialized by device drivers and as a fallback.
> >> >
> >> > Signed-off-by: Ho-Ren (Jack) Chuang 
> >> > Signed-off-by: Hao Xiang 
> >> > Reviewed-by: "Huang, Ying" 
> >>
> >> Hi - one remaining question. Why can't we delay init for all nodes
> >> to either drivers or your fallback late_initcall code.
> >> It would be nice to reduce possible code paths.
> >
> > I try not to change too much of the existing code structure in
> > this patchset.
> >
> > To me, postponing/moving all memory tier registrations to
> > late_initcall() is another possible action item for the next patchset.
> >
> > After tier_mem(), hmat_init() is called, which requires registering
> > `default_dram_type` info. This is when `default_dram_type` is needed.
> > However, it is indeed possible to postpone the latter part,
> > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> > indeed be split into two parts, and the latter part can be moved to
> > late_initcall() to be processed together.
>
> I don't think that it's good to move all memory_tier initialization in
> drivers to late_initcall().  It's natural to keep them in
> device_initcall() level.
>
> If so, we can allocate default_dram_type in memory_tier_init(), and call
> set_node_memory_tier() only in memory_tier_lateinit().  We can call
> memory_tier_lateinit() in device_initcall() level too.
>

It makes sense to me to leave only `default_dram_type ` and
hotplug_init() in memory_tier_init(), postponing all
set_node_memory_tier()s to memory_tier_late_init()

Would it be possible there is no device_initcall() calling
memory_tier_late_init()? If yes, I think putting memory_tier_late_init()
in late_init() is still necessary.

> --
> Best Regards,
> Huang, Ying
>
> > Doing this all memory-type drivers have to call late_initcall() to
> > register a memory tier. I’m not sure how many they are?
> >
> > What do you guys think?
> >
> >>
> >> Jonathan
> >>
> >>
> >> > ---
> >> >  mm/memory-tiers.c | 94 +++
> >> >  1 file changed, 70 insertions(+), 24 deletions(-)
> >> >
> >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> >> > index 516b144fd45a..6632102bd5c9 100644
> >> > --- a/mm/memory-tiers.c
> >> > +++ 

Re: [PATCH v2 0/1] virtio-pci: Fix the crash that the vector was used after released

2024-04-09 Thread Cindy Lu
Sorry, send to the wrong mail list, please ignore it

On Wed, Apr 10, 2024 at 12:35 PM Cindy Lu  wrote:
>
> During the booting process of the Vyatta image, the behavior of the
> called function in qemu is as follows:
>
> 1. vhost_net_stop() was triggered by guest image . This will call the function
> virtio_pci_set_guest_notifiers() with assgin= false, and
> virtio_pci_set_guest_notifiers() will release the irqfd for vector 0
>
> 2. virtio_reset() was called -->set configure vector to VIRTIO_NO_VECTOR
>
> 3.vhost_net_start() was called (at this time, the configure vector is
> still VIRTIO_NO_VECTOR) and call virtio_pci_set_guest_notifiers() with
> assgin= true, so the irqfd for vector 0 is still not "init" during this 
> process
>
> 4. The system continues to boot,set the vector back to 0, and 
> msix_fire_vector_notifier() was triggered
>  unmask the vector 0 and then met the crash
> [msix_fire_vector_notifier] 112 called vector 0 is_masked 1
> [msix_fire_vector_notifier] 112 called vector 0 is_masked 0
>
> To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()"
> when the vector changes back from VIRTIO_NO_VECTOR.
>
> The reason that we don't need to call kvm_virtio_pci_vector_release_one while 
> the vector changes to
> VIRTIO_NO_VECTOR is this function will called in vhost_net_stop(),
> So this step will not lost during this process.
>
> Change from V1
> 1.add the check for if using irqfd
> 2.remove the check for bool recovery, irqfd's user is enough to check status
>
> Cindy Lu (1):
>   virtio-pci: Fix the crash that the vector was used after released.
>
>  hw/virtio/virtio-pci.c | 35 +++
>  1 file changed, 35 insertions(+)
>
> --
> 2.43.0
>




Re: [PATCH v2 1/1] virtio-pci: Fix the crash that the vector was used after released.

2024-04-09 Thread Cindy Lu
Sorry, send to the wrong mail list, please ignore it

On Wed, Apr 10, 2024 at 12:35 PM Cindy Lu  wrote:
>
> When the guest triggers vhost_stop and then virtio_reset, the vector will the
> IRQFD for this vector will be released and change to VIRTIO_NO_VECTOR.
> After that, the guest called vhost_net_start,  (at this time, the configure
> vector is still VIRTIO_NO_VECTOR),  vector 0 still was not "init".
> The guest system continued to boot, set the vector back to 0, and then met 
> the crash.
>
> To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()"
> when the vector changes back from VIRTIO_NO_VECTOR
>
> (gdb) bt
> 0  __pthread_kill_implementation (threadid=, 
> signo=signo@entry=6, no_tid=no_tid@entry=0)
> at pthread_kill.c:44
> 1  0x7fc87148ec53 in __pthread_kill_internal (signo=6, 
> threadid=) at pthread_kill.c:78
> 2  0x7fc87143e956 in __GI_raise (sig=sig@entry=6) at 
> ../sysdeps/posix/raise.c:26
> 3  0x7fc8714287f4 in __GI_abort () at abort.c:79
> 4  0x7fc87142871b in __assert_fail_base
> (fmt=0x7fc8715bbde0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
> assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d 
> "../accel/kvm/kvm-all.c", line=1837, function=) at assert.c:92
> 5  0x7fc871437536 in __GI___assert_fail
> (assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d 
> "../accel/kvm/kvm-all.c", line=1837, function=0x5606413f06f0 
> <__PRETTY_FUNCTION__.19> "kvm_irqchip_commit_routes") at assert.c:101
> 6  0x560640f884b5 in kvm_irqchip_commit_routes (s=0x560642cae1f0) at 
> ../accel/kvm/kvm-all.c:1837
> 7  0x560640c98f8e in virtio_pci_one_vector_unmask
> (proxy=0x560643c65f00, queue_no=4294967295, vector=0, msg=..., 
> n=0x560643c6e4c8)
> at ../hw/virtio/virtio-pci.c:1005
> 8  0x560640c99201 in virtio_pci_vector_unmask (dev=0x560643c65f00, 
> vector=0, msg=...)
> at ../hw/virtio/virtio-pci.c:1070
> 9  0x560640bc402e in msix_fire_vector_notifier (dev=0x560643c65f00, 
> vector=0, is_masked=false)
> at ../hw/pci/msix.c:120
> 10 0x560640bc40f1 in msix_handle_mask_update (dev=0x560643c65f00, 
> vector=0, was_masked=true)
> at ../hw/pci/msix.c:140
> 11 0x560640bc4503 in msix_table_mmio_write (opaque=0x560643c65f00, 
> addr=12, val=0, size=4)
> at ../hw/pci/msix.c:231
> 12 0x560640f26d83 in memory_region_write_accessor
> (mr=0x560643c66540, addr=12, value=0x7fc86b7bc628, size=4, shift=0, 
> mask=4294967295, attrs=...)
> at ../system/memory.c:497
> 13 0x560640f270a6 in access_with_adjusted_size
>
>  (addr=12, value=0x7fc86b7bc628, size=4, access_size_min=1, 
> access_size_max=4, access_fn=0x560640f26c8d , 
> mr=0x560643c66540, attrs=...) at ../system/memory.c:573
> 14 0x560640f2a2b5 in memory_region_dispatch_write (mr=0x560643c66540, 
> addr=12, data=0, op=MO_32, attrs=...)
> at ../system/memory.c:1521
> 15 0x560640f37bac in flatview_write_continue
> (fv=0x7fc65805e0b0, addr=4273803276, attrs=..., ptr=0x7fc871e9c028, 
> len=4, addr1=12, l=4, mr=0x560643c66540)
> at ../system/physmem.c:2714
> 16 0x560640f37d0f in flatview_write
> (fv=0x7fc65805e0b0, addr=4273803276, attrs=..., buf=0x7fc871e9c028, 
> len=4) at ../system/physmem.c:2756
> 17 0x560640f380bf in address_space_write
> (as=0x560642161ae0 , addr=4273803276, attrs=..., 
> buf=0x7fc871e9c028, len=4)
> at ../system/physmem.c:2863
> 18 0x560640f3812c in address_space_rw
> (as=0x560642161ae0 , addr=4273803276, attrs=..., 
> buf=0x7fc871e9c028, len=4, is_write=true) at ../system/physmem.c:2873
> --Type  for more, q to quit, c to continue without paging--
> 19 0x560640f8aa55 in kvm_cpu_exec (cpu=0x560642f205e0) at 
> ../accel/kvm/kvm-all.c:2915
> 20 0x560640f8d731 in kvm_vcpu_thread_fn (arg=0x560642f205e0) at 
> ../accel/kvm/kvm-accel-ops.c:51
> 21 0x5606411949f4 in qemu_thread_start (args=0x560642f292b0) at 
> ../util/qemu-thread-posix.c:541
> 22 0x7fc87148cdcd in start_thread (arg=) at 
> pthread_create.c:442
> 23 0x7fc871512630 in clone3 () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
> (gdb)
> Signed-off-by: Cindy Lu 
> ---
>  hw/virtio/virtio-pci.c | 35 +++
>  1 file changed, 35 insertions(+)
>
> diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
> index 1a7039fb0c..344f4fb844 100644
> --- a/hw/virtio/virtio-pci.c
> +++ b/hw/virtio/virtio-pci.c
> @@ -880,6 +880,7 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy 
> *proxy, int queue_no)
>  int ret;
>  EventNotifier *n;
>  PCIDevice *dev = >pci_dev;
> +VirtIOIRQFD *irqfd;
>  VirtIODevice *vdev = virtio_bus_get_device(>bus);
>  VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);
>
> @@ -890,10 +891,19 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy 
> *proxy, int queue_no)
>  if (vector >= msix_nr_vectors_allocated(dev)) {
>  return 0;
>  }
> +/*
> + * if the irqfd still in use, means the 

[PATCH v2 1/1] virtio-pci: Fix the crash that the vector was used after released.

2024-04-09 Thread Cindy Lu
When the guest triggers vhost_stop and then virtio_reset, the vector will the
IRQFD for this vector will be released and change to VIRTIO_NO_VECTOR.
After that, the guest called vhost_net_start,  (at this time, the configure
vector is still VIRTIO_NO_VECTOR),  vector 0 still was not "init".
The guest system continued to boot, set the vector back to 0, and then met the 
crash.

To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()"
when the vector changes back from VIRTIO_NO_VECTOR

(gdb) bt
0  __pthread_kill_implementation (threadid=, 
signo=signo@entry=6, no_tid=no_tid@entry=0)
at pthread_kill.c:44
1  0x7fc87148ec53 in __pthread_kill_internal (signo=6, threadid=) at pthread_kill.c:78
2  0x7fc87143e956 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/posix/raise.c:26
3  0x7fc8714287f4 in __GI_abort () at abort.c:79
4  0x7fc87142871b in __assert_fail_base
(fmt=0x7fc8715bbde0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d 
"../accel/kvm/kvm-all.c", line=1837, function=) at assert.c:92
5  0x7fc871437536 in __GI___assert_fail
(assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d 
"../accel/kvm/kvm-all.c", line=1837, function=0x5606413f06f0 
<__PRETTY_FUNCTION__.19> "kvm_irqchip_commit_routes") at assert.c:101
6  0x560640f884b5 in kvm_irqchip_commit_routes (s=0x560642cae1f0) at 
../accel/kvm/kvm-all.c:1837
7  0x560640c98f8e in virtio_pci_one_vector_unmask
(proxy=0x560643c65f00, queue_no=4294967295, vector=0, msg=..., 
n=0x560643c6e4c8)
at ../hw/virtio/virtio-pci.c:1005
8  0x560640c99201 in virtio_pci_vector_unmask (dev=0x560643c65f00, 
vector=0, msg=...)
at ../hw/virtio/virtio-pci.c:1070
9  0x560640bc402e in msix_fire_vector_notifier (dev=0x560643c65f00, 
vector=0, is_masked=false)
at ../hw/pci/msix.c:120
10 0x560640bc40f1 in msix_handle_mask_update (dev=0x560643c65f00, vector=0, 
was_masked=true)
at ../hw/pci/msix.c:140
11 0x560640bc4503 in msix_table_mmio_write (opaque=0x560643c65f00, addr=12, 
val=0, size=4)
at ../hw/pci/msix.c:231
12 0x560640f26d83 in memory_region_write_accessor
(mr=0x560643c66540, addr=12, value=0x7fc86b7bc628, size=4, shift=0, 
mask=4294967295, attrs=...)
at ../system/memory.c:497
13 0x560640f270a6 in access_with_adjusted_size

 (addr=12, value=0x7fc86b7bc628, size=4, access_size_min=1, 
access_size_max=4, access_fn=0x560640f26c8d , 
mr=0x560643c66540, attrs=...) at ../system/memory.c:573
14 0x560640f2a2b5 in memory_region_dispatch_write (mr=0x560643c66540, 
addr=12, data=0, op=MO_32, attrs=...)
at ../system/memory.c:1521
15 0x560640f37bac in flatview_write_continue
(fv=0x7fc65805e0b0, addr=4273803276, attrs=..., ptr=0x7fc871e9c028, len=4, 
addr1=12, l=4, mr=0x560643c66540)
at ../system/physmem.c:2714
16 0x560640f37d0f in flatview_write
(fv=0x7fc65805e0b0, addr=4273803276, attrs=..., buf=0x7fc871e9c028, len=4) 
at ../system/physmem.c:2756
17 0x560640f380bf in address_space_write
(as=0x560642161ae0 , addr=4273803276, attrs=..., 
buf=0x7fc871e9c028, len=4)
at ../system/physmem.c:2863
18 0x560640f3812c in address_space_rw
(as=0x560642161ae0 , addr=4273803276, attrs=..., 
buf=0x7fc871e9c028, len=4, is_write=true) at ../system/physmem.c:2873
--Type  for more, q to quit, c to continue without paging--
19 0x560640f8aa55 in kvm_cpu_exec (cpu=0x560642f205e0) at 
../accel/kvm/kvm-all.c:2915
20 0x560640f8d731 in kvm_vcpu_thread_fn (arg=0x560642f205e0) at 
../accel/kvm/kvm-accel-ops.c:51
21 0x5606411949f4 in qemu_thread_start (args=0x560642f292b0) at 
../util/qemu-thread-posix.c:541
22 0x7fc87148cdcd in start_thread (arg=) at 
pthread_create.c:442
23 0x7fc871512630 in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)
Signed-off-by: Cindy Lu 
---
 hw/virtio/virtio-pci.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 1a7039fb0c..344f4fb844 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -880,6 +880,7 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy 
*proxy, int queue_no)
 int ret;
 EventNotifier *n;
 PCIDevice *dev = >pci_dev;
+VirtIOIRQFD *irqfd;
 VirtIODevice *vdev = virtio_bus_get_device(>bus);
 VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev);
 
@@ -890,10 +891,19 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy 
*proxy, int queue_no)
 if (vector >= msix_nr_vectors_allocated(dev)) {
 return 0;
 }
+/*
+ * if the irqfd still in use, means the irqfd was not
+ * release before and don't need to set up
+ */
+irqfd = >vector_irqfd[vector];
+if (irqfd->users != 0) {
+return 0;
+}
 ret = kvm_virtio_pci_vq_vector_use(proxy, vector);
 if (ret < 0) {
 goto undo;
 }
+
 /*
  * If guest supports masking, set up 

[PATCH v2 0/1] virtio-pci: Fix the crash that the vector was used after released

2024-04-09 Thread Cindy Lu
During the booting process of the Vyatta image, the behavior of the
called function in qemu is as follows:

1. vhost_net_stop() was triggered by guest image . This will call the function
virtio_pci_set_guest_notifiers() with assgin= false, and
virtio_pci_set_guest_notifiers(??? will release the irqfd for vector 0

2. virtio_reset() was called -->set configure vector to VIRTIO_NO_VECTOR

3.vhost_net_start() was called (at this time, the configure vector is
still VIRTIO_NO_VECTOR) and call virtio_pci_set_guest_notifiers() with
assgin= true, so the irqfd for vector 0 is still not "init" during this process

4. The system continues to boot,set the vector back to 0, and 
msix_fire_vector_notifier() was triggered
 unmask the vector 0 and then met the crash
[msix_fire_vector_notifier] 112 called vector 0 is_masked 1
[msix_fire_vector_notifier] 112 called vector 0 is_masked 0

To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()"
when the vector changes back from VIRTIO_NO_VECTOR.

The reason that we don't need to call kvm_virtio_pci_vector_release_one while 
the vector changes to
VIRTIO_NO_VECTOR is this function will called in vhost_net_stop(),
So this step will not lost during this process.

Change from V1
1.add the check for if using irqfd
2.remove the check for bool recovery, irqfd's user is enough to check status

Cindy Lu (1):
  virtio-pci: Fix the crash that the vector was used after released.

 hw/virtio/virtio-pci.c | 35 +++
 1 file changed, 35 insertions(+)

-- 
2.43.0




[PATCH v5] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread lyx634449800
From: Yuxue Liu 

When there is a ctlq and it doesn't require interrupt
callbacks,the original method of calculating vectors
wastes hardware msi or msix resources as well as system
IRQ resources.

When conducting performance testing using testpmd in the
guest os, it was found that the performance was lower compared
to directly using vfio-pci to passthrough the device

In scenarios where the virtio device in the guest os does
not utilize interrupts, the vdpa driver still configures
the hardware's msix vector. Therefore, the hardware still
sends interrupts to the host os. Because of this unnecessary
action by the hardware, hardware performance decreases, and
it also affects the performance of the host os.

Before modification:(interrupt mode)
 32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(interrupt mode)
 32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
 33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
 34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config

Before modification:(virtio pmd mode for guest os)
 32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(virtio pmd mode for guest os)
 32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config

To verify the use of the virtio PMD mode in the guest operating
system, the following patch needs to be applied to QEMU:
https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com

Signed-off-by: Yuxue Liu 
Acked-by: Jason Wang 
Reviewed-by: Heng Qi 
---
V5: modify the description of the printout when an exception occurs
V4: update the title and assign values to uninitialized variables
V3: delete unused variables and add validation records
V2: fix when allocating IRQs, scan all queues

 drivers/vdpa/virtio_pci/vp_vdpa.c | 22 --
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
b/drivers/vdpa/virtio_pci/vp_vdpa.c
index df5f4a3bccb5..8de0224e9ec2 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -160,7 +160,13 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
struct pci_dev *pdev = mdev->pci_dev;
int i, ret, irq;
int queues = vp_vdpa->queues;
-   int vectors = queues + 1;
+   int vectors = 1;
+   int msix_vec = 0;
+
+   for (i = 0; i < queues; i++) {
+   if (vp_vdpa->vring[i].cb.callback)
+   vectors++;
+   }
 
ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
if (ret != vectors) {
@@ -173,9 +179,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
vp_vdpa->vectors = vectors;
 
for (i = 0; i < queues; i++) {
+   if (!vp_vdpa->vring[i].cb.callback)
+   continue;
+
snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
"vp-vdpa[%s]-%d\n", pci_name(pdev), i);
-   irq = pci_irq_vector(pdev, i);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(>dev, irq,
   vp_vdpa_vq_handler,
   0, vp_vdpa->vring[i].msix_name,
@@ -185,21 +194,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
"vp_vdpa: fail to request irq for vq %d\n", i);
goto err;
}
-   vp_modern_queue_vector(mdev, i, i);
+   vp_modern_queue_vector(mdev, i, msix_vec);
vp_vdpa->vring[i].irq = irq;
+   msix_vec++;
}
 
snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
 pci_name(pdev));
-   irq = pci_irq_vector(pdev, queues);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(>dev, irq, vp_vdpa_config_handler, 0,
   vp_vdpa->msix_name, vp_vdpa);
if (ret) {
dev_err(>dev,
-   "vp_vdpa: fail to request irq for vq %d\n", i);
+   "vp_vdpa: fail to request irq for config: %d\n", ret);
goto err;
}
-   vp_modern_config_vector(mdev, queues);
+   vp_modern_config_vector(mdev, msix_vec);
vp_vdpa->config_irq = irq;
 
return 0;
-- 
2.43.0




Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Huang, Ying
"Ho-Ren (Jack) Chuang"  writes:

> On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
>  wrote:
>>
>> On Fri,  5 Apr 2024 00:07:06 +
>> "Ho-Ren (Jack) Chuang"  wrote:
>>
>> > The current implementation treats emulated memory devices, such as
>> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
>> > (E820_TYPE_RAM). However, these emulated devices have different
>> > characteristics than traditional DRAM, making it important to
>> > distinguish them. Thus, we modify the tiered memory initialization process
>> > to introduce a delay specifically for CPUless NUMA nodes. This delay
>> > ensures that the memory tier initialization for these nodes is deferred
>> > until HMAT information is obtained during the boot process. Finally,
>> > demotion tables are recalculated at the end.
>> >
>> > * late_initcall(memory_tier_late_init);
>> > Some device drivers may have initialized memory tiers between
>> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
>> > online memory nodes and configuring memory tiers. They should be excluded
>> > in the late init.
>> >
>> > * Handle cases where there is no HMAT when creating memory tiers
>> > There is a scenario where a CPUless node does not provide HMAT information.
>> > If no HMAT is specified, it falls back to using the default DRAM tier.
>> >
>> > * Introduce another new lock `default_dram_perf_lock` for adist calculation
>> > In the current implementation, iterating through CPUlist nodes requires
>> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
>> > trying to acquire the same lock, leading to a potential deadlock.
>> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
>> > protect `default_dram_perf_*`. This approach not only avoids deadlock
>> > but also prevents holding a large lock simultaneously.
>> >
>> > * Upgrade `set_node_memory_tier` to support additional cases, including
>> >   default DRAM, late CPUless, and hot-plugged initializations.
>> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
>> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
>> > handle cases where memtype is not initialized and where HMAT information is
>> > available.
>> >
>> > * Introduce `default_memory_types` for those memory types that are not
>> >   initialized by device drivers.
>> > Because late initialized memory and default DRAM memory need to be managed,
>> > a default memory type is created for storing all memory types that are
>> > not initialized by device drivers and as a fallback.
>> >
>> > Signed-off-by: Ho-Ren (Jack) Chuang 
>> > Signed-off-by: Hao Xiang 
>> > Reviewed-by: "Huang, Ying" 
>>
>> Hi - one remaining question. Why can't we delay init for all nodes
>> to either drivers or your fallback late_initcall code.
>> It would be nice to reduce possible code paths.
>
> I try not to change too much of the existing code structure in
> this patchset.
>
> To me, postponing/moving all memory tier registrations to
> late_initcall() is another possible action item for the next patchset.
>
> After tier_mem(), hmat_init() is called, which requires registering
> `default_dram_type` info. This is when `default_dram_type` is needed.
> However, it is indeed possible to postpone the latter part,
> set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> indeed be split into two parts, and the latter part can be moved to
> late_initcall() to be processed together.

I don't think that it's good to move all memory_tier initialization in
drivers to late_initcall().  It's natural to keep them in
device_initcall() level.

If so, we can allocate default_dram_type in memory_tier_init(), and call
set_node_memory_tier() only in memory_tier_lateinit().  We can call
memory_tier_lateinit() in device_initcall() level too.

--
Best Regards,
Huang, Ying

> Doing this all memory-type drivers have to call late_initcall() to
> register a memory tier. I’m not sure how many they are?
>
> What do you guys think?
>
>>
>> Jonathan
>>
>>
>> > ---
>> >  mm/memory-tiers.c | 94 +++
>> >  1 file changed, 70 insertions(+), 24 deletions(-)
>> >
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > index 516b144fd45a..6632102bd5c9 100644
>> > --- a/mm/memory-tiers.c
>> > +++ b/mm/memory-tiers.c
>>
>>
>>
>> > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
>> >* For now we can have 4 faster memory tiers with smaller adistance
>> >* than default DRAM tier.
>> >*/
>> > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
>> > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
>> > +   _memory_types);
>> >   if (IS_ERR(default_dram_type))
>> >   panic("%s() failed to allocate default DRAM tier\n", 
>> > __func__);
>> >
>> > @@ -865,6 +903,14 @@ static int __init 

Re: [PATCHv3 1/2] dt-bindings: usb: typec: anx7688: start a binding document

2024-04-09 Thread Ondřej Jirman
On Mon, Apr 08, 2024 at 10:12:30PM GMT, Krzysztof Kozlowski wrote:
> On 08/04/2024 17:17, Ondřej Jirman wrote:
> > 
> > Now for things to not fail during suspend/resume based on PM callbacks
> > invocation order, anx7688 driver needs to enable this regulator too, as long
> > as it needs it.
> 
> No, the I2C bus driver needs to manage it. Not one individual I2C
> device. Again, why anx7688 is specific? If you next phone has anx8867,
> using different driver, you also add there i2c-supply? And if it is
> nxp,ptn5100 as well?

Yes, that could work, if I2C core would manage this.

> > 
> > I can put bus-supply to I2C controller node, and read it from the ANX7688 
> > driver
> > I guess, by going up a DT node. Whether that's going to be acceptable, I 
> > don't
> > know. 
> > 
> > 
> > VCONN regulator I don't know where else to put either. It doesn't seem to 
> > belong
> > anywhere. It's not something directly connected to Type-C connector, so
> > not part of connector bindings, and there's nothing else I can see, other
> > than anx7688 device which needs it for core functionality.
> 
> That sounds like a GPIO, not regulator. anx7688 has GPIOs, right? On
> Pinephone they go to regulator, but on FooPhone also using anx7688 they
> go somewhere else, so why this anx7688 assumes this is a regulator?

CC1/CC2_VCONN control pins are "GPIO" of anx7688, sort of. They have fixed
purpose of switching external 5V regulator output to one of the CC pins
on type-c port. I don't care what other purpose with some other firmware
someone puts to those pins. It's irrelevant to the use case of anx7688
as a type-c controller/HDMI bridge, which we're describing here.

VCONN regulator is an actual GPIO controlled regulator on the board, and
needs to be controlled by the anx7688 driver. So that CC1/CC2_VCONN control
pins driven by the firmware actually do what they're supposed to do.

Not sure why it would be a business of anything else but anx7688 driver
enabling this regulator, because only this driver knows and cares about this.
If some other board doesn't have the need to manually enable the regulator, or
doesn't have the regulator, it can simply be optional.

There are also some other funky supplies in the bindings, that are not connected
to the chip in any way, but need to be controlled by the driver:

+  vbus-supply: true
+  vbus-in-supply: true

First one can be on the connector node instead, where the driver can fetch
it from.

The purpose of the second one is to link the Phone's PMIC's USB power input with
the type-c controller (anx7688), to make sure the PMIC has information about how
much power it can draw from external PSU.

The second one can be replaced by rewriting the anx7688 driver so that it
creates a power supply representing the USB PSU connected to the phone, and by
linking to anx7688 DT node from x-powers,axp813-usb-power-supply via
a power-supplies property, which would mean that USB input of the phone is
supplied by the external USB PSU. PMIC driver can be modified to watch
the power supply provided by anx7688 driver for information it detected
via USB-PD and update its input current limit via a standard helper function.

This is how eg. fusb302 works. Not sure if that's any better from the PoV of
DT bindings, though. Because power-supplies = <>; will not look any
greater in DT bindings, IMO. It will just link the same nodes in the other
direction.

Anyway, the HW is that there's an external PSU (detected by type c controller)
and internal USB power input, and they are connected and one has to respect the
limits of the other. I guess I shouldn't be adding a device node for external 
PSU,
since it's not part of the phone. But that's what's trully being linked on
HW level.

Kind regards,
o.

> 
> > 
> > ANX7688 chip desing doesn't have integrated VCONN mosfet switches so it 
> > always
> > needs external supply + switches that are controlled by the chip itself. 
> > There's
> > no sensible design where someone would not want this and the driver needs
> > to get this regulator reference from somewhere. The switches are sort of an
> > extension of the chip.
> 
> 
> Best regards,
> Krzysztof
> 



[PATCH v3] kprobes: Fix possible use-after-free issue on kprobe registration

2024-04-09 Thread Zheng Yejian
When unloading a module, its state is changing MODULE_STATE_LIVE ->
 MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take
a time. `is_module_text_address()` and `__module_text_address()`
works with MODULE_STATE_LIVE and MODULE_STATE_GOING.
If we use `is_module_text_address()` and `__module_text_address()`
separately, there is a chance that the first one is succeeded but the
next one is failed because module->state becomes MODULE_STATE_UNFORMED
between those operations.

In `check_kprobe_address_safe()`, if the second `__module_text_address()`
is failed, that is ignored because it expected a kernel_text address.
But it may have failed simply because module->state has been changed
to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify
non-exist module text address (use-after-free).

To fix this problem, we should not use separated `is_module_text_address()`
and `__module_text_address()`, but use only `__module_text_address()`
once and do `try_module_get(module)` which is only available with
MODULE_STATE_LIVE.

Signed-off-by: Zheng Yejian 
---
 kernel/kprobes.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

v3:
 - Update commit messages as suggested by Masami.
   Link: 
https://lore.kernel.org/all/20240409224922.5f192e8ace5f7a90937bf...@kernel.org/
 - Also change to a more appropriate title.

v2:
 - Update commit messages and comments as suggested by Masami.
   Link: 
https://lore.kernel.org/all/20240408115038.b0c85767bf1f249eccc32...@kernel.org/
 - Link: 
https://lore.kernel.org/all/20240408083403.3302274-1-zhengyeji...@huawei.com/

v1:
 - Link: 
https://lore.kernel.org/all/20240407035904.2556645-1-zhengyeji...@huawei.com/

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..65adc815fc6e 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1567,10 +1567,17 @@ static int check_kprobe_address_safe(struct kprobe *p,
jump_label_lock();
preempt_disable();
 
-   /* Ensure it is not in reserved area nor out of text */
-   if (!(core_kernel_text((unsigned long) p->addr) ||
-   is_module_text_address((unsigned long) p->addr)) ||
-   in_gate_area_no_mm((unsigned long) p->addr) ||
+   /* Ensure the address is in a text area, and find a module if exists. */
+   *probed_mod = NULL;
+   if (!core_kernel_text((unsigned long) p->addr)) {
+   *probed_mod = __module_text_address((unsigned long) p->addr);
+   if (!(*probed_mod)) {
+   ret = -EINVAL;
+   goto out;
+   }
+   }
+   /* Ensure it is not in reserved area. */
+   if (in_gate_area_no_mm((unsigned long) p->addr) ||
within_kprobe_blacklist((unsigned long) p->addr) ||
jump_label_text_reserved(p->addr, p->addr) ||
static_call_text_reserved(p->addr, p->addr) ||
@@ -1580,8 +1587,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
-   /* Check if 'p' is probing a module. */
-   *probed_mod = __module_text_address((unsigned long) p->addr);
+   /* Get module refcount and reject __init functions for loaded modules. 
*/
if (*probed_mod) {
/*
 * We must hold a refcount of the probed module while updating
-- 
2.25.1




Re: [PATCH v2] kprobes: Avoid possible warn in __arm_kprobe_ftrace()

2024-04-09 Thread Zheng Yejian

On 2024/4/9 21:49, Masami Hiramatsu (Google) wrote:

On Tue, 9 Apr 2024 14:20:45 +0800
Zheng Yejian  wrote:


On 2024/4/8 20:41, Masami Hiramatsu (Google) wrote:

Hi Zheng,

On Mon, 8 Apr 2024 16:34:03 +0800
Zheng Yejian  wrote:


There is once warn in __arm_kprobe_ftrace() on:

   ret = ftrace_set_filter_ip(ops, (unsigned long)p->addr, 0, 0);
   if (WARN_ONCE(..., "Failed to arm kprobe-ftrace at %pS (error %d)\n", ...)
 return ret;

This warning is generated because 'p->addr' is detected to be not a valid
ftrace location in ftrace_set_filter_ip(). The ftrace address check is done
by check_ftrace_location() at the beginning of check_kprobe_address_safe().
At that point, ftrace_location(addr) == addr should return true if the
module is loaded. Then the module is searched twice:
1. in is_module_text_address(), we find that 'p->addr' is in a module;
2. in __module_text_address(), we find the module;

If the module has just been unloaded before the second search, then
'*probed_mod' is NULL and we would not go to get the module refcount,
then the return value of check_kprobe_address_safe() would be 0, but
actually we need to return -EINVAL.


OK, so you found a race window in check_kprobe_address_safe().

It does something like below.

check_kprobe_address_safe() {
...

/* Timing [A] */

if (!(core_kernel_text(p->addr) ||
is_module_text_address(p->addr)) ||
...(other reserved address check)) {
return -EINVAL;
}

/* Timing [B] */

*probed_mod = __module_text_address(p->addr):
if (*probe_mod) {
if (!try_module_get(*probed_mod)) {
return -ENOENT;
}
... 
}
}

So, if p->addr is in a module which is alive at the timing [A], but
unloaded at timing [B], 'p->addr' is passed the
'is_module_text_address(p->addr)' check, but *probed_mod becomes NULL.
Thus the corresponding module is not referenced and kprobe_arm(p) will
access a wrong address (use after free).
This happens either kprobe on ftrace is enabled or not.


Yes, This is the problem. And for this case, check_kprobe_address_safe()
still return 0, and then going on to arm kprobe may cause problems. So
we should make check_kprobe_address_safe() return -EINVAL when refcount
of the module is not got.


Yes,





To fix this problem, we should move the mutex_lock(kprobe_mutex) before
check_kprobe_address_safe() because kprobe_module_callback() also lock it
so it can stop module unloading.

Can you ensure this will fix your problem?


It seems not, the warning in __arm_kprobe_ftrace() still occurs. I
contrived following simple test:

  #!/bin/bash
  sysctl -w kernel.panic_on_warn=1
  while [ True ]; do
  insmod mod.ko# contain function 'foo'
  rmmod mod.ko
  done &
  while [ True ]; do
  insmod kprobe.ko  # register kprobe on function 'foo'
  rmmod kprobe.ko
  done &

I think holding kprobe_mutex cannot make sure we get the refcount of the
module.


Aah, yes, it cannot, because the kallsyms in a module will be removed
after module->state becomes MODULE_STATE_UNFORMED. Before UNFORMED,
the state is MODULE_STATE_GOING and the kprobe_module_callback() is
called at that point. Thus, the following scenario happens.

  CPU1  CPU2

mod->state = MODULE_STATE_GOING
kprobe_module_callback() {
mutex_lock(_mutex)
loop on kprobe_table
to disable kprobe in the module.
mutex_unlock(_mutex)
}
register_kprobe(p) {

mutex_lock(_mutex)

check_kprobe_address_safe(p->addr) {
[A'']

is_module_text_address() return true
until 
mod->state == UNFORMED.
mod->state = MODULE_STATE_UNFORMED
[B'']

__module_text_address() returns NULL.
}
p is on the 
kprobe_table.

mutex_unlock(_mutex)

So, as your fix, if we save the module at [A''] and use it at [B''],
the mod is NOT able to get because mod->state != MODULE_STATE_LIVE.





I think your patch is just optimizing but not fixing the fundamental
problem, which is we don't have an atomic search symbol and get module


Sorry, this patch is a little confusing, but it is not just optimizing :)

As shown below, after my patch, if p->addr is in a module which is alive
at 

Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt
On Wed, 10 Apr 2024 08:44:00 +0900
Masami Hiramatsu (Google)  wrote:

> Looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) 

Thanks.

> 
> BTW, isn't this a real bugfix, because the page_touched can be
> bigger than nr_pages without this fix?

Yes, I simply forgot to add the Cc stable.

-- Steve



Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

2024-04-09 Thread Gregory Price
On Mon, Apr 08, 2024 at 10:41:04PM +0900, Honggyu Kim wrote:
> Hi Gregory,
> 
> On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price  
> wrote:
> > Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> > but not DAMOS_MIGRATE_HOT actions? (and vice versa)
> > 
> > The question I have is exactly how often is MIGRATE_HOT actually being
> > utilized, and how much data is being moved. Testing MIGRATE_COLD only
> > would at least give a rough approximation of that.
> 
> To explain this, I better share more test results.  In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
> 
>   *. "Turn on DAMON."
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)

Aha! I see now, you are allocating memory to ensure the real workload
(redis-server) pressures the DRAM tier and causes "spillage" to the CXL
tier, and then measure the overhead in different scenarios.

I would still love to know what the result of a demote-only system would
produce, mosty because it would very clearly demonstrate the value of
the demote+promote system when the system is memory-pressured.

Given the additional results below, it shows a demote-only system would
likely trend toward CXL-only, and so this shows an affirmative support
for the promotion logic.

Just another datum that is useful and paints a more complete picture.

> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
> is missing part from the cover letter.
> 
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   *. "Turn on DAMON."
> 
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
>

This is excellent and definitely demonstrates part of the picture I was
alluding to, thank you for the additional data!

> 
> I have included "DAMON lazy" result and also the migration size by new
> DAMOS migrate actions.  Please note that demotion size is way higher
> than promotion because promotion target is only for redis data, but
> demotion target includes huge cold memory allocated by mmap + memset.
> (there could be some ping-pong issue though.)
> 
> As you mentioned, "DAMON tiered" case gets more benefit because new
> redis allocations go to DRAM more than "default", but it also gets
> benefit from promotion when it is under higher memory pressure as shown
> in 490G and 500G cases.  It promotes 22GB and 17GB of redis data to DRAM
> from CXL.

I think a better way of saying this is that "DAMON tiered" more
effectively mitigates the effect of memory-pressure on faster tier
before spillage occurs, while "DAMON lazy" demonstrates the expected
performance of the system after memory pressure outruns the demotion
logic, so you wind up with hot data stuck in the slow tier.

There are some out there that would simply say "just demote more
aggressively", so this is useful information for the discussion.

+/- ~2% despite greater meomry migration is an excellent result

> > Can you also provide the DRAM-only results for each test?  Presumably,
> > as workload size increases from 440G to 500G, the system probably starts
> > using some amount of swap/zswap/whatever.  It would be good to know how
> > this system compares to swap small amounts of overflow.
> 
> It looks like my explanation doesn't correctly inform you.   The size
> from 440GB to 500GB is for pre allocated cold data to give memory
> pressure on the system so that redis-server cannot be fully allocated at
> fast DRAM, then partially allocated at CXL memory as well.
> 

Yes, sorry for the misunderstanding.  This makes it much clearer.

> 
> I hope my explanation is helpful for you to understand.  Please let me
> know if you have more questions.
>

Excellent work, exciting results! Thank you for the additional answers
:]

~Gregory



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Google
On Tue, 9 Apr 2024 16:45:47 +0200
Marco Elver  wrote:

> On Tue, 9 Apr 2024 at 16:31, Steven Rostedt  wrote:
> >
> > On Mon,  8 Apr 2024 11:01:54 +0200
> > Marco Elver  wrote:
> >
> > > Add "new_exec" tracepoint, which is run right after the point of no
> > > return but before the current task assumes its new exec identity.
> > >
> > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> > > runs before flushing the old exec, i.e. while the task still has the
> > > original state (such as original MM), but when the new exec either
> > > succeeds or crashes (but never returns to the original exec).
> > >
> > > Being able to trace this event can be helpful in a number of use cases:
> > >
> > >   * allowing tracing eBPF programs access to the original MM on exec,
> > > before current->mm is replaced;
> > >   * counting exec in the original task (via perf event);
> > >   * profiling flush time ("new_exec" to "sched_process_exec").
> > >
> > > Example of tracing output ("new_exec" and "sched_process_exec"):
> >
> > How common is this? And can't you just do the same with adding a kprobe?
> 
> Our main use case would be to use this in BPF programs to become
> exec-aware, where using the sched_process_exec hook is too late. This
> is particularly important where the BPF program must stop inspecting
> the user space's VM when the task does exec to become a new process.

Just out of curiousity, would you like to audit that the user-program
is not malformed? (security tracepoint?) I think that is an interesting
idea. What kind of information you need?

> 
> kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s
> permission check can still return an error which returns to the
> original task without crashing. Only at the point of no return are we
> guaranteed that the exec either succeeds, or the task is terminated on
> failure.

Just a note: That is BPF limitation, kprobe and kprobe events can put
a probe in the function body, but that is not supported on BPF (I guess
because it depends on kernel debuginfo.) You can add kprobe-event using
"perf probe" tool.

Thank you,

> 
> I don't know if "common" is the right question here, because it's a
> chicken-egg problem: no tracepoint, we give up; we have the
> tracepoint, it unlocks a range of new use cases (that require robust
> solution to make BPF programs exec-aware, and a tracepoint is the only
> option IMHO).
> 
> Thanks,
> -- Marco


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Google
On Tue, 9 Apr 2024 15:13:09 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The "buffer_percent" logic that is used by the ring buffer splice code to
> only wake up the tasks when there's no data after the buffer is filled to
> the percentage of the "buffer_percent" file is dependent on three
> variables that determine the amount of data that is in the ring buffer:
> 
>  1) pages_read - incremented whenever a new sub-buffer is consumed
>  2) pages_lost - incremented every time a writer overwrites a sub-buffer
>  3) pages_touched - incremented when a write goes to a new sub-buffer
> 
> The percentage is the calculation of:
> 
>   (pages_touched - (pages_lost + pages_read)) / nr_pages
> 
> Basically, the amount of data is the total number of sub-bufs that have been
> touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
> divided by the total count to give the buffer percentage. When the
> percentage is greater than the value in the "buffer_percent" file, it
> wakes up splice readers waiting for that amount.
> 
> It was observed that over time, the amount read from the splice was
> constantly decreasing the longer the trace was running. That is, if one
> asked for 60%, it would read over 60% when it first starts tracing, but
> then it would be woken up at under 60% and would slowly decrease the
> amount of data read after being woken up, where the amount becomes much
> less than the buffer percent.
> 
> This was due to an accounting of the pages_touched incrementation. This
> value is incremented whenever a writer transfers to a new sub-buffer. But
> the place where it was incremented was incorrect. If a writer overflowed
> the current sub-buffer it would go to the next one. If it gets preempted
> by an interrupt at that time, and the interrupt performs a trace, it too
> will end up going to the next sub-buffer. But only one should increment
> the counter. Unfortunately, that was not the case.
> 
> Change the cmpxchg() that does the real switch of the tail-page into a
> try_cmpxchg(), and on success, perform the increment of pages_touched. This
> will only increment the counter once for when the writer moves to a new
> sub-buffer, and not when there's a race and is incremented for when a
> writer and its preempting writer both move to the same new sub-buffer.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 

BTW, isn't this a real bugfix, because the page_touched can be
bigger than nr_pages without this fix?

Thank you,

> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 25476ead681b..6511dc3a00da 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct 
> ring_buffer_per_cpu *cpu_buffer,
>   old_write = local_add_return(RB_WRITE_INTCNT, _page->write);
>   old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries);
>  
> - local_inc(_buffer->pages_touched);
>   /*
>* Just make sure we have seen our old_write and synchronize
>* with any interrupts that come in.
> @@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct 
> ring_buffer_per_cpu *cpu_buffer,
>*/
>   local_set(_page->page->commit, 0);
>  
> - /* Again, either we update tail_page or an interrupt does */
> - (void)cmpxchg(_buffer->tail_page, tail_page, next_page);
> + /* Either we update tail_page or an interrupt does */
> + if (try_cmpxchg(_buffer->tail_page, _page, next_page))
> + local_inc(_buffer->pages_touched);
>   }
>  }
>  
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: Copying TLS/user register data per perf-sample?

2024-04-09 Thread Namhyung Kim
Hello,

On Thu, Apr 4, 2024 at 12:26 PM Beau Belgrave  wrote:
>
> Hello,
>
> I'm looking into the possibility of capturing user data that is pointed
> to by a user register (IE: fs/gs for TLS on x86/64) for each sample via
> perf_events.
>
> I was hoping to find a way to do this similar to PERF_SAMPLE_STACK_USER.
> I think it could even use roughly the same ABI in the perf ring buffer.
> Or it may be possible by some kprobe linked to the perf sample function.
>
> This would allow a profiler to collect TLS (or other values) on x64. In
> the Open Telemetry profiling SIG [1], we are trying to find a fast way
> to grab a tracing association quickly on a per-thread basis. The team
> at Elastic has a bespoke way to do this [2], however, I'd like to see a
> more general way to achieve this. The folks I've been talking with seem
> open to the idea of just having a TLS value for this we could capture
> upon each sample. We could then just state, Open Telemetry SDKs should
> have a TLS value for span correlation. However, we need a way to sample
> the TLS value(s) when a sampling event is generated.
>
> Is this already possible via some other means? It'd be great to be able
> to do this directly at the perf_event sample via the ABI or a probe.

I don't think the current perf ABI allows capturing %fs/%gs + offset.
IIRC kprobes/uprobes don't have that too but I could be wrong.

Thanks,
Namhyung

>
> 1. https://opentelemetry.io/blog/2024/profiling/
> 2. 
> https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation



Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-09 Thread Ho-Ren (Jack) Chuang
On Tue, Apr 9, 2024 at 2:50 PM Andrew Morton  wrote:
>
> On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" 
>  wrote:
>
> > Hi Jonathan,
> >
> > On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron
> >  wrote:
> > >
> > > On Fri,  5 Apr 2024 00:07:05 +
> > > "Ho-Ren (Jack) Chuang"  wrote:
> > >
> > > > Since different memory devices require finding, allocating, and putting
> > > > memory types, these common steps are abstracted in this patch,
> > > > enhancing the scalability and conciseness of the code.
> > > >
> > > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > > Reviewed-by: "Huang, Ying" 
> > > Reviewed-by: Jonathan Cameron 
> > >
> > Thank you for reviewing and for adding your "Reviewed-by"!
> > I was wondering if I need to send a v12 and manually add
> > this to the commit description, or if this is sufficient.
>
> I had added Jonathan's r-b to the mm.git copy of this patch.

Got it~ Thank you Andrew!

-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任



Re: [PATCH v3 1/2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-09 Thread Google
On Wed,  3 Apr 2024 15:03:27 -0700
Andrii Nakryiko  wrote:

> Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> control whether ftrace low-level code performs additional
> rcu_is_watching()-based validation logic in an attempt to catch noinstr
> violations.
> 
> This check is expected to never be true and is mostly useful for
> low-level validation of ftrace subsystem invariants. For most users it
> should probably be kept disabled to eliminate unnecessary runtime
> overhead.
> 
> This improves BPF multi-kretprobe (relying on ftrace and rethook
> infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).
> 
>   [0] 
> https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> 

This looks good to me :)

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> Cc: Steven Rostedt 
> Cc: Masami Hiramatsu 
> Cc: Paul E. McKenney 
> Signed-off-by: Andrii Nakryiko 
> ---
>  include/linux/trace_recursion.h |  2 +-
>  kernel/trace/Kconfig| 13 +
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
> index d48cd92d2364..24ea8ac049b4 100644
> --- a/include/linux/trace_recursion.h
> +++ b/include/linux/trace_recursion.h
> @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> unsigned long parent_ip);
>  # define do_ftrace_record_recursion(ip, pip) do { } while (0)
>  #endif
>  
> -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
>  # define trace_warn_on_no_rcu(ip)\
>   ({  \
>   bool __ret = !rcu_is_watching();\
> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> index 61c541c36596..7aebd1b8f93e 100644
> --- a/kernel/trace/Kconfig
> +++ b/kernel/trace/Kconfig
> @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> This file can be reset, but the limit can not change in
> size at runtime.
>  
> +config FTRACE_VALIDATE_RCU_IS_WATCHING
> + bool "Validate RCU is on during ftrace execution"
> + depends on FUNCTION_TRACER
> + depends on ARCH_WANTS_NO_INSTR
> + help
> +   All callbacks that attach to the function tracing have some sort of
> +   protection against recursion. This option is only to verify that
> +   ftrace (and other users of ftrace_test_recursion_trylock()) are not
> +   called outside of RCU, as if they are, it can cause a race. But it
> +   also has a noticeable overhead when enabled.
> +
> +   If unsure, say N
> +
>  config RING_BUFFER_RECORD_RECURSION
>   bool "Record functions that recurse in the ring buffer"
>   depends on FTRACE_RECORD_RECURSION
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-09 Thread Google
On Wed,  3 Apr 2024 15:03:28 -0700
Andrii Nakryiko  wrote:

> Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> that RCU is watching when trying to setup rethooko on a function entry.
> 
> This further (in addition to improvements in the previous patch)
> improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> by 2.3%, according to BPF benchmarks ([0]).
> 
>   [0] 
> https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> 

Hi Andrii,

Can you make this part depends on !KPROBE_EVENTS_ON_NOTRACE (with this
option, kretprobes can be used without ftrace, but with original int3) ?
This option should be set N on production system because of safety,
just for testing raw kretprobes.

Thank you,

> Signed-off-by: Andrii Nakryiko 
> ---
>  kernel/trace/rethook.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..15b8aa4048d9 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>   if (unlikely(!handler))
>   return NULL;
>  
> +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
>   /*
>* This expects the caller will set up a rethook on a function entry.
>* When the function returns, the rethook will eventually be reclaimed
> @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
>*/
>   if (unlikely(!rcu_is_watching()))
>   return NULL;
> +#endif
>  
>   return (struct rethook_node *)objpool_pop(>pool);
>  }
> -- 
> 2.43.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-09 Thread Andrew Morton
On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" 
 wrote:

> Hi Jonathan,
> 
> On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron
>  wrote:
> >
> > On Fri,  5 Apr 2024 00:07:05 +
> > "Ho-Ren (Jack) Chuang"  wrote:
> >
> > > Since different memory devices require finding, allocating, and putting
> > > memory types, these common steps are abstracted in this patch,
> > > enhancing the scalability and conciseness of the code.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Reviewed-by: "Huang, Ying" 
> > Reviewed-by: Jonathan Cameron 
> >
> Thank you for reviewing and for adding your "Reviewed-by"!
> I was wondering if I need to send a v12 and manually add
> this to the commit description, or if this is sufficient.

I had added Jonathan's r-b to the mm.git copy of this patch.



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Kees Cook
On Tue, Apr 09, 2024 at 08:25:45PM +0200, Marco Elver wrote:
> On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote:
> [...]
> > > + trace_new_exec(current, bprm);
> > > +
> > 
> > All other steps in this function have explicit comments about
> > what/why/etc. Please add some kind of comment describing why the
> > tracepoint is where it is, etc.
> 
> I beefed up the tracepoint documentation, and wrote a little paragraph
> above where it's called to reinforce what we want.
> 
> [...]
> > What about binfmt_misc, and binfmt_script? You may want bprm->interp
> > too?
> 
> Good points. I'll make the below changes for v2:
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index ab778ae1fc06..472b9f7b40e8 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm)
>   if (retval)
>   return retval;
>  
> + /*
> +  * This tracepoint marks the point before flushing the old exec where
> +  * the current task is still unchanged, but errors are fatal (point of
> +  * no return). The later "sched_process_exec" tracepoint is called after
> +  * the current task has successfully switched to the new exec.
> +  */
>   trace_new_exec(current, bprm);
>  
>   /*
> diff --git a/include/trace/events/task.h b/include/trace/events/task.h
> index 8853dc44783d..623d9af777c1 100644
> --- a/include/trace/events/task.h
> +++ b/include/trace/events/task.h
> @@ -61,8 +61,11 @@ TRACE_EVENT(task_rename,
>   * @task:pointer to the current task
>   * @bprm:pointer to linux_binprm used for new exec
>   *
> - * Called before flushing the old exec, but at the point of no return during
> - * switching to the new exec.
> + * Called before flushing the old exec, where @task is still unchanged, but 
> at
> + * the point of no return during switching to the new exec. At the point it 
> is
> + * called the exec will either succeed, or on failure terminate the task. 
> Also
> + * see the "sched_process_exec" tracepoint, which is called right after @task
> + * has successfully switched to the new exec.
>   */
>  TRACE_EVENT(new_exec,
>  
> @@ -71,19 +74,22 @@ TRACE_EVENT(new_exec,
>   TP_ARGS(task, bprm),
>  
>   TP_STRUCT__entry(
> + __string(   interp, bprm->interp)
>   __string(   filename,   bprm->filename  )
>   __field(pid_t,  pid )
>   __string(   comm,   task->comm  )
>   ),
>  
>   TP_fast_assign(
> + __assign_str(interp, bprm->interp);
>   __assign_str(filename, bprm->filename);
>   __entry->pid = task->pid;
>   __assign_str(comm, task->comm);
>   ),
>  
> - TP_printk("filename=%s pid=%d comm=%s",
> -   __get_str(filename), __entry->pid, __get_str(comm))
> + TP_printk("interp=%s filename=%s pid=%d comm=%s",
> +   __get_str(interp), __get_str(filename),
> +   __entry->pid, __get_str(comm))
>  );
>  
>  #endif

Looks good; I await v2, and Steven's Ack. :)

-- 
Kees Cook



Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-09 Thread David Hildenbrand

On 09.04.24 20:31, James Houghton wrote:

Ah, I didn't see this in my inbox, sorry David!


No worries :)



On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand  wrote:


On 02.04.24 01:29, James Houghton wrote:

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..daaa9db625d3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,10 @@ enum mmu_notifier_event {

   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)

+#define MMU_NOTIFIER_YOUNG   (1 << 0)
+#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)


Especially this one really deserves some documentation :)


Yes, will do. Something like

 MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
bitmap either (1) does not accurately represent the age of the pages
(in the case of test_young), or (2) was not able to be used to
completely clear the age/access bit (in the case of clear_young).


Make sense. I do wonder what the expected reaction from the caller is :)






+#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)


And that one as well.


Something like

Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
would have been supported for this address range.

The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
is able to harvest the access bit "fast" (so for x86, locklessly, and
for arm64, with the KVM MMU read lock), "fast" enough that using a
bitmap to do look-around is probably a good idea.


Is that really the right way to communicate that ("would have been 
supported") -- wouldn't we want to sense support differently?






Likely best to briefly document all of them, and how they are
supposed to be used (return value for X).


Right. Will do.




+
   struct mmu_notifier_ops {
   /*
* Called either by mmu_notifier_unregister or when the mm is
@@ -106,21 +110,36 @@ struct mmu_notifier_ops {
* clear_young is a lightweight version of clear_flush_young. Like the
* latter, it is supposed to test-and-clear the young/accessed bitflag
* in the secondary pte, but it may omit flushing the secondary tlb.
+  *
+  * If @bitmap is given but is not supported, return
+  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+  *
+  * If the walk is done "quickly" and there were young PTEs,
+  * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
   int (*clear_young)(struct mmu_notifier *subscription,
  struct mm_struct *mm,
  unsigned long start,
-unsigned long end);
+unsigned long end,
+unsigned long *bitmap);

   /*
* test_young is called to check the young/accessed bitflag in
* the secondary pte. This is used to know if the page is
* frequently used without actually clearing the flag or tearing
* down the secondary mapping on the page.
+  *
+  * If @bitmap is given but is not supported, return
+  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
+  *
+  * If the walk is done "quickly" and there were young PTEs,
+  * MMU_NOTIFIER_YOUNG_FAST is returned.
*/
   int (*test_young)(struct mmu_notifier *subscription,
 struct mm_struct *mm,
-   unsigned long address);
+   unsigned long start,
+   unsigned long end,
+   unsigned long *bitmap);


What does "quickly" mean (why not use "fast")? What are the semantics, I
don't find any existing usage of that in this file.


"fast" means fast enough such that using a bitmap to scan adjacent
pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
this comment. Perhaps I should just rename it to
MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
beneficial" thing -- that's for MGLRU/etc. to decide really.


Yes!





Further, what is MMU_NOTIFIER_YOUNG you introduce used for?


MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
(1) didn't use a bitmap, and (2) the "fast" access bit harvesting
wasn't possible. In this case we simply return 1, which is
MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
it will be 1.


Yes, that will clarify it!

--
Cheers,

David / dhildenb




[PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The "buffer_percent" logic that is used by the ring buffer splice code to
only wake up the tasks when there's no data after the buffer is filled to
the percentage of the "buffer_percent" file is dependent on three
variables that determine the amount of data that is in the ring buffer:

 1) pages_read - incremented whenever a new sub-buffer is consumed
 2) pages_lost - incremented every time a writer overwrites a sub-buffer
 3) pages_touched - incremented when a write goes to a new sub-buffer

The percentage is the calculation of:

  (pages_touched - (pages_lost + pages_read)) / nr_pages

Basically, the amount of data is the total number of sub-bufs that have been
touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
divided by the total count to give the buffer percentage. When the
percentage is greater than the value in the "buffer_percent" file, it
wakes up splice readers waiting for that amount.

It was observed that over time, the amount read from the splice was
constantly decreasing the longer the trace was running. That is, if one
asked for 60%, it would read over 60% when it first starts tracing, but
then it would be woken up at under 60% and would slowly decrease the
amount of data read after being woken up, where the amount becomes much
less than the buffer percent.

This was due to an accounting of the pages_touched incrementation. This
value is incremented whenever a writer transfers to a new sub-buffer. But
the place where it was incremented was incorrect. If a writer overflowed
the current sub-buffer it would go to the next one. If it gets preempted
by an interrupt at that time, and the interrupt performs a trace, it too
will end up going to the next sub-buffer. But only one should increment
the counter. Unfortunately, that was not the case.

Change the cmpxchg() that does the real switch of the tail-page into a
try_cmpxchg(), and on success, perform the increment of pages_touched. This
will only increment the counter once for when the writer moves to a new
sub-buffer, and not when there's a race and is incremented for when a
writer and its preempting writer both move to the same new sub-buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..6511dc3a00da 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
old_write = local_add_return(RB_WRITE_INTCNT, _page->write);
old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries);
 
-   local_inc(_buffer->pages_touched);
/*
 * Just make sure we have seen our old_write and synchronize
 * with any interrupts that come in.
@@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
 */
local_set(_page->page->commit, 0);
 
-   /* Again, either we update tail_page or an interrupt does */
-   (void)cmpxchg(_buffer->tail_page, tail_page, next_page);
+   /* Either we update tail_page or an interrupt does */
+   if (try_cmpxchg(_buffer->tail_page, _page, next_page))
+   local_inc(_buffer->pages_touched);
}
 }
 
-- 
2.43.0




Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Ho-Ren (Jack) Chuang
Hi Jonathan,

On Tue, Apr 9, 2024 at 9:12 AM Jonathan Cameron
 wrote:
>
> On Fri, 5 Apr 2024 15:43:47 -0700
> "Ho-Ren (Jack) Chuang"  wrote:
>
> > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
> >  wrote:
> > >
> > > On Fri,  5 Apr 2024 00:07:06 +
> > > "Ho-Ren (Jack) Chuang"  wrote:
> > >
> > > > The current implementation treats emulated memory devices, such as
> > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> > > > memory
> > > > (E820_TYPE_RAM). However, these emulated devices have different
> > > > characteristics than traditional DRAM, making it important to
> > > > distinguish them. Thus, we modify the tiered memory initialization 
> > > > process
> > > > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > > > ensures that the memory tier initialization for these nodes is deferred
> > > > until HMAT information is obtained during the boot process. Finally,
> > > > demotion tables are recalculated at the end.
> > > >
> > > > * late_initcall(memory_tier_late_init);
> > > > Some device drivers may have initialized memory tiers between
> > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> > > > online memory nodes and configuring memory tiers. They should be 
> > > > excluded
> > > > in the late init.
> > > >
> > > > * Handle cases where there is no HMAT when creating memory tiers
> > > > There is a scenario where a CPUless node does not provide HMAT 
> > > > information.
> > > > If no HMAT is specified, it falls back to using the default DRAM tier.
> > > >
> > > > * Introduce another new lock `default_dram_perf_lock` for adist 
> > > > calculation
> > > > In the current implementation, iterating through CPUlist nodes requires
> > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end 
> > > > up
> > > > trying to acquire the same lock, leading to a potential deadlock.
> > > > Therefore, we propose introducing a standalone `default_dram_perf_lock` 
> > > > to
> > > > protect `default_dram_perf_*`. This approach not only avoids deadlock
> > > > but also prevents holding a large lock simultaneously.
> > > >
> > > > * Upgrade `set_node_memory_tier` to support additional cases, including
> > > >   default DRAM, late CPUless, and hot-plugged initializations.
> > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> > > > handle cases where memtype is not initialized and where HMAT 
> > > > information is
> > > > available.
> > > >
> > > > * Introduce `default_memory_types` for those memory types that are not
> > > >   initialized by device drivers.
> > > > Because late initialized memory and default DRAM memory need to be 
> > > > managed,
> > > > a default memory type is created for storing all memory types that are
> > > > not initialized by device drivers and as a fallback.
> > > >
> > > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > > Signed-off-by: Hao Xiang 
> > > > Reviewed-by: "Huang, Ying" 
> > >
> > > Hi - one remaining question. Why can't we delay init for all nodes
> > > to either drivers or your fallback late_initcall code.
> > > It would be nice to reduce possible code paths.
> >
> > I try not to change too much of the existing code structure in
> > this patchset.
> >
> > To me, postponing/moving all memory tier registrations to
> > late_initcall() is another possible action item for the next patchset.
> >
> > After tier_mem(), hmat_init() is called, which requires registering
> > `default_dram_type` info. This is when `default_dram_type` is needed.
> > However, it is indeed possible to postpone the latter part,
> > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> > indeed be split into two parts, and the latter part can be moved to
> > late_initcall() to be processed together.
> >
> > Doing this all memory-type drivers have to call late_initcall() to
> > register a memory tier. I’m not sure how many they are?
> >
> > What do you guys think?
>
> Gut feeling - if you are going to move it for some cases, move it for
> all of them.  Then we only have to test once ;)
>
> J

Thank you for your reminder! I agree~ That's why I'm considering
changing them in the next patchset because of the amount of changes.
And also, this patchset already contains too many things.

> >
> > >
> > > Jonathan
> > >
> > >
> > > > ---
> > > >  mm/memory-tiers.c | 94 +++
> > > >  1 file changed, 70 insertions(+), 24 deletions(-)
> > > >
> > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > > index 516b144fd45a..6632102bd5c9 100644
> > > > --- a/mm/memory-tiers.c
> > > > +++ b/mm/memory-tiers.c
> > >
> > >
> > >
> > > > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
> > > >* For now we can have 4 faster memory tiers with smaller 
> > > > adistance
> > > >* than default DRAM tier.
> > > >*/
> > > > - 

Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-04-09 Thread Ho-Ren (Jack) Chuang
Hi Jonathan,

On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron
 wrote:
>
> On Fri,  5 Apr 2024 00:07:05 +
> "Ho-Ren (Jack) Chuang"  wrote:
>
> > Since different memory devices require finding, allocating, and putting
> > memory types, these common steps are abstracted in this patch,
> > enhancing the scalability and conciseness of the code.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Reviewed-by: "Huang, Ying" 
> Reviewed-by: Jonathan Cameron 
>
Thank you for reviewing and for adding your "Reviewed-by"!
I was wondering if I need to send a v12 and manually add
this to the commit description, or if this is sufficient.

-- 
Best regards,
Ho-Ren (Jack) Chuang
莊賀任



Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-09 Thread Dmitry Baryshkov
On Tue, 9 Apr 2024 at 21:37, Luca Weiss  wrote:
>
> Follow the gpio-hog bindings and use otg-hog as node name.
>
> Signed-off-by: Luca Weiss 
> ---
>  arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>

Reviewed-by: Dmitry Baryshkov 

-- 
With best wishes
Dmitry



[PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

2024-04-09 Thread Luca Weiss
Follow the gpio-hog bindings and use otg-hog as node name.

Signed-off-by: Luca Weiss 
---
 arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts 
b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
index 4aaae8537a3f..06549051be50 100644
--- a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
+++ b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts
@@ -328,7 +328,7 @@ wlan_regulator_pin: wl-reg-active-state {
power-source = ;
};
 
-   otg {
+   otg-hog {
gpio-hog;
gpios = <35 GPIO_ACTIVE_HIGH>;
output-high;

-- 
2.44.0




[PATCH v2 1/2] dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes

2024-04-09 Thread Luca Weiss
Allow specifying a GPIO hog, as already used on
qcom-msm8974-lge-nexus5-hammerhead.dts.

Signed-off-by: Luca Weiss 
---
 .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml  | 12 
 1 file changed, 12 insertions(+)

diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml 
b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
index a786357ed1af..bd9471de0c69 100644
--- a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
+++ b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
@@ -424,6 +424,10 @@ patternProperties:
 $ref: "#/$defs/qcom-pmic-gpio-state"
 additionalProperties: false
 
+  "-hog(-[0-9]+)?$":
+required:
+  - gpio-hog
+
 $defs:
   qcom-pmic-gpio-state:
 type: object
@@ -571,6 +575,7 @@ $defs:
 
 examples:
   - |
+#include 
 #include 
 
 pm8921_gpio: gpio@150 {
@@ -594,5 +599,12 @@ examples:
   power-source = ;
 };
   };
+
+  otg-hog {
+gpio-hog;
+gpios = <35 GPIO_ACTIVE_HIGH>;
+output-high;
+line-name = "otg-gpio";
+  };
 };
 ...

-- 
2.44.0




[PATCH v2 0/2] Allow gpio-hog nodes in qcom,pmic-gpio bindings (& dt fixup)

2024-04-09 Thread Luca Weiss
Resolve the dt validation failure on Nexus 5.

Signed-off-by: Luca Weiss 
---
Changes in v2:
- Use simpler regex from tlmm bindings (Krzysztof)
- Link to v1: 
https://lore.kernel.org/r/20240408-qcom-pmic-gpio-hog-v1-0-f61fc5323...@z3ntu.xyz

---
Luca Weiss (2):
  dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes
  ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name

 .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml  | 12 
 .../arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts |  2 +-
 2 files changed, 13 insertions(+), 1 deletion(-)
---
base-commit: 8568bb2ccc278f344e6ac44af6ed010a90aa88dc
change-id: 20240408-qcom-pmic-gpio-hog-2b4c5f103126

Best regards,
-- 
Luca Weiss 




Re: [PATCH 0/3] Fix up qcom,halt-regs definition in various schemas

2024-04-09 Thread Luca Weiss
On Dienstag, 9. April 2024 17:10:41 CEST Rob Herring wrote:
> On Sun, Apr 07, 2024 at 11:58:29AM +0200, Luca Weiss wrote:
> > The original motivation is that a bunch of other schemas fail to
> > validate qcom,halt-regs, for example like in the following examples:
> > 
> > arch/arm64/boot/dts/qcom/apq8016-sbc.dtb: remoteproc@408: 
> > qcom,halt-regs:0: [20] is too short
> > from schema $id: 
> > http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml#
> > arch/arm64/boot/dts/qcom/apq8096-ifc6640.dtb: remoteproc@208: 
> > qcom,halt-regs:0: [82] is too short
> > from schema $id: 
> > http://devicetree.org/schemas/remoteproc/qcom,msm8996-mss-pil.yaml#
> > arch/arm64/boot/dts/qcom/apq8039-t2.dtb: remoteproc@408: 
> > qcom,halt-regs:0: [32] is too short
> > from schema $id: 
> > http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml#
> > 
> > While I'm actually not quite sure why these patches fix this in
> > the other schemas - feels like a bug/limitation in dt-schema maybe? -
> 
> Was this with v2024.02? It should be a bit better there. Though it 
> may just have different errors. The limitation is that property 
> types and in the case of matrix's (which phandle-array actually is) 
> range for dimensions are global. So if there's not correct dimensions 
> for a property, the tools aren't going to decode it properly.

You're right, I doesn't look like I can reproduce this with the latest
dtschema installed.

Anyways these patches should be good to actually validate qcom,halt-regs for
the schemas I'm touching here.

Regards
Luca

> 
> Rob
> 







Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young

2024-04-09 Thread James Houghton
Ah, I didn't see this in my inbox, sorry David!

On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand  wrote:
>
> On 02.04.24 01:29, James Houghton wrote:
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index f349e08a9dfe..daaa9db625d3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -61,6 +61,10 @@ enum mmu_notifier_event {
> >
> >   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >
> > +#define MMU_NOTIFIER_YOUNG   (1 << 0)
> > +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1)
>
> Especially this one really deserves some documentation :)

Yes, will do. Something like

MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in
bitmap either (1) does not accurately represent the age of the pages
(in the case of test_young), or (2) was not able to be used to
completely clear the age/access bit (in the case of clear_young).

>
> > +#define MMU_NOTIFIER_YOUNG_FAST  (1 << 2)
>
> And that one as well.

Something like

   Indicates that (1) passing a bitmap ({test,clear}_young_bitmap)
would have been supported for this address range.

The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM
is able to harvest the access bit "fast" (so for x86, locklessly, and
for arm64, with the KVM MMU read lock), "fast" enough that using a
bitmap to do look-around is probably a good idea.

>
> Likely best to briefly document all of them, and how they are
> supposed to be used (return value for X).

Right. Will do.

>
> > +
> >   struct mmu_notifier_ops {
> >   /*
> >* Called either by mmu_notifier_unregister or when the mm is
> > @@ -106,21 +110,36 @@ struct mmu_notifier_ops {
> >* clear_young is a lightweight version of clear_flush_young. Like the
> >* latter, it is supposed to test-and-clear the young/accessed bitflag
> >* in the secondary pte, but it may omit flushing the secondary tlb.
> > +  *
> > +  * If @bitmap is given but is not supported, return
> > +  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> > +  *
> > +  * If the walk is done "quickly" and there were young PTEs,
> > +  * MMU_NOTIFIER_YOUNG_FAST is returned.
> >*/
> >   int (*clear_young)(struct mmu_notifier *subscription,
> >  struct mm_struct *mm,
> >  unsigned long start,
> > -unsigned long end);
> > +unsigned long end,
> > +unsigned long *bitmap);
> >
> >   /*
> >* test_young is called to check the young/accessed bitflag in
> >* the secondary pte. This is used to know if the page is
> >* frequently used without actually clearing the flag or tearing
> >* down the secondary mapping on the page.
> > +  *
> > +  * If @bitmap is given but is not supported, return
> > +  * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE.
> > +  *
> > +  * If the walk is done "quickly" and there were young PTEs,
> > +  * MMU_NOTIFIER_YOUNG_FAST is returned.
> >*/
> >   int (*test_young)(struct mmu_notifier *subscription,
> > struct mm_struct *mm,
> > -   unsigned long address);
> > +   unsigned long start,
> > +   unsigned long end,
> > +   unsigned long *bitmap);
>
> What does "quickly" mean (why not use "fast")? What are the semantics, I
> don't find any existing usage of that in this file.

"fast" means fast enough such that using a bitmap to scan adjacent
pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in
this comment. Perhaps I should just rename it to
MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be
beneficial" thing -- that's for MGLRU/etc. to decide really.

>
> Further, what is MMU_NOTIFIER_YOUNG you introduce used for?

MMU_NOTIFIER_YOUNG is the return value when the page was young, but we
(1) didn't use a bitmap, and (2) the "fast" access bit harvesting
wasn't possible. In this case we simply return 1, which is
MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young()
properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that
it will be 1.

Thanks David!



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Marco Elver
On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote:
[...]
> > +   trace_new_exec(current, bprm);
> > +
> 
> All other steps in this function have explicit comments about
> what/why/etc. Please add some kind of comment describing why the
> tracepoint is where it is, etc.

I beefed up the tracepoint documentation, and wrote a little paragraph
above where it's called to reinforce what we want.

[...]
> What about binfmt_misc, and binfmt_script? You may want bprm->interp
> too?

Good points. I'll make the below changes for v2:

diff --git a/fs/exec.c b/fs/exec.c
index ab778ae1fc06..472b9f7b40e8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
 
+   /*
+* This tracepoint marks the point before flushing the old exec where
+* the current task is still unchanged, but errors are fatal (point of
+* no return). The later "sched_process_exec" tracepoint is called after
+* the current task has successfully switched to the new exec.
+*/
trace_new_exec(current, bprm);
 
/*
diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 8853dc44783d..623d9af777c1 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -61,8 +61,11 @@ TRACE_EVENT(task_rename,
  * @task:  pointer to the current task
  * @bprm:  pointer to linux_binprm used for new exec
  *
- * Called before flushing the old exec, but at the point of no return during
- * switching to the new exec.
+ * Called before flushing the old exec, where @task is still unchanged, but at
+ * the point of no return during switching to the new exec. At the point it is
+ * called the exec will either succeed, or on failure terminate the task. Also
+ * see the "sched_process_exec" tracepoint, which is called right after @task
+ * has successfully switched to the new exec.
  */
 TRACE_EVENT(new_exec,
 
@@ -71,19 +74,22 @@ TRACE_EVENT(new_exec,
TP_ARGS(task, bprm),
 
TP_STRUCT__entry(
+   __string(   interp, bprm->interp)
__string(   filename,   bprm->filename  )
__field(pid_t,  pid )
__string(   comm,   task->comm  )
),
 
TP_fast_assign(
+   __assign_str(interp, bprm->interp);
__assign_str(filename, bprm->filename);
__entry->pid = task->pid;
__assign_str(comm, task->comm);
),
 
-   TP_printk("filename=%s pid=%d comm=%s",
- __get_str(filename), __entry->pid, __get_str(comm))
+   TP_printk("interp=%s filename=%s pid=%d comm=%s",
+ __get_str(interp), __get_str(filename),
+ __entry->pid, __get_str(comm))
 );
 
 #endif



Re: [PATCH net-next v3 6/6] rstreason: make it work in trace world

2024-04-09 Thread Steven Rostedt
On Tue,  9 Apr 2024 18:09:34 +0800
Jason Xing  wrote:

>  /*
>   * tcp event with arguments sk and skb
> @@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
>   TP_ARGS(sk, skb)
>  );
>  
> +#undef FN1
> +#define FN1(reason)  TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
> +#undef FN2
> +#define FN2(reason)  TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
> +DEFINE_RST_REASON(FN1, FN1)

Interesting. I've never seen the passing of the internal macros to the main
macro before. I see that you are using it for handling both the
SK_RST_REASON and the SK_DROP_REASON.

> +
> +#undef FN1
> +#undef FNe1
> +#define FN1(reason)  { SK_RST_REASON_##reason, #reason },
> +#define FNe1(reason) { SK_RST_REASON_##reason, #reason }
> +
> +#undef FN2
> +#undef FNe2
> +#define FN2(reason)  { SKB_DROP_REASON_##reason, #reason },
> +#define FNe2(reason) { SKB_DROP_REASON_##reason, #reason }

Anyway, from a tracing point of view, as it looks like it would work
(I haven't tested it).

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-04-09 Thread SeongJae Park
Hi Honggyu,

On Tue,  9 Apr 2024 18:54:14 +0900 Honggyu Kim  wrote:
> On Mon,  8 Apr 2024 10:52:28 -0700 SeongJae Park  wrote:
> > On Mon,  8 Apr 2024 21:06:44 +0900 Honggyu Kim  wrote:
> > > On Fri,  5 Apr 2024 12:24:30 -0700 SeongJae Park  wrote:
> > > > On Fri,  5 Apr 2024 15:08:54 +0900 Honggyu Kim  
> > > > wrote:
[...]
> > > I can remove it, but I would like to have more discussion about this
> > > issue.  The current implementation allows only a single migration
> > > target with "target_nid", but users might want to provide fall back
> > > migration target nids.
> > > 
> > > For example, if more than two CXL nodes exist in the system, users might
> > > want to migrate cold pages to any CXL nodes.  In such cases, we might
> > > have to make "target_nid" accept comma separated node IDs.  nodemask can
> > > be better but we should provide a way to change the scanning order.
> > > 
> > > I would like to hear how you think about this.
> > 
> > Good point.  I think we could later extend the sysfs file to receive the
> > comma-separated numbers, or even mask.  For simplicity, adding sysfs files
> > dedicated for the different format of inputs could also be an option (e.g.,
> > target_nids_list, target_nids_mask).  But starting from this single node as 
> > is
> > now looks ok to me.
> 
> If you think we can start from a single node, then I will keep it as is.
> But are you okay if I change the same 'target_nid' to accept
> comma-separated numbers later?  Or do you want to introduce another knob
> such as 'target_nids_list'?  What about rename 'target_nid' to
> 'target_nids' at the first place?

I have no strong concern or opinion about this at the moment.  Please feel free
to renaming it to 'taget_nids' if you think that's better.

[...]
> Please note that I will be out of office this week so won't be able to
> answer quickly.

No problem, I hope you to take and enjoy your time :)


Thanks,
SJ

[...]



Re: [PATCH v2 fs/proc/bootconfig 0/2] remove redundant comments from /proc/bootconfig

2024-04-09 Thread Paul E. McKenney
On Tue, Apr 09, 2024 at 11:32:48PM +0900, Masami Hiramatsu wrote:
> Hi Paul,
> 
> Thanks, both looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) 
> 
> Let me update bootconfig/fixes.

Thank you very much!  As soon as I see them in -next from you, I will
drop them from -rcu.

Thanx, Paul

> On Mon, 8 Apr 2024 21:42:49 -0700
> "Paul E. McKenney"  wrote:
> 
> > Hello!
> > 
> > This series removes redundant comments from /proc/bootconfig:
> > 
> > 1.  fs/proc: remove redundant comments from /proc/bootconfig,
> > courtesy of Zhenhua Huang.
> > 
> > 2.  fs/proc: Skip bootloader comment if no embedded kernel parameters,
> > courtesy of Masami Hiramatsu.
> > 
> > Thanx, Paul
> > 
> > 
> > 
> >  b/fs/proc/bootconfig.c   |   12 ++--
> >  b/include/linux/bootconfig.h |1 +
> >  b/init/main.c|5 +
> >  fs/proc/bootconfig.c |2 +-
> >  4 files changed, 13 insertions(+), 7 deletions(-)
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google) 



Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-09 Thread Jonathan Cameron
On Fri, 5 Apr 2024 15:43:47 -0700
"Ho-Ren (Jack) Chuang"  wrote:

> On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
>  wrote:
> >
> > On Fri,  5 Apr 2024 00:07:06 +
> > "Ho-Ren (Jack) Chuang"  wrote:
> >  
> > > The current implementation treats emulated memory devices, such as
> > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> > > memory
> > > (E820_TYPE_RAM). However, these emulated devices have different
> > > characteristics than traditional DRAM, making it important to
> > > distinguish them. Thus, we modify the tiered memory initialization process
> > > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > > ensures that the memory tier initialization for these nodes is deferred
> > > until HMAT information is obtained during the boot process. Finally,
> > > demotion tables are recalculated at the end.
> > >
> > > * late_initcall(memory_tier_late_init);
> > > Some device drivers may have initialized memory tiers between
> > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> > > online memory nodes and configuring memory tiers. They should be excluded
> > > in the late init.
> > >
> > > * Handle cases where there is no HMAT when creating memory tiers
> > > There is a scenario where a CPUless node does not provide HMAT 
> > > information.
> > > If no HMAT is specified, it falls back to using the default DRAM tier.
> > >
> > > * Introduce another new lock `default_dram_perf_lock` for adist 
> > > calculation
> > > In the current implementation, iterating through CPUlist nodes requires
> > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> > > trying to acquire the same lock, leading to a potential deadlock.
> > > Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> > > protect `default_dram_perf_*`. This approach not only avoids deadlock
> > > but also prevents holding a large lock simultaneously.
> > >
> > > * Upgrade `set_node_memory_tier` to support additional cases, including
> > >   default DRAM, late CPUless, and hot-plugged initializations.
> > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> > > handle cases where memtype is not initialized and where HMAT information 
> > > is
> > > available.
> > >
> > > * Introduce `default_memory_types` for those memory types that are not
> > >   initialized by device drivers.
> > > Because late initialized memory and default DRAM memory need to be 
> > > managed,
> > > a default memory type is created for storing all memory types that are
> > > not initialized by device drivers and as a fallback.
> > >
> > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > Signed-off-by: Hao Xiang 
> > > Reviewed-by: "Huang, Ying"   
> >
> > Hi - one remaining question. Why can't we delay init for all nodes
> > to either drivers or your fallback late_initcall code.
> > It would be nice to reduce possible code paths.  
> 
> I try not to change too much of the existing code structure in
> this patchset.
> 
> To me, postponing/moving all memory tier registrations to
> late_initcall() is another possible action item for the next patchset.
> 
> After tier_mem(), hmat_init() is called, which requires registering
> `default_dram_type` info. This is when `default_dram_type` is needed.
> However, it is indeed possible to postpone the latter part,
> set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> indeed be split into two parts, and the latter part can be moved to
> late_initcall() to be processed together.
> 
> Doing this all memory-type drivers have to call late_initcall() to
> register a memory tier. I’m not sure how many they are?
> 
> What do you guys think?

Gut feeling - if you are going to move it for some cases, move it for
all of them.  Then we only have to test once ;)

J
> 
> >
> > Jonathan
> >
> >  
> > > ---
> > >  mm/memory-tiers.c | 94 +++
> > >  1 file changed, 70 insertions(+), 24 deletions(-)
> > >
> > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > index 516b144fd45a..6632102bd5c9 100644
> > > --- a/mm/memory-tiers.c
> > > +++ b/mm/memory-tiers.c  
> >
> >
> >  
> > > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void)
> > >* For now we can have 4 faster memory tiers with smaller adistance
> > >* than default DRAM tier.
> > >*/
> > > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
> > > + default_dram_type = 
> > > mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM,
> > > +   
> > > _memory_types);
> > >   if (IS_ERR(default_dram_type))
> > >   panic("%s() failed to allocate default DRAM tier\n", 
> > > __func__);
> > >
> > > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void)
> > >* types assigned.
> > >*/
> > >   

Re: [PATCH net-next v3 6/6] rstreason: make it work in trace world

2024-04-09 Thread Jason Xing
Hi Steven,

On Tue, Apr 9, 2024 at 11:36 PM Steven Rostedt  wrote:
>
> On Tue,  9 Apr 2024 18:09:34 +0800
> Jason Xing  wrote:
>
> >  /*
> >   * tcp event with arguments sk and skb
> > @@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
> >   TP_ARGS(sk, skb)
> >  );
> >
> > +#undef FN1
> > +#define FN1(reason)  TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
> > +#undef FN2
> > +#define FN2(reason)  TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
> > +DEFINE_RST_REASON(FN1, FN1)
>
> Interesting. I've never seen the passing of the internal macros to the main
> macro before. I see that you are using it for handling both the
> SK_RST_REASON and the SK_DROP_REASON.

Yes, I want to cover two kinds of reasons and then strip them of
prefixes which can be reported to userspace.

>
> > +
> > +#undef FN1
> > +#undef FNe1
> > +#define FN1(reason)  { SK_RST_REASON_##reason, #reason },
> > +#define FNe1(reason) { SK_RST_REASON_##reason, #reason }
> > +
> > +#undef FN2
> > +#undef FNe2
> > +#define FN2(reason)  { SKB_DROP_REASON_##reason, #reason },
> > +#define FNe2(reason) { SKB_DROP_REASON_##reason, #reason }
>
> Anyway, from a tracing point of view, as it looks like it would work
> (I haven't tested it).

Sure, it works. One simple test if you're interested:
1) Apply this patchset locally
2) add 'trace_tcp_send_reset(sk, skb, [one reason])' in the receive
path, say, somewhere in the tcp_v4_rcv()

The possible result can be seen in the cover letter. I list here:
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET

>
> Reviewed-by: Steven Rostedt (Google) 

Thanks!

>
> -- Steve



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Kees Cook
On Mon, Apr 08, 2024 at 11:01:54AM +0200, Marco Elver wrote:
> Add "new_exec" tracepoint, which is run right after the point of no
> return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> runs before flushing the old exec, i.e. while the task still has the
> original state (such as original MM), but when the new exec either
> succeeds or crashes (but never returns to the original exec).
> 
> Being able to trace this event can be helpful in a number of use cases:
> 
>   * allowing tracing eBPF programs access to the original MM on exec,
> before current->mm is replaced;
>   * counting exec in the original task (via perf event);
>   * profiling flush time ("new_exec" to "sched_process_exec").
> 
> Example of tracing output ("new_exec" and "sched_process_exec"):
> 
>   $ cat /sys/kernel/debug/tracing/trace_pipe
>   <...>-379 [003] .   179.626921: new_exec: 
> filename=/usr/bin/sshd pid=379 comm=sshd
>   <...>-379 [003] .   179.629131: sched_process_exec: 
> filename=/usr/bin/sshd pid=379 old_pid=379
>   <...>-381 [002] .   180.048580: new_exec: filename=/bin/bash 
> pid=381 comm=sshd
>   <...>-381 [002] .   180.053122: sched_process_exec: 
> filename=/bin/bash pid=381 old_pid=381
>   <...>-385 [001] .   180.068277: new_exec: filename=/usr/bin/tty 
> pid=385 comm=bash
>   <...>-385 [001] .   180.069485: sched_process_exec: 
> filename=/usr/bin/tty pid=385 old_pid=385
>   <...>-389 [006] .   192.020147: new_exec: 
> filename=/usr/bin/dmesg pid=389 comm=bash
>bash-389 [006] .   192.021377: sched_process_exec: 
> filename=/usr/bin/dmesg pid=389 old_pid=389
> 
> Signed-off-by: Marco Elver 
> ---
>  fs/exec.c   |  2 ++
>  include/trace/events/task.h | 30 ++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 38bf71cbdf5e..ab778ae1fc06 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1268,6 +1268,8 @@ int begin_new_exec(struct linux_binprm * bprm)
>   if (retval)
>   return retval;
>  
> + trace_new_exec(current, bprm);
> +

All other steps in this function have explicit comments about
what/why/etc. Please add some kind of comment describing why the
tracepoint is where it is, etc.

For example, maybe something like:

/*
 * Before any changes to 'current', report that the exec is about to
 * happen (since we made it to the point of no return). On a successful
 * exec, the 'sched_process_exec' tracepoint will also fire. On failure,
 * ... [something else]
 */

> +TRACE_EVENT(new_exec,
> +
> + TP_PROTO(struct task_struct *task, struct linux_binprm *bprm),
> +
> + TP_ARGS(task, bprm),
> +
> + TP_STRUCT__entry(
> + __string(   filename,   bprm->filename  )
> + __field(pid_t,  pid )
> + __string(   comm,   task->comm  )
> + ),
> +
> + TP_fast_assign(
> + __assign_str(filename, bprm->filename);

What about binfmt_misc, and binfmt_script? You may want bprm->interp
too?

-Kees

> + __entry->pid = task->pid;
> + __assign_str(comm, task->comm);
> + ),
> +
> + TP_printk("filename=%s pid=%d comm=%s",
> +   __get_str(filename), __entry->pid, __get_str(comm))
> +);
> +
>  #endif
>  
>  /* This part must be outside protection */
> -- 
> 2.44.0.478.gd926399ef9-goog
> 

-- 
Kees Cook



Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-04-09 Thread Dmitry Baryshkov
On Tue, Apr 09, 2024 at 01:04:12PM +0200, Pavel Machek wrote:
> Hi!
> 
> > > This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> > > features removed. ANX7688 is rather criticial piece on PinePhone,
> > > there's no display and no battery charging without it.
> > > 
> > > There's likely more work to be done here, but having basic support
> > > in mainline is needed to be able to work on the other stuff
> > > (networking, cameras, power management).
> > > 
> > > Signed-off-by: Ondrej Jirman 
> > > Co-developed-by: Martijn Braam 
> > > Co-developed-by: Samuel Holland 
> > > Signed-off-by: Pavel Machek 
> > 
> > Just couple of quick comments below - I did not have time to go over
> > this very thoroughly, but I think you need to make a new version in
> > any case because of comments in 1/2.
> 

[skipped]

> 
> > > +static int anx7688_connect(struct anx7688 *anx7688)
> > > +{
> > > + struct typec_partner_desc desc = {};
> > > + int ret, i;
> > > + u8 fw[2];
> > > + const u8 dp_snk_identity[16] = {
> > > + 0x00, 0x00, 0x00, 0xec, /* id header */
> > > + 0x00, 0x00, 0x00, 0x00, /* cert stat */
> > > + 0x00, 0x00, 0x00, 0x00, /* product type */
> > > + 0x39, 0x00, 0x00, 0x51  /* alt mode adapter */
> > > + };
> > > + const u8 svid[4] = {
> > > + 0x00, 0x00, 0x01, 0xff,
> > > + };
> > 
> > Why not get those from DT?
> 
> Are you sure it belongs to the DT (and that DT people will agree)?

>From Documentation/devicetree/bindings/connector/usb-connector.yaml:

altmodes {
displayport {
svid = /bits/ 16 <0xff01>;
vdo = <0x1c46>;
};
};

BTW, I don't see the VDO for the DP altmode in your code. Maybe I missed
it at a quick glance.

> 
> > > + u32 caps[8];
> > > +


-- 
With best wishes
Dmitry



Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-09 Thread Haitao Huang

On Tue, 09 Apr 2024 04:03:22 -0500, Michal Koutný  wrote:

On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang  
 wrote:

It's always non-NULL based on testing.

It's hard for me to say definitely by reading the code. But IIUC
cgroup_disable command-line only blocks operations in /sys/fs/cgroup so  
user
space can't set up controllers and config limits, etc., for the  
diasabled
ones. Each task->cgroups would still have a non-NULL pointer to the  
static

root object for each cgroup that is enabled by KConfig, so
get_current_misc_cg() thus  sgx_get_current_cg() should not return NULL
regardless 'cgroup_disable=misc'.

Maybe @Michal or @tj can confirm?


The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal

Thanks for the info.

The way I see it, misc does not have special handling like memcg so every  
task at least belong to the root(default) group even if it's disabled by  
command line parameter. So we would not get NULL from  
get_current_misc_cg(). I think I'll keep the WARN_ON_ONCE for now as a  
reminder in case misc do have custom support for disabling in future.


Thanks
Haitao



Re: [PATCH v2 3/4] arm64: dts: qcom: msm8976: Add Adreno GPU

2024-04-09 Thread Bjorn Andersson
On Mon, Apr 01, 2024 at 07:21:52PM +0200, Adam Skladowski wrote:
> Add Adreno GPU node.
> 
> Signed-off-by: Adam Skladowski 
> ---
>  arch/arm64/boot/dts/qcom/msm8976.dtsi | 65 +++
>  1 file changed, 65 insertions(+)
> 
> diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
> b/arch/arm64/boot/dts/qcom/msm8976.dtsi
> index 6be310079f5b..77670fce9b8f 100644
> --- a/arch/arm64/boot/dts/qcom/msm8976.dtsi
> +++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi
> @@ -1074,6 +1074,71 @@ mdss_dsi1_phy: phy@1a96a00 {
>   };
>   };
>  
> + adreno_gpu: gpu@1c0 {
> + compatible = "qcom,adreno-510.0", "qcom,adreno";
> +
> + reg = <0x01c0 0x4>;
> + reg-names = "kgsl_3d0_reg_memory";
> +
> + interrupts = ;
> + interrupt-names = "kgsl_3d0_irq";
> +
> + clocks = < GCC_GFX3D_OXILI_CLK>,
> +  < GCC_GFX3D_OXILI_AHB_CLK>,
> +  < GCC_GFX3D_OXILI_GMEM_CLK>,
> +  < GCC_GFX3D_BIMC_CLK>,
> +  < GCC_GFX3D_OXILI_TIMER_CLK>,
> +  < GCC_GFX3D_OXILI_AON_CLK>;
> + clock-names = "core",
> +   "iface",
> +   "mem",
> +   "mem_iface",
> +   "rbbmtimer",
> +   "alwayson";
> +
> + power-domains = < OXILI_GX_GDSC>;
> +
> + iommus = <_iommu 0>;
> +
> + status = "disabled";

Make status the last property of the node, please.

Regards,
Bjorn

> +
> + operating-points-v2 = <_opp_table>;
> +
> + gpu_opp_table: opp-table {
> + compatible = "operating-points-v2";
> +



Re: [PATCH v2 2/4] arm64: dts: qcom: msm8976: Add MDSS nodes

2024-04-09 Thread Bjorn Andersson
On Mon, Apr 01, 2024 at 07:21:51PM +0200, Adam Skladowski wrote:
> diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi 
> b/arch/arm64/boot/dts/qcom/msm8976.dtsi
[..]
> + mdss: display-subsystem@1a0 {
[..]
> + mdss_dsi0: dsi@1a94000 {
> + compatible = "qcom,msm8976-dsi-ctrl", 
> "qcom,mdss-dsi-ctrl";
> + reg = <0x01a94000 0x25c>;
> + reg-names = "dsi_ctrl";
> +
> + interrupt-parent = <>;
> + interrupts = <4>;
> +
> + clocks = < GCC_MDSS_MDP_CLK>,
> +  < GCC_MDSS_AHB_CLK>,
> +  < GCC_MDSS_AXI_CLK>,
> +  < GCC_MDSS_BYTE0_CLK>,
> +  < GCC_MDSS_PCLK0_CLK>,
> +  < GCC_MDSS_ESC0_CLK>;
> + clock-names = "mdp_core",
> +   "iface",
> +   "bus",
> +   "byte",
> +   "pixel",
> +   "core";
> +
> + assigned-clocks = < GCC_MDSS_BYTE0_CLK_SRC>,
> +   < GCC_MDSS_PCLK0_CLK_SRC>;
> + assigned-clock-parents = <_dsi0_phy 0>,
> +  <_dsi0_phy 1>;
> +
> + phys = <_dsi0_phy>;
> +
> + operating-points-v2 = <_opp_table>;
> + power-domains = < MDSS_GDSC>;
> +
> + #address-cells = <1>;
> + #size-cells = <0>;
> +

Seems reasonable to keep this disabled as well. Further more _dsi0
depends on _dsi0_phy which is disabled.

> + ports {
[..]
> + mdss_dsi0_phy: phy@1a94a00 {
> + compatible = "qcom,dsi-phy-28nm-hpm-fam-b";
> + reg = <0x01a94a00 0xd4>,
> +   <0x01a94400 0x280>,
> +   <0x01a94b80 0x30>;
> + reg-names = "dsi_pll",
> + "dsi_phy",
> + "dsi_phy_regulator";
> +
> + #clock-cells = <1>;
> + #phy-cells = <0>;
> +
> + clocks = < GCC_MDSS_AHB_CLK>,
> +  < RPM_SMD_XO_CLK_SRC>;
> + clock-names = "iface", "ref";
> +
> + status = "disabled";
> + };

PS. Leave _mdp enabled...

Regards,
Bjorn



Re: [PATCH 0/3] Fix up qcom,halt-regs definition in various schemas

2024-04-09 Thread Rob Herring
On Sun, Apr 07, 2024 at 11:58:29AM +0200, Luca Weiss wrote:
> The original motivation is that a bunch of other schemas fail to
> validate qcom,halt-regs, for example like in the following examples:
> 
> arch/arm64/boot/dts/qcom/apq8016-sbc.dtb: remoteproc@408: 
> qcom,halt-regs:0: [20] is too short
> from schema $id: 
> http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml#
> arch/arm64/boot/dts/qcom/apq8096-ifc6640.dtb: remoteproc@208: 
> qcom,halt-regs:0: [82] is too short
> from schema $id: 
> http://devicetree.org/schemas/remoteproc/qcom,msm8996-mss-pil.yaml#
> arch/arm64/boot/dts/qcom/apq8039-t2.dtb: remoteproc@408: 
> qcom,halt-regs:0: [32] is too short
> from schema $id: 
> http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml#
> 
> While I'm actually not quite sure why these patches fix this in
> the other schemas - feels like a bug/limitation in dt-schema maybe? -

Was this with v2024.02? It should be a bit better there. Though it 
may just have different errors. The limitation is that property 
types and in the case of matrix's (which phandle-array actually is) 
range for dimensions are global. So if there's not correct dimensions 
for a property, the tools aren't going to decode it properly.

Rob



Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT

2024-04-09 Thread Steven Rostedt
On Mon,  8 Apr 2024 16:58:17 +0200
Michal Koutný  wrote:

> @@ -294,7 +295,7 @@ static void __trace_find_cmdline(int pid, char comm[])
>   return;
>   }
>  
> - tpid = pid & (PID_MAX_DEFAULT - 1);
> + tpid = pid % PID_MAP_SIZE;

Does that compile to the same? This is a fast path.

-- Steve


>   map = savedcmd->map_pid_to_cmdline[tpid];
>   if (map != NO_CMDLINE_MAP) {
>   tpid = savedcmd->map_cmdline_to_pid[map];



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Marco Elver
On Tue, 9 Apr 2024 at 16:31, Steven Rostedt  wrote:
>
> On Mon,  8 Apr 2024 11:01:54 +0200
> Marco Elver  wrote:
>
> > Add "new_exec" tracepoint, which is run right after the point of no
> > return but before the current task assumes its new exec identity.
> >
> > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> > runs before flushing the old exec, i.e. while the task still has the
> > original state (such as original MM), but when the new exec either
> > succeeds or crashes (but never returns to the original exec).
> >
> > Being able to trace this event can be helpful in a number of use cases:
> >
> >   * allowing tracing eBPF programs access to the original MM on exec,
> > before current->mm is replaced;
> >   * counting exec in the original task (via perf event);
> >   * profiling flush time ("new_exec" to "sched_process_exec").
> >
> > Example of tracing output ("new_exec" and "sched_process_exec"):
>
> How common is this? And can't you just do the same with adding a kprobe?

Our main use case would be to use this in BPF programs to become
exec-aware, where using the sched_process_exec hook is too late. This
is particularly important where the BPF program must stop inspecting
the user space's VM when the task does exec to become a new process.

kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s
permission check can still return an error which returns to the
original task without crashing. Only at the point of no return are we
guaranteed that the exec either succeeds, or the task is terminated on
failure.

I don't know if "common" is the right question here, because it's a
chicken-egg problem: no tracepoint, we give up; we have the
tracepoint, it unlocks a range of new use cases (that require robust
solution to make BPF programs exec-aware, and a tracepoint is the only
option IMHO).

Thanks,
-- Marco



Re: [PATCH v2 fs/proc/bootconfig 0/2] remove redundant comments from /proc/bootconfig

2024-04-09 Thread Google
Hi Paul,

Thanks, both looks good to me.

Acked-by: Masami Hiramatsu (Google) 

Let me update bootconfig/fixes.

On Mon, 8 Apr 2024 21:42:49 -0700
"Paul E. McKenney"  wrote:

> Hello!
> 
> This series removes redundant comments from /proc/bootconfig:
> 
> 1.fs/proc: remove redundant comments from /proc/bootconfig,
>   courtesy of Zhenhua Huang.
> 
> 2.fs/proc: Skip bootloader comment if no embedded kernel parameters,
>   courtesy of Masami Hiramatsu.
> 
>   Thanx, Paul
> 
> 
> 
>  b/fs/proc/bootconfig.c   |   12 ++--
>  b/include/linux/bootconfig.h |1 +
>  b/init/main.c|5 +
>  fs/proc/bootconfig.c |2 +-
>  4 files changed, 13 insertions(+), 7 deletions(-)
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Steven Rostedt
On Mon,  8 Apr 2024 11:01:54 +0200
Marco Elver  wrote:

> Add "new_exec" tracepoint, which is run right after the point of no
> return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> runs before flushing the old exec, i.e. while the task still has the
> original state (such as original MM), but when the new exec either
> succeeds or crashes (but never returns to the original exec).
> 
> Being able to trace this event can be helpful in a number of use cases:
> 
>   * allowing tracing eBPF programs access to the original MM on exec,
> before current->mm is replaced;
>   * counting exec in the original task (via perf event);
>   * profiling flush time ("new_exec" to "sched_process_exec").
> 
> Example of tracing output ("new_exec" and "sched_process_exec"):

How common is this? And can't you just do the same with adding a kprobe?

-- Steve



Re: [PATCH v2] kprobes: Avoid possible warn in __arm_kprobe_ftrace()

2024-04-09 Thread Google
On Tue, 9 Apr 2024 14:20:45 +0800
Zheng Yejian  wrote:

> On 2024/4/8 20:41, Masami Hiramatsu (Google) wrote:
> > Hi Zheng,
> > 
> > On Mon, 8 Apr 2024 16:34:03 +0800
> > Zheng Yejian  wrote:
> > 
> >> There is once warn in __arm_kprobe_ftrace() on:
> >>
> >>   ret = ftrace_set_filter_ip(ops, (unsigned long)p->addr, 0, 0);
> >>   if (WARN_ONCE(..., "Failed to arm kprobe-ftrace at %pS (error %d)\n", 
> >> ...)
> >> return ret;
> >>
> >> This warning is generated because 'p->addr' is detected to be not a valid
> >> ftrace location in ftrace_set_filter_ip(). The ftrace address check is done
> >> by check_ftrace_location() at the beginning of check_kprobe_address_safe().
> >> At that point, ftrace_location(addr) == addr should return true if the
> >> module is loaded. Then the module is searched twice:
> >>1. in is_module_text_address(), we find that 'p->addr' is in a module;
> >>2. in __module_text_address(), we find the module;
> >>
> >> If the module has just been unloaded before the second search, then
> >> '*probed_mod' is NULL and we would not go to get the module refcount,
> >> then the return value of check_kprobe_address_safe() would be 0, but
> >> actually we need to return -EINVAL.
> > 
> > OK, so you found a race window in check_kprobe_address_safe().
> > 
> > It does something like below.
> > 
> > check_kprobe_address_safe() {
> > ...
> > 
> > /* Timing [A] */
> > 
> > if (!(core_kernel_text(p->addr) ||
> > is_module_text_address(p->addr)) ||
> > ...(other reserved address check)) {
> > return -EINVAL;
> > }
> > 
> > /* Timing [B] */
> > 
> > *probed_mod = __module_text_address(p->addr):
> > if (*probe_mod) {
> > if (!try_module_get(*probed_mod)) {
> > return -ENOENT;
> > }
> > ... 
> > }
> > }
> > 
> > So, if p->addr is in a module which is alive at the timing [A], but
> > unloaded at timing [B], 'p->addr' is passed the
> > 'is_module_text_address(p->addr)' check, but *probed_mod becomes NULL.
> > Thus the corresponding module is not referenced and kprobe_arm(p) will
> > access a wrong address (use after free).
> > This happens either kprobe on ftrace is enabled or not.
> 
> Yes, This is the problem. And for this case, check_kprobe_address_safe() 
> still return 0, and then going on to arm kprobe may cause problems. So
> we should make check_kprobe_address_safe() return -EINVAL when refcount
> of the module is not got.

Yes,

> 
> > 
> > To fix this problem, we should move the mutex_lock(kprobe_mutex) before
> > check_kprobe_address_safe() because kprobe_module_callback() also lock it
> > so it can stop module unloading.
> > 
> > Can you ensure this will fix your problem?
> 
> It seems not, the warning in __arm_kprobe_ftrace() still occurs. I
> contrived following simple test:
> 
>  #!/bin/bash
>  sysctl -w kernel.panic_on_warn=1
>  while [ True ]; do
>  insmod mod.ko# contain function 'foo'
>  rmmod mod.ko
>  done &
>  while [ True ]; do
>  insmod kprobe.ko  # register kprobe on function 'foo'
>  rmmod kprobe.ko
>  done &
> 
> I think holding kprobe_mutex cannot make sure we get the refcount of the
> module.

Aah, yes, it cannot, because the kallsyms in a module will be removed
after module->state becomes MODULE_STATE_UNFORMED. Before UNFORMED,
the state is MODULE_STATE_GOING and the kprobe_module_callback() is
called at that point. Thus, the following scenario happens.

 CPU1   CPU2

mod->state = MODULE_STATE_GOING
kprobe_module_callback() {
mutex_lock(_mutex)
loop on kprobe_table 
to disable kprobe in the module.
mutex_unlock(_mutex)
}
register_kprobe(p) {

mutex_lock(_mutex)

check_kprobe_address_safe(p->addr) {
[A'']

is_module_text_address() return true
until 
mod->state == UNFORMED.
mod->state = MODULE_STATE_UNFORMED
[B'']

__module_text_address() returns NULL.
}
p is on the 
kprobe_table.

mutex_unlock(_mutex)

So, as your fix, if we save the module at [A''] and use it at [B''],
the mod is NOT able to get because mod->state != MODULE_STATE_LIVE.


> 
> > I think your patch is just optimizing but not fixing the fundamental
> > problem, which is we don't have an 

Re: [PATCH v3 1/2] dt-bindings: arm: qcom: Add Motorola Moto G (2013)

2024-04-09 Thread Rob Herring


On Sun, 07 Apr 2024 11:05:10 +0200, Stanislav Jakubek wrote:
> Document the Motorola Moto G (2013), which is a smartphone based
> on the Qualcomm MSM8226 SoC.
> 
> Acked-by: Krzysztof Kozlowski 
> Signed-off-by: Stanislav Jakubek 
> ---
> Changes in V3:
>   - no changes
> 
> Changes in V2:
>   - collect Krzysztof's A-b
> 
>  Documentation/devicetree/bindings/arm/qcom.yaml | 1 +
>  1 file changed, 1 insertion(+)
> 


My bot found new DTB warnings on the .dts files added or changed in this
series.

Some warnings may be from an existing SoC .dtsi. Or perhaps the warnings
are fixed by another series. Ultimately, it is up to the platform
maintainer whether these warnings are acceptable or not. No need to reply
unless the platform maintainer has comments.

If you already ran DT checks and didn't see these error(s), then
make sure dt-schema is up to date:

  pip3 install dtschema --upgrade


New warnings running 'make CHECK_DTBS=y qcom/msm8226-motorola-falcon.dtb' for 
32c507337ab80c550fb1df08f7014d1e31eb4c32.1712480582.git.stano.jaku...@gmail.com:

arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dtb: syscon@f9011000: 
compatible: 'anyOf' conditional failed, one must be fixed:
['syscon'] is too short
'syscon' is not one of ['allwinner,sun8i-a83t-system-controller', 
'allwinner,sun8i-h3-system-controller', 
'allwinner,sun8i-v3s-system-controller', 
'allwinner,sun50i-a64-system-controller', 'amd,pensando-elba-syscon', 
'brcm,cru-clkset', 'freecom,fsg-cs2-system-controller', 
'fsl,imx93-aonmix-ns-syscfg', 'fsl,imx93-wakeupmix-syscfg', 
'hisilicon,dsa-subctrl', 'hisilicon,hi6220-sramctrl', 
'hisilicon,pcie-sas-subctrl', 'hisilicon,peri-subctrl', 'hpe,gxp-sysreg', 
'intel,lgm-syscon', 'loongson,ls1b-syscon', 'loongson,ls1c-syscon', 
'marvell,armada-3700-usb2-host-misc', 'mediatek,mt8135-pctl-a-syscfg', 
'mediatek,mt8135-pctl-b-syscfg', 'mediatek,mt8365-syscfg', 
'microchip,lan966x-cpu-syscon', 'microchip,sparx5-cpu-syscon', 
'mstar,msc313-pmsleep', 'nuvoton,ma35d1-sys', 'nuvoton,wpcm450-shm', 
'rockchip,px30-qos', 'rockchip,rk3036-qos', 'rockchip,rk3066-qos', 
'rockchip,rk3128-qos', 'rockchip,rk3228-qos', 'rockchip,rk3288-qos', 
'rockchip,rk3368-qos', 'rockchip,rk3399-qos', 'rockchip,rk3568-qos', '
 rockchip,rk3588-qos', 'rockchip,rv1126-qos', 'starfive,jh7100-sysmain', 
'ti,am62-usb-phy-ctrl', 'ti,am654-dss-oldi-io-ctrl', 'ti,am654-serdes-ctrl', 
'ti,j784s4-pcie-ctrl']
from schema $id: http://devicetree.org/schemas/mfd/syscon.yaml#








Re: [PATCH v3] Documentation: Add reconnect process for VDUSE

2024-04-09 Thread Cindy Lu
On Mon, Apr 8, 2024 at 8:50 PM Michael S. Tsirkin  wrote:
>
> On Mon, Apr 08, 2024 at 08:39:21PM +0800, Cindy Lu wrote:
> > On Mon, Apr 8, 2024 at 3:40 PM Michael S. Tsirkin  wrote:
> > >
> > > On Thu, Apr 04, 2024 at 01:56:31PM +0800, Cindy Lu wrote:
> > > > Add a document explaining the reconnect process, including what the
> > > > Userspace App needs to do and how it works with the kernel.
> > > >
> > > > Signed-off-by: Cindy Lu 
> > > > ---
> > > >  Documentation/userspace-api/vduse.rst | 41 +++
> > > >  1 file changed, 41 insertions(+)
> > > >
> > > > diff --git a/Documentation/userspace-api/vduse.rst 
> > > > b/Documentation/userspace-api/vduse.rst
> > > > index bdb880e01132..7faa83462e78 100644
> > > > --- a/Documentation/userspace-api/vduse.rst
> > > > +++ b/Documentation/userspace-api/vduse.rst
> > > > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> > > > after the used ring is filled.
> > > >
> > > >  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> > > > +
> > > > +HOW VDUSE devices reconnection works
> > > > +
> > > > +1. What is reconnection?
> > > > +
> > > > +   When the userspace application loads, it should establish a 
> > > > connection
> > > > +   to the vduse kernel device. Sometimes,the userspace application 
> > > > exists,
> > > > +   and we want to support its restart and connect to the kernel device 
> > > > again
> > > > +
> > > > +2. How can I support reconnection in a userspace application?
> > > > +
> > > > +2.1 During initialization, the userspace application should first 
> > > > verify the
> > > > +existence of the device "/dev/vduse/vduse_name".
> > > > +If it doesn't exist, it means this is the first-time for 
> > > > connection. goto step 2.2
> > > > +If it exists, it means this is a reconnection, and we should goto 
> > > > step 2.3
> > > > +
> > > > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > > > +/dev/vduse/control.
> > > > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> > > > +the reconnect information. The total memory size is 
> > > > PAGE_SIZE*vq_mumber.
> > >
> > > Confused. Where is that allocation, in code?
> > >
> > > Thanks!
> > >
> > this should allocated in function vduse_create_dev(),
>
> I mean, it's not allocated there ATM right? This is just doc patch
> to become part of a larger patchset?
>
Got it, thanks Michael I will send the whole patchset soon
thanks
Cindy

> > I will rewrite
> > this part  to make it more clearer
> > will send a new version soon
> > Thanks
> > cindy
> >
> > > > +2.3 Check if the information is suitable for reconnect
> > > > +If this is reconnection :
> > > > +Before attempting to reconnect, The userspace application needs to 
> > > > use the
> > > > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> > > > VDUSE_DEV_GET_FEATURES...)
> > > > +to get the information from kernel.
> > > > +Please review the information and confirm if it is suitable to 
> > > > reconnect.
> > > > +
> > > > +2.4 Userspace application needs to mmap the memory to userspace
> > > > +The userspace application requires mapping one page for every vq. 
> > > > These pages
> > > > +should be used to save vq-related information during system 
> > > > running. Additionally,
> > > > +the application must define its own structure to store information 
> > > > for reconnection.
> > > > +
> > > > +2.5 Completed the initialization and running the application.
> > > > +While the application is running, it is important to store 
> > > > relevant information
> > > > +about reconnections in mapped pages. When calling the ioctl 
> > > > VDUSE_VQ_GET_INFO to
> > > > +get vq information, it's necessary to check whether it's a 
> > > > reconnection. If it is
> > > > +a reconnection, the vq-related information must be get from the 
> > > > mapped pages.
> > > > +
> > > > +2.6 When the Userspace application exits, it is necessary to unmap all 
> > > > the
> > > > +pages for reconnection
> > > > --
> > > > 2.43.0
> > >
>




Re: [PATCH 2/2] ARM: dts: qcom: msm8974: Use proper compatible for APCS syscon

2024-04-09 Thread Krzysztof Kozlowski
On 08/04/2024 21:32, Luca Weiss wrote:
> Use the apcs-kpss-global compatible for the APCS global mailbox block
> found on this SoC.
> 
> This also resolves a dt-binding checker warning:


Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof




Re: [PATCH 1/2] dt-bindings: mailbox: qcom: Add MSM8974 APCS compatible

2024-04-09 Thread AngeloGioacchino Del Regno

Il 08/04/24 21:32, Luca Weiss ha scritto:

Add compatible for the Qualcomm MSM8974 APCS block.

Signed-off-by: Luca Weiss 


Reviewed-by: AngeloGioacchino Del Regno 





Re: [PATCH 2/2] ARM: dts: qcom: msm8974: Use proper compatible for APCS syscon

2024-04-09 Thread AngeloGioacchino Del Regno

Il 08/04/24 21:32, Luca Weiss ha scritto:

Use the apcs-kpss-global compatible for the APCS global mailbox block
found on this SoC.

This also resolves a dt-binding checker warning:

   arch/arm/boot/dts/qcom/qcom-msm8974pro-fairphone-fp2.dtb: syscon@f9011000: 
compatible: 'anyOf' conditional failed, one must be fixed:
   ['syscon'] is too short
   'syscon' is not one of ['allwinner,sun8i-a83t-system-controller', 
'allwinner,sun8i-h3-system-controller', 
'allwinner,sun8i-v3s-system-controller', 
'allwinner,sun50i-a64-system-controller', 'amd,pensando-elba-syscon', 
'brcm,cru-clkset', 'freecom,fsg-cs2-system-controller', 
'fsl,imx93-aonmix-ns-syscfg', 'fsl,imx93-wakeupmix-syscfg', 
'hisilicon,dsa-subctrl', 'hisilicon,hi6220-sramctrl', 
'hisilicon,pcie-sas-subctrl', 'hisilicon,peri-subctrl', 'hpe,gxp-sysreg', 
'intel,lgm-syscon', 'loongson,ls1b-syscon', 'loongson,ls1c-syscon', 
'marvell,armada-3700-usb2-host-misc', 'mediatek,mt8135-pctl-a-syscfg', 
'mediatek,mt8135-pctl-b-syscfg', 'mediatek,mt8365-syscfg', 
'microchip,lan966x-cpu-syscon', 'microchip,sparx5-cpu-syscon', 
'mstar,msc313-pmsleep', 'nuvoton,ma35d1-sys', 'nuvoton,wpcm450-shm', 
'rockchip,px30-qos', 'rockchip,rk3036-qos', 'rockchip,rk3066-qos', 
'rockchip,rk3128-qos', 'rockchip,rk3228-qos', 'rockchip,rk3288-qos', 
'rockchip,rk3368-qos', 'rockchip,rk3399-qos', 'rockchip,rk356
  8-qos', 'rockchip,rk3588-qos', 'rockchip,rv1126-qos', 
'starfive,jh7100-sysmain', 'ti,am62-usb-phy-ctrl', 'ti,am654-dss-oldi-io-ctrl', 
'ti,am654-serdes-ctrl', 'ti,j784s4-pcie-ctrl']
   from schema $id: http://devicetree.org/schemas/mfd/syscon.yaml#

Signed-off-by: Luca Weiss 


Reviewed-by: AngeloGioacchino Del Regno 





Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-09 Thread Jiri Olsa
On Mon, Apr 08, 2024 at 06:22:59PM +0200, Oleg Nesterov wrote:
> On 04/08, Jiri Olsa wrote:
> >
> > On Fri, Apr 05, 2024 at 01:02:30PM +0200, Oleg Nesterov wrote:
> > >
> > > And what should sys_uretprobe() do if it is not called from the 
> > > trampoline?
> > > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
> >
> > so the similar behaviour with int3 ends up with immediate SIGTRAP
> > and not invoking pending uretprobe consumers, like:
> >
> >   - setup uretprobe for foo
> >   - foo() {
> >   executes int 3 -> sends SIGTRAP
> > }
> >
> > because the int3 handler checks if it got executed from the uretprobe's
> > trampoline.
> 
> ... or the task has uprobe at this address
> 
> > if not it treats that int3 as regular trap
> 
> Yes this mimics the "default" behaviour without uprobes/uretprobes
> 
> > so I think we should mimic int3 behaviour and:
> >
> >   - setup uretprobe for foo
> >   - foo() {
> >  uretprobe_syscall -> check if we got executed from uretprobe's
> >  trampoline and send SIGILL if that's not the case
> 
> Agreed,
> 
> > I think it's better to have the offending process killed right away,
> > rather than having more undefined behaviour, waiting for final 'ret'
> > instruction that jumps to uretprobe trampoline and causes SIGILL
> 
> Agreed. In fact I think it should be also killed if copy_to/from_user()
> fails by the same reason.

+1 makes sense

jirka

> 
> Oleg.
> 



Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-04-09 Thread Pavel Machek
Hi!

> > This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> > features removed. ANX7688 is rather criticial piece on PinePhone,
> > there's no display and no battery charging without it.
> > 
> > There's likely more work to be done here, but having basic support
> > in mainline is needed to be able to work on the other stuff
> > (networking, cameras, power management).
> > 
> > Signed-off-by: Ondrej Jirman 
> > Co-developed-by: Martijn Braam 
> > Co-developed-by: Samuel Holland 
> > Signed-off-by: Pavel Machek 
> 
> Just couple of quick comments below - I did not have time to go over
> this very thoroughly, but I think you need to make a new version in
> any case because of comments in 1/2.

Yes, there will be new version.

There is ton of sleep primitives, and existing ones are okay. I can
change everything to fdelay or whatever primitive-of-the-day is, but
I'd rather not do pointless changes.

You can ask for major changes, or complain about extra newlines, but
doing both in the same email does not make sense.

> Btw, Co-developed-by comes before Signed-off-by I think.

I believe this order is fine, too.

> > +++ b/drivers/usb/typec/anx7688.c
> > @@ -0,0 +1,1830 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * ANX7688 USB-C HDMI bridge/PD driver
> > + *
> > + * Warning, this driver is somewhat PinePhone specific.
> 
> So why not split this into core part and PinePhone specific glue
> part?

Potentially a lot of work I can't really test and would rather not do.

> > +   struct delayed_work work;
> > +   struct timer_list work_timer;
> > +
> > +   struct mutex lock;
> 
> Undocumented lock.

This is simple driver. How do you expect me to document it? Protects
this data structure, not exactly uncommon.

> > +
> > +   /* wait for power to stabilize and release reset */
> > +   msleep(10);
> > +   gpiod_set_value(anx7688->gpio_reset, 0);
> > +   usleep_range(2, 4);
> 
> Why not just use usleep_range() in both cases.

It does not really make code any better. Can do if you insist.

> > +static int anx7688_connect(struct anx7688 *anx7688)
> > +{
> > +   struct typec_partner_desc desc = {};
> > +   int ret, i;
> > +   u8 fw[2];
> > +   const u8 dp_snk_identity[16] = {
> > +   0x00, 0x00, 0x00, 0xec, /* id header */
> > +   0x00, 0x00, 0x00, 0x00, /* cert stat */
> > +   0x00, 0x00, 0x00, 0x00, /* product type */
> > +   0x39, 0x00, 0x00, 0x51  /* alt mode adapter */
> > +   };
> > +   const u8 svid[4] = {
> > +   0x00, 0x00, 0x01, 0xff,
> > +   };
> 
> Why not get those from DT?

Are you sure it belongs to the DT (and that DT people will agree)?

> > +   u32 caps[8];
> > +
> > +   dev_dbg(anx7688->dev, "cable inserted\n");
> > +
> > +   anx7688->last_status = -1;
> > +   anx7688->last_cc_status = -1;
> > +   anx7688->last_dp_state = -1;
> > +
> > +   msleep(10);
> 
> Please make a comment here why you have to wait, and use
> usleep_range().

/* Dunno because working code does that and waiting for hardware to
settle down after cable insertion kind of looks like a good thing */

I did not write the driver, and there's no good documentation for this
chip. I can try to invent something, but...

> > +   i = 0;
> > +   while (1) {
> > +   ret = anx7688_reg_read(anx7688, 
> > ANX7688_REG_EEPROM_LOAD_STATUS0);
> > +
> > +   if (ret >= 0 && (ret & ANX7688_EEPROM_FW_LOADED) == 
> > ANX7688_EEPROM_FW_LOADED) {
> > +   dev_dbg(anx7688->dev, "eeprom0 = 0x%02x\n", ret);
> > +   dev_dbg(anx7688->dev, "fw loaded after %d ms\n", i * 
> > 10);
> > +   break;
> > +   }
> > +
> > +   if (i > 99) {
> > +   set_bit(ANX7688_F_FW_FAILED, anx7688->flags);
> > +   dev_err(anx7688->dev,
> > +   "boot firmware load failed (you may need to 
> > flash FW to anx7688 first)\n");
> > +   ret = -ETIMEDOUT;
> > +   goto err_vconoff;
> > +   }
> > +   msleep(5);
> > +   i++;
> > +   }
> 
> You need to use something like time_is_after_jiffies() here instead of
> a counter.

Well, this works as well, but yes, I agree here.

> > +   ret = i2c_smbus_read_i2c_block_data(anx7688->client,
> > +   ANX7688_REG_FW_VERSION1, 2, fw);
> > +   if (ret < 0) {
> > +   dev_err(anx7688->dev, "failed to read firmware version\n");
> > +   goto err_vconoff;
> > +   }
> > +
> > +   dev_dbg(anx7688->dev, "OCM firmware loaded (version 0x%04x)\n",
> > +fw[1] | fw[0] << 8);
> > +
> > +   /* Unmask interrupts */
> > +   ret = anx7688_reg_write(anx7688, ANX7688_REG_STATUS_INT, 0);
> > +   if (ret)
> > +   goto err_vconoff;
> > +
> > +   ret = anx7688_reg_write(anx7688, ANX7688_REG_STATUS_INT_MASK, 
> > ~ANX7688_SOFT_INT_MASK);
> > +   if (ret)
> > +   goto err_vconoff;
> > +
> > +   ret = anx7688_reg_write(anx7688, 

[PATCH net-next v3 6/6] rstreason: make it work in trace world

2024-04-09 Thread Jason Xing
From: Jason Xing 

At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing 
---
 include/trace/events/tcp.h | 37 +
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/tcp_ipv6.c|  2 +-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 5c04a61a11c2..9bed9e63c9c5 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * tcp event with arguments sk and skb
@@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
TP_ARGS(sk, skb)
 );
 
+#undef FN1
+#define FN1(reason)TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
+#undef FN2
+#define FN2(reason)TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
+DEFINE_RST_REASON(FN1, FN1)
+
+#undef FN1
+#undef FNe1
+#define FN1(reason){ SK_RST_REASON_##reason, #reason },
+#define FNe1(reason)   { SK_RST_REASON_##reason, #reason }
+
+#undef FN2
+#undef FNe2
+#define FN2(reason){ SKB_DROP_REASON_##reason, #reason },
+#define FNe2(reason)   { SKB_DROP_REASON_##reason, #reason }
 /*
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
 TRACE_EVENT(tcp_send_reset,
 
-   TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
+   TP_PROTO(const struct sock *sk,
+const struct sk_buff *skb,
+const int reason),
 
-   TP_ARGS(sk, skb),
+   TP_ARGS(sk, skb, reason),
 
TP_STRUCT__entry(
__field(const void *, skbaddr)
__field(const void *, skaddr)
__field(int, state)
+   __field(int, reason)
__array(__u8, saddr, sizeof(struct sockaddr_in6))
__array(__u8, daddr, sizeof(struct sockaddr_in6))
),
@@ -113,14 +132,24 @@ TRACE_EVENT(tcp_send_reset,
 */
TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, 
entry->saddr);
}
+   __entry->reason = reason;
),
 
-   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s 
reason=%s",
  __entry->skbaddr, __entry->skaddr,
  __entry->saddr, __entry->daddr,
- __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN",
+ __entry->reason < RST_REASON_START ?
+   __print_symbolic(__entry->reason, 
DEFINE_DROP_REASON(FN2, FNe2)) :
+   __print_symbolic(__entry->reason, 
DEFINE_RST_REASON(FN1, FNe1)))
 );
 
+#undef FN1
+#undef FNe1
+
+#undef FN2
+#undef FNe2
+
 /*
  * tcp event with arguments sk
  *
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 03c5af9decbf..4889fccbf754 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb,
if (sk)
arg.bound_dev_if = sk->sk_bound_dev_if;
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6d807b5c1b9c..710922f7d4d6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3610,7 +3610,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority, int reason)
/* skb of trace_tcp_send_reset() keeps the skb that caused RST,
 * skb here is different to the troublesome skb, so use NULL
 */
-   trace_tcp_send_reset(sk, NULL);
+   trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED);
 }
 
 /* Send a crossed SYN-ACK during socket establishment.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6889ea70c760..3c995eff6e52 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1131,7 +1131,7 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb,
label = ip6_flowlabel(ipv6h);
}
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
-- 
2.37.3




[PATCH net-next v3 5/6] mptcp: support rstreason for passive reset

2024-04-09 Thread Jason Xing
From: Jason Xing 

It relys on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/mptcp/subflow.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index ba0a252c113f..4f2be72d5b02 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -308,8 +308,11 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
return dst;
 
dst_release(dst);
-   if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+
+   tcp_request_sock_ops.send_reset(sk, skb, mpext->reset_reason);
+   }
return NULL;
 }
 
@@ -375,8 +378,11 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,
return dst;
 
dst_release(dst);
-   if (!req->syncookie)
-   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+
+   tcp6_request_sock_ops.send_reset(sk, skb, mpext->reset_reason);
+   }
return NULL;
 }
 #endif
@@ -783,6 +789,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
bool fallback, fallback_is_fatal;
struct mptcp_sock *owner;
struct sock *child;
+   int reason;
 
pr_debug("listener=%p, req=%p, conn=%p", listener, req, listener->conn);
 
@@ -911,7 +918,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
tcp_rsk(req)->drop_req = true;
inet_csk_prepare_for_destroy_sock(child);
tcp_done(child);
-   req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   reason = mptcp_get_ext(skb)->reset_reason;
+   req->rsk_ops->send_reset(sk, skb, convert_mptcp_reason(reason));
 
/* The last child reference will be released by the caller */
return child;
-- 
2.37.3




[PATCH net-next v3 4/6] tcp: support rstreason for passive reset

2024-04-09 Thread Jason Xing
From: Jason Xing 

Reuse the dropreason logic to show the exact reason of tcp reset,
so we don't need to implement those duplicated reset reasons.
This patch replaces all the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/ipv4/tcp_ipv4.c | 8 
 net/ipv6/tcp_ipv6.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 21fa69445f7a..03c5af9decbf 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1935,7 +1935,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(rsk, skb, reason);
 discard:
kfree_skb_reason(skb, reason);
/* Be careful here. If this function gets more complicated and
@@ -2278,7 +2278,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v4_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(nsk, skb, drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -2356,7 +2356,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(NULL, skb, drop_reason);
}
 
 discard_it:
@@ -2407,7 +2407,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(sk, skb, drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 7e591521b193..6889ea70c760 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1678,7 +1678,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, reason);
 discard:
if (opt_skb)
__kfree_skb(opt_skb);
@@ -1864,7 +1864,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v6_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(nsk, skb, drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -1940,7 +1940,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(NULL, skb, drop_reason);
}
 
 discard_it:
@@ -1995,7 +1995,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
tcp_v6_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:
-- 
2.37.3




[PATCH net-next v3 3/6] rstreason: prepare for active reset

2024-04-09 Thread Jason Xing
From: Jason Xing 

Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/tcp.h |  2 +-
 net/ipv4/tcp.c| 15 ++-
 net/ipv4/tcp_output.c |  2 +-
 net/ipv4/tcp_timer.c  |  9 ++---
 net/mptcp/protocol.c  |  4 +++-
 net/mptcp/subflow.c   |  5 +++--
 6 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9ab5b37e9d53..67ab4dbf7805 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -667,7 +667,7 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 void tcp_send_probe0(struct sock *);
 int tcp_write_wakeup(struct sock *, int mib);
 void tcp_send_fin(struct sock *sk);
-void tcp_send_active_reset(struct sock *sk, gfp_t priority);
+void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
 void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 664c8ecb076b..d1610d4deb8f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2805,7 +2806,8 @@ void __tcp_close(struct sock *sk, long timeout)
/* Unread data was tossed, zap the connection. */
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, sk->sk_allocation);
+   tcp_send_active_reset(sk, sk->sk_allocation,
+ SK_RST_REASON_NOT_SPECIFIED);
} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
/* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
@@ -2879,7 +2881,8 @@ void __tcp_close(struct sock *sk, long timeout)
struct tcp_sock *tp = tcp_sk(sk);
if (READ_ONCE(tp->linger2) < 0) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONLINGER);
} else {
@@ -2897,7 +2900,8 @@ void __tcp_close(struct sock *sk, long timeout)
if (sk->sk_state != TCP_CLOSE) {
if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
} else if (!check_net(sock_net(sk))) {
@@ -3001,7 +3005,7 @@ int tcp_disconnect(struct sock *sk, int flags)
/* The last check adjusts for discrepancy of Linux wrt. RFC
 * states
 */
-   tcp_send_active_reset(sk, gfp_any());
+   tcp_send_active_reset(sk, gfp_any(), 
SK_RST_REASON_NOT_SPECIFIED);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (old_state == TCP_SYN_SENT)
WRITE_ONCE(sk->sk_err, ECONNRESET);
@@ -4557,7 +4561,8 @@ int tcp_abort(struct sock *sk, int err)
smp_wmb();
sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
tcp_done(sk);
}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9282fafc0e61..6d807b5c1b9c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3585,7 +3585,7 @@ void tcp_send_fin(struct sock *sk)
  * was unread data in the receive queue.  This behavior is recommended
  * by RFC 2525, section 2.17.  -DaveM
  */
-void tcp_send_active_reset(struct sock *sk, gfp_t priority)
+void tcp_send_active_reset(struct sock *sk, gfp_t priority, int reason)
 {
struct sk_buff *skb;
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 976db57b95d4..83fe7f62f7f1 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
 {
@@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
(!tp->snd_wnd && !tp->packets_out))
do_reset = true;
if (do_reset)
-   

[PATCH net-next v3 2/6] rstreason: prepare for passive reset

2024-04-09 Thread Jason Xing
From: Jason Xing 

Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/request_sock.h |  3 ++-
 net/dccp/ipv4.c| 10 ++
 net/dccp/ipv6.c| 10 ++
 net/dccp/minisocks.c   |  3 ++-
 net/ipv4/tcp_ipv4.c| 12 +++-
 net/ipv4/tcp_minisocks.c   |  3 ++-
 net/ipv6/tcp_ipv6.c| 15 +--
 net/mptcp/subflow.c|  8 +---
 8 files changed, 39 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 004e651e6067..93f9fee7e52f 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -34,7 +34,8 @@ struct request_sock_ops {
void(*send_ack)(const struct sock *sk, struct sk_buff *skb,
struct request_sock *req);
void(*send_reset)(const struct sock *sk,
- struct sk_buff *skb);
+ struct sk_buff *skb,
+ int reason);
void(*destructor)(struct request_sock *req);
void(*syn_ack_timeout)(const struct request_sock *req);
 };
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9fc9cea4c251..11b8d14be3e2 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ackvec.h"
 #include "ccid.h"
@@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, 
struct request_sock *req
return err;
 }
 
-static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  int reason)
 {
int err;
const struct iphdr *rxiph;
@@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
kfree_skb(skb);
return 0;
 }
@@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index c8ca703dc331..232092dc3887 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "dccp.h"
 #include "ipv6.h"
@@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock 
*req)
kfree_skb(inet_rsk(req)->pktopts);
 }
 
-static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  int reason)
 {
const struct ipv6hdr *rxip6h;
struct sk_buff *skb;
@@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
return 0;
 
 reset:
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
 discard:
if (opt_skb != NULL)
__kfree_skb(opt_skb);
@@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 64d805b27add..251a57cf5822 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -15,6 +15,7 @@
 #include 
 

[PATCH net-next v3 1/6] net: introduce rstreason to detect why the RST is sent

2024-04-09 Thread Jason Xing
From: Jason Xing 

Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.

This patch only does the preparations for reset reason mechanism,
nothing else changes.

The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) reuse MP_TCPRST option for MPTCP
3) our own reasons

I will implement the basic codes of active/passive reset reason in
those three protocols, which is not complete for this moment. But
it provides a new chance to let other people add more reasons into
it:)

Signed-off-by: Jason Xing 
---
 include/net/rstreason.h | 93 +
 1 file changed, 93 insertions(+)
 create mode 100644 include/net/rstreason.h

diff --git a/include/net/rstreason.h b/include/net/rstreason.h
new file mode 100644
index ..24d098a78a60
--- /dev/null
+++ b/include/net/rstreason.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _LINUX_RSTREASON_H
+#define _LINUX_RSTREASON_H
+#include 
+
+#define DEFINE_RST_REASON(FN, FNe) \
+   FN(MPTCP_RST_EUNSPEC)   \
+   FN(MPTCP_RST_EMPTCP)\
+   FN(MPTCP_RST_ERESOURCE) \
+   FN(MPTCP_RST_EPROHIBIT) \
+   FN(MPTCP_RST_EWQ2BIG)   \
+   FN(MPTCP_RST_EBADPERF)  \
+   FN(MPTCP_RST_EMIDDLEBOX)\
+   FN(NOT_SPECIFIED)   \
+   FNe(MAX)
+
+#define RST_REASON_START (SKB_DROP_REASON_MAX + 1)
+
+/* There are three parts in order:
+ * 1) 0 - SKB_DROP_REASON_MAX: rely on drop reasons for passive reset in TCP
+ * 2) SKB_DROP_REASON_MAX + 1 - MPTCP_RST_EMIDDLEBOX: for MPTCP use
+ * 3) MPTCP_RST_EMIDDLEBOX - SK_RST_REASON_MAX: independent reset reason
+ */
+enum sk_rst_reason {
+   /* Leave this 'blank' part (0-SKB_DROP_REASON_MAX) for the reuse
+* of skb drop reason because rst reason relies on what drop reason
+* indicates exactly why it could happen.
+*/
+
+   /* Copy from include/uapi/linux/mptcp.h.
+* These reset fields will not be changed since they adhere to
+* RFC 8684. So do not touch them. I'm going to list each definition
+* of them respectively.
+*/
+   /* Unspecified error.
+* This is the default error; it implies that the subflow is no
+* longer available. The presence of this option shows that the
+* RST was generated by an MPTCP-aware device.
+*/
+   SK_RST_REASON_MPTCP_RST_EUNSPEC = RST_REASON_START,
+   /* MPTCP-specific error.
+* An error has been detected in the processing of MPTCP options.
+* This is the usual reason code to return in the cases where a RST
+* is being sent to close a subflow because of an invalid response.
+*/
+   SK_RST_REASON_MPTCP_RST_EMPTCP,
+   /* Lack of resources.
+* This code indicates that the sending host does not have enough
+* resources to support the terminated subflow.
+*/
+   SK_RST_REASON_MPTCP_RST_ERESOURCE,
+   /* Administratively prohibited.
+* This code indicates that the requested subflow is prohibited by
+* the policies of the sending host.
+*/
+   SK_RST_REASON_MPTCP_RST_EPROHIBIT,
+   /* Too much outstanding data.
+* This code indicates that there is an excessive amount of data
+* that needs to be transmitted over the terminated subflow while
+* having already been acknowledged over one or more other subflows.
+* This may occur if a path has been unavailable for a short period
+* and it is more efficient to reset and start again than it is to
+* retransmit the queued data.
+*/
+   SK_RST_REASON_MPTCP_RST_EWQ2BIG,
+   /* Unacceptable performance.
+* This code indicates that the performance of this subflow was
+* too low compared to the other subflows of this Multipath TCP
+* connection.
+*/
+   SK_RST_REASON_MPTCP_RST_EBADPERF,
+   /* Middlebox interference.
+* Middlebox interference has been detected over this subflow,
+* making MPTCP signaling invalid. For example, this may be sent
+* if the checksum does not validate.
+*/
+   SK_RST_REASON_MPTCP_RST_EMIDDLEBOX,
+
+   /* For the real standalone socket reset reason, we start from here */
+   SK_RST_REASON_NOT_SPECIFIED,
+
+   /* Maximum of socket reset reasons.
+* It shouldn't be used as a real 'reason'.
+*/
+   SK_RST_REASON_MAX,
+};
+
+static inline int convert_mptcp_reason(int reason)
+{
+   return reason += RST_REASON_START;
+}
+#endif
-- 
2.37.3




[PATCH net-next v3 0/6] Implement reset reason mechanism to detect

2024-04-09 Thread Jason Xing
From: Jason Xing 

In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.

RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own.

In this series, I focus on the fundamental implement mostly about how
the rstreason mechnism works and give the detailed passive part as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.

Here are some examples when tracing:
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET

[1]
Link: 
https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

v3
Link:
https://lore.kernel.org/all/20240404072047.11490-1-kerneljasonx...@gmail.com/
1. rebase (mptcp part) and address what Mat suggested.

v2
Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/
1. rebase against the latest net-next tree

Jason Xing (6):
  net: introduce rstreason to detect why the RST is sent
  rstreason: prepare for passive reset
  rstreason: prepare for active reset
  tcp: support rstreason for passive reset
  mptcp: support rstreason for passive reset
  rstreason: make it work in trace world

 include/net/request_sock.h |  3 +-
 include/net/rstreason.h| 93 ++
 include/net/tcp.h  |  2 +-
 include/trace/events/tcp.h | 37 +--
 net/dccp/ipv4.c| 10 ++--
 net/dccp/ipv6.c| 10 ++--
 net/dccp/minisocks.c   |  3 +-
 net/ipv4/tcp.c | 15 --
 net/ipv4/tcp_ipv4.c| 14 +++---
 net/ipv4/tcp_minisocks.c   |  3 +-
 net/ipv4/tcp_output.c  |  4 +-
 net/ipv4/tcp_timer.c   |  9 ++--
 net/ipv6/tcp_ipv6.c| 17 ---
 net/mptcp/protocol.c   |  4 +-
 net/mptcp/subflow.c| 25 +++---
 15 files changed, 202 insertions(+), 47 deletions(-)
 create mode 100644 include/net/rstreason.h

-- 
2.37.3




Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

2024-04-09 Thread Honggyu Kim
On Mon,  8 Apr 2024 22:41:04 +0900 Honggyu Kim  wrote:
[...]
> To explain this, I better share more test results.  In the section of
> "Evaluation Workload", the test sequence can be summarized as follows.
> 
>   *. "Turn on DAMON."
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   3. Run YCSB to make zipfian distribution of memory accesses to
>  redis-server, then measure execution time.
>   4. Repeat 4 over 50 times to measure the average execution time for
>  each run.

Sorry, "Repeat 4 over 50 times" is incorrect.  This should be "Repeat 3
over 50 times".

>   5. Increase the cold memory size then repeat goes to 2.
> 
> I didn't want to make the evaluation too long in the cover letter, but
> I have also evaluated another senario, which lazyly enabled DAMON just
> before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
> is missing part from the cover letter.
> 
>   1. Allocate cold memory(mmap+memset) at DRAM node, then make the
>  process sleep.
>   2. Launch redis-server and load prebaked snapshot image, dump.rdb.
>  (85GB consumed: 52GB for anon and 33GB for file cache)
>   *. "Turn on DAMON."
>   4. Run YCSB to make zipfian distribution of memory accesses to
>  redis-server, then measure execution time.
>   5. Repeat 4 over 50 times to measure the average execution time for
>  each run.
>   6. Increase the cold memory size then repeat goes to 2.
> 
> In the "DAMON lazy" senario, DAMON started monitoring late so the
> initial redis-server placement is same as "default", but started to
> demote cold data and promote redis data just before YCSB run.
[...]

Thanks,
Honggyu



Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge

2024-04-09 Thread Heikki Krogerus
Hi,

On Mon, Apr 08, 2024 at 12:54:25PM +0200, Pavel Machek wrote:
> From: Ondrej Jirman 
> 
> This is driver for ANX7688 USB-C HDMI, with flashing and debugging
> features removed. ANX7688 is rather criticial piece on PinePhone,
> there's no display and no battery charging without it.
> 
> There's likely more work to be done here, but having basic support
> in mainline is needed to be able to work on the other stuff
> (networking, cameras, power management).
> 
> Signed-off-by: Ondrej Jirman 
> Co-developed-by: Martijn Braam 
> Co-developed-by: Samuel Holland 
> Signed-off-by: Pavel Machek 

Just couple of quick comments below - I did not have time to go over
this very thoroughly, but I think you need to make a new version in
any case because of comments in 1/2.

Btw, Co-developed-by comes before Signed-off-by I think.

> ---
> 
> v2: Fix checkpatch stuff. Some cleanups, adapt to dts format in 1/2.
> v3: Turn down debugging, as requested during review.
> 
> diff --git a/drivers/usb/typec/Kconfig b/drivers/usb/typec/Kconfig
> index 2f80c2792dbd..c9043ae61546 100644
> --- a/drivers/usb/typec/Kconfig
> +++ b/drivers/usb/typec/Kconfig
> @@ -64,6 +64,17 @@ config TYPEC_ANX7411
> If you choose to build this driver as a dynamically linked module, the
> module will be called anx7411.ko.
>  
> +config TYPEC_ANX7688
> + tristate "Analogix ANX7688 Type-C DRP Port controller and mux driver"
> + depends on I2C
> + depends on USB_ROLE_SWITCH
> + help
> +   Say Y or M here if your system has Analogix ANX7688 Type-C Bridge
> +   controller driver.
> +
> +   If you choose to build this driver as a dynamically linked module, the
> +   module will be called anx7688.ko.
> +
>  config TYPEC_RT1719
>   tristate "Richtek RT1719 Sink Only Type-C controller driver"
>   depends on USB_ROLE_SWITCH || !USB_ROLE_SWITCH
> diff --git a/drivers/usb/typec/Makefile b/drivers/usb/typec/Makefile
> index 7a368fea61bc..3f8ff94ad294 100644
> --- a/drivers/usb/typec/Makefile
> +++ b/drivers/usb/typec/Makefile
> @@ -7,6 +7,7 @@ obj-$(CONFIG_TYPEC_TCPM)  += tcpm/
>  obj-$(CONFIG_TYPEC_UCSI) += ucsi/
>  obj-$(CONFIG_TYPEC_TPS6598X) += tipd/
>  obj-$(CONFIG_TYPEC_ANX7411)  += anx7411.o
> +obj-$(CONFIG_TYPEC_ANX7688)  += anx7688.o
>  obj-$(CONFIG_TYPEC_HD3SS3220)+= hd3ss3220.o
>  obj-$(CONFIG_TYPEC_STUSB160X)+= stusb160x.o
>  obj-$(CONFIG_TYPEC_RT1719)   += rt1719.o
> diff --git a/drivers/usb/typec/anx7688.c b/drivers/usb/typec/anx7688.c
> new file mode 100644
> index ..407cbd5fedba
> --- /dev/null
> +++ b/drivers/usb/typec/anx7688.c
> @@ -0,0 +1,1830 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * ANX7688 USB-C HDMI bridge/PD driver
> + *
> + * Warning, this driver is somewhat PinePhone specific.

So why not split this into core part and PinePhone specific glue part?

> + * How this works:
> + * - this driver allows to program firmware into ANX7688 EEPROM, and
> + *   initialize it
> + * - it then communicates with the firmware running on the OCM (on-chip
> + *   microcontroller)
> + * - it detects whether there is cable plugged in or not and powers
> + *   up or down the ANX7688 based on that
> + * - when the cable is connected the firmware on the OCM will handle
> + *   the detection of the nature of the device on the other end
> + *   of the USB-C cable
> + * - this driver then communicates with the USB phy to let it swap
> + *   data roles accordingly
> + * - it also enables VBUS and VCONN regulators as appropriate
> + * - USB phy driver (Allwinner) needs to know whether to switch to
> + *   device or host mode, or whether to turn off
> + * - when the firmware detects SRC.1.5A or SRC.3.0A via CC pins
> + *   or something else via PD, it notifies this driver via software
> + *   interrupt and this driver will determine how to update the TypeC
> + *   port status and what input current limit is appropriate
> + * - input current limit determination happens 500ms after cable
> + *   insertion or hard reset (delay is necessary to determine whether
> + *   the remote end is PD capable or not)
> + * - this driver tells to the PMIC driver that the input current limit
> + *   needs to be changed
> + * - this driver also monitors PMIC status and re-sets the input current
> + *   limit if it changes for some reason (due to PMIC internal decision
> + *   making) (this is disabled for now)
> + *
> + * ANX7688 FW behavior as observed:
> + *
> + * - DO NOT SET MORE THAN 1 SINK CAPABILITY! Firmware will ignore what
> + *   you set and send hardcoded PDO_BATT 5-21V 30W message!
> + *
> + * Product brief:
> + * 
> https://www.analogix.com/en/system/files/AA-002281-PB-6-ANX7688_Product_Brief_0.pdf
> + * Development notes:
> + * https://xnux.eu/devices/feature/anx7688.html
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> 

Re: [PATCH v4] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread Heng Qi




在 2024/4/9 下午4:58, lyx634449800 写道:

From: Yuxue Liu 

When there is a ctlq and it doesn't require interrupt
callbacks,the original method of calculating vectors
wastes hardware msi or msix resources as well as system
IRQ resources.

When conducting performance testing using testpmd in the
guest os, it was found that the performance was lower compared
to directly using vfio-pci to passthrough the device

In scenarios where the virtio device in the guest os does
not utilize interrupts, the vdpa driver still configures
the hardware's msix vector. Therefore, the hardware still
sends interrupts to the host os. Because of this unnecessary
action by the hardware, hardware performance decreases, and
it also affects the performance of the host os.

Before modification:(interrupt mode)
  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(interrupt mode)
  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config

Before modification:(virtio pmd mode for guest os)
  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(virtio pmd mode for guest os)
  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config

To verify the use of the virtio PMD mode in the guest operating
system, the following patch needs to be applied to QEMU:
https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com

Signed-off-by: Yuxue Liu 
Acked-by: Jason Wang 
---
V4: Update the title and assign values to uninitialized variables
V3: delete unused variables and add validation records
V2: fix when allocating IRQs, scan all queues

  drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +--
  1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
b/drivers/vdpa/virtio_pci/vp_vdpa.c
index df5f4a3bccb5..74bc8adfc7e8 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
struct pci_dev *pdev = mdev->pci_dev;
int i, ret, irq;
int queues = vp_vdpa->queues;
-   int vectors = queues + 1;
+   int vectors = 0;
+   int msix_vec = 0;
+
+   for (i = 0; i < queues; i++) {
+   if (vp_vdpa->vring[i].cb.callback)
+   vectors++;
+   }
+   vectors++;
  
  	ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);

if (ret != vectors) {
@@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
vp_vdpa->vectors = vectors;
  
  	for (i = 0; i < queues; i++) {

+   if (!vp_vdpa->vring[i].cb.callback)
+   continue;
+
snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
"vp-vdpa[%s]-%d\n", pci_name(pdev), i);
-   irq = pci_irq_vector(pdev, i);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(>dev, irq,
   vp_vdpa_vq_handler,
   0, vp_vdpa->vring[i].msix_name,
@@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
"vp_vdpa: fail to request irq for vq %d\n", i);
goto err;
}
-   vp_modern_queue_vector(mdev, i, i);
+   vp_modern_queue_vector(mdev, i, msix_vec);
vp_vdpa->vring[i].irq = irq;
+   msix_vec++;
}
  
  	snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",

 pci_name(pdev));
-   irq = pci_irq_vector(pdev, queues);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(>dev, irq,  vp_vdpa_config_handler, 0,
   vp_vdpa->msix_name, vp_vdpa);
if (ret) {
dev_err(>dev,
-   "vp_vdpa: fail to request irq for vq %d\n", i);
+   "vp_vdpa: fail to request irq for config, ret %d\n", 
ret);
goto err;
}
-   vp_modern_config_vector(mdev, queues);
+   vp_modern_config_vector(mdev, msix_vec);
vp_vdpa->config_irq = irq;
  
  	return 0;


Reviewed-by: Heng Qi 



Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion

2024-04-09 Thread Honggyu Kim
Hi SeongJae,

On Mon,  8 Apr 2024 10:52:28 -0700 SeongJae Park  wrote:
> On Mon,  8 Apr 2024 21:06:44 +0900 Honggyu Kim  wrote:
> 
> > On Fri,  5 Apr 2024 12:24:30 -0700 SeongJae Park  wrote:
> > > On Fri,  5 Apr 2024 15:08:54 +0900 Honggyu Kim  wrote:
> [...]
> > > > Here is one of the example usage of this 'migrate_cold' action.
> > > > 
> > > >   $ cd /sys/kernel/mm/damon/admin/kdamonds/
> > > >   $ cat contexts//schemes//action
> > > >   migrate_cold
> > > >   $ echo 2 > contexts//schemes//target_nid
> > > >   $ echo commit > state
> > > >   $ numactl -p 0 ./hot_cold 500M 600M &
> > > >   $ numastat -c -p hot_cold
> > > > 
> > > >   Per-node process memory usage (in MBs)
> > > >   PID Node 0 Node 1 Node 2 Total
> > > >   --  -- -- -- -
> > > >   701 (hot_cold) 501  0601  1101
> > > > 
> > > > Since there are some common routines with pageout, many functions have
> > > > similar logics between pageout and migrate cold.
> > > > 
> > > > damon_pa_migrate_folio_list() is a minimized version of
> > > > shrink_folio_list(), but it's minified only for demotion.
> > > 
> > > MIGRATE_COLD is not only for demotion, right?  I think the last two words 
> > > are
> > > better to be removed for reducing unnecessary confuses.
> > 
> > You mean the last two sentences?  I will remove them if you feel it's
> > confusing.
> 
> Yes.  My real intended suggestion was 's/only for demotion/only for
> migration/', but entirely removing the sentences is also ok for me.

Ack.

> > 
> > > > 
> > > > Signed-off-by: Honggyu Kim 
> > > > Signed-off-by: Hyeongtak Ji 
> > > > ---
> > > >  include/linux/damon.h|   2 +
> > > >  mm/damon/paddr.c | 146 ++-
> > > >  mm/damon/sysfs-schemes.c |   4 ++
> > > >  3 files changed, 151 insertions(+), 1 deletion(-)
> [...]
> > > > --- a/mm/damon/paddr.c
> > > > +++ b/mm/damon/paddr.c
> [...]
> > > > +{
> > > > +   unsigned int nr_succeeded;
> > > > +   nodemask_t allowed_mask = NODE_MASK_NONE;
> > > > +
> > > 
> > > I personally prefer not having empty lines in the middle of variable
> > > declarations/definitions.  Could we remove this empty line?
> > 
> > I can remove it, but I would like to have more discussion about this
> > issue.  The current implementation allows only a single migration
> > target with "target_nid", but users might want to provide fall back
> > migration target nids.
> > 
> > For example, if more than two CXL nodes exist in the system, users might
> > want to migrate cold pages to any CXL nodes.  In such cases, we might
> > have to make "target_nid" accept comma separated node IDs.  nodemask can
> > be better but we should provide a way to change the scanning order.
> > 
> > I would like to hear how you think about this.
> 
> Good point.  I think we could later extend the sysfs file to receive the
> comma-separated numbers, or even mask.  For simplicity, adding sysfs files
> dedicated for the different format of inputs could also be an option (e.g.,
> target_nids_list, target_nids_mask).  But starting from this single node as is
> now looks ok to me.

If you think we can start from a single node, then I will keep it as is.
But are you okay if I change the same 'target_nid' to accept
comma-separated numbers later?  Or do you want to introduce another knob
such as 'target_nids_list'?  What about rename 'target_nid' to
'target_nids' at the first place?

> [...]
> > > > +   /* 'folio_list' is always empty here */
> > > > +
> > > > +   /* Migrate folios selected for migration */
> > > > +   nr_migrated += migrate_folio_list(_folios, pgdat, 
> > > > target_nid);
> > > > +   /* Folios that could not be migrated are still in 
> > > > @migrate_folios */
> > > > +   if (!list_empty(_folios)) {
> > > > +   /* Folios which weren't migrated go back on @folio_list 
> > > > */
> > > > +   list_splice_init(_folios, folio_list);
> > > > +   }
> > > 
> > > Let's not use braces for single statement
> > > (https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces).
> > 
> > Hmm.. I know the convention but left it as is because of the comment.
> > If I remove the braces, it would have a weird alignment for the two
> > lines for comment and statement lines.
> 
> I don't really hate such alignment.  But if you don't like it, how about 
> moving
> the comment out of the if statement?  Having one comment for one-line if
> statement looks not bad to me.

Ack. I will manage this in the next revision.

> > 
> > > > +
> > > > +   try_to_unmap_flush();
> > > > +
> > > > +   list_splice(_folios, folio_list);
> > > 
> > > Can't we move remaining folios in migrate_folios to ret_folios at once?
> > 
> > I will see if it's possible.
> 
> Thank you.  Not a strict request, though.
> 
> [...]
> > > > +   nid = folio_nid(lru_to_folio(folio_list));
> > > > +   do {
> > > > +   struct folio *folio = 

Re: [PATCH v4] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread Michael S. Tsirkin
Good and clear subject, I like it.

On Tue, Apr 09, 2024 at 04:58:18PM +0800, lyx634449800 wrote:
> From: Yuxue Liu 
> 
> When there is a ctlq and it doesn't require interrupt
> callbacks,the original method of calculating vectors
> wastes hardware msi or msix resources as well as system
> IRQ resources.
> 
> When conducting performance testing using testpmd in the
> guest os, it was found that the performance was lower compared
> to directly using vfio-pci to passthrough the device
> 
> In scenarios where the virtio device in the guest os does
> not utilize interrupts, the vdpa driver still configures
> the hardware's msix vector. Therefore, the hardware still
> sends interrupts to the host os. Because of this unnecessary
> action by the hardware, hardware performance decreases, and
> it also affects the performance of the host os.
> 
> Before modification:(interrupt mode)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(interrupt mode)
>  32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
>  33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
>  34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config
> 
> Before modification:(virtio pmd mode for guest os)
>  32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
>  33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
>  34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
>  35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config
> 
> After modification:(virtio pmd mode for guest os)
>  32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config
> 
> To verify the use of the virtio PMD mode in the guest operating
> system, the following patch needs to be applied to QEMU:
> https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com
> 
> Signed-off-by: Yuxue Liu 
> Acked-by: Jason Wang 

Much better, thanks!
A couple of small tweaks to polish it up and it'll be ready.

> ---
> V4: Update the title and assign values to uninitialized variables
> V3: delete unused variables and add validation records
> V2: fix when allocating IRQs, scan all queues
> 
>  drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
> b/drivers/vdpa/virtio_pci/vp_vdpa.c
> index df5f4a3bccb5..74bc8adfc7e8 100644
> --- a/drivers/vdpa/virtio_pci/vp_vdpa.c
> +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
> @@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   struct pci_dev *pdev = mdev->pci_dev;
>   int i, ret, irq;
>   int queues = vp_vdpa->queues;
> - int vectors = queues + 1;
> + int vectors = 0;
> + int msix_vec = 0;
> +
> + for (i = 0; i < queues; i++) {
> + if (vp_vdpa->vring[i].cb.callback)
> + vectors++;
> + }
> + vectors++;


Actually even easier: int vectors = 1; and then we do not need
this last line.
Sorry I only noticed now.

>  
>   ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
>   if (ret != vectors) {
> @@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   vp_vdpa->vectors = vectors;
>  
>   for (i = 0; i < queues; i++) {
> + if (!vp_vdpa->vring[i].cb.callback)
> + continue;
> +
>   snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
>   "vp-vdpa[%s]-%d\n", pci_name(pdev), i);
> - irq = pci_irq_vector(pdev, i);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq,
>  vp_vdpa_vq_handler,
>  0, vp_vdpa->vring[i].msix_name,
> @@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
>   "vp_vdpa: fail to request irq for vq %d\n", i);
>   goto err;
>   }
> - vp_modern_queue_vector(mdev, i, i);
> + vp_modern_queue_vector(mdev, i, msix_vec);
>   vp_vdpa->vring[i].irq = irq;
> + msix_vec++;
>   }
>  
>   snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
>pci_name(pdev));
> - irq = pci_irq_vector(pdev, queues);
> + irq = pci_irq_vector(pdev, msix_vec);
>   ret = devm_request_irq(>dev, irq, vp_vdpa_config_handler, 0,
>  vp_vdpa->msix_name, vp_vdpa);
>   if (ret) {
>   dev_err(>dev,
> - "vp_vdpa: fail to request irq for vq %d\n", i);
> + "vp_vdpa: fail to request irq for config, ret %d\n", 
> ret);

As long as we are here 

Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-09 Thread Michal Koutný
On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang 
 wrote:
> It's always non-NULL based on testing.
> 
> It's hard for me to say definitely by reading the code. But IIUC
> cgroup_disable command-line only blocks operations in /sys/fs/cgroup so user
> space can't set up controllers and config limits, etc., for the diasabled
> ones. Each task->cgroups would still have a non-NULL pointer to the static
> root object for each cgroup that is enabled by KConfig, so
> get_current_misc_cg() thus  sgx_get_current_cg() should not return NULL
> regardless 'cgroup_disable=misc'.
> 
> Maybe @Michal or @tj can confirm?

The current implementation creates root css object (see cgroup_init(),
cgroup_ssid_enabled() check is after cgroup_init_subsys()).
I.e. it will look like all tasks are members of root cgroup wrt given
controller permanently and controller attribute files won't exist.

(It is up to the controller implementation to do further optimization
based on the boot-time disablement (e.g. see uses of
mem_cgroup_disabled()). Not sure if this is useful for misc controller.)

As for the WARN_ON(1), taking example from memcg -- NULL is best
synonymous with root. It's a judgement call which of the values to store
and when to intepret it.

HTH,
Michal


signature.asc
Description: PGP signature


[PATCH v4] vp_vdpa: don't allocate unused msix vectors

2024-04-09 Thread lyx634449800
From: Yuxue Liu 

When there is a ctlq and it doesn't require interrupt
callbacks,the original method of calculating vectors
wastes hardware msi or msix resources as well as system
IRQ resources.

When conducting performance testing using testpmd in the
guest os, it was found that the performance was lower compared
to directly using vfio-pci to passthrough the device

In scenarios where the virtio device in the guest os does
not utilize interrupts, the vdpa driver still configures
the hardware's msix vector. Therefore, the hardware still
sends interrupts to the host os. Because of this unnecessary
action by the hardware, hardware performance decreases, and
it also affects the performance of the host os.

Before modification:(interrupt mode)
 32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(interrupt mode)
 32:  0  0  1  7   PCI-MSI 32768-edge  vp-vdpa[:00:02.0]-0
 33: 36  0  3  0   PCI-MSI 32769-edge  vp-vdpa[:00:02.0]-1
 34:  0  0  0  0   PCI-MSI 32770-edge  vp-vdpa[:00:02.0]-config

Before modification:(virtio pmd mode for guest os)
 32:  0   0  0  0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0
 33:  0   0  0  0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1
 34:  0   0  0  0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2
 35:  0   0  0  0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config

After modification:(virtio pmd mode for guest os)
 32: 0  0  0   0   PCI-MSI 32768-edge   vp-vdpa[:00:02.0]-config

To verify the use of the virtio PMD mode in the guest operating
system, the following patch needs to be applied to QEMU:
https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com

Signed-off-by: Yuxue Liu 
Acked-by: Jason Wang 
---
V4: Update the title and assign values to uninitialized variables
V3: delete unused variables and add validation records
V2: fix when allocating IRQs, scan all queues

 drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +--
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c 
b/drivers/vdpa/virtio_pci/vp_vdpa.c
index df5f4a3bccb5..74bc8adfc7e8 100644
--- a/drivers/vdpa/virtio_pci/vp_vdpa.c
+++ b/drivers/vdpa/virtio_pci/vp_vdpa.c
@@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
struct pci_dev *pdev = mdev->pci_dev;
int i, ret, irq;
int queues = vp_vdpa->queues;
-   int vectors = queues + 1;
+   int vectors = 0;
+   int msix_vec = 0;
+
+   for (i = 0; i < queues; i++) {
+   if (vp_vdpa->vring[i].cb.callback)
+   vectors++;
+   }
+   vectors++;
 
ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX);
if (ret != vectors) {
@@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
vp_vdpa->vectors = vectors;
 
for (i = 0; i < queues; i++) {
+   if (!vp_vdpa->vring[i].cb.callback)
+   continue;
+
snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE,
"vp-vdpa[%s]-%d\n", pci_name(pdev), i);
-   irq = pci_irq_vector(pdev, i);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(>dev, irq,
   vp_vdpa_vq_handler,
   0, vp_vdpa->vring[i].msix_name,
@@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa)
"vp_vdpa: fail to request irq for vq %d\n", i);
goto err;
}
-   vp_modern_queue_vector(mdev, i, i);
+   vp_modern_queue_vector(mdev, i, msix_vec);
vp_vdpa->vring[i].irq = irq;
+   msix_vec++;
}
 
snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n",
 pci_name(pdev));
-   irq = pci_irq_vector(pdev, queues);
+   irq = pci_irq_vector(pdev, msix_vec);
ret = devm_request_irq(>dev, irq, vp_vdpa_config_handler, 0,
   vp_vdpa->msix_name, vp_vdpa);
if (ret) {
dev_err(>dev,
-   "vp_vdpa: fail to request irq for vq %d\n", i);
+   "vp_vdpa: fail to request irq for config, ret %d\n", 
ret);
goto err;
}
-   vp_modern_config_vector(mdev, queues);
+   vp_modern_config_vector(mdev, msix_vec);
vp_vdpa->config_irq = irq;
 
return 0;
-- 
2.43.0




Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-09 Thread Jiri Olsa
On Tue, Apr 09, 2024 at 09:34:39AM +0900, Masami Hiramatsu wrote:

SNIP

> > > 
> > > > this can be fixed by checking the syscall is called from the trampoline
> > > > and prevent handle_trampoline call if it's not
> > > 
> > > Yes, but I still do not think this makes a lot of sense. But I won't 
> > > argue.
> > > 
> > > And what should sys_uretprobe() do if it is not called from the 
> > > trampoline?
> > > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL.
> > 
> > so the similar behaviour with int3 ends up with immediate SIGTRAP
> > and not invoking pending uretprobe consumers, like:
> > 
> >   - setup uretprobe for foo
> >   - foo() {
> >   executes int 3 -> sends SIGTRAP
> > }
> > 
> > because the int3 handler checks if it got executed from the uretprobe's
> > trampoline.. if not it treats that int3 as regular trap
> 
> Yeah, that is consistent behavior. Sounds good to me.
> 
> > 
> > while for uretprobe syscall we have at the moment following behaviour:
> > 
> >   - setup uretprobe for foo
> >   - foo() {
> >  uretprobe_syscall -> executes foo's uretprobe consumers
> > }
> >   - at some point we get to the 'ret' instruction that jump into uretprobe
> > trampoline and the uretprobe_syscall won't find pending uretprobe and
> > will send SIGILL
> > 
> > 
> > so I think we should mimic int3 behaviour and:
> > 
> >   - setup uretprobe for foo
> >   - foo() {
> >  uretprobe_syscall -> check if we got executed from uretprobe's
> >  trampoline and send SIGILL if that's not the case
> 
> OK, this looks good to me.
> 
> > 
> > I think it's better to have the offending process killed right away,
> > rather than having more undefined behaviour, waiting for final 'ret'
> > instruction that jumps to uretprobe trampoline and causes SIGILL
> > 
> > > 
> > > I agree very much with Andrii,
> > > 
> > >sigreturn()  exists only to allow the implementation of signal 
> > > handlers.  It should never be
> > >called directly.  Details of the arguments (if any) passed to 
> > > sigreturn() vary depending  on
> > >the architecture.
> > > 
> > > this is how sys_uretprobe() should be treated/documented.
> > 
> > yes, will include man page patch in new version
> 
> And please follow Documentation/process/adding-syscalls.rst in new version,
> then we can avoid repeating the same discussion :-)

yep, will do

thanks,
jirka



Re: [PATCH v14 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-04-09 Thread Krzysztof Kozlowski
On 08/04/2024 22:53, Tanmay Shah wrote:
> From: Radhey Shyam Pandey 
> 
> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
> UltraScale+ platform. It will help in defining TCM in device-tree
> and make it's access platform agnostic and data-driven.
> 
> Tightly-coupled memories(TCMs) are low-latency memory that provides
> predictable instruction execution and predictable data load/store
> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
> 
> The TCM resources(reg, reg-names and power-domain) are documented for
> each TCM in the R5 node. The reg and reg-names are made as required
> properties as we don't want to hardcode TCM addresses for future
> platforms and for zu+ legacy implementation will ensure that the
> old dts w/o reg/reg-names works and stable ABI is maintained.
> 
> It also extends the examples for TCM split and lockstep modes.
> 
> Signed-off-by: Radhey Shyam Pandey 
> Signed-off-by: Tanmay Shah 

Reviewed-by: Krzysztof Kozlowski 

Best regards,
Krzysztof




Re: [PATCH 1/2] dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes

2024-04-09 Thread Krzysztof Kozlowski
On 08/04/2024 20:36, Luca Weiss wrote:
> On Montag, 8. April 2024 19:26:49 CEST Konrad Dybcio wrote:
>>
>> On 4/8/24 18:39, Luca Weiss wrote:
>>> Allow specifying a GPIO hog, as already used on
>>> qcom-msm8974-lge-nexus5-hammerhead.dts.
>>>
>>> Signed-off-by: Luca Weiss 
>>> ---
>>>   .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml  | 12 
>>> 
>>>   1 file changed, 12 insertions(+)
>>>
>>> diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml 
>>> b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
>>> index a786357ed1af..510a05369dbb 100644
>>> --- a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
>>> +++ b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml
>>> @@ -424,6 +424,10 @@ patternProperties:
>>>   $ref: "#/$defs/qcom-pmic-gpio-state"
>>>   additionalProperties: false
>>>   
>>> +  "^(hog-[0-9]+|.+-hog(-[0-9]+)?)$":
>>
>> I see a couple bindings do this, but I'm not sure if we want two
>> allow two styles for no reason.. Rob?
> 
> This regex is actually from the gpio-hog.yaml base
> https://github.com/devicetree-org/dt-schema/blob/main/dtschema/schemas/gpio/gpio-hog.yaml#L23
> 
> Why it's made this way I cannot tell you, but I didn't want to 'artifically'
> restrict the pattern for qcom,pmic-gpio.

Use the same as in tlmm:
https://lore.kernel.org/linux-devicetree/20221121081221.30745-1-krzysztof.kozlow...@linaro.org/

Best regards,
Krzysztof




Re: [PATCH 1/2] dt-bindings: mailbox: qcom: Add MSM8974 APCS compatible

2024-04-09 Thread Krzysztof Kozlowski
On 08/04/2024 21:32, Luca Weiss wrote:
> Add compatible for the Qualcomm MSM8974 APCS block.

"... The block is already used in DTS, but without any SoC specific
compatible."

Reviewed-by: Krzysztof Kozlowski 


Best regards,
Krzysztof




Re: [PATCH] dt-bindings: iio: imu: mpu6050: Improve i2c-gate disallow list

2024-04-09 Thread Krzysztof Kozlowski
On 08/04/2024 18:34, Luca Weiss wrote:
> Before all supported sensors except for MPU{9150,9250,9255} were not
> allowed to use i2c-gate in the bindings which excluded quite a few
> supported sensors where this functionality is supported.
> 
> Switch the list of sensors to ones where the Linux driver explicitly
> disallows support for the auxiliary bus ("inv_mpu_i2c_aux_bus"). Since
> the driver is also based on "default: return true" this should scale
> better into the future.
> 
> Signed-off-by: Luca Weiss 

Acked-by: Krzysztof Kozlowski 

Best regards,
Krzysztof




Re: [PATCH v2] kprobes: Avoid possible warn in __arm_kprobe_ftrace()

2024-04-09 Thread Zheng Yejian

On 2024/4/8 20:41, Masami Hiramatsu (Google) wrote:

Hi Zheng,

On Mon, 8 Apr 2024 16:34:03 +0800
Zheng Yejian  wrote:


There is once warn in __arm_kprobe_ftrace() on:

  ret = ftrace_set_filter_ip(ops, (unsigned long)p->addr, 0, 0);
  if (WARN_ONCE(..., "Failed to arm kprobe-ftrace at %pS (error %d)\n", ...)
return ret;

This warning is generated because 'p->addr' is detected to be not a valid
ftrace location in ftrace_set_filter_ip(). The ftrace address check is done
by check_ftrace_location() at the beginning of check_kprobe_address_safe().
At that point, ftrace_location(addr) == addr should return true if the
module is loaded. Then the module is searched twice:
   1. in is_module_text_address(), we find that 'p->addr' is in a module;
   2. in __module_text_address(), we find the module;

If the module has just been unloaded before the second search, then
'*probed_mod' is NULL and we would not go to get the module refcount,
then the return value of check_kprobe_address_safe() would be 0, but
actually we need to return -EINVAL.


OK, so you found a race window in check_kprobe_address_safe().

It does something like below.

check_kprobe_address_safe() {
...

/* Timing [A] */

if (!(core_kernel_text(p->addr) ||
is_module_text_address(p->addr)) ||
...(other reserved address check)) {
return -EINVAL;
}

/* Timing [B] */

*probed_mod = __module_text_address(p->addr):
if (*probe_mod) {
if (!try_module_get(*probed_mod)) {
return -ENOENT;
}
... 
}
}

So, if p->addr is in a module which is alive at the timing [A], but
unloaded at timing [B], 'p->addr' is passed the
'is_module_text_address(p->addr)' check, but *probed_mod becomes NULL.
Thus the corresponding module is not referenced and kprobe_arm(p) will
access a wrong address (use after free).
This happens either kprobe on ftrace is enabled or not.


Yes, This is the problem. And for this case, check_kprobe_address_safe() 
still return 0, and then going on to arm kprobe may cause problems. So

we should make check_kprobe_address_safe() return -EINVAL when refcount
of the module is not got.



To fix this problem, we should move the mutex_lock(kprobe_mutex) before
check_kprobe_address_safe() because kprobe_module_callback() also lock it
so it can stop module unloading.

Can you ensure this will fix your problem?


It seems not, the warning in __arm_kprobe_ftrace() still occurs. I
contrived following simple test:

#!/bin/bash
sysctl -w kernel.panic_on_warn=1
while [ True ]; do
insmod mod.ko# contain function 'foo'
rmmod mod.ko
done &
while [ True ]; do
insmod kprobe.ko  # register kprobe on function 'foo'
rmmod kprobe.ko
done &

I think holding kprobe_mutex cannot make sure we get the refcount of the
module.


I think your patch is just optimizing but not fixing the fundamental
problem, which is we don't have an atomic search symbol and get module


Sorry, this patch is a little confusing, but it is not just optimizing :)

As shown below, after my patch, if p->addr is in a module which is alive
at the timing [A'] but unloaded at timing [B'], then *probed_mod must
not be NULL. Then after timing [B'], it will go to try_module_get() and
expected to fail and return -ENOENT. So this is the different.

check_kprobe_address_safe() {
...
*probed_mod = NULL;
if (!core_kernel_text((unsigned long) p->addr)) {

/* Timing [A'] */

*probed_mod = __module_text_address((unsigned long) p->addr);
if (!(*probed_mod)) {
return -EINVAL;
}
}
...

/* Timing [B'] */

if (*probed_mod) {
if (!try_module_get(*probed_mod)) {
return -ENOENT;
}
...
}


API. In that case, we should stop a whole module unloading system until
registering a new kprobe on a module. (After registering the kprobe,
the callback can mark it gone and disarm_kprobe does not work anymore.)

diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..94eaefd1bc51 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1633,11 +1633,11 @@ int register_kprobe(struct kprobe *p)
p->nmissed = 0;
INIT_LIST_HEAD(>list);
  
+	mutex_lock(_mutex);

+
ret = check_kprobe_address_safe(p, _mod);
if (ret)
-   return ret;
-
-   mutex_lock(_mutex);
+   goto out;
  
  	if (on_func_entry)

p->flags |= KPROBE_FLAG_ON_FUNC_ENTRY;



Thank you,



To fix it, originally we can simply check 'p->addr' is out of text again,
like below. But that would check twice respectively in kernel text and
module text, so finally I reduce them to be once.

   if (!(core_kernel_text((unsigned long) p->addr) ||
   

Re: [PATCH] drivers/virtio: delayed configuration descriptor flags

2024-04-09 Thread Michael S. Tsirkin
On Tue, Apr 09, 2024 at 01:02:52AM +0800, ni.liqiang wrote:
> In our testing of the virtio hardware accelerator, we found that
> configuring the flags of the descriptor after addr and len,
> as implemented in DPDK, seems to be more friendly to the hardware.
> 
> In our Virtio hardware implementation tests, using the default
> open-source code, the hardware's bulk reads ensure performance
> but correctness is compromised. If we refer to the implementation code
> of DPDK, placing the flags configuration of the descriptor
> after addr and len, virtio backend can function properly based on
> our hardware accelerator.
> 
> I am somewhat puzzled by this. From a software process perspective,
> it seems that there should be no difference whether
> the flags configuration of the descriptor is before or after addr and len.
> However, this is not the case according to experimental test results.


You should be aware of the following, from the PCI Express spec.
Note especially the second paragraph, and the last paragraph:

2.4.2.
25
30
Update Ordering and Granularity Observed by a
Read Transaction
If a Requester using a single transaction reads a block of data from a 
Completer, and the
Completer's data buffer is concurrently being updated, the ordering of multiple 
updates and
granularity of each update reflected in the data returned by the read is 
outside the scope of this
specification. This applies both to updates performed by PCI Express write 
transactions and
updates performed by other mechanisms such as host CPUs updating host memory.
If a Requester using a single transaction reads a block of data from a 
Completer, and the
Completer's data buffer is concurrently being updated by one or more entities 
not on the PCI
Express fabric, the ordering of multiple updates and granularity of each update 
reflected in the data
returned by the read is outside the scope of this specification.




As an example of update ordering, assume that the block of data is in host 
memory, and a host CPU
writes first to location A and then to a different location B. A Requester 
reading that data block
with a single read transaction is not guaranteed to observe those updates in 
order. In other words,
the Requester may observe an updated value in location B and an old value in 
location A, regardless
of the placement of locations A and B within the data block. Unless a Completer 
makes its own
guarantees (outside this specification) with respect to update ordering, a 
Requester that relies on
update ordering must observe the update to location B via one read transaction 
before initiating a
subsequent read to location A to return its updated value.




As an example of update granularity, if a host CPU writes a QWORD to host 
memory, a Requester
reading that QWORD from host memory may observe a portion of the QWORD updated 
and
another portion of it containing the old value.
While not required by this specification, it is strongly recommended that host 
platforms guarantee
that when a host CPU writes aligned DWORDs or aligned QWORDs to host memory, 
the update
granularity observed by a PCI Express read will not be smaller than a DWORD.


IMPLEMENTATION NOTE
No Ordering Required Between Cachelines
15
A Root Complex serving as a Completer to a single Memory Read that requests 
multiple cachelines
from host memory is permitted to fetch multiple cachelines concurrently, to 
help facilitate multi-
cacheline completions, subject to Max_Payload_Size. No ordering relationship 
between these
cacheline fetches is required.





Now I suspect that what is going on is that your Root complex
reads descriptors out of order, so the second descriptor is invalid
but the 1st one is valid.




> We would like to know if such a change in the configuration order
> is reasonable and acceptable?

We need to understand the root cause and how robust the fix is
before answering this.


> Thanks.
> 
> Signed-off-by: ni.liqiang 
> Reviewed-by: jin.qi 
> Tested-by: jin.qi 
> Cc: ni.liqiang 
> ---
>  drivers/virtio/virtio_ring.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 6f7e5010a673..bea2c2fb084e 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1472,15 +1472,16 @@ static inline int virtqueue_add_packed(struct 
> virtqueue *_vq,
>   flags = cpu_to_le16(vq->packed.avail_used_flags |
>   (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>   (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> - if (i == head)
> - head_flags = flags;
> - else
> - desc[i].flags = flags;
>  
>   desc[i].addr = cpu_to_le64(addr);
>   desc[i].len = cpu_to_le32(sg->length);
>   desc[i].id = cpu_to_le16(id);
>  
> +