Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name
On 09/04/2024 20:36, Luca Weiss wrote: > Follow the gpio-hog bindings and use otg-hog as node name. > > Signed-off-by: Luca Weiss > --- > arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH v2 1/2] dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes
On 09/04/2024 20:36, Luca Weiss wrote: > Allow specifying a GPIO hog, as already used on > qcom-msm8974-lge-nexus5-hammerhead.dts. > > Signed-off-by: Luca Weiss > --- > .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml | 12 > > 1 file changed, 12 insertions(+) Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Tue, Apr 9, 2024 at 7:33 PM Huang, Ying wrote: > > "Ho-Ren (Jack) Chuang" writes: > > > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron > > wrote: > >> > >> On Fri, 5 Apr 2024 00:07:06 + > >> "Ho-Ren (Jack) Chuang" wrote: > >> > >> > The current implementation treats emulated memory devices, such as > >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal > >> > memory > >> > (E820_TYPE_RAM). However, these emulated devices have different > >> > characteristics than traditional DRAM, making it important to > >> > distinguish them. Thus, we modify the tiered memory initialization > >> > process > >> > to introduce a delay specifically for CPUless NUMA nodes. This delay > >> > ensures that the memory tier initialization for these nodes is deferred > >> > until HMAT information is obtained during the boot process. Finally, > >> > demotion tables are recalculated at the end. > >> > > >> > * late_initcall(memory_tier_late_init); > >> > Some device drivers may have initialized memory tiers between > >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > >> > online memory nodes and configuring memory tiers. They should be excluded > >> > in the late init. > >> > > >> > * Handle cases where there is no HMAT when creating memory tiers > >> > There is a scenario where a CPUless node does not provide HMAT > >> > information. > >> > If no HMAT is specified, it falls back to using the default DRAM tier. > >> > > >> > * Introduce another new lock `default_dram_perf_lock` for adist > >> > calculation > >> > In the current implementation, iterating through CPUlist nodes requires > >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end > >> > up > >> > trying to acquire the same lock, leading to a potential deadlock. > >> > Therefore, we propose introducing a standalone `default_dram_perf_lock` > >> > to > >> > protect `default_dram_perf_*`. This approach not only avoids deadlock > >> > but also prevents holding a large lock simultaneously. > >> > > >> > * Upgrade `set_node_memory_tier` to support additional cases, including > >> > default DRAM, late CPUless, and hot-plugged initializations. > >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > >> > handle cases where memtype is not initialized and where HMAT information > >> > is > >> > available. > >> > > >> > * Introduce `default_memory_types` for those memory types that are not > >> > initialized by device drivers. > >> > Because late initialized memory and default DRAM memory need to be > >> > managed, > >> > a default memory type is created for storing all memory types that are > >> > not initialized by device drivers and as a fallback. > >> > > >> > Signed-off-by: Ho-Ren (Jack) Chuang > >> > Signed-off-by: Hao Xiang > >> > Reviewed-by: "Huang, Ying" > >> > >> Hi - one remaining question. Why can't we delay init for all nodes > >> to either drivers or your fallback late_initcall code. > >> It would be nice to reduce possible code paths. > > > > I try not to change too much of the existing code structure in > > this patchset. > > > > To me, postponing/moving all memory tier registrations to > > late_initcall() is another possible action item for the next patchset. > > > > After tier_mem(), hmat_init() is called, which requires registering > > `default_dram_type` info. This is when `default_dram_type` is needed. > > However, it is indeed possible to postpone the latter part, > > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > > indeed be split into two parts, and the latter part can be moved to > > late_initcall() to be processed together. > > I don't think that it's good to move all memory_tier initialization in > drivers to late_initcall(). It's natural to keep them in > device_initcall() level. > > If so, we can allocate default_dram_type in memory_tier_init(), and call > set_node_memory_tier() only in memory_tier_lateinit(). We can call > memory_tier_lateinit() in device_initcall() level too. > It makes sense to me to leave only `default_dram_type ` and hotplug_init() in memory_tier_init(), postponing all set_node_memory_tier()s to memory_tier_late_init() Would it be possible there is no device_initcall() calling memory_tier_late_init()? If yes, I think putting memory_tier_late_init() in late_init() is still necessary. > -- > Best Regards, > Huang, Ying > > > Doing this all memory-type drivers have to call late_initcall() to > > register a memory tier. I’m not sure how many they are? > > > > What do you guys think? > > > >> > >> Jonathan > >> > >> > >> > --- > >> > mm/memory-tiers.c | 94 +++ > >> > 1 file changed, 70 insertions(+), 24 deletions(-) > >> > > >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > >> > index 516b144fd45a..6632102bd5c9 100644 > >> > --- a/mm/memory-tiers.c > >> > +++ b/mm/mem
Re: [PATCH v2 0/1] virtio-pci: Fix the crash that the vector was used after released
Sorry, send to the wrong mail list, please ignore it On Wed, Apr 10, 2024 at 12:35 PM Cindy Lu wrote: > > During the booting process of the Vyatta image, the behavior of the > called function in qemu is as follows: > > 1. vhost_net_stop() was triggered by guest image . This will call the function > virtio_pci_set_guest_notifiers() with assgin= false, and > virtio_pci_set_guest_notifiers() will release the irqfd for vector 0 > > 2. virtio_reset() was called -->set configure vector to VIRTIO_NO_VECTOR > > 3.vhost_net_start() was called (at this time, the configure vector is > still VIRTIO_NO_VECTOR) and call virtio_pci_set_guest_notifiers() with > assgin= true, so the irqfd for vector 0 is still not "init" during this > process > > 4. The system continues to boot,set the vector back to 0, and > msix_fire_vector_notifier() was triggered > unmask the vector 0 and then met the crash > [msix_fire_vector_notifier] 112 called vector 0 is_masked 1 > [msix_fire_vector_notifier] 112 called vector 0 is_masked 0 > > To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()" > when the vector changes back from VIRTIO_NO_VECTOR. > > The reason that we don't need to call kvm_virtio_pci_vector_release_one while > the vector changes to > VIRTIO_NO_VECTOR is this function will called in vhost_net_stop(), > So this step will not lost during this process. > > Change from V1 > 1.add the check for if using irqfd > 2.remove the check for bool recovery, irqfd's user is enough to check status > > Cindy Lu (1): > virtio-pci: Fix the crash that the vector was used after released. > > hw/virtio/virtio-pci.c | 35 +++ > 1 file changed, 35 insertions(+) > > -- > 2.43.0 >
Re: [PATCH v2 1/1] virtio-pci: Fix the crash that the vector was used after released.
Sorry, send to the wrong mail list, please ignore it On Wed, Apr 10, 2024 at 12:35 PM Cindy Lu wrote: > > When the guest triggers vhost_stop and then virtio_reset, the vector will the > IRQFD for this vector will be released and change to VIRTIO_NO_VECTOR. > After that, the guest called vhost_net_start, (at this time, the configure > vector is still VIRTIO_NO_VECTOR), vector 0 still was not "init". > The guest system continued to boot, set the vector back to 0, and then met > the crash. > > To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()" > when the vector changes back from VIRTIO_NO_VECTOR > > (gdb) bt > 0 __pthread_kill_implementation (threadid=, > signo=signo@entry=6, no_tid=no_tid@entry=0) > at pthread_kill.c:44 > 1 0x7fc87148ec53 in __pthread_kill_internal (signo=6, > threadid=) at pthread_kill.c:78 > 2 0x7fc87143e956 in __GI_raise (sig=sig@entry=6) at > ../sysdeps/posix/raise.c:26 > 3 0x7fc8714287f4 in __GI_abort () at abort.c:79 > 4 0x7fc87142871b in __assert_fail_base > (fmt=0x7fc8715bbde0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", > assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d > "../accel/kvm/kvm-all.c", line=1837, function=) at assert.c:92 > 5 0x7fc871437536 in __GI___assert_fail > (assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d > "../accel/kvm/kvm-all.c", line=1837, function=0x5606413f06f0 > <__PRETTY_FUNCTION__.19> "kvm_irqchip_commit_routes") at assert.c:101 > 6 0x560640f884b5 in kvm_irqchip_commit_routes (s=0x560642cae1f0) at > ../accel/kvm/kvm-all.c:1837 > 7 0x560640c98f8e in virtio_pci_one_vector_unmask > (proxy=0x560643c65f00, queue_no=4294967295, vector=0, msg=..., > n=0x560643c6e4c8) > at ../hw/virtio/virtio-pci.c:1005 > 8 0x560640c99201 in virtio_pci_vector_unmask (dev=0x560643c65f00, > vector=0, msg=...) > at ../hw/virtio/virtio-pci.c:1070 > 9 0x560640bc402e in msix_fire_vector_notifier (dev=0x560643c65f00, > vector=0, is_masked=false) > at ../hw/pci/msix.c:120 > 10 0x560640bc40f1 in msix_handle_mask_update (dev=0x560643c65f00, > vector=0, was_masked=true) > at ../hw/pci/msix.c:140 > 11 0x560640bc4503 in msix_table_mmio_write (opaque=0x560643c65f00, > addr=12, val=0, size=4) > at ../hw/pci/msix.c:231 > 12 0x560640f26d83 in memory_region_write_accessor > (mr=0x560643c66540, addr=12, value=0x7fc86b7bc628, size=4, shift=0, > mask=4294967295, attrs=...) > at ../system/memory.c:497 > 13 0x560640f270a6 in access_with_adjusted_size > > (addr=12, value=0x7fc86b7bc628, size=4, access_size_min=1, > access_size_max=4, access_fn=0x560640f26c8d , > mr=0x560643c66540, attrs=...) at ../system/memory.c:573 > 14 0x560640f2a2b5 in memory_region_dispatch_write (mr=0x560643c66540, > addr=12, data=0, op=MO_32, attrs=...) > at ../system/memory.c:1521 > 15 0x560640f37bac in flatview_write_continue > (fv=0x7fc65805e0b0, addr=4273803276, attrs=..., ptr=0x7fc871e9c028, > len=4, addr1=12, l=4, mr=0x560643c66540) > at ../system/physmem.c:2714 > 16 0x560640f37d0f in flatview_write > (fv=0x7fc65805e0b0, addr=4273803276, attrs=..., buf=0x7fc871e9c028, > len=4) at ../system/physmem.c:2756 > 17 0x560640f380bf in address_space_write > (as=0x560642161ae0 , addr=4273803276, attrs=..., > buf=0x7fc871e9c028, len=4) > at ../system/physmem.c:2863 > 18 0x560640f3812c in address_space_rw > (as=0x560642161ae0 , addr=4273803276, attrs=..., > buf=0x7fc871e9c028, len=4, is_write=true) at ../system/physmem.c:2873 > --Type for more, q to quit, c to continue without paging-- > 19 0x560640f8aa55 in kvm_cpu_exec (cpu=0x560642f205e0) at > ../accel/kvm/kvm-all.c:2915 > 20 0x560640f8d731 in kvm_vcpu_thread_fn (arg=0x560642f205e0) at > ../accel/kvm/kvm-accel-ops.c:51 > 21 0x5606411949f4 in qemu_thread_start (args=0x560642f292b0) at > ../util/qemu-thread-posix.c:541 > 22 0x7fc87148cdcd in start_thread (arg=) at > pthread_create.c:442 > 23 0x7fc871512630 in clone3 () at > ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 > (gdb) > Signed-off-by: Cindy Lu > --- > hw/virtio/virtio-pci.c | 35 +++ > 1 file changed, 35 insertions(+) > > diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c > index 1a7039fb0c..344f4fb844 100644 > --- a/hw/virtio/virtio-pci.c > +++ b/hw/virtio/virtio-pci.c > @@ -880,6 +880,7 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy > *proxy, int queue_no) > int ret; > EventNotifier *n; > PCIDevice *dev = &proxy->pci_dev; > +VirtIOIRQFD *irqfd; > VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus); > VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev); > > @@ -890,10 +891,19 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy > *proxy, int queue_no) > if (vector >= msix_nr_vectors_allocated(dev)) { > return 0; > } > +/* > + * if the irqfd still in use,
[PATCH v2 1/1] virtio-pci: Fix the crash that the vector was used after released.
When the guest triggers vhost_stop and then virtio_reset, the vector will the IRQFD for this vector will be released and change to VIRTIO_NO_VECTOR. After that, the guest called vhost_net_start, (at this time, the configure vector is still VIRTIO_NO_VECTOR), vector 0 still was not "init". The guest system continued to boot, set the vector back to 0, and then met the crash. To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()" when the vector changes back from VIRTIO_NO_VECTOR (gdb) bt 0 __pthread_kill_implementation (threadid=, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 1 0x7fc87148ec53 in __pthread_kill_internal (signo=6, threadid=) at pthread_kill.c:78 2 0x7fc87143e956 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 3 0x7fc8714287f4 in __GI_abort () at abort.c:79 4 0x7fc87142871b in __assert_fail_base (fmt=0x7fc8715bbde0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d "../accel/kvm/kvm-all.c", line=1837, function=) at assert.c:92 5 0x7fc871437536 in __GI___assert_fail (assertion=0x5606413efd53 "ret == 0", file=0x5606413ef87d "../accel/kvm/kvm-all.c", line=1837, function=0x5606413f06f0 <__PRETTY_FUNCTION__.19> "kvm_irqchip_commit_routes") at assert.c:101 6 0x560640f884b5 in kvm_irqchip_commit_routes (s=0x560642cae1f0) at ../accel/kvm/kvm-all.c:1837 7 0x560640c98f8e in virtio_pci_one_vector_unmask (proxy=0x560643c65f00, queue_no=4294967295, vector=0, msg=..., n=0x560643c6e4c8) at ../hw/virtio/virtio-pci.c:1005 8 0x560640c99201 in virtio_pci_vector_unmask (dev=0x560643c65f00, vector=0, msg=...) at ../hw/virtio/virtio-pci.c:1070 9 0x560640bc402e in msix_fire_vector_notifier (dev=0x560643c65f00, vector=0, is_masked=false) at ../hw/pci/msix.c:120 10 0x560640bc40f1 in msix_handle_mask_update (dev=0x560643c65f00, vector=0, was_masked=true) at ../hw/pci/msix.c:140 11 0x560640bc4503 in msix_table_mmio_write (opaque=0x560643c65f00, addr=12, val=0, size=4) at ../hw/pci/msix.c:231 12 0x560640f26d83 in memory_region_write_accessor (mr=0x560643c66540, addr=12, value=0x7fc86b7bc628, size=4, shift=0, mask=4294967295, attrs=...) at ../system/memory.c:497 13 0x560640f270a6 in access_with_adjusted_size (addr=12, value=0x7fc86b7bc628, size=4, access_size_min=1, access_size_max=4, access_fn=0x560640f26c8d , mr=0x560643c66540, attrs=...) at ../system/memory.c:573 14 0x560640f2a2b5 in memory_region_dispatch_write (mr=0x560643c66540, addr=12, data=0, op=MO_32, attrs=...) at ../system/memory.c:1521 15 0x560640f37bac in flatview_write_continue (fv=0x7fc65805e0b0, addr=4273803276, attrs=..., ptr=0x7fc871e9c028, len=4, addr1=12, l=4, mr=0x560643c66540) at ../system/physmem.c:2714 16 0x560640f37d0f in flatview_write (fv=0x7fc65805e0b0, addr=4273803276, attrs=..., buf=0x7fc871e9c028, len=4) at ../system/physmem.c:2756 17 0x560640f380bf in address_space_write (as=0x560642161ae0 , addr=4273803276, attrs=..., buf=0x7fc871e9c028, len=4) at ../system/physmem.c:2863 18 0x560640f3812c in address_space_rw (as=0x560642161ae0 , addr=4273803276, attrs=..., buf=0x7fc871e9c028, len=4, is_write=true) at ../system/physmem.c:2873 --Type for more, q to quit, c to continue without paging-- 19 0x560640f8aa55 in kvm_cpu_exec (cpu=0x560642f205e0) at ../accel/kvm/kvm-all.c:2915 20 0x560640f8d731 in kvm_vcpu_thread_fn (arg=0x560642f205e0) at ../accel/kvm/kvm-accel-ops.c:51 21 0x5606411949f4 in qemu_thread_start (args=0x560642f292b0) at ../util/qemu-thread-posix.c:541 22 0x7fc87148cdcd in start_thread (arg=) at pthread_create.c:442 23 0x7fc871512630 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) Signed-off-by: Cindy Lu --- hw/virtio/virtio-pci.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index 1a7039fb0c..344f4fb844 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -880,6 +880,7 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy *proxy, int queue_no) int ret; EventNotifier *n; PCIDevice *dev = &proxy->pci_dev; +VirtIOIRQFD *irqfd; VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus); VirtioDeviceClass *k = VIRTIO_DEVICE_GET_CLASS(vdev); @@ -890,10 +891,19 @@ static int kvm_virtio_pci_vector_use_one(VirtIOPCIProxy *proxy, int queue_no) if (vector >= msix_nr_vectors_allocated(dev)) { return 0; } +/* + * if the irqfd still in use, means the irqfd was not + * release before and don't need to set up + */ +irqfd = &proxy->vector_irqfd[vector]; +if (irqfd->users != 0) { +return 0; +} ret = kvm_virtio_pci_vq_vector_use(proxy, vector); if (ret < 0) { goto undo; } + /* * If guest suppor
[PATCH v2 0/1] virtio-pci: Fix the crash that the vector was used after released
During the booting process of the Vyatta image, the behavior of the called function in qemu is as follows: 1. vhost_net_stop() was triggered by guest image . This will call the function virtio_pci_set_guest_notifiers() with assgin= false, and virtio_pci_set_guest_notifiers(??? will release the irqfd for vector 0 2. virtio_reset() was called -->set configure vector to VIRTIO_NO_VECTOR 3.vhost_net_start() was called (at this time, the configure vector is still VIRTIO_NO_VECTOR) and call virtio_pci_set_guest_notifiers() with assgin= true, so the irqfd for vector 0 is still not "init" during this process 4. The system continues to boot,set the vector back to 0, and msix_fire_vector_notifier() was triggered unmask the vector 0 and then met the crash [msix_fire_vector_notifier] 112 called vector 0 is_masked 1 [msix_fire_vector_notifier] 112 called vector 0 is_masked 0 To fix this, we need to call the function "kvm_virtio_pci_vector_use_one()" when the vector changes back from VIRTIO_NO_VECTOR. The reason that we don't need to call kvm_virtio_pci_vector_release_one while the vector changes to VIRTIO_NO_VECTOR is this function will called in vhost_net_stop(), So this step will not lost during this process. Change from V1 1.add the check for if using irqfd 2.remove the check for bool recovery, irqfd's user is enough to check status Cindy Lu (1): virtio-pci: Fix the crash that the vector was used after released. hw/virtio/virtio-pci.c | 35 +++ 1 file changed, 35 insertions(+) -- 2.43.0
[PATCH v5] vp_vdpa: don't allocate unused msix vectors
From: Yuxue Liu When there is a ctlq and it doesn't require interrupt callbacks,the original method of calculating vectors wastes hardware msi or msix resources as well as system IRQ resources. When conducting performance testing using testpmd in the guest os, it was found that the performance was lower compared to directly using vfio-pci to passthrough the device In scenarios where the virtio device in the guest os does not utilize interrupts, the vdpa driver still configures the hardware's msix vector. Therefore, the hardware still sends interrupts to the host os. Because of this unnecessary action by the hardware, hardware performance decreases, and it also affects the performance of the host os. Before modification:(interrupt mode) 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config After modification:(interrupt mode) 32: 0 0 1 7 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-0 33: 36 0 3 0 PCI-MSI 32769-edge vp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edge vp-vdpa[:00:02.0]-config Before modification:(virtio pmd mode for guest os) 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config After modification:(virtio pmd mode for guest os) 32: 0 0 0 0 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-config To verify the use of the virtio PMD mode in the guest operating system, the following patch needs to be applied to QEMU: https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com Signed-off-by: Yuxue Liu Acked-by: Jason Wang Reviewed-by: Heng Qi --- V5: modify the description of the printout when an exception occurs V4: update the title and assign values to uninitialized variables V3: delete unused variables and add validation records V2: fix when allocating IRQs, scan all queues drivers/vdpa/virtio_pci/vp_vdpa.c | 22 -- 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c index df5f4a3bccb5..8de0224e9ec2 100644 --- a/drivers/vdpa/virtio_pci/vp_vdpa.c +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c @@ -160,7 +160,13 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) struct pci_dev *pdev = mdev->pci_dev; int i, ret, irq; int queues = vp_vdpa->queues; - int vectors = queues + 1; + int vectors = 1; + int msix_vec = 0; + + for (i = 0; i < queues; i++) { + if (vp_vdpa->vring[i].cb.callback) + vectors++; + } ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX); if (ret != vectors) { @@ -173,9 +179,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) vp_vdpa->vectors = vectors; for (i = 0; i < queues; i++) { + if (!vp_vdpa->vring[i].cb.callback) + continue; + snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-%d\n", pci_name(pdev), i); - irq = pci_irq_vector(pdev, i); + irq = pci_irq_vector(pdev, msix_vec); ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_vq_handler, 0, vp_vdpa->vring[i].msix_name, @@ -185,21 +194,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) "vp_vdpa: fail to request irq for vq %d\n", i); goto err; } - vp_modern_queue_vector(mdev, i, i); + vp_modern_queue_vector(mdev, i, msix_vec); vp_vdpa->vring[i].irq = irq; + msix_vec++; } snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n", pci_name(pdev)); - irq = pci_irq_vector(pdev, queues); + irq = pci_irq_vector(pdev, msix_vec); ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_config_handler, 0, vp_vdpa->msix_name, vp_vdpa); if (ret) { dev_err(&pdev->dev, - "vp_vdpa: fail to request irq for vq %d\n", i); + "vp_vdpa: fail to request irq for config: %d\n", ret); goto err; } - vp_modern_config_vector(mdev, queues); + vp_modern_config_vector(mdev, msix_vec); vp_vdpa->config_irq = irq; return 0; -- 2.43.0
Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
"Ho-Ren (Jack) Chuang" writes: > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron > wrote: >> >> On Fri, 5 Apr 2024 00:07:06 + >> "Ho-Ren (Jack) Chuang" wrote: >> >> > The current implementation treats emulated memory devices, such as >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory >> > (E820_TYPE_RAM). However, these emulated devices have different >> > characteristics than traditional DRAM, making it important to >> > distinguish them. Thus, we modify the tiered memory initialization process >> > to introduce a delay specifically for CPUless NUMA nodes. This delay >> > ensures that the memory tier initialization for these nodes is deferred >> > until HMAT information is obtained during the boot process. Finally, >> > demotion tables are recalculated at the end. >> > >> > * late_initcall(memory_tier_late_init); >> > Some device drivers may have initialized memory tiers between >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing >> > online memory nodes and configuring memory tiers. They should be excluded >> > in the late init. >> > >> > * Handle cases where there is no HMAT when creating memory tiers >> > There is a scenario where a CPUless node does not provide HMAT information. >> > If no HMAT is specified, it falls back to using the default DRAM tier. >> > >> > * Introduce another new lock `default_dram_perf_lock` for adist calculation >> > In the current implementation, iterating through CPUlist nodes requires >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up >> > trying to acquire the same lock, leading to a potential deadlock. >> > Therefore, we propose introducing a standalone `default_dram_perf_lock` to >> > protect `default_dram_perf_*`. This approach not only avoids deadlock >> > but also prevents holding a large lock simultaneously. >> > >> > * Upgrade `set_node_memory_tier` to support additional cases, including >> > default DRAM, late CPUless, and hot-plugged initializations. >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to >> > handle cases where memtype is not initialized and where HMAT information is >> > available. >> > >> > * Introduce `default_memory_types` for those memory types that are not >> > initialized by device drivers. >> > Because late initialized memory and default DRAM memory need to be managed, >> > a default memory type is created for storing all memory types that are >> > not initialized by device drivers and as a fallback. >> > >> > Signed-off-by: Ho-Ren (Jack) Chuang >> > Signed-off-by: Hao Xiang >> > Reviewed-by: "Huang, Ying" >> >> Hi - one remaining question. Why can't we delay init for all nodes >> to either drivers or your fallback late_initcall code. >> It would be nice to reduce possible code paths. > > I try not to change too much of the existing code structure in > this patchset. > > To me, postponing/moving all memory tier registrations to > late_initcall() is another possible action item for the next patchset. > > After tier_mem(), hmat_init() is called, which requires registering > `default_dram_type` info. This is when `default_dram_type` is needed. > However, it is indeed possible to postpone the latter part, > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > indeed be split into two parts, and the latter part can be moved to > late_initcall() to be processed together. I don't think that it's good to move all memory_tier initialization in drivers to late_initcall(). It's natural to keep them in device_initcall() level. If so, we can allocate default_dram_type in memory_tier_init(), and call set_node_memory_tier() only in memory_tier_lateinit(). We can call memory_tier_lateinit() in device_initcall() level too. -- Best Regards, Huang, Ying > Doing this all memory-type drivers have to call late_initcall() to > register a memory tier. I’m not sure how many they are? > > What do you guys think? > >> >> Jonathan >> >> >> > --- >> > mm/memory-tiers.c | 94 +++ >> > 1 file changed, 70 insertions(+), 24 deletions(-) >> > >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c >> > index 516b144fd45a..6632102bd5c9 100644 >> > --- a/mm/memory-tiers.c >> > +++ b/mm/memory-tiers.c >> >> >> >> > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void) >> >* For now we can have 4 faster memory tiers with smaller adistance >> >* than default DRAM tier. >> >*/ >> > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); >> > + default_dram_type = mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, >> > + &default_memory_types); >> > if (IS_ERR(default_dram_type)) >> > panic("%s() failed to allocate default DRAM tier\n", >> > __func__); >> > >> > @@ -865,6 +903,14 @@ static int __init memory_ti
Re: [PATCHv3 1/2] dt-bindings: usb: typec: anx7688: start a binding document
On Mon, Apr 08, 2024 at 10:12:30PM GMT, Krzysztof Kozlowski wrote: > On 08/04/2024 17:17, Ondřej Jirman wrote: > > > > Now for things to not fail during suspend/resume based on PM callbacks > > invocation order, anx7688 driver needs to enable this regulator too, as long > > as it needs it. > > No, the I2C bus driver needs to manage it. Not one individual I2C > device. Again, why anx7688 is specific? If you next phone has anx8867, > using different driver, you also add there i2c-supply? And if it is > nxp,ptn5100 as well? Yes, that could work, if I2C core would manage this. > > > > I can put bus-supply to I2C controller node, and read it from the ANX7688 > > driver > > I guess, by going up a DT node. Whether that's going to be acceptable, I > > don't > > know. > > > > > > VCONN regulator I don't know where else to put either. It doesn't seem to > > belong > > anywhere. It's not something directly connected to Type-C connector, so > > not part of connector bindings, and there's nothing else I can see, other > > than anx7688 device which needs it for core functionality. > > That sounds like a GPIO, not regulator. anx7688 has GPIOs, right? On > Pinephone they go to regulator, but on FooPhone also using anx7688 they > go somewhere else, so why this anx7688 assumes this is a regulator? CC1/CC2_VCONN control pins are "GPIO" of anx7688, sort of. They have fixed purpose of switching external 5V regulator output to one of the CC pins on type-c port. I don't care what other purpose with some other firmware someone puts to those pins. It's irrelevant to the use case of anx7688 as a type-c controller/HDMI bridge, which we're describing here. VCONN regulator is an actual GPIO controlled regulator on the board, and needs to be controlled by the anx7688 driver. So that CC1/CC2_VCONN control pins driven by the firmware actually do what they're supposed to do. Not sure why it would be a business of anything else but anx7688 driver enabling this regulator, because only this driver knows and cares about this. If some other board doesn't have the need to manually enable the regulator, or doesn't have the regulator, it can simply be optional. There are also some other funky supplies in the bindings, that are not connected to the chip in any way, but need to be controlled by the driver: + vbus-supply: true + vbus-in-supply: true First one can be on the connector node instead, where the driver can fetch it from. The purpose of the second one is to link the Phone's PMIC's USB power input with the type-c controller (anx7688), to make sure the PMIC has information about how much power it can draw from external PSU. The second one can be replaced by rewriting the anx7688 driver so that it creates a power supply representing the USB PSU connected to the phone, and by linking to anx7688 DT node from x-powers,axp813-usb-power-supply via a power-supplies property, which would mean that USB input of the phone is supplied by the external USB PSU. PMIC driver can be modified to watch the power supply provided by anx7688 driver for information it detected via USB-PD and update its input current limit via a standard helper function. This is how eg. fusb302 works. Not sure if that's any better from the PoV of DT bindings, though. Because power-supplies = <&anx7688>; will not look any greater in DT bindings, IMO. It will just link the same nodes in the other direction. Anyway, the HW is that there's an external PSU (detected by type c controller) and internal USB power input, and they are connected and one has to respect the limits of the other. I guess I shouldn't be adding a device node for external PSU, since it's not part of the phone. But that's what's trully being linked on HW level. Kind regards, o. > > > > > ANX7688 chip desing doesn't have integrated VCONN mosfet switches so it > > always > > needs external supply + switches that are controlled by the chip itself. > > There's > > no sensible design where someone would not want this and the driver needs > > to get this regulator reference from somewhere. The switches are sort of an > > extension of the chip. > > > Best regards, > Krzysztof >
[PATCH v3] kprobes: Fix possible use-after-free issue on kprobe registration
When unloading a module, its state is changing MODULE_STATE_LIVE -> MODULE_STATE_GOING -> MODULE_STATE_UNFORMED. Each change will take a time. `is_module_text_address()` and `__module_text_address()` works with MODULE_STATE_LIVE and MODULE_STATE_GOING. If we use `is_module_text_address()` and `__module_text_address()` separately, there is a chance that the first one is succeeded but the next one is failed because module->state becomes MODULE_STATE_UNFORMED between those operations. In `check_kprobe_address_safe()`, if the second `__module_text_address()` is failed, that is ignored because it expected a kernel_text address. But it may have failed simply because module->state has been changed to MODULE_STATE_UNFORMED. In this case, arm_kprobe() will try to modify non-exist module text address (use-after-free). To fix this problem, we should not use separated `is_module_text_address()` and `__module_text_address()`, but use only `__module_text_address()` once and do `try_module_get(module)` which is only available with MODULE_STATE_LIVE. Signed-off-by: Zheng Yejian --- kernel/kprobes.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) v3: - Update commit messages as suggested by Masami. Link: https://lore.kernel.org/all/20240409224922.5f192e8ace5f7a90937bf...@kernel.org/ - Also change to a more appropriate title. v2: - Update commit messages and comments as suggested by Masami. Link: https://lore.kernel.org/all/20240408115038.b0c85767bf1f249eccc32...@kernel.org/ - Link: https://lore.kernel.org/all/20240408083403.3302274-1-zhengyeji...@huawei.com/ v1: - Link: https://lore.kernel.org/all/20240407035904.2556645-1-zhengyeji...@huawei.com/ diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9d9095e81792..65adc815fc6e 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1567,10 +1567,17 @@ static int check_kprobe_address_safe(struct kprobe *p, jump_label_lock(); preempt_disable(); - /* Ensure it is not in reserved area nor out of text */ - if (!(core_kernel_text((unsigned long) p->addr) || - is_module_text_address((unsigned long) p->addr)) || - in_gate_area_no_mm((unsigned long) p->addr) || + /* Ensure the address is in a text area, and find a module if exists. */ + *probed_mod = NULL; + if (!core_kernel_text((unsigned long) p->addr)) { + *probed_mod = __module_text_address((unsigned long) p->addr); + if (!(*probed_mod)) { + ret = -EINVAL; + goto out; + } + } + /* Ensure it is not in reserved area. */ + if (in_gate_area_no_mm((unsigned long) p->addr) || within_kprobe_blacklist((unsigned long) p->addr) || jump_label_text_reserved(p->addr, p->addr) || static_call_text_reserved(p->addr, p->addr) || @@ -1580,8 +1587,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } - /* Check if 'p' is probing a module. */ - *probed_mod = __module_text_address((unsigned long) p->addr); + /* Get module refcount and reject __init functions for loaded modules. */ if (*probed_mod) { /* * We must hold a refcount of the probed module while updating -- 2.25.1
Re: [PATCH v2] kprobes: Avoid possible warn in __arm_kprobe_ftrace()
On 2024/4/9 21:49, Masami Hiramatsu (Google) wrote: On Tue, 9 Apr 2024 14:20:45 +0800 Zheng Yejian wrote: On 2024/4/8 20:41, Masami Hiramatsu (Google) wrote: Hi Zheng, On Mon, 8 Apr 2024 16:34:03 +0800 Zheng Yejian wrote: There is once warn in __arm_kprobe_ftrace() on: ret = ftrace_set_filter_ip(ops, (unsigned long)p->addr, 0, 0); if (WARN_ONCE(..., "Failed to arm kprobe-ftrace at %pS (error %d)\n", ...) return ret; This warning is generated because 'p->addr' is detected to be not a valid ftrace location in ftrace_set_filter_ip(). The ftrace address check is done by check_ftrace_location() at the beginning of check_kprobe_address_safe(). At that point, ftrace_location(addr) == addr should return true if the module is loaded. Then the module is searched twice: 1. in is_module_text_address(), we find that 'p->addr' is in a module; 2. in __module_text_address(), we find the module; If the module has just been unloaded before the second search, then '*probed_mod' is NULL and we would not go to get the module refcount, then the return value of check_kprobe_address_safe() would be 0, but actually we need to return -EINVAL. OK, so you found a race window in check_kprobe_address_safe(). It does something like below. check_kprobe_address_safe() { ... /* Timing [A] */ if (!(core_kernel_text(p->addr) || is_module_text_address(p->addr)) || ...(other reserved address check)) { return -EINVAL; } /* Timing [B] */ *probed_mod = __module_text_address(p->addr): if (*probe_mod) { if (!try_module_get(*probed_mod)) { return -ENOENT; } ... } } So, if p->addr is in a module which is alive at the timing [A], but unloaded at timing [B], 'p->addr' is passed the 'is_module_text_address(p->addr)' check, but *probed_mod becomes NULL. Thus the corresponding module is not referenced and kprobe_arm(p) will access a wrong address (use after free). This happens either kprobe on ftrace is enabled or not. Yes, This is the problem. And for this case, check_kprobe_address_safe() still return 0, and then going on to arm kprobe may cause problems. So we should make check_kprobe_address_safe() return -EINVAL when refcount of the module is not got. Yes, To fix this problem, we should move the mutex_lock(kprobe_mutex) before check_kprobe_address_safe() because kprobe_module_callback() also lock it so it can stop module unloading. Can you ensure this will fix your problem? It seems not, the warning in __arm_kprobe_ftrace() still occurs. I contrived following simple test: #!/bin/bash sysctl -w kernel.panic_on_warn=1 while [ True ]; do insmod mod.ko# contain function 'foo' rmmod mod.ko done & while [ True ]; do insmod kprobe.ko # register kprobe on function 'foo' rmmod kprobe.ko done & I think holding kprobe_mutex cannot make sure we get the refcount of the module. Aah, yes, it cannot, because the kallsyms in a module will be removed after module->state becomes MODULE_STATE_UNFORMED. Before UNFORMED, the state is MODULE_STATE_GOING and the kprobe_module_callback() is called at that point. Thus, the following scenario happens. CPU1 CPU2 mod->state = MODULE_STATE_GOING kprobe_module_callback() { mutex_lock(&kprobe_mutex) loop on kprobe_table to disable kprobe in the module. mutex_unlock(&kprobe_mutex) } register_kprobe(p) { mutex_lock(&kprobe_mutex) check_kprobe_address_safe(p->addr) { [A''] is_module_text_address() return true until mod->state == UNFORMED. mod->state = MODULE_STATE_UNFORMED [B''] __module_text_address() returns NULL. } p is on the kprobe_table. mutex_unlock(&kprobe_mutex) So, as your fix, if we save the module at [A''] and use it at [B''], the mod is NOT able to get because mod->state != MODULE_STATE_LIVE. I think your patch is just optimizing but not fixing the fundamental problem, which is we don't have an atomic search symbol and get module Sorry, this patch is a little confusing, but it is not just optimizing :) As shown below, after my patch, if p->addr is in
Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched
On Wed, 10 Apr 2024 08:44:00 +0900 Masami Hiramatsu (Google) wrote: > Looks good to me. > > Acked-by: Masami Hiramatsu (Google) Thanks. > > BTW, isn't this a real bugfix, because the page_touched can be > bigger than nr_pages without this fix? Yes, I simply forgot to add the Cc stable. -- Steve
Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL
On Mon, Apr 08, 2024 at 10:41:04PM +0900, Honggyu Kim wrote: > Hi Gregory, > > On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price > wrote: > > Do you have test results which enable only DAMOS_MIGRATE_COLD actions > > but not DAMOS_MIGRATE_HOT actions? (and vice versa) > > > > The question I have is exactly how often is MIGRATE_HOT actually being > > utilized, and how much data is being moved. Testing MIGRATE_COLD only > > would at least give a rough approximation of that. > > To explain this, I better share more test results. In the section of > "Evaluation Workload", the test sequence can be summarized as follows. > > *. "Turn on DAMON." > 1. Allocate cold memory(mmap+memset) at DRAM node, then make the > process sleep. > 2. Launch redis-server and load prebaked snapshot image, dump.rdb. > (85GB consumed: 52GB for anon and 33GB for file cache) Aha! I see now, you are allocating memory to ensure the real workload (redis-server) pressures the DRAM tier and causes "spillage" to the CXL tier, and then measure the overhead in different scenarios. I would still love to know what the result of a demote-only system would produce, mosty because it would very clearly demonstrate the value of the demote+promote system when the system is memory-pressured. Given the additional results below, it shows a demote-only system would likely trend toward CXL-only, and so this shows an affirmative support for the promotion logic. Just another datum that is useful and paints a more complete picture. > I didn't want to make the evaluation too long in the cover letter, but > I have also evaluated another senario, which lazyly enabled DAMON just > before YCSB run at step 4. I will call this test as "DAMON lazy". This > is missing part from the cover letter. > > 1. Allocate cold memory(mmap+memset) at DRAM node, then make the > process sleep. > 2. Launch redis-server and load prebaked snapshot image, dump.rdb. > (85GB consumed: 52GB for anon and 33GB for file cache) > *. "Turn on DAMON." > > In the "DAMON lazy" senario, DAMON started monitoring late so the > initial redis-server placement is same as "default", but started to > demote cold data and promote redis data just before YCSB run. > This is excellent and definitely demonstrates part of the picture I was alluding to, thank you for the additional data! > > I have included "DAMON lazy" result and also the migration size by new > DAMOS migrate actions. Please note that demotion size is way higher > than promotion because promotion target is only for redis data, but > demotion target includes huge cold memory allocated by mmap + memset. > (there could be some ping-pong issue though.) > > As you mentioned, "DAMON tiered" case gets more benefit because new > redis allocations go to DRAM more than "default", but it also gets > benefit from promotion when it is under higher memory pressure as shown > in 490G and 500G cases. It promotes 22GB and 17GB of redis data to DRAM > from CXL. I think a better way of saying this is that "DAMON tiered" more effectively mitigates the effect of memory-pressure on faster tier before spillage occurs, while "DAMON lazy" demonstrates the expected performance of the system after memory pressure outruns the demotion logic, so you wind up with hot data stuck in the slow tier. There are some out there that would simply say "just demote more aggressively", so this is useful information for the discussion. +/- ~2% despite greater meomry migration is an excellent result > > Can you also provide the DRAM-only results for each test? Presumably, > > as workload size increases from 440G to 500G, the system probably starts > > using some amount of swap/zswap/whatever. It would be good to know how > > this system compares to swap small amounts of overflow. > > It looks like my explanation doesn't correctly inform you. The size > from 440GB to 500GB is for pre allocated cold data to give memory > pressure on the system so that redis-server cannot be fully allocated at > fast DRAM, then partially allocated at CXL memory as well. > Yes, sorry for the misunderstanding. This makes it much clearer. > > I hope my explanation is helpful for you to understand. Please let me > know if you have more questions. > Excellent work, exciting results! Thank you for the additional answers :] ~Gregory
Re: [PATCH] tracing: Add new_exec tracepoint
On Tue, 9 Apr 2024 16:45:47 +0200 Marco Elver wrote: > On Tue, 9 Apr 2024 at 16:31, Steven Rostedt wrote: > > > > On Mon, 8 Apr 2024 11:01:54 +0200 > > Marco Elver wrote: > > > > > Add "new_exec" tracepoint, which is run right after the point of no > > > return but before the current task assumes its new exec identity. > > > > > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > > > runs before flushing the old exec, i.e. while the task still has the > > > original state (such as original MM), but when the new exec either > > > succeeds or crashes (but never returns to the original exec). > > > > > > Being able to trace this event can be helpful in a number of use cases: > > > > > > * allowing tracing eBPF programs access to the original MM on exec, > > > before current->mm is replaced; > > > * counting exec in the original task (via perf event); > > > * profiling flush time ("new_exec" to "sched_process_exec"). > > > > > > Example of tracing output ("new_exec" and "sched_process_exec"): > > > > How common is this? And can't you just do the same with adding a kprobe? > > Our main use case would be to use this in BPF programs to become > exec-aware, where using the sched_process_exec hook is too late. This > is particularly important where the BPF program must stop inspecting > the user space's VM when the task does exec to become a new process. Just out of curiousity, would you like to audit that the user-program is not malformed? (security tracepoint?) I think that is an interesting idea. What kind of information you need? > > kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s > permission check can still return an error which returns to the > original task without crashing. Only at the point of no return are we > guaranteed that the exec either succeeds, or the task is terminated on > failure. Just a note: That is BPF limitation, kprobe and kprobe events can put a probe in the function body, but that is not supported on BPF (I guess because it depends on kernel debuginfo.) You can add kprobe-event using "perf probe" tool. Thank you, > > I don't know if "common" is the right question here, because it's a > chicken-egg problem: no tracepoint, we give up; we have the > tracepoint, it unlocks a range of new use cases (that require robust > solution to make BPF programs exec-aware, and a tracepoint is the only > option IMHO). > > Thanks, > -- Marco -- Masami Hiramatsu (Google)
Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched
On Tue, 9 Apr 2024 15:13:09 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The "buffer_percent" logic that is used by the ring buffer splice code to > only wake up the tasks when there's no data after the buffer is filled to > the percentage of the "buffer_percent" file is dependent on three > variables that determine the amount of data that is in the ring buffer: > > 1) pages_read - incremented whenever a new sub-buffer is consumed > 2) pages_lost - incremented every time a writer overwrites a sub-buffer > 3) pages_touched - incremented when a write goes to a new sub-buffer > > The percentage is the calculation of: > > (pages_touched - (pages_lost + pages_read)) / nr_pages > > Basically, the amount of data is the total number of sub-bufs that have been > touched, minus the number of sub-bufs lost and sub-bufs consumed. This is > divided by the total count to give the buffer percentage. When the > percentage is greater than the value in the "buffer_percent" file, it > wakes up splice readers waiting for that amount. > > It was observed that over time, the amount read from the splice was > constantly decreasing the longer the trace was running. That is, if one > asked for 60%, it would read over 60% when it first starts tracing, but > then it would be woken up at under 60% and would slowly decrease the > amount of data read after being woken up, where the amount becomes much > less than the buffer percent. > > This was due to an accounting of the pages_touched incrementation. This > value is incremented whenever a writer transfers to a new sub-buffer. But > the place where it was incremented was incorrect. If a writer overflowed > the current sub-buffer it would go to the next one. If it gets preempted > by an interrupt at that time, and the interrupt performs a trace, it too > will end up going to the next sub-buffer. But only one should increment > the counter. Unfortunately, that was not the case. > > Change the cmpxchg() that does the real switch of the tail-page into a > try_cmpxchg(), and on success, perform the increment of pages_touched. This > will only increment the counter once for when the writer moves to a new > sub-buffer, and not when there's a race and is incremented for when a > writer and its preempting writer both move to the same new sub-buffer. > Looks good to me. Acked-by: Masami Hiramatsu (Google) BTW, isn't this a real bugfix, because the page_touched can be bigger than nr_pages without this fix? Thank you, > Signed-off-by: Steven Rostedt (Google) > --- > kernel/trace/ring_buffer.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 25476ead681b..6511dc3a00da 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct > ring_buffer_per_cpu *cpu_buffer, > old_write = local_add_return(RB_WRITE_INTCNT, &next_page->write); > old_entries = local_add_return(RB_WRITE_INTCNT, &next_page->entries); > > - local_inc(&cpu_buffer->pages_touched); > /* >* Just make sure we have seen our old_write and synchronize >* with any interrupts that come in. > @@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct > ring_buffer_per_cpu *cpu_buffer, >*/ > local_set(&next_page->page->commit, 0); > > - /* Again, either we update tail_page or an interrupt does */ > - (void)cmpxchg(&cpu_buffer->tail_page, tail_page, next_page); > + /* Either we update tail_page or an interrupt does */ > + if (try_cmpxchg(&cpu_buffer->tail_page, &tail_page, next_page)) > + local_inc(&cpu_buffer->pages_touched); > } > } > > -- > 2.43.0 > -- Masami Hiramatsu (Google)
Re: Copying TLS/user register data per perf-sample?
Hello, On Thu, Apr 4, 2024 at 12:26 PM Beau Belgrave wrote: > > Hello, > > I'm looking into the possibility of capturing user data that is pointed > to by a user register (IE: fs/gs for TLS on x86/64) for each sample via > perf_events. > > I was hoping to find a way to do this similar to PERF_SAMPLE_STACK_USER. > I think it could even use roughly the same ABI in the perf ring buffer. > Or it may be possible by some kprobe linked to the perf sample function. > > This would allow a profiler to collect TLS (or other values) on x64. In > the Open Telemetry profiling SIG [1], we are trying to find a fast way > to grab a tracing association quickly on a per-thread basis. The team > at Elastic has a bespoke way to do this [2], however, I'd like to see a > more general way to achieve this. The folks I've been talking with seem > open to the idea of just having a TLS value for this we could capture > upon each sample. We could then just state, Open Telemetry SDKs should > have a TLS value for span correlation. However, we need a way to sample > the TLS value(s) when a sampling event is generated. > > Is this already possible via some other means? It'd be great to be able > to do this directly at the perf_event sample via the ABI or a probe. I don't think the current perf ABI allows capturing %fs/%gs + offset. IIRC kprobes/uprobes don't have that too but I could be wrong. Thanks, Namhyung > > 1. https://opentelemetry.io/blog/2024/profiling/ > 2. > https://www.elastic.co/blog/continuous-profiling-distributed-tracing-correlation
Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
On Tue, Apr 9, 2024 at 2:50 PM Andrew Morton wrote: > > On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" > wrote: > > > Hi Jonathan, > > > > On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron > > wrote: > > > > > > On Fri, 5 Apr 2024 00:07:05 + > > > "Ho-Ren (Jack) Chuang" wrote: > > > > > > > Since different memory devices require finding, allocating, and putting > > > > memory types, these common steps are abstracted in this patch, > > > > enhancing the scalability and conciseness of the code. > > > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > > Reviewed-by: "Huang, Ying" > > > Reviewed-by: Jonathan Cameron > > > > > Thank you for reviewing and for adding your "Reviewed-by"! > > I was wondering if I need to send a v12 and manually add > > this to the commit description, or if this is sufficient. > > I had added Jonathan's r-b to the mm.git copy of this patch. Got it~ Thank you Andrew! -- Best regards, Ho-Ren (Jack) Chuang 莊賀任
Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
On Tue, 9 Apr 2024 12:00:06 -0700 "Ho-Ren (Jack) Chuang" wrote: > Hi Jonathan, > > On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron > wrote: > > > > On Fri, 5 Apr 2024 00:07:05 + > > "Ho-Ren (Jack) Chuang" wrote: > > > > > Since different memory devices require finding, allocating, and putting > > > memory types, these common steps are abstracted in this patch, > > > enhancing the scalability and conciseness of the code. > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > Reviewed-by: "Huang, Ying" > > Reviewed-by: Jonathan Cameron > > > Thank you for reviewing and for adding your "Reviewed-by"! > I was wondering if I need to send a v12 and manually add > this to the commit description, or if this is sufficient. I had added Jonathan's r-b to the mm.git copy of this patch.
Re: [PATCH] tracing: Add new_exec tracepoint
On Tue, Apr 09, 2024 at 08:25:45PM +0200, Marco Elver wrote: > On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote: > [...] > > > + trace_new_exec(current, bprm); > > > + > > > > All other steps in this function have explicit comments about > > what/why/etc. Please add some kind of comment describing why the > > tracepoint is where it is, etc. > > I beefed up the tracepoint documentation, and wrote a little paragraph > above where it's called to reinforce what we want. > > [...] > > What about binfmt_misc, and binfmt_script? You may want bprm->interp > > too? > > Good points. I'll make the below changes for v2: > > diff --git a/fs/exec.c b/fs/exec.c > index ab778ae1fc06..472b9f7b40e8 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm) > if (retval) > return retval; > > + /* > + * This tracepoint marks the point before flushing the old exec where > + * the current task is still unchanged, but errors are fatal (point of > + * no return). The later "sched_process_exec" tracepoint is called after > + * the current task has successfully switched to the new exec. > + */ > trace_new_exec(current, bprm); > > /* > diff --git a/include/trace/events/task.h b/include/trace/events/task.h > index 8853dc44783d..623d9af777c1 100644 > --- a/include/trace/events/task.h > +++ b/include/trace/events/task.h > @@ -61,8 +61,11 @@ TRACE_EVENT(task_rename, > * @task:pointer to the current task > * @bprm:pointer to linux_binprm used for new exec > * > - * Called before flushing the old exec, but at the point of no return during > - * switching to the new exec. > + * Called before flushing the old exec, where @task is still unchanged, but > at > + * the point of no return during switching to the new exec. At the point it > is > + * called the exec will either succeed, or on failure terminate the task. > Also > + * see the "sched_process_exec" tracepoint, which is called right after @task > + * has successfully switched to the new exec. > */ > TRACE_EVENT(new_exec, > > @@ -71,19 +74,22 @@ TRACE_EVENT(new_exec, > TP_ARGS(task, bprm), > > TP_STRUCT__entry( > + __string( interp, bprm->interp) > __string( filename, bprm->filename ) > __field(pid_t, pid ) > __string( comm, task->comm ) > ), > > TP_fast_assign( > + __assign_str(interp, bprm->interp); > __assign_str(filename, bprm->filename); > __entry->pid = task->pid; > __assign_str(comm, task->comm); > ), > > - TP_printk("filename=%s pid=%d comm=%s", > - __get_str(filename), __entry->pid, __get_str(comm)) > + TP_printk("interp=%s filename=%s pid=%d comm=%s", > + __get_str(interp), __get_str(filename), > + __entry->pid, __get_str(comm)) > ); > > #endif Looks good; I await v2, and Steven's Ack. :) -- Kees Cook
Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young
On 09.04.24 20:31, James Houghton wrote: Ah, I didn't see this in my inbox, sorry David! No worries :) On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand wrote: On 02.04.24 01:29, James Houghton wrote: diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index f349e08a9dfe..daaa9db625d3 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -61,6 +61,10 @@ enum mmu_notifier_event { #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) +#define MMU_NOTIFIER_YOUNG (1 << 0) +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1) Especially this one really deserves some documentation :) Yes, will do. Something like MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in bitmap either (1) does not accurately represent the age of the pages (in the case of test_young), or (2) was not able to be used to completely clear the age/access bit (in the case of clear_young). Make sense. I do wonder what the expected reaction from the caller is :) +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2) And that one as well. Something like Indicates that (1) passing a bitmap ({test,clear}_young_bitmap) would have been supported for this address range. The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM is able to harvest the access bit "fast" (so for x86, locklessly, and for arm64, with the KVM MMU read lock), "fast" enough that using a bitmap to do look-around is probably a good idea. Is that really the right way to communicate that ("would have been supported") -- wouldn't we want to sense support differently? Likely best to briefly document all of them, and how they are supposed to be used (return value for X). Right. Will do. + struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is @@ -106,21 +110,36 @@ struct mmu_notifier_ops { * clear_young is a lightweight version of clear_flush_young. Like the * latter, it is supposed to test-and-clear the young/accessed bitflag * in the secondary pte, but it may omit flushing the secondary tlb. + * + * If @bitmap is given but is not supported, return + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. + * + * If the walk is done "quickly" and there were young PTEs, + * MMU_NOTIFIER_YOUNG_FAST is returned. */ int (*clear_young)(struct mmu_notifier *subscription, struct mm_struct *mm, unsigned long start, -unsigned long end); +unsigned long end, +unsigned long *bitmap); /* * test_young is called to check the young/accessed bitflag in * the secondary pte. This is used to know if the page is * frequently used without actually clearing the flag or tearing * down the secondary mapping on the page. + * + * If @bitmap is given but is not supported, return + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. + * + * If the walk is done "quickly" and there were young PTEs, + * MMU_NOTIFIER_YOUNG_FAST is returned. */ int (*test_young)(struct mmu_notifier *subscription, struct mm_struct *mm, - unsigned long address); + unsigned long start, + unsigned long end, + unsigned long *bitmap); What does "quickly" mean (why not use "fast")? What are the semantics, I don't find any existing usage of that in this file. "fast" means fast enough such that using a bitmap to scan adjacent pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in this comment. Perhaps I should just rename it to MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be beneficial" thing -- that's for MGLRU/etc. to decide really. Yes! Further, what is MMU_NOTIFIER_YOUNG you introduce used for? MMU_NOTIFIER_YOUNG is the return value when the page was young, but we (1) didn't use a bitmap, and (2) the "fast" access bit harvesting wasn't possible. In this case we simply return 1, which is MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young() properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that it will be 1. Yes, that will clarify it! -- Cheers, David / dhildenb
[PATCH] ring-buffer: Only update pages_touched when a new page is touched
From: "Steven Rostedt (Google)" The "buffer_percent" logic that is used by the ring buffer splice code to only wake up the tasks when there's no data after the buffer is filled to the percentage of the "buffer_percent" file is dependent on three variables that determine the amount of data that is in the ring buffer: 1) pages_read - incremented whenever a new sub-buffer is consumed 2) pages_lost - incremented every time a writer overwrites a sub-buffer 3) pages_touched - incremented when a write goes to a new sub-buffer The percentage is the calculation of: (pages_touched - (pages_lost + pages_read)) / nr_pages Basically, the amount of data is the total number of sub-bufs that have been touched, minus the number of sub-bufs lost and sub-bufs consumed. This is divided by the total count to give the buffer percentage. When the percentage is greater than the value in the "buffer_percent" file, it wakes up splice readers waiting for that amount. It was observed that over time, the amount read from the splice was constantly decreasing the longer the trace was running. That is, if one asked for 60%, it would read over 60% when it first starts tracing, but then it would be woken up at under 60% and would slowly decrease the amount of data read after being woken up, where the amount becomes much less than the buffer percent. This was due to an accounting of the pages_touched incrementation. This value is incremented whenever a writer transfers to a new sub-buffer. But the place where it was incremented was incorrect. If a writer overflowed the current sub-buffer it would go to the next one. If it gets preempted by an interrupt at that time, and the interrupt performs a trace, it too will end up going to the next sub-buffer. But only one should increment the counter. Unfortunately, that was not the case. Change the cmpxchg() that does the real switch of the tail-page into a try_cmpxchg(), and on success, perform the increment of pages_touched. This will only increment the counter once for when the writer moves to a new sub-buffer, and not when there's a race and is incremented for when a writer and its preempting writer both move to the same new sub-buffer. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 25476ead681b..6511dc3a00da 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct ring_buffer_per_cpu *cpu_buffer, old_write = local_add_return(RB_WRITE_INTCNT, &next_page->write); old_entries = local_add_return(RB_WRITE_INTCNT, &next_page->entries); - local_inc(&cpu_buffer->pages_touched); /* * Just make sure we have seen our old_write and synchronize * with any interrupts that come in. @@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct ring_buffer_per_cpu *cpu_buffer, */ local_set(&next_page->page->commit, 0); - /* Again, either we update tail_page or an interrupt does */ - (void)cmpxchg(&cpu_buffer->tail_page, tail_page, next_page); + /* Either we update tail_page or an interrupt does */ + if (try_cmpxchg(&cpu_buffer->tail_page, &tail_page, next_page)) + local_inc(&cpu_buffer->pages_touched); } } -- 2.43.0
Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
Hi Jonathan, On Tue, Apr 9, 2024 at 9:12 AM Jonathan Cameron wrote: > > On Fri, 5 Apr 2024 15:43:47 -0700 > "Ho-Ren (Jack) Chuang" wrote: > > > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron > > wrote: > > > > > > On Fri, 5 Apr 2024 00:07:06 + > > > "Ho-Ren (Jack) Chuang" wrote: > > > > > > > The current implementation treats emulated memory devices, such as > > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal > > > > memory > > > > (E820_TYPE_RAM). However, these emulated devices have different > > > > characteristics than traditional DRAM, making it important to > > > > distinguish them. Thus, we modify the tiered memory initialization > > > > process > > > > to introduce a delay specifically for CPUless NUMA nodes. This delay > > > > ensures that the memory tier initialization for these nodes is deferred > > > > until HMAT information is obtained during the boot process. Finally, > > > > demotion tables are recalculated at the end. > > > > > > > > * late_initcall(memory_tier_late_init); > > > > Some device drivers may have initialized memory tiers between > > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > > > > online memory nodes and configuring memory tiers. They should be > > > > excluded > > > > in the late init. > > > > > > > > * Handle cases where there is no HMAT when creating memory tiers > > > > There is a scenario where a CPUless node does not provide HMAT > > > > information. > > > > If no HMAT is specified, it falls back to using the default DRAM tier. > > > > > > > > * Introduce another new lock `default_dram_perf_lock` for adist > > > > calculation > > > > In the current implementation, iterating through CPUlist nodes requires > > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end > > > > up > > > > trying to acquire the same lock, leading to a potential deadlock. > > > > Therefore, we propose introducing a standalone `default_dram_perf_lock` > > > > to > > > > protect `default_dram_perf_*`. This approach not only avoids deadlock > > > > but also prevents holding a large lock simultaneously. > > > > > > > > * Upgrade `set_node_memory_tier` to support additional cases, including > > > > default DRAM, late CPUless, and hot-plugged initializations. > > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > > > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > > > > handle cases where memtype is not initialized and where HMAT > > > > information is > > > > available. > > > > > > > > * Introduce `default_memory_types` for those memory types that are not > > > > initialized by device drivers. > > > > Because late initialized memory and default DRAM memory need to be > > > > managed, > > > > a default memory type is created for storing all memory types that are > > > > not initialized by device drivers and as a fallback. > > > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > > Signed-off-by: Hao Xiang > > > > Reviewed-by: "Huang, Ying" > > > > > > Hi - one remaining question. Why can't we delay init for all nodes > > > to either drivers or your fallback late_initcall code. > > > It would be nice to reduce possible code paths. > > > > I try not to change too much of the existing code structure in > > this patchset. > > > > To me, postponing/moving all memory tier registrations to > > late_initcall() is another possible action item for the next patchset. > > > > After tier_mem(), hmat_init() is called, which requires registering > > `default_dram_type` info. This is when `default_dram_type` is needed. > > However, it is indeed possible to postpone the latter part, > > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > > indeed be split into two parts, and the latter part can be moved to > > late_initcall() to be processed together. > > > > Doing this all memory-type drivers have to call late_initcall() to > > register a memory tier. I’m not sure how many they are? > > > > What do you guys think? > > Gut feeling - if you are going to move it for some cases, move it for > all of them. Then we only have to test once ;) > > J Thank you for your reminder! I agree~ That's why I'm considering changing them in the next patchset because of the amount of changes. And also, this patchset already contains too many things. > > > > > > > > Jonathan > > > > > > > > > > --- > > > > mm/memory-tiers.c | 94 +++ > > > > 1 file changed, 70 insertions(+), 24 deletions(-) > > > > > > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > > > index 516b144fd45a..6632102bd5c9 100644 > > > > --- a/mm/memory-tiers.c > > > > +++ b/mm/memory-tiers.c > > > > > > > > > > > > > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void) > > > >* For now we can have 4 faster memory tiers with smaller > > > > adistance > > > >* than default DRAM tier. > > > >*/ > > > > - default_dram_
Re: [External] Re: [PATCH v11 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
Hi Jonathan, On Fri, Apr 5, 2024 at 6:56 AM Jonathan Cameron wrote: > > On Fri, 5 Apr 2024 00:07:05 + > "Ho-Ren (Jack) Chuang" wrote: > > > Since different memory devices require finding, allocating, and putting > > memory types, these common steps are abstracted in this patch, > > enhancing the scalability and conciseness of the code. > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > Reviewed-by: "Huang, Ying" > Reviewed-by: Jonathan Cameron > Thank you for reviewing and for adding your "Reviewed-by"! I was wondering if I need to send a v12 and manually add this to the commit description, or if this is sufficient. -- Best regards, Ho-Ren (Jack) Chuang 莊賀任
Re: [PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name
On Tue, 9 Apr 2024 at 21:37, Luca Weiss wrote: > > Follow the gpio-hog bindings and use otg-hog as node name. > > Signed-off-by: Luca Weiss > --- > arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > Reviewed-by: Dmitry Baryshkov -- With best wishes Dmitry
[PATCH v2 2/2] ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name
Follow the gpio-hog bindings and use otg-hog as node name. Signed-off-by: Luca Weiss --- arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts index 4aaae8537a3f..06549051be50 100644 --- a/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts +++ b/arch/arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts @@ -328,7 +328,7 @@ wlan_regulator_pin: wl-reg-active-state { power-source = ; }; - otg { + otg-hog { gpio-hog; gpios = <35 GPIO_ACTIVE_HIGH>; output-high; -- 2.44.0
[PATCH v2 1/2] dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes
Allow specifying a GPIO hog, as already used on qcom-msm8974-lge-nexus5-hammerhead.dts. Signed-off-by: Luca Weiss --- .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml | 12 1 file changed, 12 insertions(+) diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml index a786357ed1af..bd9471de0c69 100644 --- a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml +++ b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml @@ -424,6 +424,10 @@ patternProperties: $ref: "#/$defs/qcom-pmic-gpio-state" additionalProperties: false + "-hog(-[0-9]+)?$": +required: + - gpio-hog + $defs: qcom-pmic-gpio-state: type: object @@ -571,6 +575,7 @@ $defs: examples: - | +#include #include pm8921_gpio: gpio@150 { @@ -594,5 +599,12 @@ examples: power-source = ; }; }; + + otg-hog { +gpio-hog; +gpios = <35 GPIO_ACTIVE_HIGH>; +output-high; +line-name = "otg-gpio"; + }; }; ... -- 2.44.0
[PATCH v2 0/2] Allow gpio-hog nodes in qcom,pmic-gpio bindings (& dt fixup)
Resolve the dt validation failure on Nexus 5. Signed-off-by: Luca Weiss --- Changes in v2: - Use simpler regex from tlmm bindings (Krzysztof) - Link to v1: https://lore.kernel.org/r/20240408-qcom-pmic-gpio-hog-v1-0-f61fc5323...@z3ntu.xyz --- Luca Weiss (2): dt-bindings: pinctrl: qcom,pmic-gpio: Allow gpio-hog nodes ARM: dts: qcom: msm8974-hammerhead: Update gpio hog node name .../devicetree/bindings/pinctrl/qcom,pmic-gpio.yaml | 12 .../arm/boot/dts/qcom/qcom-msm8974-lge-nexus5-hammerhead.dts | 2 +- 2 files changed, 13 insertions(+), 1 deletion(-) --- base-commit: 8568bb2ccc278f344e6ac44af6ed010a90aa88dc change-id: 20240408-qcom-pmic-gpio-hog-2b4c5f103126 Best regards, -- Luca Weiss
Re: [PATCH 0/3] Fix up qcom,halt-regs definition in various schemas
On Dienstag, 9. April 2024 17:10:41 CEST Rob Herring wrote: > On Sun, Apr 07, 2024 at 11:58:29AM +0200, Luca Weiss wrote: > > The original motivation is that a bunch of other schemas fail to > > validate qcom,halt-regs, for example like in the following examples: > > > > arch/arm64/boot/dts/qcom/apq8016-sbc.dtb: remoteproc@408: > > qcom,halt-regs:0: [20] is too short > > from schema $id: > > http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml# > > arch/arm64/boot/dts/qcom/apq8096-ifc6640.dtb: remoteproc@208: > > qcom,halt-regs:0: [82] is too short > > from schema $id: > > http://devicetree.org/schemas/remoteproc/qcom,msm8996-mss-pil.yaml# > > arch/arm64/boot/dts/qcom/apq8039-t2.dtb: remoteproc@408: > > qcom,halt-regs:0: [32] is too short > > from schema $id: > > http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml# > > > > While I'm actually not quite sure why these patches fix this in > > the other schemas - feels like a bug/limitation in dt-schema maybe? - > > Was this with v2024.02? It should be a bit better there. Though it > may just have different errors. The limitation is that property > types and in the case of matrix's (which phandle-array actually is) > range for dimensions are global. So if there's not correct dimensions > for a property, the tools aren't going to decode it properly. You're right, I doesn't look like I can reproduce this with the latest dtschema installed. Anyways these patches should be good to actually validate qcom,halt-regs for the schemas I'm touching here. Regards Luca > > Rob >
Re: [PATCH v3 1/7] mm: Add a bitmap into mmu_notifier_{clear,test}_young
Ah, I didn't see this in my inbox, sorry David! On Thu, Apr 4, 2024 at 11:52 AM David Hildenbrand wrote: > > On 02.04.24 01:29, James Houghton wrote: > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > > index f349e08a9dfe..daaa9db625d3 100644 > > --- a/include/linux/mmu_notifier.h > > +++ b/include/linux/mmu_notifier.h > > @@ -61,6 +61,10 @@ enum mmu_notifier_event { > > > > #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) > > > > +#define MMU_NOTIFIER_YOUNG (1 << 0) > > +#define MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE (1 << 1) > > Especially this one really deserves some documentation :) Yes, will do. Something like MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE indicates that the passed-in bitmap either (1) does not accurately represent the age of the pages (in the case of test_young), or (2) was not able to be used to completely clear the age/access bit (in the case of clear_young). > > > +#define MMU_NOTIFIER_YOUNG_FAST (1 << 2) > > And that one as well. Something like Indicates that (1) passing a bitmap ({test,clear}_young_bitmap) would have been supported for this address range. The name MMU_NOTIFIER_YOUNG_FAST really comes from the fact that KVM is able to harvest the access bit "fast" (so for x86, locklessly, and for arm64, with the KVM MMU read lock), "fast" enough that using a bitmap to do look-around is probably a good idea. > > Likely best to briefly document all of them, and how they are > supposed to be used (return value for X). Right. Will do. > > > + > > struct mmu_notifier_ops { > > /* > >* Called either by mmu_notifier_unregister or when the mm is > > @@ -106,21 +110,36 @@ struct mmu_notifier_ops { > >* clear_young is a lightweight version of clear_flush_young. Like the > >* latter, it is supposed to test-and-clear the young/accessed bitflag > >* in the secondary pte, but it may omit flushing the secondary tlb. > > + * > > + * If @bitmap is given but is not supported, return > > + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. > > + * > > + * If the walk is done "quickly" and there were young PTEs, > > + * MMU_NOTIFIER_YOUNG_FAST is returned. > >*/ > > int (*clear_young)(struct mmu_notifier *subscription, > > struct mm_struct *mm, > > unsigned long start, > > -unsigned long end); > > +unsigned long end, > > +unsigned long *bitmap); > > > > /* > >* test_young is called to check the young/accessed bitflag in > >* the secondary pte. This is used to know if the page is > >* frequently used without actually clearing the flag or tearing > >* down the secondary mapping on the page. > > + * > > + * If @bitmap is given but is not supported, return > > + * MMU_NOTIFIER_YOUNG_BITMAP_UNRELIABLE. > > + * > > + * If the walk is done "quickly" and there were young PTEs, > > + * MMU_NOTIFIER_YOUNG_FAST is returned. > >*/ > > int (*test_young)(struct mmu_notifier *subscription, > > struct mm_struct *mm, > > - unsigned long address); > > + unsigned long start, > > + unsigned long end, > > + unsigned long *bitmap); > > What does "quickly" mean (why not use "fast")? What are the semantics, I > don't find any existing usage of that in this file. "fast" means fast enough such that using a bitmap to scan adjacent pages (e.g. with MGLRU) is likely to be beneficial. I'll write more in this comment. Perhaps I should just rename it to MMU_NOTIFIER_YOUNG_BITMAP_SUPPORTED and drop the whole "likely to be beneficial" thing -- that's for MGLRU/etc. to decide really. > > Further, what is MMU_NOTIFIER_YOUNG you introduce used for? MMU_NOTIFIER_YOUNG is the return value when the page was young, but we (1) didn't use a bitmap, and (2) the "fast" access bit harvesting wasn't possible. In this case we simply return 1, which is MMU_NOTIFIER_YOUNG. I'll make kvm_mmu_notifier_test_clear_young() properly return MMU_NOTIFIER_YOUNG instead of relying on the fact that it will be 1. Thanks David!
Re: [PATCH] tracing: Add new_exec tracepoint
On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote: [...] > > + trace_new_exec(current, bprm); > > + > > All other steps in this function have explicit comments about > what/why/etc. Please add some kind of comment describing why the > tracepoint is where it is, etc. I beefed up the tracepoint documentation, and wrote a little paragraph above where it's called to reinforce what we want. [...] > What about binfmt_misc, and binfmt_script? You may want bprm->interp > too? Good points. I'll make the below changes for v2: diff --git a/fs/exec.c b/fs/exec.c index ab778ae1fc06..472b9f7b40e8 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) return retval; + /* +* This tracepoint marks the point before flushing the old exec where +* the current task is still unchanged, but errors are fatal (point of +* no return). The later "sched_process_exec" tracepoint is called after +* the current task has successfully switched to the new exec. +*/ trace_new_exec(current, bprm); /* diff --git a/include/trace/events/task.h b/include/trace/events/task.h index 8853dc44783d..623d9af777c1 100644 --- a/include/trace/events/task.h +++ b/include/trace/events/task.h @@ -61,8 +61,11 @@ TRACE_EVENT(task_rename, * @task: pointer to the current task * @bprm: pointer to linux_binprm used for new exec * - * Called before flushing the old exec, but at the point of no return during - * switching to the new exec. + * Called before flushing the old exec, where @task is still unchanged, but at + * the point of no return during switching to the new exec. At the point it is + * called the exec will either succeed, or on failure terminate the task. Also + * see the "sched_process_exec" tracepoint, which is called right after @task + * has successfully switched to the new exec. */ TRACE_EVENT(new_exec, @@ -71,19 +74,22 @@ TRACE_EVENT(new_exec, TP_ARGS(task, bprm), TP_STRUCT__entry( + __string( interp, bprm->interp) __string( filename, bprm->filename ) __field(pid_t, pid ) __string( comm, task->comm ) ), TP_fast_assign( + __assign_str(interp, bprm->interp); __assign_str(filename, bprm->filename); __entry->pid = task->pid; __assign_str(comm, task->comm); ), - TP_printk("filename=%s pid=%d comm=%s", - __get_str(filename), __entry->pid, __get_str(comm)) + TP_printk("interp=%s filename=%s pid=%d comm=%s", + __get_str(interp), __get_str(filename), + __entry->pid, __get_str(comm)) ); #endif
Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion
Hi Honggyu, On Tue, 9 Apr 2024 18:54:14 +0900 Honggyu Kim wrote: > On Mon, 8 Apr 2024 10:52:28 -0700 SeongJae Park wrote: > > On Mon, 8 Apr 2024 21:06:44 +0900 Honggyu Kim wrote: > > > On Fri, 5 Apr 2024 12:24:30 -0700 SeongJae Park wrote: > > > > On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim > > > > wrote: [...] > > > I can remove it, but I would like to have more discussion about this > > > issue. The current implementation allows only a single migration > > > target with "target_nid", but users might want to provide fall back > > > migration target nids. > > > > > > For example, if more than two CXL nodes exist in the system, users might > > > want to migrate cold pages to any CXL nodes. In such cases, we might > > > have to make "target_nid" accept comma separated node IDs. nodemask can > > > be better but we should provide a way to change the scanning order. > > > > > > I would like to hear how you think about this. > > > > Good point. I think we could later extend the sysfs file to receive the > > comma-separated numbers, or even mask. For simplicity, adding sysfs files > > dedicated for the different format of inputs could also be an option (e.g., > > target_nids_list, target_nids_mask). But starting from this single node as > > is > > now looks ok to me. > > If you think we can start from a single node, then I will keep it as is. > But are you okay if I change the same 'target_nid' to accept > comma-separated numbers later? Or do you want to introduce another knob > such as 'target_nids_list'? What about rename 'target_nid' to > 'target_nids' at the first place? I have no strong concern or opinion about this at the moment. Please feel free to renaming it to 'taget_nids' if you think that's better. [...] > Please note that I will be out of office this week so won't be able to > answer quickly. No problem, I hope you to take and enjoy your time :) Thanks, SJ [...]
Re: [PATCH v2 fs/proc/bootconfig 0/2] remove redundant comments from /proc/bootconfig
On Tue, Apr 09, 2024 at 11:32:48PM +0900, Masami Hiramatsu wrote: > Hi Paul, > > Thanks, both looks good to me. > > Acked-by: Masami Hiramatsu (Google) > > Let me update bootconfig/fixes. Thank you very much! As soon as I see them in -next from you, I will drop them from -rcu. Thanx, Paul > On Mon, 8 Apr 2024 21:42:49 -0700 > "Paul E. McKenney" wrote: > > > Hello! > > > > This series removes redundant comments from /proc/bootconfig: > > > > 1. fs/proc: remove redundant comments from /proc/bootconfig, > > courtesy of Zhenhua Huang. > > > > 2. fs/proc: Skip bootloader comment if no embedded kernel parameters, > > courtesy of Masami Hiramatsu. > > > > Thanx, Paul > > > > > > > > b/fs/proc/bootconfig.c | 12 ++-- > > b/include/linux/bootconfig.h |1 + > > b/init/main.c|5 + > > fs/proc/bootconfig.c |2 +- > > 4 files changed, 13 insertions(+), 7 deletions(-) > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Fri, 5 Apr 2024 15:43:47 -0700 "Ho-Ren (Jack) Chuang" wrote: > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron > wrote: > > > > On Fri, 5 Apr 2024 00:07:06 + > > "Ho-Ren (Jack) Chuang" wrote: > > > > > The current implementation treats emulated memory devices, such as > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal > > > memory > > > (E820_TYPE_RAM). However, these emulated devices have different > > > characteristics than traditional DRAM, making it important to > > > distinguish them. Thus, we modify the tiered memory initialization process > > > to introduce a delay specifically for CPUless NUMA nodes. This delay > > > ensures that the memory tier initialization for these nodes is deferred > > > until HMAT information is obtained during the boot process. Finally, > > > demotion tables are recalculated at the end. > > > > > > * late_initcall(memory_tier_late_init); > > > Some device drivers may have initialized memory tiers between > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > > > online memory nodes and configuring memory tiers. They should be excluded > > > in the late init. > > > > > > * Handle cases where there is no HMAT when creating memory tiers > > > There is a scenario where a CPUless node does not provide HMAT > > > information. > > > If no HMAT is specified, it falls back to using the default DRAM tier. > > > > > > * Introduce another new lock `default_dram_perf_lock` for adist > > > calculation > > > In the current implementation, iterating through CPUlist nodes requires > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > > > trying to acquire the same lock, leading to a potential deadlock. > > > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > > > protect `default_dram_perf_*`. This approach not only avoids deadlock > > > but also prevents holding a large lock simultaneously. > > > > > > * Upgrade `set_node_memory_tier` to support additional cases, including > > > default DRAM, late CPUless, and hot-plugged initializations. > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > > > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > > > handle cases where memtype is not initialized and where HMAT information > > > is > > > available. > > > > > > * Introduce `default_memory_types` for those memory types that are not > > > initialized by device drivers. > > > Because late initialized memory and default DRAM memory need to be > > > managed, > > > a default memory type is created for storing all memory types that are > > > not initialized by device drivers and as a fallback. > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > Signed-off-by: Hao Xiang > > > Reviewed-by: "Huang, Ying" > > > > Hi - one remaining question. Why can't we delay init for all nodes > > to either drivers or your fallback late_initcall code. > > It would be nice to reduce possible code paths. > > I try not to change too much of the existing code structure in > this patchset. > > To me, postponing/moving all memory tier registrations to > late_initcall() is another possible action item for the next patchset. > > After tier_mem(), hmat_init() is called, which requires registering > `default_dram_type` info. This is when `default_dram_type` is needed. > However, it is indeed possible to postpone the latter part, > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > indeed be split into two parts, and the latter part can be moved to > late_initcall() to be processed together. > > Doing this all memory-type drivers have to call late_initcall() to > register a memory tier. I’m not sure how many they are? > > What do you guys think? Gut feeling - if you are going to move it for some cases, move it for all of them. Then we only have to test once ;) J > > > > > Jonathan > > > > > > > --- > > > mm/memory-tiers.c | 94 +++ > > > 1 file changed, 70 insertions(+), 24 deletions(-) > > > > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > > > index 516b144fd45a..6632102bd5c9 100644 > > > --- a/mm/memory-tiers.c > > > +++ b/mm/memory-tiers.c > > > > > > > > > @@ -855,7 +892,8 @@ static int __init memory_tier_init(void) > > >* For now we can have 4 faster memory tiers with smaller adistance > > >* than default DRAM tier. > > >*/ > > > - default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM); > > > + default_dram_type = > > > mt_find_alloc_memory_type(MEMTIER_ADISTANCE_DRAM, > > > + > > > &default_memory_types); > > > if (IS_ERR(default_dram_type)) > > > panic("%s() failed to allocate default DRAM tier\n", > > > __func__); > > > > > > @@ -865,6 +903,14 @@ static int __init memory_tier_init(void) > > >* types assigned. > > >*/ > > > for_eac
Re: [PATCH] tracing: Add new_exec tracepoint
On Mon, Apr 08, 2024 at 11:01:54AM +0200, Marco Elver wrote: > Add "new_exec" tracepoint, which is run right after the point of no > return but before the current task assumes its new exec identity. > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > runs before flushing the old exec, i.e. while the task still has the > original state (such as original MM), but when the new exec either > succeeds or crashes (but never returns to the original exec). > > Being able to trace this event can be helpful in a number of use cases: > > * allowing tracing eBPF programs access to the original MM on exec, > before current->mm is replaced; > * counting exec in the original task (via perf event); > * profiling flush time ("new_exec" to "sched_process_exec"). > > Example of tracing output ("new_exec" and "sched_process_exec"): > > $ cat /sys/kernel/debug/tracing/trace_pipe > <...>-379 [003] . 179.626921: new_exec: > filename=/usr/bin/sshd pid=379 comm=sshd > <...>-379 [003] . 179.629131: sched_process_exec: > filename=/usr/bin/sshd pid=379 old_pid=379 > <...>-381 [002] . 180.048580: new_exec: filename=/bin/bash > pid=381 comm=sshd > <...>-381 [002] . 180.053122: sched_process_exec: > filename=/bin/bash pid=381 old_pid=381 > <...>-385 [001] . 180.068277: new_exec: filename=/usr/bin/tty > pid=385 comm=bash > <...>-385 [001] . 180.069485: sched_process_exec: > filename=/usr/bin/tty pid=385 old_pid=385 > <...>-389 [006] . 192.020147: new_exec: > filename=/usr/bin/dmesg pid=389 comm=bash >bash-389 [006] . 192.021377: sched_process_exec: > filename=/usr/bin/dmesg pid=389 old_pid=389 > > Signed-off-by: Marco Elver > --- > fs/exec.c | 2 ++ > include/trace/events/task.h | 30 ++ > 2 files changed, 32 insertions(+) > > diff --git a/fs/exec.c b/fs/exec.c > index 38bf71cbdf5e..ab778ae1fc06 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1268,6 +1268,8 @@ int begin_new_exec(struct linux_binprm * bprm) > if (retval) > return retval; > > + trace_new_exec(current, bprm); > + All other steps in this function have explicit comments about what/why/etc. Please add some kind of comment describing why the tracepoint is where it is, etc. For example, maybe something like: /* * Before any changes to 'current', report that the exec is about to * happen (since we made it to the point of no return). On a successful * exec, the 'sched_process_exec' tracepoint will also fire. On failure, * ... [something else] */ > +TRACE_EVENT(new_exec, > + > + TP_PROTO(struct task_struct *task, struct linux_binprm *bprm), > + > + TP_ARGS(task, bprm), > + > + TP_STRUCT__entry( > + __string( filename, bprm->filename ) > + __field(pid_t, pid ) > + __string( comm, task->comm ) > + ), > + > + TP_fast_assign( > + __assign_str(filename, bprm->filename); What about binfmt_misc, and binfmt_script? You may want bprm->interp too? -Kees > + __entry->pid = task->pid; > + __assign_str(comm, task->comm); > + ), > + > + TP_printk("filename=%s pid=%d comm=%s", > + __get_str(filename), __entry->pid, __get_str(comm)) > +); > + > #endif > > /* This part must be outside protection */ > -- > 2.44.0.478.gd926399ef9-goog > -- Kees Cook
Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge
On Tue, Apr 09, 2024 at 01:04:12PM +0200, Pavel Machek wrote: > Hi! > > > > This is driver for ANX7688 USB-C HDMI, with flashing and debugging > > > features removed. ANX7688 is rather criticial piece on PinePhone, > > > there's no display and no battery charging without it. > > > > > > There's likely more work to be done here, but having basic support > > > in mainline is needed to be able to work on the other stuff > > > (networking, cameras, power management). > > > > > > Signed-off-by: Ondrej Jirman > > > Co-developed-by: Martijn Braam > > > Co-developed-by: Samuel Holland > > > Signed-off-by: Pavel Machek > > > > Just couple of quick comments below - I did not have time to go over > > this very thoroughly, but I think you need to make a new version in > > any case because of comments in 1/2. > [skipped] > > > > +static int anx7688_connect(struct anx7688 *anx7688) > > > +{ > > > + struct typec_partner_desc desc = {}; > > > + int ret, i; > > > + u8 fw[2]; > > > + const u8 dp_snk_identity[16] = { > > > + 0x00, 0x00, 0x00, 0xec, /* id header */ > > > + 0x00, 0x00, 0x00, 0x00, /* cert stat */ > > > + 0x00, 0x00, 0x00, 0x00, /* product type */ > > > + 0x39, 0x00, 0x00, 0x51 /* alt mode adapter */ > > > + }; > > > + const u8 svid[4] = { > > > + 0x00, 0x00, 0x01, 0xff, > > > + }; > > > > Why not get those from DT? > > Are you sure it belongs to the DT (and that DT people will agree)? >From Documentation/devicetree/bindings/connector/usb-connector.yaml: altmodes { displayport { svid = /bits/ 16 <0xff01>; vdo = <0x1c46>; }; }; BTW, I don't see the VDO for the DP altmode in your code. Maybe I missed it at a quick glance. > > > > + u32 caps[8]; > > > + -- With best wishes Dmitry
Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
On Tue, 09 Apr 2024 04:03:22 -0500, Michal Koutný wrote: On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang wrote: It's always non-NULL based on testing. It's hard for me to say definitely by reading the code. But IIUC cgroup_disable command-line only blocks operations in /sys/fs/cgroup so user space can't set up controllers and config limits, etc., for the diasabled ones. Each task->cgroups would still have a non-NULL pointer to the static root object for each cgroup that is enabled by KConfig, so get_current_misc_cg() thus sgx_get_current_cg() should not return NULL regardless 'cgroup_disable=misc'. Maybe @Michal or @tj can confirm? The current implementation creates root css object (see cgroup_init(), cgroup_ssid_enabled() check is after cgroup_init_subsys()). I.e. it will look like all tasks are members of root cgroup wrt given controller permanently and controller attribute files won't exist. (It is up to the controller implementation to do further optimization based on the boot-time disablement (e.g. see uses of mem_cgroup_disabled()). Not sure if this is useful for misc controller.) As for the WARN_ON(1), taking example from memcg -- NULL is best synonymous with root. It's a judgement call which of the values to store and when to intepret it. HTH, Michal Thanks for the info. The way I see it, misc does not have special handling like memcg so every task at least belong to the root(default) group even if it's disabled by command line parameter. So we would not get NULL from get_current_misc_cg(). I think I'll keep the WARN_ON_ONCE for now as a reminder in case misc do have custom support for disabling in future. Thanks Haitao
Re: [PATCH v2 3/4] arm64: dts: qcom: msm8976: Add Adreno GPU
On Mon, Apr 01, 2024 at 07:21:52PM +0200, Adam Skladowski wrote: > Add Adreno GPU node. > > Signed-off-by: Adam Skladowski > --- > arch/arm64/boot/dts/qcom/msm8976.dtsi | 65 +++ > 1 file changed, 65 insertions(+) > > diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi > b/arch/arm64/boot/dts/qcom/msm8976.dtsi > index 6be310079f5b..77670fce9b8f 100644 > --- a/arch/arm64/boot/dts/qcom/msm8976.dtsi > +++ b/arch/arm64/boot/dts/qcom/msm8976.dtsi > @@ -1074,6 +1074,71 @@ mdss_dsi1_phy: phy@1a96a00 { > }; > }; > > + adreno_gpu: gpu@1c0 { > + compatible = "qcom,adreno-510.0", "qcom,adreno"; > + > + reg = <0x01c0 0x4>; > + reg-names = "kgsl_3d0_reg_memory"; > + > + interrupts = ; > + interrupt-names = "kgsl_3d0_irq"; > + > + clocks = <&gcc GCC_GFX3D_OXILI_CLK>, > + <&gcc GCC_GFX3D_OXILI_AHB_CLK>, > + <&gcc GCC_GFX3D_OXILI_GMEM_CLK>, > + <&gcc GCC_GFX3D_BIMC_CLK>, > + <&gcc GCC_GFX3D_OXILI_TIMER_CLK>, > + <&gcc GCC_GFX3D_OXILI_AON_CLK>; > + clock-names = "core", > + "iface", > + "mem", > + "mem_iface", > + "rbbmtimer", > + "alwayson"; > + > + power-domains = <&gcc OXILI_GX_GDSC>; > + > + iommus = <&gpu_iommu 0>; > + > + status = "disabled"; Make status the last property of the node, please. Regards, Bjorn > + > + operating-points-v2 = <&gpu_opp_table>; > + > + gpu_opp_table: opp-table { > + compatible = "operating-points-v2"; > +
Re: [PATCH v2 2/4] arm64: dts: qcom: msm8976: Add MDSS nodes
On Mon, Apr 01, 2024 at 07:21:51PM +0200, Adam Skladowski wrote: > diff --git a/arch/arm64/boot/dts/qcom/msm8976.dtsi > b/arch/arm64/boot/dts/qcom/msm8976.dtsi [..] > + mdss: display-subsystem@1a0 { [..] > + mdss_dsi0: dsi@1a94000 { > + compatible = "qcom,msm8976-dsi-ctrl", > "qcom,mdss-dsi-ctrl"; > + reg = <0x01a94000 0x25c>; > + reg-names = "dsi_ctrl"; > + > + interrupt-parent = <&mdss>; > + interrupts = <4>; > + > + clocks = <&gcc GCC_MDSS_MDP_CLK>, > + <&gcc GCC_MDSS_AHB_CLK>, > + <&gcc GCC_MDSS_AXI_CLK>, > + <&gcc GCC_MDSS_BYTE0_CLK>, > + <&gcc GCC_MDSS_PCLK0_CLK>, > + <&gcc GCC_MDSS_ESC0_CLK>; > + clock-names = "mdp_core", > + "iface", > + "bus", > + "byte", > + "pixel", > + "core"; > + > + assigned-clocks = <&gcc GCC_MDSS_BYTE0_CLK_SRC>, > + <&gcc GCC_MDSS_PCLK0_CLK_SRC>; > + assigned-clock-parents = <&mdss_dsi0_phy 0>, > + <&mdss_dsi0_phy 1>; > + > + phys = <&mdss_dsi0_phy>; > + > + operating-points-v2 = <&dsi0_opp_table>; > + power-domains = <&gcc MDSS_GDSC>; > + > + #address-cells = <1>; > + #size-cells = <0>; > + Seems reasonable to keep this disabled as well. Further more &mdss_dsi0 depends on &mdss_dsi0_phy which is disabled. > + ports { [..] > + mdss_dsi0_phy: phy@1a94a00 { > + compatible = "qcom,dsi-phy-28nm-hpm-fam-b"; > + reg = <0x01a94a00 0xd4>, > + <0x01a94400 0x280>, > + <0x01a94b80 0x30>; > + reg-names = "dsi_pll", > + "dsi_phy", > + "dsi_phy_regulator"; > + > + #clock-cells = <1>; > + #phy-cells = <0>; > + > + clocks = <&gcc GCC_MDSS_AHB_CLK>, > + <&rpmcc RPM_SMD_XO_CLK_SRC>; > + clock-names = "iface", "ref"; > + > + status = "disabled"; > + }; PS. Leave &mdss_mdp enabled... Regards, Bjorn
Re: [PATCH 0/3] Fix up qcom,halt-regs definition in various schemas
On Sun, Apr 07, 2024 at 11:58:29AM +0200, Luca Weiss wrote: > The original motivation is that a bunch of other schemas fail to > validate qcom,halt-regs, for example like in the following examples: > > arch/arm64/boot/dts/qcom/apq8016-sbc.dtb: remoteproc@408: > qcom,halt-regs:0: [20] is too short > from schema $id: > http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml# > arch/arm64/boot/dts/qcom/apq8096-ifc6640.dtb: remoteproc@208: > qcom,halt-regs:0: [82] is too short > from schema $id: > http://devicetree.org/schemas/remoteproc/qcom,msm8996-mss-pil.yaml# > arch/arm64/boot/dts/qcom/apq8039-t2.dtb: remoteproc@408: > qcom,halt-regs:0: [32] is too short > from schema $id: > http://devicetree.org/schemas/remoteproc/qcom,msm8916-mss-pil.yaml# > > While I'm actually not quite sure why these patches fix this in > the other schemas - feels like a bug/limitation in dt-schema maybe? - Was this with v2024.02? It should be a bit better there. Though it may just have different errors. The limitation is that property types and in the case of matrix's (which phandle-array actually is) range for dimensions are global. So if there's not correct dimensions for a property, the tools aren't going to decode it properly. Rob
Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT
On Mon, 8 Apr 2024 16:58:17 +0200 Michal Koutný wrote: > @@ -294,7 +295,7 @@ static void __trace_find_cmdline(int pid, char comm[]) > return; > } > > - tpid = pid & (PID_MAX_DEFAULT - 1); > + tpid = pid % PID_MAP_SIZE; Does that compile to the same? This is a fast path. -- Steve > map = savedcmd->map_pid_to_cmdline[tpid]; > if (map != NO_CMDLINE_MAP) { > tpid = savedcmd->map_cmdline_to_pid[map];
Re: [PATCH] tracing: Add new_exec tracepoint
On Tue, 9 Apr 2024 at 16:31, Steven Rostedt wrote: > > On Mon, 8 Apr 2024 11:01:54 +0200 > Marco Elver wrote: > > > Add "new_exec" tracepoint, which is run right after the point of no > > return but before the current task assumes its new exec identity. > > > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > > runs before flushing the old exec, i.e. while the task still has the > > original state (such as original MM), but when the new exec either > > succeeds or crashes (but never returns to the original exec). > > > > Being able to trace this event can be helpful in a number of use cases: > > > > * allowing tracing eBPF programs access to the original MM on exec, > > before current->mm is replaced; > > * counting exec in the original task (via perf event); > > * profiling flush time ("new_exec" to "sched_process_exec"). > > > > Example of tracing output ("new_exec" and "sched_process_exec"): > > How common is this? And can't you just do the same with adding a kprobe? Our main use case would be to use this in BPF programs to become exec-aware, where using the sched_process_exec hook is too late. This is particularly important where the BPF program must stop inspecting the user space's VM when the task does exec to become a new process. kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s permission check can still return an error which returns to the original task without crashing. Only at the point of no return are we guaranteed that the exec either succeeds, or the task is terminated on failure. I don't know if "common" is the right question here, because it's a chicken-egg problem: no tracepoint, we give up; we have the tracepoint, it unlocks a range of new use cases (that require robust solution to make BPF programs exec-aware, and a tracepoint is the only option IMHO). Thanks, -- Marco
Re: [PATCH v2 fs/proc/bootconfig 0/2] remove redundant comments from /proc/bootconfig
Hi Paul, Thanks, both looks good to me. Acked-by: Masami Hiramatsu (Google) Let me update bootconfig/fixes. On Mon, 8 Apr 2024 21:42:49 -0700 "Paul E. McKenney" wrote: > Hello! > > This series removes redundant comments from /proc/bootconfig: > > 1.fs/proc: remove redundant comments from /proc/bootconfig, > courtesy of Zhenhua Huang. > > 2.fs/proc: Skip bootloader comment if no embedded kernel parameters, > courtesy of Masami Hiramatsu. > > Thanx, Paul > > > > b/fs/proc/bootconfig.c | 12 ++-- > b/include/linux/bootconfig.h |1 + > b/init/main.c|5 + > fs/proc/bootconfig.c |2 +- > 4 files changed, 13 insertions(+), 7 deletions(-) > -- Masami Hiramatsu (Google)
Re: [PATCH] tracing: Add new_exec tracepoint
On Mon, 8 Apr 2024 11:01:54 +0200 Marco Elver wrote: > Add "new_exec" tracepoint, which is run right after the point of no > return but before the current task assumes its new exec identity. > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > runs before flushing the old exec, i.e. while the task still has the > original state (such as original MM), but when the new exec either > succeeds or crashes (but never returns to the original exec). > > Being able to trace this event can be helpful in a number of use cases: > > * allowing tracing eBPF programs access to the original MM on exec, > before current->mm is replaced; > * counting exec in the original task (via perf event); > * profiling flush time ("new_exec" to "sched_process_exec"). > > Example of tracing output ("new_exec" and "sched_process_exec"): How common is this? And can't you just do the same with adding a kprobe? -- Steve
Re: [PATCH v2] kprobes: Avoid possible warn in __arm_kprobe_ftrace()
On Tue, 9 Apr 2024 14:20:45 +0800 Zheng Yejian wrote: > On 2024/4/8 20:41, Masami Hiramatsu (Google) wrote: > > Hi Zheng, > > > > On Mon, 8 Apr 2024 16:34:03 +0800 > > Zheng Yejian wrote: > > > >> There is once warn in __arm_kprobe_ftrace() on: > >> > >> ret = ftrace_set_filter_ip(ops, (unsigned long)p->addr, 0, 0); > >> if (WARN_ONCE(..., "Failed to arm kprobe-ftrace at %pS (error %d)\n", > >> ...) > >> return ret; > >> > >> This warning is generated because 'p->addr' is detected to be not a valid > >> ftrace location in ftrace_set_filter_ip(). The ftrace address check is done > >> by check_ftrace_location() at the beginning of check_kprobe_address_safe(). > >> At that point, ftrace_location(addr) == addr should return true if the > >> module is loaded. Then the module is searched twice: > >>1. in is_module_text_address(), we find that 'p->addr' is in a module; > >>2. in __module_text_address(), we find the module; > >> > >> If the module has just been unloaded before the second search, then > >> '*probed_mod' is NULL and we would not go to get the module refcount, > >> then the return value of check_kprobe_address_safe() would be 0, but > >> actually we need to return -EINVAL. > > > > OK, so you found a race window in check_kprobe_address_safe(). > > > > It does something like below. > > > > check_kprobe_address_safe() { > > ... > > > > /* Timing [A] */ > > > > if (!(core_kernel_text(p->addr) || > > is_module_text_address(p->addr)) || > > ...(other reserved address check)) { > > return -EINVAL; > > } > > > > /* Timing [B] */ > > > > *probed_mod = __module_text_address(p->addr): > > if (*probe_mod) { > > if (!try_module_get(*probed_mod)) { > > return -ENOENT; > > } > > ... > > } > > } > > > > So, if p->addr is in a module which is alive at the timing [A], but > > unloaded at timing [B], 'p->addr' is passed the > > 'is_module_text_address(p->addr)' check, but *probed_mod becomes NULL. > > Thus the corresponding module is not referenced and kprobe_arm(p) will > > access a wrong address (use after free). > > This happens either kprobe on ftrace is enabled or not. > > Yes, This is the problem. And for this case, check_kprobe_address_safe() > still return 0, and then going on to arm kprobe may cause problems. So > we should make check_kprobe_address_safe() return -EINVAL when refcount > of the module is not got. Yes, > > > > > To fix this problem, we should move the mutex_lock(kprobe_mutex) before > > check_kprobe_address_safe() because kprobe_module_callback() also lock it > > so it can stop module unloading. > > > > Can you ensure this will fix your problem? > > It seems not, the warning in __arm_kprobe_ftrace() still occurs. I > contrived following simple test: > > #!/bin/bash > sysctl -w kernel.panic_on_warn=1 > while [ True ]; do > insmod mod.ko# contain function 'foo' > rmmod mod.ko > done & > while [ True ]; do > insmod kprobe.ko # register kprobe on function 'foo' > rmmod kprobe.ko > done & > > I think holding kprobe_mutex cannot make sure we get the refcount of the > module. Aah, yes, it cannot, because the kallsyms in a module will be removed after module->state becomes MODULE_STATE_UNFORMED. Before UNFORMED, the state is MODULE_STATE_GOING and the kprobe_module_callback() is called at that point. Thus, the following scenario happens. CPU1 CPU2 mod->state = MODULE_STATE_GOING kprobe_module_callback() { mutex_lock(&kprobe_mutex) loop on kprobe_table to disable kprobe in the module. mutex_unlock(&kprobe_mutex) } register_kprobe(p) { mutex_lock(&kprobe_mutex) check_kprobe_address_safe(p->addr) { [A''] is_module_text_address() return true until mod->state == UNFORMED. mod->state = MODULE_STATE_UNFORMED [B''] __module_text_address() returns NULL. } p is on the kprobe_table. mutex_unlock(&kprobe_mutex) So, as your fix, if we save the module at [A''] and use it at [B''], the mod is NOT able to get because mod->state != MODULE_STATE_LIVE. > > > I think your patch is just optimizing but not fixing the fundamental > > problem,
Re: [PATCH v3 1/2] dt-bindings: arm: qcom: Add Motorola Moto G (2013)
On Sun, 07 Apr 2024 11:05:10 +0200, Stanislav Jakubek wrote: > Document the Motorola Moto G (2013), which is a smartphone based > on the Qualcomm MSM8226 SoC. > > Acked-by: Krzysztof Kozlowski > Signed-off-by: Stanislav Jakubek > --- > Changes in V3: > - no changes > > Changes in V2: > - collect Krzysztof's A-b > > Documentation/devicetree/bindings/arm/qcom.yaml | 1 + > 1 file changed, 1 insertion(+) > My bot found new DTB warnings on the .dts files added or changed in this series. Some warnings may be from an existing SoC .dtsi. Or perhaps the warnings are fixed by another series. Ultimately, it is up to the platform maintainer whether these warnings are acceptable or not. No need to reply unless the platform maintainer has comments. If you already ran DT checks and didn't see these error(s), then make sure dt-schema is up to date: pip3 install dtschema --upgrade New warnings running 'make CHECK_DTBS=y qcom/msm8226-motorola-falcon.dtb' for 32c507337ab80c550fb1df08f7014d1e31eb4c32.1712480582.git.stano.jaku...@gmail.com: arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dtb: syscon@f9011000: compatible: 'anyOf' conditional failed, one must be fixed: ['syscon'] is too short 'syscon' is not one of ['allwinner,sun8i-a83t-system-controller', 'allwinner,sun8i-h3-system-controller', 'allwinner,sun8i-v3s-system-controller', 'allwinner,sun50i-a64-system-controller', 'amd,pensando-elba-syscon', 'brcm,cru-clkset', 'freecom,fsg-cs2-system-controller', 'fsl,imx93-aonmix-ns-syscfg', 'fsl,imx93-wakeupmix-syscfg', 'hisilicon,dsa-subctrl', 'hisilicon,hi6220-sramctrl', 'hisilicon,pcie-sas-subctrl', 'hisilicon,peri-subctrl', 'hpe,gxp-sysreg', 'intel,lgm-syscon', 'loongson,ls1b-syscon', 'loongson,ls1c-syscon', 'marvell,armada-3700-usb2-host-misc', 'mediatek,mt8135-pctl-a-syscfg', 'mediatek,mt8135-pctl-b-syscfg', 'mediatek,mt8365-syscfg', 'microchip,lan966x-cpu-syscon', 'microchip,sparx5-cpu-syscon', 'mstar,msc313-pmsleep', 'nuvoton,ma35d1-sys', 'nuvoton,wpcm450-shm', 'rockchip,px30-qos', 'rockchip,rk3036-qos', 'rockchip,rk3066-qos', 'rockchip,rk3128-qos', 'rockchip,rk3228-qos', 'rockchip,rk3288-qos', 'rockchip,rk3368-qos', 'rockchip,rk3399-qos', 'rockchip,rk3568-qos', ' rockchip,rk3588-qos', 'rockchip,rv1126-qos', 'starfive,jh7100-sysmain', 'ti,am62-usb-phy-ctrl', 'ti,am654-dss-oldi-io-ctrl', 'ti,am654-serdes-ctrl', 'ti,j784s4-pcie-ctrl'] from schema $id: http://devicetree.org/schemas/mfd/syscon.yaml#
Re: [PATCH v3] Documentation: Add reconnect process for VDUSE
On Mon, Apr 8, 2024 at 8:50 PM Michael S. Tsirkin wrote: > > On Mon, Apr 08, 2024 at 08:39:21PM +0800, Cindy Lu wrote: > > On Mon, Apr 8, 2024 at 3:40 PM Michael S. Tsirkin wrote: > > > > > > On Thu, Apr 04, 2024 at 01:56:31PM +0800, Cindy Lu wrote: > > > > Add a document explaining the reconnect process, including what the > > > > Userspace App needs to do and how it works with the kernel. > > > > > > > > Signed-off-by: Cindy Lu > > > > --- > > > > Documentation/userspace-api/vduse.rst | 41 +++ > > > > 1 file changed, 41 insertions(+) > > > > > > > > diff --git a/Documentation/userspace-api/vduse.rst > > > > b/Documentation/userspace-api/vduse.rst > > > > index bdb880e01132..7faa83462e78 100644 > > > > --- a/Documentation/userspace-api/vduse.rst > > > > +++ b/Documentation/userspace-api/vduse.rst > > > > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows: > > > > after the used ring is filled. > > > > > > > > For more details on the uAPI, please see include/uapi/linux/vduse.h. > > > > + > > > > +HOW VDUSE devices reconnection works > > > > + > > > > +1. What is reconnection? > > > > + > > > > + When the userspace application loads, it should establish a > > > > connection > > > > + to the vduse kernel device. Sometimes,the userspace application > > > > exists, > > > > + and we want to support its restart and connect to the kernel device > > > > again > > > > + > > > > +2. How can I support reconnection in a userspace application? > > > > + > > > > +2.1 During initialization, the userspace application should first > > > > verify the > > > > +existence of the device "/dev/vduse/vduse_name". > > > > +If it doesn't exist, it means this is the first-time for > > > > connection. goto step 2.2 > > > > +If it exists, it means this is a reconnection, and we should goto > > > > step 2.3 > > > > + > > > > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on > > > > +/dev/vduse/control. > > > > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for > > > > +the reconnect information. The total memory size is > > > > PAGE_SIZE*vq_mumber. > > > > > > Confused. Where is that allocation, in code? > > > > > > Thanks! > > > > > this should allocated in function vduse_create_dev(), > > I mean, it's not allocated there ATM right? This is just doc patch > to become part of a larger patchset? > Got it, thanks Michael I will send the whole patchset soon thanks Cindy > > I will rewrite > > this part to make it more clearer > > will send a new version soon > > Thanks > > cindy > > > > > > +2.3 Check if the information is suitable for reconnect > > > > +If this is reconnection : > > > > +Before attempting to reconnect, The userspace application needs to > > > > use the > > > > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, > > > > VDUSE_DEV_GET_FEATURES...) > > > > +to get the information from kernel. > > > > +Please review the information and confirm if it is suitable to > > > > reconnect. > > > > + > > > > +2.4 Userspace application needs to mmap the memory to userspace > > > > +The userspace application requires mapping one page for every vq. > > > > These pages > > > > +should be used to save vq-related information during system > > > > running. Additionally, > > > > +the application must define its own structure to store information > > > > for reconnection. > > > > + > > > > +2.5 Completed the initialization and running the application. > > > > +While the application is running, it is important to store > > > > relevant information > > > > +about reconnections in mapped pages. When calling the ioctl > > > > VDUSE_VQ_GET_INFO to > > > > +get vq information, it's necessary to check whether it's a > > > > reconnection. If it is > > > > +a reconnection, the vq-related information must be get from the > > > > mapped pages. > > > > + > > > > +2.6 When the Userspace application exits, it is necessary to unmap all > > > > the > > > > +pages for reconnection > > > > -- > > > > 2.43.0 > > > >
Re: [PATCH 2/2] ARM: dts: qcom: msm8974: Use proper compatible for APCS syscon
On 08/04/2024 21:32, Luca Weiss wrote: > Use the apcs-kpss-global compatible for the APCS global mailbox block > found on this SoC. > > This also resolves a dt-binding checker warning: Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof
Re: [PATCH 1/2] dt-bindings: mailbox: qcom: Add MSM8974 APCS compatible
Il 08/04/24 21:32, Luca Weiss ha scritto: Add compatible for the Qualcomm MSM8974 APCS block. Signed-off-by: Luca Weiss Reviewed-by: AngeloGioacchino Del Regno
Re: [PATCH 2/2] ARM: dts: qcom: msm8974: Use proper compatible for APCS syscon
Il 08/04/24 21:32, Luca Weiss ha scritto: Use the apcs-kpss-global compatible for the APCS global mailbox block found on this SoC. This also resolves a dt-binding checker warning: arch/arm/boot/dts/qcom/qcom-msm8974pro-fairphone-fp2.dtb: syscon@f9011000: compatible: 'anyOf' conditional failed, one must be fixed: ['syscon'] is too short 'syscon' is not one of ['allwinner,sun8i-a83t-system-controller', 'allwinner,sun8i-h3-system-controller', 'allwinner,sun8i-v3s-system-controller', 'allwinner,sun50i-a64-system-controller', 'amd,pensando-elba-syscon', 'brcm,cru-clkset', 'freecom,fsg-cs2-system-controller', 'fsl,imx93-aonmix-ns-syscfg', 'fsl,imx93-wakeupmix-syscfg', 'hisilicon,dsa-subctrl', 'hisilicon,hi6220-sramctrl', 'hisilicon,pcie-sas-subctrl', 'hisilicon,peri-subctrl', 'hpe,gxp-sysreg', 'intel,lgm-syscon', 'loongson,ls1b-syscon', 'loongson,ls1c-syscon', 'marvell,armada-3700-usb2-host-misc', 'mediatek,mt8135-pctl-a-syscfg', 'mediatek,mt8135-pctl-b-syscfg', 'mediatek,mt8365-syscfg', 'microchip,lan966x-cpu-syscon', 'microchip,sparx5-cpu-syscon', 'mstar,msc313-pmsleep', 'nuvoton,ma35d1-sys', 'nuvoton,wpcm450-shm', 'rockchip,px30-qos', 'rockchip,rk3036-qos', 'rockchip,rk3066-qos', 'rockchip,rk3128-qos', 'rockchip,rk3228-qos', 'rockchip,rk3288-qos', 'rockchip,rk3368-qos', 'rockchip,rk3399-qos', 'rockchip,rk356 8-qos', 'rockchip,rk3588-qos', 'rockchip,rv1126-qos', 'starfive,jh7100-sysmain', 'ti,am62-usb-phy-ctrl', 'ti,am654-dss-oldi-io-ctrl', 'ti,am654-serdes-ctrl', 'ti,j784s4-pcie-ctrl'] from schema $id: http://devicetree.org/schemas/mfd/syscon.yaml# Signed-off-by: Luca Weiss Reviewed-by: AngeloGioacchino Del Regno
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Mon, Apr 08, 2024 at 06:22:59PM +0200, Oleg Nesterov wrote: > On 04/08, Jiri Olsa wrote: > > > > On Fri, Apr 05, 2024 at 01:02:30PM +0200, Oleg Nesterov wrote: > > > > > > And what should sys_uretprobe() do if it is not called from the > > > trampoline? > > > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL. > > > > so the similar behaviour with int3 ends up with immediate SIGTRAP > > and not invoking pending uretprobe consumers, like: > > > > - setup uretprobe for foo > > - foo() { > > executes int 3 -> sends SIGTRAP > > } > > > > because the int3 handler checks if it got executed from the uretprobe's > > trampoline. > > ... or the task has uprobe at this address > > > if not it treats that int3 as regular trap > > Yes this mimics the "default" behaviour without uprobes/uretprobes > > > so I think we should mimic int3 behaviour and: > > > > - setup uretprobe for foo > > - foo() { > > uretprobe_syscall -> check if we got executed from uretprobe's > > trampoline and send SIGILL if that's not the case > > Agreed, > > > I think it's better to have the offending process killed right away, > > rather than having more undefined behaviour, waiting for final 'ret' > > instruction that jumps to uretprobe trampoline and causes SIGILL > > Agreed. In fact I think it should be also killed if copy_to/from_user() > fails by the same reason. +1 makes sense jirka > > Oleg. >
Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge
Hi! > > This is driver for ANX7688 USB-C HDMI, with flashing and debugging > > features removed. ANX7688 is rather criticial piece on PinePhone, > > there's no display and no battery charging without it. > > > > There's likely more work to be done here, but having basic support > > in mainline is needed to be able to work on the other stuff > > (networking, cameras, power management). > > > > Signed-off-by: Ondrej Jirman > > Co-developed-by: Martijn Braam > > Co-developed-by: Samuel Holland > > Signed-off-by: Pavel Machek > > Just couple of quick comments below - I did not have time to go over > this very thoroughly, but I think you need to make a new version in > any case because of comments in 1/2. Yes, there will be new version. There is ton of sleep primitives, and existing ones are okay. I can change everything to fdelay or whatever primitive-of-the-day is, but I'd rather not do pointless changes. You can ask for major changes, or complain about extra newlines, but doing both in the same email does not make sense. > Btw, Co-developed-by comes before Signed-off-by I think. I believe this order is fine, too. > > +++ b/drivers/usb/typec/anx7688.c > > @@ -0,0 +1,1830 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * ANX7688 USB-C HDMI bridge/PD driver > > + * > > + * Warning, this driver is somewhat PinePhone specific. > > So why not split this into core part and PinePhone specific glue > part? Potentially a lot of work I can't really test and would rather not do. > > + struct delayed_work work; > > + struct timer_list work_timer; > > + > > + struct mutex lock; > > Undocumented lock. This is simple driver. How do you expect me to document it? Protects this data structure, not exactly uncommon. > > + > > + /* wait for power to stabilize and release reset */ > > + msleep(10); > > + gpiod_set_value(anx7688->gpio_reset, 0); > > + usleep_range(2, 4); > > Why not just use usleep_range() in both cases. It does not really make code any better. Can do if you insist. > > +static int anx7688_connect(struct anx7688 *anx7688) > > +{ > > + struct typec_partner_desc desc = {}; > > + int ret, i; > > + u8 fw[2]; > > + const u8 dp_snk_identity[16] = { > > + 0x00, 0x00, 0x00, 0xec, /* id header */ > > + 0x00, 0x00, 0x00, 0x00, /* cert stat */ > > + 0x00, 0x00, 0x00, 0x00, /* product type */ > > + 0x39, 0x00, 0x00, 0x51 /* alt mode adapter */ > > + }; > > + const u8 svid[4] = { > > + 0x00, 0x00, 0x01, 0xff, > > + }; > > Why not get those from DT? Are you sure it belongs to the DT (and that DT people will agree)? > > + u32 caps[8]; > > + > > + dev_dbg(anx7688->dev, "cable inserted\n"); > > + > > + anx7688->last_status = -1; > > + anx7688->last_cc_status = -1; > > + anx7688->last_dp_state = -1; > > + > > + msleep(10); > > Please make a comment here why you have to wait, and use > usleep_range(). /* Dunno because working code does that and waiting for hardware to settle down after cable insertion kind of looks like a good thing */ I did not write the driver, and there's no good documentation for this chip. I can try to invent something, but... > > + i = 0; > > + while (1) { > > + ret = anx7688_reg_read(anx7688, > > ANX7688_REG_EEPROM_LOAD_STATUS0); > > + > > + if (ret >= 0 && (ret & ANX7688_EEPROM_FW_LOADED) == > > ANX7688_EEPROM_FW_LOADED) { > > + dev_dbg(anx7688->dev, "eeprom0 = 0x%02x\n", ret); > > + dev_dbg(anx7688->dev, "fw loaded after %d ms\n", i * > > 10); > > + break; > > + } > > + > > + if (i > 99) { > > + set_bit(ANX7688_F_FW_FAILED, anx7688->flags); > > + dev_err(anx7688->dev, > > + "boot firmware load failed (you may need to > > flash FW to anx7688 first)\n"); > > + ret = -ETIMEDOUT; > > + goto err_vconoff; > > + } > > + msleep(5); > > + i++; > > + } > > You need to use something like time_is_after_jiffies() here instead of > a counter. Well, this works as well, but yes, I agree here. > > + ret = i2c_smbus_read_i2c_block_data(anx7688->client, > > + ANX7688_REG_FW_VERSION1, 2, fw); > > + if (ret < 0) { > > + dev_err(anx7688->dev, "failed to read firmware version\n"); > > + goto err_vconoff; > > + } > > + > > + dev_dbg(anx7688->dev, "OCM firmware loaded (version 0x%04x)\n", > > +fw[1] | fw[0] << 8); > > + > > + /* Unmask interrupts */ > > + ret = anx7688_reg_write(anx7688, ANX7688_REG_STATUS_INT, 0); > > + if (ret) > > + goto err_vconoff; > > + > > + ret = anx7688_reg_write(anx7688, ANX7688_REG_STATUS_INT_MASK, > > ~ANX7688_SOFT_INT_MASK); > > + if (ret) > > + goto err_vconoff; > > + > > + ret = anx7688_reg_write(anx7688, ANX7688_REG_IRQ_EXT_SOURCE2
Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL
On Mon, 8 Apr 2024 22:41:04 +0900 Honggyu Kim wrote: [...] > To explain this, I better share more test results. In the section of > "Evaluation Workload", the test sequence can be summarized as follows. > > *. "Turn on DAMON." > 1. Allocate cold memory(mmap+memset) at DRAM node, then make the > process sleep. > 2. Launch redis-server and load prebaked snapshot image, dump.rdb. > (85GB consumed: 52GB for anon and 33GB for file cache) > 3. Run YCSB to make zipfian distribution of memory accesses to > redis-server, then measure execution time. > 4. Repeat 4 over 50 times to measure the average execution time for > each run. Sorry, "Repeat 4 over 50 times" is incorrect. This should be "Repeat 3 over 50 times". > 5. Increase the cold memory size then repeat goes to 2. > > I didn't want to make the evaluation too long in the cover letter, but > I have also evaluated another senario, which lazyly enabled DAMON just > before YCSB run at step 4. I will call this test as "DAMON lazy". This > is missing part from the cover letter. > > 1. Allocate cold memory(mmap+memset) at DRAM node, then make the > process sleep. > 2. Launch redis-server and load prebaked snapshot image, dump.rdb. > (85GB consumed: 52GB for anon and 33GB for file cache) > *. "Turn on DAMON." > 4. Run YCSB to make zipfian distribution of memory accesses to > redis-server, then measure execution time. > 5. Repeat 4 over 50 times to measure the average execution time for > each run. > 6. Increase the cold memory size then repeat goes to 2. > > In the "DAMON lazy" senario, DAMON started monitoring late so the > initial redis-server placement is same as "default", but started to > demote cold data and promote redis data just before YCSB run. [...] Thanks, Honggyu
Re: [PATCHv3 2/2] usb: typec: anx7688: Add driver for ANX7688 USB-C HDMI bridge
Hi, On Mon, Apr 08, 2024 at 12:54:25PM +0200, Pavel Machek wrote: > From: Ondrej Jirman > > This is driver for ANX7688 USB-C HDMI, with flashing and debugging > features removed. ANX7688 is rather criticial piece on PinePhone, > there's no display and no battery charging without it. > > There's likely more work to be done here, but having basic support > in mainline is needed to be able to work on the other stuff > (networking, cameras, power management). > > Signed-off-by: Ondrej Jirman > Co-developed-by: Martijn Braam > Co-developed-by: Samuel Holland > Signed-off-by: Pavel Machek Just couple of quick comments below - I did not have time to go over this very thoroughly, but I think you need to make a new version in any case because of comments in 1/2. Btw, Co-developed-by comes before Signed-off-by I think. > --- > > v2: Fix checkpatch stuff. Some cleanups, adapt to dts format in 1/2. > v3: Turn down debugging, as requested during review. > > diff --git a/drivers/usb/typec/Kconfig b/drivers/usb/typec/Kconfig > index 2f80c2792dbd..c9043ae61546 100644 > --- a/drivers/usb/typec/Kconfig > +++ b/drivers/usb/typec/Kconfig > @@ -64,6 +64,17 @@ config TYPEC_ANX7411 > If you choose to build this driver as a dynamically linked module, the > module will be called anx7411.ko. > > +config TYPEC_ANX7688 > + tristate "Analogix ANX7688 Type-C DRP Port controller and mux driver" > + depends on I2C > + depends on USB_ROLE_SWITCH > + help > + Say Y or M here if your system has Analogix ANX7688 Type-C Bridge > + controller driver. > + > + If you choose to build this driver as a dynamically linked module, the > + module will be called anx7688.ko. > + > config TYPEC_RT1719 > tristate "Richtek RT1719 Sink Only Type-C controller driver" > depends on USB_ROLE_SWITCH || !USB_ROLE_SWITCH > diff --git a/drivers/usb/typec/Makefile b/drivers/usb/typec/Makefile > index 7a368fea61bc..3f8ff94ad294 100644 > --- a/drivers/usb/typec/Makefile > +++ b/drivers/usb/typec/Makefile > @@ -7,6 +7,7 @@ obj-$(CONFIG_TYPEC_TCPM) += tcpm/ > obj-$(CONFIG_TYPEC_UCSI) += ucsi/ > obj-$(CONFIG_TYPEC_TPS6598X) += tipd/ > obj-$(CONFIG_TYPEC_ANX7411) += anx7411.o > +obj-$(CONFIG_TYPEC_ANX7688) += anx7688.o > obj-$(CONFIG_TYPEC_HD3SS3220)+= hd3ss3220.o > obj-$(CONFIG_TYPEC_STUSB160X)+= stusb160x.o > obj-$(CONFIG_TYPEC_RT1719) += rt1719.o > diff --git a/drivers/usb/typec/anx7688.c b/drivers/usb/typec/anx7688.c > new file mode 100644 > index ..407cbd5fedba > --- /dev/null > +++ b/drivers/usb/typec/anx7688.c > @@ -0,0 +1,1830 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * ANX7688 USB-C HDMI bridge/PD driver > + * > + * Warning, this driver is somewhat PinePhone specific. So why not split this into core part and PinePhone specific glue part? > + * How this works: > + * - this driver allows to program firmware into ANX7688 EEPROM, and > + * initialize it > + * - it then communicates with the firmware running on the OCM (on-chip > + * microcontroller) > + * - it detects whether there is cable plugged in or not and powers > + * up or down the ANX7688 based on that > + * - when the cable is connected the firmware on the OCM will handle > + * the detection of the nature of the device on the other end > + * of the USB-C cable > + * - this driver then communicates with the USB phy to let it swap > + * data roles accordingly > + * - it also enables VBUS and VCONN regulators as appropriate > + * - USB phy driver (Allwinner) needs to know whether to switch to > + * device or host mode, or whether to turn off > + * - when the firmware detects SRC.1.5A or SRC.3.0A via CC pins > + * or something else via PD, it notifies this driver via software > + * interrupt and this driver will determine how to update the TypeC > + * port status and what input current limit is appropriate > + * - input current limit determination happens 500ms after cable > + * insertion or hard reset (delay is necessary to determine whether > + * the remote end is PD capable or not) > + * - this driver tells to the PMIC driver that the input current limit > + * needs to be changed > + * - this driver also monitors PMIC status and re-sets the input current > + * limit if it changes for some reason (due to PMIC internal decision > + * making) (this is disabled for now) > + * > + * ANX7688 FW behavior as observed: > + * > + * - DO NOT SET MORE THAN 1 SINK CAPABILITY! Firmware will ignore what > + * you set and send hardcoded PDO_BATT 5-21V 30W message! > + * > + * Product brief: > + * > https://www.analogix.com/en/system/files/AA-002281-PB-6-ANX7688_Product_Brief_0.pdf > + * Development notes: > + * https://xnux.eu/devices/feature/anx7688.html > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#i
Re: [PATCH v4] vp_vdpa: don't allocate unused msix vectors
在 2024/4/9 下午4:58, lyx634449800 写道: From: Yuxue Liu When there is a ctlq and it doesn't require interrupt callbacks,the original method of calculating vectors wastes hardware msi or msix resources as well as system IRQ resources. When conducting performance testing using testpmd in the guest os, it was found that the performance was lower compared to directly using vfio-pci to passthrough the device In scenarios where the virtio device in the guest os does not utilize interrupts, the vdpa driver still configures the hardware's msix vector. Therefore, the hardware still sends interrupts to the host os. Because of this unnecessary action by the hardware, hardware performance decreases, and it also affects the performance of the host os. Before modification:(interrupt mode) 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config After modification:(interrupt mode) 32: 0 0 1 7 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-0 33: 36 0 3 0 PCI-MSI 32769-edge vp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edge vp-vdpa[:00:02.0]-config Before modification:(virtio pmd mode for guest os) 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config After modification:(virtio pmd mode for guest os) 32: 0 0 0 0 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-config To verify the use of the virtio PMD mode in the guest operating system, the following patch needs to be applied to QEMU: https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com Signed-off-by: Yuxue Liu Acked-by: Jason Wang --- V4: Update the title and assign values to uninitialized variables V3: delete unused variables and add validation records V2: fix when allocating IRQs, scan all queues drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +-- 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c index df5f4a3bccb5..74bc8adfc7e8 100644 --- a/drivers/vdpa/virtio_pci/vp_vdpa.c +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c @@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) struct pci_dev *pdev = mdev->pci_dev; int i, ret, irq; int queues = vp_vdpa->queues; - int vectors = queues + 1; + int vectors = 0; + int msix_vec = 0; + + for (i = 0; i < queues; i++) { + if (vp_vdpa->vring[i].cb.callback) + vectors++; + } + vectors++; ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX); if (ret != vectors) { @@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) vp_vdpa->vectors = vectors; for (i = 0; i < queues; i++) { + if (!vp_vdpa->vring[i].cb.callback) + continue; + snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-%d\n", pci_name(pdev), i); - irq = pci_irq_vector(pdev, i); + irq = pci_irq_vector(pdev, msix_vec); ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_vq_handler, 0, vp_vdpa->vring[i].msix_name, @@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) "vp_vdpa: fail to request irq for vq %d\n", i); goto err; } - vp_modern_queue_vector(mdev, i, i); + vp_modern_queue_vector(mdev, i, msix_vec); vp_vdpa->vring[i].irq = irq; + msix_vec++; } snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n", pci_name(pdev)); - irq = pci_irq_vector(pdev, queues); + irq = pci_irq_vector(pdev, msix_vec); ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_config_handler, 0, vp_vdpa->msix_name, vp_vdpa); if (ret) { dev_err(&pdev->dev, - "vp_vdpa: fail to request irq for vq %d\n", i); + "vp_vdpa: fail to request irq for config, ret %d\n", ret); goto err; } - vp_modern_config_vector(mdev, queues); + vp_modern_config_vector(mdev, msix_vec); vp_vdpa->config_irq = irq; return 0; Reviewed-by: Heng Qi
Re: [RFC PATCH v3 5/7] mm/damon/paddr: introduce DAMOS_MIGRATE_COLD action for demotion
Hi SeongJae, On Mon, 8 Apr 2024 10:52:28 -0700 SeongJae Park wrote: > On Mon, 8 Apr 2024 21:06:44 +0900 Honggyu Kim wrote: > > > On Fri, 5 Apr 2024 12:24:30 -0700 SeongJae Park wrote: > > > On Fri, 5 Apr 2024 15:08:54 +0900 Honggyu Kim wrote: > [...] > > > > Here is one of the example usage of this 'migrate_cold' action. > > > > > > > > $ cd /sys/kernel/mm/damon/admin/kdamonds/ > > > > $ cat contexts//schemes//action > > > > migrate_cold > > > > $ echo 2 > contexts//schemes//target_nid > > > > $ echo commit > state > > > > $ numactl -p 0 ./hot_cold 500M 600M & > > > > $ numastat -c -p hot_cold > > > > > > > > Per-node process memory usage (in MBs) > > > > PID Node 0 Node 1 Node 2 Total > > > > -- -- -- -- - > > > > 701 (hot_cold) 501 0601 1101 > > > > > > > > Since there are some common routines with pageout, many functions have > > > > similar logics between pageout and migrate cold. > > > > > > > > damon_pa_migrate_folio_list() is a minimized version of > > > > shrink_folio_list(), but it's minified only for demotion. > > > > > > MIGRATE_COLD is not only for demotion, right? I think the last two words > > > are > > > better to be removed for reducing unnecessary confuses. > > > > You mean the last two sentences? I will remove them if you feel it's > > confusing. > > Yes. My real intended suggestion was 's/only for demotion/only for > migration/', but entirely removing the sentences is also ok for me. Ack. > > > > > > > > > > Signed-off-by: Honggyu Kim > > > > Signed-off-by: Hyeongtak Ji > > > > --- > > > > include/linux/damon.h| 2 + > > > > mm/damon/paddr.c | 146 ++- > > > > mm/damon/sysfs-schemes.c | 4 ++ > > > > 3 files changed, 151 insertions(+), 1 deletion(-) > [...] > > > > --- a/mm/damon/paddr.c > > > > +++ b/mm/damon/paddr.c > [...] > > > > +{ > > > > + unsigned int nr_succeeded; > > > > + nodemask_t allowed_mask = NODE_MASK_NONE; > > > > + > > > > > > I personally prefer not having empty lines in the middle of variable > > > declarations/definitions. Could we remove this empty line? > > > > I can remove it, but I would like to have more discussion about this > > issue. The current implementation allows only a single migration > > target with "target_nid", but users might want to provide fall back > > migration target nids. > > > > For example, if more than two CXL nodes exist in the system, users might > > want to migrate cold pages to any CXL nodes. In such cases, we might > > have to make "target_nid" accept comma separated node IDs. nodemask can > > be better but we should provide a way to change the scanning order. > > > > I would like to hear how you think about this. > > Good point. I think we could later extend the sysfs file to receive the > comma-separated numbers, or even mask. For simplicity, adding sysfs files > dedicated for the different format of inputs could also be an option (e.g., > target_nids_list, target_nids_mask). But starting from this single node as is > now looks ok to me. If you think we can start from a single node, then I will keep it as is. But are you okay if I change the same 'target_nid' to accept comma-separated numbers later? Or do you want to introduce another knob such as 'target_nids_list'? What about rename 'target_nid' to 'target_nids' at the first place? > [...] > > > > + /* 'folio_list' is always empty here */ > > > > + > > > > + /* Migrate folios selected for migration */ > > > > + nr_migrated += migrate_folio_list(&migrate_folios, pgdat, > > > > target_nid); > > > > + /* Folios that could not be migrated are still in > > > > @migrate_folios */ > > > > + if (!list_empty(&migrate_folios)) { > > > > + /* Folios which weren't migrated go back on @folio_list > > > > */ > > > > + list_splice_init(&migrate_folios, folio_list); > > > > + } > > > > > > Let's not use braces for single statement > > > (https://docs.kernel.org/process/coding-style.html#placing-braces-and-spaces). > > > > Hmm.. I know the convention but left it as is because of the comment. > > If I remove the braces, it would have a weird alignment for the two > > lines for comment and statement lines. > > I don't really hate such alignment. But if you don't like it, how about > moving > the comment out of the if statement? Having one comment for one-line if > statement looks not bad to me. Ack. I will manage this in the next revision. > > > > > > + > > > > + try_to_unmap_flush(); > > > > + > > > > + list_splice(&ret_folios, folio_list); > > > > > > Can't we move remaining folios in migrate_folios to ret_folios at once? > > > > I will see if it's possible. > > Thank you. Not a strict request, though. > > [...] > > > > + nid = folio_nid(lru_to_folio(folio_list)); > > > > + do { > > > > + struct fol
Re: [PATCH v4] vp_vdpa: don't allocate unused msix vectors
Good and clear subject, I like it. On Tue, Apr 09, 2024 at 04:58:18PM +0800, lyx634449800 wrote: > From: Yuxue Liu > > When there is a ctlq and it doesn't require interrupt > callbacks,the original method of calculating vectors > wastes hardware msi or msix resources as well as system > IRQ resources. > > When conducting performance testing using testpmd in the > guest os, it was found that the performance was lower compared > to directly using vfio-pci to passthrough the device > > In scenarios where the virtio device in the guest os does > not utilize interrupts, the vdpa driver still configures > the hardware's msix vector. Therefore, the hardware still > sends interrupts to the host os. Because of this unnecessary > action by the hardware, hardware performance decreases, and > it also affects the performance of the host os. > > Before modification:(interrupt mode) > 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 > 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 > 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 > 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config > > After modification:(interrupt mode) > 32: 0 0 1 7 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-0 > 33: 36 0 3 0 PCI-MSI 32769-edge vp-vdpa[:00:02.0]-1 > 34: 0 0 0 0 PCI-MSI 32770-edge vp-vdpa[:00:02.0]-config > > Before modification:(virtio pmd mode for guest os) > 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 > 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 > 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 > 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config > > After modification:(virtio pmd mode for guest os) > 32: 0 0 0 0 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-config > > To verify the use of the virtio PMD mode in the guest operating > system, the following patch needs to be applied to QEMU: > https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com > > Signed-off-by: Yuxue Liu > Acked-by: Jason Wang Much better, thanks! A couple of small tweaks to polish it up and it'll be ready. > --- > V4: Update the title and assign values to uninitialized variables > V3: delete unused variables and add validation records > V2: fix when allocating IRQs, scan all queues > > drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +-- > 1 file changed, 17 insertions(+), 6 deletions(-) > > diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c > b/drivers/vdpa/virtio_pci/vp_vdpa.c > index df5f4a3bccb5..74bc8adfc7e8 100644 > --- a/drivers/vdpa/virtio_pci/vp_vdpa.c > +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c > @@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) > struct pci_dev *pdev = mdev->pci_dev; > int i, ret, irq; > int queues = vp_vdpa->queues; > - int vectors = queues + 1; > + int vectors = 0; > + int msix_vec = 0; > + > + for (i = 0; i < queues; i++) { > + if (vp_vdpa->vring[i].cb.callback) > + vectors++; > + } > + vectors++; Actually even easier: int vectors = 1; and then we do not need this last line. Sorry I only noticed now. > > ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX); > if (ret != vectors) { > @@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) > vp_vdpa->vectors = vectors; > > for (i = 0; i < queues; i++) { > + if (!vp_vdpa->vring[i].cb.callback) > + continue; > + > snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE, > "vp-vdpa[%s]-%d\n", pci_name(pdev), i); > - irq = pci_irq_vector(pdev, i); > + irq = pci_irq_vector(pdev, msix_vec); > ret = devm_request_irq(&pdev->dev, irq, > vp_vdpa_vq_handler, > 0, vp_vdpa->vring[i].msix_name, > @@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) > "vp_vdpa: fail to request irq for vq %d\n", i); > goto err; > } > - vp_modern_queue_vector(mdev, i, i); > + vp_modern_queue_vector(mdev, i, msix_vec); > vp_vdpa->vring[i].irq = irq; > + msix_vec++; > } > > snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n", >pci_name(pdev)); > - irq = pci_irq_vector(pdev, queues); > + irq = pci_irq_vector(pdev, msix_vec); > ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_config_handler, 0, > vp_vdpa->msix_name, vp_vdpa); > if (ret) { > dev_err(&pdev->dev, > - "vp_vdpa: fail to request irq for vq %d\n", i); > + "vp_vdpa: fail to request irq for config, ret %d\n", > ret); As lo
Re: Re: [PATCH v10 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
On Mon, Apr 08, 2024 at 11:23:21PM -0500, Haitao Huang wrote: > It's always non-NULL based on testing. > > It's hard for me to say definitely by reading the code. But IIUC > cgroup_disable command-line only blocks operations in /sys/fs/cgroup so user > space can't set up controllers and config limits, etc., for the diasabled > ones. Each task->cgroups would still have a non-NULL pointer to the static > root object for each cgroup that is enabled by KConfig, so > get_current_misc_cg() thus sgx_get_current_cg() should not return NULL > regardless 'cgroup_disable=misc'. > > Maybe @Michal or @tj can confirm? The current implementation creates root css object (see cgroup_init(), cgroup_ssid_enabled() check is after cgroup_init_subsys()). I.e. it will look like all tasks are members of root cgroup wrt given controller permanently and controller attribute files won't exist. (It is up to the controller implementation to do further optimization based on the boot-time disablement (e.g. see uses of mem_cgroup_disabled()). Not sure if this is useful for misc controller.) As for the WARN_ON(1), taking example from memcg -- NULL is best synonymous with root. It's a judgement call which of the values to store and when to intepret it. HTH, Michal signature.asc Description: PGP signature
[PATCH v4] vp_vdpa: don't allocate unused msix vectors
From: Yuxue Liu When there is a ctlq and it doesn't require interrupt callbacks,the original method of calculating vectors wastes hardware msi or msix resources as well as system IRQ resources. When conducting performance testing using testpmd in the guest os, it was found that the performance was lower compared to directly using vfio-pci to passthrough the device In scenarios where the virtio device in the guest os does not utilize interrupts, the vdpa driver still configures the hardware's msix vector. Therefore, the hardware still sends interrupts to the host os. Because of this unnecessary action by the hardware, hardware performance decreases, and it also affects the performance of the host os. Before modification:(interrupt mode) 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config After modification:(interrupt mode) 32: 0 0 1 7 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-0 33: 36 0 3 0 PCI-MSI 32769-edge vp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edge vp-vdpa[:00:02.0]-config Before modification:(virtio pmd mode for guest os) 32: 0 0 0 0 PCI-MSI 32768-edgevp-vdpa[:00:02.0]-0 33: 0 0 0 0 PCI-MSI 32769-edgevp-vdpa[:00:02.0]-1 34: 0 0 0 0 PCI-MSI 32770-edgevp-vdpa[:00:02.0]-2 35: 0 0 0 0 PCI-MSI 32771-edgevp-vdpa[:00:02.0]-config After modification:(virtio pmd mode for guest os) 32: 0 0 0 0 PCI-MSI 32768-edge vp-vdpa[:00:02.0]-config To verify the use of the virtio PMD mode in the guest operating system, the following patch needs to be applied to QEMU: https://lore.kernel.org/all/20240408073311.2049-1-yuxue@jaguarmicro.com Signed-off-by: Yuxue Liu Acked-by: Jason Wang --- V4: Update the title and assign values to uninitialized variables V3: delete unused variables and add validation records V2: fix when allocating IRQs, scan all queues drivers/vdpa/virtio_pci/vp_vdpa.c | 23 +-- 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/drivers/vdpa/virtio_pci/vp_vdpa.c b/drivers/vdpa/virtio_pci/vp_vdpa.c index df5f4a3bccb5..74bc8adfc7e8 100644 --- a/drivers/vdpa/virtio_pci/vp_vdpa.c +++ b/drivers/vdpa/virtio_pci/vp_vdpa.c @@ -160,7 +160,14 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) struct pci_dev *pdev = mdev->pci_dev; int i, ret, irq; int queues = vp_vdpa->queues; - int vectors = queues + 1; + int vectors = 0; + int msix_vec = 0; + + for (i = 0; i < queues; i++) { + if (vp_vdpa->vring[i].cb.callback) + vectors++; + } + vectors++; ret = pci_alloc_irq_vectors(pdev, vectors, vectors, PCI_IRQ_MSIX); if (ret != vectors) { @@ -173,9 +180,12 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) vp_vdpa->vectors = vectors; for (i = 0; i < queues; i++) { + if (!vp_vdpa->vring[i].cb.callback) + continue; + snprintf(vp_vdpa->vring[i].msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-%d\n", pci_name(pdev), i); - irq = pci_irq_vector(pdev, i); + irq = pci_irq_vector(pdev, msix_vec); ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_vq_handler, 0, vp_vdpa->vring[i].msix_name, @@ -185,21 +195,22 @@ static int vp_vdpa_request_irq(struct vp_vdpa *vp_vdpa) "vp_vdpa: fail to request irq for vq %d\n", i); goto err; } - vp_modern_queue_vector(mdev, i, i); + vp_modern_queue_vector(mdev, i, msix_vec); vp_vdpa->vring[i].irq = irq; + msix_vec++; } snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, "vp-vdpa[%s]-config\n", pci_name(pdev)); - irq = pci_irq_vector(pdev, queues); + irq = pci_irq_vector(pdev, msix_vec); ret = devm_request_irq(&pdev->dev, irq, vp_vdpa_config_handler, 0, vp_vdpa->msix_name, vp_vdpa); if (ret) { dev_err(&pdev->dev, - "vp_vdpa: fail to request irq for vq %d\n", i); + "vp_vdpa: fail to request irq for config, ret %d\n", ret); goto err; } - vp_modern_config_vector(mdev, queues); + vp_modern_config_vector(mdev, msix_vec); vp_vdpa->config_irq = irq; return 0; -- 2.43.0
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Tue, Apr 09, 2024 at 09:34:39AM +0900, Masami Hiramatsu wrote: SNIP > > > > > > > this can be fixed by checking the syscall is called from the trampoline > > > > and prevent handle_trampoline call if it's not > > > > > > Yes, but I still do not think this makes a lot of sense. But I won't > > > argue. > > > > > > And what should sys_uretprobe() do if it is not called from the > > > trampoline? > > > I'd prefer force_sig(SIGILL) to punish the abuser ;) OK, OK, EINVAL. > > > > so the similar behaviour with int3 ends up with immediate SIGTRAP > > and not invoking pending uretprobe consumers, like: > > > > - setup uretprobe for foo > > - foo() { > > executes int 3 -> sends SIGTRAP > > } > > > > because the int3 handler checks if it got executed from the uretprobe's > > trampoline.. if not it treats that int3 as regular trap > > Yeah, that is consistent behavior. Sounds good to me. > > > > > while for uretprobe syscall we have at the moment following behaviour: > > > > - setup uretprobe for foo > > - foo() { > > uretprobe_syscall -> executes foo's uretprobe consumers > > } > > - at some point we get to the 'ret' instruction that jump into uretprobe > > trampoline and the uretprobe_syscall won't find pending uretprobe and > > will send SIGILL > > > > > > so I think we should mimic int3 behaviour and: > > > > - setup uretprobe for foo > > - foo() { > > uretprobe_syscall -> check if we got executed from uretprobe's > > trampoline and send SIGILL if that's not the case > > OK, this looks good to me. > > > > > I think it's better to have the offending process killed right away, > > rather than having more undefined behaviour, waiting for final 'ret' > > instruction that jumps to uretprobe trampoline and causes SIGILL > > > > > > > > I agree very much with Andrii, > > > > > >sigreturn() exists only to allow the implementation of signal > > > handlers. It should never be > > >called directly. Details of the arguments (if any) passed to > > > sigreturn() vary depending on > > >the architecture. > > > > > > this is how sys_uretprobe() should be treated/documented. > > > > yes, will include man page patch in new version > > And please follow Documentation/process/adding-syscalls.rst in new version, > then we can avoid repeating the same discussion :-) yep, will do thanks, jirka
Re: [PATCH v14 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings
On 08/04/2024 22:53, Tanmay Shah wrote: > From: Radhey Shyam Pandey > > Introduce bindings for TCM memory address space on AMD-xilinx Zynq > UltraScale+ platform. It will help in defining TCM in device-tree > and make it's access platform agnostic and data-driven. > > Tightly-coupled memories(TCMs) are low-latency memory that provides > predictable instruction execution and predictable data load/store > timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory > banks on the ATCM and BTCM ports, for a total of 128 KB of memory. > > The TCM resources(reg, reg-names and power-domain) are documented for > each TCM in the R5 node. The reg and reg-names are made as required > properties as we don't want to hardcode TCM addresses for future > platforms and for zu+ legacy implementation will ensure that the > old dts w/o reg/reg-names works and stable ABI is maintained. > > It also extends the examples for TCM split and lockstep modes. > > Signed-off-by: Radhey Shyam Pandey > Signed-off-by: Tanmay Shah Reviewed-by: Krzysztof Kozlowski Best regards, Krzysztof