Hi all,

The Linux kernel's P2P DMA infrastructure is already very mature, but currently 
it is not user-friendly in terms of metric observability. 
For example, without manually adding logs, there is no intuitive data to see 
how many P2P transfers, which paths are taken, 
    and how performance is. It is impossible to clearly observe P2PDMA activity 
from user space, making the following operations difficult:

- Diagnose the reasons why P2PDMA may not work (or perform poorly).

- Verify whether the P2PDMA mapping uses the expected type (BUS_ADDR or 
THRU_HOST_BRIDGE)

- Monitor the use of P2PDMA in production environments

- Detect potential memory leaks (unmapped allocations)

P2PDMA is a subtle feature. When P2PDMA mapping cannot use BUS_ADDR (Direct 
PCIe Switch Path), it silently falls back to the THRU_HOST_BRIDGE, 
       routing traffic to the host bridge. This significantly reduces 
performance (usually by 10 times or more), but it cannot be detected 
       from user space.

Therefore, I plan to export some metrics in the user space to better observe 
P2PDMA activity.
This series of solutions adds three layers of observability:

1. Tracepoints (5 events, optional, no overhead when disabled)

- p2p_dma_alloc: P2P memory allocation

- p2p_dma_free: P2P memory release

- p2p_dma_map: P2P DMA mapping (including client/provider, mapping type,

PCIe distance and process information)

- p2p_dma_unmap: P2P DMA removes mapping

- p2p_map_type_change: New mapping type calculations (xarray missed)

All tracking points include the calling process (comm pid), enabling P2PDMA 
activity tracking for each process.

Example:

$ cat /sys/kernel/debug/tracing/trace | grep p2p_dma_map

nvme[1234] map nvme0 -> p2p_mem type=BUS_ADDR dist=4

python[5678] map nvme1 -> p2p_mem type=THRU_HOST_BRIDGE dist=8

2. Debugfs (global cumulative counter, always available)

- /sys/kernel/debug/pci-p2pdma/

- 11 counters: total_mappings, bus_addr_mappings, host_bridge_mappings,

total_allocations, error_count, etc.

- Enable the calculation of the "BUS_ADDR ratio" to quantify the effectiveness 
of P2PDMA.

3. Sysfs (Statistical Information for Each Device, Production Environment 
Safety)

- /sys/bus/pci/devices/*/p2pmem/stats/

- 4 attributes: alloc_count, free_count, mapped_bytes, peak_mapped_bytes

Performance impact

- Tracking point: Static branch, zero overhead when disabled (by default).

- Debugfs/sysfs: atomic64_t counter, no locking, negligible overhead

- After disabling all observability, the P2PDMA thermal path remains unchanged


I would appreciate feedback on:

1. Is the overall solution worth implementing?
2. Is the set of tracepoints appropriate? Any events I'm missing?
3. Are the tracepoint fields sufficient for debugging?
4. Is the debugfs/sysfs interface design acceptable?
5. Any concerns about the implementation approach?

Reply via email to