On Mon, Apr 27, 2026, Herbert Xu wrote: > Yes I'm happy with this since it could also work for IPsec. > > But before you invest too much energy in it it would be helpful > if you can get some proof-of-concept performance numbers so that > your effort is not wasted down the track.
I ran a proof-of-concept benchmark on an XTS-AES-256 dm-crypt volume backed by a hardware crypto accelerator, comparing per-sector submission against multi-data-unit submission. Setup: single-core ARM64, fio 4K sequential writes, buffered IO with end_fsync (representative of filesystem-over-dm-crypt workloads). Two rounds per configuration, results were consistent (< 2% variance between rounds). Throughput (averaged): per-sector: 286 MB/s, 73K IOPS multi-data-unit: 340 MB/s, 87K IOPS (+19%) CPU cycles (perf, 30s sample): per-sector: 59.8 billion cycles multi-data-unit: 36.0 billion cycles (-40%) The baseline is partially CPU-bound. The perf profile shows dm-crypt and crypto API per-request overhead consuming roughly 25% of CPU cycles in the per-sector case: 4.3% crypto dispatch 4.1% async completion callback 3.5% completion collection 3.3% kfree 2.9% per-bio context lookup 2.8% crypt_convert loop 1.6% slab allocation 1.3% mempool free With multi-data-unit, these functions drop out of the top profile. The bottleneck shifts to DMA mapping and page cache operations. CPU0 kernel time drops from 78% to 40%, with the freed cycles appearing as iowait. The 19% throughput gain (vs 40% CPU reduction) reflects that the system was partially IO-bound even in the baseline. The optimization removes the CPU bottleneck, allowing the system to fully saturate the IO path. I will prepare the patch series against mainline. Thanks, Leonid Ravich
