On Mon, Apr 27, 2026, Herbert Xu wrote:
> Yes I'm happy with this since it could also work for IPsec.
>
> But before you invest too much energy in it it would be helpful
> if you can get some proof-of-concept performance numbers so that
> your effort is not wasted down the track.

I ran a proof-of-concept benchmark on an XTS-AES-256 dm-crypt
volume backed by a hardware crypto accelerator, comparing
per-sector submission against multi-data-unit submission.

Setup: single-core ARM64, fio 4K sequential writes, buffered IO
with end_fsync (representative of filesystem-over-dm-crypt
workloads). Two rounds per configuration, results were consistent
(< 2% variance between rounds).

Throughput (averaged):

  per-sector:       286 MB/s, 73K IOPS
  multi-data-unit:  340 MB/s, 87K IOPS  (+19%)

CPU cycles (perf, 30s sample):

  per-sector:       59.8 billion cycles
  multi-data-unit:  36.0 billion cycles  (-40%)

The baseline is partially CPU-bound. The perf profile shows
dm-crypt and crypto API per-request overhead consuming roughly
25% of CPU cycles in the per-sector case:

  4.3%  crypto dispatch
  4.1%  async completion callback
  3.5%  completion collection
  3.3%  kfree
  2.9%  per-bio context lookup
  2.8%  crypt_convert loop
  1.6%  slab allocation
  1.3%  mempool free

With multi-data-unit, these functions drop out of the top
profile. The bottleneck shifts to DMA mapping and page cache
operations. CPU0 kernel time drops from 78% to 40%, with the
freed cycles appearing as iowait.

The 19% throughput gain (vs 40% CPU reduction) reflects that
the system was partially IO-bound even in the baseline. The
optimization removes the CPU bottleneck, allowing the system
to fully saturate the IO path.

I will prepare the patch series against mainline.

Thanks,
Leonid Ravich

Reply via email to