On Wed, Jul 01, 2026 at 12:19:19AM -0700, Eric Biggers wrote: > No, this didn't address my feedback. It moved things around but still > adds additional overhead for everyone to support an out-of-tree driver, > which also hasn't been shown to be any better than just using the CPU.
Eric, thanks for the fast reply. Overhead: for a non-user the only cost is the data_unit_size field plus one zeroing store in set_tfm()/ON_STACK; the en/decrypt paths are untouched. A dun() user pays one indirect dispatch into the template per request plus a scatterwalk step and IV copy per unit -- the same per-DU bookkeeping the consumer already open-codes today. On the driver: I agree pushing code optimized for an out-of-tree driver is wrong, but I don't think that's the case here -- this helps any async crypto engine, and there are in-tree async xts(aes) ones dm-crypt is eligible to use today: HiSilicon SEC2, TI DTHEv2, Atmel (I don't have any to test on). To bound the win, I used cryptd as a pure async carrier and moved the per-DU split inside it, then ran dm-crypt + fio: batching cut CPU ~30% on 128k I/O (large batch) and had zero impact on 4k -- so the saving is dispatch, not crypto. A real engine that submits a whole multi-DU request in one descriptor avoids that per-DU dispatch entirely, so it saves at least that. So the question for me is what the bar is: does landing the API and dun() template now (with the in-tree consolidation it already buys dm-crypt and blk-crypto-fallback), with a throughput demonstration deferred to a real async provider, work for you ? Thanks, Leonid

