On Fri, Jan 23, 2026 at 02:53:59PM -0800, Matthew Brost wrote:
> > Thats a 2x improvement in overall full operation? Wow!
> >
> > Did you look at how non-iommu cases perform too?
> >
>
> Like intel_iommu=off kerenl command line? I haven't checked that but can.
iommu.passthrough=1
This is generally what we recommend everyone who cares about
performance more than iommu protection should use by default. It
leaves the iommu HW turned on, which x86 requires for other reasons,
but eliminates the performance cost to DMA.
> > I think we can do better still for the non-cached platforms as I have
> > a way in mind to batch up lines and flush the line instead of flushing
> > for every 8 byte IOPTE written. Some ARM folks have been talking about
> > this problem too..
>
> Yes, prior to the IOMMU changes I believe the basline was ~330us so
> dma-map/unmap are still way slower than before and if this affect
> platforms other than Intel x86 there will be complaints everyone until
> the entire kernel moves to the IOVA alloc model.
I have managed to get a test showing that when cache flushing is
turned on the new code is 50% slower. I'm investigating this..
map_pages
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 331,249 , 289,214 , -35.35
2^21, 335,243 , 306,222 , -37.37
2^30, 226,238 , 205,215 , 4.04
# test_map_unmap_benchmark:
unmap_pages
pgsz ,avg new,old ns, min new,old ns , min % (+ve is better)
2^12, 389,272 , 347,237 , -46.46
2^21, 321,261 , 297,239 , -24.24
2^30, 237,251 , 214,228 , 6.06
So it looks to me like this is isolated to Intel GPU for the moment
because it is the only device that would use the cache flushing flow
until we convert ARM.
FWIW, on my system enabling cache flushing goes from 60ns to 250ns, it
has a huge, huge cost to these flows.
> Also another question does IOVA alloc support modes similar to
> dma_map_resource between per device? We also do that and I haven't
> modified that code or check that for perf regressions.
Yes, and no.. The API does, but Christoph doesn't want to let arbitary
drivers use it. So you need to figure out some way to get there.
For reference Leon added dma_buf_phys_vec_to_sgt() which shows this
flow to create a sg_table.
There are also hmm helpers for the mapping too if this is in a hmm
context.
A PCI device calling map_resource is incorrect usage of the DMA API,
but it was the only option till now.
Jason