On Wed, Jan 22, 2020 at 9:07 AM Robin Murphy <robin.mur...@arm.com> wrote: > > On 21/01/2020 5:21 pm, Cong Wang wrote: > > On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.mur...@arm.com> wrote: > >> > >> On 18/12/2019 4:39 am, Cong Wang wrote: > >>> The IOVA cache algorithm implemented in IOMMU code does not > >>> exactly match the original algorithm described in the paper > >>> "Magazines and Vmem: Extending the Slab Allocator to Many > >>> CPUs and Arbitrary Resources". > >>> > >>> Particularly, it doesn't need to free the loaded empty magazine > >>> when trying to put it back to global depot. To make it work, we > >>> have to pre-allocate magazines in the depot and only recycle them > >>> when all of them are full. > >>> > >>> Before this patch, rcache->depot[] contains either full or > >>> freed entries, after this patch, it contains either full or > >>> empty (but allocated) entries. > >> > >> How much additional memory overhead does this impose (particularly on > >> systems that may have many domains mostly used for large, long-term > >> mappings)? I'm wary that trying to micro-optimise for the "churn network > >> packets as fast as possible" case may penalise every other case, > >> potentially quite badly. Lower-end embedded systems are using IOMMUs in > >> front of their GPUs, video codecs, etc. precisely because they *don't* > >> have much memory to spare (and thus need to scrape together large > >> buffers out of whatever pages they can find). > > > > The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes, > > which is roughly 192K, per domain. > > Theoretically. On many architectures, kmalloc(1032,...) is going to > consume rather more than 1032 bytes. Either way, it's rather a lot of > memory to waste in the many cases where it will never be used at all.
If this is a concern, we can make IOVA_MAG_SIZE tunable in Kconfig. I myself want a larger IOVA_MAG_SIZE at least for experiments. You know, servers now have 100G+ memory, 192k is nearly nothing... > > >> But on the other hand, if we were to go down this route, then why is > >> there any dynamic allocation/freeing left at all? Once both the depot > >> and the rcaches are preallocated, then AFAICS it would make more sense > >> to rework the overflow case in __iova_rcache_insert() to just free the > >> IOVAs and swap the empty mag around rather than destroying and > >> recreating it entirely. > > > > It's due to the algorithm requires a swap(), which can't be done with > > statically allocated magzine. I had the same thought initially but gave it > > up quickly when realized this. > > I'm not sure I follow... we're replacing a "full magazine" pointer with > an "empty magazine" pointer regardless of where that empty magazine came > from. It would be trivial to preallocate an 'overflow' magazine for the > one remaining case of handling a full depot, although to be honest, at > that point it's probably most efficient to just free the pfns directly > from cpu_rcache->loaded while still under the percpu lock and be done > with it. I don't follow you either. I thought you are suggesting to completely get rid of dynamic memory allocations like: @@ -31,7 +31,7 @@ struct iova_cpu_rcache; struct iova_rcache { spinlock_t lock; unsigned long depot_size; - struct iova_magazine *depot[MAX_GLOBAL_MAGS]; + struct iova_magazine depot[MAX_GLOBAL_MAGS]; struct iova_cpu_rcache __percpu *cpu_rcaches; }; If it is so, I don't see how I can do swap() with pointers like cpu_rcache->prev. More importantly, this doesn't save any memory either for your embedded case. So I don't know why you want to bring it up. > > > If you are suggesting to change the algorithm, it is not a goal of this > > patchset. I do have plan to search for a better algorithm as the IOMMU > > performance still sucks (comparing to no IOMMU) after this patchset, > > but once again, I do not want to change it in this patchset. > > "Still sucks" is probably the most interesting thing here - the headline > number for the original patch series was that it reached about 98% of > bypass performance on Intel VT-d[1]. Sounds like it would be well worth > digging in to what's different about your system and/or workload. Just FYI: The latency is 10x/20x worse with IOMMU enabled on AMD servers here. (mlx5 driver for ethernet, if matters.) The throughput is roughly same. The patchset you linked only measures throughput. > > > (My ultimate goal is to find a spinlock-free algorithm, otherwise there is > > no way to make it close to no-IOMMU performance.) > > > >> > >> Perhaps there's a reasonable compromise wherein we don't preallocate, > >> but still 'free' empty magazines back to the depot, such that busy > >> domains will quickly reach a steady-state. In fact, having now dug up > >> the paper at this point of writing this reply, that appears to be what > >> fig. 3.1b describes anyway - I don't see any mention of preallocating > >> the depot. > > > > That paper missed a lot of things, it doesn't even recommend a size > > of a depot or percpu cache. For implementation, we still have to > > think about those details, including whether to preallocate memory. > > Heh, "missed"... To my reading, the original design actually describes a > depot consisting of two unbounded (but garbage-collected) lists and a > dynamically-adjusted magazine size - I'd hardly blame the authors for I must miss the dynamic size part, as I tried to tune IOVA_MAG_SIZE manually when I initially thought it is overcached. > not discussing an implementation from 15 years in the future of a > fixed-size design *based on* their concept ;) Are you saying fixed-size implementation is wrong? I'd like to hear more! :) I am curious how to dynamically adjust the magzine size too as I still don't believe IOVA_MAG_SIZE fits all, also how to balance the percpu cache. Can you be more elaborate? Thanks. _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu