Garrett D'Amore wrote: > Interesting... the DDI dma handle mechanism has greatly improved, I > believe, since it was first introduced. I've not done any perf. > analysis, but I am a bit surprised that you didn't notice a nasty > regression in perf. when using a fresh ddi dma handle allocation. > > That said, I wonder if a kmem_cache could be used to achieve a high > level of efficiency here.
for x86, the handles come out of a kmem cache in S10u2 and include preallocating space for some cookies. There's a couple of different things which could possibly improve performance in this area. The current interface is really only an impact to NIC performance since your doing a bunch of small binds in a short amount of time.. To me, it makes sense to look at a special bind like interface for NICs, maybe being part of GLD? I would think TX binds would be more important to RX binds? Not sure if that matters though.. Some of the interesting things to look at would be, if you have a pre-alloc handle like interface, to pass in worse case sgl and pre-alloc the cookie space *and* iommu space (if your using it) to improve bind performance. The driver would have direct access to the cookie buffer so you wouldn't have to walk through them.. These would have to be small binds which are carefully managed so you don't consume the IOMMU space. If you don't do a pre-alloc, you would have the driver pass in the cookie space. Other than reserving IOMMU space (and getting rid of the old dvma reserve stuff on SPARC), I'm not sure how much this will really get you in perf.. One of the big hits is the DDI has to take the virtual address passed in and find the physical address for each page. That is a pretty expensive operation. In the storage stack years ago, since they were already looking at the PA earlier in the stack, they passed down the PAs in a shadow list so the DDI could skip calling into the VM. I'm not sure if there is opportunity to do this in the networking stack (e.g. sockfs, etc.). i.e. if sockfs does a copy for tx (i have no idea if it does, just and example), why not copy into a buffer which you already know the PA. If you could add a shadow list to a mblk, and have gld take care of building a cookie list if the shadow list is present, I think you could get a decent perf increase. MRJ > Admittedly, from all my own experiences, locking issues (the length of > the path while the lock is held) are one of the major bottlenecks for > the 10Gb driver I've looked at in detail (namely nxge). > > But, I think its also true that if customers see that one of the top > locks is your driver's lock, they shouldn't be surprised -- *if* the > lock is not *contended*. (Uncontended locks shouldn't be a problem.) > > -- Garrett > > Andrew Gallatin wrote: >> Garrett D'Amore wrote: >> >>> Intelligent use of DDI compliant DMA (reusing handles, and sometimes >> It is funny that you mention re-using handles. That's just about >> my biggest pet peeve with the Solaris DMA DDI. >> >> <rant on> >> Handles are the most annoying part of the DDI for a NIC driver trying >> to map transmits for DMA. They either require you to do a >> handle allocation in the critical path, or require you to setup a >> fairly complex infrastructure to re-use them, or require you to use >> fairly coarse grained locking in your tx routine so you can associate >> a handle with every tx ring entry. The locking problems inherent in >> handles really make me wish for a pci_map_page() sort of interface >> which is fire and forget. >> >> FWIW, I used to have a fairly clever handle management policy in my >> 10GbE driver's transmit path. My transmit runs almost entirely >> unlocked, and acquires the tx lock only when passing the descriptors >> to the NIC. Using a handle per ring entry would put a lock around a >> *lot* of code which can otherwise run lockless. So, at plumb time, >> I'd allocate a pool of handles. Each transmit would briefly acquire >> the pool lock and remove a pre-allocated handle from the pool. If the >> pool was empty, it would be grown (after dropping the lock). The >> transmit done path would build a list of free handles, and then >> acquire the pool mutex and add them to the free pool in one operation. >> The pool mutex was never held for more than one list insertion >> or removal operation. >> >> So I'm not sure how I could have handled the free pool locking any >> better. Yet I would still get complaints from customers saying "when >> we run lockstat, one of the top locks is <address of my driver's tx >> handle lock>, please fix the locking in your driver". I finally got >> sick and tired of getting these reports and I now do a >> ddi_dma_alloc_handle() each time. Performance didn't get much worse, >> and now lockstat points to the ddi system, so the customers have >> stopped complaining. Sigh. >> >> </rant> >> >> Drew > > _______________________________________________ > driver-discuss mailing list > [email protected] > http://mail.opensolaris.org/mailman/listinfo/driver-discuss _______________________________________________ driver-discuss mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/driver-discuss
