Brian Xu - Sun Microsystems - Beijing China wrote:
Garrett D'Amore wrote:
Brian Xu - Sun Microsystems - Beijing China wrote:
Hi there,
I have a question here:
Why all of the NIC drivers have to bcopy the MBLKs for transmit?
(some of them bcopy always, and some others bcopy under a threshold
of the packet length).
I think one of the reason is the overhead of the setup of dma on the
fly is greater than the overhead of bcopy for short packets. I want
to know if this is the case and if there are any other reasons.
Yes. For any packet reasonably sized bcopy (ETHERMTU or smaller) is
faster on *all* recent hardware. (This is confirmed on even an older
300MHz Via C3.) (Hmm... I've heard that for some Niagra systems
this might not be true, however. But I've not tested it myself.)
Even with bcopy, there is still need a pre-binded dma resource. So the
threshold of the bcopy size is based on whether the overhead for dma
bind on the fly is greater than the threshold of the bcopy to a
pre-binded dma address. For the hardware itself, it only know DMA is
needed.
The pre-bound DMA setup you pay at attach() time, and doesn't play a
role. So you have to compare the cost of bcopy() vs. the cost of
ddi_dma_addr_setup(). There is a lot of additional complexity for tx
as well, because you have to deal with the fact that packets may cross
page boundaries and require multiple DMA cookies. This adds a lot of
complexity, and not all drivers can deal well with multiple descriptors
per packet.
I still don´t know if there are other reasons other than the overhead
of dma setup.
Complexity. There are various concerns, as a race with _fini() and
esballoc (for the rx path), involved.
Also you have to worry about alignment. Not all hardware can transmit
arbitrarily aligned packets. With all the work you wind up doing to
make this work correctly, you get very little performance benefit. So
its rarely worth the pain and suffering. For regular MTU frames, it
just isn't worth it, ever. On reasonably modern hardware, anyway.
For rx, you can eliminate a lot of the DMA costs by recycling buffers.
But the complexity to do this "well" without introducing potential
panics is high. Almost every driver that has tried has gotten this
wrong at some point. Some of them are still wrong.
-- Garrett
Thanks,
Brian
I think the situation is different with jumbo frames, though.
If what I guess is the major cause, I have a proposal and I want to
hear your advice whether it makes sense.
The most time-consuming action for the dma setup is the dma bind,
more specific, calling into the VM layer to get the PFN for the
vaddr(hat_getpfnum()), since it need to search the huge page table.
While for the MBLKs, essentially which are slab objects, the PFN
has already been determined in the slab layer, and for most of their
usage, we only touch the magazine layer, where the PFN is a pre
determined one. That is, the PFN should be considered as a
constructed state, but we don't leverage it for dma bind.
In storage, we have a field 'b_shadow' in buf(9S) to store the
pages which are recently used, through which the PFNs can be easily
got. so in
the case that b_shadow works, ddi_dma_buf_bind_handle() is much
faster than the ddi_dma_mem_bind_handle().
Another example, moving the dma bind of the HBA driver(mpt) from Tx
path to the kmem cache constrcutor, mpt driver got 26% throughput
increment. See CR6707308.
If the mblk could store the PFN info and we had a
ddi_dma_mblk_bind_handle() like interface, then I think it will
benefit the performance of the NIC drivers. I consulted the PAE,
and got a answer that the bcopy is typically about 10-15% of a NIC
TX workload.
There are things that can do to make DMA faster, better, and
simpler. In an ideal world, the GLDv3 could do most of this work,
and the mblk could just carry the ddi_dma_cookie with it.
-- Garrett
Thanks,
Brian
_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss
_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss
_______________________________________________
driver-discuss mailing list
driver-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/driver-discuss