Line 576-610: Your method register mapping is rather quaint. You
shouldn't need to examine the "reg" property, but instead use
ddi_regs_map_setup() and friends. (You can inquire about number of
register sets and sizing information with ddi_dev_nregs(),
ddi_dev_regsize(), etc.) I'd rather see you use the DDI convenience
routines here. (Note that config space is set up with
pci_config_setup().)
I agree in that is quaint... but... Memory space accesses don't work
for the old VT86C100A cards. (They have memory space in their bar's
but nothing happens if you write to them via memory space.).
But you should use a different a register number then... and you should
then still be able to to use ddi_regs_map_setup(). Does this not work?
The issue is that the driver needs to know the access type (io space/mem
space) for a set number. It queries the reg property to make that
association. Then, ddi_regs_map_setup() is used with the 'right'
register set number.
The webpage states:
As mentioned in "IEEE 1275 PCI Binding," IEEE 1275 PCI binding makes no
specific connection between the entries in the "reg" property and the
configuration registers in PCI configuration space. Driver developers
should check the "reg" property to ensure that the correct rnumber is
used in ddi_regs_map_setup(9F)."
So, I read this as "you can't assume that set number 1 always provides
i/o space accesses".
Lines 717-760: I believe you shouldn't need to use driver.conf...
Brussels is the "new" way forward, and I believe that a driver
without legacy concerns shouldn't be using driver.conf anymore.
I added it for consistancy with the current state of solaris' drivers.
Just tell me again it needs to go and it's gone.
Nuke it. I'm working on a different ethernet driver for a popular
commodity chip, and it will *not* have driver.conf settings. (Every
friggin' driver always had its own unique tunables, so there was no
consistency to follow. Brussels is the way forward.)
OK. I will remove it.
Line 1287: I see you're using desballoc. For a 100M driver, I think
this adds complexity that is questionable.
For modern CPU's, the extra complexity of "zero copy" for a 100M
device does not outweigh the advantage of an easy maintainable driver.
However, I specifically wrote this driver for use on my 600Mhz VIA C3,
(http://joostm.nl/solaris/onix/), and on this box, it matters. Every
cycle not spend on copying data, is used for something useful.
But with normal sized ethernet packets, you may find that the cost of
doing the DMA machinations may approach the cost of doing the copying.
Especially when you add locking that you're currently missing.
Even a 600MHz Via C3 shouldn't matter much with 100M.
With smaller packets (e.g. typical for HTTP) the copy will always win.
The tradeoff varies from one platform to another, but any "modern" (i.e.
anything made this decade) system will probably do better copying at
least up to 1K.
Paritcularly, you need to
make sure that you don't have a race with your driver getting
unloaded (e.g. via modunload()) while you still have mblks that are
"loaned up". I believe your driver suffers from this race, which
could ultimately lead to a panic in certain situations. "Fixing"
this requires the driver to use a lock to guard modunload() from
completing successfully if there are outstanding mblks via
desballoc. (That also means you need a global reference counter.)
Again, for a typical 100Mbit driver, the perf. gain is questionable
anyway, so I'd not bother with this and just use bcopy. (Plus bcopy
solves a lot of extra problems with resource management...)
This is in detach:
418 if (vrp->dmamem > 0) {
419 vr_log(vrp, CE_NOTE, "Can't detach, buffers in upper layer.");
420 return (DDI_FAILURE);
421 }
vrp->dmamem is the amount of memory allocated for DMA buffers for that
instance. Is this sufficient?
No, you need to guard against race conditions. That implies locking.
This area is really tricky to get right, and almost every driver that
has ever tried to do this has gotten it WRONG.
The extra overhead this implies is painful. I'd *really* recommend you
consider just using bcopy.
What driver does it right then? A right example would be useful.
This would solve the problem described at line 1377 as well, because
you can then bcopy into a misaligned buffer, avoiding an extra data
copy in the IP stack later.
If you don't change this otherwise, you must add safeguards against
the modunload race, and you *should* conditionalize it on the length
of the packet. (Tiny packets will get dismal perf. on this path.)
The TX path is changed so that only packets of 128 bytes and above,
are mapped.
I suspect that 128 needs to be more like 512 or even 1024.
Line 1786. Similar to the rx comments above, this approach, which is
common but IMO ugly, is better eliminated and using a bcopy of the
frame into prealloc'd frames. Especially, DMA setup and tear down is
expensive and this actually *hurts* performance worse than just using
bcopy on *many* platforms... especially if the frames are not full
MTU frames. I think I wouldn't bother with this extra effort, but go
with the simple bcopy approach. You've got to do that anyway if the
packet doesn't fit your alignment or block count requirements, so its
simpler to just skip all the tests and do the copy unconditionally,
IMO. (For larger frames -- say more than 512 bits, the tradeoff may
favor DMA setup and teardown, but its pretty close, I think. And,
for 100Mbps with reasonably recent CPUs -- anything in the last
decade or so -- you shouldn't need the special DMA tricks to get
line-speed performance.)
Note that unlike rx, you can't even get the benefit of reusing
mappings... so the trade off here for bcopy vs. DMA much more
strongly favors bcopy, I think.
Again, I agree. True for modern CPU's. On slower CPU's, things are
different. The driver's objective is not to get wirespeed TCP
throughput. It's more like, "get decent throughput in few as possible
cycles". TX DMA mapping supports that objective. Every cycle not spend
in the driver is used elsewhere.
See my earlier comments. If you want to do this, go ahead, but I think
you're mistaken. You should do some performance analysis to see if
you're really saving cpu cycles. I suspect you might be saving as much
as you think. (It will vary by packet size, as well as specific CPU and
MMU configuration, of course.)
OK. I will re-evaluate copy versus dma mapping on the VIA C3 box
measured using NFS. Is your objection with dma mappings on the TX path,
RX path or both?
You still need to handle the DMA binding properly, using locking. The
locking will sap performance somewhat, which is one of the reasons that
I so strongly advocate just copying.
In fact, I'm probably going to do this in the Gigabit driver I'm working
on.
-- Garrett
Thanks for caring!
Joost
--
Joost Mulders + email: [email protected]
Technical Consultant + phone: +31-33-45-15701
Client Solutions + fax: +31-33-45-15734
Sun Microsystems + mobile: +31-6-5198-7268
-= Anything not done right, has to be done again =-
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss