On 10/20/06, Bill Paul <[EMAIL PROTECTED]> wrote:
> This is exactly the test that Andre and I were running, though only in > one direction (I think due to lack of hardware for a full test). Yes, but did you do it with a Smartbits though, or just with a couple of other FreeBSD machines? Unfortunately, a typical FreeBSD system on its own won't generate frames anywhere near fast enough to really torture test a gigE interface. At best you might hit around 200000 to 300000 frames/sec. A given Smartbits system doesn't need special hardware to run a bi-directional forwarding test. If you're using SmartApps, you just have to click the "Bi-Directional" checkbox on the main setup window. (At least, that's how it is with the ones at work.) Being able to flood the link with the Smartbits is also handy for provoking error conditions (RX overruns and TX underruns, mostly), which shows you how well (or not) the driver's error recovery works. In the past I considered creating a kernel module that would grab hold of a given interface and blast traffic through it with as little software overhead as possible (e.g. sending the same mbuf over and over) in order to create my own test system that could hopefully rival the Smartbits, but I never got around to it. I'm not sure that it's really possible without custom hardware though.
Our Linux team has this, as far as I know its only been used by our internal test types though, I have not seen the code, but I take this as evidence that it IS doable :)
> Prior to the INTR_FAST change, the machine would live-lock. Now it > survives, stays responsive, and drops packets as needed. The wide range of failures people seem to be reporting might mean that the driver code itself is not the issue, but that there's an interaction with some other part of the system. This means torture testing the driver itself might not be enough to provoke the problems. Unfortunately, nobody seems to have nailed down a good test case for any of these failures. I strongly suspect people are leaving out details which seem obvious and/or trivial to them, but which are critical to finding the problem. ("Oh, I was using SCHED_ULE... was I not supposed to do that? Tee-hee. *curls finger in blonde hair*) Another thing that might be handy is improving the watchdog timeout message so that it dumps the state of the ICR and ICM registers (and maybe some other interesting driver and/or device state). The timeout implies no interrupts were delivered for a Long Time (tm). If the ICM register indicates interrupts have been masked, then that means em_intr_fast() was triggered by and interrupt and it scheduled work, but that work never executed. If that really is what happened, then I can understand the watchdog error occuring. If that's _not_ what happened, them something else is screwed up.
Jesse Brandeburg just did an interesting hack for the Linux driver, I was considering trying to code an equivalent thing up for us. We have evidence that on some AMD based systems there are writebacks that get lost, since the TX cleanup relies on the DD being set you are hosed when this happens. What he did was make a cleanup routine that ONLY uses the head and tail pointers and NOT the done bit. Then, in the watchdog routine, if there is evidence of this problem it will switch the cleanup function pointer to this alternate clean code. At least one user that was having a problem has reported this solved it. It may be one of the issues hitting us as well. Jack _______________________________________________ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"