On 10/20/06, Bill Paul <[EMAIL PROTECTED]> wrote:

> This is exactly the test that Andre and I were running, though only in
> one direction (I think due to lack of hardware for a full test).

Yes, but did you do it with a Smartbits though, or just with a couple of
other FreeBSD machines? Unfortunately, a typical FreeBSD system on its own
won't generate frames anywhere near fast enough to really torture test a
gigE interface. At best you might hit around 200000 to 300000 frames/sec.

A given Smartbits system doesn't need special hardware to run a
bi-directional forwarding test. If you're using SmartApps, you just
have to click the "Bi-Directional" checkbox on the main setup window.
(At least, that's how it is with the ones at work.)

Being able to flood the link with the Smartbits is also handy for
provoking error conditions (RX overruns and TX underruns, mostly), which
shows you how well (or not) the driver's error recovery works.

In the past I considered creating a kernel module that would grab hold
of a given interface and blast traffic through it with as little software
overhead as possible (e.g. sending the same mbuf over and over) in order
to create my own test system that could hopefully rival the Smartbits,
but I never got around to it. I'm not sure that it's really possible
without custom hardware though.

Our Linux team has this, as far as I know its only been used by our
internal test types though, I have not seen the code, but I take this
as evidence that it IS doable :)

> Prior to the INTR_FAST change, the machine would live-lock.  Now it
> survives, stays responsive, and drops packets as needed.

The wide range of failures people seem to be reporting might mean that
the driver code itself is not the issue, but that there's an interaction
with some other part of the system. This means torture testing the driver
itself might not be enough to provoke the problems.

Unfortunately, nobody seems to have nailed down a good test case for
any of these failures. I strongly suspect people are leaving out details
which seem obvious and/or trivial to them, but which are critical to
finding the problem. ("Oh, I was using SCHED_ULE... was I not supposed
to do that? Tee-hee. *curls finger in blonde hair*)

Another thing that might be handy is improving the watchdog timeout
message so that it dumps the state of the ICR and ICM registers (and
maybe some other interesting driver and/or device state). The timeout
implies no interrupts were delivered for a Long Time (tm). If the
ICM register indicates interrupts have been masked, then that means
em_intr_fast() was triggered by and interrupt and it scheduled work,
but that work never executed. If that really is what happened, then
I can understand the watchdog error occuring. If that's _not_ what
happened, them something else is screwed up.

Jesse Brandeburg just did an interesting hack for the Linux driver, I
was considering trying to code an equivalent thing up for us. We
have evidence that on some AMD based systems there are writebacks
that get lost, since the TX cleanup relies on the DD being set you
are hosed when this happens. What he did was make a cleanup
routine that ONLY uses the head and tail pointers and NOT the done
bit. Then, in the watchdog routine, if there is evidence of this problem
it will switch the cleanup function pointer to this alternate clean code.

At least one user that was having a problem has reported this solved
it. It may be one of the issues hitting us as well.

Jack
_______________________________________________
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to