[email protected] wrote:
On Pén, December 12, 2008 19:09, David Sommerseth wrote:
David Sommerseth wrote:
[email protected] wrote:
PCI-X dual port Broadcom NetXtreme BCM5704 Gigabit Ethernet (rev 03)
adapter is working fine here driven by tg3, 2.6.27-hardened-r1. The
driver
doesn't seem to be borked with my card.
Did you check out the "error" field of ifconfig's output for the
interface
of your card?
Regards,
Dw.
Hmmm ... No, I have not had that opportunity. The server is located
2000km away from me, and I
usually call a guy (who is not a technician)to go in and press
CTRL-ALT-DEL on a keyboard. That is
the short-time "fix". But I'm going to have a look physically on the
server in a couple of weeks,
so if I get positive feedbacks from others as well regarding 2.6.27
kernel, I'm willing to try that
upgrade.
This interface is an on-board interface in an IBM eServer. The first
time it happened, it was no
problems for about 28 days. Now it was 13 days. So I expect it to
happen again, soon enough.
I'll try to hack the shutdown scripts to dump the ifconfig info
somewhere somehow.
Then it happened again ... and I have ifconfig stats for the interface:
eth0 Link encap:Ethernet HWaddr 00:14:5e:5d:3c:d0
inet6 addr: fe80::214:5eff:fe5d:3cd0/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:10551633 errors:4294967239 dropped:767 overruns:0
frame:170
TX packets:9371606 errors:4294967239 dropped:0 overruns:0
carrier:0
collisions:4294967239 txqueuelen:1000
RX bytes:28237000 (26.9 MiB) TX bytes:163377979 (155.8 MiB)
Interrupt:16
From the kernel log I see this:
Dec 12 12:19:21 fw [74355.059369] tg3: tg3_abort_hw timed out for world,
TX_MODE_ENABLE will not clear MAC_TX_MODE=ffffffff
Dec 12 12:19:24 fw [74357.842979] tg3: world: No firmware running.
Dec 12 12:19:41 fw [74374.992867] tg3: world: Link is down.
I'm surprised by the errors and collision numbers here, as I checked it
the
other day, and all of them was 0. I also know that the TX and RX values
was above 3-4GB, but don't remember which was what.
Could this be an overflow bug of some kind?
I have also found out that IBM have released an updated firmware to this
network device, so I'll try to upgrade it during Christmas when I'm close
to the box again. In the mean time I have a little ping-script, which
restarts network (incl. reloading of the tg3 module) when the network
dies.
This restart gives me minimal downtime.
But I do not understand why this box was so rock solid until I upgraded
from 2.6.22-hardened-r8 to 2.6.25-hardened-r8. The new kernel driver
obviously does something it didn't do before. Unfortunately I can't find
anything particular in the kernel git logs for the tg3.[ch] files which
could pin-point anything particular.
Does anyone have any experiences regarding firmware upgrades on these
cards? The instructions seems pretty much forward, but if you know about
anything, whatever, I would appreciate that.
kind regards,
David Sommerseth
Rather strange. The collisions and the errors counter shows the same...
It was a long time ago, when I last saw collisions.
There are several possibilities regarding this symptom. It would be
important to know if the card is connected to a hub, or a switch(ing-hub)?
1.) There can be a defective device on the subnet, which is connected to
it from time-to-time, or it is present all the time, but doesn't hog the
line constantly
Pretty confident this is not the case, as this interface is the one
connected straight to the router from the ISP.
2.) The switch/hub can have a problem - try reconnecting the card to
another port
Pretty confident this is also not the case.
3.) The network card can have a problem, which can be software related and
might be solved by a firmware upgrade (unfortunately the card itself
cannot be replaced being an on-board NIC)
Firmware updated now. I found a firmware updates for the Broadcom
interface I have in the IBM xSeries server and updated it. I also upgraded
the kernel to 2.6.25-hardened-r11 from 2.6.25-hardened-r8. After this, the
server have survived 55 days without any issues, which is the longest since
I upgraded from 2.6.22-hardened-r8. I believe strongly that it was the
firmware update which helped out.
4.) It can even be caused by a driver bug - which we know is all the way
possible since the e1000 issue
Yeah, and this part scares me more ...
I hope it'll turn out soon. I would think about a hardware issue, but it's
a disturbing fact, that these symptoms appeared after a kernel upgrade.
Exactly!
So my thesis is that between linux-2.6.22-hardened-r8 and
2.6.25-hardened-r8 the tg3 driver must have been updated somehow, which
then depends on some features in the firmware which obviously did not work
properly. And if the tg3 driver did not change, I've simply been way to
lucky to not experience that for over 13 months with the 2.6.22 kernel.
The firmware I upgraded to can be found here:
http://www-947.ibm.com/systems/support/supportsite.wss/docdisplay?lndocid=MIGR-5070004&brandind=5000008
This update upgraded the network card firmware "bootcode" from 3.61 to 3.65
and the "IPMI" from 6.20 to 6.25.
kind regards,
David Sommerseth