Hi James Kelly! On 2011.10.05 at 22:31:18 +0100, James Kelly wrote next:
> I lost contact with my Scientific Linux 6.1 KVM host earlier today. > > The machine is headless and I don't have any IPMI stuff on the machine so I > had to plug a monitor into it. However, there was no life from the monitor > and I pressed the reset button. > > It seems to me that the networking died. The machine is booted first thing > every morning (so the 9:00am start was missed by two minutes!) and the > networking error seems to have occurred about 27 minutes after > the initial boot. It's unclear to me if tg3 driver errors in the second half of message are source or cause of this situation, however if they are source, you might be interested in recent update that Red Hat has released: http://rhn.redhat.com/errata/RHEA-2011-1348.html Try installing kmod-tg3 from sl-fastbugs repo and rebooting, it should make your system use newer version of network driver that's mentioned in these messages. I have no idea if it will really help, but it probably won't hurt to try. The often cause of similar problems with network drivers could be interrupt setup - network cards generate lots of interrupts under load and use various advanced features to ease it a bit, I saw some situations where panics and warnings in kernel appeared due to hardware interrupt setup or buggy interrupt code in network driver under load. Just in case, you might want to find mention of eth in /proc/interrupts to make sure that it uses MSI-X (shown as PCI-MSI-edge or PCI-MSI-X) and not IO-APIC-level or something like that. However, I don't think these kind of problems should arrive on such hardware. In the worst case, if these problems will keep appearing, consider installing external intel-based network card, these work most flawlessly under Linux in my (and some other people) experience. It's kind of sad, but marvell, broadcom and nvidia products are a bit of second class citizens and don't always work flawlessly under load - might be more of a driver problem, who knows, but that's just my experience from past years. (also, I'd definitely stay away from NICs based on other manufacturer's chips, except for these 4 nothing else should probably be allowed in server market. YMMV) These messages also can be indicating something else than network problems but people with deeper kernel knowledge than me should answer this. All I can say is that NICs+network drivers+interrupt settings combination *can* be real source of problems, up to kernel panics under some conditions, it's not that rare at all to find out that such problems are caused by network driver. -- Vladimir