Public bug reported: I'm submitting this bug report to encourage driver developers to try this particular use case because a network-only test would not uncover the bug. I know (now, alas :) that the Realtek NIC is the subject of many prior bug reports, but someone who thinks those are fixed in any given release might not try what's going on here.
I just got an MSI 785GT-E63 motherboard (MS-7551), which has an onboard Realtek RTL8111/8168B NIC, running a 6-core AMD 64-bit CPU. I've found that heavy network traffic -only when combined with heavy PCI traffic- will cause a variety of crashes after around 10 minutes, plus or minus 10. I can supply all kinds of logs (dmesg, lshw, lspci -vvvv, whatever you want) and my use cases, if someone is interested in working on the bug. I'm using the r8169 kernel module that comes with the release and have -not- downloaded the one from Realtek. This happens under 10.10 (Maverick) in both 32-bit and 64-bit kernels, on both LiveCDs and real installed. (I tried 2.6.35-28-generic and also the 2.6.35-22 kernel from the LiveCD.) I have -not- tainted the kernel (certainly not in the case of the LiveCD tests). If I use a "tar | nc" pipeline to push files from one machine to the target machine at gigabit speeds, I see about 70 MB/sec, limited mostly by I/O on the disks. The target disks are on an AOC-SAT2-MV8 PCI disk controller, so this push loads both the NIC and the PCI bus simultaneously. (The machine actually has 2 of these and 1 empty PCI slot at present, but eventually it will have 3 MV8's.) Under these conditions, the machine simply freezes after a while. It's not totally frozen---I can toggle caps-lock on the keyboard---but it's good for little else. It invariably leaves the NIC -transmitting-, according to the lights on the NIC and on my switch---ancient thicknet Ethernet called this "jabbering". I often have to do a hard power cycle to reset the machine state; simply pushing the reset button seems to leave the USB controller in a messed-up state, at least. If I turn off AMD Cool & Quiet and spread-spectrum, the machine stays up long enough to write some logs into kern.log and for me to even type into terminal windows, but anything that might cause disk access (the system disk is a -USB- disk, not one of the ones on the PCI bus) will wedge it up. The system monitor app will also continue running and the menubar clock will tick, until I wedge it by typing too much, etc. [Waving my mouse once wedged it; mouse on that test was serial, keyboard was USB.] In the cases where I can get logs, I see lots of "BUG: soft lockup" entries for all 6 cores after that. Even if the machine is left alone, they recur every 5-60 minutes at random when the machine is sitting idle. (It's possible the later ones correspond to TCP keepalives or NTP traffic or something hitting the NIC; I didn't instrument the switch.) They do NOT happen if the machine never got stressed in this way; a machine that never locked up before since boot (even if stressed in other ways) doesn't show the soft lockups. According to the logs, the NIC is using MSI interrupts, and according to /proc/interrupts, nothing else is sitting on the IRQ for either of the MV8 disk controllers or on the NIC---they're all disjoint from each other and from everything else. (PCI-MSI-edge/IRQ 41 for the NIC, IO-APIC-fasteoi sata_mv at 20 & 21 for the MV8's.) Why do I think this is specific to heavy PCI loads? Because if I write to a disk connected to one of the motherboard's native SATA ports, it never crashes (I can push for 12 hours and see no signs of distress, and that push runs about 20 MB/sec faster because the onboard SATA is faster than PCI to the MV8---same exact disk, btw, just plugged in elsewhere). I can also do "nc blahblah < /dev/zero" from the source and "nc > /dev/null" on the target and see almost 120MB/s through the link for hours, with no problem. I can also -copy- from a disk on a motherboard native SATA port to a disk on the PCI bus and again, no problem if the network isn't in use---I copied 2TB of ext4 filesystem that way with no problem. USB traffic doesn't affect it; I tried booting with my system disk being on a native SATA port and it changed nothing. (My normal configuration is instead a USB disk; some of my tests used a USB flash drive for LiveCD images.) This machine is actually one of a pair of clones---same exact hardware on both, bought at the same time. Both machines exhibit exactly the same behavior, which exonorates any one-off hardware issue. I have -not- flashed a newer BIOS into the machine (my rev is from October of last year, and there are two newer ones) because I consider the risks of bricking a mobo larger than the chances that this is actually a BIOS bug instead. These mobos have the capability of temporarily booting a different BIOS vis USB without flashing it; I -may- try that if I trust it enough. I happened to have a GA311 PCI (not PCIe) gigabit card sitting around still in its shrinkwrap that someone gave to me. I stuck it in the free PCI slot. I was able to push 2TB to the test machine (at about 50 MB/sec) that way via my tar|nc pipe, and it wrote to the PCI-based MV8 disk for 12 hours without a hiccup. So even if the PCI bus is totally saturated, no crash---as long as I'm not using the onboard Realtek NIC. I tried the 11.04 Natty daily build as of 3/30/2011 (desktop AMD64) and -that- did not crash, BUT the test is incomplete---Natty loaded the machine -terribly-, spinning up the CPU fan to maximum and soaking almost all 6 cores! One core was entirely devoted to running kswapd at 100%; most of two others were running the tar and the nc at more than 70% each. Under 10.10, the machine sits at lowest speed (800Mhz) and only a couple cores are doing anything, and they're at 10-20%. And my transfer rates were abysmal---roughly 45 MB/sec---probably because of the CPU load, which went away as soon as I aborted the test when I couldn't stand the fan noise any more. That's not enough time for the complete test---one of my crashing tests under 10.10 was to use rsync via ssh instead of tar via nc, and the additional load from all the crypto slowed things down enough (to about 50 MB/sec again) that it took 3 hours to crash instead of 10 minutes. So Natty may still have the same bug, if it was otherwise usable. (WTF is the deal with Natty? WTF is kswapd soaking an entire core just because I was writing to disk? Swap was completely unused, and the load vanished as soon as I stopped the test.) If someone cares about this, I can attach all the logs you'd ever want, but I'll leave that for now. But be advised---just testing the net -alone-, without heavy PCI activity, will -not- find this bug. Given the huge number of confused works/doesn't-work/worked in last release but not this one and vice versa about the RTL8111, I'm giving up. I have a free PCIe slot and will be buying two Intel EXPI9301CTBLK's and putting one in each machine, and never using the RTL8111 again. I hope that works. ** Affects: linux (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/746914 Title: MSI 785GT with Realtek RTL8111/8168B locks up only with heavy gigabit traffic -and- heavy PCI load (10.10 at least) -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs