[Bug 746914] [NEW] MSI 785GT with Realtek RTL8111/8168B locks up only with heavy gigabit traffic -and- heavy PCI load (10.10 at least)

Grondr Thu, 31 Mar 2011 19:31:08 -0700

Public bug reported:

I'm submitting this bug report to encourage driver developers to try
this particular use case because a network-only test would not uncover
the bug.  I know (now, alas :) that the Realtek NIC is the subject of
many prior bug reports, but someone who thinks those are fixed in any
given release might not try what's going on here.


I just got an MSI 785GT-E63 motherboard (MS-7551), which has an
onboard Realtek RTL8111/8168B NIC, running a 6-core AMD 64-bit CPU.
I've found that heavy network traffic -only when combined with heavy
PCI traffic- will cause a variety of crashes after around 10 minutes,
plus or minus 10.  I can supply all kinds of logs (dmesg, lshw, lspci
-vvvv, whatever you want) and my use cases, if someone is interested
in working on the bug.  I'm using the r8169 kernel module that comes
with the release and have -not- downloaded the one from Realtek.

This happens under 10.10 (Maverick) in both 32-bit and 64-bit kernels,
on both LiveCDs and real installed.  (I tried 2.6.35-28-generic and
also the 2.6.35-22 kernel from the LiveCD.)  I have -not- tainted
the kernel (certainly not in the case of the LiveCD tests).

If I use a "tar | nc" pipeline to push files from one machine to the
target machine at gigabit speeds, I see about 70 MB/sec, limited
mostly by I/O on the disks.  The target disks are on an AOC-SAT2-MV8
PCI disk controller, so this push loads both the NIC and the PCI bus
simultaneously.  (The machine actually has 2 of these and 1 empty PCI
slot at present, but eventually it will have 3 MV8's.)

Under these conditions, the machine simply freezes after a while.
It's not totally frozen---I can toggle caps-lock on the keyboard---but
it's good for little else.  It invariably leaves the NIC -transmitting-,
according to the lights on the NIC and on my switch---ancient thicknet
Ethernet called this "jabbering".  I often have to do a hard power
cycle to reset the machine state; simply pushing the reset button
seems to leave the USB controller in a messed-up state, at least.

If I turn off AMD Cool & Quiet and spread-spectrum, the machine stays
up long enough to write some logs into kern.log and for me to even
type into terminal windows, but anything that might cause disk access
(the system disk is a -USB- disk, not one of the ones on the PCI bus)
will wedge it up.  The system monitor app will also continue running
and the menubar clock will tick, until I wedge it by typing too much,
etc.  [Waving my mouse once wedged it; mouse on that test was serial,
keyboard was USB.]

In the cases where I can get logs, I see lots of "BUG: soft lockup"
entries for all 6 cores after that.  Even if the machine is left
alone, they recur every 5-60 minutes at random when the machine is
sitting idle.  (It's possible the later ones correspond to TCP
keepalives or NTP traffic or something hitting the NIC; I didn't
instrument the switch.)  They do NOT happen if the machine never got
stressed in this way; a machine that never locked up before since boot
(even if stressed in other ways) doesn't show the soft lockups.

According to the logs, the NIC is using MSI interrupts, and according
to /proc/interrupts, nothing else is sitting on the IRQ for either of
the MV8 disk controllers or on the NIC---they're all disjoint from
each other and from everything else.  (PCI-MSI-edge/IRQ 41 for the
NIC, IO-APIC-fasteoi sata_mv at 20 & 21 for the MV8's.)

Why do I think this is specific to heavy PCI loads?  Because if I
write to a disk connected to one of the motherboard's native SATA
ports, it never crashes (I can push for 12 hours and see no signs of
distress, and that push runs about 20 MB/sec faster because the
onboard SATA is faster than PCI to the MV8---same exact disk, btw,
just plugged in elsewhere).  I can also do "nc blahblah < /dev/zero"
from the source and "nc > /dev/null" on the target and see almost
120MB/s through the link for hours, with no problem.  I can also -copy-
from a disk on a motherboard native SATA port to a disk on the PCI bus
and again, no problem if the network isn't in use---I copied 2TB of
ext4 filesystem that way with no problem.  USB traffic doesn't affect
it; I tried booting with my system disk being on a native SATA port
and it changed nothing.  (My normal configuration is instead a USB
disk; some of my tests used a USB flash drive for LiveCD images.)

This machine is actually one of a pair of clones---same exact hardware
on both, bought at the same time.  Both machines exhibit exactly the
same behavior, which exonorates any one-off hardware issue.  I have
-not- flashed a newer BIOS into the machine (my rev is from October
of last year, and there are two newer ones) because I consider the
risks of bricking a mobo larger than the chances that this is actually
a BIOS bug instead.  These mobos have the capability of temporarily
booting a different BIOS vis USB without flashing it; I -may- try that
if I trust it enough.

I happened to have a GA311 PCI (not PCIe) gigabit card sitting around
still in its shrinkwrap that someone gave to me.  I stuck it in the
free PCI slot.  I was able to push 2TB to the test machine (at about
50 MB/sec) that way via my tar|nc pipe, and it wrote to the PCI-based
MV8 disk for 12 hours without a hiccup.  So even if the PCI bus is
totally saturated, no crash---as long as I'm not using the onboard
Realtek NIC.

I tried the 11.04 Natty daily build as of 3/30/2011 (desktop AMD64)
and -that- did not crash, BUT the test is incomplete---Natty loaded
the machine -terribly-, spinning up the CPU fan to maximum and soaking
almost all 6 cores!  One core was entirely devoted to running kswapd
at 100%; most of two others were running the tar and the nc at more
than 70% each.  Under 10.10, the machine sits at lowest speed (800Mhz)
and only a couple cores are doing anything, and they're at 10-20%.
And my transfer rates were abysmal---roughly 45 MB/sec---probably
because of the CPU load, which went away as soon as I aborted the
test when I couldn't stand the fan noise any more.  That's not enough
time for the complete test---one of my crashing tests under 10.10 was
to use rsync via ssh instead of tar via nc, and the additional load
from all the crypto slowed things down enough (to about 50 MB/sec
again) that it took 3 hours to crash instead of 10 minutes.  So
Natty may still have the same bug, if it was otherwise usable.
(WTF is the deal with Natty?  WTF is kswapd soaking an entire core
just because I was writing to disk?  Swap was completely unused,
and the load vanished as soon as I stopped the test.)

If someone cares about this, I can attach all the logs you'd ever
want, but I'll leave that for now.  But be advised---just testing
the net -alone-, without heavy PCI activity, will -not- find this
bug.

Given the huge number of confused works/doesn't-work/worked in
last release but not this one and vice versa about the RTL8111,
I'm giving up.  I have a free PCIe slot and will be buying two
Intel EXPI9301CTBLK's and putting one in each machine, and never
using the RTL8111 again.  I hope that works.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/746914

Title:
  MSI 785GT with Realtek RTL8111/8168B locks up only with heavy gigabit
  traffic -and- heavy PCI load (10.10 at least)

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 746914] [NEW] MSI 785GT with Realtek RTL8111/8168B locks up only with heavy gigabit traffic -and- heavy PCI load (10.10 at least)

Reply via email to