[Mvpmc-users] Ethernet overrun errors

Tom Metro Mon, 16 Apr 2007 12:42:33 -0700

-------- Original Message --------
Subject: Re: sluggish mvpmc, network errors
Date: Thu, 05 Apr 2007 15:11:46 -0400
From: Tom Metro

Michael Drons wrote:
 >> And for the MVP:
 >> # ifconfig
 >> eth0 Link encap:Ethernet  HWaddr ...
 >>      inet addr:192.168.0.242 Bcast:192.168.0.255  Mask:255.255.255.0
 >>      UP BROADCAST RUNNING MULTICAST  MTU:1500 Metric:1
 >>      RX packets:7965082 errors:99961 dropped:0 overruns:99961 frame:0
 >>      TX packets:2796846 errors:0 dropped:0 overruns:0 carrier:0
 >>       collisions:0 txqueuelen:1000
 >>      RX bytes:3091505936 (2.8 GiB)  TX bytes:0 (0.0 B)
 >>      Interrupt:27 Base address:0xd300 DMA chan:1
 >>
 >> Maybe the dailies will help with that.

The latest daily does seem to be a bit more responsive, but I still run
into a problem where mvpmc gets progressively slower after use. After
watching a show or two the UI interaction will slow to a crawl, forcing
me to power cycle. But all it takes is a "warm" restart to clear it.

 > My guess is that it is either a duplex setting or physical cable
 > issue.

If the duplex is mismatched between the two end points, I'd expect it to
not work at all. If the switch was performing "duplex translation" and
failing to keep up, then I'd expect the ping flood and/or streaming
video to show problems.

A cable issue isn't out of the question, as I ran the wire myself and
punched-down the ends to the jacks, but again, I'd expect this to show
flaws in the streamed video and other areas, like corrupt dongles
causing failed boots, if it's the receive wires that are faulty.

I've also tried a ping from the MVP to the back-end, though it is of
somewhat limited usefulness as busybox doesn't support a flood or wait
option:

# ping -s 1400 192.168.0.203
PING 192.168.0.203 (192.168.0.203): 1400 data bytes
1428 bytes from 192.168.0.203: icmp_seq=0 ttl=64 time=2.7 ms
1428 bytes from 192.168.0.203: icmp_seq=1 ttl=64 time=1.2 ms
1428 bytes from 192.168.0.203: icmp_seq=2 ttl=64 time=1.1 ms
[...]
1428 bytes from 192.168.0.203: icmp_seq=58 ttl=64 time=1.1 ms

--- 192.168.0.203 ping statistics ---
59 packets transmitted, 59 packets received, 0% packet loss
round-trip min/avg/max = 1.1/1.1/2.7 ms

 > Try setting the duplex manually on the mvp.
 > Add the dongle config commands (below)...

Do you see anything wrong with manually running these via telnet after a
cold boot?

 >  echo 0 > /proc/sys/dev/eth0/autoneg
 >  echo 1 > /proc/sys/dev/eth0/rfduplx
 >  echo 1 > /proc/sys/dev/eth0/swfdup
 >  echo 1 > /proc/sys/dev/eth0/autoneg

Lets see what they're set to first:

# cat /proc/sys/dev/eth0/autoneg
1
# cat /proc/sys/dev/eth0/rfduplx
1
# cat /proc/sys/dev/eth0/swfdup
1

And on the back-end:

# ethtool eth0
Settings for eth0:
        Supported ports: [ MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 100Mb/s
        Duplex: Full
        Port: MII
        PHYAD: 1
        Transceiver: external
        Auto-negotiation: on
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes

So to me it looks like both ends are already running in 100 Mbps, full
duplex.

But just to be sure, I'll reboot the MVP to clear the error counters:

# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0D:FE:0C:01:28
     inet addr:192.168.0.242  Bcast:192.168.0.255  Mask:255.255.255.0
     UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
     RX packets:56 errors:0 dropped:0 overruns:0 frame:0
     TX packets:41 errors:0 dropped:0 overruns:0 carrier:0
        collisions:0 txqueuelen:1000
     RX bytes:4716 (4.6 KiB)  TX bytes:0 (0.0 B)
     Interrupt:27 Base address:0xd300 DMA chan:1

And run the commands...

echo 0 > /proc/sys/dev/eth0/autoneg
echo 1 > /proc/sys/dev/eth0/rfduplx
echo 1 > /proc/sys/dev/eth0/swfdup
echo 1 > /proc/sys/dev/eth0/autoneg

and use the UI for a bit...

 > The error counters on the mvp are definitely causing
 > slow/sluggish response from the mvp.

I'm glad to see that there is some concrete indicator of the problem,
but so far I'm not convinced that the overruns are anything more than a
side-effect symptom of something going wrong in the software.

Any thoughts as to a next step? I could perhaps mess with the Ethernet
receive buffer size, but that's likely to be only a bandaid.

I'm going to check the load average and capture the output from top the
next time the UI starts getting sluggish. Merely waiting for corrupt
packets to be retransmitted - if that's the root cause - should be a
blocking operation that doesn't eat up CPU. Here's the uptime fresh
after a cold boot:

# uptime
  19:25:55 up 4 min, load average: 0.00, 0.02, 0.00

and top:
Mem: 13332K used, 352K free, 0K shrd, 3860K buff, 4193284K cached
Load average: 0.00, 0.01, 0.00    (State: S=sleeping R=running, W=waiting)

   PID USER     STATUS   RSS  PPID %CPU %MEM COMMAND
   134 root     R        376   115  0.5  2.7 top
    50 root     S        280     1  0.3  2.0 telnetd
   114 root     S       8444   108  0.0 61.7 mvpmc
   122 root     S       8444   116  0.0 61.7 mvpmc
   120 root     S       8444   116  0.0 61.7 mvpmc
   121 root     S       8444   116  0.0 61.7 mvpmc
   118 root     S       8444   116  0.0 61.7 mvpmc
   116 root     S       8444   114  0.0 61.7 mvpmc
   117 root     S       8444   116  0.0 61.7 mvpmc
   119 root     S       8444   116  0.0 61.7 mvpmc
   125 root     S       8444   116  0.0 61.7 mvpmc
   126 root     S       8444   116  0.0 61.7 mvpmc
   124 root     S       8444   116  0.0 61.7 mvpmc
   108 root     S        652     1  0.0  4.7 mvpmc
   115 root     S        408    50  0.0  2.9 sh
     1 root     S        344     0  0.0  2.5 init
    80 root     S        316     1  0.0  2.3 udhcpc
    91 root     S        284     1  0.0  2.0 ntpclient
     8 root     SW         0     1  0.0  0.0 mtdblockd
     4 root     SW         0     1  0.0  0.0 kswapd
     2 root     SW         0     1  0.0  0.0 keventd
     3 root     SWN        0     1  0.0  0.0 ksoftirqd_CPU0
     5 root     SW         0     1  0.0  0.0 bdflush
     6 root     SW         0     1  0.0  0.0 kupdated
     7 root     Z          0     1  0.0  0.0 cifsoplockd

I waited until a "Please wait" dialog appeared and seemed to be stuck,
and captured the stats again:

# uptime
  20:31:24 up  1:10, load average: 0.00, 0.02, 0.08

Mem: 13232K used, 452K free, 0K shrd, 3256K buff, 4193188K cached
Load average: 0.00, 0.03, 0.09    (State: S=sleeping R=running, W=waiting)

   PID USER     STATUS   RSS  PPID %CPU %MEM COMMAND
   164 root     R        376   115  0.5  2.7 top
    50 root     S         80     1  0.3  0.5 telnetd
   152 root     S       8948   150  0.0 65.3 mvpmc
   153 root     S       8948   150  0.0 65.3 mvpmc
   149 root     S       8948   108  0.0 65.3 mvpmc
   160 root     S       8948   150  0.0 65.3 mvpmc
   151 root     S       8948   150  0.0 65.3 mvpmc
   157 root     S       8948   150  0.0 65.3 mvpmc
   156 root     S       8948   150  0.0 65.3 mvpmc
   162 root     S       8948   150  0.0 65.3 mvpmc
   155 root     S       8948   150  0.0 65.3 mvpmc
   159 root     S       8948   150  0.0 65.3 mvpmc
   161 root     S       8948   150  0.0 65.3 mvpmc
   154 root     S       8948   150  0.0 65.3 mvpmc
   150 root     S       8948   149  0.0 65.3 mvpmc
   158 root     S       8948   150  0.0 65.3 mvpmc
   108 root     S        460     1  0.0  3.3 mvpmc
   115 root     S        240    50  0.0  1.7 sh
    91 root     S        128     1  0.0  0.9 ntpclient
     1 root     S         92     0  0.0  0.6 init
    80 root     S         56     1  0.0  0.4 udhcpc
     8 root     SW         0     1  0.0  0.0 mtdblockd
     4 root     SW         0     1  0.0  0.0 kswapd
     3 root     SWN        0     1  0.0  0.0 ksoftirqd_CPU0
     7 root     Z          0     1  0.0  0.0 cifsoplockd
     5 root     SW         0     1  0.0  0.0 bdflush
     6 root     SW         0     1  0.0  0.0 kupdated
     2 root     SW         0     1  0.0  0.0 keventd

That all looks pretty normal to me. So much for my theory. But the
overrun count went up:

# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0D:FE:0C:01:28
           inet addr:192.168.0.242  Bcast:192.168.0.255  Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:308233 errors:4070 dropped:0 overruns:4070 frame:0
           TX packets:155237 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:450168188 (429.3 MiB)  TX bytes:0 (0.0 B)
           Interrupt:27 Base address:0xd300 DMA chan:1

When I went back to the MVP 10+ minutes later, the "Please wait" dialog
was still on the screen with the animated bar still moving.

There is a possibility that I'm seeing multiple failure modes. Often it
doesn't get completely stuck, but instead just operates in slow motion,
such that it takes 10 or more seconds to repaint the text on the
screen. It's quite possible that the load average would show a spike
under those conditions. I could have sworn I checked it once under those
conditions and saw it up in the 20s.

I'll keep testing, but any other theories?

Thanks for spending the time on this.

  -Tom

-------- Original Message --------
Subject: Re: sluggish mvpmc, network errors
Date: Thu, 05 Apr 2007 18:31:35 -0400
From: Tom Metro

Michael Drons wrote:
> My friend had the exact same issue.  He made the
> changes in his dongle config file and all of his
> issues went away. 

So your theory is that adding those echo statements to dongle.bin.config
might resolve it? I'm skeptical, but it is easy enough to try.

> Can you go back to back with the mythtv server?

You mean attach the MVP directly to the Ethernet port of the back-end
using a crossover cable? Yes, but only temporarily, as the back-end only
has one Ethernet interface, and thus would be cut-off from the net. I'd
also need to set up a DHCP server on the back-end, as that currently
resides on another machine on my LAN.

I guess I'd need to have a bit more evidence pointing in the direction
of that being useful before I'd go through the trouble.

I keep coming back to the fact that they aren't just any old Ethernet
errors, but are specifically overruns, and my expectation is that you
get overruns when the receiving CPU is too slow, or the IRQ handler has
problems. Not surprisingly, this document:

Linux Network Administrators Guide
http://osdir.com/LDP/LDP/nag2/nag2.pdf

says:

   Receiver overruns usually occur when packets come in
   faster than the kernel can service the last interrupt.

And this article:
http://www.onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html

says:

   To achieve maximum throughput, it is critical to use optimal TCP
   socket buffer sizes for the link you are using. If the buffers are
   too small, the TCP congestion window will never open up fully, so
   the sender will be throttled. If the buffers are too large, the
   sender can overrun the receiver, which will cause the receiver to
   drop packets and the TCP congestion window to shut down. This is
   more likely to happen if the sending host is faster than the
   receiving host.

Or more clearly stated in the author's tuning guide:

http://dsd.lbl.gov/TCP-tuning/TCP-tuning.html

   If the receiver buffers are too large, TCP flow control breaks and the
   sender can overrun the receiver, which will cause the TCP window to
   shut down.

So bigger buffers aren't necessarily the solution, if the receiver can't
sustain adequate speed to empty them, and in fact cause overruns due to
TCP flow control not kicking in when it normally would.

The buffer settings on the MVP seem pretty close to the normal Linux
defaults for 2.4, according to the article:

# sysctl -A
...
net.ipv4.tcp_rmem = 4096        43689   87378
net.ipv4.tcp_wmem = 4096        16384   65536

That's receive buffer on the first line, send on the second. Min,
default, and max buffer size. The send buffers seem to be kernel stock.
The receive buffers look like they've been tweaked, but not by much, and
no where near as high as the article recommends for good sustained
performance (although the article seems to be assuming a high latency
connection, like a WAN, rather than a LAN).

This page:
http://dsd.lbl.gov/TCP-tuning/linux.html
has more details on buffer tuning in Linux.

Reducing the default and max receive buffer size will, in theory,
eliminate the overruns (at the expense of bandwidth), if indeed they are
a result of the MVP not being able to keep up with the sustained data
flow, but I'm skeptical that this is the case, otherwise I'd see
stuttering during video playback. My MVP seems to handle playing back
streams that peak at 6 Mbps without problems.

Seems one test I could run is to reset the error counters, and run a
bandwidth test in mvpmc. Or, perhaps better, if my suspicion is correct
and mvpmc client software is causing the problem, pull a large file to
the MVP via tftp to /dev/null on the command line with mvpmc not
running. (Though tftp tends to be really slow. Maybe nfs would be better.)

But of course short duration events keeping the CPU busy will also cause
overruns if the buffers aren't big enough. In that case increasing the
buffers should help, but only if those busy periods are truly momentary.

It still comes down to figuring out whether the sluggish UI is a side
effect of mvpmc waiting for packets to be retransmitted, or if something
else is bogging down the MVP and the packet errors are just another
symptom, like the UI. Unless there is something buggy in the protocol
layer, I don't think the error count is high enough to explain the
delays I am seeing - packet retransmission of a few dozen small MythTV
control packets (supposedly what's going over the network while I'm
interacting with the UI and no video is playing) should be imperceptible.

That first article:
http://www.onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html

also had:

   A surprisingly common source of LAN trouble with 100BT networks is
   when the host is set to full duplex but the Ethernet switch is set
   to half duplex, or vice versa. Newer hardware will autonegotiate
   this, but with some older hardware, autonegotiation will sometimes
   fail, with the result being a working but very slow network
   (typically only 1Mbps to 2Mbps). Newer hardware will autonegotiate
   this...

All the hardware I'm using is relatively new and supposedly supports
autonegotiation.

Maybe there is some operation that is being ran from the Ethernet
driver's IRQ handler that takes longer than expected on H3 hardware. Or
maybe there is some other operation that occurs while interrupts are
masked (like in one of the other IRQ handlers) that takes longer than
expected on H3 hardware. I don't yet know enough about the hardware to
speculate...

  -Tom

-------- Original Message --------
Subject: Re: sluggish mvpmc, network errors
Date: Fri, 06 Apr 2007 15:54:40 -0400
From: Tom Metro

Michael Drons wrote:
 > My friend had the exact same issue.  He made the
 > changes in his dongle config file and all of his
 > issues went away.

I made the changes in dongle.bin.conf yesterday and rebooted.

Today the overrun count is still incrementing:

# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:0D:FE:0C:01:28
           inet addr:192.168.0.242  Bcast:192.168.0.255  Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:2534540 errors:32492 dropped:0 overruns:32492 frame:0
           TX packets:883792 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:3691691978 (3.4 GiB)  TX bytes:0 (0.0 B)
           Interrupt:27 Base address:0xd300 DMA chan:1

It was worth a shot, but I think the cause is elsewhere.

  -Tom

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Mvpmc-users mailing list
Mvpmc-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mvpmc-users
mvpmc wiki: http://mvpmc.wikispaces.com/

[Mvpmc-users] Ethernet overrun errors

Reply via email to