-------- Original Message -------- Subject: Re: sluggish mvpmc, network errors Date: Thu, 05 Apr 2007 15:11:46 -0400 From: Tom Metro
Michael Drons wrote: >> And for the MVP: >> # ifconfig >> eth0 Link encap:Ethernet HWaddr ... >> inet addr:192.168.0.242 Bcast:192.168.0.255 Mask:255.255.255.0 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:7965082 errors:99961 dropped:0 overruns:99961 frame:0 >> TX packets:2796846 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:3091505936 (2.8 GiB) TX bytes:0 (0.0 B) >> Interrupt:27 Base address:0xd300 DMA chan:1 >> >> Maybe the dailies will help with that. The latest daily does seem to be a bit more responsive, but I still run into a problem where mvpmc gets progressively slower after use. After watching a show or two the UI interaction will slow to a crawl, forcing me to power cycle. But all it takes is a "warm" restart to clear it. > My guess is that it is either a duplex setting or physical cable > issue. If the duplex is mismatched between the two end points, I'd expect it to not work at all. If the switch was performing "duplex translation" and failing to keep up, then I'd expect the ping flood and/or streaming video to show problems. A cable issue isn't out of the question, as I ran the wire myself and punched-down the ends to the jacks, but again, I'd expect this to show flaws in the streamed video and other areas, like corrupt dongles causing failed boots, if it's the receive wires that are faulty. I've also tried a ping from the MVP to the back-end, though it is of somewhat limited usefulness as busybox doesn't support a flood or wait option: # ping -s 1400 192.168.0.203 PING 192.168.0.203 (192.168.0.203): 1400 data bytes 1428 bytes from 192.168.0.203: icmp_seq=0 ttl=64 time=2.7 ms 1428 bytes from 192.168.0.203: icmp_seq=1 ttl=64 time=1.2 ms 1428 bytes from 192.168.0.203: icmp_seq=2 ttl=64 time=1.1 ms [...] 1428 bytes from 192.168.0.203: icmp_seq=58 ttl=64 time=1.1 ms --- 192.168.0.203 ping statistics --- 59 packets transmitted, 59 packets received, 0% packet loss round-trip min/avg/max = 1.1/1.1/2.7 ms > Try setting the duplex manually on the mvp. > Add the dongle config commands (below)... Do you see anything wrong with manually running these via telnet after a cold boot? > echo 0 > /proc/sys/dev/eth0/autoneg > echo 1 > /proc/sys/dev/eth0/rfduplx > echo 1 > /proc/sys/dev/eth0/swfdup > echo 1 > /proc/sys/dev/eth0/autoneg Lets see what they're set to first: # cat /proc/sys/dev/eth0/autoneg 1 # cat /proc/sys/dev/eth0/rfduplx 1 # cat /proc/sys/dev/eth0/swfdup 1 And on the back-end: # ethtool eth0 Settings for eth0: Supported ports: [ MII ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full Advertised auto-negotiation: Yes Speed: 100Mb/s Duplex: Full Port: MII PHYAD: 1 Transceiver: external Auto-negotiation: on Supports Wake-on: g Wake-on: d Link detected: yes So to me it looks like both ends are already running in 100 Mbps, full duplex. But just to be sure, I'll reboot the MVP to clear the error counters: # ifconfig eth0 Link encap:Ethernet HWaddr 00:0D:FE:0C:01:28 inet addr:192.168.0.242 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:56 errors:0 dropped:0 overruns:0 frame:0 TX packets:41 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:4716 (4.6 KiB) TX bytes:0 (0.0 B) Interrupt:27 Base address:0xd300 DMA chan:1 And run the commands... echo 0 > /proc/sys/dev/eth0/autoneg echo 1 > /proc/sys/dev/eth0/rfduplx echo 1 > /proc/sys/dev/eth0/swfdup echo 1 > /proc/sys/dev/eth0/autoneg and use the UI for a bit... > The error counters on the mvp are definitely causing > slow/sluggish response from the mvp. I'm glad to see that there is some concrete indicator of the problem, but so far I'm not convinced that the overruns are anything more than a side-effect symptom of something going wrong in the software. Any thoughts as to a next step? I could perhaps mess with the Ethernet receive buffer size, but that's likely to be only a bandaid. I'm going to check the load average and capture the output from top the next time the UI starts getting sluggish. Merely waiting for corrupt packets to be retransmitted - if that's the root cause - should be a blocking operation that doesn't eat up CPU. Here's the uptime fresh after a cold boot: # uptime 19:25:55 up 4 min, load average: 0.00, 0.02, 0.00 and top: Mem: 13332K used, 352K free, 0K shrd, 3860K buff, 4193284K cached Load average: 0.00, 0.01, 0.00 (State: S=sleeping R=running, W=waiting) PID USER STATUS RSS PPID %CPU %MEM COMMAND 134 root R 376 115 0.5 2.7 top 50 root S 280 1 0.3 2.0 telnetd 114 root S 8444 108 0.0 61.7 mvpmc 122 root S 8444 116 0.0 61.7 mvpmc 120 root S 8444 116 0.0 61.7 mvpmc 121 root S 8444 116 0.0 61.7 mvpmc 118 root S 8444 116 0.0 61.7 mvpmc 116 root S 8444 114 0.0 61.7 mvpmc 117 root S 8444 116 0.0 61.7 mvpmc 119 root S 8444 116 0.0 61.7 mvpmc 125 root S 8444 116 0.0 61.7 mvpmc 126 root S 8444 116 0.0 61.7 mvpmc 124 root S 8444 116 0.0 61.7 mvpmc 108 root S 652 1 0.0 4.7 mvpmc 115 root S 408 50 0.0 2.9 sh 1 root S 344 0 0.0 2.5 init 80 root S 316 1 0.0 2.3 udhcpc 91 root S 284 1 0.0 2.0 ntpclient 8 root SW 0 1 0.0 0.0 mtdblockd 4 root SW 0 1 0.0 0.0 kswapd 2 root SW 0 1 0.0 0.0 keventd 3 root SWN 0 1 0.0 0.0 ksoftirqd_CPU0 5 root SW 0 1 0.0 0.0 bdflush 6 root SW 0 1 0.0 0.0 kupdated 7 root Z 0 1 0.0 0.0 cifsoplockd I waited until a "Please wait" dialog appeared and seemed to be stuck, and captured the stats again: # uptime 20:31:24 up 1:10, load average: 0.00, 0.02, 0.08 Mem: 13232K used, 452K free, 0K shrd, 3256K buff, 4193188K cached Load average: 0.00, 0.03, 0.09 (State: S=sleeping R=running, W=waiting) PID USER STATUS RSS PPID %CPU %MEM COMMAND 164 root R 376 115 0.5 2.7 top 50 root S 80 1 0.3 0.5 telnetd 152 root S 8948 150 0.0 65.3 mvpmc 153 root S 8948 150 0.0 65.3 mvpmc 149 root S 8948 108 0.0 65.3 mvpmc 160 root S 8948 150 0.0 65.3 mvpmc 151 root S 8948 150 0.0 65.3 mvpmc 157 root S 8948 150 0.0 65.3 mvpmc 156 root S 8948 150 0.0 65.3 mvpmc 162 root S 8948 150 0.0 65.3 mvpmc 155 root S 8948 150 0.0 65.3 mvpmc 159 root S 8948 150 0.0 65.3 mvpmc 161 root S 8948 150 0.0 65.3 mvpmc 154 root S 8948 150 0.0 65.3 mvpmc 150 root S 8948 149 0.0 65.3 mvpmc 158 root S 8948 150 0.0 65.3 mvpmc 108 root S 460 1 0.0 3.3 mvpmc 115 root S 240 50 0.0 1.7 sh 91 root S 128 1 0.0 0.9 ntpclient 1 root S 92 0 0.0 0.6 init 80 root S 56 1 0.0 0.4 udhcpc 8 root SW 0 1 0.0 0.0 mtdblockd 4 root SW 0 1 0.0 0.0 kswapd 3 root SWN 0 1 0.0 0.0 ksoftirqd_CPU0 7 root Z 0 1 0.0 0.0 cifsoplockd 5 root SW 0 1 0.0 0.0 bdflush 6 root SW 0 1 0.0 0.0 kupdated 2 root SW 0 1 0.0 0.0 keventd That all looks pretty normal to me. So much for my theory. But the overrun count went up: # ifconfig eth0 Link encap:Ethernet HWaddr 00:0D:FE:0C:01:28 inet addr:192.168.0.242 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:308233 errors:4070 dropped:0 overruns:4070 frame:0 TX packets:155237 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:450168188 (429.3 MiB) TX bytes:0 (0.0 B) Interrupt:27 Base address:0xd300 DMA chan:1 When I went back to the MVP 10+ minutes later, the "Please wait" dialog was still on the screen with the animated bar still moving. There is a possibility that I'm seeing multiple failure modes. Often it doesn't get completely stuck, but instead just operates in slow motion, such that it takes 10 or more seconds to repaint the text on the screen. It's quite possible that the load average would show a spike under those conditions. I could have sworn I checked it once under those conditions and saw it up in the 20s. I'll keep testing, but any other theories? Thanks for spending the time on this. -Tom -------- Original Message -------- Subject: Re: sluggish mvpmc, network errors Date: Thu, 05 Apr 2007 18:31:35 -0400 From: Tom Metro Michael Drons wrote: > My friend had the exact same issue. He made the > changes in his dongle config file and all of his > issues went away. So your theory is that adding those echo statements to dongle.bin.config might resolve it? I'm skeptical, but it is easy enough to try. > Can you go back to back with the mythtv server? You mean attach the MVP directly to the Ethernet port of the back-end using a crossover cable? Yes, but only temporarily, as the back-end only has one Ethernet interface, and thus would be cut-off from the net. I'd also need to set up a DHCP server on the back-end, as that currently resides on another machine on my LAN. I guess I'd need to have a bit more evidence pointing in the direction of that being useful before I'd go through the trouble. I keep coming back to the fact that they aren't just any old Ethernet errors, but are specifically overruns, and my expectation is that you get overruns when the receiving CPU is too slow, or the IRQ handler has problems. Not surprisingly, this document: Linux Network Administrators Guide http://osdir.com/LDP/LDP/nag2/nag2.pdf says: Receiver overruns usually occur when packets come in faster than the kernel can service the last interrupt. And this article: http://www.onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html says: To achieve maximum throughput, it is critical to use optimal TCP socket buffer sizes for the link you are using. If the buffers are too small, the TCP congestion window will never open up fully, so the sender will be throttled. If the buffers are too large, the sender can overrun the receiver, which will cause the receiver to drop packets and the TCP congestion window to shut down. This is more likely to happen if the sending host is faster than the receiving host. Or more clearly stated in the author's tuning guide: http://dsd.lbl.gov/TCP-tuning/TCP-tuning.html If the receiver buffers are too large, TCP flow control breaks and the sender can overrun the receiver, which will cause the TCP window to shut down. So bigger buffers aren't necessarily the solution, if the receiver can't sustain adequate speed to empty them, and in fact cause overruns due to TCP flow control not kicking in when it normally would. The buffer settings on the MVP seem pretty close to the normal Linux defaults for 2.4, according to the article: # sysctl -A ... net.ipv4.tcp_rmem = 4096 43689 87378 net.ipv4.tcp_wmem = 4096 16384 65536 That's receive buffer on the first line, send on the second. Min, default, and max buffer size. The send buffers seem to be kernel stock. The receive buffers look like they've been tweaked, but not by much, and no where near as high as the article recommends for good sustained performance (although the article seems to be assuming a high latency connection, like a WAN, rather than a LAN). This page: http://dsd.lbl.gov/TCP-tuning/linux.html has more details on buffer tuning in Linux. Reducing the default and max receive buffer size will, in theory, eliminate the overruns (at the expense of bandwidth), if indeed they are a result of the MVP not being able to keep up with the sustained data flow, but I'm skeptical that this is the case, otherwise I'd see stuttering during video playback. My MVP seems to handle playing back streams that peak at 6 Mbps without problems. Seems one test I could run is to reset the error counters, and run a bandwidth test in mvpmc. Or, perhaps better, if my suspicion is correct and mvpmc client software is causing the problem, pull a large file to the MVP via tftp to /dev/null on the command line with mvpmc not running. (Though tftp tends to be really slow. Maybe nfs would be better.) But of course short duration events keeping the CPU busy will also cause overruns if the buffers aren't big enough. In that case increasing the buffers should help, but only if those busy periods are truly momentary. It still comes down to figuring out whether the sluggish UI is a side effect of mvpmc waiting for packets to be retransmitted, or if something else is bogging down the MVP and the packet errors are just another symptom, like the UI. Unless there is something buggy in the protocol layer, I don't think the error count is high enough to explain the delays I am seeing - packet retransmission of a few dozen small MythTV control packets (supposedly what's going over the network while I'm interacting with the UI and no video is playing) should be imperceptible. That first article: http://www.onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html also had: A surprisingly common source of LAN trouble with 100BT networks is when the host is set to full duplex but the Ethernet switch is set to half duplex, or vice versa. Newer hardware will autonegotiate this, but with some older hardware, autonegotiation will sometimes fail, with the result being a working but very slow network (typically only 1Mbps to 2Mbps). Newer hardware will autonegotiate this... All the hardware I'm using is relatively new and supposedly supports autonegotiation. Maybe there is some operation that is being ran from the Ethernet driver's IRQ handler that takes longer than expected on H3 hardware. Or maybe there is some other operation that occurs while interrupts are masked (like in one of the other IRQ handlers) that takes longer than expected on H3 hardware. I don't yet know enough about the hardware to speculate... -Tom -------- Original Message -------- Subject: Re: sluggish mvpmc, network errors Date: Fri, 06 Apr 2007 15:54:40 -0400 From: Tom Metro Michael Drons wrote: > My friend had the exact same issue. He made the > changes in his dongle config file and all of his > issues went away. I made the changes in dongle.bin.conf yesterday and rebooted. Today the overrun count is still incrementing: # ifconfig eth0 Link encap:Ethernet HWaddr 00:0D:FE:0C:01:28 inet addr:192.168.0.242 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2534540 errors:32492 dropped:0 overruns:32492 frame:0 TX packets:883792 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3691691978 (3.4 GiB) TX bytes:0 (0.0 B) Interrupt:27 Base address:0xd300 DMA chan:1 It was worth a shot, but I think the cause is elsewhere. -Tom ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Mvpmc-users mailing list Mvpmc-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mvpmc-users mvpmc wiki: http://mvpmc.wikispaces.com/