On Sat, 12 Jan 2002, Jacques B. Siboni wrote: > Dear all, > > i forward the following mail to ltsp and beowulf as I use these concepts and > Mosix group seems to be in a very depressed mood. > > The problem I encounter occurs before Mosix even starts. There is some new > kind of stuff with kernel 2.4.xx that does not accept some kinds of fragments. > It is more an NFS boot problem. > > One (quick and dirty) solution could be to allow the kernel to load even with > an mtu less than 1500, which I could not do. > > Thanks in advance > > Jacques > >
Dear Jacques, This has the look and feel of a hardware problem with your physical network. The fact that small packets sometimes make it through but big ones don't is telling indeed. You don't describe your physical network, but one of the following could easily be the problem: a) Bad wiring. One wire with an almost-broken wire can do this. So can poorly wired connectors at the punchblock or inside the RJ45 connectors. b) Wiring runs that are too long. 100BT has a maximum radius of 100 m from a switch that can retime packets. If runs are too long, a collision condition can easily occur as one host brings up the line to send but the signal doesn't have time to propagate to a host downstream in time to keep it from ALSO bringing up the line to send. In a high traffic density network, lots of packets collide and are lost, and perhaps smaller packets have a better chance of making it through at least sometimes. c) Hubs instead of switches, especially too many hubs. Packets sent to a hub are echoed on all lines, and ANY system trying to send in the same window will cause a collision. Too many hubs add latency that reduces the effective diameter of your network and increases the probability of collisions. Switches actually read a packet and retransmit it on ONLY the line it is destined for, and retime the packet besides. This isolates systems from traffic not intended for them and improves network stability and performance. Offhand I can't remember the maximum number of "repeaters" (hubs) permitted in a 100BT network -- something like 3 -- because I haven't used hubs for years now, ever since switches got so cheap. Good switches will also sometimes indicate lines with a fault condition and isolate those lines. d) Cheap/bad NICs. It is just my opinion, but this includes all RTL8139 NICs from any manufacturer. These NICs have exhibited behavior like that which you describe in my own systems all by themselves on an otherwise perfect network -- if you flood them with a packet stream, they can easily end up dropping all but one or two packets in a hundred. Again, small packets probably doesn't improve their efficiency (it just makes for a longer stream with even smaller interpacket gaps) but it likely does improve the probability that a packet will make it through before timing out. Unfortunately, RTL8139's are nearly ubiquitous, since they are available in $10 NICs and some folks cannot resist the bargain. If you have 8139's, just throw them away and buy a decent NIC -- eepro100, tulip, 3c905 -- and your problem may magically go away. e) It's a long shot, but a poorly supported card/driver or interference with a particular chipset or motherboard or card combination "can" cause things like this, but frankly I doubt it. I'd work a-d over pretty thoroughly before I started worrying about problems in the base linux kernel or network drivers (RTL drivers excluded, although it isn't really a driver problem per se) or exotic chipset problems. This is presuming that you are running a reasonably recent and/or non-SMP production kernel. If you are running a really old kernel (especially a really old SMP kernel) or an exotic homemade kernel with strange drivers or the like, after I finished asking "why" I'd agree that doing something sort of dumb like this could also cause such a problem. There are some lovely online guides to the care and feeding of Ethernet networks, many of them linked to www.phy.duke.edu/brahma or available on the Scyld website. One or more of them will (for example) tell you the maximum number of repeaters permitted if in fact you are using hubs and have a very large physical network. Hope this helps. I'd advise investing in (minimally) a network cable tester and if there is any chance at all your cable runs are too long, in a reflectometer. If you are using hubs or RTL NICs, I'd STRONGLY recommend swappping them out for switches and decent NICs as rapidly as possible, especially for the lines connecting to your servers. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:[EMAIL PROTECTED] _____________________________________________________________________ Ltsp-discuss mailing list. To un-subscribe, or change prefs, goto: https://lists.sourceforge.net/lists/listinfo/ltsp-discuss For additional LTSP help, try #ltsp channel on irc.openprojects.net