Here's where we stand on our cluster communications errors: nothing we did worked. We tried different ports on the switch. We tried forcing 1Gbps. We tried forcing the port down to 10 Mbps. That actually seemed to help slightly, in that we only lost communications every 63 seconds or so, instead of every 15--60 seconds. But it would lose and re-establish connection to the cluster every 63 seconds.

So I decided to try setting up and using a TAP device, just to see what would happen.

Using the dedicated Ethernet card, it made no difference. It still lost communications every 63 seconds.

When I say dedicated Ethernet card, I probably should have stated earlier that it's a USB -> Ethernet device plugged into the system. I don't know what brand or model, but I can find out, if anyone wants to know.

So I decided to try tunneling through the "real" Ethernet port used by the Linux system. After figuring out what to do for the missing tunctl command under CentOS, I was able to set up a tunnel, and I did "attach xq tap:tap0". I then booted the system and wonder of wonders, miracle of miracles, it was seven minutes into the boot (yes, it takes a long time, mounting a slew of disks that needed to be rebuilt) before it lost communications. But it re-established them immediately, and as of my typing this, it was been twenty-nine minutes since that happened. No further drops. Normally, I wouldn't think twenty-nine minutes is enough to prove anything, but when it was dropping every 15--63 seconds for two solid days, this sounds like a fix to me.

So what does it mean? One thing it suggests is that the USB Ethernet device may be buggy or bad. I mean, it seems to work OK for TCP/IP communications, etc, but it sure sounds like it may be the part responsible for the problems. Especially since tunneling through the built-in Ethernet card seems to work and tunneling through the USB device did not.

These are the commands I used to set up the tap device for CentOS:

   brctl addbr br0
   ifconfig eno1 0.0.0.0          ; eno1 is the host's Ethernet device
   ifconfig br0 XXX.XX.XX.XX up   ; the IP address of the host system
   brctl addif br0 eno1
   brctl setfd br0 0
   #tunctl -t tap0
   ip tuntap add tap0 mode tap    ; Replacement for tunctl on CentOS 7
   brctl addif br0 tap0
   ifconfig tap0 up

I then just did "xq attach tap:tap0" in the init file. I guess I should set up a special MAC address, but I haven't yet, and so far, nothing seems amiss.

While I thought having a dedicated Ethernet device would be the simplest thing, I can live with tunneling it through the shared Ethernet device, especially since it works and the former does not. ;-)

Thank you for all of your input over the past couple of days, and thank you for all of your work on SIMH!

Hunter

_______________________________________________
Simh mailing list
Simh@trailing-edge.com
http://mailman.trailing-edge.com/mailman/listinfo/simh

Reply via email to