Hello - Was wondering if anyone is running the bonding network driver in active/backup mode using arp_validate?
I'm trying to deal with really crappy network switches from Dell and I thought I could work around their faults in the short term by switching from link monitoring to arp monitoring. But I ran into a situation just now that seems even arp monitoring isn't enough. This is my config for the system: CentOS 5.2 base 2.6.18-128.1.10.el5 kernel (I think it's a 5.3 kernel) Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008) Bonding Mode: fault-tolerance (active-backup) Primary Slave: None Currently Active Slave: eth1 MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 ARP Polling Interval (ms): 1000 ARP IP target/s (n.n.n.n form): 10.16.1.1, 10.16.1.254 Slave Interface: eth0 MII Status: up Link Failure Count: 2 Permanent HW addr: 00:21:9b:8d:f1:0c Slave Interface: eth1 MII Status: up Link Failure Count: 1 -- Both IPs are supposed to be redundant, .1 is a pair of stacked piece of shit Dell gigabit switches, the other is a pair of F5 LTM load balancers. System had been running fine for about the past 9 days since I enabled this stuff, and then for some reason could no longer talk to 10.16.1.254, looking at tcpdump I saw the system almost flooding the link for arp requests for that address and maybe getting one in 10 answered. Communication with 10.16.1.1 was fine by contrast. 40 other systems on the same LAN communicate with both addresses constantly so I know both were more or less OK, it was something with the switch itself(have seen behavior on multiple Dell switches where they decide to stop forwarding traffic, which is what prompted me to switch from link monitoring to arp monitoring) At the time the system was running on eth0, so I brought that interface down and it immediately failed over to eth1 and things were ok again. I have since failed it back to eth0 and things are still fine. What I'd like to do if possible is configure the bonding driver to fail if either of the arp attempts fails, as far as I can see the default is even if 1 succeeds then the driver thinks it's ok. Looking at the arp_validate option it seems it only applies to slaves, not to the active link. Is there any thing I can do to make the driver fail if even one of the two addresses is not responding? I have noticed that in some cases the fail over does work, checking several systems they all have at least 1 link failure detected for each interface. Longer term my goal is to replace the switches entirely, been pushing that for about a month now. just goes to show you get what you pay for when you buy crap equipment(wasn't my idea), sigh. thanks nate _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos