Hi All,

We are using dual ported HCAs connected with each port connected to 2 different IB switches so that we can tolerate the failure of any one of those switches and we are trying to cut down the amount of time it takes for traffic (TCP & RDS) to resume when there is an IB switch failure and the hosts failover from one port to the other.

We have the bonding driver configured in active-backup mode and setup to send out 100 gratuitous arps at intervals of 100ms whenever there is a failover. In most cases, traffic resumes within a few seconds after a failover because these gratuitous arps take care of updating all the nodes with the new IP:GID mapping.

The problem we are seeing is that sometimes, one or more of the nodes on the fabric do not receive even 1 of these gratuitous arps and re-establishing communication with these nodes takes a much longer time (over 40 seconds) as it depends on various arp cache timeouts. Does anyone know why all these gratuitous arps might be lost?

Besides the gratuitous arp settings, are there any other tunables to look at to minimize the time it takes for IPoIB traffic to resume?

- Sumeet

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to