Brian J. Murrell wrote: > On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote: >> If they are call traces due to the watchdog timer, then this is somewhat >> expected for extremely high load. > > Andreas, > > Do you know, does adaptive timeouts take care of setting the timeout > appropriately on watchdogs? >
I don't think this is quite what you are asking, but some details on our setup. We have a mixture of 1.6.7.2 clients and 1.8.1.1 clients. The 1.6.7.2 clients were not using adaptive timeouts when the problem occurred[1]. At least one of the 1.6 machines gets regularly swamped with network traffic - leading to packet loss. It was 40 1.8.1.1 clients running updatedb that caused the problem. Chris [1] One machine is the interface to the outside world - and runs 1.6.7.2. I see packet loss to this machine at times and have observed lustre hanging for a while. I suspect the problem is that it is occasionally overloaded with network packets, lustre packets are then lost (probably at the router), followed by a timeout and recovery. I've now enabled adaptive timeouts on this machine - and will install a 10GigE card too. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss