The maximum amount of time that recovery will run is controlled by "at_max".  
The default is 600s (10 mins), but on my 2-client home cluster (with a 
relatively light load) the recovery is usually finished in 10s or less. 

You can reduce the timeout based on what is your typical time. Note that the 
recovery should finish sooner even without changing at_max if all of the 
clients are available (as they should be). 

Cheers, Andreas

> On Aug 4, 2022, at 22:37, Christian Kuntz <c.ku...@opendrives.com> wrote:
> 
> 
> Hello all,
> 
> I'm wondering if there is any way to tune the maximum amount of time that 
> lustre will use for a recovery window in the event that imperative recovery 
> fails due to the failover of an MGS. On MGS failover, we appear to hit a 
> default timeout of around 6 minutes that seems to be unavoidable. We're at a 
> scale of less than 10 total nodes, so it seems that this timeout could safely 
> be made much shorter.
> 
> I understand that I'm approaching an unsafe/risky situation and asking for it 
> to be made more unsafe, but we'd like to get start time in the event of a 
> total cluster failure as fast as possible (within reason, of course). 
> Alternatively, any way to manually end the recovery window would be 
> appreciated.
> 
> Cheers, and thanks for your attention,
> Christian Kuntz
> _______________________________________________
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to