Thank you so much, that has puzzled me for sometime now :).

From: Patrick Farrell <pfarr...@ddn.com>
Date: Tuesday, 14 March 2023 at 14:36
To: lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org>, Marc 
O'Brien <marc.obr...@cruk.cam.ac.uk>
Subject: Re: Question regarding user access during recovery and journal replay

Marc,



[Re-posting to the list...]



No, it’s fine to have interaction during those times. The system is designed to 
do that work online.  Depending what you’re trying to do and what you’re 
accessing, some client operations will experience delays, but that’s it.  For 
example, during failover/recovery for a particular OST or MDT, no new IO to 
that target will complete.  But the user programs will just wait - it’s safe to 
leave them running.



So recovery, etc, will show up to users as delays in some requests, but it’s 
safe to do with users accessing the system.



Regards,

Patrick

________________________________
From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Marc O'Brien via lustre-discuss <lustre-discuss@lists.lustre.org>
Sent: Tuesday, March 14, 2023 7:24 AM
To: lustre-discuss@lists.lustre.org <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Question regarding user access during recovery and 
journal replay


Hi,

When I was first taught some Lustre file system administration, it was stressed 
that when recovering a Lustre file system and while the journal replay was 
occurring on each host, there should be no user interaction with the file 
system. Any recovery was done with cluster access denied to HPC users, or when 
the cluster was deemed to be quiescent. This seemed to make sense as during 
journal replay the file system is in R/W state, but the distributed file system 
may not have reached a stable state. We now have multiple Lustre file systems 
(2 Ext4 based and 1 ZFS based) and evicting users or finding a quiescent time 
is problematic (luckily there are maintenance windows for the routine stuff).

I have searched online and have yet to see in print that there should be no 
user interaction with Lustre during recovery or journal replay (I may have 
missed it).

So, my question is, is the no cluster user interaction during recovery and 
journal replay restriction, actually a thing?

Thanks in advance for any enlightenment :)

Marc


_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to