A quick follow up.  I thought an lfsck would only clean up (i.e. remove 
orphaned MDT and OST objects) but it appears this might have a good shot at 
repairing the file system – specifically, recreating the MDT objects with the 
--create-mdtobj option.  We have started this command:

[root@hpfs-fsl-mds1 ~]# lctl lfsck_start -M scratch-MDT0000 --dryrun on 
--create-mdtobj on

And after running for about an hour we are already seeing this from the query:

layout_repaired: 4645105

Can anyone confirm this will work for our situation – i.e. repair the metadata 
for the OST objects that were orphaned when our metadata got reverted?

From: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" 
<darby.vicke...@nasa.gov>
Date: Tuesday, June 21, 2022 at 5:27 PM
To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Help with recovery of data

Hi everyone,

We ran into a problem with our lustre filesystem this weekend and could use a 
sanity check and/or advice on recovery.

We are running on CentOS 7.9, ZFS 2.1.4 and Lustre 2.14.  We are using ZFS 
OST’s but and an ldiskfs MDT (for better MDT performance).  For various 
reasons, the ldiskfs is built on a zdev.  Every night we (intend to) back up 
the metadata by ZFS snapshot-ing the zdev, mount the MDT via ldiskfs and tar up 
the contents, umount and remove the ZFS snapshot.  On Sunday (6/19 at about 4 
pm), the metadata server crashed.  It came back up fine but users started 
reporting many missing files and directories today (6/21) – everything since 
about February 9th is gone.  After quite a bit of investigation, it looks like 
the MDT got rolled back to a snapshot of the metadata from February.

[root@hpfs-fsl-mds1 ~]# zfs list -t snap mds1-0/meta-scratch
NAME                       USED  AVAIL     REFER  MOUNTPOINT
mds1-0/meta-scratch@snap  52.3G      -     1.34T  -
[root@hpfs-fsl-mds1 ~]# zfs get all mds1-0/meta-scratch@snap | grep creation
mds1-0/meta-scratch@snap  creation              Thu Feb 10  3:35 2022          -
[root@hpfs-fsl-mds1 ~]#

We discovered that our MDT backups have been stalled since February since the 
first step is to create mds1-0/meta-scratch@snap  and that dataset already 
exists.  The script was erroring out since the existing snapshot still in 
place.  We have rebooted this MDS several times (gracefully) since February 
with no issues but, apparently, whatever happened in the server crash on Sunday 
caused the MDT to revert to the February data.  So, in theory, the data on the 
OST’s is still there, we are just missing the metadata due to the ZFS glitch.

So the first question - is anyone familiar with this failure mode of ZFS or if 
there is a way recover from it?  I think its unlikely there are any direct ZFS 
recovery options but wanted to ask.

Obviously, MDT backups would be our best recovery option but since this was all 
caused by the backup scripts stalling (and the subsequent rolling back to the 
last snapshot), our backups are the same age as the current data on the 
filesystem.

[root@hpfs-fsl-mds1 ~]# ls -lrt /internal/ldiskfs_backups/
total 629789909
-rw-r--r-- 1 root root         1657 Apr 30  2019 process.txt
-rw-r--r-- 1 root root 445317560320 Jan 25 15:36 
mds1-0_meta-scratch-2022_01_25.tar
-rw-r--r-- 1 root root 446230016000 Jan 26 15:31 
mds1-0_meta-scratch-2022_01_26.tar
-rw-r--r-- 1 root root 448093808640 Jan 27 15:46 
mds1-0_meta-scratch-2022_01_27.tar
-rw-r--r-- 1 root root 440368783360 Jan 28 16:56 
mds1-0_meta-scratch-2022_01_28.tar
-rw-r--r-- 1 root root 442342113280 Jan 29 14:45 
mds1-0_meta-scratch-2022_01_29.tar
-rw-r--r-- 1 root root 442922567680 Jan 30 15:03 
mds1-0_meta-scratch-2022_01_30.tar
-rw-r--r-- 1 root root 443076515840 Jan 31 15:17 
mds1-0_meta-scratch-2022_01_31.tar
-rw-r--r-- 1 root root 444589025280 Feb  1 15:11 
mds1-0_meta-scratch-2022_02_01.tar
-rw-r--r-- 1 root root 443741409280 Feb  2 15:17 
mds1-0_meta-scratch-2022_02_02.tar
-rw-r--r-- 1 root root 448209367040 Feb  3 15:24 
mds1-0_meta-scratch-2022_02_03.tar
-rw-r--r-- 1 root root 453777090560 Feb  4 15:55 
mds1-0_meta-scratch-2022_02_04.tar
-rw-r--r-- 1 root root 454211307520 Feb  5 14:37 
mds1-0_meta-scratch-2022_02_05.tar
-rw-r--r-- 1 root root 454619084800 Feb  6 14:30 
mds1-0_meta-scratch-2022_02_06.tar
-rw-r--r-- 1 root root 455459276800 Feb  7 15:26 
mds1-0_meta-scratch-2022_02_07.tar
-rw-r--r-- 1 root root 457470945280 Feb  8 15:07 
mds1-0_meta-scratch-2022_02_08.tar
-rw-r--r-- 1 root root 460592517120 Feb  9 15:21 
mds1-0_meta-scratch-2022_02_09.tar
-rw-r--r-- 1 root root 332377712640 Feb 10 12:04 
mds1-0_meta-scratch-2022_02_10.tar
[root@hpfs-fsl-mds1 ~]#


Yes, I know, we will put in some monitoring for this in the future...

Fortunately, we also have a robinhood system syncing with this file system.  
The sync is fairly up to date – the logs say a few days ago and I’ve used 
rbh-find to find some files that were created in the last few days.  So I think 
we have a shot at recovery.  We have this command running now to see what it 
will do:

rbh-diff --apply=fs --dry-run --scan=/scratch-lustre

But it has already been running a long time with no output.  Our file system is 
fairly large:


[root@hpfs-fsl-lmon0 ~]# lfs df -h /scratch-lustre
UUID                       bytes        Used   Available Use% Mounted on
scratch-MDT0000_UUID     1011.8G       82.6G      826.7G  10% 
/scratch-lustre[MDT:0]
scratch-OST0000_UUID       49.6T       16.2T       33.4T  33% 
/scratch-lustre[OST:0]
scratch-OST0001_UUID       49.6T       17.4T       32.3T  35% 
/scratch-lustre[OST:1]
scratch-OST0002_UUID       49.6T       16.8T       32.8T  34% 
/scratch-lustre[OST:2]
scratch-OST0003_UUID       49.6T       17.2T       32.4T  35% 
/scratch-lustre[OST:3]
scratch-OST0004_UUID       49.6T       16.7T       32.9T  34% 
/scratch-lustre[OST:4]
scratch-OST0005_UUID       49.6T       16.9T       32.7T  35% 
/scratch-lustre[OST:5]
scratch-OST0006_UUID       49.6T       16.4T       33.2T  34% 
/scratch-lustre[OST:6]
scratch-OST0007_UUID       49.6T       15.6T       34.0T  32% 
/scratch-lustre[OST:7]
scratch-OST0008_UUID       49.6T       16.2T       33.4T  33% 
/scratch-lustre[OST:8]
scratch-OST0009_UUID       49.6T       16.4T       33.2T  34% 
/scratch-lustre[OST:9]
scratch-OST000a_UUID       49.6T       15.8T       33.8T  32% 
/scratch-lustre[OST:10]
scratch-OST000b_UUID       49.6T       17.4T       32.2T  36% 
/scratch-lustre[OST:11]
scratch-OST000c_UUID       49.6T       17.1T       32.5T  35% 
/scratch-lustre[OST:12]
scratch-OST000d_UUID       49.6T       15.8T       33.8T  32% 
/scratch-lustre[OST:13]
scratch-OST000e_UUID       49.6T       15.7T       33.9T  32% 
/scratch-lustre[OST:14]
scratch-OST000f_UUID       49.6T       16.4T       33.2T  33% 
/scratch-lustre[OST:15]
scratch-OST0010_UUID       49.6T       15.5T       34.1T  32% 
/scratch-lustre[OST:16]
scratch-OST0011_UUID       49.6T       16.6T       33.1T  34% 
/scratch-lustre[OST:17]
scratch-OST0012_UUID       49.6T       16.4T       33.2T  34% 
/scratch-lustre[OST:18]
scratch-OST0013_UUID       48.4T       16.3T       32.1T  34% 
/scratch-lustre[OST:19]
scratch-OST0014_UUID       49.6T       15.1T       34.5T  31% 
/scratch-lustre[OST:20]
scratch-OST0015_UUID       49.6T       16.0T       33.6T  33% 
/scratch-lustre[OST:21]
scratch-OST0016_UUID       49.6T       15.2T       34.4T  31% 
/scratch-lustre[OST:22]
scratch-OST0017_UUID       49.6T       16.1T       33.5T  33% 
/scratch-lustre[OST:23]

filesystem_summary:         1.2P      391.1T      798.3T  33% /scratch-lustre

[root@hpfs-fsl-lmon0 ~]#


We currently still have the robinhood process running (syncing the filesystem 
and the SQL DB) but we’ve umounted the LFS from all user facing machines so 
there should be no further changes to the filesystem.

Does anyone have experience recovering from this kind of situation with 
robinhood?

FWIW, the SQL DB that robinhood lives on is also a ZFS filesystem that we also 
snapshot.  But we don’t have much history and its unclear where the current RBH 
scans are (WRT the data loss).  But its likely SQL DB in the oldest snapshot 
below would not be affected by the 6/19 reboot event.


[root@hpfs-fsl-lmon0 ~]# zfs list -t snap
NAME                                           USED  AVAIL  REFER  MOUNTPOINT
lmon0-0/mysql@zincrsend_2022-06-20-16:01:01   26.8G      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-20-17:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-20-18:01:01    557K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-20-19:01:01    558K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-20-20:01:01    558K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-20-21:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-20-22:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-20-23:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-00:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-01:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-02:01:02    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-03:01:01    561K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-04:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-05:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-06:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-07:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-08:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-09:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-10:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-11:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-12:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-13:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-14:01:01    560K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-15:01:01    518K      -   105G  -
lmon0-0/mysql@zincrsend_2022-06-21-16:01:01    512K      -   105G  -
[root@hpfs-fsl-lmon0 ~]#


Is RBH our best recovery option?

Would lfsck recover from this situation?  I don’t think so...

Any advice on recovery would be appreciated.

Thanks,
Darby
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
  • [... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
    • ... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
      • ... Andreas Dilger via lustre-discuss
        • ... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss
    • ... Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.] via lustre-discuss

Reply via email to