A quick follow up. I thought an lfsck would only clean up (i.e. remove orphaned MDT and OST objects) but it appears this might have a good shot at repairing the file system – specifically, recreating the MDT objects with the --create-mdtobj option. We have started this command:
[root@hpfs-fsl-mds1 ~]# lctl lfsck_start -M scratch-MDT0000 --dryrun on --create-mdtobj on And after running for about an hour we are already seeing this from the query: layout_repaired: 4645105 Can anyone confirm this will work for our situation – i.e. repair the metadata for the OST objects that were orphaned when our metadata got reverted? From: "Vicker, Darby J. (JSC-EG111)[Jacobs Technology, Inc.]" <darby.vicke...@nasa.gov> Date: Tuesday, June 21, 2022 at 5:27 PM To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org> Subject: Help with recovery of data Hi everyone, We ran into a problem with our lustre filesystem this weekend and could use a sanity check and/or advice on recovery. We are running on CentOS 7.9, ZFS 2.1.4 and Lustre 2.14. We are using ZFS OST’s but and an ldiskfs MDT (for better MDT performance). For various reasons, the ldiskfs is built on a zdev. Every night we (intend to) back up the metadata by ZFS snapshot-ing the zdev, mount the MDT via ldiskfs and tar up the contents, umount and remove the ZFS snapshot. On Sunday (6/19 at about 4 pm), the metadata server crashed. It came back up fine but users started reporting many missing files and directories today (6/21) – everything since about February 9th is gone. After quite a bit of investigation, it looks like the MDT got rolled back to a snapshot of the metadata from February. [root@hpfs-fsl-mds1 ~]# zfs list -t snap mds1-0/meta-scratch NAME USED AVAIL REFER MOUNTPOINT mds1-0/meta-scratch@snap 52.3G - 1.34T - [root@hpfs-fsl-mds1 ~]# zfs get all mds1-0/meta-scratch@snap | grep creation mds1-0/meta-scratch@snap creation Thu Feb 10 3:35 2022 - [root@hpfs-fsl-mds1 ~]# We discovered that our MDT backups have been stalled since February since the first step is to create mds1-0/meta-scratch@snap and that dataset already exists. The script was erroring out since the existing snapshot still in place. We have rebooted this MDS several times (gracefully) since February with no issues but, apparently, whatever happened in the server crash on Sunday caused the MDT to revert to the February data. So, in theory, the data on the OST’s is still there, we are just missing the metadata due to the ZFS glitch. So the first question - is anyone familiar with this failure mode of ZFS or if there is a way recover from it? I think its unlikely there are any direct ZFS recovery options but wanted to ask. Obviously, MDT backups would be our best recovery option but since this was all caused by the backup scripts stalling (and the subsequent rolling back to the last snapshot), our backups are the same age as the current data on the filesystem. [root@hpfs-fsl-mds1 ~]# ls -lrt /internal/ldiskfs_backups/ total 629789909 -rw-r--r-- 1 root root 1657 Apr 30 2019 process.txt -rw-r--r-- 1 root root 445317560320 Jan 25 15:36 mds1-0_meta-scratch-2022_01_25.tar -rw-r--r-- 1 root root 446230016000 Jan 26 15:31 mds1-0_meta-scratch-2022_01_26.tar -rw-r--r-- 1 root root 448093808640 Jan 27 15:46 mds1-0_meta-scratch-2022_01_27.tar -rw-r--r-- 1 root root 440368783360 Jan 28 16:56 mds1-0_meta-scratch-2022_01_28.tar -rw-r--r-- 1 root root 442342113280 Jan 29 14:45 mds1-0_meta-scratch-2022_01_29.tar -rw-r--r-- 1 root root 442922567680 Jan 30 15:03 mds1-0_meta-scratch-2022_01_30.tar -rw-r--r-- 1 root root 443076515840 Jan 31 15:17 mds1-0_meta-scratch-2022_01_31.tar -rw-r--r-- 1 root root 444589025280 Feb 1 15:11 mds1-0_meta-scratch-2022_02_01.tar -rw-r--r-- 1 root root 443741409280 Feb 2 15:17 mds1-0_meta-scratch-2022_02_02.tar -rw-r--r-- 1 root root 448209367040 Feb 3 15:24 mds1-0_meta-scratch-2022_02_03.tar -rw-r--r-- 1 root root 453777090560 Feb 4 15:55 mds1-0_meta-scratch-2022_02_04.tar -rw-r--r-- 1 root root 454211307520 Feb 5 14:37 mds1-0_meta-scratch-2022_02_05.tar -rw-r--r-- 1 root root 454619084800 Feb 6 14:30 mds1-0_meta-scratch-2022_02_06.tar -rw-r--r-- 1 root root 455459276800 Feb 7 15:26 mds1-0_meta-scratch-2022_02_07.tar -rw-r--r-- 1 root root 457470945280 Feb 8 15:07 mds1-0_meta-scratch-2022_02_08.tar -rw-r--r-- 1 root root 460592517120 Feb 9 15:21 mds1-0_meta-scratch-2022_02_09.tar -rw-r--r-- 1 root root 332377712640 Feb 10 12:04 mds1-0_meta-scratch-2022_02_10.tar [root@hpfs-fsl-mds1 ~]# Yes, I know, we will put in some monitoring for this in the future... Fortunately, we also have a robinhood system syncing with this file system. The sync is fairly up to date – the logs say a few days ago and I’ve used rbh-find to find some files that were created in the last few days. So I think we have a shot at recovery. We have this command running now to see what it will do: rbh-diff --apply=fs --dry-run --scan=/scratch-lustre But it has already been running a long time with no output. Our file system is fairly large: [root@hpfs-fsl-lmon0 ~]# lfs df -h /scratch-lustre UUID bytes Used Available Use% Mounted on scratch-MDT0000_UUID 1011.8G 82.6G 826.7G 10% /scratch-lustre[MDT:0] scratch-OST0000_UUID 49.6T 16.2T 33.4T 33% /scratch-lustre[OST:0] scratch-OST0001_UUID 49.6T 17.4T 32.3T 35% /scratch-lustre[OST:1] scratch-OST0002_UUID 49.6T 16.8T 32.8T 34% /scratch-lustre[OST:2] scratch-OST0003_UUID 49.6T 17.2T 32.4T 35% /scratch-lustre[OST:3] scratch-OST0004_UUID 49.6T 16.7T 32.9T 34% /scratch-lustre[OST:4] scratch-OST0005_UUID 49.6T 16.9T 32.7T 35% /scratch-lustre[OST:5] scratch-OST0006_UUID 49.6T 16.4T 33.2T 34% /scratch-lustre[OST:6] scratch-OST0007_UUID 49.6T 15.6T 34.0T 32% /scratch-lustre[OST:7] scratch-OST0008_UUID 49.6T 16.2T 33.4T 33% /scratch-lustre[OST:8] scratch-OST0009_UUID 49.6T 16.4T 33.2T 34% /scratch-lustre[OST:9] scratch-OST000a_UUID 49.6T 15.8T 33.8T 32% /scratch-lustre[OST:10] scratch-OST000b_UUID 49.6T 17.4T 32.2T 36% /scratch-lustre[OST:11] scratch-OST000c_UUID 49.6T 17.1T 32.5T 35% /scratch-lustre[OST:12] scratch-OST000d_UUID 49.6T 15.8T 33.8T 32% /scratch-lustre[OST:13] scratch-OST000e_UUID 49.6T 15.7T 33.9T 32% /scratch-lustre[OST:14] scratch-OST000f_UUID 49.6T 16.4T 33.2T 33% /scratch-lustre[OST:15] scratch-OST0010_UUID 49.6T 15.5T 34.1T 32% /scratch-lustre[OST:16] scratch-OST0011_UUID 49.6T 16.6T 33.1T 34% /scratch-lustre[OST:17] scratch-OST0012_UUID 49.6T 16.4T 33.2T 34% /scratch-lustre[OST:18] scratch-OST0013_UUID 48.4T 16.3T 32.1T 34% /scratch-lustre[OST:19] scratch-OST0014_UUID 49.6T 15.1T 34.5T 31% /scratch-lustre[OST:20] scratch-OST0015_UUID 49.6T 16.0T 33.6T 33% /scratch-lustre[OST:21] scratch-OST0016_UUID 49.6T 15.2T 34.4T 31% /scratch-lustre[OST:22] scratch-OST0017_UUID 49.6T 16.1T 33.5T 33% /scratch-lustre[OST:23] filesystem_summary: 1.2P 391.1T 798.3T 33% /scratch-lustre [root@hpfs-fsl-lmon0 ~]# We currently still have the robinhood process running (syncing the filesystem and the SQL DB) but we’ve umounted the LFS from all user facing machines so there should be no further changes to the filesystem. Does anyone have experience recovering from this kind of situation with robinhood? FWIW, the SQL DB that robinhood lives on is also a ZFS filesystem that we also snapshot. But we don’t have much history and its unclear where the current RBH scans are (WRT the data loss). But its likely SQL DB in the oldest snapshot below would not be affected by the 6/19 reboot event. [root@hpfs-fsl-lmon0 ~]# zfs list -t snap NAME USED AVAIL REFER MOUNTPOINT lmon0-0/mysql@zincrsend_2022-06-20-16:01:01 26.8G - 105G - lmon0-0/mysql@zincrsend_2022-06-20-17:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-20-18:01:01 557K - 105G - lmon0-0/mysql@zincrsend_2022-06-20-19:01:01 558K - 105G - lmon0-0/mysql@zincrsend_2022-06-20-20:01:01 558K - 105G - lmon0-0/mysql@zincrsend_2022-06-20-21:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-20-22:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-20-23:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-00:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-01:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-02:01:02 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-03:01:01 561K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-04:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-05:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-06:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-07:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-08:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-09:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-10:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-11:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-12:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-13:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-14:01:01 560K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-15:01:01 518K - 105G - lmon0-0/mysql@zincrsend_2022-06-21-16:01:01 512K - 105G - [root@hpfs-fsl-lmon0 ~]# Is RBH our best recovery option? Would lfsck recover from this situation? I don’t think so... Any advice on recovery would be appreciated. Thanks, Darby
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org