Re: [lustre-discuss] possible to read orphan ost objects on live filesystem?
On 2015/09/10, 6:54 PM, "Chris Hunter" wrote: >We experienced file corruption on several OSTs. We proceeded through >recovery using e2fsck & ll_recover_lost_found_obj tools. >Following these steps, e2fsck came out clean. > >The file corruption did not impact the MDT. The files were still >referenced by the MDT. Accessing the file on a lustre client (ie. ls -l) >would report error "Cannot allocate memory" > >Following OST recovery steps, we started removing the corrupt files via >"unlink" command on lustre client (rm command would not remove file). > >Now dry-run e2fsck of the OST is reporting errors: >"deleted/unused inodes" in Pass 2 (checking directory structure), >"Unattached inodes" in Pass 4 (checking reference counts) >"free block count wrong" in Pass 5 (checking group summary information). > >Is e2fsck errors expected when unlinking files ? No, the "unlink" command is just avoiding the -ENOENT error that "rm" gets by calling "stat()" on the file before trying to unlink it. This shouldn't cause any errors on the OSTs, unless there is ongoing corruption from the back-end storage. >thanks, >chris hunter >chris.hun...@yale.edu > > >On 09/03/2015 12:54 PM, Martin Hecht wrote: >> Hi Chris, >> >> On 09/02/2015 07:18 AM, Chris Hunter wrote: >>> Hi Andreas >>> >>> On 09/01/2015 07:22 PM, Dilger, Andreas wrote: On 2015/09/01, 7:59 AM, "lustre-discuss on behalf of Chris Hunter" >>> chris.hun...@yale.edu> wrote: > Hi Andreas, > Thanks for your help. > > If you have a striped lustre file with "holes" (ie. one chunk is gone > due hardware failure, etc.) are the remaining file chunks considered > orphan objects ? >>> So when a lustre striped file has a hole (eg. missing chunk due to >>> hardware failure), the remaining file chunks stay indefinitely on the >>> OSTs. >>> Is there a way to reclaim the space occupied by these pieces (after >>> recovery of any usuable data, etc.)? >> these remaining chunks still belong to the file (i.e. you have the >> metadata entry on the MDT and you see the file when lustre is mounted). >> By removing the file you free up the space. >> >> In general there are two types of inconsistencies which may occur: >> Orphan objects are objects which are NOT assigned to an entry on the >> MDT, i.e. chunks which do not belong to any file. These can be either >> pre-allocated chunks or chunks left over after a corruption of the >> metadata on the MDT. >> >> The other type of corruption is that you have a file, where chunks are >> missing in-between. This can happen, when an OST gets corrupted. As long >> as the MDT is Ok, you should be able to remove such a file. If in >> addition the MDT is also corrupted, you should first fix the MDT, and >> you might then only be able to unlink the file (which again might leave >> some orphan objects on the OSTs). lfsck should be able to remove them, >> depending on the lustre version you are running... >> >> Another point: When the OST got corrupted, after having them repaired >> with e2fsck, you can mount them as ldiskfs and see if there are chunks >> in lost+found and use the tool ll_recover_lost_found_objs to restore >> them in the original place. I believe these objects which e2fsck puts in >> lost+found are another kind of thing, usually not called "orphan >> objects". As I said, they usually can be easily recovered. >> >> Martin >> >> > Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] possible to read orphan ost objects on live filesystem?
Hi, We experienced file corruption on several OSTs. We proceeded through recovery using e2fsck & ll_recover_lost_found_obj tools. Following these steps, e2fsck came out clean. The file corruption did not impact the MDT. The files were still referenced by the MDT. Accessing the file on a lustre client (ie. ls -l) would report error "Cannot allocate memory" Following OST recovery steps, we started removing the corrupt files via "unlink" command on lustre client (rm command would not remove file). Now dry-run e2fsck of the OST is reporting errors: "deleted/unused inodes" in Pass 2 (checking directory structure), "Unattached inodes" in Pass 4(checking reference counts) "free block count wrong" in Pass 5 (checking group summary information). Is e2fsck errors expected when unlinking files ? thanks, chris hunter chris.hun...@yale.edu On 09/03/2015 12:54 PM, Martin Hecht wrote: Hi Chris, On 09/02/2015 07:18 AM, Chris Hunter wrote: Hi Andreas On 09/01/2015 07:22 PM, Dilger, Andreas wrote: On 2015/09/01, 7:59 AM, "lustre-discuss on behalf of Chris Hunter" wrote: Hi Andreas, Thanks for your help. If you have a striped lustre file with "holes" (ie. one chunk is gone due hardware failure, etc.) are the remaining file chunks considered orphan objects ? So when a lustre striped file has a hole (eg. missing chunk due to hardware failure), the remaining file chunks stay indefinitely on the OSTs. Is there a way to reclaim the space occupied by these pieces (after recovery of any usuable data, etc.)? these remaining chunks still belong to the file (i.e. you have the metadata entry on the MDT and you see the file when lustre is mounted). By removing the file you free up the space. In general there are two types of inconsistencies which may occur: Orphan objects are objects which are NOT assigned to an entry on the MDT, i.e. chunks which do not belong to any file. These can be either pre-allocated chunks or chunks left over after a corruption of the metadata on the MDT. The other type of corruption is that you have a file, where chunks are missing in-between. This can happen, when an OST gets corrupted. As long as the MDT is Ok, you should be able to remove such a file. If in addition the MDT is also corrupted, you should first fix the MDT, and you might then only be able to unlink the file (which again might leave some orphan objects on the OSTs). lfsck should be able to remove them, depending on the lustre version you are running... Another point: When the OST got corrupted, after having them repaired with e2fsck, you can mount them as ldiskfs and see if there are chunks in lost+found and use the tool ll_recover_lost_found_objs to restore them in the original place. I believe these objects which e2fsck puts in lost+found are another kind of thing, usually not called "orphan objects". As I said, they usually can be easily recovered. Martin ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 1.8 client on 3.13.0 kernel
Thanks a lot for the info, a little more optimistic :-). -Lewis On 9/10/15 11:17 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote: Lewis, I did an upgrade from Lustre 1.8.6 to 2.4.3 on our servers, and for the most part things went pretty good. I’ll chime in on a couple of Martin’s points and mention a few other things. On Sep 10, 2015, at 9:30 AM, Martin Hecht wrote: In any case the file systems should be clean before starting the upgrade, so I would recommend to run e2fsck on all targets and repair them before starting the upgrade. We did so, but unfortunately our e2fsprogs were not really up to date and after our lustre upgrade a lot of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So, probably some errors on the file systems were still present, but unnoticed when we did the upgrade. This is a very important point. While I didn’t run e2fsck before the upgrade (but maybe I should have), I made sure to install the latest e2fsprogs. Lustre 2 introduces the FID (which is something like an inode number, where lustre 1.8 used the inode number of the underlying ldiskfs, but with the possibility to have several MDTs in one file system a replacement was needed). The FID is stored in the inode, but it can also be activated that the FIDs are stored in the directory node, which makes lookups faster, especially when there are many files in a directory. However, there were bugs in the code that takes care about adding the FID to the directory entry when the file system is converted from 1.8 to 2.x. So, I would recommend to use a version in which these bug are solved. We went to 2.4.1 that time. By default this fid_in_dirent feature is not automatically enabled, however, this is the only point where a performance boost may be expected... so we took the risk to enable this... and ran into some bugs. Enabling fid_in_dirent prevents you from backing out of the upgrade. In theory, if you upgraded to Lustre 2.x without enabling fid_in_dirent, you could always revert back to Lustre 1.8. We tried this on a test system, and the downgrade seemed to work. However, this was a small scale test and I have never tried it on a production file system. But if you want to minimize possible complications, you could always leave this disabled for a while after the updgrade, and then if things are going well, enable it later on. LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again - I believe that's something which must be done anyhow quite often, because there is no quotacheck anymore. It's run in the background when enabling quotas, but file systems have to be unmounted for this. We didn’t exactly hit this bug, but I will mention that we have had a couple of instance where e2fsck complained about problems on an OST, and it turned out that we had to disable and re-enable quotas on the OST to correct the issue. LU-4743: We had to remove the CATALOGS file on another file system (otherwise the MDT wouldn't mount) We hit this problem. Someone I know had to do a Lustre upgrade, and they suggested that I apply a patch for LU-4708 (which I did). But if you upgrade to Lustre 2.5.2 or later, that patch should already be included. My only other advice is to test as much as possible prior to the upgrade. If you have a little test hardware, install the same Lustre 1.8 version you are currently running in production and then try upgrading that to the new Lustre version. I think preparation is the key. I think I spent about 2 months reading about upgrade procedures, talking with others who have upgraded, reading JIRA bug reports, and running tests on hardware. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 1.8 client on 3.13.0 kernel
Lewis, I did an upgrade from Lustre 1.8.6 to 2.4.3 on our servers, and for the most part things went pretty good. I’ll chime in on a couple of Martin’s points and mention a few other things. > On Sep 10, 2015, at 9:30 AM, Martin Hecht wrote: > > In any case the file systems should be clean before starting the > upgrade, so I would recommend to run e2fsck on all targets and repair > them before starting the upgrade. We did so, but unfortunately our > e2fsprogs were not really up to date and after our lustre upgrade a lot > of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So, > probably some errors on the file systems were still present, but > unnoticed when we did the upgrade. This is a very important point. While I didn’t run e2fsck before the upgrade (but maybe I should have), I made sure to install the latest e2fsprogs. > Lustre 2 introduces the FID (which is something like an inode number, > where lustre 1.8 used the inode number of the underlying ldiskfs, but > with the possibility to have several MDTs in one file system a > replacement was needed). The FID is stored in the inode, but it can also > be activated that the FIDs are stored in the directory node, which makes > lookups faster, especially when there are many files in a directory. > However, there were bugs in the code that takes care about adding the > FID to the directory entry when the file system is converted from 1.8 to > 2.x. So, I would recommend to use a version in which these bug are > solved. We went to 2.4.1 that time. By default this fid_in_dirent > feature is not automatically enabled, however, this is the only point > where a performance boost may be expected... so we took the risk to > enable this... and ran into some bugs. Enabling fid_in_dirent prevents you from backing out of the upgrade. In theory, if you upgraded to Lustre 2.x without enabling fid_in_dirent, you could always revert back to Lustre 1.8. We tried this on a test system, and the downgrade seemed to work. However, this was a small scale test and I have never tried it on a production file system. But if you want to minimize possible complications, you could always leave this disabled for a while after the updgrade, and then if things are going well, enable it later on. > LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again > - I believe that's something which must be done anyhow quite often, > because there is no quotacheck anymore. It's run in the background when > enabling quotas, but file systems have to be unmounted for this. We didn’t exactly hit this bug, but I will mention that we have had a couple of instance where e2fsck complained about problems on an OST, and it turned out that we had to disable and re-enable quotas on the OST to correct the issue. > LU-4743: We had to remove the CATALOGS file on another file system > (otherwise the MDT wouldn't mount) We hit this problem. Someone I know had to do a Lustre upgrade, and they suggested that I apply a patch for LU-4708 (which I did). But if you upgrade to Lustre 2.5.2 or later, that patch should already be included. My only other advice is to test as much as possible prior to the upgrade. If you have a little test hardware, install the same Lustre 1.8 version you are currently running in production and then try upgrading that to the new Lustre version. I think preparation is the key. I think I spent about 2 months reading about upgrade procedures, talking with others who have upgraded, reading JIRA bug reports, and running tests on hardware. -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] 1.8 client on 3.13.0 kernel
Thanks very much for this. Will let you know how we come out once we absorb this and get the courage to pull the trigger. -lewis On 9/10/15 9:30 AM, Martin Hecht wrote: Hi Lewis, it's difficult to tell how much data loss was actually related to the lustre upgrade itself. We have upgraded 6 file systems and we had to do it more or less in one shot, because at that time they were using a common MGS server. All servers of one file system must be on the same level (at least for the major upgrade 1.8 to 2.x, there is rolling upgrade for minor versions in the lustre 2 branch now, but I have no experience with that). In any case the file systems should be clean before starting the upgrade, so I would recommend to run e2fsck on all targets and repair them before starting the upgrade. We did so, but unfortunately our e2fsprogs were not really up to date and after our lustre upgrade a lot of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So, probably some errors on the file systems were still present, but unnoticed when we did the upgrade. Lustre 2 introduces the FID (which is something like an inode number, where lustre 1.8 used the inode number of the underlying ldiskfs, but with the possibility to have several MDTs in one file system a replacement was needed). The FID is stored in the inode, but it can also be activated that the FIDs are stored in the directory node, which makes lookups faster, especially when there are many files in a directory. However, there were bugs in the code that takes care about adding the FID to the directory entry when the file system is converted from 1.8 to 2.x. So, I would recommend to use a version in which these bug are solved. We went to 2.4.1 that time. By default this fid_in_dirent feature is not automatically enabled, however, this is the only point where a performance boost may be expected... so we took the risk to enable this... and ran into some bugs. We had other file systems, still on 1.8, so with the server upgrade we didn't upgrade the clients, because lustre 2 clients wouldn't have been able to mount the 1.8 file systems. And we use quotas, and for this you need the 1.8.9 client with a patch that corrects a defect of the 1.8.9 client when it talks to 2.x servers (LU-3067). However, older 1.8 clients don't support the Lustre 2 quota (which came in 2.2 or 2.4, I'm not 100% sure). BTW, it still runs out of sync from time to time, but the limit seems to be fine now, it's just the numbers the users see. lfs quota prints out too low numbers and users run out of quota earlier than they expect... It's better in the latest 2.5 versions now. Here an unsorted(!) list of bugs we have hit during the lustre upgrade. For most of them we weren't the first ones, but I guess you could wait forever for the version in which all bugs are resolved :-) LU-3067 - already mentioned above, a patch for 1.8.9 clients interoperating with 2.x servers, however, 1.8.9 is needed for having quota working. Without this patch clients become unresponsive, 100% cpu load, then just hang and devices become unavailable, reboot doesn't work, so power cycle needed, but after a while the problem reappeared LU-4504 - e2fsck noticed quota issues similar to this bug on osts - use latest e2fsprogs, check again and then the ldiskfs backend doesn't run into this anymore e2fsck noticed quota issues on MDT "Problem in HTREE directory inode 21685465: block #16 not referenced" however, could be fixed by e2fsck LU-5626 mdt becomes readonly: one file system where the MDT was corrupted at earlier stage and obviously not fully repaired lbuged upon MDT mount, could only be mounted with noscrub option the mdt group_upcall (which can be configured with tunefs) used to be /usr/sbin/l_getgroups in lustre 1.8 and it was set by default - the program is called l_getidentity now, is not configured by default anymore. You should either change it with tunefs, or put an appropriate link in place as a fallback. Anyhow, lustre 2 file systems don't use it by default anymore. They just trust the client. It also means that users/groups are not needed anymore on lustre the servers. (we had lokal passwd/group files there so that secondary groups work properly, alternatively you could configure ldap, but without group_upcall, all this is handled by the lustre client. LU-5626 and LU-2627: .. directory entries were damaged by adding the FID, once all old directories were converted and all files somehow recovered (in several consecutive attempts), the problem is gone. The number of emergency maintenances is basically limited by the depth of your directory structure. It could be repaired by running e2fsck, followed by manually moving back everything (save the log of the e2fsck which tells you the relation of the objects in lost+found and their original path!) LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again - I believe that's something which must be done anyhow quite often, because there is
Re: [lustre-discuss] 1.8 client on 3.13.0 kernel
Hi Lewis, it's difficult to tell how much data loss was actually related to the lustre upgrade itself. We have upgraded 6 file systems and we had to do it more or less in one shot, because at that time they were using a common MGS server. All servers of one file system must be on the same level (at least for the major upgrade 1.8 to 2.x, there is rolling upgrade for minor versions in the lustre 2 branch now, but I have no experience with that). In any case the file systems should be clean before starting the upgrade, so I would recommend to run e2fsck on all targets and repair them before starting the upgrade. We did so, but unfortunately our e2fsprogs were not really up to date and after our lustre upgrade a lot of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So, probably some errors on the file systems were still present, but unnoticed when we did the upgrade. Lustre 2 introduces the FID (which is something like an inode number, where lustre 1.8 used the inode number of the underlying ldiskfs, but with the possibility to have several MDTs in one file system a replacement was needed). The FID is stored in the inode, but it can also be activated that the FIDs are stored in the directory node, which makes lookups faster, especially when there are many files in a directory. However, there were bugs in the code that takes care about adding the FID to the directory entry when the file system is converted from 1.8 to 2.x. So, I would recommend to use a version in which these bug are solved. We went to 2.4.1 that time. By default this fid_in_dirent feature is not automatically enabled, however, this is the only point where a performance boost may be expected... so we took the risk to enable this... and ran into some bugs. We had other file systems, still on 1.8, so with the server upgrade we didn't upgrade the clients, because lustre 2 clients wouldn't have been able to mount the 1.8 file systems. And we use quotas, and for this you need the 1.8.9 client with a patch that corrects a defect of the 1.8.9 client when it talks to 2.x servers (LU-3067). However, older 1.8 clients don't support the Lustre 2 quota (which came in 2.2 or 2.4, I'm not 100% sure). BTW, it still runs out of sync from time to time, but the limit seems to be fine now, it's just the numbers the users see. lfs quota prints out too low numbers and users run out of quota earlier than they expect... It's better in the latest 2.5 versions now. Here an unsorted(!) list of bugs we have hit during the lustre upgrade. For most of them we weren't the first ones, but I guess you could wait forever for the version in which all bugs are resolved :-) LU-3067 - already mentioned above, a patch for 1.8.9 clients interoperating with 2.x servers, however, 1.8.9 is needed for having quota working. Without this patch clients become unresponsive, 100% cpu load, then just hang and devices become unavailable, reboot doesn't work, so power cycle needed, but after a while the problem reappeared LU-4504 - e2fsck noticed quota issues similar to this bug on osts - use latest e2fsprogs, check again and then the ldiskfs backend doesn't run into this anymore e2fsck noticed quota issues on MDT "Problem in HTREE directory inode 21685465: block #16 not referenced" however, could be fixed by e2fsck LU-5626 mdt becomes readonly: one file system where the MDT was corrupted at earlier stage and obviously not fully repaired lbuged upon MDT mount, could only be mounted with noscrub option the mdt group_upcall (which can be configured with tunefs) used to be /usr/sbin/l_getgroups in lustre 1.8 and it was set by default - the program is called l_getidentity now, is not configured by default anymore. You should either change it with tunefs, or put an appropriate link in place as a fallback. Anyhow, lustre 2 file systems don't use it by default anymore. They just trust the client. It also means that users/groups are not needed anymore on lustre the servers. (we had lokal passwd/group files there so that secondary groups work properly, alternatively you could configure ldap, but without group_upcall, all this is handled by the lustre client. LU-5626 and LU-2627: .. directory entries were damaged by adding the FID, once all old directories were converted and all files somehow recovered (in several consecutive attempts), the problem is gone. The number of emergency maintenances is basically limited by the depth of your directory structure. It could be repaired by running e2fsck, followed by manually moving back everything (save the log of the e2fsck which tells you the relation of the objects in lost+found and their original path!) LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again - I believe that's something which must be done anyhow quite often, because there is no quotacheck anymore. It's run in the background when enabling quotas, but file systems have to be unmounted for this. Related to quota, there is a change in the lfs setquot