a few more comments in-line On 09/10/2015 09:11 PM, Lewis Hyatt wrote: > Thanks a lot for the info, a little more optimistic :-). > > -Lewis > > On 9/10/15 11:17 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote: >> Lewis, >> >> I did an upgrade from Lustre 1.8.6 to 2.4.3 on our servers, and for >> the most part things went pretty good. I’ll chime in on a couple of >> Martin’s points and mention a few other things. >> >>> On Sep 10, 2015, at 9:30 AM, Martin Hecht <[email protected]> wrote: >>> >>> In any case the file systems should be clean before starting the >>> upgrade, so I would recommend to run e2fsck on all targets and repair >>> them before starting the upgrade. We did so, but unfortunately our >>> e2fsprogs were not really up to date and after our lustre upgrade a lot >>> of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So, >>> probably some errors on the file systems were still present, but >>> unnoticed when we did the upgrade. >> >> This is a very important point. While I didn’t run e2fsck before the >> upgrade (but maybe I should have), I made sure to install the latest >> e2fsprogs. well, a version of the e2fsprogs with some important fixes was released shortly after we did the upgrade. Maybe this was just because we ran into these bugs, and the vendor escalated our tickets to whamcloud/intel
>> >>> Lustre 2 introduces the FID (which is something like an inode number, >>> where lustre 1.8 used the inode number of the underlying ldiskfs, but >>> with the possibility to have several MDTs in one file system a >>> replacement was needed). The FID is stored in the inode, but it can >>> also >>> be activated that the FIDs are stored in the directory node, which >>> makes >>> lookups faster, especially when there are many files in a directory. >>> However, there were bugs in the code that takes care about adding the >>> FID to the directory entry when the file system is converted from >>> 1.8 to >>> 2.x. So, I would recommend to use a version in which these bug are >>> solved. We went to 2.4.1 that time. By default this fid_in_dirent >>> feature is not automatically enabled, however, this is the only point >>> where a performance boost may be expected... so we took the risk to >>> enable this... and ran into some bugs. >> >> Enabling fid_in_dirent prevents you from backing out of the upgrade. >> In theory, if you upgraded to Lustre 2.x without enabling >> fid_in_dirent, you could always revert back to Lustre 1.8. We tried >> this on a test system, and the downgrade seemed to work. However, >> this was a small scale test and I have never tried it on a production >> file system. But if you want to minimize possible complications, you >> could always leave this disabled for a while after the updgrade, and >> then if things are going well, enable it later on. actually, the FID is added to new contents, and you have to run the oi_scrub once to convert the file system. That might be important to know when you decide to use this feature. On the other hand, if you don't enable fid_in_dirent, you can go back theoretically, but I think the FID is still added to regular files (not to the directory entry), and you can't read these files created with lustre 2 after the downgrade. However, running lustre 2 without fid_in_dirent is possiblem at least in the earlier 2.x versions - about 2.5 onwards you would have to double check. This is sometimes called "Compatibility Mode IGIF" Anyhow, to avoid running into the problem with the directory entries, I would also recommend not to enable fid_in_dirent or make sure to choose a version which has all the fixes for this problem. There are different types of directories, large and small ones which have a different structure, and the issue was already fixed for some cases, but we have hit another case which was not correctly handled until we hit that bug with our upgrade. >> >> My only other advice is to test as much as possible prior to the >> upgrade. If you have a little test hardware, install the same Lustre >> 1.8 version you are currently running in production and then try >> upgrading that to the new Lustre version. I think preparation is the >> key. I think I spent about 2 months reading about upgrade >> procedures, talking with others who have upgraded, reading JIRA bug >> reports, and running tests on hardware. well, our vendor was preparing the upgrade for about a year and did intensive testing on several file systems and they changed the targeted lustre version several times. The problem is that some bugs are only hit on the real production system. For instance the fid_in_dirent issue: It depends on the number of files in the directory, and you only notice the bug when you have upgraded the file system and try to move some files from such a directory to another place. I'm not sure if it has to be a directory created after the upgrade, maybe the destination just has to be a different directory. But to be honest you wouldn't test this scenario if you weren't aware that such a bug may exist. Or, you might test, but you might not test all cases with different numbers of files in a directory. In fact I had a bad feeling about enabling fid_in_dirent, because converting the directory entries sounds like a dangerous thing, but due to the tests carried out the vendor was confident that this would work fine and it was promising better metadata performance. Anyhow, the point I want to make is that even if you do a lot of testing, you may miss issues that only pop up on real production environments. However, the good thing is that we have hit this already and you can avoid this problem now, and rick is totally right: talking with others who have upgraded, reading JIRA bug reports, and running tests is really important to be prepared and make a good choice of the version to which you plan to upgrade. best regards, Martin
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
