Lewis,

I did an upgrade from Lustre 1.8.6 to 2.4.3 on our servers, and for the most 
part things went pretty good.  I’ll chime in on a couple of Martin’s points and 
mention a few other things.

> On Sep 10, 2015, at 9:30 AM, Martin Hecht <[email protected]> wrote:
> 
> In any case the file systems should be clean before starting the
> upgrade, so I would recommend to run e2fsck on all targets and repair
> them before starting the upgrade. We did so, but unfortunately our
> e2fsprogs were not really up to date and after our lustre upgrade a lot
> of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So,
> probably some errors on the file systems were still present, but
> unnoticed when we did the upgrade.

This is a very important point.  While I didn’t run e2fsck before the upgrade 
(but maybe I should have), I made sure to install the latest e2fsprogs.  

> Lustre 2 introduces the FID (which is something like an inode number,
> where lustre 1.8 used the inode number of the underlying ldiskfs, but
> with the possibility to have several MDTs in one file system a
> replacement was needed). The FID is stored in the inode, but it can also
> be activated that the FIDs are stored in the directory node, which makes
> lookups faster, especially when there are many files in a directory.
> However, there were bugs in the code that takes care about adding the
> FID to the directory entry when the file system is converted from 1.8 to
> 2.x. So, I would recommend to use a version in which these bug are
> solved. We went to 2.4.1 that time. By default this fid_in_dirent
> feature is not automatically enabled, however, this is the only point
> where a performance boost may be expected... so we took the risk to
> enable this... and ran into some bugs.

Enabling fid_in_dirent prevents you from backing out of the upgrade.  In 
theory, if you upgraded to Lustre 2.x without enabling fid_in_dirent, you could 
always revert back to Lustre 1.8.  We tried this on a test system, and the 
downgrade seemed to work.  However, this was a small scale test and I have 
never tried it on a production file system.  But if you want to minimize 
possible complications, you could always leave this disabled for a while after 
the updgrade, and then if things are going well, enable it later on.

> LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again
> - I believe that's something which must be done anyhow quite often,
> because there is no quotacheck anymore. It's run in the background when
> enabling quotas, but file systems have to be unmounted for this.

We didn’t exactly hit this bug, but I will mention that we have had a couple of 
instance where e2fsck complained about problems on an OST, and it turned out 
that we had to disable and re-enable quotas on the OST to correct the issue.

> LU-4743: We had to remove the CATALOGS file on another file system
> (otherwise the MDT wouldn't mount)

We hit this problem.

Someone I know had to do a Lustre upgrade, and they suggested that I apply a 
patch for LU-4708 (which I did).  But if you upgrade to Lustre 2.5.2 or later, 
that patch should already be included.

My only other advice is to test as much as possible prior to the upgrade.  If 
you have a little test hardware, install the same Lustre 1.8 version you are 
currently running in production and then try upgrading that to the new Lustre 
version.  I think preparation is the key.  I think I spent about 2 months 
reading about upgrade procedures, talking with others who have upgraded, 
reading JIRA bug reports, and running tests on hardware.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to