Re: [lustre-discuss] possible to read orphan ost objects on live filesystem?

2015-09-10 Thread Dilger, Andreas
On 2015/09/10, 6:54 PM, "Chris Hunter"  wrote:

>We experienced file corruption on several OSTs. We proceeded through
>recovery using e2fsck & ll_recover_lost_found_obj tools.
>Following these steps, e2fsck came out clean.
>
>The file corruption did not impact the MDT. The files were still
>referenced by the MDT. Accessing the file on a lustre client (ie. ls -l)
>would report error "Cannot allocate memory"
>
>Following OST recovery steps, we started removing the corrupt files via
>"unlink" command on lustre client (rm command would not remove file).
>
>Now dry-run e2fsck of the OST is reporting errors:
>"deleted/unused inodes" in Pass 2 (checking directory structure),
>"Unattached inodes" in Pass 4 (checking reference counts)
>"free block count wrong" in Pass 5 (checking group summary information).
>
>Is e2fsck errors expected when unlinking files ?

No, the "unlink" command is just avoiding the -ENOENT error that "rm" gets
by calling "stat()" on the file before trying to unlink it.  This
shouldn't cause any errors on the OSTs, unless there is ongoing corruption
from the back-end storage.

>thanks,
>chris hunter
>chris.hun...@yale.edu
>
>
>On 09/03/2015 12:54 PM, Martin Hecht wrote:
>> Hi Chris,
>>
>> On 09/02/2015 07:18 AM, Chris Hunter wrote:
>>> Hi Andreas
>>>
>>> On 09/01/2015 07:22 PM, Dilger, Andreas wrote:
 On 2015/09/01, 7:59 AM, "lustre-discuss on behalf of Chris Hunter"
 >>> chris.hun...@yale.edu> wrote:

> Hi Andreas,
> Thanks for your help.
>
> If you have a striped lustre file with "holes" (ie. one chunk is gone
> due hardware failure, etc.) are the remaining file chunks considered
> orphan objects ?
>>> So when a lustre striped file has a hole (eg. missing chunk due to
>>> hardware failure), the remaining file chunks stay indefinitely on the
>>> OSTs.
>>> Is there a way to reclaim the space occupied by these pieces (after
>>> recovery of any usuable data, etc.)?
>> these remaining chunks still belong to the file (i.e. you have the
>> metadata entry on the MDT and you see the file when lustre is mounted).
>> By removing the file you free up the space.
>>
>> In general there are two types of inconsistencies which may occur:
>> Orphan objects are objects which are NOT assigned to an entry on the
>> MDT, i.e. chunks which do not belong to any file. These can be either
>> pre-allocated chunks or chunks left over after a corruption of the
>> metadata on the MDT.
>>
>> The other type of corruption is that you have a file, where chunks are
>> missing in-between. This can happen, when an OST gets corrupted. As long
>> as the MDT is Ok, you should be able to remove such a file. If in
>> addition the MDT is also corrupted, you should first fix the MDT, and
>> you might then only be able to unlink the file (which again might leave
>> some orphan objects on the OSTs). lfsck should be able to remove them,
>> depending on the lustre version you are running...
>>
>> Another point: When the OST got corrupted, after having them repaired
>> with e2fsck, you can mount them as ldiskfs and see if there are chunks
>> in lost+found and use the tool ll_recover_lost_found_objs to restore
>> them in the original place. I believe these objects which e2fsck puts in
>> lost+found are another kind of thing, usually not called "orphan
>> objects". As I said, they usually can be easily recovered.
>>
>> Martin
>>
>>
>


Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] possible to read orphan ost objects on live filesystem?

2015-09-10 Thread Chris Hunter

Hi,
We experienced file corruption on several OSTs. We proceeded through 
recovery using e2fsck & ll_recover_lost_found_obj tools.

Following these steps, e2fsck came out clean.

The file corruption did not impact the MDT. The files were still 
referenced by the MDT. Accessing the file on a lustre client (ie. ls -l) 
would report error "Cannot allocate memory"


Following OST recovery steps, we started removing the corrupt files via 
"unlink" command on lustre client (rm command would not remove file).


Now dry-run e2fsck of the OST is reporting errors:
"deleted/unused inodes" in Pass 2 (checking directory structure), 
"Unattached inodes" in Pass 4(checking reference counts)

"free block count wrong" in Pass 5 (checking group summary information).

Is e2fsck errors expected when unlinking files ?

thanks,
chris hunter
chris.hun...@yale.edu


On 09/03/2015 12:54 PM, Martin Hecht wrote:

Hi Chris,

On 09/02/2015 07:18 AM, Chris Hunter wrote:

Hi Andreas

On 09/01/2015 07:22 PM, Dilger, Andreas wrote:

On 2015/09/01, 7:59 AM, "lustre-discuss on behalf of Chris Hunter"
 wrote:


Hi Andreas,
Thanks for your help.

If you have a striped lustre file with "holes" (ie. one chunk is gone
due hardware failure, etc.) are the remaining file chunks considered
orphan objects ?

So when a lustre striped file has a hole (eg. missing chunk due to
hardware failure), the remaining file chunks stay indefinitely on the
OSTs.
Is there a way to reclaim the space occupied by these pieces (after
recovery of any usuable data, etc.)?

these remaining chunks still belong to the file (i.e. you have the
metadata entry on the MDT and you see the file when lustre is mounted).
By removing the file you free up the space.

In general there are two types of inconsistencies which may occur:
Orphan objects are objects which are NOT assigned to an entry on the
MDT, i.e. chunks which do not belong to any file. These can be either
pre-allocated chunks or chunks left over after a corruption of the
metadata on the MDT.

The other type of corruption is that you have a file, where chunks are
missing in-between. This can happen, when an OST gets corrupted. As long
as the MDT is Ok, you should be able to remove such a file. If in
addition the MDT is also corrupted, you should first fix the MDT, and
you might then only be able to unlink the file (which again might leave
some orphan objects on the OSTs). lfsck should be able to remove them,
depending on the lustre version you are running...

Another point: When the OST got corrupted, after having them repaired
with e2fsck, you can mount them as ldiskfs and see if there are chunks
in lost+found and use the tool ll_recover_lost_found_objs to restore
them in the original place. I believe these objects which e2fsck puts in
lost+found are another kind of thing, usually not called "orphan
objects". As I said, they usually can be easily recovered.

Martin



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 1.8 client on 3.13.0 kernel

2015-09-10 Thread Lewis Hyatt

Thanks a lot for the info, a little more optimistic :-).

-Lewis

On 9/10/15 11:17 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

Lewis,

I did an upgrade from Lustre 1.8.6 to 2.4.3 on our servers, and for the most 
part things went pretty good.  I’ll chime in on a couple of Martin’s points and 
mention a few other things.


On Sep 10, 2015, at 9:30 AM, Martin Hecht  wrote:

In any case the file systems should be clean before starting the
upgrade, so I would recommend to run e2fsck on all targets and repair
them before starting the upgrade. We did so, but unfortunately our
e2fsprogs were not really up to date and after our lustre upgrade a lot
of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So,
probably some errors on the file systems were still present, but
unnoticed when we did the upgrade.


This is a very important point.  While I didn’t run e2fsck before the upgrade 
(but maybe I should have), I made sure to install the latest e2fsprogs.


Lustre 2 introduces the FID (which is something like an inode number,
where lustre 1.8 used the inode number of the underlying ldiskfs, but
with the possibility to have several MDTs in one file system a
replacement was needed). The FID is stored in the inode, but it can also
be activated that the FIDs are stored in the directory node, which makes
lookups faster, especially when there are many files in a directory.
However, there were bugs in the code that takes care about adding the
FID to the directory entry when the file system is converted from 1.8 to
2.x. So, I would recommend to use a version in which these bug are
solved. We went to 2.4.1 that time. By default this fid_in_dirent
feature is not automatically enabled, however, this is the only point
where a performance boost may be expected... so we took the risk to
enable this... and ran into some bugs.


Enabling fid_in_dirent prevents you from backing out of the upgrade.  In 
theory, if you upgraded to Lustre 2.x without enabling fid_in_dirent, you could 
always revert back to Lustre 1.8.  We tried this on a test system, and the 
downgrade seemed to work.  However, this was a small scale test and I have 
never tried it on a production file system.  But if you want to minimize 
possible complications, you could always leave this disabled for a while after 
the updgrade, and then if things are going well, enable it later on.


LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again
- I believe that's something which must be done anyhow quite often,
because there is no quotacheck anymore. It's run in the background when
enabling quotas, but file systems have to be unmounted for this.


We didn’t exactly hit this bug, but I will mention that we have had a couple of 
instance where e2fsck complained about problems on an OST, and it turned out 
that we had to disable and re-enable quotas on the OST to correct the issue.


LU-4743: We had to remove the CATALOGS file on another file system
(otherwise the MDT wouldn't mount)


We hit this problem.

Someone I know had to do a Lustre upgrade, and they suggested that I apply a 
patch for LU-4708 (which I did).  But if you upgrade to Lustre 2.5.2 or later, 
that patch should already be included.

My only other advice is to test as much as possible prior to the upgrade.  If 
you have a little test hardware, install the same Lustre 1.8 version you are 
currently running in production and then try upgrading that to the new Lustre 
version.  I think preparation is the key.  I think I spent about 2 months 
reading about upgrade procedures, talking with others who have upgraded, 
reading JIRA bug reports, and running tests on hardware.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 1.8 client on 3.13.0 kernel

2015-09-10 Thread Mohr Jr, Richard Frank (Rick Mohr)
Lewis,

I did an upgrade from Lustre 1.8.6 to 2.4.3 on our servers, and for the most 
part things went pretty good.  I’ll chime in on a couple of Martin’s points and 
mention a few other things.

> On Sep 10, 2015, at 9:30 AM, Martin Hecht  wrote:
> 
> In any case the file systems should be clean before starting the
> upgrade, so I would recommend to run e2fsck on all targets and repair
> them before starting the upgrade. We did so, but unfortunately our
> e2fsprogs were not really up to date and after our lustre upgrade a lot
> of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So,
> probably some errors on the file systems were still present, but
> unnoticed when we did the upgrade.

This is a very important point.  While I didn’t run e2fsck before the upgrade 
(but maybe I should have), I made sure to install the latest e2fsprogs.  

> Lustre 2 introduces the FID (which is something like an inode number,
> where lustre 1.8 used the inode number of the underlying ldiskfs, but
> with the possibility to have several MDTs in one file system a
> replacement was needed). The FID is stored in the inode, but it can also
> be activated that the FIDs are stored in the directory node, which makes
> lookups faster, especially when there are many files in a directory.
> However, there were bugs in the code that takes care about adding the
> FID to the directory entry when the file system is converted from 1.8 to
> 2.x. So, I would recommend to use a version in which these bug are
> solved. We went to 2.4.1 that time. By default this fid_in_dirent
> feature is not automatically enabled, however, this is the only point
> where a performance boost may be expected... so we took the risk to
> enable this... and ran into some bugs.

Enabling fid_in_dirent prevents you from backing out of the upgrade.  In 
theory, if you upgraded to Lustre 2.x without enabling fid_in_dirent, you could 
always revert back to Lustre 1.8.  We tried this on a test system, and the 
downgrade seemed to work.  However, this was a small scale test and I have 
never tried it on a production file system.  But if you want to minimize 
possible complications, you could always leave this disabled for a while after 
the updgrade, and then if things are going well, enable it later on.

> LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again
> - I believe that's something which must be done anyhow quite often,
> because there is no quotacheck anymore. It's run in the background when
> enabling quotas, but file systems have to be unmounted for this.

We didn’t exactly hit this bug, but I will mention that we have had a couple of 
instance where e2fsck complained about problems on an OST, and it turned out 
that we had to disable and re-enable quotas on the OST to correct the issue.

> LU-4743: We had to remove the CATALOGS file on another file system
> (otherwise the MDT wouldn't mount)

We hit this problem.

Someone I know had to do a Lustre upgrade, and they suggested that I apply a 
patch for LU-4708 (which I did).  But if you upgrade to Lustre 2.5.2 or later, 
that patch should already be included.

My only other advice is to test as much as possible prior to the upgrade.  If 
you have a little test hardware, install the same Lustre 1.8 version you are 
currently running in production and then try upgrading that to the new Lustre 
version.  I think preparation is the key.  I think I spent about 2 months 
reading about upgrade procedures, talking with others who have upgraded, 
reading JIRA bug reports, and running tests on hardware.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 1.8 client on 3.13.0 kernel

2015-09-10 Thread Lewis Hyatt
Thanks very much for this. Will let you know how we come out once we absorb 
this and get the courage to pull the trigger.


-lewis


On 9/10/15 9:30 AM, Martin Hecht wrote:

Hi Lewis,

it's difficult to tell how much data loss was actually related to the
lustre upgrade itself. We have upgraded 6 file systems and we had to do
it more or less in one shot, because at that time they were using a
common MGS server. All servers of one file system must be on the same
level (at least for the major upgrade 1.8 to 2.x, there is rolling
upgrade for minor versions in the lustre 2 branch now, but I have no
experience with that).

In any case the file systems should be clean before starting the
upgrade, so I would recommend to run e2fsck on all targets and repair
them before starting the upgrade. We did so, but unfortunately our
e2fsprogs were not really up to date and after our lustre upgrade a lot
of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So,
probably some errors on the file systems were still present, but
unnoticed when we did the upgrade.

Lustre 2 introduces the FID (which is something like an inode number,
where lustre 1.8 used the inode number of the underlying ldiskfs, but
with the possibility to have several MDTs in one file system a
replacement was needed). The FID is stored in the inode, but it can also
be activated that the FIDs are stored in the directory node, which makes
lookups faster, especially when there are many files in a directory.
However, there were bugs in the code that takes care about adding the
FID to the directory entry when the file system is converted from 1.8 to
2.x. So, I would recommend to use a version in which these bug are
solved. We went to 2.4.1 that time. By default this fid_in_dirent
feature is not automatically enabled, however, this is the only point
where a performance boost may be expected... so we took the risk to
enable this... and ran into some bugs.

We had other file systems, still on 1.8, so with the server upgrade we
didn't upgrade the clients, because lustre 2 clients wouldn't have been
able to mount the 1.8 file systems. And we use quotas, and for this you
need the 1.8.9 client with a patch that corrects a defect of the 1.8.9
client when it talks to 2.x servers (LU-3067). However, older 1.8
clients don't support the Lustre 2 quota (which came in 2.2 or 2.4, I'm
not 100% sure). BTW, it still runs out of sync from time to time, but
the limit seems to be fine now, it's just the numbers the users see. lfs
quota prints out too low numbers and users run out of quota earlier than
they expect... It's better in the latest 2.5 versions now.

Here an unsorted(!) list of bugs we have hit during the lustre upgrade.
For most of them we weren't the first ones, but I guess you could wait
forever for the version in which all bugs are resolved :-)

LU-3067 - already mentioned above, a patch for 1.8.9 clients
interoperating with 2.x servers, however, 1.8.9 is needed for having
quota working. Without this patch clients become unresponsive, 100% cpu
load, then just hang and devices become unavailable, reboot doesn't
work, so power cycle needed, but after a while the problem reappeared

LU-4504 - e2fsck noticed quota issues similar to this bug on osts - use
latest e2fsprogs, check again and then the ldiskfs backend doesn't run
into this anymore

e2fsck noticed quota issues on MDT "Problem in HTREE directory inode
21685465: block #16 not referenced"  however, could be fixed by e2fsck

LU-5626 mdt becomes readonly: one file system where the MDT was
corrupted at earlier stage and obviously not fully repaired lbuged upon
MDT mount, could only be mounted with noscrub option

the mdt group_upcall (which can be configured with tunefs) used to be
/usr/sbin/l_getgroups in lustre 1.8 and it was set by default - the
program is called l_getidentity now, is not configured by default
anymore. You should either change it with tunefs, or put an appropriate
link in place as a fallback. Anyhow, lustre 2 file systems don't use it
by default anymore. They just trust the client. It also means that
users/groups are not needed anymore on lustre the servers. (we had lokal
passwd/group files there so that secondary groups work properly,
alternatively you could configure ldap, but without group_upcall, all
this is handled by the lustre client.

LU-5626 and LU-2627: .. directory entries were damaged by adding the
FID, once all old directories were converted and all files somehow
recovered (in several consecutive attempts), the problem is gone. The
number of emergency maintenances is basically limited by the depth of
your directory structure. It could be repaired by running e2fsck,
followed by manually moving back everything (save the log of the e2fsck
which tells you the relation of the objects in lost+found and their
original path!)

LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again
- I believe that's something which must be done anyhow quite often,
because there is 

Re: [lustre-discuss] 1.8 client on 3.13.0 kernel

2015-09-10 Thread Martin Hecht
Hi Lewis,

it's difficult to tell how much data loss was actually related to the
lustre upgrade itself. We have upgraded 6 file systems and we had to do
it more or less in one shot, because at that time they were using a
common MGS server. All servers of one file system must be on the same
level (at least for the major upgrade 1.8 to 2.x, there is rolling
upgrade for minor versions in the lustre 2 branch now, but I have no
experience with that).

In any case the file systems should be clean before starting the
upgrade, so I would recommend to run e2fsck on all targets and repair
them before starting the upgrade. We did so, but unfortunately our
e2fsprogs were not really up to date and after our lustre upgrade a lot
of fixes for e2fsprogs were committed to whamclouds e2fsprogs git. So,
probably some errors on the file systems were still present, but
unnoticed when we did the upgrade.

Lustre 2 introduces the FID (which is something like an inode number,
where lustre 1.8 used the inode number of the underlying ldiskfs, but
with the possibility to have several MDTs in one file system a
replacement was needed). The FID is stored in the inode, but it can also
be activated that the FIDs are stored in the directory node, which makes
lookups faster, especially when there are many files in a directory.
However, there were bugs in the code that takes care about adding the
FID to the directory entry when the file system is converted from 1.8 to
2.x. So, I would recommend to use a version in which these bug are
solved. We went to 2.4.1 that time. By default this fid_in_dirent
feature is not automatically enabled, however, this is the only point
where a performance boost may be expected... so we took the risk to
enable this... and ran into some bugs.

We had other file systems, still on 1.8, so with the server upgrade we
didn't upgrade the clients, because lustre 2 clients wouldn't have been
able to mount the 1.8 file systems. And we use quotas, and for this you
need the 1.8.9 client with a patch that corrects a defect of the 1.8.9
client when it talks to 2.x servers (LU-3067). However, older 1.8
clients don't support the Lustre 2 quota (which came in 2.2 or 2.4, I'm
not 100% sure). BTW, it still runs out of sync from time to time, but
the limit seems to be fine now, it's just the numbers the users see. lfs
quota prints out too low numbers and users run out of quota earlier than
they expect... It's better in the latest 2.5 versions now.

Here an unsorted(!) list of bugs we have hit during the lustre upgrade.
For most of them we weren't the first ones, but I guess you could wait
forever for the version in which all bugs are resolved :-)

LU-3067 - already mentioned above, a patch for 1.8.9 clients
interoperating with 2.x servers, however, 1.8.9 is needed for having
quota working. Without this patch clients become unresponsive, 100% cpu
load, then just hang and devices become unavailable, reboot doesn't
work, so power cycle needed, but after a while the problem reappeared

LU-4504 - e2fsck noticed quota issues similar to this bug on osts - use
latest e2fsprogs, check again and then the ldiskfs backend doesn't run
into this anymore

e2fsck noticed quota issues on MDT "Problem in HTREE directory inode
21685465: block #16 not referenced"  however, could be fixed by e2fsck

LU-5626 mdt becomes readonly: one file system where the MDT was
corrupted at earlier stage and obviously not fully repaired lbuged upon
MDT mount, could only be mounted with noscrub option

the mdt group_upcall (which can be configured with tunefs) used to be
/usr/sbin/l_getgroups in lustre 1.8 and it was set by default - the
program is called l_getidentity now, is not configured by default
anymore. You should either change it with tunefs, or put an appropriate
link in place as a fallback. Anyhow, lustre 2 file systems don't use it
by default anymore. They just trust the client. It also means that
users/groups are not needed anymore on lustre the servers. (we had lokal
passwd/group files there so that secondary groups work properly,
alternatively you could configure ldap, but without group_upcall, all
this is handled by the lustre client.

LU-5626 and LU-2627: .. directory entries were damaged by adding the
FID, once all old directories were converted and all files somehow
recovered (in several consecutive attempts), the problem is gone. The
number of emergency maintenances is basically limited by the depth of
your directory structure. It could be repaired by running e2fsck,
followed by manually moving back everything (save the log of the e2fsck
which tells you the relation of the objects in lost+found and their
original path!)

LU-4504 quota out of sync: turn off quota, run e2fsck, turn it on again
- I believe that's something which must be done anyhow quite often,
because there is no quotacheck anymore. It's run in the background when
enabling quotas, but file systems have to be unmounted for this.

Related to quota, there is a change in the lfs setquot