Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-05 Thread Martin Hecht
Hi,

comments inline...

On 11/04/2015 01:34 PM, Patrick Farrell wrote:
> Our observation at the time was that lfsck did not add the fid to the .. 
> dentry unless there was already space in the appropriate location.  
Ok, I might have been wrong in this point and some manual mv by the
users was involved.


On 11/04/2015 04:24 PM, Chris Hunter wrote:
> Yes I believe you want to (manually) recover the directories from
> lost+found back to ROOT on the MDT before lfsck/oi_scrub runs. I don't
> think lfsck on the MDT will impact orphan objects on the OSTs.
With lfsck phase 2 introduced in lustre 2.6 the MDT-OST consistency is
checked and repaired. Chris, you wrote that you have upgraded to "lustre
2.x", so I don't know if you have lfsck II already.  And I'm not sure if
MDT entries in lost+found are ignored by lfsck. I just wanted to point
out that you might have to be careful here, but looking at the lustre
manual it turns out that you are right. The consistency checks are run
when lfsck type is set to "layout", which is a different thing than the
"namespace" check used to update the FIDs.


On 11/05/2015 01:29 AM, Dilger, Andreas wrote:
> Note that newer versions of LFSCK namespace checking (2.6 or 2.7, don't
> recall offhand) will be able to return such entries from lost+found back
> into the proper parent directory in the namespace, assuming they were
> created under 2.x.  Lustre stores an extra "link" xattr on each inode with
> the filename and parent directory FID for each link to the file (up to the
> available xattr space for each inode), so in case of directory corruption
> it would be possible to rebuild the directory structure just from the
> "link" xattrs on each file.
that's good to know. However, the files in this case were created with
1.8, so even if the current version after the upgrade has this "link"
xattr, it doesn't help to recover from LU-5626. But your script is
useful (it's pretty much the same as I did back then, but I didn't find
my quick hack it anymore...)
 
> In the meantime, I attached a script to LU-5626 that could be used to
> re-link files from lost+found into the right directory and filename based
> on the output from e2fsck.  It is a bit rough (needs manual editing of
> pathnames), but may be useful if someone has hit this problem.

best regards,
Martin



smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-04 Thread Dilger, Andreas
On 2015/11/04, 02:42, "lustre-discuss on behalf of Martin Hecht"
 wrote:

>On 11/04/2015 03:23 AM, Patrick Farrell wrote:
>> PAF: Remember, the specific conditions are pretty tight.  Created under
>>1.8, not empty (if it's empty, the .. dentry is not misplaced when
>>moved) but also non-htree, then moved with dirdata enabled, and then
>>grown to this larger size.  How many existing (small) directories do you
>>move and then add a bunch of files to?  It's a pretty rare operation.
>>We only hit it at Martin's site because of an automated tool they have
>>to re-arrange user/job directories.
>Well, not only because of the tool. Especially, because when the
>directories have been moved by the tool, no files are added anymore.
>However, our mechanism gives a reason to the users to move their data
>from time to time (that's not the intention of the mechanism, but that's
>how some users react).
>
>But I'm not quite sure anymore if moving the directories is really a
>precondition to run into LU-5626.
>We have run the background lfsck which adds the FID to the existing
>dentries. This might be an important detail, because in our case a
>second '..' entry containing the FID was presumably created by lfsck (in
>the wrong place), and not by moving the directory. To my current
>understanding the user then only has to add some files to trigger the
>LBUG.
>A subsequent e2fsck will not only find this particular directory but all
>other small directories with a '..' entry in the wrong place. When
>e2fsck tries to fix these directories, some entries are overwritten by
>the FID and these files are then moved to lost+found.

Note that newer versions of LFSCK namespace checking (2.6 or 2.7, don't
recall offhand) will be able to return such entries from lost+found back
into the proper parent directory in the namespace, assuming they were
created under 2.x.  Lustre stores an extra "link" xattr on each inode with
the filename and parent directory FID for each link to the file (up to the
available xattr space for each inode), so in case of directory corruption
it would be possible to rebuild the directory structure just from the
"link" xattrs on each file.

In the meantime, I attached a script to LU-5626 that could be used to
re-link files from lost+found into the right directory and filename based
on the output from e2fsck.  It is a bit rough (needs manual editing of
pathnames), but may be useful if someone has hit this problem.

Cheers, Andreas

>If one of these first entries happens to be a small subdirectory, I
>believe there is a chance to run into the same issue again, when you
>move everything back to the original location after the e2fsck and
>someone starts adding files in these subdirectories.
>
>However, the preconditions are still quite narrow: small directories,
>not empty, created without fid, then converted by lfsck (or
>alternatively moved to a different place which would also create the
>second '..' entry). To trigger the LBUG files need to be added to one of
>these directories and for a second occurrence of the LBUG the same
>conditions must hold for another subdirectory which must have been at
>the very beginning of the directory.
>
>Martin
>
>
>


Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-04 Thread Chris Hunter



On 11/02/2015 12:30 PM, Martin Hecht wrote:

Hi Chris and Patrick,

I was sick last week so I have found this conversation not before today,
sorry

On 10/27/2015 05:06 PM, Patrick Farrell wrote:

If you read LU-5626 carefully, there's an explanation of the exact nature of 
the damage, and having that should let you make partial recoveries by hand.  
I'm not familiar with the ll_recover_lost_found_objs tool, but I doubt it would 
prove helpful in this instance.

there is no tool like ll_recover_lost_found_objs for the MDT. On OSTs
this would be the right choice.


Note that there's two forms to this corruption.  One is if you move a directory 
which was created before dirdata was enabled, then the '..' entry ends up in 
the wrong place.  This does not trouble Lustre, but fsck reports it as an error 
and will 'correct' it, which has the effect of (usually) overwriting one dentry 
in the directory when it creates a new '..' dentry in the correct location.

I don't *think* that one causes the MDT to go read only, but I could be wrong.  
I *think* what causes the MDT to go read only is the other problem:

When you have a non-htree directory (not too many items in it, all directory 
entries in a single inode) that is in the bad state described above (with the 
'..' dentry in the wrong place after being moved) and that directory has enough 
files added to it that it becomes an htree directory, the resulting directory 
is corrupted more severely.  We never sorted out the precise details of this - 
I believe we chose to simply delete any directories in this state.  (I think 
lfsck did it for us, but can't recall for sure.)

If I recall correctly, moving (or renaming) the corrupted directory to
another place caused the MDT to go readonly, probably adding more files
as Patrick wrote before is another trigger.

In our case we captured the full ouptut of e2fsck which contained the
original names and the inodes. fsck moved some of the files and
subdiretories of the corrupted directories to lost+found. With the
information contained in the e2fsck output we could move them back from
lost+found to their original place on the ldiskfs level (I have parsed
the e2fsck output for a pattern matching the inode numbers and created a
script out of it). We had to repeat this a couple of times, because
either some of the subdirectories moved to lost+found were in a bad
shape themselves or were further damaged later when the owners added
files to them later on or moved them around.

So, if you have captured all your e2fsck output and you haven't yet
cleaned up lost+found, you still can recover the data. lfsck would
probably throw away the objects on the OSTs because it thinks they are
orphane objects left over after deleting the files.

best regards,
Martin


Yes I believe you want to (manually) recover the directories from 
lost+found back to ROOT on the MDT before lfsck/oi_scrub runs. I don't 
think lfsck on the MDT will impact orphan objects on the OSTs.


regards,
chris hunter

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-04 Thread Patrick Farrell
Martin,

Our observation at the time was that lfsck did not add the fid to the .. dentry 
unless there was already space in the appropriate location.  I don't remember 
digging in to the details, but that was our observation at the time.  (Since it 
meant lfsck namespace was behaving, in a sense, correctly, we were initially 
puzzled but decided it was all right.  I seem to remember reading a comment 
somewhere that the developers decided rearranging the dentries was too hard, so 
they'd only add fids were space was already present.)

It's possible we didn't get that quite right, though it would have to be 
partial somehow - misplaced .. dentries with fids were definitely not universal 
after running the namespace lfsck. (Drawing on experience from other sites here 
as well.)

In any case, directories with bad .. dentries can be identified with fsck 
anyway.

- Patrick

From: Martin Hecht [he...@hlrs.de]
Sent: Wednesday, November 04, 2015 3:42 AM
To: Patrick Farrell; Mohr Jr, Richard Frank (Rick Mohr)
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

On 11/04/2015 03:23 AM, Patrick Farrell wrote:
> PAF: Remember, the specific conditions are pretty tight.  Created under 1.8, 
> not empty (if it's empty, the .. dentry is not misplaced when moved) but also 
> non-htree, then moved with dirdata enabled, and then grown to this larger 
> size.  How many existing (small) directories do you move and then add a bunch 
> of files to?  It's a pretty rare operation.  We only hit it at Martin's site 
> because of an automated tool they have to re-arrange user/job directories.
Well, not only because of the tool. Especially, because when the
directories have been moved by the tool, no files are added anymore.
However, our mechanism gives a reason to the users to move their data
from time to time (that's not the intention of the mechanism, but that's
how some users react).

But I'm not quite sure anymore if moving the directories is really a
precondition to run into LU-5626.
We have run the background lfsck which adds the FID to the existing
dentries. This might be an important detail, because in our case a
second '..' entry containing the FID was presumably created by lfsck (in
the wrong place), and not by moving the directory. To my current
understanding the user then only has to add some files to trigger the LBUG.
A subsequent e2fsck will not only find this particular directory but all
other small directories with a '..' entry in the wrong place. When
e2fsck tries to fix these directories, some entries are overwritten by
the FID and these files are then moved to lost+found.
If one of these first entries happens to be a small subdirectory, I
believe there is a chance to run into the same issue again, when you
move everything back to the original location after the e2fsck and
someone starts adding files in these subdirectories.

However, the preconditions are still quite narrow: small directories,
not empty, created without fid, then converted by lfsck (or
alternatively moved to a different place which would also create the
second '..' entry). To trigger the LBUG files need to be added to one of
these directories and for a second occurrence of the LBUG the same
conditions must hold for another subdirectory which must have been at
the very beginning of the directory.

Martin


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-04 Thread Martin Hecht
On 11/04/2015 03:23 AM, Patrick Farrell wrote:
> PAF: Remember, the specific conditions are pretty tight.  Created under 1.8, 
> not empty (if it's empty, the .. dentry is not misplaced when moved) but also 
> non-htree, then moved with dirdata enabled, and then grown to this larger 
> size.  How many existing (small) directories do you move and then add a bunch 
> of files to?  It's a pretty rare operation.  We only hit it at Martin's site 
> because of an automated tool they have to re-arrange user/job directories.
Well, not only because of the tool. Especially, because when the
directories have been moved by the tool, no files are added anymore.
However, our mechanism gives a reason to the users to move their data
from time to time (that's not the intention of the mechanism, but that's
how some users react).

But I'm not quite sure anymore if moving the directories is really a
precondition to run into LU-5626.
We have run the background lfsck which adds the FID to the existing
dentries. This might be an important detail, because in our case a
second '..' entry containing the FID was presumably created by lfsck (in
the wrong place), and not by moving the directory. To my current
understanding the user then only has to add some files to trigger the LBUG.
A subsequent e2fsck will not only find this particular directory but all
other small directories with a '..' entry in the wrong place. When
e2fsck tries to fix these directories, some entries are overwritten by
the FID and these files are then moved to lost+found.
If one of these first entries happens to be a small subdirectory, I
believe there is a chance to run into the same issue again, when you
move everything back to the original location after the e2fsck and
someone starts adding files in these subdirectories.

However, the preconditions are still quite narrow: small directories,
not empty, created without fid, then converted by lfsck (or
alternatively moved to a different place which would also create the
second '..' entry). To trigger the LBUG files need to be added to one of
these directories and for a second occurrence of the LBUG the same
conditions must hold for another subdirectory which must have been at
the very beginning of the directory.

Martin




smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-03 Thread Patrick Farrell
Mmm, unfortunately, still not quite right - Disabling dirdata will not 
save you in the conversion to HTree case either.  It will just prevent 
*more* directories from getting a misplaced ".." dentry to begin with.


As to size...  I figured it out once - But it depends on file name 
length in the directory, since the dentry includes the file name. Once 
the total size of dentries in a directory exceeds 4096 bytes (one 
inode), then it will be converted to an HTree, I believe.


So, at something like 32 bytes a dentry, which is like a 10-16 or so 
character file name (exact dentry length here requires more checking 
than I've got time for, but it's close), then you've got 32=2^5, 4096 = 
2^12, so 2^12/2^5 = 2^7 or 128 dentries.


But of course, longer file names --> bigger dentries --> fewer dentries 
before conversion to HTree.


As far as "easy way to scan", well, fsck set to not make changes will 
find all the directories with misplaced ".." dentries, and also any 
already damaged-by-conversion-to-HTree directories.


- Patrick
On 11/03/2015 01:12 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

Patrick,

Thanks for the clarification. I think I understand now.  Disabling dirdata 
would not help any directories which have already had their “..” entry 
relocated.  The next time fsck runs, those directories will potentially get 
corrupted.  The bigger reason to disable dirdata is to prevent more serious 
corruption if a non-HTree directory with an incorrectly placed “..” gets 
converted to a HTree directory.

How large does a directory need to be before the conversion to HTree happens?  
I don’t suppose there is an easy way to scan the file system to look for 
directories that might be subject to corruption…

—Rick



On Nov 3, 2015, at 12:30 PM, Patrick Farrell  wrote:

Hm.  That's almost, but not quite, right.  Disabling dirdata during the fsck 
run has no positive effect - fsck will still get upset about the incorrectly 
placed entry.  (And whether or not dirdata is enabled, fsck will do the same 
thing.  It doesn't know or care about the dirdata setting as such.)

Steps #1 and #2 will not cause any problems until you run fsck, but there's no 
way around the issue once you do run fsck.  The .. dentry must go back to the 
correct location to make fsck happy.  If I remember right, fsck creates the .. 
dentry and doesn't include the fid (regardless of dirdata setting).  This can 
overwrite another dentry if one has been placed in the location normally 
reserved for the .. dentry (which can happen if the dentry which was after the 
.. dentry is deleted, thereby making a space large enough for a dentry+FID).

Furthermore, if you have a non-Htree directory where the .. dentry is incorrectly 
placed (your steps 1 & 2), then you add files until it shifts to become an 
HTree directory, THAT directory becomes corrupted in a more severe manner that will 
cause your MDT to remount read only and/or LBUG.  (LU-2638 only fixes the .. dentry 
bug for HTree directories themselves.  It does not help with a corrupted directory 
that then becomes an HTree directory.)

- Patrick

On 11/03/2015 11:17 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

On Oct 27, 2015, at 1:46 PM, Patrick Farrell  wrote:

That's something of a time bomb - If one of those directories fsck wishes it 
could correct is small and grows in number of files, you'll get the MDT going 
read only (and a few odd LBUGs if you try to put it back).

I was looking back over the incident where I thought I had hit this bug, but 
based on the lack of side effects that you mentioned, I am now starting to 
think that I was mistaken.  Nevertheless, I am trying to understand the bug a 
little better in case I am still susceptible to it.  I tried to summarize my 
understanding below, and maybe you can tell me if I am correct.

For HTree directories, the problem is described in LU-2638.  But since I am 
running Lustre >2.4, I should not be affected by this bug.

For non-Tree directories, the problem is described in LU-5626.  In order to 
trigger the bug, the following steps must happen:

1) A non-HTree directory created under Lustre 1.8 (which does not have a FID 
for its “..” entry) gets moved to a different parent directory.

2) Lustre tries to update the “..” entry in the directory, and if there is not 
enough space in the existing entry, it creates a new “..” entry and adds the 
FID.

3) Something happens to the MDT, and fsck needs to be run.  When it runs, it 
notices that “..” is no longer the second entry in the directory.

4) fsck tries to “fix” the problem by moving the “..” entry back to its 
original position.  With the FID in place, there is not enough space in the 
original position, but fsck moves it anyway which causes the “..” entry to 
overwrite part of the third entry in the directory.

If that is correct, then steps #1 and #2 can happen without causing any 
problems.  It is only at steps #3 and #4 that the corruption occurs, 

Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-03 Thread Mohr Jr, Richard Frank (Rick Mohr)
Patrick,

Thanks for the clarification. I think I understand now.  Disabling dirdata 
would not help any directories which have already had their “..” entry 
relocated.  The next time fsck runs, those directories will potentially get 
corrupted.  The bigger reason to disable dirdata is to prevent more serious 
corruption if a non-HTree directory with an incorrectly placed “..” gets 
converted to a HTree directory.

How large does a directory need to be before the conversion to HTree happens?  
I don’t suppose there is an easy way to scan the file system to look for 
directories that might be subject to corruption…

—Rick


> On Nov 3, 2015, at 12:30 PM, Patrick Farrell  wrote:
> 
> Hm.  That's almost, but not quite, right.  Disabling dirdata during the fsck 
> run has no positive effect - fsck will still get upset about the incorrectly 
> placed entry.  (And whether or not dirdata is enabled, fsck will do the same 
> thing.  It doesn't know or care about the dirdata setting as such.)
> 
> Steps #1 and #2 will not cause any problems until you run fsck, but there's 
> no way around the issue once you do run fsck.  The .. dentry must go back to 
> the correct location to make fsck happy.  If I remember right, fsck creates 
> the .. dentry and doesn't include the fid (regardless of dirdata setting).  
> This can overwrite another dentry if one has been placed in the location 
> normally reserved for the .. dentry (which can happen if the dentry which was 
> after the .. dentry is deleted, thereby making a space large enough for a 
> dentry+FID).
> 
> Furthermore, if you have a non-Htree directory where the .. dentry is 
> incorrectly placed (your steps 1 & 2), then you add files until it shifts to 
> become an HTree directory, THAT directory becomes corrupted in a more severe 
> manner that will cause your MDT to remount read only and/or LBUG.  (LU-2638 
> only fixes the .. dentry bug for HTree directories themselves.  It does not 
> help with a corrupted directory that then becomes an HTree directory.)
> 
> - Patrick
> 
> On 11/03/2015 11:17 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>>> On Oct 27, 2015, at 1:46 PM, Patrick Farrell  wrote:
>>> 
>>> That's something of a time bomb - If one of those directories fsck wishes 
>>> it could correct is small and grows in number of files, you'll get the MDT 
>>> going read only (and a few odd LBUGs if you try to put it back).
>> I was looking back over the incident where I thought I had hit this bug, but 
>> based on the lack of side effects that you mentioned, I am now starting to 
>> think that I was mistaken.  Nevertheless, I am trying to understand the bug 
>> a little better in case I am still susceptible to it.  I tried to summarize 
>> my understanding below, and maybe you can tell me if I am correct.
>> 
>> For HTree directories, the problem is described in LU-2638.  But since I am 
>> running Lustre >2.4, I should not be affected by this bug.
>> 
>> For non-Tree directories, the problem is described in LU-5626.  In order to 
>> trigger the bug, the following steps must happen:
>> 
>> 1) A non-HTree directory created under Lustre 1.8 (which does not have a FID 
>> for its “..” entry) gets moved to a different parent directory.
>> 
>> 2) Lustre tries to update the “..” entry in the directory, and if there is 
>> not enough space in the existing entry, it creates a new “..” entry and adds 
>> the FID.
>> 
>> 3) Something happens to the MDT, and fsck needs to be run.  When it runs, it 
>> notices that “..” is no longer the second entry in the directory.
>> 
>> 4) fsck tries to “fix” the problem by moving the “..” entry back to its 
>> original position.  With the FID in place, there is not enough space in the 
>> original position, but fsck moves it anyway which causes the “..” entry to 
>> overwrite part of the third entry in the directory.
>> 
>> If that is correct, then steps #1 and #2 can happen without causing any 
>> problems.  It is only at steps #3 and #4 that the corruption occurs, and as 
>> long as dirdata is disabled before fsck is run, then there should not be any 
>> problems.
>> 
>> Is that explanation accurate?
>> 
>> --
>> Rick Mohr
>> Senior HPC System Administrator
>> National Institute for Computational Sciences
>> http://www.nics.tennessee.edu
>> 
> 

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-03 Thread Mohr Jr, Richard Frank (Rick Mohr)

> On Oct 27, 2015, at 1:46 PM, Patrick Farrell  wrote:
> 
> That's something of a time bomb - If one of those directories fsck wishes it 
> could correct is small and grows in number of files, you'll get the MDT going 
> read only (and a few odd LBUGs if you try to put it back).

I was looking back over the incident where I thought I had hit this bug, but 
based on the lack of side effects that you mentioned, I am now starting to 
think that I was mistaken.  Nevertheless, I am trying to understand the bug a 
little better in case I am still susceptible to it.  I tried to summarize my 
understanding below, and maybe you can tell me if I am correct.

For HTree directories, the problem is described in LU-2638.  But since I am 
running Lustre >2.4, I should not be affected by this bug.

For non-Tree directories, the problem is described in LU-5626.  In order to 
trigger the bug, the following steps must happen:

1) A non-HTree directory created under Lustre 1.8 (which does not have a FID 
for its “..” entry) gets moved to a different parent directory.

2) Lustre tries to update the “..” entry in the directory, and if there is not 
enough space in the existing entry, it creates a new “..” entry and adds the 
FID.

3) Something happens to the MDT, and fsck needs to be run.  When it runs, it 
notices that “..” is no longer the second entry in the directory.

4) fsck tries to “fix” the problem by moving the “..” entry back to its 
original position.  With the FID in place, there is not enough space in the 
original position, but fsck moves it anyway which causes the “..” entry to 
overwrite part of the third entry in the directory.

If that is correct, then steps #1 and #2 can happen without causing any 
problems.  It is only at steps #3 and #4 that the corruption occurs, and as 
long as dirdata is disabled before fsck is run, then there should not be any 
problems.

Is that explanation accurate?

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-03 Thread Patrick Farrell
Hm.  That's almost, but not quite, right.  Disabling dirdata during the 
fsck run has no positive effect - fsck will still get upset about the 
incorrectly placed entry.  (And whether or not dirdata is enabled, fsck 
will do the same thing.  It doesn't know or care about the dirdata 
setting as such.)


Steps #1 and #2 will not cause any problems until you run fsck, but 
there's no way around the issue once you do run fsck.  The .. dentry 
must go back to the correct location to make fsck happy.  If I remember 
right, fsck creates the .. dentry and doesn't include the fid 
(regardless of dirdata setting).  This can overwrite another dentry if 
one has been placed in the location normally reserved for the .. dentry 
(which can happen if the dentry which was after the .. dentry is 
deleted, thereby making a space large enough for a dentry+FID).


Furthermore, if you have a non-Htree directory where the .. dentry is 
incorrectly placed (your steps 1 & 2), then you add files until it 
shifts to become an HTree directory, THAT directory becomes corrupted in 
a more severe manner that will cause your MDT to remount read only 
and/or LBUG.  (LU-2638 only fixes the .. dentry bug for HTree 
directories themselves.  It does not help with a corrupted directory 
that then becomes an HTree directory.)


- Patrick

On 11/03/2015 11:17 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

On Oct 27, 2015, at 1:46 PM, Patrick Farrell  wrote:

That's something of a time bomb - If one of those directories fsck wishes it 
could correct is small and grows in number of files, you'll get the MDT going 
read only (and a few odd LBUGs if you try to put it back).

I was looking back over the incident where I thought I had hit this bug, but 
based on the lack of side effects that you mentioned, I am now starting to 
think that I was mistaken.  Nevertheless, I am trying to understand the bug a 
little better in case I am still susceptible to it.  I tried to summarize my 
understanding below, and maybe you can tell me if I am correct.

For HTree directories, the problem is described in LU-2638.  But since I am 
running Lustre >2.4, I should not be affected by this bug.

For non-Tree directories, the problem is described in LU-5626.  In order to 
trigger the bug, the following steps must happen:

1) A non-HTree directory created under Lustre 1.8 (which does not have a FID 
for its “..” entry) gets moved to a different parent directory.

2) Lustre tries to update the “..” entry in the directory, and if there is not 
enough space in the existing entry, it creates a new “..” entry and adds the 
FID.

3) Something happens to the MDT, and fsck needs to be run.  When it runs, it 
notices that “..” is no longer the second entry in the directory.

4) fsck tries to “fix” the problem by moving the “..” entry back to its 
original position.  With the FID in place, there is not enough space in the 
original position, but fsck moves it anyway which causes the “..” entry to 
overwrite part of the third entry in the directory.

If that is correct, then steps #1 and #2 can happen without causing any 
problems.  It is only at steps #3 and #4 that the corruption occurs, and as 
long as dirdata is disabled before fsck is run, then there should not be any 
problems.

Is that explanation accurate?

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-03 Thread Mohr Jr, Richard Frank (Rick Mohr)

> On Nov 3, 2015, at 2:20 PM, Patrick Farrell  wrote:
> 
> Mmm, unfortunately, still not quite right - Disabling dirdata will not save 
> you in the conversion to HTree case either.  It will just prevent *more* 
> directories from getting a misplaced ".." dentry to begin with.

Sorry.  Yes, that is what I meant to say.  (My sentence was supposed to read 
“…correctly placed…”.  I was thinking of a non-corrupted non-HTree directory 
that was moved and then converted. Poor wording on my part.)

> As to size...  I figured it out once - But it depends on file name length in 
> the directory, since the dentry includes the file name. Once the total size 
> of dentries in a directory exceeds 4096 bytes (one inode), then it will be 
> converted to an HTree, I believe.
> 
> So, at something like 32 bytes a dentry, which is like a 10-16 or so 
> character file name (exact dentry length here requires more checking than 
> I've got time for, but it's close), then you've got 32=2^5, 4096 = 2^12, so 
> 2^12/2^5 = 2^7 or 128 dentries.
> 
> But of course, longer file names --> bigger dentries --> fewer dentries 
> before conversion to HTree.

So it doesn’t seem like it takes many entries at all.  Interesting.  We have 
many directories much larger than that and no sign of any corruption.  I’ll 
have to spend some more time looking into this.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-03 Thread Patrick Farrell
Comment inline...

From: Mohr Jr, Richard Frank (Rick Mohr) [rm...@utk.edu]
Sent: Tuesday, November 03, 2015 4:47 PM
To: Patrick Farrell
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

> On Nov 3, 2015, at 2:20 PM, Patrick Farrell <p...@cray.com> wrote:
>
> Mmm, unfortunately, still not quite right - Disabling dirdata will not save 
> you in the conversion to HTree case either.  It will just prevent *more* 
> directories from getting a misplaced ".." dentry to begin with.

Sorry.  Yes, that is what I meant to say.  (My sentence was supposed to read 
“…correctly placed…”.  I was thinking of a non-corrupted non-HTree directory 
that was moved and then converted. Poor wording on my part.)

> As to size...  I figured it out once - But it depends on file name length in 
> the directory, since the dentry includes the file name. Once the total size 
> of dentries in a directory exceeds 4096 bytes (one inode), then it will be 
> converted to an HTree, I believe.
>
> So, at something like 32 bytes a dentry, which is like a 10-16 or so 
> character file name (exact dentry length here requires more checking than 
> I've got time for, but it's close), then you've got 32=2^5, 4096 = 2^12, so 
> 2^12/2^5 = 2^7 or 128 dentries.
>
> But of course, longer file names --> bigger dentries --> fewer dentries 
> before conversion to HTree.

So it doesn’t seem like it takes many entries at all.  Interesting.  We have 
many directories much larger than that and no sign of any corruption.  I’ll 
have to spend some more time looking into this.

PAF: Remember, the specific conditions are pretty tight.  Created under 1.8, 
not empty (if it's empty, the .. dentry is not misplaced when moved) but also 
non-htree, then moved with dirdata enabled, and then grown to this larger size. 
 How many existing (small) directories do you move and then add a bunch of 
files to?  It's a pretty rare operation.  We only hit it at Martin's site 
because of an automated tool they have to re-arrange user/job directories.


--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-11-02 Thread Martin Hecht
Hi Chris and Patrick,

I was sick last week so I have found this conversation not before today,
sorry

On 10/27/2015 05:06 PM, Patrick Farrell wrote:
> If you read LU-5626 carefully, there's an explanation of the exact nature of 
> the damage, and having that should let you make partial recoveries by hand.  
> I'm not familiar with the ll_recover_lost_found_objs tool, but I doubt it 
> would prove helpful in this instance.
there is no tool like ll_recover_lost_found_objs for the MDT. On OSTs
this would be the right choice.

> Note that there's two forms to this corruption.  One is if you move a 
> directory which was created before dirdata was enabled, then the '..' entry 
> ends up in the wrong place.  This does not trouble Lustre, but fsck reports 
> it as an error and will 'correct' it, which has the effect of (usually) 
> overwriting one dentry in the directory when it creates a new '..' dentry in 
> the correct location.
>
> I don't *think* that one causes the MDT to go read only, but I could be 
> wrong.  I *think* what causes the MDT to go read only is the other problem:
>
> When you have a non-htree directory (not too many items in it, all directory 
> entries in a single inode) that is in the bad state described above (with the 
> '..' dentry in the wrong place after being moved) and that directory has 
> enough files added to it that it becomes an htree directory, the resulting 
> directory is corrupted more severely.  We never sorted out the precise 
> details of this - I believe we chose to simply delete any directories in this 
> state.  (I think lfsck did it for us, but can't recall for sure.)
If I recall correctly, moving (or renaming) the corrupted directory to
another place caused the MDT to go readonly, probably adding more files
as Patrick wrote before is another trigger.

In our case we captured the full ouptut of e2fsck which contained the
original names and the inodes. fsck moved some of the files and
subdiretories of the corrupted directories to lost+found. With the
information contained in the e2fsck output we could move them back from
lost+found to their original place on the ldiskfs level (I have parsed
the e2fsck output for a pattern matching the inode numbers and created a
script out of it). We had to repeat this a couple of times, because
either some of the subdirectories moved to lost+found were in a bad
shape themselves or were further damaged later when the owners added
files to them later on or moved them around.

So, if you have captured all your e2fsck output and you haven't yet
cleaned up lost+found, you still can recover the data. lfsck would
probably throw away the objects on the OSTs because it thinks they are
orphane objects left over after deleting the files. 

best regards,
Martin




smime.p7s
Description: S/MIME Cryptographic Signature
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-10-27 Thread Patrick Farrell
Chris,

I had the joy of taking this one apart personally.  We mostly let lfsck do the 
repair and moved on, accepting that some of the dentries were trashed.  I 
think, for important things, our field staff did some manual recovery with the 
e2fsprogs tools, but it was not a common enough problem that we documented a 
procedure.

If you read LU-5626 carefully, there's an explanation of the exact nature of 
the damage, and having that should let you make partial recoveries by hand.  
I'm not familiar with the ll_recover_lost_found_objs tool, but I doubt it would 
prove helpful in this instance.

Note that there's two forms to this corruption.  One is if you move a directory 
which was created before dirdata was enabled, then the '..' entry ends up in 
the wrong place.  This does not trouble Lustre, but fsck reports it as an error 
and will 'correct' it, which has the effect of (usually) overwriting one dentry 
in the directory when it creates a new '..' dentry in the correct location.

I don't *think* that one causes the MDT to go read only, but I could be wrong.  
I *think* what causes the MDT to go read only is the other problem:

When you have a non-htree directory (not too many items in it, all directory 
entries in a single inode) that is in the bad state described above (with the 
'..' dentry in the wrong place after being moved) and that directory has enough 
files added to it that it becomes an htree directory, the resulting directory 
is corrupted more severely.  We never sorted out the precise details of this - 
I believe we chose to simply delete any directories in this state.  (I think 
lfsck did it for us, but can't recall for sure.)

I'd advise reading LU-5626 with care, and I'd also suggest you might turn off 
'dirdata' on your MDT until you have this under control.  That will at least 
prevent any more directories from ending up in either of these bad states if 
you use the filesystem without updating Lustre to a version with the LU-5626 
patch in it.

- Patrick

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Chris Hunter [chris.hun...@yale.edu]
Sent: Tuesday, October 27, 2015 10:22 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss]  recovery MDT ".." directory entries (LU-5626)

We have a lustre 1.8 filesystem that was upgraded to lustre 2.x and
"dirdata" feature was enabled. We encountered LU-5626/LU-2638 issue with
".." directory entries. Are there established recovery steps for this
issue ?

If I run fsck, the directory entries will be moved into lost+found.
I assume the next step is to run the ll_recover_lost_found_objs tool ?

Can you share any advice/experience about recovery ?

thanks,
chris hunter
chris.hun...@yale.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-10-27 Thread Chris Hunter
We have a lustre 1.8 filesystem that was upgraded to lustre 2.x and 
"dirdata" feature was enabled. We encountered LU-5626/LU-2638 issue with 
".." directory entries. Are there established recovery steps for this 
issue ?


If I run fsck, the directory entries will be moved into lost+found.
I assume the next step is to run the ll_recover_lost_found_objs tool ?

Can you share any advice/experience about recovery ?

thanks,
chris hunter
chris.hun...@yale.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-10-27 Thread Patrick Farrell
Excuse me, I said 'lfsck' below, but I meant 'fsck'.

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Patrick Farrell [p...@cray.com]
Sent: Tuesday, October 27, 2015 11:06 AM
To: Chris Hunter; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

Chris,

I had the joy of taking this one apart personally.  We mostly let lfsck do the 
repair and moved on, accepting that some of the dentries were trashed.  I 
think, for important things, our field staff did some manual recovery with the 
e2fsprogs tools, but it was not a common enough problem that we documented a 
procedure.

If you read LU-5626 carefully, there's an explanation of the exact nature of 
the damage, and having that should let you make partial recoveries by hand.  
I'm not familiar with the ll_recover_lost_found_objs tool, but I doubt it would 
prove helpful in this instance.

Note that there's two forms to this corruption.  One is if you move a directory 
which was created before dirdata was enabled, then the '..' entry ends up in 
the wrong place.  This does not trouble Lustre, but fsck reports it as an error 
and will 'correct' it, which has the effect of (usually) overwriting one dentry 
in the directory when it creates a new '..' dentry in the correct location.

I don't *think* that one causes the MDT to go read only, but I could be wrong.  
I *think* what causes the MDT to go read only is the other problem:

When you have a non-htree directory (not too many items in it, all directory 
entries in a single inode) that is in the bad state described above (with the 
'..' dentry in the wrong place after being moved) and that directory has enough 
files added to it that it becomes an htree directory, the resulting directory 
is corrupted more severely.  We never sorted out the precise details of this - 
I believe we chose to simply delete any directories in this state.  (I think 
lfsck did it for us, but can't recall for sure.)

I'd advise reading LU-5626 with care, and I'd also suggest you might turn off 
'dirdata' on your MDT until you have this under control.  That will at least 
prevent any more directories from ending up in either of these bad states if 
you use the filesystem without updating Lustre to a version with the LU-5626 
patch in it.

- Patrick

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Chris Hunter [chris.hun...@yale.edu]
Sent: Tuesday, October 27, 2015 10:22 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss]  recovery MDT ".." directory entries (LU-5626)

We have a lustre 1.8 filesystem that was upgraded to lustre 2.x and
"dirdata" feature was enabled. We encountered LU-5626/LU-2638 issue with
".." directory entries. Are there established recovery steps for this
issue ?

If I run fsck, the directory entries will be moved into lost+found.
I assume the next step is to run the ll_recover_lost_found_objs tool ?

Can you share any advice/experience about recovery ?

thanks,
chris hunter
chris.hun...@yale.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-10-27 Thread Patrick Farrell

Chris,

That's probably best, to be safe.  By the way, this is one where (if I 
remember right) sometimes you run fsck, let it correct things, then you 
must run it again - As it will find new things to object about in the 
modified filesystem.  So if you weren't already, running fsck repeatedly 
until it doesn't complain is best.  (That's also a best practice anyway..)


I can't find a -d or -D option in my copy of fsck.  Not sure what it means?

Best of luck,
- Patrick

On 10/27/2015 12:52 PM, Chris Hunter wrote:

Hi Patrick,
Thanks for sharing your experience, looks like you did the bulk of 
troubleshooting in the Jira ticket.


I assume I should have a clean filesystem (ie. run fsck first) before 
disabling the dirdata feature ?

After I disable dirdata, I will need to run fsck with the "-D" option ?

FYI, ll_recover_lost_found_objs tool will recover files from 
lost+found on *OST* volumes (ie. moves them back into /O/0/dXX 
directory) based on extended file attributes. Section 37.5 of the HPDD 
manual.


thanks
chris hunter
chris.hun...@yale.edu

On 10/27/2015 12:06 PM, Patrick Farrell wrote:

Chris,

I had the joy of taking this one apart personally.  We mostly let 
lfsck do the repair and moved on, accepting that some of the dentries 
were trashed.  I think, for important things, our field staff did 
some manual recovery with the e2fsprogs tools, but it was not a 
common enough problem that we documented a procedure.


If you read LU-5626 carefully, there's an explanation of the exact 
nature of the damage, and having that should let you make partial 
recoveries by hand.  I'm not familiar with the 
ll_recover_lost_found_objs tool, but I doubt it would prove helpful 
in this instance.


Note that there's two forms to this corruption.  One is if you move a 
directory which was created before dirdata was enabled, then the '..' 
entry ends up in the wrong place.  This does not trouble Lustre, but 
fsck reports it as an error and will 'correct' it, which has the 
effect of (usually) overwriting one dentry in the directory when it 
creates a new '..' dentry in the correct location.


I don't *think* that one causes the MDT to go read only, but I could 
be wrong.  I *think* what causes the MDT to go read only is the other 
problem:


When you have a non-htree directory (not too many items in it, all 
directory entries in a single inode) that is in the bad state 
described above (with the '..' dentry in the wrong place after being 
moved) and that directory has enough files added to it that it 
becomes an htree directory, the resulting directory is corrupted more 
severely.  We never sorted out the precise details of this - I 
believe we chose to simply delete any directories in this state.  (I 
think lfsck did it for us, but can't recall for sure.)


I'd advise reading LU-5626 with care, and I'd also suggest you might 
turn off 'dirdata' on your MDT until you have this under control.  
That will at least prevent any more directories from ending up in 
either of these bad states if you use the filesystem without updating 
Lustre to a version with the LU-5626 patch in it.


- Patrick

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on 
behalf of Chris Hunter [chris.hun...@yale.edu]

Sent: Tuesday, October 27, 2015 10:22 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss]  recovery MDT ".." directory entries (LU-5626)

We have a lustre 1.8 filesystem that was upgraded to lustre 2.x and
"dirdata" feature was enabled. We encountered LU-5626/LU-2638 issue with
".." directory entries. Are there established recovery steps for this
issue ?

If I run fsck, the directory entries will be moved into lost+found.
I assume the next step is to run the ll_recover_lost_found_objs tool ?

Can you share any advice/experience about recovery ?

thanks,
chris hunter
chris.hun...@yale.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=AwIFAg=-dg2m7zWuuDZ0MUcV7Sdqw=d_G2h_sZYG4xtHMeKo8QgjDmOcMVdQvYgM-5Dri1AOY=83OYH_ms_eqiU1wnAGo9fAzmYQX3fBG7y1eio_j_xpU=hl5TuadAk5fXgjermbroSP81LGazmXpj1BxqaIsP7Cw= 





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-10-27 Thread Chris Hunter

Hi Patrick,
Thanks for sharing your experience, looks like you did the bulk of 
troubleshooting in the Jira ticket.


I assume I should have a clean filesystem (ie. run fsck first) before 
disabling the dirdata feature ?

After I disable dirdata, I will need to run fsck with the "-D" option ?

FYI, ll_recover_lost_found_objs tool will recover files from lost+found 
on *OST* volumes (ie. moves them back into /O/0/dXX directory) based on 
extended file attributes. Section 37.5 of the HPDD manual.


thanks
chris hunter
chris.hun...@yale.edu

On 10/27/2015 12:06 PM, Patrick Farrell wrote:

Chris,

I had the joy of taking this one apart personally.  We mostly let lfsck do the 
repair and moved on, accepting that some of the dentries were trashed.  I 
think, for important things, our field staff did some manual recovery with the 
e2fsprogs tools, but it was not a common enough problem that we documented a 
procedure.

If you read LU-5626 carefully, there's an explanation of the exact nature of 
the damage, and having that should let you make partial recoveries by hand.  
I'm not familiar with the ll_recover_lost_found_objs tool, but I doubt it would 
prove helpful in this instance.

Note that there's two forms to this corruption.  One is if you move a directory 
which was created before dirdata was enabled, then the '..' entry ends up in 
the wrong place.  This does not trouble Lustre, but fsck reports it as an error 
and will 'correct' it, which has the effect of (usually) overwriting one dentry 
in the directory when it creates a new '..' dentry in the correct location.

I don't *think* that one causes the MDT to go read only, but I could be wrong.  
I *think* what causes the MDT to go read only is the other problem:

When you have a non-htree directory (not too many items in it, all directory 
entries in a single inode) that is in the bad state described above (with the 
'..' dentry in the wrong place after being moved) and that directory has enough 
files added to it that it becomes an htree directory, the resulting directory 
is corrupted more severely.  We never sorted out the precise details of this - 
I believe we chose to simply delete any directories in this state.  (I think 
lfsck did it for us, but can't recall for sure.)

I'd advise reading LU-5626 with care, and I'd also suggest you might turn off 
'dirdata' on your MDT until you have this under control.  That will at least 
prevent any more directories from ending up in either of these bad states if 
you use the filesystem without updating Lustre to a version with the LU-5626 
patch in it.

- Patrick

From: lustre-discuss [lustre-discuss-boun...@lists.lustre.org] on behalf of 
Chris Hunter [chris.hun...@yale.edu]
Sent: Tuesday, October 27, 2015 10:22 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss]  recovery MDT ".." directory entries (LU-5626)

We have a lustre 1.8 filesystem that was upgraded to lustre 2.x and
"dirdata" feature was enabled. We encountered LU-5626/LU-2638 issue with
".." directory entries. Are there established recovery steps for this
issue ?

If I run fsck, the directory entries will be moved into lost+found.
I assume the next step is to run the ll_recover_lost_found_objs tool ?

Can you share any advice/experience about recovery ?

thanks,
chris hunter
chris.hun...@yale.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=AwIFAg=-dg2m7zWuuDZ0MUcV7Sdqw=d_G2h_sZYG4xtHMeKo8QgjDmOcMVdQvYgM-5Dri1AOY=83OYH_ms_eqiU1wnAGo9fAzmYQX3fBG7y1eio_j_xpU=hl5TuadAk5fXgjermbroSP81LGazmXpj1BxqaIsP7Cw=


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-10-27 Thread Mohr Jr, Richard Frank (Rick Mohr)

> On Oct 27, 2015, at 11:22 AM, Chris Hunter  wrote:
> 
> We have a lustre 1.8 filesystem that was upgraded to lustre 2.x and "dirdata" 
> feature was enabled. We encountered LU-5626/LU-2638 issue with ".." directory 
> entries. Are there established recovery steps for this issue ?
> 
> If I run fsck, the directory entries will be moved into lost+found.
> I assume the next step is to run the ll_recover_lost_found_objs tool ?
> 
> Can you share any advice/experience about recovery ?

I only recall seeing the bug once on my file system (about a year after we 
upgraded), so it really hasn’t been a problem.  It has been a while, so I don’t 
remember the details.  But I think I just handled it by not letting fsck make 
any “corrections”.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] recovery MDT ".." directory entries (LU-5626)

2015-10-27 Thread Patrick Farrell

Rick,

That's something of a time bomb - If one of those directories fsck 
wishes it could correct is small and grows in number of files, you'll 
get the MDT going read only (and a few odd LBUGs if you try to put it back).


- Patrick

On 10/27/2015 12:18 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

On Oct 27, 2015, at 11:22 AM, Chris Hunter  wrote:

We have a lustre 1.8 filesystem that was upgraded to lustre 2.x and "dirdata" feature was 
enabled. We encountered LU-5626/LU-2638 issue with ".." directory entries. Are there 
established recovery steps for this issue ?

If I run fsck, the directory entries will be moved into lost+found.
I assume the next step is to run the ll_recover_lost_found_objs tool ?

Can you share any advice/experience about recovery ?

I only recall seeing the bug once on my file system (about a year after we 
upgraded), so it really hasn’t been a problem.  It has been a while, so I don’t 
remember the details.  But I think I just handled it by not letting fsck make 
any “corrections”.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org