Re: [Lustre-discuss] [HPDD-discuss] Recovering a failed OST

2014-05-28 Thread Martin Hecht
Hi bob,

just to make sure: You already followed:
http://wiki.lustre.org/index.php/Handling_File_System_Errors, especially
the steps for e2fsck linked there?

If you did *not yet* do any write operation to the damaged OST, you
might want to back up the whole OST first, using dd for instance (if the
underlying hardware still permits it).

If the situation described (empty O directory, lost LAST_ID entry)
occurred *after* the e2fsck, and you find lots of files in lost+found
when you mount the OST as ldiskfs, you can use
ll_recover_lost_found_objs to put them back in the correct place
(http://manpages.ubuntu.com/manpages/precise/man1/ll_recover_lost_found_objs.1.html)
- it is part of the lustre distribution. Once I had to run this several
times in order to restore the structure below.

best regards,
Martin

On 05/19/2014 08:24 PM, Bob Ball wrote:
 Oh, better still, as I kept looking, and the low-level panic
 retreated, I found this on the mdt:

 [root@lmd02 ~]# lctl get_param osc.*.prealloc_next_id
 ...
 osc.umt3-OST0025-osc.prealloc_next_id=6778336

 So, unless someone tells me that I am way off base, I'm going to
 proceed with the assumption that this is a valid starting point, and
 proceed to get my file system back online.

 bob

 On 5/19/2014 2:05 PM, Bob Ball wrote:
 Google first, ask later.  I found this in the manuals:


   26.3.4 Fixing a Bad LAST_ID on an OST

 The procedures there spell out pretty well what I must do, so this
 should be relatively straight forward.  But, does this comment refer
 to just this OST, or to all OST?
 *Note - *The file system must be stopped on all servers before
 performing this procedure.

 So, is this the best approach to follow, allowing for the fact that
 there is nothing at all left on the OST, or is there a better short
 cut to choosing an appropriate LAST_ID?

 Thanks again,
 bob


 On 5/19/2014 1:50 PM, Bob Ball wrote:
 I need to completely remake a failed OST.  I have done this in the
 past, but this time, the disk failed in such a way that I cannot
 fully get recovery information from the OST before I destroy and
 recreate.  In particular, I am unable to recover the LAST_ID file,
 but successfully retrieved the last_rcvd and CONFIGS/* files.

 mount -t ldiskfs /dev/sde /mnt/ost
 pushd /mnt/ost
 cd O
 cd 0
 cp -p LAST_ID /root/reformat/sde

 The O directory exists, but it is empty.  What can I do concerning
 this missing LAST_ID file?  I mean, I probably have something,
 somewhere, from some previous recovery, but that is way, way out of
 date.

 My intent is to recreate this OST with the same index, and then put
 it back into production.  All files were moved off the OST before
 reaching this state, so nothing else needs to be recovered here.

 Thanks,
 bob

 ___
 HPDD-discuss mailing list
 hpdd-disc...@lists.01.org
 https://lists.01.org/mailman/listinfo/hpdd-discuss




 ___
 HPDD-discuss mailing list
 hpdd-disc...@lists.01.org
 https://lists.01.org/mailman/listinfo/hpdd-discuss




 ___
 HPDD-discuss mailing list
 hpdd-disc...@lists.01.org
 https://lists.01.org/mailman/listinfo/hpdd-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [HPDD-discuss] Recovering a failed OST

2014-05-22 Thread Bob Ball
Thanks for the advice.  Fortunately, the OST was completely drained of 
files before all heck broke loose.  With the help of the manual, a 
couple of lustre list threads, and some long-lost memories of a similar 
situation a few years back, I was able to bring the OST alive again, 
albeit still read-only for the time being (2 days off for me, and now I 
need to IO test it before I'll trust it again).


Cheers,
bob

On 5/20/2014 10:49 AM, Martin Hecht wrote:

Hi bob,

just to make sure: You already followed: 
http://wiki.lustre.org/index.php/Handling_File_System_Errors, 
especially the steps for e2fsck linked there?


If you did *not yet* do any write operation to the damaged OST, you 
might want to back up the whole OST first, using dd for instance (if 
the underlying hardware still permits it).


If the situation described (empty O directory, lost LAST_ID entry) 
occurred *after* the e2fsck, and you find lots of files in lost+found 
when you mount the OST as ldiskfs, you can use 
ll_recover_lost_found_objs to put them back in the correct place 
(http://manpages.ubuntu.com/manpages/precise/man1/ll_recover_lost_found_objs.1.html) 
- it is part of the lustre distribution. Once I had to run this 
several times in order to restore the structure below.


best regards,
Martin

On 05/19/2014 08:24 PM, Bob Ball wrote:
Oh, better still, as I kept looking, and the low-level panic 
retreated, I found this on the mdt:


[root@lmd02 ~]# lctl get_param osc.*.prealloc_next_id
...
osc.umt3-OST0025-osc.prealloc_next_id=6778336

So, unless someone tells me that I am way off base, I'm going to 
proceed with the assumption that this is a valid starting point, and 
proceed to get my file system back online.


bob

On 5/19/2014 2:05 PM, Bob Ball wrote:

Google first, ask later.  I found this in the manuals:


  26.3.4 Fixing a Bad LAST_ID on an OST

The procedures there spell out pretty well what I must do, so this 
should be relatively straight forward.  But, does this comment refer 
to just this OST, or to all OST?
*Note - *The file system must be stopped on all servers before 
performing this procedure.


So, is this the best approach to follow, allowing for the fact that 
there is nothing at all left on the OST, or is there a better short 
cut to choosing an appropriate LAST_ID?


Thanks again,
bob


On 5/19/2014 1:50 PM, Bob Ball wrote:
I need to completely remake a failed OST.  I have done this in the 
past, but this time, the disk failed in such a way that I cannot 
fully get recovery information from the OST before I destroy and 
recreate.  In particular, I am unable to recover the LAST_ID file, 
but successfully retrieved the last_rcvd and CONFIGS/* files.


mount -t ldiskfs /dev/sde /mnt/ost
pushd /mnt/ost
cd O
cd 0
cp -p LAST_ID /root/reformat/sde

The O directory exists, but it is empty.  What can I do concerning 
this missing LAST_ID file?  I mean, I probably have something, 
somewhere, from some previous recovery, but that is way, way out of 
date.


My intent is to recreate this OST with the same index, and then put 
it back into production.  All files were moved off the OST before 
reaching this state, so nothing else needs to be recovered here.


Thanks,
bob

___
HPDD-discuss mailing list
hpdd-disc...@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss





___
HPDD-discuss mailing list
hpdd-disc...@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss





___
HPDD-discuss mailing list
hpdd-disc...@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [HPDD-discuss] Recovering a failed OST

2014-05-19 Thread Bob Ball

Google first, ask later.  I found this in the manuals:


 26.3.4 Fixing a Bad LAST_ID on an OST

The procedures there spell out pretty well what I must do, so this 
should be relatively straight forward.  But, does this comment refer to 
just this OST, or to all OST?
*Note - *The file system must be stopped on all servers before 
performing this procedure.


So, is this the best approach to follow, allowing for the fact that 
there is nothing at all left on the OST, or is there a better short cut 
to choosing an appropriate LAST_ID?


Thanks again,
bob


On 5/19/2014 1:50 PM, Bob Ball wrote:
I need to completely remake a failed OST.  I have done this in the 
past, but this time, the disk failed in such a way that I cannot fully 
get recovery information from the OST before I destroy and recreate.  
In particular, I am unable to recover the LAST_ID file, but 
successfully retrieved the last_rcvd and CONFIGS/* files.


mount -t ldiskfs /dev/sde /mnt/ost
pushd /mnt/ost
cd O
cd 0
cp -p LAST_ID /root/reformat/sde

The O directory exists, but it is empty.  What can I do concerning 
this missing LAST_ID file?  I mean, I probably have something, 
somewhere, from some previous recovery, but that is way, way out of date.


My intent is to recreate this OST with the same index, and then put it 
back into production.  All files were moved off the OST before 
reaching this state, so nothing else needs to be recovered here.


Thanks,
bob

___
HPDD-discuss mailing list
hpdd-disc...@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [HPDD-discuss] Recovering a failed OST

2014-05-19 Thread Bob Ball
Oh, better still, as I kept looking, and the low-level panic retreated, 
I found this on the mdt:


[root@lmd02 ~]# lctl get_param osc.*.prealloc_next_id
...
osc.umt3-OST0025-osc.prealloc_next_id=6778336

So, unless someone tells me that I am way off base, I'm going to proceed 
with the assumption that this is a valid starting point, and proceed to 
get my file system back online.


bob

On 5/19/2014 2:05 PM, Bob Ball wrote:

Google first, ask later.  I found this in the manuals:


  26.3.4 Fixing a Bad LAST_ID on an OST

The procedures there spell out pretty well what I must do, so this 
should be relatively straight forward.  But, does this comment refer 
to just this OST, or to all OST?
*Note - *The file system must be stopped on all servers before 
performing this procedure.


So, is this the best approach to follow, allowing for the fact that 
there is nothing at all left on the OST, or is there a better short 
cut to choosing an appropriate LAST_ID?


Thanks again,
bob


On 5/19/2014 1:50 PM, Bob Ball wrote:
I need to completely remake a failed OST.  I have done this in the 
past, but this time, the disk failed in such a way that I cannot 
fully get recovery information from the OST before I destroy and 
recreate.  In particular, I am unable to recover the LAST_ID file, 
but successfully retrieved the last_rcvd and CONFIGS/* files.


mount -t ldiskfs /dev/sde /mnt/ost
pushd /mnt/ost
cd O
cd 0
cp -p LAST_ID /root/reformat/sde

The O directory exists, but it is empty.  What can I do concerning 
this missing LAST_ID file?  I mean, I probably have something, 
somewhere, from some previous recovery, but that is way, way out of 
date.


My intent is to recreate this OST with the same index, and then put 
it back into production.  All files were moved off the OST before 
reaching this state, so nothing else needs to be recovered here.


Thanks,
bob

___
HPDD-discuss mailing list
hpdd-disc...@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss





___
HPDD-discuss mailing list
hpdd-disc...@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss