Thank you Andreas! Your information is wonderful. I did the following:
I logged into my MDS (same as MGS) and issued the commands-- shell-prompt> mount -t lustre /dev/md1 /srv/lustre/mds/crew4-MDT0000 No errors so far. shell-prompt> lctl dl (Found my nids of failed JBODs) device 14 deactivate device 16 deactivate quit On one of our servers, I mounted the lustre disk /crew4. The disk will hang a UNIX df or ls command. However.... lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID --ost crew4- OST0004_UUID -print /crew4 Did indeed provide a list of files. I saved the list to a text file. I will next see if I am able to copy a single file to a new location. Thank you again Andreas for this incredibly useful information. Do you/Sun do paid Lustre consulting by any chance? Later, megan On Jun 18, 12:48 am, Andreas Dilger <[EMAIL PROTECTED]> wrote: > On Jun 16, 2008 15:37 -0700, megan wrote: > > > I am using Lustre 2.6.18-53.1.13.el5_lustre.1.6.4.3smp kernel on a > > CentOS 5 linux x86_64 linux box. > > We had a hardware problem that caused the underlying ext3 partition > > table to completely blow up. This is resulting in only three of five > > OSTs being mountable. The main lustre disk of this unit cannot be > > mounted because the MDS knows that two of its parts are missing. > > It should be possible to mount a Lustre filesystem with OSTs that > are not available. However, access to files on the unavailable > OSTs will cause the process to wait on OST recovery. > > > > > The underlying set-up is JBOD hw that is passed to the linux OS, via > > an LSI 8888ELP card in this case, as a simple device, ie. sde, > > sdf,... The simple devices were partitioned using parted and > > formatted ext3 then lustre was built on top of the five ext3 units. > > There was no striping done across units/JBODS. Three of the five > > units passed an e2fsck and an lfsck. Those remaining units are > > mounted as such: > > /dev/sdc 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4- > > OST0003 > > /dev/sdd 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4- > > OST0004 > > /dev/sdf 13T 6.2T 5.8T 52% /srv/lustre/OST/crew4- > > OST0001 > > > Being that it is unlikely that we shall be able to recover the > > underlying ext3 on the other two units, is there some method by which > > I might try to rescue the data from these last three units mounted > > currently on the OSS? > > > Any and all suggestion genuinely appreciated. > > The recoverability of your data depends heavily on the striping of > the individual files (i.e. the default striping). If your files have > a default stripe_count = 1, then you can probably recover 3/5 of the > files in the filesystem. If your default stripe_count = 2, then you > can probably only recover 1/5 of the files, and if you have a higher > stripe_count you probably can't recover any files. > > What you need to do is to mount one of the clients and mark the > corresponding OSTs inactive with: > > lctl dl # get device numbers for OSC 0000 and OSC 0002 > lctl --device N deactivate > > Then, instead of the clients waiting for the OSTs to recover the > client will get an IO error when it accesses files on the failed OSTs. > > To get a list of the files that are on the good OSTs run: > > lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID > --ost crew4-OST0004_UUID {mountpoint} > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > [EMAIL PROTECTED]://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss