So, are you all good now? Thanks for the explanation, BTW!
On Tue, Jul 7, 2009 at 7:42 AM, Thomas Roth<t.r...@gsi.de> wrote: > Hi, > > Mag Gam wrote: >> Exactly the symptoms I had. How long were you running this for? Also, >> how easy is it for you to reproduce this error? > > the MDS-going-on-strike - instances happened only twice since we > upgraded the cluster from Lustre 1.6.5.1 to 1.6.7.1 end of April. > Since last week everything seems to work fine again. The difference: I > had to move data off of one OST whose RAID announces hardware errors. To > do that, I ran "lfs find --obd <OST> /lustre/<dir>", at first massivel > parallel, then with 6 processes, and for the last few directories only > step-by-step. Of course I'm bewildered that such a well defined > operation should be able to break the MDT's operation, while the things > our users do in their unlimited ingenuity did not. > In the other hand, there is that issues with switching on quota. As I > have reported earlier, "lfs quotacheck -ug" also leads to enormous loads > on the MDT, finally stopping everything. > Maybe it's more of a hardware issue. > >> >> This should clear up your doubts. But you said you are running at >> 1.6.7.1 which is bizzare because I was running at 1.6.7 . Maybe this >> could be a different bug? >> >> http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010167.html > > Well, that was the bug causing data corruption on the MDT. There were > patches for 1.6.7.0 and then the patched release 1.6.7.1 to correct that. > But now we experienced this stop of operation of the MDT. After curing > it in the way I described earlier, there were no data corruptions or > losses that could be attributed to this outage. > > > Regards, > Thomas > > >> >> On Fri, Jul 3, 2009 at 10:44 AM, Thomas Roth<t.r...@gsi.de> wrote: >>> >>> Mag Gam wrote: >>>> http://lists.lustre.org/pipermail/lustre-discuss/2009-March/009928.html >>>> >>>> Look familiar? >>>> >>> Yes, I've read the thread - that's why I addressed you in addition to >>> the list  ;-) >>> >>> But I was not aware that this is supposed to be a bug in this particular >>> Lustre version. >>> >>> Right now the MDT stops cooperating without any ll_mdt processes going >>> up. Load is 0.5 or so on the MDT but no connections possible. >>>  In the log I only noted some "still busy with 2 active RPCs" messages. >>> I just hope I don't have to writeconf the MDT again - I learned on this >>> list that this would be necessary if these RPCs are never finished. >>> >>> Regards, >>> Thomas >>> >>> >>>> On Fri, Jul 3, 2009 at 7:32 AM, Thomas Roth<t.r...@gsi.de> wrote: >>>>> Hi, >>>>> >>>>> I didn't take notice of a discussion of such problems with 1.6.7.1. Â Do >>>>> you know something more specific about it? We won't want to downgrade >>>>> since our users are happier after the last upgrade (1.6.5 -> 1.6.7). And >>>>> we don't have the 1.6.7.2 (Debian-) packages yet. But I could try to >>>>> speed that up and force an upgrade if you told me that 1.6.7.1 wasn't >>>>> really reliable. >>>>> >>>>> For the moment the problem seems to have been fixed by shutdown, >>>>> fs-check and writeconf of all servers. >>>>> However, I don't want to do that every other week ... >>>>> >>>>> Thanks a lot for your help, >>>>> Thomas >>>>> >>>>> Mag Gam wrote: >>>>>> Hi Tom: >>>>>> >>>>>> There was a known issue with 1.6.7.1. What I did was downgrade to >>>>>> 1.6.6 and everything worked well. Or you can try upgrading, but there >>>>>> is something def wrong with that version... >>>>>> >>>>>> If you like, I can help you offline. I should be free this weekend (I >>>>>> have a long weekend) >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 2, 2009 at 8:22 AM, Thomas Roth<t.r...@gsi.de> wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> our MDT gets stuck and unresponsive with very high loads (Lustre >>>>>>> 1.6.7.1, Kernel 2.6.22, 8 Core, 32GB RAM). The only thing calling >>>>>>> attention is one ll_mt_?? process running with 100% cpu. Nothing unusual >>>>>>> happening on the cluster before that. >>>>>>> After reboot as well as after moving the service to another server, this >>>>>>> behavior reappears. The initial stages - mounting MGS, mouting MDT, >>>>>>> recovery - work fine, but then the load goes up and the system is >>>>>>> rendered unusable. >>>>>>> >>>>>>> Atm, I don't know what to do, except shutting down all servers and >>>>>>> possible do a writeconf everywhere. >>>>>>> >>>>>>> I see that a similar problem was reported by Mag in March this year, but >>>>>>> no clues or solutions appeared. >>>>>>> Any ideas? >>>>>>> >>>>>>> Yours, >>>>>>> Thomas >>>>>>> >>>>> -- >>>>> -------------------------------------------------------------------- >>>>> Thomas Roth >>>>> Department: Informationstechnologie >>>>> Location: SB3 1.262 >>>>> Phone: +49-6159-71 1453 Â Fax: +49-6159-71 2986 >>>>> >>>>> GSI Helmholtzzentrum für Schwerionenforschung GmbH >>>>> Planckstraße 1 >>>>> D-64291 Darmstadt >>>>> www.gsi.de >>>>> >>>>> Gesellschaft mit beschränkter Haftung >>>>> Sitz der Gesellschaft: Darmstadt >>>>> Handelsregister: Amtsgericht Darmstadt, HRB 1528 >>>>> >>>>> Geschäftsführer: Professor Dr. Horst Stöcker >>>>> >>>>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, >>>>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >>>>> >>> -- >>> -------------------------------------------------------------------- >>> Thomas Roth >>> Department: Informationstechnologie >>> Location: SB3 1.262 >>> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986 >>> >>> GSI Helmholtzzentrum für Schwerionenforschung GmbH >>> Planckstraße 1 >>> D-64291 Darmstadt >>> www.gsi.de >>> >>> Gesellschaft mit beschränkter Haftung >>> Sitz der Gesellschaft: Darmstadt >>> Handelsregister: Amtsgericht Darmstadt, HRB 1528 >>> >>> Geschäftsführer: Professor Dr. Horst Stöcker >>> >>> Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, >>> Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt >>> > > -- > -------------------------------------------------------------------- > Thomas Roth > Department: Informationstechnologie > Location: SB3 1.262 > Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 > > GSI Helmholtzzentrum für Schwerionenforschung GmbH > Planckstraße 1 > D-64291 Darmstadt > www.gsi.de > > Gesellschaft mit beschränkter Haftung > Sitz der Gesellschaft: Darmstadt > Handelsregister: Amtsgericht Darmstadt, HRB 1528 > > Geschäftsführer: Professor Dr. Horst Stöcker > > Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, > Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt > _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss