Re: [Lustre-discuss] Inode errors at time of job failure
Hello! On Aug 6, 2009, at 12:57 PM, Thomas Roth wrote: > Hi, > these ll_inode_revalidate_fini errors are unfortunately quite known > to us. > So what would you guess if that happens again and again, on a number > of > clients - MDT softly dying away? No, I do not think this is MDT problem of any sort at present, more like some strange client interaction. Are there any negative side effects in your case aside from log clutter? Jobs failing or anything like that? > Because we haven't seen any mass evictions (and no reasons for that) > in > connection with these errors. > Or could the problem with the cached open files also be present if the > communication interruption does not show up as an eviction in the > logs? It has nothing to do with opened files if there are no evictions. I checked in bugzilla and found bug 16377 which looks like this report too. Though the logs in there are somewhat confusing. It almost appears as if the failing dentry is reported as a mountpoint by vfs, but then it is not, since following inode_revalidate call ends up on lustre again. Do you have "lookup on mtpt" sort of errors coming from namei.c? If you can reproduce the problem with ls or another tool at will, can you please execute this on a client (comment #17 in the bug 16377): # script Script started, file is typescript # lctl clear # echo -1 > /proc/sys/lnet/debug [ reproduce problem ] # lctl dk > /tmp/ls.debug # exit Script done, file is typescript and attach your resulting ls.debug in the bug? Also what lustre version are you using? Bye, Oleg ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Moving away from bugzilla
On Aug 5, 2009, at 2:53 PM, Christopher J. Morrone wrote: > Mag Gam wrote: >> Are there any plans to move away from Bugzilla for issue tracking? I >> have been lurking around https://*bugzilla.lustre.org for several >> months now and I still find it very hard to use, do others have the >> same feeling? or is there a setting or a preferred filter to see all >> the new bugs in 1.8 series? > > I just want to voice for my support for Bugzilla. I think it has been > really great to use. Here are LLNL, we have probably opened > hundreds of > Lustre "issues" (bugs, trackers, future-improvement requests, etc.), > and bugzilla has been a pleasure to use. I'll second that. While we don't submit bugs ourselves (we receive Lustre support through a third party), we do use it in other ways, and it's been a fantastic resource. Whenever I'm researching a Lustre problem, the very first thing I do is search bugzilla - *not* Google! Plugging in the output from an LBUG into a Bugzilla search turns up a relevant bug more often than not. Additionally, some information on what other sites are doing - especially large sites such as LLNL and ORNL, and tools that they use, can be found by digging around in Bugzilla. See, for example, bz 20165, submitted by Jim Garlick @ LLNL, which has scripts for integrating heartbeat support into Lustre. While we're not using the failover bits, I did pull out ldev from Jim's patch, which is a fantastic tool that I wish I had taken the time to write myself months ago (thanks, Jim!) However, Bugzilla's usefulness as a support tool for the Lustre community is somewhat hindered by the fact that some customers request that their support tickets be made private. They certainly have the right to do that, and I'm not knocking Sun or those customers for doing so. However, the data contained in those tickets can be rather useful to the community and it would be helpful to have as many tickets as possible be publicly-accessible. It's very frustrating to run a Bugzilla search, find a matching bug, only to be presented with a "not authorized" message when clicking on the bug's link. This happened when searching for bugs related to the corruption introduced into Lustre 1.6.7. I believe we were the second site to report the corruption. The bug from the first site was marked private, which was a bit frustrating when we were trying to analyze the problem before requesting support, especially on a weekend when support isn't always available. Sun has assured us that they are working on technical and procedural improvements to ensure that public versions of private bugs containing relevant technical data are made available to everyone. Until that happens, I'm putting out a call to those of you who do submit private bugs to either make them public in the first place, or strip out any private information before submitting them to Sun. If there's proprietary customer data contained in the bug you submit, that's one thing. But if you're embarrassed about pilot error, well, I'll be the first to admit that I've committed some myself! Thanks, j > > > I have been forced to use some other issue tracking systems in the > past > that have made bugzilla seem a breath of fresh air in comparison. > > Chris > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035 jason.rappl...@nasa.gov ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Large scale delete results in lag on clients
On Aug 06, 2009 15:08 -0400, Jim McCusker wrote: > We have a 15 TB luster volume across 4 OSTs and we recently deleted over 4 > million files from it in order to free up the 80 GB MDT/MDS (going from 100% > capacity on it to 81%. As a result, after the rm completed, there is > significant lag on most file system operations (but fast access once it > occurs) even after the two servers that host the targets were rebooted. It > seems to clear up for a little while after reboot, but comes back after some > time. > > Any ideas? The Lustre unlink processing is somewhat asynchronous, so you may still be catching up with unlinks. You can check this by looking at the OSS service RPC stats file to see if there are still object destroys being processed by the OSTs. You could also just check the system load/io on the OSTs to see how busy they are in a "no load" situation. > For the curious, we host a large image archive (almost 400k images) and do > research on processing them. We had a lot of intermediate files that we > needed to clean up: > > http://krauthammerlab.med.yale.edu/imagefinder (currently laggy and > unresponsive due to this problem) > > Thanks, > Jim > -- > Jim McCusker > Programmer Analyst > Krauthammer Lab, Pathology Informatics > Yale School of Medicine > james.mccus...@yale.edu | (203) 785-6330 > http://krauthammerlab.med.yale.edu > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] Large scale delete results in lag on clients
We have a 15 TB luster volume across 4 OSTs and we recently deleted over 4 million files from it in order to free up the 80 GB MDT/MDS (going from 100% capacity on it to 81%. As a result, after the rm completed, there is significant lag on most file system operations (but fast access once it occurs) even after the two servers that host the targets were rebooted. It seems to clear up for a little while after reboot, but comes back after some time. Any ideas? For the curious, we host a large image archive (almost 400k images) and do research on processing them. We had a lot of intermediate files that we needed to clean up: http://krauthammerlab.med.yale.edu/imagefinder (currently laggy and unresponsive due to this problem) Thanks, Jim -- Jim McCusker Programmer Analyst Krauthammer Lab, Pathology Informatics Yale School of Medicine james.mccus...@yale.edu | (203) 785-6330 http://krauthammerlab.med.yale.edu ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] building lustre on debian unstable
Hey > Can you please submit a bug with the above, and attach the generated > configure and config.log files. Also post the excerpt of the > configure file around line 5542 here would possibly allow someone > else to diagnose what is going wrong. Done.. see #20383 I'll add the requested files tomorrow morning when I'm back in the office. Greetings Patrick ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Inode errors at time of job failure
Hi, these ll_inode_revalidate_fini errors are unfortunately quite known to us. So what would you guess if that happens again and again, on a number of clients - MDT softly dying away? Because we haven't seen any mass evictions (and no reasons for that) in connection with these errors. Or could the problem with the cached open files also be present if the communication interruption does not show up as an eviction in the logs? Regards, Thomas Oleg Drokin wrote: > Hello! > > On Aug 5, 2009, at 3:12 PM, Daniel Kulinski wrote: > >> What would cause the following error to appear? > > Typically this is some sort of a race where you presume an inode exist > (because you have some traces of it in memory), > but it is not anymore (on mds, anyway). So when client comes to fetch > inode attributes, there is nothing anymore. > Normally this should not happen because lustre uses locking to ensure > caching consistency, but in some cases > this is not true (e.g. open returns dentry without lock oftentimes). > Also if a client was evicted, > cached opened files could not be revoked right away until they are > closed. > >> LustreError: 10991:0:(file.c:2930:ll_inode_revalidate_fini()) >> failure -2 inode 14520180 >> This happened at the same time a job failed. Error number 2 is >> ENOENT which means that this inode does not exist? > > Right. > >> Is there a way to query the MDS to find out which file this inode >> should have belonged to? > > Well, there is lfs find that can search by inode number, but since > there is no such inode anymore, there is no way > to find out to what name it was attached (and the name likely does not > exist either). > > Did you have client eviction before this message by any chance? > What was the job doing at the time? > > Bye, > Oleg > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] building lustre on debian unstable
On Aug 06, 2009 14:34 +0200, Patrick Winnertz wrote: > I've huge problems since several days to build lustre on unstable, the > cause seems to be something related to auto* stuff. > > configure is crashing with this error msg: > checking whether to build kernel modules... no (linux-gnu) > ../../configure: line 5542: syntax error near unexpected token > `else' ../../configure: line 5542: `else' > make: *** [configure-stamp] Error 2 > > I used automake 1.10 and autoconf 2.64. On a older system (e.g. lenny > or etch) it builds without any problems. (The configure is generated > correctly). Can you please submit a bug with the above, and attach the generated configure and config.log files. Also post the excerpt of the configure file around line 5542 here would possibly allow someone else to diagnose what is going wrong. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] building lustre on debian unstable
> Does anybody else hitted this problem? Hi all, I ran into a similar issue building some other packages on SID. I think the problem is related to unstable using a newer version of the libtool/automake toolchain than the system the source was packaged on. The fix was to use the following runes to rebuild all the automake stuff, after which I had no build problems. libtoolize --force --copy aclocal-1.9 autoconf automake-1.9 --add-missing and then: ./configure ... Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] building lustre on debian unstable
Hello, I've huge problems since several days to build lustre on unstable, the cause seems to be something related to auto* stuff. configure is crashing with this error msg: checking whether to build kernel modules... no (linux-gnu) ../../configure: line 5542: syntax error near unexpected token `else' ../../configure: line 5542: `else' make: *** [configure-stamp] Error 2 I used automake 1.10 and autoconf 2.64. On a older system (e.g. lenny or etch) it builds without any problems. (The configure is generated correctly). Does anybody else hitted this problem? Greetings Patrick ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Problems upgrading from 1.6 to 1.8
Mag Gam wrote: > Thanks for the response Chris. > Thank you for following up. > > > On Wed, Aug 5, 2009 at 5:20 PM, Andreas Dilger wrote: >> On Aug 05, 2009 18:45 +0100, Christopher J.Walker wrote: >>> Aug 5 13:53:01 se02 kernel: LustreError: >>> 2668:0:(lib-move.c:95:lnet_try_match_md()) Matching packet from >>> 12345-10.1.4@tcp, match 1449 length 832 too big: 816 left, 816 allowed >> This looks like bug 20020, fixed in the 1.8.1 release. The 1.8.1 release >> is GA, but I'm not sure if the packages have made it to the download site >> yet or not. >> They haven't - but I'll keep checking. Thanks enormously Andreas. Chris ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss