Re: [Lustre-discuss] Inode errors at time of job failure

2009-08-06 Thread Oleg Drokin
Hello!

On Aug 6, 2009, at 12:57 PM, Thomas Roth wrote:

> Hi,
> these ll_inode_revalidate_fini errors are unfortunately quite known  
> to us.
> So what would you guess if that happens again and again, on a number  
> of
> clients - MDT softly dying away?

No, I do not think this is MDT problem of any sort at present, more
like some strange client interaction.
Are there any negative side effects in your case aside from log clutter?
Jobs failing or anything like that?

> Because we haven't seen any mass evictions (and no reasons for that)  
> in
> connection with these errors.
> Or could the problem with the cached open files also be present if the
> communication interruption does not show up as an eviction in the  
> logs?

It has nothing to do with opened files if there are no evictions.
I checked in bugzilla and found bug 16377 which looks like this report
too. Though the logs in there are somewhat confusing.
It almost appears as if the failing dentry is reported as a mountpoint
by vfs, but then it is not, since following inode_revalidate call
ends up on lustre again.
Do you have "lookup on mtpt" sort of errors coming from namei.c?
If you can reproduce the problem with ls or another tool at will,
can you please execute this on a client (comment #17 in the bug 16377):
# script
Script started, file is typescript
# lctl clear
# echo -1 > /proc/sys/lnet/debug
[ reproduce problem ]
# lctl dk > /tmp/ls.debug
# exit
Script done, file is typescript

and attach your resulting ls.debug in the bug?

Also what lustre version are you using?

Bye,
 Oleg
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Moving away from bugzilla

2009-08-06 Thread Jason Rappleye

On Aug 5, 2009, at 2:53 PM, Christopher J. Morrone wrote:

> Mag Gam wrote:
>> Are there any plans to move away from Bugzilla for issue tracking? I
>> have been lurking around https://*bugzilla.lustre.org for several
>> months now and I still find it very hard to use, do others have the
>> same feeling? or is there a setting or a preferred filter to see all
>> the new bugs in 1.8 series?
>
> I just want to voice for my support for Bugzilla.  I think it has been
> really great to use.  Here are LLNL, we have probably opened  
> hundreds of
> Lustre "issues" (bugs, trackers, future-improvement requests, etc.),
> and bugzilla has been a pleasure to use.

I'll second that. While we don't submit bugs ourselves (we receive  
Lustre support through a third party), we do use it in other ways, and  
it's been a fantastic resource.

Whenever I'm researching a Lustre problem, the very first thing I do  
is search bugzilla - *not* Google! Plugging in the output from an LBUG  
into a Bugzilla search turns up a relevant bug more often than not.

Additionally, some information on what other sites are doing -  
especially large sites such as LLNL and ORNL, and tools that they use,  
can be found by digging around in Bugzilla. See, for example, bz  
20165, submitted by Jim Garlick @ LLNL, which has scripts for  
integrating heartbeat support into Lustre. While we're not using the  
failover bits, I did pull out ldev from Jim's patch, which is a  
fantastic tool that I wish I had taken the time to write myself months  
ago (thanks, Jim!)

However, Bugzilla's usefulness as a support tool for the Lustre  
community is somewhat hindered by the fact that some customers request  
that their support tickets be made private. They certainly have the  
right to do that, and I'm not knocking Sun or those customers for  
doing so. However, the data contained in those tickets can be rather  
useful to the community and it would be helpful to have as many  
tickets as possible be publicly-accessible.

It's very frustrating to run a Bugzilla search, find a matching bug,  
only to be presented with a "not authorized" message when clicking on  
the bug's link. This happened when searching for bugs related to the  
corruption introduced into Lustre 1.6.7. I believe we were the second  
site to report the corruption. The bug from the first site was marked  
private, which was a bit frustrating when we were trying to analyze  
the problem before requesting support, especially on a weekend when  
support isn't always available.

Sun has assured us that they are working on technical and procedural  
improvements to ensure that public versions of private bugs containing  
relevant technical data are made available to everyone. Until that  
happens, I'm putting out a call to those of you who do submit private  
bugs to either make them public in the first place, or strip out any  
private information before submitting them to Sun. If there's  
proprietary customer data contained in the bug you submit, that's one  
thing. But if you're embarrassed about pilot error, well, I'll be the  
first to admit that I've committed some myself!

Thanks,

j

>
>
> I have been forced to use some other issue tracking systems in the  
> past
> that have made bugzilla seem a breath of fresh air in comparison.
>
> Chris
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035
jason.rappl...@nasa.gov




___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Large scale delete results in lag on clients

2009-08-06 Thread Andreas Dilger
On Aug 06, 2009  15:08 -0400, Jim McCusker wrote:
> We have a 15 TB luster volume across 4 OSTs and we recently deleted over 4
> million files from it in order to free up the 80 GB MDT/MDS (going from 100%
> capacity on it to 81%. As a result, after the rm completed, there is
> significant lag on most file system operations (but fast access once it
> occurs) even after the two servers that host the targets were rebooted. It
> seems to clear up for a little while after reboot, but comes back after some
> time.
> 
> Any ideas?

The Lustre unlink processing is somewhat asynchronous, so you may still be
catching up with unlinks.  You can check this by looking at the OSS service
RPC stats file to see if there are still object destroys being processed
by the OSTs.  You could also just check the system load/io on the OSTs to
see how busy they are in a "no load" situation.


> For the curious, we host a large image archive (almost 400k images) and do
> research on processing them. We had a lot of intermediate files that we
> needed to clean up:
> 
>  http://krauthammerlab.med.yale.edu/imagefinder (currently laggy and
> unresponsive due to this problem)
> 
> Thanks,
> Jim
> --
> Jim McCusker
> Programmer Analyst
> Krauthammer Lab, Pathology Informatics
> Yale School of Medicine
> james.mccus...@yale.edu | (203) 785-6330
> http://krauthammerlab.med.yale.edu

> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] Large scale delete results in lag on clients

2009-08-06 Thread Jim McCusker
We have a 15 TB luster volume across 4 OSTs and we recently deleted over 4
million files from it in order to free up the 80 GB MDT/MDS (going from 100%
capacity on it to 81%. As a result, after the rm completed, there is
significant lag on most file system operations (but fast access once it
occurs) even after the two servers that host the targets were rebooted. It
seems to clear up for a little while after reboot, but comes back after some
time.

Any ideas?

For the curious, we host a large image archive (almost 400k images) and do
research on processing them. We had a lot of intermediate files that we
needed to clean up:

 http://krauthammerlab.med.yale.edu/imagefinder (currently laggy and
unresponsive due to this problem)

Thanks,
Jim
--
Jim McCusker
Programmer Analyst
Krauthammer Lab, Pathology Informatics
Yale School of Medicine
james.mccus...@yale.edu | (203) 785-6330
http://krauthammerlab.med.yale.edu
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] building lustre on debian unstable

2009-08-06 Thread Patrick Winnertz
Hey
> Can you please submit a bug with the above, and attach the generated
> configure and config.log files.  Also post the excerpt of the
> configure file around line 5542 here would possibly allow someone
> else to diagnose what is going wrong.
Done.. see #20383

I'll add the requested files tomorrow morning when I'm back in the
office.

Greetings
Patrick

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Inode errors at time of job failure

2009-08-06 Thread Thomas Roth
Hi,
these ll_inode_revalidate_fini errors are unfortunately quite known to us.
So what would you guess if that happens again and again, on a number of
clients - MDT softly dying away?
Because we haven't seen any mass evictions (and no reasons for that) in
connection with these errors.
Or could the problem with the cached open files also be present if the
communication interruption does not show up as an eviction in the logs?

Regards,
Thomas

Oleg Drokin wrote:
> Hello!
> 
> On Aug 5, 2009, at 3:12 PM, Daniel Kulinski wrote:
> 
>> What would cause the following error to appear?
> 
> Typically this is some sort of a race where you presume an inode exist  
> (because you have some traces of it in memory),
> but it is not anymore (on mds, anyway). So when client comes to fetch  
> inode attributes, there is nothing anymore.
> Normally this should not happen because lustre uses locking to ensure  
> caching consistency, but in some cases
> this is not true (e.g. open returns dentry without lock oftentimes).  
> Also if a client was evicted,
> cached opened files could not be revoked right away until they are  
> closed.
> 
>> LustreError: 10991:0:(file.c:2930:ll_inode_revalidate_fini())  
>> failure -2 inode 14520180
>> This happened at the same time a job failed.  Error number 2 is  
>> ENOENT which means that this inode does not exist?
> 
> Right.
> 
>> Is there a way to query the MDS to find out which file this inode  
>> should have belonged to?
> 
> Well, there is lfs find that can search by inode number, but since  
> there is no such inode anymore, there is no way
> to find out to what name it was attached (and the name likely does not  
> exist either).
> 
> Did you have client eviction before this message by any chance?
> What was the job doing at the time?
> 
> Bye,
>  Oleg
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] building lustre on debian unstable

2009-08-06 Thread Andreas Dilger
On Aug 06, 2009  14:34 +0200, Patrick Winnertz wrote:
> I've huge problems since several days to build lustre on unstable, the
> cause seems to be something related to auto* stuff. 
> 
> configure is crashing with this error msg:
> checking whether to build kernel modules... no (linux-gnu)
> ../../configure: line 5542: syntax error near unexpected token
> `else' ../../configure: line 5542: `else'
> make: *** [configure-stamp] Error 2
> 
> I used automake 1.10 and autoconf 2.64. On a older system (e.g. lenny
> or etch) it builds without any problems. (The configure is generated
> correctly).

Can you please submit a bug with the above, and attach the generated
configure and config.log files.  Also post the excerpt of the configure
file around line 5542 here would possibly allow someone else to diagnose
what is going wrong.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] building lustre on debian unstable

2009-08-06 Thread Guy Coates

> Does anybody else hitted this problem? 

Hi all,

I ran  into a similar issue building some other packages on SID. I think the
problem is related to unstable using a newer version of the libtool/automake
toolchain than the system the source was packaged on.

The fix was to use the following runes to rebuild all the automake stuff, after
which I had no build problems.


libtoolize --force --copy
aclocal-1.9
autoconf
automake-1.9 --add-missing

and then:

./configure ...

Cheers,

Guy

-- 
Dr. Guy Coates,  Informatics System Group
The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK
Tel: +44 (0)1223 834244 x 6925
Fax: +44 (0)1223 496802


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] building lustre on debian unstable

2009-08-06 Thread Patrick Winnertz
Hello,

I've huge problems since several days to build lustre on unstable, the
cause seems to be something related to auto* stuff. 

configure is crashing with this error msg:
checking whether to build kernel modules... no (linux-gnu)
../../configure: line 5542: syntax error near unexpected token
`else' ../../configure: line 5542: `else'
make: *** [configure-stamp] Error 2

I used automake 1.10 and autoconf 2.64. On a older system (e.g. lenny
or etch) it builds without any problems. (The configure is generated
correctly).

Does anybody else hitted this problem? 

Greetings
Patrick

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Problems upgrading from 1.6 to 1.8

2009-08-06 Thread Christopher J.Walker
Mag Gam wrote:
> Thanks for the response Chris.
>

Thank you for following up.

> 
> 
> On Wed, Aug 5, 2009 at 5:20 PM, Andreas Dilger wrote:
>> On Aug 05, 2009  18:45 +0100, Christopher J.Walker wrote:
>>> Aug  5 13:53:01 se02 kernel: LustreError:
>>> 2668:0:(lib-move.c:95:lnet_try_match_md()) Matching packet from
>>> 12345-10.1.4@tcp, match 1449 length 832 too big: 816 left, 816 allowed
>> This looks like bug 20020, fixed in the 1.8.1 release.  The 1.8.1 release
>> is GA, but I'm not sure if the packages have made it to the download site
>> yet or not.
>>

They haven't - but I'll keep checking.

Thanks enormously Andreas.

Chris
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss