[Putting list back on cc]

On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote:

> On 03/15/2013 04:23 PM, Greg Farnum wrote:
> > As I come back and look at these again, I'm not sure what the context
> > for these logs is. Which test did they come from, and which behavior
> > (slow or not slow, etc) did you see? :) -Greg
> 
> 
> 
> They come from a test where I had debug mds = 20 and debug ms = 1
> on the MDS while writing files from 198 clients. It turns out that 
> for some reason I need debug mds = 20 during writing to reproduce
> the slow stat behavior later.
> 
> strace.find.dirs.txt.bz2 contains the log of running 
> strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls 
> -lhd {} \;
> 
> From that output, I believe that the stat of at least these files is slow:
> zero0.rc11
> zero0.rc30
> zero0.rc46
> zero0.rc8
> zero0.tc103
> zero0.tc105
> zero0.tc106
> I believe that log shows slow stats on more files, but those are the first 
> few.
> 
> mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the
> find command started, until just after the fifth or sixth slow stat from
> the list above.
> 
> I haven't yet tried to find other ways of reproducing this, but so far
> it appears that something happens during the writing of the files that
> ends up causing the condition that results in slow stat commands.
> 
> I have the full MDS log from the writing of the files, as well, but it's
> big....
> 
> Is that what you were after?
> 
> Thanks for taking a look!
> 
> -- Jim

I just was coming back to these to see what new information was available, but 
I realized we'd discussed several tests and I wasn't sure what these ones came 
from. That information is enough, yes.

If in fact you believe you've only seen this with high-level MDS debugging, I 
believe the cause is as I mentioned last time: the MDS is flapping a bit and so 
some files get marked as "needsrecover", but they aren't getting recovered 
asynchronously, and the first thing that pokes them into doing a recover is the 
stat.
That's definitely not the behavior we want and so I'll be poking around the 
code a bit and generating bugs, but given that explanation it's a bit less 
scary than random slow stats are so it's not such a high priority. :) Do let me 
know if you come across it without the MDS and clients having had connection 
issues!
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to