On 03/15/2013 05:17 PM, Greg Farnum wrote:
> [Putting list back on cc]
> 
> On Friday, March 15, 2013 at 4:11 PM, Jim Schutt wrote:
> 
>> On 03/15/2013 04:23 PM, Greg Farnum wrote:
>>> As I come back and look at these again, I'm not sure what the context
>>> for these logs is. Which test did they come from, and which behavior
>>> (slow or not slow, etc) did you see? :) -Greg
>>
>>
>>
>> They come from a test where I had debug mds = 20 and debug ms = 1
>> on the MDS while writing files from 198 clients. It turns out that 
>> for some reason I need debug mds = 20 during writing to reproduce
>> the slow stat behavior later.
>>
>> strace.find.dirs.txt.bz2 contains the log of running 
>> strace -tt -o strace.find.dirs.txt find /mnt/ceph/stripe-4M -type d -exec ls 
>> -lhd {} \;
>>
>> From that output, I believe that the stat of at least these files is slow:
>> zero0.rc11
>> zero0.rc30
>> zero0.rc46
>> zero0.rc8
>> zero0.tc103
>> zero0.tc105
>> zero0.tc106
>> I believe that log shows slow stats on more files, but those are the first 
>> few.
>>
>> mds.cs28.slow-stat.partial.bz2 contains the MDS log from just before the
>> find command started, until just after the fifth or sixth slow stat from
>> the list above.
>>
>> I haven't yet tried to find other ways of reproducing this, but so far
>> it appears that something happens during the writing of the files that
>> ends up causing the condition that results in slow stat commands.
>>
>> I have the full MDS log from the writing of the files, as well, but it's
>> big....
>>
>> Is that what you were after?
>>
>> Thanks for taking a look!
>>
>> -- Jim
> 
> I just was coming back to these to see what new information was
> available, but I realized we'd discussed several tests and I wasn't
> sure what these ones came from. That information is enough, yes.
> 
> If in fact you believe you've only seen this with high-level MDS
> debugging, I believe the cause is as I mentioned last time: the MDS
> is flapping a bit and so some files get marked as "needsrecover", but
> they aren't getting recovered asynchronously, and the first thing
> that pokes them into doing a recover is the stat.

OK, that makes sense.

> That's definitely not the behavior we want and so I'll be poking
> around the code a bit and generating bugs, but given that explanation
> it's a bit less scary than random slow stats are so it's not such a
> high priority. :) Do let me know if you come across it without the
> MDS and clients having had connection issues!

No problem - thanks!

-- Jim


> -Greg
> 
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to