Re: Crash and strange things on MDS

2013-03-15 Thread Greg Farnum
On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote:
 On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote:
  On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf ke...@kdecherf.com 
  (mailto:ke...@kdecherf.com) wrote:
   You will find the archive here: snip
   The data is not anonymized. Interesting folders/files here are
   /user_309bbd38-3cff-468d-a465-dc17c260de0c/*
   
   
   
  Sorry for the delay, but I have retrieved this archive locally at
  least so if you want to remove it from your webserver you can do so.
  :) Also, I notice when I untar it that the file name includes
  filtered — what filters did you run it through?
  
  
  
 Hi Gregory,
  
 Do you have any news about it?
  

I wrote a couple tools to do log analysis and created a number of bugs to make 
the MDS more amenable to analysis as a result of this.
Having spot-checked some of your longer-running requests, they're all getattrs 
or setattrs contending on files in what look to be shared cache and php 
libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd 
look into what your split applications are sharing across those spaces.

On the up side for Ceph, 80% of your requests take 0 milliseconds and ~95% 
of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most 
of the time. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-03-15 Thread Marc-Antoine Perennou
Le 15 mars 2013 à 21:32, Greg Farnum g...@inktank.com a écrit :

 On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote:
 On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf ke...@kdecherf.com 
 (mailto:ke...@kdecherf.com) wrote:
 You will find the archive here: snip
 The data is not anonymized. Interesting folders/files here are
 /user_309bbd38-3cff-468d-a465-dc17c260de0c/*



 Sorry for the delay, but I have retrieved this archive locally at
 least so if you want to remove it from your webserver you can do so.
 :) Also, I notice when I untar it that the file name includes
 filtered — what filters did you run it through?



 Hi Gregory,

 Do you have any news about it?

 I wrote a couple tools to do log analysis and created a number of bugs to 
 make the MDS more amenable to analysis as a result of this.
 Having spot-checked some of your longer-running requests, they're all 
 getattrs or setattrs contending on files in what look to be shared cache and 
 php libraries. These cover a range from ~40 milliseconds to ~150 
 milliseconds. I'd look into what your split applications are sharing across 
 those spaces.

 On the up side for Ceph, 80% of your requests take 0 milliseconds and ~95% 
 of them take less than 2 milliseconds. Hurray, it's not ridiculously slow 
 most of the time. :)
 -Greg

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thank you a lot for these explanations, looking forward for these fixes!
Do you have some public bug reports regarding this to link us?

Good luck, thank you for your great job and have a nice weekend

Marc-Antoine Perennou
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-03-15 Thread Greg Farnum
On Friday, March 15, 2013 at 3:40 PM, Marc-Antoine Perennou wrote:
 Thank you a lot for these explanations, looking forward for these fixes!
 Do you have some public bug reports regarding this to link us?
 
 Good luck, thank you for your great job and have a nice weekend
 
 Marc-Antoine Perennou 
Well, for now the fixes are for stuff like make analysis take less time, and 
export timing information more easily. The most immediately applicable one is 
probably http://tracker.ceph.com/issues/4354, which I hope to start on next 
week and should be done by the end of the sprint.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Kevin Decherf
On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
 On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote:
  On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
  Looks like you've got ~424k dentries pinned, and it's trying to keep
  400k inodes in cache. So you're still a bit oversubscribed, yes. This
  might just be the issue where your clients are keeping a bunch of
  inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
 
  Thanks for the analyze. We use only one ceph-fuse client at this time
  which makes all high-load commands like rsync, tar and cp on a huge
  amount of files. Well, I will replace it by the kernel client.
 
 Oh, that bug is just an explanation of what's happening; I believe it
 exists in the kernel client as well.

After setting the mds cache size to 900k, storms are gone.
However we continue to observe high latency on some clients (always the
same clients): each IO takes between 40 and 90ms (for example with
Wordpress, it takes ~20 seconds to load all needed files...).
With a non-laggy client, IO requests take less than 1ms.

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote:
 On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
 On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote:
  On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
  Looks like you've got ~424k dentries pinned, and it's trying to keep
  400k inodes in cache. So you're still a bit oversubscribed, yes. This
  might just be the issue where your clients are keeping a bunch of
  inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
 
  Thanks for the analyze. We use only one ceph-fuse client at this time
  which makes all high-load commands like rsync, tar and cp on a huge
  amount of files. Well, I will replace it by the kernel client.

 Oh, that bug is just an explanation of what's happening; I believe it
 exists in the kernel client as well.

 After setting the mds cache size to 900k, storms are gone.
 However we continue to observe high latency on some clients (always the
 same clients): each IO takes between 40 and 90ms (for example with
 Wordpress, it takes ~20 seconds to load all needed files...).
 With a non-laggy client, IO requests take less than 1ms.

I can't be sure from that description, but it sounds like you've got
one client which is generally holding the RW caps on the files, and
then another client which comes in occasionally to read those same
files. That requires the first client to drop its caps, and involves a
couple round-trip messages and is going to take some time — this is an
unavoidable consequence if you have clients sharing files, although
there's probably still room for us to optimize.

Can you describe your client workload in a bit more detail?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Kevin Decherf
On Tue, Feb 26, 2013 at 10:10:06AM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote:
  On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote:
  On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote:
   On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
   Looks like you've got ~424k dentries pinned, and it's trying to keep
   400k inodes in cache. So you're still a bit oversubscribed, yes. This
   might just be the issue where your clients are keeping a bunch of
   inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
  
   Thanks for the analyze. We use only one ceph-fuse client at this time
   which makes all high-load commands like rsync, tar and cp on a huge
   amount of files. Well, I will replace it by the kernel client.
 
  Oh, that bug is just an explanation of what's happening; I believe it
  exists in the kernel client as well.
 
  After setting the mds cache size to 900k, storms are gone.
  However we continue to observe high latency on some clients (always the
  same clients): each IO takes between 40 and 90ms (for example with
  Wordpress, it takes ~20 seconds to load all needed files...).
  With a non-laggy client, IO requests take less than 1ms.
 
 I can't be sure from that description, but it sounds like you've got
 one client which is generally holding the RW caps on the files, and
 then another client which comes in occasionally to read those same
 files. That requires the first client to drop its caps, and involves a
 couple round-trip messages and is going to take some time — this is an
 unavoidable consequence if you have clients sharing files, although
 there's probably still room for us to optimize.
 
 Can you describe your client workload in a bit more detail?

We have one folder per application (php, java, ruby). Every application has
small (1M) files. The folder is mounted by only one client by default.

In case of overload, another clients spawn to mount the same folder and
access the same files.

In the following test, only one client was used to serve the
application (a website using wordpress).

I made the test with strace to see the time of each IO request (strace -T
-e trace=file) and I noticed the same pattern:

...
[pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
st_size=28622, ...}) = 0 0.033409
[pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
st_size=28622, ...}) = 0 0.081642
[pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
[pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
st_size=10896, ...}) = 0 0.082303
[pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
st_size=10896, ...}) = 0 0.004090
[pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
...

~250 files were accessed for only one request (thanks Wordpress.).

The fs is mounted with these options: 
rw,noatime,name=hidden,secret=hidden,nodcache.

I have a debug (debug_mds=20) log of the active mds during this test if you 
want.
-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote:
 We have one folder per application (php, java, ruby). Every application has
 small (1M) files. The folder is mounted by only one client by default.

 In case of overload, another clients spawn to mount the same folder and
 access the same files.

 In the following test, only one client was used to serve the
 application (a website using wordpress).

 I made the test with strace to see the time of each IO request (strace -T
 -e trace=file) and I noticed the same pattern:

 ...
 [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
 st_size=28622, ...}) = 0 0.033409
 [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
 st_size=28622, ...}) = 0 0.081642
 [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
 [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
 st_size=10896, ...}) = 0 0.082303
 [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
 st_size=10896, ...}) = 0 0.004090
 [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
 ...

 ~250 files were accessed for only one request (thanks Wordpress.).

Okay, that is slower than I'd expect, even for an across-the-wire request...

 The fs is mounted with these options: 
 rw,noatime,name=hidden,secret=hidden,nodcache.

What kernel and why are you using nodcache? Did you have problems
without that mount option? That's forcing an MDS access for most
operations, rather than using local data.

 I have a debug (debug_mds=20) log of the active mds during this test if you 
 want.

Yeah, can you post it somewhere?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Kevin Decherf
On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote:
  We have one folder per application (php, java, ruby). Every application has
  small (1M) files. The folder is mounted by only one client by default.
 
  In case of overload, another clients spawn to mount the same folder and
  access the same files.
 
  In the following test, only one client was used to serve the
  application (a website using wordpress).
 
  I made the test with strace to see the time of each IO request (strace -T
  -e trace=file) and I noticed the same pattern:
 
  ...
  [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.033409
  [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.081642
  [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
  [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.082303
  [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.004090
  [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
  ...
 
  ~250 files were accessed for only one request (thanks Wordpress.).
 
 Okay, that is slower than I'd expect, even for an across-the-wire request...
 
  The fs is mounted with these options: 
  rw,noatime,name=hidden,secret=hidden,nodcache.
 
 What kernel and why are you using nodcache?

We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
secretfile as mount options) and I didn't find it in the documentation of
mount.ceph.

 Did you have problems
 without that mount option? That's forcing an MDS access for most
 operations, rather than using local data.

Good question, I will try it (-o dcache?).

  I have a debug (debug_mds=20) log of the active mds during this test if you 
  want.
 
 Yeah, can you post it somewhere?

Upload in progress :-)

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote:
 On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote:
  We have one folder per application (php, java, ruby). Every application has
  small (1M) files. The folder is mounted by only one client by default.
 
  In case of overload, another clients spawn to mount the same folder and
  access the same files.
 
  In the following test, only one client was used to serve the
  application (a website using wordpress).
 
  I made the test with strace to see the time of each IO request (strace -T
  -e trace=file) and I noticed the same pattern:
 
  ...
  [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.033409
  [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.081642
  [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
  [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.082303
  [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.004090
  [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
  ...
 
  ~250 files were accessed for only one request (thanks Wordpress.).

 Okay, that is slower than I'd expect, even for an across-the-wire request...

  The fs is mounted with these options: 
  rw,noatime,name=hidden,secret=hidden,nodcache.

 What kernel and why are you using nodcache?

 We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
 secretfile as mount options) and I didn't find it in the documentation of
 mount.ceph.

 Did you have problems
 without that mount option? That's forcing an MDS access for most
 operations, rather than using local data.

 Good question, I will try it (-o dcache?).

Oh right — I forgot Sage had enabled that by default; I don't recall
how necessary it is. (Sage?)


  I have a debug (debug_mds=20) log of the active mds during this test if 
  you want.

 Yeah, can you post it somewhere?

 Upload in progress :-)

Looking forward to it. ;)
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Yan, Zheng
On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum g...@inktank.com wrote:
 On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote:
 On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
 On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote:
  We have one folder per application (php, java, ruby). Every application 
  has
  small (1M) files. The folder is mounted by only one client by default.
 
  In case of overload, another clients spawn to mount the same folder and
  access the same files.
 
  In the following test, only one client was used to serve the
  application (a website using wordpress).
 
  I made the test with strace to see the time of each IO request (strace -T
  -e trace=file) and I noticed the same pattern:
 
  ...
  [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.033409
  [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
  st_size=28622, ...}) = 0 0.081642
  [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
  [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.082303
  [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
  st_size=10896, ...}) = 0 0.004090
  [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
  ...
 
  ~250 files were accessed for only one request (thanks Wordpress.).

 Okay, that is slower than I'd expect, even for an across-the-wire request...

  The fs is mounted with these options: 
  rw,noatime,name=hidden,secret=hidden,nodcache.

 What kernel and why are you using nodcache?

 We use kernel 3.7.0. nodcache is enabled by default (we only specify user and
 secretfile as mount options) and I didn't find it in the documentation of
 mount.ceph.

 Did you have problems
 without that mount option? That's forcing an MDS access for most
 operations, rather than using local data.

 Good question, I will try it (-o dcache?).

 Oh right — I forgot Sage had enabled that by default; I don't recall
 how necessary it is. (Sage?)


That code is buggy, see ceph_dir_test_complete(), it always return false.

Yan, Zheng


  I have a debug (debug_mds=20) log of the active mds during this test if 
  you want.

 Yeah, can you post it somewhere?

 Upload in progress :-)

 Looking forward to it. ;)
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-26 Thread Sage Weil
On Wed, 27 Feb 2013, Yan, Zheng  wrote:
 On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum g...@inktank.com wrote:
  On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote:
  On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote:
  On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com 
  wrote:
   We have one folder per application (php, java, ruby). Every application 
   has
   small (1M) files. The folder is mounted by only one client by default.
  
   In case of overload, another clients spawn to mount the same folder and
   access the same files.
  
   In the following test, only one client was used to serve the
   application (a website using wordpress).
  
   I made the test with strace to see the time of each IO request (strace 
   -T
   -e trace=file) and I noticed the same pattern:
  
   ...
   [pid  4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
   st_size=28622, ...}) = 0 0.033409
   [pid  4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, 
   st_size=28622, ...}) = 0 0.081642
   [pid  4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138
   [pid  4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
   st_size=10896, ...}) = 0 0.082303
   [pid  4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, 
   st_size=10896, ...}) = 0 0.004090
   [pid  4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929
   ...
  
   ~250 files were accessed for only one request (thanks Wordpress.).
 
  Okay, that is slower than I'd expect, even for an across-the-wire 
  request...
 
   The fs is mounted with these options: 
   rw,noatime,name=hidden,secret=hidden,nodcache.
 
  What kernel and why are you using nodcache?
 
  We use kernel 3.7.0. nodcache is enabled by default (we only specify user 
  and
  secretfile as mount options) and I didn't find it in the documentation of
  mount.ceph.
 
  Did you have problems
  without that mount option? That's forcing an MDS access for most
  operations, rather than using local data.
 
  Good question, I will try it (-o dcache?).
 
  Oh right ? I forgot Sage had enabled that by default; I don't recall
  how necessary it is. (Sage?)
 
 
 That code is buggy, see ceph_dir_test_complete(), it always return false.

FWIW, I think #4023 may be the root of those original bugs.  I think that 
should be up next (as far as fs/ceph goes) after after this i_mutex stuff 
is sorted out.

sage


 
 Yan, Zheng
 
 
   I have a debug (debug_mds=20) log of the active mds during this test if 
   you want.
 
  Yeah, can you post it somewhere?
 
  Upload in progress :-)
 
  Looking forward to it. ;)
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-19 Thread Gregory Farnum
On Sat, Feb 16, 2013 at 10:24 AM, Kevin Decherf ke...@kdecherf.com wrote:
 On Sat, Feb 16, 2013 at 11:36:09AM -0600, Sam Lang wrote:
 On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf ke...@kdecherf.com wrote:
  It seems better now, I didn't see any storm so far.
 
  But we observe high latency on some of our clients (with no load). Does
  it exist any documentation on how to read the perfcounters_dump output?
  I would like to know if the MDS still has any problem with its cache or
  if the latency comes from elsewhere.

 You can use perfcounters_schema to get the output format in json,
 using something like json_pp to make it somewhat readable for a human.
  You can post the dump output as well and we can try to help you
 parse/understand what its telling you.

 Ok, thanks.

 Here you will find the output of perfcounters_dump on the active MDS:
 http://pastebin.com/raw.php?i=BNerQ2fL

Looks like you've got ~424k dentries pinned, and it's trying to keep
400k inodes in cache. So you're still a bit oversubscribed, yes. This
might just be the issue where your clients are keeping a bunch of
inodes cached for the VFS (http://tracker.ceph.com/issues/3289).
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-19 Thread Kevin Decherf
On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
 Looks like you've got ~424k dentries pinned, and it's trying to keep
 400k inodes in cache. So you're still a bit oversubscribed, yes. This
 might just be the issue where your clients are keeping a bunch of
 inodes cached for the VFS (http://tracker.ceph.com/issues/3289).

Thanks for the analyze. We use only one ceph-fuse client at this time
which makes all high-load commands like rsync, tar and cp on a huge
amount of files. Well, I will replace it by the kernel client.

After some tests we increased the mds cache size to 900k.

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-19 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote:
 On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote:
 Looks like you've got ~424k dentries pinned, and it's trying to keep
 400k inodes in cache. So you're still a bit oversubscribed, yes. This
 might just be the issue where your clients are keeping a bunch of
 inodes cached for the VFS (http://tracker.ceph.com/issues/3289).

 Thanks for the analyze. We use only one ceph-fuse client at this time
 which makes all high-load commands like rsync, tar and cp on a huge
 amount of files. Well, I will replace it by the kernel client.

Oh, that bug is just an explanation of what's happening; I believe it
exists in the kernel client as well.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-16 Thread Sam Lang
On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf ke...@kdecherf.com wrote:
 On Wed, Feb 13, 2013 at 10:19:36AM -0800, Gregory Farnum wrote:
 On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote:
  On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
  On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com 
  wrote:
   Furthermore, I observe another strange thing more or less related to the
   storms.
  
   During a rsync command to write ~20G of data on Ceph and during (and
   after) the storm, one OSD sends a lot of data to the active MDS
   (400Mbps peak each 6 seconds). After a quick check, I found that when I
   stop osd.23, osd.14 stops its peaks.
 
  This is consistent with Sam's suggestion that MDS is thrashing its
  cache, and is grabbing a directory object off of the OSDs. How large
  are the directories you're using? If they're a significant fraction of
  your cache size, it might be worth enabling the (sadly less stable)
  directory fragmentation options, which will split them up into smaller
  fragments that can be independently read and written to disk.
 
  I set mds cache size to 40 but now I observe ~900Mbps peaks from
  osd.14 to the active mds, osd.18 and osd.2.
 
  osd.14 shares some pg with osd.18 and osd.2:
  http://pastebin.com/raw.php?i=uBAcTcu4

 The high bandwidth from OSD to MDS really isn't a concern — that's the
 MDS asking for data and getting it back quickly! We're concerned about
 client responsiveness; has that gotten better?

 It seems better now, I didn't see any storm so far.

 But we observe high latency on some of our clients (with no load). Does
 it exist any documentation on how to read the perfcounters_dump output?
 I would like to know if the MDS still has any problem with its cache or
 if the latency comes from elsewhere.

You can use perfcounters_schema to get the output format in json,
using something like json_pp to make it somewhat readable for a human.
 You can post the dump output as well and we can try to help you
parse/understand what its telling you.
-sam


 --
 Kevin Decherf - @Kdecherf
 GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
 http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-16 Thread Kevin Decherf
On Sat, Feb 16, 2013 at 11:36:09AM -0600, Sam Lang wrote:
 On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf ke...@kdecherf.com wrote:
  It seems better now, I didn't see any storm so far.
 
  But we observe high latency on some of our clients (with no load). Does
  it exist any documentation on how to read the perfcounters_dump output?
  I would like to know if the MDS still has any problem with its cache or
  if the latency comes from elsewhere.
 
 You can use perfcounters_schema to get the output format in json,
 using something like json_pp to make it somewhat readable for a human.
  You can post the dump output as well and we can try to help you
 parse/understand what its telling you.

Ok, thanks.

Here you will find the output of perfcounters_dump on the active MDS:
http://pastebin.com/raw.php?i=BNerQ2fL

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-15 Thread Kevin Decherf
On Wed, Feb 13, 2013 at 10:19:36AM -0800, Gregory Farnum wrote:
 On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote:
  On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
  On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote:
   Furthermore, I observe another strange thing more or less related to the
   storms.
  
   During a rsync command to write ~20G of data on Ceph and during (and
   after) the storm, one OSD sends a lot of data to the active MDS
   (400Mbps peak each 6 seconds). After a quick check, I found that when I
   stop osd.23, osd.14 stops its peaks.
 
  This is consistent with Sam's suggestion that MDS is thrashing its
  cache, and is grabbing a directory object off of the OSDs. How large
  are the directories you're using? If they're a significant fraction of
  your cache size, it might be worth enabling the (sadly less stable)
  directory fragmentation options, which will split them up into smaller
  fragments that can be independently read and written to disk.
 
  I set mds cache size to 40 but now I observe ~900Mbps peaks from
  osd.14 to the active mds, osd.18 and osd.2.
 
  osd.14 shares some pg with osd.18 and osd.2:
  http://pastebin.com/raw.php?i=uBAcTcu4
 
 The high bandwidth from OSD to MDS really isn't a concern — that's the
 MDS asking for data and getting it back quickly! We're concerned about
 client responsiveness; has that gotten better?

It seems better now, I didn't see any storm so far.

But we observe high latency on some of our clients (with no load). Does
it exist any documentation on how to read the perfcounters_dump output?
I would like to know if the MDS still has any problem with its cache or
if the latency comes from elsewhere.

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-13 Thread Kevin Decherf
On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
 On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote:
  Furthermore, I observe another strange thing more or less related to the
  storms.
 
  During a rsync command to write ~20G of data on Ceph and during (and
  after) the storm, one OSD sends a lot of data to the active MDS
  (400Mbps peak each 6 seconds). After a quick check, I found that when I
  stop osd.23, osd.14 stops its peaks.
 
 This is consistent with Sam's suggestion that MDS is thrashing its
 cache, and is grabbing a directory object off of the OSDs. How large
 are the directories you're using? If they're a significant fraction of
 your cache size, it might be worth enabling the (sadly less stable)
 directory fragmentation options, which will split them up into smaller
 fragments that can be independently read and written to disk.

I set mds cache size to 40 but now I observe ~900Mbps peaks from
osd.14 to the active mds, osd.18 and osd.2.

osd.14 shares some pg with osd.18 and osd.2:
http://pastebin.com/raw.php?i=uBAcTcu4


-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-13 Thread Gregory Farnum
On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote:
 On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
 On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote:
  Furthermore, I observe another strange thing more or less related to the
  storms.
 
  During a rsync command to write ~20G of data on Ceph and during (and
  after) the storm, one OSD sends a lot of data to the active MDS
  (400Mbps peak each 6 seconds). After a quick check, I found that when I
  stop osd.23, osd.14 stops its peaks.

 This is consistent with Sam's suggestion that MDS is thrashing its
 cache, and is grabbing a directory object off of the OSDs. How large
 are the directories you're using? If they're a significant fraction of
 your cache size, it might be worth enabling the (sadly less stable)
 directory fragmentation options, which will split them up into smaller
 fragments that can be independently read and written to disk.

 I set mds cache size to 40 but now I observe ~900Mbps peaks from
 osd.14 to the active mds, osd.18 and osd.2.

 osd.14 shares some pg with osd.18 and osd.2:
 http://pastebin.com/raw.php?i=uBAcTcu4

The high bandwidth from OSD to MDS really isn't a concern — that's the
MDS asking for data and getting it back quickly! We're concerned about
client responsiveness; has that gotten better?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Sam Lang
On Mon, Feb 11, 2013 at 7:05 AM, Kevin Decherf ke...@kdecherf.com wrote:
 On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote:
 Hey everyone,

 It's my first post here to expose a potential issue I found today using
 Ceph 0.56.1.

 The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS.
 All nodes are running Exherbo (source-based distribution) with Ceph
 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which
 is mounted on ~60 clients (increasing each day). Objects are replicated
 three times and the cluster handles only 7GB of data atm for 350k
 objects.

 In certain conditions (I don't know them atm), some clients hang,
 generate CPU overloads (kworker) and are unable to make any IO on
 Ceph. The active MDS have ~20Mbps in/out during the issue (less than
 2Mbps in normal activity). I don't know if it's directly linked but we
 also observe a lot of missing files at the same time.

 The problem is similar to this one [1].

 A restart of the client or the MDS was enough before today, but we found
 a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours
 with ~25% clients hanging.

 In logs I found a segfault with this backtrace [2] and 100,000 dumped
 events during the first hang. We observed another hang which produces
 lot of these events (in debug mode):
- mds.0.server FAIL on ESTALE but attempting recovery
- mds.0.server reply_request -116 (Stale NFS file handle)
   client_request(client.10991:1031 getattr As #104bab0
   RETRY=132)

Hi Kevin, sorry for the delayed response.
This looks like the mds cache is thrashing quite a bit, and with
multiple MDSs the tree partitioning is causing those estale messages.
In your case, you should probably run with just a single active mds (I
assume all three MDSs are active, but ceph -s will tell you for sure),
and the others as standby.  I don't think you'll be able to do that
without starting over though.

Also, you might want to increase the size of the mds cache if you have
enough memory on that machine.  mds cache size defaults to 100k, you
might increase it to 300k and see if you get the same problems.


 We have no profiling tools available on these nodes, and I don't know
 what I should search in the 35 GB log file.

 Note: the segmentation fault occured only once but the problem was
 observed four times on this cluster.

 Any help may be appreciated.

 References:
 [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
 [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: /usr/bin/ceph-mds() [0x817e82]
 2: (()+0xf140) [0x7f9091d30140]
 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
 8: (Server::kill_session(Session*)+0x137) [0x549c67]
 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
 10: (MDS::tick()+0x338) [0x4da928]
 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
 12: (SafeTimerThread::entry()+0xd) [0x782bad]
 13: (()+0x7ddf) [0x7f9091d28ddf]
 14: (clone()+0x6d) [0x7f90909cc24d]


 I found a possible cause/way to reproduce this issue.
 We have now ~90 clients for 18GB / 650k objects and the storm occurs
 when we execute an intensive IO command (tar of the whole pool / rsync
 in one folder) on one of our client (the only which uses ceph-fuse,
 don't know if it's limited to it or not).

Do you have debug logging enabled when you see this crash?  Can you
compress that mds log and post it somewhere or email it to me?

Thanks,
-sam


 Any idea?

 Cheers,
 --
 Kevin Decherf - @Kdecherf
 GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
 http://kdecherf.com
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Kevin Decherf
On Mon, Feb 11, 2013 at 11:00:15AM -0600, Sam Lang wrote:
 Hi Kevin, sorry for the delayed response.
 This looks like the mds cache is thrashing quite a bit, and with
 multiple MDSs the tree partitioning is causing those estale messages.
 In your case, you should probably run with just a single active mds (I
 assume all three MDSs are active, but ceph -s will tell you for sure),
 and the others as standby.  I don't think you'll be able to do that
 without starting over though.

Hi Sam,

I know that MDS clustering is a bit buggy so I have only one active MDS
on this cluster.

Here is the output of ceph -s:

   ~ # ceph -s
  health HEALTH_OK
  monmap e1: 3 mons at {a=x:6789/0,b=y:6789/0,c=z:6789/0}, election epoch 
48, quorum 0,1,2 a,b,c
  osdmap e79: 27 osds: 27 up, 27 in
   pgmap v895343: 5376 pgs: 5376 active+clean; 18987 MB data, 103 GB used, 
21918 GB / 23201 GB avail
  mdsmap e73: 1/1/1 up {0=b=up:active}, 2 up:standby


 Also, you might want to increase the size of the mds cache if you have
 enough memory on that machine.  mds cache size defaults to 100k, you
 might increase it to 300k and see if you get the same problems.

I have 24GB of memory for each MDS, I will try to increase this value.
Thanks for advice.

 Do you have debug logging enabled when you see this crash?  Can you
 compress that mds log and post it somewhere or email it to me?

Yes, I have 34GB of raw logs (for this issue) but I have no debug log
of the beginning of the storm itself. I will upload a compressed
archive.


Furthermore, I observe another strange thing more or less related to the
storms.

During a rsync command to write ~20G of data on Ceph and during (and
after) the storm, one OSD sends a lot of data to the active MDS
(400Mbps peak each 6 seconds). After a quick check, I found that when I
stop osd.23, osd.14 stops its peaks.

I will forward a copy of the debug enabled log of osd14.

The only significant difference between osd.23 and others is the list of
hb_in where osd.14 is missing (but I think it's unrelated).

   ~ # ceph pg dump
   osdstat  kbused   kbavail  kb hb in hb out
   0  4016228  851255948   901042464   
[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   1  4108748  851163428   901042464   
[0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26]  []
   2  4276584  850995592   901042464   
[0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   3  3997368  851274808   901042464   
[0,1,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] []
   4  4358212  850913964   901042464   
[0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   5  4039112  851233064   901042464   
[0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   6  3971568  851300608   901042464   
[0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   7  3942556  851329620   901042464   
[0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   8  4275584  850996592   901042464   
[0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   9  4279308  850992868   901042464   
[0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]  []
   10 3728136  851544040   901042464   
[0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   11 3934096  851338080   901042464   
[0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   12 3991600  851280576   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   13 4211228  851060948   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26]   []
   14 4169476  851102700   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26]   []
   15 4385584  850886592   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26]   []
   16 3761176  851511000   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26]   []
   17 3646096  851626080   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26]   []
   18 4119448  851152728   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26]   []
   19 4592992  850679184   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23,24,25,26]   []
   20 3740840  851531336   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26]   []
   21 4363552  850908624   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26]   []
   22 3831420  851440756   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26]   []
   23 3681648  851590528   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,24,25,26]   []
   24 3946192  851325984   901042464   
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26]   []
   25 

Re: Crash and strange things on MDS

2013-02-11 Thread Gregory Farnum
On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf ke...@kdecherf.com wrote:
 References:
 [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
 [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
 1: /usr/bin/ceph-mds() [0x817e82]
 2: (()+0xf140) [0x7f9091d30140]
 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
 8: (Server::kill_session(Session*)+0x137) [0x549c67]
 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
 10: (MDS::tick()+0x338) [0x4da928]
 11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
 12: (SafeTimerThread::entry()+0xd) [0x782bad]
 13: (()+0x7ddf) [0x7f9091d28ddf]
 14: (clone()+0x6d) [0x7f90909cc24d]

This in particular is quite odd. Do you have any logging from when
that happened? (Oftentimes the log can have a bunch of debugging
information from shortly before the crash.)

On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote:
 Furthermore, I observe another strange thing more or less related to the
 storms.

 During a rsync command to write ~20G of data on Ceph and during (and
 after) the storm, one OSD sends a lot of data to the active MDS
 (400Mbps peak each 6 seconds). After a quick check, I found that when I
 stop osd.23, osd.14 stops its peaks.

This is consistent with Sam's suggestion that MDS is thrashing its
cache, and is grabbing a directory object off of the OSDs. How large
are the directories you're using? If they're a significant fraction of
your cache size, it might be worth enabling the (sadly less stable)
directory fragmentation options, which will split them up into smaller
fragments that can be independently read and written to disk.
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Kevin Decherf
On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
 On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf ke...@kdecherf.com wrote:
  References:
  [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
  [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: /usr/bin/ceph-mds() [0x817e82]
  2: (()+0xf140) [0x7f9091d30140]
  3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
  4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
  5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
  6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
  7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
  8: (Server::kill_session(Session*)+0x137) [0x549c67]
  9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
  10: (MDS::tick()+0x338) [0x4da928]
  11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
  12: (SafeTimerThread::entry()+0xd) [0x782bad]
  13: (()+0x7ddf) [0x7f9091d28ddf]
  14: (clone()+0x6d) [0x7f90909cc24d]
 
 This in particular is quite odd. Do you have any logging from when
 that happened? (Oftentimes the log can have a bunch of debugging
 information from shortly before the crash.)

Yes, there is a dump of 100,000 events for this backtrace in the linked
archive (I need 7 hours to upload it).

 
 On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote:
  Furthermore, I observe another strange thing more or less related to the
  storms.
 
  During a rsync command to write ~20G of data on Ceph and during (and
  after) the storm, one OSD sends a lot of data to the active MDS
  (400Mbps peak each 6 seconds). After a quick check, I found that when I
  stop osd.23, osd.14 stops its peaks.
 
 This is consistent with Sam's suggestion that MDS is thrashing its
 cache, and is grabbing a directory object off of the OSDs. How large
 are the directories you're using? If they're a significant fraction of
 your cache size, it might be worth enabling the (sadly less stable)
 directory fragmentation options, which will split them up into smaller
 fragments that can be independently read and written to disk.

The distribution is heterogeneous: we have a folder of ~17G for 300k
objects, another of ~2G for 150k objects and a lof of smaller directories.
Are you talking about the mds bal frag and mds bal split * settings?
Do you have any advice about the value to use?

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Gregory Farnum
On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf ke...@kdecherf.com wrote:
 On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
 On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf ke...@kdecherf.com wrote:
  References:
  [1] http://www.spinics.net/lists/ceph-devel/msg04903.html
  [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: /usr/bin/ceph-mds() [0x817e82]
  2: (()+0xf140) [0x7f9091d30140]
  3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
  4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
  5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
  6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
  7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
  8: (Server::kill_session(Session*)+0x137) [0x549c67]
  9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
  10: (MDS::tick()+0x338) [0x4da928]
  11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
  12: (SafeTimerThread::entry()+0xd) [0x782bad]
  13: (()+0x7ddf) [0x7f9091d28ddf]
  14: (clone()+0x6d) [0x7f90909cc24d]

 This in particular is quite odd. Do you have any logging from when
 that happened? (Oftentimes the log can have a bunch of debugging
 information from shortly before the crash.)

 Yes, there is a dump of 100,000 events for this backtrace in the linked
 archive (I need 7 hours to upload it).

Can you just pastebin the last couple hundred lines? I'm mostly
interested if there's anything from the function which actually caused
the assert/segfault. Also, the log should compress well and get much
smaller!

 On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote:
  Furthermore, I observe another strange thing more or less related to the
  storms.
 
  During a rsync command to write ~20G of data on Ceph and during (and
  after) the storm, one OSD sends a lot of data to the active MDS
  (400Mbps peak each 6 seconds). After a quick check, I found that when I
  stop osd.23, osd.14 stops its peaks.

 This is consistent with Sam's suggestion that MDS is thrashing its
 cache, and is grabbing a directory object off of the OSDs. How large
 are the directories you're using? If they're a significant fraction of
 your cache size, it might be worth enabling the (sadly less stable)
 directory fragmentation options, which will split them up into smaller
 fragments that can be independently read and written to disk.

 The distribution is heterogeneous: we have a folder of ~17G for 300k
 objects, another of ~2G for 150k objects and a lof of smaller directories.

Sorry, you mean 300,000 files in the single folder?
If so, that's definitely why it's behaving so badly — your folder is
larger than your maximum cache size settings, and so if you run an
ls or anything the MDS will read the whole thing off disk, then
instantly drop most of the folder from its cache. Then re-read again
for the next request to list contents, etc etc.

 Are you talking about the mds bal frag and mds bal split * settings?
 Do you have any advice about the value to use?
If you set mds bal frag = true in your config, it will split up
those very large directories into smaller fragments and behave a lot
better. This isn't quite as stable (thus the default to off), so if
you have the memory to just really up your cache size I'd start with
that and see if it makes your problems better. But if it doesn't,
directory fragmentation does work reasonably well and it's something
we'd be interested in bug reports for. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash and strange things on MDS

2013-02-11 Thread Kevin Decherf
On Mon, Feb 11, 2013 at 02:47:13PM -0800, Gregory Farnum wrote:
 On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf ke...@kdecherf.com wrote:
  On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote:
  Yes, there is a dump of 100,000 events for this backtrace in the linked
  archive (I need 7 hours to upload it).
 
 Can you just pastebin the last couple hundred lines? I'm mostly
 interested if there's anything from the function which actually caused
 the assert/segfault. Also, the log should compress well and get much
 smaller!

Sent in pm.

And yes, I have a good compression rate but...

   % ls -lh 
   total 38G
   -rw-r--r-- 1 kdecherf kdecherf 3.3G Feb 11 18:36 cc-ceph-log.tar.gz
   -rw--- 1 kdecherf kdecherf  66M Feb  4 17:57 ceph.log
   -rw-r--r-- 1 kdecherf kdecherf 3.5G Feb  4 14:44 ceph-mds.b.log
   -rw-r--r-- 1 kdecherf kdecherf  31G Feb  5 15:55 ceph-mds.c.log
   -rw-r--r-- 1 kdecherf kdecherf  27M Feb 11 19:46 ceph-osd.14.log

;-)

  The distribution is heterogeneous: we have a folder of ~17G for 300k
  objects, another of ~2G for 150k objects and a lof of smaller directories.
 
 Sorry, you mean 300,000 files in the single folder?
 If so, that's definitely why it's behaving so badly — your folder is
 larger than your maximum cache size settings, and so if you run an
 ls or anything the MDS will read the whole thing off disk, then
 instantly drop most of the folder from its cache. Then re-read again
 for the next request to list contents, etc etc.

The biggest top-level folder contains 300k files but splitted into
several subfolders (a subfolder does not contain more than 10,000 files
at its level).

  Are you talking about the mds bal frag and mds bal split * settings?
  Do you have any advice about the value to use?
 If you set mds bal frag = true in your config, it will split up
 those very large directories into smaller fragments and behave a lot
 better. This isn't quite as stable (thus the default to off), so if
 you have the memory to just really up your cache size I'd start with
 that and see if it makes your problems better. But if it doesn't,
 directory fragmentation does work reasonably well and it's something
 we'd be interested in bug reports for. :)

I will try it, thanks!

-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Crash and strange things on MDS

2013-02-04 Thread Kevin Decherf
Hey everyone,

It's my first post here to expose a potential issue I found today using
Ceph 0.56.1.

The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS.
All nodes are running Exherbo (source-based distribution) with Ceph
0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which
is mounted on ~60 clients (increasing each day). Objects are replicated
three times and the cluster handles only 7GB of data atm for 350k
objects.

In certain conditions (I don't know them atm), some clients hang,
generate CPU overloads (kworker) and are unable to make any IO on
Ceph. The active MDS have ~20Mbps in/out during the issue (less than
2Mbps in normal activity). I don't know if it's directly linked but we
also observe a lot of missing files at the same time.

The problem is similar to this one [1].

A restart of the client or the MDS was enough before today, but we found
a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours
with ~25% clients hanging.

In logs I found a segfault with this backtrace [2] and 100,000 dumped
events during the first hang. We observed another hang which produces
lot of these events (in debug mode):
   - mds.0.server FAIL on ESTALE but attempting recovery
   - mds.0.server reply_request -116 (Stale NFS file handle)
  client_request(client.10991:1031 getattr As #104bab0
  RETRY=132)

We have no profiling tools available on these nodes, and I don't know
what I should search in the 35 GB log file.

Note: the segmentation fault occured only once but the problem was
observed four times on this cluster.

Any help may be appreciated.

References:
[1] http://www.spinics.net/lists/ceph-devel/msg04903.html
[2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
1: /usr/bin/ceph-mds() [0x817e82]
2: (()+0xf140) [0x7f9091d30140]
3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1]
4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9]
5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70]
6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90]
7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2]
8: (Server::kill_session(Session*)+0x137) [0x549c67]
9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6]
10: (MDS::tick()+0x338) [0x4da928]
11: (SafeTimer::timer_thread()+0x1af) [0x78151f]
12: (SafeTimerThread::entry()+0xd) [0x782bad]
13: (()+0x7ddf) [0x7f9091d28ddf]
14: (clone()+0x6d) [0x7f90909cc24d]

Cheers,
-- 
Kevin Decherf - @Kdecherf
GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F
http://kdecherf.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html