Re: Crash and strange things on MDS
On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote: On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf ke...@kdecherf.com (mailto:ke...@kdecherf.com) wrote: You will find the archive here: snip The data is not anonymized. Interesting folders/files here are /user_309bbd38-3cff-468d-a465-dc17c260de0c/* Sorry for the delay, but I have retrieved this archive locally at least so if you want to remove it from your webserver you can do so. :) Also, I notice when I untar it that the file name includes filtered — what filters did you run it through? Hi Gregory, Do you have any news about it? I wrote a couple tools to do log analysis and created a number of bugs to make the MDS more amenable to analysis as a result of this. Having spot-checked some of your longer-running requests, they're all getattrs or setattrs contending on files in what look to be shared cache and php libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd look into what your split applications are sharing across those spaces. On the up side for Ceph, 80% of your requests take 0 milliseconds and ~95% of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most of the time. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
Le 15 mars 2013 à 21:32, Greg Farnum g...@inktank.com a écrit : On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote: On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf ke...@kdecherf.com (mailto:ke...@kdecherf.com) wrote: You will find the archive here: snip The data is not anonymized. Interesting folders/files here are /user_309bbd38-3cff-468d-a465-dc17c260de0c/* Sorry for the delay, but I have retrieved this archive locally at least so if you want to remove it from your webserver you can do so. :) Also, I notice when I untar it that the file name includes filtered — what filters did you run it through? Hi Gregory, Do you have any news about it? I wrote a couple tools to do log analysis and created a number of bugs to make the MDS more amenable to analysis as a result of this. Having spot-checked some of your longer-running requests, they're all getattrs or setattrs contending on files in what look to be shared cache and php libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd look into what your split applications are sharing across those spaces. On the up side for Ceph, 80% of your requests take 0 milliseconds and ~95% of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most of the time. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Thank you a lot for these explanations, looking forward for these fixes! Do you have some public bug reports regarding this to link us? Good luck, thank you for your great job and have a nice weekend Marc-Antoine Perennou -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Friday, March 15, 2013 at 3:40 PM, Marc-Antoine Perennou wrote: Thank you a lot for these explanations, looking forward for these fixes! Do you have some public bug reports regarding this to link us? Good luck, thank you for your great job and have a nice weekend Marc-Antoine Perennou Well, for now the fixes are for stuff like make analysis take less time, and export timing information more easily. The most immediately applicable one is probably http://tracker.ceph.com/issues/4354, which I hope to start on next week and should be done by the end of the sprint. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote: On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). Thanks for the analyze. We use only one ceph-fuse client at this time which makes all high-load commands like rsync, tar and cp on a huge amount of files. Well, I will replace it by the kernel client. Oh, that bug is just an explanation of what's happening; I believe it exists in the kernel client as well. After setting the mds cache size to 900k, storms are gone. However we continue to observe high latency on some clients (always the same clients): each IO takes between 40 and 90ms (for example with Wordpress, it takes ~20 seconds to load all needed files...). With a non-laggy client, IO requests take less than 1ms. -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote: On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). Thanks for the analyze. We use only one ceph-fuse client at this time which makes all high-load commands like rsync, tar and cp on a huge amount of files. Well, I will replace it by the kernel client. Oh, that bug is just an explanation of what's happening; I believe it exists in the kernel client as well. After setting the mds cache size to 900k, storms are gone. However we continue to observe high latency on some clients (always the same clients): each IO takes between 40 and 90ms (for example with Wordpress, it takes ~20 seconds to load all needed files...). With a non-laggy client, IO requests take less than 1ms. I can't be sure from that description, but it sounds like you've got one client which is generally holding the RW caps on the files, and then another client which comes in occasionally to read those same files. That requires the first client to drop its caps, and involves a couple round-trip messages and is going to take some time — this is an unavoidable consequence if you have clients sharing files, although there's probably still room for us to optimize. Can you describe your client workload in a bit more detail? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 10:10:06AM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote: On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). Thanks for the analyze. We use only one ceph-fuse client at this time which makes all high-load commands like rsync, tar and cp on a huge amount of files. Well, I will replace it by the kernel client. Oh, that bug is just an explanation of what's happening; I believe it exists in the kernel client as well. After setting the mds cache size to 900k, storms are gone. However we continue to observe high latency on some clients (always the same clients): each IO takes between 40 and 90ms (for example with Wordpress, it takes ~20 seconds to load all needed files...). With a non-laggy client, IO requests take less than 1ms. I can't be sure from that description, but it sounds like you've got one client which is generally holding the RW caps on the files, and then another client which comes in occasionally to read those same files. That requires the first client to drop its caps, and involves a couple round-trip messages and is going to take some time — this is an unavoidable consequence if you have clients sharing files, although there's probably still room for us to optimize. Can you describe your client workload in a bit more detail? We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. I have a debug (debug_mds=20) log of the active mds during this test if you want. -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? We use kernel 3.7.0. nodcache is enabled by default (we only specify user and secretfile as mount options) and I didn't find it in the documentation of mount.ceph. Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. Good question, I will try it (-o dcache?). I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? Upload in progress :-) -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? We use kernel 3.7.0. nodcache is enabled by default (we only specify user and secretfile as mount options) and I didn't find it in the documentation of mount.ceph. Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. Good question, I will try it (-o dcache?). Oh right — I forgot Sage had enabled that by default; I don't recall how necessary it is. (Sage?) I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? Upload in progress :-) Looking forward to it. ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? We use kernel 3.7.0. nodcache is enabled by default (we only specify user and secretfile as mount options) and I didn't find it in the documentation of mount.ceph. Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. Good question, I will try it (-o dcache?). Oh right — I forgot Sage had enabled that by default; I don't recall how necessary it is. (Sage?) That code is buggy, see ceph_dir_test_complete(), it always return false. Yan, Zheng I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? Upload in progress :-) Looking forward to it. ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Wed, 27 Feb 2013, Yan, Zheng wrote: On Wed, Feb 27, 2013 at 5:58 AM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and access the same files. In the following test, only one client was used to serve the application (a website using wordpress). I made the test with strace to see the time of each IO request (strace -T -e trace=file) and I noticed the same pattern: ... [pid 4378] stat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.033409 [pid 4378] lstat(/data/wp-includes/user.php, {st_mode=S_IFREG|0750, st_size=28622, ...}) = 0 0.081642 [pid 4378] open(/data/wp-includes/user.php, O_RDONLY) = 5 0.041138 [pid 4378] stat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.082303 [pid 4378] lstat(/data/wp-includes/meta.php, {st_mode=S_IFREG|0750, st_size=10896, ...}) = 0 0.004090 [pid 4378] open(/data/wp-includes/meta.php, O_RDONLY) = 5 0.081929 ... ~250 files were accessed for only one request (thanks Wordpress.). Okay, that is slower than I'd expect, even for an across-the-wire request... The fs is mounted with these options: rw,noatime,name=hidden,secret=hidden,nodcache. What kernel and why are you using nodcache? We use kernel 3.7.0. nodcache is enabled by default (we only specify user and secretfile as mount options) and I didn't find it in the documentation of mount.ceph. Did you have problems without that mount option? That's forcing an MDS access for most operations, rather than using local data. Good question, I will try it (-o dcache?). Oh right ? I forgot Sage had enabled that by default; I don't recall how necessary it is. (Sage?) That code is buggy, see ceph_dir_test_complete(), it always return false. FWIW, I think #4023 may be the root of those original bugs. I think that should be up next (as far as fs/ceph goes) after after this i_mutex stuff is sorted out. sage Yan, Zheng I have a debug (debug_mds=20) log of the active mds during this test if you want. Yeah, can you post it somewhere? Upload in progress :-) Looking forward to it. ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Sat, Feb 16, 2013 at 10:24 AM, Kevin Decherf ke...@kdecherf.com wrote: On Sat, Feb 16, 2013 at 11:36:09AM -0600, Sam Lang wrote: On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf ke...@kdecherf.com wrote: It seems better now, I didn't see any storm so far. But we observe high latency on some of our clients (with no load). Does it exist any documentation on how to read the perfcounters_dump output? I would like to know if the MDS still has any problem with its cache or if the latency comes from elsewhere. You can use perfcounters_schema to get the output format in json, using something like json_pp to make it somewhat readable for a human. You can post the dump output as well and we can try to help you parse/understand what its telling you. Ok, thanks. Here you will find the output of perfcounters_dump on the active MDS: http://pastebin.com/raw.php?i=BNerQ2fL Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). Thanks for the analyze. We use only one ceph-fuse client at this time which makes all high-load commands like rsync, tar and cp on a huge amount of files. Well, I will replace it by the kernel client. After some tests we increased the mds cache size to 900k. -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just be the issue where your clients are keeping a bunch of inodes cached for the VFS (http://tracker.ceph.com/issues/3289). Thanks for the analyze. We use only one ceph-fuse client at this time which makes all high-load commands like rsync, tar and cp on a huge amount of files. Well, I will replace it by the kernel client. Oh, that bug is just an explanation of what's happening; I believe it exists in the kernel client as well. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf ke...@kdecherf.com wrote: On Wed, Feb 13, 2013 at 10:19:36AM -0800, Gregory Farnum wrote: On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. I set mds cache size to 40 but now I observe ~900Mbps peaks from osd.14 to the active mds, osd.18 and osd.2. osd.14 shares some pg with osd.18 and osd.2: http://pastebin.com/raw.php?i=uBAcTcu4 The high bandwidth from OSD to MDS really isn't a concern — that's the MDS asking for data and getting it back quickly! We're concerned about client responsiveness; has that gotten better? It seems better now, I didn't see any storm so far. But we observe high latency on some of our clients (with no load). Does it exist any documentation on how to read the perfcounters_dump output? I would like to know if the MDS still has any problem with its cache or if the latency comes from elsewhere. You can use perfcounters_schema to get the output format in json, using something like json_pp to make it somewhat readable for a human. You can post the dump output as well and we can try to help you parse/understand what its telling you. -sam -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Sat, Feb 16, 2013 at 11:36:09AM -0600, Sam Lang wrote: On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf ke...@kdecherf.com wrote: It seems better now, I didn't see any storm so far. But we observe high latency on some of our clients (with no load). Does it exist any documentation on how to read the perfcounters_dump output? I would like to know if the MDS still has any problem with its cache or if the latency comes from elsewhere. You can use perfcounters_schema to get the output format in json, using something like json_pp to make it somewhat readable for a human. You can post the dump output as well and we can try to help you parse/understand what its telling you. Ok, thanks. Here you will find the output of perfcounters_dump on the active MDS: http://pastebin.com/raw.php?i=BNerQ2fL -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Wed, Feb 13, 2013 at 10:19:36AM -0800, Gregory Farnum wrote: On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. I set mds cache size to 40 but now I observe ~900Mbps peaks from osd.14 to the active mds, osd.18 and osd.2. osd.14 shares some pg with osd.18 and osd.2: http://pastebin.com/raw.php?i=uBAcTcu4 The high bandwidth from OSD to MDS really isn't a concern — that's the MDS asking for data and getting it back quickly! We're concerned about client responsiveness; has that gotten better? It seems better now, I didn't see any storm so far. But we observe high latency on some of our clients (with no load). Does it exist any documentation on how to read the perfcounters_dump output? I would like to know if the MDS still has any problem with its cache or if the latency comes from elsewhere. -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. I set mds cache size to 40 but now I observe ~900Mbps peaks from osd.14 to the active mds, osd.18 and osd.2. osd.14 shares some pg with osd.18 and osd.2: http://pastebin.com/raw.php?i=uBAcTcu4 -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. I set mds cache size to 40 but now I observe ~900Mbps peaks from osd.14 to the active mds, osd.18 and osd.2. osd.14 shares some pg with osd.18 and osd.2: http://pastebin.com/raw.php?i=uBAcTcu4 The high bandwidth from OSD to MDS really isn't a concern — that's the MDS asking for data and getting it back quickly! We're concerned about client responsiveness; has that gotten better? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 7:05 AM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote: Hey everyone, It's my first post here to expose a potential issue I found today using Ceph 0.56.1. The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS. All nodes are running Exherbo (source-based distribution) with Ceph 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which is mounted on ~60 clients (increasing each day). Objects are replicated three times and the cluster handles only 7GB of data atm for 350k objects. In certain conditions (I don't know them atm), some clients hang, generate CPU overloads (kworker) and are unable to make any IO on Ceph. The active MDS have ~20Mbps in/out during the issue (less than 2Mbps in normal activity). I don't know if it's directly linked but we also observe a lot of missing files at the same time. The problem is similar to this one [1]. A restart of the client or the MDS was enough before today, but we found a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours with ~25% clients hanging. In logs I found a segfault with this backtrace [2] and 100,000 dumped events during the first hang. We observed another hang which produces lot of these events (in debug mode): - mds.0.server FAIL on ESTALE but attempting recovery - mds.0.server reply_request -116 (Stale NFS file handle) client_request(client.10991:1031 getattr As #104bab0 RETRY=132) Hi Kevin, sorry for the delayed response. This looks like the mds cache is thrashing quite a bit, and with multiple MDSs the tree partitioning is causing those estale messages. In your case, you should probably run with just a single active mds (I assume all three MDSs are active, but ceph -s will tell you for sure), and the others as standby. I don't think you'll be able to do that without starting over though. Also, you might want to increase the size of the mds cache if you have enough memory on that machine. mds cache size defaults to 100k, you might increase it to 300k and see if you get the same problems. We have no profiling tools available on these nodes, and I don't know what I should search in the 35 GB log file. Note: the segmentation fault occured only once but the problem was observed four times on this cluster. Any help may be appreciated. References: [1] http://www.spinics.net/lists/ceph-devel/msg04903.html [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-mds() [0x817e82] 2: (()+0xf140) [0x7f9091d30140] 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] 8: (Server::kill_session(Session*)+0x137) [0x549c67] 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] 10: (MDS::tick()+0x338) [0x4da928] 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] 12: (SafeTimerThread::entry()+0xd) [0x782bad] 13: (()+0x7ddf) [0x7f9091d28ddf] 14: (clone()+0x6d) [0x7f90909cc24d] I found a possible cause/way to reproduce this issue. We have now ~90 clients for 18GB / 650k objects and the storm occurs when we execute an intensive IO command (tar of the whole pool / rsync in one folder) on one of our client (the only which uses ceph-fuse, don't know if it's limited to it or not). Do you have debug logging enabled when you see this crash? Can you compress that mds log and post it somewhere or email it to me? Thanks, -sam Any idea? Cheers, -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 11:00:15AM -0600, Sam Lang wrote: Hi Kevin, sorry for the delayed response. This looks like the mds cache is thrashing quite a bit, and with multiple MDSs the tree partitioning is causing those estale messages. In your case, you should probably run with just a single active mds (I assume all three MDSs are active, but ceph -s will tell you for sure), and the others as standby. I don't think you'll be able to do that without starting over though. Hi Sam, I know that MDS clustering is a bit buggy so I have only one active MDS on this cluster. Here is the output of ceph -s: ~ # ceph -s health HEALTH_OK monmap e1: 3 mons at {a=x:6789/0,b=y:6789/0,c=z:6789/0}, election epoch 48, quorum 0,1,2 a,b,c osdmap e79: 27 osds: 27 up, 27 in pgmap v895343: 5376 pgs: 5376 active+clean; 18987 MB data, 103 GB used, 21918 GB / 23201 GB avail mdsmap e73: 1/1/1 up {0=b=up:active}, 2 up:standby Also, you might want to increase the size of the mds cache if you have enough memory on that machine. mds cache size defaults to 100k, you might increase it to 300k and see if you get the same problems. I have 24GB of memory for each MDS, I will try to increase this value. Thanks for advice. Do you have debug logging enabled when you see this crash? Can you compress that mds log and post it somewhere or email it to me? Yes, I have 34GB of raw logs (for this issue) but I have no debug log of the beginning of the storm itself. I will upload a compressed archive. Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. I will forward a copy of the debug enabled log of osd14. The only significant difference between osd.23 and others is the list of hb_in where osd.14 is missing (but I think it's unrelated). ~ # ceph pg dump osdstat kbused kbavail kb hb in hb out 0 4016228 851255948 901042464 [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 1 4108748 851163428 901042464 [0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26] [] 2 4276584 850995592 901042464 [0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 3 3997368 851274808 901042464 [0,1,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 4 4358212 850913964 901042464 [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 5 4039112 851233064 901042464 [0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 6 3971568 851300608 901042464 [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 7 3942556 851329620 901042464 [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 8 4275584 850996592 901042464 [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 9 4279308 850992868 901042464 [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 10 3728136 851544040 901042464 [0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 11 3934096 851338080 901042464 [0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 12 3991600 851280576 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 13 4211228 851060948 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 14 4169476 851102700 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26] [] 15 4385584 850886592 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26] [] 16 3761176 851511000 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26] [] 17 3646096 851626080 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26] [] 18 4119448 851152728 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26] [] 19 4592992 850679184 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23,24,25,26] [] 20 3740840 851531336 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26] [] 21 4363552 850908624 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26] [] 22 3831420 851440756 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26] [] 23 3681648 851590528 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,24,25,26] [] 24 3946192 851325984 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26] [] 25
Re: Crash and strange things on MDS
On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf ke...@kdecherf.com wrote: References: [1] http://www.spinics.net/lists/ceph-devel/msg04903.html [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-mds() [0x817e82] 2: (()+0xf140) [0x7f9091d30140] 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] 8: (Server::kill_session(Session*)+0x137) [0x549c67] 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] 10: (MDS::tick()+0x338) [0x4da928] 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] 12: (SafeTimerThread::entry()+0xd) [0x782bad] 13: (()+0x7ddf) [0x7f9091d28ddf] 14: (clone()+0x6d) [0x7f90909cc24d] This in particular is quite odd. Do you have any logging from when that happened? (Oftentimes the log can have a bunch of debugging information from shortly before the crash.) On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf ke...@kdecherf.com wrote: References: [1] http://www.spinics.net/lists/ceph-devel/msg04903.html [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-mds() [0x817e82] 2: (()+0xf140) [0x7f9091d30140] 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] 8: (Server::kill_session(Session*)+0x137) [0x549c67] 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] 10: (MDS::tick()+0x338) [0x4da928] 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] 12: (SafeTimerThread::entry()+0xd) [0x782bad] 13: (()+0x7ddf) [0x7f9091d28ddf] 14: (clone()+0x6d) [0x7f90909cc24d] This in particular is quite odd. Do you have any logging from when that happened? (Oftentimes the log can have a bunch of debugging information from shortly before the crash.) Yes, there is a dump of 100,000 events for this backtrace in the linked archive (I need 7 hours to upload it). On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. The distribution is heterogeneous: we have a folder of ~17G for 300k objects, another of ~2G for 150k objects and a lof of smaller directories. Are you talking about the mds bal frag and mds bal split * settings? Do you have any advice about the value to use? -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf ke...@kdecherf.com wrote: References: [1] http://www.spinics.net/lists/ceph-devel/msg04903.html [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-mds() [0x817e82] 2: (()+0xf140) [0x7f9091d30140] 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] 8: (Server::kill_session(Session*)+0x137) [0x549c67] 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] 10: (MDS::tick()+0x338) [0x4da928] 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] 12: (SafeTimerThread::entry()+0xd) [0x782bad] 13: (()+0x7ddf) [0x7f9091d28ddf] 14: (clone()+0x6d) [0x7f90909cc24d] This in particular is quite odd. Do you have any logging from when that happened? (Oftentimes the log can have a bunch of debugging information from shortly before the crash.) Yes, there is a dump of 100,000 events for this backtrace in the linked archive (I need 7 hours to upload it). Can you just pastebin the last couple hundred lines? I'm mostly interested if there's anything from the function which actually caused the assert/segfault. Also, the log should compress well and get much smaller! On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. The distribution is heterogeneous: we have a folder of ~17G for 300k objects, another of ~2G for 150k objects and a lof of smaller directories. Sorry, you mean 300,000 files in the single folder? If so, that's definitely why it's behaving so badly — your folder is larger than your maximum cache size settings, and so if you run an ls or anything the MDS will read the whole thing off disk, then instantly drop most of the folder from its cache. Then re-read again for the next request to list contents, etc etc. Are you talking about the mds bal frag and mds bal split * settings? Do you have any advice about the value to use? If you set mds bal frag = true in your config, it will split up those very large directories into smaller fragments and behave a lot better. This isn't quite as stable (thus the default to off), so if you have the memory to just really up your cache size I'd start with that and see if it makes your problems better. But if it doesn't, directory fragmentation does work reasonably well and it's something we'd be interested in bug reports for. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 02:47:13PM -0800, Gregory Farnum wrote: On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: Yes, there is a dump of 100,000 events for this backtrace in the linked archive (I need 7 hours to upload it). Can you just pastebin the last couple hundred lines? I'm mostly interested if there's anything from the function which actually caused the assert/segfault. Also, the log should compress well and get much smaller! Sent in pm. And yes, I have a good compression rate but... % ls -lh total 38G -rw-r--r-- 1 kdecherf kdecherf 3.3G Feb 11 18:36 cc-ceph-log.tar.gz -rw--- 1 kdecherf kdecherf 66M Feb 4 17:57 ceph.log -rw-r--r-- 1 kdecherf kdecherf 3.5G Feb 4 14:44 ceph-mds.b.log -rw-r--r-- 1 kdecherf kdecherf 31G Feb 5 15:55 ceph-mds.c.log -rw-r--r-- 1 kdecherf kdecherf 27M Feb 11 19:46 ceph-osd.14.log ;-) The distribution is heterogeneous: we have a folder of ~17G for 300k objects, another of ~2G for 150k objects and a lof of smaller directories. Sorry, you mean 300,000 files in the single folder? If so, that's definitely why it's behaving so badly — your folder is larger than your maximum cache size settings, and so if you run an ls or anything the MDS will read the whole thing off disk, then instantly drop most of the folder from its cache. Then re-read again for the next request to list contents, etc etc. The biggest top-level folder contains 300k files but splitted into several subfolders (a subfolder does not contain more than 10,000 files at its level). Are you talking about the mds bal frag and mds bal split * settings? Do you have any advice about the value to use? If you set mds bal frag = true in your config, it will split up those very large directories into smaller fragments and behave a lot better. This isn't quite as stable (thus the default to off), so if you have the memory to just really up your cache size I'd start with that and see if it makes your problems better. But if it doesn't, directory fragmentation does work reasonably well and it's something we'd be interested in bug reports for. :) I will try it, thanks! -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Crash and strange things on MDS
Hey everyone, It's my first post here to expose a potential issue I found today using Ceph 0.56.1. The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS. All nodes are running Exherbo (source-based distribution) with Ceph 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which is mounted on ~60 clients (increasing each day). Objects are replicated three times and the cluster handles only 7GB of data atm for 350k objects. In certain conditions (I don't know them atm), some clients hang, generate CPU overloads (kworker) and are unable to make any IO on Ceph. The active MDS have ~20Mbps in/out during the issue (less than 2Mbps in normal activity). I don't know if it's directly linked but we also observe a lot of missing files at the same time. The problem is similar to this one [1]. A restart of the client or the MDS was enough before today, but we found a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours with ~25% clients hanging. In logs I found a segfault with this backtrace [2] and 100,000 dumped events during the first hang. We observed another hang which produces lot of these events (in debug mode): - mds.0.server FAIL on ESTALE but attempting recovery - mds.0.server reply_request -116 (Stale NFS file handle) client_request(client.10991:1031 getattr As #104bab0 RETRY=132) We have no profiling tools available on these nodes, and I don't know what I should search in the 35 GB log file. Note: the segmentation fault occured only once but the problem was observed four times on this cluster. Any help may be appreciated. References: [1] http://www.spinics.net/lists/ceph-devel/msg04903.html [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-mds() [0x817e82] 2: (()+0xf140) [0x7f9091d30140] 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] 8: (Server::kill_session(Session*)+0x137) [0x549c67] 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] 10: (MDS::tick()+0x338) [0x4da928] 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] 12: (SafeTimerThread::entry()+0xd) [0x782bad] 13: (()+0x7ddf) [0x7f9091d28ddf] 14: (clone()+0x6d) [0x7f90909cc24d] Cheers, -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html