Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I will try doing that once again tonight as this is a production cluster and when dds trigger that dmesg error the cluster's io becomes very bad and I have to reboot the server to get things on track. Most of my vms start having 70-90% iowait until that server is rebooted. I've actually checked what you've asked last time i've ran the test. When I do 4 dds concurrently nothing aprears in the dmesg output. No messages at all. The kern.log file that i've sent last time is what I got about a minute after i've started 8 dds. I've pasted the full output. The 8 dds did actually complete, but it took a rather long time. I was getting about 6MB/s per dd process compared to around 70MB/s per dd process when 4 dds were running. Do you still want me to run this or is the information i've provided enough? How long did it take for all dds to complete? Can you send the entire kern.log for that boot? I want to look at how things progressed during the entire time dds were chunking along. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, further to your email I have switched back to the 3.18 kernel that you've sent and I got similar looking dmesg output as I had on the 3.17 kernel. Please find it attached for your reference. As before, this is the command I've ran on the client: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct Can you run that command again - on 3.18 kernel, to completion - and paste - the entire dmesg - time results for each dd ? Compare those to your results with four dds (or any other number which doesn't trigger page allocation failures). Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, I will try doing that once again tonight as this is a production cluster and when dds trigger that dmesg error the cluster's io becomes very bad and I have to reboot the server to get things on track. Most of my vms start having 70-90% iowait until that server is rebooted. I've actually checked what you've asked last time i've ran the test. When I do 4 dds concurrently nothing aprears in the dmesg output. No messages at all. The kern.log file that i've sent last time is what I got about a minute after i've started 8 dds. I've pasted the full output. The 8 dds did actually complete, but it took a rather long time. I was getting about 6MB/s per dd process compared to around 70MB/s per dd process when 4 dds were running. Do you still want me to run this or is the information i've provided enough? Cheers Andrei - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com, Gregory Farnum g...@gregs42.com Sent: Monday, 1 December, 2014 8:22:08 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Mon, Dec 1, 2014 at 12:30 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, further to your email I have switched back to the 3.18 kernel that you've sent and I got similar looking dmesg output as I had on the 3.17 kernel. Please find it attached for your reference. As before, this is the command I've ran on the client: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct Can you run that command again - on 3.18 kernel, to completion - and paste - the entire dmesg - time results for each dd ? Compare those to your results with four dds (or any other number which doesn't trigger page allocation failures). Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I will try doing that once again tonight as this is a production cluster and when dds trigger that dmesg error the cluster's io becomes very bad and I have to reboot the server to get things on track. Most of my vms start having 70-90% iowait until that server is rebooted. That's easily explained - those splats in dmesg indicate a case of a severe memory pressure. I've actually checked what you've asked last time i've ran the test. When I do 4 dds concurrently nothing aprears in the dmesg output. No messages at all. The kern.log file that i've sent last time is what I got about a minute after i've started 8 dds. I've pasted the full output. The 8 dds did actually complete, but it took a rather long time. I was getting about 6MB/s per dd process compared to around 70MB/s per dd process when 4 dds were running. Do you still want me to run this or is the information i've provided enough? No, no need if it's a production cluster. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, I see. My server is has 24GB of ram + 3GB of swap. While running the tests, I've noticed that the server had 14GB of ram shown as cached and only 2MB were used from the swap. Not sure if this is helpful to your debugging. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com, Gregory Farnum g...@gregs42.com Sent: Monday, 1 December, 2014 11:06:37 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Mon, Dec 1, 2014 at 1:39 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I will try doing that once again tonight as this is a production cluster and when dds trigger that dmesg error the cluster's io becomes very bad and I have to reboot the server to get things on track. Most of my vms start having 70-90% iowait until that server is rebooted. That's easily explained - those splats in dmesg indicate a case of a severe memory pressure. I've actually checked what you've asked last time i've ran the test. When I do 4 dds concurrently nothing aprears in the dmesg output. No messages at all. The kern.log file that i've sent last time is what I got about a minute after i've started 8 dds. I've pasted the full output. The 8 dds did actually complete, but it took a rather long time. I was getting about 6MB/s per dd process compared to around 70MB/s per dd process when 4 dds were running. Do you still want me to run this or is the information i've provided enough? No, no need if it's a production cluster. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Sun, Nov 30, 2014 at 1:15 PM, Andrei Mikhailovsky and...@arhont.com wrote: Greg, thanks for your comment. Could you please share what OS, kernel and any nfs/cephfs settings you've used to achieve the pretty well stability? Also, what kind of tests have you ran to check that? We're just doing it on our testing cluster with the teuthology/ceph-qa-suite stuff in https://github.com/ceph/ceph-qa-suite/tree/master/suites/knfs/basic So that'll be running our ceph-client kernel, which I believe is usually a recent rc release with the new Ceph changes on top, with knfs exporting a kcephfs mount, and then running each of the tasks named in the tasks folder on top of a client to that knfs export. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Greg, thanks for your comment. Could you please share what OS, kernel and any nfs/cephfs settings you've used to achieve the pretty well stability? Also, what kind of tests have you ran to check that? Thanks - Original Message - From: Gregory Farnum g...@gregs42.com To: Ilya Dryomov ilya.dryo...@inktank.com, Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Saturday, 29 November, 2014 10:19:32 PM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks Ilya, do you have a ticket reference for the bug? Andrei, we run NFS tests on CephFS in our nightlies and it does pretty well so in the general case we expect it to work. Obviously not at the moment with whatever bug Ilya is looking at, though. ;) -Greg On Sat, Nov 29, 2014 at 4:51 AM Ilya Dryomov ilya.dryo...@inktank.com wrote: On Sat, Nov 29, 2014 at 3:49 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Sat, Nov 29, 2014 at 3:22 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I think i spoke too soon in my last message. I've not given it more load (running 8 concurrent dds with bs=4M) and about a minute or so after starting i've seen problems in dmesg output. I am attaching kern.log file for you reference. Please check starting with the following line: Nov 29 12:07:38 arh-ibstorage1-ib kernel: [ 3831.906510]. This is when I've started the concurrent 8 dds. The command that caused this is: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct I've ran the same test about 10 times but with only 4 concurrent dds and that didn't cause the issue. Should I try the 3.18 kernel again to see if 8dds produce similar output? Missing attachment. Definitely try the 3.18 testing kernel. Thanks, Ilya __ _ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/ listinfo.cgi/ceph-users-ceph. com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Sat, Nov 29, 2014 at 2:13 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, here is what I got shortly after starting the dd test: [ 288.307993] [ 288.308004] = [ 288.308008] [ INFO: possible irq lock inversion dependency detected ] [ 288.308014] 3.18.0-rc6-ceph-00024-g72ca172 #1 Tainted: GE [ 288.308019] - [ 288.308023] kswapd1/87 just changed the state of lock: [ 288.308027] (xfs_dir_ilock_class){-+}, at: [a0682d44] xfs_ilock+0x134/0x160 [xfs] [ 288.308072] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 288.308076] (mm-mmap_sem){++} [ 288.308076] [ 288.308076] and interrupts could create inverse lock ordering between them. [ 288.308076] [ 288.308084] [ 288.308084] other info that might help us debug this: [ 288.308089] Possible interrupt unsafe locking scenario: [ 288.308089] [ 288.308094]CPU0CPU1 [ 288.308097] [ 288.308100] lock(mm-mmap_sem); [ 288.308104]local_irq_disable(); [ 288.308109]lock(xfs_dir_ilock_class); [ 288.308114]lock(mm-mmap_sem); [ 288.308120] Interrupt [ 288.308122] lock(xfs_dir_ilock_class); [ 288.308127] [ 288.308127] *** DEADLOCK *** [ 288.308127] [ 288.308133] 3 locks held by kswapd1/87: [ 288.308136] #0: (shrinker_rwsem){..}, at: [8117551f] shrink_slab+0x3f/0x140 [ 288.308151] #1: (type-s_umount_key#27){.+}, at: [811d8c14] grab_super_passive+0x44/0x90 [ 288.308165] #2: (pag-pag_ici_reclaim_lock){+.+...}, at: [a067acd4] xfs_reclaim_inodes_ag+0xb4/0x400 [xfs] [ 288.308192] [ 288.308192] the shortest dependencies between 2nd lock and 1st lock: [ 288.308206] - (mm-mmap_sem){++} ops: 27039227 { [ 288.308214] HARDIRQ-ON-W at: [ 288.308218] [810a7209] __lock_acquire+0x629/0x1c90 [ 288.308229] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.308236] [8173ae99] down_write+0x49/0x80 [ 288.308244] [811dcd03] do_execve_common.isra.25+0x283/0x6e0 [ 288.308253] [811dd178] do_execve+0x18/0x20 [ 288.308259] [8106ff4e] call_usermodehelper+0x11e/0x170 [ 288.308269] [8173d66c] ret_from_fork+0x7c/0xb0 [ 288.308276] HARDIRQ-ON-R at: [ 288.308280] [810a6f23] __lock_acquire+0x343/0x1c90 [ 288.308287] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.308294] [8118d833] might_fault+0x93/0xc0 [ 288.308304] [813b7a80] __clear_user+0x20/0x70 [ 288.308314] [813b7afe] clear_user+0x2e/0x40 [ 288.308320] [8122a4cd] padzero+0x2d/0x40 [ 288.308329] [8122b0bf] load_elf_binary+0x9cf/0x1880 [ 288.308336] [811db9f0] search_binary_handler+0xa0/0x1e0 [ 288.308343] [811dcfa2] do_execve_common.isra.25+0x522/0x6e0 [ 288.308351] [811dd178] do_execve+0x18/0x20 [ 288.308358] [8106ff4e] call_usermodehelper+0x11e/0x170 [ 288.308366] [8173d66c] ret_from_fork+0x7c/0xb0 [ 288.308373] SOFTIRQ-ON-W at: [ 288.308376] [810a6f54] __lock_acquire+0x374/0x1c90 [ 288.308384] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.308391] [8173ae99] down_write+0x49/0x80 [ 288.308398] [811dcd03] do_execve_common.isra.25+0x283/0x6e0 [ 288.308406] [811dd178] do_execve+0x18/0x20 [ 288.308412] [8106ff4e] call_usermodehelper+0x11e/0x170 [ 288.308420] [8173d66c] ret_from_fork+0x7c/0xb0 [ 288.308427] SOFTIRQ-ON-R at: [ 288.308431] [810a6f54] __lock_acquire+0x374/0x1c90 [ 288.308438] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.308445] [8118d833] might_fault+0x93/0xc0 [ 288.308452] [813b7a80] __clear_user+0x20/0x70 [ 288.308458] [813b7afe] clear_user+0x2e/0x40 [ 288.308464] [8122a4cd] padzero+0x2d/0x40 [ 288.308470] [8122b0bf] load_elf_binary+0x9cf/0x1880 [ 288.308477] [811db9f0]
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, so, what is the best action plan now? should I continue using the kernel that you've sent me? I am running production infrastructure and not sure if this is the right way forward. Do you have a patch by any chance against the LTS kernel that I can use to recompile the ceph module? Thanks - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Saturday, 29 November, 2014 8:45:54 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Sat, Nov 29, 2014 at 2:13 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, here is what I got shortly after starting the dd test: [ 288.307993] [ 288.308004] = [ 288.308008] [ INFO: possible irq lock inversion dependency detected ] [ 288.308014] 3.18.0-rc6-ceph-00024-g72ca172 #1 Tainted: G E [ 288.308019] - [ 288.308023] kswapd1/87 just changed the state of lock: [ 288.308027] (xfs_dir_ilock_class){-+}, at: [a0682d44] xfs_ilock+0x134/0x160 [xfs] [ 288.308072] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 288.308076] (mm-mmap_sem){++} [ 288.308076] [ 288.308076] and interrupts could create inverse lock ordering between them. [ 288.308076] [ 288.308084] [ 288.308084] other info that might help us debug this: [ 288.308089] Possible interrupt unsafe locking scenario: [ 288.308089] [ 288.308094] CPU0 CPU1 [ 288.308097] [ 288.308100] lock(mm-mmap_sem); [ 288.308104] local_irq_disable(); [ 288.308109] lock(xfs_dir_ilock_class); [ 288.308114] lock(mm-mmap_sem); [ 288.308120] Interrupt [ 288.308122] lock(xfs_dir_ilock_class); [ 288.308127] [ 288.308127] *** DEADLOCK *** [ 288.308127] [ 288.308133] 3 locks held by kswapd1/87: [ 288.308136] #0: (shrinker_rwsem){..}, at: [8117551f] shrink_slab+0x3f/0x140 [ 288.308151] #1: (type-s_umount_key#27){.+}, at: [811d8c14] grab_super_passive+0x44/0x90 [ 288.308165] #2: (pag-pag_ici_reclaim_lock){+.+...}, at: [a067acd4] xfs_reclaim_inodes_ag+0xb4/0x400 [xfs] [ 288.308192] [ 288.308192] the shortest dependencies between 2nd lock and 1st lock: [ 288.308206] - (mm-mmap_sem){++} ops: 27039227 { [ 288.308214] HARDIRQ-ON-W at: [ 288.308218] [810a7209] __lock_acquire+0x629/0x1c90 [ 288.308229] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.308236] [8173ae99] down_write+0x49/0x80 [ 288.308244] [811dcd03] do_execve_common.isra.25+0x283/0x6e0 [ 288.308253] [811dd178] do_execve+0x18/0x20 [ 288.308259] [8106ff4e] call_usermodehelper+0x11e/0x170 [ 288.308269] [8173d66c] ret_from_fork+0x7c/0xb0 [ 288.308276] HARDIRQ-ON-R at: [ 288.308280] [810a6f23] __lock_acquire+0x343/0x1c90 [ 288.308287] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.308294] [8118d833] might_fault+0x93/0xc0 [ 288.308304] [813b7a80] __clear_user+0x20/0x70 [ 288.308314] [813b7afe] clear_user+0x2e/0x40 [ 288.308320] [8122a4cd] padzero+0x2d/0x40 [ 288.308329] [8122b0bf] load_elf_binary+0x9cf/0x1880 [ 288.308336] [811db9f0] search_binary_handler+0xa0/0x1e0 [ 288.308343] [811dcfa2] do_execve_common.isra.25+0x522/0x6e0 [ 288.308351] [811dd178] do_execve+0x18/0x20 [ 288.308358] [8106ff4e] call_usermodehelper+0x11e/0x170 [ 288.308366] [8173d66c] ret_from_fork+0x7c/0xb0 [ 288.308373] SOFTIRQ-ON-W at: [ 288.308376] [810a6f54] __lock_acquire+0x374/0x1c90 [ 288.308384] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.308391] [8173ae99] down_write+0x49/0x80 [ 288.308398] [811dcd03] do_execve_common.isra.25+0x283/0x6e0 [ 288.308406] [811dd178] do_execve+0x18/0x20 [ 288.308412] [8106ff4e] call_usermodehelper+0x11e/0x170 [ 288.308420] [8173d66c] ret_from_fork+0x7c/0xb0 [ 288.308427] SOFTIRQ-ON-R at: [ 288.308431] [810a6f54] __lock_acquire+0x374/0x1c90 [ 288.308438] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.308445] [8118d833] might_fault+0x93/0xc0 [ 288.308452] [813b7a80] __clear_user+0x20/0x70 [ 288.308458] [813b7afe] clear_user+0x2e/0x40 [ 288.308464] [8122a4cd] padzero+0x2d/0x40 [ 288.308470] [8122b0bf] load_elf_binary+0x9cf/0x1880 [ 288.308477] [811db9f0] search_binary_handler+0xa0/0x1e0 [ 288.308485] [811dcfa2] do_execve_common.isra.25+0x522/0x6e0 [ 288.308493] [811dd178] do_execve+0x18/0x20 [ 288.308499] [8106ff4e] call_usermodehelper+0x11e/0x170 [ 288.308507] [8173d66c] ret_from_fork+0x7c/0xb0 [ 288.308514] RECLAIM_FS
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Sat, Nov 29, 2014 at 2:33 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, not sure if dmesg output in the previous is related to the cephfs, but from what I can see it looks good with your kernel. I would have seen hang tasks by now, but not anymore. I've ran a bunch of concurrent dd tests and also the file touch tests and there are no more delays. So, it looks like you have nailed the bug! Great, good to have another data point. Do you plan to backport the fix to the 3.16 or 3.17 branches? That's the tricky part. Can you try http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/for-andrei-1/linux-image-3.17.4-ceph-00638-g0f25ebb_3.17.4-ceph-00638-g0f25ebb-1_amd64.deb ? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, I will give it a try and get back to you shortly, Andrei - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Saturday, 29 November, 2014 10:40:48 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Sat, Nov 29, 2014 at 2:33 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, not sure if dmesg output in the previous is related to the cephfs, but from what I can see it looks good with your kernel. I would have seen hang tasks by now, but not anymore. I've ran a bunch of concurrent dd tests and also the file touch tests and there are no more delays. So, it looks like you have nailed the bug! Great, good to have another data point. Do you plan to backport the fix to the 3.16 or 3.17 branches? That's the tricky part. Can you try http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/for-andrei-1/linux-image-3.17.4-ceph-00638-g0f25ebb_3.17.4-ceph-00638-g0f25ebb-1_amd64.deb ? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, I think i spoke too soon in my last message. I've not given it more load (running 8 concurrent dds with bs=4M) and about a minute or so after starting i've seen problems in dmesg output. I am attaching kern.log file for you reference. Please check starting with the following line: Nov 29 12:07:38 arh-ibstorage1-ib kernel: [ 3831.906510] . This is when I've started the concurrent 8 dds. The command that caused this is: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct I've ran the same test about 10 times but with only 4 concurrent dds and that didn't cause the issue. Should I try the 3.18 kernel again to see if 8dds produce similar output? Andrei - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Saturday, 29 November, 2014 10:40:48 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Sat, Nov 29, 2014 at 2:33 AM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, not sure if dmesg output in the previous is related to the cephfs, but from what I can see it looks good with your kernel. I would have seen hang tasks by now, but not anymore. I've ran a bunch of concurrent dd tests and also the file touch tests and there are no more delays. So, it looks like you have nailed the bug! Great, good to have another data point. Do you plan to backport the fix to the 3.16 or 3.17 branches? That's the tricky part. Can you try http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/for-andrei-1/linux-image-3.17.4-ceph-00638-g0f25ebb_3.17.4-ceph-00638-g0f25ebb-1_amd64.deb ? Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Sat, Nov 29, 2014 at 3:10 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, The 3.17.4 kernel that you've given is also good so far. No hang tasks as seen before. However, I do have the same message in dmesg as with the 3.18 kernel that you've sent. This message I've not seen in the past while using kernel version 3.2 onwards. Not really sure if this message should be treated as alarming. If you are referring to the xfs lockdep splat, the reason you haven't seen it in the past may be that lockdep just wasn't enabled on your kernels - most distro kernels don't enable it. I wouldn't treat it as alarming but I'd report it to xfs lists if it hasn't been reported there yet. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Sat, Nov 29, 2014 at 3:22 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I think i spoke too soon in my last message. I've not given it more load (running 8 concurrent dds with bs=4M) and about a minute or so after starting i've seen problems in dmesg output. I am attaching kern.log file for you reference. Please check starting with the following line: Nov 29 12:07:38 arh-ibstorage1-ib kernel: [ 3831.906510]. This is when I've started the concurrent 8 dds. The command that caused this is: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct I've ran the same test about 10 times but with only 4 concurrent dds and that didn't cause the issue. Should I try the 3.18 kernel again to see if 8dds produce similar output? Missing attachment. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Sat, Nov 29, 2014 at 3:49 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Sat, Nov 29, 2014 at 3:22 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I think i spoke too soon in my last message. I've not given it more load (running 8 concurrent dds with bs=4M) and about a minute or so after starting i've seen problems in dmesg output. I am attaching kern.log file for you reference. Please check starting with the following line: Nov 29 12:07:38 arh-ibstorage1-ib kernel: [ 3831.906510]. This is when I've started the concurrent 8 dds. The command that caused this is: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct I've ran the same test about 10 times but with only 4 concurrent dds and that didn't cause the issue. Should I try the 3.18 kernel again to see if 8dds produce similar output? Missing attachment. Definitely try the 3.18 testing kernel. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, do you have a ticket reference for the bug? Andrei, we run NFS tests on CephFS in our nightlies and it does pretty well so in the general case we expect it to work. Obviously not at the moment with whatever bug Ilya is looking at, though. ;) -Greg On Sat, Nov 29, 2014 at 4:51 AM Ilya Dryomov ilya.dryo...@inktank.com wrote: On Sat, Nov 29, 2014 at 3:49 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Sat, Nov 29, 2014 at 3:22 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, I think i spoke too soon in my last message. I've not given it more load (running 8 concurrent dds with bs=4M) and about a minute or so after starting i've seen problems in dmesg output. I am attaching kern.log file for you reference. Please check starting with the following line: Nov 29 12:07:38 arh-ibstorage1-ib kernel: [ 3831.906510]. This is when I've started the concurrent 8 dds. The command that caused this is: time dd if=/dev/zero of=4G00 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G11 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G22 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G33 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G44 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G55 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G66 bs=4M count=5K oflag=direct time dd if=/dev/zero of=4G77 bs=4M count=5K oflag=direct I've ran the same test about 10 times but with only 4 concurrent dds and that didn't cause the issue. Should I try the 3.18 kernel again to see if 8dds produce similar output? Missing attachment. Definitely try the 3.18 testing kernel. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Sun, Nov 30, 2014 at 1:19 AM, Gregory Farnum g...@gregs42.com wrote: Ilya, do you have a ticket reference for the bug? Opened a ticket, assigned to myself. http://tracker.ceph.com/issues/10208 Andrei, we run NFS tests on CephFS in our nightlies and it does pretty well so in the general case we expect it to work. Obviously not at the moment with whatever bug Ilya is looking at, though. ;) This is most probably a libceph issue - both krbd and kcephfs are affected. I've been tracking it under the general io hang umbrella, which spreads over a couple existing tickets. Definitely not a nfs-on-cephfs problem, Greg. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
I am also noticing some delays working with nfs over cephfs. Especially when making an initial connection. For instance, I run the following: # time for i in {0..10} ; do time touch /tmp/cephfs/test-$i ; done where /tmp/cephfs is the nfs mount point running over cephfs I am noticing that the first touch file is created after about 20-30 seconds. All the following files files are created with no delay. If I run the command once again, all files are created pretty quickly without any delay. However, if I wait 20-30 minutes and run the command again, the delay with the first file is back again. Has anyone experienced similar issues? Andrei - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 9:08:17 AM Subject: [ceph-users] Giant + nfs over cephfs hang tasks Hello guys, I've got a bunch of hang tasks of the nfsd service running over the cephfs (kernel) mounted file system. Here is an example of one of them. [433079.991218] INFO: task nfsd:32625 blocked for more than 120 seconds. [433080.029685] Not tainted 3.15.10-031510-generic #201408132333 [433080.068036] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [433080.144235] nfsd D 000a 0 32625 2 0x [433080.144241] 8801a94dba78 0002 8801a94dba38 8801a94dbfd8 [433080.144244] 00014540 00014540 880673d63260 880491d264c0 [433080.144247] 8801a94dba78 88067fd14e40 880491d264c0 8115dff0 [433080.144250] Call Trace: [433080.144260] [8115dff0] ? __lock_page+0x70/0x70 [433080.144274] [81778449] schedule+0x29/0x70 [433080.144279] [8177851f] io_schedule+0x8f/0xd0 [433080.144282] [8115dffe] sleep_on_page+0xe/0x20 [433080.144286] [81778be2] __wait_on_bit+0x62/0x90 [433080.144288] [8115eacb] ? find_get_pages_tag+0xcb/0x170 [433080.144291] [8115e160] wait_on_page_bit+0x80/0x90 [433080.144296] [810b54a0] ? wake_atomic_t_function+0x40/0x40 [433080.144299] [8115e334] filemap_fdatawait_range+0xf4/0x180 [433080.144302] [8116027d] filemap_write_and_wait_range+0x4d/0x80 [433080.144315] [a06bf1b8] ceph_fsync+0x58/0x200 [ceph] [433080.144330] [813308f5] ? ima_file_check+0x35/0x40 [433080.144337] [812028c8] vfs_fsync_range+0x18/0x30 [433080.144352] [a03ee491] nfsd_commit+0xb1/0xd0 [nfsd] [433080.144363] [a03fb787] nfsd4_commit+0x57/0x60 [nfsd] [433080.144370] [a03fcf9e] nfsd4_proc_compound+0x54e/0x740 [nfsd] [433080.144377] [a03e8e05] nfsd_dispatch+0xe5/0x230 [nfsd] [433080.144401] [a03205a5] svc_process_common+0x345/0x680 [sunrpc] [433080.144413] [a0320c33] svc_process+0x103/0x160 [sunrpc] [433080.144418] [a03e895f] nfsd+0xbf/0x130 [nfsd] [433080.144424] [a03e88a0] ? nfsd_destroy+0x80/0x80 [nfsd] [433080.144428] [81091439] kthread+0xc9/0xe0 [433080.144431] [81091370] ? flush_kthread_worker+0xb0/0xb0 [433080.144434] [8178567c] ret_from_fork+0x7c/0xb0 [433080.144437] [81091370] ? flush_kthread_worker+0xb0/0xb0 I am using Ubuntu 12.04 servers with 3.15.10 kernel and ceph Giant. Thanks Andrei ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
I've just tried the latest ubuntu-vivid kernel and also seeing hang tasks with dd tests : [ 3721.026421] INFO: task nfsd:16596 blocked for more than 120 seconds. [ 3721.065141] Not tainted 3.17.4-031704-generic #201411211317 [ 3721.103721] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3721.180409] nfsd D 0009 0 16596 2 0x [ 3721.180412] 88006f0cbc18 0046 88006f0cbbc8 8109677f [ 3721.180414] 88006f0cbfd8 000145c0 88067089f700 000145c0 [ 3721.180417] 880673dac600 88045adfbc00 8801bdf8be40 88000b841500 [ 3721.180420] Call Trace: [ 3721.180423] [8109677f] ? set_groups+0x2f/0x60 [ 3721.180427] [817a20c9] schedule+0x29/0x70 [ 3721.180440] [817a23ee] schedule_preempt_disabled+0xe/0x10 [ 3721.180443] [817a429d] __mutex_lock_slowpath+0xcd/0x1d0 [ 3721.180447] [817a43c3] mutex_lock+0x23/0x37 [ 3721.180454] [c071cadd] nfsd_setattr+0x15d/0x2a0 [nfsd] [ 3721.180460] [c0727d2e] nfsd4_setattr+0x14e/0x180 [nfsd] [ 3721.180467] [c0729eac] nfsd4_proc_compound+0x4cc/0x730 [nfsd] [ 3721.180478] [c0715e55] nfsd_dispatch+0xe5/0x230 [nfsd] [ 3721.180491] [c05b9882] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] [ 3721.180500] [c05b8694] svc_process_common+0x324/0x680 [sunrpc] [ 3721.180510] [c05b8d43] svc_process+0x103/0x160 [sunrpc] [ 3721.180516] [c07159c7] nfsd+0x117/0x190 [nfsd] [ 3721.180526] [c07158b0] ? nfsd_destroy+0x80/0x80 [nfsd] [ 3721.180528] [81093359] kthread+0xc9/0xe0 [ 3721.180533] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.180536] [817a64bc] ret_from_fork+0x7c/0xb0 [ 3721.180540] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.180577] INFO: task kworker/2:3:28061 blocked for more than 120 seconds. [ 3721.221450] Not tainted 3.17.4-031704-generic #201411211317 [ 3721.261440] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3721.341394] kworker/2:3 D 0002 0 28061 2 0x [ 3721.341408] Workqueue: ceph-trunc ceph_vmtruncate_work [ceph] [ 3721.341409] 8805a6507b08 0046 8805a6507b88 ea00040e8c80 [ 3721.341412] 8805a6507fd8 000145c0 88067089b480 000145c0 [ 3721.341414] 8801fffec600 880102535a00 88000b8415a8 88046fc94ec0 [ 3721.341417] Call Trace: [ 3721.341421] [817a2970] ? bit_wait+0x50/0x50 [ 3721.341424] [817a20c9] schedule+0x29/0x70 [ 3721.341427] [817a219f] io_schedule+0x8f/0xd0 [ 3721.341430] [817a299b] bit_wait_io+0x2b/0x50 [ 3721.341433] [817a2656] __wait_on_bit_lock+0x76/0xb0 [ 3721.341438] [811756b5] ? find_get_entries+0xe5/0x160 [ 3721.341440] [8117245e] __lock_page+0xae/0xb0 [ 3721.341446] [810b3fd0] ? wake_atomic_t_function+0x40/0x40 [ 3721.341451] [81183226] truncate_inode_pages_range+0x446/0x700 [ 3721.341455] [81183565] truncate_inode_pages+0x15/0x20 [ 3721.341457] [811835bc] truncate_pagecache+0x4c/0x70 [ 3721.341464] [c09f815e] __ceph_do_pending_vmtruncate+0xde/0x230 [ceph] [ 3721.341470] [c09f8c73] ceph_vmtruncate_work+0x23/0x50 [ceph] [ 3721.341476] [8108cece] process_one_work+0x14e/0x460 [ 3721.341479] [8108d84b] worker_thread+0x11b/0x3f0 [ 3721.341482] [8108d730] ? create_worker+0x1e0/0x1e0 [ 3721.341485] [81093359] kthread+0xc9/0xe0 [ 3721.341487] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.341490] [817a64bc] ret_from_fork+0x7c/0xb0 [ 3721.341492] [81093290] ? flush_kthread_worker+0x90/0x90 They do not happen with every dd test, but happen pretty often. Especially when I am running a few dd tests concurrently. An example test that generated hang tasks above after just 2 runs: # dd if=/dev/zero of=/tmp/cephfs/4G bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G1 bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G2 bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G3 bs=1M count=4K oflag=direct Cheers - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 11:22:07 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks I am also noticing some delays working with nfs over cephfs. Especially when making an initial connection. For instance, I run the following: # time for i in {0..10} ; do time touch /tmp/cephfs/test-$i ; done where /tmp/cephfs is the nfs mount point running over cephfs I am noticing that the first touch file is created after about 20-30 seconds. All the following files files are created with no delay. If I run the command once again, all files are created pretty quickly without any delay. However, if I wait 20-30 minutes
Re: [ceph-users] Giant + nfs over cephfs hang tasks
I've done some tests using ceph-fuse and it looks far more stable. I've not experienced any issues so far with ceph-fuse mount point over nfs. Will do more stress testing over and update. Anyone experiencing issues with hang tasks using ceph kernel module mount method? Thanks - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 12:02:57 PM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks I've just tried the latest ubuntu-vivid kernel and also seeing hang tasks with dd tests: [ 3721.026421] INFO: task nfsd:16596 blocked for more than 120 seconds. [ 3721.065141] Not tainted 3.17.4-031704-generic #201411211317 [ 3721.103721] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3721.180409] nfsd D 0009 0 16596 2 0x [ 3721.180412] 88006f0cbc18 0046 88006f0cbbc8 8109677f [ 3721.180414] 88006f0cbfd8 000145c0 88067089f700 000145c0 [ 3721.180417] 880673dac600 88045adfbc00 8801bdf8be40 88000b841500 [ 3721.180420] Call Trace: [ 3721.180423] [8109677f] ? set_groups+0x2f/0x60 [ 3721.180427] [817a20c9] schedule+0x29/0x70 [ 3721.180440] [817a23ee] schedule_preempt_disabled+0xe/0x10 [ 3721.180443] [817a429d] __mutex_lock_slowpath+0xcd/0x1d0 [ 3721.180447] [817a43c3] mutex_lock+0x23/0x37 [ 3721.180454] [c071cadd] nfsd_setattr+0x15d/0x2a0 [nfsd] [ 3721.180460] [c0727d2e] nfsd4_setattr+0x14e/0x180 [nfsd] [ 3721.180467] [c0729eac] nfsd4_proc_compound+0x4cc/0x730 [nfsd] [ 3721.180478] [c0715e55] nfsd_dispatch+0xe5/0x230 [nfsd] [ 3721.180491] [c05b9882] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] [ 3721.180500] [c05b8694] svc_process_common+0x324/0x680 [sunrpc] [ 3721.180510] [c05b8d43] svc_process+0x103/0x160 [sunrpc] [ 3721.180516] [c07159c7] nfsd+0x117/0x190 [nfsd] [ 3721.180526] [c07158b0] ? nfsd_destroy+0x80/0x80 [nfsd] [ 3721.180528] [81093359] kthread+0xc9/0xe0 [ 3721.180533] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.180536] [817a64bc] ret_from_fork+0x7c/0xb0 [ 3721.180540] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.180577] INFO: task kworker/2:3:28061 blocked for more than 120 seconds. [ 3721.221450] Not tainted 3.17.4-031704-generic #201411211317 [ 3721.261440] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3721.341394] kworker/2:3 D 0002 0 28061 2 0x [ 3721.341408] Workqueue: ceph-trunc ceph_vmtruncate_work [ceph] [ 3721.341409] 8805a6507b08 0046 8805a6507b88 ea00040e8c80 [ 3721.341412] 8805a6507fd8 000145c0 88067089b480 000145c0 [ 3721.341414] 8801fffec600 880102535a00 88000b8415a8 88046fc94ec0 [ 3721.341417] Call Trace: [ 3721.341421] [817a2970] ? bit_wait+0x50/0x50 [ 3721.341424] [817a20c9] schedule+0x29/0x70 [ 3721.341427] [817a219f] io_schedule+0x8f/0xd0 [ 3721.341430] [817a299b] bit_wait_io+0x2b/0x50 [ 3721.341433] [817a2656] __wait_on_bit_lock+0x76/0xb0 [ 3721.341438] [811756b5] ? find_get_entries+0xe5/0x160 [ 3721.341440] [8117245e] __lock_page+0xae/0xb0 [ 3721.341446] [810b3fd0] ? wake_atomic_t_function+0x40/0x40 [ 3721.341451] [81183226] truncate_inode_pages_range+0x446/0x700 [ 3721.341455] [81183565] truncate_inode_pages+0x15/0x20 [ 3721.341457] [811835bc] truncate_pagecache+0x4c/0x70 [ 3721.341464] [c09f815e] __ceph_do_pending_vmtruncate+0xde/0x230 [ceph] [ 3721.341470] [c09f8c73] ceph_vmtruncate_work+0x23/0x50 [ceph] [ 3721.341476] [8108cece] process_one_work+0x14e/0x460 [ 3721.341479] [8108d84b] worker_thread+0x11b/0x3f0 [ 3721.341482] [8108d730] ? create_worker+0x1e0/0x1e0 [ 3721.341485] [81093359] kthread+0xc9/0xe0 [ 3721.341487] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.341490] [817a64bc] ret_from_fork+0x7c/0xb0 [ 3721.341492] [81093290] ? flush_kthread_worker+0x90/0x90 They do not happen with every dd test, but happen pretty often. Especially when I am running a few dd tests concurrently. An example test that generated hang tasks above after just 2 runs: # dd if=/dev/zero of=/tmp/cephfs/4G bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G1 bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G2 bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G3 bs=1M count=4K oflag=direct Cheers - Original Message - From: Andrei Mikhailovsky and...@arhont.com To: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 11:22:07 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks I am also
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Fri, Nov 28, 2014 at 3:02 PM, Andrei Mikhailovsky and...@arhont.com wrote: I've just tried the latest ubuntu-vivid kernel and also seeing hang tasks with dd tests: [ 3721.026421] INFO: task nfsd:16596 blocked for more than 120 seconds. [ 3721.065141] Not tainted 3.17.4-031704-generic #201411211317 [ 3721.103721] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3721.180409] nfsdD 0009 0 16596 2 0x [ 3721.180412] 88006f0cbc18 0046 88006f0cbbc8 8109677f [ 3721.180414] 88006f0cbfd8 000145c0 88067089f700 000145c0 [ 3721.180417] 880673dac600 88045adfbc00 8801bdf8be40 88000b841500 [ 3721.180420] Call Trace: [ 3721.180423] [8109677f] ? set_groups+0x2f/0x60 [ 3721.180427] [817a20c9] schedule+0x29/0x70 [ 3721.180440] [817a23ee] schedule_preempt_disabled+0xe/0x10 [ 3721.180443] [817a429d] __mutex_lock_slowpath+0xcd/0x1d0 [ 3721.180447] [817a43c3] mutex_lock+0x23/0x37 [ 3721.180454] [c071cadd] nfsd_setattr+0x15d/0x2a0 [nfsd] [ 3721.180460] [c0727d2e] nfsd4_setattr+0x14e/0x180 [nfsd] [ 3721.180467] [c0729eac] nfsd4_proc_compound+0x4cc/0x730 [nfsd] [ 3721.180478] [c0715e55] nfsd_dispatch+0xe5/0x230 [nfsd] [ 3721.180491] [c05b9882] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] [ 3721.180500] [c05b8694] svc_process_common+0x324/0x680 [sunrpc] [ 3721.180510] [c05b8d43] svc_process+0x103/0x160 [sunrpc] [ 3721.180516] [c07159c7] nfsd+0x117/0x190 [nfsd] [ 3721.180526] [c07158b0] ? nfsd_destroy+0x80/0x80 [nfsd] [ 3721.180528] [81093359] kthread+0xc9/0xe0 [ 3721.180533] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.180536] [817a64bc] ret_from_fork+0x7c/0xb0 [ 3721.180540] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.180577] INFO: task kworker/2:3:28061 blocked for more than 120 seconds. [ 3721.221450] Not tainted 3.17.4-031704-generic #201411211317 [ 3721.261440] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3721.341394] kworker/2:3 D 0002 0 28061 2 0x [ 3721.341408] Workqueue: ceph-trunc ceph_vmtruncate_work [ceph] [ 3721.341409] 8805a6507b08 0046 8805a6507b88 ea00040e8c80 [ 3721.341412] 8805a6507fd8 000145c0 88067089b480 000145c0 [ 3721.341414] 8801fffec600 880102535a00 88000b8415a8 88046fc94ec0 [ 3721.341417] Call Trace: [ 3721.341421] [817a2970] ? bit_wait+0x50/0x50 [ 3721.341424] [817a20c9] schedule+0x29/0x70 [ 3721.341427] [817a219f] io_schedule+0x8f/0xd0 [ 3721.341430] [817a299b] bit_wait_io+0x2b/0x50 [ 3721.341433] [817a2656] __wait_on_bit_lock+0x76/0xb0 [ 3721.341438] [811756b5] ? find_get_entries+0xe5/0x160 [ 3721.341440] [8117245e] __lock_page+0xae/0xb0 [ 3721.341446] [810b3fd0] ? wake_atomic_t_function+0x40/0x40 [ 3721.341451] [81183226] truncate_inode_pages_range+0x446/0x700 [ 3721.341455] [81183565] truncate_inode_pages+0x15/0x20 [ 3721.341457] [811835bc] truncate_pagecache+0x4c/0x70 [ 3721.341464] [c09f815e] __ceph_do_pending_vmtruncate+0xde/0x230 [ceph] [ 3721.341470] [c09f8c73] ceph_vmtruncate_work+0x23/0x50 [ceph] [ 3721.341476] [8108cece] process_one_work+0x14e/0x460 [ 3721.341479] [8108d84b] worker_thread+0x11b/0x3f0 [ 3721.341482] [8108d730] ? create_worker+0x1e0/0x1e0 [ 3721.341485] [81093359] kthread+0xc9/0xe0 [ 3721.341487] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.341490] [817a64bc] ret_from_fork+0x7c/0xb0 [ 3721.341492] [81093290] ? flush_kthread_worker+0x90/0x90 They do not happen with every dd test, but happen pretty often. Especially when I am running a few dd tests concurrently. An example test that generated hang tasks above after just 2 runs: # dd if=/dev/zero of=/tmp/cephfs/4G bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G1 bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G2 bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G3 bs=1M count=4K oflag=direct Cheers From: Andrei Mikhailovsky and...@arhont.com To: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 11:22:07 AM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks I am also noticing some delays working with nfs over cephfs. Especially when making an initial connection. For instance, I run the following: # time for i in {0..10} ; do time touch /tmp/cephfs/test-$i ; done where /tmp/cephfs is the nfs mount point running over cephfs I am noticing that the first touch file is created after about 20-30 seconds. All
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, yes I do! LIke these from different osds: [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state OPEN) Andrei - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 4:58:41 PM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Fri, Nov 28, 2014 at 3:02 PM, Andrei Mikhailovsky and...@arhont.com wrote: I've just tried the latest ubuntu-vivid kernel and also seeing hang tasks with dd tests: [ 3721.026421] INFO: task nfsd:16596 blocked for more than 120 seconds. [ 3721.065141] Not tainted 3.17.4-031704-generic #201411211317 [ 3721.103721] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3721.180409] nfsd D 0009 0 16596 2 0x [ 3721.180412] 88006f0cbc18 0046 88006f0cbbc8 8109677f [ 3721.180414] 88006f0cbfd8 000145c0 88067089f700 000145c0 [ 3721.180417] 880673dac600 88045adfbc00 8801bdf8be40 88000b841500 [ 3721.180420] Call Trace: [ 3721.180423] [8109677f] ? set_groups+0x2f/0x60 [ 3721.180427] [817a20c9] schedule+0x29/0x70 [ 3721.180440] [817a23ee] schedule_preempt_disabled+0xe/0x10 [ 3721.180443] [817a429d] __mutex_lock_slowpath+0xcd/0x1d0 [ 3721.180447] [817a43c3] mutex_lock+0x23/0x37 [ 3721.180454] [c071cadd] nfsd_setattr+0x15d/0x2a0 [nfsd] [ 3721.180460] [c0727d2e] nfsd4_setattr+0x14e/0x180 [nfsd] [ 3721.180467] [c0729eac] nfsd4_proc_compound+0x4cc/0x730 [nfsd] [ 3721.180478] [c0715e55] nfsd_dispatch+0xe5/0x230 [nfsd] [ 3721.180491] [c05b9882] ? svc_tcp_adjust_wspace+0x12/0x30 [sunrpc] [ 3721.180500] [c05b8694] svc_process_common+0x324/0x680 [sunrpc] [ 3721.180510] [c05b8d43] svc_process+0x103/0x160 [sunrpc] [ 3721.180516] [c07159c7] nfsd+0x117/0x190 [nfsd] [ 3721.180526] [c07158b0] ? nfsd_destroy+0x80/0x80 [nfsd] [ 3721.180528] [81093359] kthread+0xc9/0xe0 [ 3721.180533] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.180536] [817a64bc] ret_from_fork+0x7c/0xb0 [ 3721.180540] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.180577] INFO: task kworker/2:3:28061 blocked for more than 120 seconds. [ 3721.221450] Not tainted 3.17.4-031704-generic #201411211317 [ 3721.261440] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 3721.341394] kworker/2:3 D 0002 0 28061 2 0x [ 3721.341408] Workqueue: ceph-trunc ceph_vmtruncate_work [ceph] [ 3721.341409] 8805a6507b08 0046 8805a6507b88 ea00040e8c80 [ 3721.341412] 8805a6507fd8 000145c0 88067089b480 000145c0 [ 3721.341414] 8801fffec600 880102535a00 88000b8415a8 88046fc94ec0 [ 3721.341417] Call Trace: [ 3721.341421] [817a2970] ? bit_wait+0x50/0x50 [ 3721.341424] [817a20c9] schedule+0x29/0x70 [ 3721.341427] [817a219f] io_schedule+0x8f/0xd0 [ 3721.341430] [817a299b] bit_wait_io+0x2b/0x50 [ 3721.341433] [817a2656] __wait_on_bit_lock+0x76/0xb0 [ 3721.341438] [811756b5] ? find_get_entries+0xe5/0x160 [ 3721.341440] [8117245e] __lock_page+0xae/0xb0 [ 3721.341446] [810b3fd0] ? wake_atomic_t_function+0x40/0x40 [ 3721.341451] [81183226] truncate_inode_pages_range+0x446/0x700 [ 3721.341455] [81183565] truncate_inode_pages+0x15/0x20 [ 3721.341457] [811835bc] truncate_pagecache+0x4c/0x70 [ 3721.341464] [c09f815e] __ceph_do_pending_vmtruncate+0xde/0x230 [ceph] [ 3721.341470] [c09f8c73] ceph_vmtruncate_work+0x23/0x50 [ceph] [ 3721.341476] [8108cece] process_one_work+0x14e/0x460 [ 3721.341479] [8108d84b] worker_thread+0x11b/0x3f0 [ 3721.341482] [8108d730] ? create_worker+0x1e0/0x1e0 [ 3721.341485] [81093359] kthread+0xc9/0xe0 [ 3721.341487] [81093290] ? flush_kthread_worker+0x90/0x90 [ 3721.341490] [817a64bc] ret_from_fork+0x7c/0xb0 [ 3721.341492] [81093290] ? flush_kthread_worker+0x90/0x90 They do not happen with every dd test, but happen pretty often. Especially when I am running a few dd tests concurrently. An example test that generated hang tasks above after just 2 runs: # dd if=/dev/zero of=/tmp/cephfs/4G bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G1 bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G2 bs=1M count=4K oflag=direct dd if=/dev/zero of=/tmp/cephfs/4G3 bs=1M count=4K oflag=direct Cheers From: Andrei Mikhailovsky and...@arhont.com To: ceph-users ceph-users@lists.ceph.com Sent: Friday
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, yes I do! LIke these from different osds: [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state OPEN) Can you by any chance try a kernel from [1] ? It's based on Ubuntu config and unless you are doing something fancy should boot your box. You have to install it only on the client box of course. This may be related to the bug I'm currently trying to nail down and I'd like to know if the latest bits make any difference. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, yes I do! LIke these from different osds: [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state OPEN) Can you by any chance try a kernel from [1] ? It's based on Ubuntu config and unless you are doing something fancy should boot your box. You have to install it only on the client box of course. This may be related to the bug I'm currently trying to nail down and I'd like to know if the latest bits make any difference. [1] http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
On Fri, Nov 28, 2014 at 8:20 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, yes I do! LIke these from different osds: [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state OPEN) Can you by any chance try a kernel from [1] ? It's based on Ubuntu config and unless you are doing something fancy should boot your box. You have to install it only on the client box of course. This may be related to the bug I'm currently trying to nail down and I'd like to know if the latest bits make any difference. [1] http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb It's currently rebuilding because of an unrelated patch and will be overwritten once gitbuilder is done. If it's not there by the time you try use this link: http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/sha1/72ca172a582d656930f413c3733401b8a5c120db/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
I will give it a go and let you know. Cheers - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 5:28:28 PM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Fri, Nov 28, 2014 at 8:20 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, yes I do! LIke these from different osds: [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state OPEN) Can you by any chance try a kernel from [1] ? It's based on Ubuntu config and unless you are doing something fancy should boot your box. You have to install it only on the client box of course. This may be related to the bug I'm currently trying to nail down and I'd like to know if the latest bits make any difference. [1] http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb It's currently rebuilding because of an unrelated patch and will be overwritten once gitbuilder is done. If it's not there by the time you try use this link: http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/sha1/72ca172a582d656930f413c3733401b8a5c120db/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
: [ 288.310156] CPU: 8 PID: 87 Comm: kswapd1 Tainted: G E 3.18.0-rc6-ceph-00024-g72ca172 #1 [ 288.310162] Hardware name: Supermicro X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF, BIOS 3.0b 04/28/2014 [ 288.310169] 821208e0 8804676ab608 81733b38 0007 [ 288.310177] 8804676ab670 8804676ab658 810a5f68 821208e0 [ 288.310184] 81a7cbe0 8804676ab674 88046763cc50 [ 288.310192] Call Trace: [ 288.310200] [81733b38] dump_stack+0x4e/0x68 [ 288.310206] [810a5f68] print_irq_inversion_bug.part.41+0x1e8/0x1f0 [ 288.310213] [810a607b] check_usage_forwards+0x10b/0x150 [ 288.310220] [810a6a8b] mark_lock+0x18b/0x2e0 [ 288.310226] [810a5f70] ? print_irq_inversion_bug.part.41+0x1f0/0x1f0 [ 288.310234] [811c9185] ? __mem_cgroup_threshold+0x5/0x1d0 [ 288.310241] [810a6fb0] __lock_acquire+0x3d0/0x1c90 [ 288.310247] [810a6ff1] ? __lock_acquire+0x411/0x1c90 [ 288.310266] [a0682d44] ? xfs_ilock+0x134/0x160 [xfs] [ 288.310272] [810a8e9e] lock_acquire+0x9e/0x140 [ 288.310289] [a0682d44] ? xfs_ilock+0x134/0x160 [xfs] [ 288.310295] [810a33ef] down_write_nested+0x4f/0x80 [ 288.310312] [a0682d44] ? xfs_ilock+0x134/0x160 [xfs] [ 288.310329] [a0682d44] xfs_ilock+0x134/0x160 [xfs] [ 288.310347] [a067aa0c] ? xfs_reclaim_inode+0x12c/0x340 [xfs] [ 288.310364] [a067aa0c] xfs_reclaim_inode+0x12c/0x340 [xfs] [ 288.310382] [a067aea7] xfs_reclaim_inodes_ag+0x287/0x400 [xfs] [ 288.310400] [a067ad00] ? xfs_reclaim_inodes_ag+0xe0/0x400 [xfs] [ 288.310418] [a067bda3] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [ 288.310438] [a068b855] xfs_fs_free_cached_objects+0x15/0x20 [xfs] [ 288.310445] [811d8dd8] super_cache_scan+0x178/0x180 [ 288.310451] [8117393e] shrink_slab_node+0x15e/0x310 [ 288.310457] [811755e0] shrink_slab+0x100/0x140 [ 288.310463] [81178306] kswapd_shrink_zone+0x116/0x1a0 [ 288.310469] [8117925b] kswapd+0x4bb/0x9a0 [ 288.310475] [81178da0] ? mem_cgroup_shrink_node_zone+0x1c0/0x1c0 [ 288.310481] [8107a664] kthread+0xe4/0x100 [ 288.310488] [8107a580] ? flush_kthread_worker+0xf0/0xf0 [ 288.310494] [8173d66c] ret_from_fork+0x7c/0xb0 [ 288.310500] [8107a580] ? flush_kthread_worker+0xf0/0xf0 I've not seen any hang tasks just yet. The server seems to continue working. I will do more testing and get back to you with more info. Andrei - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 5:28:28 PM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Fri, Nov 28, 2014 at 8:20 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, yes I do! LIke these from different osds: [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state OPEN) Can you by any chance try a kernel from [1] ? It's based on Ubuntu config and unless you are doing something fancy should boot your box. You have to install it only on the client box of course. This may be related to the bug I'm currently trying to nail down and I'd like to know if the latest bits make any difference. [1] http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb It's currently rebuilding because of an unrelated patch and will be overwritten once gitbuilder is done. If it's not there by the time you try use this link: http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/sha1/72ca172a582d656930f413c3733401b8a5c120db/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Giant + nfs over cephfs hang tasks
Ilya, not sure if dmesg output in the previous is related to the cephfs, but from what I can see it looks good with your kernel. I would have seen hang tasks by now, but not anymore. I've ran a bunch of concurrent dd tests and also the file touch tests and there are no more delays. So, it looks like you have nailed the bug! Do you plan to backport the fix to the 3.16 or 3.17 branches? Cheers Andrei - Original Message - From: Ilya Dryomov ilya.dryo...@inktank.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Friday, 28 November, 2014 5:28:28 PM Subject: Re: [ceph-users] Giant + nfs over cephfs hang tasks On Fri, Nov 28, 2014 at 8:20 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:19 PM, Ilya Dryomov ilya.dryo...@inktank.com wrote: On Fri, Nov 28, 2014 at 8:13 PM, Andrei Mikhailovsky and...@arhont.com wrote: Ilya, yes I do! LIke these from different osds: [ 4422.212204] libceph: osd13 192.168.168.201:6819 socket closed (con state OPEN) Can you by any chance try a kernel from [1] ? It's based on Ubuntu config and unless you are doing something fancy should boot your box. You have to install it only on the client box of course. This may be related to the bug I'm currently trying to nail down and I'd like to know if the latest bits make any difference. [1] http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/testing/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb It's currently rebuilding because of an unrelated patch and will be overwritten once gitbuilder is done. If it's not there by the time you try use this link: http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/sha1/72ca172a582d656930f413c3733401b8a5c120db/linux-image-3.18.0-rc6-ceph-00024-g72ca172_3.18.0-rc6-ceph-00024-g72ca172-1_amd64.deb Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com