Re: MPI applications on ceph fs

Sage Weil Fri, 31 Aug 2012 13:04:13 -0700

Hi John,

On Fri, 31 Aug 2012, John C. Wright wrote:
> An update,
> While looking into how to switch over to a different network on my ceph 
> cluster - another question altogether, I discovered during my upgrade from 
> 0.47.2 to 0.48 on the three nodes somehow my 'git checkout 0.48' wasn't done 
> correctly on two nodes and wound up reinstalling 0.47.2 on those so somehow 
> had a heterogenous cluster running 0.47.2 on the osds and 0.48 running on the 
> mds (all three running monitors).
> 
> So wiped out and started fresh with 0.48 and still got the error, but with 
> more info this time.
> This MPI code is running on 14 processes on two client nodes. Each process 
> writes to its own 'stdout' file and some other data files are created by 
> process '0'. The program starts up and creates its initial files and begins 
> to write to its stdout files. Normally this proceeds with writes to the 
> stdout at periodic intervals during the run, but on the ceph volume, this 
> freezes up after about one minute.
> 
> Symptoms: initially can still access ceph volume from other clients. Listing 
> the working directory of the mpi code is very slow and soon is unresponsive 
> but can still list other directories on the ceph volume. After another minute 
> or so, ceph clients can no longer access the volume at all without locking up 
> in a trace that ends in a 'fastpath' kernel call. If I CTRL-C out of the 
> mpirun call within the first minute, everything recovers, but waiting longer 
> than that requires a reboot of mounted nodes and a restart of ceph to clear 
> things up.
> 
> Below are relevant (I hope) process traces from dmesg and ceph logs. Any help 
> on diagnosing this would be greatly appreciated. We're hoping to use ceph as 
> a parallel file system on a scientific workload beowulf cluster, initially 
> with a buyer-beware policy an for only transient reproducible data and more 
> general usage as ceph gets stable and reaches the 1.0 milestone.


It looks a bit like this is a problem on the MDS side of things, since you
have both a hung request and a writer waiting for caps.  Can you generate 
an MDS log that goes with this workload?  With

        debug ms = 1
        debug mds = 20

in the [mds] section of your config?  Also, once it hangs, it would be 
helpful to see what the hung request is (cat 
/sys/kernel/debug/ceph/*/mdsc) and the inode number for the hung writer 
(stat /proc/$pid/fd/NNN).  Hopefully the stat won't hang.. but if it does, 
hopefully you can identify which file ino or filename it is some other 
way.

Thanks!
sage


> 
> Thanks very much.
> 
> -john wright
> 
> Log snips:
> 
> ceph.log on mds when code starts at 15:34
> 2012-08-30 15:34:46.915165 mon.0 10.0.0.17:6789/0 414 : [INF] pgmap v396: 384 
> pgs: 384 active+clean; 37085 MB data, 198 GB used, 21841 GB / 27939 GB avail
> 2012-08-30 15:34:47.921910 mon.0 10.0.0.17:6789/0 415 : [INF] pgmap v397: 384 
> pgs: 384 active+clean; 37085 MB data, 198 GB used, 21841 GB / 27939 GB avail
> 2012-08-30 15:34:52.586958 mon.0 10.0.0.17:6789/0 416 : [INF] pgmap v398: 384 
> pgs: 384 active+clean; 37085 MB data, 198 GB used, 21841 GB / 27939 GB avail
> 2012-08-30 15:35:01.920783 mon.0 10.0.0.17:6789/0 417 : [INF] pgmap v399: 384 
> pgs: 384 active+clean; 37088 MB data, 198 GB used, 21841 GB / 27939 GB avail
> 2012-08-30 15:35:02.926615 mon.0 10.0.0.17:6789/0 418 : [INF] pgmap v400: 384 
> pgs: 384 active+clean; 37091 MB data, 198 GB used, 21841 GB / 27939 GB avail
> 2012-08-30 15:35:06.922546 mon.0 10.0.0.17:6789/0 419 : [INF] pgmap v401: 384 
> pgs: 384 active+clean; 37102 MB data, 198 GB used, 21841 GB / 27939 GB avail
> 2012-08-30 15:35:07.929265 mon.0 10.0.0.17:6789/0 420 : [INF] pgmap v402: 384 
> pgs: 384 active+clean; 37111 MB data, 198 GB used, 21841 GB / 27939 GB avail
> 2012-08-30 15:35:17.594883 mon.0 10.0.0.17:6789/0 421 : [INF] pgmap v403: 384 
> pgs: 384 active+clean; 37111 MB data, 198 GB used, 21841 GB / 27939 GB avail
> 
> dmesg trace:
> one of the mpi processes:
> 85298.841748] hydra_pmi_proxy S ffffffff81808440     0  5441   5412 0x00000080
> [85298.841752]  ffff8805d93e1968 0000000000000082 ffff8805d93e1978 
> ffffffffa02f588c
> [85298.841758]  ffff88061803c590 ffff8805d93e1fd8 ffff8805d93e1fd8 
> ffff8805d93e1fd8
> [85298.841764]  ffff88061e7a2e60 ffff88061803c590 ffff88063ffedc08 
> ffff88061803c590
> [85298.841769] Call Trace:
> [85298.841779]  [<ffffffffa02f588c>] ? try_get_cap_refs+0x10c/0x320 [ceph]
> [85298.841783]  [<ffffffff815f323f>] schedule+0x3f/0x60
> [85298.841794]  [<ffffffffa02f7ee5>] ceph_get_caps+0x135/0x1f0 [ceph]
> [85298.841798]  [<ffffffff8107a7d0>] ? remove_wait_queue+0x50/0x50
> [85298.841807]  [<ffffffffa02ed5ae>] ceph_aio_read+0xee/0x880 [ceph]
> [85298.841812]  [<ffffffff81178c3e>] ? memcg_check_events+0x2e/0x1c0
> [85298.841817]  [<ffffffff81179925>] ? __mem_cgroup_commit_charge+0x145/0x300
> [85298.841821]  [<ffffffff8117b3aa>] ? mem_cgroup_charge_common+0xba/0x100
> [85298.841825]  [<ffffffff81180ba2>] do_sync_read+0xd2/0x110
> [85298.841830]  [<ffffffff81269773>] ? security_file_permission+0x93/0xb0
> [85298.841834]  [<ffffffff81181011>] ? rw_verify_area+0x61/0xf0
> [85298.841837]  [<ffffffff811814f0>] vfs_read+0xb0/0x180
> [85298.841841]  [<ffffffff81186d36>] kernel_read+0x46/0x60
> [85298.841845]  [<ffffffff811d1b6e>] load_elf_binary+0x1ae/0x1a10
> [85298.841850]  [<ffffffff811477c2>] ? get_user_pages+0x52/0x60
> [85298.841854]  [<ffffffff81186936>] ? get_arg_page+0xa6/0xe0
> [85298.841858]  [<ffffffff8118760d>] search_binary_handler+0xed/0x370
> [85298.841863]  [<ffffffff811d19c0>] ? do_mmap+0x40/0x40
> [85298.841867]  [<ffffffff81188acf>] do_execve_common+0x36f/0x410
> [85298.841871]  [<ffffffff81188baa>] do_execve+0x3a/0x40
> [85298.841875]  [<ffffffff8101d737>] sys_execve+0x47/0x70
> [85298.841879]  [<ffffffff815fc72c>] stub_execve+0x6c/0xc0
> 
> locked up ls:
> [85298.842119] ls              D 000000000000000a     0  5451   4964 
> 0x00000084
> [85298.842124]  ffff88061733bc58 0000000000000086 0000000000000072 
> 0000000000000287
> [85298.842129]  ffff880617299730 ffff88061733bfd8 ffff88061733bfd8 
> ffff88061733bfd8
> [85298.842135]  ffff880c1ac3dcc0 ffff880617299730 ffff880617247e00 
> 7fffffffffffffff
> [85298.842140] Call Trace:
> [85298.842144]  [<ffffffff815f323f>] schedule+0x3f/0x60
> [85298.842148]  [<ffffffff815f1945>] schedule_timeout+0x2a5/0x320
> [85298.842158]  [<ffffffffa02a8d05>] ? ceph_con_send+0xf5/0x1c0 [libceph]
> [85298.842162]  [<ffffffff815f30d7>] wait_for_common+0x127/0x170
> [85298.842166]  [<ffffffff81088e30>] ? try_to_wake_up+0x2b0/0x2b0
> [85298.842170]  [<ffffffff815f315d>] wait_for_completion_killable+0x1d/0x30
> [85298.842182]  [<ffffffffa0301e3d>] ceph_mdsc_do_request+0x12d/0x2a0 [ceph]
> [85298.842191]  [<ffffffffa02ea379>] ceph_do_getattr+0xe9/0x1c0 [ceph]
> [85298.842199]  [<ffffffffa02ea474>] ceph_getattr+0x24/0x100 [ceph]
> [85298.842204]  [<ffffffff8118c873>] ? putname+0x33/0x50
> [85298.842208]  [<ffffffff81185fae>] vfs_getattr+0x4e/0x80
> [85298.842212]  [<ffffffff8118602e>] vfs_fstatat+0x4e/0x70
> [85298.842216]  [<ffffffff8118606e>] vfs_lstat+0x1e/0x20
> [85298.842220]  [<ffffffff8118620a>] sys_newlstat+0x1a/0x40
> [85298.842224]  [<ffffffff810d340c>] ? __audit_syscall_entry+0xcc/0x310
> [85298.842228]  [<ffffffff810d3a26>] ? __audit_syscall_exit+0x3d6/0x410
> [85298.842232]  [<ffffffff815fc2e9>] system_call_fastpath+0x16/0x1b
> 
> 
> 
> 
> On Aug 28, 2012, at 11:54 AM, John Wright wrote:
> 
> > Hi Sage,
> > Thanks for the suggestion.
> > I've attached two files with the dump from dmesg with the trigger set to 
> > true. They are for two nodes running 7 parallel processes each. The 
> > parallel process was launched via mpirun on 35e and you can see each of the 
> > parallel processes in its trace ('mitgcm') along with one ceph processes.  
> > On the second node (37w) you see lots of ceph messages but no mitgcm 
> > processes (though there are mitgcm processes listed by 'ps' with state 
> > 'D'). This is the node on which the mounted ceph filesystem locks up and 
> > any process accessing it goes into state 'D': uninterruptible sleep. So 
> > far, it has always been the second node on which ceph locks. The lock up 
> > happens as soon as the executable is started presumably when it tries to 
> > open several new files from the 14  processes. Of course, an identical test 
> > under NFS mounted volumes works fine.
> > 
> > Interestingly, I did this after rebooting 37w from the last freeze up of 
> > the ceph volume and the first few times the application still froze with 
> > state 'D' on the second node and 'S' on the first but would terminate when 
> > mpirun was killed. But on the fourth try the ceph volume fully locked, 
> > requiring a reboot. In the past two tries the ceph lock up had occurred on 
> > the first mpirun test of mitgcm.
> > 
> > thanks for any feed back you give on this,
> > 
> > -John
> > 
> > The files are at
> > https://dl.dropbox.com/u/9688196/80_35e_dmsesg_gcm.log.1.1.txt
> > https://dl.dropbox.com/u/9688196/80_37w_dmesg.gcm.log.1.txt
> > 
> > (I tried to post yesterday, but it never made it to the list , maybe the 
> > files were too big?)
> > 
> > And because I forgot to CC the list, here's my reply to Mark's note:
> >> Hi Mark,
> >> the mpi application is attempting to open multiple files, one from each 
> >> process, I believe it was at this step.  Several files are opened by 
> >> process 0 for summary output. I cannot rule out that there is some 
> >> concurrent opening or access going on since I am not the author of the 
> >> code. That is one of my suspects and I'm writing some small programs to 
> >> test that.  Not sure if the ceph osd's ever received a request or if 
> >> things got locked up at the level of the kernel module for ceph on the 
> >> node. 
> >> 
> >> I appreciate any suggestions. Does any one know if ceph has been tested in 
> >> a parallel application environment where there often is a lot of file i/o 
> >> concurrency?
> >> thanks,
> >> -john
> > 
> > 
> > On Aug 27, 2012, at 12:06 AM, Sage Weil wrote:
> > 
> >> Hi John,
> >> 
> >> Can you dump running threads?  'echo t > /proc/sysrq-trigger' and then 
> >> attach the output (from dmesg or kern.log).
> >> 
> >> Thanks!
> >> sage
> >> 
> >> 
> >> On Sun, 26 Aug 2012, John Wright wrote:
> >> 
> >>> Hi All,
> >>> We're running ceph 0.48 on  small three node test cluster. We've had good 
> >>> stability with I/O using dd and iozone especially after upgrading to 
> >>> 0.48. However, we're running into a repeatable lockup of the linux ceph 
> >>> client ( 3.3.5-2.fc16.x86_64 ) when running an mpi program that has 
> >>> simple I/O on a ceph mount. This is an mpi program running processes on 
> >>> two nodes. It is the remote node on which the ceph client locks up. The 
> >>> cient becomes immediately unresponsive and any attempt to access the 
> >>> mounted volume produces a process with status 'D'. I can see no 
> >>> indication in the server logs that it is ever contacted. Regular serial 
> >>> processes run fine on the volume. MPI runs on the nodes work fine when 
> >>> not using the ceph volume.
> >>> 
> >>> So any suggestions on where to look? Any one have an experience testing 
> >>> parallel programs on ceph?
> >>> 
> >>> thanks,
> >>> -john
> >>> 
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> 
> >>> 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MPI applications on ceph fs

Reply via email to