This sounds credible. When I login via Torque, I see the following: [binf316:fischega] $ ulimit -l 64
but when I login via ssh, I see: [binf316:fischega] $ ulimit -l unlimited I'll have my administrator make the changes and give that a shot. Thanks, everyone! _____________________________________________ From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa Sent: Wednesday, June 11, 2014 7:13 PM To: Open MPI Users Subject: Re: [OMPI users] openib segfaults with Torque If that could help Greg, on the compute nodes I normally add this to /etc/security/limits.conf: * - memlock -1 * - stack -1 * - nofile 32768 and ulimit -n 32768 ulimit -l unlimited ulimit -s unlimited to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which should be sourced by the former). Other values are possible, of course. My recollection is that the boilerplate init scripts that come with Torque don't change those limits. I suppose this makes the pbs_mom child processes, including the user job script and whatever processes it starts (mpiexec, etc), to inherit those limits. Or not? Gus Correa On 06/11/2014 06:20 PM, Jeff Squyres (jsquyres) wrote: > +1 > > On Jun 11, 2014, at 6:01 PM, Ralph Castain > <r...@open-mpi.org<mailto:r...@open-mpi.org>> > wrote: > >> Yeah, I think we've seen that somewhere before too... >> >> >> On Jun 11, 2014, at 2:59 PM, Joshua Ladd >> <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote: >> >>> Agreed. The problem is not with UDCM. I don't think something is wrong with >>> the system. I think his Torque is imposing major constraints on the maximum >>> size that can be locked into memory. >>> >>> Josh >>> >>> >>> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm >>> <hje...@lanl.gov<mailto:hje...@lanl.gov>> wrote: >>> Probably won't help to use RDMACM though as you will just see the >>> resource failure somewhere else. UDCM is not the problem. Something is >>> wrong with the system. Allocating a 512 entry CQ should not fail. >>> >>> -Nathan >>> >>> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote: >>>> I'm guessing it's a resource limitation issue coming from Torque. >>>> >>>> Hmmmm...I found something interesting on the interwebs that looks >>>> awfully >>>> similar: >>>> >>>> http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html >>>> >>>> Greg, if the suggestion from the Torque users doesn't resolve your >>>> issue ( >>>> "...adding the following line 'ulimit -l unlimited' to pbs_mom and >>>> restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead >>>> of >>>> UDCM, which is a pretty recent addition to the openIB BTL.) by setting: >>>> >>>> -mca btl_openib_cpc_include rdmacm >>>> >>>> Josh >>>> >>>> On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres) >>>> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote: >>>> >>>> Mellanox -- >>>> >>>> What would cause a CQ to fail to be created? >>>> >>>> On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A." >>>> <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote: >>>> >>>> > Is there any other work around that I might try? Something that >>>> avoids UDCM? >>>> > >>>> > -----Original Message----- >>>> > From: Fischer, Greg A. >>>> > Sent: Tuesday, June 10, 2014 2:59 PM >>>> > To: Nathan Hjelm >>>> > Cc: Open MPI Users; Fischer, Greg A. >>>> > Subject: RE: [OMPI users] openib segfaults with Torque >>>> > >>>> > [binf316:fischega] $ ulimit -m >>>> > unlimited >>>> > >>>> > Greg >>>> > >>>> > -----Original Message----- >>>> > From: Nathan Hjelm [mailto:hje...@lanl.gov] >>>> > Sent: Tuesday, June 10, 2014 2:58 PM >>>> > To: Fischer, Greg A. >>>> > Cc: Open MPI Users >>>> > Subject: Re: [OMPI users] openib segfaults with Torque >>>> > >>>> > Out of curiosity what is the mlock limit on your system? If it is >>>> too >>>> low that can cause ibv_create_cq to fail. To check run ulimit -m. >>>> > >>>> > -Nathan Hjelm >>>> > Application Readiness, HPC-5, LANL >>>> > >>>> > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote: >>>> >> Yes, this fails on all nodes on the system, except for the head >>>> node. >>>> >> >>>> >> The uptime of the system isn't significant. Maybe 1 week, and it's >>>> received basically no use. >>>> >> >>>> >> -----Original Message----- >>>> >> From: Nathan Hjelm [mailto:hje...@lanl.gov] >>>> >> Sent: Tuesday, June 10, 2014 2:49 PM >>>> >> To: Fischer, Greg A. >>>> >> Cc: Open MPI Users >>>> >> Subject: Re: [OMPI users] openib segfaults with Torque >>>> >> >>>> >> >>>> >> Well, thats interesting. The output shows that ibv_create_cq is >>>> failing. Strange since an identical call had just succeeded (udcm >>>> creates two completion queues). Some questions that might indicate >>>> where >>>> the failure might be: >>>> >> >>>> >> Does this fail on any other node in your system? >>>> >> >>>> >> How long has the node been up? >>>> >> >>>> >> -Nathan Hjelm >>>> >> Application Readiness, HPC-5, LANL >>>> >> >>>> >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote: >>>> >>> Jeff/Nathan, >>>> >>> >>>> >>> I ran the following with my debug build of OpenMPI 1.8.1 - after >>>> opening a terminal on a compute node with "qsub -l nodes 2 -I": >>>> >>> >>>> >>> mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 >>>> >>> ring_c &> output.txt >>>> >>> >>>> >>> Output and backtrace are attached. Let me know if I can provide >>>> anything else. >>>> >>> >>>> >>> Thanks for looking into this, >>>> >>> Greg >>>> >>> >>>> >>> -----Original Message----- >>>> >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff >>>> >>> Squyres (jsquyres) >>>> >>> Sent: Tuesday, June 10, 2014 10:31 AM >>>> >>> To: Nathan Hjelm >>>> >>> Cc: Open MPI Users >>>> >>> Subject: Re: [OMPI users] openib segfaults with Torque >>>> >>> >>>> >>> Greg: >>>> >>> >>>> >>> Can you run with "--mca btl_base_verbose 100" on your debug >>>> build so >>>> that we can get some additional output to see why UDCM is failing to >>>> setup properly? >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm >>>> <hje...@lanl.gov<mailto:hje...@lanl.gov>> wrote: >>>> >>> >>>> >>>> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres >>>> (jsquyres) >>>> wrote: >>>> >>>>> I seem to recall that you have an IB-based cluster, right? >>>> >>>>> >>>> >>>>> From a *very quick* glance at the code, it looks like this >>>> might >>>> be a simple incorrect-finalization issue. That is: >>>> >>>>> >>>> >>>>> - you run the job on a single server >>>> >>>>> - openib disqualifies itself because you're running on a single >>>> >>>>> server >>>> >>>>> - openib then goes to finalize/close itself >>>> >>>>> - but openib didn't fully initialize itself (because it >>>> >>>>> disqualified itself early in the initialization process), and >>>> >>>>> something in the finalization process didn't take that into >>>> >>>>> account >>>> >>>>> >>>> >>>>> Nathan -- is that anywhere close to correct? >>>> >>>> >>>> >>>> Nope. udcm_module_finalize is being called because there was an >>>> >>>> error setting up the udcm state. See >>>> btl_openib_connect_udcm.c:476. >>>> >>>> The opal_list_t destructor is getting an assert failure. >>>> Probably >>>> >>>> because the constructor wasn't called. I can rearrange the >>>> >>>> constructors to be called first but there appears to be a deeper >>>> >>>> issue with the user's >>>> >>>> system: udcm_module_init should not be failing! It creates a >>>> >>>> couple of CQs, allocates a small number of registered bufferes >>>> and >>>> >>>> starts monitoring the fd for the completion channel. All these >>>> >>>> things are also done in the setup of the openib btl itself. Keep >>>> >>>> in mind that the openib btl will not disqualify itself when >>>> running >>>> single server. >>>> >>>> Openib may be used to communicate on node and is needed for the >>>> dynamics case. >>>> >>>> >>>> >>>> The user might try adding -mca btl_base_verbose 100 to shed some >>>> >>>> light on what the real issue is. >>>> >>>> >>>> >>>> BTW, I no longer monitor the user mailing list. If something >>>> needs >>>> >>>> my attention forward it to me directly. >>>> >>>> >>>> >>>> -Nathan >>>> >>> >>>> >>> >>>> >>> -- >>>> >>> Jeff Squyres >>>> >>> jsquy...@cisco.com<mailto:jsquy...@cisco.com> >>>> >>> For corporate legal information go to: >>>> >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>> >>>> >>> _______________________________________________ >>>> >>> users mailing list >>>> >>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>>> >>> >>>> >> >>>> >>> Core was generated by `ring_c'. >>>> >>> Program terminated with signal 6, Aborted. >>>> >>> #0 0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6 >>>> >>> #0 0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6 >>>> >>> #1 0x00007f8b6ae1e0c5 in abort () from /lib64/libc.so.6 >>>> >>> #2 0x00007f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6 >>>> >>> #3 0x00007f8b664b684b in udcm_module_finalize (btl=0x717060, >>>> >>> cpc=0x7190c0) at >>>> >>> >>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_ >>>> >>> co >>>> >>> nnect_udcm.c:734 >>>> >>> #4 0x00007f8b664b5474 in udcm_component_query (btl=0x717060, >>>> >>> cpc=0x718a48) at >>>> >>> >>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_ >>>> >>> co >>>> >>> nnect_udcm.c:476 >>>> >>> #5 0x00007f8b664ae316 in >>>> >>> ompi_btl_openib_connect_base_select_for_local_port >>>> (btl=0x717060) at >>>> >>> >>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_ >>>> >>> co >>>> >>> nnect_base.c:273 >>>> >>> #6 0x00007f8b66497817 in btl_openib_component_init >>>> (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, >>>> enable_mpi_threads=false) >>>> >>> at >>>> >>> >>>> >>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component. >>>> >>> c:2703 >>>> >>> #7 0x00007f8b6b43fa5e in mca_btl_base_select >>>> >>> (enable_progress_threads=false, enable_mpi_threads=false) at >>>> >>> ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108 >>>> >>> #8 0x00007f8b666d9d42 in mca_bml_r2_component_init >>>> (priority=0x7fffe34cecb4, enable_progress_threads=false, >>>> enable_mpi_threads=false) >>>> >>> at >>>> >>> >>>> ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88 >>>> >>> #9 0x00007f8b6b43ed1b in mca_bml_base_init >>>> >>> (enable_progress_threads=false, enable_mpi_threads=false) at >>>> >>> ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69 >>>> >>> #10 0x00007f8b655ff739 in mca_pml_ob1_component_init >>>> (priority=0x7fffe34cedf0, enable_progress_threads=false, >>>> enable_mpi_threads=false) >>>> >>> at >>>> >>> >>>> ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:27 >>>> >>> 1 >>>> >>> #11 0x00007f8b6b4659b2 in mca_pml_base_select >>>> >>> (enable_progress_threads=false, enable_mpi_threads=false) at >>>> >>> ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128 >>>> >>> #12 0x00007f8b6b3d233c in ompi_mpi_init (argc=1, >>>> >>> argv=0x7fffe34cf0e8, requested=0, provided=0x7fffe34cef98) at >>>> >>> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604 >>>> >>> #13 0x00007f8b6b407386 in PMPI_Init (argc=0x7fffe34cefec, >>>> >>> argv=0x7fffe34cefe0) at pinit.c:84 >>>> >>> #14 0x000000000040096f in main (argc=1, argv=0x7fffe34cf0e8) at >>>> >>> ring_c.c:19 >>>> >>> >>>> >> >>>> >>> [binf316:24591] mca: base: components_register: registering btl >>>> >>> components [binf316:24591] mca: base: components_register: found >>>> >>> loaded component openib [binf316:24592] mca: base: >>>> >>> components_register: registering btl components [binf316:24592] >>>> mca: >>>> >>> base: components_register: found loaded component openib >>>> >>> [binf316:24591] mca: base: components_register: component openib >>>> >>> register function successful [binf316:24591] mca: base: >>>> >>> components_register: found loaded component self [binf316:24591] >>>> mca: >>>> >>> base: components_register: component self register function >>>> >>> successful [binf316:24591] mca: base: components_open: opening >>>> btl >>>> >>> components [binf316:24591] mca: base: components_open: found >>>> loaded >>>> >>> component openib [binf316:24591] mca: base: components_open: >>>> >>> component openib open function successful [binf316:24591] mca: >>>> base: >>>> components_open: >>>> >>> found loaded component self [binf316:24591] mca: base: >>>> >>> components_open: component self open function successful >>>> >>> [binf316:24592] mca: base: components_register: component openib >>>> >>> register function successful [binf316:24592] mca: base: >>>> >>> components_register: found loaded component self [binf316:24592] >>>> mca: >>>> >>> base: components_register: component self register function >>>> >>> successful [binf316:24592] mca: base: components_open: opening >>>> btl >>>> >>> components [binf316:24592] mca: base: components_open: found >>>> loaded >>>> >>> component openib [binf316:24592] mca: base: components_open: >>>> >>> component openib open function successful [binf316:24592] mca: >>>> base: >>>> components_open: >>>> >>> found loaded component self [binf316:24592] mca: base: >>>> >>> components_open: component self open function successful >>>> >>> [binf316:24591] select: initializing btl component openib >>>> >>> [binf316:24592] select: initializing btl component openib >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75 >>>> >>> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75 >>>> >>> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:686:init_one_port] looking for >>>> mlx4_0:1 >>>> >>> GID index 0 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:717:init_one_port] my IB subnet_id >>>> for >>>> >>> HCA >>>> >>> mlx4_0 port 1 is fe80000000000000 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 >>>> rd_low >>>> >>> is >>>> >>> 192 rd_win 128 rd_rsv 4 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 >>>> >>> rd_low is >>>> >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 >>>> >>> rd_low is >>>> >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 >>>> >>> rd_low is >>>> >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni >>>> >>> >>>> b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query] >>>> >>> rdmacm_component_query >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] >>>> Looking >>>> >>> for >>>> >>> mlx4_0:1 in IP address list >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] FOUND: >>>> >>> mlx4_0:1 is 9.9.10.75 (0x4b0a0909) >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found >>>> >>> device >>>> >>> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):51845 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] >>>> creating >>>> >>> new server to listen on 9.9.10.75 (0x4b0a0909):51845 >>>> [binf316:24591] >>>> >>> openib BTL: rdmacm CPC available for use on mlx4_0:1 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] >>>> created >>>> >>> cpc module 0x719220 for btl 0x716ee0 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:686:init_one_port] looking for >>>> mlx4_0:1 >>>> >>> GID index 0 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:717:init_one_port] my IB subnet_id >>>> for >>>> >>> HCA >>>> >>> mlx4_0 port 1 is fe80000000000000 >>>> >>> >>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] >>>> error >>>> >>> creating ud send completion queue >>>> >>> ring_c: >>>> >>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: >>>> udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + >>>> 0xdeafbeedULL) >>>> == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed. >>>> >>> [binf316:24591] *** Process received signal *** [binf316:24591] >>>> >>> Signal: Aborted (6) [binf316:24591] Signal code: (-6) >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 >>>> rd_low >>>> >>> is >>>> >>> 192 rd_win 128 rd_rsv 4 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 >>>> >>> rd_low is >>>> >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 >>>> >>> rd_low is >>>> >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024 >>>> >>> rd_low is >>>> >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni >>>> >>> >>>> b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query] >>>> >>> rdmacm_component_query >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] >>>> Looking >>>> >>> for >>>> >>> mlx4_0:1 in IP address list >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] FOUND: >>>> >>> mlx4_0:1 is 9.9.10.75 (0x4b0a0909) >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found >>>> >>> device >>>> >>> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):57734 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] >>>> creating >>>> >>> new server to listen on 9.9.10.75 (0x4b0a0909):57734 >>>> [binf316:24592] >>>> >>> openib BTL: rdmacm CPC available for use on mlx4_0:1 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] >>>> created >>>> >>> cpc module 0x7190c0 for btl 0x717060 >>>> >>> >>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope >>>> >>> ni b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] >>>> error >>>> >>> creating ud send completion queue >>>> >>> ring_c: >>>> >>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734: >>>> udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + >>>> 0xdeafbeedULL) >>>> == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed. >>>> >>> [binf316:24592] *** Process received signal *** [binf316:24592] >>>> >>> Signal: Aborted (6) [binf316:24592] Signal code: (-6) >>>> >>> [binf316:24591] [ 0] >>>> /lib64/libpthread.so.0(+0xf7c0)[0x7fb35959c7c0] >>>> >>> [binf316:24591] [ 1] >>>> /lib64/libc.so.6(gsignal+0x35)[0x7fb359248b55] >>>> >>> [binf316:24591] [ 2] >>>> /lib64/libc.so.6(abort+0x181)[0x7fb35924a131] >>>> >>> [binf316:24591] [ 3] >>>> >>> /lib64/libc.so.6(__assert_fail+0xf0)[0x7fb359241a10] >>>> >>> [binf316:24591] [ 4] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bt l_openib.so(+0x3784b)[0x7fb3548e284b] >>>> >>> [binf316:24591] [ 5] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bt l_openib.so(+0x36474)[0x7fb3548e1474] >>>> >>> [binf316:24591] [ 6] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bt >>>> >>> >>>> l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b >>>> >>> )[ 0x7fb3548da316] [binf316:24591] [ 7] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bt l_openib.so(+0x18817)[0x7fb3548c3817] >>>> >>> [binf316:24591] [ 8] [binf316:24592] [ 0] >>>> >>> /lib64/libpthread.so.0(+0xf7c0)[0x7f8b6b1707c0] >>>> >>> [binf316:24592] [ 1] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> mc a_btl_base_select+0x1b2)[0x7fb35986ba5e] >>>> >>> [binf316:24591] [ 9] >>>> /lib64/libc.so.6(gsignal+0x35)[0x7f8b6ae1cb55] >>>> >>> [binf316:24592] [ 2] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bm l_r2.so(mca_bml_r2_component_init+0x20)[0x7fb354b05d42] >>>> >>> [binf316:24591] [10] >>>> /lib64/libc.so.6(abort+0x181)[0x7f8b6ae1e131] >>>> >>> [binf316:24592] [ 3] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> mc a_bml_base_init+0xd6)[0x7fb35986ad1b] >>>> >>> [binf316:24591] [11] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> pm l_ob1.so(+0x7739)[0x7fb353a2b739] [binf316:24591] [12] >>>> >>> /lib64/libc.so.6(__assert_fail+0xf0)[0x7f8b6ae15a10] >>>> >>> [binf316:24592] [ 4] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bt l_openib.so(+0x3784b)[0x7f8b664b684b] >>>> >>> [binf316:24592] [ 5] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bt l_openib.so(+0x36474)[0x7f8b664b5474] >>>> >>> [binf316:24592] [ 6] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> mc a_pml_base_select+0x26e)[0x7fb3598919b2] >>>> >>> [binf316:24591] [13] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bt >>>> >>> >>>> l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b >>>> >>> )[ 0x7f8b664ae316] [binf316:24592] [ 7] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bt l_openib.so(+0x18817)[0x7f8b66497817] >>>> >>> [binf316:24592] [ 8] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> om >>>> >>> pi_mpi_init+0x5f6)[0x7fb3597fe33c] >>>> >>> [binf316:24591] [14] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> mc a_btl_base_select+0x1b2)[0x7f8b6b43fa5e] >>>> >>> [binf316:24592] [ 9] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> bm l_r2.so(mca_bml_r2_component_init+0x20)[0x7f8b666d9d42] >>>> >>> [binf316:24592] [10] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> MP >>>> >>> I_Init+0x17e)[0x7fb359833386] >>>> >>> [binf316:24591] [15] ring_c[0x40096f] [binf316:24591] [16] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> mc a_bml_base_init+0xd6)[0x7f8b6b43ed1b] >>>> >>> [binf316:24592] [11] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_ >>>> >>> pm l_ob1.so(+0x7739)[0x7f8b655ff739] [binf316:24592] [12] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> mc a_pml_base_select+0x26e)[0x7f8b6b4659b2] >>>> >>> [binf316:24592] [13] >>>> >>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fb359234c36] >>>> >>> [binf316:24591] [17] ring_c[0x400889] [binf316:24591] *** End of >>>> >>> error message *** >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> om >>>> >>> pi_mpi_init+0x5f6)[0x7f8b6b3d233c] >>>> >>> [binf316:24592] [14] >>>> >>> >>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1( >>>> >>> MP >>>> >>> I_Init+0x17e)[0x7f8b6b407386] >>>> >>> [binf316:24592] [15] ring_c[0x40096f] [binf316:24592] [16] >>>> >>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f8b6ae08c36] >>>> >>> [binf316:24592] [17] ring_c[0x400889] [binf316:24592] *** End of >>>> >>> error message *** >>>> >>> >>>> -------------------------------------------------------------------- >>>> >>> -- >>>> >>> ---- mpirun noticed that process rank 0 with PID 24591 on node >>>> >>> xxxx316 exited on signal 6 (Aborted). >>>> >>> >>>> -------------------------------------------------------------------- >>>> >>> -- >>>> >>> ---- >>>> >> >>>> >> >>>> > >>>> > _______________________________________________ >>>> > users mailing list >>>> > us...@open-mpi.org<mailto:us...@open-mpi.org> >>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> > Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/06/24632.php >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com<mailto:jsquy...@cisco.com> >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/06/24633.php >>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2014/06/24634.php >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/06/24636.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org<mailto:us...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/06/24637.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org<mailto:us...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/06/24638.php > > _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/06/24641.php