Jeff/Nathan,

I ran the following with my debug build of OpenMPI 1.8.1 - after opening a 
terminal on a compute node with "qsub -l nodes 2 -I":

        mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2 ring_c &> 
output.txt

Output and backtrace are attached. Let me know if I can provide anything else.

Thanks for looking into this,
Greg

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
(jsquyres)
Sent: Tuesday, June 10, 2014 10:31 AM
To: Nathan Hjelm
Cc: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque

Greg: 

Can you run with "--mca btl_base_verbose 100" on your debug build so that we 
can get some additional output to see why UDCM is failing to setup properly?



On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:

> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote:
>> I seem to recall that you have an IB-based cluster, right?
>> 
>> From a *very quick* glance at the code, it looks like this might be a simple 
>> incorrect-finalization issue.  That is:
>> 
>> - you run the job on a single server
>> - openib disqualifies itself because you're running on a single 
>> server
>> - openib then goes to finalize/close itself
>> - but openib didn't fully initialize itself (because it disqualified 
>> itself early in the initialization process), and something in the 
>> finalization process didn't take that into account
>> 
>> Nathan -- is that anywhere close to correct?
> 
> Nope. udcm_module_finalize is being called because there was an error 
> setting up the udcm state. See btl_openib_connect_udcm.c:476. The 
> opal_list_t destructor is getting an assert failure. Probably because 
> the constructor wasn't called. I can rearrange the constructors to be 
> called first but there appears to be a deeper issue with the user's
> system: udcm_module_init should not be failing! It creates a couple of 
> CQs, allocates a small number of registered bufferes and starts 
> monitoring the fd for the completion channel. All these things are 
> also done in the setup of the openib btl itself. Keep in mind that the 
> openib btl will not disqualify itself when running single server. 
> Openib may be used to communicate on node and is needed for the dynamics case.
> 
> The user might try adding -mca btl_base_verbose 100 to shed some light 
> on what the real issue is.
> 
> BTW, I no longer monitor the user mailing list. If something needs my 
> attention forward it to me directly.
> 
> -Nathan


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Core was generated by `ring_c'.
Program terminated with signal 6, Aborted.
#0  0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6
#0  0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6
#1  0x00007f8b6ae1e0c5 in abort () from /lib64/libc.so.6
#2  0x00007f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
#3  0x00007f8b664b684b in udcm_module_finalize (btl=0x717060, cpc=0x7190c0) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734
#4  0x00007f8b664b5474 in udcm_component_query (btl=0x717060, cpc=0x718a48) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:476
#5  0x00007f8b664ae316 in ompi_btl_openib_connect_base_select_for_local_port 
(btl=0x717060) at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_base.c:273
#6  0x00007f8b66497817 in btl_openib_component_init 
(num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false, 
enable_mpi_threads=false)
    at 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:2703
#7  0x00007f8b6b43fa5e in mca_btl_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
#8  0x00007f8b666d9d42 in mca_bml_r2_component_init (priority=0x7fffe34cecb4, 
enable_progress_threads=false, enable_mpi_threads=false)
    at ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
#9  0x00007f8b6b43ed1b in mca_bml_base_init (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
#10 0x00007f8b655ff739 in mca_pml_ob1_component_init (priority=0x7fffe34cedf0, 
enable_progress_threads=false, enable_mpi_threads=false)
    at ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:271
#11 0x00007f8b6b4659b2 in mca_pml_base_select (enable_progress_threads=false, 
enable_mpi_threads=false) at 
../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
#12 0x00007f8b6b3d233c in ompi_mpi_init (argc=1, argv=0x7fffe34cf0e8, 
requested=0, provided=0x7fffe34cef98) at 
../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
#13 0x00007f8b6b407386 in PMPI_Init (argc=0x7fffe34cefec, argv=0x7fffe34cefe0) 
at pinit.c:84
#14 0x000000000040096f in main (argc=1, argv=0x7fffe34cf0e8) at ring_c.c:19

[binf316:24591] mca: base: components_register: registering btl components
[binf316:24591] mca: base: components_register: found loaded component openib
[binf316:24592] mca: base: components_register: registering btl components
[binf316:24592] mca: base: components_register: found loaded component openib
[binf316:24591] mca: base: components_register: component openib register 
function successful
[binf316:24591] mca: base: components_register: found loaded component self
[binf316:24591] mca: base: components_register: component self register 
function successful
[binf316:24591] mca: base: components_open: opening btl components
[binf316:24591] mca: base: components_open: found loaded component openib
[binf316:24591] mca: base: components_open: component openib open function 
successful
[binf316:24591] mca: base: components_open: found loaded component self
[binf316:24591] mca: base: components_open: component self open function 
successful
[binf316:24592] mca: base: components_register: component openib register 
function successful
[binf316:24592] mca: base: components_register: found loaded component self
[binf316:24592] mca: base: components_register: component self register 
function successful
[binf316:24592] mca: base: components_open: opening btl components
[binf316:24592] mca: base: components_open: found loaded component openib
[binf316:24592] mca: base: components_open: component openib open function 
successful
[binf316:24592] mca: base: components_open: found loaded component self
[binf316:24592] mca: base: components_open: component self open function 
successful
[binf316:24591] select: initializing btl component openib
[binf316:24592] select: initializing btl component openib
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_ip.c:364:add_rdma_addr]
 Adding addr 9.9.10.75 (0x4b0a0909) subnet 0x9090000 as mlx4_0:1
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_ip.c:364:add_rdma_addr]
 Adding addr 9.9.10.75 (0x4b0a0909) subnet 0x9090000 as mlx4_0:1
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:686:init_one_port]
 looking for mlx4_0:1 GID index 0
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:717:init_one_port]
 my IB subnet_id for HCA mlx4_0 port 1 is fe80000000000000
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:1294:setup_qps]
 pp: rd_num is 256 rd_low is 192 rd_win 128 rd_rsv 4
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:1339:setup_qps]
 srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:1339:setup_qps]
 srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:1339:setup_qps]
 srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query]
 rdmacm_component_query
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr]
 Looking for mlx4_0:1 in IP address list
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr]
 FOUND: mlx4_0:1 is 9.9.10.75 (0x4b0a0909)
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck]
 Found device mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):51845
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck]
 creating new server to listen on 9.9.10.75 (0x4b0a0909):51845
[binf316:24591] openib BTL: rdmacm CPC available for use on mlx4_0:1
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:542:udcm_module_init]
 created cpc module 0x719220 for btl 0x716ee0
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:686:init_one_port]
 looking for mlx4_0:1 GID index 0
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:717:init_one_port]
 my IB subnet_id for HCA mlx4_0 port 1 is fe80000000000000
[binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:565:udcm_module_init]
 error creating ud send completion queue
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:24591] *** Process received signal ***
[binf316:24591] Signal: Aborted (6)
[binf316:24591] Signal code:  (-6)
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:1294:setup_qps]
 pp: rd_num is 256 rd_low is 192 rd_win 128 rd_rsv 4
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:1339:setup_qps]
 srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:1339:setup_qps]
 srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.c:1339:setup_qps]
 srq: rd_num is 1024 rd_low is 1008 sd_max is 64 rd_max is 256 srq_limit is 48
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query]
 rdmacm_component_query
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr]
 Looking for mlx4_0:1 in IP address list
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr]
 FOUND: mlx4_0:1 is 9.9.10.75 (0x4b0a0909)
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck]
 Found device mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):57734
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck]
 creating new server to listen on 9.9.10.75 (0x4b0a0909):57734
[binf316:24592] openib BTL: rdmacm CPC available for use on mlx4_0:1
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:542:udcm_module_init]
 created cpc module 0x7190c0 for btl 0x717060
[binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:565:udcm_module_init]
 error creating ud send completion queue
ring_c: 
../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
 udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == 
((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
[binf316:24592] *** Process received signal ***
[binf316:24592] Signal: Aborted (6)
[binf316:24592] Signal code:  (-6)
[binf316:24591] [ 0] /lib64/libpthread.so.0(+0xf7c0)[0x7fb35959c7c0]
[binf316:24591] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fb359248b55]
[binf316:24591] [ 2] /lib64/libc.so.6(abort+0x181)[0x7fb35924a131]
[binf316:24591] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x7fb359241a10]
[binf316:24591] [ 4] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7fb3548e284b]
[binf316:24591] [ 5] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7fb3548e1474]
[binf316:24591] [ 6] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7fb3548da316]
[binf316:24591] [ 7] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7fb3548c3817]
[binf316:24591] [ 8] [binf316:24592] [ 0] 
/lib64/libpthread.so.0(+0xf7c0)[0x7f8b6b1707c0]
[binf316:24592] [ 1] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7fb35986ba5e]
[binf316:24591] [ 9] /lib64/libc.so.6(gsignal+0x35)[0x7f8b6ae1cb55]
[binf316:24592] [ 2] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7fb354b05d42]
[binf316:24591] [10] /lib64/libc.so.6(abort+0x181)[0x7f8b6ae1e131]
[binf316:24592] [ 3] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7fb35986ad1b]
[binf316:24591] [11] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7fb353a2b739]
[binf316:24591] [12] /lib64/libc.so.6(__assert_fail+0xf0)[0x7f8b6ae15a10]
[binf316:24592] [ 4] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x3784b)[0x7f8b664b684b]
[binf316:24592] [ 5] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x36474)[0x7f8b664b5474]
[binf316:24592] [ 6] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7fb3598919b2]
[binf316:24591] [13] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b)[0x7f8b664ae316]
[binf316:24592] [ 7] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_btl_openib.so(+0x18817)[0x7f8b66497817]
[binf316:24592] [ 8] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7fb3597fe33c]
[binf316:24591] [14] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_btl_base_select+0x1b2)[0x7f8b6b43fa5e]
[binf316:24592] [ 9] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x20)[0x7f8b666d9d42]
[binf316:24592] [10] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7fb359833386]
[binf316:24591] [15] ring_c[0x40096f]
[binf316:24591] [16] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_bml_base_init+0xd6)[0x7f8b6b43ed1b]
[binf316:24592] [11] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_pml_ob1.so(+0x7739)[0x7f8b655ff739]
[binf316:24592] [12] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(mca_pml_base_select+0x26e)[0x7f8b6b4659b2]
[binf316:24592] [13] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fb359234c36]
[binf316:24591] [17] ring_c[0x400889]
[binf316:24591] *** End of error message ***
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(ompi_mpi_init+0x5f6)[0x7f8b6b3d233c]
[binf316:24592] [14] 
/xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(MPI_Init+0x17e)[0x7f8b6b407386]
[binf316:24592] [15] ring_c[0x40096f]
[binf316:24592] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f8b6ae08c36]
[binf316:24592] [17] ring_c[0x400889]
[binf316:24592] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 24591 on node xxxx316 exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------

Reply via email to