This sounds credible. When I login via Torque, I see the following:

[binf316:fischega] $ ulimit -l
64

but when I login via ssh, I see:

[binf316:fischega] $ ulimit -l
unlimited

I'll have my administrator make the changes and give that a shot.  Thanks, 
everyone!

_____________________________________________
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Wednesday, June 11, 2014 7:13 PM
To: Open MPI Users
Subject: Re: [OMPI users] openib segfaults with Torque


If that could help Greg,
on the compute nodes I normally add this to /etc/security/limits.conf:

*   -   memlock     -1
*   -   stack       -1
*   -   nofile      32768

and

ulimit -n 32768
ulimit -l unlimited
ulimit -s unlimited

to either /etc/init.d/pbs_mom or to /etc/sysconfig/pbs_mom (which
should be sourced by the former).
Other values are possible, of course.

My recollection is that the boilerplate init scripts that
come with Torque don't change those limits.

I suppose this makes the pbs_mom child processes,
including the user job script and whatever processes it starts
(mpiexec, etc), to inherit those limits.
Or not?

Gus Correa


On 06/11/2014 06:20 PM, Jeff Squyres (jsquyres) wrote:
> +1
>
> On Jun 11, 2014, at 6:01 PM, Ralph Castain 
> <r...@open-mpi.org<mailto:r...@open-mpi.org>>
>   wrote:
>
>> Yeah, I think we've seen that somewhere before too...
>>
>>
>> On Jun 11, 2014, at 2:59 PM, Joshua Ladd 
>> <jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:
>>
>>> Agreed. The problem is not with UDCM. I don't think something is wrong with 
>>> the system. I think his Torque is imposing major constraints on the maximum 
>>> size that can be locked into memory.
>>>
>>> Josh
>>>
>>>
>>> On Wed, Jun 11, 2014 at 5:49 PM, Nathan Hjelm 
>>> <hje...@lanl.gov<mailto:hje...@lanl.gov>> wrote:
>>> Probably won't help to use RDMACM though as you will just see the
>>> resource failure somewhere else. UDCM is not the problem. Something is
>>> wrong with the system. Allocating a 512 entry CQ should not fail.
>>>
>>> -Nathan
>>>
>>> On Wed, Jun 11, 2014 at 05:03:31PM -0400, Joshua Ladd wrote:
>>>>     I'm guessing it's a resource limitation issue coming from Torque.
>>>>
>>>>     Hmmmm...I found something interesting on the interwebs that looks 
>>>> awfully
>>>>     similar:
>>>>     
>>>> http://www.supercluster.org/pipermail/torqueusers/2008-February/006916.html
>>>>
>>>>     Greg, if the suggestion from the Torque users doesn't resolve your 
>>>> issue (
>>>>     "...adding the following line 'ulimit -l unlimited' to pbs_mom and
>>>>     restarting pbs_mom." ) doesn't work, try using the RDMACM CPC (instead 
>>>> of
>>>>     UDCM, which is a pretty recent addition to the openIB BTL.) by setting:
>>>>
>>>>     -mca btl_openib_cpc_include rdmacm
>>>>
>>>>     Josh
>>>>
>>>>     On Wed, Jun 11, 2014 at 4:04 PM, Jeff Squyres (jsquyres)
>>>>     <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
>>>>
>>>>       Mellanox --
>>>>
>>>>       What would cause a CQ to fail to be created?
>>>>
>>>>       On Jun 11, 2014, at 3:42 PM, "Fischer, Greg A."
>>>>       <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> wrote:
>>>>
>>>>       > Is there any other work around that I might try?  Something that
>>>>       avoids UDCM?
>>>>       >
>>>>       > -----Original Message-----
>>>>       > From: Fischer, Greg A.
>>>>       > Sent: Tuesday, June 10, 2014 2:59 PM
>>>>       > To: Nathan Hjelm
>>>>       > Cc: Open MPI Users; Fischer, Greg A.
>>>>       > Subject: RE: [OMPI users] openib segfaults with Torque
>>>>       >
>>>>       > [binf316:fischega] $ ulimit -m
>>>>       > unlimited
>>>>       >
>>>>       > Greg
>>>>       >
>>>>       > -----Original Message-----
>>>>       > From: Nathan Hjelm [mailto:hje...@lanl.gov]
>>>>       > Sent: Tuesday, June 10, 2014 2:58 PM
>>>>       > To: Fischer, Greg A.
>>>>       > Cc: Open MPI Users
>>>>       > Subject: Re: [OMPI users] openib segfaults with Torque
>>>>       >
>>>>       > Out of curiosity what is the mlock limit on your system? If it is 
>>>> too
>>>>       low that can cause ibv_create_cq to fail. To check run ulimit -m.
>>>>       >
>>>>       > -Nathan Hjelm
>>>>       > Application Readiness, HPC-5, LANL
>>>>       >
>>>>       > On Tue, Jun 10, 2014 at 02:53:58PM -0400, Fischer, Greg A. wrote:
>>>>       >> Yes, this fails on all nodes on the system, except for the head 
>>>> node.
>>>>       >>
>>>>       >> The uptime of the system isn't significant. Maybe 1 week, and it's
>>>>       received basically no use.
>>>>       >>
>>>>       >> -----Original Message-----
>>>>       >> From: Nathan Hjelm [mailto:hje...@lanl.gov]
>>>>       >> Sent: Tuesday, June 10, 2014 2:49 PM
>>>>       >> To: Fischer, Greg A.
>>>>       >> Cc: Open MPI Users
>>>>       >> Subject: Re: [OMPI users] openib segfaults with Torque
>>>>       >>
>>>>       >>
>>>>       >> Well, thats interesting. The output shows that ibv_create_cq is
>>>>       failing. Strange since an identical call had just succeeded (udcm
>>>>       creates two completion queues). Some questions that might indicate 
>>>> where
>>>>       the failure might be:
>>>>       >>
>>>>       >> Does this fail on any other node in your system?
>>>>       >>
>>>>       >> How long has the node been up?
>>>>       >>
>>>>       >> -Nathan Hjelm
>>>>       >> Application Readiness, HPC-5, LANL
>>>>       >>
>>>>       >> On Tue, Jun 10, 2014 at 02:06:54PM -0400, Fischer, Greg A. wrote:
>>>>       >>> Jeff/Nathan,
>>>>       >>>
>>>>       >>> I ran the following with my debug build of OpenMPI 1.8.1 - after
>>>>       opening a terminal on a compute node with "qsub -l nodes 2 -I":
>>>>       >>>
>>>>       >>>      mpirun -mca btl openib,self -mca btl_base_verbose 100 -np 2
>>>>       >>> ring_c &> output.txt
>>>>       >>>
>>>>       >>> Output and backtrace are attached. Let me know if I can provide
>>>>       anything else.
>>>>       >>>
>>>>       >>> Thanks for looking into this,
>>>>       >>> Greg
>>>>       >>>
>>>>       >>> -----Original Message-----
>>>>       >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff
>>>>       >>> Squyres (jsquyres)
>>>>       >>> Sent: Tuesday, June 10, 2014 10:31 AM
>>>>       >>> To: Nathan Hjelm
>>>>       >>> Cc: Open MPI Users
>>>>       >>> Subject: Re: [OMPI users] openib segfaults with Torque
>>>>       >>>
>>>>       >>> Greg:
>>>>       >>>
>>>>       >>> Can you run with "--mca btl_base_verbose 100" on your debug 
>>>> build so
>>>>       that we can get some additional output to see why UDCM is failing to
>>>>       setup properly?
>>>>       >>>
>>>>       >>>
>>>>       >>>
>>>>       >>> On Jun 10, 2014, at 10:25 AM, Nathan Hjelm 
>>>> <hje...@lanl.gov<mailto:hje...@lanl.gov>> wrote:
>>>>       >>>
>>>>       >>>> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres 
>>>> (jsquyres)
>>>>       wrote:
>>>>       >>>>> I seem to recall that you have an IB-based cluster, right?
>>>>       >>>>>
>>>>       >>>>> From a *very quick* glance at the code, it looks like this 
>>>> might
>>>>       be a simple incorrect-finalization issue.  That is:
>>>>       >>>>>
>>>>       >>>>> - you run the job on a single server
>>>>       >>>>> - openib disqualifies itself because you're running on a single
>>>>       >>>>> server
>>>>       >>>>> - openib then goes to finalize/close itself
>>>>       >>>>> - but openib didn't fully initialize itself (because it
>>>>       >>>>> disqualified itself early in the initialization process), and
>>>>       >>>>> something in the finalization process didn't take that into
>>>>       >>>>> account
>>>>       >>>>>
>>>>       >>>>> Nathan -- is that anywhere close to correct?
>>>>       >>>>
>>>>       >>>> Nope. udcm_module_finalize is being called because there was an
>>>>       >>>> error setting up the udcm state. See 
>>>> btl_openib_connect_udcm.c:476.
>>>>       >>>> The opal_list_t destructor is getting an assert failure. 
>>>> Probably
>>>>       >>>> because the constructor wasn't called. I can rearrange the
>>>>       >>>> constructors to be called first but there appears to be a deeper
>>>>       >>>> issue with the user's
>>>>       >>>> system: udcm_module_init should not be failing! It creates a
>>>>       >>>> couple of CQs, allocates a small number of registered bufferes 
>>>> and
>>>>       >>>> starts monitoring the fd for the completion channel. All these
>>>>       >>>> things are also done in the setup of the openib btl itself. Keep
>>>>       >>>> in mind that the openib btl will not disqualify itself when 
>>>> running
>>>>       single server.
>>>>       >>>> Openib may be used to communicate on node and is needed for the
>>>>       dynamics case.
>>>>       >>>>
>>>>       >>>> The user might try adding -mca btl_base_verbose 100 to shed some
>>>>       >>>> light on what the real issue is.
>>>>       >>>>
>>>>       >>>> BTW, I no longer monitor the user mailing list. If something 
>>>> needs
>>>>       >>>> my attention forward it to me directly.
>>>>       >>>>
>>>>       >>>> -Nathan
>>>>       >>>
>>>>       >>>
>>>>       >>> --
>>>>       >>> Jeff Squyres
>>>>       >>> jsquy...@cisco.com<mailto:jsquy...@cisco.com>
>>>>       >>> For corporate legal information go to:
>>>>       >>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>       >>>
>>>>       >>> _______________________________________________
>>>>       >>> users mailing list
>>>>       >>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>>>       >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>       >>>
>>>>       >>>
>>>>       >>
>>>>       >>> Core was generated by `ring_c'.
>>>>       >>> Program terminated with signal 6, Aborted.
>>>>       >>> #0  0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6
>>>>       >>> #0  0x00007f8b6ae1cb55 in raise () from /lib64/libc.so.6
>>>>       >>> #1  0x00007f8b6ae1e0c5 in abort () from /lib64/libc.so.6
>>>>       >>> #2  0x00007f8b6ae15a10 in __assert_fail () from /lib64/libc.so.6
>>>>       >>> #3  0x00007f8b664b684b in udcm_module_finalize (btl=0x717060,
>>>>       >>> cpc=0x7190c0) at
>>>>       >>> 
>>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_
>>>>       >>> co
>>>>       >>> nnect_udcm.c:734
>>>>       >>> #4  0x00007f8b664b5474 in udcm_component_query (btl=0x717060,
>>>>       >>> cpc=0x718a48) at
>>>>       >>> 
>>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_
>>>>       >>> co
>>>>       >>> nnect_udcm.c:476
>>>>       >>> #5  0x00007f8b664ae316 in
>>>>       >>> ompi_btl_openib_connect_base_select_for_local_port 
>>>> (btl=0x717060) at
>>>>       >>> 
>>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_
>>>>       >>> co
>>>>       >>> nnect_base.c:273
>>>>       >>> #6  0x00007f8b66497817 in btl_openib_component_init
>>>>       (num_btl_modules=0x7fffe34cebe0, enable_progress_threads=false,
>>>>       enable_mpi_threads=false)
>>>>       >>>    at
>>>>       >>>
>>>>       
>>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/btl_openib_component.
>>>>       >>> c:2703
>>>>       >>> #7  0x00007f8b6b43fa5e in mca_btl_base_select
>>>>       >>> (enable_progress_threads=false, enable_mpi_threads=false) at
>>>>       >>> ../../../../openmpi-1.8.1/ompi/mca/btl/base/btl_base_select.c:108
>>>>       >>> #8  0x00007f8b666d9d42 in mca_bml_r2_component_init
>>>>       (priority=0x7fffe34cecb4, enable_progress_threads=false,
>>>>       enable_mpi_threads=false)
>>>>       >>>    at
>>>>       >>> 
>>>> ../../../../../openmpi-1.8.1/ompi/mca/bml/r2/bml_r2_component.c:88
>>>>       >>> #9  0x00007f8b6b43ed1b in mca_bml_base_init
>>>>       >>> (enable_progress_threads=false, enable_mpi_threads=false) at
>>>>       >>> ../../../../openmpi-1.8.1/ompi/mca/bml/base/bml_base_init.c:69
>>>>       >>> #10 0x00007f8b655ff739 in mca_pml_ob1_component_init
>>>>       (priority=0x7fffe34cedf0, enable_progress_threads=false,
>>>>       enable_mpi_threads=false)
>>>>       >>>    at
>>>>       >>> 
>>>> ../../../../../openmpi-1.8.1/ompi/mca/pml/ob1/pml_ob1_component.c:27
>>>>       >>> 1
>>>>       >>> #11 0x00007f8b6b4659b2 in mca_pml_base_select
>>>>       >>> (enable_progress_threads=false, enable_mpi_threads=false) at
>>>>       >>> ../../../../openmpi-1.8.1/ompi/mca/pml/base/pml_base_select.c:128
>>>>       >>> #12 0x00007f8b6b3d233c in ompi_mpi_init (argc=1,
>>>>       >>> argv=0x7fffe34cf0e8, requested=0, provided=0x7fffe34cef98) at
>>>>       >>> ../../openmpi-1.8.1/ompi/runtime/ompi_mpi_init.c:604
>>>>       >>> #13 0x00007f8b6b407386 in PMPI_Init (argc=0x7fffe34cefec,
>>>>       >>> argv=0x7fffe34cefe0) at pinit.c:84
>>>>       >>> #14 0x000000000040096f in main (argc=1, argv=0x7fffe34cf0e8) at
>>>>       >>> ring_c.c:19
>>>>       >>>
>>>>       >>
>>>>       >>> [binf316:24591] mca: base: components_register: registering btl
>>>>       >>> components [binf316:24591] mca: base: components_register: found
>>>>       >>> loaded component openib [binf316:24592] mca: base:
>>>>       >>> components_register: registering btl components [binf316:24592] 
>>>> mca:
>>>>       >>> base: components_register: found loaded component openib
>>>>       >>> [binf316:24591] mca: base: components_register: component openib
>>>>       >>> register function successful [binf316:24591] mca: base:
>>>>       >>> components_register: found loaded component self [binf316:24591]
>>>>       mca:
>>>>       >>> base: components_register: component self register function
>>>>       >>> successful [binf316:24591] mca: base: components_open: opening 
>>>> btl
>>>>       >>> components [binf316:24591] mca: base: components_open: found 
>>>> loaded
>>>>       >>> component openib [binf316:24591] mca: base: components_open:
>>>>       >>> component openib open function successful [binf316:24591] mca: 
>>>> base:
>>>>       components_open:
>>>>       >>> found loaded component self [binf316:24591] mca: base:
>>>>       >>> components_open: component self open function successful
>>>>       >>> [binf316:24592] mca: base: components_register: component openib
>>>>       >>> register function successful [binf316:24592] mca: base:
>>>>       >>> components_register: found loaded component self [binf316:24592]
>>>>       mca:
>>>>       >>> base: components_register: component self register function
>>>>       >>> successful [binf316:24592] mca: base: components_open: opening 
>>>> btl
>>>>       >>> components [binf316:24592] mca: base: components_open: found 
>>>> loaded
>>>>       >>> component openib [binf316:24592] mca: base: components_open:
>>>>       >>> component openib open function successful [binf316:24592] mca: 
>>>> base:
>>>>       components_open:
>>>>       >>> found loaded component self [binf316:24592] mca: base:
>>>>       >>> components_open: component self open function successful
>>>>       >>> [binf316:24591] select: initializing btl component openib
>>>>       >>> [binf316:24592] select: initializing btl component openib
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75
>>>>       >>> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_ip.c:364:add_rdma_addr] Adding addr 9.9.10.75
>>>>       >>> (0x4b0a0909) subnet 0x9090000 as mlx4_0:1
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:686:init_one_port] looking for 
>>>> mlx4_0:1
>>>>       >>> GID index 0
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:717:init_one_port] my IB subnet_id 
>>>> for
>>>>       >>> HCA
>>>>       >>> mlx4_0 port 1 is fe80000000000000
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 
>>>> rd_low
>>>>       >>> is
>>>>       >>> 192 rd_win 128 rd_rsv 4
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
>>>>       >>> rd_low is
>>>>       >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
>>>>       >>> rd_low is
>>>>       >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
>>>>       >>> rd_low is
>>>>       >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni
>>>>       >>> 
>>>> b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query]
>>>>       >>> rdmacm_component_query
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] 
>>>> Looking
>>>>       >>> for
>>>>       >>> mlx4_0:1 in IP address list
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] FOUND:
>>>>       >>> mlx4_0:1 is 9.9.10.75 (0x4b0a0909)
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found
>>>>       >>> device
>>>>       >>> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):51845
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] 
>>>> creating
>>>>       >>> new server to listen on 9.9.10.75 (0x4b0a0909):51845 
>>>> [binf316:24591]
>>>>       >>> openib BTL: rdmacm CPC available for use on mlx4_0:1
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] 
>>>> created
>>>>       >>> cpc module 0x719220 for btl 0x716ee0
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:686:init_one_port] looking for 
>>>> mlx4_0:1
>>>>       >>> GID index 0
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:717:init_one_port] my IB subnet_id 
>>>> for
>>>>       >>> HCA
>>>>       >>> mlx4_0 port 1 is fe80000000000000
>>>>       >>> 
>>>> [binf316][[17980,1],0][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] 
>>>> error
>>>>       >>> creating ud send completion queue
>>>>       >>> ring_c:
>>>>       
>>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
>>>>       udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 
>>>> 0xdeafbeedULL)
>>>>       == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
>>>>       >>> [binf316:24591] *** Process received signal *** [binf316:24591]
>>>>       >>> Signal: Aborted (6) [binf316:24591] Signal code:  (-6)
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:1294:setup_qps] pp: rd_num is 256 
>>>> rd_low
>>>>       >>> is
>>>>       >>> 192 rd_win 128 rd_rsv 4
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
>>>>       >>> rd_low is
>>>>       >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
>>>>       >>> rd_low is
>>>>       >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_component.c:1339:setup_qps] srq: rd_num is 1024
>>>>       >>> rd_low is
>>>>       >>> 1008 sd_max is 64 rd_max is 256 srq_limit is 48
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni
>>>>       >>> 
>>>> b/connect/btl_openib_connect_rdmacm.c:1840:rdmacm_component_query]
>>>>       >>> rdmacm_component_query
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_ip.c:132:mca_btl_openib_rdma_get_ipv4addr] 
>>>> Looking
>>>>       >>> for
>>>>       >>> mlx4_0:1 in IP address list
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/btl_openib_ip.c:141:mca_btl_openib_rdma_get_ipv4addr] FOUND:
>>>>       >>> mlx4_0:1 is 9.9.10.75 (0x4b0a0909)
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/connect/btl_openib_connect_rdmacm.c:1750:ipaddrcheck] Found
>>>>       >>> device
>>>>       >>> mlx4_0:1 = IP address 9.9.10.75 (0x4b0a0909):57734
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/connect/btl_openib_connect_rdmacm.c:1776:ipaddrcheck] 
>>>> creating
>>>>       >>> new server to listen on 9.9.10.75 (0x4b0a0909):57734 
>>>> [binf316:24592]
>>>>       >>> openib BTL: rdmacm CPC available for use on mlx4_0:1
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/connect/btl_openib_connect_udcm.c:542:udcm_module_init] 
>>>> created
>>>>       >>> cpc module 0x7190c0 for btl 0x717060
>>>>       >>> 
>>>> [binf316][[17980,1],1][../../../../../openmpi-1.8.1/ompi/mca/btl/ope
>>>>       >>> ni b/connect/btl_openib_connect_udcm.c:565:udcm_module_init] 
>>>> error
>>>>       >>> creating ud send completion queue
>>>>       >>> ring_c:
>>>>       
>>>> ../../../../../openmpi-1.8.1/ompi/mca/btl/openib/connect/btl_openib_connect_udcm.c:734:
>>>>       udcm_module_finalize: Assertion `((0xdeafbeedULL << 32) + 
>>>> 0xdeafbeedULL)
>>>>       == ((opal_object_t *) (&m->cm_recv_msg_queue))->obj_magic_id' failed.
>>>>       >>> [binf316:24592] *** Process received signal *** [binf316:24592]
>>>>       >>> Signal: Aborted (6) [binf316:24592] Signal code:  (-6)
>>>>       >>> [binf316:24591] [ 0] 
>>>> /lib64/libpthread.so.0(+0xf7c0)[0x7fb35959c7c0]
>>>>       >>> [binf316:24591] [ 1] 
>>>> /lib64/libc.so.6(gsignal+0x35)[0x7fb359248b55]
>>>>       >>> [binf316:24591] [ 2] 
>>>> /lib64/libc.so.6(abort+0x181)[0x7fb35924a131]
>>>>       >>> [binf316:24591] [ 3]
>>>>       >>> /lib64/libc.so.6(__assert_fail+0xf0)[0x7fb359241a10]
>>>>       >>> [binf316:24591] [ 4]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bt l_openib.so(+0x3784b)[0x7fb3548e284b]
>>>>       >>> [binf316:24591] [ 5]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bt l_openib.so(+0x36474)[0x7fb3548e1474]
>>>>       >>> [binf316:24591] [ 6]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bt
>>>>       >>> 
>>>> l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b
>>>>       >>> )[ 0x7fb3548da316] [binf316:24591] [ 7]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bt l_openib.so(+0x18817)[0x7fb3548c3817]
>>>>       >>> [binf316:24591] [ 8] [binf316:24592] [ 0]
>>>>       >>> /lib64/libpthread.so.0(+0xf7c0)[0x7f8b6b1707c0]
>>>>       >>> [binf316:24592] [ 1]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> mc a_btl_base_select+0x1b2)[0x7fb35986ba5e]
>>>>       >>> [binf316:24591] [ 9] 
>>>> /lib64/libc.so.6(gsignal+0x35)[0x7f8b6ae1cb55]
>>>>       >>> [binf316:24592] [ 2]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bm l_r2.so(mca_bml_r2_component_init+0x20)[0x7fb354b05d42]
>>>>       >>> [binf316:24591] [10] 
>>>> /lib64/libc.so.6(abort+0x181)[0x7f8b6ae1e131]
>>>>       >>> [binf316:24592] [ 3]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> mc a_bml_base_init+0xd6)[0x7fb35986ad1b]
>>>>       >>> [binf316:24591] [11]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> pm l_ob1.so(+0x7739)[0x7fb353a2b739] [binf316:24591] [12]
>>>>       >>> /lib64/libc.so.6(__assert_fail+0xf0)[0x7f8b6ae15a10]
>>>>       >>> [binf316:24592] [ 4]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bt l_openib.so(+0x3784b)[0x7f8b664b684b]
>>>>       >>> [binf316:24592] [ 5]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bt l_openib.so(+0x36474)[0x7f8b664b5474]
>>>>       >>> [binf316:24592] [ 6]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> mc a_pml_base_select+0x26e)[0x7fb3598919b2]
>>>>       >>> [binf316:24591] [13]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bt
>>>>       >>> 
>>>> l_openib.so(ompi_btl_openib_connect_base_select_for_local_port+0x15b
>>>>       >>> )[ 0x7f8b664ae316] [binf316:24592] [ 7]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bt l_openib.so(+0x18817)[0x7f8b66497817]
>>>>       >>> [binf316:24592] [ 8]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> om
>>>>       >>> pi_mpi_init+0x5f6)[0x7fb3597fe33c]
>>>>       >>> [binf316:24591] [14]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> mc a_btl_base_select+0x1b2)[0x7f8b6b43fa5e]
>>>>       >>> [binf316:24592] [ 9]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> bm l_r2.so(mca_bml_r2_component_init+0x20)[0x7f8b666d9d42]
>>>>       >>> [binf316:24592] [10]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> MP
>>>>       >>> I_Init+0x17e)[0x7fb359833386]
>>>>       >>> [binf316:24591] [15] ring_c[0x40096f] [binf316:24591] [16]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> mc a_bml_base_init+0xd6)[0x7f8b6b43ed1b]
>>>>       >>> [binf316:24592] [11]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/openmpi/mca_
>>>>       >>> pm l_ob1.so(+0x7739)[0x7f8b655ff739] [binf316:24592] [12]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> mc a_pml_base_select+0x26e)[0x7f8b6b4659b2]
>>>>       >>> [binf316:24592] [13]
>>>>       >>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7fb359234c36]
>>>>       >>> [binf316:24591] [17] ring_c[0x400889] [binf316:24591] *** End of
>>>>       >>> error message ***
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> om
>>>>       >>> pi_mpi_init+0x5f6)[0x7f8b6b3d233c]
>>>>       >>> [binf316:24592] [14]
>>>>       >>> 
>>>> /xxxx/yyyy_ib/gcc-4.8.3/toolset/openmpi-1.8.1_debug/lib/libmpi.so.1(
>>>>       >>> MP
>>>>       >>> I_Init+0x17e)[0x7f8b6b407386]
>>>>       >>> [binf316:24592] [15] ring_c[0x40096f] [binf316:24592] [16]
>>>>       >>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f8b6ae08c36]
>>>>       >>> [binf316:24592] [17] ring_c[0x400889] [binf316:24592] *** End of
>>>>       >>> error message ***
>>>>       >>> 
>>>> --------------------------------------------------------------------
>>>>       >>> --
>>>>       >>> ---- mpirun noticed that process rank 0 with PID 24591 on node
>>>>       >>> xxxx316 exited on signal 6 (Aborted).
>>>>       >>> 
>>>> --------------------------------------------------------------------
>>>>       >>> --
>>>>       >>> ----
>>>>       >>
>>>>       >>
>>>>       >
>>>>       > _______________________________________________
>>>>       > users mailing list
>>>>       > us...@open-mpi.org<mailto:us...@open-mpi.org>
>>>>       > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>       > Link to this post:
>>>>       http://www.open-mpi.org/community/lists/users/2014/06/24632.php
>>>>
>>>>       --
>>>>       Jeff Squyres
>>>>       jsquy...@cisco.com<mailto:jsquy...@cisco.com>
>>>>       For corporate legal information go to:
>>>>       http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>       _______________________________________________
>>>>       users mailing list
>>>>       us...@open-mpi.org<mailto:us...@open-mpi.org>
>>>>       Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>       Link to this post:
>>>>       http://www.open-mpi.org/community/lists/users/2014/06/24633.php
>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/users/2014/06/24634.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/06/24636.php
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/06/24637.php
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org<mailto:us...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/06/24638.php
>
>

_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24641.php



Reply via email to