Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

Ralph Castain Mon, 12 Aug 2013 19:43:09 -0400 (EDT)

Seems strange that it would have something to do with IB - it seems that alloc 
itself is failing, and at only 512 bytes, that doesn't seem like something IB 
would cause.


If you write a little program that calls alloc (no MPI), does it also fail?


On Aug 12, 2013, at 3:35 PM, Gus Correa <g...@ldeo.columbia.edu> wrote:

> Hi Ralph
> 
> Sorry if this is more of an IB than an OMPI problem,
> but my view angle shows it through the OMPI jobs failing.
> 
> Yes, indeed I was setting memlock to unlimited in limits.conf
> and in the pbs_mom, restarting everything, relaunching the job.
> The error message changes, but it still fails on Infiniband,
> now complaining about the IB driver, but also that it cannot
> allocate memory.
> 
> Weird because when I ssh to the node and do ibstat it
> responds (see below, please).
> I actually ran ibstat everywhere, and all IB host adapters seem OK.
> 
> Thank you,
> Gus Correa
> 
> 
> *********************** the job stderr ******************************
> unable to alloc 512 bytes
> Abort: Command not found.
> unable to realloc 1600 bytes
> Abort: Command not found.
> libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed 
> to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to 
> map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed 
> to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed 
> to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'ipathverbs': 
> libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot 
> allocate memory
> libibverbs: Warning: no userspace device-specific driver found for 
> /sys/class/infiniband_verbs/uverbs0
> libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed 
> to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to 
> map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed 
> to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed 
> to map segment from shared object: Cannot allocate memory
> libibverbs: Warning: couldn't load driver 'ipathverbs': 
> libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot 
> allocate memory
> [node15:29683] *** Process received signal ***
> [node15:29683] Signal: Segmentation fault (11)
> [node15:29683] Signal code:  (128)
> [node15:29683] Failing at address: (nil)
> [node15:29683] *** End of error message ***
> --------------------------------------------------------------------------
> mpiexec noticed that process rank 0 with PID 29683 on node node15.cluster 
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [node15.cluster:29682] [[7785,0],0]-[[7785,1],2] mca_oob_tcp_msg_recv: readv 
> failed: Connection reset by peer (104)
> ************************************************************
> 
> *************** ibstat on node15 *************************
> 
> [root@node15 ~]# ibstat
> CA 'mlx4_0'
>       CA type: MT26428
>       Number of ports: 1
>       Firmware version: 2.7.700
>       Hardware version: b0
>       Node GUID: 0x002590ffff16284c
>       System image GUID: 0x002590ffff16284f
>       Port 1:
>               State: Active
>               Physical state: LinkUp
>               Rate: 40
>               Base lid: 11
>               LMC: 0
>               SM lid: 1
>               Capability mask: 0x02510868
>               Port GUID: 0x002590ffff16284d
>               Link layer: IB
> 
> 
> ************************************************************
> 
> On 08/12/2013 05:29 PM, Ralph Castain wrote:
>> No, this has nothing to do with the registration limit.
>> For some reason, the system is refusing to create a thread -
>> i.e., it is pthread_create that is failing.
>> I have no idea what would be causing that to happen.
>> 
>> Try setting it to unlimited and see if it allows the thread
>> to start, I guess.
>> 
>> 
>> On Aug 12, 2013, at 2:20 PM, Gus Correa<g...@ldeo.columbia.edu>  wrote:
>> 
>>> Hi Ralph, all
>>> 
>>> I include more information below,
>>> after turning on btl_openib_verbose 30.
>>> As you can see, OMPI tries, and fails, to load openib.
>>> 
>>> Last week I reduced the memlock limit from unlimited
>>> to ~12GB, as part of a general attempt to reign on memory
>>> use/abuse by jobs sharing a node.
>>> No parallel job ran until today, when the problem showed up.
>>> Could the memlock limit be the root of the problem?
>>> 
>>> The OMPI FAQ says the memlock limit
>>> should be a "large number (or better yet, unlimited)":
>>> 
>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>> 
>>> The next two FAQ kind of indicate that
>>> it should be set to "unlimited", but don't say it clearly:
>>> 
>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user
>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more
>>> 
>>> QUESTION:
>>> Is "unlimited" a must, or is there any (magic) "large number"
>>> that would be OK for openib?
>>> 
>>> I thought a 12GB memlock limit would be OK, but maybe it is not.
>>> The nodes have 64GB RAM.
>>> 
>>> Thank you,
>>> Gus Correa
>>> 
>>> *************************************************\
>>> [node15.cluster][[8097,1],0][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node15.cluster][[8097,1],1][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node15.cluster][[8097,1],4][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node15.cluster][[8097,1],3][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node15.cluster][[8097,1],2][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> --------------------------------------------------------------------------
>>> WARNING: There was an error initializing an OpenFabrics device.
>>> 
>>>  Local host:   node15.cluster
>>>  Local device: mlx4_0
>>> --------------------------------------------------------------------------
>>> [node15.cluster][[8097,1],10][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node15.cluster][[8097,1],12][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node15.cluster][[8097,1],13][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node14.cluster][[8097,1],17][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node14.cluster][[8097,1],23][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node14.cluster][[8097,1],24][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node14.cluster][[8097,1],26][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node14.cluster][[8097,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> [node14.cluster][[8097,1],31][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread]
>>>  Failed to create async event thread
>>> --------------------------------------------------------------------------
>>> At least one pair of MPI processes are unable to reach each other for
>>> MPI communications.  This means that no Open MPI device has indicated
>>> that it can be used to communicate between these processes.  This is
>>> an error; Open MPI requires that all MPI processes be able to reach
>>> each other.  This error can sometimes be the result of forgetting to
>>> specify the "self" BTL.
>>> 
>>>  Process 1 ([[8097,1],4]) is on host: node15.cluster
>>>  Process 2 ([[8097,1],16]) is on host: node14
>>>  BTLs attempted: self sm
>>> 
>>> Your MPI job is now going to abort; sorry.
>>> --------------------------------------------------------------------------
>>> 
>>> *************************************************
>>> 
>>> On 08/12/2013 03:32 PM, Gus Correa wrote:
>>>> Thank you for the prompt help, Ralph!
>>>> 
>>>> Yes, it is OMPI 1.4.3 built with openib support:
>>>> 
>>>> $ ompi_info | grep openib
>>>> MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3)
>>>> 
>>>> There are three libraries in prefix/lib/openmpi,
>>>> no mca_btl_openib library.
>>>> 
>>>> $ ls $PREFIX/lib/openmpi/
>>>> libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so
>>>> 
>>>> 
>>>> However, this may be just because it is an older OMPI version in
>>>> the 1.4 series.
>>>> Because those are exactly what I have in another cluster with IB,
>>>> and OMPI 1.4.3, where there isn't a problem.
>>>> The libraries' organization may have changed from
>>>> the 1.4 to the 1.6 series, right?
>>>> I only have mca_btl_openib libraries in the 1.6 series, but it
>>>> will be a hardship to migrate this program to OMPI 1.6.
>>>> 
>>>> (OK, I have newer OMPI, but I need old also for some
>>>> programs).
>>>> 
>>>> Why the heck it is not detecting the Infinband hardware?
>>>> [It used to detect it! :( ]
>>>> 
>>>> Thank you,
>>>> Gus Correa
>>>> 
>>>> 
>>>> On 08/12/2013 03:01 PM, Ralph Castain wrote:
>>>>> Check ompi_info - was it built with openib support?
>>>>> 
>>>>> Then check that the mca_btl_openib library is present in the
>>>>> prefix/lib/openmpi directory
>>>>> 
>>>>> Sounds like it isn't finding the openib plugin
>>>>> 
>>>>> 
>>>>> On Aug 12, 2013, at 11:57 AM, Gus Correa<g...@ldeo.columbia.edu>  wrote:
>>>>> 
>>>>>> Dear Open MPI pros
>>>>>> 
>>>>>> On one of the clusters here, that has Infinband,
>>>>>> I am getting this type of errors from
>>>>>> OpenMPI 1.4.3 (OK, I know it is old ...):
>>>>>> 
>>>>>> *********************************************************
>>>>>> Tcl_InitNotifier: unable to start notifier thread
>>>>>> Abort: Command not found.
>>>>>> Tcl_InitNotifier: unable to start notifier thread
>>>>>> Abort: Command not found.
>>>>>> --------------------------------------------------------------------------
>>>>>> 
>>>>>> At least one pair of MPI processes are unable to reach each other for
>>>>>> MPI communications. This means that no Open MPI device has indicated
>>>>>> that it can be used to communicate between these processes. This is
>>>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>>>> each other. This error can sometimes be the result of forgetting to
>>>>>> specify the "self" BTL.
>>>>>> 
>>>>>> Process 1 ([[907,1],68]) is on host: node11.cluster
>>>>>> Process 2 ([[907,1],0]) is on host: node15
>>>>>> BTLs attempted: self sm
>>>>>> 
>>>>>> Your MPI job is now going to abort; sorry.
>>>>>> --------------------------------------------------------------------------
>>>>>> 
>>>>>> *********************************************************
>>>>>> 
>>>>>> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf.
>>>>>> The same error also happens if I force --mca btl openib,sm,self
>>>>>> in mpiexec.
>>>>>> 
>>>>>> ** Why is it attempting only the self and sm BTLs, but not openib? **
>>>>>> 
>>>>>> I don't understand either the initial errors
>>>>>> "Tcl_InitNotifier: unable to start notifier thread".
>>>>>> Are they coming from Torque perhaps?
>>>>>> 
>>>>>> As I said, the cluster has Infiniband,
>>>>>> which is what we've been using forever, until
>>>>>> these errors started today.
>>>>>> 
>>>>>> When I divert the traffic to tcp
>>>>>> (--mca btl tcp,sm,self), the jobs run normally.
>>>>>> 
>>>>>> I am using the examples/connectivity_c.c program
>>>>>> to troubleshoot this problem.
>>>>>> 
>>>>>> ***
>>>>>> I checked a few things on the IB side.
>>>>>> 
>>>>>> The output of ibstat on all nodes seems OK (links up, etc),
>>>>>> and so are the output of ibhosts and ibchecknet.
>>>>>> 
>>>>>> Only two connected ports had errors, as reported by ibcheckerrors,
>>>>>> and I cleared them with iblclearerrors.
>>>>>> 
>>>>>> The IB subnet manager is running on the head node.
>>>>>> I restarted the daemon, but nothing changed, the job continue to
>>>>>> fail with the same errors.
>>>>>> 
>>>>>> **
>>>>>> 
>>>>>> Any hints of what is going on, how to diagnose it, and how to fix it?
>>>>>> Any gentler way than reboot everything and power cycling
>>>>>> the IB switch? (And would this brute force method work, at least?)
>>>>>> 
>>>>>> Thank you,
>>>>>> Gus Correa
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Error - BTLs attempted: self sm - on a cluster with IB and openib btl enabled

Reply via email to