Seems strange that it would have something to do with IB - it seems that alloc itself is failing, and at only 512 bytes, that doesn't seem like something IB would cause.
If you write a little program that calls alloc (no MPI), does it also fail? On Aug 12, 2013, at 3:35 PM, Gus Correa <g...@ldeo.columbia.edu> wrote: > Hi Ralph > > Sorry if this is more of an IB than an OMPI problem, > but my view angle shows it through the OMPI jobs failing. > > Yes, indeed I was setting memlock to unlimited in limits.conf > and in the pbs_mom, restarting everything, relaunching the job. > The error message changes, but it still fails on Infiniband, > now complaining about the IB driver, but also that it cannot > allocate memory. > > Weird because when I ssh to the node and do ibstat it > responds (see below, please). > I actually ran ibstat everywhere, and all IB host adapters seem OK. > > Thank you, > Gus Correa > > > *********************** the job stderr ****************************** > unable to alloc 512 bytes > Abort: Command not found. > unable to realloc 1600 bytes > Abort: Command not found. > libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed > to map segment from shared object: Cannot allocate memory > libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to > map segment from shared object: Cannot allocate memory > libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed > to map segment from shared object: Cannot allocate memory > libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed > to map segment from shared object: Cannot allocate memory > libibverbs: Warning: couldn't load driver 'ipathverbs': > libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot > allocate memory > libibverbs: Warning: no userspace device-specific driver found for > /sys/class/infiniband_verbs/uverbs0 > libibverbs: Warning: couldn't load driver 'mlx4': libmlx4-rdmav2.so: failed > to map segment from shared object: Cannot allocate memory > libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: failed to > map segment from shared object: Cannot allocate memory > libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: failed > to map segment from shared object: Cannot allocate memory > libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: failed > to map segment from shared object: Cannot allocate memory > libibverbs: Warning: couldn't load driver 'ipathverbs': > libipathverbs-rdmav2.so: failed to map segment from shared object: Cannot > allocate memory > [node15:29683] *** Process received signal *** > [node15:29683] Signal: Segmentation fault (11) > [node15:29683] Signal code: (128) > [node15:29683] Failing at address: (nil) > [node15:29683] *** End of error message *** > -------------------------------------------------------------------------- > mpiexec noticed that process rank 0 with PID 29683 on node node15.cluster > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > [node15.cluster:29682] [[7785,0],0]-[[7785,1],2] mca_oob_tcp_msg_recv: readv > failed: Connection reset by peer (104) > ************************************************************ > > *************** ibstat on node15 ************************* > > [root@node15 ~]# ibstat > CA 'mlx4_0' > CA type: MT26428 > Number of ports: 1 > Firmware version: 2.7.700 > Hardware version: b0 > Node GUID: 0x002590ffff16284c > System image GUID: 0x002590ffff16284f > Port 1: > State: Active > Physical state: LinkUp > Rate: 40 > Base lid: 11 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510868 > Port GUID: 0x002590ffff16284d > Link layer: IB > > > ************************************************************ > > On 08/12/2013 05:29 PM, Ralph Castain wrote: >> No, this has nothing to do with the registration limit. >> For some reason, the system is refusing to create a thread - >> i.e., it is pthread_create that is failing. >> I have no idea what would be causing that to happen. >> >> Try setting it to unlimited and see if it allows the thread >> to start, I guess. >> >> >> On Aug 12, 2013, at 2:20 PM, Gus Correa<g...@ldeo.columbia.edu> wrote: >> >>> Hi Ralph, all >>> >>> I include more information below, >>> after turning on btl_openib_verbose 30. >>> As you can see, OMPI tries, and fails, to load openib. >>> >>> Last week I reduced the memlock limit from unlimited >>> to ~12GB, as part of a general attempt to reign on memory >>> use/abuse by jobs sharing a node. >>> No parallel job ran until today, when the problem showed up. >>> Could the memlock limit be the root of the problem? >>> >>> The OMPI FAQ says the memlock limit >>> should be a "large number (or better yet, unlimited)": >>> >>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages >>> >>> The next two FAQ kind of indicate that >>> it should be set to "unlimited", but don't say it clearly: >>> >>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-user >>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more >>> >>> QUESTION: >>> Is "unlimited" a must, or is there any (magic) "large number" >>> that would be OK for openib? >>> >>> I thought a 12GB memlock limit would be OK, but maybe it is not. >>> The nodes have 64GB RAM. >>> >>> Thank you, >>> Gus Correa >>> >>> *************************************************\ >>> [node15.cluster][[8097,1],0][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node15.cluster][[8097,1],1][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node15.cluster][[8097,1],4][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node15.cluster][[8097,1],3][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node15.cluster][[8097,1],2][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> -------------------------------------------------------------------------- >>> WARNING: There was an error initializing an OpenFabrics device. >>> >>> Local host: node15.cluster >>> Local device: mlx4_0 >>> -------------------------------------------------------------------------- >>> [node15.cluster][[8097,1],10][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node15.cluster][[8097,1],12][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node15.cluster][[8097,1],13][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node14.cluster][[8097,1],17][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node14.cluster][[8097,1],23][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node14.cluster][[8097,1],24][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node14.cluster][[8097,1],26][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node14.cluster][[8097,1],28][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> [node14.cluster][[8097,1],31][../../../../../ompi/mca/btl/openib/btl_openib_component.c:562:start_async_event_thread] >>> Failed to create async event thread >>> -------------------------------------------------------------------------- >>> At least one pair of MPI processes are unable to reach each other for >>> MPI communications. This means that no Open MPI device has indicated >>> that it can be used to communicate between these processes. This is >>> an error; Open MPI requires that all MPI processes be able to reach >>> each other. This error can sometimes be the result of forgetting to >>> specify the "self" BTL. >>> >>> Process 1 ([[8097,1],4]) is on host: node15.cluster >>> Process 2 ([[8097,1],16]) is on host: node14 >>> BTLs attempted: self sm >>> >>> Your MPI job is now going to abort; sorry. >>> -------------------------------------------------------------------------- >>> >>> ************************************************* >>> >>> On 08/12/2013 03:32 PM, Gus Correa wrote: >>>> Thank you for the prompt help, Ralph! >>>> >>>> Yes, it is OMPI 1.4.3 built with openib support: >>>> >>>> $ ompi_info | grep openib >>>> MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.3) >>>> >>>> There are three libraries in prefix/lib/openmpi, >>>> no mca_btl_openib library. >>>> >>>> $ ls $PREFIX/lib/openmpi/ >>>> libompi_dbg_msgq.a libompi_dbg_msgq.la libompi_dbg_msgq.so >>>> >>>> >>>> However, this may be just because it is an older OMPI version in >>>> the 1.4 series. >>>> Because those are exactly what I have in another cluster with IB, >>>> and OMPI 1.4.3, where there isn't a problem. >>>> The libraries' organization may have changed from >>>> the 1.4 to the 1.6 series, right? >>>> I only have mca_btl_openib libraries in the 1.6 series, but it >>>> will be a hardship to migrate this program to OMPI 1.6. >>>> >>>> (OK, I have newer OMPI, but I need old also for some >>>> programs). >>>> >>>> Why the heck it is not detecting the Infinband hardware? >>>> [It used to detect it! :( ] >>>> >>>> Thank you, >>>> Gus Correa >>>> >>>> >>>> On 08/12/2013 03:01 PM, Ralph Castain wrote: >>>>> Check ompi_info - was it built with openib support? >>>>> >>>>> Then check that the mca_btl_openib library is present in the >>>>> prefix/lib/openmpi directory >>>>> >>>>> Sounds like it isn't finding the openib plugin >>>>> >>>>> >>>>> On Aug 12, 2013, at 11:57 AM, Gus Correa<g...@ldeo.columbia.edu> wrote: >>>>> >>>>>> Dear Open MPI pros >>>>>> >>>>>> On one of the clusters here, that has Infinband, >>>>>> I am getting this type of errors from >>>>>> OpenMPI 1.4.3 (OK, I know it is old ...): >>>>>> >>>>>> ********************************************************* >>>>>> Tcl_InitNotifier: unable to start notifier thread >>>>>> Abort: Command not found. >>>>>> Tcl_InitNotifier: unable to start notifier thread >>>>>> Abort: Command not found. >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> At least one pair of MPI processes are unable to reach each other for >>>>>> MPI communications. This means that no Open MPI device has indicated >>>>>> that it can be used to communicate between these processes. This is >>>>>> an error; Open MPI requires that all MPI processes be able to reach >>>>>> each other. This error can sometimes be the result of forgetting to >>>>>> specify the "self" BTL. >>>>>> >>>>>> Process 1 ([[907,1],68]) is on host: node11.cluster >>>>>> Process 2 ([[907,1],0]) is on host: node15 >>>>>> BTLs attempted: self sm >>>>>> >>>>>> Your MPI job is now going to abort; sorry. >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> ********************************************************* >>>>>> >>>>>> Awkward, because I have "btl = ^tcp" in openmpi-mca-params.conf. >>>>>> The same error also happens if I force --mca btl openib,sm,self >>>>>> in mpiexec. >>>>>> >>>>>> ** Why is it attempting only the self and sm BTLs, but not openib? ** >>>>>> >>>>>> I don't understand either the initial errors >>>>>> "Tcl_InitNotifier: unable to start notifier thread". >>>>>> Are they coming from Torque perhaps? >>>>>> >>>>>> As I said, the cluster has Infiniband, >>>>>> which is what we've been using forever, until >>>>>> these errors started today. >>>>>> >>>>>> When I divert the traffic to tcp >>>>>> (--mca btl tcp,sm,self), the jobs run normally. >>>>>> >>>>>> I am using the examples/connectivity_c.c program >>>>>> to troubleshoot this problem. >>>>>> >>>>>> *** >>>>>> I checked a few things on the IB side. >>>>>> >>>>>> The output of ibstat on all nodes seems OK (links up, etc), >>>>>> and so are the output of ibhosts and ibchecknet. >>>>>> >>>>>> Only two connected ports had errors, as reported by ibcheckerrors, >>>>>> and I cleared them with iblclearerrors. >>>>>> >>>>>> The IB subnet manager is running on the head node. >>>>>> I restarted the daemon, but nothing changed, the job continue to >>>>>> fail with the same errors. >>>>>> >>>>>> ** >>>>>> >>>>>> Any hints of what is going on, how to diagnose it, and how to fix it? >>>>>> Any gentler way than reboot everything and power cycling >>>>>> the IB switch? (And would this brute force method work, at least?) >>>>>> >>>>>> Thank you, >>>>>> Gus Correa >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users