Ouch - this is on current master HEAD? I'm on travel right now, but I'll be back Fri evening and can look at it this weekend. Probably something silly that needs to be fixed.
On Thu, Sep 17, 2015 at 11:30 AM, Mark Santcroos <mark.santcr...@rutgers.edu > wrote: > Hi (Ralph), > > Over the last months I have been focussing on exec throughput, and not so > much on the application payload (read: mainly using /bin/sleep ;-) > As things are stabilising now, I returned my attention to "real" > applications. > To discover that launching MPI applications (build with the same Open MPI > version) within a DVM doesn't work anymore (see error below). > > I've been doing a binary search, but that turned out to be not so trivial > because of other problems in the window of the change. > So far I've narrowed it down to: > > 64ec498 - mar 5 - works on my laptop (but not on the Crays) > b67b361 - mar 28 - works once per DVM launch on my laptop, but consecutive > orte-submits get a connect error > b209c9e - March 30 - same MPI_Init issue as HEAD > > Going further into mid-March I run into build issues with verb, runtime > issues with default binding complaining about missing libnumactl, runtime > tcp oob errors, etc. > So I don't know whether the binary search will yield much more than I was > able to dig up now. > > What can I do to get closer to debugging the actual issue? > > Thanks! > > Mark > > > OMPI_PREFIX=/Users/mark/proj/openmpi/installed/HEAD > OMPI_MCA_orte_hnp_uri=723386368.0;usock;tcp://192.168.0.103:56672 > OMPI_MCA_ess=tool > [netbook:70703] Job [11038,3] has launched > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: ompi_rte_init failed > --> Returned "(null)" (-43) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [netbook:70704] Local abort before MPI_INIT completed completed > successfully, but am not able to aggregate error messages, and not able to > guarantee that all other processes were killed! > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18064.php >