> On 17 Sep 2015, at 20:34 , Ralph Castain <r...@open-mpi.org> wrote: > > Ouch - this is on current master HEAD?
Yep! > I'm on travel right now, but I'll be back Fri evening and can look at it this > weekend. Probably something silly that needs to be fixed. Thanks! Obviously I didn't check every single version between March and now, but its safe to assume that it didn't work in between either I guess. > > > On Thu, Sep 17, 2015 at 11:30 AM, Mark Santcroos <mark.santcr...@rutgers.edu> > wrote: > Hi (Ralph), > > Over the last months I have been focussing on exec throughput, and not so > much on the application payload (read: mainly using /bin/sleep ;-) > As things are stabilising now, I returned my attention to "real" applications. > To discover that launching MPI applications (build with the same Open MPI > version) within a DVM doesn't work anymore (see error below). > > I've been doing a binary search, but that turned out to be not so trivial > because of other problems in the window of the change. > So far I've narrowed it down to: > > 64ec498 - mar 5 - works on my laptop (but not on the Crays) > b67b361 - mar 28 - works once per DVM launch on my laptop, but consecutive > orte-submits get a connect error > b209c9e - March 30 - same MPI_Init issue as HEAD > > Going further into mid-March I run into build issues with verb, runtime > issues with default binding complaining about missing libnumactl, runtime tcp > oob errors, etc. > So I don't know whether the binary search will yield much more than I was > able to dig up now. > > What can I do to get closer to debugging the actual issue? > > Thanks! > > Mark > > > OMPI_PREFIX=/Users/mark/proj/openmpi/installed/HEAD > OMPI_MCA_orte_hnp_uri=723386368.0;usock;tcp://192.168.0.103:56672 > OMPI_MCA_ess=tool > [netbook:70703] Job [11038,3] has launched > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: ompi_rte_init failed > --> Returned "(null)" (-43) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [netbook:70704] Local abort before MPI_INIT completed completed successfully, > but am not able to aggregate error messages, and not able to guarantee that > all other processes were killed! > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18064.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18065.php