Might not - there has been a very large amount of change over the last few months, and I confess I haven't been checking the DVM regularly. So let me take a step back and look at that code.
I'll also include the extensions you requested on the other email - I didn't forget them, just somewhat overwhelmed lately On Thu, Sep 17, 2015 at 11:39 AM, Mark Santcroos <mark.santcr...@rutgers.edu > wrote: > > > On 17 Sep 2015, at 20:34 , Ralph Castain <r...@open-mpi.org> wrote: > > > > Ouch - this is on current master HEAD? > > Yep! > > > I'm on travel right now, but I'll be back Fri evening and can look at it > this weekend. Probably something silly that needs to be fixed. > > Thanks! > > Obviously I didn't check every single version between March and now, but > its safe to assume that it didn't work in between either I guess. > > > > > > > > On Thu, Sep 17, 2015 at 11:30 AM, Mark Santcroos < > mark.santcr...@rutgers.edu> wrote: > > Hi (Ralph), > > > > Over the last months I have been focussing on exec throughput, and not > so much on the application payload (read: mainly using /bin/sleep ;-) > > As things are stabilising now, I returned my attention to "real" > applications. > > To discover that launching MPI applications (build with the same Open > MPI version) within a DVM doesn't work anymore (see error below). > > > > I've been doing a binary search, but that turned out to be not so > trivial because of other problems in the window of the change. > > So far I've narrowed it down to: > > > > 64ec498 - mar 5 - works on my laptop (but not on the Crays) > > b67b361 - mar 28 - works once per DVM launch on my laptop, but > consecutive orte-submits get a connect error > > b209c9e - March 30 - same MPI_Init issue as HEAD > > > > Going further into mid-March I run into build issues with verb, runtime > issues with default binding complaining about missing libnumactl, runtime > tcp oob errors, etc. > > So I don't know whether the binary search will yield much more than I > was able to dig up now. > > > > What can I do to get closer to debugging the actual issue? > > > > Thanks! > > > > Mark > > > > > > OMPI_PREFIX=/Users/mark/proj/openmpi/installed/HEAD > > OMPI_MCA_orte_hnp_uri=723386368.0;usock;tcp://192.168.0.103:56672 > > OMPI_MCA_ess=tool > > [netbook:70703] Job [11038,3] has launched > > > -------------------------------------------------------------------------- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or > environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_mpi_init: ompi_rte_init failed > > --> Returned "(null)" (-43) instead of "Success" (0) > > > -------------------------------------------------------------------------- > > *** An error occurred in MPI_Init > > *** on a NULL communicator > > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > > *** and potentially your MPI job) > > [netbook:70704] Local abort before MPI_INIT completed completed > successfully, but am not able to aggregate error messages, and not able to > guarantee that all other processes were killed! > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18064.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18065.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18066.php >