Hi (Ralph), Over the last months I have been focussing on exec throughput, and not so much on the application payload (read: mainly using /bin/sleep ;-) As things are stabilising now, I returned my attention to "real" applications. To discover that launching MPI applications (build with the same Open MPI version) within a DVM doesn't work anymore (see error below).
I've been doing a binary search, but that turned out to be not so trivial because of other problems in the window of the change. So far I've narrowed it down to: 64ec498 - mar 5 - works on my laptop (but not on the Crays) b67b361 - mar 28 - works once per DVM launch on my laptop, but consecutive orte-submits get a connect error b209c9e - March 30 - same MPI_Init issue as HEAD Going further into mid-March I run into build issues with verb, runtime issues with default binding complaining about missing libnumactl, runtime tcp oob errors, etc. So I don't know whether the binary search will yield much more than I was able to dig up now. What can I do to get closer to debugging the actual issue? Thanks! Mark OMPI_PREFIX=/Users/mark/proj/openmpi/installed/HEAD OMPI_MCA_orte_hnp_uri=723386368.0;usock;tcp://192.168.0.103:56672 OMPI_MCA_ess=tool [netbook:70703] Job [11038,3] has launched -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (-43) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [netbook:70704] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!