Hi (Ralph),

Over the last months I have been focussing on exec throughput, and not so much 
on the application payload (read: mainly using /bin/sleep ;-)
As things are stabilising now, I returned my attention to "real" applications.
To discover that launching MPI applications (build with the same Open MPI 
version) within a DVM doesn't work anymore (see error below).

I've been doing a binary search, but that turned out to be not so trivial 
because of other problems in the window of the change.
So far I've narrowed it down to:

64ec498 - mar 5 - works on my laptop (but not on the Crays)
b67b361 - mar 28 - works once per DVM launch on my laptop, but consecutive 
orte-submits get a connect error
b209c9e - March 30 - same MPI_Init issue as HEAD

Going further into mid-March I run into build issues with verb, runtime issues 
with default binding complaining about missing libnumactl, runtime tcp oob 
errors, etc.
So I don't know whether the binary search will yield much more than I was able 
to dig up now.

What can I do to get closer to debugging the actual issue?

Thanks!

Mark


OMPI_PREFIX=/Users/mark/proj/openmpi/installed/HEAD
OMPI_MCA_orte_hnp_uri=723386368.0;usock;tcp://192.168.0.103:56672
OMPI_MCA_ess=tool
[netbook:70703] Job [11038,3] has launched
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "(null)" (-43) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[netbook:70704] Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to guarantee that all 
other processes were killed!

Reply via email to