Hi,

I am attempting to use the sm and vader BTLs between a client and server
process, but it seems impossible to use fast transports (i.e. not TCP)
between two independent groups started with two separate mpirun
invocations. Am I correct, or is there a way to communicate using shared
memory between a client and server like this? It seems this might be the
case: https://github.com/open-mpi/ompi/blob/master/ompi/dpm/dpm.c#L495

The server calls MPI::COMM_WORLD.Accept() and the client calls
MPI::COMM_WORLD.Connect(). Each program is started with "mpirun --np 1
--mca btl self,sm,vader <exectuable>" where the executable is either the
client or server program. When no BTL is specified, both establish a TCP
connection just fine. But when the sm and vader BTLs are specified,
immediately after the Connect() call, both client and server exit with the
message, copied at the end. It seems as though intergroup communication
can't use fast transport and only uses TCP.

Also, as expected, running the Accept() and Connect() calls within a single
program with "mpirun -np 2 --mca btl self,sm,vader ..." uses shared memory
as transport.

$> mpirun --ompi-server "3414491136.0;tcp://10.4.131.16:49775" -np 1 --mca
btl self,vader ./server

At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[50012,1],0]) is on host: MacBook-Pro-80
  Process 2 ([[50010,1],0]) is on host: MacBook-Pro-80
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[MacBook-Pro-80.local:57315] [[50012,1],0] ORTE_ERROR_LOG: Unreachable in
file dpm_orte.c at line 523
[MacBook-Pro-80:57315] *** An error occurred in MPI_Comm_accept
[MacBook-Pro-80:57315] *** reported by process [7572553729,4294967296]
[MacBook-Pro-80:57315] *** on communicator MPI_COMM_WORLD
[MacBook-Pro-80:57315] *** MPI_ERR_INTERN: internal error
[MacBook-Pro-80:57315] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[MacBook-Pro-80:57315] ***    and potentially your MPI job)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:

  Process name: [[50012,1],0]
  Exit code:    17
--------------------------------------------------------------------------

Thanks,
Louis

Reply via email to