Hi, I am attempting to use the sm and vader BTLs between a client and server process, but it seems impossible to use fast transports (i.e. not TCP) between two independent groups started with two separate mpirun invocations. Am I correct, or is there a way to communicate using shared memory between a client and server like this? It seems this might be the case: https://github.com/open-mpi/ompi/blob/master/ompi/dpm/dpm.c#L495
The server calls MPI::COMM_WORLD.Accept() and the client calls MPI::COMM_WORLD.Connect(). Each program is started with "mpirun --np 1 --mca btl self,sm,vader <exectuable>" where the executable is either the client or server program. When no BTL is specified, both establish a TCP connection just fine. But when the sm and vader BTLs are specified, immediately after the Connect() call, both client and server exit with the message, copied at the end. It seems as though intergroup communication can't use fast transport and only uses TCP. Also, as expected, running the Accept() and Connect() calls within a single program with "mpirun -np 2 --mca btl self,sm,vader ..." uses shared memory as transport. $> mpirun --ompi-server "3414491136.0;tcp://10.4.131.16:49775" -np 1 --mca btl self,vader ./server At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[50012,1],0]) is on host: MacBook-Pro-80 Process 2 ([[50010,1],0]) is on host: MacBook-Pro-80 BTLs attempted: self Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- [MacBook-Pro-80.local:57315] [[50012,1],0] ORTE_ERROR_LOG: Unreachable in file dpm_orte.c at line 523 [MacBook-Pro-80:57315] *** An error occurred in MPI_Comm_accept [MacBook-Pro-80:57315] *** reported by process [7572553729,4294967296] [MacBook-Pro-80:57315] *** on communicator MPI_COMM_WORLD [MacBook-Pro-80:57315] *** MPI_ERR_INTERN: internal error [MacBook-Pro-80:57315] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [MacBook-Pro-80:57315] *** and potentially your MPI job) ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[50012,1],0] Exit code: 17 -------------------------------------------------------------------------- Thanks, Louis