Ralph, thanks for the quick reply. Is cross-job fast transport like InfiniBand supported?
Louis On Tue, Jun 14, 2016 at 3:53 PM Ralph Castain <r...@open-mpi.org> wrote: > Nope - we don’t currently support cross-job shared memory operations. > Nathan has talked about doing so for vader, but not at this time. > > > On Jun 14, 2016, at 12:38 PM, Louis Williams <louis.willi...@gatech.edu> > wrote: > > Hi, > > I am attempting to use the sm and vader BTLs between a client and server > process, but it seems impossible to use fast transports (i.e. not TCP) > between two independent groups started with two separate mpirun > invocations. Am I correct, or is there a way to communicate using shared > memory between a client and server like this? It seems this might be the > case: https://github.com/open-mpi/ompi/blob/master/ompi/dpm/dpm.c#L495 > > The server calls MPI::COMM_WORLD.Accept() and the client calls > MPI::COMM_WORLD.Connect(). Each program is started with "mpirun --np 1 > --mca btl self,sm,vader <exectuable>" where the executable is either the > client or server program. When no BTL is specified, both establish a TCP > connection just fine. But when the sm and vader BTLs are specified, > immediately after the Connect() call, both client and server exit with the > message, copied at the end. It seems as though intergroup communication > can't use fast transport and only uses TCP. > > Also, as expected, running the Accept() and Connect() calls within a > single program with "mpirun -np 2 --mca btl self,sm,vader ..." uses shared > memory as transport. > > $> mpirun --ompi-server "3414491136.0;tcp://10.4.131.16:49775" -np 1 > --mca btl self,vader ./server > > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[50012,1],0]) is on host: MacBook-Pro-80 > Process 2 ([[50010,1],0]) is on host: MacBook-Pro-80 > BTLs attempted: self > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > [MacBook-Pro-80.local:57315] [[50012,1],0] ORTE_ERROR_LOG: Unreachable in > file dpm_orte.c at line 523 > [MacBook-Pro-80:57315] *** An error occurred in MPI_Comm_accept > [MacBook-Pro-80:57315] *** reported by process [7572553729,4294967296] > [MacBook-Pro-80:57315] *** on communicator MPI_COMM_WORLD > [MacBook-Pro-80:57315] *** MPI_ERR_INTERN: internal error > [MacBook-Pro-80:57315] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [MacBook-Pro-80:57315] *** and potentially your MPI job) > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so was: > > Process name: [[50012,1],0] > Exit code: 17 > -------------------------------------------------------------------------- > > Thanks, > Louis > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29441.php > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29442.php