Ralph, thanks for the quick reply. Is cross-job fast transport like
InfiniBand supported?

Louis

On Tue, Jun 14, 2016 at 3:53 PM Ralph Castain <r...@open-mpi.org> wrote:

> Nope - we don’t currently support cross-job shared memory operations.
> Nathan has talked about doing so for vader, but not at this time.
>
>
> On Jun 14, 2016, at 12:38 PM, Louis Williams <louis.willi...@gatech.edu>
> wrote:
>
> Hi,
>
> I am attempting to use the sm and vader BTLs between a client and server
> process, but it seems impossible to use fast transports (i.e. not TCP)
> between two independent groups started with two separate mpirun
> invocations. Am I correct, or is there a way to communicate using shared
> memory between a client and server like this? It seems this might be the
> case: https://github.com/open-mpi/ompi/blob/master/ompi/dpm/dpm.c#L495
>
> The server calls MPI::COMM_WORLD.Accept() and the client calls
> MPI::COMM_WORLD.Connect(). Each program is started with "mpirun --np 1
> --mca btl self,sm,vader <exectuable>" where the executable is either the
> client or server program. When no BTL is specified, both establish a TCP
> connection just fine. But when the sm and vader BTLs are specified,
> immediately after the Connect() call, both client and server exit with the
> message, copied at the end. It seems as though intergroup communication
> can't use fast transport and only uses TCP.
>
> Also, as expected, running the Accept() and Connect() calls within a
> single program with "mpirun -np 2 --mca btl self,sm,vader ..." uses shared
> memory as transport.
>
> $> mpirun --ompi-server "3414491136.0;tcp://10.4.131.16:49775" -np 1
> --mca btl self,vader ./server
>
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[50012,1],0]) is on host: MacBook-Pro-80
>   Process 2 ([[50010,1],0]) is on host: MacBook-Pro-80
>   BTLs attempted: self
>
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> [MacBook-Pro-80.local:57315] [[50012,1],0] ORTE_ERROR_LOG: Unreachable in
> file dpm_orte.c at line 523
> [MacBook-Pro-80:57315] *** An error occurred in MPI_Comm_accept
> [MacBook-Pro-80:57315] *** reported by process [7572553729,4294967296]
> [MacBook-Pro-80:57315] *** on communicator MPI_COMM_WORLD
> [MacBook-Pro-80:57315] *** MPI_ERR_INTERN: internal error
> [MacBook-Pro-80:57315] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [MacBook-Pro-80:57315] ***    and potentially your MPI job)
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[50012,1],0]
>   Exit code:    17
> --------------------------------------------------------------------------
>
> Thanks,
> Louis
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29441.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/06/29442.php

Reply via email to