HI Gilles,
> On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet <[email protected]> wrote:
> I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to
> CUDA environments ?
No, this is just on normal CPU-only nodes. But memcpy always goes through
opal_cuda_memcpy when CUDA support is enabled, even if there’s no GPUs in use
(or indeed, even installed).
> The coll/tuned default collective module is known not to work when tasks use
> matching but different signatures.
> For example, one task sends one vector of N elements, and the other task
> receives N elements.
This is the call that triggers it:
ierror = MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, S[0],
recvcounts, displs, mpitype_vec_nobs, node_comm);
(and changing the source datatype to MPI_BYTE to avoid the NULL handle doesn’t
help).
> A workaround worth trying is to
> mpirun --mca coll basic ...
Thanks — using --mca coll basic,libnbc fixes it (basic on its own fails because
it can’t work out what to use for Iallgather).
> Last but not least, could you please post a minimal example (and the number
> of MPI tasks used) that can evidence the issue ?
I’m just waiting for the user to get back to me with the okay to share the
code. Otherwise, I’ll see what I can put together myself. It works on 42 cores
(at 14 per node = 3 nodes) but fails for 43 cores (so 1 rank on the 4th node).
The communicator includes 1 rank per node, so it’s going from a three-rank
communicator to a four-rank communicator — perhaps the tuned algorithm changes
at that point?
Cheers,
Ben
_______________________________________________
devel mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/devel