Some random points:
1. Are your counts ever 0? In principle, method 1 should be fine, I think.
But with blocking, I *think* you should be fine, but I haven't thought hard
about this -- I have a nagging feeling that there might be a possibility of
deadlock in there, but I could be wrong.
2. It's been a long, long time since I've used the STL. Is &foo[x][y]
guaranteed to give the address of a contiguous buffer that MPI can use? More
specifically, is &foo[x][y] guaranteed to be equal to (&foo[x][y + N] - N *
sizeof(T))? I have a dim recollection of needing to use .cptr() or something
like that... but this is a very old memory from many years ago.
3. Why not use MPI_Alltoallw?
On Sep 17, 2011, at 10:06 PM, Evghenii Gaburov wrote:
> Hi All,
>
> My MPI program's basic task consists of regularly establishing point-to-point
> communication with other procs via MPI_Alltoall, and then to communicate
> data. I tested it on two HPC clusters with 32-256 MPI tasks. One of the
> systems (HPC1) this custom collective runs flawlessly, while on another one
> (HPC2) the collective causes non-reproducible deadlocks (after a day of
> running, or after of few hours or so). So, I want to figure out whether it is
> a system (HPC2) bug that I can communicate to HPC admins, or a subtle bug in
> my code that needs to be fixed. One possibly important point, I communicate
> huge amount of data between tasks (up to ~2GB of data) in several all2all
> calls.
>
> I would like to have expert eyes to look at the code to confirm or disprove
> that the code is deadlock-safe. I have implemented several methods (METHOD1 -
> METHOD4), that, if I am not mistaken, should in principle be deadlock safe.
> However, as a beginner MPI user, I can easily miss something subtle, as such
> I seek you help with this! I mostly used METHOD4 which have caused periodic
> deadlock, after having deadlocks with METHOD1 and METHOD2. On HPC1 none these
> methods deadlock in my runs. METHOD3 I am currently testing, so cannot
> comment on it as yet but will later; however, I will be happy to hear your
> comments.
>
> Both system use openmpi-1.4.3.
>
> Your answers will be of great help! Thanks!
>
> Cheers,
> Evghenii
>
> Here is the code snippet:
>
> template<class T>
> void all2all(std::vector<T> sbuf[], std::vector<T> rbuf[],
> const int myid,
> const int nproc)
> {
> static int nsend[NMAXPROC], nrecv[NMAXPROC];
> for (int p = 0; p < nproc; p++)
> nsend[p] = sbuf[p].size();
> MPI_Alltoall(nsend, 1, MPI_INT, nrecv, 1, MPI_INT, MPI_COMM_WORLD);
> // let the other tasks know how much data they will receive from this one
>
> #ifdef _METHOD1_
>
> static MPI_Status stat[NMAXPROC ];
> static MPI_Request req[NMAXPROC*2];
> int nreq = 0;
> for (int p = 0; p < nproc; p++)
> if (p != myid)
> {
> const int scount = nsend[p];
> const int rcount = nrecv[p];
> rbuf[p].resize(rcount);
> if (scount > 0) MPI_Isend(&sbuf[p][0], nscount, datatype<T>(), p,
> 1, MPI_COMM_WORLD, &req[nreq++]);
> if (rcount > 0) MPI_Irecv(&rbuf[p][0], rcount, datatype<T>(), p,
> 1, MPI_COMM_WORLD, &req[nreq++]);
> }
> rbuf[myid] = sbuf[myid];
> MPI_Waitall(nreq, req, stat);
>
> #elif defined _METHOD2_
>
> static MPI_Status stat;
> for (int p = 0; p < nproc; p++)
> if (p != myid)
> {
> const int scount = nsend[p]*scale;
> const int rcount = nrecv[p]*scale;
> rbuf[p].resize(rcount);
> if (scount + rcount > 0)
> MPI_Sendrecv(&sbuf[p][0], scount, datatype<T>(), p, 1,
> &rbuf[p][0], rcount, datatype<T>(), p, 1,
> MPI_COMM_WORLD, &stat);
> }
> rbuf[myid] = sbuf[myid];
>
> #elif defined _METHOD3_
>
> static MPI_Status stat[NMAXPROC ];
> static MPI_Request req[NMAXPROC*2];
> for (int dist = 1; dist < nproc; dist++)
> {
> const int src = (nproc + myid - dist) % nproc;
> const int dst = (nproc + myid + dist) % nproc;
> const int scount = nsend[dst]*scale;
> const int rcount = nrecv[src]*scale;
> rbuf[src].resize(rcount);
> int nreq = 0;
> if (scount > 0) MPI_Isend(&sbuf[dst][0], scount, datatype<T>(),
> dst, 1, MPI_COMM_WORLD, &req[nreq++]);
> if (rcount > 0) MPI_Irecv(&rbuf[src][0], rcount, datatype<T>(),
> src, 1, MPI_COMM_WORLD, &req[nreq++]);
> MPI_Waitall(nreq, req, stat);
> }
> rbuf[myid] = sbuf[myid];
>
> #elif defined _METHOD4_
>
> static MPI_Status stat;
> for (int dist = 1; dist < nproc; dist++)
> {
> const int src = (nproc + myid - dist) % nproc;
> const int dst = (nproc + myid + dist) % nproc;
> const int scount = nsend[dst]*scale;
> const int rcount = nrecv[src]*scale;
> rbuf[src].resize(rcount);
> if ((myid/dist) & 1)
> {
> if (scount > 0) MPI_Send(&sbuf[dst][0], scount,
> datatype<T>(), dst, 1, MPI_COMM_WORLD);
> if (rcount > 0) MPI_Recv(&rbuf[src][0], rcount,
> datatype<T>(), src, 1, MPI_COMM_WORLD, &stat);
> }
> else
> {
> if (rcount > 0) MPI_Recv(&rbuf[src][0], rcount,
> datatype<T>(), src, 1, MPI_COMM_WORLD, &stat);
> if (scount > 0) MPI_Send(&sbuf[dst][0], scount,
> datatype<T>(), dst, 1, MPI_COMM_WORLD);
> }
> }
> rbuf[myid] = sbuf[myid];
> #endif
> }
>
>
> _______________________________________________
> users mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/