Dear All,

Next feedback is about "collective communications".

Collective communication may be abend when it use over 2GiB buffer.
This problem occurs following condition:
-- communicator_size * count(scount/rcount) >= 2GiB
It occurs in even small PC cluster.

The following is one of the suspicious parts.
(Many similar code in ompi/coll/tuned/*.c)

--- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)---
398    tmprecv = (char*) rbuf + rank * rcount * rext;
-----------------------------------------------------------------

if this condition is met, "rank * rcount" is overflowed.
So, we fixed it tentatively like following:
(cast int to size_t)
--- in ompi/coll/tuned/coll_tuned_allgather.c --------------
398    tmprecv = (char*) rbuf + (size_t)rank * rcount * rext;
------------------------------------------------------------

It needs not only "ompi/coll/tuned" but also other codes to fix this problem.
We try to fix, but following functions have problem (argument may be 
overflowed):
-"ompi_coll_tuned_sendrecv" may be called when "scount/rcount" sets over 2GiB.
-"ompi_datatype_copy_content_same_ddt" may be called when "count" sets over 
2GiB.
-"basic_linear in Allgather": Bcast may be called when "count" sets over 2GiB.

Best Regards,
Yuki Matsumoto
MPI development team,
Fujitsu

Reply via email to