Dear All, Next feedback is about "collective communications".
Collective communication may be abend when it use over 2GiB buffer. This problem occurs following condition: -- communicator_size * count(scount/rcount) >= 2GiB It occurs in even small PC cluster. The following is one of the suspicious parts. (Many similar code in ompi/coll/tuned/*.c) --- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)--- 398 tmprecv = (char*) rbuf + rank * rcount * rext; ----------------------------------------------------------------- if this condition is met, "rank * rcount" is overflowed. So, we fixed it tentatively like following: (cast int to size_t) --- in ompi/coll/tuned/coll_tuned_allgather.c -------------- 398 tmprecv = (char*) rbuf + (size_t)rank * rcount * rext; ------------------------------------------------------------ It needs not only "ompi/coll/tuned" but also other codes to fix this problem. We try to fix, but following functions have problem (argument may be overflowed): -"ompi_coll_tuned_sendrecv" may be called when "scount/rcount" sets over 2GiB. -"ompi_datatype_copy_content_same_ddt" may be called when "count" sets over 2GiB. -"basic_linear in Allgather": Bcast may be called when "count" sets over 2GiB. Best Regards, Yuki Matsumoto MPI development team, Fujitsu