Yuki, I pushed a fix for this issue in the trunk (r26097). However, I disagree with you on some of the topics below.
On Mar 5, 2012, at 04:02 , Y.MATSUMOTO wrote: > Dear All, > > Next feedback is about "collective communications". > > Collective communication may be abend when it use over 2GiB buffer. > This problem occurs following condition: > -- communicator_size * count(scount/rcount) >= 2GiB > It occurs in even small PC cluster. > > The following is one of the suspicious parts. > (Many similar code in ompi/coll/tuned/*.c) > > --- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)--- > 398 tmprecv = (char*) rbuf + rank * rcount * rext; > ----------------------------------------------------------------- > > if this condition is met, "rank * rcount" is overflowed. > So, we fixed it tentatively like following: > (cast int to size_t) > --- in ompi/coll/tuned/coll_tuned_allgather.c -------------- > 398 tmprecv = (char*) rbuf + (size_t)rank * rcount * rext; > ------------------------------------------------------------ Based on my understanding of the C standard this operation should be done on the most extended type, in this particular case the one of the rext (ptrdiff_t). Thus I would say the displacement should be correctly computed. > It needs not only "ompi/coll/tuned" but also other codes to fix this problem. > We try to fix, but following functions have problem (argument may be > overflowed): > -"ompi_coll_tuned_sendrecv" may be called when "scount/rcount" sets over 2GiB. > -"ompi_datatype_copy_content_same_ddt" may be called when "count" sets over > 2GiB. These two should have been fixed by the previous commit (r26097) > -"basic_linear in Allgather": Bcast may be called when "count" sets over 2GiB. Fixed in r26098. george. > > Best Regards, > Yuki Matsumoto > MPI development team, > Fujitsu > > _______________________________________________ > devel mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/devel
