Hi George, I'm a member of Fujitsu MPI development team. Thank you for picking up the issue.
We checked the changesets and unfortunately found they are incomplete. Our testing method is as follows: - Using LLVM clang to compile trunk with -ftrapv (integer overflow detection) because GCC's -ftrapv is broken :-( - Checking algorithms with 600MB MPI_BYTE messages on an 8-node cluster. 'v' functions (which take int *displs) are checked with 300MB message per process, i.e. count[] = {300M, 300M, ..., 300M} and dislpls[] = {0, 300M, ..., 2100M} Then, we detected five issues. - ompi_datatype_copy_content_same_ddt does not work correctly (partially fixed in r26097) * the second argument of opal_datatype_copy_content_same_ddt() should be 'length' * pDestBuf and pSrcBuf should be advanced in the loop - Reduce_scatter algorithms cause overflow in not multiplication but addition like "total_count += rcounts[i];" - 'binomial' algorithms for Gather and Scatter still have integer overflow ("mycount *= rcount;" and "total_recv += mycount;") But some collectives still do not work for the following reasons: - PML abend when count>= 2^31 because convertor functions use (u)int32_t count (binomial and recursive halving collective algorithms are affected) - ompi_datatype_create_indexed also have a problem when sum of pBlockLength[] >= 2^31 (used in some Allgatherv algorithms) Changing datatype (convertor) interfaces and internals to use (s)size_t might be hard work. (but should be done in the future?) Do you have any good idea? Regards, Tomoya Adachi MPI development team, Fujitsu (2012/03/06 7:25), George Bosilca wrote: > I gave it a try (r26103). It was messy, and I hope I got it right. Let's soak > it for few days with our nightly testing to see how it behave. > > george. > > On Mar 5, 2012, at 16:37 , N.M. Maclaren wrote: > >> On Mar 5 2012, George Bosilca wrote: >>> >>> I was afraid about all those little intermediary steps. I asked a compiler >>> guy and apparently reversing the order (aka starting with the ptrdiff_t >>> variable) will not solve anything. The only portable way to solve this is >>> to cast every single member, to prevent __any__ compiler from hurting us. >> >> That is true, but even that may not help, given that each version of >> the C standard has been incompatible with its predecessors. And see >> below. >> >>>> In my copy of C99, section 6.5 Expressions says " the order of evaluation >>>> of subexpressions and the order in which side effects take place are both >>>> unspecified. There is a footnote 71 that "specifies the precedence of >>>> operators in the evaluation of an expressions, which is the same as the >>>> order of the major subclauses of this subclause, highest precedence >>>> first." It is the footnote that implies multiplication (6.5.5 >>>> Multiplicative operators) has higher precedence than addition (6.5.6 >>>> Additive operators) in the expression "(char*) rbuf + rank * rcount * >>>> rext". But, the main text states that there is no ordering of the >>>> subexpression "rank * rcount * rext". When the compiler chooses to >>>> evaluate "rank * rcount" first, the overflow described by Yuki can result. >>>> I think you are correct that the subexpression will get promoted to >>>> (ptrdiff_t), but that is not quite the same thing. >> >> No, it's not as simple as that :-( >> >> That was the intent during the standardisation of C90, but those of >> us who tried failed to get any explicit statement into it, and the >> situation during C99 was that "but everybody knows that" the syntax >> rules also define the evaluation order. We failed to get that stated >> then, either :-( That interpretation was apparently also the one >> assumed by C++03, too, and now is explicitly (if informally) stated in >> C++11. So you theoretically can just cast the first operand to the >> maximum precision and it will all work. >> >> What it means by the "order of evaluation of subexpressions" is that >> the assignments in '(a = b) + (c = d) + (e = f)' can take place in >> any order, which is a different issue. >> HOWEVER, about half of the C communities have given C99 the thumbs >> down, I doubt that C11 will be taken much notice of, gcc is the >> de facto standard definer, and most compilers have optimisation >> options that say "ignore the standard when it helps to go faster". >> So the only feasible rule is to do your damnedest to defend yourself >> against the aberrations, ambiguities and inconsistencies of C, and >> hope for the best. I.e. what George recommends. >> >> But will even that work reliably in the medium term? I wouldn't >> bet on it :-( >> >> >> Regards, >> Nick Maclaren. >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >