Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer

Tomoya Adachi Fri, 16 Mar 2012 03:28:56 -0400

Hi George,

I'm a member of Fujitsu MPI development team.
Thank you for picking up the issue.


We checked the changesets and unfortunately found they are incomplete.

Our testing method is as follows:
- Using LLVM clang to compile trunk with -ftrapv (integer overflow detection)
  because GCC's -ftrapv is broken :-(
- Checking algorithms with 600MB MPI_BYTE messages on an 8-node cluster.
  'v' functions (which take int *displs) are checked with 300MB message per 
process,
  i.e. count[] = {300M, 300M, ..., 300M} and dislpls[] = {0, 300M, ..., 2100M}

Then, we detected five issues.

- ompi_datatype_copy_content_same_ddt does not work correctly (partially fixed 
in r26097)
  * the second argument of opal_datatype_copy_content_same_ddt() should be 
'length'
  * pDestBuf and pSrcBuf should be advanced in the loop

- Reduce_scatter algorithms cause overflow in not multiplication but addition
  like "total_count += rcounts[i];"

- 'binomial' algorithms for Gather and Scatter still have integer overflow
  ("mycount *= rcount;" and "total_recv += mycount;")

  But some collectives still do not work for the following reasons:

- PML abend when count>= 2^31 because convertor functions use (u)int32_t count
  (binomial and recursive halving collective algorithms are affected)

- ompi_datatype_create_indexed also have a problem when sum of pBlockLength[] 
>= 2^31
  (used in some Allgatherv algorithms)

Changing datatype (convertor) interfaces and internals to use (s)size_t
might be hard work. (but should be done in the future?)
Do you have any good idea?

Regards,
Tomoya Adachi
MPI development team, Fujitsu

(2012/03/06 7:25), George Bosilca wrote:
> I gave it a try (r26103). It was messy, and I hope I got it right. Let's soak 
> it for few days with our nightly testing to see how it behave.
> 
>    george.
> 
> On Mar 5, 2012, at 16:37 , N.M. Maclaren wrote:
> 
>> On Mar 5 2012, George Bosilca wrote:
>>>
>>> I was afraid about all those little intermediary steps. I asked a compiler 
>>> guy and apparently reversing the order (aka starting with the ptrdiff_t 
>>> variable) will not solve anything. The only portable way to solve this is 
>>> to cast every single member, to prevent __any__ compiler from hurting us.
>>
>> That is true, but even that may not help, given that each version of
>> the C standard has been incompatible with its predecessors.  And see
>> below.
>>
>>>> In my copy of C99, section 6.5 Expressions says " the order of evaluation 
>>>> of subexpressions and the order in which side effects take place are both 
>>>> unspecified. There is a footnote 71 that "specifies the precedence of 
>>>> operators in the evaluation of an expressions, which is the same as the 
>>>> order of the major subclauses of this subclause, highest precedence 
>>>> first." It is the footnote that implies multiplication (6.5.5 
>>>> Multiplicative operators) has higher precedence than addition (6.5.6 
>>>> Additive operators) in the expression "(char*) rbuf + rank * rcount * 
>>>> rext". But, the main text states that there is no ordering of the 
>>>> subexpression "rank * rcount * rext". When the compiler chooses to 
>>>> evaluate "rank * rcount" first, the overflow described by Yuki can result. 
>>>> I think you are correct that the subexpression will get promoted to 
>>>> (ptrdiff_t), but that is not quite the same thing.
>>
>> No, it's not as simple as that :-(
>>
>> That was the intent during the standardisation of C90, but those of
>> us who tried failed to get any explicit statement into it, and the
>> situation during C99 was that "but everybody knows that" the syntax
>> rules also define the evaluation order.  We failed to get that stated
>> then, either :-(  That interpretation was apparently also the one
>> assumed by C++03, too, and now is explicitly (if informally) stated in
>> C++11.  So you theoretically can just cast the first operand to the
>> maximum precision and it will all work.
>>
>> What it means by the "order of evaluation of subexpressions" is that
>> the assignments in '(a = b) + (c = d) + (e = f)' can take place in
>> any order, which is a different issue.
>> HOWEVER, about half of the C communities have given C99 the thumbs
>> down, I doubt that C11 will be taken much notice of, gcc is the
>> de facto standard definer, and most compilers have optimisation
>> options that say "ignore the standard when it helps to go faster".
>> So the only feasible rule is to do your damnedest to defend yourself
>> against the aberrations, ambiguities and inconsistencies of C, and
>> hope for the best.  I.e. what George recommends.
>>
>> But will even that work reliably in the medium term?  I wouldn't
>> bet on it :-(
>>
>>
>> Regards,
>> Nick Maclaren.
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
>

Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer

Reply via email to