On Fri, Feb 6, 2015 at 8:54 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> George,
>
> Can you point me to an other project that uses 128 bits atomics ?
>

http://icl.cs.utk.edu/parsec/. It heavily uses lock-free structures, and
the 128 bits atomics are the safest and fastest way to implement them.


> In my tests, i noticed that the volatile keyword is (one of) the trigger
> of the compiler bug.
>

I usually use it for the location to be atomically changed.


> At this stage, i could not see anything wrong in ompi, plus this is
> working fine with recent gcc and icc, so i concluded this is an icc bug,
> that is now fixed, so all ompi can do is hide the symptom.
>

These issues are pretty tricky to trigger, we need special race conditions
while manipulating pointers. There are tens of papers about how to
correctly implement FIFOs with CAS2, and even after peer reviews some of
them turned out to be incorrect. What I am saying is that we are quick to
blame these failures on the icc compiler, while we have no formal proof
that the FIFO algorithm in Open MPI is correct.

  George.



>
> Cheers,
>
> Gilles
>
>
> George Bosilca <bosi...@icl.utk.edu> wrote:
> My feeling is that the current patch hide the symptoms without addressing
> the real issue.
>
> As a side note: The compiler incriminated in this thread, works perfectly
> for 128 bits atomic operations in other projects where I use atomic LIFO &
> FIFO (but not the one from OMPI as I already raised my concerns about this).
>
>   George.
>
> PS: Why are there totally non-related comments about FIFO in the
> opal_lifo.h (starting line 61)?
>
> On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>>  Paul and all,
>>
>> i just pushed
>> https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163
>>
>> when a buggy compiler is detected, configure now forces
>> OPAL_HAVE_CMPXCHG16B=0
>> this is enough to make opal_lifo test and make check happy again.
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2015/02/04 17:26, Gilles Gouaillardet wrote:
>>
>> Paul,
>>
>> my previous email was misleading.
>>
>> what i really meant is the opal_fifo test works fine with icc 2013u5
>> (the release before 2013sp1) and
>> icc 2013sp1u2 and later
>>
>> so even if the reproducer fails with icc older that 2013sp1u2, that
>> might not impact ompi
>> since for other reasons, the bug is not hit
>>
>> for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away
>> from the compiler bug.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2015/02/04 17:15, Paul Hargrove wrote:
>>
>>  Giles,
>>
>> Who says only 2 version are effected?
>>
>> I have access to 9 revisions of icc.
>> Using your reduced case I find 7 that fail and only 2 (the latest two) that
>> pass.
>> Discounting icc-12 (which can't compile the test) that makes 6 versions
>> effected by the bug (not 2).
>>
>> -Paul
>>
>> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192
>> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x
>> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out &&
>> echo PASS ; done
>> @ Testing Intel compiler version 12.1.5.339
>> conftest.c(10): error: identifier "__int128_t" is undefined
>>       __int128_t value;
>>       ^
>>
>> compilation aborted for conftest.c (code 2)
>> @ Testing Intel compiler version 13.0.0.079
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.0.1.117
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.1.2.183
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.1.3.192
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.0.080
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.1.106
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.2.144
>> PASS
>> @ Testing Intel compiler version 15.0.1.133
>> PASS
>>
>> On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org> wrote:
>>
>>
>>   Nathan,
>>
>> imho, this is a compiler bug and only two versions are affected :
>> - intel icc 14.0.0.080 (aka 2013sp1)
>> - intel icc 14.0.1.106 (aka 2013sp1u1)
>> /* note the bug only occurs with -O1 and higher optimization levels */
>>
>> here is attached a simple reproducer
>>
>> a simple workaround is to configure with ac_cv_type___int128=0
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2015/02/04 4:17, Nathan Hjelm wrote:
>>
>> Thats the second report involving icc 14. I will dig into this later
>> this week.
>>
>> -Nathan
>>
>> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote:
>>
>>     I have seen opal_fifo hang on 2 distinct systems
>>     + Linux/ppc32 with xlc-11.1
>>     + Linux/x86-64 with icc-14.0.1.106
>>    I have no explanation to offer for either hang.
>>    No "weird" configure options were passed to either.
>>    -Paul
>>    --
>>    Paul H. Hargrove                          phhargr...@lbl.gov
>>    Computer Languages & Systems Software (CLaSS) Group
>>    Computer Science Department               Tel: +1-510-495-2352
>>    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>
>>   _______________________________________________
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php
>>
>>
>>
>> _______________________________________________
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/02/16920.php
>>
>>
>>
>> _______________________________________________
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this 
>> post:http://www.open-mpi.org/community/lists/devel/2015/02/16921.php
>>
>>  _______________________________________________
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/02/16922.php
>>
>>
>>
>> _______________________________________________
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/02/16923.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/02/16926.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/02/16949.php
>

Reply via email to