Re: [OMPI devel] OMPI devel] OMPI devel] Master hangs in opal_fifo test

2015-02-09 Thread George Bosilca
On Fri, Feb 6, 2015 at 9:12 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> George,
>
> I cannot acces parsec : http error 403 :-(
>

Our webserver was down over the weekend. Please try again.

  George.



>
> I understand your point of view.
> Back to the opal_lifo test, and if i remember correctly, it hangs in the
> non multi threaded part : the very first pop loops forever since cas always
> fails in comparing values that are equal indeed.
> Though there is a possibility the problem comes from ompi, and we are just
> lucky it works with recent icc, i would not go "all in" with this ...
>
> And as you pointed, even if the problem does come from the compiler, that
> does not mean ompi algo are necessarily correct.
>
> Cheers,
>
> Gilles
>
> George Bosilca  wrote:
> On Fri, Feb 6, 2015 at 8:54 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
>> George,
>>
>> Can you point me to an other project that uses 128 bits atomics ?
>>
>
> http://icl.cs.utk.edu/parsec/. It heavily uses lock-free structures, and
> the 128 bits atomics are the safest and fastest way to implement them.
>
>
>> In my tests, i noticed that the volatile keyword is (one of) the trigger
>> of the compiler bug.
>>
>
> I usually use it for the location to be atomically changed.
>
>
>> At this stage, i could not see anything wrong in ompi, plus this is
>> working fine with recent gcc and icc, so i concluded this is an icc bug,
>> that is now fixed, so all ompi can do is hide the symptom.
>>
>
> These issues are pretty tricky to trigger, we need special race conditions
> while manipulating pointers. There are tens of papers about how to
> correctly implement FIFOs with CAS2, and even after peer reviews some of
> them turned out to be incorrect. What I am saying is that we are quick to
> blame these failures on the icc compiler, while we have no formal proof
> that the FIFO algorithm in Open MPI is correct.
>
>   George.
>
>
>
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> George Bosilca  wrote:
>> My feeling is that the current patch hide the symptoms without addressing
>> the real issue.
>>
>> As a side note: The compiler incriminated in this thread, works perfectly
>> for 128 bits atomic operations in other projects where I use atomic LIFO &
>> FIFO (but not the one from OMPI as I already raised my concerns about this).
>>
>>   George.
>>
>> PS: Why are there totally non-related comments about FIFO in the
>> opal_lifo.h (starting line 61)?
>>
>> On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>>>  Paul and all,
>>>
>>> i just pushed
>>> https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163
>>>
>>> when a buggy compiler is detected, configure now forces
>>> OPAL_HAVE_CMPXCHG16B=0
>>> this is enough to make opal_lifo test and make check happy again.
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On 2015/02/04 17:26, Gilles Gouaillardet wrote:
>>>
>>> Paul,
>>>
>>> my previous email was misleading.
>>>
>>> what i really meant is the opal_fifo test works fine with icc 2013u5
>>> (the release before 2013sp1) and
>>> icc 2013sp1u2 and later
>>>
>>> so even if the reproducer fails with icc older that 2013sp1u2, that
>>> might not impact ompi
>>> since for other reasons, the bug is not hit
>>>
>>> for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away
>>> from the compiler bug.
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 2015/02/04 17:15, Paul Hargrove wrote:
>>>
>>>  Giles,
>>>
>>> Who says only 2 version are effected?
>>>
>>> I have access to 9 revisions of icc.
>>> Using your reduced case I find 7 that fail and only 2 (the latest two) that
>>> pass.
>>> Discounting icc-12 (which can't compile the test) that makes 6 versions
>>> effected by the bug (not 2).
>>>
>>> -Paul
>>>
>>> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192
>>> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x
>>> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out &&
>>> echo PASS ; done
>>> @ Testing Intel compiler version 12.1.5.339
>>> conftest.c(10): error: identifier "__int128_t" is undefined
>>>   __int128_t value;
>>>   ^
>>>
>>> compilation aborted for conftest.c (code 2)
>>> @ Testing Intel compiler version 13.0.0.079
>>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>>> Aborted
>>> @ Testing Intel compiler version 13.0.1.117
>>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>>> Aborted
>>> @ Testing Intel compiler version 13.1.2.183
>>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>>> Aborted
>>> @ Testing Intel compiler version 13.1.3.192
>>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>>> Aborted
>>> @ Testing Intel compiler version 14.0.0.080
>>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>>> Aborted
>>> @ Testing Intel compiler 

Re: [OMPI devel] OMPI devel] OMPI devel] Master hangs in opal_fifo test

2015-02-06 Thread Gilles Gouaillardet
George,

I cannot acces parsec : http error 403 :-(

I understand your point of view.
Back to the opal_lifo test, and if i remember correctly, it hangs in the non 
multi threaded part : the very first pop loops forever since cas always fails 
in comparing values that are equal indeed.
Though there is a possibility the problem comes from ompi, and we are just 
lucky it works with recent icc, i would not go "all in" with this ...

And as you pointed, even if the problem does come from the compiler, that does 
not mean ompi algo are necessarily correct.

Cheers,

Gilles

George Bosilca  wrote:
>On Fri, Feb 6, 2015 at 8:54 AM, Gilles Gouaillardet 
> wrote:
>
>George,
>
>Can you point me to an other project that uses 128 bits atomics ?
>
>
>http://icl.cs.utk.edu/parsec/. It heavily uses lock-free structures, and the 
>128 bits atomics are the safest and fastest way to implement them.
>
> 
>
>In my tests, i noticed that the volatile keyword is (one of) the trigger of 
>the compiler bug.
>
>
>I usually use it for the location to be atomically changed.
>
> 
>
>At this stage, i could not see anything wrong in ompi, plus this is working 
>fine with recent gcc and icc, so i concluded this is an icc bug, that is now 
>fixed, so all ompi can do is hide the symptom.
>
>
>These issues are pretty tricky to trigger, we need special race conditions 
>while manipulating pointers. There are tens of papers about how to correctly 
>implement FIFOs with CAS2, and even after peer reviews some of them turned out 
>to be incorrect. What I am saying is that we are quick to blame these failures 
>on the icc compiler, while we have no formal proof that the FIFO algorithm in 
>Open MPI is correct.
>
>
>  George.
>
>
> 
>
>
>Cheers,
>
>Gilles
>
>
>George Bosilca  wrote:
>
>My feeling is that the current patch hide the symptoms without addressing the 
>real issue.
>
>
>As a side note: The compiler incriminated in this thread, works perfectly for 
>128 bits atomic operations in other projects where I use atomic LIFO & FIFO 
>(but not the one from OMPI as I already raised my concerns about this).
>
>
>  George.
>
>
>PS: Why are there totally non-related comments about FIFO in the opal_lifo.h 
>(starting line 61)?
>
>
>On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet 
> wrote:
>
>Paul and all,
>
>i just pushed 
>https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163
>
>when a buggy compiler is detected, configure now forces OPAL_HAVE_CMPXCHG16B=0
>this is enough to make opal_lifo test and make check happy again.
>
>Cheers,
>
>Gilles
>
>
>
>On 2015/02/04 17:26, Gilles Gouaillardet wrote:
>
>Paul, my previous email was misleading. what i really meant is the opal_fifo 
>test works fine with icc 2013u5 (the release before 2013sp1) and icc 2013sp1u2 
>and later so even if the reproducer fails with icc older that 2013sp1u2, that 
>might not impact ompi since for other reasons, the bug is not hit for example, 
>with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away from the compiler 
>bug. Cheers, Gilles On 2015/02/04 17:15, Paul Hargrove wrote: 
>
>Giles, Who says only 2 version are effected? I have access to 9 revisions of 
>icc. Using your reduced case I find 7 that fail and only 2 (the latest two) 
>that pass. Discounting icc-12 (which can't compile the test) that makes 6 
>versions effected by the bug (not 2). -Paul $ for x in 12.1.5.339 13.0.0.079 
>13.0.1.117 13.1.2.183 13.1.3.192 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; 
>do module swap intel intel/$x ; echo @ Testing Intel compiler version $x; icc 
>conftest.c && ./a.out && echo PASS ; done @ Testing Intel compiler version 
>12.1.5.339 conftest.c(10): error: identifier "__int128_t" is undefined 
>__int128_t value; ^ compilation aborted for conftest.c (code 2) @ Testing 
>Intel compiler version 13.0.0.079 a.out: conftest.c:36: main: Assertion 
>`a.value == b.value' failed. Aborted @ Testing Intel compiler version 
>13.0.1.117 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. 
>Aborted @ Testing Intel compiler version 13.1.2.183 a.out: conftest.c:36: 
>main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler 
>version 13.1.3.192 a.out: conftest.c:36: main: Assertion `a.value == b.value' 
>failed. Aborted @ Testing Intel compiler version 14.0.0.080 a.out: 
>conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing 
>Intel compiler version 14.0.1.106 a.out: conftest.c:36: main: Assertion 
>`a.value == b.value' failed. Aborted @ Testing Intel compiler version 
>14.0.2.144 PASS @ Testing Intel compiler version 15.0.1.133 PASS On Tue, Feb 
>3, 2015 at 11:45 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> 
>wrote: 
>
>Nathan, imho, this is a compiler bug and only two versions are affected : - 
>intel icc 14.0.0.080 (aka 2013sp1) - intel icc 14.0.1.106 (aka 2013sp1u1) /* 
>note the bug only 

Re: [OMPI devel] OMPI devel] Master hangs in opal_fifo test

2015-02-06 Thread George Bosilca
On Fri, Feb 6, 2015 at 8:54 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> George,
>
> Can you point me to an other project that uses 128 bits atomics ?
>

http://icl.cs.utk.edu/parsec/. It heavily uses lock-free structures, and
the 128 bits atomics are the safest and fastest way to implement them.


> In my tests, i noticed that the volatile keyword is (one of) the trigger
> of the compiler bug.
>

I usually use it for the location to be atomically changed.


> At this stage, i could not see anything wrong in ompi, plus this is
> working fine with recent gcc and icc, so i concluded this is an icc bug,
> that is now fixed, so all ompi can do is hide the symptom.
>

These issues are pretty tricky to trigger, we need special race conditions
while manipulating pointers. There are tens of papers about how to
correctly implement FIFOs with CAS2, and even after peer reviews some of
them turned out to be incorrect. What I am saying is that we are quick to
blame these failures on the icc compiler, while we have no formal proof
that the FIFO algorithm in Open MPI is correct.

  George.



>
> Cheers,
>
> Gilles
>
>
> George Bosilca  wrote:
> My feeling is that the current patch hide the symptoms without addressing
> the real issue.
>
> As a side note: The compiler incriminated in this thread, works perfectly
> for 128 bits atomic operations in other projects where I use atomic LIFO &
> FIFO (but not the one from OMPI as I already raised my concerns about this).
>
>   George.
>
> PS: Why are there totally non-related comments about FIFO in the
> opal_lifo.h (starting line 61)?
>
> On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>>  Paul and all,
>>
>> i just pushed
>> https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163
>>
>> when a buggy compiler is detected, configure now forces
>> OPAL_HAVE_CMPXCHG16B=0
>> this is enough to make opal_lifo test and make check happy again.
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2015/02/04 17:26, Gilles Gouaillardet wrote:
>>
>> Paul,
>>
>> my previous email was misleading.
>>
>> what i really meant is the opal_fifo test works fine with icc 2013u5
>> (the release before 2013sp1) and
>> icc 2013sp1u2 and later
>>
>> so even if the reproducer fails with icc older that 2013sp1u2, that
>> might not impact ompi
>> since for other reasons, the bug is not hit
>>
>> for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away
>> from the compiler bug.
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2015/02/04 17:15, Paul Hargrove wrote:
>>
>>  Giles,
>>
>> Who says only 2 version are effected?
>>
>> I have access to 9 revisions of icc.
>> Using your reduced case I find 7 that fail and only 2 (the latest two) that
>> pass.
>> Discounting icc-12 (which can't compile the test) that makes 6 versions
>> effected by the bug (not 2).
>>
>> -Paul
>>
>> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192
>> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x
>> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out &&
>> echo PASS ; done
>> @ Testing Intel compiler version 12.1.5.339
>> conftest.c(10): error: identifier "__int128_t" is undefined
>>   __int128_t value;
>>   ^
>>
>> compilation aborted for conftest.c (code 2)
>> @ Testing Intel compiler version 13.0.0.079
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.0.1.117
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.1.2.183
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.1.3.192
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.0.080
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.1.106
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.2.144
>> PASS
>> @ Testing Intel compiler version 15.0.1.133
>> PASS
>>
>> On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet 
>>  wrote:
>>
>>
>>   Nathan,
>>
>> imho, this is a compiler bug and only two versions are affected :
>> - intel icc 14.0.0.080 (aka 2013sp1)
>> - intel icc 14.0.1.106 (aka 2013sp1u1)
>> /* note the bug only occurs with -O1 and higher optimization levels */
>>
>> here is attached a simple reproducer
>>
>> a simple workaround is to configure with ac_cv_type___int128=0
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2015/02/04 4:17, Nathan Hjelm wrote:
>>
>> Thats the second report involving icc 14. I will dig into this later
>> this week.
>>
>> -Nathan
>>
>> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote:
>>
>> I have seen opal_fifo hang on 2 

Re: [OMPI devel] OMPI devel] Master hangs in opal_fifo test

2015-02-06 Thread Gilles Gouaillardet
George,

Can you point me to an other project that uses 128 bits atomics ?

In my tests, i noticed that the volatile keyword is (one of) the trigger of the 
compiler bug.
At this stage, i could not see anything wrong in ompi, plus this is working 
fine with recent gcc and icc, so i concluded this is an icc bug, that is now 
fixed, so all ompi can do is hide the symptom.

Cheers,

Gilles


George Bosilca  wrote:
>My feeling is that the current patch hide the symptoms without addressing the 
>real issue.
>
>
>As a side note: The compiler incriminated in this thread, works perfectly for 
>128 bits atomic operations in other projects where I use atomic LIFO & FIFO 
>(but not the one from OMPI as I already raised my concerns about this).
>
>
>  George.
>
>
>PS: Why are there totally non-related comments about FIFO in the opal_lifo.h 
>(starting line 61)?
>
>
>On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet 
> wrote:
>
>Paul and all,
>
>i just pushed 
>https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163
>
>when a buggy compiler is detected, configure now forces OPAL_HAVE_CMPXCHG16B=0
>this is enough to make opal_lifo test and make check happy again.
>
>Cheers,
>
>Gilles
>
>
>
>On 2015/02/04 17:26, Gilles Gouaillardet wrote:
>
>Paul, my previous email was misleading. what i really meant is the opal_fifo 
>test works fine with icc 2013u5 (the release before 2013sp1) and icc 2013sp1u2 
>and later so even if the reproducer fails with icc older that 2013sp1u2, that 
>might not impact ompi since for other reasons, the bug is not hit for example, 
>with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away from the compiler 
>bug. Cheers, Gilles On 2015/02/04 17:15, Paul Hargrove wrote: 
>
>Giles, Who says only 2 version are effected? I have access to 9 revisions of 
>icc. Using your reduced case I find 7 that fail and only 2 (the latest two) 
>that pass. Discounting icc-12 (which can't compile the test) that makes 6 
>versions effected by the bug (not 2). -Paul $ for x in 12.1.5.339 13.0.0.079 
>13.0.1.117 13.1.2.183 13.1.3.192 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; 
>do module swap intel intel/$x ; echo @ Testing Intel compiler version $x; icc 
>conftest.c && ./a.out && echo PASS ; done @ Testing Intel compiler version 
>12.1.5.339 conftest.c(10): error: identifier "__int128_t" is undefined 
>__int128_t value; ^ compilation aborted for conftest.c (code 2) @ Testing 
>Intel compiler version 13.0.0.079 a.out: conftest.c:36: main: Assertion 
>`a.value == b.value' failed. Aborted @ Testing Intel compiler version 
>13.0.1.117 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. 
>Aborted @ Testing Intel compiler version 13.1.2.183 a.out: conftest.c:36: 
>main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler 
>version 13.1.3.192 a.out: conftest.c:36: main: Assertion `a.value == b.value' 
>failed. Aborted @ Testing Intel compiler version 14.0.0.080 a.out: 
>conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing 
>Intel compiler version 14.0.1.106 a.out: conftest.c:36: main: Assertion 
>`a.value == b.value' failed. Aborted @ Testing Intel compiler version 
>14.0.2.144 PASS @ Testing Intel compiler version 15.0.1.133 PASS On Tue, Feb 
>3, 2015 at 11:45 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> 
>wrote: 
>
>Nathan, imho, this is a compiler bug and only two versions are affected : - 
>intel icc 14.0.0.080 (aka 2013sp1) - intel icc 14.0.1.106 (aka 2013sp1u1) /* 
>note the bug only occurs with -O1 and higher optimization levels */ here is 
>attached a simple reproducer a simple workaround is to configure with 
>ac_cv_type___int128=0 Cheers, Gilles On 2015/02/04 4:17, Nathan Hjelm wrote: 
>Thats the second report involving icc 14. I will dig into this later this 
>week. -Nathan On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: I 
>have seen opal_fifo hang on 2 distinct systems + Linux/ppc32 with xlc-11.1 + 
>Linux/x86-64 with icc-14.0.1.106 I have no explanation to offer for either 
>hang. No "weird" configure options were passed to either. -Paul -- Paul H. 
>Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) 
>Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley 
>National Laboratory Fax: +1-510-486-6900 
>___ devel mailing 
>listde...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16911.php 
>___ devel mailing 
>listde...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16920.php 
>___ devel mailing list 
>de...@open-mpi.org Subscription: 

Re: [OMPI devel] Master hangs in opal_fifo test

2015-02-06 Thread George Bosilca
My feeling is that the current patch hide the symptoms without addressing
the real issue.

As a side note: The compiler incriminated in this thread, works perfectly
for 128 bits atomic operations in other projects where I use atomic LIFO &
FIFO (but not the one from OMPI as I already raised my concerns about this).

  George.

PS: Why are there totally non-related comments about FIFO in the
opal_lifo.h (starting line 61)?

On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Paul and all,
>
> i just pushed
> https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163
>
> when a buggy compiler is detected, configure now forces
> OPAL_HAVE_CMPXCHG16B=0
> this is enough to make opal_lifo test and make check happy again.
>
> Cheers,
>
> Gilles
>
>
> On 2015/02/04 17:26, Gilles Gouaillardet wrote:
>
> Paul,
>
> my previous email was misleading.
>
> what i really meant is the opal_fifo test works fine with icc 2013u5
> (the release before 2013sp1) and
> icc 2013sp1u2 and later
>
> so even if the reproducer fails with icc older that 2013sp1u2, that
> might not impact ompi
> since for other reasons, the bug is not hit
>
> for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away
> from the compiler bug.
>
> Cheers,
>
> Gilles
>
> On 2015/02/04 17:15, Paul Hargrove wrote:
>
>  Giles,
>
> Who says only 2 version are effected?
>
> I have access to 9 revisions of icc.
> Using your reduced case I find 7 that fail and only 2 (the latest two) that
> pass.
> Discounting icc-12 (which can't compile the test) that makes 6 versions
> effected by the bug (not 2).
>
> -Paul
>
> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192
> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x
> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out &&
> echo PASS ; done
> @ Testing Intel compiler version 12.1.5.339
> conftest.c(10): error: identifier "__int128_t" is undefined
>   __int128_t value;
>   ^
>
> compilation aborted for conftest.c (code 2)
> @ Testing Intel compiler version 13.0.0.079
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 13.0.1.117
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 13.1.2.183
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 13.1.3.192
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 14.0.0.080
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 14.0.1.106
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 14.0.2.144
> PASS
> @ Testing Intel compiler version 15.0.1.133
> PASS
>
> On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet 
>  wrote:
>
>
>   Nathan,
>
> imho, this is a compiler bug and only two versions are affected :
> - intel icc 14.0.0.080 (aka 2013sp1)
> - intel icc 14.0.1.106 (aka 2013sp1u1)
> /* note the bug only occurs with -O1 and higher optimization levels */
>
> here is attached a simple reproducer
>
> a simple workaround is to configure with ac_cv_type___int128=0
>
> Cheers,
>
> Gilles
>
> On 2015/02/04 4:17, Nathan Hjelm wrote:
>
> Thats the second report involving icc 14. I will dig into this later
> this week.
>
> -Nathan
>
> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote:
>
> I have seen opal_fifo hang on 2 distinct systems
> + Linux/ppc32 with xlc-11.1
> + Linux/x86-64 with icc-14.0.1.106
>I have no explanation to offer for either hang.
>No "weird" configure options were passed to either.
>-Paul
>--
>Paul H. Hargrove  phhargr...@lbl.gov
>Computer Languages & Systems Software (CLaSS) Group
>Computer Science Department   Tel: +1-510-495-2352
>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>   ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/16920.php
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this 
> post:http://www.open-mpi.org/community/lists/devel/2015/02/16921.php
>
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> 

Re: [OMPI devel] Master hangs in opal_fifo test

2015-02-04 Thread Gilles Gouaillardet
Paul and all,

i just pushed
https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163

when a buggy compiler is detected, configure now forces
OPAL_HAVE_CMPXCHG16B=0
this is enough to make opal_lifo test and make check happy again.

Cheers,

Gilles

On 2015/02/04 17:26, Gilles Gouaillardet wrote:
> Paul,
>
> my previous email was misleading.
>
> what i really meant is the opal_fifo test works fine with icc 2013u5
> (the release before 2013sp1) and
> icc 2013sp1u2 and later
>
> so even if the reproducer fails with icc older that 2013sp1u2, that
> might not impact ompi
> since for other reasons, the bug is not hit
>
> for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away
> from the compiler bug.
>
> Cheers,
>
> Gilles
>
> On 2015/02/04 17:15, Paul Hargrove wrote:
>> Giles,
>>
>> Who says only 2 version are effected?
>>
>> I have access to 9 revisions of icc.
>> Using your reduced case I find 7 that fail and only 2 (the latest two) that
>> pass.
>> Discounting icc-12 (which can't compile the test) that makes 6 versions
>> effected by the bug (not 2).
>>
>> -Paul
>>
>> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192
>> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x
>> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out &&
>> echo PASS ; done
>> @ Testing Intel compiler version 12.1.5.339
>> conftest.c(10): error: identifier "__int128_t" is undefined
>>   __int128_t value;
>>   ^
>>
>> compilation aborted for conftest.c (code 2)
>> @ Testing Intel compiler version 13.0.0.079
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.0.1.117
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.1.2.183
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 13.1.3.192
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.0.080
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.1.106
>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
>> Aborted
>> @ Testing Intel compiler version 14.0.2.144
>> PASS
>> @ Testing Intel compiler version 15.0.1.133
>> PASS
>>
>> On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>>>  Nathan,
>>>
>>> imho, this is a compiler bug and only two versions are affected :
>>> - intel icc 14.0.0.080 (aka 2013sp1)
>>> - intel icc 14.0.1.106 (aka 2013sp1u1)
>>> /* note the bug only occurs with -O1 and higher optimization levels */
>>>
>>> here is attached a simple reproducer
>>>
>>> a simple workaround is to configure with ac_cv_type___int128=0
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 2015/02/04 4:17, Nathan Hjelm wrote:
>>>
>>> Thats the second report involving icc 14. I will dig into this later
>>> this week.
>>>
>>> -Nathan
>>>
>>> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote:
>>>
>>> I have seen opal_fifo hang on 2 distinct systems
>>> + Linux/ppc32 with xlc-11.1
>>> + Linux/x86-64 with icc-14.0.1.106
>>>I have no explanation to offer for either hang.
>>>No "weird" configure options were passed to either.
>>>-Paul
>>>--
>>>Paul H. Hargrove  phhargr...@lbl.gov
>>>Computer Languages & Systems Software (CLaSS) Group
>>>Computer Science Department   Tel: +1-510-495-2352
>>>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>
>>>   ___
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php
>>>
>>>
>>>
>>> ___
>>> devel mailing listde...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/02/16920.php
>>>
>>>
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/02/16921.php
>>>
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/02/16922.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> 

Re: [OMPI devel] Master hangs in opal_fifo test

2015-02-04 Thread Gilles Gouaillardet
Paul,

my previous email was misleading.

what i really meant is the opal_fifo test works fine with icc 2013u5
(the release before 2013sp1) and
icc 2013sp1u2 and later

so even if the reproducer fails with icc older that 2013sp1u2, that
might not impact ompi
since for other reasons, the bug is not hit

for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away
from the compiler bug.

Cheers,

Gilles

On 2015/02/04 17:15, Paul Hargrove wrote:
> Giles,
>
> Who says only 2 version are effected?
>
> I have access to 9 revisions of icc.
> Using your reduced case I find 7 that fail and only 2 (the latest two) that
> pass.
> Discounting icc-12 (which can't compile the test) that makes 6 versions
> effected by the bug (not 2).
>
> -Paul
>
> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192
> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x
> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out &&
> echo PASS ; done
> @ Testing Intel compiler version 12.1.5.339
> conftest.c(10): error: identifier "__int128_t" is undefined
>   __int128_t value;
>   ^
>
> compilation aborted for conftest.c (code 2)
> @ Testing Intel compiler version 13.0.0.079
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 13.0.1.117
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 13.1.2.183
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 13.1.3.192
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 14.0.0.080
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 14.0.1.106
> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
> Aborted
> @ Testing Intel compiler version 14.0.2.144
> PASS
> @ Testing Intel compiler version 15.0.1.133
> PASS
>
> On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>>  Nathan,
>>
>> imho, this is a compiler bug and only two versions are affected :
>> - intel icc 14.0.0.080 (aka 2013sp1)
>> - intel icc 14.0.1.106 (aka 2013sp1u1)
>> /* note the bug only occurs with -O1 and higher optimization levels */
>>
>> here is attached a simple reproducer
>>
>> a simple workaround is to configure with ac_cv_type___int128=0
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2015/02/04 4:17, Nathan Hjelm wrote:
>>
>> Thats the second report involving icc 14. I will dig into this later
>> this week.
>>
>> -Nathan
>>
>> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote:
>>
>> I have seen opal_fifo hang on 2 distinct systems
>> + Linux/ppc32 with xlc-11.1
>> + Linux/x86-64 with icc-14.0.1.106
>>I have no explanation to offer for either hang.
>>No "weird" configure options were passed to either.
>>-Paul
>>--
>>Paul H. Hargrove  phhargr...@lbl.gov
>>Computer Languages & Systems Software (CLaSS) Group
>>Computer Science Department   Tel: +1-510-495-2352
>>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>>   ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php
>>
>>
>>
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/02/16920.php
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/02/16921.php
>>
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/16922.php



Re: [OMPI devel] Master hangs in opal_fifo test

2015-02-04 Thread Paul Hargrove
Giles,

Who says only 2 version are effected?

I have access to 9 revisions of icc.
Using your reduced case I find 7 that fail and only 2 (the latest two) that
pass.
Discounting icc-12 (which can't compile the test) that makes 6 versions
effected by the bug (not 2).

-Paul

$ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192
14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x
; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out &&
echo PASS ; done
@ Testing Intel compiler version 12.1.5.339
conftest.c(10): error: identifier "__int128_t" is undefined
  __int128_t value;
  ^

compilation aborted for conftest.c (code 2)
@ Testing Intel compiler version 13.0.0.079
a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
Aborted
@ Testing Intel compiler version 13.0.1.117
a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
Aborted
@ Testing Intel compiler version 13.1.2.183
a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
Aborted
@ Testing Intel compiler version 13.1.3.192
a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
Aborted
@ Testing Intel compiler version 14.0.0.080
a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
Aborted
@ Testing Intel compiler version 14.0.1.106
a.out: conftest.c:36: main: Assertion `a.value == b.value' failed.
Aborted
@ Testing Intel compiler version 14.0.2.144
PASS
@ Testing Intel compiler version 15.0.1.133
PASS

On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

>  Nathan,
>
> imho, this is a compiler bug and only two versions are affected :
> - intel icc 14.0.0.080 (aka 2013sp1)
> - intel icc 14.0.1.106 (aka 2013sp1u1)
> /* note the bug only occurs with -O1 and higher optimization levels */
>
> here is attached a simple reproducer
>
> a simple workaround is to configure with ac_cv_type___int128=0
>
> Cheers,
>
> Gilles
>
> On 2015/02/04 4:17, Nathan Hjelm wrote:
>
> Thats the second report involving icc 14. I will dig into this later
> this week.
>
> -Nathan
>
> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote:
>
> I have seen opal_fifo hang on 2 distinct systems
> + Linux/ppc32 with xlc-11.1
> + Linux/x86-64 with icc-14.0.1.106
>I have no explanation to offer for either hang.
>No "weird" configure options were passed to either.
>-Paul
>--
>Paul H. Hargrove  phhargr...@lbl.gov
>Computer Languages & Systems Software (CLaSS) Group
>Computer Science Department   Tel: +1-510-495-2352
>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>   ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/16920.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/02/16921.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Master hangs in opal_fifo test

2015-02-04 Thread Gilles Gouaillardet
Nathan,

imho, this is a compiler bug and only two versions are affected :
- intel icc 14.0.0.080 (aka 2013sp1)
- intel icc 14.0.1.106 (aka 2013sp1u1)
/* note the bug only occurs with -O1 and higher optimization levels */

here is attached a simple reproducer

a simple workaround is to configure with ac_cv_type___int128=0

Cheers,

Gilles

On 2015/02/04 4:17, Nathan Hjelm wrote:
> Thats the second report involving icc 14. I will dig into this later
> this week.
>
> -Nathan
>
> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote:
>>I have seen opal_fifo hang on 2 distinct systems
>> + Linux/ppc32 with xlc-11.1
>> + Linux/x86-64 with icc-14.0.1.106
>>I have no explanation to offer for either hang.
>>No "weird" configure options were passed to either.
>>-Paul
>>--
>>Paul H. Hargrove  phhargr...@lbl.gov
>>Computer Languages & Systems Software (CLaSS) Group
>>Computer Science Department   Tel: +1-510-495-2352
>>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/16920.php

/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
#include 
#include 

union opal_counted_pointer_t {
struct {
uint64_t counter;
uint64_t item;
} data;
__int128_t value;
};
typedef union opal_counted_pointer_t opal_counted_pointer_t;

int main (int argc, char *argv[]) {
volatile opal_counted_pointer_t a;
opal_counted_pointer_t b;

a.data.counter = 0;
a.data.item = 0x1234567890ABCDEF;

b.data.counter = a.data.counter;
b.data.item = a.data.item;

/* bozo checks */
assert(16 == sizeof(opal_counted_pointer_t));
assert(a.data.counter == b.data.counter);
assert(a.data.item == b.data.item);
/* 
 * following assert fails on buggy compilers
 * so far, with icc -o conftest conftest.c 
 *  - intel icc 14.0.0.080 (aka 2013sp1)
 *  - intel icc 14.0.1.106 (aka 2013sp1u1)
 * older and more recents compilers work fine
 * buggy compilers work also fine but only with -O0
 */
assert(a.value == b.value);
return 0;
}


[OMPI devel] Master hangs in opal_fifo test

2015-02-03 Thread Paul Hargrove
I have seen opal_fifo hang on 2 distinct systems
 + Linux/ppc32 with xlc-11.1
 + Linux/x86-64 with icc-14.0.1.106

I have no explanation to offer for either hang.
No "weird" configure options were passed to either.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900