Re: [OMPI devel] OMPI devel] OMPI devel] Master hangs in opal_fifo test
On Fri, Feb 6, 2015 at 9:12 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > George, > > I cannot acces parsec : http error 403 :-( > Our webserver was down over the weekend. Please try again. George. > > I understand your point of view. > Back to the opal_lifo test, and if i remember correctly, it hangs in the > non multi threaded part : the very first pop loops forever since cas always > fails in comparing values that are equal indeed. > Though there is a possibility the problem comes from ompi, and we are just > lucky it works with recent icc, i would not go "all in" with this ... > > And as you pointed, even if the problem does come from the compiler, that > does not mean ompi algo are necessarily correct. > > Cheers, > > Gilles > > George Bosilcawrote: > On Fri, Feb 6, 2015 at 8:54 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > >> George, >> >> Can you point me to an other project that uses 128 bits atomics ? >> > > http://icl.cs.utk.edu/parsec/. It heavily uses lock-free structures, and > the 128 bits atomics are the safest and fastest way to implement them. > > >> In my tests, i noticed that the volatile keyword is (one of) the trigger >> of the compiler bug. >> > > I usually use it for the location to be atomically changed. > > >> At this stage, i could not see anything wrong in ompi, plus this is >> working fine with recent gcc and icc, so i concluded this is an icc bug, >> that is now fixed, so all ompi can do is hide the symptom. >> > > These issues are pretty tricky to trigger, we need special race conditions > while manipulating pointers. There are tens of papers about how to > correctly implement FIFOs with CAS2, and even after peer reviews some of > them turned out to be incorrect. What I am saying is that we are quick to > blame these failures on the icc compiler, while we have no formal proof > that the FIFO algorithm in Open MPI is correct. > > George. > > > >> >> Cheers, >> >> Gilles >> >> >> George Bosilca wrote: >> My feeling is that the current patch hide the symptoms without addressing >> the real issue. >> >> As a side note: The compiler incriminated in this thread, works perfectly >> for 128 bits atomic operations in other projects where I use atomic LIFO & >> FIFO (but not the one from OMPI as I already raised my concerns about this). >> >> George. >> >> PS: Why are there totally non-related comments about FIFO in the >> opal_lifo.h (starting line 61)? >> >> On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >>> Paul and all, >>> >>> i just pushed >>> https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163 >>> >>> when a buggy compiler is detected, configure now forces >>> OPAL_HAVE_CMPXCHG16B=0 >>> this is enough to make opal_lifo test and make check happy again. >>> >>> Cheers, >>> >>> Gilles >>> >>> >>> On 2015/02/04 17:26, Gilles Gouaillardet wrote: >>> >>> Paul, >>> >>> my previous email was misleading. >>> >>> what i really meant is the opal_fifo test works fine with icc 2013u5 >>> (the release before 2013sp1) and >>> icc 2013sp1u2 and later >>> >>> so even if the reproducer fails with icc older that 2013sp1u2, that >>> might not impact ompi >>> since for other reasons, the bug is not hit >>> >>> for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away >>> from the compiler bug. >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2015/02/04 17:15, Paul Hargrove wrote: >>> >>> Giles, >>> >>> Who says only 2 version are effected? >>> >>> I have access to 9 revisions of icc. >>> Using your reduced case I find 7 that fail and only 2 (the latest two) that >>> pass. >>> Discounting icc-12 (which can't compile the test) that makes 6 versions >>> effected by the bug (not 2). >>> >>> -Paul >>> >>> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192 >>> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x >>> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out && >>> echo PASS ; done >>> @ Testing Intel compiler version 12.1.5.339 >>> conftest.c(10): error: identifier "__int128_t" is undefined >>> __int128_t value; >>> ^ >>> >>> compilation aborted for conftest.c (code 2) >>> @ Testing Intel compiler version 13.0.0.079 >>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >>> Aborted >>> @ Testing Intel compiler version 13.0.1.117 >>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >>> Aborted >>> @ Testing Intel compiler version 13.1.2.183 >>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >>> Aborted >>> @ Testing Intel compiler version 13.1.3.192 >>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >>> Aborted >>> @ Testing Intel compiler version 14.0.0.080 >>> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >>> Aborted >>> @ Testing Intel compiler
Re: [OMPI devel] OMPI devel] OMPI devel] Master hangs in opal_fifo test
George, I cannot acces parsec : http error 403 :-( I understand your point of view. Back to the opal_lifo test, and if i remember correctly, it hangs in the non multi threaded part : the very first pop loops forever since cas always fails in comparing values that are equal indeed. Though there is a possibility the problem comes from ompi, and we are just lucky it works with recent icc, i would not go "all in" with this ... And as you pointed, even if the problem does come from the compiler, that does not mean ompi algo are necessarily correct. Cheers, Gilles George Bosilcawrote: >On Fri, Feb 6, 2015 at 8:54 AM, Gilles Gouaillardet > wrote: > >George, > >Can you point me to an other project that uses 128 bits atomics ? > > >http://icl.cs.utk.edu/parsec/. It heavily uses lock-free structures, and the >128 bits atomics are the safest and fastest way to implement them. > > > >In my tests, i noticed that the volatile keyword is (one of) the trigger of >the compiler bug. > > >I usually use it for the location to be atomically changed. > > > >At this stage, i could not see anything wrong in ompi, plus this is working >fine with recent gcc and icc, so i concluded this is an icc bug, that is now >fixed, so all ompi can do is hide the symptom. > > >These issues are pretty tricky to trigger, we need special race conditions >while manipulating pointers. There are tens of papers about how to correctly >implement FIFOs with CAS2, and even after peer reviews some of them turned out >to be incorrect. What I am saying is that we are quick to blame these failures >on the icc compiler, while we have no formal proof that the FIFO algorithm in >Open MPI is correct. > > > George. > > > > > >Cheers, > >Gilles > > >George Bosilca wrote: > >My feeling is that the current patch hide the symptoms without addressing the >real issue. > > >As a side note: The compiler incriminated in this thread, works perfectly for >128 bits atomic operations in other projects where I use atomic LIFO & FIFO >(but not the one from OMPI as I already raised my concerns about this). > > > George. > > >PS: Why are there totally non-related comments about FIFO in the opal_lifo.h >(starting line 61)? > > >On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet > wrote: > >Paul and all, > >i just pushed >https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163 > >when a buggy compiler is detected, configure now forces OPAL_HAVE_CMPXCHG16B=0 >this is enough to make opal_lifo test and make check happy again. > >Cheers, > >Gilles > > > >On 2015/02/04 17:26, Gilles Gouaillardet wrote: > >Paul, my previous email was misleading. what i really meant is the opal_fifo >test works fine with icc 2013u5 (the release before 2013sp1) and icc 2013sp1u2 >and later so even if the reproducer fails with icc older that 2013sp1u2, that >might not impact ompi since for other reasons, the bug is not hit for example, >with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away from the compiler >bug. Cheers, Gilles On 2015/02/04 17:15, Paul Hargrove wrote: > >Giles, Who says only 2 version are effected? I have access to 9 revisions of >icc. Using your reduced case I find 7 that fail and only 2 (the latest two) >that pass. Discounting icc-12 (which can't compile the test) that makes 6 >versions effected by the bug (not 2). -Paul $ for x in 12.1.5.339 13.0.0.079 >13.0.1.117 13.1.2.183 13.1.3.192 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; >do module swap intel intel/$x ; echo @ Testing Intel compiler version $x; icc >conftest.c && ./a.out && echo PASS ; done @ Testing Intel compiler version >12.1.5.339 conftest.c(10): error: identifier "__int128_t" is undefined >__int128_t value; ^ compilation aborted for conftest.c (code 2) @ Testing >Intel compiler version 13.0.0.079 a.out: conftest.c:36: main: Assertion >`a.value == b.value' failed. Aborted @ Testing Intel compiler version >13.0.1.117 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >Aborted @ Testing Intel compiler version 13.1.2.183 a.out: conftest.c:36: >main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler >version 13.1.3.192 a.out: conftest.c:36: main: Assertion `a.value == b.value' >failed. Aborted @ Testing Intel compiler version 14.0.0.080 a.out: >conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing >Intel compiler version 14.0.1.106 a.out: conftest.c:36: main: Assertion >`a.value == b.value' failed. Aborted @ Testing Intel compiler version >14.0.2.144 PASS @ Testing Intel compiler version 15.0.1.133 PASS On Tue, Feb >3, 2015 at 11:45 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> >wrote: > >Nathan, imho, this is a compiler bug and only two versions are affected : - >intel icc 14.0.0.080 (aka 2013sp1) - intel icc 14.0.1.106 (aka 2013sp1u1) /* >note the bug only
Re: [OMPI devel] OMPI devel] Master hangs in opal_fifo test
On Fri, Feb 6, 2015 at 8:54 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > George, > > Can you point me to an other project that uses 128 bits atomics ? > http://icl.cs.utk.edu/parsec/. It heavily uses lock-free structures, and the 128 bits atomics are the safest and fastest way to implement them. > In my tests, i noticed that the volatile keyword is (one of) the trigger > of the compiler bug. > I usually use it for the location to be atomically changed. > At this stage, i could not see anything wrong in ompi, plus this is > working fine with recent gcc and icc, so i concluded this is an icc bug, > that is now fixed, so all ompi can do is hide the symptom. > These issues are pretty tricky to trigger, we need special race conditions while manipulating pointers. There are tens of papers about how to correctly implement FIFOs with CAS2, and even after peer reviews some of them turned out to be incorrect. What I am saying is that we are quick to blame these failures on the icc compiler, while we have no formal proof that the FIFO algorithm in Open MPI is correct. George. > > Cheers, > > Gilles > > > George Bosilcawrote: > My feeling is that the current patch hide the symptoms without addressing > the real issue. > > As a side note: The compiler incriminated in this thread, works perfectly > for 128 bits atomic operations in other projects where I use atomic LIFO & > FIFO (but not the one from OMPI as I already raised my concerns about this). > > George. > > PS: Why are there totally non-related comments about FIFO in the > opal_lifo.h (starting line 61)? > > On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> Paul and all, >> >> i just pushed >> https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163 >> >> when a buggy compiler is detected, configure now forces >> OPAL_HAVE_CMPXCHG16B=0 >> this is enough to make opal_lifo test and make check happy again. >> >> Cheers, >> >> Gilles >> >> >> On 2015/02/04 17:26, Gilles Gouaillardet wrote: >> >> Paul, >> >> my previous email was misleading. >> >> what i really meant is the opal_fifo test works fine with icc 2013u5 >> (the release before 2013sp1) and >> icc 2013sp1u2 and later >> >> so even if the reproducer fails with icc older that 2013sp1u2, that >> might not impact ompi >> since for other reasons, the bug is not hit >> >> for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away >> from the compiler bug. >> >> Cheers, >> >> Gilles >> >> On 2015/02/04 17:15, Paul Hargrove wrote: >> >> Giles, >> >> Who says only 2 version are effected? >> >> I have access to 9 revisions of icc. >> Using your reduced case I find 7 that fail and only 2 (the latest two) that >> pass. >> Discounting icc-12 (which can't compile the test) that makes 6 versions >> effected by the bug (not 2). >> >> -Paul >> >> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192 >> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x >> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out && >> echo PASS ; done >> @ Testing Intel compiler version 12.1.5.339 >> conftest.c(10): error: identifier "__int128_t" is undefined >> __int128_t value; >> ^ >> >> compilation aborted for conftest.c (code 2) >> @ Testing Intel compiler version 13.0.0.079 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 13.0.1.117 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 13.1.2.183 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 13.1.3.192 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 14.0.0.080 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 14.0.1.106 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 14.0.2.144 >> PASS >> @ Testing Intel compiler version 15.0.1.133 >> PASS >> >> On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet >> wrote: >> >> >> Nathan, >> >> imho, this is a compiler bug and only two versions are affected : >> - intel icc 14.0.0.080 (aka 2013sp1) >> - intel icc 14.0.1.106 (aka 2013sp1u1) >> /* note the bug only occurs with -O1 and higher optimization levels */ >> >> here is attached a simple reproducer >> >> a simple workaround is to configure with ac_cv_type___int128=0 >> >> Cheers, >> >> Gilles >> >> On 2015/02/04 4:17, Nathan Hjelm wrote: >> >> Thats the second report involving icc 14. I will dig into this later >> this week. >> >> -Nathan >> >> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: >> >> I have seen opal_fifo hang on 2
Re: [OMPI devel] OMPI devel] Master hangs in opal_fifo test
George, Can you point me to an other project that uses 128 bits atomics ? In my tests, i noticed that the volatile keyword is (one of) the trigger of the compiler bug. At this stage, i could not see anything wrong in ompi, plus this is working fine with recent gcc and icc, so i concluded this is an icc bug, that is now fixed, so all ompi can do is hide the symptom. Cheers, Gilles George Bosilcawrote: >My feeling is that the current patch hide the symptoms without addressing the >real issue. > > >As a side note: The compiler incriminated in this thread, works perfectly for >128 bits atomic operations in other projects where I use atomic LIFO & FIFO >(but not the one from OMPI as I already raised my concerns about this). > > > George. > > >PS: Why are there totally non-related comments about FIFO in the opal_lifo.h >(starting line 61)? > > >On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet > wrote: > >Paul and all, > >i just pushed >https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163 > >when a buggy compiler is detected, configure now forces OPAL_HAVE_CMPXCHG16B=0 >this is enough to make opal_lifo test and make check happy again. > >Cheers, > >Gilles > > > >On 2015/02/04 17:26, Gilles Gouaillardet wrote: > >Paul, my previous email was misleading. what i really meant is the opal_fifo >test works fine with icc 2013u5 (the release before 2013sp1) and icc 2013sp1u2 >and later so even if the reproducer fails with icc older that 2013sp1u2, that >might not impact ompi since for other reasons, the bug is not hit for example, >with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away from the compiler >bug. Cheers, Gilles On 2015/02/04 17:15, Paul Hargrove wrote: > >Giles, Who says only 2 version are effected? I have access to 9 revisions of >icc. Using your reduced case I find 7 that fail and only 2 (the latest two) >that pass. Discounting icc-12 (which can't compile the test) that makes 6 >versions effected by the bug (not 2). -Paul $ for x in 12.1.5.339 13.0.0.079 >13.0.1.117 13.1.2.183 13.1.3.192 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; >do module swap intel intel/$x ; echo @ Testing Intel compiler version $x; icc >conftest.c && ./a.out && echo PASS ; done @ Testing Intel compiler version >12.1.5.339 conftest.c(10): error: identifier "__int128_t" is undefined >__int128_t value; ^ compilation aborted for conftest.c (code 2) @ Testing >Intel compiler version 13.0.0.079 a.out: conftest.c:36: main: Assertion >`a.value == b.value' failed. Aborted @ Testing Intel compiler version >13.0.1.117 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >Aborted @ Testing Intel compiler version 13.1.2.183 a.out: conftest.c:36: >main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler >version 13.1.3.192 a.out: conftest.c:36: main: Assertion `a.value == b.value' >failed. Aborted @ Testing Intel compiler version 14.0.0.080 a.out: >conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing >Intel compiler version 14.0.1.106 a.out: conftest.c:36: main: Assertion >`a.value == b.value' failed. Aborted @ Testing Intel compiler version >14.0.2.144 PASS @ Testing Intel compiler version 15.0.1.133 PASS On Tue, Feb >3, 2015 at 11:45 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> >wrote: > >Nathan, imho, this is a compiler bug and only two versions are affected : - >intel icc 14.0.0.080 (aka 2013sp1) - intel icc 14.0.1.106 (aka 2013sp1u1) /* >note the bug only occurs with -O1 and higher optimization levels */ here is >attached a simple reproducer a simple workaround is to configure with >ac_cv_type___int128=0 Cheers, Gilles On 2015/02/04 4:17, Nathan Hjelm wrote: >Thats the second report involving icc 14. I will dig into this later this >week. -Nathan On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: I >have seen opal_fifo hang on 2 distinct systems + Linux/ppc32 with xlc-11.1 + >Linux/x86-64 with icc-14.0.1.106 I have no explanation to offer for either >hang. No "weird" configure options were passed to either. -Paul -- Paul H. >Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) >Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley >National Laboratory Fax: +1-510-486-6900 >___ devel mailing >listde...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/02/16911.php >___ devel mailing >listde...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/02/16920.php >___ devel mailing list >de...@open-mpi.org Subscription:
Re: [OMPI devel] Master hangs in opal_fifo test
My feeling is that the current patch hide the symptoms without addressing the real issue. As a side note: The compiler incriminated in this thread, works perfectly for 128 bits atomic operations in other projects where I use atomic LIFO & FIFO (but not the one from OMPI as I already raised my concerns about this). George. PS: Why are there totally non-related comments about FIFO in the opal_lifo.h (starting line 61)? On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Paul and all, > > i just pushed > https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163 > > when a buggy compiler is detected, configure now forces > OPAL_HAVE_CMPXCHG16B=0 > this is enough to make opal_lifo test and make check happy again. > > Cheers, > > Gilles > > > On 2015/02/04 17:26, Gilles Gouaillardet wrote: > > Paul, > > my previous email was misleading. > > what i really meant is the opal_fifo test works fine with icc 2013u5 > (the release before 2013sp1) and > icc 2013sp1u2 and later > > so even if the reproducer fails with icc older that 2013sp1u2, that > might not impact ompi > since for other reasons, the bug is not hit > > for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away > from the compiler bug. > > Cheers, > > Gilles > > On 2015/02/04 17:15, Paul Hargrove wrote: > > Giles, > > Who says only 2 version are effected? > > I have access to 9 revisions of icc. > Using your reduced case I find 7 that fail and only 2 (the latest two) that > pass. > Discounting icc-12 (which can't compile the test) that makes 6 versions > effected by the bug (not 2). > > -Paul > > $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192 > 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x > ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out && > echo PASS ; done > @ Testing Intel compiler version 12.1.5.339 > conftest.c(10): error: identifier "__int128_t" is undefined > __int128_t value; > ^ > > compilation aborted for conftest.c (code 2) > @ Testing Intel compiler version 13.0.0.079 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 13.0.1.117 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 13.1.2.183 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 13.1.3.192 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 14.0.0.080 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 14.0.1.106 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 14.0.2.144 > PASS > @ Testing Intel compiler version 15.0.1.133 > PASS > > On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet >wrote: > > > Nathan, > > imho, this is a compiler bug and only two versions are affected : > - intel icc 14.0.0.080 (aka 2013sp1) > - intel icc 14.0.1.106 (aka 2013sp1u1) > /* note the bug only occurs with -O1 and higher optimization levels */ > > here is attached a simple reproducer > > a simple workaround is to configure with ac_cv_type___int128=0 > > Cheers, > > Gilles > > On 2015/02/04 4:17, Nathan Hjelm wrote: > > Thats the second report involving icc 14. I will dig into this later > this week. > > -Nathan > > On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: > > I have seen opal_fifo hang on 2 distinct systems > + Linux/ppc32 with xlc-11.1 > + Linux/x86-64 with icc-14.0.1.106 >I have no explanation to offer for either hang. >No "weird" configure options were passed to either. >-Paul >-- >Paul H. Hargrove phhargr...@lbl.gov >Computer Languages & Systems Software (CLaSS) Group >Computer Science Department Tel: +1-510-495-2352 >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16911.php > > > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16920.php > > > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this > post:http://www.open-mpi.org/community/lists/devel/2015/02/16921.php > > > > > ___ > devel mailing listde...@open-mpi.org >
Re: [OMPI devel] Master hangs in opal_fifo test
Paul and all, i just pushed https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163 when a buggy compiler is detected, configure now forces OPAL_HAVE_CMPXCHG16B=0 this is enough to make opal_lifo test and make check happy again. Cheers, Gilles On 2015/02/04 17:26, Gilles Gouaillardet wrote: > Paul, > > my previous email was misleading. > > what i really meant is the opal_fifo test works fine with icc 2013u5 > (the release before 2013sp1) and > icc 2013sp1u2 and later > > so even if the reproducer fails with icc older that 2013sp1u2, that > might not impact ompi > since for other reasons, the bug is not hit > > for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away > from the compiler bug. > > Cheers, > > Gilles > > On 2015/02/04 17:15, Paul Hargrove wrote: >> Giles, >> >> Who says only 2 version are effected? >> >> I have access to 9 revisions of icc. >> Using your reduced case I find 7 that fail and only 2 (the latest two) that >> pass. >> Discounting icc-12 (which can't compile the test) that makes 6 versions >> effected by the bug (not 2). >> >> -Paul >> >> $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192 >> 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x >> ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out && >> echo PASS ; done >> @ Testing Intel compiler version 12.1.5.339 >> conftest.c(10): error: identifier "__int128_t" is undefined >> __int128_t value; >> ^ >> >> compilation aborted for conftest.c (code 2) >> @ Testing Intel compiler version 13.0.0.079 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 13.0.1.117 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 13.1.2.183 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 13.1.3.192 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 14.0.0.080 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 14.0.1.106 >> a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. >> Aborted >> @ Testing Intel compiler version 14.0.2.144 >> PASS >> @ Testing Intel compiler version 15.0.1.133 >> PASS >> >> On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >>> Nathan, >>> >>> imho, this is a compiler bug and only two versions are affected : >>> - intel icc 14.0.0.080 (aka 2013sp1) >>> - intel icc 14.0.1.106 (aka 2013sp1u1) >>> /* note the bug only occurs with -O1 and higher optimization levels */ >>> >>> here is attached a simple reproducer >>> >>> a simple workaround is to configure with ac_cv_type___int128=0 >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2015/02/04 4:17, Nathan Hjelm wrote: >>> >>> Thats the second report involving icc 14. I will dig into this later >>> this week. >>> >>> -Nathan >>> >>> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: >>> >>> I have seen opal_fifo hang on 2 distinct systems >>> + Linux/ppc32 with xlc-11.1 >>> + Linux/x86-64 with icc-14.0.1.106 >>>I have no explanation to offer for either hang. >>>No "weird" configure options were passed to either. >>>-Paul >>>-- >>>Paul H. Hargrove phhargr...@lbl.gov >>>Computer Languages & Systems Software (CLaSS) Group >>>Computer Science Department Tel: +1-510-495-2352 >>>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> ___ >>> devel mailing listde...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php >>> >>> >>> >>> ___ >>> devel mailing listde...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/02/16920.php >>> >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/02/16921.php >>> >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16922.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: >
Re: [OMPI devel] Master hangs in opal_fifo test
Paul, my previous email was misleading. what i really meant is the opal_fifo test works fine with icc 2013u5 (the release before 2013sp1) and icc 2013sp1u2 and later so even if the reproducer fails with icc older that 2013sp1u2, that might not impact ompi since for other reasons, the bug is not hit for example, with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away from the compiler bug. Cheers, Gilles On 2015/02/04 17:15, Paul Hargrove wrote: > Giles, > > Who says only 2 version are effected? > > I have access to 9 revisions of icc. > Using your reduced case I find 7 that fail and only 2 (the latest two) that > pass. > Discounting icc-12 (which can't compile the test) that makes 6 versions > effected by the bug (not 2). > > -Paul > > $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192 > 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x > ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out && > echo PASS ; done > @ Testing Intel compiler version 12.1.5.339 > conftest.c(10): error: identifier "__int128_t" is undefined > __int128_t value; > ^ > > compilation aborted for conftest.c (code 2) > @ Testing Intel compiler version 13.0.0.079 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 13.0.1.117 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 13.1.2.183 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 13.1.3.192 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 14.0.0.080 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 14.0.1.106 > a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. > Aborted > @ Testing Intel compiler version 14.0.2.144 > PASS > @ Testing Intel compiler version 15.0.1.133 > PASS > > On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> Nathan, >> >> imho, this is a compiler bug and only two versions are affected : >> - intel icc 14.0.0.080 (aka 2013sp1) >> - intel icc 14.0.1.106 (aka 2013sp1u1) >> /* note the bug only occurs with -O1 and higher optimization levels */ >> >> here is attached a simple reproducer >> >> a simple workaround is to configure with ac_cv_type___int128=0 >> >> Cheers, >> >> Gilles >> >> On 2015/02/04 4:17, Nathan Hjelm wrote: >> >> Thats the second report involving icc 14. I will dig into this later >> this week. >> >> -Nathan >> >> On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: >> >> I have seen opal_fifo hang on 2 distinct systems >> + Linux/ppc32 with xlc-11.1 >> + Linux/x86-64 with icc-14.0.1.106 >>I have no explanation to offer for either hang. >>No "weird" configure options were passed to either. >>-Paul >>-- >>Paul H. Hargrove phhargr...@lbl.gov >>Computer Languages & Systems Software (CLaSS) Group >>Computer Science Department Tel: +1-510-495-2352 >>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> ___ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php >> >> >> >> ___ >> devel mailing listde...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16920.php >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16921.php >> > > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16922.php
Re: [OMPI devel] Master hangs in opal_fifo test
Giles, Who says only 2 version are effected? I have access to 9 revisions of icc. Using your reduced case I find 7 that fail and only 2 (the latest two) that pass. Discounting icc-12 (which can't compile the test) that makes 6 versions effected by the bug (not 2). -Paul $ for x in 12.1.5.339 13.0.0.079 13.0.1.117 13.1.2.183 13.1.3.192 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; do module swap intel intel/$x ; echo @ Testing Intel compiler version $x; icc conftest.c && ./a.out && echo PASS ; done @ Testing Intel compiler version 12.1.5.339 conftest.c(10): error: identifier "__int128_t" is undefined __int128_t value; ^ compilation aborted for conftest.c (code 2) @ Testing Intel compiler version 13.0.0.079 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler version 13.0.1.117 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler version 13.1.2.183 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler version 13.1.3.192 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler version 14.0.0.080 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler version 14.0.1.106 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler version 14.0.2.144 PASS @ Testing Intel compiler version 15.0.1.133 PASS On Tue, Feb 3, 2015 at 11:45 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Nathan, > > imho, this is a compiler bug and only two versions are affected : > - intel icc 14.0.0.080 (aka 2013sp1) > - intel icc 14.0.1.106 (aka 2013sp1u1) > /* note the bug only occurs with -O1 and higher optimization levels */ > > here is attached a simple reproducer > > a simple workaround is to configure with ac_cv_type___int128=0 > > Cheers, > > Gilles > > On 2015/02/04 4:17, Nathan Hjelm wrote: > > Thats the second report involving icc 14. I will dig into this later > this week. > > -Nathan > > On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: > > I have seen opal_fifo hang on 2 distinct systems > + Linux/ppc32 with xlc-11.1 > + Linux/x86-64 with icc-14.0.1.106 >I have no explanation to offer for either hang. >No "weird" configure options were passed to either. >-Paul >-- >Paul H. Hargrove phhargr...@lbl.gov >Computer Languages & Systems Software (CLaSS) Group >Computer Science Department Tel: +1-510-495-2352 >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16911.php > > > > ___ > devel mailing listde...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16920.php > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16921.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] Master hangs in opal_fifo test
Nathan, imho, this is a compiler bug and only two versions are affected : - intel icc 14.0.0.080 (aka 2013sp1) - intel icc 14.0.1.106 (aka 2013sp1u1) /* note the bug only occurs with -O1 and higher optimization levels */ here is attached a simple reproducer a simple workaround is to configure with ac_cv_type___int128=0 Cheers, Gilles On 2015/02/04 4:17, Nathan Hjelm wrote: > Thats the second report involving icc 14. I will dig into this later > this week. > > -Nathan > > On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: >>I have seen opal_fifo hang on 2 distinct systems >> + Linux/ppc32 with xlc-11.1 >> + Linux/x86-64 with icc-14.0.1.106 >>I have no explanation to offer for either hang. >>No "weird" configure options were passed to either. >>-Paul >>-- >>Paul H. Hargrove phhargr...@lbl.gov >>Computer Languages & Systems Software (CLaSS) Group >>Computer Science Department Tel: +1-510-495-2352 >>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16911.php > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16920.php /* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */ #include #include union opal_counted_pointer_t { struct { uint64_t counter; uint64_t item; } data; __int128_t value; }; typedef union opal_counted_pointer_t opal_counted_pointer_t; int main (int argc, char *argv[]) { volatile opal_counted_pointer_t a; opal_counted_pointer_t b; a.data.counter = 0; a.data.item = 0x1234567890ABCDEF; b.data.counter = a.data.counter; b.data.item = a.data.item; /* bozo checks */ assert(16 == sizeof(opal_counted_pointer_t)); assert(a.data.counter == b.data.counter); assert(a.data.item == b.data.item); /* * following assert fails on buggy compilers * so far, with icc -o conftest conftest.c * - intel icc 14.0.0.080 (aka 2013sp1) * - intel icc 14.0.1.106 (aka 2013sp1u1) * older and more recents compilers work fine * buggy compilers work also fine but only with -O0 */ assert(a.value == b.value); return 0; }
[OMPI devel] Master hangs in opal_fifo test
I have seen opal_fifo hang on 2 distinct systems + Linux/ppc32 with xlc-11.1 + Linux/x86-64 with icc-14.0.1.106 I have no explanation to offer for either hang. No "weird" configure options were passed to either. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900