George,

I cannot acces parsec : http error 403 :-(

I understand your point of view.
Back to the opal_lifo test, and if i remember correctly, it hangs in the non 
multi threaded part : the very first pop loops forever since cas always fails 
in comparing values that are equal indeed.
Though there is a possibility the problem comes from ompi, and we are just 
lucky it works with recent icc, i would not go "all in" with this ...

And as you pointed, even if the problem does come from the compiler, that does 
not mean ompi algo are necessarily correct.

Cheers,

Gilles

George Bosilca <bosi...@icl.utk.edu> wrote:
>On Fri, Feb 6, 2015 at 8:54 AM, Gilles Gouaillardet 
><gilles.gouaillar...@gmail.com> wrote:
>
>George,
>
>Can you point me to an other project that uses 128 bits atomics ?
>
>
>http://icl.cs.utk.edu/parsec/. It heavily uses lock-free structures, and the 
>128 bits atomics are the safest and fastest way to implement them.
>
> 
>
>In my tests, i noticed that the volatile keyword is (one of) the trigger of 
>the compiler bug.
>
>
>I usually use it for the location to be atomically changed.
>
> 
>
>At this stage, i could not see anything wrong in ompi, plus this is working 
>fine with recent gcc and icc, so i concluded this is an icc bug, that is now 
>fixed, so all ompi can do is hide the symptom.
>
>
>These issues are pretty tricky to trigger, we need special race conditions 
>while manipulating pointers. There are tens of papers about how to correctly 
>implement FIFOs with CAS2, and even after peer reviews some of them turned out 
>to be incorrect. What I am saying is that we are quick to blame these failures 
>on the icc compiler, while we have no formal proof that the FIFO algorithm in 
>Open MPI is correct.
>
>
>  George.
>
>
> 
>
>
>Cheers,
>
>Gilles
>
>
>George Bosilca <bosi...@icl.utk.edu> wrote:
>
>My feeling is that the current patch hide the symptoms without addressing the 
>real issue.
>
>
>As a side note: The compiler incriminated in this thread, works perfectly for 
>128 bits atomic operations in other projects where I use atomic LIFO & FIFO 
>(but not the one from OMPI as I already raised my concerns about this).
>
>
>  George.
>
>
>PS: Why are there totally non-related comments about FIFO in the opal_lifo.h 
>(starting line 61)?
>
>
>On Wed, Feb 4, 2015 at 11:30 PM, Gilles Gouaillardet 
><gilles.gouaillar...@iferc.org> wrote:
>
>Paul and all,
>
>i just pushed 
>https://github.com/open-mpi/ompi/commit/b42e3441294e9fe787fe8e9ad7403d5b8e465163
>
>when a buggy compiler is detected, configure now forces OPAL_HAVE_CMPXCHG16B=0
>this is enough to make opal_lifo test and make check happy again.
>
>Cheers,
>
>Gilles
>
>
>
>On 2015/02/04 17:26, Gilles Gouaillardet wrote:
>
>Paul, my previous email was misleading. what i really meant is the opal_fifo 
>test works fine with icc 2013u5 (the release before 2013sp1) and icc 2013sp1u2 
>and later so even if the reproducer fails with icc older that 2013sp1u2, that 
>might not impact ompi since for other reasons, the bug is not hit for example, 
>with icc 2013u5, OPAL_HAVE_CMPXCHG16B=0 so ompi stays away from the compiler 
>bug. Cheers, Gilles On 2015/02/04 17:15, Paul Hargrove wrote: 
>
>Giles, Who says only 2 version are effected? I have access to 9 revisions of 
>icc. Using your reduced case I find 7 that fail and only 2 (the latest two) 
>that pass. Discounting icc-12 (which can't compile the test) that makes 6 
>versions effected by the bug (not 2). -Paul $ for x in 12.1.5.339 13.0.0.079 
>13.0.1.117 13.1.2.183 13.1.3.192 14.0.0.080 14.0.1.106 14.0.2.144 15.0.1.133; 
>do module swap intel intel/$x ; echo @ Testing Intel compiler version $x; icc 
>conftest.c && ./a.out && echo PASS ; done @ Testing Intel compiler version 
>12.1.5.339 conftest.c(10): error: identifier "__int128_t" is undefined 
>__int128_t value; ^ compilation aborted for conftest.c (code 2) @ Testing 
>Intel compiler version 13.0.0.079 a.out: conftest.c:36: main: Assertion 
>`a.value == b.value' failed. Aborted @ Testing Intel compiler version 
>13.0.1.117 a.out: conftest.c:36: main: Assertion `a.value == b.value' failed. 
>Aborted @ Testing Intel compiler version 13.1.2.183 a.out: conftest.c:36: 
>main: Assertion `a.value == b.value' failed. Aborted @ Testing Intel compiler 
>version 13.1.3.192 a.out: conftest.c:36: main: Assertion `a.value == b.value' 
>failed. Aborted @ Testing Intel compiler version 14.0.0.080 a.out: 
>conftest.c:36: main: Assertion `a.value == b.value' failed. Aborted @ Testing 
>Intel compiler version 14.0.1.106 a.out: conftest.c:36: main: Assertion 
>`a.value == b.value' failed. Aborted @ Testing Intel compiler version 
>14.0.2.144 PASS @ Testing Intel compiler version 15.0.1.133 PASS On Tue, Feb 
>3, 2015 at 11:45 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> 
>wrote: 
>
>Nathan, imho, this is a compiler bug and only two versions are affected : - 
>intel icc 14.0.0.080 (aka 2013sp1) - intel icc 14.0.1.106 (aka 2013sp1u1) /* 
>note the bug only occurs with -O1 and higher optimization levels */ here is 
>attached a simple reproducer a simple workaround is to configure with 
>ac_cv_type___int128=0 Cheers, Gilles On 2015/02/04 4:17, Nathan Hjelm wrote: 
>Thats the second report involving icc 14. I will dig into this later this 
>week. -Nathan On Mon, Feb 02, 2015 at 11:03:41PM -0800, Paul Hargrove wrote: I 
>have seen opal_fifo hang on 2 distinct systems + Linux/ppc32 with xlc-11.1 + 
>Linux/x86-64 with icc-14.0.1.106 I have no explanation to offer for either 
>hang. No "weird" configure options were passed to either. -Paul -- Paul H. 
>Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) 
>Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley 
>National Laboratory Fax: +1-510-486-6900 
>_______________________________________________ devel mailing 
>listde...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16911.php 
>_______________________________________________ devel mailing 
>listde...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16920.php 
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16921.php 
>
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16922.php 
>
>
>
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16923.php 
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16926.php
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/02/16949.php
>
>

Reply via email to