Nathan, Just FYI: Both systems where I've seen this failure are VMs on a well-loaded server. So, the instruction interleaving (for reproducing races) is likely a bit different than what you would see on ones own laptop or workstation. Also, I don't see the SEGV in every run, but to reproduce it inside gdb took me no more than 3 or 4 runs.
Let me know if your added memory barrier will be in tonight's master tarball. If so, I'll try to test again tonight. -Paul On Thu, Feb 12, 2015 at 12:53 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > > Yes, seriously. This code is still undergoing testing which is part of > the reason it is on master. Once I am confident in the code I will be > updating some on my code to use a fifo instead of an opal_list_t and a > lock. > > I don't know if the barrier will make a difference but it is the only > place I could see for a possibly inconsistency. It might not make any > difference. If that is the case I will dig deeper. > > -Nathan > > On Thu, Feb 12, 2015 at 03:48:25PM -0500, George Bosilca wrote: > > Seriously? > > George. > > On Thu, Feb 12, 2015 at 1:00 PM, Nathan Hjelm <hje...@lanl.gov> > wrote: > > > > I think I see the issue. Looks like there is a missing memory > barrier > > after the head consistency code. I will add one and see if that > fixes > > your problem. > > > > BTW, I can't reproduce the issue on any of my systems :-/. > > > > -Nathan > > On Thu, Feb 12, 2015 at 02:07:08AM -0800, Paul Hargrove wrote: > > > Just experienced the same failure as below with > > openmpi-dev-904-g08dceda > > > build with "gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)" on > > Scientific > > > Linux 7.x (a RHEL 7 clone). > > > gdb says: > > > Program received signal SIGSEGV, Segmentation fault. > > > [Switching to Thread 0x7ffff53b0700 (LWP 19685)] > > > 0x0000000000401417 in opal_fifo_pop_atomic > (fifo=0x7fffffffe130) > > > at > > > > > > /home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-904-g08dceda/opal/class/opal_fifo.h:127 > > > 127 next = (opal_list_item_t *) > item->opal_list_next; > > > -Paul > > > On Fri, Feb 6, 2015 at 4:22 PM, Paul Hargrove < > phhargr...@lbl.gov> > > wrote: > > > > > > Yes, this time I really mean "fifo", not "lifo". ;-) > > > With last night's master tarball (Open MPI dev-845-ga3275aa) > > configured > > > with only --prefix and --enable-debug > > > A Linux-86-64 system running debian Wheezy and compiler = > "gcc > > (Debian > > > 4.7.2-5) 4.7.2" > > > Failure from "make check": > > > > > > /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/config/test-driver: > > > line 95: 3697 Segmentation fault "$@" > $log_file 2>&1 > > > FAIL: opal_fifo > > > Manual run shows: > > > $ ./test/class/opal_fifo > > > Single thread test. Time: 0 s 33534 us 33 nsec/poppush > > > Atomics thread finished. Time: 0 s 82289 us 82 nsec/poppush > > > Atomics thread finished. Time: 4 s 844299 us 4844 > nsec/poppush > > > Atomics thread finished. Time: 5 s 27642 us 5027 nsec/poppush > > > Atomics thread finished. Time: 5 s 65829 us 5065 nsec/poppush > > > Atomics thread finished. Time: 5 s 264239 us 5264 > nsec/poppush > > > Atomics thread finished. Time: 5 s 432407 us 5432 > nsec/poppush > > > Atomics thread finished. Time: 5 s 462913 us 5462 > nsec/poppush > > > Atomics thread finished. Time: 5 s 466208 us 5466 > nsec/poppush > > > Atomics thread finished. Time: 5 s 485575 us 5485 > nsec/poppush > > > All threads finished. Thread count: 8 Time: 5 s 485844 us 685 > > > nsec/poppush > > > Segmentation fault (core dumped) > > > When run within GDB: > > > Program received signal SIGSEGV, Segmentation fault. > > > [Switching to Thread 0x7ffff5c64700 (LWP 3948)] > > > 0x0000000000401568 in opal_fifo_pop_atomic > (fifo=0x7fffffffe830) > > > at > > > > > > /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/opal/class/opal_fifo.h:127 > > > 127 next = (opal_list_item_t *) > item->opal_list_next; > > > (gdb) print item > > > $1 = (opal_list_item_t *) 0x0 > > > (gdb) where > > > #0 0x0000000000401568 in opal_fifo_pop_atomic > > (fifo=0x7fffffffe830) > > > at > > > > > > /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/opal/class/opal_fifo.h:127 > > > #1 0x000000000040193d in thread_test_exhaust > > (arg=0x7fffffffe830) > > > at > > > > > > /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/test/class/opal_fifo.c:79 > > > #2 0x00007ffff6ff9b50 in start_thread () from > > > /lib/x86_64-linux-gnu/libpthread.so.0 > > > #3 0x00007ffff6d4370d in clone () from > > /lib/x86_64-linux-gnu/libc.so.6 > > > #4 0x0000000000000000 in ?? () > > > -Paul > > > -- > > > Paul H. Hargrove phhargr...@lbl.gov > > > Computer Languages & Systems Software (CLaSS) Group > > > Computer Science Department Tel: > +1-510-495-2352 > > > Lawrence Berkeley National Laboratory Fax: > +1-510-486-6900 > > > > > > -- > > > Paul H. Hargrove phhargr...@lbl.gov > > > Computer Languages & Systems Software (CLaSS) Group > > > Computer Science Department Tel: +1-510-495-2352 > > > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/02/16975.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/02/16978.php > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16979.php > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16980.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900