Nathan, Last night's master tarball is still producing a SEGV in opal_fifo on the same Scientific Linux 7.x x86-64 VM as I reported in Feb.
Reproducing the SEGV under gdb yields: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff5bb1700 (LWP 16242)] 0x0000000000401167 in opal_fifo_pop_atomic (fifo=0x7fffffffe130) at /home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-2014-gc8730b5/opal/class/opal_fifo.h:127 127 next = (opal_list_item_t *) item->opal_list_next; (gdb) where #0 0x0000000000401167 in opal_fifo_pop_atomic (fifo=0x7fffffffe130) at /home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-2014-gc8730b5/opal/class/opal_fifo.h:127 #1 0x000000000040153e in thread_test_exhaust (arg=0x7fffffffe130) at /home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-2014-gc8730b5/test/class/opal_fifo.c:79 #2 0x00007ffff6f7cdf3 in start_thread () from /lib64/libpthread.so.0 #3 0x00007ffff6caa1ed in clone () from /lib64/libc.so.6 -Paul On Thu, Feb 12, 2015 at 2:17 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > Nathan, > > Just FYI: Both systems where I've seen this failure are VMs on a > well-loaded server. > So, the instruction interleaving (for reproducing races) is likely a bit > different than what you would see on ones own laptop or workstation. Also, > I don't see the SEGV in every run, but to reproduce it inside gdb took me > no more than 3 or 4 runs. > > Let me know if your added memory barrier will be in tonight's master > tarball. > If so, I'll try to test again tonight. > > -Paul > > > On Thu, Feb 12, 2015 at 12:53 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > >> >> Yes, seriously. This code is still undergoing testing which is part of >> the reason it is on master. Once I am confident in the code I will be >> updating some on my code to use a fifo instead of an opal_list_t and a >> lock. >> >> I don't know if the barrier will make a difference but it is the only >> place I could see for a possibly inconsistency. It might not make any >> difference. If that is the case I will dig deeper. >> >> -Nathan >> >> On Thu, Feb 12, 2015 at 03:48:25PM -0500, George Bosilca wrote: >> > Seriously? >> > George. >> > On Thu, Feb 12, 2015 at 1:00 PM, Nathan Hjelm <hje...@lanl.gov> >> wrote: >> > >> > I think I see the issue. Looks like there is a missing memory >> barrier >> > after the head consistency code. I will add one and see if that >> fixes >> > your problem. >> > >> > BTW, I can't reproduce the issue on any of my systems :-/. >> > >> > -Nathan >> > On Thu, Feb 12, 2015 at 02:07:08AM -0800, Paul Hargrove wrote: >> > > Just experienced the same failure as below with >> > openmpi-dev-904-g08dceda >> > > build with "gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)" on >> > Scientific >> > > Linux 7.x (a RHEL 7 clone). >> > > gdb says: >> > > Program received signal SIGSEGV, Segmentation fault. >> > > [Switching to Thread 0x7ffff53b0700 (LWP 19685)] >> > > 0x0000000000401417 in opal_fifo_pop_atomic >> (fifo=0x7fffffffe130) >> > > at >> > > >> > >> /home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-904-g08dceda/opal/class/opal_fifo.h:127 >> > > 127 next = (opal_list_item_t *) >> item->opal_list_next; >> > > -Paul >> > > On Fri, Feb 6, 2015 at 4:22 PM, Paul Hargrove < >> phhargr...@lbl.gov> >> > wrote: >> > > >> > > Yes, this time I really mean "fifo", not "lifo". ;-) >> > > With last night's master tarball (Open MPI dev-845-ga3275aa) >> > configured >> > > with only --prefix and --enable-debug >> > > A Linux-86-64 system running debian Wheezy and compiler = >> "gcc >> > (Debian >> > > 4.7.2-5) 4.7.2" >> > > Failure from "make check": >> > > >> > >> /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/config/test-driver: >> > > line 95: 3697 Segmentation fault "$@" > $log_file 2>&1 >> > > FAIL: opal_fifo >> > > Manual run shows: >> > > $ ./test/class/opal_fifo >> > > Single thread test. Time: 0 s 33534 us 33 nsec/poppush >> > > Atomics thread finished. Time: 0 s 82289 us 82 nsec/poppush >> > > Atomics thread finished. Time: 4 s 844299 us 4844 >> nsec/poppush >> > > Atomics thread finished. Time: 5 s 27642 us 5027 >> nsec/poppush >> > > Atomics thread finished. Time: 5 s 65829 us 5065 >> nsec/poppush >> > > Atomics thread finished. Time: 5 s 264239 us 5264 >> nsec/poppush >> > > Atomics thread finished. Time: 5 s 432407 us 5432 >> nsec/poppush >> > > Atomics thread finished. Time: 5 s 462913 us 5462 >> nsec/poppush >> > > Atomics thread finished. Time: 5 s 466208 us 5466 >> nsec/poppush >> > > Atomics thread finished. Time: 5 s 485575 us 5485 >> nsec/poppush >> > > All threads finished. Thread count: 8 Time: 5 s 485844 us >> 685 >> > > nsec/poppush >> > > Segmentation fault (core dumped) >> > > When run within GDB: >> > > Program received signal SIGSEGV, Segmentation fault. >> > > [Switching to Thread 0x7ffff5c64700 (LWP 3948)] >> > > 0x0000000000401568 in opal_fifo_pop_atomic >> (fifo=0x7fffffffe830) >> > > at >> > > >> > >> /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/opal/class/opal_fifo.h:127 >> > > 127 next = (opal_list_item_t *) >> item->opal_list_next; >> > > (gdb) print item >> > > $1 = (opal_list_item_t *) 0x0 >> > > (gdb) where >> > > #0 0x0000000000401568 in opal_fifo_pop_atomic >> > (fifo=0x7fffffffe830) >> > > at >> > > >> > >> /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/opal/class/opal_fifo.h:127 >> > > #1 0x000000000040193d in thread_test_exhaust >> > (arg=0x7fffffffe830) >> > > at >> > > >> > >> /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/test/class/opal_fifo.c:79 >> > > #2 0x00007ffff6ff9b50 in start_thread () from >> > > /lib/x86_64-linux-gnu/libpthread.so.0 >> > > #3 0x00007ffff6d4370d in clone () from >> > /lib/x86_64-linux-gnu/libc.so.6 >> > > #4 0x0000000000000000 in ?? () >> > > -Paul >> > > -- >> > > Paul H. Hargrove >> phhargr...@lbl.gov >> > > Computer Languages & Systems Software (CLaSS) Group >> > > Computer Science Department Tel: >> +1-510-495-2352 >> > > Lawrence Berkeley National Laboratory Fax: >> +1-510-486-6900 >> > > >> > > -- >> > > Paul H. Hargrove phhargr...@lbl.gov >> > > Computer Languages & Systems Software (CLaSS) Group >> > > Computer Science Department Tel: >> +1-510-495-2352 >> > > Lawrence Berkeley National Laboratory Fax: >> +1-510-486-6900 >> > >> > > _______________________________________________ >> > > devel mailing list >> > > de...@open-mpi.org >> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > Link to this post: >> > http://www.open-mpi.org/community/lists/devel/2015/02/16975.php >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> > http://www.open-mpi.org/community/lists/devel/2015/02/16978.php >> >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16979.php >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16980.php >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900