Nathan,

Last night's master tarball is still producing a SEGV in opal_fifo on the
same Scientific Linux 7.x x86-64 VM as I reported in Feb.

Reproducing the SEGV under gdb yields:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff5bb1700 (LWP 16242)]
0x0000000000401167 in opal_fifo_pop_atomic (fifo=0x7fffffffe130)
    at
/home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-2014-gc8730b5/opal/class/opal_fifo.h:127
127             next = (opal_list_item_t *) item->opal_list_next;

(gdb) where
#0  0x0000000000401167 in opal_fifo_pop_atomic (fifo=0x7fffffffe130)
    at
/home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-2014-gc8730b5/opal/class/opal_fifo.h:127
#1  0x000000000040153e in thread_test_exhaust (arg=0x7fffffffe130)
    at
/home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-2014-gc8730b5/test/class/opal_fifo.c:79
#2  0x00007ffff6f7cdf3 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff6caa1ed in clone () from /lib64/libc.so.6


-Paul

On Thu, Feb 12, 2015 at 2:17 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:

> Nathan,
>
> Just FYI: Both systems where I've seen this failure are VMs on a
> well-loaded server.
> So, the instruction interleaving (for reproducing races) is likely a bit
> different than what you would see on ones own laptop or workstation.  Also,
> I don't see the SEGV in every run, but to reproduce it inside gdb took me
> no more than 3 or 4 runs.
>
> Let me know if your added memory barrier will be in tonight's master
> tarball.
> If so, I'll try to test again tonight.
>
> -Paul
>
>
> On Thu, Feb 12, 2015 at 12:53 PM, Nathan Hjelm <hje...@lanl.gov> wrote:
>
>>
>> Yes, seriously. This code is still undergoing testing which is part of
>> the reason it is on master. Once I am confident in the code I will be
>> updating some on my code to use a fifo instead of an opal_list_t and a
>> lock.
>>
>> I don't know if the barrier will make a difference but it is the only
>> place I could see for a possibly inconsistency. It might not make any
>> difference. If that is the case I will dig deeper.
>>
>> -Nathan
>>
>> On Thu, Feb 12, 2015 at 03:48:25PM -0500, George Bosilca wrote:
>> >    Seriously?
>> >      George.
>> >    On Thu, Feb 12, 2015 at 1:00 PM, Nathan Hjelm <hje...@lanl.gov>
>> wrote:
>> >
>> >      I think I see the issue. Looks like there is a missing memory
>> barrier
>> >      after the head consistency code. I will add one and see if that
>> fixes
>> >      your problem.
>> >
>> >      BTW, I can't reproduce the issue on any of my systems :-/.
>> >
>> >      -Nathan
>> >      On Thu, Feb 12, 2015 at 02:07:08AM -0800, Paul Hargrove wrote:
>> >      >    Just experienced the same failure as below with
>> >      openmpi-dev-904-g08dceda
>> >      >    build with "gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16)" on
>> >      Scientific
>> >      >    Linux 7.x (a RHEL 7 clone).
>> >      >    gdb says:
>> >      >    Program received signal SIGSEGV, Segmentation fault.
>> >      >    [Switching to Thread 0x7ffff53b0700 (LWP 19685)]
>> >      >    0x0000000000401417 in opal_fifo_pop_atomic
>> (fifo=0x7fffffffe130)
>> >      >        at
>> >      >
>> >
>> /home/phargrov/OMPI/openmpi-master-linux-x86_64-sl7x/openmpi-dev-904-g08dceda/opal/class/opal_fifo.h:127
>> >      >    127             next = (opal_list_item_t *)
>> item->opal_list_next;
>> >      >    -Paul
>> >      >    On Fri, Feb 6, 2015 at 4:22 PM, Paul Hargrove <
>> phhargr...@lbl.gov>
>> >      wrote:
>> >      >
>> >      >      Yes, this time I really mean "fifo", not "lifo".  ;-)
>> >      >      With last night's master tarball (Open MPI dev-845-ga3275aa)
>> >      configured
>> >      >      with only --prefix and --enable-debug
>> >      >      A Linux-86-64 system running debian Wheezy and compiler =
>> "gcc
>> >      (Debian
>> >      >      4.7.2-5) 4.7.2"
>> >      >      Failure from "make check":
>> >      >
>> >
>> /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/config/test-driver:
>> >      >      line 95:  3697 Segmentation fault      "$@" > $log_file 2>&1
>> >      >      FAIL: opal_fifo
>> >      >      Manual run shows:
>> >      >      $ ./test/class/opal_fifo
>> >      >      Single thread test. Time: 0 s 33534 us 33 nsec/poppush
>> >      >      Atomics thread finished. Time: 0 s 82289 us 82 nsec/poppush
>> >      >      Atomics thread finished. Time: 4 s 844299 us 4844
>> nsec/poppush
>> >      >      Atomics thread finished. Time: 5 s 27642 us 5027
>> nsec/poppush
>> >      >      Atomics thread finished. Time: 5 s 65829 us 5065
>> nsec/poppush
>> >      >      Atomics thread finished. Time: 5 s 264239 us 5264
>> nsec/poppush
>> >      >      Atomics thread finished. Time: 5 s 432407 us 5432
>> nsec/poppush
>> >      >      Atomics thread finished. Time: 5 s 462913 us 5462
>> nsec/poppush
>> >      >      Atomics thread finished. Time: 5 s 466208 us 5466
>> nsec/poppush
>> >      >      Atomics thread finished. Time: 5 s 485575 us 5485
>> nsec/poppush
>> >      >      All threads finished. Thread count: 8 Time: 5 s 485844 us
>> 685
>> >      >      nsec/poppush
>> >      >      Segmentation fault (core dumped)
>> >      >      When run within GDB:
>> >      >      Program received signal SIGSEGV, Segmentation fault.
>> >      >      [Switching to Thread 0x7ffff5c64700 (LWP 3948)]
>> >      >      0x0000000000401568 in opal_fifo_pop_atomic
>> (fifo=0x7fffffffe830)
>> >      >          at
>> >      >
>> >
>> /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/opal/class/opal_fifo.h:127
>> >      >      127             next = (opal_list_item_t *)
>> item->opal_list_next;
>> >      >      (gdb) print item
>> >      >      $1 = (opal_list_item_t *) 0x0
>> >      >      (gdb) where
>> >      >      #0  0x0000000000401568 in opal_fifo_pop_atomic
>> >      (fifo=0x7fffffffe830)
>> >      >          at
>> >      >
>> >
>> /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/opal/class/opal_fifo.h:127
>> >      >      #1  0x000000000040193d in thread_test_exhaust
>> >      (arg=0x7fffffffe830)
>> >      >          at
>> >      >
>> >
>> /home/phargrov/OMPI/openmpi-master-linux-x86_64-wheezy/openmpi-dev-845-ga3275aa/test/class/opal_fifo.c:79
>> >      >      #2  0x00007ffff6ff9b50 in start_thread () from
>> >      >      /lib/x86_64-linux-gnu/libpthread.so.0
>> >      >      #3  0x00007ffff6d4370d in clone () from
>> >      /lib/x86_64-linux-gnu/libc.so.6
>> >      >      #4  0x0000000000000000 in ?? ()
>> >      >      -Paul
>> >      >      --
>> >      >      Paul H. Hargrove
>> phhargr...@lbl.gov
>> >      >      Computer Languages & Systems Software (CLaSS) Group
>> >      >      Computer Science Department               Tel:
>> +1-510-495-2352
>> >      >      Lawrence Berkeley National Laboratory     Fax:
>> +1-510-486-6900
>> >      >
>> >      >    --
>> >      >    Paul H. Hargrove                          phhargr...@lbl.gov
>> >      >    Computer Languages & Systems Software (CLaSS) Group
>> >      >    Computer Science Department               Tel:
>> +1-510-495-2352
>> >      >    Lawrence Berkeley National Laboratory     Fax:
>> +1-510-486-6900
>> >
>> >      > _______________________________________________
>> >      > devel mailing list
>> >      > de...@open-mpi.org
>> >      > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >      > Link to this post:
>> >      http://www.open-mpi.org/community/lists/devel/2015/02/16975.php
>> >
>> >      _______________________________________________
>> >      devel mailing list
>> >      de...@open-mpi.org
>> >      Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >      Link to this post:
>> >      http://www.open-mpi.org/community/lists/devel/2015/02/16978.php
>>
>> > _______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/02/16979.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/02/16980.php
>>
>
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to