Hi Jeff
> > I posted this to the devel list the other day, but it raised no
> > responses. Maybe people will have more to say here.
>
> Sorry Terry; many of us were at the SC conference last week, and this
> week is short because of the US holiday. Some of the inbox got
> dropped/delayed as a result...
'Tis OK. Beggars can't be choosers! ;-)
> > Because of this I can't reduce the problem to a small testcase, and so
> > have not included any code at this stage.
>
> Ugh. Heisenbugs are the worst.
>
> Have you tried with a memory checking debugger, such as valgrind, or a
> parallel debugger? Is there a chance that there's a simple errant
> posted receive (perhaps in a race condition) that is unexpectedly
> receiving into the Bug's memory location when you don't expect it?
I have zero experience with valgrind. But I downloaded it and ran my
"minimal" case (about 1000 lines + libraries!) with it. Thus I found
one uninitialised variable and need to go away and check my code
carefully now. Correcting this in the most obvious, un-thought-through
way makes my Bug go away. (But then so does changing the code in other,
unexecuted sections!)
However, what I get out of valgrind now is:
[tjf@fkpc167 Minimal]$ valgrind --leak-check=yes ./nnh
==20671== Memcheck, a memory error detector.
==20671== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et
al.
==20671== Using LibVEX rev 1732, a library for dynamic binary
translation.
==20671== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==20671== Using valgrind-3.2.3, a dynamic binary instrumentation
framework.
==20671== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et
al.
==20671== For more details, rerun with: -v
==20671==
==20671== Conditional jump or move depends on uninitialised value(s)
==20671==at 0x40152B1: (within /lib/ld-2.5.so)
==20671==by 0x4005278: (within /lib/ld-2.5.so)
==20671==by 0x4007CFD: (within /lib/ld-2.5.so)
==20671==by 0x400318A: (within /lib/ld-2.5.so)
==20671==by 0x4013D9A: (within /lib/ld-2.5.so)
==20671==by 0x40012C6: (within /lib/ld-2.5.so)
==20671==by 0x4000A67: (within /lib/ld-2.5.so)
..
==20671== Conditional jump or move depends on uninitialised value(s)
==20671==at 0x40152B1: (within /lib/ld-2.5.so)
==20671==by 0x400A289: (within /lib/ld-2.5.so)
==20671==by 0x6A42E4D: (within /lib/libc-2.5.so)
==20671==by 0x59AE0E3: (within /lib/libdl-2.5.so)
==20671==by 0x400D725: (within /lib/ld-2.5.so)
==20671==by 0x59AE4EC: (within /lib/libdl-2.5.so)
==20671==by 0x59AE099: dlsym (in /lib/libdl-2.5.so)
==20671==by 0x57610FB: vm_sym
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==by 0x575E29E: lt_dlsym
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==by 0x57666EF: open_component
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==by 0x576711B: mca_base_component_find
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==by 0x5767A9F: mca_base_components_open
(in /usr/local/lib/libopen-pal.so.0.0.0)
..
==20671==
==20671== ERROR SUMMARY: 102 errors from 24 contexts (suppressed: 0 from
0)
==20671== malloc/free: in use at exit: 0 bytes in 0 blocks.
==20671== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==20671== For counts of detected errors, rerun with: -v
==20671== All heap blocks were freed -- no leaks are possible.
This looks particularly broken!
I've just run valgrind on another (serial) piece of code on this machine
and got three of the unitialised jumps from within ld-2.5.so, virtually
identical to the first three from this MPI code. Of the 24 from the MPI
code, those seeming to originate from within OpenMPI are particularly
worrying.
Am I panicking for no reason, have I likely got a bad build or is
OpenMPI broken beyond repair?!!
> > If I run the code with mpirun -np 1 the problem goes away. So one
> > could
> > presumably simply say "always run it with mpirun." But if this is
> > required, why does OpenMPI not detect it?
>
> I'm not sure what you're asking -- Open MPI does not *require* you to
> run with mpirun...
That's exactly what I was asking. Cheers!
Ciao
Terry
--
Dr Terry Frankcombe
Physical Chemistry, Department of Chemistry
Göteborgs Universitet
SE-412 96 Göteborg Sweden
Ph: +46 76 224 0887 Skype: terry.frankcombe