Hi Jeff

> > I posted this to the devel list the other day, but it raised no
> > responses.  Maybe people will have more to say here.
> 
> Sorry Terry; many of us were at the SC conference last week, and this  
> week is short because of the US holiday.  Some of the inbox got  
> dropped/delayed as a result...

'Tis OK.  Beggars can't be choosers!  ;-)

<snip>

> > Because of this I can't reduce the problem to a small testcase, and so
> > have not included any code at this stage.
> 
> Ugh.  Heisenbugs are the worst.
> 
> Have you tried with a memory checking debugger, such as valgrind, or a  
> parallel debugger?  Is there a chance that there's a simple errant  
> posted receive (perhaps in a race condition) that is unexpectedly  
> receiving into the Bug's memory location when you don't expect it?

I have zero experience with valgrind.  But I downloaded it and ran my
"minimal" case (about 1000 lines + libraries!) with it.  Thus I found
one uninitialised variable and need to go away and check my code
carefully now.  Correcting this in the most obvious, un-thought-through
way makes my Bug go away.  (But then so does changing the code in other,
unexecuted sections!)

However, what I get out of valgrind now is:

[tjf@fkpc167 Minimal]$ valgrind --leak-check=yes ./nnh
==20671== Memcheck, a memory error detector.
==20671== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et
al.
==20671== Using LibVEX rev 1732, a library for dynamic binary
translation.
==20671== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==20671== Using valgrind-3.2.3, a dynamic binary instrumentation
framework.
==20671== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et
al.
==20671== For more details, rerun with: -v
==20671== 
==20671== Conditional jump or move depends on uninitialised value(s)
==20671==    at 0x40152B1: (within /lib/ld-2.5.so)
==20671==    by 0x4005278: (within /lib/ld-2.5.so)
==20671==    by 0x4007CFD: (within /lib/ld-2.5.so)
==20671==    by 0x400318A: (within /lib/ld-2.5.so)
==20671==    by 0x4013D9A: (within /lib/ld-2.5.so)
==20671==    by 0x40012C6: (within /lib/ld-2.5.so)
==20671==    by 0x4000A67: (within /lib/ld-2.5.so)

...<snip>...

==20671== Conditional jump or move depends on uninitialised value(s)
==20671==    at 0x40152B1: (within /lib/ld-2.5.so)
==20671==    by 0x400A289: (within /lib/ld-2.5.so)
==20671==    by 0x6A42E4D: (within /lib/libc-2.5.so)
==20671==    by 0x59AE0E3: (within /lib/libdl-2.5.so)
==20671==    by 0x400D725: (within /lib/ld-2.5.so)
==20671==    by 0x59AE4EC: (within /lib/libdl-2.5.so)
==20671==    by 0x59AE099: dlsym (in /lib/libdl-2.5.so)
==20671==    by 0x57610FB: vm_sym
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==    by 0x575E29E: lt_dlsym
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==    by 0x57666EF: open_component
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==    by 0x576711B: mca_base_component_find
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==    by 0x5767A9F: mca_base_components_open
(in /usr/local/lib/libopen-pal.so.0.0.0)

...<snip>...

<my code output, no valgrind errors within it>

==20671== 
==20671== ERROR SUMMARY: 102 errors from 24 contexts (suppressed: 0 from
0)
==20671== malloc/free: in use at exit: 0 bytes in 0 blocks.
==20671== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==20671== For counts of detected errors, rerun with: -v
==20671== All heap blocks were freed -- no leaks are possible.


This looks particularly broken!

I've just run valgrind on another (serial) piece of code on this machine
and got three of the unitialised jumps from within ld-2.5.so, virtually
identical to the first three from this MPI code.  Of the 24 from the MPI
code, those seeming to originate from within OpenMPI are particularly
worrying.

Am I panicking for no reason, have I likely got a bad build or is
OpenMPI broken beyond repair?!!


> > If I run the code with mpirun -np 1 the problem goes away.  So one  
> > could
> > presumably simply say "always run it with mpirun."  But if this is
> > required, why does OpenMPI not detect it?
> 
> I'm not sure what you're asking -- Open MPI does not *require* you to  
> run with mpirun...

That's exactly what I was asking.  Cheers!

Ciao
Terry

-- 
Dr Terry Frankcombe
Physical Chemistry, Department of Chemistry
Göteborgs Universitet
SE-412 96 Göteborg Sweden
Ph: +46 76 224 0887   Skype: terry.frankcombe
<te...@chem.gu.se>

Reply via email to