Hi Jeff > > I posted this to the devel list the other day, but it raised no > > responses. Maybe people will have more to say here. > > Sorry Terry; many of us were at the SC conference last week, and this > week is short because of the US holiday. Some of the inbox got > dropped/delayed as a result...
'Tis OK. Beggars can't be choosers! ;-) <snip> > > Because of this I can't reduce the problem to a small testcase, and so > > have not included any code at this stage. > > Ugh. Heisenbugs are the worst. > > Have you tried with a memory checking debugger, such as valgrind, or a > parallel debugger? Is there a chance that there's a simple errant > posted receive (perhaps in a race condition) that is unexpectedly > receiving into the Bug's memory location when you don't expect it? I have zero experience with valgrind. But I downloaded it and ran my "minimal" case (about 1000 lines + libraries!) with it. Thus I found one uninitialised variable and need to go away and check my code carefully now. Correcting this in the most obvious, un-thought-through way makes my Bug go away. (But then so does changing the code in other, unexecuted sections!) However, what I get out of valgrind now is: [tjf@fkpc167 Minimal]$ valgrind --leak-check=yes ./nnh ==20671== Memcheck, a memory error detector. ==20671== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et al. ==20671== Using LibVEX rev 1732, a library for dynamic binary translation. ==20671== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP. ==20671== Using valgrind-3.2.3, a dynamic binary instrumentation framework. ==20671== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et al. ==20671== For more details, rerun with: -v ==20671== ==20671== Conditional jump or move depends on uninitialised value(s) ==20671== at 0x40152B1: (within /lib/ld-2.5.so) ==20671== by 0x4005278: (within /lib/ld-2.5.so) ==20671== by 0x4007CFD: (within /lib/ld-2.5.so) ==20671== by 0x400318A: (within /lib/ld-2.5.so) ==20671== by 0x4013D9A: (within /lib/ld-2.5.so) ==20671== by 0x40012C6: (within /lib/ld-2.5.so) ==20671== by 0x4000A67: (within /lib/ld-2.5.so) ...<snip>... ==20671== Conditional jump or move depends on uninitialised value(s) ==20671== at 0x40152B1: (within /lib/ld-2.5.so) ==20671== by 0x400A289: (within /lib/ld-2.5.so) ==20671== by 0x6A42E4D: (within /lib/libc-2.5.so) ==20671== by 0x59AE0E3: (within /lib/libdl-2.5.so) ==20671== by 0x400D725: (within /lib/ld-2.5.so) ==20671== by 0x59AE4EC: (within /lib/libdl-2.5.so) ==20671== by 0x59AE099: dlsym (in /lib/libdl-2.5.so) ==20671== by 0x57610FB: vm_sym (in /usr/local/lib/libopen-pal.so.0.0.0) ==20671== by 0x575E29E: lt_dlsym (in /usr/local/lib/libopen-pal.so.0.0.0) ==20671== by 0x57666EF: open_component (in /usr/local/lib/libopen-pal.so.0.0.0) ==20671== by 0x576711B: mca_base_component_find (in /usr/local/lib/libopen-pal.so.0.0.0) ==20671== by 0x5767A9F: mca_base_components_open (in /usr/local/lib/libopen-pal.so.0.0.0) ...<snip>... <my code output, no valgrind errors within it> ==20671== ==20671== ERROR SUMMARY: 102 errors from 24 contexts (suppressed: 0 from 0) ==20671== malloc/free: in use at exit: 0 bytes in 0 blocks. ==20671== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==20671== For counts of detected errors, rerun with: -v ==20671== All heap blocks were freed -- no leaks are possible. This looks particularly broken! I've just run valgrind on another (serial) piece of code on this machine and got three of the unitialised jumps from within ld-2.5.so, virtually identical to the first three from this MPI code. Of the 24 from the MPI code, those seeming to originate from within OpenMPI are particularly worrying. Am I panicking for no reason, have I likely got a bad build or is OpenMPI broken beyond repair?!! > > If I run the code with mpirun -np 1 the problem goes away. So one > > could > > presumably simply say "always run it with mpirun." But if this is > > required, why does OpenMPI not detect it? > > I'm not sure what you're asking -- Open MPI does not *require* you to > run with mpirun... That's exactly what I was asking. Cheers! Ciao Terry -- Dr Terry Frankcombe Physical Chemistry, Department of Chemistry Göteborgs Universitet SE-412 96 Göteborg Sweden Ph: +46 76 224 0887 Skype: terry.frankcombe <te...@chem.gu.se>