[OMPI devel] Fwd: [all-osl-users] OSL system outage

2007-11-27 Thread Jeff Squyres
FYI -- all of open-mpi.org (www, svn) will be down for a short period  
next Monday.



Begin forwarded message:


From: DongInn Kim <>
Date: November 27, 2007 2:15:43 PM CST
Subject: [all-osl-users] OSL system outage

Hi,

The OSL systems need to reboot to for the regular maintenance on Dec  
3rd (Mon) 2007.


Short outage of the OSL systems is expected on Dec 3, 2007:
- 5:00am-6:00am Pacific US time
- 6:00am-7:00am Mountain US time
- 7:00am-8:00am Central US time
- 8:00am-9:00am Eastern US time
- 1:00pm-2:00pm GMT

Following are the unavailable services the during the reboot time.
- Web services to any OSL-hosted domain
 (e.g., www.osl.iu.edu, www.lam-mpi.org, www.open-mpi.org,  
*.boost.org, ...)
 Note that this includes mailman services and mail archive  
service(hypermail)

 Note that this includes bug tracking services
 Note that this includes webmail
- Incoming e-mail to any OSL-hosted domain
 (e.g., osl.iu.edu, lam-mpi.org, open-mpi.org, *.boost.org, ...)
 Note that this includes *ALL* mailing lists!
- IMAP/SSL services
- SMTP/auth services (e-mail relaying through milliways)
- Subversion services
- Trac services
- NFS services

Please let me know if you have any questions or concerns about this  
outage.


Regards,

--
- DongInn



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [OMPI users] Memory manager

2007-11-27 Thread Terry Frankcombe
Hi Jeff

> > I posted this to the devel list the other day, but it raised no
> > responses.  Maybe people will have more to say here.
> 
> Sorry Terry; many of us were at the SC conference last week, and this  
> week is short because of the US holiday.  Some of the inbox got  
> dropped/delayed as a result...

'Tis OK.  Beggars can't be choosers!  ;-)



> > Because of this I can't reduce the problem to a small testcase, and so
> > have not included any code at this stage.
> 
> Ugh.  Heisenbugs are the worst.
> 
> Have you tried with a memory checking debugger, such as valgrind, or a  
> parallel debugger?  Is there a chance that there's a simple errant  
> posted receive (perhaps in a race condition) that is unexpectedly  
> receiving into the Bug's memory location when you don't expect it?

I have zero experience with valgrind.  But I downloaded it and ran my
"minimal" case (about 1000 lines + libraries!) with it.  Thus I found
one uninitialised variable and need to go away and check my code
carefully now.  Correcting this in the most obvious, un-thought-through
way makes my Bug go away.  (But then so does changing the code in other,
unexecuted sections!)

However, what I get out of valgrind now is:

[tjf@fkpc167 Minimal]$ valgrind --leak-check=yes ./nnh
==20671== Memcheck, a memory error detector.
==20671== Copyright (C) 2002-2007, and GNU GPL'd, by Julian Seward et
al.
==20671== Using LibVEX rev 1732, a library for dynamic binary
translation.
==20671== Copyright (C) 2004-2007, and GNU GPL'd, by OpenWorks LLP.
==20671== Using valgrind-3.2.3, a dynamic binary instrumentation
framework.
==20671== Copyright (C) 2000-2007, and GNU GPL'd, by Julian Seward et
al.
==20671== For more details, rerun with: -v
==20671== 
==20671== Conditional jump or move depends on uninitialised value(s)
==20671==at 0x40152B1: (within /lib/ld-2.5.so)
==20671==by 0x4005278: (within /lib/ld-2.5.so)
==20671==by 0x4007CFD: (within /lib/ld-2.5.so)
==20671==by 0x400318A: (within /lib/ld-2.5.so)
==20671==by 0x4013D9A: (within /lib/ld-2.5.so)
==20671==by 0x40012C6: (within /lib/ld-2.5.so)
==20671==by 0x4000A67: (within /lib/ld-2.5.so)

..

==20671== Conditional jump or move depends on uninitialised value(s)
==20671==at 0x40152B1: (within /lib/ld-2.5.so)
==20671==by 0x400A289: (within /lib/ld-2.5.so)
==20671==by 0x6A42E4D: (within /lib/libc-2.5.so)
==20671==by 0x59AE0E3: (within /lib/libdl-2.5.so)
==20671==by 0x400D725: (within /lib/ld-2.5.so)
==20671==by 0x59AE4EC: (within /lib/libdl-2.5.so)
==20671==by 0x59AE099: dlsym (in /lib/libdl-2.5.so)
==20671==by 0x57610FB: vm_sym
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==by 0x575E29E: lt_dlsym
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==by 0x57666EF: open_component
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==by 0x576711B: mca_base_component_find
(in /usr/local/lib/libopen-pal.so.0.0.0)
==20671==by 0x5767A9F: mca_base_components_open
(in /usr/local/lib/libopen-pal.so.0.0.0)

..



==20671== 
==20671== ERROR SUMMARY: 102 errors from 24 contexts (suppressed: 0 from
0)
==20671== malloc/free: in use at exit: 0 bytes in 0 blocks.
==20671== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==20671== For counts of detected errors, rerun with: -v
==20671== All heap blocks were freed -- no leaks are possible.


This looks particularly broken!

I've just run valgrind on another (serial) piece of code on this machine
and got three of the unitialised jumps from within ld-2.5.so, virtually
identical to the first three from this MPI code.  Of the 24 from the MPI
code, those seeming to originate from within OpenMPI are particularly
worrying.

Am I panicking for no reason, have I likely got a bad build or is
OpenMPI broken beyond repair?!!


> > If I run the code with mpirun -np 1 the problem goes away.  So one  
> > could
> > presumably simply say "always run it with mpirun."  But if this is
> > required, why does OpenMPI not detect it?
> 
> I'm not sure what you're asking -- Open MPI does not *require* you to  
> run with mpirun...

That's exactly what I was asking.  Cheers!

Ciao
Terry

-- 
Dr Terry Frankcombe
Physical Chemistry, Department of Chemistry
Göteborgs Universitet
SE-412 96 Göteborg Sweden
Ph: +46 76 224 0887   Skype: terry.frankcombe