Re: [Valgrind-users] memcheck behaviour in random failure of an open mpi based code.

Indi Tristanto Tue, 01 Dec 2009 14:53:05 -0800

Ashley, Tom

Thank you for your reply.

I add --mca btl tcp, self as mpirun argument following Julian suggestion on 
Tom's posting. Indeed this managed to supress open mpi errors to the extend 
that the valgrind log file is actually reduced from 20Mb to only a few kb. 
However I found the discussion through a google search, hence I'd appreciate if 
you could point me to the tittle of the FAQ so that I can follow the complete 
discussion.

As of my problem, the program did crash, presumably with a segfault.  I put 
'presumably' since valgrind simply hang the whole parallel computation after 
printing out the last error message.

Valgrind never print out any error until the program crash with messages I've 
wrote in my original posting. Since it happens in the middle of an iterative 
process, the line where the program crashes has been passed during the previous 
iterations without problem. As far as I can see the program is trying to call 
mpi_reduce when it crashed. Looking at valgrind behaviour, I have impression 
that the problem lies on open-mpi 1.2.6. rather than on my program. (Valgrind 
did not report anything when I checked the same program in sequential 
environment as well as under mpich environment with gnu compilation) However, I 
was expecting valgrind to report the problem earlier, i.e. when mpi_allreduce 
is called by the same program line at the very first iteration.

What I am wondering at the moment is whether the problem is hidden within the 
ompi-suppressed errors ? 
Also have I been naive in expecting this sort of error to be reported cleanly?
My problem is that I have more than one call to mpi_allreduce and the failure 
seems to happen randomly at any of the calls as well as the number of 
iteration. So my valgrind error log changes from each run, thus what I quoted 
on my posting is the "typical" error that I have seen.

By the way thank you for the suggestion on upgrading the open mpi.

Regards

Indi

> Subject: Re: [Valgrind-users] memcheck behaviour in random failure of an open 
> mpi based code.
> From: [email protected]
> To: [email protected]
> CC: [email protected]
> Date: Mon, 30 Nov 2009 18:50:00 +0000
> 
> On Mon, 2009-11-30 at 10:20 -0700, tom fogal wrote:
> > Ashley Pittman <[email protected]> writes:
> > > On Sun, 2009-11-29 at 19:34 -0700, tom fogal wrote:
> > > > Indi Tristanto <[email protected]> writes:
> > > > > I am trying to debug a large iterative solver that has been compiled 
> > > > > usin
> > > g =
> > > > > intel fortran 10 and open mpi 1.2.6.
> > > > > 
> > > > This looks (very) familiar to an issue I brought up with the OpenMPI
> > > > folks earlier this year.  See ticket 1942:
> > > > 
> > > >   https://svn.open-mpi.org/trac/ompi/ticket/1942
> > > 
> > > The error in that ticket is about uninitialised reads which do happen
> > > and are semi-expected with socket programming.
> > 
> > No, it is not.  It is about valgrinding OpenMPI programs.  It links to
> > a thread which originally started with uninitialized reads, but if you
> > follow the thread you'll note that the discussion became much wider
> > than the original posting.
> > 
> > > The error in this email is about a crash (segfault) in the open mpi
> > > library, I doubt the two are related.
> > 
> > At no point in Indi's email did he mention the application segfault or
> > crashed.
> 
> I was thinking about the "Adress 0x10 is not stack'd, malloc'd or
> (recently) free'd" message.  Actually there's a spelling mistake in that
> error message, I assume this is from transmission somewhere rather than
> in the actual valgrind output.
> 
> Given that OpenMPI has it's own malloc implementation it's likely that
> allocations aren't being intercepted and buffer over-runs aren't being
> intercepted and quite possible the error is being caused by an invalid
> write that valgrind isn't catching.
> 
> Ashley,
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 

_________________________________________________________________
View your other email accounts from your Hotmail inbox. Add them now.
http://clk.atdmt.com/UKM/go/186394592/direct/01/

------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev

_______________________________________________
Valgrind-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/valgrind-users

Re: [Valgrind-users] memcheck behaviour in random failure of an open mpi based code.

Reply via email to