Ashley, Tom
Thank you for your reply.
I add --mca btl tcp, self as mpirun argument following Julian suggestion on
Tom's posting. Indeed this managed to supress open mpi errors to the extend
that the valgrind log file is actually reduced from 20Mb to only a few kb.
However I found the discussion through a google search, hence I'd appreciate if
you could point me to the tittle of the FAQ so that I can follow the complete
discussion.
As of my problem, the program did crash, presumably with a segfault. I put
'presumably' since valgrind simply hang the whole parallel computation after
printing out the last error message.
Valgrind never print out any error until the program crash with messages I've
wrote in my original posting. Since it happens in the middle of an iterative
process, the line where the program crashes has been passed during the previous
iterations without problem. As far as I can see the program is trying to call
mpi_reduce when it crashed. Looking at valgrind behaviour, I have impression
that the problem lies on open-mpi 1.2.6. rather than on my program. (Valgrind
did not report anything when I checked the same program in sequential
environment as well as under mpich environment with gnu compilation) However, I
was expecting valgrind to report the problem earlier, i.e. when mpi_allreduce
is called by the same program line at the very first iteration.
What I am wondering at the moment is whether the problem is hidden within the
ompi-suppressed errors ?
Also have I been naive in expecting this sort of error to be reported cleanly?
My problem is that I have more than one call to mpi_allreduce and the failure
seems to happen randomly at any of the calls as well as the number of
iteration. So my valgrind error log changes from each run, thus what I quoted
on my posting is the "typical" error that I have seen.
By the way thank you for the suggestion on upgrading the open mpi.
Regards
Indi
> Subject: Re: [Valgrind-users] memcheck behaviour in random failure of an open
> mpi based code.
> From: [email protected]
> To: [email protected]
> CC: [email protected]
> Date: Mon, 30 Nov 2009 18:50:00 +0000
>
> On Mon, 2009-11-30 at 10:20 -0700, tom fogal wrote:
> > Ashley Pittman <[email protected]> writes:
> > > On Sun, 2009-11-29 at 19:34 -0700, tom fogal wrote:
> > > > Indi Tristanto <[email protected]> writes:
> > > > > I am trying to debug a large iterative solver that has been compiled
> > > > > usin
> > > g =
> > > > > intel fortran 10 and open mpi 1.2.6.
> > > > >
> > > > This looks (very) familiar to an issue I brought up with the OpenMPI
> > > > folks earlier this year. See ticket 1942:
> > > >
> > > > https://svn.open-mpi.org/trac/ompi/ticket/1942
> > >
> > > The error in that ticket is about uninitialised reads which do happen
> > > and are semi-expected with socket programming.
> >
> > No, it is not. It is about valgrinding OpenMPI programs. It links to
> > a thread which originally started with uninitialized reads, but if you
> > follow the thread you'll note that the discussion became much wider
> > than the original posting.
> >
> > > The error in this email is about a crash (segfault) in the open mpi
> > > library, I doubt the two are related.
> >
> > At no point in Indi's email did he mention the application segfault or
> > crashed.
>
> I was thinking about the "Adress 0x10 is not stack'd, malloc'd or
> (recently) free'd" message. Actually there's a spelling mistake in that
> error message, I assume this is from transmission somewhere rather than
> in the actual valgrind output.
>
> Given that OpenMPI has it's own malloc implementation it's likely that
> allocations aren't being intercepted and buffer over-runs aren't being
> intercepted and quite possible the error is being caused by an invalid
> write that valgrind isn't catching.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
_________________________________________________________________
View your other email accounts from your Hotmail inbox. Add them now.
http://clk.atdmt.com/UKM/go/186394592/direct/01/------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing.
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
_______________________________________________
Valgrind-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/valgrind-users