I opened ticket for the bug:
https://svn.open-mpi.org/trac/ompi/ticket/1389

Ralph Castain wrote:
It looks like a new issue to me, Pasha. Possibly a side consequence of the
IOF change made by Jeff and I the other day. From what I can see, it looks
like you app was a simple "hello" - correct?

If you look at the error, the problem occurs when mpirun is trying to route
a message. Since the app is clearly running at this time, the problem is
probably in the IOF. The error message shows that mpirun is attempting to
route a message to a jobid that doesn't exist. We have a test in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)" <pa...@dev.mellanox.co.il> wrote:

Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to