All these errors are in the MPI_Finalize, it should not be that hard
to find. I'll take a look later this afternoon.
george.
On Jan 14, 2009, at 06:41 , Tim Mattox wrote:
Unfortunately, although this fixed some problems when enabling
hierarch coll,
there is still a segfault in two of IU's tests that only shows up
when we set
-mca coll_hierarch_priority 100
See this MTT summary to see how the failures improved on the trunk,
but that there are still two that segfault even at 1.4a1r20267:
http://www.open-mpi.org/mtt/index.php?do_redir=923
This link just has the remaining failures:
http://www.open-mpi.org/mtt/index.php?do_redir=922
So, I'll vote for applying the CMR for 1.3 since it clearly improved
things,
but there is still more to be done to get coll_hierarch ready for
regular
use.
On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca
<bosi...@eecs.utk.edu> wrote:
Here we go by the book :)
https://svn.open-mpi.org/trac/ompi/ticket/1749
george.
On Jan 13, 2009, at 23:40 , Jeff Squyres wrote:
Let's debate tomorrow when people are around, but first you have
to file a
CMR... :-)
On Jan 13, 2009, at 10:28 PM, George Bosilca wrote:
Unfortunately, this pinpoint the fact that we didn't test enough
the
collective module mixing thing. I went over the tuned collective
functions
and changed all instances to use the correct module information.
It is now
on the trunk, revision 20267. Simultaneously,I checked that all
other
collective components do the right thing ... and I have to admit
tuned was
the only faulty one.
This is clearly a bug in the tuned, and correcting it will allow
people
to use the hierarch. In the current incarnation 1.3 will mostly/
always
segfault when hierarch is active. I would prefer not to give a
broken toy
out there. How about pushing r20267 in the 1.3?
george.
On Jan 13, 2009, at 20:13 , Jeff Squyres wrote:
Thanks for digging into this. Can you file a bug? Let's mark
it for
v1.3.1.
I say 1.3.1 instead of 1.3.0 because this *only* affects
hierarch, and
since hierarch isn't currently selected by default (you must
specifically
elevate hierarch's priority to get it to run), there's no danger
that users
will run into this problem in default runs.
But clearly the problem needs to be fixed, and therefore we need
a bug
to track it.
On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote:
I just debugged the Reduce_scatter bug mentioned previously.
The bug is
unfortunately not in hierarch, but in tuned.
Here is the code snipplet causing the problems:
int reduce_scatter (...., mca_coll_base_module_t *module)
{
...
err = comm->c_coll.coll_reduce (...., module)
...
}
but should be
{
...
err = comm->c_coll.coll_reduce (..., comm-
>c_coll.coll_reduce_module);
...
}
The problem as it is right now is, that when using hierarch,
only a
subset of the function are set, e.g. reduce,allreduce, bcast
and barrier.
Thus, reduce_scatter is from tuned in most scenarios, and calls
the
subsequent functions with the wrong module. Hierarch of course
does not like
that :-)
Anyway, a quick glance through the tuned code reveals a
significant
number of instances where this appears(reduce_scatter,
allreduce, allgather,
allgatherv). Basic, hierarch and inter seem to do that mostly
correctly.
Thanks
Edgar
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel