On Wed, 11 Mar 2009, Richard Graham wrote:
> Brian,
> Going back over the e-mail trail it seems like you have raised two
> concerns:
> - BTL performance after the change, which I would take to be
> - btl latency
> - btl bandwidth
> - Code maintainability
> - repeated code changes that impact a large number of files
> - A demonstration that the changes actually achieve their goal. As
we
> discussed after you got off the call, there are two separate goals
here
> - being able to use the btl?s outside the context of mpi, but
> within the ompi code base
> - ability to use the btl?s in the context of a run-time other than
> orte
> Another concern I have heard raised by others is
> - mpi startup time
>
> Has anything else been missed here ? I would like to make sure
that we
> address all the issues raised in the next version of the RFC.
I think the umbrella concerns for the final success of the change
are btl
performance (in particular, latency and message rates for cache-
unfriendly
applications/benchmarks) and code maintainability. In addition,
there are
some intermediate change issues I have, in that this project is
working
different than other large changes. In particular, there is/was the
appearance of being asked to accept changes which only make sense if
the
btl move is going to move forward, without any way to judge the
performance or code impact because critical technical issues still
remain.
The latency/message rate issues are fairly straight forward from an
end
measure point-of-view. My concerns on latency/message rate come not
from
the movement of the BTL to another library (for most operating
systems /
shared library systems that should be negligible), but from the code
changes which surround moving the BTLs. The BTLs are tightly
intertwined
with a number of pieces of the OMPI layer, in particular the BML and
MPool
frameworks and the ompi proc structure. I had a productive
conversation
with Rainer this morning explaining why I'm so concerned about the
bml and
ompi proc structures. The ompi proc structure currently acts not
only as
the identifier for a remote endpoint, but stores endpoint specific
data
for both the PML and BML. The BML structure actually contains each
BTL's
per process endpoint information, in the form of the base_endpoint_t*
structures returned from add_procs(). Moving these structures
around must
be done with care, as some of the proposals Jeff, Rainer, and I came
up
with this morning either induced spaghetti code or greatly increased
the
spread of information needed for the critical send path through the
memory
space (thereby likely increasing cache misses on send for real
applications).
The code maintainability issue comes from three separate and
independent
issues. First, there is the issue of how the pieces of the OMPI layer
will interact after the move. The BML/BTL/MPool/Rcache dance is
already
complicated, and care should be taken to minimize that change.
Start-up
is also already quite complex, and moving the BTLs to make them
independent of starting other pieces of Open MPI can be done well or
can
be done poorly. We need to ensure it's done well, obviously. Second,
there is the issue of wire-up. My impression from conversations with
everyone at ORNL was that this move of BTLs would include changes to
allow
BTLs to wire-up without the RML. I understand that Rich said this
was not
the case during the part of the admin meeting I missed yesterday, so
that may no longer be a concern. Finally, there has been some
discussion,
mainly second hand in my case, about the mechanisms in which the trunk
would be modified to allow for using OMPI without ORTE. I have
concerns
that we'd add complexity to the BTLs to achieve that, and again that
can
be done poorly if we're not careful. Talking with Jeff and Rainer
this
morning helped reduce my concern in this area, but I think it also
added
to the technical issues with must be solved to consider this project
ready
for movement to the trunk.
There are a couple of technical issues which I believe prevent a
reasonable discussion of the performance and maintainability issues
based
on the current branch. I talked about some of them in the previous
two
paragraphs, but so that we have a short bullet list, they are:
- How will the ompi_proc_t be handled? In particular,
where will PML/BML data be stored, and how will we
avoid adding new cache misses.
- How will the BML and MPool be handled? The BML holds
the BTL endpoint data, so changes have to be made if
it continues to live in OMPI.
- How will the modex and the intricate dance with adding
new procs from dynamic processes be handled?
- How will we handle the progress mechanisms in cases where
the MTLs are used and the BTLs aren't needed by the RTE?
- If there are users outside of OMPI, but who want to also use
OMPI, how will the library versioning / conflict problem be
solved?
> As was mentioned before, our time frame for this is measured in
weeks,
> and not in months. I believe the date of May 1st was mentioned to
> coincide with the next feature release.
While I understand your deadline, we have in the past been very
conservative with such large changes. The C/R work was delayed for
over a
year because people were concerned with the impact to performance and
maintainability. ORTE work is consistently delayed in the name of
code
stability. I believe that changing our desire for high quality code
in
the trunk because of an organization's deadline (particularly when
other
organizations are successfully using branches to meet their deadlines)
sets a poor precedent and goes against previous precedents.
Similarly, my concern with the intermediate changes which have been
proposed or occurred come from the slippery-slope argument. Changes
which
are really only necessary for the btl move (even general code
cleanups)
should only occur once we're all sure the btl move will work.
Otherwise,
we're impacting other developers (many of who are working on temp
branches
attempting to get a feature to completion, as our normal process
dictates) in order to reach an end point which may not be
achievable. In
talking to Rainer this morning with Jeff, I think we came up with a
number
of ideas on how to mitigate this impact and find a better balance
which
allows ORNL to answer the critical technical questions (which are
not just
mine, but are shared by others and are critical to the "make it
work" part
of the process) and allows the rest of the community some belief
that we
can avoid any permanent harm if the move doesn't work out.
> One thing that should help when the naming changes are applied is
that
> this is scripted, and the script can be made available for others
that
> are working on temp branches ? which includes us, also.
That unfortunately doesn't help other developers, if they're trying to
strictly follow the version control changes to the trunk. The
problem is
that we're going to get all those moves (hopefully the script now
svn moves
instead of svn rm / svn add) through the version control system. The
script would then cause all the changes to occur a second time, and
that
could be very problematic. The problem with the version control
changes
filtering down is that it is not all-encompassing. For example, svn
will
have problems if the btl directory moves but I have my own private
special
BTL. Yes, i might be able to use your scripts to handle that, but
if they
aren't written with that scenario in mind, they won't help. It also
won't
help if I've added a particular file to an existing BTL and the BTL
then
moves.
I think these cases are worth the pain to non-ORNL developers *IF*
all the
other issues are addressed. Otherwise, we're unfairly asking them
to deal
with a radically changing code base for an incomplete project, a
situation
we've worked to avoid in the past.
Hopefully this explains my thoughts on the btl move. I'm not
opposed to
the move itself (although I reserve the right to become opposed,
based on
performance and maintainability issues). I have a problem with the
change
in process from previous large, invasive changes.
Hope this helps,
Brian
<ATT4444789.txt>