I guess then I missed the point of this RFC if not to move code. It talks about bringing this code into the trunk for the 1.5 time frame. If it's just getting general comments and there will be an RFC for all the changes (including the onet split proposed below) when the issues have been solved, that's great. I'll comment on the proposal as a whole once my 4 month old questions are answered. Until then, I don't think we should be using the RFC process to get permission to move portions of a project with critical questions unanswered (which is exactly what this RFC reads as doing).

Brian


On Mon, 9 Mar 2009, Rainer Keller wrote:

Hi Jeff,
thanks for the mail!
I completely agree with Your points.

To stress the fact: The timeout date does not mean, that we intend to just
commit to trunk by that date.
It was rather to get comments to this particular date by all the parties
interested. (this is what I remembered from previous RFCs, but I could be
wrong...)
All the work that has been committed should cleanup the code. Anything that
was beyond a cleanup deserved an RFC and input from many people (such as
bitmap_t change...).

We still intend, as in the Louisville meeting, to have as much input from the
community (that's why this is TRACS-visible svn-tmp-branch).

Thanks,
Rainer



On Monday 09 March 2009 04:52:28 pm Jeff Squyres wrote:
Random points in no particular order (Rainer please correct me if I'm
making bad assumptions):

- I believe that ORNL is proposing to do this work on a separate
branch (this is what we have discussed for some time now, and we
discussed this deeply in Louisville).  The RFC text doesn't
specifically say, but I would be very surprised if this stuff is
planned to come back to the trunk in the near future -- as we have all
agreed, it's not done yet.

- I believe that the timeout field in RFC's is a limit for non-
responsiveness -- it is mainly intended to prevent people from
ignoring / not responding to RFCs.  I do not believe that Rainer was
using that date as a "that's when I'm bringing it all back to the
trunk."  Indeed, he specifically called out the 1.5 series as a target
for this work.

- I also believe that Rainer is using this RFC as a means to get
preliminary review of the work that has been done on the branch so
far.  He has provided a script that shows what they plan to do, how
the code will be laid out, etc.  There are still some important core
issues to be solved -- and, like Brian, I want to see how they'll get
solved before being happy (we have strong precedent for this
requirement) -- but I think all that Rainer was saying in his RFC was
"here's where we are so far; can people review and see if they hate it?"

- It was made abundantly clear in the Louisville meeting that ORTE has
no short-term plans for using the ONET layer (probably no long-term
plans, either, but hey -- never say "never" :-) ).  The design of ONET
is such that other RTE's *could* use ONET if they want (e.g., STCI
will), but it is not a requirement for the underlying RTE to use
ONET.  We agreed in Louisville that ORTE will provide sufficient stubs
and hooks (all probably effectively no-ops) so that ONET can compile
against it in the default OMPI configuration; other RTEs that want to
do more meaningful stuff will need to provide more meaningful
implementations of the stubs and hooks.

- Hopefully the teleconference time tomorrow works out for Rich (his
communications were unclear on this point).  Otherwise, postponing the
admin discussion until April seems problematic.

On Mar 9, 2009, at 4:01 PM, Brian W. Barrett wrote:
I, not suprisingly, have serious concerns about this RFC.  It
assumes that
the ompi_proc issues and bootstrapping issues (the entire point of the
move, as I understand it) can both be solved, but offer no proof to
support that claim.  Without those two issues solved, we would be left
with an onet layer that is dependent on ORTE and OMPI, and which OMPI
depends upon.  This is not a good place to be.  These issues should be
resolved before an onet layer is created in the trunk.

This is not an unusual requirement.  The fault tolerance work took a
very
long time because of similar requirements.  Not only was a full
implementation required to prove performance would not be negatively
impacted (when FT wasn't active), but we had discussions about its
impact
on code maintainability.  We had a full implementation of all the
pieces
that impacted the code *before* any of it was allowed into the trunk.

We should live by the rules the community has setup.  They have
served us
well in the past.  Further, these are not new objections on my part.
Since the initial RFCs related to this move started, I have
continually
brought up the exact same questions and never gotten a satisfactory
answer.  This RFC even acknowledges the issues, but without
presenting any
solution and still asks to do the most disruptive work.  I simply
can't
see how that fits with Open MPI's long-standing development
proceedures.

If all the issues I've asked about previously (which are essentially
the
ones you've identified in the RFC) can be solved, the impact to code
base
maintainability is reasonable, and the impact to performance is
negligable, I'll gladly remove my objection to this RFC.

Further, before any work on this branch is brought into the trunk, the
admin-level discussion regarding this issue should be resolved.  At
this
time, that discussion is blocking on ORNL and they've given April as
the
earliest such a discussion can occur.  So at the very least, the RFC
timeout should be pushed into April or ORNL should revise their
availability for the admin discussion.


Brian

On Mon, 9 Mar 2009, Rainer Keller wrote:
What:     Move BTLs into separate layer

Why:      Several projects have expressed interest to use the

BTLs. Use-cases

such as the RTE using the BTLs for modex or tools collecting/

distributing data

in the fastest possible way may be possible.

Where:    This would affect several components, that the BTLs

depend on

(namely allocator, mpool, rcache and the common part of the BTLs).
Additionally some changes to classes were/are necessary.

When:     Preferably 1.5 (in case we use the Feature/Stable

Release cycle ;-)

Timeout:  23.03.2009

------------------------------------------------------------------------

There has been much speculation about this project.
This RFC should shed some light, if there is some more information

required,

please feel free to ask/comment. Of course, suggestions are welcome!

The BTLs offer access to fast communication framework. Several

projects have

expressed interest to use them separate of other layers of Open MPI.
Additionally (with further changes) BTLs maybe used within ORTE

itself.

COURSE OF WORK:
The extraction is not easy (as was the extraction of ORTE and OMPI

in the

early stages of Open MPI?).
In order to get as much input and be as visible as possible (e.g.

in TRACS),

the tmp-branch for this work has been set up on:
  https://svn.open-mpi.org/svn/ompi/tmp/koenig-btl

We propose to have a separate ONET library living in onet, based

on orte (see

attached fig).

In order to keep the diff between the trunk and the branch to a

minimum

several cleanup patches have already been applied to the trunk (e.g.
unnecessary #include of ompi and orte header files, integration of
ompi_bitmap_t into opal_bitmap_t, #include "*_config.h").


Additionally a script (attached below) has been kept up-to-date

(contrib/move-

btl-into-onet), that will perform this separation on a fresh

checkout of

trunk:
 svn list
https://svn.open-mpi.org/svn/ompi/tmp/koenig-btl/contrib/move-btl-
into-onet

This script requires several patches (see attached TAR-ball).
Please update the variable PATCH_DIR to match the location of

patches.

 ./move-btl-into-onet ompi-clean/
 # Lots of output deleted.
 cd ompi-clean/
 rm -fr ompi/mca/common/  # No two mcas called common, too bad...
 ./autogen.sh


OTHER RTEs:
A preliminary header file is provided in onet/include/rte.h to

accommodate the

requirements of other RTEs (such as stci), that replaces selected
functionality, as proposed by Jeff and Ralph in the Louisville

meeting.

Additionally, this header file is included before orte-header

files (within

onet)...
By default, this does not change anything in the standard case

(ORTE),

otherwise -DHAVE_STCI, redefinitions for components orte-

functionality

required within onet is done.


TESTS:
First tests have been done locally on Linux/x86_64.
The branch compiles without warnings.
The wrappers have been updated.

The Intel Testsuite runs without failures:
 ./run-tests.pl  all_tests_no_perf


PERFORMANCE:
!!!Before any merge, do extensive performance tests on real

machines!!!

Initial tests on the cluster smoky, show no difference in

comparison to ompi-

trunk.
Please see the enclosed output of NetPipe-3.7.1 run on a single

node (--mca

btl sm,self) on smoky.


TODOS:
There are still some todos, to finalize this:
- Dependencies in the onet-layer into the ompi-layer (ompi_proc_t,
ompi_converter)
 We are working on these, and have shortly talked about the latter

with

George.
- Better abstraction from orte / cleanups, such as modex

If these involve code-changes (and not just "save" and non-

intrusive renames),

such as a opal_keyval-change, we will continue to write RFCs.

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to