Random points in no particular order (Rainer please correct me if I'm making bad assumptions):

- I believe that ORNL is proposing to do this work on a separate branch (this is what we have discussed for some time now, and we discussed this deeply in Louisville). The RFC text doesn't specifically say, but I would be very surprised if this stuff is planned to come back to the trunk in the near future -- as we have all agreed, it's not done yet.

- I believe that the timeout field in RFC's is a limit for non- responsiveness -- it is mainly intended to prevent people from ignoring / not responding to RFCs. I do not believe that Rainer was using that date as a "that's when I'm bringing it all back to the trunk." Indeed, he specifically called out the 1.5 series as a target for this work.

- I also believe that Rainer is using this RFC as a means to get preliminary review of the work that has been done on the branch so far. He has provided a script that shows what they plan to do, how the code will be laid out, etc. There are still some important core issues to be solved -- and, like Brian, I want to see how they'll get solved before being happy (we have strong precedent for this requirement) -- but I think all that Rainer was saying in his RFC was "here's where we are so far; can people review and see if they hate it?"

- It was made abundantly clear in the Louisville meeting that ORTE has no short-term plans for using the ONET layer (probably no long-term plans, either, but hey -- never say "never" :-) ). The design of ONET is such that other RTE's *could* use ONET if they want (e.g., STCI will), but it is not a requirement for the underlying RTE to use ONET. We agreed in Louisville that ORTE will provide sufficient stubs and hooks (all probably effectively no-ops) so that ONET can compile against it in the default OMPI configuration; other RTEs that want to do more meaningful stuff will need to provide more meaningful implementations of the stubs and hooks.

- Hopefully the teleconference time tomorrow works out for Rich (his communications were unclear on this point). Otherwise, postponing the admin discussion until April seems problematic.



On Mar 9, 2009, at 4:01 PM, Brian W. Barrett wrote:

I, not suprisingly, have serious concerns about this RFC. It assumes that
the ompi_proc issues and bootstrapping issues (the entire point of the
move, as I understand it) can both be solved, but offer no proof to
support that claim.  Without those two issues solved, we would be left
with an onet layer that is dependent on ORTE and OMPI, and which OMPI
depends upon.  This is not a good place to be.  These issues should be
resolved before an onet layer is created in the trunk.

This is not an unusual requirement. The fault tolerance work took a very
long time because of similar requirements.  Not only was a full
implementation required to prove performance would not be negatively
impacted (when FT wasn't active), but we had discussions about its impact on code maintainability. We had a full implementation of all the pieces
that impacted the code *before* any of it was allowed into the trunk.

We should live by the rules the community has setup. They have served us
well in the past.  Further, these are not new objections on my part.
Since the initial RFCs related to this move started, I have continually
brought up the exact same questions and never gotten a satisfactory
answer. This RFC even acknowledges the issues, but without presenting any solution and still asks to do the most disruptive work. I simply can't see how that fits with Open MPI's long-standing development proceedures.

If all the issues I've asked about previously (which are essentially the ones you've identified in the RFC) can be solved, the impact to code base
maintainability is reasonable, and the impact to performance is
negligable, I'll gladly remove my objection to this RFC.

Further, before any work on this branch is brought into the trunk, the
admin-level discussion regarding this issue should be resolved. At this time, that discussion is blocking on ORNL and they've given April as the
earliest such a discussion can occur.  So at the very least, the RFC
timeout should be pushed into April or ORNL should revise their
availability for the admin discussion.


Brian


On Mon, 9 Mar 2009, Rainer Keller wrote:

>
> What:     Move BTLs into separate layer
>
> Why: Several projects have expressed interest to use the BTLs. Use-cases > such as the RTE using the BTLs for modex or tools collecting/ distributing data
> in the fastest possible way may be possible.
>
> Where: This would affect several components, that the BTLs depend on
> (namely allocator, mpool, rcache and the common part of the BTLs).
> Additionally some changes to classes were/are necessary.
>
> When: Preferably 1.5 (in case we use the Feature/Stable Release cycle ;-)
>
> Timeout:  23.03.2009
> ------------------------------------------------------------------------
>
> There has been much speculation about this project.
> This RFC should shed some light, if there is some more information required,
> please feel free to ask/comment. Of course, suggestions are welcome!
>
> The BTLs offer access to fast communication framework. Several projects have
> expressed interest to use them separate of other layers of Open MPI.
> Additionally (with further changes) BTLs maybe used within ORTE itself.
>
> COURSE OF WORK:
> The extraction is not easy (as was the extraction of ORTE and OMPI in the
> early stages of Open MPI?).
> In order to get as much input and be as visible as possible (e.g. in TRACS),
> the tmp-branch for this work has been set up on:
>   https://svn.open-mpi.org/svn/ompi/tmp/koenig-btl
>
> We propose to have a separate ONET library living in onet, based on orte (see
> attached fig).
>
> In order to keep the diff between the trunk and the branch to a minimum
> several cleanup patches have already been applied to the trunk (e.g.
> unnecessary #include of ompi and orte header files, integration of
> ompi_bitmap_t into opal_bitmap_t, #include "*_config.h").
>
>
> Additionally a script (attached below) has been kept up-to-date (contrib/move- > btl-into-onet), that will perform this separation on a fresh checkout of
> trunk:
>  svn list https://svn.open-mpi.org/svn/ompi/tmp/koenig-btl/contrib/move-btl-
> into-onet
>
> This script requires several patches (see attached TAR-ball).
> Please update the variable PATCH_DIR to match the location of patches.
>
>  ./move-btl-into-onet ompi-clean/
>  # Lots of output deleted.
>  cd ompi-clean/
>  rm -fr ompi/mca/common/  # No two mcas called common, too bad...
>  ./autogen.sh
>
>
> OTHER RTEs:
> A preliminary header file is provided in onet/include/rte.h to accommodate the
> requirements of other RTEs (such as stci), that replaces selected
> functionality, as proposed by Jeff and Ralph in the Louisville meeting. > Additionally, this header file is included before orte-header files (within
> onet)...
> By default, this does not change anything in the standard case (ORTE), > otherwise -DHAVE_STCI, redefinitions for components orte- functionality
> required within onet is done.
>
>
> TESTS:
> First tests have been done locally on Linux/x86_64.
> The branch compiles without warnings.
> The wrappers have been updated.
>
> The Intel Testsuite runs without failures:
>  ./run-tests.pl  all_tests_no_perf
>
>
> PERFORMANCE:
> !!!Before any merge, do extensive performance tests on real machines!!! > Initial tests on the cluster smoky, show no difference in comparison to ompi-
> trunk.
> Please see the enclosed output of NetPipe-3.7.1 run on a single node (--mca
> btl sm,self) on smoky.
>
>
> TODOS:
> There are still some todos, to finalize this:
> - Dependencies in the onet-layer into the ompi-layer (ompi_proc_t,
> ompi_converter)
> We are working on these, and have shortly talked about the latter with
> George.
> - Better abstraction from orte / cleanups, such as modex
>
> If these involve code-changes (and not just "save" and non- intrusive renames),
> such as a opal_keyval-change, we will continue to write RFCs.
>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Reply via email to