WHAT: a) Clarify the actual max MPI payload size for eager messages
(i.e., the exact meaning of btl_XXX_eager_limit), and b) allow
network administrators to shape network traffic by publishing
actual BTL max wire fragment sizes (i.e., MPI max payload size +
max PML header size + max BTL header size).
WHY: Currently BTL eager_limit values actually have the PML header
subtracted from them, meaning that the eager_limit is not
actually the largest MPI message payload size. Terry and Jeff,
at least, find this misleading. :-) Additionally, BTLs may add
their own (variable-sized) headers beyond the eager_limit size,
so it's not possible for a network administrator to shape network
traffic because they don't (can't) know what a BTL's max wire
fragment size.
WHERE: ompi/pml/{ob1,csum,dr}, and likely all BTLs
TIMEOUT: COB, Friday, 31 July 2009
DESCRIPTION:
In trying to fix the checks for eager_limit in the OB1 PML (per
discussion on the OMPI teleconf this past Tuesday), I've come across a
couple gaps. This RFC is to get others (mainly Brian Barrett's and
George Bosilca's) opinions on exactly what should be done for issue #1
and the ok for implementing issue #2.
1. The btl_XXX_eager_limit values are the upper payload value from
each payload, but this must include the PML header. Hence, the max
MPI data payload size is (btl_XXX_eager_limit - PML header size);
but this even depends on which flavor of PML send you are using.
Terry and Jeff find this misleading. Specifically, if a user sets
the eager_limit to 1024 bytes and expects their 256 MPI_INT's to
fit in an eager message, they're wrong. Additionally, network
administrators who try to adjust the eager_limit to fit the max MTU
size of their networks are unpleasantly surprised because the BTL
may actually send (btl_XXX_eager_limit + btl_XXX_header_size) bytes
at a time. Even worse, the value of btl_XXX_header_size is not
published anywhere, so a network administrator cannot know if
they're actually going over the MTU size or not.
--> Note that we only looked at eager_limit -- similar issues
likely also exist with btl_XXX_max_send_size, and possibly
btl_XXX_rdma_pipeline_send_length...?
btl_XXX_rdma_pipeline_frag_size (i.e., the RDMA size) should be
ok -- I *think* it's an absolute payload size already. If you
don't remember what these names mean, look at the pretty
picture here:
http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3
There are two solutions I can think of. Which should we do?
a. Pass the (max?) PML header size down into the BTL during
initialization such that the the btl_XXX_eager_limit can
represent the max MPI data payload size (i.e., the BTL can size
its buffers to accommodate its desired max eager payload size,
its header size, and the PML header size). Thus, the
eager_limit can truly be the MPI data payload size -- and easy
to explain to users.
b. Stay with the current btl_XXX_eager_limit implementation (which
OMPI has had for a long, long time) and add the code to check
for btl_eager_limit less than the pml header size (per this past
Tuesday's discussion). This is the minimal distance change.
2. OMPI currently does not publish enough information for a user to
set eager_limit to be able to do BTL traffic shaping. That is, one
really needs to know the (max) BTL header length and the (max) PML
header length values to be able to calculate the correct
eager_limit force a specific (max) BTL wire fragment size. Our
proposed solution is to have ompi_info print out the (max) PML and
BTL header sizes. Regardless of whether 1a) or 1b) is chosen, with
these two pieces of information, a determined network administrator
could calculate the max wire fragment size used by OMPI, and
therefore be able to do at least some of traffic shaping.
--
Jeff Squyres
[email protected]