> Erik Nordmark wrote:
> > 
> >> > Perhaps I didn't see it, but which (if either)
> MTU corresponds
> >> > to what may be publically seen?  Or does it
> matter (if IP
> >> > fragmenting hides the distinction)?
> > 
> > IP fragmentation hides the distinction.
> > 
> > It is the unicast MTU that is reported using the
> various interfaces that
> > report MTU (dladm, ifconfig, routing sockets).
> 
> If you're sending multicast, I'd think you'd want to
> avoid the
> complications involved with either fragmentation or
> (yikes!) MTU discovery.
> 
> If the application writer doesn't just force the
> issue off onto the user
> (by requiring tuning) or always using 576 or 1280
> minima, how can his IP
> multicast-using application discover the correct MTU?
> 
> I didn't think it was too uncommon to rely on
> SIOCGIFMTU to determine
> how large a multicast packet one could send to
> directly-attached peers.
>  E.g.:
> http://fxr.googlebit.com/source/usr.sbin/route6d/route
> 6d.c?v=NETBSD-CURRENT#L788
> 
> -- 
> James Carlson         42.703N 71.076W
>         <carls...@workingcode.com>

I agree, and that's what I was hinting at with my initial question,
that if what was publically visible corresponded to the greater
value, then multicast software that attempted to avoid fragmentation
simply by staying below the public value would be misled.  Of course,
if the lesser value was publically visible, MTU-aware unicast datagram
using software would get less than the full benefit of the actual unicast
MTU.  So either way, if only one value is publically visible, someone's
expectations get hurt.

I think the question then moves to:
* in the general case (given the possibility mentioned of extending
the mechanism to other transports), can one identify clients for
which this would cause problems, and any existing example of
something similar and a solution for such problems?
* while that may be a problem in the general case, is there any
expectation that such clients would be running over InfiniBand specifically
(even if it's a problem in the general case, is it an immediate problem)?
Does it help any that such fragmentation would be handled by the
local IP stack rather than by some intermediate node (i.e. would
that perhaps at least make it easier to provide some indication that
otherwise unexpected fragmentation was occurring)?

Or maybe the "proper" solution is for broadcast/multicast software
to implement proper PMTUD at user-level (trying messages with the
DF flag set, looking at ipm_nextmtu if in receipt of an 
ICMP_UNREACH_NEEDFRAG message, etc?  In which case, the question
ends up being "is the impact of that change on application software
acceptable"?

In summary, if someone is paying for InfiniBand, they probably will
expect serious performance.  Failing to take advantage of the
higher MTU possible with unicast IPoIB-CM would shortchange them;
but misleading multicast/broadcast clients would also be sub-optimal.

It sure looks to me like a shortfall of RFC 4755 that it makes no attempt
to suggest what information might be made available to an application
regarding per-connection MTU, let alone unicast vs multicast/broadcast
(or more accurately, UD vs RC/UC) MTU.  Moreover,

=== quote ===
7.2.  IPoIB-CM Per-Destination MTU

   As described above, interfaces on the same subnet may support
   different link MTUs based on the negotiated value or due to the link
   type (UD or connected mode).  Therefore, an implementation might
   choose to define a large IP MTU, which is reduced based on the MTU to
   the destination.  The relevant MTU may be stored in a suitable per-
   destination object, such as a route cache or a neighbor cache.  The
   per-destination MTU is known to the IPoIB-CM interface as described
   in Section 5.

   Implementations might choose not to support differing MTU values and
   always support an MTU equal to the IPoIB-UD MTU determined from the
   broadcast GID.
=== end quote ===

leaves me thinking that one would still have to be careful to
interoperate correctly with other implementations that
"choose not to support differing MTU values".

Might it not be better to publically report the _lower_ value as the
value, and just silently _use_ the higher
value (actually, the value for that IB connection) for unicast traffic?
That wouldn't introduce any new user-visible mechanisms, nor any
unexpected fragmentation. All it would do is cause MTU-aware
(mostly datagram) applications to accommodate the lower value even
when it wasn't necessary, but it wouldn't bother non-MTU-aware
applications at all.  (And I gather that it's probably the lower value
that's being reported now.)  It would also be more nearly meaningful
when communicating with a remote end that was only UD capable or
that chose to use one MTU for everything.

Disclaimer: I'm no network guru, and this is the first time I've so much
as looked at RFCs 4391 and 4755.  I have no idea what other
implementations supporting IPoIB-CM do (it might be worth finding
out...).
-- 
This message posted from opensolaris.org
_______________________________________________
opensolaris-arc mailing list
opensolaris-arc@opensolaris.org

Reply via email to