Jim Klimov writes:
> >> So, as far as I can tell from the respected gurus' responses, this 
> >> behavior is
> >> expected, works as designed, and won't be fixed. Correct?
> 
> > No, I think that goes a bit too far. It's just software. It can be changed. 
> > ;-}
> 
> Well, I exagerrated a bit. I hope sometime in the realistic future these
> more common scenarios are ironed out, and that yelling "it is indeed a 
> real-life problem" helps speed up the process ;)

I don't think anyone has doubted that it's a real-life problem.  The
main complication (besides the usual sorts of technical arguing about
the details) is that the system's internal IP stack design uses
something called a "cache IRE."

Cache IREs were designed to improve performance and to handle ARP
caching within the stack, but they also cause a lot of mischief --
including an inability to deal with ECMP and difficulties in source
address selection.

That's being fixed now.

> >  A "connected network" is a network on which we have an interface.
> > Yes, for interface types that require a next hop address (i.e.,
> > non-point-to-point), the _destination_ address for the next hop must
> > be on the same subnet as the configured interface, but that says
> > nothing about the source address.
> 
> Now I'm not quite understanding. Which meaning of terms such as "network", 
> "subnet", "address" and "configured interface" do you imply? L2 or L3? Or a 
> mixture of them all? ;)

L3 only.  L2 doesn't matter for any of this discussion.

> My understanding is that in order for L2 ethernet frames with payload to go
> between two hosts, their L3 IP "addresses" must be in the same L3 "subnet" as
> defined by network address bits and the mask size.

That doesn't sound quite right.

In order for L3 (IP) frames to go between hosts connected via
Ethernet, the system must have the L2 (MAC) addresses necessary to
transmit and receive the frames.

In general, the way it works is that the system does a forwarding
lookup on the destination address.  There are four possible results:

  - The destination address is configured as a local address on some
    interface.  The packet is looped back internally.

  - The destination address matches an IP subnet configured on some
    interface but is not a local address.  The destination address
    must be resolved via some L2-specific method (e.g., ARP on
    Ethernet) to finish the transmission process.

  - The destination address doesn't match any of the above, but does
    match some IP "gateway" route.  The output of that route is an
    interface and, if the interface is broadcast or NBMA, a next hop
    address.  The next hop address is used for L2 resolution (ARP),
    and the packet -- with L3 addresses unaltered -- is sent to that
    destination.

  - The destination address doesn't match anything.  Drop the packet.
    (Obviously, for this discussion, "default routes" are just gateway
    routes with destination "0.0.0.0/0" -- that is, they always match,
    and this case isn't hit if you have such a route configured.)

Note that the source address has nothing to do with it.

> Moreover, these L3 IP 
> addresses should be configured on the (L1/L2 = physical/MAC) interfaces which 
> are connected to the same ethernet collision domain (VLAN in our case).

You can have as many L3 subnets configured on a single Ethernet
subnetwork as you want ... but I'm not sure if that's what you're
talking about.

> When this is the case of two servers talking to each other within a subnet, 
> there is no need for a router.

True.  That's the second case out of the address lookup algorithm
described above.

> My expectation was that the same rules should automagically apply for a server
> talking to a router in order to pass a packet to a remote destination (in 
> another,
> unconnected, subnet). This is where my knowledge didn't match reality ;)

The same rules _do_ apply.  The problem is with the source address,
which plays no part in the rules.

> All in all, there are more variables in this equation to me, than the current 
> implementation considers: 
> 1) destination IP address - it's currently the only criterion for picking a 
> gateway 
> IP address from the routing table (after considering metrics or balancing for 
> equivalent routes);
> 2) gateway address - it is produced after a lookup in the routing table by the
> destination address, completely disregarding the source address value;
> 3) source address - I thought it should have been in the same IP subnet 
> and/or 
> ethernet segment as the gateway (reverse actually: gateway should have been 
> in the same subnet as the source address). But it's not considered at all.

Correct.

In cases where the source address is not yet known (e.g., an unbound
local socket where the application issues connect()), the system uses
the destination-address based look-up to determine an output
interface, and *then* from the output interface it will choose an
appropriate source address to use.  That algorithm is a whole separate
topic.

Note that if routing changes, and we start going through a different
interface because of it, the source address is *not* chosen again
(unless the socket remains unbound).

> After a gateway is picked, its MAC address is evaluated and used to craft an 
> L2 ethernet frame going out of the physical interface (hopefully matching the
> same ethernet segment as the selected gateway). This process pays no 
> attention 
> to whatever src/dst IP addresses are in the packet inside the ethernet frame.

Correct.

> > If this weren't so, then routing itself wouldn't work. Once you get
> > one hop away, you're no longer transmitting a packet whose source IP
> > address matches the outbound interface -- and that's exactly what's
> > expected.
> 
> Yes - but only for an L3 router device ;) 

The differences between an L3 host and an L3 router are vanishingly
small in real life.  Plus, Solaris *is* a fully functional L3 routing
system.

> Anyway, for now the problem is solved. My understanding of Solaris is updated.
> I don't consider the current implementation perfect (it's even 
> counter-intuitive!)
> but the workaround seems acceptable.

Nobody at Sun thinks it's perfect, either.

> I hope "just changing the software" in this algorithm would improve the 
> situation.
> I.e. consider the source address to boost priority of a certain gateway (of 
> several
> equivalent ones). And use these "boosted" gateways unless there's indeed a 
> link
> failure detected. Something like that... ;-}

That's exactly what's referenced as the preferred solution in that
blog entry.

(Other folks have suggested Strong ES, but I'm opposed to that.  It
just creates _new_ problems.)

-- 
James Carlson         42.703N 71.076W         <[email protected]>
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to