Issues for 2008/055

James Carlson Wed, 24 Dec 2008 10:48:00 -0500

Nicolas Droux writes:
> On Dec 18, 2008, at 9:25 AM, James Carlson wrote:
> > Providing etherstubs themselves with arbitrary MAC addresses and
> > allowing them to be plumbed normally (just as though they were regular
> > NICs) would resolve the issue, but might be a step too far.
> 
> That would be one option, I'm sure we can work something out to make  
> etherstub bridging-friendly.


OK.  Let me know when that happens.

For now, I'm planning to detect the case of etherstubs and handling
them as a special case: no STP (and thus no DLPI needed) and
forwarding enabled when configured.

> >> An important benefit is to have the flexibility to build virtual
> >> networks in a box which map directly to physical topologies.
> >
> > Please expand on that.
> 
> Our architecture allows one to build a virtual network in a box  
> consisting of virtual switches, virtual NICs, etc. That virtual  
> network in a box can be used to simulate physical network topologies,  
> which can be used by testers, developers, deployers, etc. The more  
> virtual network elements we have, the richer these virtual networks  
> can be, and the more realistically they can be used to simulate real  
> networks.

Except that the virtual environments constructed this way do not
emulate the real ones faithfully.

For instance (and it's just one instance), there's no 802 parameter
negotiation, so if the testing desired involves anything related to
that -- such as duplex or speed, both of which affect aggregations and
bridging -- then VNICs and etherstubs are right out.

In short, no, that doesn't solve the problem well.  It does work for
some cases -- tests that involve only higher level protocols, and then
likely a subset of those that don't need to see hardware-like behavior
-- but doesn't for many others.  That's why I proposed point-to-point
emulations in addition to etherstubs.  They'd provide a better way to
construct intentional test environments.

> > None of that appears to be as feasible with etherstubs.
> 
> What you are describing should be possible using VNICs connected  
> through etherstubs.

I'd like to see it done.  In the meantime, I'll use other means to
test.

> But this is different than what I was referring  
> to, which is a bridge between etherstubs.

Yes ... but I was trying to fish out the reason for such a bridge.  It
seems hard to understand.

> > Don't forget that NICs used in bridges _must_ be in promiscuous mode,
> > and that this destroys a good bit of performance.  That alone makes
> > them less interesting for ad-hoc system reconfiguration types of
> > activities.
> 
> This is different, we've talked about the VNICs migration during the  
> Crossbow design review. The design is there, but the feature was not  
> implemented as part of the first Crossbow putback.

OK.  Someone mentioned the possible usage of bridges for migration in
a private discussion, and I brought it up here to find out what sorts
of concrete situations you might have been referring to in this way:

   An important benefit is to have the flexibility to build virtual
   networks in a box which map directly to physical topologies.

I don't know what that means, so I'm searching for concrete answers.
If it means trying to use VNICs on etherstubs as a way to run tests,
then, as I've pointed out, that only works for some situations, and
doesn't for others.  Rather pointedly, I can't test bridging that way,
because I need Ethernet NICs.

> > Assuming that to be correct, it would mean that potentially _all_
> > received packets would have to be shuffled off to a separate kernel
> > thread, so that the source MAC address could be safely inspected and
> > used to update the list of learned MAC destinations.
> 
> That wouldn't be necessarily required for all packets. Flow additions  
> and removals currently require the perimeter to be held, but not  
> lookups of course.

It gets worse.

We have to do a look-up first on source address, and then on
destination.  That source look-up isn't something that Crossbow does
today, so that's a new addition.

If the source look-up *either* finds no entry *or* finds an entry that
points to a different output, then we have to make changes.  We must
delete the old entry (if present) and create a new one.

We will be forced (by Crossbow architecture) to put that on a separate
thread for processing, which means that new packets can arrive while
we're trying to do that work.  Those new packets must also be queued
until the flow is ready.

Once we get the new flow inserted, our troubles do not end.  The 802
standards require that we provide in-order delivery to
"conversations."  Since we don't have per-MAC-pair storage, we don't
know what "conversations" exist, which means that we need to preserve
order among things that we're forced to queue, at least for a given
input port.

In other words, as long as there are still packets queued in this new
mechanism for a given source address, ones that have not yet been
drained following the flow insertion, we have to enqueue any
subsequent packets behind them.  Only when the queue empties can we
return to normal Crossbow processing and avoid reordering.

How do we do this?  It can't be a reference count on the flow, because
it was the non-existence of a usable flow that would have caused us to
go through this task-based detour in the first place.  It must be some
sort of new structure, but I'm not sure what it looks like.  It'll
take at least a bit of research to figure out how to put this
together.

> > Worse still, it depends on a per-MAC lock for a per-MAC flow table,
> > which makes no sense for bridging.  Bridge forwarding entries are a
> > common database for a given bridge instance -- packets flowing in or
> > out of any port on the bridge are matched against the common pool, not
> > against a set of per-port entries.  We thus need a resource common to
> > multiple MACs, which is not something that appears to be part of
> > Crossbow, or we need to multiply the storage required by replicating
> > each entry by N (and the locking effort goes up by N as well), so we
> > can remain with Crossbow's per-port mi_flow_tab.
> 
> Yes, some of these entries will be duplicated in multiple mac_impl_t  
> flow tables.

No, not "some."  "All."  It's a requirement of bridging that
forwarding is the same no matter which input port is used.

The implication is NxM storage for forwarding entries (N links, M
known MAC destinations), each represented as a separate flow, and N
times as much effort in inserting or deleting entries, which can
easily happen en masse when topology changes occur.

A normal occurance is that a port goes "up" somewhere in the network,
and you start learning the full set of MAC addresses through a
different port.  Each one of those will require a separate flow delete
and insert operation on each one of the ports, and queuing of all the
traffic as this happens.  (With behavior much like the old outer
STREAMS perimeter in IP, I suspect.)

But -- wait -- it gets worse.

Those flows are distinct units.  We only match one.  But the real
usefulness in Crossbow is in being able to manage flows
administratively, and I still don't see how that can work when
bridging creates and destroys flows on its own.  Should the user just
be prohibited for all time from controlling flows through the bridge?

> On the other end the data-path can be kept simpler and  
> only one classification is needed.
> 
> An alternative would be to have a per bridge flow table, like the per  
> MAC client flow table we use for user-specified flows, and do a lookup  
> in that table when forwarding is enabled, however this requires an  
> additional lookup.

Yes.  And it also requires some definition of what occurs when flows
are matched in more than one place.

That's actually a deeper problem anyway with this whole scheme, as
you'd be matching a flow on input and then (assuming administrative
controls) another one on output, and it's not clear how they should
work together.  What if they specify conflicting resource
restrictions?

With IP, the usage is much more obvious.  When a packet matches a flow
and is delivered to IP, the "flowness" of the packet stops there.  If
IP decides to forward the packet out a different interface, the MAC
layer will do a brand new ex nihilo look-up for a new flow to match in
that context.  The original horse it rode in on doesn't count.

Does the same apply for bridging?  It's really unclear.

> >  - I need to be able to add and delete entries in the table from
> >    within the datapath; this means manipulating the flow table on an
> >    upcall from the driver layer.  This means having a locking scheme
> >    that supports table modification in an upward direction.
> 
> The flow lookup can be done from the data-path without holding the  
> perimeter of course, but you are correct that the addition or removal  
> of flow would have to be handed-off to a helper thread. Is there a  
> particular reason why the flow update via helper thread would be  
> problematic?

Yes.  Because that forces data packets in a very common case (ordinary
bridge learning) into a slow and highly complex queuing mechanism.
The design and properties of such a mechanism are well outside the
scope of this project, and are *purely* needed to cope with Crossbow
design peculiarities -- it has nothing to do with bridging, and
everything to do with getting around rule "R2."

> >  - I need to control the forwarding function on a per-port basis in
> >    order to implement Spanning Tree.  When disabled (the default),
> >    arriving packets are never matched against bridge forwarding
> >    entries, and are just delivered to local destinations (i.e., VNICs
> >    on that port).  When forwarding is enabled by STP, arriving
> >    packets are matched against forwarding entries and delivered as
> >    directed.  This means we need to match against subsets of flows
> >    (local to port or all flows) depending on port state.
> 
> The separate flow table would probably be ideal here.

Yes.  However, I think it'd likely involve at least a minor
restructuring of Crossbow to get there.  Something would have to know
when to invoke searches on that new table.

> An alternative  
> would be to have the entries still in place, but not do the forwarding  
> on a match if it is disabled.

That's part of it.  The rest is that the port still needs to function
as a normal port when forwarding isn't active, which means that
receive to local destinations still works and transmit out that one
port (at least for STP) still works.

> >  - I need to do a special match on "unknown destinations" -- anything
> >    not in the forwarding table must be copied to every port that's
> >    enabled by the control described above.
> 
> An unknown destination can be easily handled, since it results in a  
> NULL flow during a matching operation, and is not on the performance  
> sensitive case.

It involves modifying Crossbow to do something special with that NULL
case.

> >  - Crossbow flows are associated with resources (CPUs and the like),
> >    and these are currently assigned when flows are created by
> >    administrative action, but it's unclear what choices data-driven
> >    flow creation should use for those parameters.  There's probably a
> >    separate infrastructure needed here -- perhaps per-VLAN set of
> >    flow 'templates' set up by an administrator that get used to
> >    create the actual flows -- but it's unclear how that should work.
> 
> The resources are not used by the flows themselves directly but rather  
> by the data scheduling entities which relies on the flows, e.g. the  
> SRS. If you use the flows directly you can decide how you will use  
> these resource parameters. You can decide to not do any resource  
> management, or have per bridge resources, etc.

Right.  And as I'm saying, I don't quite know what to do with this.

If I do nothing, and bridge flows have no resource controls, then I
don't see that I've given the user anything noteworthy by using
Crossbow than I do by implementing my own forwarding.  There are no
extra features that become usable, but there's a whole lot more
complexity in the implementation, and much more interesting failure
modes that can result.

The forwarding look-up process itself is trivial.  I'm using the
existing kernel AVL trees to do the work for me.  Trying to reuse
Crossbow's flow matching mechanism as though it were a bridge
forwarding process involves adding substantial complexity, and I'm
just not seeing any obvious benefit.

> >  - What happens when L2 forwarding occurs?  Do we use the same input
> >    side flow for the output side, or are two separate look-ups done?
> >    Things are clearer for IP and other network layer features using
> >    Crossbow, as the xb responsibility effectively ends at the network
> >    layer, so the user naturally expects a new flow look-up if the
> >    packet is transmitted elsewhere by IP.  It's unclear if that makes
> >    sense for forwarding done entirely inside Crossbow.
> 
> I'm not sure what problem you are referring to. Are you referring to  
> user flows managed through flowadm(1M)?

Yes, in part.

For IP, the situation is clear: input and output are independent.  For
bridging, it's less clear.  Output is an artifact or outcome of having
done the input-side flow identification.

Does it make sense to do output-side flow look-ups with bridging?
They wouldn't be needed to do anything that a bridge needs to do, but
the administrative model of Crossbow would become very odd if output
flow controls worked only "sometimes."

> >   - Is "unknown destination" (transmit on all ports) a single flow or
> >    N separate output-side look-ups?
> 
> If you don't find your destination we can then treat this as a special  
> case which does the appropriate transmission(s). We already have such  
> a special case when multiple MAC clients are present which causes the  
> packet to go out on the wire.

That question was actually about the output-side flows.

> >  - What happens when the user configures MAC based flows *and*
> >    bridging is in use?  We then have two administrators manipulating
> >    the same table -- the human admin is creating static entries, and
> >    the system is creating and deleting dynamic ones.  What happens
> >    when they overlap?
> 
> flowadm(1M) entries are L3 and up, so they wouldn't conflict with  
> bridging. The only L2 entries we currently have are for MAC clients,  
> and bridging already know how to handle these.

A main point in using Crossbow would be to enable the administrative
mechanisms.

If we leave that behind, what is there?  A MAC-based search function?

> > At a high enough level, the two functions do appear to be similar.
> > They're not the same, though, and have significant differences when
> > you look at the details.  For instance, the things I'm matching
> > against are *not* per-port entries.
> 
> They are not today in your design, but that's not a requirement, the  
> MAC addresses could be added to the per MAC instance flow table, or  
> you could even have a separate flow table per bridge.

Quite simply put, this project is not rewriting Crossbow to provide
features that would allow the sort of design you've described.

If someone else wants to do that, then that's great.  If someone wants
to provide Crossbow-based interfaces that work well for bridging, then
let us know, and we'll see how to schedule a follow-on project to use
those.

That's not this project, and I refuse to allow this project to be made
dependent on Crossbow features that do not exist and that nobody is
working on.

> > Again, I have no plan or resources available to launch the necessary
> > research project into the redesign of Crossbow for the use of
> > bridging.  It sounds like an interesting thing, to be sure, but that's
> > just not going to happen within the context of this project.
> 
> We're not talking about a research project, and we're not talking  
> about redesigning Crossbow. Crossbow introduced a new framework in  
> place in Solaris to do flow classification which enables steering of  
> packets based on layer-2 addresses, which corresponds to one of the  
> needs of bridging.

As described many times over now, it does not fit the needs of
bridging.  In my opinion -- which I think counts as delivery of this
project is my responsibility -- it can't be made to fit the needs of
bridging without substantial and risky rework of internal elements of
Crossbow.

Just because you've got a hammer, that doesn't make every problem look
like a nail.  I'm not using that flow classification because it
doesn't fit.

> > If you feel a TCR or outright denial of this project is necessary to
> > preserve Crossbow, then speak with the other ARC members and build a
> > consensus for that.
> 
> I don't think we're are talking about "preserving Crossbow". This  
> discussion is about using a common framework to implement a new  
> feature. In the long term it will make the framework itself more  
> complete, allow bridging to take advantage of future improvements,  
> hardware classification, and avoid duplication.

It doesn't appear to do much of any of that.  We've already ruled out
flow administration at this level as an unclear concept, the hardware
features can't be used due to the use of both promiscuous mode and
searching based on source MAC address (the hard part of all this,
which the hardware doesn't do), and the "duplication" (if any) is in
the most trivial of places -- the table look-up function on
destination address, for which I use existing kernel facilities, and
thus don't actually duplicate anything.

> PSARC members should now have the information needed to make an  
> informed decision. I'll be happy to provide more information if needed.

I agree.

PSARC members: please vote to deny if you believe that bridging must
be built using Crossbow's classifier.  That's not this project, and
it's not going to be this project.  It's someone else's project.

> > If they're expecting to filter on the actual physical interface that
> > bridging uses, then the scheme you're suggesting won't work right --
> > the action of bridge forwarding can (and does) redirect output packets
> > from the originally intended link over to the actual one where the
> > destination resides -- or to all ports if the destination is not
> > known.  If an administrator says "never send packets to
> > 00:01:02:03:04:05 on hme0", then he'll be quite surprised to find that
> > when he sends packets to that address on ce0, the packet comes
> > stumbling out hme0 despite his filter, because the L2 filtering "on
> > top" of the bridge saw only ce0 as the output.
> 
> The primary goal is the filtering on a per MAC client basis. I.e. when  
> the users specifies "hme0", this corresponds to the traffic going  
> through the primary MAC client of hme0, for example IP on top of that  
> data-link. It is different from all traffic going through the physical  
> MAC instance which is shared by multiple MAC clients/VNICs/etc.

OK.  As long as the designers of L2 filtering can describe things so
that the user understands that it's on an internal MAC client basis,
rather than related to the physical port, that sounds fine to me.  I
would think it's confusing, but there are clear usage scenarios in
either direction, so I don't actually care which one they implement.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Issues for 2008/055

Reply via email to