Nicolas Droux writes:
> > You *should* in principle be able to create a bridge between two
> > etherstub instances. I've attempted to do this, and I've found that
> > there appear to be numerous bugs related to etherstubs in ON today --
> > for instance, dladm_linkid2legacyname() thinks they're invalid and
> > dlpi_bind() won't allow me to bind to SAP zero so that I can send and
> > receive STP into the bit-bucket.
>
> I don't think the dladm_linkid2legacyname() you are seeing is a bug.
> As its name implies, it is used for legacy data-link names and
> etherstubs don't fall in that category.
I see ... dladm_datalink_id2info seems like the right way out of this
one.
> We also limitations built-in to prevent an etherstub to be plumbed,
> which maybe causing the other issue you are hitting.
That seems likely.
The underlying issue seems to be in the design of VNICs and etherstub.
On a regular NIC, there's a native instance that has its own MAC
address and thus can work normally when plumbed; no VNIC is required.
On an etherstub, though, the native instance has no address, and VNICs
appear to be required in order to provide MAC addresses needed for
clients.
Thus, to a client, the two are different.
Providing etherstubs themselves with arbitrary MAC addresses and
allowing them to be plumbed normally (just as though they were regular
NICs) would resolve the issue, but might be a step too far.
> So things seem to be currently working as expected. If there are new
> requirements for etherstubs in order to make them work with bridging,
> we'll be happy to work with you on that.
Bridging works with 802 MAC entities and the things that emulate them,
such as aggregations. As long as an "etherstub" can be made to look
like a real Ethernet (with transmit just going to the bit-bucket and
nothing ever received from "the wire"), it should work fine.
Currently, it doesn't seem to look much like an Ethernet by itself.
> > I'm not sure, though, that it's an interesting case. You'll get
> > better performance if you just put all of the VNICs that must talk
> > with each other together on a single etherstub if you're planning to
> > bridge etherstubs together. If you're planning to bridge an etherstub
> > with a regular NIC, then just move the VNICs over to the regular NIC.
>
> An important benefit is to have the flexibility to build virtual
> networks in a box which map directly to physical topologies.
Please expand on that.
If we're talking about using these for general testing purposes (e.g.,
creating a four-port bridge inside a box and using that for the IPMP
test suite; no external hardware required), then I think etherstubs
are just the wrong mechanism, because they are so different from
actual Ethernets.
What I believe we really need to generalize that case is an entity
that emulates a point-to-point link, where the two endpoints are
separate MACs in the same box. Such a pseudo-device would give us:
- Distinct transmit and receive paths across the emulated link,
which could be subject to any desired impairment (e.g., MTU
restrictions, loss rates, delay, link up/down).
- Emulation of negotiated L2 parameters, including all of the
standard Ethernet capabilities.
- An obvious means to test wireless configurations, by having one
endpoint behave as an 802.11 "client" and the other behave as an
AP, emulating the assocation/disassociation/authentication
mechanisms.
None of that appears to be as feasible with etherstubs.
If we're talking about other cases, then I'll need more information.
One case suggested by private email was a DomU migrating from one NIC
to another, and using an etherstub bridged to the actual NIC in order
to make the transition "easy." But if that's the goal, I would
suggest that having either a mechanism that allows migration of VNICs
among NICs without teardown or something that inserts an indirection
mechanism (perhaps even an etherstub with the optional capability of
linking to an underlying NIC) would be better.
Don't forget that NICs used in bridges _must_ be in promiscuous mode,
and that this destroys a good bit of performance. That alone makes
them less interesting for ad-hoc system reconfiguration types of
activities.
> > This appears to be a misunderstanding. I'm not modifying the existing
> > link up/down handling that Crossbow VNICs have in any way.
>
> I think it's the following sentence in your document which which is
> confusing to me: "This means that when all external links are showing
> link-down status, the upper-level clients using the MAC layers will
> see link-down events as well."
Will fix ... the "upper-level" here means everything that thinks it's
looking directly at the drivers. MAC client and VNIC behavior doesn't
change.
> > With Crossbow, the classification is tied to the administrative bits,
> > which rely on explicit configuration of the VNICs and flows involved
> > using a user-space component. With bridging, forwarding entries are
> > created and updated on the fly based on source MAC addresses seen in
> > the data path, and then aged away over time; there's no administrative
> > involvement normally expected for these entries.
>
> This doesn't have to be the case. The Crossbow flow implementation
> provides a kernel API which allows flows to be created and added to
> flow tables. That API today is used for VNICs, but also via MAC client
> creation in general (e.g. through LDOMs), for user-specified flows,
> and for multicast addresses. It could be used by the bridge code as
> well.
The mac_flow_add() function asserts that the ft_mip perimeter is
held. My understanding of how the perimeters are used in Crossbow
(please correct me if I've gotten this wrong) is that they're *NEVER*
taken in the upward direction, as that would lead to deadlock. This
is given as general rule "R2" in the mac.c block comments.
Assuming that to be correct, it would mean that potentially _all_
received packets would have to be shuffled off to a separate kernel
thread, so that the source MAC address could be safely inspected and
used to update the list of learned MAC destinations.
Worse still, it depends on a per-MAC lock for a per-MAC flow table,
which makes no sense for bridging. Bridge forwarding entries are a
common database for a given bridge instance -- packets flowing in or
out of any port on the bridge are matched against the common pool, not
against a set of per-port entries. We thus need a resource common to
multiple MACs, which is not something that appears to be part of
Crossbow, or we need to multiply the storage required by replicating
each entry by N (and the locking effort goes up by N as well), so we
can remain with Crossbow's per-port mi_flow_tab.
The scheme I'm currently using is not so drastic. I'm not holding the
rw lock I'm using across any external calls (other than the avl
functions), so I'm able to update the table in the datapath when
necessary by taking a writer lock to insert a new entry. And my
database is an avl tree attached to the bridge instance, so it's
common among the attached ports.
The forwarding entries I have are more like ire_t entries than they
are like conn_ts. The flows you have look more like conn_ts to me. I
think they're different objects.
> > The two are different in many respects. In theory, though, it might
> > be possible modify Crossbow so that it can create and destroy
> > classification entries on the fly (this does not look trivial in the
> > least; the locking scheme makes this an unobvious approach), and it
> > may be possible to make use of some aspects of flow administration
> > when tied to more easily identifiable objects, such as VLANs, though
> > it's unclear how this should work with the existing Crossbow resource
> > management structure.
>
> Crossbow can already create and destroy flow entries on the fly. The
> locking requirements are also very straightforward.
They're straightforward but apparently immiscible with the data-driven
learning required for bridging.
> > I regard all of that as a research project. It may well be an
> > interesting one, but it's not this project by any stretch. I have no
> > plans or engineering resources available to redesign the internals of
> > Crossbow to handle things it wasn't originally designed to do, and I
> > think that insisting on such an extension of the project I've proposed
> > is not reasonable. I will not be doing that.
>
> I don't think you need to "redesign the internals of Crossbow".
>
> We have kernel APIs which I believe can achieve most of what you need
> here. There might be some small gaps, but I don't see why you would
> need to introduce a new classification table at layer 2 since we
> already have most of what you need at the same layer in mac.
I've stated my needs previously, and I don't see how Crossbow's
features would fill them or be modified to fill them. To state them
again in detail:
- I need to be able to add and delete entries in the table from
within the datapath; this means manipulating the flow table on an
upcall from the driver layer. This means having a locking scheme
that supports table modification in an upward direction.
- I need to control the forwarding function on a per-port basis in
order to implement Spanning Tree. When disabled (the default),
arriving packets are never matched against bridge forwarding
entries, and are just delivered to local destinations (i.e., VNICs
on that port). When forwarding is enabled by STP, arriving
packets are matched against forwarding entries and delivered as
directed. This means we need to match against subsets of flows
(local to port or all flows) depending on port state.
- I need to do a special match on "unknown destinations" -- anything
not in the forwarding table must be copied to every port that's
enabled by the control described above.
- The forwarding table must be implemented in a per-bridge manner,
rather than per-port/link, as is currently done with Crossbow.
Forwarding is a global function for a bridge; otherwise, you can't
actually forward between ports.
- If I'm to support IVL and SVL, I need to vary my lookup so that it
sometimes matches on MAC address alone (SVL) and other times based
on MAC+VID (IVL). (I could ditch the feature and hard code for
one or the other, but that'd be less good.)
- I need to age away entries (potentially based on bridge
parameters). (This part at least looks "easy" in that I would use
a separate kernel thread that wakes up when needed.)
There are many other issues that I don't know how to resolve. For a
few of these:
- Crossbow flows are associated with resources (CPUs and the like),
and these are currently assigned when flows are created by
administrative action, but it's unclear what choices data-driven
flow creation should use for those parameters. There's probably a
separate infrastructure needed here -- perhaps per-VLAN set of
flow 'templates' set up by an administrator that get used to
create the actual flows -- but it's unclear how that should work.
- What happens when L2 forwarding occurs? Do we use the same input
side flow for the output side, or are two separate look-ups done?
Things are clearer for IP and other network layer features using
Crossbow, as the xb responsibility effectively ends at the network
layer, so the user naturally expects a new flow look-up if the
packet is transmitted elsewhere by IP. It's unclear if that makes
sense for forwarding done entirely inside Crossbow.
- Is "unknown destination" (transmit on all ports) a single flow or
N separate output-side look-ups?
- What happens when the user configures MAC based flows *and*
bridging is in use? We then have two administrators manipulating
the same table -- the human admin is creating static entries, and
the system is creating and deleting dynamic ones. What happens
when they overlap?
> > For what it's worth, it may also be possible to modify Crossbow so
> > that it eliminates the Fireengine classifier entirely. After all, the
> > two are much more aligned than are Crossbow and bridging: both involve
> > identifying specific receiving client(s) on input and handling output
> > from multiple clients, and both involve classification structures that
> > are created strictly on the action of user space components. It seems
> > like a performance loss to have Crossbow inspect and classify the
> > packet once -- potentially looking high up the stack for flow
> > information -- only to have Fireengine do the same thing again.
>
> Of course it might be possible to use flows from other layers of the
> stack, but this is not as obvious as bridging. See below...
I have to differ on that. Flow identification is exactly what
Fireengine aims to provide. It's also what Crossbow provides.
Matching them together means that you don't have to do the same
look-up twice, and means that you can extend resource controls through
to individual sockets, applications, and users. I think it
potentially also means that Crossbow can provide squeue-like
functionality on behalf of IP, so that a whole layer of locks and
complexity can be removed, straight from NIC to transport.
The use of Crossbow inside bridging is much less obvious to me. I
certainly agree that it's possible. After having looked at bridging
for some time now, the application of Crossbow structures looks to me
like a research project and involves non-trivial redesign of Crossbow.
> > I can see that this path wasn't taken, so I can't help but wonder how
> > reuse of Crossbow's classifier could be considered a requirement for
> > bridging.
>
> It is very relevant to bridging since the bridge forwarding happens at
> the same place on the data-path as the classification that Crossbow
> introduced in the MAC layer. For example on transmit, the
> classification on the destination MAC address results in sending the
> packet to another MAC client (e.g. VNIC), send copies of the packets
> to members of a multicast group, or send the packet on the wire. A new
> outcome would be to pass the packet to a bridge.
The flow identification in flow_ip_v4_match() does the same sort of
look-up that's done by IPv4 forwarding entries, IP Filter, and IPsec
policy matching. That doesn't make Crossbow a replacement for any of
those other cases, though.
At a high enough level, the two functions do appear to be similar.
They're not the same, though, and have significant differences when
you look at the details. For instance, the things I'm matching
against are *not* per-port entries.
> With Crossbow the old mac txinfo implementation is completely gone,
> and all packets sent from a client will go through mac_tx(). Since
> mac_tx() is where classification takes place, and where you need to do
> your own checks, it seems natural to combine the operations in a
> single classification operation.
Yes, that makes it "possible." The issues discussed above make it
much less natural.
> Similarly on receive, the old mac rx_add() entry points are gone, and
> demultiplexing to the interested parties is now done by the mac layer
> through the same classification table. So having the entry for an
> address the bridge is interested in would allow the classification to
> be leveraged for the receive side as well.
>
> So reusing the classifier for components of the same layer of the
> stack seems that the natural thing to do. Once you use flows you can
> also take advantage of hardware classification on the receive side.
>
> The Crossbow team will be happy to answer questions you may have about
> the new datapath, and discuss specific requirements you may have, and
> are not addressed by the current implementation.
Again, I have no plan or resources available to launch the necessary
research project into the redesign of Crossbow for the use of
bridging. It sounds like an interesting thing, to be sure, but that's
just not going to happen within the context of this project.
If you feel a TCR or outright denial of this project is necessary to
preserve Crossbow, then speak with the other ARC members and build a
consensus for that.
If someone else has the necessary resources to do the required
research (or even wants to launch a counter-project), then be my
guest.
> > One important issue did come up here: we need to define the relative
> > ordering between L2 filtering and bridging, and I believe it makes
> > sense to put L2 filtering closer to the physical I/O. In other words,
> > L2 filter should do its work underneath the bridge.
>
> There's filtering which needs to occur between multiple MAC clients
> (VNICs are MAC clients) defined on top of the same data-link. For
> example to be consistent with the way things work in the physical
> world, one might want to prevent a VM to be able to specific send
> packets on the wire, which in this case would include a bridge. On the
> transmit side these checks would have to be done before the packet is
> potentially sent through a bridge, i.e. the L2 filtering would have to
> be done "on top" of the bridge.
The issue that I'm pointing out is largely an administrative one.
When someone creates an L2 filter, what exactly are they expecting to
filter on?
If they're expecting to filter on the actual physical interface that
bridging uses, then the scheme you're suggesting won't work right --
the action of bridge forwarding can (and does) redirect output packets
from the originally intended link over to the actual one where the
destination resides -- or to all ports if the destination is not
known. If an administrator says "never send packets to
00:01:02:03:04:05 on hme0", then he'll be quite surprised to find that
when he sends packets to that address on ce0, the packet comes
stumbling out hme0 despite his filter, because the L2 filtering "on
top" of the bridge saw only ce0 as the output.
Perhaps a hybrid approach is needed. Place the hooks in two places:
below the bridge for physical I/O and inside the MAC layer for
client-to-client (VNIC-to-VNIC) communication. The former will allow
the user to filter packets with reference to the actual physical
hardware in use (as opposed to just what the network layer "thinks" is
in use), and allow him to filter traffic between MAC clients as well.
The latter would not involve bridging, and thus would use the link as
known to the network layer.
--
James Carlson, Solaris Networking <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677