Issues for 2008/055

Nicolas Droux Tue, 23 Dec 2008 00:11:01 -0700

On Dec 18, 2008, at 9:25 AM, James Carlson wrote:

> Nicolas Droux writes:
>>
>> We also limitations built-in to prevent an etherstub to be plumbed,
>> which maybe causing the other issue you are hitting.
>
> That seems likely.
>
> The underlying issue seems to be in the design of VNICs and etherstub.
> On a regular NIC, there's a native instance that has its own MAC
> address and thus can work normally when plumbed; no VNIC is required.
> On an etherstub, though, the native instance has no address, and VNICs
> appear to be required in order to provide MAC addresses needed for
> clients.
>
> Thus, to a client, the two are different.


That's correct.

> Providing etherstubs themselves with arbitrary MAC addresses and
> allowing them to be plumbed normally (just as though they were regular
> NICs) would resolve the issue, but might be a step too far.

That would be one option, I'm sure we can work something out to make  
etherstub bridging-friendly.

>>> I'm not sure, though, that it's an interesting case.  You'll get
>>> better performance if you just put all of the VNICs that must talk
>>> with each other together on a single etherstub if you're planning to
>>> bridge etherstubs together.  If you're planning to bridge an  
>>> etherstub
>>> with a regular NIC, then just move the VNICs over to the regular  
>>> NIC.
>>
>> An important benefit is to have the flexibility to build virtual
>> networks in a box which map directly to physical topologies.
>
> Please expand on that.

Our architecture allows one to build a virtual network in a box  
consisting of virtual switches, virtual NICs, etc. That virtual  
network in a box can be used to simulate physical network topologies,  
which can be used by testers, developers, deployers, etc. The more  
virtual network elements we have, the richer these virtual networks  
can be, and the more realistically they can be used to simulate real  
networks.

> If we're talking about using these for general testing purposes (e.g.,
> creating a four-port bridge inside a box and using that for the IPMP
> test suite; no external hardware required), then I think etherstubs
> are just the wrong mechanism, because they are so different from
> actual Ethernets.
>
> What I believe we really need to generalize that case is an entity
> that emulates a point-to-point link, where the two endpoints are
> separate MACs in the same box.  Such a pseudo-device would give us:
>
>  - Distinct transmit and receive paths across the emulated link,
>    which could be subject to any desired impairment (e.g., MTU
>    restrictions, loss rates, delay, link up/down).
>
>  - Emulation of negotiated L2 parameters, including all of the
>    standard Ethernet capabilities.
>
>  - An obvious means to test wireless configurations, by having one
>    endpoint behave as an 802.11 "client" and the other behave as an
>    AP, emulating the assocation/disassociation/authentication
>    mechanisms.
>
> None of that appears to be as feasible with etherstubs.

What you are describing should be possible using VNICs connected  
through etherstubs. But this is different than what I was referring  
to, which is a bridge between etherstubs.

> If we're talking about other cases, then I'll need more information.
> One case suggested by private email was a DomU migrating from one NIC
> to another, and using an etherstub bridged to the actual NIC in order
> to make the transition "easy."  But if that's the goal, I would
> suggest that having either a mechanism that allows migration of VNICs
> among NICs without teardown or something that inserts an indirection
> mechanism (perhaps even an etherstub with the optional capability of
> linking to an underlying NIC) would be better.

> Don't forget that NICs used in bridges _must_ be in promiscuous mode,
> and that this destroys a good bit of performance.  That alone makes
> them less interesting for ad-hoc system reconfiguration types of
> activities.

This is different, we've talked about the VNICs migration during the  
Crossbow design review. The design is there, but the feature was not  
implemented as part of the first Crossbow putback.

>>> With Crossbow, the classification is tied to the administrative  
>>> bits,
>>> which rely on explicit configuration of the VNICs and flows involved
>>> using a user-space component.  With bridging, forwarding entries are
>>> created and updated on the fly based on source MAC addresses seen in
>>> the data path, and then aged away over time; there's no  
>>> administrative
>>> involvement normally expected for these entries.
>>
>> This doesn't have to be the case. The Crossbow flow implementation
>> provides a kernel API which allows flows to be created and added to
>> flow tables. That API today is used for VNICs, but also via MAC  
>> client
>> creation in general (e.g. through LDOMs), for user-specified flows,
>> and for multicast addresses. It could be used by the bridge code as
>> well.
>
> The mac_flow_add() function asserts that the ft_mip perimeter is
> held.  My understanding of how the perimeters are used in Crossbow
> (please correct me if I've gotten this wrong) is that they're *NEVER*
> taken in the upward direction, as that would lead to deadlock.  This
> is given as general rule "R2" in the mac.c block comments.
>
> Assuming that to be correct, it would mean that potentially _all_
> received packets would have to be shuffled off to a separate kernel
> thread, so that the source MAC address could be safely inspected and
> used to update the list of learned MAC destinations.

That wouldn't be necessarily required for all packets. Flow additions  
and removals currently require the perimeter to be held, but not  
lookups of course.

> Worse still, it depends on a per-MAC lock for a per-MAC flow table,
> which makes no sense for bridging.  Bridge forwarding entries are a
> common database for a given bridge instance -- packets flowing in or
> out of any port on the bridge are matched against the common pool, not
> against a set of per-port entries.  We thus need a resource common to
> multiple MACs, which is not something that appears to be part of
> Crossbow, or we need to multiply the storage required by replicating
> each entry by N (and the locking effort goes up by N as well), so we
> can remain with Crossbow's per-port mi_flow_tab.

Yes, some of these entries will be duplicated in multiple mac_impl_t  
flow tables. On the other end the data-path can be kept simpler and  
only one classification is needed.

An alternative would be to have a per bridge flow table, like the per  
MAC client flow table we use for user-specified flows, and do a lookup  
in that table when forwarding is enabled, however this requires an  
additional lookup.

>>> I regard all of that as a research project.  It may well be an
>>> interesting one, but it's not this project by any stretch.  I have  
>>> no
>>> plans or engineering resources available to redesign the internals  
>>> of
>>> Crossbow to handle things it wasn't originally designed to do, and I
>>> think that insisting on such an extension of the project I've  
>>> proposed
>>> is not reasonable.  I will not be doing that.
>>
>> I don't think you need to "redesign the internals of Crossbow".
>>
>> We have kernel APIs which I believe can achieve most of what you need
>> here. There might be some small gaps, but I don't see why you would
>> need to introduce a new classification table at layer 2 since we
>> already have most of what you need at the same layer in mac.
>
> I've stated my needs previously, and I don't see how Crossbow's
> features would fill them or be modified to fill them.  To state them
> again in detail:
>
>  - I need to be able to add and delete entries in the table from
>    within the datapath; this means manipulating the flow table on an
>    upcall from the driver layer.  This means having a locking scheme
>    that supports table modification in an upward direction.

The flow lookup can be done from the data-path without holding the  
perimeter of course, but you are correct that the addition or removal  
of flow would have to be handed-off to a helper thread. Is there a  
particular reason why the flow update via helper thread would be  
problematic?

>  - I need to control the forwarding function on a per-port basis in
>    order to implement Spanning Tree.  When disabled (the default),
>    arriving packets are never matched against bridge forwarding
>    entries, and are just delivered to local destinations (i.e., VNICs
>    on that port).  When forwarding is enabled by STP, arriving
>    packets are matched against forwarding entries and delivered as
>    directed.  This means we need to match against subsets of flows
>    (local to port or all flows) depending on port state.

The separate flow table would probably be ideal here. An alternative  
would be to have the entries still in place, but not do the forwarding  
on a match if it is disabled.

>
>  - I need to do a special match on "unknown destinations" -- anything
>    not in the forwarding table must be copied to every port that's
>    enabled by the control described above.

An unknown destination can be easily handled, since it results in a  
NULL flow during a matching operation, and is not on the performance  
sensitive case.

>
>  - The forwarding table must be implemented in a per-bridge manner,
>    rather than per-port/link, as is currently done with Crossbow.
>    Forwarding is a global function for a bridge; otherwise, you can't
>    actually forward between ports.

As discussed above, we could replicate the flows in each table of the  
links that are part of a bridge, or we could have a separate flow  
table per bridge, separate from the per mac_impl_t flow tables.

>  - If I'm to support IVL and SVL, I need to vary my lookup so that it
>    sometimes matches on MAC address alone (SVL) and other times based
>    on MAC+VID (IVL).  (I could ditch the feature and hard code for
>    one or the other, but that'd be less good.)

We currently do matching on MAC + VID, but in previous implementations  
we had code which could toggle between MAC only or MAC + VID based  
classification. So we can certainly consider having a flexible  
approach which satisfies this requirement.

>  - I need to age away entries (potentially based on bridge
>    parameters).  (This part at least looks "easy" in that I would use
>    a separate kernel thread that wakes up when needed.)
>
>
> There are many other issues that I don't know how to resolve.  For a
> few of these:
>
>  - Crossbow flows are associated with resources (CPUs and the like),
>    and these are currently assigned when flows are created by
>    administrative action, but it's unclear what choices data-driven
>    flow creation should use for those parameters.  There's probably a
>    separate infrastructure needed here -- perhaps per-VLAN set of
>    flow 'templates' set up by an administrator that get used to
>    create the actual flows -- but it's unclear how that should work.

The resources are not used by the flows themselves directly but rather  
by the data scheduling entities which relies on the flows, e.g. the  
SRS. If you use the flows directly you can decide how you will use  
these resource parameters. You can decide to not do any resource  
management, or have per bridge resources, etc.

>  - What happens when L2 forwarding occurs?  Do we use the same input
>    side flow for the output side, or are two separate look-ups done?
>    Things are clearer for IP and other network layer features using
>    Crossbow, as the xb responsibility effectively ends at the network
>    layer, so the user naturally expects a new flow look-up if the
>    packet is transmitted elsewhere by IP.  It's unclear if that makes
>    sense for forwarding done entirely inside Crossbow.

I'm not sure what problem you are referring to. Are you referring to  
user flows managed through flowadm(1M)?

>
>   - Is "unknown destination" (transmit on all ports) a single flow or
>    N separate output-side look-ups?

If you don't find your destination we can then treat this as a special  
case which does the appropriate transmission(s). We already have such  
a special case when multiple MAC clients are present which causes the  
packet to go out on the wire.

>  - What happens when the user configures MAC based flows *and*
>    bridging is in use?  We then have two administrators manipulating
>    the same table -- the human admin is creating static entries, and
>    the system is creating and deleting dynamic ones.  What happens
>    when they overlap?

flowadm(1M) entries are L3 and up, so they wouldn't conflict with  
bridging. The only L2 entries we currently have are for MAC clients,  
and bridging already know how to handle these.

>>> I can see that this path wasn't taken, so I can't help but wonder  
>>> how
>>> reuse of Crossbow's classifier could be considered a requirement for
>>> bridging.
>>
>> It is very relevant to bridging since the bridge forwarding happens  
>> at
>> the same place on the data-path as the classification that Crossbow
>> introduced in the MAC layer. For example on transmit, the
>> classification on the destination MAC address results in sending the
>> packet to another MAC client (e.g. VNIC), send copies of the packets
>> to members of a multicast group, or send the packet on the wire. A  
>> new
>> outcome would be to pass the packet to a bridge.
>
> The flow identification in flow_ip_v4_match() does the same sort of
> look-up that's done by IPv4 forwarding entries, IP Filter, and IPsec
> policy matching.  That doesn't make Crossbow a replacement for any of
> those other cases, though.
>
> At a high enough level, the two functions do appear to be similar.
> They're not the same, though, and have significant differences when
> you look at the details.  For instance, the things I'm matching
> against are *not* per-port entries.

They are not today in your design, but that's not a requirement, the  
MAC addresses could be added to the per MAC instance flow table, or  
you could even have a separate flow table per bridge.

>> Similarly on receive, the old mac rx_add() entry points are gone, and
>> demultiplexing to the interested parties is now done by the mac layer
>> through the same classification table. So having the entry for an
>> address the bridge is interested in would allow the classification to
>> be leveraged for the receive side as well.
>>
>> So reusing the classifier for components of the same layer of the
>> stack seems that the natural thing to do. Once you use flows you can
>> also take advantage of hardware classification on the receive side.
>>
>> The Crossbow team will be happy to answer questions you may have  
>> about
>> the new datapath, and discuss specific requirements you may have, and
>> are not addressed by the current implementation.
>
> Again, I have no plan or resources available to launch the necessary
> research project into the redesign of Crossbow for the use of
> bridging.  It sounds like an interesting thing, to be sure, but that's
> just not going to happen within the context of this project.

We're not talking about a research project, and we're not talking  
about redesigning Crossbow. Crossbow introduced a new framework in  
place in Solaris to do flow classification which enables steering of  
packets based on layer-2 addresses, which corresponds to one of the  
needs of bridging.

> If you feel a TCR or outright denial of this project is necessary to
> preserve Crossbow, then speak with the other ARC members and build a
> consensus for that.

I don't think we're are talking about "preserving Crossbow". This  
discussion is about using a common framework to implement a new  
feature. In the long term it will make the framework itself more  
complete, allow bridging to take advantage of future improvements,  
hardware classification, and avoid duplication.

PSARC members should now have the information needed to make an  
informed decision. I'll be happy to provide more information if needed.

> If someone else has the necessary resources to do the required
> research (or even wants to launch a counter-project), then be my
> guest.
>
>>> One important issue did come up here: we need to define the relative
>>> ordering between L2 filtering and bridging, and I believe it makes
>>> sense to put L2 filtering closer to the physical I/O.  In other  
>>> words,
>>> L2 filter should do its work underneath the bridge.
>>
>> There's filtering which needs to occur between multiple MAC clients
>> (VNICs are MAC clients) defined on top of the same data-link. For
>> example to be consistent with the way things work in the physical
>> world, one might want to prevent a VM to be able to specific send
>> packets on the wire, which in this case would include a bridge. On  
>> the
>> transmit side these checks would have to be done before the packet is
>> potentially sent through a bridge, i.e. the L2 filtering would have  
>> to
>> be done "on top" of the bridge.
>
> The issue that I'm pointing out is largely an administrative one.
> When someone creates an L2 filter, what exactly are they expecting to
> filter on?
>
> If they're expecting to filter on the actual physical interface that
> bridging uses, then the scheme you're suggesting won't work right --
> the action of bridge forwarding can (and does) redirect output packets
> from the originally intended link over to the actual one where the
> destination resides -- or to all ports if the destination is not
> known.  If an administrator says "never send packets to
> 00:01:02:03:04:05 on hme0", then he'll be quite surprised to find that
> when he sends packets to that address on ce0, the packet comes
> stumbling out hme0 despite his filter, because the L2 filtering "on
> top" of the bridge saw only ce0 as the output.

The primary goal is the filtering on a per MAC client basis. I.e. when  
the users specifies "hme0", this corresponds to the traffic going  
through the primary MAC client of hme0, for example IP on top of that  
data-link. It is different from all traffic going through the physical  
MAC instance which is shared by multiple MAC clients/VNICs/etc.

> Perhaps a hybrid approach is needed.  Place the hooks in two places:
> below the bridge for physical I/O and inside the MAC layer for
> client-to-client (VNIC-to-VNIC) communication.  The former will allow
> the user to filter packets with reference to the actual physical
> hardware in use (as opposed to just what the network layer "thinks" is
> in use), and allow him to filter traffic between MAC clients as well.
> The latter would not involve bridging, and thus would use the link as
> known to the network layer.

L2 Filtering based on "physical" interfaces is not currently planned,  
but could be added in the future if needed.

Nicolas.

>
>
> -- 
> James Carlson, Solaris Networking              <james.d.carlson at sun.com 
> >
> Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442  
> 2084
> MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442  
> 1677

-- 
Nicolas Droux - Solaris Kernel Networking - Sun Microsystems, Inc.
nicolas.droux at sun.com - http://blogs.sun.com/droux

Issues for 2008/055

Reply via email to