Nicolas Droux writes: > On Dec 18, 2008, at 9:25 AM, James Carlson wrote: > > Providing etherstubs themselves with arbitrary MAC addresses and > > allowing them to be plumbed normally (just as though they were regular > > NICs) would resolve the issue, but might be a step too far. > > That would be one option, I'm sure we can work something out to make > etherstub bridging-friendly.
OK. Let me know when that happens. For now, I'm planning to detect the case of etherstubs and handling them as a special case: no STP (and thus no DLPI needed) and forwarding enabled when configured. > >> An important benefit is to have the flexibility to build virtual > >> networks in a box which map directly to physical topologies. > > > > Please expand on that. > > Our architecture allows one to build a virtual network in a box > consisting of virtual switches, virtual NICs, etc. That virtual > network in a box can be used to simulate physical network topologies, > which can be used by testers, developers, deployers, etc. The more > virtual network elements we have, the richer these virtual networks > can be, and the more realistically they can be used to simulate real > networks. Except that the virtual environments constructed this way do not emulate the real ones faithfully. For instance (and it's just one instance), there's no 802 parameter negotiation, so if the testing desired involves anything related to that -- such as duplex or speed, both of which affect aggregations and bridging -- then VNICs and etherstubs are right out. In short, no, that doesn't solve the problem well. It does work for some cases -- tests that involve only higher level protocols, and then likely a subset of those that don't need to see hardware-like behavior -- but doesn't for many others. That's why I proposed point-to-point emulations in addition to etherstubs. They'd provide a better way to construct intentional test environments. > > None of that appears to be as feasible with etherstubs. > > What you are describing should be possible using VNICs connected > through etherstubs. I'd like to see it done. In the meantime, I'll use other means to test. > But this is different than what I was referring > to, which is a bridge between etherstubs. Yes ... but I was trying to fish out the reason for such a bridge. It seems hard to understand. > > Don't forget that NICs used in bridges _must_ be in promiscuous mode, > > and that this destroys a good bit of performance. That alone makes > > them less interesting for ad-hoc system reconfiguration types of > > activities. > > This is different, we've talked about the VNICs migration during the > Crossbow design review. The design is there, but the feature was not > implemented as part of the first Crossbow putback. OK. Someone mentioned the possible usage of bridges for migration in a private discussion, and I brought it up here to find out what sorts of concrete situations you might have been referring to in this way: An important benefit is to have the flexibility to build virtual networks in a box which map directly to physical topologies. I don't know what that means, so I'm searching for concrete answers. If it means trying to use VNICs on etherstubs as a way to run tests, then, as I've pointed out, that only works for some situations, and doesn't for others. Rather pointedly, I can't test bridging that way, because I need Ethernet NICs. > > Assuming that to be correct, it would mean that potentially _all_ > > received packets would have to be shuffled off to a separate kernel > > thread, so that the source MAC address could be safely inspected and > > used to update the list of learned MAC destinations. > > That wouldn't be necessarily required for all packets. Flow additions > and removals currently require the perimeter to be held, but not > lookups of course. It gets worse. We have to do a look-up first on source address, and then on destination. That source look-up isn't something that Crossbow does today, so that's a new addition. If the source look-up *either* finds no entry *or* finds an entry that points to a different output, then we have to make changes. We must delete the old entry (if present) and create a new one. We will be forced (by Crossbow architecture) to put that on a separate thread for processing, which means that new packets can arrive while we're trying to do that work. Those new packets must also be queued until the flow is ready. Once we get the new flow inserted, our troubles do not end. The 802 standards require that we provide in-order delivery to "conversations." Since we don't have per-MAC-pair storage, we don't know what "conversations" exist, which means that we need to preserve order among things that we're forced to queue, at least for a given input port. In other words, as long as there are still packets queued in this new mechanism for a given source address, ones that have not yet been drained following the flow insertion, we have to enqueue any subsequent packets behind them. Only when the queue empties can we return to normal Crossbow processing and avoid reordering. How do we do this? It can't be a reference count on the flow, because it was the non-existence of a usable flow that would have caused us to go through this task-based detour in the first place. It must be some sort of new structure, but I'm not sure what it looks like. It'll take at least a bit of research to figure out how to put this together. > > Worse still, it depends on a per-MAC lock for a per-MAC flow table, > > which makes no sense for bridging. Bridge forwarding entries are a > > common database for a given bridge instance -- packets flowing in or > > out of any port on the bridge are matched against the common pool, not > > against a set of per-port entries. We thus need a resource common to > > multiple MACs, which is not something that appears to be part of > > Crossbow, or we need to multiply the storage required by replicating > > each entry by N (and the locking effort goes up by N as well), so we > > can remain with Crossbow's per-port mi_flow_tab. > > Yes, some of these entries will be duplicated in multiple mac_impl_t > flow tables. No, not "some." "All." It's a requirement of bridging that forwarding is the same no matter which input port is used. The implication is NxM storage for forwarding entries (N links, M known MAC destinations), each represented as a separate flow, and N times as much effort in inserting or deleting entries, which can easily happen en masse when topology changes occur. A normal occurance is that a port goes "up" somewhere in the network, and you start learning the full set of MAC addresses through a different port. Each one of those will require a separate flow delete and insert operation on each one of the ports, and queuing of all the traffic as this happens. (With behavior much like the old outer STREAMS perimeter in IP, I suspect.) But -- wait -- it gets worse. Those flows are distinct units. We only match one. But the real usefulness in Crossbow is in being able to manage flows administratively, and I still don't see how that can work when bridging creates and destroys flows on its own. Should the user just be prohibited for all time from controlling flows through the bridge? > On the other end the data-path can be kept simpler and > only one classification is needed. > > An alternative would be to have a per bridge flow table, like the per > MAC client flow table we use for user-specified flows, and do a lookup > in that table when forwarding is enabled, however this requires an > additional lookup. Yes. And it also requires some definition of what occurs when flows are matched in more than one place. That's actually a deeper problem anyway with this whole scheme, as you'd be matching a flow on input and then (assuming administrative controls) another one on output, and it's not clear how they should work together. What if they specify conflicting resource restrictions? With IP, the usage is much more obvious. When a packet matches a flow and is delivered to IP, the "flowness" of the packet stops there. If IP decides to forward the packet out a different interface, the MAC layer will do a brand new ex nihilo look-up for a new flow to match in that context. The original horse it rode in on doesn't count. Does the same apply for bridging? It's really unclear. > > - I need to be able to add and delete entries in the table from > > within the datapath; this means manipulating the flow table on an > > upcall from the driver layer. This means having a locking scheme > > that supports table modification in an upward direction. > > The flow lookup can be done from the data-path without holding the > perimeter of course, but you are correct that the addition or removal > of flow would have to be handed-off to a helper thread. Is there a > particular reason why the flow update via helper thread would be > problematic? Yes. Because that forces data packets in a very common case (ordinary bridge learning) into a slow and highly complex queuing mechanism. The design and properties of such a mechanism are well outside the scope of this project, and are *purely* needed to cope with Crossbow design peculiarities -- it has nothing to do with bridging, and everything to do with getting around rule "R2." > > - I need to control the forwarding function on a per-port basis in > > order to implement Spanning Tree. When disabled (the default), > > arriving packets are never matched against bridge forwarding > > entries, and are just delivered to local destinations (i.e., VNICs > > on that port). When forwarding is enabled by STP, arriving > > packets are matched against forwarding entries and delivered as > > directed. This means we need to match against subsets of flows > > (local to port or all flows) depending on port state. > > The separate flow table would probably be ideal here. Yes. However, I think it'd likely involve at least a minor restructuring of Crossbow to get there. Something would have to know when to invoke searches on that new table. > An alternative > would be to have the entries still in place, but not do the forwarding > on a match if it is disabled. That's part of it. The rest is that the port still needs to function as a normal port when forwarding isn't active, which means that receive to local destinations still works and transmit out that one port (at least for STP) still works. > > - I need to do a special match on "unknown destinations" -- anything > > not in the forwarding table must be copied to every port that's > > enabled by the control described above. > > An unknown destination can be easily handled, since it results in a > NULL flow during a matching operation, and is not on the performance > sensitive case. It involves modifying Crossbow to do something special with that NULL case. > > - Crossbow flows are associated with resources (CPUs and the like), > > and these are currently assigned when flows are created by > > administrative action, but it's unclear what choices data-driven > > flow creation should use for those parameters. There's probably a > > separate infrastructure needed here -- perhaps per-VLAN set of > > flow 'templates' set up by an administrator that get used to > > create the actual flows -- but it's unclear how that should work. > > The resources are not used by the flows themselves directly but rather > by the data scheduling entities which relies on the flows, e.g. the > SRS. If you use the flows directly you can decide how you will use > these resource parameters. You can decide to not do any resource > management, or have per bridge resources, etc. Right. And as I'm saying, I don't quite know what to do with this. If I do nothing, and bridge flows have no resource controls, then I don't see that I've given the user anything noteworthy by using Crossbow than I do by implementing my own forwarding. There are no extra features that become usable, but there's a whole lot more complexity in the implementation, and much more interesting failure modes that can result. The forwarding look-up process itself is trivial. I'm using the existing kernel AVL trees to do the work for me. Trying to reuse Crossbow's flow matching mechanism as though it were a bridge forwarding process involves adding substantial complexity, and I'm just not seeing any obvious benefit. > > - What happens when L2 forwarding occurs? Do we use the same input > > side flow for the output side, or are two separate look-ups done? > > Things are clearer for IP and other network layer features using > > Crossbow, as the xb responsibility effectively ends at the network > > layer, so the user naturally expects a new flow look-up if the > > packet is transmitted elsewhere by IP. It's unclear if that makes > > sense for forwarding done entirely inside Crossbow. > > I'm not sure what problem you are referring to. Are you referring to > user flows managed through flowadm(1M)? Yes, in part. For IP, the situation is clear: input and output are independent. For bridging, it's less clear. Output is an artifact or outcome of having done the input-side flow identification. Does it make sense to do output-side flow look-ups with bridging? They wouldn't be needed to do anything that a bridge needs to do, but the administrative model of Crossbow would become very odd if output flow controls worked only "sometimes." > > - Is "unknown destination" (transmit on all ports) a single flow or > > N separate output-side look-ups? > > If you don't find your destination we can then treat this as a special > case which does the appropriate transmission(s). We already have such > a special case when multiple MAC clients are present which causes the > packet to go out on the wire. That question was actually about the output-side flows. > > - What happens when the user configures MAC based flows *and* > > bridging is in use? We then have two administrators manipulating > > the same table -- the human admin is creating static entries, and > > the system is creating and deleting dynamic ones. What happens > > when they overlap? > > flowadm(1M) entries are L3 and up, so they wouldn't conflict with > bridging. The only L2 entries we currently have are for MAC clients, > and bridging already know how to handle these. A main point in using Crossbow would be to enable the administrative mechanisms. If we leave that behind, what is there? A MAC-based search function? > > At a high enough level, the two functions do appear to be similar. > > They're not the same, though, and have significant differences when > > you look at the details. For instance, the things I'm matching > > against are *not* per-port entries. > > They are not today in your design, but that's not a requirement, the > MAC addresses could be added to the per MAC instance flow table, or > you could even have a separate flow table per bridge. Quite simply put, this project is not rewriting Crossbow to provide features that would allow the sort of design you've described. If someone else wants to do that, then that's great. If someone wants to provide Crossbow-based interfaces that work well for bridging, then let us know, and we'll see how to schedule a follow-on project to use those. That's not this project, and I refuse to allow this project to be made dependent on Crossbow features that do not exist and that nobody is working on. > > Again, I have no plan or resources available to launch the necessary > > research project into the redesign of Crossbow for the use of > > bridging. It sounds like an interesting thing, to be sure, but that's > > just not going to happen within the context of this project. > > We're not talking about a research project, and we're not talking > about redesigning Crossbow. Crossbow introduced a new framework in > place in Solaris to do flow classification which enables steering of > packets based on layer-2 addresses, which corresponds to one of the > needs of bridging. As described many times over now, it does not fit the needs of bridging. In my opinion -- which I think counts as delivery of this project is my responsibility -- it can't be made to fit the needs of bridging without substantial and risky rework of internal elements of Crossbow. Just because you've got a hammer, that doesn't make every problem look like a nail. I'm not using that flow classification because it doesn't fit. > > If you feel a TCR or outright denial of this project is necessary to > > preserve Crossbow, then speak with the other ARC members and build a > > consensus for that. > > I don't think we're are talking about "preserving Crossbow". This > discussion is about using a common framework to implement a new > feature. In the long term it will make the framework itself more > complete, allow bridging to take advantage of future improvements, > hardware classification, and avoid duplication. It doesn't appear to do much of any of that. We've already ruled out flow administration at this level as an unclear concept, the hardware features can't be used due to the use of both promiscuous mode and searching based on source MAC address (the hard part of all this, which the hardware doesn't do), and the "duplication" (if any) is in the most trivial of places -- the table look-up function on destination address, for which I use existing kernel facilities, and thus don't actually duplicate anything. > PSARC members should now have the information needed to make an > informed decision. I'll be happy to provide more information if needed. I agree. PSARC members: please vote to deny if you believe that bridging must be built using Crossbow's classifier. That's not this project, and it's not going to be this project. It's someone else's project. > > If they're expecting to filter on the actual physical interface that > > bridging uses, then the scheme you're suggesting won't work right -- > > the action of bridge forwarding can (and does) redirect output packets > > from the originally intended link over to the actual one where the > > destination resides -- or to all ports if the destination is not > > known. If an administrator says "never send packets to > > 00:01:02:03:04:05 on hme0", then he'll be quite surprised to find that > > when he sends packets to that address on ce0, the packet comes > > stumbling out hme0 despite his filter, because the L2 filtering "on > > top" of the bridge saw only ce0 as the output. > > The primary goal is the filtering on a per MAC client basis. I.e. when > the users specifies "hme0", this corresponds to the traffic going > through the primary MAC client of hme0, for example IP on top of that > data-link. It is different from all traffic going through the physical > MAC instance which is shared by multiple MAC clients/VNICs/etc. OK. As long as the designers of L2 filtering can describe things so that the user understands that it's on an internal MAC client basis, rather than related to the physical port, that sounds fine to me. I would think it's confusing, but there are clear usage scenarios in either direction, so I don't actually care which one they implement. -- James Carlson, Solaris Networking <james.d.carlson at sun.com> Sun Microsystems / 35 Network Drive 71.232W Vox +1 781 442 2084 MS UBUR02-212 / Burlington MA 01803-2757 42.496N Fax +1 781 442 1677
