On Mon, 2005-10-31 at 00:42 +0100, Bernhard Kauer wrote: > On Sun, Oct 30, 2005 at 09:40:18AM -0500, Jonathan S. Shapiro wrote:
> I shorten your long mail to some core sentences. Please correct me, if I > quote you wrong. Bernhard: thank you for taking the time to answer so thoughtfully. It is clear to me that we are operating from very different assumptions. I would like to see if we can at least clarify the differences a bit better. > > You assume that the difference in cost between (IPC w/map) vs the cost > > of 2(IPC w/map) can be ignored. This is incorrect. > > No, I assume that the usage of a capability (IPC w/map) in a session-less > protocol has a significant overhead over a session-based (IPC) protocol and > that the usage happens more often then a copy via 2(IPC w/map). This is the assumption that I thought you were making. Let us unpack this assumption and examine it. In order to implement a sessionless RPC protocol, every CALL (the first IPC) must transmit an endpoint capability. This endpoint capability will later be used by the server to perform a RETURN (the 2nd IPC). If evidence from EROS is a good predictor (and I think it is -- at least for this), we may safely conclude that these CALL/RETURN patterns describe the overwhelming majority of IPCs -- all of the other patterns taken together accounted in EROS for less than 1% of all dynamic IPCs. Further (again, if EROS is a good predictor) we observed very few IPCs statistically that passed any *other* capability as an argument. A low enough percentage that we may ignore it for the present discussion. Therefore, we may assume that ~50% of IPCs in a sessionless protocol will need to transfer exactly one capability. It will not, of course, be exactly 50%, but it will be very close to this. First question: is this set of statements consistent with your assumptions about how sessionless RPC would work? Now concerning expense: Yes, I agree that the MAP operation would be a noticeable cost in the protocol described above. It does not require any manipulation of hardware mapping tables, but there is still the need to traverse the C-space on both the sender and the receiver side of the exchange, and this will take time. Once the sender "slot" (the location in sender memory containing the capability to be transmitted) and the receiver "slot" are identified, we must then deal with the cost of the capability copy itself -- possibly including updates to the mapping database. Depending on the representations of the mapping database and the capabilities, this copy may or may not have a significant cost. [Please note: I am not sure that these phases separate clearly in the L4 design, so please do not hesitate to correct or clarify.] In EROS, the slot location phase was very simple, because we did not have C-spaces. All that was necessary was to offset into a capability register vector. The main cost was in the capability copy, and the main source of this cost was the update of the linked lists that linked the source and destination. This update incurred three unnecessary D-cache misses in practice. In Coyotos, we are not using a linked representation, and these extra D-cache misses are eliminated. As we considered the session/sessionless issue in Coyotos, I became convinced that using the protected payload as a session ID would not scale well, and also presents certain challenges for bootstrapping and isolation. I therefore decided to focus the architecture on supporting sessionless interactions. Sessions are still possible and fast; it simply changes our orientation in looking at the implementation goals a bit. In examining support for sessionless protocols, we concluded that the big cost in Coyotos was going to be the C-space traversal in sender and receiver to locate the slots. Independent of session/sessionless, this traversal cost is also an issue for locating the send and receive endpoint caps in any IPC. We have therefore concluded that Coyotos needs to support capability registers in addition to C-spaces, and that in the normal case the send and receive endpoints will live in these registers. With all of this exposed, I believe it may now be clearer why I do not think the capability copy is expensive. It involves a cache-aligned four-word move from one process to the other, using a scaled index addressing mode using the respective process structure pointers as base pointers. The only important cost in this is the D-cache line load in the target space. I am *guessing* that this is significantly lower than the cost that you anticipate for the MAP operation. It is one (of several) reasons that I argued at Dresden that MAP was not the right starting point for designing a capability transfer mechanism. Various people at Dresden argued that optimizations could address this, and in some parts they were correct, but it still appears to me that the need to update the mapping database is inherently more complicated than simple word copy. So: I agree that there is some overhead in transferring the capability, but I think it is much lower than you appear to believe. Given my explanation, I have two further questions: Second question: do you agree that this difference in designs would largely explain our differing views about the costs of a sessionless protocol? Third question: If so, do you agree that in the design I have outlined, the marginal cost of the capability copy is not likely to be significant in real-world use (it will of course be measurable in microbenchmarks, but who cares). > > However, based on those measurements, I think that the cost of IPC+COPY > > for one capability in 50% of transfers (namely: the reply endpoint, > > which is only passed in the CALL phase of the round-trip RPC) is not > > measurably different from the cost of IPC without any transfer. > > L4-IPC paths are today so heavily optimized that the influence of TLB and > even Cache misses are significant. Doing a copy involves at least 2 cache > misses by using capability registers. Capability-spaces add at least 1 > additional cache and TLB miss to this number. I do not think that this is > not measurably. I agree that these numbers appear and are significant in the microbenchmarks, but the microbenchmarks are misleading here. It is not a question of *whether* you will take extra misses, but *where*. If the kernel does not support the sessionless protocol, then the server must execute the lookups necessary to map from session identifiers to the appropriate endpoint capabilities. The cache and TLB misses associated with this activity must be accounted as part of the IPC costs in order to make an accurate comparison. Further, I think that the microbenchmark numbers for the capability registers design are not as different as you believe. There are zero marginal TLB misses and I don't remember any marginal D-cache misses. The absence of additional TLB misses is easily explained because of the layout of our process structure. I cannot really explain the absence of a major difference in D-cache behavior, unless it occurs because the flow of our IPC path is slightly different from the L4 path, and the complexities appear in different places (we had cap transfer, L4 had map items). My high level comment is that merely shifting an unavoidable burden from the kernel to the application is not an optimization! > > But further, you are saying, in effect, that the high-performance case > > (the session-based case) requires complex storage management code in > > each server combined with a multi-party trust relationship to ensure > > cleanup. > > The storage management code is needed in a session-based server anyway. > And there will be session-based servers, like the network or a window server, > in a complex system. Yes. But this does not answer my point. I agree that there are a very small number of servers that must accept and manage this complexity. My objection is that this should not extend to *all* servers. And in fact, if you look at the EROS window system, you will discover that it is sessionless. So is our current ethernet driver (though this one needs to change). > > Whether the "extensible badge" approach will work well in practice is > > something that only time and usage will permit us to learn. > > The current L4.sec specification does not have an "extensible" badge. > Up to now we do not found a usage scenario that needs one. Interesting! Thank you for letting me know. > Jonathan: > > A point which is not decidable without the protocols and the user-level > applications is: How often is a capability transfer compared to the > usage-case? > Perhaps you can say something to this ratio in EROS. This is heavily dependent on the sessionless vs. sessionful design. If the protocol is sessionless, then the answer is that ~50% of transfers involve exactly one capability. If the protocol is session-based, then the answer is that almost no IPCs will transfer capabilities. If we ignore the reply capabilities, the next most common reason for cap transfer is the wrapper pattern, and the most important examples of this are the space bank and the memory fault handlers. In this pattern, the capability is synthesized inside the kernel and delivered to a capability "register", so there is no slot search and no cold cache issue on the sender side, and in practice none on the receiver side. I am not sure that this pattern will provide any useful prediction for the purposes of estimating L4.sec usage patterns, but I think it does not matter, because the dynamic frequency of this pattern is very very low. shap _______________________________________________ L4-hurd mailing list [email protected] http://lists.gnu.org/mailman/listinfo/l4-hurd
