Re: A comment about changing kernels

Jonathan S. Shapiro Mon, 31 Oct 2005 07:24:13 -0800

On Mon, 2005-10-31 at 00:42 +0100, Bernhard Kauer wrote:
> On Sun, Oct 30, 2005 at 09:40:18AM -0500, Jonathan S. Shapiro wrote:


> I shorten your long mail to some core sentences. Please correct me, if I
> quote you wrong.

Bernhard: thank you for taking the time to answer so thoughtfully. It is
clear to me that we are operating from very different assumptions. I
would like to see if we can at least clarify the differences a bit
better.

> > You assume that the difference in cost between (IPC w/map) vs the cost
> > of 2(IPC w/map) can be ignored. This is incorrect.
> 
> No, I assume that the usage of a capability (IPC w/map) in a session-less
> protocol has a significant overhead over a session-based (IPC) protocol and
> that the usage happens more often then a copy via 2(IPC w/map).

This is the assumption that I thought you were making. Let us unpack
this assumption and examine it.

In order to implement a sessionless RPC protocol, every CALL (the first
IPC) must transmit an endpoint capability. This endpoint capability will
later be used by the server to perform a RETURN (the 2nd IPC). If
evidence from EROS is a good predictor (and I think it is -- at least
for this), we may safely conclude that these CALL/RETURN patterns
describe the overwhelming majority of IPCs -- all of the other patterns
taken together accounted in EROS for less than 1% of all dynamic IPCs.

Further (again, if EROS is a good predictor) we observed very few IPCs
statistically that passed any *other* capability as an argument. A low
enough percentage that we may ignore it for the present discussion.

Therefore, we may assume that ~50% of IPCs in a sessionless protocol
will need to transfer exactly one capability. It will not, of course, be
exactly 50%, but it will be very close to this.

First question: is this set of statements consistent with your
assumptions about how sessionless RPC would work?


Now concerning expense:

Yes, I agree that the MAP operation would be a noticeable cost in the
protocol described above. It does not require any manipulation of
hardware mapping tables, but there is still the need to traverse the
C-space on both the sender and the receiver side of the exchange, and
this will take time.

Once the sender "slot" (the location in sender memory containing the
capability to be transmitted) and the receiver "slot" are identified, we
must then deal with the cost of the capability copy itself -- possibly
including updates to the mapping database. Depending on the
representations of the mapping database and the capabilities, this copy
may or may not have a significant cost.

[Please note: I am not sure that these phases separate clearly in the L4
design, so please do not hesitate to correct or clarify.]

In EROS, the slot location phase was very simple, because we did not
have C-spaces. All that was necessary was to offset into a capability
register vector. The main cost was in the capability copy, and the main
source of this cost was the update of the linked lists that linked the
source and destination. This update incurred three unnecessary D-cache
misses in practice. In Coyotos, we are not using a linked
representation, and these extra D-cache misses are eliminated.

As we considered the session/sessionless issue in Coyotos, I became
convinced that using the protected payload as a session ID would not
scale well, and also presents certain challenges for bootstrapping and
isolation. I therefore decided to focus the architecture on supporting
sessionless interactions. Sessions are still possible and fast; it
simply changes our orientation in looking at the implementation goals a
bit.

In examining support for sessionless protocols, we concluded that the
big cost in Coyotos was going to be the C-space traversal in sender and
receiver to locate the slots. Independent of session/sessionless, this
traversal cost is also an issue for locating the send and receive
endpoint caps in any IPC.

We have therefore concluded that Coyotos needs to support capability
registers in addition to C-spaces, and that in the normal case the send
and receive endpoints will live in these registers.

With all of this exposed, I believe it may now be clearer why I do not
think the capability copy is expensive. It involves a cache-aligned
four-word move from one process to the other, using a scaled index
addressing mode using the respective process structure pointers as base
pointers. The only important cost in this is the D-cache line load in
the target space.

I am *guessing* that this is significantly lower than the cost that you
anticipate for the MAP operation. It is one (of several) reasons that I
argued at Dresden that MAP was not the right starting point for
designing a capability transfer mechanism. Various people at Dresden
argued that optimizations could address this, and in some parts they
were correct, but it still appears to me that the need to update the
mapping database is inherently more complicated than simple word copy.

So: I agree that there is some overhead in transferring the capability,
but I think it is much lower than you appear to believe. Given my
explanation, I have two further questions:

Second question: do you agree that this difference in designs would
largely explain our differing views about the costs of a sessionless
protocol?

Third question: If so, do you agree that in the design I have outlined,
the marginal cost of the capability copy is not likely to be significant
in real-world use (it will of course be measurable in microbenchmarks,
but who cares).

> > However, based on those measurements, I think that the cost of IPC+COPY
> > for one capability in 50% of transfers (namely: the reply endpoint,
> > which is only passed in the CALL phase of the round-trip RPC) is not
> > measurably different from the cost of IPC without any transfer.
> 
> L4-IPC paths are today so heavily optimized that the influence of TLB and
> even Cache misses are significant. Doing a copy involves at least 2 cache
> misses by using capability registers. Capability-spaces add at least 1
> additional cache and TLB miss to this number. I do not think that this is 
> not measurably.

I agree that these numbers appear and are significant in the
microbenchmarks, but the microbenchmarks are misleading here. It is not
a question of *whether* you will take extra misses, but *where*. If the
kernel does not support the sessionless protocol, then the server must
execute the lookups necessary to map from session identifiers to the
appropriate endpoint capabilities. The cache and TLB misses associated
with this activity must be accounted as part of the IPC costs in order
to make an accurate comparison.

Further, I think that the microbenchmark numbers for the capability
registers design are not as different as you believe. There are zero
marginal TLB misses and I don't remember any marginal D-cache misses.
The absence of additional TLB misses is easily explained because of the
layout of our process structure. I cannot really explain the absence of
a major difference in D-cache behavior, unless it occurs because the
flow of our IPC path is slightly different from the L4 path, and the
complexities appear in different places (we had cap transfer, L4 had map
items).

My high level comment is that merely shifting an unavoidable burden from
the kernel to the application is not an optimization!

> > But further, you are saying, in effect, that the high-performance case
> > (the session-based case) requires complex storage management code in
> > each server combined with a multi-party trust relationship to ensure
> > cleanup.
> 
> The storage management code is needed in a session-based server anyway.
> And there will be session-based servers, like the network or a window server,
> in a complex system.

Yes. But this does not answer my point. I agree that there are a very
small number of servers that must accept and manage this complexity. My
objection is that this should not extend to *all* servers.

And in fact, if you look at the EROS window system, you will discover
that it is sessionless. So is our current ethernet driver (though this
one needs to change).

> > Whether the "extensible badge" approach will work well in practice is
> > something that only time and usage will permit us to learn.
> 
> The current L4.sec specification does not have an "extensible" badge.
> Up to now we do not found a usage scenario that needs one.

Interesting! Thank you for letting me know.

> Jonathan:
> 
> A point which is not decidable without the protocols and the user-level
> applications is: How often is a capability transfer compared to the 
> usage-case?
> Perhaps you can say something to this ratio in EROS.

This is heavily dependent on the sessionless vs. sessionful design. If
the protocol is sessionless, then the answer is that ~50% of transfers
involve exactly one capability. If the protocol is session-based, then
the answer is that almost no IPCs will transfer capabilities.

If we ignore the reply capabilities, the next most common reason for cap
transfer is the wrapper pattern, and the most important examples of this
are the space bank and the memory fault handlers. In this pattern, the
capability is synthesized inside the kernel and delivered to a
capability "register", so there is no slot search and no cold cache
issue on the sender side, and in practice none on the receiver side.

I am not sure that this pattern will provide any useful prediction for
the purposes of estimating L4.sec usage patterns, but I think it does
not matter, because the dynamic frequency of this pattern is very very
low.

shap



_______________________________________________
L4-hurd mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/l4-hurd

Re: A comment about changing kernels

Reply via email to