Re: Reliability of RPC services

Marcus Brinkmann Fri, 21 Apr 2006 18:55:02 -0700

At Fri, 21 Apr 2006 20:16:40 -0400,
"Jonathan S. Shapiro" <[EMAIL PROTECTED]> wrote:
> 
> Marcus raises a good point, but he is missing some history.


Thanks for filling that in!

> In KeyKOS, the "resume capability" served the role of a reply
> capability. When a node containing a resume capability was destroyed (by
> the space bank), a well-defined, distinguished message was sent to the
> recipient for all resume capabilities contained in that node.
> 
> In EROS, this behavior was dropped, and in my opinion that was a
> mistake.
> 
> In Coyotos, there is no way to distinguish resume capabilities from
> entry capabilities (or at least, not at the moment) so it is difficult
> to duplicate the KeyKOS behavior at the moment, but see below.
> 
> In any persistent system, "notify on last capability drop" is
> impractical. It requires disk garbage collection, so the delay is too
> long to be helpful.

Yes.
 
> It would not be difficult to add a bit to an FCRB capability to support
> this. We could call it the "invoke on delete" bit.
> 
> Here is the meaning of this bit:
> 
>   On destruction of any object, the capability slots are examined.
>   For any slot that contains an "invoke on delete" FCRB sender
>   capability, a non-blocking message will be sent indicating that
>   the capability was held within a deleted object at the time of
>   deletion.
> 
>   If sending this message would block, it will not be delivered.
> 
>   If the FCRB sender capability is invalid, no message will be sent.
> 
> HOWEVER:
> 
> This message does NOT mean that all capabilities to the FCRB are gone.
> It means that *some* object containing the capability has been
> destroyed. If there are multiple copies of the capability in different
> objects, and one of these objects is destroyed, the message will be
> sent. Programs can take deliberate steps to suppress this behavior, but
> this would be the normal outcome.
> 
> This is not quite the semantics that Marcus is after, but in practice it
> was good enough in KeyKOS.

Interesting that you propose this.  I thought about this earlier
today, and called it a "send at least once" capability when describing
it to Neal.  Neal pointed out that this may be inconvenient in actual
practice: If a server S wants to propagate (forward) a request to
another server T, then S must be careful to not destroy its own copy
of the reply capability before T had a chance to reply.  In such a
scenario, the "send-once" semantics, where the capability is moved
rather than copied, seems to be easier to handle (and more efficient).

Are there scenarios where the semantics you describe are favorable.  I
think I may be able to construct abstract ones, but am not sure if
there are actually interesting use cases.

> If this is sufficiently helpful to justify revising the Coyotos spec,
> please send a note to coyotos-dev confirming that this update should be
> made.

It seems to me that both semantics do the job.  The semantics you
describe have the advantage that they minimize the differences in
behaviour of the two capability types.  Neals' reservation however
justifies IMO to take a closer look at server design use cases.

> > * Whatever user program destroys the failed server process D, also
> >   takes care of the users of the process D.  This solution requires
> >   significant structural overhead, and creates undesirable strong
> >   dependency structures in the system (for example, global managers).
> 
> This solution is impossible. The storage containing those capabilities
> is gone, and the party who destroys that storage usually does not have
> access to the content of the storage. In particular, the authority to
> destroy a space bank specifically does NOT include the authority to
> inspect storage that has been allocated by that bank. This is absolutely
> essential for confinement and *any* security policy.

I was trying to also enumerate some potential options that would apply
to different system designs.

> > * The program S could use timeouts in the call to D.  This solution
> >   requires significant structural changes to the system design,
> >   because now time becomes an important parameter in evaluating
> >   services.  It can be tried to argue that this is desirable anyway.
> 
> This solution leads directly to systems that fail under load. There is
> general agreement in both the L4 and EROS/Coyotos communities that
> generalized timeouts were a mistake, and that "forever" and "don't wait"
> are the only options that should be implemented by the IPC layer.

It is clear that such a naive approach would not work.  However, in
the context of specific systems (for example real time systems) there
seem to be solutions to this problem, eg using scheduler donations, or
priority inversion avoidance protocols.  Again, I was trying to look
beyond the specific type of systems we are considering for us.

> > * Following Mach, special "send-once" capabilities are introduced that
> >   implement the send-once semantics.  Here are the semantics expressed
> >   in terms of Coyotos: When copied, the source capability is
> >   invalidated (so the number of send-once capabilities to a given
> >   object is a system invariant under capability copy operations). 
> 
> The semantics of send-once rights is an abomination. The cost of them is
> considerable, and the overhead of manipulating them correctly from the
> application perspective is a serious problem. Coyotos will not under any
> circumstances implement "send-once" or "grant-only" capabilities.

Well, let me try to understand the cost factor.  As far as I see it,
the only cost involved in the copy operation is setting a data field
in the source capability to mark it as invalid.

Also, the semantics you propose require the kernel to make an attempt
to send a message for every destroyed reply capability, which entails
accessing the FCRB (cache pressure?).  And this although in the common
case (>99.999% of the cases), the FCRB will already contain the reply
message and thus not be available.

For the application perspective, see above.  This capability type is
meant to be used for reply capabilities, whose manipulation requires
special care either way.

> >   This has the disadvantage that it makes task destruction somewhat
> >   more expensive...
> 
> Worse, it has the disadvantage that every capability copy must be
> preceded by a capability type check, so that the sender knows whether it
> is losing the capability as a side effect. This violates encapsulation
> in a fairly fundamental way.

Good point.  However, I think that it would suffice to apply such
checks on incoming capabilities rather than outgoing, for those places
where it is actually relevant.  In many situations, the check can
probably be omitted (for example if the capability will be dropped
anyway).

So, to summarize: I don't yet understand why you consider send-once
capabilities to be expensive, and I am not sure that the
attempt-to-send-on-every-destroy is actually easier to handle at the
application level (for the case where you actually copy the capability
at the server side).  What we consider to be the "common" use case
(propagation) seems to prefer send-once semantics.

Either way it seems that we are so close that the discussion is
probably more appropriate for coyotos-dev rather than here, if the
above details need to be straightened out.  OTOH, if you can quickly
resolve the above doubts, it's just as well I just post the final
result.  Your call.  I don't mind setting up a discussion thread
starting with a (much) more condensed version of my first mail
incorporating the current state.

Thanks,
Marcus



_______________________________________________
L4-hurd mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/l4-hurd

Re: Reliability of RPC services

Reply via email to