Reliability of RPC services

Marcus Brinkmann Fri, 21 Apr 2006 16:45:05 -0700

Hi,

here is a new concern that seems to be introduced by recent
microkernel developments, Coyotos as well as secure L4 variants.  The
problem did not exist in Mach, Minix or L4.X2.  I am not sure if it
existed in EROS or not (probably not).


The matter concerns the reliability of an RPC mechanism built on top
of the IPC primitives, considering a rather simple client-server
model.  In recent systems, the reliability seems to be decreased,
because failure in the server can lead to indefinite resource
allocation in the client, if no additional provision is made.

Motivation: One argument for composing operating systems from multiple
servers is to increase the robustness of the system.  Jorrit Herder
(Minix3) proposed at the poster session of Eurosys 2006 a mechanism to
restart crashed device drivers and other system services (potentially
transparently to the user).  To achieve this level of robustness, the
damage that a crashed server can do needs to be contained to a
manageable amount.  Here is an example failure case that I want to see
addressed:

A client C makes a call to a server S.  The server S requires to make
a call to a device driver D to implement the service.  While S is in
the reply phase of the invocation to D, the device driver crashes and
is removed from the system.  Eventually, the client C gives up and
exits (for example on user intervention).  Now, what happens to the
server S?  In Mach, S moved a "send-once" capability to the reply port
to D.  At destruction of D's port name space, the kernel would
generate a failure message and send it to the reply port.  S would be
notified by the removal of S in this way.  In L4.X2, the calling
thread in S would be blocked on a closed wait on the thread ID of the
server thread in D.  At the destruction of the server thread, the list
of waiters queued on the thread is traversed, and the pending IPC
system calls are aborted with an error (this is called "unwinding" in
the source code).

In the upcoming L4 versions, and in Coyotos, destruction of the
receiver of a reply capability does not cause any action to be
triggered: Pending RPCs are not aborted.  This is because there is an
extra level of indirection between the reply capability and the thread
(first class receive buffer).  In fact, the underlying mechanisms are
sufficiently expressive to allow behaviours where the above semantics
are not meaningful anymore: For example, there could be copies of the
reply capability in different processes (the kernel does not keep
reference counts).  Or in fact the caller could create a new reply
capability for the reply endpoint and use that in any imaginable fashion.

Still, this lack of kernel support poses an appreciable challenge: If
nothing replaces this functionality, the server S in the above
scenario will just indefinitely hang in an RPC operation that can
never complete.  A resource has been leaked permanently.

Here are a couple of ideas what could replace this functionality:

* Whatever user program destroys the failed server process D, also
  takes care of the users of the process D.  This solution requires
  significant structural overhead, and creates undesirable strong
  dependency structures in the system (for example, global managers).

* The program S could use timeouts in the call to D.  This solution
  requires significant structural changes to the system design,
  because now time becomes an important parameter in evaluating
  services.  It can be tried to argue that this is desirable anyway.

* Following Mach, special "send-once" capabilities are introduced that
  implement the send-once semantics.  Here are the semantics expressed
  in terms of Coyotos: When copied, the source capability is
  invalidated (so the number of send-once capabilities to a given
  object is a system invariant under capability copy operations).  If
  a send-once capability is dropped, the kernel generates a message to
  any enqueued first class receive buffer.  At task destruction, the
  space bank can scan the capability pages of the destroyed task and
  drop all (send-once) capabilities.

  This has the disadvantage that it makes task destruction somewhat
  more expensive.  The cost for doing the cleanup is at least bounded
  by the number of capabilities the process can allocate, and the
  destruction of all capabilities collectively does not need to be
  atomic.

I sort of have my eyes on the last solution.  Jonathan, I remember
that you did not like the send-once semantics, because (IIRC) it
restricts the possible server designs.  For example, a server can not
keep several reply capabilities to the same caller in different worker
processes.  So if the server want to reply to a message it needs to
make sure that the "send-once" reply capability ends up in the right
worker process.  However, in the possible use cases I can think of,
there will be some negotiation among the worker processes about who
will respond to the message anyway, so I can't really convince myself
that this is a serious restriction.  Maybe this is not the only reason
you were against it.

So, here are a couple of questions:

1) Is RPC robustness desirable/required, or is an alternative model
   feasible where machine-local RPC is as unreliable as IP/UDP network
   communication?

2) If it is indeed desirable, are there more possible solutions than
   the three approaches described above?

3) Are the costs of destroying send-once rights (and thus sending
   messages) acceptable?  Given a positive answer to 1, and a negative
   answer to 2, are these costs in fact inavoidable?

4) In fact, if we consider persistence, can not the same mechanism
   above that was described to help with malicious or buggy software
   be used to deal with the planned and desired removal of device
   driver servers from the system at reboot of the persistent machine?
   IOW: As far as I understand, EROS had a logic to restart RPCs that
   were pending and which were sent across the boundary between the
   persistent and the non-persistent world.  The above solution may
   provide a convenient and consistent approach to recover not only
   from accidential loss of a single driver, but also from a planned
   mass exodus as a reboot.

Thanks,
Marcus



_______________________________________________
L4-hurd mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/l4-hurd

Reliability of RPC services

Reply via email to