On Mon, Nov 30, 2009 at 06:01:17PM +0100, Michael Hanselmann wrote:
> Signed-off-by: Michael Hanselmann <[email protected]>
> ---
>  doc/design-2.2.rst |  123 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 123 insertions(+), 0 deletions(-)
> 
> diff --git a/doc/design-2.2.rst b/doc/design-2.2.rst
> index cc78b51..f43146c 100644
> --- a/doc/design-2.2.rst
> +++ b/doc/design-2.2.rst
> @@ -33,6 +33,129 @@ As for 2.1 we divide the 2.2 design into three areas:
>  Core changes
>  ------------
>  
> +Remote procedure call timeouts
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Current state and shortcomings
> +++++++++++++++++++++++++++++++
> +
> +The current RPC protocol used by Ganeti is based on HTTP. Every request
> +consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``)
> +and doesn't return until the function called has returned. Parameters
> +and return values are encoded using JSON.
> +
> +On the server side, ``ganeti-noded`` handles every incoming connection
> +in a separate process by forking just after accepting the connection.
> +This process exits after sending the response.
> +
> +There is one major problem with this design: Timeouts can not be used on
> +a per-request basis. Neither client or server know how long it will
> +take. Even if we might be able to group requests into different
> +categories (e.g. fast and slow), this is not reliable.
> +
> +If a node has an issue or the network connection fails while a request
> +is being handled, the master daemon can wait for a long time for the
> +connection to time out (due to the operating system's underlying TCP
> +keep-alive packets or timeouts). While the settings for keep-alive
> +packets can be changed using Linux-specific socket options, we don't
> +consider them reliable and responsive enough for our case.

This is not really fair/correct, we use and rely on socket timeouts configured
system-wide for 'node down' case - and it works.

> +
> +Proposed changes
> +++++++++++++++++
> +
> +Protocol
> +^^^^^^^^
> +
> +Initially we chose HTTP as our RPC protocol because there were existing
> +libraries, which, unfortunately, turned out to miss important features
> +(such as SSL certificate authentication) and we had to write our own.
> +
> +This proposal can easily be implemented using HTTP, though it would
> +likely be more efficient and less complicated to use the LUXI protocol
> +already used to communicate between client tools and the Ganeti master
> +daemon.

I'm not sure I understand here - what is the actual proposal, switch or
remain with HTTP?

> +
> +The LUXI protocol currently contains two functions, ``WaitForJobChange``
> +and ``AutoArchiveJobs``, which can take a longer time. They both support
> +a parameter to specify the timeout. This timeout is usually chosen as
> +roughly half of the socket timeout, guaranteeing a response before the
> +socket times out. After the specified amount of time,
> +``AutoArchiveJobs`` returns and reports the number of archived jobs.
> +``WaitForJobChange`` returns and reports a timeout. In both cases, the
> +functions can be called again.
> +
> +A similar model can be used for the inter-node RPC protocol. In some
> +sense, the node daemon will implement a light variant of *"node daemon
> +jobs"*. When the function call is sent, it specifies an initial timeout.
> +If the function didn't finish within this timeout, a response is sent
> +with a unique identifier. The client can then choose to wait for the
> +function to finish again with a timeout. Inter-node RPC calls would no
> +longer be blocking indefinitely and there would be an implicit
> +ping-mechanism.

ACK.

> +
> +Request handling
> +^^^^^^^^^^^^^^^^
> +
> +To support the protocol changes described above, the way the node daemon
> +handles request will have to change. Instead of forking and handling
> +every connection in a separate process, there should be one fork per
> +RPC function call and the master process will handle the communication
> +with clients and the function handling processes.

OK.

> +
> +Function processes communicate with the parent process via stdio and
> +possibly their exit status. Every function process has a unique
> +identifier, though it shouldn't be the process ID (PIDs can be recycled
> +and are prone to race conditions for this use case).

(I wonder if PIDs+other ID is not unique enough)

> +
> +The following operations will be supported:
> +
> +``StartFunction(fn_name, fn_args, timeout)``
> +  Starts a function specified by ``fn_name`` with arguments in
> +  ``fn_args`` and waits up to ``timeout`` seconds for the function
> +  to finish. Fire-and-forget calls can be made by specifying a timeout
> +  of 0 seconds (e.g. for powercycling the node). Returns three values:
> +  function call ID (if not finished), whether function finished (or
> +  timeout) and the function's return value.
> +``WaitForFunction(fnc_id, timeout)``
> +  Waits up to ``timeout`` seconds for function call to finish. Return
> +  value same as ``StartFunction``.
> +
> +In the future, ``StartFunction`` could support an additional parameter
> +to specify after how long the function process should be aborted.
                                          ^ started, or process started?

> +
> +Simplified timing diagram::
> +
> +  Master daemon        Node daemon                      Function process
> +   |
> +  Call function
> +  (timeout 10s) -----> Parse request and fork for ----> Start function
> +                       calling actual function, then     |
> +                       wait up to 10s for function to    |
> +                       finish                            |
> +                        |                                |
> +                       ...                              ...
> +                        |                                |
> +  Examine return <----  |                                |
> +  value and wait                                         |
> +  again -------------> Wait another 10s for function     |
> +                        |                                |
> +                       ...                              ...
> +                        |                                |
> +  Examine return <----  |                                |
> +  value and wait                                         |
> +  again -------------> Wait another 10s for function     |
> +                        |                                |
> +                       ...                              ...
> +                        |                                |
> +                        |                               Function ends,
> +                       Get return value and forward <-- process exits
> +  Process return <---- it to caller
> +  value and continue
> +   |
> +
> +.. TODO: Convert diagram above to graphviz/dot graphic

Questions: the "wait up to 10s" there, is done in which process? parent
ganeti-noded - which would mean stalling all other requests?

What happens if the noded process is restarted?

iustin

Reply via email to