I'm sponsoring this fasttrack for Haik Aftandilian.  This is an Open case
seeking Patch binding (for backport to S10).  Timeout on 11/05/2009.

A copy of the proposed contract is in the case directory.

Dan

Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
This information is Copyright 2009 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         LDOM-SunCluster suspend callbacks
    1.2. Name of Document Author/Supplier:
         Author:  Haik Aftandilian
    1.3  Date of This Document:
        29 October, 2009
4. Technical Description

Introduction
------------

    Solaris Cluster (SC) runs in LDoms guest domains and
provides infrastructure to make applications deployed in the
guest domains HA. SC cluster nodes monitor each other via (so
called) heartbeats, as well associated hardware. SC manages
Storage devices in a way which is specific to the server it
is running on, such as performing SCSI reservations, which
are meaningful only from a specific physical HBA.

    LDoms guest domains can be migrated from one server to
another with the LDoms "Warm Migration" feature. During the
migration, the domain being migrated is suspended. While a
domain is suspended, which can be several minutes, the domain
is totally inactive and not responsive to any requests. Thus,
if a domain running SC is migrated from one server to
another, other cluster nodes need to be made aware of this
fact so that they can suspend monitoring of the cluster node
under migration. Additionally, cluster nodes need to act
co-operatively to make sure the SCSI reservations on storage
devices are also migrated correctly.

    The proposed callbacks would allow SC to
perform these tasks, thereby enabling a seamless migration
of the LDoms guest domain from an end user perspective. 

References
----------

1. Suspend Domain Service
   http://sac.eng/Archives/CaseLog/arc/FWARC/2009/559/

2. Current list of Sun Cluster/ON contract cases in use in Solaris 10.
   /ws/osc-gate/usr/src/uts/sparc/cl/imported_symbols.private.Sol_10

3. Example of an existing Sun Cluster/ON contract case.
   http://sac.eng/Archives/CaseLog/arc/PSARC/2005/602/

Overview
--------

    In the Solaris kernel, hooks/callback functions will be
run before and after the domain is suspended. A single callback
will be made to SC before the suspension and a single
callback will be made after the resumption. Note that the use of
"suspend" in this contract only applies to suspend operations
initiated by LDoms infrastructure using the suspend domain
service on sun4v guest domains. And that today, suspend
operations are only performed to facilitate LDoms domain
migration. These suspend operations are entirely distinct from
CPR suspend operations.

Commitment level for all the interfaces:
    Contracted Project Private

Interface Details
-----------------

When SC is loaded and wishes to receive suspend
notifications, it will set the callback function pointers to
point to SC functions that handle the notifications. When
setting these callbacks, the cl_suspend_error_decode callback
should be set first, then the cl_suspend_post_callback, and
then the cl_suspend_pre callback.

The cl_suspend_pre_callback and cl_suspend_post_callback
will never be invoked concurrently and solaris will wait
indefinitely for the callbacks to return.

Pre-suspend callback:

    int (*cl_suspend_pre_callback)(void);

Called before the domain is suspended. This serves to
notify SC that this domain is in the process of being
suspended. SC should return 0 if it successfully suspended
monitoring of this domain. If a failure occurred which
should prevent the guest domain from being suspended and
possibly migrated, or if SC can not support a migration
at this time, SC should return a non-zero error code.
If the cl_suspend_pre_callback returns an error code,
the suspension will aborted. The intent is for this error
to be sent back to the domain manager and then used to
build an error message informing the user why the
migration could not be completed.

Post-suspend callback:

    int (*cl_suspend_post_callback)(void);

Called after the guest domain has been resumed following a
successful suspension. It is also called after a failed
suspension attempt as well as a canceled suspension
attempt. i.e., it is possible that this function
will be called when the guest domain was suspended and
then resumed without being migrated (as a result of a 
failure or cancellation). It can also be called
even when the guest domain was never suspended (due
to a failure before the suspension) and therefore never
migrated. If the callbacks are set after a suspend operation
is already in progress, since the pre callback is set after
the post callback, it is also possible that this function
will be called after a suspension even when the
cl_suspend_pre_callback was not called. Therefore, SC should
not consider it an error if cl_suspend_post_callback is
called before cl_suspend_pre_callback without a
corresponding call to cl_suspend_pre_callback. SC should
return 0 if it successfully resumed monitoring of
this domain. If a failure occurred which prevents the guest
domain from resuming normal activity in the cluster, a
non-zero error value should be returned. The error will
be sent back to the domain manager which will display an
error message informing the user that an error occurred
after the migration and that manual inspection and
recovery may be required for the domain to resume normal
operation. The domain will have been resumed and Solaris
and applications will be running. 

Error code decode callback: 

    const char *(*cl_suspend_error_decode)(int);

Called at any time to convert an error code returned from
the cl_suspend_pre_callback or cl_suspend_post_callback
into a descriptive error string suitable for use in
an error message presented to the user.  Returns a NULL-
terminated statically allocated string of length less than
or equal to 256 including the NULL terminator. The caller
will consider this string immutable and will not modify it
or deallocate it. This function may return NULL. When the
cl_suspend_pre_callback or cl_suspend_post_callback return
an error, cl_suspend_error_decode will be used to obtain
an error message string that corresponds to the error.
i.e., cl_suspend_error_decode will be called and its
argument will be an error code returned from
cl_suspend_pre_callback or cl_suspend_post_callback.

6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open

Reply via email to