On 2008-01-16T18:48:06, Keisuke MORI <[EMAIL PROTECTED]> wrote:

> Hello all,
> 
> We have developed a new feature that detects a process failure directly
> to reduce the failover time.
> 
> If you're interested in, please try this and give me your comments.
> 
> See attached README for details about how to use this.
> The patch is made for heartbeat-2.1.3.

Hi Keisuke, instead of introducing a secondary process monitoring layer
which has to be configured additionally, why not instead enhance the
existing RAs to use a faster process checking technique? 

If you are going this way, I think the RAs starting the individual
[services should sign in and tell procd (if it's running) what they want
monitored, and what rsc id it relates to - and, of course, notify procd
before stopping the process.

Instead of scanning /proc, which is very very Linux-specific, why not
use the async waitpid call instead? Or, you might decide to
poll()/select()/inotify() on the relevant proc dirs at least.

You can further use the async failure notification feature of the LRM to
directly tell it when a monitored process has failed; no need to do so
via the monitor call, which would then only need to be a backup function
to make sure procd is still running.

This would further reduce the error detection latency.

procd also probably should be started by a RA, not by a respawn line.

> +if [ -f ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs ]; then
> +     FUNCTION_FILE="${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs"
> +elif [ -f /usr/lib64/heartbeat/ocf-shellfuncs ]; then
> +     FUNCTION_FILE="/usr/lib64/heartbeat/ocf-shellfuncs"
> +elif [ -f /usr/lib/heartbeat/ocf-shellfuncs ]; then
> +     FUNCTION_FILE="/usr/lib/heartbeat/ocf-shellfuncs"
> +else
> +     echo "${OCF_RESOURCE_INSTANCE} ocf-shellfuncs file doesn't exist." >&2
> +     exit 1
> +fi

This seems unneeded, the other RAs have the proper code too - to just
source a single file.

> +PROCD="/usr/lib/heartbeat/procd"

No hardcoded paths please.

> +<action name="monitor"  timeout="60" depth="0" interval="10" 
> start-delay="60" />

Start-delay shouldn't be needed.

> +#if 1
> +    crm_log_init(crm_system_name, LOG_INFO, TRUE, FALSE, argc, argv);
> +#else
> +    /* for before heartbeat 2.1.2 */
> +    crm_log_init(crm_system_name, TRUE);
> +#endif

Please take out the compatibility code. It's not going to be merged
before 2.1.4, so we don't need to be compatible to anything before that
in the merged code.

> +        /*
> +         * connect with cib, and get objects's information cib.xml
> +         */

This is not really a good idea. The code doesn't take the dynamic
properties of the CIB into account. Why don't you use the values
provided by the instance attributes?

> +    /* get the whole of cib.xml */
> +    do {
> +        /*
> +         * At SBY server, sometimes cib.xml doesn't exist.
> +         * so, try to get cib.xml <max_failures> times.
> +         */

What's this supposed to mean? The CIB is clearly available everywhere
where the CRM is running!

> +        sprintf(cmdline_path, "/proc/%s/cmdline", dp->d_name);

sprintf is discouraged, please use snprintf.


Regards,
    Lars

-- 
Teamlead Kernel, SuSE Labs, Research and Development
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to