On 2008-01-16T18:48:06, Keisuke MORI <[EMAIL PROTECTED]> wrote: > Hello all, > > We have developed a new feature that detects a process failure directly > to reduce the failover time. > > If you're interested in, please try this and give me your comments. > > See attached README for details about how to use this. > The patch is made for heartbeat-2.1.3.
Hi Keisuke, instead of introducing a secondary process monitoring layer which has to be configured additionally, why not instead enhance the existing RAs to use a faster process checking technique? If you are going this way, I think the RAs starting the individual [services should sign in and tell procd (if it's running) what they want monitored, and what rsc id it relates to - and, of course, notify procd before stopping the process. Instead of scanning /proc, which is very very Linux-specific, why not use the async waitpid call instead? Or, you might decide to poll()/select()/inotify() on the relevant proc dirs at least. You can further use the async failure notification feature of the LRM to directly tell it when a monitored process has failed; no need to do so via the monitor call, which would then only need to be a backup function to make sure procd is still running. This would further reduce the error detection latency. procd also probably should be started by a RA, not by a respawn line. > +if [ -f ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs ]; then > + FUNCTION_FILE="${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs" > +elif [ -f /usr/lib64/heartbeat/ocf-shellfuncs ]; then > + FUNCTION_FILE="/usr/lib64/heartbeat/ocf-shellfuncs" > +elif [ -f /usr/lib/heartbeat/ocf-shellfuncs ]; then > + FUNCTION_FILE="/usr/lib/heartbeat/ocf-shellfuncs" > +else > + echo "${OCF_RESOURCE_INSTANCE} ocf-shellfuncs file doesn't exist." >&2 > + exit 1 > +fi This seems unneeded, the other RAs have the proper code too - to just source a single file. > +PROCD="/usr/lib/heartbeat/procd" No hardcoded paths please. > +<action name="monitor" timeout="60" depth="0" interval="10" > start-delay="60" /> Start-delay shouldn't be needed. > +#if 1 > + crm_log_init(crm_system_name, LOG_INFO, TRUE, FALSE, argc, argv); > +#else > + /* for before heartbeat 2.1.2 */ > + crm_log_init(crm_system_name, TRUE); > +#endif Please take out the compatibility code. It's not going to be merged before 2.1.4, so we don't need to be compatible to anything before that in the merged code. > + /* > + * connect with cib, and get objects's information cib.xml > + */ This is not really a good idea. The code doesn't take the dynamic properties of the CIB into account. Why don't you use the values provided by the instance attributes? > + /* get the whole of cib.xml */ > + do { > + /* > + * At SBY server, sometimes cib.xml doesn't exist. > + * so, try to get cib.xml <max_failures> times. > + */ What's this supposed to mean? The CIB is clearly available everywhere where the CRM is running! > + sprintf(cmdline_path, "/proc/%s/cmdline", dp->d_name); sprintf is discouraged, please use snprintf. Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/