DAIKI MATSUDA wrote:
Hi, All
I add the new function for heartbeat-2.0.8 and attached its patch file.
The function is to apply the new timeout parameters ( keepalive,
deadtime, deadping, warntime ) without stopping the heartbeat services.
Currently heartbeat boot scripts supply the 'reload' or 'forcereload'
function, but it, they are same, does stop the services and the HA
services are moved to standby node, because its process kills the forked
heartbeat processes and clients ( crmd etc. ).
So, we think to without suspending the services make the changing
parameters to apply to driving nodes. Current feature is following.
1. changing ha.cf <http://ha.cf> file for 4 parameters
2. send working parent heartbeat process signal SIGRTMAX ( e.g. kill -s
SIGRTMAX `cat /var/run/heartbeat.pid` (Why do I choose SIGRTMAX? I do
not find the unused good signal.)
As we research the heatbeat, it may be safety. And I want to listen to
your issues for patch and functions.
Sorry to be coming in so late on this, but I was working on the release
for many weeks now. I really like the idea of dynamically modifying the
heartbeat configuration - but if you're going to go to the trouble to do
it, I'd like to see it done more generally.
In other words, I'd like to be able to change nearly any parameter in
ha.cf at run time without restarting heartbeat.
This would require reworking (and improving) the way heartbeat starts
up. This would be probably about twice or three times as much work as
what you've done, but it would be much more useful, and much more general.
In the end, if done right, it could be groundwork to letting let us
eventually be able receive config updates from the CIB. [I know there's
a bootstrapping issue, but we can deal with that when we get to deciding
to do that work].
I have thought about this and have some specific ideas on what kinds of
things need to be done to make this happen.
Hi, Alan.
I understood what you say and think it is very good idea to tread all
parameters in ha.cf. I thought my implementation is for testing and it
is better that you, ha-dev team, make its feature.
I don't know quite what you meant by "it is better that you, ha-dev
team, make it's feature".
I am sorry for poor English. It means that the feature you think to
make is better than what I made.
If possible, could you show the schedule
Not a problem. This will all work out.
I don't have a particular schedule in mind. I'm also not sure how long
it will take, and this kind of thing depends a lot on how well the
person doing the change knows the code.
Here is a suggested approach. At each stage, please test the patch
some, submit the patch for review and then test it extensively, and
submit it for re-review if you found more bugs. I would suggest in this
order - to keep you from spending too much time testing a patch we ask
you to do over. In fact, on the first stage maybe review your data
structures first, because that will determine the code in the end.
Step 1 - Further categorize and modularize the configuration.
There are at least 4 kinds of statements in the configuration
and there may be more:
1. media statements - like ucast, bcast, etc. Things
which load plugins and start read/write processes
2. global statements - which affect some or all of the
media statements - things like port number, serial
baud rate, etc. Knowing which global statements
affect which media statements, may eventually be
important.
3. Respawn statements - things which start child processes
this includes the implied respawn statements in things
like 'crm on'.
4. Other statements. For each of these, figure out which
class of processes are affected by each change.
Make it so that each media statement is processed by a single
function call. Right now, the processing for any given media
statement is embedded in a loop. This is just restructuring.
If you store all the ha.cf statements in an array, then you can
make a minor improvement even in this stage. Make a pass
through the array looking for global statements and execute them
first. This will fix some known annoying behaviors where these
need to occur before they're used.
For media and respawn statements, you need to add an association
between the statements and the child processes they created.
That way, when we finally get around to processing changes, we
can kill them when they go away or change. We already have
a special way to track processes. Use that code, but create
new associations.
Note that this doesn't implement the feature we are talking
about, it just lays the groundwork for it. At this point
the code won't be able to do anything new. That happens
in step 2. Test this code in CTS, and test it manually.
Have it reviewed, and repeat until people are happy.
Then I'll commit it for you.
Step 2 - add the code to deal with changes in the configuration, and
figure out when to kill things, when to start new ones.
Step 3 - Create CTS tests which change the configuration, then change it
back, watching for the correct behavior in each case. Run 1000
instances of this test alone in a CTS run. After you have had the code
reviewed, and have run these tests, and everyone is happy, then we'll
commit this stage of the changes.
Suggested Enhancement - after doing this:
Since you now know how to restart anything in heartbeat, you should also
be able to restart a pair of read and write children if either should
die. So, we should be able to then recover from them dying. Add the
code to do this, and fix up the CTS test which is supposed to kill
random processes, to know how to kill any process in the system. Turn
the test back on, and run 1000 instances of this test in CTS. Similarly
for this stage, submit it for review, and when everyone is happy, we'll
commit it.
And, in the end this will be a great improvement, and the system will
also be more robust (better able to recover from errors) than it has
ever been.
How does that sound for an outline of a plan?
--
Alan Robertson <[EMAIL PROTECTED]>