On Fri, Mar 25, 2011 at 06:18:07PM +0100, Christoph Bartoschek wrote:
> Hi,
> 
> we experiment with DRBD and pacemaker and see several times that the 
> DRBD part is degraded (One node is outdated or diskless or something 
> similar) but crm_mon just reports that the DRBD resource runs as master 
> and slave on the nodes.
> 
> There is no indication that the resource is not in its optimal mode of 
> operation.
> 
> For me it seems as if pacemaker knows only the states: running, stopped, 
> failed.
> 
> I am missing the state: running degraded or suboptimal.

Yep, "degraded" is not a state available for pacemaker.
Pacemaker cannot do much about "suboptimal".

Pacemaker can stop, start, and promote/demote resources.
No more, no less.

If your resources are running "suboptimal" (but working),
stopping/restarting things, in the hope that would make them
run better, likely won't add to your availability.

Pacemaker is not a substitute for proper monitoring (nagios, whatever).

Monitoring can page your engineer on duty (or yourself)
for things that require immediate admin intervention.
Monitoring can provide you with nice graphs, so you can detect early
which things may require strategic admin intervention.

It is not pacemaker's job to do either.

> Is it already there and I have made an configuration error? Or what is 
> the recommended way to check the sanity of the resources controlled by 
> pacemaker?

Do you expect the cluster manager to sound the alarm beep as well,
if a disk falls out of the raid, or the battery of the BBWC on the
controler is depleted?
Or if the response time of your home page goes bad (but the status
page comes still back within the timeout)?

What is Pacemaker expected to do?  Stop everything?

If you are Primary on DRBD, and the lower level disk has some IO error,
DRBD detaches from the local disk. The RA will notice this on the next
monitoring intervall, and adjust the master score accordingly.
Depending on overall configuration, pacemaker may then decide to migrate
resource over to the other node, or not.

But many other resource internal problems,
replication link damage or something like that,
pacemaker has no way to magically heal things.


But ok, for strictly "informational purposes", conceivably,
we could add a monitoring result code to the RA spec saying
"working [slave/master], but degraded".

That could then be presented in some obvious way in crm_mon, or even
trigger certain action scripts (which again could then page you).

Currently, a similar effect could be achieved
by adding some sort of "supervisor resource",
which would need to be made dependent of the supervised resource,
and would "fail" if the supervised resource is not running "optimal".

My feeling is, don't try to do everything with the same tool.
Use the best tool for the job.
Use a monitoring tool for system monitoring.
Use a cluster manager for cluster management.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to