Sheng Yang created CLOUDSTACK-1653:
--------------------------------------
Summary: Redundant router: check_heartbeat.sh malfunction caused
by delayed cron job
Key: CLOUDSTACK-1653
URL: https://issues.apache.org/jira/browse/CLOUDSTACK-1653
Project: CloudStack
Issue Type: Bug
Security Level: Public (Anyone can view this level - this is the default.)
Affects Versions: 4.1.0
Reporter: Sheng Yang
Assignee: Sheng Yang
Fix For: 4.1.0
According to: https://bugzilla.redhat.com/show_bug.cgi?id=159441
cron can only guarantee the minimum interval of execution jobs, so two check of
check_heartbeat.sh would possibly take more than 1 minutes.
Since keepalived should update keepalived.ts every 10 seconds, so if any of two
execution have gap less than 60 seconds, it should fail.
The current logic in the check_heartbeat.sh is wrong, which only guarantee cron
didn't delay, but not keepalived is alive.
This pass the original test because it was a NFS disconnecting test, in which
case disk is corrupted, so cron got delayed, means network is down.
Change the condition to less than 60(probably 30 is safer because seems
sometime cron has bug for not meeting the minimum interval requirement) should
works too. Because it should find out that keepalived is dead after second time
it was executed after NFS recovered.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira