Hi,
We built a cluster consist of three nodes and start/stop one of these nodes
repeatedly. The test script is shown
like this:
1 #!/bin/sh
2
3 LOOP_COUNT=1000
4
5 while [ $LOOP_COUNT -gt 0 ];
6 do
7 let "LOOP_COUNT-=1"
8 echo "test No. $((1000-LOOP_COUNT))"
9 rcopenais start
10 sleep 30
11 rcopenais stop
12 sleep 10
13 done
The error log looks like:
Apr 3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint
"ocfs2:controld": Object does not exist
Apr 3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint
"ocfs2:controld": Object does not exist
Several times after this error appears first, it leads to this node being
fenced.
After some analysis, we think there is a race condition between corosync and
openais CKPT service. So we formed
a patch which can avoid this problem effectively.
The patch is attached below. Any review is highly appreciated.
Thanks
Index: openais-1.1.4/services/ckpt.c
===================================================================
--- openais-1.1.4.orig/services/ckpt.c
+++ openais-1.1.4/services/ckpt.c
@@ -776,14 +776,17 @@ static void ckpt_confchg_fn (
unsigned int i, j;
unsigned int lowest_nodeid;
+ if (!memcmp (&my_saved_ring_id, ring_id,sizeof (struct memb_ring_id))) {
+ if (my_sync_state != SYNC_STATE_NOT_STARTED) {
+ return;
+ }
+ }
+ if (configuration_type != TOTEM_CONFIGURATION_REGULAR) {
+ return;
+ }
+
memcpy (&my_saved_ring_id, ring_id,
sizeof (struct memb_ring_id));
- if (configuration_type != TOTEM_CONFIGURATION_REGULAR) {
- return;
- }
- if (my_sync_state != SYNC_STATE_NOT_STARTED) {
- return;
- }
my_sync_state = SYNC_STATE_STARTED;
_______________________________________________
discuss mailing list
[email protected]
http://lists.corosync.org/mailman/listinfo/discuss