OK....
Since there was no ssh-as-root between the cluster nodes, I didn't send
all the logs along from every node in the cluster - and it didn't occur
to me to look at all of them.
However, the problem has gotten curioser and curioser - because ALL the
nodes in the cluster reported the same problem at the same time...
That makes it a lot less likely to be a race condition with the disk
writing infrastructure...
I've attached the relevant lines from the various machines - slightly
processed (date stamp format changed and a few other minor things).
Let me know if you want me to send all the system logs along...
Alan Robertson wrote:
Hi,
I've run into what looks at first blush to be a CIB bug in writing to disk.
The key messages from this incident are these:
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: validate_cib_digest:
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf
(/var/lib/heartbeat/crm/cib.GUdD9T), calculated
0bac3440f5c42f0f37d22ea7dfe433e8
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Checksum of
/var/lib/heartbeat/crm/cib.uHFtAW failed! Configuration contents ignored!
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Usually this
is caused by manual changes, please refer to
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
Mar 31 19:02:52 vhost0384 cib: [13294]: WARN: retrieveCib: Continuing
but /var/lib/heartbeat/crm/cib.uHFtAW will NOT used.
I did not make manual changes on a running CIB. I was using the cluster
shell at the time. The CIB it is complaining about appears to be an
intact, valid CIB with contents approximately like they should have been
at the time. By the way, I have a report from another IBMer that they
have seen systems that stop writing to their local CIBs. I'll contact him.
Here are some relevant facts:
These machines are virtual guests in a cloud somewhere - operations
have somewhat unpredictable latency. But, nothing too egregious
was happening at the time or Heartbeat would have bitched.
I was doing some testing at the time. I was putting on and
taking off constraints using the cluster shell
migrate and unmigrate operations.
Given that the file looks intact, and I know how the CIB is written to
disk (since I originally wrote that code), I wonder if it isn't a
versioning issue / race condition. That is, the code for writing to
disk does NOT guarantee when it gets done (assuming you're still using
it). It would be easy to do a checksum on the wrong version compared to
the version you thought it should be (or before it completed).
Andrew: You should have already received all the relevant logs to you
on a separate email.
Also, for my reference - what method are you using to compute the digest
of the file? That is, what command should I execute to get the same
results?
--
Alan Robertson <al...@unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
2010/03/31_19:02:52 vhost0384 [13294]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1,
tmp2, FALSE) != NULL
2010/03/31_19:02:52 vhost0384 [13294]: ERROR: retrieveCib: Checksum
of /var/lib/heartbeat/crm/cib.uHFtAW failed! Configuration contents ignored!
2010/03/31_19:02:52 vhost0384 [13294]: ERROR: retrieveCib: Usually
this is caused by manual changes, please refer to
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:52 vhost0384 [13294]: ERROR: validate_cib_digest:
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf
(/var/lib/heartbeat/crm/cib.GUdD9T), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:52 vhost0384 [6297]: ERROR: cib_diskwrite_complete:
Disabling disk writes after write failure
2010/03/31_19:02:52 vhost0384 [6297]: ERROR: cib_diskwrite_complete:
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:52 vhost0384 [6297]: ERROR: Managed
write_cib_contents process 13294 dumped core
2010/03/31_19:02:53 vhost0150 [15083]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1,
tmp2, FALSE) != NULL
2010/03/31_19:02:53 vhost0150 [15083]: ERROR: retrieveCib: Checksum
of /var/lib/heartbeat/crm/cib.n66oB0 failed! Configuration contents ignored!
2010/03/31_19:02:53 vhost0150 [15083]: ERROR: retrieveCib: Usually
this is caused by manual changes, please refer to
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:53 vhost0150 [15083]: ERROR: validate_cib_digest:
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf
(/var/lib/heartbeat/crm/cib.UJSSzR), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:53 vhost0150 [2564]: ERROR: cib_diskwrite_complete:
Disabling disk writes after write failure
2010/03/31_19:02:53 vhost0150 [2564]: ERROR: cib_diskwrite_complete:
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:53 vhost0150 [2564]: ERROR: Managed
write_cib_contents process 15083 dumped core
2010/03/31_19:02:53 vhost0330 [23191]: ERROR: cib_diskwrite_complete:
Disabling disk writes after write failure
2010/03/31_19:02:53 vhost0330 [23191]: ERROR: cib_diskwrite_complete:
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:53 vhost0330 [23191]: ERROR: Managed
write_cib_contents process 31680 dumped core
2010/03/31_19:02:53 vhost0330 [31680]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1,
tmp2, FALSE) != NULL
2010/03/31_19:02:53 vhost0330 [31680]: ERROR: retrieveCib: Checksum
of /var/lib/heartbeat/crm/cib.Zyw20Q failed! Configuration contents ignored!
2010/03/31_19:02:53 vhost0330 [31680]: ERROR: retrieveCib: Usually
this is caused by manual changes, please refer to
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:53 vhost0330 [31680]: ERROR: validate_cib_digest:
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf
(/var/lib/heartbeat/crm/cib.FT60xq), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:53 vhost0336 [18632]: ERROR: cib_diskwrite_complete:
Disabling disk writes after write failure
2010/03/31_19:02:53 vhost0336 [18632]: ERROR: cib_diskwrite_complete:
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:53 vhost0336 [18632]: ERROR: Managed
write_cib_contents process 32233 dumped core
2010/03/31_19:02:53 vhost0336 [32233]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1,
tmp2, FALSE) != NULL
2010/03/31_19:02:53 vhost0336 [32233]: ERROR: retrieveCib: Checksum
of /var/lib/heartbeat/crm/cib.aCrcBc failed! Configuration contents ignored!
2010/03/31_19:02:53 vhost0336 [32233]: ERROR: retrieveCib: Usually
this is caused by manual changes, please refer to
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:53 vhost0336 [32233]: ERROR: validate_cib_digest:
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf
(/var/lib/heartbeat/crm/cib.igNQMc), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:55 vhost0362 [19654]: ERROR: cib_diskwrite_complete:
Disabling disk writes after write failure
2010/03/31_19:02:55 vhost0362 [19654]: ERROR: cib_diskwrite_complete:
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:55 vhost0362 [19654]: ERROR: Managed
write_cib_contents process 23200 dumped core
2010/03/31_19:02:55 vhost0362 [23200]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1,
tmp2, FALSE) != NULL
2010/03/31_19:02:55 vhost0362 [23200]: ERROR: retrieveCib: Checksum
of /var/lib/heartbeat/crm/cib.5CVr9T failed! Configuration contents ignored!
2010/03/31_19:02:55 vhost0362 [23200]: ERROR: retrieveCib: Usually
this is caused by manual changes, please refer to
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:55 vhost0362 [23200]: ERROR: validate_cib_digest:
Digest comparision failed: expected bb5b09392c502ee22faaf0184f060754
(/var/lib/heartbeat/crm/cib.JfcbF1), calculated 54b0e7f533f8667f1bd76e96268d2970
_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker