OK....

Since there was no ssh-as-root between the cluster nodes, I didn't send all the logs along from every node in the cluster - and it didn't occur to me to look at all of them.

However, the problem has gotten curioser and curioser - because ALL the nodes in the cluster reported the same problem at the same time...

That makes it a lot less likely to be a race condition with the disk writing infrastructure...

I've attached the relevant lines from the various machines - slightly processed (date stamp format changed and a few other minor things).

Let me know if you want me to send all the system logs along...


Alan Robertson wrote:
Hi,

I've run into what looks at first blush to be a CIB bug in writing to disk.

The key messages from this incident are these:


Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: validate_cib_digest: Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf (/var/lib/heartbeat/crm/cib.GUdD9T), calculated 0bac3440f5c42f0f37d22ea7dfe433e8 Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Checksum of /var/lib/heartbeat/crm/cib.uHFtAW failed! Configuration contents ignored! Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Usually this is caused by manual changes, please refer to http://clusterlabs.org/wiki/FAQ#cib_changes_detected Mar 31 19:02:52 vhost0384 cib: [13294]: WARN: retrieveCib: Continuing but /var/lib/heartbeat/crm/cib.uHFtAW will NOT used.


I did not make manual changes on a running CIB. I was using the cluster shell at the time. The CIB it is complaining about appears to be an intact, valid CIB with contents approximately like they should have been at the time. By the way, I have a report from another IBMer that they have seen systems that stop writing to their local CIBs. I'll contact him.

Here are some relevant facts:
  These machines are virtual guests in a cloud somewhere - operations
    have somewhat unpredictable latency.  But, nothing too egregious
    was happening at the time or Heartbeat would have bitched.
  I was doing some testing at the time.  I was putting on and
    taking off constraints using the cluster shell
    migrate and unmigrate operations.

Given that the file looks intact, and I know how the CIB is written to disk (since I originally wrote that code), I wonder if it isn't a versioning issue / race condition. That is, the code for writing to disk does NOT guarantee when it gets done (assuming you're still using it). It would be easy to do a checksum on the wrong version compared to the version you thought it should be (or before it completed).

Andrew: You should have already received all the relevant logs to you on a separate email.

Also, for my reference - what method are you using to compute the digest of the file? That is, what command should I execute to get the same results?



--
    Alan Robertson <al...@unix.sh>

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce
2010/03/31_19:02:52     vhost0384       [13294]: ERROR: crm_abort: 
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1, 
tmp2, FALSE) != NULL
2010/03/31_19:02:52     vhost0384       [13294]: ERROR: retrieveCib: Checksum 
of /var/lib/heartbeat/crm/cib.uHFtAW failed!  Configuration contents ignored!
2010/03/31_19:02:52     vhost0384       [13294]: ERROR: retrieveCib: Usually 
this is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:52     vhost0384       [13294]: ERROR: validate_cib_digest: 
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
(/var/lib/heartbeat/crm/cib.GUdD9T), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:52     vhost0384       [6297]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write failure
2010/03/31_19:02:52     vhost0384       [6297]: ERROR: cib_diskwrite_complete: 
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:52     vhost0384       [6297]: ERROR: Managed 
write_cib_contents process 13294 dumped core
2010/03/31_19:02:53     vhost0150       [15083]: ERROR: crm_abort: 
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1, 
tmp2, FALSE) != NULL
2010/03/31_19:02:53     vhost0150       [15083]: ERROR: retrieveCib: Checksum 
of /var/lib/heartbeat/crm/cib.n66oB0 failed!  Configuration contents ignored!
2010/03/31_19:02:53     vhost0150       [15083]: ERROR: retrieveCib: Usually 
this is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:53     vhost0150       [15083]: ERROR: validate_cib_digest: 
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
(/var/lib/heartbeat/crm/cib.UJSSzR), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:53     vhost0150       [2564]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write failure
2010/03/31_19:02:53     vhost0150       [2564]: ERROR: cib_diskwrite_complete: 
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:53     vhost0150       [2564]: ERROR: Managed 
write_cib_contents process 15083 dumped core
2010/03/31_19:02:53     vhost0330       [23191]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write failure
2010/03/31_19:02:53     vhost0330       [23191]: ERROR: cib_diskwrite_complete: 
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:53     vhost0330       [23191]: ERROR: Managed 
write_cib_contents process 31680 dumped core
2010/03/31_19:02:53     vhost0330       [31680]: ERROR: crm_abort: 
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1, 
tmp2, FALSE) != NULL
2010/03/31_19:02:53     vhost0330       [31680]: ERROR: retrieveCib: Checksum 
of /var/lib/heartbeat/crm/cib.Zyw20Q failed!  Configuration contents ignored!
2010/03/31_19:02:53     vhost0330       [31680]: ERROR: retrieveCib: Usually 
this is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:53     vhost0330       [31680]: ERROR: validate_cib_digest: 
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
(/var/lib/heartbeat/crm/cib.FT60xq), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:53     vhost0336       [18632]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write failure
2010/03/31_19:02:53     vhost0336       [18632]: ERROR: cib_diskwrite_complete: 
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:53     vhost0336       [18632]: ERROR: Managed 
write_cib_contents process 32233 dumped core
2010/03/31_19:02:53     vhost0336       [32233]: ERROR: crm_abort: 
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1, 
tmp2, FALSE) != NULL
2010/03/31_19:02:53     vhost0336       [32233]: ERROR: retrieveCib: Checksum 
of /var/lib/heartbeat/crm/cib.aCrcBc failed!  Configuration contents ignored!
2010/03/31_19:02:53     vhost0336       [32233]: ERROR: retrieveCib: Usually 
this is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:53     vhost0336       [32233]: ERROR: validate_cib_digest: 
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
(/var/lib/heartbeat/crm/cib.igNQMc), calculated 0bac3440f5c42f0f37d22ea7dfe433e8
2010/03/31_19:02:55     vhost0362       [19654]: ERROR: cib_diskwrite_complete: 
Disabling disk writes after write failure
2010/03/31_19:02:55     vhost0362       [19654]: ERROR: cib_diskwrite_complete: 
Disk write failed: status=134, signo=6, exitcode=0
2010/03/31_19:02:55     vhost0362       [19654]: ERROR: Managed 
write_cib_contents process 23200 dumped core
2010/03/31_19:02:55     vhost0362       [23200]: ERROR: crm_abort: 
write_cib_contents: Triggered fatal assert at io.c:624 : retrieveCib(tmp1, 
tmp2, FALSE) != NULL
2010/03/31_19:02:55     vhost0362       [23200]: ERROR: retrieveCib: Checksum 
of /var/lib/heartbeat/crm/cib.5CVr9T failed!  Configuration contents ignored!
2010/03/31_19:02:55     vhost0362       [23200]: ERROR: retrieveCib: Usually 
this is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
2010/03/31_19:02:55     vhost0362       [23200]: ERROR: validate_cib_digest: 
Digest comparision failed: expected bb5b09392c502ee22faaf0184f060754 
(/var/lib/heartbeat/crm/cib.JfcbF1), calculated 54b0e7f533f8667f1bd76e96268d2970
_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to