Lars Ellenberg wrote:
On Thu, Apr 01, 2010 at 08:27:02AM -0600, Alan Robertson wrote:
Lars Ellenberg wrote:
On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
OK....

Since there was no ssh-as-root between the cluster nodes, I didn't
send all the logs along from every node in the cluster - and it
didn't occur to me to look at all of them.

However, the problem has gotten curioser and curioser - because ALL
the nodes in the cluster reported the same problem at the same
time...

That makes it a lot less likely to be a race condition with the disk
writing infrastructure...

I've attached the relevant lines from the various machines -
slightly processed (date stamp format changed and a few other minor
things).

Let me know if you want me to send all the system logs along...
There should be core files.
You should be able to get some interessting information out there,
especially "the_cib" and "digest" at the point of abort().

Also, for my reference - what method are you using to compute the
digest of the file?  That is, what command should I execute to get
the same results?
It's an md5sum over the xml tree -- not over the formated ascii buffer,
though, so "md5sum cib.xml" won't do.
I think it is the same as
echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
But there is "cibadmin --md5-sum -x cib.xml",
to use the exact same code path.
This is a change from how it used to be (the last time I looked - at
least according to my not-always-reliable memory).  Thanks for the
update.


2010/03/31_19:02:52     vhost0384       [13294]: ERROR: crm_abort:
write_cib_contents: Triggered fatal assert at io.c:624 :
retrieveCib(tmp1, tmp2, FALSE) != NULL
So it did not verify right after it was written.
Can you reproduce?
I have no idea.  I didn't do anything much.  Hopefully the test
suite does a lot more strenuous things...

The core files may actually contains some hints,
so have a look there.
None of them verified.  All the nodes in the cluster failed the test
at the same time - and now I have no official CIBs on disk - on any
cluster nodes...  I sent Andrew all the CIBs, and all the core

Well, Andrew is on vacation right now... you will have noticed.

files, and basically everything under /var/lib/heartbeat/ from one
machine. They're from the latest official release - so the binaries
that match them are readily available.

The strange thing is that your "corrupt" cib.uHFtAW
contains a <status/> thing.  it should not.
No other cib*.raw or cib.xml does.

Because <status/> is explicitly filtered out in write_cib_contents:
 free_xml_from_parent(the_cib, cib_status_root);
before
 write_xml_file(the_cib, tmp1, FALSE),
so that should never have made it in there.

Something is very wrong somewhere...

Did you manage to get two status sections in there, somehow?
You tried anything funky with the cib as last action before this failed?

Not that I recall...

Do it again, with higher log level.  Sorry, no time right now to rebuild
your exact thing with your exact gcc and stuff to look at your core file.

You can just download the RPM and extract the objects.  That's what I used.

--
    Alan Robertson <al...@unix.sh>

"Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to