Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

Toni Tschampke Thu, 03 Nov 2016 09:49:58 -0700

> I'm guessing this change should be instantly written into the xml file?
> If this is the case something is wrong, greping for validate gives the
> old string back.

We found some strange behavior when setting "validate-with" viacibadmin, corosync.log shows the successful transaction, issuingcibadmin --query gives the correct value but it is NOT written intocib.xml.


We restarted pacemaker and value is reset to pacemaker-1.1

If signatures for the cib.xml are generated from pacemaker/cib, whichalgorithm is used? looks like md5 to me.

Would it be possible to manual edit the cib.xml and generate a validcib.xml.sig to get one step further in debugging process?


Regards, Toni

--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 16:39 schrieb Toni Tschampke:

 > I'm going to guess you were using the experimental 1.1 schema as the
 > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
 > changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
 > you get better results. Don't edit the file directly though; use the
 > cibadmin command so it signs the end result properly.
 >
 > After changing the validate-with, run:
 >
 >    crm_verify -x /var/lib/pacemaker/cib/cib.xml
 >
 > and fix any errors that show up.

strange, the location of our cib.xml differs from your path, our cib is
located in /var/lib/heartbeat/crm/

running cibadmin --modify --xml-text '<cib validate-with="pacemaker-1.2"/>'

gave no output but was logged to corosync:

cib:     info: cib_perform_op:    -- <cib num_updates="0"
validate-with="pacemaker-1.1"/>
cib:     info: cib_perform_op:    ++ <cib admin_epoch="0" epoch="8462"
num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
  have-quorum="1" cib-last-written="Thu Nov  3 10:05:52 2016"
update-origin="nebel1" update-client="cibadmin" update-user="root"/>

I'm guessing this change should be instantly written into the xml file?
If this is the case something is wrong, greping for validate gives the
old string back.

<cib admin_epoch="0" epoch="8462" num_updates="0"
validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1"
cib-last-written="Thu Nov  3 16:19:51 2016" update-origin="nebel1"
update-client="cibadmin" update-user="root">

pacemakerd --features
Pacemaker 1.1.15 (Build: e174ec8)
Supporting v3.0.10:

Should the crm_feature_set be updated this way too? I'm guessing this is
done when "cibadmin --upgrade" succeeds?

We just get an timeout error when trying to upgrade it with cibadmin:
Call cib_upgrade failed (-62): Timer expired

Do have permissions changed from 1.1.7 to 1.1.15? when looking at our
quite big /var/lib/heartbeat/crm/ folder some permissions changed:

-rw------- 1 hacluster root      80K Nov  1 16:56 cib-31.raw
-rw-r--r-- 1 hacluster root       32 Nov  1 16:56 cib-31.raw.sig
-rw------- 1 hacluster haclient  80K Nov  1 18:53 cib-32.raw
-rw------- 1 hacluster haclient   32 Nov  1 18:53 cib-32.raw.sig

cib-31 was before upgrading, cib-32 after starting upgraded pacemaker


--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 15:39 schrieb Ken Gaillot:

On 11/03/2016 05:51 AM, Toni Tschampke wrote:

Hi,

we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie
(pacemaker 1.1.15, corosync 2.3.6).
During the upgrade pacemaker was removed (rc) and reinstalled after from
jessie-backports, same for crmsh.

Now we are encountering multiple problems:

First I checked the configuration on a single node running pacemaker &
corosync which dropped a strange error, followed by multiple lines
stating syntax is wrong. crm configure show then showed up a mixed view
of xml and crmsh singleline syntax.

ERROR: Cannot read schema file

'/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
directory: '/usr/share/pacemaker/pacemaker-1.1.rng'


pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12,
as it was used to hold experimental new features rather than as the
actual next version of the schema. So, the schema skipped to 1.2.

I'm going to guess you were using the experimental 1.1 schema as the
"validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
you get better results. Don't edit the file directly though; use the
cibadmin command so it signs the end result properly.

After changing the validate-with, run:

   crm_verify -x /var/lib/pacemaker/cib/cib.xml

and fix any errors that show up.

When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so
on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
were gone. When running crm resource show, all resources showed up, when
running crm_mon -1fA the output was unexpected as it showed all nodes
offline, with no DC elected:

Stack: corosync
Current DC: NONE
Last updated: Thu Nov  3 11:11:16 2016
Last change: Thu Nov  3 09:54:52 2016 by root via cibadmin on nebel1

              *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

3 nodes and 73 resources configured:
5 resources DISABLED and 0 BLOCKED from being started due to failures

OFFLINE: [ nebel1 nebel2 nebel3 ]


we tried to manually change dc-version

when issuing a simple cleanup command I got the following error:

crm resource cleanup DrbdBackuppcMs
Error signing on to the CRMd service
Error performing operation: Transport endpoint is not connected


which looks like crmsh is not able to communicate with crmd and nothing
is logged in this case in corosync.log

we experimented with multiple config changes (corosync.conf: pacemaker
ver 0 > 1)
cib-bootstrap-options: cluster-infrastructure from openais to corosync

Package versions:
cman 3.1.8-1.2+b1
corosync 2.3.6-3~bpo8+1
crmsh 2.2.0-1~bpo8+1
csync2 1.34-2.3+b1
dlm-pcmk 3.0.12-3.2+deb7u2
libcman3 3.1.8-1.2+b1
libcorosync-common4:amd64 2.3.6-3~bpo8+1
munin-libvirt-plugins 0.0.6-1
pacemaker 1.1.15-2~bpo8+1
pacemaker-cli-utils 1.1.15-2~bpo8+1
pacemaker-common 1.1.15-2~bpo8+1
pacemaker-resource-agents 1.1.15-2~bpo8+1

Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux


I attached our cib before upgrade and after, as well as the one with the
mixed syntax and our corosync.conf.

When we tried to connect a second node to the cluster, pacemaker starts
it's deamons, starts corosync and dies after 15 tries with following in
corosync log:

crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
crmd: info: do_cib_control: Could not connect to the CIB service:
Transport endpoint is not connected
crmd:  warning: do_cib_control:
Couldn't complete CIB registration 15 times... pause and retry
attrd: error: attrd_cib_connect: Signon to CIB failed:
Transport endpoint is not connected (-107)
attrd: info: main: Shutting down attribute manager
attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
pacemakerd:  warning: pcmk_child_exit:
The attrd process (12761) can no longer be respawned,
shutting the cluster down.
pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker


A third node joins without above error, but crm_mon still shows all
nodes as offline.

Thanks for any advice how to solve this, I'm out of ideas now.

Regards, Toni


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

Reply via email to