Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-10 Thread Klaus Wenninger
On 11/10/2016 09:47 AM, Toni Tschampke wrote:
>> Did your upgrade documentation describe how to update the corosync
>> configuration, and did that go well? crmd may be unable to function due
>> to lack of quorum information.
>
> Thanks for this tip, corosync quorum configuration was the cause.
>
> As we changed validate-with as well as the feature set manually in the
> cib, is there a need for issuing the cibadmin --upgrade --force
> command or is this command just for changing the schemes?
>

Guess no as this would just do automatically (to the latest version
then) what
you've done manually already.

> -- 
> Mit freundlichen Grüßen
>
> Toni Tschampke | t...@halle.it
> bcs kommunikationslösungen
> Inh. Dipl. Ing. Carsten Burkhardt
> Harz 51 | 06108 Halle (Saale) | Germany
> tel +49 345 29849-0 | fax +49 345 29849-22
> www.b-c-s.de | www.halle.it | www.wivewa.de
>
>
> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>
> Weitere Informationen erhalten Sie unter www.wivewa.de
>
> Am 08.11.2016 um 22:51 schrieb Ken Gaillot:
>> On 11/07/2016 09:08 AM, Toni Tschampke wrote:
>>> We managed to change the validate-with option via workaround (cibadmin
>>> export & replace) as setting the value with cibadmin --modify doesn't
>>> write the changes to disk.
>>>
>>> After experimenting with various schemes (xml is correctly interpreted
>>> by crmsh) we are still not able to communicate with local crmd.
>>>
>>> Can someone please help to determine why the local crmd is not
>>> responding (we disabled our other nodes to eliminate possible corosync
>>> related issues) and runs into errors/timeouts when issuing crmsh or
>>> cibadmin related commands.
>>
>> It occurs to me that wheezy used corosync 1. There were major changes
>> from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for
>> pacemaker, whereas 2 has quorum built-in.
>>
>> Did your upgrade documentation describe how to update the corosync
>> configuration, and did that go well? crmd may be unable to function due
>> to lack of quorum information.
>>
>>> examples for not working local commands
>>>
>>> timeout when running cibadmin: (strace attachment)
 cibadmin --upgrade --force
 Call cib_upgrade failed (-62): Timer expired
>>>
>>> error when running a crm resource cleanup
 crm resource cleanup $vm
 Error signing on to the CRMd service
 Error performing operation: Transport endpoint is not connected
>>>
>>> I attached the strace log from running cib_upgrade, does this help to
>>> find the cause of the timeout issue?
>>>
>>> Here is the corosync dump when locally starting pacemaker:
>>>
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:1256
 Corosync Cluster Engine ('2.3.6'): started and ready to provide
 service.
 Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN  ] main.c:1257
 Corosync built-in features: dbus rdma monitoring watchdog augeas
 systemd upstart xmlconf qdevices snmp pie relro bindnow
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemnet.c:248 Initializing transport (UDP/IP Multicast).
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
 none hash: none
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemnet.c:248 Initializing transport (UDP/IP Multicast).
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
 none hash: none
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemudp.c:671 The network interface [10.112.0.1] is now up.
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync configuration map access [0]
 Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
 ipc_setup.c:536 server name: cmap
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync configuration service [1]
 Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
 ipc_setup.c:536 server name: cfg
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync cluster closed process group service
 v1.01 [2]
 Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
 ipc_setup.c:536 server name: cpg
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync profile loading service [4]
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync resource monitoring service [6]
 Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669
 Watchdog /dev/watchdog is now been tickled by corosync.
 Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625
 Could not change the Watchdog 

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-10 Thread Toni Tschampke

Did your upgrade documentation describe how to update the corosync
configuration, and did that go well? crmd may be unable to function due
to lack of quorum information.


Thanks for this tip, corosync quorum configuration was the cause.

As we changed validate-with as well as the feature set manually in the 
cib, is there a need for issuing the cibadmin --upgrade --force command 
or is this command just for changing the schemes?


--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 08.11.2016 um 22:51 schrieb Ken Gaillot:

On 11/07/2016 09:08 AM, Toni Tschampke wrote:

We managed to change the validate-with option via workaround (cibadmin
export & replace) as setting the value with cibadmin --modify doesn't
write the changes to disk.

After experimenting with various schemes (xml is correctly interpreted
by crmsh) we are still not able to communicate with local crmd.

Can someone please help to determine why the local crmd is not
responding (we disabled our other nodes to eliminate possible corosync
related issues) and runs into errors/timeouts when issuing crmsh or
cibadmin related commands.


It occurs to me that wheezy used corosync 1. There were major changes
from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for
pacemaker, whereas 2 has quorum built-in.

Did your upgrade documentation describe how to update the corosync
configuration, and did that go well? crmd may be unable to function due
to lack of quorum information.


examples for not working local commands

timeout when running cibadmin: (strace attachment)

cibadmin --upgrade --force
Call cib_upgrade failed (-62): Timer expired


error when running a crm resource cleanup

crm resource cleanup $vm
Error signing on to the CRMd service
Error performing operation: Transport endpoint is not connected


I attached the strace log from running cib_upgrade, does this help to
find the cause of the timeout issue?

Here is the corosync dump when locally starting pacemaker:


Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:1256
Corosync Cluster Engine ('2.3.6'): started and ready to provide service.
Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN  ] main.c:1257
Corosync built-in features: dbus rdma monitoring watchdog augeas
systemd upstart xmlconf qdevices snmp pie relro bindnow
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemnet.c:248 Initializing transport (UDP/IP Multicast).
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
none hash: none
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemnet.c:248 Initializing transport (UDP/IP Multicast).
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
none hash: none
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemudp.c:671 The network interface [10.112.0.1] is now up.
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync configuration map access [0]
Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
ipc_setup.c:536 server name: cmap
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync configuration service [1]
Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
ipc_setup.c:536 server name: cfg
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync cluster closed process group service
v1.01 [2]
Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
ipc_setup.c:536 server name: cpg
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync profile loading service [4]
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync resource monitoring service [6]
Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669
Watchdog /dev/watchdog is now been tickled by corosync.
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625
Could not change the Watchdog timeout from 10 to 6 seconds
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464
resource load_15min missing a recovery key.
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464
resource memory_used missing a recovery key.
Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:581 no
resources configured.
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync watchdog service [7]
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-08 Thread Ken Gaillot
On 11/07/2016 09:08 AM, Toni Tschampke wrote:
> We managed to change the validate-with option via workaround (cibadmin
> export & replace) as setting the value with cibadmin --modify doesn't
> write the changes to disk.
>
> After experimenting with various schemes (xml is correctly interpreted
> by crmsh) we are still not able to communicate with local crmd.
> 
> Can someone please help to determine why the local crmd is not
> responding (we disabled our other nodes to eliminate possible corosync
> related issues) and runs into errors/timeouts when issuing crmsh or
> cibadmin related commands.

It occurs to me that wheezy used corosync 1. There were major changes
from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for
pacemaker, whereas 2 has quorum built-in.

Did your upgrade documentation describe how to update the corosync
configuration, and did that go well? crmd may be unable to function due
to lack of quorum information.

> examples for not working local commands
> 
> timeout when running cibadmin: (strace attachment)
>> cibadmin --upgrade --force
>> Call cib_upgrade failed (-62): Timer expired
> 
> error when running a crm resource cleanup
>> crm resource cleanup $vm
>> Error signing on to the CRMd service
>> Error performing operation: Transport endpoint is not connected
> 
> I attached the strace log from running cib_upgrade, does this help to
> find the cause of the timeout issue?
> 
> Here is the corosync dump when locally starting pacemaker:
> 
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:1256
>> Corosync Cluster Engine ('2.3.6'): started and ready to provide service.
>> Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN  ] main.c:1257
>> Corosync built-in features: dbus rdma monitoring watchdog augeas
>> systemd upstart xmlconf qdevices snmp pie relro bindnow
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>> totemnet.c:248 Initializing transport (UDP/IP Multicast).
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
>> none hash: none
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>> totemnet.c:248 Initializing transport (UDP/IP Multicast).
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>> totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
>> none hash: none
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>> totemudp.c:671 The network interface [10.112.0.1] is now up.
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>> Service engine loaded: corosync configuration map access [0]
>> Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
>> ipc_setup.c:536 server name: cmap
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>> Service engine loaded: corosync configuration service [1]
>> Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
>> ipc_setup.c:536 server name: cfg
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>> Service engine loaded: corosync cluster closed process group service
>> v1.01 [2]
>> Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
>> ipc_setup.c:536 server name: cpg
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>> Service engine loaded: corosync profile loading service [4]
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>> Service engine loaded: corosync resource monitoring service [6]
>> Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669
>> Watchdog /dev/watchdog is now been tickled by corosync.
>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625
>> Could not change the Watchdog timeout from 10 to 6 seconds
>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464
>> resource load_15min missing a recovery key.
>> Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464
>> resource memory_used missing a recovery key.
>> Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:581 no
>> resources configured.
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>> Service engine loaded: corosync watchdog service [7]
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
>> Service engine loaded: corosync cluster quorum service v0.1 [3]
>> Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
>> ipc_setup.c:536 server name: quorum
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>> totemudp.c:671 The network interface [10.110.1.1] is now up.
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
>> totemsrp.c:2095 A new membership (10.112.0.1:348) was formed. Members
>> joined: 1
>> Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:310
>> Completed service synchronization, ready to provide service.
>> Nov 07 16:01:59 [24341] nebel1 pacemakerd:   notice: main: 
>> Starting Pacemaker 1.1.15 | build=e174ec8 features: 

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-03 Thread Toni Tschampke

> I'm guessing this change should be instantly written into the xml file?
> If this is the case something is wrong, greping for validate gives the
> old string back.

We found some strange behavior when setting "validate-with" via 
cibadmin, corosync.log shows the successful transaction, issuing 
cibadmin --query gives the correct value but it is NOT written into 
cib.xml.


We restarted pacemaker and value is reset to pacemaker-1.1
If signatures for the cib.xml are generated from pacemaker/cib, which 
algorithm is used? looks like md5 to me.


Would it be possible to manual edit the cib.xml and generate a valid 
cib.xml.sig to get one step further in debugging process?


Regards, Toni

--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 16:39 schrieb Toni Tschampke:

 > I'm going to guess you were using the experimental 1.1 schema as the
 > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
 > changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
 > you get better results. Don't edit the file directly though; use the
 > cibadmin command so it signs the end result properly.
 >
 > After changing the validate-with, run:
 >
 >crm_verify -x /var/lib/pacemaker/cib/cib.xml
 >
 > and fix any errors that show up.

strange, the location of our cib.xml differs from your path, our cib is
located in /var/lib/heartbeat/crm/

running cibadmin --modify --xml-text ''

gave no output but was logged to corosync:

cib: info: cib_perform_op:-- 
cib: info: cib_perform_op:++ 

I'm guessing this change should be instantly written into the xml file?
If this is the case something is wrong, greping for validate gives the
old string back.



pacemakerd --features
Pacemaker 1.1.15 (Build: e174ec8)
Supporting v3.0.10:

Should the crm_feature_set be updated this way too? I'm guessing this is
done when "cibadmin --upgrade" succeeds?

We just get an timeout error when trying to upgrade it with cibadmin:
Call cib_upgrade failed (-62): Timer expired

Do have permissions changed from 1.1.7 to 1.1.15? when looking at our
quite big /var/lib/heartbeat/crm/ folder some permissions changed:

-rw--- 1 hacluster root  80K Nov  1 16:56 cib-31.raw
-rw-r--r-- 1 hacluster root   32 Nov  1 16:56 cib-31.raw.sig
-rw--- 1 hacluster haclient  80K Nov  1 18:53 cib-32.raw
-rw--- 1 hacluster haclient   32 Nov  1 18:53 cib-32.raw.sig

cib-31 was before upgrading, cib-32 after starting upgraded pacemaker


--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 15:39 schrieb Ken Gaillot:

On 11/03/2016 05:51 AM, Toni Tschampke wrote:

Hi,

we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie
(pacemaker 1.1.15, corosync 2.3.6).
During the upgrade pacemaker was removed (rc) and reinstalled after from
jessie-backports, same for crmsh.

Now we are encountering multiple problems:

First I checked the configuration on a single node running pacemaker &
corosync which dropped a strange error, followed by multiple lines
stating syntax is wrong. crm configure show then showed up a mixed view
of xml and crmsh singleline syntax.


ERROR: Cannot read schema file

'/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
directory: '/usr/share/pacemaker/pacemaker-1.1.rng'


pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12,
as it was used to hold experimental new features rather than as the
actual next version of the schema. So, the schema skipped to 1.2.

I'm going to guess you were using the experimental 1.1 schema as the
"validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
you get better results. Don't edit the file directly though; use the
cibadmin command so it signs the end result properly.

After changing the validate-with, run:

   crm_verify -x /var/lib/pacemaker/cib/cib.xml

and fix any errors that show up.


When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so
on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
were gone. When running crm resource show, all resources showed up, when
running crm_mon -1fA the output was unexpected as it showed all nodes
offline, with no DC 

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-03 Thread Toni Tschampke

> I'm going to guess you were using the experimental 1.1 schema as the
> "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
> changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
> you get better results. Don't edit the file directly though; use the
> cibadmin command so it signs the end result properly.
>
> After changing the validate-with, run:
>
>crm_verify -x /var/lib/pacemaker/cib/cib.xml
>
> and fix any errors that show up.

strange, the location of our cib.xml differs from your path, our cib is 
located in /var/lib/heartbeat/crm/


running cibadmin --modify --xml-text ''

gave no output but was logged to corosync:

cib: info: cib_perform_op:-- validate-with="pacemaker-1.1"/>
cib: info: cib_perform_op:++ num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
 have-quorum="1" cib-last-written="Thu Nov  3 10:05:52 2016" 
update-origin="nebel1" update-client="cibadmin" update-user="root"/>


I'm guessing this change should be instantly written into the xml file?
If this is the case something is wrong, greping for validate gives the 
old string back.


validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1" 
cib-last-written="Thu Nov  3 16:19:51 2016" update-origin="nebel1" 
update-client="cibadmin" update-user="root">


pacemakerd --features
Pacemaker 1.1.15 (Build: e174ec8)
Supporting v3.0.10:

Should the crm_feature_set be updated this way too? I'm guessing this is 
done when "cibadmin --upgrade" succeeds?


We just get an timeout error when trying to upgrade it with cibadmin:
Call cib_upgrade failed (-62): Timer expired

Do have permissions changed from 1.1.7 to 1.1.15? when looking at our 
quite big /var/lib/heartbeat/crm/ folder some permissions changed:


-rw--- 1 hacluster root  80K Nov  1 16:56 cib-31.raw
-rw-r--r-- 1 hacluster root   32 Nov  1 16:56 cib-31.raw.sig
-rw--- 1 hacluster haclient  80K Nov  1 18:53 cib-32.raw
-rw--- 1 hacluster haclient   32 Nov  1 18:53 cib-32.raw.sig

cib-31 was before upgrading, cib-32 after starting upgraded pacemaker


--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 15:39 schrieb Ken Gaillot:

On 11/03/2016 05:51 AM, Toni Tschampke wrote:

Hi,

we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie
(pacemaker 1.1.15, corosync 2.3.6).
During the upgrade pacemaker was removed (rc) and reinstalled after from
jessie-backports, same for crmsh.

Now we are encountering multiple problems:

First I checked the configuration on a single node running pacemaker &
corosync which dropped a strange error, followed by multiple lines
stating syntax is wrong. crm configure show then showed up a mixed view
of xml and crmsh singleline syntax.


ERROR: Cannot read schema file

'/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
directory: '/usr/share/pacemaker/pacemaker-1.1.rng'


pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker 1.1.12,
as it was used to hold experimental new features rather than as the
actual next version of the schema. So, the schema skipped to 1.2.

I'm going to guess you were using the experimental 1.1 schema as the
"validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
changing the validate-with to pacemaker-next or pacemaker-1.2 and see if
you get better results. Don't edit the file directly though; use the
cibadmin command so it signs the end result properly.

After changing the validate-with, run:

   crm_verify -x /var/lib/pacemaker/cib/cib.xml

and fix any errors that show up.


When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so
on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
were gone. When running crm resource show, all resources showed up, when
running crm_mon -1fA the output was unexpected as it showed all nodes
offline, with no DC elected:


Stack: corosync
Current DC: NONE
Last updated: Thu Nov  3 11:11:16 2016
Last change: Thu Nov  3 09:54:52 2016 by root via cibadmin on nebel1

  *** Resource management is DISABLED ***
  The cluster will not attempt to start, stop or recover services

3 nodes and 73 resources configured:
5 resources DISABLED and 0 BLOCKED from being started due to failures

OFFLINE: [ nebel1 nebel2 nebel3 ]


we tried to manually change dc-version

when issuing a simple cleanup command I got the following error:


crm resource cleanup DrbdBackuppcMs
Error signing on to the CRMd service
Error performing operation: Transport endpoint is not connected


which looks like crmsh is not able to communicate with crmd and nothing
is logged in 

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-03 Thread Toni Tschampke
> You'll want to switch your validate-with schema to a newer schema, 
and most

> likely there will be one or two things that don't validate
> anymore. There is the "crm configure upgrade" command, but if crmsh is
> having problems you can call cibadmin directly:
>
>  cibadmin --upgrade --force

when trying to run this command, I just get an timeout:

> Call cib_upgrade failed (-62): Timer expired

corosync.log shows the attempt

> cib: info: cib_process_request:   Forwarding cib_upgrade
> operation for section 'all' to all (origin=local/cibadmin/2)

I tried to get the current value, either the command is wrong or there 
is no value set for validate-with


> crm_attribute --type crm_config --query --name validate-with
> scope=crm_config  name=validate-with value=(null)
> Error performing operation: No such device or address

I would mind increasing the timeout won't fix this, how do I get 
information which timeout is involved and why it's triggered?


I attached the strace dump, hope this helps to figure out where the 
problem sits.


Is there another way to set the correct validate-with option if both 
options do not work?


Regards, Toni

--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 12:42 schrieb Kristoffer Grönlund:

Toni Tschampke  writes:


Hi,

we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie
(pacemaker 1.1.15, corosync 2.3.6).
During the upgrade pacemaker was removed (rc) and reinstalled after from
jessie-backports, same for crmsh.


You'll want to switch your validate-with schema to a newer schema, and most
likely there will be one or two things that don't validate
anymore. There is the "crm configure upgrade" command, but if crmsh is
having problems you can call cibadmin directly:

 cibadmin --upgrade --force

Going from 1.1.7 to 1.1.15 is quite a big jump, so there is a lot that
could go wrong..

Your configuration looks fine from a first glance, the reason you're
getting XML mixed in is because of the missing schema: crmsh can't be
sure that it translated the XML to line syntax correctly, so falls back
to showing the XML. That should all fix itself by changing the
validate-with attribute on the  root tag to a newer version.

I'm guessing that the errors you are getting when connecting the second
node are due to the missing schema, it's hard to tell based on the log
snippet attached though.

execve("/usr/sbin/cibadmin", ["cibadmin", "--upgrade", "--force"], [/* 42 vars 
*/]) = 0
brk(0)  = 0x7fee60f7a000
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7fee60845000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=60973, ...}) = 0
mmap(NULL, 60973, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fee60836000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/libcrmcommon.so.3", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0p\1\1\0\0\0\0\0"..., 
832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=367264, ...}) = 0
mmap(NULL, 2467216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7fee601c4000
mprotect(0x7fee6021a000, 2093056, PROT_NONE) = 0
mmap(0x7fee60419000, 20480, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x55000) = 0x7fee60419000
mmap(0x7fee6041e000, 1424, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fee6041e000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/libcib.so.4", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0Ps\0\0\0\0\0\0"..., 832) 
= 832
fstat(3, {st_mode=S_IFREG|0644, st_size=128320, ...}) = 0
mmap(NULL, 2224936, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x7fee5ffa4000
mprotect(0x7fee5ffc1000, 2097152, PROT_NONE) = 0
mmap(0x7fee601c1000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d000) = 0x7fee601c1000
mmap(0x7fee601c3000, 808, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fee601c3000
close(3)= 0
access("/etc/ld.so.nohwcap", F_OK)  = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/libqb.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, 

[ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-03 Thread Toni Tschampke

Hi,

we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to jessie 
(pacemaker 1.1.15, corosync 2.3.6).
During the upgrade pacemaker was removed (rc) and reinstalled after from 
jessie-backports, same for crmsh.


Now we are encountering multiple problems:

First I checked the configuration on a single node running pacemaker & 
corosync which dropped a strange error, followed by multiple lines 
stating syntax is wrong. crm configure show then showed up a mixed view 
of xml and crmsh singleline syntax.


> ERROR: Cannot read schema file 
'/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or 
directory: '/usr/share/pacemaker/pacemaker-1.1.rng'


When we looked into that folder there was pacemaker-1.0.rng, 1.2 and so 
on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors 
were gone. When running crm resource show, all resources showed up, when 
running crm_mon -1fA the output was unexpected as it showed all nodes 
offline, with no DC elected:


> Stack: corosync
> Current DC: NONE
> Last updated: Thu Nov  3 11:11:16 2016
> Last change: Thu Nov  3 09:54:52 2016 by root via cibadmin on nebel1
>
>  *** Resource management is DISABLED ***
>  The cluster will not attempt to start, stop or recover services
>
> 3 nodes and 73 resources configured:
> 5 resources DISABLED and 0 BLOCKED from being started due to failures
>
> OFFLINE: [ nebel1 nebel2 nebel3 ]

we tried to manually change dc-version

when issuing a simple cleanup command I got the following error:

> crm resource cleanup DrbdBackuppcMs
> Error signing on to the CRMd service
> Error performing operation: Transport endpoint is not connected

which looks like crmsh is not able to communicate with crmd and nothing 
is logged in this case in corosync.log


we experimented with multiple config changes (corosync.conf: pacemaker 
ver 0 > 1)

cib-bootstrap-options: cluster-infrastructure from openais to corosync

> Package versions:
> cman 3.1.8-1.2+b1
> corosync 2.3.6-3~bpo8+1
> crmsh 2.2.0-1~bpo8+1
> csync2 1.34-2.3+b1
> dlm-pcmk 3.0.12-3.2+deb7u2
> libcman3 3.1.8-1.2+b1
> libcorosync-common4:amd64 2.3.6-3~bpo8+1
> munin-libvirt-plugins 0.0.6-1
> pacemaker 1.1.15-2~bpo8+1
> pacemaker-cli-utils 1.1.15-2~bpo8+1
> pacemaker-common 1.1.15-2~bpo8+1
> pacemaker-resource-agents 1.1.15-2~bpo8+1

> Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

I attached our cib before upgrade and after, as well as the one with the 
mixed syntax and our corosync.conf.


When we tried to connect a second node to the cluster, pacemaker starts 
it's deamons, starts corosync and dies after 15 tries with following in 
corosync log:


> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
> crmd: info: do_cib_control: Could not connect to the CIB service:
> Transport endpoint is not connected
> crmd:  warning: do_cib_control:
> Couldn't complete CIB registration 15 times... pause and retry
> attrd: error: attrd_cib_connect: Signon to CIB failed:
> Transport endpoint is not connected (-107)
> attrd: info: main: Shutting down attribute manager
> attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
> attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
> crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped (2000ms)
> pacemakerd:  warning: pcmk_child_exit:
> The attrd process (12761) can no longer be respawned,
> shutting the cluster down.
> pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker

A third node joins without above error, but crm_mon still shows all 
nodes as offline.


Thanks for any advice how to solve this, I'm out of ideas now.

Regards, Toni

--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de

node nebel1 \
utilization memory=61440 \
attributes standby=off
node nebel2 \
utilization memory=61440 \
attributes standby=on
node nebel3 \
utilization memory=6144 \
attributes standby=on
primitive ClusterEmail MailTo \
params email="clus...@bcs.bcs" subject="[cluster]" \
meta allow-migrate=true target-role=Stopped
primitive ClusterIp IPaddr2 \
params ip=10.110.2.1 cidr_netmask=16 \
op monitor interval=30
primitive ClusterMon ClusterMon \
params extra_options="-r -f -A -o" 
htmlfile="/var/www/cluster-status.html" \
operations $id=ClusterMon-operations \
op monitor interval=60 start-delay=0 timeout=30 \
meta target-role=started
primitive DhcpDaemon lsb:isc-dhcp-server \
op start interval=0 timeout=30 \
op stop interval=0 timeout=30 \
op monitor interval=60 \
meta target-role=Started
primitive DrbdAptcacher ocf:linbit:drbd \
params drbd_resource=apt-cacher \
operations $id=DrbdAptcacher-operations \
op