Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

Toni Tschampke Thu, 10 Nov 2016 00:51:48 -0800

Did your upgrade documentation describe how to update the corosync
configuration, and did that go well? crmd may be unable to function due
to lack of quorum information.


Thanks for this tip, corosync quorum configuration was the cause.

As we changed validate-with as well as the feature set manually in thecib, is there a need for issuing the cibadmin --upgrade --force commandor is this command just for changing the schemes?


--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 08.11.2016 um 22:51 schrieb Ken Gaillot:

On 11/07/2016 09:08 AM, Toni Tschampke wrote:

We managed to change the validate-with option via workaround (cibadmin
export & replace) as setting the value with cibadmin --modify doesn't
write the changes to disk.

After experimenting with various schemes (xml is correctly interpreted
by crmsh) we are still not able to communicate with local crmd.

Can someone please help to determine why the local crmd is not
responding (we disabled our other nodes to eliminate possible corosync
related issues) and runs into errors/timeouts when issuing crmsh or
cibadmin related commands.


It occurs to me that wheezy used corosync 1. There were major changes
from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for
pacemaker, whereas 2 has quorum built-in.

Did your upgrade documentation describe how to update the corosync
configuration, and did that go well? crmd may be unable to function due
to lack of quorum information.

examples for not working local commands

timeout when running cibadmin: (strace attachment)

cibadmin --upgrade --force
Call cib_upgrade failed (-62): Timer expired


error when running a crm resource cleanup

crm resource cleanup $vm
Error signing on to the CRMd service
Error performing operation: Transport endpoint is not connected


I attached the strace log from running cib_upgrade, does this help to
find the cause of the timeout issue?

Here is the corosync dump when locally starting pacemaker:

Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:1256
Corosync Cluster Engine ('2.3.6'): started and ready to provide service.
Nov 07 16:01:59 [24339] nebel1 corosync info    [MAIN  ] main.c:1257
Corosync built-in features: dbus rdma monitoring watchdog augeas
systemd upstart xmlconf qdevices snmp pie relro bindnow
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemnet.c:248 Initializing transport (UDP/IP Multicast).
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
none hash: none
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemnet.c:248 Initializing transport (UDP/IP Multicast).
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
none hash: none
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemudp.c:671 The network interface [10.112.0.1] is now up.
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync configuration map access [0]
Nov 07 16:01:59 [24339] nebel1 corosync info    [QB    ]
ipc_setup.c:536 server name: cmap
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync configuration service [1]
Nov 07 16:01:59 [24339] nebel1 corosync info    [QB    ]
ipc_setup.c:536 server name: cfg
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync cluster closed process group service
v1.01 [2]
Nov 07 16:01:59 [24339] nebel1 corosync info    [QB    ]
ipc_setup.c:536 server name: cpg
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync profile loading service [4]
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync resource monitoring service [6]
Nov 07 16:01:59 [24339] nebel1 corosync info    [WD    ] wd.c:669
Watchdog /dev/watchdog is now been tickled by corosync.
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD    ] wd.c:625
Could not change the Watchdog timeout from 10 to 6 seconds
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD    ] wd.c:464
resource load_15min missing a recovery key.
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD    ] wd.c:464
resource memory_used missing a recovery key.
Nov 07 16:01:59 [24339] nebel1 corosync info    [WD    ] wd.c:581 no
resources configured.
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync watchdog service [7]
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync cluster quorum service v0.1 [3]
Nov 07 16:01:59 [24339] nebel1 corosync info    [QB    ]
ipc_setup.c:536 server name: quorum
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemudp.c:671 The network interface [10.110.1.1] is now up.
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemsrp.c:2095 A new membership (10.112.0.1:348) was formed. Members
joined: 1
Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:310
Completed service synchronization, ready to provide service.
Nov 07 16:01:59 [24341] nebel1 pacemakerd:   notice: main:
Starting Pacemaker 1.1.15 | build=e174ec8 features: generated-manpages
agent-manpages ascii-docs publican-docs ncurses libqb-logging
libqb-ipc lha-fencing upstart systemd nagios  corosync-native
atomic-attrd snmp libesmtp acls
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info: main:
Maximum core file size is: 18446744073709551615
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
qb_ipcs_us_publish:        server name: pacemakerd
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:01:59 [24341] nebel1 pacemakerd:   notice:
get_node_name:     Could not obtain a node name for corosync nodeid 1
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
crm_get_peer:      Created entry
283a5061-34c2-4b81-bff9-738533f22277/0x7f8a151931a0 for node (null)/1
(1 total)
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
crm_get_peer:      Node 1 has uuid 1
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
crm_update_peer_proc:      cluster_connect_cpg: Node (null)[1] -
corosync-cpg is now online
Nov 07 16:01:59 [24341] nebel1 pacemakerd:    error:
cluster_connect_quorum:    Corosync quorum is not configured
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:01:59 [24341] nebel1 pacemakerd:   notice:
get_node_name:     Defaulting to uname -n for the local corosync node
name
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
crm_get_peer:      Node 1 is now known as nebel1
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Using uid=108 and group=114 for process cib
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Forked child 24342 for process cib
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Forked child 24343 for process stonith-ng
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Forked child 24344 for process lrmd
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Using uid=108 and group=114 for process attrd
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Forked child 24345 for process attrd
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Using uid=108 and group=114 for process pengine
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Forked child 24346 for process pengine
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Using uid=108 and group=114 for process crmd
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
start_child:       Forked child 24347 for process crmd
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info: main:
Starting mainloop
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
pcmk_cpg_membership:       Node 1 joined group pacemakerd (counter=0.0)
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
pcmk_cpg_membership:       Node 1 still member of group pacemakerd
(peer=nebel1, counter=0.0)
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
mcp_cpg_deliver:   Ignoring process list sent by peer for local node
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
mcp_cpg_deliver:   Ignoring process list sent by peer for local node
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
mcp_cpg_deliver:   Ignoring process list sent by peer for local node
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
mcp_cpg_deliver:   Ignoring process list sent by peer for local node
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
mcp_cpg_deliver:   Ignoring process list sent by peer for local node
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
mcp_cpg_deliver:   Ignoring process list sent by peer for local node
Nov 07 16:01:59 [24341] nebel1 pacemakerd:     info:
mcp_cpg_deliver:   Ignoring process list sent by peer for local node
Nov 07 16:01:59 [24342] nebel1        cib:     info:
crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
Nov 07 16:01:59 [24342] nebel1        cib:   notice: main:      Using
legacy config location: /var/lib/heartbeat/crm
Nov 07 16:01:59 [24342] nebel1        cib:     info:
get_cluster_type:  Verifying cluster type: 'corosync'
Nov 07 16:01:59 [24342] nebel1        cib:     info:
get_cluster_type:  Assuming an active 'corosync' cluster
Nov 07 16:01:59 [24342] nebel1        cib:     info:
retrieveCib:       Reading cluster configuration file
/var/lib/heartbeat/crm/cib.xml (digest:
/var/lib/heartbeat/crm/cib.xml.sig)
Nov 07 16:01:59 [24344] nebel1       lrmd:     info:
crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
Nov 07 16:01:59 [24344] nebel1       lrmd:     info:
qb_ipcs_us_publish:        server name: lrmd
Nov 07 16:01:59 [24344] nebel1       lrmd:     info: main:      Starting
Nov 07 16:01:59 [24346] nebel1    pengine:     info:
crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
Nov 07 16:01:59 [24346] nebel1    pengine:     info:
qb_ipcs_us_publish:        server name: pengine
Nov 07 16:01:59 [24346] nebel1    pengine:     info: main:
Starting pengine
Nov 07 16:01:59 [24345] nebel1      attrd:     info:
crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
Nov 07 16:01:59 [24345] nebel1      attrd:     info: main:
Starting up
Nov 07 16:01:59 [24345] nebel1      attrd:     info:
get_cluster_type:  Verifying cluster type: 'corosync'
Nov 07 16:01:59 [24345] nebel1      attrd:     info:
get_cluster_type:  Assuming an active 'corosync' cluster
Nov 07 16:01:59 [24345] nebel1      attrd:   notice:
crm_cluster_connect:       Connecting to cluster infrastructure: corosync
Nov 07 16:01:59 [24347] nebel1       crmd:     info:
crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
Nov 07 16:01:59 [24347] nebel1       crmd:     info: main:      CRM
Git Version: 1.1.15 (e174ec8)
Nov 07 16:01:59 [24343] nebel1 stonith-ng:     info:
crm_log_init:      Changed active directory to /var/lib/pacemaker/cores
Nov 07 16:01:59 [24343] nebel1 stonith-ng:     info:
get_cluster_type:  Verifying cluster type: 'corosync'
Nov 07 16:01:59 [24343] nebel1 stonith-ng:     info:
get_cluster_type:  Assuming an active 'corosync' cluster
Nov 07 16:01:59 [24343] nebel1 stonith-ng:   notice:
crm_cluster_connect:       Connecting to cluster infrastructure: corosync
Nov 07 16:01:59 [24347] nebel1       crmd:     info: do_log:    Input
I_STARTUP received in state S_STARTING from crmd_init
Nov 07 16:01:59 [24347] nebel1       crmd:     info:
get_cluster_type:  Verifying cluster type: 'corosync'
Nov 07 16:02:00 [24342] nebel1        cib:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:02:00 [24343] nebel1 stonith-ng:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:02:00 [24342] nebel1        cib:   notice:
get_node_name:     Could not obtain a node name for corosync nodeid 1
Nov 07 16:02:00 [24343] nebel1 stonith-ng:   notice:
get_node_name:     Defaulting to uname -n for the local corosync node
name
Nov 07 16:02:00 [24343] nebel1 stonith-ng:     info:
crm_get_peer:      Node 1 is now known as nebel1
Nov 07 16:02:00 [24342] nebel1        cib:     info:
crm_get_peer:      Created entry
f5df58e3-3848-440c-8f6b-d572f8fa9b9c/0x7f0ce1744570 for node (null)/1
(1 total)
Nov 07 16:02:00 [24342] nebel1        cib:     info:
crm_get_peer:      Node 1 has uuid 1
Nov 07 16:02:00 [24342] nebel1        cib:     info:
crm_update_peer_proc:      cluster_connect_cpg: Node (null)[1] -
corosync-cpg is now online
Nov 07 16:02:00 [24342] nebel1        cib:   notice:
crm_update_peer_state_iter:        Node (null) state is now member |
nodeid=1 previous=unknown source=crm_update_peer_proc
Nov 07 16:02:00 [24342] nebel1        cib:     info:
init_cs_connection_once:   Connection to 'corosync': established
Nov 07 16:02:00 [24345] nebel1      attrd:     info: main:
Cluster connection active
Nov 07 16:02:00 [24345] nebel1      attrd:     info:
qb_ipcs_us_publish:        server name: attrd
Nov 07 16:02:00 [24345] nebel1      attrd:     info: main:
Accepting attribute updates
Nov 07 16:02:00 [24342] nebel1        cib:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:02:00 [24342] nebel1        cib:   notice:
get_node_name:     Defaulting to uname -n for the local corosync node
name
Nov 07 16:02:00 [24342] nebel1        cib:     info:
crm_get_peer:      Node 1 is now known as nebel1
Nov 07 16:02:00 [24342] nebel1        cib:     info:
qb_ipcs_us_publish:        server name: cib_ro
Nov 07 16:02:00 [24342] nebel1        cib:     info:
qb_ipcs_us_publish:        server name: cib_rw
Nov 07 16:02:00 [24342] nebel1        cib:     info:
qb_ipcs_us_publish:        server name: cib_shm
Nov 07 16:02:00 [24342] nebel1        cib:     info: cib_init:
Starting cib mainloop
Nov 07 16:02:00 [24342] nebel1        cib:     info:
pcmk_cpg_membership:       Node 1 joined group cib (counter=0.0)
Nov 07 16:02:00 [24342] nebel1        cib:     info:
pcmk_cpg_membership:       Node 1 still member of group cib
(peer=nebel1, counter=0.0)
Nov 07 16:02:00 [24342] nebel1        cib:     info:
cib_file_backup:   Archived previous version as
/var/lib/heartbeat/crm/cib-72.raw
Nov 07 16:02:00 [24342] nebel1        cib:     info:
cib_file_write_with_digest:        Wrote version 0.8464.0 of the CIB
to disk (digest: 5201c56641a95e5117df4184587c3e93)
Nov 07 16:02:00 [24342] nebel1        cib:     info:
cib_file_write_with_digest:        Reading cluster configuration file
/var/lib/heartbeat/crm/cib.naRhNz (digest:
/var/lib/heartbeat/crm/cib.hLaVCH)
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
do_cib_control:    CIB connection established
Nov 07 16:02:00 [24347] nebel1       crmd:   notice:
crm_cluster_connect:       Connecting to cluster infrastructure: corosync
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:02:00 [24347] nebel1       crmd:   notice:
get_node_name:     Could not obtain a node name for corosync nodeid 1
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
crm_get_peer:      Created entry
43a3b98f-d81d-4cc7-b46e-4512f24db371/0x7f798ff40040 for node (null)/1
(1 total)
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
crm_get_peer:      Node 1 has uuid 1
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
crm_update_peer_proc:      cluster_connect_cpg: Node (null)[1] -
corosync-cpg is now online
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
init_cs_connection_once:   Connection to 'corosync': established
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:02:00 [24347] nebel1       crmd:   notice:
get_node_name:     Defaulting to uname -n for the local corosync node
name
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
crm_get_peer:      Node 1 is now known as nebel1
Nov 07 16:02:00 [24347] nebel1       crmd:     info:
peer_update_callback:      nebel1 is now in unknown state
Nov 07 16:02:00 [24347] nebel1       crmd:    error:
cluster_connect_quorum:    Corosync quorum is not configured
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
corosync_node_name:        Unable to get node name for nodeid 2
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
corosync_node_name:        Unable to get node name for nodeid 2
Nov 07 16:02:01 [24347] nebel1       crmd:   notice:
get_node_name:     Could not obtain a node name for corosync nodeid 2
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
crm_get_peer:      Created entry
c790c642-6666-4022-bba9-f700e4773b03/0x7f79901428e0 for node (null)/2
(2 total)
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
crm_get_peer:      Node 2 has uuid 2
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
corosync_node_name:        Unable to get node name for nodeid 3
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
corosync_node_name:        Unable to get node name for nodeid 3
Nov 07 16:02:01 [24347] nebel1       crmd:   notice:
get_node_name:     Could not obtain a node name for corosync nodeid 3
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
crm_get_peer:      Created entry
928f8124-4d29-4285-99de-50038d3c3b7e/0x7f7990142a20 for node (null)/3
(3 total)
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
crm_get_peer:      Node 3 has uuid 3
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
do_ha_control:     Connected to the cluster
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
lrmd_ipc_connect:  Connecting to lrmd
Nov 07 16:02:01 [24342] nebel1        cib:     info:
cib_process_request:       Forwarding cib_modify operation for section
nodes to all (origin=local/crmd/3)
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
do_lrm_control:    LRM connection established
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
do_started:        Delaying start, no membership data (0000000000100000)
Nov 07 16:02:01 [24342] nebel1        cib:     info:
corosync_node_name:        Unable to get node name for nodeid 1
Nov 07 16:02:01 [24342] nebel1        cib:   notice:
get_node_name:     Defaulting to uname -n for the local corosync node
name
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
parse_notifications:       No optional alerts section in cib
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
do_started:        Delaying start, no membership data (0000000000100000)
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
pcmk_cpg_membership:       Node 1 joined group crmd (counter=0.0)
Nov 07 16:02:01 [24347] nebel1       crmd:     info:
pcmk_cpg_membership:       Node 1 still member of group crmd
(peer=nebel1, counter=0.0)
Nov 07 16:02:01 [24342] nebel1        cib:     info:
cib_process_request:       Completed cib_modify operation for section
nodes: OK (rc=0, origin=nebel1/crmd/3, version=0.8464.0)
Nov 07 16:02:01 [24345] nebel1      attrd:     info:
attrd_cib_connect: Connected to the CIB after 2 attempts
Nov 07 16:02:01 [24345] nebel1      attrd:     info: main:      CIB
connection active
Nov 07 16:02:01 [24345] nebel1      attrd:     info:
pcmk_cpg_membership:       Node 1 joined group attrd (counter=0.0)
Nov 07 16:02:01 [24345] nebel1      attrd:     info:
pcmk_cpg_membership:       Node 1 still member of group attrd
(peer=nebel1, counter=0.0)
Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info: setup_cib:
Watching for stonith topology changes
Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
qb_ipcs_us_publish:        server name: stonith-ng
Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info: main:
Starting stonith-ng mainloop
Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
pcmk_cpg_membership:       Node 1 joined group stonith-ng (counter=0.0)
Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
pcmk_cpg_membership:       Node 1 still member of group stonith-ng
(peer=nebel1, counter=0.0)
Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
init_cib_cache_cb: Updating device list from the cib: init
Nov 07 16:02:01 [24343] nebel1 stonith-ng:     info:
cib_devices_update:        Updating devices to version 0.8464.0
Nov 07 16:02:01 [24343] nebel1 stonith-ng:   notice:
unpack_config:     On loss of CCM Quorum: Ignore
Nov 07 16:02:02 [24343] nebel1 stonith-ng:   notice:
stonith_device_register:   Added 'stonith1Nebel2' to the device list
(1 active devices)
Nov 07 16:02:02 [24343] nebel1 stonith-ng:     info:
cib_device_update: Device stonith1Nebel1 has been disabled on nebel1:
score=-INFINITY


Current cib settings:

cibadmin -Q | grep validate
<cib admin_epoch="0" epoch="8464" num_updates="0"
validate-with="pacemaker-2.4" crm_feature_set="3.0.10" have-quorum="1"
cib-last-written="Fri Nov  4 12:15:30 2016" update-origin="nebel3"
update-client="crm_attribute" update-user="root">


Any help is appreciated, thanks in advance

Regards, Toni

--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 17:42 schrieb Toni Tschampke:

  > I'm guessing this change should be instantly written into the xml
file?
  > If this is the case something is wrong, greping for validate gives the
  > old string back.

We found some strange behavior when setting "validate-with" via
cibadmin, corosync.log shows the successful transaction, issuing
cibadmin --query gives the correct value but it is NOT written into
cib.xml.

We restarted pacemaker and value is reset to pacemaker-1.1
If signatures for the cib.xml are generated from pacemaker/cib, which
algorithm is used? looks like md5 to me.

Would it be possible to manual edit the cib.xml and generate a valid
cib.xml.sig to get one step further in debugging process?

Regards, Toni

--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 16:39 schrieb Toni Tschampke:

  > I'm going to guess you were using the experimental 1.1 schema as the
  > "validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
  > changing the validate-with to pacemaker-next or pacemaker-1.2 and
see if
  > you get better results. Don't edit the file directly though; use the
  > cibadmin command so it signs the end result properly.
  >
  > After changing the validate-with, run:
  >
  >    crm_verify -x /var/lib/pacemaker/cib/cib.xml
  >
  > and fix any errors that show up.

strange, the location of our cib.xml differs from your path, our cib is
located in /var/lib/heartbeat/crm/

running cibadmin --modify --xml-text '<cib
validate-with="pacemaker-1.2"/>'

gave no output but was logged to corosync:

cib:     info: cib_perform_op:    -- <cib num_updates="0"
validate-with="pacemaker-1.1"/>
cib:     info: cib_perform_op:    ++ <cib admin_epoch="0" epoch="8462"
num_updates="1" validate-with="pacemaker-1.2" crm_feature_set="3.0.6"
   have-quorum="1" cib-last-written="Thu Nov  3 10:05:52 2016"
update-origin="nebel1" update-client="cibadmin" update-user="root"/>

I'm guessing this change should be instantly written into the xml file?
If this is the case something is wrong, greping for validate gives the
old string back.

<cib admin_epoch="0" epoch="8462" num_updates="0"
validate-with="pacemaker-1.1" crm_feature_set="3.0.6" have-quorum="1"
cib-last-written="Thu Nov  3 16:19:51 2016" update-origin="nebel1"
update-client="cibadmin" update-user="root">

pacemakerd --features
Pacemaker 1.1.15 (Build: e174ec8)
Supporting v3.0.10:

Should the crm_feature_set be updated this way too? I'm guessing this is
done when "cibadmin --upgrade" succeeds?

We just get an timeout error when trying to upgrade it with cibadmin:
Call cib_upgrade failed (-62): Timer expired

Do have permissions changed from 1.1.7 to 1.1.15? when looking at our
quite big /var/lib/heartbeat/crm/ folder some permissions changed:

-rw------- 1 hacluster root      80K Nov  1 16:56 cib-31.raw
-rw-r--r-- 1 hacluster root       32 Nov  1 16:56 cib-31.raw.sig
-rw------- 1 hacluster haclient  80K Nov  1 18:53 cib-32.raw
-rw------- 1 hacluster haclient   32 Nov  1 18:53 cib-32.raw.sig

cib-31 was before upgrading, cib-32 after starting upgraded pacemaker


--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 03.11.2016 um 15:39 schrieb Ken Gaillot:

On 11/03/2016 05:51 AM, Toni Tschampke wrote:

Hi,

we just upgraded our nodes from wheezy 7.11 (pacemaker 1.1.7) to
jessie
(pacemaker 1.1.15, corosync 2.3.6).
During the upgrade pacemaker was removed (rc) and reinstalled after
from
jessie-backports, same for crmsh.

Now we are encountering multiple problems:

First I checked the configuration on a single node running pacemaker &
corosync which dropped a strange error, followed by multiple lines
stating syntax is wrong. crm configure show then showed up a mixed
view
of xml and crmsh singleline syntax.

ERROR: Cannot read schema file

'/usr/share/pacemaker/pacemaker-1.1.rng': [Errno 2] No such file or
directory: '/usr/share/pacemaker/pacemaker-1.1.rng'


pacemaker-1.1.rng was renamed to pacemaker-next.rng in Pacemaker
1.1.12,
as it was used to hold experimental new features rather than as the
actual next version of the schema. So, the schema skipped to 1.2.

I'm going to guess you were using the experimental 1.1 schema as the
"validate-with" at the top of /var/lib/pacemaker/cib/cib.xml. Try
changing the validate-with to pacemaker-next or pacemaker-1.2 and
see if
you get better results. Don't edit the file directly though; use the
cibadmin command so it signs the end result properly.

After changing the validate-with, run:

    crm_verify -x /var/lib/pacemaker/cib/cib.xml

and fix any errors that show up.

When we looked into that folder there was pacemaker-1.0.rng, 1.2
and so
on. As a quick try we symlinked the 1.2 to 1.1 and the syntax errors
were gone. When running crm resource show, all resources showed up,
when
running crm_mon -1fA the output was unexpected as it showed all nodes
offline, with no DC elected:

Stack: corosync
Current DC: NONE
Last updated: Thu Nov  3 11:11:16 2016
Last change: Thu Nov  3 09:54:52 2016 by root via cibadmin on nebel1

               *** Resource management is DISABLED ***
   The cluster will not attempt to start, stop or recover services

3 nodes and 73 resources configured:
5 resources DISABLED and 0 BLOCKED from being started due to failures

OFFLINE: [ nebel1 nebel2 nebel3 ]


we tried to manually change dc-version

when issuing a simple cleanup command I got the following error:

crm resource cleanup DrbdBackuppcMs
Error signing on to the CRMd service
Error performing operation: Transport endpoint is not connected


which looks like crmsh is not able to communicate with crmd and
nothing
is logged in this case in corosync.log

we experimented with multiple config changes (corosync.conf: pacemaker
ver 0 > 1)
cib-bootstrap-options: cluster-infrastructure from openais to corosync

Package versions:
cman 3.1.8-1.2+b1
corosync 2.3.6-3~bpo8+1
crmsh 2.2.0-1~bpo8+1
csync2 1.34-2.3+b1
dlm-pcmk 3.0.12-3.2+deb7u2
libcman3 3.1.8-1.2+b1
libcorosync-common4:amd64 2.3.6-3~bpo8+1
munin-libvirt-plugins 0.0.6-1
pacemaker 1.1.15-2~bpo8+1
pacemaker-cli-utils 1.1.15-2~bpo8+1
pacemaker-common 1.1.15-2~bpo8+1
pacemaker-resource-agents 1.1.15-2~bpo8+1

Kernel: #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux


I attached our cib before upgrade and after, as well as the one with
the
mixed syntax and our corosync.conf.

When we tried to connect a second node to the cluster, pacemaker
starts
it's deamons, starts corosync and dies after 15 tries with
following in
corosync log:

crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped
(2000ms)
crmd: info: do_cib_control: Could not connect to the CIB service:
Transport endpoint is not connected
crmd:  warning: do_cib_control:
Couldn't complete CIB registration 15 times... pause and retry
attrd: error: attrd_cib_connect: Signon to CIB failed:
Transport endpoint is not connected (-107)
attrd: info: main: Shutting down attribute manager
attrd: info: qb_ipcs_us_withdraw: withdrawing server sockets
attrd: info: crm_xml_cleanup: Cleaning up memory from libxml2
crmd: info: crm_timer_popped: Wait Timer (I_NULL) just popped
(2000ms)
pacemakerd:  warning: pcmk_child_exit:
The attrd process (12761) can no longer be respawned,
shutting the cluster down.
pacemakerd: notice: pcmk_shutdown_worker: Shutting down Pacemaker


A third node joins without above error, but crm_mon still shows all
nodes as offline.

Thanks for any advice how to solve this, I'm out of ideas now.

Regards, Toni


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

Reply via email to