Re: [ClusterLabs] node name issues (Could not obtain a node name for corosync nodeid 739512332)

2019-08-22 Thread Ken Gaillot
On Thu, 2019-08-22 at 09:07 +0200, Ulrich Windl wrote:
> Hi!
> 
> When starting pacemaker (1.1.19+20181105.ccd6b5b10-3.10.1) on a node
> that had been down for a while, I noticed some unexpected messages
> about the node name:
> 
> pacemakerd:   notice: get_node_name:   Could not obtain a node name
> for corosync nodeid 739512332
> pacemakerd: info: crm_get_peer:Created entry a21bf687-045b-
> 4fd7-9340-0562ef595883/0x18752f0 for node (null)/739512332 (1 total)
> pacemakerd: info: crm_get_peer:Node 739512332 has uuid
> 739512332
> 
> Seems UUID and node ID is mixed up in the message at least...

"UUID" is a misnomer, for historical reasons. It was an actual UUID for
heartbeat (originally the only supported cluster layer), but for
corosync it's the node ID and for Pacemaker Remote nodes it's the node
name.

Ironically the string after "Created entry" is an actual UUID but
that's not the "node UUID", just an internal hash table id.

We should definitely update all those messages to reflect the current
reality.

> pacemakerd: info: crm_update_peer_proc: cluster_connect_cpg: Node
> (null)[739512332] - corosync-cpg is now online
> pacemakerd:   notice: cluster_connect_quorum: Quorum acquired
> pacemakerd: info: corosync_node_name: Unable to get node name for
> nodeid 739512332
> pacemakerd:   notice: get_node_name:   Defaulting to uname -n for the
> local corosync node name
> pacemakerd: info: crm_get_peer:Node 739512332 is now known as
> h12
> ...
> pacemakerd: info: main:Starting mainloop
> pacemakerd: info: pcmk_quorum_notification:Quorum
> retained | membership=172 members=2
> pacemakerd: info: corosync_node_name:  Unable to get node
> name for nodeid 739512331
> pacemakerd:   notice: get_node_name:   Could not obtain a node name
> for corosync nodeid 739512331
> pacemakerd: info: crm_get_peer:Created entry f4ef35e4-1b49-
> 4e48-916b-bb0fab7c52c9/0x1876820 for node (null)/739512331 (2 total)
> pacemakerd: info: crm_get_peer:Node 739512331 has uuid
> 739512331
> ...
> pacemakerd: info: corosync_node_name:  Unable to get node
> name for nodeid 739512331
> ...
> pacemakerd:   notice: get_node_name:   Could not obtain a node name
> for corosync nodeid 739512331
> pacemakerd:   notice: crm_update_peer_state_iter:  Node (null)
> state is now member | nodeid=739512331 previous=unknown
> source=pcmk_quorum_notification
> pacemakerd:   notice: crm_update_peer_state_iter:  Node 12 state
> is now member | nodeid=739512332 previous=unknown
> source=pcmk_quorum_notification
> pacemakerd: info: pcmk_cpg_membership: Node 739512332 joined
> group pacemakerd (counter=0.0, pid=32766, unchecked for rivals)
> stonith-ng: info: corosync_node_name:  Unable to get node
> name for nodeid 739512332
> stonith-ng:   notice: get_node_name:   Could not obtain a node name
> for corosync nodeid 739512332
> 
> What's that? The ID had been resolved before!

stonith-ng is a completely different process; each daemon has to figure
out the node information itself from what corosync gives it. You'll see
a lot of such messages repeated for each daemon that uses corosync.

> 
> stonith-ng: info: crm_get_peer:Created entry 155a30a0-ddd3-
> 4b31-9f76-46313ffa9824/0x1bff130 for node (null)/739512332 (1 total)
> stonith-ng: info: crm_get_peer:Node 739512332 has uuid
> 739512332
> ...
> stonith-ng:   notice: crm_update_peer_state_iter:  Node (null)
> state is now member | nodeid=739512332 previous=unknown
> source=crm_update_peer_proc
> ...
> attrd:   notice: get_node_name:   Could not obtain a node name for
> corosync nodeid 739512332
> attrd: info: crm_get_peer:Created entry 961e718f-ad71-479a-
> ae04-c2ec5ba29858/0x256ca40 for node (null)/739512332 (1 total)
> attrd: info: crm_get_peer:Node 739512332 has uuid 739512332
> attrd: info: crm_update_peer_proc:cluster_connect_cpg: Node
> (null)[739512332] - corosync-cpg is now online
> attrd:   notice: crm_update_peer_state_iter:  Node (null) state
> is now member | nodeid=739512332 previous=unknown
> source=crm_update_peer_proc
> ...
> pacemakerd:   notice: get_node_name:   Could not obtain a node name
> for corosync nodeid 739512331
> pacemakerd: info: pcmk_cpg_membership: Node 739512331 still
> member of group pacemakerd (peer=(null):7275, counter=0.0, at least
> once)
> stonith-ng:   notice: get_node_name:   Defaulting to uname -n for the
> local corosync node name
> ...
> pacemakerd: info: crm_get_peer:Node 739512331 is now known as
> h11
> ...
> attrd: info: corosync_node_name:  Unable to get node name for
> nodeid 739512332
> attrd:   notice: get_node_name:   Defaulting to uname -n for the
> local corosync node name
> attrd: info: crm_get_peer:Node 739512332 is now known as h12
> stonith-ng: info: corosync_node_name:  Unable to get node
> name for nodeid 739512332
> stonith-ng:   notice: get_node_name:   D

Re: [ClusterLabs] Q: "pengine[7280]: error: Characters left over after parsing '10#012': '#012'"

2019-08-22 Thread Ken Gaillot
On Thu, 2019-08-22 at 12:09 +0200, Jan Pokorný wrote:
> On 22/08/19 08:07 +0200, Ulrich Windl wrote:
> > When a second node joined a two-node cluster, I noticed the
> > following error message that leaves me kind of clueless:
> >  pengine[7280]:error: Characters left over after parsing
> > '10#012': '#012'
> > 
> > Where should I look for these characters?

The message is coming from pacemaker's function that scans an integer
from a string (usually user-provided). I'd check the CIB (especially
cluster properties) and /etc/sysconfig/pacemaker (or OS equivalent).

Octal 012 would be a newline/line feed character, so one possibility is
that whatever software was used to edit one of those files added an
encoding of it.

> Given it's pengine related, one of the ideas is it's related to:
> 
> 
https://github.com/ClusterLabs/pacemaker/commit/9cf01f5f987b5cbe387c4e040ff5bfd6872eb0ad

I don't think so, or it would have the action name in it. Also, that
won't take effect until a cluster is entirely upgraded to a version
that supports it.

> Therefore it'd be nothing to try to tackle in the user-facing
> configuration, but some kind of internal confusion, perhaps stemming
> from mixing pacemaker version within the cluster?
> 
> By any chance, do you have an interval of 12 seconds configured
> at any operation for any resource?
> 
> (The only other and unlikely possibility I can immediately see is
> having one of pe-*-series-max cluster options misconfigured.)
> 
> > The message was written after an announced resource move to the new
> > node.
-- 
Ken Gaillot 

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Antw: Re: Thoughts on crm shell

2019-08-22 Thread Ulrich Windl
>>> Andrei Borzenkov  schrieb am 22.08.2019 um 12:47 in
Nachricht <64e562db-3ece-3b4d-a793-896fcf0b3...@gmail.com>:
> 22.08.2019 12:49, Ulrich Windl пишет:
>> Hi!
>> 
>> It's been a while since I used crm shell, and now after having moved from 
> SLES11 to SLES12 (jhaving to use it again), I realized a few things:
>> 
>> 1) As the ptest command is crm_simulate now, shouldn't crm shell's ptest
(in 
> configure) be accomanied by a "simulate" command as well (declaring "ptest"

> as obsolete)?
>> 
>> 2) Some commands in the "resource" group actually manipulate the CIB 
> "configuration" section (and not the "status"). So why aren't those in 
> "configure", but in "resource"?
> 
> My educated guess is that crmsh "resource" maps directly to
> "crm_resource" ...

+1, but there is no crm_configure ;-)

> 
>> Excamples: "utilization", "param", etc.
>> 
> 
> ...  and these is simply wrapper around crm_resource functionality.
> 
>> The really bad thing is that crm shell insists on "commit" if you do an
"up" 
> from "configure".
>> 
>> Also "configure" collects changes until commit, while "resource" commits 
> immediately. Maybe that could be a criteria whcih commands should be in 
> "configure", and which should be in "resource".
>> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Thoughts on crm shell

2019-08-22 Thread Andrei Borzenkov
22.08.2019 12:49, Ulrich Windl пишет:
> Hi!
> 
> It's been a while since I used crm shell, and now after having moved from 
> SLES11 to SLES12 (jhaving to use it again), I realized a few things:
> 
> 1) As the ptest command is crm_simulate now, shouldn't crm shell's ptest (in 
> configure) be accomanied by a "simulate" command as well (declaring "ptest" 
> as obsolete)?
> 
> 2) Some commands in the "resource" group actually manipulate the CIB 
> "configuration" section (and not the "status"). So why aren't those in 
> "configure", but in "resource"?

My educated guess is that crmsh "resource" maps directly to
"crm_resource" ...

> Excamples: "utilization", "param", etc.
> 

...  and these is simply wrapper around crm_resource functionality.

> The really bad thing is that crm shell insists on "commit" if you do an "up" 
> from "configure".
> 
> Also "configure" collects changes until commit, while "resource" commits 
> immediately. Maybe that could be a criteria whcih commands should be in 
> "configure", and which should be in "resource".
> 


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] node name issues (Could not obtain a node name for corosync nodeid 739512332)

2019-08-22 Thread Andrei Borzenkov
22.08.2019 10:07, Ulrich Windl пишет:
> Hi!
> 
> When starting pacemaker (1.1.19+20181105.ccd6b5b10-3.10.1) on a node that had 
> been down for a while, I noticed some unexpected messages about the node name:
> 
> pacemakerd:   notice: get_node_name:   Could not obtain a node name for 
> corosync nodeid 739512332

As far as I understand this comes straight from corosync.conf so if you
want to suppress them, set node names in your nodelist {} directive.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Q: "pengine[7280]: error: Characters left over after parsing '10#012': '#012'"

2019-08-22 Thread Jan Pokorný
On 22/08/19 08:07 +0200, Ulrich Windl wrote:
> When a second node joined a two-node cluster, I noticed the
> following error message that leaves me kind of clueless:
>  pengine[7280]:error: Characters left over after parsing '10#012': '#012'
> 
> Where should I look for these characters?

Given it's pengine related, one of the ideas is it's related to:

https://github.com/ClusterLabs/pacemaker/commit/9cf01f5f987b5cbe387c4e040ff5bfd6872eb0ad

Therefore it'd be nothing to try to tackle in the user-facing
configuration, but some kind of internal confusion, perhaps stemming
from mixing pacemaker version within the cluster?

By any chance, do you have an interval of 12 seconds configured
at any operation for any resource?

(The only other and unlikely possibility I can immediately see is
having one of pe-*-series-max cluster options misconfigured.)

> The message was written after an announced resource move to the new
> node.

-- 
Poki


pgpj0Z_tRBZUy.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Thoughts on crm shell

2019-08-22 Thread Ulrich Windl
Hi!

It's been a while since I used crm shell, and now after having moved from 
SLES11 to SLES12 (jhaving to use it again), I realized a few things:

1) As the ptest command is crm_simulate now, shouldn't crm shell's ptest (in 
configure) be accomanied by a "simulate" command as well (declaring "ptest" as 
obsolete)?

2) Some commands in the "resource" group actually manipulate the CIB 
"configuration" section (and not the "status"). So why aren't those in 
"configure", but in "resource"?
Excamples: "utilization", "param", etc.

The really bad thing is that crm shell insists on "commit" if you do an "up" 
from "configure".

Also "configure" collects changes until commit, while "resource" commits 
immediately. Maybe that could be a criteria whcih commands should be in 
"configure", and which should be in "resource".

Regards,
Ulrich



___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] node name issues (Could not obtain a node name for corosync nodeid 739512332)

2019-08-22 Thread Ulrich Windl
Hi!

When starting pacemaker (1.1.19+20181105.ccd6b5b10-3.10.1) on a node that had 
been down for a while, I noticed some unexpected messages about the node name:

pacemakerd:   notice: get_node_name:   Could not obtain a node name for 
corosync nodeid 739512332
pacemakerd: info: crm_get_peer:Created entry 
a21bf687-045b-4fd7-9340-0562ef595883/0x18752f0 for node (null)/739512332 (1 
total)
pacemakerd: info: crm_get_peer:Node 739512332 has uuid 739512332

Seems UUID and node ID is mixed up in the message at least...

pacemakerd: info: crm_update_peer_proc: cluster_connect_cpg: Node 
(null)[739512332] - corosync-cpg is now online
pacemakerd:   notice: cluster_connect_quorum: Quorum acquired
pacemakerd: info: corosync_node_name: Unable to get node name for nodeid 
739512332
pacemakerd:   notice: get_node_name:   Defaulting to uname -n for the local 
corosync node name
pacemakerd: info: crm_get_peer:Node 739512332 is now known as h12
...
pacemakerd: info: main:Starting mainloop
pacemakerd: info: pcmk_quorum_notification:Quorum retained | 
membership=172 members=2
pacemakerd: info: corosync_node_name:  Unable to get node name for 
nodeid 739512331
pacemakerd:   notice: get_node_name:   Could not obtain a node name for 
corosync nodeid 739512331
pacemakerd: info: crm_get_peer:Created entry 
f4ef35e4-1b49-4e48-916b-bb0fab7c52c9/0x1876820 for node (null)/739512331 (2 
total)
pacemakerd: info: crm_get_peer:Node 739512331 has uuid 739512331
...
pacemakerd: info: corosync_node_name:  Unable to get node name for 
nodeid 739512331
...
pacemakerd:   notice: get_node_name:   Could not obtain a node name for 
corosync nodeid 739512331
pacemakerd:   notice: crm_update_peer_state_iter:  Node (null) state is now 
member | nodeid=739512331 previous=unknown source=pcmk_quorum_notification
pacemakerd:   notice: crm_update_peer_state_iter:  Node 12 state is now 
member | nodeid=739512332 previous=unknown source=pcmk_quorum_notification
pacemakerd: info: pcmk_cpg_membership: Node 739512332 joined group 
pacemakerd (counter=0.0, pid=32766, unchecked for rivals)
stonith-ng: info: corosync_node_name:  Unable to get node name for 
nodeid 739512332
stonith-ng:   notice: get_node_name:   Could not obtain a node name for 
corosync nodeid 739512332

What's that? The ID had been resolved before!

stonith-ng: info: crm_get_peer:Created entry 
155a30a0-ddd3-4b31-9f76-46313ffa9824/0x1bff130 for node (null)/739512332 (1 
total)
stonith-ng: info: crm_get_peer:Node 739512332 has uuid 739512332
...
stonith-ng:   notice: crm_update_peer_state_iter:  Node (null) state is now 
member | nodeid=739512332 previous=unknown source=crm_update_peer_proc
...
attrd:   notice: get_node_name:   Could not obtain a node name for corosync 
nodeid 739512332
attrd: info: crm_get_peer:Created entry 
961e718f-ad71-479a-ae04-c2ec5ba29858/0x256ca40 for node (null)/739512332 (1 
total)
attrd: info: crm_get_peer:Node 739512332 has uuid 739512332
attrd: info: crm_update_peer_proc:cluster_connect_cpg: Node 
(null)[739512332] - corosync-cpg is now online
attrd:   notice: crm_update_peer_state_iter:  Node (null) state is now 
member | nodeid=739512332 previous=unknown source=crm_update_peer_proc
...
pacemakerd:   notice: get_node_name:   Could not obtain a node name for 
corosync nodeid 739512331
pacemakerd: info: pcmk_cpg_membership: Node 739512331 still member of 
group pacemakerd (peer=(null):7275, counter=0.0, at least once)
stonith-ng:   notice: get_node_name:   Defaulting to uname -n for the local 
corosync node name
...
pacemakerd: info: crm_get_peer:Node 739512331 is now known as h11
...
attrd: info: corosync_node_name:  Unable to get node name for nodeid 
739512332
attrd:   notice: get_node_name:   Defaulting to uname -n for the local corosync 
node name
attrd: info: crm_get_peer:Node 739512332 is now known as h12
stonith-ng: info: corosync_node_name:  Unable to get node name for 
nodeid 739512332
stonith-ng:   notice: get_node_name:   Defaulting to uname -n for the local 
corosync node name
stonith-ng: info: crm_get_peer:Node 739512332 is now known as h12
cib: info: corosync_node_name:  Unable to get node name for nodeid 
739512332
cib:   notice: get_node_name:   Could not obtain a node name for corosync 
nodeid 739512332
cib: info: crm_get_peer:Created entry 
287bf9d9-b9f7-44d5-997f-89fd3ee038de/0x24d2740 for node (null)/739512332 (1 
total)
cib: info: crm_get_peer:Node 739512332 has uuid 739512332
cib: info: crm_update_peer_proc:cluster_connect_cpg: Node 
(null)[739512332] - corosync-cpg is now online
cib:   notice: crm_update_peer_state_iter:  Node (null) state is now member 
| nodeid=739512332 previous=unknown source=crm_update_peer_proc
...

This doesn't look right in my eyes.

cib: info: cib_init:Starting cib m