Re: [ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-05 Thread Andrei Borzenkov
06.06.2018 04:27, Albert Weng пишет:
>  Hi All,
> 
> I have created active/passive pacemaker cluster on RHEL 7.
> 
> Here are my environment:
> clustera : 192.168.11.1 (passive)
> clusterb : 192.168.11.2 (master)
> clustera-ilo4 : 192.168.11.10
> clusterb-ilo4 : 192.168.11.11
> 
> cluster resource status :
>  cluster_fsstarted on clusterb
>  cluster_vip   started on clusterb
>  cluster_sid   started on clusterb
>  cluster_listnrstarted on clusterb
> 
> Both cluster node are online status.
> 
> i found my corosync.log contain many records like below:
> 
> clusterapengine: info: determine_online_status_fencing:
> Node clusterb is active
> clusterapengine: info: determine_online_status:Node
> clusterb is online
> clusterapengine: info: determine_online_status_fencing:
> Node clustera is active
> clusterapengine: info: determine_online_status:Node
> clustera is online
> 
> *clusterapengine:  warning: unpack_rsc_op_failure:  Processing
> failed op start for cluster_sid on clustera: unknown error (1)*
> *=> Question :  Why pengine always trying to start cluster_sid on the
> passive node? how to fix it? *
> 

pacemaker does not have concept of "passive" or "master" node - it is up
to you to decide when you configure resource placement. By default
pacemaker will attempt to spread resources across all eligible nodes.
You can influence node selection by using constraints. See
https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_deciding_which_nodes_a_resource_can_run_on.html
for details.

But in any case - all your resources MUST be capable of running of both
nodes, otherwise cluster makes no sense. If one resource A depends on
something that another resource B provides and can be started only
together with resource B (and after it is ready) - you must tell it to
pacemaker by using resource colocations and ordering. See same document
for details.

> clusterapengine: info: native_print:   ipmi-fence-clustera
> (stonith:fence_ipmilan):Started clustera
> clusterapengine: info: native_print:   ipmi-fence-clusterb
> (stonith:fence_ipmilan):Started clustera
> clusterapengine: info: group_print: Resource Group: cluster
> clusterapengine: info: native_print:cluster_fs
> (ocf::heartbeat:Filesystem):Started clusterb
> clusterapengine: info: native_print:cluster_vip
> (ocf::heartbeat:IPaddr2):   Started clusterb
> clusterapengine: info: native_print:cluster_sid
> (ocf::heartbeat:oracle):Started clusterb
> clusterapengine: info: native_print:
> cluster_listnr   (ocf::heartbeat:oralsnr):   Started clusterb
> clusterapengine: info: get_failcount_full: cluster_sid has
> failed INFINITY times on clustera
> 
> 
> *clusterapengine:  warning: common_apply_stickiness:Forcing
> cluster_sid away from clustera after 100 failures (max=100)*
> *=> Question: too much trying result in forbid the resource start on
> clustera ?*
> 

Yes.

> Couple days ago, the clusterb has been stonith by unknown reason, but only
> "cluster_fs", "cluster_vip" moved to clustera successfully, but
> "cluster_sid" and "cluster_listnr" go to "STOP" status.
> like below messages, is it related with "op start for cluster_sid on
> clustera..." ?
> 

Yes. Node clustera is now marked as being incapable of running resource
so if node cluaterb fails, resource cannot be started anywhere.

> clusterapengine:  warning: unpack_rsc_op_failure:  Processing failed op
> start for cluster_sid on clustera: unknown error (1)
> clusterapengine: info: native_print:   ipmi-fence-clustera
> (stonith:fence_ipmilan):Started clustera
> clusterapengine: info: native_print:   ipmi-fence-clusterb
> (stonith:fence_ipmilan):Started clustera
> clusterapengine: info: group_print: Resource Group: cluster
> clusterapengine: info: native_print:cluster_fs
> (ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN)
> clusterapengine: info: native_print:cluster_vip
> (ocf::heartbeat:IPaddr2):   Started clusterb (UNCLEAN)
> clusterapengine: info: native_print:cluster_sid
> (ocf::heartbeat:oracle):Started clusterb (UNCLEAN)
> clusterapengine: info: native_print:cluster_listnr
> (ocf::heartbeat:oralsnr):   Started clusterb (UNCLEAN)
> clusterapengine: info: get_failcount_full: cluster_sid has
> failed INFINITY times on clustera
> clusterapengine:  warning: common_apply_stickiness:Forcing
> cluster_sid away from clustera after 100 failures (max=100)
> clusterapengine: info: rsc_merge_weights:  cluster_fs: Rolling
> back scores from cluster_sid
> clusterapengine: info: rsc_merge_weights:  cluster_vip:

Re: [ClusterLabs] Resource-stickiness is not working

2018-06-05 Thread Ken Gaillot
On Wed, 2018-06-06 at 07:47 +0800, Confidential Company wrote:
> On Sat, 2018-06-02 at 22:14 +0800, Confidential Company wrote:
> > On Fri, 2018-06-01 at 22:58 +0800, Confidential Company wrote:
> > > Hi,
> > >?
> > > I have two-node active/passive setup. My goal is to failover a
> > > resource once a Node goes down with minimal downtime as possible.
> > > Based on my testing, when Node1 goes down it failover to Node2.
> If
> > > Node1 goes up after link reconnection (reconnect physical cable),
> > > resource failback to Node1 even though I configured resource-
> > > stickiness. Is there something wrong with configuration below?
> > >?
> > > #service firewalld stop
> > > #vi /etc/hosts --> 192.168.10.121 (Node1) / 192.168.10.122
> (Node2)
> > --
> > > --- Private Network (Direct connect)
> > > #systemctl start pcsd.service
> > > #systemctl enable pcsd.service
> > > #passwd hacluster --> define pw
> > > #pcs cluster auth Node1 Node2
> > > #pcs setup --name Cluster Node1 Node2
> > > #pcs cluster start -all
> > > #pcs property set stonith-enabled=false
> > > #pcs resource create ClusterIP ocf:heartbeat:IPaddr2
> > > ip=192.168.10.123 cidr_netmask=32 op monitor interval=30s
> > > #pcs resource defaults resource-stickiness=100
> > >?
> > > Regards,
> > > imnotarobot
> > 
> > Your configuration is correct, but keep in mind scores of all kinds
> > will be added together to determine where the final placement is.
> > 
> > In this case, I'd check that you don't have any constraints with a
> > higher score preferring the other node. For example, if you
> > previously?
> > did a "move" or "ban" from the command line, that adds a constraint
> > that has to be removed manually if you no longer want it.
> > --?
> > Ken Gaillot 
> > 
> > 
> > >>
> > I'm confused. constraint from what I think means there's a
> preferred
> > node. But if I want my resources not to have a preferred node is
> that
> > possible?
> > 
> > Regards,
> > imnotarobot
> 
> Yes, that's one type of constraint -- but you may not have realized
> you
> added one if you ran something like "pcs resource move", which is a
> way
> of saying there's a preferred node.
> 
> There are a variety of other constraints. For example, as you add
> more
> resources, you might say that resource A can't run on the same node
> as
> resource B, and if that constraint's score is higher than the
> stickiness, A might move if B starts on its node.
> 
> To see your existing constraints using pcs, run "pcs constraint
> show".
> If there are any you don't want, you can remove them with various pcs
> commands.
> -- 
> Ken Gaillot 
> 
> 
> >>
> Correct me if I'm wrong. So resource-stickiness policy can not be
> used alone. A constraint configuration should be setup in order to
> make it work but will also be dependent on the level of scores that
> was setup between the two. Can you suggest what type of constraint
> configuration should i set to achieve the simple goal above?

Not quite -- stickiness can be used alone. However, scores from all
sources are combined and compared when placing resources, so anything
else in the configuration that generates a score (like constraints)
will have an effect, if present.

Looking at your test scenario again, I see the problem is the lack of
stonith, and has nothing to do with stickiness.

When you pull the cable, neither node can see the other. The isolated
node is still running the IP address, even though it can't do anything
with it. The failover node thinks it is the only node remaining, and
brings up the IP address there as well. This is a split-brain
situation.

When you reconnect the cable, the nodes can see each other again, and
*both are already running the IP*. The cluster detects this, and stops
the IP on both nodes, and brings it up again on one node. Since the IP
is not running at that point, stickiness doesn't come into play.

If stonith were configured, one of the two nodes would kill the other,
so only one would be running the IP at any time. If the dead node came
back up and rejoined, it would not be running the IP, and stickiness
would keep the IP where it was.

Which node kills the other is a bit tricky in a two-node situation. If
you're interested mainly in IP availability, you can use
fence_heuristic_ping to keep a node with a nonfunctioning network from
killing the other. Another possibility is to use qdevice on a third
node as a tie-breaker.

In any case, stonith is how to avoid a split-brain situation.

> 
> Regards,
> imnotarobot
> 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pengine always trying to start the resource on the standby node.

2018-06-05 Thread Albert Weng
 Hi All,

I have created active/passive pacemaker cluster on RHEL 7.

Here are my environment:
clustera : 192.168.11.1 (passive)
clusterb : 192.168.11.2 (master)
clustera-ilo4 : 192.168.11.10
clusterb-ilo4 : 192.168.11.11

cluster resource status :
 cluster_fsstarted on clusterb
 cluster_vip   started on clusterb
 cluster_sid   started on clusterb
 cluster_listnrstarted on clusterb

Both cluster node are online status.

i found my corosync.log contain many records like below:

clusterapengine: info: determine_online_status_fencing:
Node clusterb is active
clusterapengine: info: determine_online_status:Node
clusterb is online
clusterapengine: info: determine_online_status_fencing:
Node clustera is active
clusterapengine: info: determine_online_status:Node
clustera is online

*clusterapengine:  warning: unpack_rsc_op_failure:  Processing
failed op start for cluster_sid on clustera: unknown error (1)*
*=> Question :  Why pengine always trying to start cluster_sid on the
passive node? how to fix it? *

clusterapengine: info: native_print:   ipmi-fence-clustera
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: native_print:   ipmi-fence-clusterb
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: group_print: Resource Group: cluster
clusterapengine: info: native_print:cluster_fs
(ocf::heartbeat:Filesystem):Started clusterb
clusterapengine: info: native_print:cluster_vip
(ocf::heartbeat:IPaddr2):   Started clusterb
clusterapengine: info: native_print:cluster_sid
(ocf::heartbeat:oracle):Started clusterb
clusterapengine: info: native_print:
cluster_listnr   (ocf::heartbeat:oralsnr):   Started clusterb
clusterapengine: info: get_failcount_full: cluster_sid has
failed INFINITY times on clustera


*clusterapengine:  warning: common_apply_stickiness:Forcing
cluster_sid away from clustera after 100 failures (max=100)*
*=> Question: too much trying result in forbid the resource start on
clustera ?*

Couple days ago, the clusterb has been stonith by unknown reason, but only
"cluster_fs", "cluster_vip" moved to clustera successfully, but
"cluster_sid" and "cluster_listnr" go to "STOP" status.
like below messages, is it related with "op start for cluster_sid on
clustera..." ?

clusterapengine:  warning: unpack_rsc_op_failure:  Processing failed op
start for cluster_sid on clustera: unknown error (1)
clusterapengine: info: native_print:   ipmi-fence-clustera
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: native_print:   ipmi-fence-clusterb
(stonith:fence_ipmilan):Started clustera
clusterapengine: info: group_print: Resource Group: cluster
clusterapengine: info: native_print:cluster_fs
(ocf::heartbeat:Filesystem):Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_vip
(ocf::heartbeat:IPaddr2):   Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_sid
(ocf::heartbeat:oracle):Started clusterb (UNCLEAN)
clusterapengine: info: native_print:cluster_listnr
(ocf::heartbeat:oralsnr):   Started clusterb (UNCLEAN)
clusterapengine: info: get_failcount_full: cluster_sid has
failed INFINITY times on clustera
clusterapengine:  warning: common_apply_stickiness:Forcing
cluster_sid away from clustera after 100 failures (max=100)
clusterapengine: info: rsc_merge_weights:  cluster_fs: Rolling
back scores from cluster_sid
clusterapengine: info: rsc_merge_weights:  cluster_vip: Rolling
back scores from cluster_sid
clusterapengine: info: rsc_merge_weights:  cluster_sid: Rolling
back scores from cluster_listnr
clusterapengine: info: native_color:   Resource cluster_sid cannot
run anywhere
clusterapengine: info: native_color:   Resource cluster_listnr
cannot run anywhere
clusterapengine:  warning: custom_action:  Action cluster_fs_stop_0 on
clusterb is unrunnable (offline)
clusterapengine: info: RecurringOp: Start recurring monitor
(20s) for cluster_fs on clustera
clusterapengine:  warning: custom_action:  Action cluster_vip_stop_0 on
clusterb is unrunnable (offline)
clusterapengine: info: RecurringOp: Start recurring monitor
(10s) for cluster_vip on clustera
clusterapengine:  warning: custom_action:  Action cluster_sid_stop_0 on
clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action:  Action cluster_sid_stop_0 on
clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action:  Action cluster_listnr_stop_0
on clusterb is unrunnable (offline)
clusterapengine:  warning: custom_action:  Ac

Re: [ClusterLabs] Resource-stickiness is not working

2018-06-05 Thread Confidential Company
On Sat, 2018-06-02 at 22:14 +0800, Confidential Company wrote:
> On Fri, 2018-06-01 at 22:58 +0800, Confidential Company wrote:
> > Hi,
> >?
> > I have two-node active/passive setup. My goal is to failover a
> > resource once a Node goes down with minimal downtime as possible.
> > Based on my testing, when Node1 goes down it failover to Node2. If
> > Node1 goes up after link reconnection (reconnect physical cable),
> > resource failback to Node1 even though I configured resource-
> > stickiness. Is there something wrong with configuration below?
> >?
> > #service firewalld stop
> > #vi /etc/hosts --> 192.168.10.121 (Node1) / 192.168.10.122 (Node2)
> --
> > --- Private Network (Direct connect)
> > #systemctl start pcsd.service
> > #systemctl enable pcsd.service
> > #passwd hacluster --> define pw
> > #pcs cluster auth Node1 Node2
> > #pcs setup --name Cluster Node1 Node2
> > #pcs cluster start -all
> > #pcs property set stonith-enabled=false
> > #pcs resource create ClusterIP ocf:heartbeat:IPaddr2
> > ip=192.168.10.123 cidr_netmask=32 op monitor interval=30s
> > #pcs resource defaults resource-stickiness=100
> >?
> > Regards,
> > imnotarobot
>
> Your configuration is correct, but keep in mind scores of all kinds
> will be added together to determine where the final placement is.
>
> In this case, I'd check that you don't have any constraints with a
> higher score preferring the other node. For example, if you
> previously?
> did a "move" or "ban" from the command line, that adds a constraint
> that has to be removed manually if you no longer want it.
> --?
> Ken Gaillot 
>
>
> >>
> I'm confused. constraint from what I think means there's a preferred
> node. But if I want my resources not to have a preferred node is that
> possible?
>
> Regards,
> imnotarobot

Yes, that's one type of constraint -- but you may not have realized you
added one if you ran something like "pcs resource move", which is a way
of saying there's a preferred node.

There are a variety of other constraints. For example, as you add more
resources, you might say that resource A can't run on the same node as
resource B, and if that constraint's score is higher than the
stickiness, A might move if B starts on its node.

To see your existing constraints using pcs, run "pcs constraint show".
If there are any you don't want, you can remove them with various pcs
commands.
-- 
Ken Gaillot 


>>
Correct me if I'm wrong. So resource-stickiness policy can not be used
alone. A constraint configuration should be setup in order to make it work
but will also be dependent on the level of scores that was setup between
the two. Can you suggest what type of constraint configuration should i set
to achieve the simple goal above?

Regards,
imnotarobot
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [questionnaire] Do you manage your pacemaker configuration by hand and (if so) what reusability features do you use?

2018-06-05 Thread Jan Pokorný
On 31/05/18 14:48 +0200, Jan Pokorný wrote:
> I am soliciting feedback on these CIB features related questions,
> please reply (preferably on-list so we have the shared collective
> knowledge) if at least one of the questions is answered positively
> in your case (just tick the respective "[ ]" boxes as "[x]").

I am not sure how to interpret no feedback so far -- does it mean
that those features are indeed used only very sparsely, or is the
questionnaire not as welcoming as it could be?  This is definitely
not the last time the userbase's feedback is of help, so the more
pleasant we can do such enquiries, the better turnaround, I guess.

> Any other commentary also welcome -- thank you in advance.
> 
> 1.  [ ] Do you edit CIB by hand (as opposed to relying on crm/pcs or
> their UI counterparts)?

Putting seriousness aside for a bit, there's a relevant anecdotical
reference directly from pacemaker's own codebase:
https://github.com/ClusterLabs/pacemaker/blob/Pacemaker-2.0.0-rc5/Makefile.common#L54
:-)

As Ken noted, crm shell may support all of 2. + 3. + 4., it was
just my extension that those could be especially handy with direct
XML-level control, as shifting towards more abstract thinking
about the configuration may actually conflict with the goal of
straightforward conceptual comprehension, at least in case of 3.

> 2.  [ ] Do you use "template" based syntactic simplification[1] in CIB?
> 
> 3.  [ ] Do you use "id-ref" based syntactic simplification[2] in CIB?
> 
> 3.1 [ ] When positive about 3., would you mind much if "id-refs" got
> unfold/exploded during the "cibadmin --upgrade --force"
> equivalent as a reliability/safety precaution?

This was a premature worst-case conclusion on my end (generally,
I think it's better to start pessimistically only to be pleased
later on, rather than vice-versa).  In fact, there's nothing that
would prevent reversibility of temporary limited-scope "id-refs"
exploding, in an unfold-upgrade-refold manner, sorry for the noise
(https://github.com/ClusterLabs/pacemaker/pull/1500).

However, you can take this also as discussion-worth probe into how
mere _syntactic_ changes not affecting the behaviour (i.e. the
semantics encoded with either syntactic expressions) whatsoever
would be perceived.  In this now merely theoretic case, parts of
information that only have bearing on user's comprehension would
be lost (multiple duplicate entities as opposed to shared single
point of control) and the question hence is:

How much frustration could arise from such semantics-preserving
interventions inflicted with schema upgrades or elsewhere?
Is this something we should avoid at all costs so as not to
alienate not even a single user, or is there some extent of
tolerance as long as you can hardly tell a difference in
higher-level tools?

> 4.  [ ] Do you use "tag" based syntactic grouping[3] in CIB?

The original questions are still valid, feel free to respond
to them or to the new bunch at your convenience.  It will help
to shape future directions for pacemaker.

-- 
Jan (Poki)


pgpJbth36Z1Tt.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org