Re: [ClusterLabs] DRBD demote/promote not called - Why? How to fix?

2016-11-10 Thread Ken Gaillot
On 11/09/2016 12:27 PM, CART Andreas wrote:
> Hi again
> 
>  
> 
> Sorry for missing the omission of the master role within the colocation
> constraint.
> 
> I  added it  - but unfortunately still no success.
> 
>  
> 
> (In the meantime I added 2 additional filesystem resources on top of the
> NFSServer, but that should not change anything regarding the root
> problem that I miss the demote of DRBDClone.)
> 
>  
> 
> I again started with all resources located at ventsi-clst1 and issued a
> 'pcs resource move DRBD_global_clst' (the resource next collocated next
> to the DRBDClone).
> 
>  
> 
> With that I end up with all primitive resources stopped and the
> DRBDClone resource still being master at ventsi-clst1.
> 
> Here is what pacemaker pretends has to be done:
> 
> ==
> 
> [root@ventsi-clst2 ~]# crm_simulate -Ls
> 
>  
> 
> Current cluster status:
> 
> Online: [ ventsi-clst1-sync ventsi-clst2-sync ]
> 
>  
> 
> ipmi-fence-clst1   (stonith:fence_ipmilan):Started
> ventsi-clst2-sync
> 
> ipmi-fence-clst2   (stonith:fence_ipmilan):Started
> ventsi-clst1-sync
> 
> IPaddrNFS  (ocf::heartbeat:IPaddr2):   Stopped
> 
> NFSServer  (ocf::heartbeat:nfsserver): Stopped
> 
> Master/Slave Set: DRBDClone [DRBD]
> 
>  Masters: [ ventsi-clst1-sync ]<=== still not demoted
> 
>  Slaves: [ ventsi-clst2-sync ]
> 
> DRBD_global_clst   (ocf::heartbeat:Filesystem):Stopped
> 
> NFS_global_clst(ocf::heartbeat:Filesystem):Stopped
> 
> BIND_global_clst   (ocf::heartbeat:Filesystem):Stopped
> 
>  
> 
> Allocation scores:
> 
> native_color: ipmi-fence-clst1 allocation score on ventsi-clst1-sync:
> -INFINITY
> 
> native_color: ipmi-fence-clst1 allocation score on ventsi-clst2-sync:
> INFINITY
> 
> native_color: ipmi-fence-clst2 allocation score on ventsi-clst1-sync:
> INFINITY
> 
> native_color: ipmi-fence-clst2 allocation score on ventsi-clst2-sync:
> -INFINITY
> 
> clone_color: DRBDClone allocation score on ventsi-clst1-sync: 0
> 
> clone_color: DRBDClone allocation score on ventsi-clst2-sync: 0
> 
> clone_color: DRBD:0 allocation score on ventsi-clst1-sync: INFINITY
> 
> clone_color: DRBD:0 allocation score on ventsi-clst2-sync: 0
> 
> clone_color: DRBD:1 allocation score on ventsi-clst1-sync: 0
> 
> clone_color: DRBD:1 allocation score on ventsi-clst2-sync: INFINITY
> 
> native_color: DRBD:0 allocation score on ventsi-clst1-sync: INFINITY
> 
> native_color: DRBD:0 allocation score on ventsi-clst2-sync: 0
> 
> native_color: DRBD:1 allocation score on ventsi-clst1-sync: -INFINITY
> 
> native_color: DRBD:1 allocation score on ventsi-clst2-sync: INFINITY
> 
> DRBD:1 promotion score on ventsi-clst2-sync: 1
> 
> DRBD:0 promotion score on ventsi-clst1-sync: 1
> 
> native_color: DRBD_global_clst allocation score on ventsi-clst1-sync:
> -INFINITY
> 
> native_color: DRBD_global_clst allocation score on ventsi-clst2-sync:
> INFINITY
> 
> native_color: IPaddrNFS allocation score on ventsi-clst1-sync: -INFINITY
> 
> native_color: IPaddrNFS allocation score on ventsi-clst2-sync: 0
> 
> native_color: NFSServer allocation score on ventsi-clst1-sync: -INFINITY
> 
> native_color: NFSServer allocation score on ventsi-clst2-sync: 0
> 
> native_color: NFS_global_clst allocation score on ventsi-clst1-sync: 0
> 
> native_color: NFS_global_clst allocation score on ventsi-clst2-sync:
> -INFINITY
> 
> native_color: BIND_global_clst allocation score on ventsi-clst1-sync:
> -INFINITY
> 
> native_color: BIND_global_clst allocation score on ventsi-clst2-sync: 0
> 
>  
> 
> Transition Summary:
> 
> * Start   IPaddrNFS(ventsi-clst2-sync)
> 
> * Start   NFSServer(ventsi-clst2-sync)
> 
> * Demote  DRBD:0   (Master -> Slave ventsi-clst1-sync)<=== this
> demote never happens
> 
> * Promote DRBD:1   (Slave -> Master ventsi-clst2-sync)
> 
> * Start   DRBD_global_clst (ventsi-clst2-sync)
> 
> * Start   NFS_global_clst  (ventsi-clst1-sync)
> 
> * Start   BIND_global_clst (ventsi-clst2-sync)

Strangely, this sequence appears to be ignoring the constraint "start
DRBD_global_clst then start IPaddrNFS".

Can you open a bug report at http://bugs.clusterlabs.org/ and attach the
CIB (or pe-input file) in use at this time?

For testing purposes, you may want to try replacing the "start
DRBD_global_clst then start IPaddrNFS" constraint with "promote
DRBDClone then start IPaddrNFS" to see whether that makes a difference.

> And this is the executed transaction:
> 
> ==
> 
> [root@ventsi-clst2 ~]# crm_simulate --xml-file
> /var/lib/pacemaker/pengine/pe-input-1157.bz2 --save-graph problem5.graph
> --save-dotfile problem5.dot -V --simulate
> 
> Using the original execution date of: 2016-11-09 17:54:10Z
> 
>  
> 
> Current cluster status:
> 
> Online: [ ventsi-clst1-sync ventsi-clst2-sync ]
> 
>  
> 
> ipmi-fence-clst1   

Re: [ClusterLabs] Antw: Re: How Pacemaker reacts to fast changes of the same parameter in configuration

2016-11-10 Thread Klaus Wenninger
On 11/10/2016 11:34 AM, Kostiantyn Ponomarenko wrote:
> Ulrich Windl,
>
> >> You want your resources to move to their preferred location after
> some problem.
> It is not about that. It is about - I want to control when fail-back
> happens. And I want to be sure that I have full control over it all
> the time.
>
> Klaus Wenninger,
>
> You are right. That is exactly what I want and what I am concerned
> about. Another example with "move" operation is 100% correct.
>
> I've been thinking about another possible approach here since
> yesterday and I've got an idea which actually seems to satisfy my needs.
> At least till a proper solution is available.
> My set-up is a two node cluster.
> I will modify my script to:
>
> 1. issue a command to low down "resource-stickiness" on the local
> node;
> 2. on the other node to trigger a script which waits for cluster
> to finish all transactions (crm_resource --wait) and set
> "resource-stickiness" back to its original value;
> 3. on this node wait for cluster to finish all transactions
> (crm_resource --wait) and set "resource-stickiness" back to its
> original value;
>
> This way I can be sure to have back the original value of
> "resource-stickiness" immediately after fail-back.
> Though, I am still thinking about the best way of how a local script
> can trigger the script on the other node and passing an argument to it.
> If any thoughts, I would like to hear =)
>
>
> I also was thinking about more general approach to it.
> Maybe it is time for higher level cluster configuration tools to
> evolve to provide this robustness?
> So that they can take a sequence of commands and guarantee that they
> will be executed in a predicted order even if a node on which this
> sequence was initiated goes down.

yep, either that or - especially for things where the success of your
cib-modification is very special
to your cluster - you script it.
But in either case the high-level-tool or your script can fail, the node
it is running on can be fenced or
whatever you can think of ...
So I wanted to think about simple, not very invasive things that could
be done within the core of pacemaker
to enable a predictable fallback in such cases.

>
> Or maybe pacemaker can expand its functionality to handle a command
> sequence?
>
> Or this special tagging which you mentioned. Could you please
> elaborate on this one as I am curious how it should work?

That is what the high-level-tools are doing at the moment. You can
recognize the constraints they have
created by their names (prefix).

>
> >> some mechanism that makes the constraints somehow magically
> disappear or disabled when they have achieved what they were intended to.
> You mean something like time based constraints, but instead of
> duration they are event based?

Something in that direction, yes ...

>
> Thank you,
> Kostia
>
> On Thu, Nov 10, 2016 at 11:17 AM, Klaus Wenninger  > wrote:
>
> On 11/10/2016 08:27 AM, Ulrich Windl wrote:
>  Klaus Wenninger  > schrieb am 09.11.2016 um 17:42 in
> > Nachricht <80c65564-b299-e504-4c6c-afd0ff86e...@redhat.com
> >:
> >> On 11/09/2016 05:30 PM, Kostiantyn Ponomarenko wrote:
> >>> When one problem seems to be solved, another one appears.
> >>> Now my script looks this way:
> >>>
> >>> crm --wait configure rsc_defaults resource-stickiness=50
> >>> crm configure rsc_defaults resource-stickiness=150
> >>>
> >>> While now I am sure that transactions caused by the first command
> >>> won't be aborted, I see another possible problem here.
> >>> With a minimum load in the cluster it took 22 sec for this
> script to
> >>> finish.
> >>> I see here a weakness.
> >>> If a node on which this script is called goes down for any
> reasons,
> >>> then "resource-stickiness" is not set back to its original value,
> >>> which is vary bad.
> > I don't quite understand: You want your resources to move to
> their preferred location after some problem. When the node goes
> down with the lower stickiness, there is no problem while the
> other node is down; when it comes up, resources might be moved,
> but isn't that what you wanted?
>
> I guess this is about the general problem with features like e.g.
> 'move'
> as well
> that are so much against how pacemaker is working.
> They are implemented inside the high-level-tooling.
> They are temporarily modifying the CIB and if something happens
> that makes
> this controlling high-level-tool go away it stays as is - or the CIB
> even stays
> modified and the user has to know that he has to do a manual cleanup.
> So we could actually derive a general discussion from that how to
> handle
> these issues in a way that it is less likely to 

Re: [ClusterLabs] Antw: Re: How Pacemaker reacts to fast changes of the same parameter in configuration

2016-11-10 Thread Kostiantyn Ponomarenko
Ulrich Windl,

>> You want your resources to move to their preferred location after some
problem.
It is not about that. It is about - I want to control when fail-back
happens. And I want to be sure that I have full control over it all the
time.

Klaus Wenninger,

You are right. That is exactly what I want and what I am concerned about.
Another example with "move" operation is 100% correct.

I've been thinking about another possible approach here since yesterday and
I've got an idea which actually seems to satisfy my needs.
At least till a proper solution is available.
My set-up is a two node cluster.
I will modify my script to:

1. issue a command to low down "resource-stickiness" on the local node;
2. on the other node to trigger a script which waits for cluster to
finish all transactions (crm_resource --wait) and set "resource-stickiness"
back to its original value;
3. on this node wait for cluster to finish all transactions (crm_resource
--wait) and set "resource-stickiness" back to its original value;

This way I can be sure to have back the original value of "
resource-stickiness" immediately after fail-back.
Though, I am still thinking about the best way of how a local script can
trigger the script on the other node and passing an argument to it.
If any thoughts, I would like to hear =)


I also was thinking about more general approach to it.
Maybe it is time for higher level cluster configuration tools to evolve to
provide this robustness?
So that they can take a sequence of commands and guarantee that they will
be executed in a predicted order even if a node on which this sequence was
initiated goes down.

Or maybe pacemaker can expand its functionality to handle a command
sequence?

Or this special tagging which you mentioned. Could you please elaborate on
this one as I am curious how it should work?

>> some mechanism that makes the constraints somehow magically disappear or
disabled when they have achieved what they were intended to.
You mean something like time based constraints, but instead of duration
they are event based?

Thank you,
Kostia

On Thu, Nov 10, 2016 at 11:17 AM, Klaus Wenninger 
wrote:

> On 11/10/2016 08:27 AM, Ulrich Windl wrote:
>  Klaus Wenninger  schrieb am 09.11.2016 um 17:42
> in
> > Nachricht <80c65564-b299-e504-4c6c-afd0ff86e...@redhat.com>:
> >> On 11/09/2016 05:30 PM, Kostiantyn Ponomarenko wrote:
> >>> When one problem seems to be solved, another one appears.
> >>> Now my script looks this way:
> >>>
> >>> crm --wait configure rsc_defaults resource-stickiness=50
> >>> crm configure rsc_defaults resource-stickiness=150
> >>>
> >>> While now I am sure that transactions caused by the first command
> >>> won't be aborted, I see another possible problem here.
> >>> With a minimum load in the cluster it took 22 sec for this script to
> >>> finish.
> >>> I see here a weakness.
> >>> If a node on which this script is called goes down for any reasons,
> >>> then "resource-stickiness" is not set back to its original value,
> >>> which is vary bad.
> > I don't quite understand: You want your resources to move to their
> preferred location after some problem. When the node goes down with the
> lower stickiness, there is no problem while the other node is down; when it
> comes up, resources might be moved, but isn't that what you wanted?
>
> I guess this is about the general problem with features like e.g. 'move'
> as well
> that are so much against how pacemaker is working.
> They are implemented inside the high-level-tooling.
> They are temporarily modifying the CIB and if something happens that makes
> this controlling high-level-tool go away it stays as is - or the CIB
> even stays
> modified and the user has to know that he has to do a manual cleanup.
> So we could actually derive a general discussion from that how to handle
> these issues in a way that it is less likely to have artefacts persist
> after
> some administrative action.
> At the moment e.g. special tagging for the constraints that are
> automatically
> created to trigger a move  is one approach.
> But when would you issue an automatized cleanup? Is there anything
> implemented in high-level-tooling? pcsd I guess would be a candidate, for
> crmsh I don't know of a persistent instance that could take care of that
> ...
>
> If we say we won't implement these features in the core of pacemaker
> I definitely agree. But is there anything we could do to make it easier
> for high-level-tools?
> I'm thinking of some mechanism that makes the constraints somehow
> magically disappear or disabled when they have achieved what they
> were intended to, if the connection to some administrative-shell is
> lost, or ...
> I could imagine dependency on some token given to a shell, something
> like a suicide-timeout, ...
> Maybe the usual habit when configuring a switch/router can trigger
> some ideas: issue a reboot in x minutes; do a non persistent config-change;

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-10 Thread Klaus Wenninger
On 11/10/2016 09:47 AM, Toni Tschampke wrote:
>> Did your upgrade documentation describe how to update the corosync
>> configuration, and did that go well? crmd may be unable to function due
>> to lack of quorum information.
>
> Thanks for this tip, corosync quorum configuration was the cause.
>
> As we changed validate-with as well as the feature set manually in the
> cib, is there a need for issuing the cibadmin --upgrade --force
> command or is this command just for changing the schemes?
>

Guess no as this would just do automatically (to the latest version
then) what
you've done manually already.

> -- 
> Mit freundlichen Grüßen
>
> Toni Tschampke | t...@halle.it
> bcs kommunikationslösungen
> Inh. Dipl. Ing. Carsten Burkhardt
> Harz 51 | 06108 Halle (Saale) | Germany
> tel +49 345 29849-0 | fax +49 345 29849-22
> www.b-c-s.de | www.halle.it | www.wivewa.de
>
>
> EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
> IHREM WISSENSVERWALTER FUER IHREN BETRIEB!
>
> Weitere Informationen erhalten Sie unter www.wivewa.de
>
> Am 08.11.2016 um 22:51 schrieb Ken Gaillot:
>> On 11/07/2016 09:08 AM, Toni Tschampke wrote:
>>> We managed to change the validate-with option via workaround (cibadmin
>>> export & replace) as setting the value with cibadmin --modify doesn't
>>> write the changes to disk.
>>>
>>> After experimenting with various schemes (xml is correctly interpreted
>>> by crmsh) we are still not able to communicate with local crmd.
>>>
>>> Can someone please help to determine why the local crmd is not
>>> responding (we disabled our other nodes to eliminate possible corosync
>>> related issues) and runs into errors/timeouts when issuing crmsh or
>>> cibadmin related commands.
>>
>> It occurs to me that wheezy used corosync 1. There were major changes
>> from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for
>> pacemaker, whereas 2 has quorum built-in.
>>
>> Did your upgrade documentation describe how to update the corosync
>> configuration, and did that go well? crmd may be unable to function due
>> to lack of quorum information.
>>
>>> examples for not working local commands
>>>
>>> timeout when running cibadmin: (strace attachment)
 cibadmin --upgrade --force
 Call cib_upgrade failed (-62): Timer expired
>>>
>>> error when running a crm resource cleanup
 crm resource cleanup $vm
 Error signing on to the CRMd service
 Error performing operation: Transport endpoint is not connected
>>>
>>> I attached the strace log from running cib_upgrade, does this help to
>>> find the cause of the timeout issue?
>>>
>>> Here is the corosync dump when locally starting pacemaker:
>>>
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:1256
 Corosync Cluster Engine ('2.3.6'): started and ready to provide
 service.
 Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN  ] main.c:1257
 Corosync built-in features: dbus rdma monitoring watchdog augeas
 systemd upstart xmlconf qdevices snmp pie relro bindnow
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemnet.c:248 Initializing transport (UDP/IP Multicast).
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
 none hash: none
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemnet.c:248 Initializing transport (UDP/IP Multicast).
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
 none hash: none
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
 totemudp.c:671 The network interface [10.112.0.1] is now up.
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync configuration map access [0]
 Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
 ipc_setup.c:536 server name: cmap
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync configuration service [1]
 Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
 ipc_setup.c:536 server name: cfg
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync cluster closed process group service
 v1.01 [2]
 Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
 ipc_setup.c:536 server name: cpg
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync profile loading service [4]
 Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
 Service engine loaded: corosync resource monitoring service [6]
 Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669
 Watchdog /dev/watchdog is now been tickled by corosync.
 Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625
 Could not change the Watchdog 

Re: [ClusterLabs] Antw: Re: How Pacemaker reacts to fast changes of the same parameter in configuration

2016-11-10 Thread Klaus Wenninger
On 11/10/2016 08:27 AM, Ulrich Windl wrote:
 Klaus Wenninger  schrieb am 09.11.2016 um 17:42 in
> Nachricht <80c65564-b299-e504-4c6c-afd0ff86e...@redhat.com>:
>> On 11/09/2016 05:30 PM, Kostiantyn Ponomarenko wrote:
>>> When one problem seems to be solved, another one appears.
>>> Now my script looks this way:
>>>
>>> crm --wait configure rsc_defaults resource-stickiness=50
>>> crm configure rsc_defaults resource-stickiness=150
>>>
>>> While now I am sure that transactions caused by the first command
>>> won't be aborted, I see another possible problem here.
>>> With a minimum load in the cluster it took 22 sec for this script to
>>> finish. 
>>> I see here a weakness. 
>>> If a node on which this script is called goes down for any reasons,
>>> then "resource-stickiness" is not set back to its original value,
>>> which is vary bad.
> I don't quite understand: You want your resources to move to their preferred 
> location after some problem. When the node goes down with the lower 
> stickiness, there is no problem while the other node is down; when it comes 
> up, resources might be moved, but isn't that what you wanted?

I guess this is about the general problem with features like e.g. 'move'
as well
that are so much against how pacemaker is working.
They are implemented inside the high-level-tooling.
They are temporarily modifying the CIB and if something happens that makes
this controlling high-level-tool go away it stays as is - or the CIB
even stays
modified and the user has to know that he has to do a manual cleanup.
So we could actually derive a general discussion from that how to handle
these issues in a way that it is less likely to have artefacts persist after
some administrative action.
At the moment e.g. special tagging for the constraints that are
automatically
created to trigger a move  is one approach.
But when would you issue an automatized cleanup? Is there anything
implemented in high-level-tooling? pcsd I guess would be a candidate, for
crmsh I don't know of a persistent instance that could take care of that ...

If we say we won't implement these features in the core of pacemaker
I definitely agree. But is there anything we could do to make it easier
for high-level-tools?
I'm thinking of some mechanism that makes the constraints somehow
magically disappear or disabled when they have achieved what they
were intended to, if the connection to some administrative-shell is
lost, or ...
I could imagine dependency on some token given to a shell, something
like a suicide-timeout, ...
Maybe the usual habit when configuring a switch/router can trigger
some ideas: issue a reboot in x minutes; do a non persistent config-change;
check if everything is fine afterwards; make it persistent; disable
the timed reboot
 
>
>>> So, now I am thinking of how to solve this problem. I would appreciate
>>> any thoughts about this.
>>>
>>> Is there a way to ask Pacemaker to do these commands sequentially so
>>> there is no need to wait in the script?
>>> If it is possible, than I think that my concern from above goes away.
>>>
>>> Another thing which comes to my mind - is to use time based rules.
>>> This ways when I need to do a manual fail-back, I simply set (or
>>> update) a time-based rule from the script.
>>> And the rule will basically say - set "resource-stickiness" to 50
>>> right now and expire in 10 min.
>>> This looks good at the first glance, but there is no a reliable way to
>>> put a minimum sufficient time for it; at least not I am aware of.
>>> And the thing is - it is important to me that "resource-stickiness" is
>>> set back to its original value as soon as possible.
>>>
>>> Those are my thoughts. As I said, I appreciate any ideas here.
>> Have never tried --wait with crmsh but I would guess that the delay you
>> are observing
>> is really the time your resources are taking to stop and start somewhere
>> else.
>>
>> Actually you would need the reduced stickiness just during the stop
>> phase - right.
>>
>> So as there is no command like "wait till all stops are done" you could
>> still
>> do the 'crm_simulate -Ls' and check that it doesn't want to stop
>> anything anymore.
>> So you can save the time the starts would take.
>> Unfortunately you have to repeat that and thus put additional load on
>> pacemaker
>> possibly slowing down things if your poll-cycle is to short.
>>
>>>
>>> Thank you,
>>> Kostia
>>>
>>> On Tue, Nov 8, 2016 at 10:19 PM, Dejan Muhamedagic
>>> > wrote:
>>>
>>> On Tue, Nov 08, 2016 at 12:54:10PM +0100, Klaus Wenninger wrote:
>>> > On 11/08/2016 11:40 AM, Kostiantyn Ponomarenko wrote:
>>> > > Hi,
>>> > >
>>> > > I need a way to do a manual fail-back on demand.
>>> > > To be clear, I don't want it to be ON/OFF; I want it to be
>>> more like
>>> > > "one shot".
>>> > > So far I found that the most reasonable way to do it - is to set
>>> > > "resource 

Re: [ClusterLabs] pacemaker after upgrade from wheezy to jessie

2016-11-10 Thread Toni Tschampke

Did your upgrade documentation describe how to update the corosync
configuration, and did that go well? crmd may be unable to function due
to lack of quorum information.


Thanks for this tip, corosync quorum configuration was the cause.

As we changed validate-with as well as the feature set manually in the 
cib, is there a need for issuing the cibadmin --upgrade --force command 
or is this command just for changing the schemes?


--
Mit freundlichen Grüßen

Toni Tschampke | t...@halle.it
bcs kommunikationslösungen
Inh. Dipl. Ing. Carsten Burkhardt
Harz 51 | 06108 Halle (Saale) | Germany
tel +49 345 29849-0 | fax +49 345 29849-22
www.b-c-s.de | www.halle.it | www.wivewa.de


EINFACH ADRESSEN, TELEFONATE UND DOKUMENTE VERWALTEN - MIT WIVEWA -
IHREM WISSENSVERWALTER FUER IHREN BETRIEB!

Weitere Informationen erhalten Sie unter www.wivewa.de

Am 08.11.2016 um 22:51 schrieb Ken Gaillot:

On 11/07/2016 09:08 AM, Toni Tschampke wrote:

We managed to change the validate-with option via workaround (cibadmin
export & replace) as setting the value with cibadmin --modify doesn't
write the changes to disk.

After experimenting with various schemes (xml is correctly interpreted
by crmsh) we are still not able to communicate with local crmd.

Can someone please help to determine why the local crmd is not
responding (we disabled our other nodes to eliminate possible corosync
related issues) and runs into errors/timeouts when issuing crmsh or
cibadmin related commands.


It occurs to me that wheezy used corosync 1. There were major changes
from corosync 1 to 2 ... 1 relied on a "plugin" to provide quorum for
pacemaker, whereas 2 has quorum built-in.

Did your upgrade documentation describe how to update the corosync
configuration, and did that go well? crmd may be unable to function due
to lack of quorum information.


examples for not working local commands

timeout when running cibadmin: (strace attachment)

cibadmin --upgrade --force
Call cib_upgrade failed (-62): Timer expired


error when running a crm resource cleanup

crm resource cleanup $vm
Error signing on to the CRMd service
Error performing operation: Transport endpoint is not connected


I attached the strace log from running cib_upgrade, does this help to
find the cause of the timeout issue?

Here is the corosync dump when locally starting pacemaker:


Nov 07 16:01:59 [24339] nebel1 corosync notice  [MAIN  ] main.c:1256
Corosync Cluster Engine ('2.3.6'): started and ready to provide service.
Nov 07 16:01:59 [24339] nebel1 corosync info[MAIN  ] main.c:1257
Corosync built-in features: dbus rdma monitoring watchdog augeas
systemd upstart xmlconf qdevices snmp pie relro bindnow
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemnet.c:248 Initializing transport (UDP/IP Multicast).
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
none hash: none
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemnet.c:248 Initializing transport (UDP/IP Multicast).
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemcrypto.c:579 Initializing transmit/receive security (NSS) crypto:
none hash: none
Nov 07 16:01:59 [24339] nebel1 corosync notice  [TOTEM ]
totemudp.c:671 The network interface [10.112.0.1] is now up.
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync configuration map access [0]
Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
ipc_setup.c:536 server name: cmap
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync configuration service [1]
Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
ipc_setup.c:536 server name: cfg
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync cluster closed process group service
v1.01 [2]
Nov 07 16:01:59 [24339] nebel1 corosync info[QB]
ipc_setup.c:536 server name: cpg
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync profile loading service [4]
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync resource monitoring service [6]
Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:669
Watchdog /dev/watchdog is now been tickled by corosync.
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:625
Could not change the Watchdog timeout from 10 to 6 seconds
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464
resource load_15min missing a recovery key.
Nov 07 16:01:59 [24339] nebel1 corosync warning [WD] wd.c:464
resource memory_used missing a recovery key.
Nov 07 16:01:59 [24339] nebel1 corosync info[WD] wd.c:581 no
resources configured.
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  ] service.c:174
Service engine loaded: corosync watchdog service [7]
Nov 07 16:01:59 [24339] nebel1 corosync notice  [SERV  

[ClusterLabs] Antw: What do Parenthesis Mean in a Colocation or Order?

2016-11-10 Thread Ulrich Windl
>>> Eric Robinson  schrieb am 10.11.2016 um 09:18 in
Nachricht


> I can't believe I'm still unclear in this, but the behavior seems to be 
> different with different versions of Pacemaker.
> 
> What do the parenthesis accomplish in a statement like this?
> 
> colocation c_clust15 inf: ( p_mysql_029  p_mysql_484 p_mysql_734 ) 
> p_vip_clust15 p_fs_clust15 p_lv_drbd0 ms_drbd0:Master 
> order o_clust15 inf: ms_drbd0:promote p_lv_drbd0 p_fs_clust15 p_vip_clust15 
> ( p_mysql_029  p_mysql_484 p_mysql_734 )
> 
> I've noticed that on one this cluster, if any of the mysql instances fail, 
> the whole cluster fails (vip, fs, lvm, etc.). It does not do that on other 
> clusters.

We also use similar things, and they seem to work ;-)
colocation col_Xen_CFS inf: ( prm_xen_v01 prm_xen_v02 prm_xen_v03  ) 
cln_CFS_VMs_fs

meaning any of the prm_xen_v* can only run where the VM filesystem is running

order ord_CFS_VMs_Xen inf: cln_CFS_VMs_fs ( prm_xen_v01 prm_xen_v02 prm_xen_v03 
)

meaning any of the xen VMs must be started after the VM filesystem.

See "Resource sets" in the manual page of crm:

   Three different types of resource sets are provided by crmsh, and each
   one implies different values for the two resource set attributes,
   sequential and require-all.

   sequential
   If true, the resources in the set do not depend on each other
   internally. Setting sequential to true implies a strict order of
   dependency within the set.

   require-all
   If false, only one resource in the set is required to fulfil the
   requirements of the set. The set of A, B and C with require-all set
   to false is be read as "A OR B OR C" when its dependencies are
   resolved.

   The three types of resource sets modify the attributes in the following
   way:

1. Implicit sets (no brackets).  sequential=true, require-all=true

2. Parenthesis set (( ...  )).  sequential=false, require-all=true

3. Bracket set ([ ...  ]).  sequential=false, require-all=false

> 
> --Eric
> 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org