[ClusterLabs] redis and pgsql RAs under pacemaker_remote does not work

2018-01-11 Thread Виталий Процко
Hi!

I'm running redis and PostgreSQL in LXC containers with pacemaker_remote
on hosts running full cluster stack.

In pacemaker_remote installation crm_attribute utility is absent, so
subj RAs are not working.

May be crm_resource can be used for the same purpose, storing state data
in resource attributes? Or here is any other solution?

-- 
/aTan

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] 答复: 答复: pacemaker reports monitor timeout while CPU is high

2018-01-11 Thread 范国腾
Thank you very much, Ken. I will set the high timeout and try.

-邮件原件-
发件人: Ken Gaillot [mailto:kgail...@redhat.com] 
发送时间: 2018年1月11日 23:48
收件人: Cluster Labs - All topics related to open-source clustering welcomed 

抄送: 王亮 
主题: Re: [ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high

On Thu, 2018-01-11 at 03:50 +, 范国腾 wrote:
> Thank you, Ken.
> 
> We have set the timeout to be 10 seconds, but it reports timeout only 
> after 2 seconds. So it seems not work if I set higher timeouts.
> Our application which is managed by pacemaker will start more than
> 500 process to run when running performance test. Does it affect the 
> result? Which log could help us to analyze?
> 
> > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-
> > interval-16s)

It's not timing out after 2 seconds. The message:

  sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor

indicates that the monitor's process ID is 5240, but the message:

  sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606) timed out

indicates that the monitor that timed out had process ID 5606. That means that 
there were two separate monitors in progress. I'm not sure why; I wouldn't 
expect the second one to be started until after the first one had timed out. 
But it's possible with the high load that the log messages were simply written 
to the log out of order, since they were written by different processes.

I would just raise the timeout higher than 10s during the test.

> 
> -邮件原件-
> 发件人: Ken Gaillot [mailto:kgail...@redhat.com]
> 发送时间: 2018年1月11日 0:54
> 收件人: Cluster Labs - All topics related to open-source clustering 
> welcomed 
> 主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is 
> high
> 
> On Wed, 2018-01-10 at 09:40 +, 范国腾 wrote:
> > Hello,
> >  
> > This issue only appears when we run performance test and the CPU is 
> > high. The cluster and log is as below. The Pacemaker will restart 
> > the Slave Side pgsql-ha resource about every two minutes.
> >  
> > Take the following scenario for example:(when the pgsqlms RA is 
> > called, we print the log “execute the command start (command)”.
> > When
> > the command is returned, we print the log “execute the command stop
> > (Command) (result)”)
> > 1. We could see that pacemaker call “pgsqlms monitor” about 
> > every
> > 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor 
> > command again at 13:56:16, and then it reports timeout error error 
> > 13:56:18.
> > It is only 2 seconds but it reports “timeout=1ms”
> > 3. In other logs, sometimes after 15 minutes, there is no 
> > “execute the command start monitor” printed and it reports timeout 
> > error directly.
> >  
> > Could you please tell how to debug or resolve such issue?
> >  
> > The log:
> >  
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the 
> > command start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: 
> > INFO:
> > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]:
> > INFO:
> > _confirm_role stop
> > 0
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the 
> > command stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: 
> > INFO:
> > execute the command start monitor Jan 10 13:55:52 sds2
> > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52
> > sds2
> > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop
> > 0
> > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the 
> > command stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: 
> > High CPU load detected:
> > 426.77
> > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the 
> > command start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning:
> > pgsqld_monitor_16000 process (PID 5606) timed out
> 
> There's something more going on than in this log snippet. Notice the 
> process that timed out (5606) is not one of the processes that logged 
> above (5240 and 5477).
> 
> Generally, once load gets that high, it's very difficult to maintain 
> responsiveness, and the expectation is that another node will fence 
> it.
> But it can often be worked around with high timeouts, and/or you can 
> use rules to set higher timeouts or maintenance mode during times when 
> high load is expected.
> 
> > Jan 10 13:56:18 sds2 lrmd[26093]: warning:
> > pgsqld_monitor_16000:5606
> > - timed out after 1ms
> > Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor 
> > operation for pgsqld on db2: Timed Out | call=102
> > key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2
> > crmd[26096]:  notice: db2-
> > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] Jan
> > 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
> > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]:
> > warning: Processing failed 

[ClusterLabs] pengine bug? Recovery after monitor failure: Restart of DRBD does not restart Filesystem -- unless explicit order start before promote on DRBD

2018-01-11 Thread Lars Ellenberg

To understand some weird behavior we observed,
I dumbed down a production config to three dummy resources,
while keeping some descriptive resource ids (ip, drbd, fs).

For some reason, the constraints are:
stuff, more stuff, IP -> DRBD -> FS -> other stuff.
(In the actual real-world config, it makes somewhat more sense,
but it reproduces with just these three resources)

All is running just fine.

Online: [ ava emma ]
 virtual_ip (ocf::pacemaker:Dummy): Started ava
 Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
 Masters: [ ava ]
 p_fs_drbd1 (ocf::pacemaker:Dummy): Started ava

If I simulate a monitor failure on IP:
# crm_simulate -L -i virtual_ip_monitor_3@ava=1

Transition Summary:
 * Recover virtual_ip   (Started ava)
 * Restart p_drbd_r0:0  (Master ava)

Which in real life will obviously fail,
because we cannot "restart" (demote) a DRBD
while it is still in use (mounted, in this case).

Only if I add a stupid intra-resource order constraint that explicitly
states to first start, then promote on the DRBD itself,
I get the result I would have expected:

Transition Summary:
 * Recover virtual_ip   (Started ava)
 * Restart p_drbd_r0:0  (Master ava)
 * Restart p_fs_drbd1   (Started ava)

Interestingly enough, if I simulate a monitor failure on "DRBD" directly,
it is in both cases the expected:

Transition Summary:
 * Recover p_drbd_r0:0  (Master ava)
 * Restart p_fs_drbd1   (Started ava)


What am I missing?

Do we have to "annotate" somewhere that you must not demote something
if it is still "in use" by something else?

Did I just screw up the constraints somehow?
How would the constraints need to look like to get the expected result,
without explicitly adding the first-start-then-promote constraint?

Is (was?) this a pengine bug?



How to reproduce:
=

crm shell style dummy config:
--
node 1: ava
node 2: emma
primitive p_drbd_r0 ocf:pacemaker:Stateful \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive p_fs_drbd1 ocf:pacemaker:Dummy \
op monitor interval=20 timeout=40
primitive virtual_ip ocf:pacemaker:Dummy \
op monitor interval=30s
ms ms_drbd_r0 p_drbd_r0 \
meta master-max=1 master-node-max=1 clone-max=1 clone-node-max=1
colocation c1 inf: ms_drbd_r0 virtual_ip
colocation c2 inf: p_fs_drbd1:Started ms_drbd_r0:Master
order o1 inf: virtual_ip:start ms_drbd_r0:start
order o2 inf: ms_drbd_r0:promote p_fs_drbd1:start
--

crm_simulate -x bad.xml -i virtual_ip_monitor_3@ava=1

 trying to demote DRBD before umount :-((

adding stupid constraint:

order first-start-then-promote inf: ms_drbd_r0:start ms_drbd_r0:promote

crm_simulate -x good.xml -i virtual_ip_monitor_3@ava=1

  yay, first umount, then demote...

(tested with 1.1.15 and 1.1.16, not yet with more recent code base)


Full good.xml and bad.xml are both attached.

Manipulating constraint in live cib using cibadmin only:
add: cibadmin -C -o constraints -X ''
del: cibadmin -D -X ''

Thanks,

Lars



bad.xml.bz2
Description: Binary data


good.xml.bz2
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?

2018-01-11 Thread Jehan-Guillaume de Rorthais
On Thu, 11 Jan 2018 12:00:25 -0600
Ken Gaillot  wrote:

> On Thu, 2018-01-11 at 20:11 +0300, Andrei Borzenkov wrote:
> > 11.01.2018 19:21, Ken Gaillot пишет:  
> > > On Thu, 2018-01-11 at 01:16 +0100, Jehan-Guillaume de Rorthais
> > > wrote:  
> > > > On Wed, 10 Jan 2018 12:23:59 -0600
> > > > Ken Gaillot  wrote:
> > > > ...  
> > > > > My question is: has anyone used or tested this, or is anyone
> > > > > interested
> > > > > in this? We won't promote it to the default schema unless it is
> > > > > tested.
> > > > > 
> > > > > My feeling is that it is more likely to be confusing than
> > > > > helpful,
> > > > > and
> > > > > there are probably ways to achieve any reasonable use case with
> > > > > existing syntax.  
> > > > 
> > > > For what it worth, I tried to implement such solution to dispatch
> > > > mulitple
> > > > IP addresses to slaves in a 1 master 2 slaves cluster. This is
> > > > quite
> > > > time
> > > > consuming to wrap its head around sides effects with colocation,
> > > > scores and
> > > > stickiness. My various tests shows everything sounds to behave
> > > > correctly now,
> > > > but I don't feel really 100% confident about my setup.
> > > > 
> > > > I agree that there are ways to achieve such a use case with
> > > > existing
> > > > syntax.
> > > > But this is quite confusing as well. As instance, I experienced a
> > > > master
> > > > relocation when messing with a slave to make sure its IP would
> > > > move
> > > > to the
> > > > other slave node...I don't remember exactly what was my error,
> > > > but I
> > > > could
> > > > easily dig for it if needed.
> > > > 
> > > > I feel like it fits in the same area that the usability of
> > > > Pacemaker.
> > > > Making it
> > > > easier to understand. See the recent discussion around the
> > > > gocardless
> > > > war story.
> > > > 
> > > > My tests was mostly for labs, demo and tutorial purpose. I don't
> > > > have
> > > > a
> > > > specific field use case. But if at some point this feature is
> > > > promoted
> > > > officially as preview, I'll give it some testing and report here
> > > > (barring the
> > > > fact I'm actually aware some feedback are requested ;)).  
> > > 
> > > It's ready to be tested now -- just do this:
> > > 
> > >  cibadmin --upgrade
> > >  cibadmin --modify --xml-text ''  
> > > 
> > > Then use constraints like:
> > > 
> > >   > >    rsc="rsc1"
> > >    with-rsc="clone1" with-rsc-instance="1" />
> > > 
> > >   > >    rsc="rsc2"
> > >    with-rsc="clone1" with-rsc-instance="2" />
> > > 
> > > to colocate rsc1 and rsc2 with separate instances of clone1. There
> > > is
> > > no way to know *which* instance of clone1 will be 1, 2, etc.; this
> > > just
> > > allows you to ensure the colocations are separate.
> > >   
> > 
> > Is it possible to designate master/slave as well?  
> 
> If you mean constrain one resource to the master, and a bunch of other
> resources to the slaves, then no, this new syntax doesn't support that.
> But it should be possible with existing syntax, by constraining with
> role=master or role=slave, then anticolocating the resources with each
> other.
> 

Oh, wait, this is a deal breaker then... This was exactly my use case:

 * giving a specific IP address to the master
 * provide various IP addresses to slaves 

I suppose I'm stucked with the existing syntaxe then.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 20:11 +0300, Andrei Borzenkov wrote:
> 11.01.2018 19:21, Ken Gaillot пишет:
> > On Thu, 2018-01-11 at 01:16 +0100, Jehan-Guillaume de Rorthais
> > wrote:
> > > On Wed, 10 Jan 2018 12:23:59 -0600
> > > Ken Gaillot  wrote:
> > > ...
> > > > My question is: has anyone used or tested this, or is anyone
> > > > interested
> > > > in this? We won't promote it to the default schema unless it is
> > > > tested.
> > > > 
> > > > My feeling is that it is more likely to be confusing than
> > > > helpful,
> > > > and
> > > > there are probably ways to achieve any reasonable use case with
> > > > existing syntax.
> > > 
> > > For what it worth, I tried to implement such solution to dispatch
> > > mulitple
> > > IP addresses to slaves in a 1 master 2 slaves cluster. This is
> > > quite
> > > time
> > > consuming to wrap its head around sides effects with colocation,
> > > scores and
> > > stickiness. My various tests shows everything sounds to behave
> > > correctly now,
> > > but I don't feel really 100% confident about my setup.
> > > 
> > > I agree that there are ways to achieve such a use case with
> > > existing
> > > syntax.
> > > But this is quite confusing as well. As instance, I experienced a
> > > master
> > > relocation when messing with a slave to make sure its IP would
> > > move
> > > to the
> > > other slave node...I don't remember exactly what was my error,
> > > but I
> > > could
> > > easily dig for it if needed.
> > > 
> > > I feel like it fits in the same area that the usability of
> > > Pacemaker.
> > > Making it
> > > easier to understand. See the recent discussion around the
> > > gocardless
> > > war story.
> > > 
> > > My tests was mostly for labs, demo and tutorial purpose. I don't
> > > have
> > > a
> > > specific field use case. But if at some point this feature is
> > > promoted
> > > officially as preview, I'll give it some testing and report here
> > > (barring the
> > > fact I'm actually aware some feedback are requested ;)).
> > 
> > It's ready to be tested now -- just do this:
> > 
> >  cibadmin --upgrade
> >  cibadmin --modify --xml-text ''
> > 
> > Then use constraints like:
> > 
> >   >    rsc="rsc1"
> >    with-rsc="clone1" with-rsc-instance="1" />
> > 
> >   >    rsc="rsc2"
> >    with-rsc="clone1" with-rsc-instance="2" />
> > 
> > to colocate rsc1 and rsc2 with separate instances of clone1. There
> > is
> > no way to know *which* instance of clone1 will be 1, 2, etc.; this
> > just
> > allows you to ensure the colocations are separate.
> > 
> 
> Is it possible to designate master/slave as well?

If you mean constrain one resource to the master, and a bunch of other
resources to the slaves, then no, this new syntax doesn't support that.
But it should be possible with existing syntax, by constraining with
role=master or role=slave, then anticolocating the resources with each
other.

> 
> > Similarly you can use rsc="clone1" rsc-instance="1" to colocate a
> > clone
> > instance relative to another resource instead.
> > 
> > For ordering, the corresponding syntax is "first-instance" or
> > "then-
> > instance" as desired.
> > 
> > I believe crm shell has higher-level support for this feature.
> > 
> > Personally, I think standard colocations of rsc1 and rsc2 with
> > clone1,
> > and then an anticolocation between rsc1 and rsc2, would be more
> > intuitive. You're right that the interactions with stickiness etc.
> > can
> > be tricky, but that would apply to the alternate syntax as well.
> > 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?

2018-01-11 Thread Andrei Borzenkov
11.01.2018 19:21, Ken Gaillot пишет:
> On Thu, 2018-01-11 at 01:16 +0100, Jehan-Guillaume de Rorthais wrote:
>> On Wed, 10 Jan 2018 12:23:59 -0600
>> Ken Gaillot  wrote:
>> ...
>>> My question is: has anyone used or tested this, or is anyone
>>> interested
>>> in this? We won't promote it to the default schema unless it is
>>> tested.
>>>
>>> My feeling is that it is more likely to be confusing than helpful,
>>> and
>>> there are probably ways to achieve any reasonable use case with
>>> existing syntax.
>>
>> For what it worth, I tried to implement such solution to dispatch
>> mulitple
>> IP addresses to slaves in a 1 master 2 slaves cluster. This is quite
>> time
>> consuming to wrap its head around sides effects with colocation,
>> scores and
>> stickiness. My various tests shows everything sounds to behave
>> correctly now,
>> but I don't feel really 100% confident about my setup.
>>
>> I agree that there are ways to achieve such a use case with existing
>> syntax.
>> But this is quite confusing as well. As instance, I experienced a
>> master
>> relocation when messing with a slave to make sure its IP would move
>> to the
>> other slave node...I don't remember exactly what was my error, but I
>> could
>> easily dig for it if needed.
>>
>> I feel like it fits in the same area that the usability of Pacemaker.
>> Making it
>> easier to understand. See the recent discussion around the gocardless
>> war story.
>>
>> My tests was mostly for labs, demo and tutorial purpose. I don't have
>> a
>> specific field use case. But if at some point this feature is
>> promoted
>> officially as preview, I'll give it some testing and report here
>> (barring the
>> fact I'm actually aware some feedback are requested ;)).
> 
> It's ready to be tested now -- just do this:
> 
>  cibadmin --upgrade
>  cibadmin --modify --xml-text ''
> 
> Then use constraints like:
> 
>  rsc="rsc1"
>with-rsc="clone1" with-rsc-instance="1" />
> 
>  rsc="rsc2"
>with-rsc="clone1" with-rsc-instance="2" />
> 
> to colocate rsc1 and rsc2 with separate instances of clone1. There is
> no way to know *which* instance of clone1 will be 1, 2, etc.; this just
> allows you to ensure the colocations are separate.
> 

Is it possible to designate master/slave as well?

> Similarly you can use rsc="clone1" rsc-instance="1" to colocate a clone
> instance relative to another resource instead.
> 
> For ordering, the corresponding syntax is "first-instance" or "then-
> instance" as desired.
> 
> I believe crm shell has higher-level support for this feature.
> 
> Personally, I think standard colocations of rsc1 and rsc2 with clone1,
> and then an anticolocation between rsc1 and rsc2, would be more
> intuitive. You're right that the interactions with stickiness etc. can
> be tricky, but that would apply to the alternate syntax as well.
> 


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Changes coming in Pacemaker 2.0.0

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 08:54 +0100, Ulrich Windl wrote:
> On "--crm_xml -> --xml-text": Why not simply "--xml" (XML IS text)?

Most Pacemaker tools that accept XML can get it from standard input (
--xml-pipe), a file (--xml-file), or a literal string (--xml-text).

Although, looking at it now, it might be nice to reduce it to one
option:

 --xml -   standard input
 --xml '' anything starting with '<' is literal
 --xml fileanything else is a filename
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Changes coming in Pacemaker 2.0.0

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 01:21 +0100, Jehan-Guillaume de Rorthais wrote:
> On Wed, 10 Jan 2018 16:10:50 -0600
> Ken Gaillot  wrote:
> 
> > Pacemaker 2.0 will be a major update whose main goal is to remove
> > support for deprecated, legacy syntax, in order to make the code
> > base
> > more maintainable into the future. There will also be some changes
> > to
> > default configuration behavior, and the command-line tools.
> > 
> > I'm hoping to release the first release candidate in the next
> > couple of
> > weeks.
> 
> Great news! Congrats.
> 
> > We'll have a longer than usual rc phase to allow for plenty of
> > testing.
> > 
> > A thoroughly detailed list of changes will be maintained on the
> > ClusterLabs wiki:
> > 
> >   https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes
> > 
> > These changes are not final, and we can restore functionality if
> > there
> > is a strong need for it. Most user-visible changes are complete (in
> > the
> > 2.0 branch on github); major changes are still expected, but
> > primarily
> > to the C API.
> > 
> > Some highlights:
> > 
> > * Only Corosync version 2 will be supported as the underlying
> > cluster
> > layer. Support for Heartbeat and Corosync 1 is removed. (Support
> > for
> > the new kronosnet layer will be added in a future version.)
> 
> I thought (according to some conference slides from sept 2017) knet
> was mostly
> related to corosync directly? Is there some visible impact on
> Pacemaker too?

You're right -- it's more accurate to say that corosync 3 will support
knet, and I'm not yet aware whether the corosync 3 API will require any
changes in Pacemaker.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 01:16 +0100, Jehan-Guillaume de Rorthais wrote:
> On Wed, 10 Jan 2018 12:23:59 -0600
> Ken Gaillot  wrote:
> ...
> > My question is: has anyone used or tested this, or is anyone
> > interested
> > in this? We won't promote it to the default schema unless it is
> > tested.
> > 
> > My feeling is that it is more likely to be confusing than helpful,
> > and
> > there are probably ways to achieve any reasonable use case with
> > existing syntax.
> 
> For what it worth, I tried to implement such solution to dispatch
> mulitple
> IP addresses to slaves in a 1 master 2 slaves cluster. This is quite
> time
> consuming to wrap its head around sides effects with colocation,
> scores and
> stickiness. My various tests shows everything sounds to behave
> correctly now,
> but I don't feel really 100% confident about my setup.
> 
> I agree that there are ways to achieve such a use case with existing
> syntax.
> But this is quite confusing as well. As instance, I experienced a
> master
> relocation when messing with a slave to make sure its IP would move
> to the
> other slave node...I don't remember exactly what was my error, but I
> could
> easily dig for it if needed.
> 
> I feel like it fits in the same area that the usability of Pacemaker.
> Making it
> easier to understand. See the recent discussion around the gocardless
> war story.
> 
> My tests was mostly for labs, demo and tutorial purpose. I don't have
> a
> specific field use case. But if at some point this feature is
> promoted
> officially as preview, I'll give it some testing and report here
> (barring the
> fact I'm actually aware some feedback are requested ;)).

It's ready to be tested now -- just do this:

 cibadmin --upgrade
 cibadmin --modify --xml-text ''

Then use constraints like:

 

 

to colocate rsc1 and rsc2 with separate instances of clone1. There is
no way to know *which* instance of clone1 will be 1, 2, etc.; this just
allows you to ensure the colocations are separate.

Similarly you can use rsc="clone1" rsc-instance="1" to colocate a clone
instance relative to another resource instead.

For ordering, the corresponding syntax is "first-instance" or "then-
instance" as desired.

I believe crm shell has higher-level support for this feature.

Personally, I think standard colocations of rsc1 and rsc2 with clone1,
and then an anticolocation between rsc1 and rsc2, would be more
intuitive. You're right that the interactions with stickiness etc. can
be tricky, but that would apply to the alternate syntax as well.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0

2018-01-11 Thread Kristoffer Grönlund
Jehan-Guillaume de Rorthais  writes:

>
> For what is worth, while using crmsh, I always have to explain to
> people or customers that:
>
> * we should issue an "unmigrate" to remove the constraint as soon as the
>   resource can get back to the original node or get off the current node if
>   needed (depending on the -inf or +inf constraint location issued)
> * this will not migrate back the resource if it's sticky enough on the current
>   node. 
>
> See:
> http://clusterlabs.github.io/PAF/Debian-8-admin-cookbook.html#swapping-master-and-slave-roles-between-nodes
>
> This is counter-intuitive, indeed. I prefer the pcs interface using
> the move/clear actions.

No need! You can use crm rsc move / crm rsc clear. In fact, "unmove" is
just a backwards-compatibility alias for clear in crmsh.

Cheers,
Kristoffer

>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0

2018-01-11 Thread Jehan-Guillaume de Rorthais
On Thu, 11 Jan 2018 18:32:35 +0300
Andrei Borzenkov  wrote:

> On Thu, Jan 11, 2018 at 2:52 PM, Ulrich Windl
>  wrote:
> >
> >  
>  Andrei Borzenkov  schrieb am 11.01.2018 um 12:41
>  in  
> > Nachricht
> > :  
> >> On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl
> >>  wrote:  
> >>> Hi!
> >>>
> >>> On the tool changes, I'd prefer --move and --un-move as pair over --move
> >>> and --clear  
> >> ("clear" is less expressive IMHO).
> >>
> >> --un-move is really wrong semantically. You do not "unmove" resource -
> >> you "clear" constraints that were created. Whether this actually
> >> results in any "movement" is unpredictable (easily).  
> >
> > You undo what "move" does: "un-move". With your argument, "move" is just as
> > bad: Why not "--forbid-host" and "--allow-host" then? 
> 
> That would be less confusing as it sounds more declarative and matches
> what actually happens - setting configuration parameter instead of
> initiating some action.

For what is worth, while using crmsh, I always have to explain to
people or customers that:

* we should issue an "unmigrate" to remove the constraint as soon as the
  resource can get back to the original node or get off the current node if
  needed (depending on the -inf or +inf constraint location issued)
* this will not migrate back the resource if it's sticky enough on the current
  node. 

See:
http://clusterlabs.github.io/PAF/Debian-8-admin-cookbook.html#swapping-master-and-slave-roles-between-nodes

This is counter-intuitive, indeed. I prefer the pcs interface using
the move/clear actions.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 03:50 +, 范国腾 wrote:
> Thank you, Ken.
> 
> We have set the timeout to be 10 seconds, but it reports timeout only
> after 2 seconds. So it seems not work if I set higher timeouts.
> Our application which is managed by pacemaker will start more than
> 500 process to run when running performance test. Does it affect the
> result? Which log could help us to analyze?
> 
> > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-
> > interval-16s)

It's not timing out after 2 seconds. The message:

  sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor

indicates that the monitor's process ID is 5240, but the message:

  sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606)
timed out

indicates that the monitor that timed out had process ID 5606. That
means that there were two separate monitors in progress. I'm not sure
why; I wouldn't expect the second one to be started until after the
first one had timed out. But it's possible with the high load that the
log messages were simply written to the log out of order, since they
were written by different processes.

I would just raise the timeout higher than 10s during the test.

> 
> -邮件原件-
> 发件人: Ken Gaillot [mailto:kgail...@redhat.com] 
> 发送时间: 2018年1月11日 0:54
> 收件人: Cluster Labs - All topics related to open-source clustering
> welcomed 
> 主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is
> high
> 
> On Wed, 2018-01-10 at 09:40 +, 范国腾 wrote:
> > Hello,
> >  
> > This issue only appears when we run performance test and the CPU
> > is 
> > high. The cluster and log is as below. The Pacemaker will restart
> > the 
> > Slave Side pgsql-ha resource about every two minutes.
> >  
> > Take the following scenario for example:(when the pgsqlms RA is 
> > called, we print the log “execute the command start (command)”.
> > When 
> > the command is returned, we print the log “execute the command stop
> > (Command) (result)”)
> > 1. We could see that pacemaker call “pgsqlms monitor” about
> > every
> > 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor
> > command 
> > again at 13:56:16, and then it reports timeout error error
> > 13:56:18. 
> > It is only 2 seconds but it reports “timeout=1ms”
> > 3. In other logs, sometimes after 15 minutes, there is no
> > “execute 
> > the command start monitor” printed and it reports timeout error 
> > directly.
> >  
> > Could you please tell how to debug or resolve such issue?
> >  
> > The log:
> >  
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the
> > command 
> > start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: 
> > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]:
> > INFO: 
> > _confirm_role stop
> > 0
> > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the
> > command 
> > stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: 
> > execute the command start monitor Jan 10 13:55:52 sds2 
> > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52
> > sds2 
> > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop
> > 0
> > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the
> > command 
> > stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU 
> > load detected:
> > 426.77
> > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the
> > command 
> > start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: 
> > pgsqld_monitor_16000 process (PID 5606) timed out
> 
> There's something more going on than in this log snippet. Notice the
> process that timed out (5606) is not one of the processes that logged
> above (5240 and 5477).
> 
> Generally, once load gets that high, it's very difficult to maintain
> responsiveness, and the expectation is that another node will fence
> it.
> But it can often be worked around with high timeouts, and/or you can
> use rules to set higher timeouts or maintenance mode during times
> when high load is expected.
> 
> > Jan 10 13:56:18 sds2 lrmd[26093]: warning:
> > pgsqld_monitor_16000:5606
> > - timed out after 1ms
> > Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor
> > operation 
> > for pgsqld on db2: Timed Out | call=102
> > key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 
> > crmd[26096]:  notice: db2-
> > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ]
> > Jan 
> > 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
> > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: 
> > warning: Processing failed op monitor for pgsqld:0 on db2: unknown 
> > error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing 
> > failed op start for pgsqld:1 on db1: unknown error (1) Jan 10
> > 13:56:19 
> > sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 
> > 100 failures (max=100) Jan 10 13:56:19 sds2
> > 

Re: [ClusterLabs] Antw: Re: Coming in Pacemaker 2.0.0: /var/log/pacemaker/pacemaker.log

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 09:10 +0100, Ulrich Windl wrote:
> Maybe the question to ask right now would be: What are the modules,
> and what are their logfile locations? An opportunity to clean up the
> mess!

It would be nice, but time is always constrained, and coordinating
multiple projects takes more time. Still a good idea though.

> > > > Adam Spiers  schrieb am 11.01.2018 um 00:59
> > > > in Nachricht
> 
> <20180110235939.fvwkormbruoqhwfb@pacific.linksys.moosehall>:
> > Ken Gaillot  wrote: 
> > > The initial proposal, after discussion at last year's summit, was
> > > to 
> > > use /var/log/cluster/pacemaker.log instead. That turned out to be
> > > slightly 
> > 
> > problematic: it broke some regression tests in a way that wasn't
> > easily 
> > fixable, and more significantly, it raises the question of what
> > package 
> > should own /var/log/cluster (which different distributions might
> > want to 
> > answer differently). 
> > 
> > I thought one option aired at the summit to address this was 
> > /var/log/clusterlabs, but it's entirely possible my memory's
> > playing 
> > tricks on me again. 

I don't remember that, but it sounds like a good choice. However we'd
still have the same issue of needing a single package to own it. We
could create a really shallow clusterlabs project/package for the
purpose; I can't think of anything else to put in it that would be
universal to all ClusterLabs projects.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0

2018-01-11 Thread Andrei Borzenkov
On Thu, Jan 11, 2018 at 2:52 PM, Ulrich Windl
 wrote:
>
>
 Andrei Borzenkov  schrieb am 11.01.2018 um 12:41 in
> Nachricht
> :
>> On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl
>>  wrote:
>>> Hi!
>>>
>>> On the tool changes, I'd prefer --move and --un-move as pair over --move 
>>> and --clear
>> ("clear" is less expressive IMHO).
>>
>> --un-move is really wrong semantically. You do not "unmove" resource -
>> you "clear" constraints that were created. Whether this actually
>> results in any "movement" is unpredictable (easily).
>
> You undo what "move" does: "un-move". With your argument, "move" is just as 
> bad: Why not "--forbid-host" and "--allow-host" then?
>

That would be less confusing as it sounds more declarative and matches
what actually happens - setting configuration parameter instead of
initiating some action.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Coming in Pacemaker 2.0.0: Reliable exit codes

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 08:59 +0100, Ulrich Windl wrote:
> Hi!
> 
> Will those exit code be compatible with , i.e. will it be
> a superset or a subset of it? If not, it would be the right time.

Yes! It will be a superset. From the new source code:

/*
 * Exit statuses
 *
 * We want well-specified (i.e. OS-invariant) exit status codes for our
daemons
 * and applications so they can be relied on by callers. (Function
return codes
 * and errno's do not make good exit statuses.)
 *
 * The only hard rule is that exit statuses must be between 0 and 255;
all else
 * is convention. Universally, 0 is success, and 1 is generic error
(excluding
 * OSes we don't support -- for example, OpenVMS considers 1 success!).
 *
 * For init scripts, the LSB gives meaning to 0-7, and sets aside 150-
199 for
 * application use. OCF adds 8-9 and 189-199.
 *
 * sysexits.h was an attempt to give additional meanings, but never
really
 * caught on. It uses 0 and 64-78.
 *
 * Bash reserves 2 ("incorrect builtin usage") and 126-255 (126 is
"command
 * found but not executable", 127 is "command not found", 128 + n is
 * "interrupted by signal n").
 *
 * tldp.org recommends 64-113 for application use.
 *
 * We try to overlap with the above conventions when practical.
 */

We are using 0-1 as success and generic error, 2-7 overlapping with
LSB+OCF, 64-78 overlapping with sysexits.h, 100-109 (possibly more
later) for custom errors, and 124 overlapping with timeout(1).

> 
> Regards,
> Ulrich
> 
> 
> > > > Ken Gaillot  schrieb am 10.01.2018 um
> > > > 23:22 in Nachricht
> 
> <1515622941.4815.21.ca...@redhat.com>:
> > Every time you run a command on the command line or in a script, it
> > returns an exit status. These are most useful in scripts to check
> > for
> > errors.
> > 
> > Currently, Pacemaker daemons and command-line tools return an
> > unreliable mishmash of exit status codes, sometimes including
> > negative
> > numbers (which get bitwise-remapped to the 0-255 range) and/or C
> > library errno codes (which can vary across OSes).
> > 
> > The only thing scripts could rely on was 0 means success and
> > nonzero
> > means error.
> > 
> > Beginning with Pacemaker 2.0.0, everything will return a well-
> > defined
> > set of reliable exit status codes. These codes can be viewed using
> > the
> > existing crm_error tool using the --exit parameter. For example:
> > 
> > crm_error --exit --list
> > 
> > will list all possible exit statuses, and
> > 
> > crm_error --exit 124
> > 
> > will show a textual description of what exit status 124 means.
> > 
> > This will mainly be of interest to users who script Pacemaker
> > commands
> > and check the return value. If your scripts rely on the current
> > exit
> > codes, you may need to update your scripts for 2.0.0.
> > -- 
> > Ken Gaillot 
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org 
> > http://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > Project Home: http://www.clusterlabs.org 
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf 
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 12:52 +0100, Ulrich Windl wrote:
> > > > Andrei Borzenkov  schrieb am 11.01.2018 um
> > > > 12:41 in
> 
> Nachricht
> :
> > On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl
> >  wrote:
> > > Hi!
> > > 
> > > On the tool changes, I'd prefer --move and --un-move as pair over
> > > --move and --clear 
> > 
> > ("clear" is less expressive IMHO).
> > 
> > --un-move is really wrong semantically. You do not "unmove"
> > resource -
> > you "clear" constraints that were created. Whether this actually
> > results in any "movement" is unpredictable (easily).
> 
> You undo what "move" does: "un-move". With your argument, "move" is
> just as bad: Why not "--forbid-host" and "--allow-host" then?

That's a good point actually. There's a tension between Pacemaker's
model (defining a desired state, and letting Pacemaker decide how to
get there) vs most people's intuition (defining actions to be taken).
Also, Pacemaker's XML syntax tends to be very flexible such that one
expression can convey multiple logical intents. So we see that
discrepancy sometimes in command naming vs implementation.

> 
> > 
> > Personally I find lack of any means to change resource state
> > non-persistently one of major usability issue with pacemaker
> > comparing
> > with other cluster stacks. Just a small example:
> > 
> > I wanted to show customer how "maintenance-mode" works. After
> > setting
> > maintenance-mode=yes for the cluster we found that database was
> > mysteriously restarted after being stopped manually. It took quite
> > some time to find out that couple of weeks ago "crm resource
> > manager"
> > followed by "crm resource unmanage" was run for this resource -
> > which
> > left explicit "managed=yes" on resource which took precedence over
> > "maintenance-mode".
> 
> Oops: Didn't know that!
> 
> > 
> > Not only is this asymmetrical and non-intuitive. There is no way to
> > distinguish temporary change from permanent one. Moving resources
> > is
> > special-cased but for any change that involves setting resource
> > (meta-)attributes this approach is not possible. Attribute is
> > there,
> > and we do not know why it was set.
> 
> Yes, the "lifetime" in a rule should not restrict what the rule does,
> but how long the rule exists. As garbage collection of expired rules
> (which does not exist yet) would have less accuracy as the lifetime
> (maybe specified in seconds), a combination could be used.

Expired rules are still useful -- e.g. for troubleshooting an event
that occurred while the rule was in effect, or for simulating events
that occur inside and outside the effective window.

It would be helpful though to have a new command to remove all expired
rules from the configuration, so an admin can conveniently clean up
periodically.

> Regards,
> Ulrich
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Does anyone use clone instance constraints from pacemaker-next schema?

2018-01-11 Thread Ken Gaillot
On Thu, 2018-01-11 at 09:12 +0100, Ulrich Windl wrote:
> BTW: Could be fix that "Master/slave resources need different
> monitoring intervals for master and slave" at this time?

Unfortunately that would be a major project, as the interval is used to
identify the operation throughout the code base.

> 
> 
> > > > Jehan-Guillaume de Rorthais  schrieb am
> > > > 11.01.2018 um 01:16 in
> 
> Nachricht <20180111011616.496a383b@firost>:
> > On Wed, 10 Jan 2018 12:23:59 -0600
> > Ken Gaillot  wrote:
> > ...
> > > My question is: has anyone used or tested this, or is anyone
> > > interested
> > > in this? We won't promote it to the default schema unless it is
> > > tested.
> > > 
> > > My feeling is that it is more likely to be confusing than
> > > helpful, and
> > > there are probably ways to achieve any reasonable use case with
> > > existing syntax.
> > 
> > For what it worth, I tried to implement such solution to dispatch
> > mulitple
> > IP addresses to slaves in a 1 master 2 slaves cluster. This is
> > quite time
> > consuming to wrap its head around sides effects with colocation,
> > scores and
> > stickiness. My various tests shows everything sounds to behave
> > correctly 
> > now,
> > but I don't feel really 100% confident about my setup.
> > 
> > I agree that there are ways to achieve such a use case with
> > existing syntax.
> > But this is quite confusing as well. As instance, I experienced a
> > master
> > relocation when messing with a slave to make sure its IP would move
> > to the
> > other slave node...I don't remember exactly what was my error, but
> > I could
> > easily dig for it if needed.
> > 
> > I feel like it fits in the same area that the usability of
> > Pacemaker. Making 
> > it
> > easier to understand. See the recent discussion around the
> > gocardless war 
> > story.
> > 
> > My tests was mostly for labs, demo and tutorial purpose. I don't
> > have a
> > specific field use case. But if at some point this feature is
> > promoted
> > officially as preview, I'll give it some testing and report here
> > (barring 
> > the
> > fact I'm actually aware some feedback are requested ;)).
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org 
> > http://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > Project Home: http://www.clusterlabs.org 
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf 
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Recommendations for securing DLM traffic?

2018-01-11 Thread Bob Peterson
- Original Message -
| What are the general recommendations for securing traffic for DLM on port
| 21064?
| 
| It appears that this traffic is not signed or encrypted in any way so whilst
| there might not be any privacy issues with information disclosure it's not
| clear that the messages could not be replayed or otherwise spoofed.
| 
| Thanks,
| 
| Mark.

Hi Mark,

Perhaps you should send your question to the public cluster development
mailing list: cluster-de...@redhat.com

The dlm kernel developers are more likely to see it there.

For more info:
https://www.redhat.com/mailman/listinfo/cluster-devel

Regards,

Bob Peterson
Red Hat File Systems

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0

2018-01-11 Thread Ulrich Windl


>>> Andrei Borzenkov  schrieb am 11.01.2018 um 12:41 in
Nachricht
:
> On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl
>  wrote:
>> Hi!
>>
>> On the tool changes, I'd prefer --move and --un-move as pair over --move and 
>> --clear 
> ("clear" is less expressive IMHO).
> 
> --un-move is really wrong semantically. You do not "unmove" resource -
> you "clear" constraints that were created. Whether this actually
> results in any "movement" is unpredictable (easily).

You undo what "move" does: "un-move". With your argument, "move" is just as 
bad: Why not "--forbid-host" and "--allow-host" then?

> 
> Personally I find lack of any means to change resource state
> non-persistently one of major usability issue with pacemaker comparing
> with other cluster stacks. Just a small example:
> 
> I wanted to show customer how "maintenance-mode" works. After setting
> maintenance-mode=yes for the cluster we found that database was
> mysteriously restarted after being stopped manually. It took quite
> some time to find out that couple of weeks ago "crm resource manager"
> followed by "crm resource unmanage" was run for this resource - which
> left explicit "managed=yes" on resource which took precedence over
> "maintenance-mode".

Oops: Didn't know that!

> 
> Not only is this asymmetrical and non-intuitive. There is no way to
> distinguish temporary change from permanent one. Moving resources is
> special-cased but for any change that involves setting resource
> (meta-)attributes this approach is not possible. Attribute is there,
> and we do not know why it was set.

Yes, the "lifetime" in a rule should not restrict what the rule does, but how 
long the rule exists. As garbage collection of expired rules (which does not 
exist yet) would have less accuracy as the lifetime (maybe specified in 
seconds), a combination could be used.

Regards,
Ulrich

> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Changes coming in Pacemaker 2.0.0

2018-01-11 Thread Andrei Borzenkov
On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl
 wrote:
> Hi!
>
> On the tool changes, I'd prefer --move and --un-move as pair over --move and 
> --clear ("clear" is less expressive IMHO).

--un-move is really wrong semantically. You do not "unmove" resource -
you "clear" constraints that were created. Whether this actually
results in any "movement" is unpredictable (easily).

Personally I find lack of any means to change resource state
non-persistently one of major usability issue with pacemaker comparing
with other cluster stacks. Just a small example:

I wanted to show customer how "maintenance-mode" works. After setting
maintenance-mode=yes for the cluster we found that database was
mysteriously restarted after being stopped manually. It took quite
some time to find out that couple of weeks ago "crm resource manager"
followed by "crm resource unmanage" was run for this resource - which
left explicit "managed=yes" on resource which took precedence over
"maintenance-mode".

Not only is this asymmetrical and non-intuitive. There is no way to
distinguish temporary change from permanent one. Moving resources is
special-cased but for any change that involves setting resource
(meta-)attributes this approach is not possible. Attribute is there,
and we do not know why it was set.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Recommendations for securing DLM traffic?

2018-01-11 Thread Mark Syms
What are the general recommendations for securing traffic for DLM on port 21064?

It appears that this traffic is not signed or encrypted in any way so whilst 
there might not be any privacy issues with information disclosure it's not 
clear that the messages could not be replayed or otherwise spoofed.

Thanks,

Mark.
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: 答复: Antw: pacemaker reports monitor timeout while CPU is high

2018-01-11 Thread Ulrich Windl
Hi!

A few years ago I was playing with cgroups, getting quite interesting (useful)
results, but applying the cgroups to existing and newly started processes was
quite hard to integrate into the OS, so I did not proceed on that way. I think
cgroups is even more powerful today, but I haven't followed the ease of using
it in systems based on systemd (which uses cgroups heavily AFAIK).

In short: You may be unable to control the client processes, but you could
control the server processes the clients start.

Regards,
Ulrich


>>> ???  schrieb am 11.01.2018 um 05:01 in Nachricht
<492a1ace20c04e85bc4979307af2a...@ex01.highgo.com>:
> Ulrich,
> 
> Thank you very much for the help. When we do the performance test, our 
> application(pgsql-ha) will start more than 500 process to process the client

> request. Is it possible to make this issue?
> 
> Is it any workaround or method to make pacemaker not restart the resource in

> such situation? Now the system could not work if the client sends high call

> load but we could not control the client's behavior. 
> 
> Thanks
> 
> 
> -邮件原件-
> 发件人: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] 
> 发送时间: 2018年1月10日 18:20
> 收件人: users@clusterlabs.org 
> 主题: [ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high
> 
> Hi!
> 
> I only can talk for myself: In former times with HP-UX, we had severe 
> performance problems when the load was in the range of 8 to 14 (I/O waits
not 
> included, average for all logical CPUs), while in Linux we are getting 
> problems with a load above 40 (or so) (I/O included, sum of all logical CPUs

> (which are 24)). Also I/O waits cause cluster timeouts before CPU load 
> actually matters (for us).
> So with a load above 400 (not knowing your number of CPUs) it should not be

> that unusual. What is the number of threads in your system at that time?
> It might be worth the efforts binding the cluster processes to specific CPUs

> and keep other tasks away from those, but I don't have experience with
that.
> I guess the "High CPU load detected" message triggers some internal suspend

> in the cluster engine (assuming the cluster engine caused the high load). Of

> course for "external " load the measure won't help...
> 
> Regards,
> Ulrich
> 
> 
 ???  schrieb am 10.01.2018 um 10:40 in 
 Nachricht
> <4dc98a5d9be144a78fb9a18721743...@ex01.highgo.com>:
>> Hello,
>> 
>> This issue only appears when we run performance test and the CPU is high. 
>> The cluster and log is as below. The Pacemaker will restart the Slave 
>> Side pgsql-ha resource about every two minutes.
>> 
>> Take the following scenario for example:(when the pgsqlms RA is 
>> called, we print the log “execute the command start (command)”. When 
>> the command is
> 
>> returned, we print the log “execute the command stop (Command)
> (result)”)
>> 
>> 1. We could see that pacemaker call “pgsqlms monitor” about every
15
> 
>> seconds. And it return $OCF_SUCCESS
>> 
>> 2. In calls monitor command again at 13:56:16, and then it reports 
>> timeout error error 13:56:18. It is only 2 seconds but it reports 
>> “timeout=1ms”
>> 
>> 3. In other logs, sometimes after 15 minutes, there is no “execute
the
> 
>> command start monitor” printed and it reports timeout error directly.
>> 
>> Could you please tell how to debug or resolve such issue?
>> 
>> The log:
>> 
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
>> start
> 
>> monitor
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start 
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 
>> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command 
>> stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: 
>> execute the command start
> 
>> monitor
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start 
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 
>> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command 
>> stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]:  notice: High CPU 
>> load detected:
>> 426.77
>> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command 
>> start
> 
>> monitor
>> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 
>> process (PID
> 
>> 5606) timed out
>> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - 
>> timed
> 
>> out after 1ms
>> Jan 10 13:56:18 sds2 crmd[26096]:   error: Result of monitor operation for

>> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000
> timeout=1ms
>> Jan 10 13:56:18 sds2 crmd[26096]:  notice: 
>> db2-pgsqld_monitor_16000:102 [
>> /tmp:5432 - accepting connections\n ]
>> Jan 10 13:56:18 sds2 crmd[26096]:  notice: State transition S_IDLE -> 
>> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL 
>> origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: 
>> 

[ClusterLabs] Antw: Re: Does anyone use clone instance constraints from pacemaker-next schema?

2018-01-11 Thread Ulrich Windl
BTW: Could be fix that "Master/slave resources need different monitoring 
intervals for master and slave" at this time?


>>> Jehan-Guillaume de Rorthais  schrieb am 11.01.2018 um 
>>> 01:16 in
Nachricht <20180111011616.496a383b@firost>:
> On Wed, 10 Jan 2018 12:23:59 -0600
> Ken Gaillot  wrote:
> ...
>> My question is: has anyone used or tested this, or is anyone interested
>> in this? We won't promote it to the default schema unless it is tested.
>> 
>> My feeling is that it is more likely to be confusing than helpful, and
>> there are probably ways to achieve any reasonable use case with
>> existing syntax.
> 
> For what it worth, I tried to implement such solution to dispatch mulitple
> IP addresses to slaves in a 1 master 2 slaves cluster. This is quite time
> consuming to wrap its head around sides effects with colocation, scores and
> stickiness. My various tests shows everything sounds to behave correctly 
> now,
> but I don't feel really 100% confident about my setup.
> 
> I agree that there are ways to achieve such a use case with existing syntax.
> But this is quite confusing as well. As instance, I experienced a master
> relocation when messing with a slave to make sure its IP would move to the
> other slave node...I don't remember exactly what was my error, but I could
> easily dig for it if needed.
> 
> I feel like it fits in the same area that the usability of Pacemaker. Making 
> it
> easier to understand. See the recent discussion around the gocardless war 
> story.
> 
> My tests was mostly for labs, demo and tutorial purpose. I don't have a
> specific field use case. But if at some point this feature is promoted
> officially as preview, I'll give it some testing and report here (barring 
> the
> fact I'm actually aware some feedback are requested ;)).
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Coming in Pacemaker 2.0.0: /var/log/pacemaker/pacemaker.log

2018-01-11 Thread Ulrich Windl
Maybe the question to ask right now would be: What are the modules, and what 
are their logfile locations? An opportunity to clean up the mess!


>>> Adam Spiers  schrieb am 11.01.2018 um 00:59 in Nachricht
<20180110235939.fvwkormbruoqhwfb@pacific.linksys.moosehall>:
> Ken Gaillot  wrote: 
>>The initial proposal, after discussion at last year's summit, was to 
>>use /var/log/cluster/pacemaker.log instead. That turned out to be slightly 
> problematic: it broke some regression tests in a way that wasn't easily 
> fixable, and more significantly, it raises the question of what package 
> should own /var/log/cluster (which different distributions might want to 
> answer differently). 
> 
> I thought one option aired at the summit to address this was 
> /var/log/clusterlabs, but it's entirely possible my memory's playing 
> tricks on me again. 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Coming in Pacemaker 2.0.0: /var/log/pacemaker/pacemaker.log

2018-01-11 Thread Ulrich Windl
Hi!

More than the location of the log file, I'm interested in the contents of the 
log file: The log file should have a formal syntax for automated parsing, and 
it should be as compact as possible.
Considering lines like:
Jan 07 10:07:41 [10691] h01pengine: info: 
determine_online_status_fencing: Node h01 is active
there are too many repeating blanks in that. Also the order may not be optimal: 
Why not have the priority ("Info:") first after the host name? The function 
name "determine_online_status_fencing" wouldn't loose much information if just 
being called "fencing_status", too (IMHO). That would move the most important 
information that come last ("Node h01 is active") toward a common line limit 
(which is not 160, BTW ;-)).

Regards,
Ulrich


>>> Ken Gaillot  schrieb am 10.01.2018 um 23:34 in 
>>> Nachricht
<1515623653.4815.23.ca...@redhat.com>:
> Starting with Pacemaker 2.0.0, the Pacemaker detail log will be kept by
> default in /var/log/pacemaker/pacemaker.log (rather than
> /var/log/pacemaker.log). This will keep /var/log cleaner.
> 
> Pacemaker will still prefer any log file specified in corosync.conf.
> 
> The initial proposal, after discussion at last year's summit, was to
> use /var/log/cluster/pacemaker.log instead. That turned out to be slightly 
> problematic: it broke some regression tests in a way that wasn't easily 
> fixable, and more significantly, it raises the question of what package 
> should own /var/log/cluster (which different distributions might want to 
> answer differently).
> 
> So instead, the default log locations can be overridden when building
> pacemaker. The ./configure script now has these two options:
> 
> --with-logdir
> Where to keep pacemaker.log (default /var/log/pacemaker)
> 
> --with-bundledir
> Where to keep bundle logs (default /var/log/pacemaker/bundles, which
> hasn't changed)
> 
> Thus, if a packager wants to preserve the 1.1 locations, they can use:
> 
> ./configure --with-logdir=/var/log
> 
> And if a packager wants to use /var/log/cluster as originally planned,
> they can use:
> 
> ./configure --with-logdir=/var/log/cluster --with-
> bundledir=/var/log/cluster/bundles
> 
> and ensure that pacemaker depends on whatever package owns
> /var/log/cluster.
> -- 
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Coming in Pacemaker 2.0.0: Reliable exit codes

2018-01-11 Thread Ulrich Windl
Hi!

Will those exit code be compatible with , i.e. will it be a 
superset or a subset of it? If not, it would be the right time.

Regards,
Ulrich


>>> Ken Gaillot  schrieb am 10.01.2018 um 23:22 in 
>>> Nachricht
<1515622941.4815.21.ca...@redhat.com>:
> Every time you run a command on the command line or in a script, it
> returns an exit status. These are most useful in scripts to check for
> errors.
> 
> Currently, Pacemaker daemons and command-line tools return an
> unreliable mishmash of exit status codes, sometimes including negative
> numbers (which get bitwise-remapped to the 0-255 range) and/or C
> library errno codes (which can vary across OSes).
> 
> The only thing scripts could rely on was 0 means success and nonzero
> means error.
> 
> Beginning with Pacemaker 2.0.0, everything will return a well-defined
> set of reliable exit status codes. These codes can be viewed using the
> existing crm_error tool using the --exit parameter. For example:
> 
> crm_error --exit --list
> 
> will list all possible exit statuses, and
> 
> crm_error --exit 124
> 
> will show a textual description of what exit status 124 means.
> 
> This will mainly be of interest to users who script Pacemaker commands
> and check the return value. If your scripts rely on the current exit
> codes, you may need to update your scripts for 2.0.0.
> -- 
> Ken Gaillot 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org