[ClusterLabs] redis and pgsql RAs under pacemaker_remote does not work
Hi! I'm running redis and PostgreSQL in LXC containers with pacemaker_remote on hosts running full cluster stack. In pacemaker_remote installation crm_attribute utility is absent, so subj RAs are not working. May be crm_resource can be used for the same purpose, storing state data in resource attributes? Or here is any other solution? -- /aTan ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: 答复: pacemaker reports monitor timeout while CPU is high
Thank you very much, Ken. I will set the high timeout and try. -邮件原件- 发件人: Ken Gaillot [mailto:kgail...@redhat.com] 发送时间: 2018年1月11日 23:48 收件人: Cluster Labs - All topics related to open-source clustering welcomed抄送: 王亮 主题: Re: [ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high On Thu, 2018-01-11 at 03:50 +, 范国腾 wrote: > Thank you, Ken. > > We have set the timeout to be 10 seconds, but it reports timeout only > after 2 seconds. So it seems not work if I set higher timeouts. > Our application which is managed by pacemaker will start more than > 500 process to run when running performance test. Does it affect the > result? Which log could help us to analyze? > > > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor- > > interval-16s) It's not timing out after 2 seconds. The message: sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor indicates that the monitor's process ID is 5240, but the message: sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606) timed out indicates that the monitor that timed out had process ID 5606. That means that there were two separate monitors in progress. I'm not sure why; I wouldn't expect the second one to be started until after the first one had timed out. But it's possible with the high load that the log messages were simply written to the log out of order, since they were written by different processes. I would just raise the timeout higher than 10s during the test. > > -邮件原件- > 发件人: Ken Gaillot [mailto:kgail...@redhat.com] > 发送时间: 2018年1月11日 0:54 > 收件人: Cluster Labs - All topics related to open-source clustering > welcomed > 主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is > high > > On Wed, 2018-01-10 at 09:40 +, 范国腾 wrote: > > Hello, > > > > This issue only appears when we run performance test and the CPU is > > high. The cluster and log is as below. The Pacemaker will restart > > the Slave Side pgsql-ha resource about every two minutes. > > > > Take the following scenario for example:(when the pgsqlms RA is > > called, we print the log “execute the command start (command)”. > > When > > the command is returned, we print the log “execute the command stop > > (Command) (result)”) > > 1. We could see that pacemaker call “pgsqlms monitor” about > > every > > 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor > > command again at 13:56:16, and then it reports timeout error error > > 13:56:18. > > It is only 2 seconds but it reports “timeout=1ms” > > 3. In other logs, sometimes after 15 minutes, there is no > > “execute the command start monitor” printed and it reports timeout > > error directly. > > > > Could you please tell how to debug or resolve such issue? > > > > The log: > > > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the > > command start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: > > INFO: > > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: > > INFO: > > _confirm_role stop > > 0 > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the > > command stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: > > INFO: > > execute the command start monitor Jan 10 13:55:52 sds2 > > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 > > sds2 > > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop > > 0 > > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the > > command stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: > > High CPU load detected: > > 426.77 > > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the > > command start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: > > pgsqld_monitor_16000 process (PID 5606) timed out > > There's something more going on than in this log snippet. Notice the > process that timed out (5606) is not one of the processes that logged > above (5240 and 5477). > > Generally, once load gets that high, it's very difficult to maintain > responsiveness, and the expectation is that another node will fence > it. > But it can often be worked around with high timeouts, and/or you can > use rules to set higher timeouts or maintenance mode during times when > high load is expected. > > > Jan 10 13:56:18 sds2 lrmd[26093]: warning: > > pgsqld_monitor_16000:5606 > > - timed out after 1ms > > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor > > operation for pgsqld on db2: Timed Out | call=102 > > key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 > > crmd[26096]: notice: db2- > > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] Jan > > 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> > > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: > > warning: Processing failed
[ClusterLabs] pengine bug? Recovery after monitor failure: Restart of DRBD does not restart Filesystem -- unless explicit order start before promote on DRBD
To understand some weird behavior we observed, I dumbed down a production config to three dummy resources, while keeping some descriptive resource ids (ip, drbd, fs). For some reason, the constraints are: stuff, more stuff, IP -> DRBD -> FS -> other stuff. (In the actual real-world config, it makes somewhat more sense, but it reproduces with just these three resources) All is running just fine. Online: [ ava emma ] virtual_ip (ocf::pacemaker:Dummy): Started ava Master/Slave Set: ms_drbd_r0 [p_drbd_r0] Masters: [ ava ] p_fs_drbd1 (ocf::pacemaker:Dummy): Started ava If I simulate a monitor failure on IP: # crm_simulate -L -i virtual_ip_monitor_3@ava=1 Transition Summary: * Recover virtual_ip (Started ava) * Restart p_drbd_r0:0 (Master ava) Which in real life will obviously fail, because we cannot "restart" (demote) a DRBD while it is still in use (mounted, in this case). Only if I add a stupid intra-resource order constraint that explicitly states to first start, then promote on the DRBD itself, I get the result I would have expected: Transition Summary: * Recover virtual_ip (Started ava) * Restart p_drbd_r0:0 (Master ava) * Restart p_fs_drbd1 (Started ava) Interestingly enough, if I simulate a monitor failure on "DRBD" directly, it is in both cases the expected: Transition Summary: * Recover p_drbd_r0:0 (Master ava) * Restart p_fs_drbd1 (Started ava) What am I missing? Do we have to "annotate" somewhere that you must not demote something if it is still "in use" by something else? Did I just screw up the constraints somehow? How would the constraints need to look like to get the expected result, without explicitly adding the first-start-then-promote constraint? Is (was?) this a pengine bug? How to reproduce: = crm shell style dummy config: -- node 1: ava node 2: emma primitive p_drbd_r0 ocf:pacemaker:Stateful \ op monitor interval=29s role=Master \ op monitor interval=31s role=Slave primitive p_fs_drbd1 ocf:pacemaker:Dummy \ op monitor interval=20 timeout=40 primitive virtual_ip ocf:pacemaker:Dummy \ op monitor interval=30s ms ms_drbd_r0 p_drbd_r0 \ meta master-max=1 master-node-max=1 clone-max=1 clone-node-max=1 colocation c1 inf: ms_drbd_r0 virtual_ip colocation c2 inf: p_fs_drbd1:Started ms_drbd_r0:Master order o1 inf: virtual_ip:start ms_drbd_r0:start order o2 inf: ms_drbd_r0:promote p_fs_drbd1:start -- crm_simulate -x bad.xml -i virtual_ip_monitor_3@ava=1 trying to demote DRBD before umount :-(( adding stupid constraint: order first-start-then-promote inf: ms_drbd_r0:start ms_drbd_r0:promote crm_simulate -x good.xml -i virtual_ip_monitor_3@ava=1 yay, first umount, then demote... (tested with 1.1.15 and 1.1.16, not yet with more recent code base) Full good.xml and bad.xml are both attached. Manipulating constraint in live cib using cibadmin only: add: cibadmin -C -o constraints -X '' del: cibadmin -D -X '' Thanks, Lars bad.xml.bz2 Description: Binary data good.xml.bz2 Description: Binary data ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?
On Thu, 11 Jan 2018 12:00:25 -0600 Ken Gaillotwrote: > On Thu, 2018-01-11 at 20:11 +0300, Andrei Borzenkov wrote: > > 11.01.2018 19:21, Ken Gaillot пишет: > > > On Thu, 2018-01-11 at 01:16 +0100, Jehan-Guillaume de Rorthais > > > wrote: > > > > On Wed, 10 Jan 2018 12:23:59 -0600 > > > > Ken Gaillot wrote: > > > > ... > > > > > My question is: has anyone used or tested this, or is anyone > > > > > interested > > > > > in this? We won't promote it to the default schema unless it is > > > > > tested. > > > > > > > > > > My feeling is that it is more likely to be confusing than > > > > > helpful, > > > > > and > > > > > there are probably ways to achieve any reasonable use case with > > > > > existing syntax. > > > > > > > > For what it worth, I tried to implement such solution to dispatch > > > > mulitple > > > > IP addresses to slaves in a 1 master 2 slaves cluster. This is > > > > quite > > > > time > > > > consuming to wrap its head around sides effects with colocation, > > > > scores and > > > > stickiness. My various tests shows everything sounds to behave > > > > correctly now, > > > > but I don't feel really 100% confident about my setup. > > > > > > > > I agree that there are ways to achieve such a use case with > > > > existing > > > > syntax. > > > > But this is quite confusing as well. As instance, I experienced a > > > > master > > > > relocation when messing with a slave to make sure its IP would > > > > move > > > > to the > > > > other slave node...I don't remember exactly what was my error, > > > > but I > > > > could > > > > easily dig for it if needed. > > > > > > > > I feel like it fits in the same area that the usability of > > > > Pacemaker. > > > > Making it > > > > easier to understand. See the recent discussion around the > > > > gocardless > > > > war story. > > > > > > > > My tests was mostly for labs, demo and tutorial purpose. I don't > > > > have > > > > a > > > > specific field use case. But if at some point this feature is > > > > promoted > > > > officially as preview, I'll give it some testing and report here > > > > (barring the > > > > fact I'm actually aware some feedback are requested ;)). > > > > > > It's ready to be tested now -- just do this: > > > > > > cibadmin --upgrade > > > cibadmin --modify --xml-text '' > > > > > > Then use constraints like: > > > > > > > > rsc="rsc1" > > > with-rsc="clone1" with-rsc-instance="1" /> > > > > > > > > rsc="rsc2" > > > with-rsc="clone1" with-rsc-instance="2" /> > > > > > > to colocate rsc1 and rsc2 with separate instances of clone1. There > > > is > > > no way to know *which* instance of clone1 will be 1, 2, etc.; this > > > just > > > allows you to ensure the colocations are separate. > > > > > > > Is it possible to designate master/slave as well? > > If you mean constrain one resource to the master, and a bunch of other > resources to the slaves, then no, this new syntax doesn't support that. > But it should be possible with existing syntax, by constraining with > role=master or role=slave, then anticolocating the resources with each > other. > Oh, wait, this is a deal breaker then... This was exactly my use case: * giving a specific IP address to the master * provide various IP addresses to slaves I suppose I'm stucked with the existing syntaxe then. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?
On Thu, 2018-01-11 at 20:11 +0300, Andrei Borzenkov wrote: > 11.01.2018 19:21, Ken Gaillot пишет: > > On Thu, 2018-01-11 at 01:16 +0100, Jehan-Guillaume de Rorthais > > wrote: > > > On Wed, 10 Jan 2018 12:23:59 -0600 > > > Ken Gaillotwrote: > > > ... > > > > My question is: has anyone used or tested this, or is anyone > > > > interested > > > > in this? We won't promote it to the default schema unless it is > > > > tested. > > > > > > > > My feeling is that it is more likely to be confusing than > > > > helpful, > > > > and > > > > there are probably ways to achieve any reasonable use case with > > > > existing syntax. > > > > > > For what it worth, I tried to implement such solution to dispatch > > > mulitple > > > IP addresses to slaves in a 1 master 2 slaves cluster. This is > > > quite > > > time > > > consuming to wrap its head around sides effects with colocation, > > > scores and > > > stickiness. My various tests shows everything sounds to behave > > > correctly now, > > > but I don't feel really 100% confident about my setup. > > > > > > I agree that there are ways to achieve such a use case with > > > existing > > > syntax. > > > But this is quite confusing as well. As instance, I experienced a > > > master > > > relocation when messing with a slave to make sure its IP would > > > move > > > to the > > > other slave node...I don't remember exactly what was my error, > > > but I > > > could > > > easily dig for it if needed. > > > > > > I feel like it fits in the same area that the usability of > > > Pacemaker. > > > Making it > > > easier to understand. See the recent discussion around the > > > gocardless > > > war story. > > > > > > My tests was mostly for labs, demo and tutorial purpose. I don't > > > have > > > a > > > specific field use case. But if at some point this feature is > > > promoted > > > officially as preview, I'll give it some testing and report here > > > (barring the > > > fact I'm actually aware some feedback are requested ;)). > > > > It's ready to be tested now -- just do this: > > > > cibadmin --upgrade > > cibadmin --modify --xml-text '' > > > > Then use constraints like: > > > > > rsc="rsc1" > > with-rsc="clone1" with-rsc-instance="1" /> > > > > > rsc="rsc2" > > with-rsc="clone1" with-rsc-instance="2" /> > > > > to colocate rsc1 and rsc2 with separate instances of clone1. There > > is > > no way to know *which* instance of clone1 will be 1, 2, etc.; this > > just > > allows you to ensure the colocations are separate. > > > > Is it possible to designate master/slave as well? If you mean constrain one resource to the master, and a bunch of other resources to the slaves, then no, this new syntax doesn't support that. But it should be possible with existing syntax, by constraining with role=master or role=slave, then anticolocating the resources with each other. > > > Similarly you can use rsc="clone1" rsc-instance="1" to colocate a > > clone > > instance relative to another resource instead. > > > > For ordering, the corresponding syntax is "first-instance" or > > "then- > > instance" as desired. > > > > I believe crm shell has higher-level support for this feature. > > > > Personally, I think standard colocations of rsc1 and rsc2 with > > clone1, > > and then an anticolocation between rsc1 and rsc2, would be more > > intuitive. You're right that the interactions with stickiness etc. > > can > > be tricky, but that would apply to the alternate syntax as well. > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?
11.01.2018 19:21, Ken Gaillot пишет: > On Thu, 2018-01-11 at 01:16 +0100, Jehan-Guillaume de Rorthais wrote: >> On Wed, 10 Jan 2018 12:23:59 -0600 >> Ken Gaillotwrote: >> ... >>> My question is: has anyone used or tested this, or is anyone >>> interested >>> in this? We won't promote it to the default schema unless it is >>> tested. >>> >>> My feeling is that it is more likely to be confusing than helpful, >>> and >>> there are probably ways to achieve any reasonable use case with >>> existing syntax. >> >> For what it worth, I tried to implement such solution to dispatch >> mulitple >> IP addresses to slaves in a 1 master 2 slaves cluster. This is quite >> time >> consuming to wrap its head around sides effects with colocation, >> scores and >> stickiness. My various tests shows everything sounds to behave >> correctly now, >> but I don't feel really 100% confident about my setup. >> >> I agree that there are ways to achieve such a use case with existing >> syntax. >> But this is quite confusing as well. As instance, I experienced a >> master >> relocation when messing with a slave to make sure its IP would move >> to the >> other slave node...I don't remember exactly what was my error, but I >> could >> easily dig for it if needed. >> >> I feel like it fits in the same area that the usability of Pacemaker. >> Making it >> easier to understand. See the recent discussion around the gocardless >> war story. >> >> My tests was mostly for labs, demo and tutorial purpose. I don't have >> a >> specific field use case. But if at some point this feature is >> promoted >> officially as preview, I'll give it some testing and report here >> (barring the >> fact I'm actually aware some feedback are requested ;)). > > It's ready to be tested now -- just do this: > > cibadmin --upgrade > cibadmin --modify --xml-text '' > > Then use constraints like: > > rsc="rsc1" >with-rsc="clone1" with-rsc-instance="1" /> > > rsc="rsc2" >with-rsc="clone1" with-rsc-instance="2" /> > > to colocate rsc1 and rsc2 with separate instances of clone1. There is > no way to know *which* instance of clone1 will be 1, 2, etc.; this just > allows you to ensure the colocations are separate. > Is it possible to designate master/slave as well? > Similarly you can use rsc="clone1" rsc-instance="1" to colocate a clone > instance relative to another resource instead. > > For ordering, the corresponding syntax is "first-instance" or "then- > instance" as desired. > > I believe crm shell has higher-level support for this feature. > > Personally, I think standard colocations of rsc1 and rsc2 with clone1, > and then an anticolocation between rsc1 and rsc2, would be more > intuitive. You're right that the interactions with stickiness etc. can > be tricky, but that would apply to the alternate syntax as well. > ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Changes coming in Pacemaker 2.0.0
On Thu, 2018-01-11 at 08:54 +0100, Ulrich Windl wrote: > On "--crm_xml -> --xml-text": Why not simply "--xml" (XML IS text)? Most Pacemaker tools that accept XML can get it from standard input ( --xml-pipe), a file (--xml-file), or a literal string (--xml-text). Although, looking at it now, it might be nice to reduce it to one option: --xml - standard input --xml '' anything starting with '<' is literal --xml fileanything else is a filename -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Changes coming in Pacemaker 2.0.0
On Thu, 2018-01-11 at 01:21 +0100, Jehan-Guillaume de Rorthais wrote: > On Wed, 10 Jan 2018 16:10:50 -0600 > Ken Gaillotwrote: > > > Pacemaker 2.0 will be a major update whose main goal is to remove > > support for deprecated, legacy syntax, in order to make the code > > base > > more maintainable into the future. There will also be some changes > > to > > default configuration behavior, and the command-line tools. > > > > I'm hoping to release the first release candidate in the next > > couple of > > weeks. > > Great news! Congrats. > > > We'll have a longer than usual rc phase to allow for plenty of > > testing. > > > > A thoroughly detailed list of changes will be maintained on the > > ClusterLabs wiki: > > > > https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes > > > > These changes are not final, and we can restore functionality if > > there > > is a strong need for it. Most user-visible changes are complete (in > > the > > 2.0 branch on github); major changes are still expected, but > > primarily > > to the C API. > > > > Some highlights: > > > > * Only Corosync version 2 will be supported as the underlying > > cluster > > layer. Support for Heartbeat and Corosync 1 is removed. (Support > > for > > the new kronosnet layer will be added in a future version.) > > I thought (according to some conference slides from sept 2017) knet > was mostly > related to corosync directly? Is there some visible impact on > Pacemaker too? You're right -- it's more accurate to say that corosync 3 will support knet, and I'm not yet aware whether the corosync 3 API will require any changes in Pacemaker. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?
On Thu, 2018-01-11 at 01:16 +0100, Jehan-Guillaume de Rorthais wrote: > On Wed, 10 Jan 2018 12:23:59 -0600 > Ken Gaillotwrote: > ... > > My question is: has anyone used or tested this, or is anyone > > interested > > in this? We won't promote it to the default schema unless it is > > tested. > > > > My feeling is that it is more likely to be confusing than helpful, > > and > > there are probably ways to achieve any reasonable use case with > > existing syntax. > > For what it worth, I tried to implement such solution to dispatch > mulitple > IP addresses to slaves in a 1 master 2 slaves cluster. This is quite > time > consuming to wrap its head around sides effects with colocation, > scores and > stickiness. My various tests shows everything sounds to behave > correctly now, > but I don't feel really 100% confident about my setup. > > I agree that there are ways to achieve such a use case with existing > syntax. > But this is quite confusing as well. As instance, I experienced a > master > relocation when messing with a slave to make sure its IP would move > to the > other slave node...I don't remember exactly what was my error, but I > could > easily dig for it if needed. > > I feel like it fits in the same area that the usability of Pacemaker. > Making it > easier to understand. See the recent discussion around the gocardless > war story. > > My tests was mostly for labs, demo and tutorial purpose. I don't have > a > specific field use case. But if at some point this feature is > promoted > officially as preview, I'll give it some testing and report here > (barring the > fact I'm actually aware some feedback are requested ;)). It's ready to be tested now -- just do this: cibadmin --upgrade cibadmin --modify --xml-text '' Then use constraints like: to colocate rsc1 and rsc2 with separate instances of clone1. There is no way to know *which* instance of clone1 will be 1, 2, etc.; this just allows you to ensure the colocations are separate. Similarly you can use rsc="clone1" rsc-instance="1" to colocate a clone instance relative to another resource instead. For ordering, the corresponding syntax is "first-instance" or "then- instance" as desired. I believe crm shell has higher-level support for this feature. Personally, I think standard colocations of rsc1 and rsc2 with clone1, and then an anticolocation between rsc1 and rsc2, would be more intuitive. You're right that the interactions with stickiness etc. can be tricky, but that would apply to the alternate syntax as well. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0
Jehan-Guillaume de Rorthaiswrites: > > For what is worth, while using crmsh, I always have to explain to > people or customers that: > > * we should issue an "unmigrate" to remove the constraint as soon as the > resource can get back to the original node or get off the current node if > needed (depending on the -inf or +inf constraint location issued) > * this will not migrate back the resource if it's sticky enough on the current > node. > > See: > http://clusterlabs.github.io/PAF/Debian-8-admin-cookbook.html#swapping-master-and-slave-roles-between-nodes > > This is counter-intuitive, indeed. I prefer the pcs interface using > the move/clear actions. No need! You can use crm rsc move / crm rsc clear. In fact, "unmove" is just a backwards-compatibility alias for clear in crmsh. Cheers, Kristoffer > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > -- // Kristoffer Grönlund // kgronl...@suse.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0
On Thu, 11 Jan 2018 18:32:35 +0300 Andrei Borzenkovwrote: > On Thu, Jan 11, 2018 at 2:52 PM, Ulrich Windl > wrote: > > > > > Andrei Borzenkov schrieb am 11.01.2018 um 12:41 > in > > Nachricht > > : > >> On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl > >> wrote: > >>> Hi! > >>> > >>> On the tool changes, I'd prefer --move and --un-move as pair over --move > >>> and --clear > >> ("clear" is less expressive IMHO). > >> > >> --un-move is really wrong semantically. You do not "unmove" resource - > >> you "clear" constraints that were created. Whether this actually > >> results in any "movement" is unpredictable (easily). > > > > You undo what "move" does: "un-move". With your argument, "move" is just as > > bad: Why not "--forbid-host" and "--allow-host" then? > > That would be less confusing as it sounds more declarative and matches > what actually happens - setting configuration parameter instead of > initiating some action. For what is worth, while using crmsh, I always have to explain to people or customers that: * we should issue an "unmigrate" to remove the constraint as soon as the resource can get back to the original node or get off the current node if needed (depending on the -inf or +inf constraint location issued) * this will not migrate back the resource if it's sticky enough on the current node. See: http://clusterlabs.github.io/PAF/Debian-8-admin-cookbook.html#swapping-master-and-slave-roles-between-nodes This is counter-intuitive, indeed. I prefer the pcs interface using the move/clear actions. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high
On Thu, 2018-01-11 at 03:50 +, 范国腾 wrote: > Thank you, Ken. > > We have set the timeout to be 10 seconds, but it reports timeout only > after 2 seconds. So it seems not work if I set higher timeouts. > Our application which is managed by pacemaker will start more than > 500 process to run when running performance test. Does it affect the > result? Which log could help us to analyze? > > > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor- > > interval-16s) It's not timing out after 2 seconds. The message: sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor indicates that the monitor's process ID is 5240, but the message: sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606) timed out indicates that the monitor that timed out had process ID 5606. That means that there were two separate monitors in progress. I'm not sure why; I wouldn't expect the second one to be started until after the first one had timed out. But it's possible with the high load that the log messages were simply written to the log out of order, since they were written by different processes. I would just raise the timeout higher than 10s during the test. > > -邮件原件- > 发件人: Ken Gaillot [mailto:kgail...@redhat.com] > 发送时间: 2018年1月11日 0:54 > 收件人: Cluster Labs - All topics related to open-source clustering > welcomed> 主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is > high > > On Wed, 2018-01-10 at 09:40 +, 范国腾 wrote: > > Hello, > > > > This issue only appears when we run performance test and the CPU > > is > > high. The cluster and log is as below. The Pacemaker will restart > > the > > Slave Side pgsql-ha resource about every two minutes. > > > > Take the following scenario for example:(when the pgsqlms RA is > > called, we print the log “execute the command start (command)”. > > When > > the command is returned, we print the log “execute the command stop > > (Command) (result)”) > > 1. We could see that pacemaker call “pgsqlms monitor” about > > every > > 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor > > command > > again at 13:56:16, and then it reports timeout error error > > 13:56:18. > > It is only 2 seconds but it reports “timeout=1ms” > > 3. In other logs, sometimes after 15 minutes, there is no > > “execute > > the command start monitor” printed and it reports timeout error > > directly. > > > > Could you please tell how to debug or resolve such issue? > > > > The log: > > > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the > > command > > start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: > > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: > > INFO: > > _confirm_role stop > > 0 > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the > > command > > stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: > > execute the command start monitor Jan 10 13:55:52 sds2 > > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 > > sds2 > > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop > > 0 > > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the > > command > > stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU > > load detected: > > 426.77 > > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the > > command > > start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: > > pgsqld_monitor_16000 process (PID 5606) timed out > > There's something more going on than in this log snippet. Notice the > process that timed out (5606) is not one of the processes that logged > above (5240 and 5477). > > Generally, once load gets that high, it's very difficult to maintain > responsiveness, and the expectation is that another node will fence > it. > But it can often be worked around with high timeouts, and/or you can > use rules to set higher timeouts or maintenance mode during times > when high load is expected. > > > Jan 10 13:56:18 sds2 lrmd[26093]: warning: > > pgsqld_monitor_16000:5606 > > - timed out after 1ms > > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor > > operation > > for pgsqld on db2: Timed Out | call=102 > > key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 > > crmd[26096]: notice: db2- > > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] > > Jan > > 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> > > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: > > warning: Processing failed op monitor for pgsqld:0 on db2: unknown > > error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing > > failed op start for pgsqld:1 on db1: unknown error (1) Jan 10 > > 13:56:19 > > sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after > > 100 failures (max=100) Jan 10 13:56:19 sds2 > >
Re: [ClusterLabs] Antw: Re: Coming in Pacemaker 2.0.0: /var/log/pacemaker/pacemaker.log
On Thu, 2018-01-11 at 09:10 +0100, Ulrich Windl wrote: > Maybe the question to ask right now would be: What are the modules, > and what are their logfile locations? An opportunity to clean up the > mess! It would be nice, but time is always constrained, and coordinating multiple projects takes more time. Still a good idea though. > > > > Adam Spiersschrieb am 11.01.2018 um 00:59 > > > > in Nachricht > > <20180110235939.fvwkormbruoqhwfb@pacific.linksys.moosehall>: > > Ken Gaillot wrote: > > > The initial proposal, after discussion at last year's summit, was > > > to > > > use /var/log/cluster/pacemaker.log instead. That turned out to be > > > slightly > > > > problematic: it broke some regression tests in a way that wasn't > > easily > > fixable, and more significantly, it raises the question of what > > package > > should own /var/log/cluster (which different distributions might > > want to > > answer differently). > > > > I thought one option aired at the summit to address this was > > /var/log/clusterlabs, but it's entirely possible my memory's > > playing > > tricks on me again. I don't remember that, but it sounds like a good choice. However we'd still have the same issue of needing a single package to own it. We could create a really shallow clusterlabs project/package for the purpose; I can't think of anything else to put in it that would be universal to all ClusterLabs projects. -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0
On Thu, Jan 11, 2018 at 2:52 PM, Ulrich Windlwrote: > > Andrei Borzenkov schrieb am 11.01.2018 um 12:41 in > Nachricht > : >> On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl >> wrote: >>> Hi! >>> >>> On the tool changes, I'd prefer --move and --un-move as pair over --move >>> and --clear >> ("clear" is less expressive IMHO). >> >> --un-move is really wrong semantically. You do not "unmove" resource - >> you "clear" constraints that were created. Whether this actually >> results in any "movement" is unpredictable (easily). > > You undo what "move" does: "un-move". With your argument, "move" is just as > bad: Why not "--forbid-host" and "--allow-host" then? > That would be less confusing as it sounds more declarative and matches what actually happens - setting configuration parameter instead of initiating some action. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Coming in Pacemaker 2.0.0: Reliable exit codes
On Thu, 2018-01-11 at 08:59 +0100, Ulrich Windl wrote: > Hi! > > Will those exit code be compatible with , i.e. will it be > a superset or a subset of it? If not, it would be the right time. Yes! It will be a superset. From the new source code: /* * Exit statuses * * We want well-specified (i.e. OS-invariant) exit status codes for our daemons * and applications so they can be relied on by callers. (Function return codes * and errno's do not make good exit statuses.) * * The only hard rule is that exit statuses must be between 0 and 255; all else * is convention. Universally, 0 is success, and 1 is generic error (excluding * OSes we don't support -- for example, OpenVMS considers 1 success!). * * For init scripts, the LSB gives meaning to 0-7, and sets aside 150- 199 for * application use. OCF adds 8-9 and 189-199. * * sysexits.h was an attempt to give additional meanings, but never really * caught on. It uses 0 and 64-78. * * Bash reserves 2 ("incorrect builtin usage") and 126-255 (126 is "command * found but not executable", 127 is "command not found", 128 + n is * "interrupted by signal n"). * * tldp.org recommends 64-113 for application use. * * We try to overlap with the above conventions when practical. */ We are using 0-1 as success and generic error, 2-7 overlapping with LSB+OCF, 64-78 overlapping with sysexits.h, 100-109 (possibly more later) for custom errors, and 124 overlapping with timeout(1). > > Regards, > Ulrich > > > > > > Ken Gaillotschrieb am 10.01.2018 um > > > > 23:22 in Nachricht > > <1515622941.4815.21.ca...@redhat.com>: > > Every time you run a command on the command line or in a script, it > > returns an exit status. These are most useful in scripts to check > > for > > errors. > > > > Currently, Pacemaker daemons and command-line tools return an > > unreliable mishmash of exit status codes, sometimes including > > negative > > numbers (which get bitwise-remapped to the 0-255 range) and/or C > > library errno codes (which can vary across OSes). > > > > The only thing scripts could rely on was 0 means success and > > nonzero > > means error. > > > > Beginning with Pacemaker 2.0.0, everything will return a well- > > defined > > set of reliable exit status codes. These codes can be viewed using > > the > > existing crm_error tool using the --exit parameter. For example: > > > > crm_error --exit --list > > > > will list all possible exit statuses, and > > > > crm_error --exit 124 > > > > will show a textual description of what exit status 124 means. > > > > This will mainly be of interest to users who script Pacemaker > > commands > > and check the return value. If your scripts rely on the current > > exit > > codes, you may need to update your scripts for 2.0.0. > > -- > > Ken Gaillot > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0
On Thu, 2018-01-11 at 12:52 +0100, Ulrich Windl wrote: > > > > Andrei Borzenkovschrieb am 11.01.2018 um > > > > 12:41 in > > Nachricht > : > > On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl > > wrote: > > > Hi! > > > > > > On the tool changes, I'd prefer --move and --un-move as pair over > > > --move and --clear > > > > ("clear" is less expressive IMHO). > > > > --un-move is really wrong semantically. You do not "unmove" > > resource - > > you "clear" constraints that were created. Whether this actually > > results in any "movement" is unpredictable (easily). > > You undo what "move" does: "un-move". With your argument, "move" is > just as bad: Why not "--forbid-host" and "--allow-host" then? That's a good point actually. There's a tension between Pacemaker's model (defining a desired state, and letting Pacemaker decide how to get there) vs most people's intuition (defining actions to be taken). Also, Pacemaker's XML syntax tends to be very flexible such that one expression can convey multiple logical intents. So we see that discrepancy sometimes in command naming vs implementation. > > > > > Personally I find lack of any means to change resource state > > non-persistently one of major usability issue with pacemaker > > comparing > > with other cluster stacks. Just a small example: > > > > I wanted to show customer how "maintenance-mode" works. After > > setting > > maintenance-mode=yes for the cluster we found that database was > > mysteriously restarted after being stopped manually. It took quite > > some time to find out that couple of weeks ago "crm resource > > manager" > > followed by "crm resource unmanage" was run for this resource - > > which > > left explicit "managed=yes" on resource which took precedence over > > "maintenance-mode". > > Oops: Didn't know that! > > > > > Not only is this asymmetrical and non-intuitive. There is no way to > > distinguish temporary change from permanent one. Moving resources > > is > > special-cased but for any change that involves setting resource > > (meta-)attributes this approach is not possible. Attribute is > > there, > > and we do not know why it was set. > > Yes, the "lifetime" in a rule should not restrict what the rule does, > but how long the rule exists. As garbage collection of expired rules > (which does not exist yet) would have less accuracy as the lifetime > (maybe specified in seconds), a combination could be used. Expired rules are still useful -- e.g. for troubleshooting an event that occurred while the rule was in effect, or for simulating events that occur inside and outside the effective window. It would be helpful though to have a new command to remove all expired rules from the configuration, so an admin can conveniently clean up periodically. > Regards, > Ulrich -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Re: Does anyone use clone instance constraints from pacemaker-next schema?
On Thu, 2018-01-11 at 09:12 +0100, Ulrich Windl wrote: > BTW: Could be fix that "Master/slave resources need different > monitoring intervals for master and slave" at this time? Unfortunately that would be a major project, as the interval is used to identify the operation throughout the code base. > > > > > > Jehan-Guillaume de Rorthaisschrieb am > > > > 11.01.2018 um 01:16 in > > Nachricht <20180111011616.496a383b@firost>: > > On Wed, 10 Jan 2018 12:23:59 -0600 > > Ken Gaillot wrote: > > ... > > > My question is: has anyone used or tested this, or is anyone > > > interested > > > in this? We won't promote it to the default schema unless it is > > > tested. > > > > > > My feeling is that it is more likely to be confusing than > > > helpful, and > > > there are probably ways to achieve any reasonable use case with > > > existing syntax. > > > > For what it worth, I tried to implement such solution to dispatch > > mulitple > > IP addresses to slaves in a 1 master 2 slaves cluster. This is > > quite time > > consuming to wrap its head around sides effects with colocation, > > scores and > > stickiness. My various tests shows everything sounds to behave > > correctly > > now, > > but I don't feel really 100% confident about my setup. > > > > I agree that there are ways to achieve such a use case with > > existing syntax. > > But this is quite confusing as well. As instance, I experienced a > > master > > relocation when messing with a slave to make sure its IP would move > > to the > > other slave node...I don't remember exactly what was my error, but > > I could > > easily dig for it if needed. > > > > I feel like it fits in the same area that the usability of > > Pacemaker. Making > > it > > easier to understand. See the recent discussion around the > > gocardless war > > story. > > > > My tests was mostly for labs, demo and tutorial purpose. I don't > > have a > > specific field use case. But if at some point this feature is > > promoted > > officially as preview, I'll give it some testing and report here > > (barring > > the > > fact I'm actually aware some feedback are requested ;)). > > > > ___ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc > > h.pdf > > Bugs: http://bugs.clusterlabs.org > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Recommendations for securing DLM traffic?
- Original Message - | What are the general recommendations for securing traffic for DLM on port | 21064? | | It appears that this traffic is not signed or encrypted in any way so whilst | there might not be any privacy issues with information disclosure it's not | clear that the messages could not be replayed or otherwise spoofed. | | Thanks, | | Mark. Hi Mark, Perhaps you should send your question to the public cluster development mailing list: cluster-de...@redhat.com The dlm kernel developers are more likely to see it there. For more info: https://www.redhat.com/mailman/listinfo/cluster-devel Regards, Bob Peterson Red Hat File Systems ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Antw: Changes coming in Pacemaker 2.0.0
>>> Andrei Borzenkovschrieb am 11.01.2018 um 12:41 in Nachricht : > On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windl > wrote: >> Hi! >> >> On the tool changes, I'd prefer --move and --un-move as pair over --move and >> --clear > ("clear" is less expressive IMHO). > > --un-move is really wrong semantically. You do not "unmove" resource - > you "clear" constraints that were created. Whether this actually > results in any "movement" is unpredictable (easily). You undo what "move" does: "un-move". With your argument, "move" is just as bad: Why not "--forbid-host" and "--allow-host" then? > > Personally I find lack of any means to change resource state > non-persistently one of major usability issue with pacemaker comparing > with other cluster stacks. Just a small example: > > I wanted to show customer how "maintenance-mode" works. After setting > maintenance-mode=yes for the cluster we found that database was > mysteriously restarted after being stopped manually. It took quite > some time to find out that couple of weeks ago "crm resource manager" > followed by "crm resource unmanage" was run for this resource - which > left explicit "managed=yes" on resource which took precedence over > "maintenance-mode". Oops: Didn't know that! > > Not only is this asymmetrical and non-intuitive. There is no way to > distinguish temporary change from permanent one. Moving resources is > special-cased but for any change that involves setting resource > (meta-)attributes this approach is not possible. Attribute is there, > and we do not know why it was set. Yes, the "lifetime" in a rule should not restrict what the rule does, but how long the rule exists. As garbage collection of expired rules (which does not exist yet) would have less accuracy as the lifetime (maybe specified in seconds), a combination could be used. Regards, Ulrich > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Changes coming in Pacemaker 2.0.0
On Thu, Jan 11, 2018 at 10:54 AM, Ulrich Windlwrote: > Hi! > > On the tool changes, I'd prefer --move and --un-move as pair over --move and > --clear ("clear" is less expressive IMHO). --un-move is really wrong semantically. You do not "unmove" resource - you "clear" constraints that were created. Whether this actually results in any "movement" is unpredictable (easily). Personally I find lack of any means to change resource state non-persistently one of major usability issue with pacemaker comparing with other cluster stacks. Just a small example: I wanted to show customer how "maintenance-mode" works. After setting maintenance-mode=yes for the cluster we found that database was mysteriously restarted after being stopped manually. It took quite some time to find out that couple of weeks ago "crm resource manager" followed by "crm resource unmanage" was run for this resource - which left explicit "managed=yes" on resource which took precedence over "maintenance-mode". Not only is this asymmetrical and non-intuitive. There is no way to distinguish temporary change from permanent one. Moving resources is special-cased but for any change that involves setting resource (meta-)attributes this approach is not possible. Attribute is there, and we do not know why it was set. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Recommendations for securing DLM traffic?
What are the general recommendations for securing traffic for DLM on port 21064? It appears that this traffic is not signed or encrypted in any way so whilst there might not be any privacy issues with information disclosure it's not clear that the messages could not be replayed or otherwise spoofed. Thanks, Mark. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: 答复: Antw: pacemaker reports monitor timeout while CPU is high
Hi! A few years ago I was playing with cgroups, getting quite interesting (useful) results, but applying the cgroups to existing and newly started processes was quite hard to integrate into the OS, so I did not proceed on that way. I think cgroups is even more powerful today, but I haven't followed the ease of using it in systems based on systemd (which uses cgroups heavily AFAIK). In short: You may be unable to control the client processes, but you could control the server processes the clients start. Regards, Ulrich >>> ???schrieb am 11.01.2018 um 05:01 in Nachricht <492a1ace20c04e85bc4979307af2a...@ex01.highgo.com>: > Ulrich, > > Thank you very much for the help. When we do the performance test, our > application(pgsql-ha) will start more than 500 process to process the client > request. Is it possible to make this issue? > > Is it any workaround or method to make pacemaker not restart the resource in > such situation? Now the system could not work if the client sends high call > load but we could not control the client's behavior. > > Thanks > > > -邮件原件- > 发件人: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] > 发送时间: 2018年1月10日 18:20 > 收件人: users@clusterlabs.org > 主题: [ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high > > Hi! > > I only can talk for myself: In former times with HP-UX, we had severe > performance problems when the load was in the range of 8 to 14 (I/O waits not > included, average for all logical CPUs), while in Linux we are getting > problems with a load above 40 (or so) (I/O included, sum of all logical CPUs > (which are 24)). Also I/O waits cause cluster timeouts before CPU load > actually matters (for us). > So with a load above 400 (not knowing your number of CPUs) it should not be > that unusual. What is the number of threads in your system at that time? > It might be worth the efforts binding the cluster processes to specific CPUs > and keep other tasks away from those, but I don't have experience with that. > I guess the "High CPU load detected" message triggers some internal suspend > in the cluster engine (assuming the cluster engine caused the high load). Of > course for "external " load the measure won't help... > > Regards, > Ulrich > > ??? schrieb am 10.01.2018 um 10:40 in Nachricht > <4dc98a5d9be144a78fb9a18721743...@ex01.highgo.com>: >> Hello, >> >> This issue only appears when we run performance test and the CPU is high. >> The cluster and log is as below. The Pacemaker will restart the Slave >> Side pgsql-ha resource about every two minutes. >> >> Take the following scenario for example:(when the pgsqlms RA is >> called, we print the log “execute the command start (command)”. When >> the command is > >> returned, we print the log “execute the command stop (Command) > (result)”) >> >> 1. We could see that pacemaker call “pgsqlms monitor” about every 15 > >> seconds. And it return $OCF_SUCCESS >> >> 2. In calls monitor command again at 13:56:16, and then it reports >> timeout error error 13:56:18. It is only 2 seconds but it reports >> “timeout=1ms” >> >> 3. In other logs, sometimes after 15 minutes, there is no “execute the > >> command start monitor” printed and it reports timeout error directly. >> >> Could you please tell how to debug or resolve such issue? >> >> The log: >> >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command >> start > >> monitor >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 >> Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command >> stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: >> execute the command start > >> monitor >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 >> Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command >> stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU >> load detected: >> 426.77 >> Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command >> start > >> monitor >> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 >> process (PID > >> 5606) timed out >> Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - >> timed > >> out after 1ms >> Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for >> pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 > timeout=1ms >> Jan 10 13:56:18 sds2 crmd[26096]: notice: >> db2-pgsqld_monitor_16000:102 [ >> /tmp:5432 - accepting connections\n ] >> Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> >> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL >> origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: >>
[ClusterLabs] Antw: Re: Does anyone use clone instance constraints from pacemaker-next schema?
BTW: Could be fix that "Master/slave resources need different monitoring intervals for master and slave" at this time? >>> Jehan-Guillaume de Rorthaisschrieb am 11.01.2018 um >>> 01:16 in Nachricht <20180111011616.496a383b@firost>: > On Wed, 10 Jan 2018 12:23:59 -0600 > Ken Gaillot wrote: > ... >> My question is: has anyone used or tested this, or is anyone interested >> in this? We won't promote it to the default schema unless it is tested. >> >> My feeling is that it is more likely to be confusing than helpful, and >> there are probably ways to achieve any reasonable use case with >> existing syntax. > > For what it worth, I tried to implement such solution to dispatch mulitple > IP addresses to slaves in a 1 master 2 slaves cluster. This is quite time > consuming to wrap its head around sides effects with colocation, scores and > stickiness. My various tests shows everything sounds to behave correctly > now, > but I don't feel really 100% confident about my setup. > > I agree that there are ways to achieve such a use case with existing syntax. > But this is quite confusing as well. As instance, I experienced a master > relocation when messing with a slave to make sure its IP would move to the > other slave node...I don't remember exactly what was my error, but I could > easily dig for it if needed. > > I feel like it fits in the same area that the usability of Pacemaker. Making > it > easier to understand. See the recent discussion around the gocardless war > story. > > My tests was mostly for labs, demo and tutorial purpose. I don't have a > specific field use case. But if at some point this feature is promoted > officially as preview, I'll give it some testing and report here (barring > the > fact I'm actually aware some feedback are requested ;)). > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Re: Coming in Pacemaker 2.0.0: /var/log/pacemaker/pacemaker.log
Maybe the question to ask right now would be: What are the modules, and what are their logfile locations? An opportunity to clean up the mess! >>> Adam Spiersschrieb am 11.01.2018 um 00:59 in Nachricht <20180110235939.fvwkormbruoqhwfb@pacific.linksys.moosehall>: > Ken Gaillot wrote: >>The initial proposal, after discussion at last year's summit, was to >>use /var/log/cluster/pacemaker.log instead. That turned out to be slightly > problematic: it broke some regression tests in a way that wasn't easily > fixable, and more significantly, it raises the question of what package > should own /var/log/cluster (which different distributions might want to > answer differently). > > I thought one option aired at the summit to address this was > /var/log/clusterlabs, but it's entirely possible my memory's playing > tricks on me again. > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Coming in Pacemaker 2.0.0: /var/log/pacemaker/pacemaker.log
Hi! More than the location of the log file, I'm interested in the contents of the log file: The log file should have a formal syntax for automated parsing, and it should be as compact as possible. Considering lines like: Jan 07 10:07:41 [10691] h01pengine: info: determine_online_status_fencing: Node h01 is active there are too many repeating blanks in that. Also the order may not be optimal: Why not have the priority ("Info:") first after the host name? The function name "determine_online_status_fencing" wouldn't loose much information if just being called "fencing_status", too (IMHO). That would move the most important information that come last ("Node h01 is active") toward a common line limit (which is not 160, BTW ;-)). Regards, Ulrich >>> Ken Gaillotschrieb am 10.01.2018 um 23:34 in >>> Nachricht <1515623653.4815.23.ca...@redhat.com>: > Starting with Pacemaker 2.0.0, the Pacemaker detail log will be kept by > default in /var/log/pacemaker/pacemaker.log (rather than > /var/log/pacemaker.log). This will keep /var/log cleaner. > > Pacemaker will still prefer any log file specified in corosync.conf. > > The initial proposal, after discussion at last year's summit, was to > use /var/log/cluster/pacemaker.log instead. That turned out to be slightly > problematic: it broke some regression tests in a way that wasn't easily > fixable, and more significantly, it raises the question of what package > should own /var/log/cluster (which different distributions might want to > answer differently). > > So instead, the default log locations can be overridden when building > pacemaker. The ./configure script now has these two options: > > --with-logdir > Where to keep pacemaker.log (default /var/log/pacemaker) > > --with-bundledir > Where to keep bundle logs (default /var/log/pacemaker/bundles, which > hasn't changed) > > Thus, if a packager wants to preserve the 1.1 locations, they can use: > > ./configure --with-logdir=/var/log > > And if a packager wants to use /var/log/cluster as originally planned, > they can use: > > ./configure --with-logdir=/var/log/cluster --with- > bundledir=/var/log/cluster/bundles > > and ensure that pacemaker depends on whatever package owns > /var/log/cluster. > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Coming in Pacemaker 2.0.0: Reliable exit codes
Hi! Will those exit code be compatible with , i.e. will it be a superset or a subset of it? If not, it would be the right time. Regards, Ulrich >>> Ken Gaillotschrieb am 10.01.2018 um 23:22 in >>> Nachricht <1515622941.4815.21.ca...@redhat.com>: > Every time you run a command on the command line or in a script, it > returns an exit status. These are most useful in scripts to check for > errors. > > Currently, Pacemaker daemons and command-line tools return an > unreliable mishmash of exit status codes, sometimes including negative > numbers (which get bitwise-remapped to the 0-255 range) and/or C > library errno codes (which can vary across OSes). > > The only thing scripts could rely on was 0 means success and nonzero > means error. > > Beginning with Pacemaker 2.0.0, everything will return a well-defined > set of reliable exit status codes. These codes can be viewed using the > existing crm_error tool using the --exit parameter. For example: > > crm_error --exit --list > > will list all possible exit statuses, and > > crm_error --exit 124 > > will show a textual description of what exit status 124 means. > > This will mainly be of interest to users who script Pacemaker commands > and check the return value. If your scripts rely on the current exit > codes, you may need to update your scripts for 2.0.0. > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org