[ClusterLabs] Antw: Changes coming in Pacemaker 2.0.0
Hi! On the tool changes, I'd prefer --move and --un-move as pair over --move and --clear ("clear" is less expressive IMHO). On "--reprobe -> --refresh": Why not simply "--probe"? On "--crm_xml -> --xml-text": Why not simply "--xml" (XML IS text)? Regards, Ulrich >>> Ken Gaillotschrieb am 10.01.2018 um 23:10 in >>> Nachricht <1515622250.4815.19.ca...@redhat.com>: > Pacemaker 2.0 will be a major update whose main goal is to remove > support for deprecated, legacy syntax, in order to make the code base > more maintainable into the future. There will also be some changes to > default configuration behavior, and the command-line tools. > > I'm hoping to release the first release candidate in the next couple of > weeks. We'll have a longer than usual rc phase to allow for plenty of > testing. > > A thoroughly detailed list of changes will be maintained on the > ClusterLabs wiki: > > https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes > > These changes are not final, and we can restore functionality if there > is a strong need for it. Most user-visible changes are complete (in the > 2.0 branch on github); major changes are still expected, but primarily > to the C API. > > Some highlights: > > * Only Corosync version 2 will be supported as the underlying cluster > layer. Support for Heartbeat and Corosync 1 is removed. (Support for > the new kronosnet layer will be added in a future version.) > > * The record-pending cluster property now defaults to true, which > allows status tools such as crm_mon to show operations that are in > progress. > > * So far, the code base has been reduced by about 17,000 lines of code. > -- > Ken Gaillot > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] 答复: Antw: pacemaker reports monitor timeout while CPU is high
Ulrich, Thank you very much for the help. When we do the performance test, our application(pgsql-ha) will start more than 500 process to process the client request. Is it possible to make this issue? Is it any workaround or method to make pacemaker not restart the resource in such situation? Now the system could not work if the client sends high call load but we could not control the client's behavior. Thanks -邮件原件- 发件人: Ulrich Windl [mailto:ulrich.wi...@rz.uni-regensburg.de] 发送时间: 2018年1月10日 18:20 收件人: users@clusterlabs.org 主题: [ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high Hi! I only can talk for myself: In former times with HP-UX, we had severe performance problems when the load was in the range of 8 to 14 (I/O waits not included, average for all logical CPUs), while in Linux we are getting problems with a load above 40 (or so) (I/O included, sum of all logical CPUs (which are 24)). Also I/O waits cause cluster timeouts before CPU load actually matters (for us). So with a load above 400 (not knowing your number of CPUs) it should not be that unusual. What is the number of threads in your system at that time? It might be worth the efforts binding the cluster processes to specific CPUs and keep other tasks away from those, but I don't have experience with that. I guess the "High CPU load detected" message triggers some internal suspend in the cluster engine (assuming the cluster engine caused the high load). Of course for "external " load the measure won't help... Regards, Ulrich >>> ???schrieb am 10.01.2018 um 10:40 in >>> Nachricht <4dc98a5d9be144a78fb9a18721743...@ex01.highgo.com>: > Hello, > > This issue only appears when we run performance test and the CPU is high. > The cluster and log is as below. The Pacemaker will restart the Slave > Side pgsql-ha resource about every two minutes. > > Take the following scenario for example:(when the pgsqlms RA is > called, we print the log “execute the command start (command)”. When > the command is > returned, we print the log “execute the command stop (Command) (result)”) > > 1. We could see that pacemaker call “pgsqlms monitor” about every 15 > seconds. And it return $OCF_SUCCESS > > 2. In calls monitor command again at 13:56:16, and then it reports > timeout error error 13:56:18. It is only 2 seconds but it reports > “timeout=1ms” > > 3. In other logs, sometimes after 15 minutes, there is no “execute the > command start monitor” printed and it reports timeout error directly. > > Could you please tell how to debug or resolve such issue? > > The log: > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > start > monitor > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: > execute the command start > monitor > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command > stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU > load detected: > 426.77 > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command > start > monitor > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 > process (PID > 5606) timed out > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - > timed > out after 1ms > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for > pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=1ms > Jan 10 13:56:18 sds2 crmd[26096]: notice: > db2-pgsqld_monitor_16000:102 [ > /tmp:5432 - accepting connections\n ] > Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: > warning: Processing failed op monitor for pgsqld:0 on db2: unknown > error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing > failed op start for > pgsqld:1 on db1: unknown error (1) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away > from db1 > after 100 failures (max=100) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away > from db1 > after 100 failures (max=100) > Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover > pgsqld:0#011(Slave > db2) > Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition > 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2 > > > The Cluster Configuration: > 2 nodes and 13 resources configured > > Online: [ db1 db2 ] > > Full list of resources: > > Clone Set: dlm-clone
[ClusterLabs] 答复: pacemaker reports monitor timeout while CPU is high
Thank you, Ken. We have set the timeout to be 10 seconds, but it reports timeout only after 2 seconds. So it seems not work if I set higher timeouts. Our application which is managed by pacemaker will start more than 500 process to run when running performance test. Does it affect the result? Which log could help us to analyze? > monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s) -邮件原件- 发件人: Ken Gaillot [mailto:kgail...@redhat.com] 发送时间: 2018年1月11日 0:54 收件人: Cluster Labs - All topics related to open-source clustering welcomed主题: Re: [ClusterLabs] pacemaker reports monitor timeout while CPU is high On Wed, 2018-01-10 at 09:40 +, 范国腾 wrote: > Hello, > > This issue only appears when we run performance test and the CPU is > high. The cluster and log is as below. The Pacemaker will restart the > Slave Side pgsql-ha resource about every two minutes. > > Take the following scenario for example:(when the pgsqlms RA is > called, we print the log “execute the command start (command)”. When > the command is returned, we print the log “execute the command stop > (Command) (result)”) > 1. We could see that pacemaker call “pgsqlms monitor” about every > 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor command > again at 13:56:16, and then it reports timeout error error 13:56:18. > It is only 2 seconds but it reports “timeout=1ms” > 3. In other logs, sometimes after 15 minutes, there is no “execute > the command start monitor” printed and it reports timeout error > directly. > > Could you please tell how to debug or resolve such issue? > > The log: > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: > _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: > _confirm_role stop > 0 > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command > stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: > execute the command start monitor Jan 10 13:55:52 sds2 > pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 sds2 > pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop > 0 > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command > stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU > load detected: > 426.77 > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command > start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: > pgsqld_monitor_16000 process (PID 5606) timed out There's something more going on than in this log snippet. Notice the process that timed out (5606) is not one of the processes that logged above (5240 and 5477). Generally, once load gets that high, it's very difficult to maintain responsiveness, and the expectation is that another node will fence it. But it can often be worked around with high timeouts, and/or you can use rules to set higher timeouts or maintenance mode during times when high load is expected. > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 > - timed out after 1ms > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation > for pgsqld on db2: Timed Out | call=102 > key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 > crmd[26096]: notice: db2- > pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] Jan > 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: > warning: Processing failed op monitor for pgsqld:0 on db2: unknown > error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing > failed op start for pgsqld:1 on db1: unknown error (1) Jan 10 13:56:19 > sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after > 100 failures (max=100) Jan 10 13:56:19 sds2 pengine[26095]: > warning: Forcing pgsql-ha away from db1 after 100 failures > (max=100) Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover > pgsqld:0#011(Slave db2) Jan 10 13:56:19 sds2 pengine[26095]: notice: > Calculated transition 37, saving inputs in > /var/lib/pacemaker/pengine/pe-input-1251.bz2 > > > The Cluster Configuration: > 2 nodes and 13 resources configured > > Online: [ db1 db2 ] > > Full list of resources: > > Clone Set: dlm-clone [dlm] > Started: [ db1 db2 ] > Clone Set: clvmd-clone [clvmd] > Started: [ db1 db2 ] > ipmi_node1 (stonith:fence_ipmilan): Started db2 > ipmi_node2 (stonith:fence_ipmilan): Started db1 Clone Set: > clusterfs-clone [clusterfs] > Started: [ db1 db2 ] > Master/Slave Set: pgsql-ha [pgsqld]> > Masters: [ db1 ] > Slaves: [ db2 ] > Resource Group: mastergroup > db1-vip (ocf::heartbeat:IPaddr2): Started > rep-vip (ocf::heartbeat:IPaddr2): Started Resource > Group:
Re: [ClusterLabs] Changes coming in Pacemaker 2.0.0
On Wed, 10 Jan 2018 16:10:50 -0600 Ken Gaillotwrote: > Pacemaker 2.0 will be a major update whose main goal is to remove > support for deprecated, legacy syntax, in order to make the code base > more maintainable into the future. There will also be some changes to > default configuration behavior, and the command-line tools. > > I'm hoping to release the first release candidate in the next couple of > weeks. Great news! Congrats. > We'll have a longer than usual rc phase to allow for plenty of > testing. > > A thoroughly detailed list of changes will be maintained on the > ClusterLabs wiki: > > https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes > > These changes are not final, and we can restore functionality if there > is a strong need for it. Most user-visible changes are complete (in the > 2.0 branch on github); major changes are still expected, but primarily > to the C API. > > Some highlights: > > * Only Corosync version 2 will be supported as the underlying cluster > layer. Support for Heartbeat and Corosync 1 is removed. (Support for > the new kronosnet layer will be added in a future version.) I thought (according to some conference slides from sept 2017) knet was mostly related to corosync directly? Is there some visible impact on Pacemaker too? ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?
On Wed, 10 Jan 2018 12:23:59 -0600 Ken Gaillotwrote: ... > My question is: has anyone used or tested this, or is anyone interested > in this? We won't promote it to the default schema unless it is tested. > > My feeling is that it is more likely to be confusing than helpful, and > there are probably ways to achieve any reasonable use case with > existing syntax. For what it worth, I tried to implement such solution to dispatch mulitple IP addresses to slaves in a 1 master 2 slaves cluster. This is quite time consuming to wrap its head around sides effects with colocation, scores and stickiness. My various tests shows everything sounds to behave correctly now, but I don't feel really 100% confident about my setup. I agree that there are ways to achieve such a use case with existing syntax. But this is quite confusing as well. As instance, I experienced a master relocation when messing with a slave to make sure its IP would move to the other slave node...I don't remember exactly what was my error, but I could easily dig for it if needed. I feel like it fits in the same area that the usability of Pacemaker. Making it easier to understand. See the recent discussion around the gocardless war story. My tests was mostly for labs, demo and tutorial purpose. I don't have a specific field use case. But if at some point this feature is promoted officially as preview, I'll give it some testing and report here (barring the fact I'm actually aware some feedback are requested ;)). ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Coming in Pacemaker 2.0.0: /var/log/pacemaker/pacemaker.log
Ken Gaillotwrote: The initial proposal, after discussion at last year's summit, was to use /var/log/cluster/pacemaker.log instead. That turned out to be slightly problematic: it broke some regression tests in a way that wasn't easily fixable, and more significantly, it raises the question of what package should own /var/log/cluster (which different distributions might want to answer differently). I thought one option aired at the summit to address this was /var/log/clusterlabs, but it's entirely possible my memory's playing tricks on me again. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Choosing between Pacemaker 1.1 and Pacemaker 2.0
Distribution packagers and users who build Pacemaker themselves will need to choose between staying on the 1.1 line or moving to 2.0. A new wiki page lists factors to consider: https://wiki.clusterlabs.org/wiki/Choosing_Between_Pacemaker_1.1_and_2. 0 -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Coming in Pacemaker 2.0.0: /var/log/pacemaker/pacemaker.log
Starting with Pacemaker 2.0.0, the Pacemaker detail log will be kept by default in /var/log/pacemaker/pacemaker.log (rather than /var/log/pacemaker.log). This will keep /var/log cleaner. Pacemaker will still prefer any log file specified in corosync.conf. The initial proposal, after discussion at last year's summit, was to use /var/log/cluster/pacemaker.log instead. That turned out to be slightly problematic: it broke some regression tests in a way that wasn't easily fixable, and more significantly, it raises the question of what package should own /var/log/cluster (which different distributions might want to answer differently). So instead, the default log locations can be overridden when building pacemaker. The ./configure script now has these two options: --with-logdir Where to keep pacemaker.log (default /var/log/pacemaker) --with-bundledir Where to keep bundle logs (default /var/log/pacemaker/bundles, which hasn't changed) Thus, if a packager wants to preserve the 1.1 locations, they can use: ./configure --with-logdir=/var/log And if a packager wants to use /var/log/cluster as originally planned, they can use: ./configure --with-logdir=/var/log/cluster --with- bundledir=/var/log/cluster/bundles and ensure that pacemaker depends on whatever package owns /var/log/cluster. -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Coming in Pacemaker 2.0.0: Reliable exit codes
Every time you run a command on the command line or in a script, it returns an exit status. These are most useful in scripts to check for errors. Currently, Pacemaker daemons and command-line tools return an unreliable mishmash of exit status codes, sometimes including negative numbers (which get bitwise-remapped to the 0-255 range) and/or C library errno codes (which can vary across OSes). The only thing scripts could rely on was 0 means success and nonzero means error. Beginning with Pacemaker 2.0.0, everything will return a well-defined set of reliable exit status codes. These codes can be viewed using the existing crm_error tool using the --exit parameter. For example: crm_error --exit --list will list all possible exit statuses, and crm_error --exit 124 will show a textual description of what exit status 124 means. This will mainly be of interest to users who script Pacemaker commands and check the return value. If your scripts rely on the current exit codes, you may need to update your scripts for 2.0.0. -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Changes coming in Pacemaker 2.0.0
Pacemaker 2.0 will be a major update whose main goal is to remove support for deprecated, legacy syntax, in order to make the code base more maintainable into the future. There will also be some changes to default configuration behavior, and the command-line tools. I'm hoping to release the first release candidate in the next couple of weeks. We'll have a longer than usual rc phase to allow for plenty of testing. A thoroughly detailed list of changes will be maintained on the ClusterLabs wiki: https://wiki.clusterlabs.org/wiki/Pacemaker_2.0_Changes These changes are not final, and we can restore functionality if there is a strong need for it. Most user-visible changes are complete (in the 2.0 branch on github); major changes are still expected, but primarily to the C API. Some highlights: * Only Corosync version 2 will be supported as the underlying cluster layer. Support for Heartbeat and Corosync 1 is removed. (Support for the new kronosnet layer will be added in a future version.) * The record-pending cluster property now defaults to true, which allows status tools such as crm_mon to show operations that are in progress. * So far, the code base has been reduced by about 17,000 lines of code. -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Does anyone use clone instance constraints from pacemaker-next schema?
The pacemaker-next schema contains experimental features for testing before potential release. To use these features, someone must explicitly set validate-with in their configuration to pacemaker-next (or its legacy alias, pacemaker-1.1). There is a feature that has been hanging around in there for a long time: the ability to reference particular instances of a clone in constraints, using "rsc-instance"/"with-rsc-instance" (colocation) or "first-instance"/"then-instance" (ordering). The originally proposed use case (back in 2009) was having separate IP addresses, each associated with one copy of the clone: https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2169 My question is: has anyone used or tested this, or is anyone interested in this? We won't promote it to the default schema unless it is tested. My feeling is that it is more likely to be confusing than helpful, and there are probably ways to achieve any reasonable use case with existing syntax. -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Antw: Resource Demote Time Out Question
On Wed, 2018-01-10 at 16:48 +0100, Ulrich Windl wrote: > Hi! > > Common pitfall: The default parameters in the RA's metadata are not > the defaults being configured when you don't specify a value; instead > they are suggestions for you when configuring (don't ask me why!). > Instead there is a global default timeout being used when you don't > specify one. > I hope I put that correctly. You could verify by manually adding the > default avlues from the metadata to "demote". > > Regards, > Ulrich Yep. That would be in the section of the configuration with "op start interval=0 timeout=120" ... you want "op demote interval=0 timeout=" with the desired value. > > > > > Marc Smithschrieb am 10.01.2018 um > > > > 16:26 in > > Nachricht >
Re: [ClusterLabs] corosync taking almost 30 secs to detect node failure in case of kernel panic
On Wed, 2018-01-10 at 12:43 +0530, ashutosh tiwari wrote: > Hi, > > We have two node cluster running in active/standby mode and having > IPMI fencing configured. Be aware that using on-board IPMI as the only fencing method is problematic -- if the host loses power, the IPMI will not respond, and the cluster will be unable to recover. > In case of kernel panic at Active node, standby node is detecting > node failure in around 30 secs which leads to delay in standby node > taking the active role. > > we have totem token timeout as 1 msecs. > Please let us know in case there is any more configuration > controlling membership detection. The logs should show what's taking up the time. Corosync should recognize the node is lost around the token timeout, then pacemaker has to contact the IPMI and wait for a successful response before recovering. It could be that the IPMI takes that long to respond, or there may be something else causing issues. > > s/w versions. > > centos 6.7 > corosync-1.4.7-5.el6.x86_64 > pacemaker-1.1.14-8.el6.x86_64 > > Thanks and Regards, > Ashutosh Tiwari -- Ken Gaillot___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Antw: Resource Demote Time Out Question
Hi! Common pitfall: The default parameters in the RA's metadata are not the defaults being configured when you don't specify a value; instead they are suggestions for you when configuring (don't ask me why!). Instead there is a global default timeout being used when you don't specify one. I hope I put that correctly. You could verify by manually adding the default avlues from the metadata to "demote". Regards, Ulrich >>> Marc Smithschrieb am 10.01.2018 um 16:26 in Nachricht
[ClusterLabs] Resource Demote Time Out Question
Hi, I'm experiencing a time out on a demote operation and I'm not sure which parameter / attribute needs to be updated to extend the time out window. I'm using Pacemaker 1.1.16 and Corosync 2.4.2. Here are the set of log lines that show the issue (shutdown initiated, then demote time out after 20 seconds): --snip-- Jan 10 09:08:13 tgtnode2 pacemakerd[1096]: notice: Caught 'Terminated' signal Jan 10 09:08:13 tgtnode2 crmd[1104]: notice: Caught 'Terminated' signal Jan 10 09:08:13 tgtnode2 crmd[1104]: notice: State transition S_IDLE -> S_POLICY_ENGINE Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Scheduling Node tgtnode2.parodyne.com for shutdown Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Promote p_scst_zfs_vols:0^I(Slave -> Master tgtnode1.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Demote p_scst_zfs_vols:1^I(Master -> Stopped tgtnode2.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Stop p_dlm:1^I(tgtnode2.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Migrate p_dummy_g_zfs^I(Started tgtnode2.parodyne.com -> tgtnode1.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Move p_zfs_pool_one^I(Started tgtnode2.parodyne.com -> tgtnode1.parodyne.com) Jan 10 09:08:13 tgtnode2 pengine[1103]: notice: Calculated transition 3, saving inputs in /var/lib/pacemaker/pengine/pe-input-1441.bz2 Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17449]: DEBUG: scst_notify() -> Received a 'pre' / 'demote' notification. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17449]: DEBUG: p_scst_zfs_vols notify returned: 0 Jan 10 09:08:13 tgtnode2 crmd[1104]: notice: Result of notify operation for p_scst_zfs_vols on tgtnode2.parodyne.com: 0 (ok) Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_monitor() -> SCST version: 3.3.0-rc Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_monitor() -> Resource is running. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_monitor() -> SCST local target group state: active Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_demote() -> Resource is currently running as Master. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO: Blocking all 'zfs_vols' devices... Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: Waiting for devices to finish blocking... Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_demote() -> Setting target group 'zfs_vols_local' ALUA state to 'transitioning'... Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO: Collecting current configuration: done. -> Making requested changes. -> Setting Target Group attribute 'state' to value 'transitioning' for target group 'zfs_vols/zfs_vols_local': done. -> Done, 1 change(s) made. All done. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_demote() -> Setting target group 'zfs_vols_local' ALUA state to 'unavailable'... Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: INFO: Collecting current configuration: done. -> Making requested changes. -> Setting Target Group attribute 'state' to value 'unavailable' for target group 'zfs_vols/zfs_vols_local': done. -> Done, 1 change(s) made. All done. Jan 10 09:08:13 tgtnode2 scst(p_scst_zfs_vols)[17473]: DEBUG: scst_demote() -> Changing the group's devices to inactive... Jan 10 09:08:33 tgtnode2 lrmd[1101]: warning: p_scst_zfs_vols_demote_0 process (PID 17473) timed out Jan 10 09:08:33 tgtnode2 crmd[1104]: notice: Transition aborted by operation p_scst_zfs_vols_demote_0 'modify' on tgtnode2.parodyne.com: Event failed Jan 10 09:08:33 tgtnode2 crmd[1104]: notice: Transition aborted by status-2-fail-count-p_scst_zfs_vols doing create fail-count-p_scst_zfs_vols=1: Transient attribute change --snip-- So I'm getting a "time out" after 20 seconds of waiting in the demote operation with this line: Jan 10 09:08:33 tgtnode2 lrmd[1101]: warning: p_scst_zfs_vols_demote_0 process (PID 17473) timed out The 20 second time out is consistent when testing this, so I'm sure it's just a configuration thing, but it's not obvious to me which parameter/attribute/setting needs to be modified. The relevant metadata section from the RA referenced above: --snip-- --snip-- And the primitive and clone (multi-state) actual cluster configuration for the referenced resource: --snip-- primitive p_scst_zfs_vols ocf:esos:scst \ params alua=true device_group=zfs_vols local_tgt_grp=zfs_vols_local remote_tgt_grp=zfs_vols_remote m_alua_state=active s_alua_state=unavailable use_trans_state=true set_dev_active=true \ op monitor interval=10 role=Master \ op monitor interval=20 role=Slave \ op start interval=0 timeout=120 \ op stop interval=0 timeout=90 ms ms_scst_zfs_vols p_scst_zfs_vols \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
[ClusterLabs] Antw: Antw: corosync taking almost 30 secs to detect node failure in case of kernel panic
_peer_state_iter: cman_event_callback: Node tigana[2] - state is > now lost (was member) > Jan 10 11:06:33 [19261] orana crmd: notice: > crm_update_peer_state_iter: cman_event_callback: Node tigana[2] - state is > now lost (was member) > Jan 10 11:06:33 [19261] orana crmd: info: peer_update_callback: > tigana is now lost (was member) > Jan 10 11:06:33 [19261] orana crmd: warning: match_down_event: No > match for shutdown action on tigana > Jan 10 11:06:33 [19261] orana crmd: notice: peer_update_callback: > Stonith/shutdown of tigana not matched > Jan 10 11:06:33 [19261] orana crmd: info: crm_update_peer_join: > peer_update_callback: Node tigana[2] - join-2 phase 4 -> 0 > Jan 10 11:06:33 [19261] orana crmd: info: > abort_transition_graph: Transition aborted: Node failure > (source=peer_update_callback:240, 1) > Jan 10 11:06:33 corosync [CPG ] chosen downlist: sender r(0) ip(7.7.7.1) > ; members(old:2 left:1) > ++ > > this is the logs from standby node(new active). > kernel panic was triggered at 11:06:00 at the other node and here totem > change is reported at 11:06:31. > > 30 secs is the cluster recheck timer. > > Regards, > Ashutosh > > > On Wed, Jan 10, 2018 at 3:12 PM, <users-requ...@clusterlabs.org> wrote: > >> Send Users mailing list submissions to >> users@clusterlabs.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://lists.clusterlabs.org/mailman/listinfo/users >> or, via email, send a message with subject or body 'help' to >> users-requ...@clusterlabs.org >> >> You can reach the person managing the list at >> users-ow...@clusterlabs.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of Users digest..." >> >> >> Today's Topics: >> >>1. corosync taking almost 30 secs to detect node failure in case >> of kernel panic (ashutosh tiwari) >>2. Antw: corosync taking almost 30 secs to detect node failure >> in case of kernel panic (Ulrich Windl) >>3. pacemaker reports monitor timeout while CPU is high (???) >> >> >> -- >> >> Message: 1 >> Date: Wed, 10 Jan 2018 12:43:46 +0530 >> From: ashutosh tiwari <ashutosh.k...@gmail.com> >> To: users@clusterlabs.org >> Subject: [ClusterLabs] corosync taking almost 30 secs to detect node >> failure in case of kernel panic >> Message-ID: >> <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb >> 3...@mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> Hi, >> >> We have two node cluster running in active/standby mode and having IPMI >> fencing configured. >> >> In case of kernel panic at Active node, standby node is detecting node >> failure in around 30 secs which leads to delay in standby node taking the >> active role. >> >> we have totem token timeout as 1 msecs. >> Please let us know in case there is any more configuration controlling >> membership detection. >> >> s/w versions. >> >> centos 6.7 >> corosync-1.4.7-5.el6.x86_64 >> pacemaker-1.1.14-8.el6.x86_64 >> >> Thanks and Regards, >> Ashutosh Tiwari >> -- next part -- >> An HTML attachment was scrubbed... >> URL: <http://lists.clusterlabs.org/pipermail/users/attachments/ >> 20180110/235f148d/attachment-0001.html> >> >> -- >> >> Message: 2 >> Date: Wed, 10 Jan 2018 08:32:16 +0100 >> From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> >> To: <users@clusterlabs.org> >> Subject: [ClusterLabs] Antw: corosync taking almost 30 secs to detect >> node failure in case of kernel panic >> Message-ID: <5a55c18002a100029...@gwsmtp1.uni-regensburg.de> >> Content-Type: text/plain; charset=US-ASCII >> >> Hi! >> >> Maybe define "detecting node failure". Culkd it be your 30 seconds are >> between detection and reaction? Logs would help here, too. >> >> Regards, >> Ulrich >> >> >> >>> ashutosh tiwari <ashutosh.k...@gmail.com> schrieb am 10.01.2018 um >> 08:13 in >> Nachricht >> <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muv...@mail.gmail.com>: >> > Hi, >> > >> > We have two node cluster run
[ClusterLabs] Antw: corosync taking almost 30 secs to detect node failure in case of kernel panic
he logs from standby node(new active). kernel panic was triggered at 11:06:00 at the other node and here totem change is reported at 11:06:31. 30 secs is the cluster recheck timer. Regards, Ashutosh On Wed, Jan 10, 2018 at 3:12 PM, <users-requ...@clusterlabs.org> wrote: > Send Users mailing list submissions to > users@clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.clusterlabs.org/mailman/listinfo/users > or, via email, send a message with subject or body 'help' to > users-requ...@clusterlabs.org > > You can reach the person managing the list at > users-ow...@clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Users digest..." > > > Today's Topics: > >1. corosync taking almost 30 secs to detect node failure in case > of kernel panic (ashutosh tiwari) >2. Antw: corosync taking almost 30 secs to detect node failure > in case of kernel panic (Ulrich Windl) >3. pacemaker reports monitor timeout while CPU is high (???) > > > -- > > Message: 1 > Date: Wed, 10 Jan 2018 12:43:46 +0530 > From: ashutosh tiwari <ashutosh.k...@gmail.com> > To: users@clusterlabs.org > Subject: [ClusterLabs] corosync taking almost 30 secs to detect node > failure in case of kernel panic > Message-ID: > <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muVb > 3...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi, > > We have two node cluster running in active/standby mode and having IPMI > fencing configured. > > In case of kernel panic at Active node, standby node is detecting node > failure in around 30 secs which leads to delay in standby node taking the > active role. > > we have totem token timeout as 1 msecs. > Please let us know in case there is any more configuration controlling > membership detection. > > s/w versions. > > centos 6.7 > corosync-1.4.7-5.el6.x86_64 > pacemaker-1.1.14-8.el6.x86_64 > > Thanks and Regards, > Ashutosh Tiwari > -- next part -- > An HTML attachment was scrubbed... > URL: <http://lists.clusterlabs.org/pipermail/users/attachments/ > 20180110/235f148d/attachment-0001.html> > > -- > > Message: 2 > Date: Wed, 10 Jan 2018 08:32:16 +0100 > From: "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> > To: <users@clusterlabs.org> > Subject: [ClusterLabs] Antw: corosync taking almost 30 secs to detect > node failure in case of kernel panic > Message-ID: <5a55c18002a100029...@gwsmtp1.uni-regensburg.de> > Content-Type: text/plain; charset=US-ASCII > > Hi! > > Maybe define "detecting node failure". Culkd it be your 30 seconds are > between detection and reaction? Logs would help here, too. > > Regards, > Ulrich > > > >>> ashutosh tiwari <ashutosh.k...@gmail.com> schrieb am 10.01.2018 um > 08:13 in > Nachricht > <CA+vEgjiKG_VGegT7Q+wCqn6merFNrvegiQs+RHRuxzE=muv...@mail.gmail.com>: > > Hi, > > > > We have two node cluster running in active/standby mode and having IPMI > > fencing configured. > > > > In case of kernel panic at Active node, standby node is detecting node > > failure in around 30 secs which leads to delay in standby node taking the > > active role. > > > > we have totem token timeout as 1 msecs. > > Please let us know in case there is any more configuration controlling > > membership detection. > > > > s/w versions. > > > > centos 6.7 > > corosync-1.4.7-5.el6.x86_64 > > pacemaker-1.1.14-8.el6.x86_64 > > > > Thanks and Regards, > > Ashutosh Tiwari > > > > > -- > > Message: 3 > Date: Wed, 10 Jan 2018 09:40:51 + > From: ??? <fanguot...@highgo.com> > To: Cluster Labs - All topics related to open-source clustering > welcomed<users@clusterlabs.org> > Subject: [ClusterLabs] pacemaker reports monitor timeout while CPU is > high > Message-ID: <4dc98a5d9be144a78fb9a18721743...@ex01.highgo.com> > Content-Type: text/plain; charset="utf-8" > > Hello, > > This issue only appears when we run performance test and the CPU is high. > The cluster and log is as below. The Pacemaker will restart the Slave Side > pgsql-ha resource about every two minutes. > > Take the following scenario for example:?when the pgsqlms RA is called, we > print the log ?execute th
[ClusterLabs] Antw: pacemaker reports monitor timeout while CPU is high
Hi! I only can talk for myself: In former times with HP-UX, we had severe performance problems when the load was in the range of 8 to 14 (I/O waits not included, average for all logical CPUs), while in Linux we are getting problems with a load above 40 (or so) (I/O included, sum of all logical CPUs (which are 24)). Also I/O waits cause cluster timeouts before CPU load actually matters (for us). So with a load above 400 (not knowing your number of CPUs) it should not be that unusual. What is the number of threads in your system at that time? It might be worth the efforts binding the cluster processes to specific CPUs and keep other tasks away from those, but I don't have experience with that. I guess the "High CPU load detected" message triggers some internal suspend in the cluster engine (assuming the cluster engine caused the high load). Of course for "external " load the measure won't help... Regards, Ulrich >>> ???schrieb am 10.01.2018 um 10:40 in Nachricht <4dc98a5d9be144a78fb9a18721743...@ex01.highgo.com>: > Hello, > > This issue only appears when we run performance test and the CPU is high. > The cluster and log is as below. The Pacemaker will restart the Slave Side > pgsql-ha resource about every two minutes. > > Take the following scenario for example:(when the pgsqlms RA is called, we > print the log “execute the command start (command)”. When the command is > returned, we print the log “execute the command stop (Command) (result)”) > > 1. We could see that pacemaker call “pgsqlms monitor” about every 15 > seconds. And it return $OCF_SUCCESS > > 2. In calls monitor command again at 13:56:16, and then it reports > timeout error error 13:56:18. It is only 2 seconds but it reports > “timeout=1ms” > > 3. In other logs, sometimes after 15 minutes, there is no “execute the > command start monitor” printed and it reports timeout error directly. > > Could you please tell how to debug or resolve such issue? > > The log: > > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start > monitor > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 > Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop > monitor 0 > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command start > monitor > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 > Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop > monitor 0 > Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU load detected: > 426.77 > Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command start > monitor > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID > 5606) timed out > Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - timed > out after 1ms > Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for > pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=1ms > Jan 10 13:56:18 sds2 crmd[26096]: notice: db2-pgsqld_monitor_16000:102 [ > /tmp:5432 - accepting connections\n ] > Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> > S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph > Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor > for pgsqld:0 on db2: unknown error (1) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start for > pgsqld:1 on db1: unknown error (1) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 > after 100 failures (max=100) > Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 > after 100 failures (max=100) > Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover pgsqld:0#011(Slave > db2) > Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition 37, > saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2 > > > The Cluster Configuration: > 2 nodes and 13 resources configured > > Online: [ db1 db2 ] > > Full list of resources: > > Clone Set: dlm-clone [dlm] > Started: [ db1 db2 ] > Clone Set: clvmd-clone [clvmd] > Started: [ db1 db2 ] > ipmi_node1 (stonith:fence_ipmilan):Started db2 > ipmi_node2 (stonith:fence_ipmilan):Started db1 > Clone Set: clusterfs-clone [clusterfs] > Started: [ db1 db2 ] > Master/Slave Set: pgsql-ha [pgsqld]> > > Masters: [ db1 ] > > Slaves: [ db2 ] > Resource Group: mastergroup > db1-vip(ocf::heartbeat:IPaddr2): Started > rep-vip(ocf::heartbeat:IPaddr2): Started > Resource Group: slavegroup > db2-vip(ocf::heartbeat:IPaddr2): Started > > > pcs resource show pgsql-ha > Master: pgsql-ha > Meta Attrs:
[ClusterLabs] pacemaker reports monitor timeout while CPU is high
Hello, This issue only appears when we run performance test and the CPU is high. The cluster and log is as below. The Pacemaker will restart the Slave Side pgsql-ha resource about every two minutes. Take the following scenario for example:(when the pgsqlms RA is called, we print the log “execute the command start (command)”. When the command is returned, we print the log “execute the command stop (Command) (result)”) 1. We could see that pacemaker call “pgsqlms monitor” about every 15 seconds. And it return $OCF_SUCCESS 2. In calls monitor command again at 13:56:16, and then it reports timeout error error 13:56:18. It is only 2 seconds but it reports “timeout=1ms” 3. In other logs, sometimes after 15 minutes, there is no “execute the command start monitor” printed and it reports timeout error directly. Could you please tell how to debug or resolve such issue? The log: Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command start monitor Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role start Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: _confirm_role stop 0 Jan 10 13:55:35 sds2 pgsqlms(pgsqld)[5240]: INFO: execute the command stop monitor 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command start monitor Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role start Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: _confirm_role stop 0 Jan 10 13:55:52 sds2 pgsqlms(pgsqld)[5477]: INFO: execute the command stop monitor 0 Jan 10 13:56:02 sds2 crmd[26096]: notice: High CPU load detected: 426.77 Jan 10 13:56:16 sds2 pgsqlms(pgsqld)[5606]: INFO: execute the command start monitor Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000 process (PID 5606) timed out Jan 10 13:56:18 sds2 lrmd[26093]: warning: pgsqld_monitor_16000:5606 - timed out after 1ms Jan 10 13:56:18 sds2 crmd[26096]: error: Result of monitor operation for pgsqld on db2: Timed Out | call=102 key=pgsqld_monitor_16000 timeout=1ms Jan 10 13:56:18 sds2 crmd[26096]: notice: db2-pgsqld_monitor_16000:102 [ /tmp:5432 - accepting connections\n ] Jan 10 13:56:18 sds2 crmd[26096]: notice: State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op monitor for pgsqld:0 on db2: unknown error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Processing failed op start for pgsqld:1 on db1: unknown error (1) Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 100 failures (max=100) Jan 10 13:56:19 sds2 pengine[26095]: warning: Forcing pgsql-ha away from db1 after 100 failures (max=100) Jan 10 13:56:19 sds2 pengine[26095]: notice: Recover pgsqld:0#011(Slave db2) Jan 10 13:56:19 sds2 pengine[26095]: notice: Calculated transition 37, saving inputs in /var/lib/pacemaker/pengine/pe-input-1251.bz2 The Cluster Configuration: 2 nodes and 13 resources configured Online: [ db1 db2 ] Full list of resources: Clone Set: dlm-clone [dlm] Started: [ db1 db2 ] Clone Set: clvmd-clone [clvmd] Started: [ db1 db2 ] ipmi_node1 (stonith:fence_ipmilan):Started db2 ipmi_node2 (stonith:fence_ipmilan):Started db1 Clone Set: clusterfs-clone [clusterfs] Started: [ db1 db2 ] Master/Slave Set: pgsql-ha [pgsqld]> Masters: [ db1 ] Slaves: [ db2 ] Resource Group: mastergroup db1-vip(ocf::heartbeat:IPaddr2): Started rep-vip(ocf::heartbeat:IPaddr2): Started Resource Group: slavegroup db2-vip(ocf::heartbeat:IPaddr2): Started pcs resource show pgsql-ha Master: pgsql-ha Meta Attrs: interleave=true notify=true Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms) Attributes: bindir=/usr/local/pgsql/bin pgdata=/home/postgres/data Operations: start interval=0s timeout=160s (pgsqld-start-interval-0s) stop interval=0s timeout=60s (pgsqld-stop-interval-0s) promote interval=0s timeout=130s (pgsqld-promote-interval-0s) demote interval=0s timeout=120s (pgsqld-demote-interval-0s) monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s) monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s) notify interval=0s timeout=60s (pgsqld-notify-interval-0s) ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org