On 14 Nov 2013, at 6:13 am, Andrey Groshev <gre...@yandex.ru> wrote:

> 
> 
> 13.11.2013, 03:22, "Andrew Beekhof" <and...@beekhof.net>:
>> On 12 Nov 2013, at 4:42 pm, Andrey Groshev <gre...@yandex.ru> wrote:
>> 
>>>  11.11.2013, 03:44, "Andrew Beekhof" <and...@beekhof.net>:
>>>>  On 8 Nov 2013, at 7:49 am, Andrey Groshev <gre...@yandex.ru> wrote:
>>>>>   Hi, PPL!
>>>>>   I need help. I do not understand... Why has stopped working.
>>>>>   This configuration work on other cluster, but on corosync1.
>>>>> 
>>>>>   So... cluster postgres with master/slave.
>>>>>   Classic config as in wiki.
>>>>>   I build cluster, start, he is working.
>>>>>   Next I kill postgres on Master with 6 signal, as if "disk space left"
>>>>> 
>>>>>   # pkill -6 postgres
>>>>>   # ps axuww|grep postgres
>>>>>   root      9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep 
>>>>> postgres
>>>>> 
>>>>>   PostgreSQL die, But crm_mon shows that the master is still running.
>>>>> 
>>>>>   Last updated: Fri Nov  8 00:42:08 2013
>>>>>   Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on 
>>>>> dev-cluster2-node4
>>>>>   Stack: corosync
>>>>>   Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>>>>   Version: 1.1.10-1.el6-368c726
>>>>>   3 Nodes configured
>>>>>   7 Resources configured
>>>>> 
>>>>>   Node dev-cluster2-node2 (172793105): online
>>>>>          pingCheck       (ocf::pacemaker:ping):  Started
>>>>>          pgsql   (ocf::heartbeat:pgsql): Started
>>>>>   Node dev-cluster2-node3 (172793106): online
>>>>>          pingCheck       (ocf::pacemaker:ping):  Started
>>>>>          pgsql   (ocf::heartbeat:pgsql): Started
>>>>>   Node dev-cluster2-node4 (172793107): online
>>>>>          pgsql   (ocf::heartbeat:pgsql): Master
>>>>>          pingCheck       (ocf::pacemaker:ping):  Started
>>>>>          VirtualIP       (ocf::heartbeat:IPaddr2):       Started
>>>>> 
>>>>>   Node Attributes:
>>>>>   * Node dev-cluster2-node2:
>>>>>      + default_ping_set                  : 100
>>>>>      + master-pgsql                      : -INFINITY
>>>>>      + pgsql-data-status                 : STREAMING|ASYNC
>>>>>      + pgsql-status                      : HS:async
>>>>>   * Node dev-cluster2-node3:
>>>>>      + default_ping_set                  : 100
>>>>>      + master-pgsql                      : -INFINITY
>>>>>      + pgsql-data-status                 : STREAMING|ASYNC
>>>>>      + pgsql-status                      : HS:async
>>>>>   * Node dev-cluster2-node4:
>>>>>      + default_ping_set                  : 100
>>>>>      + master-pgsql                      : 1000
>>>>>      + pgsql-data-status                 : LATEST
>>>>>      + pgsql-master-baseline             : 0000000002000078
>>>>>      + pgsql-status                      : PRI
>>>>> 
>>>>>   Migration summary:
>>>>>   * Node dev-cluster2-node4:
>>>>>   * Node dev-cluster2-node2:
>>>>>   * Node dev-cluster2-node3:
>>>>> 
>>>>>   Tickets:
>>>>> 
>>>>>   CONFIG:
>>>>>   node $id="172793105" dev-cluster2-node2. \
>>>>>          attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>>   node $id="172793106" dev-cluster2-node3. \
>>>>>          attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>>>   node $id="172793107" dev-cluster2-node4. \
>>>>>          attributes pgsql-data-status="LATEST"
>>>>>   primitive VirtualIP ocf:heartbeat:IPaddr2 \
>>>>>          params ip="10.76.157.194" \
>>>>>          op start interval="0" timeout="60s" on-fail="stop" \
>>>>>          op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>>>          op stop interval="0" timeout="60s" on-fail="block"
>>>>>   primitive pgsql ocf:heartbeat:pgsql \
>>>>>          params pgctl="/usr/pgsql-9.1/bin/pg_ctl" 
>>>>> psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" 
>>>>> tmpdir="/tmp/pg" start_opt="-p 5432" 
>>>>> logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" 
>>>>> dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " 
>>>>> restore_command="gzip -cd 
>>>>> /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" 
>>>>> primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 
>>>>> keepalives_count=5" master_ip="10.76.157.194" \
>>>>>          op start interval="0" timeout="60s" on-fail="restart" \
>>>>>          op monitor interval="5s" timeout="61s" on-fail="restart" \
>>>>>          op monitor interval="1s" role="Master" timeout="62s" 
>>>>> on-fail="restart" \
>>>>>          op promote interval="0" timeout="63s" on-fail="restart" \
>>>>>          op demote interval="0" timeout="64s" on-fail="stop" \
>>>>>          op stop interval="0" timeout="65s" on-fail="block" \
>>>>>          op notify interval="0" timeout="66s"
>>>>>   primitive pingCheck ocf:pacemaker:ping \
>>>>>          params name="default_ping_set" host_list="10.76.156.1" 
>>>>> multiplier="100" \
>>>>>          op start interval="0" timeout="60s" on-fail="restart" \
>>>>>          op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>>>          op stop interval="0" timeout="60s" on-fail="ignore"
>>>>>   ms msPostgresql pgsql \
>>>>>          meta master-max="1" master-node-max="1" clone-node-max="1" 
>>>>> notify="true" target-role="Master" clone-max="3"
>>>>>   clone clnPingCheck pingCheck \
>>>>>          meta clone-max="3"
>>>>>   location l0_DontRunPgIfNotPingGW msPostgresql \
>>>>>          rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined 
>>>>> default_ping_set or default_ping_set lt 100
>>>>>   colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
>>>>>   colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
>>>>>   order rsc_order-1 0: clnPingCheck msPostgresql
>>>>>   order rsc_order-2 0: msPostgresql:promote VirtualIP:start 
>>>>> symmetrical=false
>>>>>   order rsc_order-3 0: msPostgresql:demote VirtualIP:stop 
>>>>> symmetrical=false
>>>>>   property $id="cib-bootstrap-options" \
>>>>>          dc-version="1.1.10-1.el6-368c726" \
>>>>>          cluster-infrastructure="corosync" \
>>>>>          stonith-enabled="false" \
>>>>>          no-quorum-policy="stop"
>>>>>   rsc_defaults $id="rsc-options" \
>>>>>          resource-stickiness="INFINITY" \
>>>>>          migration-threshold="1"
>>>>> 
>>>>>   Tell me where to look - why does pacemaker not work?
>>>>  You might want to follow some of the steps at:
>>>> 
>>>>     http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/
>>>> 
>>>>  under the heading "Resource-level failures".
>>>  Yes. Thank you.
>>>  I've seen this article and now I study it in more detail.
>>>  A lot of information in the logs, so it is difficult to determine where 
>>> the error is, and where the consequence of error.
>>>  Now I'm trying to figure it out.
>>> 
>>>  BUT...
>>>  While I can say with certainty that the RA with monitor in the MS(pgsql) 
>>> is called ONLY on the node on which the last was launched PACEMAKER.
>> 
>> It looks like you're hitting 
>> https://github.com/beekhof/pacemaker/commit/58962338
>> Since you appear to be on rhel6 (or a clone of rhel6), can I suggest you use 
>> the 1.1.10 packages that come with 6.4?
>> They include the above patch.
> 
> I already use (builded from source two weeks ago)

Upstream 1.1.10 does not include the above patch.

> * pacemaker 1.1.10
> * resource-agents 3.9.5
> * corosync 2.3.2
> * libqb 0.16
> & CentOS 6.4
> 
> The same config work on pacemaker 1.1.9/corosync 1.4.5
> Not ideal, but no such problem.
> 
> At first idea - I thought I should move the target-role=Master from MS to 
> primitive pgsql.
> And so even working.
> But after a crash killing the main PostgreSQL process - started the same.
> Today's experiments showed that this behavior starts after I add in MS 
> "notify=true".
> But primitive pgsql not properly work without "notify" messages.
> While I in frustration :(
> 
> 
>> Also, just to be sure. Are you expecting monitor operations to detect when 
>> you started a resource manually?
>> If so, you'll need a monitor operation with role=Stopped. We don't do that 
>> by default.
> 
> I expect that the resource monitoring on all the time, otherwise how to 
> control them?

When a node joins the cluster, we check to see if it had any resources running.
If no-one has the resource running, we pick a node and start it there.

If a malicious admin then starts the resource somewhere else, manually, we 
would not normally detect this.
It is assumed that someone trusted with root privileges would not do this on 
purpose.

However, if you do not trust your admins, as explained above, you can configure 
pacemaker to periodically re-check the node to detect and recover from this 
situation.

> Or I do not quite understand the question.
> 
> 
>>>>  'crm_mon -o' might be a good source of information too.
>>>  Therefore, I see that my resources allegedly functioning normally.
>>> 
>>>  # crm_mon -o1
>>>  Last updated: Tue Nov 12 09:27:16 2013
>>>  Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on 
>>> dev-cluster2-node2
>>>  Stack: corosync
>>>  Current DC: dev-cluster2-node2 (172793105) - partition with quorum
>>>  Version: 1.1.10-1.el6-368c726
>>>  3 Nodes configured
>>>  337 Resources configured
>>> 
>>>  Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
>>> 
>>>  Clone Set: clonePing [pingCheck]
>>>      Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
>>>  Master/Slave Set: msPgsql [pgsql]
>>>      Masters: [ dev-cluster2-node2 ]
>>>      Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ]
>>>  VirtualIP      (ocf::heartbeat:IPaddr2):       Started dev-cluster2-node2
>>> 
>>>  Operations:
>>>  * Node dev-cluster2-node2:
>>>    pingCheck: migration-threshold=1
>>>     + (20) start: rc=0 (ok)
>>>     + (23) monitor: interval=10000ms rc=0 (ok)
>>>    pgsql: migration-threshold=1
>>>     + (41) promote: rc=0 (ok)
>>>     + (87) monitor: interval=1000ms rc=8 (master)
>>>    VirtualIP: migration-threshold=1
>>>     + (49) start: rc=0 (ok)
>>>     + (52) monitor: interval=10000ms rc=0 (ok)
>>>  * Node dev-cluster2-node3:
>>>    pingCheck: migration-threshold=1
>>>     + (20) start: rc=0 (ok)
>>>     + (23) monitor: interval=10000ms rc=0 (ok)
>>>    pgsql: migration-threshold=1
>>>     + (26) start: rc=0 (ok)
>>>     + (32) monitor: interval=10000ms rc=0 (ok)
>>>  * Node dev-cluster2-node4:
>>>    pingCheck: migration-threshold=1
>>>     + (20) start: rc=0 (ok)
>>>     + (23) monitor: interval=10000ms rc=0 (ok)
>>>    pgsql: migration-threshold=1
>>>     + (26) start: rc=0 (ok)
>>>     + (32) monitor: interval=10000ms rc=0 (ok)
>>> 
>>>  In reality now killed (signal 4|6) the PG master and the penultimate slave 
>>> PG.
>>>  IMHO, even if I have something configured incorrectly, the inability to 
>>> monitor the resource must cause a fatal error.
>>>  Or is there a reason not to do so?
>>> 
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>> 
>> ,
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to