15.11.2013, 03:19, "Andrew Beekhof" <and...@beekhof.net>: > On 14 Nov 2013, at 5:06 pm, Andrey Groshev <gre...@yandex.ru> wrote: > >> 14.11.2013, 02:22, "Andrew Beekhof" <and...@beekhof.net>: >>> On 14 Nov 2013, at 6:13 am, Andrey Groshev <gre...@yandex.ru> wrote: >>>> 13.11.2013, 03:22, "Andrew Beekhof" <and...@beekhof.net>: >>>>> On 12 Nov 2013, at 4:42 pm, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>> 11.11.2013, 03:44, "Andrew Beekhof" <and...@beekhof.net>: >>>>>>> On 8 Nov 2013, at 7:49 am, Andrey Groshev <gre...@yandex.ru> wrote: >>>>>>>> Hi, PPL! >>>>>>>> I need help. I do not understand... Why has stopped working. >>>>>>>> This configuration work on other cluster, but on corosync1. >>>>>>>> >>>>>>>> So... cluster postgres with master/slave. >>>>>>>> Classic config as in wiki. >>>>>>>> I build cluster, start, he is working. >>>>>>>> Next I kill postgres on Master with 6 signal, as if "disk space >>>>>>>> left" >>>>>>>> >>>>>>>> # pkill -6 postgres >>>>>>>> # ps axuww|grep postgres >>>>>>>> root 9032 0.0 0.1 103236 860 pts/0 S+ 00:37 0:00 >>>>>>>> grep postgres >>>>>>>> >>>>>>>> PostgreSQL die, But crm_mon shows that the master is still running. >>>>>>>> >>>>>>>> Last updated: Fri Nov 8 00:42:08 2013 >>>>>>>> Last change: Fri Nov 8 00:37:05 2013 via crm_attribute on >>>>>>>> dev-cluster2-node4 >>>>>>>> Stack: corosync >>>>>>>> Current DC: dev-cluster2-node4 (172793107) - partition with quorum >>>>>>>> Version: 1.1.10-1.el6-368c726 >>>>>>>> 3 Nodes configured >>>>>>>> 7 Resources configured >>>>>>>> >>>>>>>> Node dev-cluster2-node2 (172793105): online >>>>>>>> pingCheck (ocf::pacemaker:ping): Started >>>>>>>> pgsql (ocf::heartbeat:pgsql): Started >>>>>>>> Node dev-cluster2-node3 (172793106): online >>>>>>>> pingCheck (ocf::pacemaker:ping): Started >>>>>>>> pgsql (ocf::heartbeat:pgsql): Started >>>>>>>> Node dev-cluster2-node4 (172793107): online >>>>>>>> pgsql (ocf::heartbeat:pgsql): Master >>>>>>>> pingCheck (ocf::pacemaker:ping): Started >>>>>>>> VirtualIP (ocf::heartbeat:IPaddr2): Started >>>>>>>> >>>>>>>> Node Attributes: >>>>>>>> * Node dev-cluster2-node2: >>>>>>>> + default_ping_set : 100 >>>>>>>> + master-pgsql : -INFINITY >>>>>>>> + pgsql-data-status : STREAMING|ASYNC >>>>>>>> + pgsql-status : HS:async >>>>>>>> * Node dev-cluster2-node3: >>>>>>>> + default_ping_set : 100 >>>>>>>> + master-pgsql : -INFINITY >>>>>>>> + pgsql-data-status : STREAMING|ASYNC >>>>>>>> + pgsql-status : HS:async >>>>>>>> * Node dev-cluster2-node4: >>>>>>>> + default_ping_set : 100 >>>>>>>> + master-pgsql : 1000 >>>>>>>> + pgsql-data-status : LATEST >>>>>>>> + pgsql-master-baseline : 0000000002000078 >>>>>>>> + pgsql-status : PRI >>>>>>>> >>>>>>>> Migration summary: >>>>>>>> * Node dev-cluster2-node4: >>>>>>>> * Node dev-cluster2-node2: >>>>>>>> * Node dev-cluster2-node3: >>>>>>>> >>>>>>>> Tickets: >>>>>>>> >>>>>>>> CONFIG: >>>>>>>> node $id="172793105" dev-cluster2-node2. \ >>>>>>>> attributes pgsql-data-status="STREAMING|ASYNC" >>>>>>>> standby="false" >>>>>>>> node $id="172793106" dev-cluster2-node3. \ >>>>>>>> attributes pgsql-data-status="STREAMING|ASYNC" >>>>>>>> standby="false" >>>>>>>> node $id="172793107" dev-cluster2-node4. \ >>>>>>>> attributes pgsql-data-status="LATEST" >>>>>>>> primitive VirtualIP ocf:heartbeat:IPaddr2 \ >>>>>>>> params ip="10.76.157.194" \ >>>>>>>> op start interval="0" timeout="60s" on-fail="stop" \ >>>>>>>> op monitor interval="10s" timeout="60s" on-fail="restart" \ >>>>>>>> op stop interval="0" timeout="60s" on-fail="block" >>>>>>>> primitive pgsql ocf:heartbeat:pgsql \ >>>>>>>> params pgctl="/usr/pgsql-9.1/bin/pg_ctl" >>>>>>>> psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" >>>>>>>> tmpdir="/tmp/pg" start_opt="-p 5432" >>>>>>>> logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" >>>>>>>> node_list=" dev-cluster2-node2. dev-cluster2-node3. >>>>>>>> dev-cluster2-node4. " restore_command="gzip -cd >>>>>>>> /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" >>>>>>>> primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 >>>>>>>> keepalives_count=5" master_ip="10.76.157.194" \ >>>>>>>> op start interval="0" timeout="60s" on-fail="restart" \ >>>>>>>> op monitor interval="5s" timeout="61s" on-fail="restart" \ >>>>>>>> op monitor interval="1s" role="Master" timeout="62s" >>>>>>>> on-fail="restart" \ >>>>>>>> op promote interval="0" timeout="63s" on-fail="restart" \ >>>>>>>> op demote interval="0" timeout="64s" on-fail="stop" \ >>>>>>>> op stop interval="0" timeout="65s" on-fail="block" \ >>>>>>>> op notify interval="0" timeout="66s" >>>>>>>> primitive pingCheck ocf:pacemaker:ping \ >>>>>>>> params name="default_ping_set" host_list="10.76.156.1" >>>>>>>> multiplier="100" \ >>>>>>>> op start interval="0" timeout="60s" on-fail="restart" \ >>>>>>>> op monitor interval="10s" timeout="60s" on-fail="restart" \ >>>>>>>> op stop interval="0" timeout="60s" on-fail="ignore" >>>>>>>> ms msPostgresql pgsql \ >>>>>>>> meta master-max="1" master-node-max="1" clone-node-max="1" >>>>>>>> notify="true" target-role="Master" clone-max="3" >>>>>>>> clone clnPingCheck pingCheck \ >>>>>>>> meta clone-max="3" >>>>>>>> location l0_DontRunPgIfNotPingGW msPostgresql \ >>>>>>>> rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined >>>>>>>> default_ping_set or default_ping_set lt 100 >>>>>>>> colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck >>>>>>>> colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master >>>>>>>> order rsc_order-1 0: clnPingCheck msPostgresql >>>>>>>> order rsc_order-2 0: msPostgresql:promote VirtualIP:start >>>>>>>> symmetrical=false >>>>>>>> order rsc_order-3 0: msPostgresql:demote VirtualIP:stop >>>>>>>> symmetrical=false >>>>>>>> property $id="cib-bootstrap-options" \ >>>>>>>> dc-version="1.1.10-1.el6-368c726" \ >>>>>>>> cluster-infrastructure="corosync" \ >>>>>>>> stonith-enabled="false" \ >>>>>>>> no-quorum-policy="stop" >>>>>>>> rsc_defaults $id="rsc-options" \ >>>>>>>> resource-stickiness="INFINITY" \ >>>>>>>> migration-threshold="1" >>>>>>>> >>>>>>>> Tell me where to look - why does pacemaker not work? >>>>>>> You might want to follow some of the steps at: >>>>>>> >>>>>>> http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/ >>>>>>> >>>>>>> under the heading "Resource-level failures". >>>>>> Yes. Thank you. >>>>>> I've seen this article and now I study it in more detail. >>>>>> A lot of information in the logs, so it is difficult to determine >>>>>> where the error is, and where the consequence of error. >>>>>> Now I'm trying to figure it out. >>>>>> >>>>>> BUT... >>>>>> While I can say with certainty that the RA with monitor in the >>>>>> MS(pgsql) is called ONLY on the node on which the last was launched >>>>>> PACEMAKER. >>>>> It looks like you're hitting >>>>> https://github.com/beekhof/pacemaker/commit/58962338 >>>>> Since you appear to be on rhel6 (or a clone of rhel6), can I suggest >>>>> you use the 1.1.10 packages that come with 6.4? >>>>> They include the above patch. >>>> I already use (builded from source two weeks ago) >>> Upstream 1.1.10 does not include the above patch. >> Strangely, in my source code, these lines exist. >> Maybe I do not collect build. >> I have the source - "master", and build RPM - 1.1.10. >>>> * pacemaker 1.1.10 >>>> * resource-agents 3.9.5 >>>> * corosync 2.3.2 >>>> * libqb 0.16 >>>> & CentOS 6.4 >>>> >>>> The same config work on pacemaker 1.1.9/corosync 1.4.5 >>>> Not ideal, but no such problem. >>>> >>>> At first idea - I thought I should move the target-role=Master from MS >>>> to primitive pgsql. >>>> And so even working. >>>> But after a crash killing the main PostgreSQL process - started the same. >>>> Today's experiments showed that this behavior starts after I add in MS >>>> "notify=true". >>>> But primitive pgsql not properly work without "notify" messages. >>>> While I in frustration :( >>>>> Also, just to be sure. Are you expecting monitor operations to detect >>>>> when you started a resource manually? >>>>> If so, you'll need a monitor operation with role=Stopped. We don't do >>>>> that by default. >>>> I expect that the resource monitoring on all the time, otherwise how to >>>> control them? >>> When a node joins the cluster, we check to see if it had any resources >>> running. >>> If no-one has the resource running, we pick a node and start it there. >>> >>> If a malicious admin then starts the resource somewhere else, manually, we >>> would not normally detect this. >>> It is assumed that someone trusted with root privileges would not do this >>> on purpose. >>> >>> However, if you do not trust your admins, as explained above, you can >>> configure pacemaker to periodically re-check the node to detect and recover >>> from this situation. >> This individual servers allocated for development. > > Then you likely have a random version from Git. > Thats probably not a great idea.
Ok, build 1.1.11 - works as expected! > >> Their only one malicious admin - It's I. >> But, I trust myself yet. :) >> >> And even if I run something "hands" , then each time the utility deployment >> of cluster: >> * Certainly removes all the packages that were installed prior to this >> (referring to the cluster (libqb, corosync1 / 2 , cman, pacemaker, >> resource-agents, pcs, crmsh, cluster-glue)); >> * Deletes all files that might remain from a previous installation ( and >> configs , and temp) ; >> * Install my RPM's; >> * Turns off and disables all of the services that may interfere with the >> operation of the cluster ; >> * Cleared the old configuration, calculates new configs corosync and >> pacemaker; >> * Writes and runs their cluster alternately every synchronizing each node. >> >> In this situation it is difficult to survive something forgotten. >> I reinstalled the OS would be if I had access to a level above :) >>>> Or I do not quite understand the question. >>>>>>> 'crm_mon -o' might be a good source of information too. >>>>>> Therefore, I see that my resources allegedly functioning normally. >>>>>> >>>>>> # crm_mon -o1 >>>>>> Last updated: Tue Nov 12 09:27:16 2013 >>>>>> Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on >>>>>> dev-cluster2-node2 >>>>>> Stack: corosync >>>>>> Current DC: dev-cluster2-node2 (172793105) - partition with quorum >>>>>> Version: 1.1.10-1.el6-368c726 >>>>>> 3 Nodes configured >>>>>> 337 Resources configured >>>>>> >>>>>> Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ] >>>>>> >>>>>> Clone Set: clonePing [pingCheck] >>>>>> Started: [ dev-cluster2-node2 dev-cluster2-node3 >>>>>> dev-cluster2-node4 ] >>>>>> Master/Slave Set: msPgsql [pgsql] >>>>>> Masters: [ dev-cluster2-node2 ] >>>>>> Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ] >>>>>> VirtualIP (ocf::heartbeat:IPaddr2): Started >>>>>> dev-cluster2-node2 >>>>>> >>>>>> Operations: >>>>>> * Node dev-cluster2-node2: >>>>>> pingCheck: migration-threshold=1 >>>>>> + (20) start: rc=0 (ok) >>>>>> + (23) monitor: interval=10000ms rc=0 (ok) >>>>>> pgsql: migration-threshold=1 >>>>>> + (41) promote: rc=0 (ok) >>>>>> + (87) monitor: interval=1000ms rc=8 (master) >>>>>> VirtualIP: migration-threshold=1 >>>>>> + (49) start: rc=0 (ok) >>>>>> + (52) monitor: interval=10000ms rc=0 (ok) >>>>>> * Node dev-cluster2-node3: >>>>>> pingCheck: migration-threshold=1 >>>>>> + (20) start: rc=0 (ok) >>>>>> + (23) monitor: interval=10000ms rc=0 (ok) >>>>>> pgsql: migration-threshold=1 >>>>>> + (26) start: rc=0 (ok) >>>>>> + (32) monitor: interval=10000ms rc=0 (ok) >>>>>> * Node dev-cluster2-node4: >>>>>> pingCheck: migration-threshold=1 >>>>>> + (20) start: rc=0 (ok) >>>>>> + (23) monitor: interval=10000ms rc=0 (ok) >>>>>> pgsql: migration-threshold=1 >>>>>> + (26) start: rc=0 (ok) >>>>>> + (32) monitor: interval=10000ms rc=0 (ok) >>>>>> >>>>>> In reality now killed (signal 4|6) the PG master and the penultimate >>>>>> slave PG. >>>>>> IMHO, even if I have something configured incorrectly, the inability >>>>>> to monitor the resource must cause a fatal error. >>>>>> Or is there a reason not to do so? >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> , >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> , >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > , > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org