[Pacemaker] pgsql troubles.

2014-12-04 Thread steve

Good Afternoon,


I am having loads of trouble with pacemaker/corosync/postgres. Defining 
the symptoms is rather difficult.   The primary being that postgres  
starts as slave on both nodes.  I have tested the pgsqlRA 
start/stop/status/monitor and they work from the command line after I 
setup the environment.  I have not been able to get promote/demote to 
work, there are issues with NODENAME not being defined.


I am able to run postgres in master/slave mode outside of pacemaker.

I can provide additional logs but here is a start.

Distributor ID: Ubuntu
Description:Ubuntu 12.04.3 LTS
Release:12.04
Codename:   precise

latest verions of pgsql RA (yesterday)
pacemaker  1.1.6-2ubuntu3.1   HA cluster resource manager
corosync   1.4.2-2Standards-based cluster framework 
(daemon and module
resource-agents  1:3.9.2-5ubuntu4.1   Cluster 
Resource Agents

I have upgraded pgsqlRA to the lastest from git.



Last updated: Wed Nov 26 13:55:59 2014
Last change: Wed Nov 26 13:55:58 2014 via crm_attribute on tstdb04
Stack: openais
Current DC: tstdb04 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
4 Resources configured.


Online: [ tstdb03 tstdb04 ]

Full list of resources:

 Resource Group: master-group
 vip-master (ocf::heartbeat:IPaddr2):   Stopped
 vip-rep(ocf::heartbeat:IPaddr2):   Stopped
 Master/Slave Set: msPostgresql [pgsql]
 Slaves: [ tstdb04 ]
 Stopped: [ pgsql:0 ]

Node Attributes:
* Node tstdb03:
+ master-pgsql:0: -INFINITY
+ pgsql-data-status : DISCONNECT
* Node tstdb04:
+ master-pgsql:1: -INFINITY
+ pgsql-data-status : DISCONNECT

Migration summary:
* Node tstdb04:
* Node tstdb03:
   pgsql:0: migration-threshold=1 fail-count=100

Failed actions:
pgsql:0_start_0 (node=tstdb03, call=5, rc=1, status=complete): 
unknown error



config:
property \
 no-quorum-policy="ignore" \
 stonith-enabled="false" \
 crmd-transition-delay="0"

rsc_defaults \
 resource-stickiness="INFINITY" \
 migration-threshold="1"

group master-group \
   vip-master \
   vip-rep

primitive vip-master ocf:heartbeat:IPaddr2 \
 params \
 ip="10.132.101.95" \
 nic="eth0" \
 cidr_netmask="24" \
 op start   timeout="60s" interval="0"  on-fail="restart" \
 op monitor timeout="60s" interval="10s" on-fail="restart" \
 op stoptimeout="60s" interval="0"  on-fail="block"

primitive vip-rep ocf:heartbeat:IPaddr2 \
 params \
 ip="10.132.101.96" \
 nic="eth0" \
 cidr_netmask="24" \
 meta \
 migration-threshold="0" \
 op start   timeout="60s" interval="0"  on-fail="stop" \
 op monitor timeout="60s" interval="10s" on-fail="restart" \
 op stoptimeout="60s" interval="0"  on-fail="ignore"

master msPostgresql pgsql \
 meta \
 master-max="1" \
 master-node-max="1" \
 clone-max="2" \
 clone-node-max="1" \
 notify="true"

primitive pgsql ocf:heartbeat:pgsql \
 params \
 pgctl="/usr/bin/pg_ctl" \
 psql="/usr/bin/psql" \
 pgdata="/database/9.3" \
 config="/etc/postgresql/9.3/main/postgresql.conf" \
 socketdir=/var/run/postgresql \
 rep_mode="sync" \
 node_list="tstdb03 tstdb04" \
 restore_command="cp /database/archive/%f %p" \
 primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 
keepalives_count=5" \

 master_ip="10.132.101.95" \
 restart_on_promote="true" \
 logfile=/var/log/postgresql/postgresql-9.3-main.log \
 op start   timeout="60s" interval="0"  on-fail="restart" \
 op monitor timeout="60s" interval="4s" on-fail="restart" \
 op monitor timeout="60s" interval="3s"  on-fail="restart" 
role="Master" \

 op promote timeout="60s" interval="0"  on-fail="restart" \
 op demote  timeout="60s" interval="0"  on-fail="stop" \
 op stoptimeout="60s" interval="0"  on-fail="block" \
 op notify  timeout="60s" interval="0"

#colocation rsc_colocation-1 inf: vip-master msPostgresql:Master
#order rsc_order-1 0: msPostgresql:promote  vip-master:start  
symmetrical=false
#order rsc_order-2 0: msPostgresql:demote   vip-rep:stop   
symmetrical=false


colocation rsc_colocation-1 inf: master-group msPostgresql:Master
order rsc_order-1 0: msPostgresql:promote  master-group:start  
symmetrical=false
order rsc_order-2 0: msPostgresql:demote   master-group:stop   
symmetrical=false


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Failed-over incomplete

2014-12-04 Thread Teerapatr Kittiratanachai
sorry for my mistyping,
it's res.vBKN6

--teenigma

On Thu, Dec 4, 2014 at 4:23 PM, Andrei Borzenkov  wrote:
> On Thu, Dec 4, 2014 at 9:52 AM, Teerapatr Kittiratanachai
>  wrote:
>> Dear Andrei,
>>
>> Since the failed over is uncompleted so all the resource isn't failed
>> over to another node.
>>
>> I think this case happened because of the res.vBKN is go into unmanaged 
>> state.
>>
>
> There is no resource res.vBKN in your logs or configuration snippet
> you have shown.
>
>> But why? Since there is no configuration is changed.
>>
>> --teenigma
>>
>> On Thu, Dec 4, 2014 at 1:41 PM, Andrei Borzenkov  wrote:
>>> On Thu, Dec 4, 2014 at 4:56 AM, Teerapatr Kittiratanachai
>>>  wrote:
 Dear List,

 We are using Pacemaker and Corosync with CMAN as our HA software as
 below version.

 OS:CentOS release 6.5 (Final) 64-bit
 Pacemaker:pacemaker.x86_641.1.10-14.el6_5.3
 Corosync:corosync.x86_641.4.1-17.el6_5.1
 CMAN:cman.x86_643.0.12.1-59.el6_5.2
 Resource-Agent:resource-agents.x86_643.9.5-3.12

 Topology:2 Nodes with Active/Standby model. (MySQL is
 Active/Active by clone)

 All packages are install from CentOS official repository, and the
 Resource-Agent is only one which be installed from OpenSUSE repository
 (http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-6/).

 The system is work normally for few months until yesterday morning,
 around 03:35 UTC+0700, we found that one of resource is go into
 UNMANAGED state without any configuration changed. After another
 resource is failed, the pacemaker try to failed-over resource to
 another node but it incomplete after facing this resource.

 Configuration of some resource is below and the LOG during event is in
 attached file.

>>>
>>> The log just covers resource monitor failure and stopping of
>>> resources. It does not contain any event related to starting resources
>>> on another nodes.
>>>
>>> You would need to collect crm_report with start time before resource
>>> failed and stop time after resources were started on another node.
>>>
 primitive res.vBKN6 IPv6addr \
 params ipv6addr="2001:db8:0:f::61a" cidr_netmask=64 nic=eth0 \
 op monitor interval=10s

 primitive res.vDMZ6 IPv6addr \
 params ipv6addr="2001:db8:0:9::61a" cidr_netmask=64 nic=eth1 \
 op monitor interval=10s

 group gr.mainService res.vDMZ4 res.vDMZ6 res.vBKN4 res.vBKN6 res.http 
 res.ftp

 rsc_defaults rsc_defaults-options: \
 migration-threshold=1

 Please help me to solve this problem.

 --teenigma

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Failed-over incomplete

2014-12-04 Thread Andrei Borzenkov
On Thu, Dec 4, 2014 at 9:52 AM, Teerapatr Kittiratanachai
 wrote:
> Dear Andrei,
>
> Since the failed over is uncompleted so all the resource isn't failed
> over to another node.
>
> I think this case happened because of the res.vBKN is go into unmanaged state.
>

There is no resource res.vBKN in your logs or configuration snippet
you have shown.

> But why? Since there is no configuration is changed.
>
> --teenigma
>
> On Thu, Dec 4, 2014 at 1:41 PM, Andrei Borzenkov  wrote:
>> On Thu, Dec 4, 2014 at 4:56 AM, Teerapatr Kittiratanachai
>>  wrote:
>>> Dear List,
>>>
>>> We are using Pacemaker and Corosync with CMAN as our HA software as
>>> below version.
>>>
>>> OS:CentOS release 6.5 (Final) 64-bit
>>> Pacemaker:pacemaker.x86_641.1.10-14.el6_5.3
>>> Corosync:corosync.x86_641.4.1-17.el6_5.1
>>> CMAN:cman.x86_643.0.12.1-59.el6_5.2
>>> Resource-Agent:resource-agents.x86_643.9.5-3.12
>>>
>>> Topology:2 Nodes with Active/Standby model. (MySQL is
>>> Active/Active by clone)
>>>
>>> All packages are install from CentOS official repository, and the
>>> Resource-Agent is only one which be installed from OpenSUSE repository
>>> (http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/CentOS_CentOS-6/).
>>>
>>> The system is work normally for few months until yesterday morning,
>>> around 03:35 UTC+0700, we found that one of resource is go into
>>> UNMANAGED state without any configuration changed. After another
>>> resource is failed, the pacemaker try to failed-over resource to
>>> another node but it incomplete after facing this resource.
>>>
>>> Configuration of some resource is below and the LOG during event is in
>>> attached file.
>>>
>>
>> The log just covers resource monitor failure and stopping of
>> resources. It does not contain any event related to starting resources
>> on another nodes.
>>
>> You would need to collect crm_report with start time before resource
>> failed and stop time after resources were started on another node.
>>
>>> primitive res.vBKN6 IPv6addr \
>>> params ipv6addr="2001:db8:0:f::61a" cidr_netmask=64 nic=eth0 \
>>> op monitor interval=10s
>>>
>>> primitive res.vDMZ6 IPv6addr \
>>> params ipv6addr="2001:db8:0:9::61a" cidr_netmask=64 nic=eth1 \
>>> op monitor interval=10s
>>>
>>> group gr.mainService res.vDMZ4 res.vDMZ6 res.vBKN4 res.vBKN6 res.http 
>>> res.ftp
>>>
>>> rsc_defaults rsc_defaults-options: \
>>> migration-threshold=1
>>>
>>> Please help me to solve this problem.
>>>
>>> --teenigma
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Avoid monitoring of resources on nodes

2014-12-04 Thread Daniel Dehennin
Andrew Beekhof  writes:

> What version of pacemaker is this?
> Some very old versions wanted the agent to be installed on all nodes.

It's 1.1.10+git20130802-1ubuntu2.1 on Trusty Tahr.

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org