Re: [Pgpool-general] diagnosing BackendError from pcp_recovery_node

2011-10-12 Thread Jeff Frost

On 10/11/11 17:48, Tatsuo Ishii wrote:

No. It's just a matter of when 'shutdown host pgpool(5432)' message
was issued. Actual shutdown (set down status in shared memory area)
was done befoere follow command was executed.

I think real problem here is, when the private backend status (copy of
shared memory area, which pcp process looks into) gets
updated. Currently at the very end of failover process, pgpool main
sends SIGUSR1 to pcp process requesting a restart. So it might be
possible for pcp process to look up outdated status if the follow
command starts too early.

Attached is a patch to solve the problem. With the patch pcp process
refesh the copy of backend status whenever it's idle, rather than
after receiving SIGUSR1. Please try.

So far in testing, this patch appears to resolve the problem. Thanks much!

Any chance the follow_master_command could get an escape code indicating what 
the new primary node has become?


--
Jeff Frost
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/

___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] difference between master and primary node in streaming mode

2011-10-05 Thread Jeff Frost
On 10/05/11 16:09, Tatsuo Ishii wrote:
>> What's the difference between the master and primary node in streaming
>> replication mode?
>>
>>
> The master node is the node which is alive and has the least node
> id. Not important in streaming replication mode(important in pgpool-II
> native replication mode, though).
> --

Ah! Thank you!  So, Is it safe to assume that in streaming replication mode,
the %H special value for follow_master_command will actually be the hostname
of the new primary node and not the hostname of the new master node?



-- 
Jeff Frost 
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/ 

___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


[Pgpool-general] diagnosing BackendError from pcp_recovery_node

2011-10-05 Thread Jeff Frost
I have a follow_master_command set which calls the pcp_recovery_node script
after editing a config file on the replica which has been degenerated. 
Occasionally this fails with BackendError.  It looks like the node doesn't get
degenerated fast enough and so it's marked as still alive, thus cancelling the
recovery that I just requested:

BTW, my backend hostnames are:

backend_hostname0: pgpool
backend_hostname1: node1
backend_hostname2: node2

2011-10-05 14:30:02 LOG:   pid 31743: execute command:
/var/lib/postgresql/scripts/failover 0 0 node1
Warning: Permanently added 'node1,216.121.61.228' (RSA) to the list of known
hosts.
Warning: Permanently added 'node1,216.121.61.228' (RSA) to the list of known
hosts.
2011-10-05 14:30:14 LOG:   pid 31743: find_primary_node_repeatedly: waiting
for finding a primary node
2011-10-05 14:30:14 LOG:   pid 31743: find_primary_node: primary node id is 1
2011-10-05 14:30:14 LOG:   pid 31743: starting follow degeneration. shutdown
host pgpool(5432)
2011-10-05 14:30:14 LOG:   pid 31743: starting follow degeneration. shutdown
host node2(5432)
2011-10-05 14:30:14 LOG:   pid 31743: failover: 2 follow backends have been
degenerated
2011-10-05 14:30:14 LOG:   pid 31743: failover: set new primary node: 1
2011-10-05 14:30:14 LOG:   pid 31743: failover: set new master node: 1
2011-10-05 14:30:14 LOG:   pid 2941: start triggering follow command.
2011-10-05 14:30:14 LOG:   pid 2941: execute command:
/var/lib/postgresql/scripts/pgpool_remaster_replica pgpool node1 0
2011-10-05 14:30:14 LOG:   pid 31743: failover done. shutdown host pgpool(5432)
DEBUG: send: tos="R", len=44
DEBUG: recv: tos="r", len=21, data=AuthenticationOK
2011-10-05 14:30:15 LOG:   pid 2536: starting recovering node 0
2011-10-05 14:30:15 ERROR: pid 2536: start_recovery: backend node 0 is alive
DEBUG: send: tos="D", len=6
DEBUG: recv: tos="e", len=20, data=recovery failed
DEBUG: command failed. reason=recovery failed
DEBUG: send: tos="X", len=4
BackendError

Notice that the 'shutdown host pgpool(5432)' message was logged AFTER my
follow command was executed, so does this: ERROR: pid 2536: start_recovery:
backend node 0 is alive

mean that the node is not detached or that postgresql is still running? or
something else I'm not thinking of?

Looks like this is the bit of code in question:

pool_log("starting recovering node %d", recovery_node);

if (VALID_BACKEND(recovery_node))
{
pool_error("start_recovery: backend node %d is alive",
recovery_node);
return 1;
}

So, how does it define a VALID_BACKEND and how can I modify my
follow_master_command to allow the pcp_recovery_node to finish successfully?

If I run the pcp_recovery_node again by hand afterward, it always works fine.
I thought perhaps adding a pcp_detach_node before the pcp_recovery_node would
do the trick, but it does not.

/var/lib/postgresql/scripts/pgpool_remaster_replica:
#!/bin/bash

CONF_FILE=/var/lib/postgresql/scripts/pitr-replication.conf
. ${CONF_FILE}
SCRIPT_DIR=$(dirname ${CONF_FILE})
echo "$0 $* run on $(hostname)" >> ${REPLICATION_LOG}

REPLICA=$1
MASTER=$2
NODEID=$3

# Update the MASTER value in the replication config file
${SSH} ${SSH_OPT} ${REPLICA} /bin/sh -c "'sed -e
\"s/MASTER=.*$/MASTER=${MASTER}/\" ${CONF_FILE} > ${CONF_FILE}.swap && mv
${CONF_FILE}.swap ${CONF_FILE}'"

${SSH} ${SSH_OPT} -T ${REPLICA} $PGCTL ${PGCTLOPTIONS} -m fast -w -D ${PGDATA}
stop 2>/dev/null 1>/dev/null < /dev/null &
#/usr/local/bin/pcp_detach_node --debug ${PCPTIMEOUT} ${PGPOOLHOST} ${PCPPORT}
${PGPOOLUSER} ${PGPOOLPASS} ${NODEID}
/usr/local/bin/pcp_recovery_node --debug ${PCPTIMEOUT} ${PGPOOLHOST}
${PCPPORT} ${PGPOOLUSER} ${PGPOOLPASS} ${NODEID}


-- 
Jeff Frost 
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/ 

___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


[Pgpool-general] difference between master and primary node in streaming mode

2011-10-05 Thread Jeff Frost
What's the difference between the master and primary node in streaming
replication mode?

Occasionally, in my testing, I see this:

Oct  5 11:36:08 31214-1-807393 pgpool[28871]: find_primary_node_repeatedly:
waiting for finding a primary node
Oct  5 11:36:08 31214-1-807393 pgpool[28871]: last message repeated 10 times
Oct  5 11:36:08 31214-1-807393 pgpool[28871]: find_primary_node: 0 node is 
standby
Oct  5 11:36:08 31214-1-807393 pgpool[28871]: last message repeated 10 times
Oct  5 11:36:08 31214-1-807393 pgpool[28871]: find_primary_node: primary node
id is 1
Oct  5 11:36:08 31214-1-807393 pgpool[28871]: failover: set new primary node: 1
Oct  5 11:36:08 31214-1-807393 pgpool[28871]: failover: set new master node: 0

Note that the primary and master are different.  I understand that the primary
node is considered whichever node is not in recovery mode, but what is the 
master?

-- 
Jeff Frost 
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/ 

___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] how to find node role and order of scripts being run

2011-10-04 Thread Jeff Frost

On Oct 4, 2011, at 5:42 PM, Tatsuo Ishii wrote:

>>> In Streaming replication mode, how do I find which node is currently which
>>> role?  That is, I'd like to know which node is considered the primary and
>>> which are replicas. 
> 
> You can use standard PostgreSQL function pg_is_in_recover() to get the
> information. It's not necessarily identical to what pgpool-II thinks,
> but since pgpool-II also uses the function to judge which is which,
> you could assume that what pg_is_in_recover() returns is what
> pgpool-II thinks in real world.
> 
> It would be nice if pcp command returns that info directly though.

Ah, ok, so no way to find out what pgpool thinks internally. Drat!

I ask because I'm trying to troubleshoot some scripts and with 3 active nodes, 
detaching node 0 causes the system to failover properly to node 1, but if I 
recover node 0, then detach node 1, the log indicates it can't find a primary 
node and the failover script eventually gets a -1 as the 'new master node'.  
I'll try comparing the recovery state to see if that disagrees with what I 
think is actually happening.

How does pgpool use the pg_is_in_recovery() function?



---
Jeff Frost 
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/ 






___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


Re: [Pgpool-general] how to find node role and order of scripts being run

2011-10-04 Thread Jeff Frost

On Oct 4, 2011, at 4:29 PM, Jeff Frost wrote:

> In Streaming replication mode, how do I find which node is currently which
> role?  That is, I'd like to know which node is considered the primary and
> which are replicas. 
> 
> Also, what is the failback command expected to accomplish?


Should have mentioned: I'm specifically asking about version 3.1.0.


---
Jeff Frost 
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/ 






___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general


[Pgpool-general] how to find node role and order of scripts being run

2011-10-04 Thread Jeff Frost
In Streaming replication mode, how do I find which node is currently which
role?  That is, I'd like to know which node is considered the primary and
which are replicas. 

Also, what is the failback command expected to accomplish?



-- 
Jeff Frost 
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/ 

___
Pgpool-general mailing list
Pgpool-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general