[Slony1-general] prepare clone failure

Josh Berkus Mon, 07 Dec 2015 10:57:12 -0800

Slony folks:

I'm being blocked by an interesting failure of "prepare/finish clone".
There's a bit of a setup on this one, but the complexity of the cluster
may be related to the failure, so I want to give you everything.


Versions:
PostgreSQL 9.2.14
Slony 2.1.4

4 replication sets

5 nodes:
        4: origin of sets 1, 2 and 3
        5: failover for 4, subscribes to 1,2,3
        6: origin of set 4, subscribes to 2
        7: failover for 6, subsribes to 2,4
        10: on AWS, mirror of 6, subscribes to 2,4 from origin

The owner is creating new nodes on AWS which are copies of node 6, for
expanding capacity and testing purposes.  The fastest way for us to spin
up new nodes on AWS works like this:

1. create a new EC2 instance
2. prepare clone of 10
3. make AWS snapshot copy of 10
4. bring up PostgreSQL on the new node
5. finish clone for new node
6. start slony on the new node

We followed this procedure to bring up nodes in this order:

- original clone on AWS was node 8.
- created node 9 via prepare clone method
- dropped node 8 (and shut down instance)
- created node 10 via prepare clone method
- dropped node 9 (but did not shut down the instance)

So, the prepare clone method above worked perfectly twice.  But then we
tried to bring up a new node as a prepared clone from node 11 and things
went to hell.

At step 6, when we brought up slony, we started to see this in the logs:

2015-12-04 14:40:21 PST ERROR slon_connectdb: PQconnectdb("dbname=prod
host=192.168.80.32 port=5432 user=slony") failed - FATAL: no pg_hba.conf
entry for host "172.16.81.31", user "slony", database "prod", SSL off
2015-12-04 14:40:21 PST WARN remoteListenThread_6: DB connection failed
- sleep 10 seconds
2015-12-04 14:40:21 PST CONFIG version for "dbname=prod" is 90214
2015-12-04 14:40:21 PST CONFIG version for "dbname=prod
host=dw3.prod.com port=5432 user=slony" is 90214
2015-12-04 14:40:21 PST ERROR slon_connectdb: PQconnectdb("dbname=prod
host=192.168.80.33 port=5432 user=slony") failed - FATAL: no pg_hba.conf
entry for host "172.16.81.31", user "slony", database "prod", SSL off
2015-12-04 14:40:21 PST WARN remoteListenThread_7: DB connection failed
- sleep 10 seconds
2015-12-04 14:40:21 PST CONFIG remoteWorkerThread_10: update provider
configuration
2015-12-04 14:40:21 PST CONFIG remoteWorkerThread_7: update provider
configuration
2015-12-04 14:40:21 PST ERROR remoteListenThread_10: db_getLocalNodeId()
returned 12 - wrong database?
2015-12-04 14:40:21 PST CONFIG version for "dbname=prod
host=192.168.80.43 port=5432 user=slony" is 90214
2015-12-04 14:40:21 PST CONFIG version for "dbname=prod
host=192.168.80.43 port=5432 user=slony" is 90214
2015-12-04 14:40:21 PST INFO remoteWorkerThread_4: syncing set 2 with
118 table(s) from provider 5
2015-12-04 14:40:21 PST INFO remoteWorkerThread_4: SYNC 5009308944 done
in 0.098 seconds
2015-12-04 14:40:22 PST CONFIG version for "dbname=prod
host=192.168.80.42 port=5432 user=slony" is 90214
2015-12-04 14:40:22 PST ERROR remoteWorkerThread_5: "lock table
"_replication".sl_config_lock;select "_replication".storePath_int(13, 5,
'dbname=prod host=172.16.81.31 port=5432 user=slony', 10); insert into
"_oltp_replication".sl_event (ev_origin, ev_seqno, ev_timestamp,
ev_snapshot, ev_type , ev_data1, ev_data2, ev_data3, ev_data4 ) values
('5', '5001941723', '2015-12-04 14:37:48.196237-08',
'70559671:70559671:', 'STORE_PATH', '13', '5', 'dbname=prod
host=172.16.81.31 port=5432 user=slony', '10'); insert into
"_replication".sl_confirm (con_origin, con_received, con_seqno,
con_timestamp) values (5, 13, '5001941723', now()); commit transaction;"
PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint
"sl_event-pkey"
DETAIL: Key (ev_origin, ev_seqno)=(5, 5001941723) already exists.
2015-12-04 14:40:22 PST CONFIG slon: child terminated signal: 9; pid:
4539, current worker pid: 4539
2015-12-04 14:40:22 PST CONFIG slon: restart of worker in 10 seconds
2015-12-04 14:40:25 PST CONFIG slon: child terminated status: 9; pid:
-1, current worker pid: 4511 errno: 10
2015-12-04 14:40:25 PST CONFIG slon: child terminated status: 9; pid:
-1, current worker pid: 4539 errno: 10
2015-12-04 14:40:25 PST FATAL slon: wait returned an error pid:-1 errno:10
2015-12-04 14:40:25 PST FATAL slon: wait returned an error pid:-1 errno:1

This now happens *every time* we try the prepare clone sequence (3 out
of 3 tries).  Any idea what's going on here?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

[Slony1-general] prepare clone failure

Reply via email to