Slony folks: I'm being blocked by an interesting failure of "prepare/finish clone". There's a bit of a setup on this one, but the complexity of the cluster may be related to the failure, so I want to give you everything.
Versions: PostgreSQL 9.2.14 Slony 2.1.4 4 replication sets 5 nodes: 4: origin of sets 1, 2 and 3 5: failover for 4, subscribes to 1,2,3 6: origin of set 4, subscribes to 2 7: failover for 6, subsribes to 2,4 10: on AWS, mirror of 6, subscribes to 2,4 from origin The owner is creating new nodes on AWS which are copies of node 6, for expanding capacity and testing purposes. The fastest way for us to spin up new nodes on AWS works like this: 1. create a new EC2 instance 2. prepare clone of 10 3. make AWS snapshot copy of 10 4. bring up PostgreSQL on the new node 5. finish clone for new node 6. start slony on the new node We followed this procedure to bring up nodes in this order: - original clone on AWS was node 8. - created node 9 via prepare clone method - dropped node 8 (and shut down instance) - created node 10 via prepare clone method - dropped node 9 (but did not shut down the instance) So, the prepare clone method above worked perfectly twice. But then we tried to bring up a new node as a prepared clone from node 11 and things went to hell. At step 6, when we brought up slony, we started to see this in the logs: 2015-12-04 14:40:21 PST ERROR slon_connectdb: PQconnectdb("dbname=prod host=192.168.80.32 port=5432 user=slony") failed - FATAL: no pg_hba.conf entry for host "172.16.81.31", user "slony", database "prod", SSL off 2015-12-04 14:40:21 PST WARN remoteListenThread_6: DB connection failed - sleep 10 seconds 2015-12-04 14:40:21 PST CONFIG version for "dbname=prod" is 90214 2015-12-04 14:40:21 PST CONFIG version for "dbname=prod host=dw3.prod.com port=5432 user=slony" is 90214 2015-12-04 14:40:21 PST ERROR slon_connectdb: PQconnectdb("dbname=prod host=192.168.80.33 port=5432 user=slony") failed - FATAL: no pg_hba.conf entry for host "172.16.81.31", user "slony", database "prod", SSL off 2015-12-04 14:40:21 PST WARN remoteListenThread_7: DB connection failed - sleep 10 seconds 2015-12-04 14:40:21 PST CONFIG remoteWorkerThread_10: update provider configuration 2015-12-04 14:40:21 PST CONFIG remoteWorkerThread_7: update provider configuration 2015-12-04 14:40:21 PST ERROR remoteListenThread_10: db_getLocalNodeId() returned 12 - wrong database? 2015-12-04 14:40:21 PST CONFIG version for "dbname=prod host=192.168.80.43 port=5432 user=slony" is 90214 2015-12-04 14:40:21 PST CONFIG version for "dbname=prod host=192.168.80.43 port=5432 user=slony" is 90214 2015-12-04 14:40:21 PST INFO remoteWorkerThread_4: syncing set 2 with 118 table(s) from provider 5 2015-12-04 14:40:21 PST INFO remoteWorkerThread_4: SYNC 5009308944 done in 0.098 seconds 2015-12-04 14:40:22 PST CONFIG version for "dbname=prod host=192.168.80.42 port=5432 user=slony" is 90214 2015-12-04 14:40:22 PST ERROR remoteWorkerThread_5: "lock table "_replication".sl_config_lock;select "_replication".storePath_int(13, 5, 'dbname=prod host=172.16.81.31 port=5432 user=slony', 10); insert into "_oltp_replication".sl_event (ev_origin, ev_seqno, ev_timestamp, ev_snapshot, ev_type , ev_data1, ev_data2, ev_data3, ev_data4 ) values ('5', '5001941723', '2015-12-04 14:37:48.196237-08', '70559671:70559671:', 'STORE_PATH', '13', '5', 'dbname=prod host=172.16.81.31 port=5432 user=slony', '10'); insert into "_replication".sl_confirm (con_origin, con_received, con_seqno, con_timestamp) values (5, 13, '5001941723', now()); commit transaction;" PGRES_FATAL_ERROR ERROR: duplicate key value violates unique constraint "sl_event-pkey" DETAIL: Key (ev_origin, ev_seqno)=(5, 5001941723) already exists. 2015-12-04 14:40:22 PST CONFIG slon: child terminated signal: 9; pid: 4539, current worker pid: 4539 2015-12-04 14:40:22 PST CONFIG slon: restart of worker in 10 seconds 2015-12-04 14:40:25 PST CONFIG slon: child terminated status: 9; pid: -1, current worker pid: 4511 errno: 10 2015-12-04 14:40:25 PST CONFIG slon: child terminated status: 9; pid: -1, current worker pid: 4539 errno: 10 2015-12-04 14:40:25 PST FATAL slon: wait returned an error pid:-1 errno:10 2015-12-04 14:40:25 PST FATAL slon: wait returned an error pid:-1 errno:1 This now happens *every time* we try the prepare clone sequence (3 out of 3 tries). Any idea what's going on here? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com _______________________________________________ Slony1-general mailing list Slony1-general@lists.slony.info http://lists.slony.info/mailman/listinfo/slony1-general