On Oct 23, 2009, at 10:57 AM, Jeff wrote:
Just ran into this problem - the origin is 8.2, replica is 8.4.1 2009-10-23 10:47:42 EDT DEBUG1 copy_set 32009-10-23 10:47:42 EDT DEBUG1 remoteWorkerThread_1: connected to provider DB 2009-10-23 10:47:42 EDT WARN remoteWorkerThread_1: transactions earlier than XID 3820504992 are still in progress 2009-10-23 10:47:42 EDT WARN remoteWorkerThread_1: data copy for set 3 failed - sleep 60 secondsNOTICE: there is no transaction in progress 2009-10-23 10:48:42 EDT DEBUG1 copy_set 32009-10-23 10:48:42 EDT DEBUG1 remoteWorkerThread_1: connected to provider DB 2009-10-23 10:48:42 EDT ERROR remoteWorkerThread_1: Could not lock table "public"."companyinfo" on subscriber 2009-10-23 10:48:42 EDT WARN remoteWorkerThread_1: data copy for set 3 failed - sleep 60 secondsNOTICE: there is no transaction in progress In the PG log LOG: checkpoint starting: time ERROR: LOCK TABLE can only be used in transaction blocks STATEMENT: lock table "public"."companyinfo";
So I've dug into this and attached a patch to solve it.In a nutshell in the event loop we start a transaction, then if we are not an accept set event we lock the config lock table. We then zero out query1. (this is in remote_worker.c).
The ENABLE_SUBSCRIPTION event runs in a while(true) loop.First it executes query1 (which thanks to the above, is empty), then tries to copy_set. If copy_set fails for whatever reason we ROLLBACK our local conn (query2) and then loop.
The problem with this is when we come back around in the next look we're outside of a transaction and one won't be started because query1 is reset. This causes LOCK TABLE to barf on PG8.4. You are forever stuck until you restart slon. This also explains another problem I've seen a couple times.
We subscribe to a set with say 3 tables. The initial subscription fails due to an earlier txn wait. We copy the first table of hte set successfully.Then the second table fails to copy due to some DDL issue (perhaps for some reason a PK or column is missing). We issue a rollback but since we are not in a txn, nothing happens. The event does not suceed so we try again What happens next is since our previous work wasn't rolled back slony sees we've already got teh deny trigger & friends on the first table and barfs. Cue infinite loop fixed only by shutting down slon and playing with the sl_ tables.
This patch keeps a count of how many retries we've had on this copy_set. If we are on retry > 0 then we re-issue a start transaction, set islolation, and lock the config table. My testing has showed that this works.
copy_set_retry.patch
Description: Binary data
-- Jeff Trout <[email protected]> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
_______________________________________________ Slony1-general mailing list [email protected] http://lists.slony.info/mailman/listinfo/slony1-general
