This looks very much like a prime example of the log selection query problem, we fixed in 2.1. The catch up is getting slower and slower over time, until the log switch succeeds.
Jan On 12/27/2011 8:25 PM, Steve Singer wrote: > On Tue, 27 Dec 2011, Mike Wilson wrote: > >> Thanks for the reply. Per your questions here is additional information. >> For the purposes of this discussion db1 (node: 101) is our Slave and db2 >> (node:102) is our Slony Master. >> >>> On Tue, 27 Dec 2011, Mike Wilson wrote: > >>> >>> 2. Now that you have killed the old vacuums you need to determine what if >>> anything slony is doing. >> During the period where the replication stopped I found several vacuums >> that had been running for upwards of 11 days. I killed the processes >> (using pg_cancel_backup) and shortly after that started to get pings from >> Nagios which monitors the sl_status table that replication was falling >> behind. >>> >>> a) Are new SYNC events being generated on the master? >> I'm not sure how I determine this. > > select max(ev_seqno) FROM sl_event where ev_origin=102; > Every few seconds that number should be increasing. > >> >> Here is the sl_status table from the master over the same few minutes: >> =# select * from sl_status; >> st_origin | st_received | st_last_event | st_last_event_ts | >> st_last_received | st_last_received_ts | st_last_received_event_ts | >> st_lag_num_events | st_lag_time >> >> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------ >> 102 | 101 | 5004119000 | 2011-12-27 10:06:44.327499 | >> 5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 | >> 133831 | 3 days 06:40:41.911154 >> >> =# select * from sl_status; >> st_origin | st_received | st_last_event | st_last_event_ts | >> st_last_received | st_last_received_ts | st_last_received_event_ts | >> st_lag_num_events | st_lag_time >> >> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------ >> 102 | 101 | 5004119000 | 2011-12-27 10:06:44.327499 | >> 5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 | >> 133831 | 3 days 06:42:49.222796 >> >> As you can see the slaves st_last_event value appears to be increasing, >> whereas the master's appears to be static. > > This tells me that (according to the master) the slave has so far processed > event (102,5003985169) and the last event generated on the master is > (102,5004119000). This means that your slave is 133,831 events behind. You > didn't tell me what time you run the above query at. > > If the above query was run at about 2011-12-27 10:06:xx then it proabably > means that replication is now working normally but your slave is behind > 133,000 events. > > The line from your slave.slon.log below > 2011-12-27 14:06:32 PSTINFO remoteWorkerThread_102: SYNC 5003985339 done in > 80.973 seconds > > tells me that at 14:06:32 the slave processed a SYNC from the master that > took 80 seconds to process. Every 80 seconds or so you should be seeing > another similar line. The difference in the SYNC # in the previous > log line and these one tells you how many SYNC's were processed in that > 80.973 seconds. > > Now that it is 6-8 or so hours later, is your cluster closing to being > caught up or has it fallen more behind? > > If your getting caught up then you should be able to estimate when you will > be caught up. > > If your falling even farther behind then your options are > > 1) To unsubscribe + resubscribe the slave. As I mentioned this morning that > if your database is small enough and your network is fast enough this > might be faster then letting it get caught up > 2) If you can't resubscribe, then tune the the sync_interval(make it larger) > on the master and tune the sync_max_groupsize (make it larger) on the slave > as described > http://www.slony.info/documentation/2.0/slon-config-interval.html to make > the master generate fewer syncs and the slave to process more of those syncs > together. > 3) Consider upgrading your cluster to 2.1.0, you can upgrade from 2.0.7 to > 2.1.0 on a cluster that is not caught up. > > If my assumptions are correct then this is an instance of slony bug #167 > (fixed in 2.1.0) where sync processing in a cluster that is behind takes a > lot longer than it should. Your long running transactions made sl_log get > very big and now slony takse longer to process a SYNC then it does to > generate a new one. 2.1.0 should process SYNC events in this case much > faster. > > > > >> Thanks so much for your help thus far. Please advise with any attempted >> fixes you can think of or if you can possibly describe what is currently >> going on. Cheers. >> >>> >>> Steve >>> >>>> Under incredible load last week during the Christmas season our primary >>>> PG (8.4.7: Slony 2.0.7) stopped replicating. Now that we are past the >>>> Christmas season I need to figure out how to clear the back log of >>>> replication rows in sl_log_1/2. This is all running on our commercial >>>> website and if possible I would prefer not to have to restart the PG >>>> instance as it would require a scheduled maintenance window on a week >>>> where everyone is out of the office. In an attempt to fix the issue >>>> without rebooting the PG instance and I've already restarted the Slony >>>> services on the primary PG node as a first attempt at a fix. This did not >>>> get replication working again and I'm still getting the same error from >>>> Slony in the logs: log switch to sl_log_1 still in progress - sl_log_2 not >>>> truncated >>>> >>>>> From my research I can see that this error message is called when the >>>>> function logswitch_finish() is called. I did have some hung vacuums >>>>> during this period of high load on the server but I have killed them with >>>>> pg_cancel_backend. From other lock analysis I can see that nothing is >>>>> currently running or locked in the db (nothing more than a few >>>>> milliseconds old at least). I'm certain whatever transaction was in >>>>> progress that prevented the switch from occurring is long since past. >>>> >>>> Any ideas on the best way to get replication working again? I'm adverse >>>> to rebooting the PG instance but I am willing to do it if someone more >>>> knowlegable out there thinks it would fix this issue. We currently are >>>> operating without a backup of all of our XMas sales data and I *really* >>>> want to get this data replicated. Any help would be appreciated. >>>> >>>> Mike Wilson >>>> Predicate Logic >>>> Cell: (310) 600-8777 >>>> SkypeID: lycovian >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Slony1-general mailing list >>>> [email protected] >>>> http://lists.slony.info/mailman/listinfo/slony1-general >>>> >>> >> >> > > _______________________________________________ > Slony1-general mailing list > [email protected] > http://lists.slony.info/mailman/listinfo/slony1-general -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin _______________________________________________ Slony1-general mailing list [email protected] http://lists.slony.info/mailman/listinfo/slony1-general
