Thanks Steve.  I've restored the master/slave slon processes with roughly 5 
times the size of the previous parameters to see if it helps.  I'll let it run 
overnight.  If in the morning we are still falling behind I will take your 
advice and compile and implement 2.1.0.  Thanks for your help, you rock!

Mike Wilson
Predicate Logic
Cell: (310) 600-8777
SkypeID: lycovian




On Dec 27, 2011, at 5:25 PM, Steve Singer wrote:

> On Tue, 27 Dec 2011, Mike Wilson wrote:
> 
>> Thanks for the reply.  Per your questions here is additional information.  
>> For the purposes of this discussion db1 (node: 101) is our Slave and db2 
>> (node:102) is our Slony Master.
>> 
>>> On Tue, 27 Dec 2011, Mike Wilson wrote:
> 
>>> 
>>> 2. Now that you have killed the old vacuums you need to determine what if 
>>> anything slony is doing.
>> During the period where the replication stopped I found several vacuums that 
>> had been running for upwards of 11 days.   I killed the processes (using 
>> pg_cancel_backup) and shortly after that started to get pings from Nagios 
>> which monitors the sl_status table that replication was falling behind.
>>> 
>>> a) Are new SYNC events being generated on the master?
>> I'm not sure how I determine this.
> 
> select max(ev_seqno) FROM sl_event where ev_origin=102;
> Every few seconds that number should be increasing.
> 
>> 
>> Here is the sl_status table from the master over the same few minutes:
>> =# select * from sl_status;
>> st_origin | st_received | st_last_event |      st_last_event_ts      | 
>> st_last_received |    st_last_received_ts     | st_last_received_event_ts  | 
>> st_lag_num_events |      st_lag_time
>> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------
>>      102 |         101 |    5004119000 | 2011-12-27 10:06:44.327499 |       
>> 5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 |       
>>      133831 | 3 days 06:40:41.911154
>> 
>> =# select * from sl_status;
>> st_origin | st_received | st_last_event |      st_last_event_ts      | 
>> st_last_received |    st_last_received_ts     | st_last_received_event_ts  | 
>> st_lag_num_events |      st_lag_time
>> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------
>>      102 |         101 |    5004119000 | 2011-12-27 10:06:44.327499 |       
>> 5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 |       
>>      133831 | 3 days 06:42:49.222796
>> 
>> As you can see the slaves st_last_event value appears to be increasing, 
>> whereas the master's appears to be static.
> 
> This tells me that (according to the master) the slave has so far processed 
> event (102,5003985169) and the last event generated on the master is 
> (102,5004119000).  This means that your slave is 133,831 events behind.  You 
> didn't tell me what time you run the above query at.
> 
> If the above query was run at about 2011-12-27 10:06:xx then it proabably 
> means that replication is now working normally but your slave is behind 
> 133,000 events.
> 
> The line from your slave.slon.log below
> 2011-12-27 14:06:32 PSTINFO   remoteWorkerThread_102: SYNC 5003985339 done in 
> 80.973 seconds
> 
> tells me that at 14:06:32 the slave processed a SYNC from the master that 
> took 80 seconds to process.  Every 80 seconds or so you should be seeing 
> another similar line.  The difference in the SYNC # in the previous log line 
> and these one tells you how many SYNC's were processed in that 80.973 seconds.
> 
> Now that it is 6-8 or so hours later, is your cluster closing to being caught 
> up or has it fallen more behind?
> 
> If your getting caught up then you should be able to estimate when you will 
> be caught up.
> 
> If your falling even farther behind then your options are
> 
> 1) To unsubscribe + resubscribe the slave.  As I mentioned this morning that 
> if your database is small enough and your network is fast enough this might 
> be faster then letting it get caught up
> 2) If you can't resubscribe, then tune the the sync_interval(make it larger) 
> on the master and tune the sync_max_groupsize (make it larger) on the slave 
> as described 
> http://www.slony.info/documentation/2.0/slon-config-interval.html to make the 
> master generate fewer syncs and the slave to process more of those syncs 
> together.
> 3) Consider upgrading your cluster to 2.1.0, you can upgrade from 2.0.7 to 
> 2.1.0 on a cluster that is not caught up.
> 
> If my assumptions are correct then this is an instance of slony bug #167 
> (fixed in 2.1.0) where sync processing in a cluster that is behind takes a 
> lot longer than it should.  Your long running transactions made sl_log get 
> very big and now slony takse longer to process a SYNC then it does to 
> generate a new one.  2.1.0 should process SYNC events in this case much 
> faster.
> 
> 
> 
> 
>> Thanks so much for your help thus far.  Please advise with any attempted 
>> fixes you can think of or if you can possibly describe what is currently 
>> going on.  Cheers.
>> 
>>> 
>>> Steve
>>> 
>>>> Under incredible load last week during the Christmas season our primary PG 
>>>> (8.4.7: Slony 2.0.7) stopped replicating.  Now that we are past the 
>>>> Christmas season I need to figure out how to clear the back log of 
>>>> replication rows in sl_log_1/2.  This is all running on our commercial 
>>>> website and if possible I would prefer not to have to restart the PG 
>>>> instance as it would require a scheduled maintenance window on a week 
>>>> where everyone is out of the office.  In an attempt to fix the issue 
>>>> without rebooting the PG instance and I've already restarted the Slony 
>>>> services on the primary PG node as a first attempt at a fix.  This did not 
>>>> get replication working again and I'm still getting the same error from 
>>>> Slony in the logs: log switch to sl_log_1 still in progress - sl_log_2 not 
>>>> truncated
>>>> 
>>>>> From my research I can see that this error message is called when the 
>>>>> function logswitch_finish() is called.  I did have some hung vacuums 
>>>>> during this period of high load on the server but I have killed them with 
>>>>> pg_cancel_backend.  From other lock analysis I can see that nothing is 
>>>>> currently running or locked in the db (nothing more than a few 
>>>>> milliseconds old at least).  I'm certain whatever transaction was in 
>>>>> progress that prevented the switch from occurring is long since past.
>>>> 
>>>> Any ideas on the best way to get replication working again?  I'm adverse 
>>>> to rebooting the PG instance but I am willing to do it if someone more 
>>>> knowlegable out there thinks it would fix this issue.  We currently are 
>>>> operating without a backup of all of our XMas sales data and I *really* 
>>>> want to get this data replicated.  Any help would be appreciated.
>>>> 
>>>> Mike Wilson
>>>> Predicate Logic
>>>> Cell: (310) 600-8777
>>>> SkypeID: lycovian
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Slony1-general mailing list
>>>> [email protected]
>>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>> 
>>> 
>> 
>> 
> 

_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to