This looks very much like a prime example of the log selection query 
problem, we fixed in 2.1. The catch up is getting slower and slower over 
time, until the log switch succeeds.


Jan


On 12/27/2011 8:25 PM, Steve Singer wrote:
> On Tue, 27 Dec 2011, Mike Wilson wrote:
>
>>  Thanks for the reply.  Per your questions here is additional information.  
>> For the purposes of this discussion db1 (node: 101) is our Slave and db2 
>> (node:102) is our Slony Master.
>>
>>>  On Tue, 27 Dec 2011, Mike Wilson wrote:
>
>>>
>>>  2. Now that you have killed the old vacuums you need to determine what if 
>>> anything slony is doing.
>>  During the period where the replication stopped I found several vacuums 
>> that had been running for upwards of 11 days.   I killed the processes 
>> (using pg_cancel_backup) and shortly after that started to get pings from 
>> Nagios which monitors the sl_status table that replication was falling 
>> behind.
>>>
>>>  a) Are new SYNC events being generated on the master?
>>  I'm not sure how I determine this.
>
> select max(ev_seqno) FROM sl_event where ev_origin=102;
> Every few seconds that number should be increasing.
>
>>
>>  Here is the sl_status table from the master over the same few minutes:
>>  =# select * from sl_status;
>>  st_origin | st_received | st_last_event |      st_last_event_ts      | 
>> st_last_received |    st_last_received_ts     | st_last_received_event_ts  | 
>> st_lag_num_events |      st_lag_time
>>  
>> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------
>>        102 |         101 |    5004119000 | 2011-12-27 10:06:44.327499 |      
>>  5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 |      
>>       133831 | 3 days 06:40:41.911154
>>
>>  =# select * from sl_status;
>>  st_origin | st_received | st_last_event |      st_last_event_ts      | 
>> st_last_received |    st_last_received_ts     | st_last_received_event_ts  | 
>> st_lag_num_events |      st_lag_time
>>  
>> -----------+-------------+---------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+------------------------
>>        102 |         101 |    5004119000 | 2011-12-27 10:06:44.327499 |      
>>  5003985169 | 2011-12-27 10:04:16.351454 | 2011-12-24 07:13:35.509854 |      
>>       133831 | 3 days 06:42:49.222796
>>
>>  As you can see the slaves st_last_event value appears to be increasing, 
>> whereas the master's appears to be static.
>
> This tells me that (according to the master) the slave has so far processed
> event (102,5003985169) and the last event generated on the master is
> (102,5004119000).  This means that your slave is 133,831 events behind.  You
> didn't tell me what time you run the above query at.
>
> If the above query was run at about 2011-12-27 10:06:xx then it proabably
> means that replication is now working normally but your slave is behind
> 133,000 events.
>
> The line from your slave.slon.log below
> 2011-12-27 14:06:32 PSTINFO   remoteWorkerThread_102: SYNC 5003985339 done in 
> 80.973 seconds
>
> tells me that at 14:06:32 the slave processed a SYNC from the master that
> took 80 seconds to process.  Every 80 seconds or so you should be seeing
> another similar line.  The difference in the SYNC # in the previous
> log line and these one tells you how many SYNC's were processed in that
> 80.973 seconds.
>
> Now that it is 6-8 or so hours later, is your cluster closing to being
> caught up or has it fallen more behind?
>
> If your getting caught up then you should be able to estimate when you will
> be caught up.
>
> If your falling even farther behind then your options are
>
> 1) To unsubscribe + resubscribe the slave.  As I mentioned this morning that
> if your database is small enough and your network is fast enough this
> might be faster then letting it get caught up
> 2) If you can't resubscribe, then tune the the sync_interval(make it larger)
> on the master and tune the sync_max_groupsize (make it larger) on the slave
> as described
> http://www.slony.info/documentation/2.0/slon-config-interval.html to make
> the master generate fewer syncs and the slave to process more of those syncs
> together.
> 3) Consider upgrading your cluster to 2.1.0, you can upgrade from 2.0.7 to
> 2.1.0 on a cluster that is not caught up.
>
> If my assumptions are correct then this is an instance of slony bug #167
> (fixed in 2.1.0) where sync processing in a cluster that is behind takes a
> lot longer than it should.  Your long running transactions made sl_log get
> very big and now slony takse longer to process a SYNC then it does to
> generate a new one.  2.1.0 should process SYNC events in this case much
> faster.
>
>
>
>
>>  Thanks so much for your help thus far.  Please advise with any attempted
>>  fixes you can think of or if you can possibly describe what is currently
>>  going on.  Cheers.
>>
>>>
>>>  Steve
>>>
>>>>  Under incredible load last week during the Christmas season our primary 
>>>> PG (8.4.7: Slony 2.0.7) stopped replicating.  Now that we are past the 
>>>> Christmas season I need to figure out how to clear the back log of 
>>>> replication rows in sl_log_1/2.  This is all running on our commercial 
>>>> website and if possible I would prefer not to have to restart the PG 
>>>> instance as it would require a scheduled maintenance window on a week 
>>>> where everyone is out of the office.  In an attempt to fix the issue 
>>>> without rebooting the PG instance and I've already restarted the Slony 
>>>> services on the primary PG node as a first attempt at a fix.  This did not 
>>>> get replication working again and I'm still getting the same error from 
>>>> Slony in the logs: log switch to sl_log_1 still in progress - sl_log_2 not 
>>>> truncated
>>>>
>>>>>   From my research I can see that this error message is called when the 
>>>>> function logswitch_finish() is called.  I did have some hung vacuums 
>>>>> during this period of high load on the server but I have killed them with 
>>>>> pg_cancel_backend.  From other lock analysis I can see that nothing is 
>>>>> currently running or locked in the db (nothing more than a few 
>>>>> milliseconds old at least).  I'm certain whatever transaction was in 
>>>>> progress that prevented the switch from occurring is long since past.
>>>>
>>>>  Any ideas on the best way to get replication working again?  I'm adverse 
>>>> to rebooting the PG instance but I am willing to do it if someone more 
>>>> knowlegable out there thinks it would fix this issue.  We currently are 
>>>> operating without a backup of all of our XMas sales data and I *really* 
>>>> want to get this data replicated.  Any help would be appreciated.
>>>>
>>>>  Mike Wilson
>>>>  Predicate Logic
>>>>  Cell: (310) 600-8777
>>>>  SkypeID: lycovian
>>>>
>>>>
>>>>
>>>>
>>>>  _______________________________________________
>>>>  Slony1-general mailing list
>>>>  [email protected]
>>>>  http://lists.slony.info/mailman/listinfo/slony1-general
>>>>
>>>
>>
>>
>
> _______________________________________________
> Slony1-general mailing list
> [email protected]
> http://lists.slony.info/mailman/listinfo/slony1-general


-- 
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to