On Apr 2, 2007, at 4:44 PM, Andrew Hammond wrote:

> On 3/30/07, Richard Yen <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> As a follow-up to my previous post about sl_confirm getting aged, I
>> *did* do a move_set from node 4 to node 1 about 6 days ago.  Any
>> reason why the slon cleanup cycle didn't pick up these confirmations
>> and delete them?  Perhaps it is a bug of some sort?
>
> Or, perhaps the confirmation set wasn't complete for all nodes, and
> the slons were behaving correctly?
Not sure what you mean here.  How do I check for confirmation set  
completeness?

>> In any case, I deleted the rows in sl_confirm, so the
>
> Clever. Did it occur to you that perhaps they're there for a reason
> and that simply deleting them is not going to fix your problem, but
> may in fact make it worse? You have probably broken your replication
> cluster, unless you kept some copies of the deleted rows around.
>
> Alternatively you can just assume that the syncs mentioned in
> sl_confirm were applied and then (optionally) try to figure out which
> ones they were in sl_log and purge them out of there too. However,
> this strikes me as a pretty sloppy way to treat your data and cluster.
Lesson learned.  I'm a rookie DBA, and that's why I'm looking for  
help from people like you guys.

I *did* however, check both sl_log_1 and sl_log_2 for corresponding  
entries to the rows in sl_confirm.  I checked sl_event also, but  
found nothing (perhaps I should've checked elsewhere, but I couldn't  
find any direction in the documentation).  I concluded that these  
rows in sl_confirm were, in effect, orphaned, and deleted them.   
True, it's sloppy--I'll not do it again.

>> test_slony_state-dbi.pl script doesn't list these anomalies anymore.
>
> Of course not. By treating the symptom, you've managed to further
> obscure your actual problem.
>
>> Could anyone else has encountered this, or have an explanation for  
>> this?
>
> Slightly messed up listen paths? Slons which needed a restart? Who
> knows? I doubt we can help you figure it out now that you've deleted
> the evidence.
>
test_slony_state-dbi.pl said "No problems found with sl_listen" and I  
*did* try a restart of all slon daemons.  Neither of them did  
anything to the rows in sl_confirm.

--Richard




>> --Richard
>>
>>
>>
>>
>> On Mar 30, 2007, at 12:17 PM, Richard Yen wrote:
>>
>> > Hi all,
>> >
>> > I've recently been experiencing climbing lags, followed by a sudden
>> > drop, at random times during the day.  I understand that for some
>> > people a ~40 event lag isn't much, but it's quite unusual for my
>> > cluster.
>> >
>> > I run a 4-node cluster (1 provider, 3 subscribers), and it appears
>> > that at random times, the event lag climbs up to ~40, and then
>> > suddenly drops to 0.  Load on all nodes is < 1.0 during these  
>> times,
>> > so I don't suspect that it's hardware or configuration.  That  
>> leaves
>> > me with no explanation of what's happening that causes these "lag
>> > spikes."
>> >
>> > Tried running test_slony_state-dbi.pl, and found the following  
>> output:
>> >
>> > ===BEGIN LOG===
>> > Tests for node 1 - DSN = dbname=tii host=tii-
>> > db1.oaktown.iparadigms.com user=slony password=3l3phant
>> > ========================================
>> > pg_listener info:
>> > Pages: 9
>> > Tuples: 1
>> >
>> > Size Tests
>> > ================================================
>> >         sl_log_1      1918 26082.000000
>> >         sl_log_2         0  0.000000
>> >        sl_seqlog        20 1543.000000
>> >
>> > Listen Path Analysis
>> > ===================================================
>> > No problems found with sl_listen
>> >
>> >  
>> --------------------------------------------------------------------- 
>> -
>> > --
>> > --------
>> > Summary of event info
>> > Origin  Min SYNC  Max SYNC Min SYNC Age Max SYNC Age
>> >  
>> ===================================================================== 
>> =
>> > ==
>> > ========
>> >        2   2277006   2277401     00:00:00     00:19:00    0
>> >        1   2999671   3001970     00:00:00     00:19:00    0
>> >        5    516048    516088     00:00:00     00:20:00    0
>> >        4    173746    174140     00:00:00     00:19:00    0
>> >
>> >
>> >  
>> --------------------------------------------------------------------- 
>> -
>> > --
>> > ---------
>> > Summary of sl_confirm aging
>> >     Origin   Receiver   Min SYNC   Max SYNC  Age of latest SYNC   
>> Age
>> > of eldest SYNC
>> >  
>> ===================================================================== 
>> =
>> > ==
>> > =========
>> >          1          2    2999672    3001969      00:00:00
>> > 00:19:00    0
>> >          1          4    2999678    3001969      00:00:00
>> > 00:19:00    0
>> >          1          5    2999671    3001962      00:00:00
>> > 00:19:00    0
>> >          2          1    2277006    2277401      00:00:00
>> > 00:19:00    0
>> >          2          4    2277006    2277401      00:00:00
>> > 00:19:00    0
>> >          2          5    2277006    2277400      00:00:00
>> > 00:19:00    0
>> >          4          1     173746     174140      00:00:00
>> > 00:19:00    0
>> >          4          2    6030310    6030310  6 days 01:52:00  6  
>> days
>> > 01:52:00    1
>> >          4          5    6030307    6030307  6 days 01:52:00  6  
>> days
>> > 01:52:00    1
>> >          5          1     516048     516088      00:00:00
>> > 00:20:00    0
>> >          5          2     516048     516088      00:00:00
>> > 00:20:00    0
>> >          5          4     516048     516088      00:00:00
>> > 00:20:00    0
>> >
>> >
>> >  
>> --------------------------------------------------------------------- 
>> -
>> > --
>> > ------
>> >
>> > Listing of old open connections
>> >         Database             PID            User    Query
>> > Age                Query
>> >  
>> ===================================================================== 
>> =
>> > ==
>> > ========
>> > ===END OF LOG===
>> >
>> > If you notice, the lines for Origin->Receiver on 4->2 and 4->2 have
>> > some old SYNCs.  These nodes (2 and 5) are the ones I experience  
>> the
>> > "lag spikes" on.  The other subscriber, node 4, doesn't experience
>> > lag spikes at all.  This report is similar for every node in the
>> > test_slony_state-dbi.pl script, so I'm kind of perplexed.
>> >
>> > Wondering if anyone would be able to interpret this for me and
>> > provide and help/advice.
>> >
>> > Thanks a lot!
>> > --Richard
>> > _______________________________________________
>> > Slony1-general mailing list
>> > [email protected]
>> > http://gborg.postgresql.org/mailman/listinfo/slony1-general
>>
>> _______________________________________________
>> Slony1-general mailing list
>> [email protected]
>> http://gborg.postgresql.org/mailman/listinfo/slony1-general
>>

_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general

Reply via email to