On Tuesday 06 February 2007 7:47 pm, [EMAIL PROTECTED] wrote:
> > On Tue, Feb 06, 2007 at 05:34:34PM -0600, Dan Falconer wrote:
> >>    I'm going to leave this overnight, and find out what happens, but I'm
> >> not
> >> very hopeful.  Tomorrow, if it hasn't caught up significantly, I'm going
> >> to
> >> have to do something drastic... and if 1.2.6 continues to drop behind so
> >> badly, I may have to (attempt to) revert backup to 1.1.0.  I may try
> >> just
> >
> > Naw: if that's what's happening, something is wrong in a way we need
> > to fix right away, and will do in collaboration with you.  We'll need
> > more data, though, about what's going on under the hood.
>
> Further, I don't think there's anything about 1.1 that would be expected
> to be *better* than 1.2, in terms of performance.
>
> The one thing that would be expected to affect performance is the
> switching between log tables (sl_log_1 and sl_log_2).  And the fact that
> this has the ability to *empty* the tables should be an improvement.
>
> I can suggest one thing to take a particular look at, namely what indices
> are on sl_log_1 and sl_log_2.
>
> There should, over time, become a set of partial indexes on these tables
> based on the node numbers that are the origins of replication sets.  That
> should be a help (e.g. - better performance than in 1.1, which didn't do
> this).
>
> If there aren't good indices on these tables, which would be unexpected,
> that could cause problems
>

        First off, I'd like to say that you guys have done a great job on 
Slony--for 
the most part ;) -- and I really appreciate your help.  

        After running the vacuum analyze + restart of the slon daemons, it ran 
through at least one "burst" of activities (where it suddenly cleared 1M 
records out of sl_log_2 on the master).  It seemed to start falling behind, 
as I ran checks on that table, and the latency of the slave, by running a 
query recommended by Chris Browne to our former DBA (our replication cluster 
is "pl_replication")::: 

select con_origin, con_received, max(con_seqno), max(con_timestamp), now() - 
max(con_timestamp) as age from _pl_replication.sl_confirm group by 
con_origin, con_received order by age;

        Anyway, this morning, it appears to have pounded through, and finally 
caught 
up.  I think the problem really was resolved by the vacuum analyze + restart 
slons... my co-worker was talking to me about it, and mentioned that it may 
be something along the lines of what happens in Perl with prepared 
statements: it gets a good plan right away, but if the table grows too fast 
during that time, the plan becomes "stale" and more intensive.  Might be 
something to think about, though I have little knowledge of the deeper 
inner-workings of Slony (I'm okay with using the good ol' "it's magic" 
explanation).

        About the partial indexes: the slave appears to have a partial index on 
both 
sl_log_1 and sl_log_2, while the master has a partial index on sl_log_1 ONLY 
(important because sl_log_2 still has about 23,000 records in it).  The slave 
is now only ~7sec behind... I would expect that once whatever magic happens 
that causes the master to start using sl_log_1 again will also cause that 
partial index to get thrown onto sl_log_2.

        Final thought: if my Slony setup hasn't figured itself out already, 
then I 
would venture to guess that this problem will recur next week after inventory 
processing.  If so, I'll put a message on the list again, and maybe we can 
figure out what causes this little beasty to rear it's ugly not-so-little 
head. 

-- 
Best Regards,


Dan Falconer
"Head Geek",
AvSupport, Inc. (http://www.partslogistics.com)
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general

Reply via email to