On Tuesday 06 February 2007 7:47 pm, [EMAIL PROTECTED] wrote:
> > On Tue, Feb 06, 2007 at 05:34:34PM -0600, Dan Falconer wrote:
> >> I'm going to leave this overnight, and find out what happens, but I'm
> >> not
> >> very hopeful. Tomorrow, if it hasn't caught up significantly, I'm going
> >> to
> >> have to do something drastic... and if 1.2.6 continues to drop behind so
> >> badly, I may have to (attempt to) revert backup to 1.1.0. I may try
> >> just
> >
> > Naw: if that's what's happening, something is wrong in a way we need
> > to fix right away, and will do in collaboration with you. We'll need
> > more data, though, about what's going on under the hood.
>
> Further, I don't think there's anything about 1.1 that would be expected
> to be *better* than 1.2, in terms of performance.
>
> The one thing that would be expected to affect performance is the
> switching between log tables (sl_log_1 and sl_log_2). And the fact that
> this has the ability to *empty* the tables should be an improvement.
>
> I can suggest one thing to take a particular look at, namely what indices
> are on sl_log_1 and sl_log_2.
>
> There should, over time, become a set of partial indexes on these tables
> based on the node numbers that are the origins of replication sets. That
> should be a help (e.g. - better performance than in 1.1, which didn't do
> this).
>
> If there aren't good indices on these tables, which would be unexpected,
> that could cause problems
>
First off, I'd like to say that you guys have done a great job on
Slony--for
the most part ;) -- and I really appreciate your help.
After running the vacuum analyze + restart of the slon daemons, it ran
through at least one "burst" of activities (where it suddenly cleared 1M
records out of sl_log_2 on the master). It seemed to start falling behind,
as I ran checks on that table, and the latency of the slave, by running a
query recommended by Chris Browne to our former DBA (our replication cluster
is "pl_replication"):::
select con_origin, con_received, max(con_seqno), max(con_timestamp), now() -
max(con_timestamp) as age from _pl_replication.sl_confirm group by
con_origin, con_received order by age;
Anyway, this morning, it appears to have pounded through, and finally
caught
up. I think the problem really was resolved by the vacuum analyze + restart
slons... my co-worker was talking to me about it, and mentioned that it may
be something along the lines of what happens in Perl with prepared
statements: it gets a good plan right away, but if the table grows too fast
during that time, the plan becomes "stale" and more intensive. Might be
something to think about, though I have little knowledge of the deeper
inner-workings of Slony (I'm okay with using the good ol' "it's magic"
explanation).
About the partial indexes: the slave appears to have a partial index on
both
sl_log_1 and sl_log_2, while the master has a partial index on sl_log_1 ONLY
(important because sl_log_2 still has about 23,000 records in it). The slave
is now only ~7sec behind... I would expect that once whatever magic happens
that causes the master to start using sl_log_1 again will also cause that
partial index to get thrown onto sl_log_2.
Final thought: if my Slony setup hasn't figured itself out already,
then I
would venture to guess that this problem will recur next week after inventory
processing. If so, I'll put a message on the list again, and maybe we can
figure out what causes this little beasty to rear it's ugly not-so-little
head.
--
Best Regards,
Dan Falconer
"Head Geek",
AvSupport, Inc. (http://www.partslogistics.com)
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general