Christopher, I appreciate your efforts as well as those of everyone else
on the list. I'm glad to see you folks haven't give up on me yet. :)
Christopher Browne wrote:
Geoffrey <[EMAIL PROTECTED]> writes:
Andrew Sullivan wrote:
I am by no means willing to dismiss the suggestion that there are bugs in
Slony; but this still looks to me very much like there's something we don't
know about what happened, that explains the errors you're seeing.
I would so love to figure out this issue. I appreciate your efforts.
I simply don't understand how one table inparticular could get so far
out of sync. We're talking 300 records.
I can't imagine that slony is that fragile. There's got to be
something going on that we don't see.
I agree. From what I have heard, it doesn't sound like you have
experienced anything that should be scratching any of the edge points
of Slony-I.
300 records don't just disappear.
When I put this all together, I'm increasingly suspicious that you may
have experienced hardware problems or some such thing that might cause
data loss that Slony-I would have no way to address.
Understand, I'm not saying that I'm losing data, just that there are
inconsistencies between the replication server and the primary. I don't
believe we are losing data on the primary at all. What I see is the
number of records in tables don't match, thus the replication process is
not working as expected. The weird thing is, not every table is
affected, just a handful. We're talking 88 tables and 84 sequences, but
only 4 tables have problems. Here's a comparison of record counts:
< count for adest 54055
---
> count for adest 54056
65c65
< count for mcarr 22560
---
> count for mcarr 22572
67c67
< count for mcust 63757
---
> count for mcust 63774
94c94
< count for tract 75380
---
> count for tract 75420
This hardware has been rock solid since it was installed. If we were
losing data on the primary, we would definitely hear about it. One
thing I didn't mention is the actual configuration. Two boxes connected
to a single data silo. It's a hot/hot configuration. Separate
postmaster for each database. Half the postmasters run on one server,
the other half on the other. If/when one fails, the other picks up the
postmaster processes. Each database has it's own IP, so I reference the
host by multiple host names. Connect to database mwr via host mwr. In
the event of failure, mwr IP is moved to the other machine.
<snip>
You've grown suspicious about *every* component, which, on the one
hand, is unsurprising, but on the other, not much useful. I haven't
heard you mention anything that would cause me to expect Slony-I to
have eaten data, or to have even "started to look hungrily at the
data."
The only reason I keep looking at slony is because the system is rock
solid. We don't lose data and these boxes are up 24/7. Folks are
hitting them constantly. Slony is the only new part of the equation.
The notices you have mentioned are all benign things. The one
question that comes to mind: Any interesting ERROR messages in the
PostgreSQL logs? I'm getting more and more suspicious that something
about the entire DB cluster has gotten unstable, and if that's the
case, Slony-I wouldn't do any better than the DB it is running on...
There are no postgresql errors to speak of on the primary.
I do see the following in the postgresql log on the slave:
2008-02-19 19:30:59 [3216] NOTICE: type "_mwr_cluster.xxid" is not yet
defined
DETAIL: Creating a shell type definition.
2008-02-19 19:30:59 [3216] NOTICE: argument type _mwr_cluster.xxid is
only a shell
2008-02-19 19:30:59 [3216] NOTICE: type "_mwr_cluster.xxid_snapshot" is
not yet defined
DETAIL: Creating a shell type definition.
2008-02-19 19:30:59 [3216] NOTICE: argument type
_mwr_cluster.xxid_snapshot is only a shell
Since these are NOTICEs, I assume this is normal.
During the initial replication, I do see a number of:
2008-02-19 19:32:28 [2463] LOG: checkpoints are occurring too
frequently (6 seconds apart)
But our problem doesn't seem to start until after the initial replication.
--
Until later, Geoffrey
Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
- Benjamin Franklin
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general