Christopher, I appreciate your efforts as well as those of everyone else on the list. I'm glad to see you folks haven't give up on me yet. :)

Christopher Browne wrote:
Geoffrey <[EMAIL PROTECTED]> writes:
Andrew Sullivan wrote:
I am by no means willing to dismiss the suggestion that there are bugs in
Slony; but this still looks to me very much like there's something we don't
know about what happened, that explains the errors you're seeing.
I would so love to figure out this issue.  I appreciate your efforts.

I simply don't understand how one table inparticular could get so far
out of sync.  We're talking 300 records.

I can't imagine that slony is that fragile.  There's got to be
something going on that we don't see.

I agree.  From what I have heard, it doesn't sound like you have
experienced anything that should be scratching any of the edge points
of Slony-I.

300 records don't just disappear.

When I put this all together, I'm increasingly suspicious that you may
have experienced hardware problems or some such thing that might cause
data loss that Slony-I would have no way to address.

Understand, I'm not saying that I'm losing data, just that there are inconsistencies between the replication server and the primary. I don't believe we are losing data on the primary at all. What I see is the number of records in tables don't match, thus the replication process is not working as expected. The weird thing is, not every table is affected, just a handful. We're talking 88 tables and 84 sequences, but only 4 tables have problems. Here's a comparison of record counts:

< count for adest 54055
---
> count for adest 54056
65c65
< count for mcarr 22560
---
> count for mcarr 22572
67c67
< count for mcust 63757
---
> count for mcust 63774
94c94
< count for tract 75380
---
> count for tract 75420

This hardware has been rock solid since it was installed. If we were losing data on the primary, we would definitely hear about it. One thing I didn't mention is the actual configuration. Two boxes connected to a single data silo. It's a hot/hot configuration. Separate postmaster for each database. Half the postmasters run on one server, the other half on the other. If/when one fails, the other picks up the postmaster processes. Each database has it's own IP, so I reference the host by multiple host names. Connect to database mwr via host mwr. In the event of failure, mwr IP is moved to the other machine.

<snip>

You've grown suspicious about *every* component, which, on the one
hand, is unsurprising, but on the other, not much useful.  I haven't
heard you mention anything that would cause me to expect Slony-I to
have eaten data, or to have even "started to look hungrily at the
data."

The only reason I keep looking at slony is because the system is rock solid. We don't lose data and these boxes are up 24/7. Folks are hitting them constantly. Slony is the only new part of the equation.

The notices you have mentioned are all benign things.  The one
question that comes to mind: Any interesting ERROR messages in the
PostgreSQL logs?  I'm getting more and more suspicious that something
about the entire DB cluster has gotten unstable, and if that's the
case, Slony-I wouldn't do any better than the DB it is running on...

There are no postgresql errors to speak of on the primary.

I do see the following in the postgresql log on the slave:

2008-02-19 19:30:59 [3216] NOTICE: type "_mwr_cluster.xxid" is not yet defined
DETAIL:  Creating a shell type definition.
2008-02-19 19:30:59 [3216] NOTICE: argument type _mwr_cluster.xxid is only a shell 2008-02-19 19:30:59 [3216] NOTICE: type "_mwr_cluster.xxid_snapshot" is not yet defined
DETAIL:  Creating a shell type definition.
2008-02-19 19:30:59 [3216] NOTICE: argument type _mwr_cluster.xxid_snapshot is only a shell

Since these are NOTICEs, I assume this is normal.

During the initial replication, I do see a number of:

2008-02-19 19:32:28 [2463] LOG: checkpoints are occurring too frequently (6 seconds apart)

But our problem doesn't seem to start until after the initial replication.

--
Until later, Geoffrey

Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety.
 - Benjamin Franklin
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to