On Tuesday 13 February 2007 8:44 am, Dan Falconer wrote:
> On Monday 12 February 2007 5:17 pm, Dan Falconer wrote:
> > Once again, my slave server has fallen behind, even after doing the
> > vacuum analyze + restart slons trick. And once again, it was after
> > processing inventories... my slave has 1,079,592 records in sl_log_1,
> > while the master is sitting at a whopping 2,655,927. Also, the slave is
> > reporting (okay, a query I ran on the slave's db is reporting) that it is
> > now 14 hours behind, while the master appears to be saying that it's not
> > behind at all... unless I'm reading something wrong. Here's what I'm
> > running, & the output (FYI: node1 is the slave, node2 is the master):::
> >
> >
> > [------------------------- snip -------------------------]
> > MASTER=# select con_origin, con_received, max(con_seqno),
> > max(con_timestamp), now() - max(con_timestamp) as age from
> > _pl_replication.sl_confirm group by con_origin, con_received order by
> > age; con_origin | con_received | max | max |
> > age
> > ------------+--------------+--------+----------------------------+-------
> >-- -------- 1 | 2 | 120564 | 2007-02-12 17:11:15.42497 |
> > 00:00:00.948413
> > 2 | 1 | 895115 | 2007-02-12 17:10:03.907914 |
> > 00:01:12.465469
> > (2 rows)
> >
> >
> > SLAVE=# select con_origin, con_received, max(con_seqno),
> > max(con_timestamp), now() - max(con_timestamp) as age from
> > _pl_replication.sl_confirm group by con_origin, con_received order by
> > age; con_origin | con_received | max | max |
> > age
> > ------------+--------------+--------+----------------------------+-------
> >-- -------- 2 | 1 | 895115 | 2007-02-12 17:10:03.907914 |
> > 00:01:35.189915
> > 1 | 2 | 115554 | 2007-02-12 02:50:16.085218 |
> > 14:21:23.012611
> > (2 rows)
> > [------------------------- /snip -------------------------]
> >
> >
> > I find the output slightly disturbing: the master (node2) thinks the
> > slave (node1) is lagging just a little, but the slave thinks it's lagging
> > A LOT. Am I reading something wrong?
> >
> > Also, in an effort to try fixing the problem, I manually ran a vacuum
> > analyze verbose on ALL the slony tables. Nothing of any consequence
> > there.
> >
> > One final bit of information: when the servers were recovered a few
> > weeks ago from a disastrous crash of the SAN, we found that our backups
> > were missing a copy of the postgresql.conf file. We've been tweaking the
> > one copied from our development server. Anybody have any insight on
> > tweaks to that which might make a difference? Pertinent information
> > (copied from my original post):::
> >
> > Master & Slave (identical setup):
> > HARDWARE::: dual opteron 846 procs, 8G ram, RAID5 array (SAN)
> > running 6 fibre 15k drives. Internal OS runs on mirrored 15k SCSI array
> > (~32G), with a mirrored 15k SCSI array (~32G) for the WAL directory.
> > OS::: SLES 8.1
> > SOFTWARE::: PostgreSQL 8.0.4, Slony 1.2.6
>
> Sadly, the master is now showing 4.2M rows in sl_log_1, while the slave
> is
> still at 1.08M... I'm just not sure what to do here, or where to look. I
> suppose today I'll be going through the drop + add node procedure. :((
This may have already been posted on the list/known about, but I found
this
bit of info (printed directly to the terminal instead of a log) a bit
disturbing, and quite possibly the culprit:::
NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=29721
CONTEXT: SQL statement "SELECT "_pl_replication".cleanupNodelock()"
PL/pgSQL function "cleanupevent" line 77 at perform
NOTICE: Slony-I: cleanup stale sl_nodelock entry for pid=29723
CONTEXT: SQL statement "SELECT "_pl_replication".cleanupNodelock()"
PL/pgSQL function "cleanupevent" line 77 at perform
--
Best Regards,
Dan Falconer
"Head Geek",
AvSupport, Inc. (http://www.partslogistics.com)
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general