Hello,
This is a re-hash of an earlier mail that went unanswered. Doing this
probably doesn't show decorum, but I'm at the end of my tether with
this problem. I'd appreciate any help that you can give to fix this
problem, because it's a real irritation - my production system is
affected by it. At this point, I'd welcome even wild speculation.
Replication ostensibly works fine. We replicate from a windows Master
(node 1), using Hiroshi Saito's Slony-I 2.0.2 binaries, to 2 OpenSuse
slaves (nodes 2 and 3). It's all fairly standard.
When I restart a slave database (in the following example node 2),
replication continues to work (at least as far as can be immediately
observed), but sl_status shows:
st_origin | st_received | st_last_event | st_last_event_ts |
st_last_received | st_last_received_ts |
st_last_received_event_ts | st_lag_num_events | st_lag_time
-------------+-----------------+------------------------------------------------+----------------------------+------------------+----------------------------+----------------------------+-------------------+-----------------
1 | 3 |38689 | "2009-07-30 12:11:51.796" |
38688;"2009-07-30 12:12:02.428316" | "2009-07-30 12:11:41.859"
|1 |"00:00:14.015"
1 | 2 |38689 | "2009-07-30 12:11:51.796" |
38605;"2009-07-30 11:52:35.119048" | "2009-07-30 11:58:05.734"
|84 |"00:13:50.14"
Node 2's st_lag_num_events grows and grows, until the slony-I service
(all slon daemons) is restarted on the master, at which time it
returns to zero, just as before. This is very annoying, because
sl_status is how my application monitors the state of the replication
cluster, and when its broken it confuses users. I can restart the slon
services (slon daemons) and have the event lag return to zero, but
that's not acceptable in a production system.
Bear in mind, replication isn't broken at any point - only sl_status is.
When I run test_slony_state-dbi.pl on the master while the event lag
continues to grow, it outputs the following:
pe...@peter-development-machine:~/slony1-2.0.2/tools>
./test_slony_state-dbi.pl --host=10.0.0.80 --database=lustre
--cluster=lustre_cluster --user=postgres --password=my_password
DSN: dbi:Pg:dbname=lustre;host=10.0.0.80;user=postgres;password=my_password;
===========================
Rummage for DSNs
=============================
Query:
select p.pa_server, p.pa_conninfo
from "_lustre_cluster".sl_path p
-- where exists (select * from "_lustre_cluster".sl_subscribe s where
-- (s.sub_provider = p.pa_server or
s.sub_receiver = p.pa_server) and
-- sub_active = 't')
group by pa_server, pa_conninfo;
Tests for node 1 - DSN = dbi:Pg:dbname=lustre host=10.0.0.80
user=postgres password=my_password
========================================
pg_listener info:
Pages: 0
Tuples: 0
Size Tests
================================================
sl_log_1 0 0.000000
sl_log_2 0 0.000000
sl_seqlog 0 0.000000
Listen Path Analysis
===================================================
No problems found with sl_listen
--------------------------------------------------------------------------------
Summary of event info
Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age
================================================================================
1 38605 38699 00:00:00 00:15:00 0
2 20 20 01:08:00 01:08:00 1
3 30 30 01:02:00 01:02:00 1
---------------------------------------------------------------------------------
Summary of sl_confirm aging
Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of
eldest SYNC
=================================================================================
1 2 38605 38605 00:20:00 00:20:00 0
1 3 38627 38698 00:00:00 00:11:00 0
2 1 20 20 01:03:00 01:03:00 1
2 3 20 20 01:02:00 01:02:00 1
3 1 30 30 01:02:00 01:02:00 1
3 2 30 30 01:08:00 01:08:00 1
------------------------------------------------------------------------------
Listing of old open connections on node 1
Database PID User Query Age
Query
================================================================================
Tests for node 3 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.82
user=postgres password=my_password
========================================
pg_listener info:
Pages: 0
Tuples: 0
Size Tests
================================================
sl_log_1 0 0.000000
sl_log_2 0 0.000000
sl_seqlog 0 0.000000
Listen Path Analysis
===================================================
No problems found with sl_listen
--------------------------------------------------------------------------------
Summary of event info
Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age
================================================================================
1 38605 38699 00:00:00 00:15:00 0
2 20 20 01:08:00 01:08:00 1
3 30 30 01:02:00 01:02:00 1
---------------------------------------------------------------------------------
Summary of sl_confirm aging
Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of
eldest SYNC
=================================================================================
1 2 38605 38605 00:21:00 00:21:00 0
1 3 38629 38699 00:00:00 00:11:00 0
2 1 20 20 01:03:00 01:03:00 1
2 3 20 20 01:03:00 01:03:00 1
3 1 30 30 01:03:00 01:03:00 1
3 2 30 30 01:08:00 01:08:00 1
------------------------------------------------------------------------------
Listing of old open connections on node 3
Database PID User Query Age
Query
================================================================================
Tests for node 2 - DSN = dbi:Pg:dbname=lustre_slave host=10.0.0.81
user=postgres password=my_password
========================================
pg_listener info:
Pages: 0
Tuples: 0
Size Tests
================================================
sl_log_1 0 0.000000
sl_log_2 0 0.000000
sl_seqlog 0 0.000000
Listen Path Analysis
===================================================
No problems found with sl_listen
--------------------------------------------------------------------------------
Summary of event info
Origin Min SYNC Max SYNC Min SYNC Age Max SYNC Age
================================================================================
1 38573 38699 -00:05:00 00:15:00 0
2 20 21 00:15:00 01:03:00 0
3 30 30 00:57:00 00:57:00 1
---------------------------------------------------------------------------------
Summary of sl_confirm aging
Origin Receiver Min SYNC Max SYNC Age of latest SYNC Age of
eldest SYNC
=================================================================================
1 2 38607 38699 00:00:00 00:15:00 0
1 3 38573 38698 -00:05:00 00:15:00 0
2 1 20 20 00:57:00 00:57:00 1
2 3 20 20 00:57:00 00:57:00 1
3 1 30 30 00:57:00 00:57:00 1
3 2 30 30 01:02:00 01:02:00 1
------------------------------------------------------------------------------
Listing of old open connections on node 2
Database PID User Query Age
Query
================================================================================
pe...@peter-development-machine:~/slony1-2.0.2/tools>
Why is this happening?
Regards,
Peter Geoghegan
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general