Victoria Parsons wrote:
>
> Hi,
>
> I am running a 3 node system, one master, and two slaves talking
> directly to the master. There are in fact many separate clusters in my
> organisation, some running with 1 slave, some with up to 10 slaves,
> and I have found I have the same problem on them all.
>
> The sl_event table is growing forever on my slave nodes. It is cleaned
> out after a slon daemon restart for that node, on the first run of
> cleanupEvent() roughly ten minutes after starting. Thereafter
> cleanupEvent runs every ten minutes and reports no errors but never
> clears out old events. I traced the problem back to the max(seq_no)
> not getting updated in sl_confirm, i.e. no new confirms arriving for
> certain node origin, node received pairs.
>
> I thought I had found the solution yesterday. On creation I only set
> up paths from master to each slave and slave back to master. i.e. no
> cross slave paths. I created the missing slave to slave paths
> yesterday and during the first cleanupEvent after that most old events
> were purged. However, since then the event table keeps growing.
>
> If I look in pg_listener on each node the nodes with the oldest
> running slon daemons have most entries, then less for newer slon
> daemons. I know the pg_listener entries are created when a slon daemon
> starts so I guess older running ones are missing some listen entries
> and that is why I am missing confirm notifies. I am a bit stuck now
> though,
>
> 1. My theory about missing pg_listener entries must be wrong as there
> is no way you can start every node after every other.
>
That is definitely not the root of the problem.
>
> 2. Restarting a slon daemon updates the confirm table with newer
> confirms so the first cleanup works. What is special about what
> happens on start up to fill this table that doesn't happen during
> normal running time.
>
> Maybe I just have something misconfigured somewhere. All events are
> replicating fine to all nodes. It is only missing confirms and hence
> growing event tables that are causing me problems.
>
Are you on Slony-I 1.1, or 1.2?  The way that listen paths are computed
changed fundamentally in 1.2, and if you're observing the problem on
1.1, then I'd guess that the problem is that you're being bitten by this.

The problem is loosely that event confirmations aren't making it back to
the subscriber nodes, so they're deciding not to clean things out.

Assuming you're on 1.1, I think you'd see it fixed if you "completed"
the network by adding STORE PATH requests indicating paths between the
subscriber nodes.  This shouldn't add much to overhead, as there aren't
*that* many nodes involved, and there are relatively few events
generated by the subscribers. 

And that should add some robustness to the overall cluster network;
consider that if your origin node falls over, and have to move the
"master" role to one of those subscribers, you would have to submit
those two STORE PATH requests in order for replication to continue.
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general

Reply via email to