Follow up:

I executed this on the master:
mydatabase=# select * from _slony.sl_event where ev_origin not in (select no_id from _slony.sl_node); ev_origin | ev_seqno | ev_timestamp | ev_snapshot | ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 | ev_data6 | ev_data7 | ev_data8
-----------+------------+-------------------------------+--------------------+---------+----------+----------+----------+----------+----------+----------+----------+----------
3 | 5000290161 | 2012-09-27 09:48:03.749424-04 | 40580084:40580084: | SYNC | | | | | | | |
(1 row)

There is a row in sl_event that shouldn't be there, because it's referencing a node that nolonger exists. I need to add this node back to replication, but I don't want to run into the same issue as before. I ran a cleanupEvent('10 minute') and it did nothing (even did it with 0 minutes).

Will this row eventually go away? will it cause issue if we attempt to add a new node to replication with node = 3? How can I safely clean this up?

thanks,
- Brian F

On 09/27/2012 01:28 PM, Brian Fehrle wrote:
On 09/27/2012 01:26 PM, Jan Wieck wrote:
On 9/27/2012 2:34 PM, Brian Fehrle wrote:
Hi all,

PostgreSQL v 9.1.5 - 9.1.6
Slony version 2.1.0

I'm having an issue that's occurred twice now. I have 4 node slony
cluster, and one of the operations is to drop a node from replication,
do maintenance on it, then add it back to replication.

Node 1 = master
Node 2 = slave
Node 3 = slave  ->  dropped then readded
Node 4 = slave
First, why is the node actually dropped and readded so fast, instead
of just doing the maintenance while it falls behind, then let it catch
up?

We have several cases where it makes sense, such as re-installing the OS
or in todays case, we replaced the physical machine with a new one.

You apparently have a full blown path network from everyone to
everyone. This is not good under normal circumstances since the
automatic listen generation will cause every node to listen on every
other node for events, from non-origins. Way too many useless database
connections.
   From my understanding, without this set-up, all events must then be
passed through the master node to relay it. So master node = 1, slave =
2 and 3, 3 must communicate with 2, and without direct access it will
relay through the master. Is this understanding wrong?

What seems to happen here are some race conditions. The node is
dropped and when it is added back again, some third node still didn't
process the DROP NODE and when node 4 looks for events from node 3, it
finds old ones somewhere else (like on 1 or 2). When node 3 then comes
around to use those event IDs again, you get the dupkey error.

What you could do if you really need to drop/readd it, use an explicit
WAIT FOR EVENT for the DROP NODE to make sure all traces of that node
are gone from the whole cluster.

Ok, I'll look into implementing that. Another thought was to issue a
cleanupEvent() on each of the nodes still attached to replication after
I do the dump.

Thanks
- Brian F
Jan

_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to