We ran into a problem with the postmaster restarting database
connections causing replication to break.  A segfault in one of the
backends running a custom C aggregate caused the postmaster to restart
all of the other backend processes.  This broke all of the existing
connections to the database.  The node1 slon died. For some reason the
slon_watchdog.pl script also died but that is a different problem. 
The slave slon daemons both reconnected the remoteListenThread
connection.  But the remoteWorkerThread connection for transferring
data was unused and left broken.

We didn't notice the node1 slon was down until a few hours later.  I
started the node1 slon daemon which inserted SYNC events.  The slave
slon daemon then started processing the SYNC events and trying to
transfer data.  Since the data connection had failed and the slave
slon daemons started failing.  I noticed the error and restart all the
slon daemons which fixed the problem.

Shouldn't the slon daemons reconnect if the remoteWorkerThread
connection goes down?  Even dying and being restarted would be better
than continuously failing in a loop.  We are using 1.1.0 with most of
the 1.1.1 patches.  Has this problem been fixed in 1.1.5?

Appended are the relevant portions of the logs from one of the slave nodes.

  - Ian


2006-04-18 14:39:36 PDT ERROR  remoteListenThread_1: "select
ev_origin, ev_seqno, ev_timestamp,        ev_minxid, ev_maxxid,
ev_xip,        ev_type,        ev_data1, ev_data2,        ev_data3,
ev_data4,        ev_data5, ev_data6,        ev_data7, ev_data8 from
"_vodslony".sl_event e where (e.ev_origin = '3' and e.ev_seqno >
'89248') or (e.ev_origin = '1' and e.ev_seqno > '384494') order by
e.ev_origin, e.ev_seqno" - server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2006-04-18 14:39:46 PDT ERROR  slon_connectdb:
PQconnectdb("host=voddb1 dbname=vodlive user=slony port=5432") failed
- FATAL:  the database system is starting up
2006-04-18 14:39:46 PDT WARN   remoteListenThread_1: DB connection
failed - sleep 10 seconds
2006-04-18 14:39:56 PDT DEBUG1 remoteListenThread_1: connected to
'host=voddb1 dbname=vodlive user=slony port=5432'

2006-04-18 23:20:00 PDT DEBUG2 remoteWorkerThread_1: SYNC 384524 processing
2006-04-18 23:20:00 PDT DEBUG2 remoteWorkerThread_1: syncing set 99999
with 1 table(s) from provider 1
2006-04-18 23:20:00 PDT DEBUG2 remoteWorkerThread_1: syncing set 1
with 129 table(s) from provider 1
2006-04-18 23:20:00 PDT ERROR  remoteWorkerThread_1: "start
transaction; set enable_seqscan = off; set enable_indexscan
= on; " PGRES_FATAL_ERROR server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2006-04-18 23:20:00 PDT ERROR  remoteWorkerThread_1: "close LOG; "
PGRES_FATAL_ERROR 2006-04-18 23:20:00 PDT ERROR  remoteWorkerThread_1:
"rollback transaction; set enable_seqscan = default; set
enable_indexscan = default; " PGRES_FATAL_ERROR 2006-04-18 23:20:00
PDT DEBUG2 remoteHelperThread_1_1: 31251.591 seconds until close
cursor
2006-04-18 23:20:00 PDT ERROR  remoteWorkerThread_1: helper 1 finished
with error
2006-04-18 23:20:00 PDT ERROR  remoteWorkerThread_1: SYNC aborted

2006-04-18 23:20:10 PDT DEBUG2 remoteWorkerThread_1: SYNC 384524 processing
2006-04-18 23:20:10 PDT DEBUG2 remoteWorkerThread_1: syncing set 99999
with 1 table(s) from provider 1
2006-04-18 23:20:10 PDT DEBUG2 remoteWorkerThread_1: syncing set 1
with 129 table(s) from provider 1
2006-04-18 23:20:10 PDT ERROR  remoteWorkerThread_1: "start
transaction; set enable_seqscan = off; set enable_indexscan
= on; " PGRES_FATAL_ERROR 2006-04-18 23:20:10 PDT ERROR 
remoteWorkerThread_1: "close LOG; " PGRES_FATAL_ERROR 2006-04-18
23:20:10 PDT ERROR  remoteWorkerThread_1: "rollback transaction; set
enable_seqscan = default; set enable_indexscan =
default; " PGRES_FATAL_ERROR 2006-04-18 23:20:10 PDT DEBUG2
remoteHelperThread_1_1: 31261.620 seconds until close cursor2006-04-18
23:20:10 PDT ERROR  remoteWorkerThread_1: helper 1 finished with error
2006-04-18 23:20:10 PDT ERROR  remoteWorkerThread_1: SYNC aborted
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general

Reply via email to