I am having some trouble getting a slon node caught up on events. It's a
larger database, 350 or so Gigs, and I added a node to a replication set
and while it was doing the initial sync, the server that the slon
daemons were running on died. It wasn't until about 5 hours later we got
the daemons running on a different node and it restarted (i assume it
restarted) the initial sync.
From what I can tell, it finished the initial sync, however now it's
unable to catch up due to the following error line (reduced in size,
don't know how many elements there actually were but the single line had
about 18 million characters):
2012-01-22 04:43:07 EST ERROR remoteWorkerThread_1: "declare LOG cursor
for select log_origin, log_txid, log_tableid, log_actionseq,
log_cmdtype, octet_length(log_cmddata), case when
octet_length(log_cmddata) <= 1024 then log_cmddata else null end from
"_myslonycluster".sl_log_1 where log_origin = 1 and log_tableid in
(2,3,4,5,6,7,1,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122)
and log_txid >= '34299501' and log_txid < '34311624' and
"pg_catalog".txid_visible_in_snapshot(log_txid, '34311624:34311624:')
and ( log_actionseq <> '2474682' and log_actionseq <> '2403310' and
log_actionseq <> '2427861' and
<SNIP, repeated many thousands of times with different numbers>
' and log_actionseq <> '2520797' and log_actionseq <> '2519348'
and log_actionseq <> '2485828' and log_actionseq <> '2523367' and
log_actionseq <> '2469096' and log_actionseq <> '2520589' and
log_actionseq <> '2414071' and log_actionseq <> '2391417' ) order by
log_actionseq" PGRES_FATAL_ERROR ERROR: stack depth limit exceeded
I found someone with a similar(ish) issue back in the day, and a
function called compress_actionseq was mentioned. I turned up debugging
to level 4 and see that it is indeed compressing the actionseq, and I
looked at the code and it also looks like the above output IS the
compressed sequence.
Now, this seems to be a tricky setting to tweak on postgres, so I'd
rather not unless I had to. So my thoughts were to hopefully just force
slony to try to do smaller syncs at a time. I tried reducing (and for
the heck of it increasing) the group size, desired_sync_time,
sync_max_rowsize, and sync_max_largemem. However nothing has altered the
size of this query that is being executed on the database.
Any thoughts, suggestions? The initial sync of slony takes about 14
hours, so I'd rather not drop the node and re-attach it. In fact I have
two nodes in the same issue, stuck at the same event, so I'd rather just
get them both synced up without doing another initial sync.
Also, I toyed with the idea of forcing slon daemon to only sync up to a
specific event, in hopes to do blocks of say 500 events, however the
quit_sync_finalsync parameter is not accepted correctly by slony 2.1.0.
(I've submitted a email to this list about this too).
Thanks in advance,
- Brian F
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general