On 3/26/2011 8:02 PM, Tim Lloyd wrote: > It wasn't mission critical changes lost. Postgres log was full of messages > saying it couldn't switch the log because it was already in progress. Uptime > was showing load averages of 60. Checking sl_log_1 it only had 4 entries. > nuking it and re-initing the log switch reduced the load average to between 4 > and 12. >
The slony cleanup thread up to 1.2 does delete the no longer needed sl_log_X entries. So the fact that you found something in there means that you deleted stuff, that probably had not replicated to all nodes yet. Maybe you personally don't care about a few updates lost, but for most of us what you suggested doing is actually a good reason by itself to start rebuilding all replicas. The actual reason why a large backlog in sl_log_X causes problems is that the query plan for selecting that log is scanning the log from the beginning, however far into the log the catch up has progressed already. So the startup cost for the log selection increases more and more until it actually finishes processing that entire sl_log_X. All that time, it cannot and should not finish that log switch. We have a fix for that in the current 2.1 development tree and consider backpatching that logic into 2.0. Jan -- Anyone who trades liberty for security deserves neither liberty nor security. -- Benjamin Franklin _______________________________________________ Slony1-general mailing list [email protected] http://lists.slony.info/mailman/listinfo/slony1-general
