Re: [Slony1-general] problem with lock on sl_log_x table in cleanupevent

Klaus Darilion Tue, 08 Oct 2019 00:15:37 -0700

Am 07.10.2019 um 19:17 schrieb Christopher Browne:
> On Mon, 7 Oct 2019 at 11:50, Klaus Darilion <klaus.mailingli...@pernau.at>
> wrote:
> 
>> Hello!
>>
>> We use slony 2.1.4 and will be forced to this version for some more time.
>>
>> Today I debugged an issue where the logswitching did not finish.
>> Although it would be safe (in my opinion) to truncate the old log table,
>> the logswitch_finish() fails with:
>>
>>   could not lock sl_log_2 - sl_log_2 not truncated
>>
>> The function tries to lock the sl_log table with:
>>
>>  begin;
>>  lock table "_regdnscluster".sl_log_2 in access exclusive mode nowait;
>>
>> The problem seems, that the table is so hot in reading (55 slaves) that
>> the lock hardly succeeds.
>>
>> If I call logswitch_finish() manually (because the cleanup thread tries
>> only every 10 minutes - hard coded) I need to call it approx 100 times
>> until I get the lock.
>>
>> Is there a reason to use "nowait"? As far as I understand, it should be
>> safe to wait some time until giving up, i.e.:
>>
>> SET lock_timeout TO '10s';
>> begin;
>> lock table "_regdnscluster".sl_log_2 in access exclusive mode;
>>
>>
>> This way, log switching can happen more often.
>>
> 
> set lock_timeout was introduced in PostgreSQL 9.3, so it isn't available in
> "all versions."
> 
> When it was introduced, we wouldn't have been keen on directly adopting it
> due to that factor, especially in view that one of the major use cases for
> Slony is as a way of upgrading from elderly versions of PostgreSQL.
> 
> It surely seems like a reasonable idea to attempt to use it now, for the
> reasons you suggest.
> 
> The way I'd be inclined to adopt it would be as follows...
> 
> -  slon should check to see if lock_timeout is available (looking in
> pg_catalog.pg_settings where name='lock_timeout')'
>    - this is probably mostly localized to the function logswitch_finish()
> -  if not, then stay with the current NOWAIT behaviour
> -  if available, then set lock_timeout and attempt without NOWAIT
>    - open question: what's the apropos lock_timeout time?  Probably
> configurable, perhaps that points to a new config parameter to add to
> confoptions.h and confoptions.c

This makes sense. Another option would be to use the cleanup_interval as
locking timeout, but then we would need to measure how long the cleanup
took, and then reduce the sleep() in the cleanup_thread for the time
waited for the lock.

> It doesn't seem like either an enormous change or an awfully risky one.
> 
> I'd love to see a patch to review for this ;-)

Honestly, I do not want to put too much effort into Slony anymore. Slony
did a great job for us, but now after 7 years using Slony in production,
the DB is getting bigger and bigger and we add continuosly slaves. Now
we have quite often problem when a slave is offline (or slow
connectivity). If this happens over the weekend (like this weekend),
then of course the events queue up and we had around 12mio rows in the
sl_log table. Due to the above problem (lock not aquired), the logswitch
was never finsihed and we ended up with a load of 90 on our DB, slowing
down everything.

I think the logswitch can be improved by upgrading to slony 2.2 (as the
cleanup_thread-loop is not hard coded to 10 minutes anymore), but I am
afraid to to upgrade due to increased DB connections [1].

Hence, our plan is to move to logical replication, where we hope that is
does not put so much load on the DB as Slony does.

regards
Klaus

[1] http://lists.slony.info/pipermail/slony1-general/2019-March/013458.html

_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general
Re: [Slony1-general] problem with lock on sl_log_x table in cleanupevent

Reply via email to