Re: [Slony1-general] Slony-I: log switch to sl_log_2 still in progress - sl_log_1 not truncated

Jan Wieck Tue, 02 Feb 2016 06:26:46 -0800

On 02/02/2016 08:06 AM, Tignor, Tom wrote:
> 
>       I did drop one of my replicas several weeks ago, though I didn’t 
> recreate
> the node. I do have automation to do exactly that, however, if a replica
> becomes defective somehow. Seems I’ll need to consider the point of the
> two log switches. Is that important even if the dropped node isn’t a
> provider for anybody?


Yes, it is important because even a non-forwarding leaf node is still
producing events that propagate to all other nodes.

You might want to check sl_even on all nodes if there are remnants of
dropped and re-created nodes there. This would be sl_event rows with an
ev_seqno in the future of what that node is currently producing.


Regards, Jan

> 
>       Tom    :-)
> 
> 
> On 2/2/16, 12:06 AM, "Jan Wieck" <j...@wi3ck.info> wrote:
> 
>>On 02/01/2016 01:24 PM, Tignor, Tom wrote:
>>>
>>>     Quick update: a couple hours after deleting entries from both sl_log
>>> tables with txids > 630M, it appears the cleanup thread has taken care
>>>of
>>> business. sl_log_1 is down from 54GB to 24KB.
>>
>>I still wonder how that happened in the first place.
>>
>>Was there a sequence of dropping and re-creating a node recently? I
>>think I have seen cases like this where if a node is re-created with the
>>same node ID before the cleanup of data, belonging to the dropped node,
>>had happened everywhere. The "cleanup" I am talking about is basically 2
>>complete log switches on all nodes after the DROP NODE had replicated
>>everywhere. That takes at least 20-30 minutes and can in some cases take
>>hours.
>>
>>
>>Regards, Jan
>>
>>
>>>
>>>     Tom    :-)
>>>
>>>
>>> On 2/1/16, 9:06 AM, "Tignor, Tom" <ttig...@akamai.com> wrote:
>>>
>>>>
>>>>    Jan,
>>>>    Thanks much for all the help. I’ve been looking over logswitch finish
>>>>and
>>>>the event and changelog tables. Selecting logswitch_finish() on one of
>>>>my
>>>>replicas simply returned ""Slony-I: log switch to sl_log_2 still in
>>>>progress - sl_log_1 not truncated” All three seem to be in that state
>>>>with
>>>>either sl_log_1 or sl_log_2. Looking closer at sl_event and sl_log_2 I’m
>>>>seeing some strangeness.
>>>>
>>>>ams=# select min(pg_catalog.txid_snapshot_xmin(ev_snapshot)),
>>>>max(pg_catalog.txid_snapshot_xmin(ev_snapshot)) from
>>>>_ams_cluster.sl_event
>>>>where ev_origin = 1;
>>>>    min    |    max
>>>>-----------+-----------
>>>> 139136948 | 139204299
>>>>(1 row)
>>>>
>>>>
>>>>ams=# select min(log_txid), max(log_txid) from _ams_cluster.sl_log_2
>>>>where
>>>>log_origin = 1;
>>>>    min    |    max
>>>>-----------+-----------
>>>> 631532717 | 631661386
>>>>(1 row)
>>>>
>>>>
>>>>ams=#
>>>>
>>>>
>>>>    So I understand all the txids referenced in sl_event are in the 139M
>>>>range while all those in sl_log_2 are in the 631M range. Normally, the
>>>>sl_log txids should be older, shouldn’t they? Do you think I’ve hit a
>>>>txid-wraparound problem?
>>>>
>>>>    Tom    :-)
>>>>
>>>>
>>>>
>>>>On 1/28/16, 1:09 PM, "Jan Wieck" <j...@wi3ck.info> wrote:
>>>>
>>>>>On 01/28/2016 11:11 AM, Tignor, Tom wrote:
>>>>>>
>>>>>>  Output below. They seem to be replicating normally, except for the
>>>>>>sl_log
>>>>>> growth.
>>>>>
>>>>>Indeed. Is there anything in the slon logs for those nodes that says
>>>>>why
>>>>>it doesn't finish the log switch?
>>>>>
>>>>>Connect to the database as a the slony user.
>>>>>
>>>>>To check if a log switch is indeed in progress, do
>>>>>
>>>>>     SELECT last_value FROM _ams_cluster.sl_log_status;
>>>>>
>>>>>It should be either 2 or 3. If it is 0 or 1, no log switch is in
>>>>>progress and you can start one with
>>>>>
>>>>>     SELECCT _ams_cluster.logswitch_start();
>>>>>
>>>>>If it is 2 or 3, then you can do
>>>>>
>>>>>     SELECT _ams_cluster.logswitch_finish();
>>>>>
>>>>>All these operations are harmless and will only do what is safely
>>>>>possible. Look at the code of logswitch_finish() to find out how it
>>>>>determines if the current log switch can be finished. In short, the
>>>>>cleanup thread is removing events from sl_event that have been
>>>>>confirmed
>>>>>by all nodes in the cluster. The function logswitch_finish() looks if
>>>>>there is anything left in sl_event, that belonged to that old log. If
>>>>>so
>>>>>it will not finish. Running those queries manually you can find out
>>>>>what
>>>>>that event is that is preventing the switch to finish.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> a...@ams-repl2.ams.netmgmt:~$ /a/third-party/postgresql/bin/psql -U
>>>>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status'
>>>>>>   st_origin | st_received | st_last_event |       st_last_event_ts
>>>>>>  |
>>>>>> st_last_received |      st_last_received_ts      |
>>>>>> st_last_received_event_ts   | st_lag_num_events |   st_lag_time
>>>>>>
>>>>>>-----------+-------------+---------------+----------------------------
>>>>>>--
>>>>>>-
>>>>>>+-
>>>>>>
>>>>>>-----------------+-------------------------------+--------------------
>>>>>>--
>>>>>>-
>>>>>>--
>>>>>> ------+-------------------+-----------------
>>>>>>           2 |           1 |    5000611610 | 2016-01-28
>>>>>>16:06:37.343826+00 |
>>>>>>        5000611610 | 2016-01-28 16:06:38.843562+00 | 2016-01-28
>>>>>> 16:06:37.343826+00 |                 0 | 00:00:09.201996
>>>>>>           2 |           3 |    5000611610 | 2016-01-28
>>>>>>16:06:37.343826+00 |
>>>>>>        5000611609 | 2016-01-28 16:06:29.851545+00 | 2016-01-28
>>>>>> 16:06:27.341894+00 |                 1 | 00:00:19.203928
>>>>>>           2 |           4 |    5000611610 | 2016-01-28
>>>>>>16:06:37.343826+00 |
>>>>>>        5000611610 | 2016-01-28 16:06:38.710974+00 | 2016-01-28
>>>>>> 16:06:37.343826+00 |                 0 | 00:00:09.201996
>>>>>> (3 rows)
>>>>>>
>>>>>>
>>>>>> a...@ams-repl3.lga.netmgmt:~$  /a/third-party/postgresql/bin/psql -U
>>>>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status'
>>>>>>   st_origin | st_received | st_last_event |       st_last_event_ts
>>>>>>  |
>>>>>> st_last_received |      st_last_received_ts      |
>>>>>> st_last_received_event_ts   | st_lag_num_events |   st_lag_time
>>>>>>
>>>>>>-----------+-------------+---------------+----------------------------
>>>>>>--
>>>>>>-
>>>>>>+-
>>>>>>
>>>>>>-----------------+-------------------------------+--------------------
>>>>>>--
>>>>>>-
>>>>>>--
>>>>>> ------+-------------------+-----------------
>>>>>>           3 |           4 |    5000654642 | 2016-01-28
>>>>>>16:07:05.493455+00 |
>>>>>>        5000654642 | 2016-01-28 16:07:06.486539+00 | 2016-01-28
>>>>>> 16:07:05.493455+00 |                 0 | 00:00:08.522529
>>>>>>           3 |           1 |    5000654642 | 2016-01-28
>>>>>>16:07:05.493455+00 |
>>>>>>        5000654642 | 2016-01-28 16:07:08.040292+00 | 2016-01-28
>>>>>> 16:07:05.493455+00 |                 0 | 00:00:08.522529
>>>>>>           3 |           2 |    5000654642 | 2016-01-28
>>>>>>16:07:05.493455+00 |
>>>>>>        5000654642 | 2016-01-28 16:07:08.472049+00 | 2016-01-28
>>>>>> 16:07:05.493455+00 |                 0 | 00:00:08.522529
>>>>>> (3 rows)
>>>>>>
>>>>>>
>>>>>> a...@ams-repl4.blr.netmgmt:~$  /a/third-party/postgresql/bin/psql -U
>>>>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status'
>>>>>>   st_origin | st_received | st_last_event |       st_last_event_ts
>>>>>>  |
>>>>>> st_last_received |      st_last_received_ts      |
>>>>>> st_last_received_event_ts   | st_lag_num_events |   st_lag_time
>>>>>>
>>>>>>-----------+-------------+---------------+----------------------------
>>>>>>--
>>>>>>-
>>>>>>+-
>>>>>>
>>>>>>-----------------+-------------------------------+--------------------
>>>>>>--
>>>>>>-
>>>>>>--
>>>>>> ------+-------------------+-----------------
>>>>>>           4 |           3 |    5000637483 | 2016-01-28
>>>>>>16:07:32.698809+00 |
>>>>>>        5000637482 | 2016-01-28 16:07:28.731404+00 | 2016-01-28
>>>>>> 16:07:22.695826+00 |                 1 | 00:00:19.077657
>>>>>>           4 |           1 |    5000637483 | 2016-01-28
>>>>>>16:07:32.698809+00 |
>>>>>>        5000637482 | 2016-01-28 16:07:24.839978+00 | 2016-01-28
>>>>>> 16:07:22.695826+00 |                 1 | 00:00:19.077657
>>>>>>           4 |           2 |    5000637483 | 2016-01-28
>>>>>>16:07:32.698809+00 |
>>>>>>        5000637482 | 2016-01-28 16:07:22.926411+00 | 2016-01-28
>>>>>> 16:07:22.695826+00 |                 1 | 00:00:19.077657
>>>>>> (3 rows)
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Tom    :-)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1/28/16, 10:38 AM, "Jan Wieck" <j...@wi3ck.info> wrote:
>>>>>>
>>>>>>>On 01/28/2016 08:30 AM, Tignor, Tom wrote:
>>>>>>>>
>>>>>>>> Hello slony folks,
>>>>>>>>  From my reading I¹m guessing (hoping) this isn¹t a new problem. I
>>>>>>>>have
>>>>>>>> a simple cluster with one provider replicating to three
>>>>>>>>subscribers.
>>>>>>>>The
>>>>>>>> provider¹s changelog tables (sl_log_[1|2]) are fine, but the
>>>>>>>>subscribers
>>>>>>>> (with forwarding enabled) are all showing runaway growth. Looked
>>>>>>>>through
>>>>>>>> the FAQ and I don¹t see the node I dropped or any idle transactions
>>>>>>>>as
>>>>>>>> viable culprits. Are there other thoughts on the cause? Can I
>>>>>>>>safely
>>>>>>>> manually delete/truncate some/all of the changelog tables? These
>>>>>>>> replicas are all leaf nodes. I only have forwarding turned on to
>>>>>>>>allow
>>>>>>>> for failover, and my replication rate is the 2 sec default.
>>>>>>>> Thanks in advance for any insights.
>>>>>>>
>>>>>>>What is the output of the sl_status view "on those leaf nodes"?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> ams=# select
>>>>>>>> pg_size_pretty(pg_total_relation_size('_ams_cluster.sl_log_1'));
>>>>>>>>
>>>>>>>>   pg_size_pretty
>>>>>>>>
>>>>>>>> ----------------
>>>>>>>>
>>>>>>>>   75 MB
>>>>>>>>
>>>>>>>> (1 row)
>>>>>>>>
>>>>>>>>
>>>>>>>> ams=# select
>>>>>>>> pg_size_pretty(pg_total_relation_size('_ams_cluster.sl_log_2'));
>>>>>>>>
>>>>>>>>   pg_size_pretty
>>>>>>>>
>>>>>>>> ----------------
>>>>>>>>
>>>>>>>>   34 GB
>>>>>>>>
>>>>>>>> (1 row)
>>>>>>>>
>>>>>>>>
>>>>>>>> ams=# select * from _ams_cluster.sl_confirm where con_origin not in
>>>>>>>> (select no_id from _ams_cluster.sl_node) or con_received not in
>>>>>>>>(select
>>>>>>>> no_id from _ams_cluster.sl_node);
>>>>>>>>
>>>>>>>>   con_origin | con_received | con_seqno | con_timestamp
>>>>>>>>
>>>>>>>> ------------+--------------+-----------+---------------
>>>>>>>>
>>>>>>>> (0 rows)
>>>>>>>>
>>>>>>>>
>>>>>>>> ams=# select * from pg_stat_activity where current_query like
>>>>>>>>'%IDLE%';
>>>>>>>>
>>>>>>>>   datid | datname | procpid | usesysid |  usename   |
>>>>>>>> application_name      |  client_addr   | client_hostname |
>>>>>>>>client_port |
>>>>>>>>          backend_start         |          xact_start           |
>>>>>>>>    query_start          | waiting |
>>>>>>>>
>>>>>>>>                          current_query
>>>>>>>>
>>>>>>>>
>>>>>>>>-------+---------+---------+----------+------------+----------------
>>>>>>>>--
>>>>>>>>-
>>>>>>>>--
>>>>>>>>------+----------------+-----------------+-------------+------------
>>>>>>>>--
>>>>>>>>-
>>>>>>>>--
>>>>>>>>--------------+-------------------------------+---------------------
>>>>>>>>--
>>>>>>>>-
>>>>>>>>--
>>>>>>>>-----+---------+---
>>>>>>>>
>>>>>>>> ----------------------------------------------------------------
>>>>>>>>
>>>>>>>>   16393 | ams     |    2611 |   212995 | ams_viewer |
>>>>>>>>          | 88.221.209.10  |                 |       43328 |
>>>>>>>>2016-01-28
>>>>>>>> 12:24:49.706389+00 |                               | 2016-01-28
>>>>>>>> 13:18:02.427848+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |   12894 |   212995 | ams_viewer |
>>>>>>>>          | 88.221.209.10  |                 |       60112 |
>>>>>>>>2016-01-28
>>>>>>>> 12:47:26.230681+00 |                               | 2016-01-28
>>>>>>>> 13:15:27.744242+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |   12884 |   212995 | ams_viewer |
>>>>>>>>          | 88.221.209.10  |                 |       44302 |
>>>>>>>>2016-01-28
>>>>>>>> 12:47:25.100006+00 |                               | 2016-01-28
>>>>>>>> 13:15:27.936059+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |   23466 |   213867 | ams_slony  | psql
>>>>>>>>          |                |                 |          -1 |
>>>>>>>>2016-01-28
>>>>>>>> 13:11:32.030343+00 | 2016-01-28 13:18:37.283992+00 | 2016-01-28
>>>>>>>> 13:18:37.283992+00 | f       | se
>>>>>>>>
>>>>>>>> lect * from pg_stat_activity where current_query like '%IDLE%';
>>>>>>>>
>>>>>>>>   16393 | ams     |    6719 |   213867 | ams_slony  |
>>>>>>>> slon.origin_2_provider_2  | 60.254.150.133 |                 |
>>>>>>>> 61806 | 2016-01-22 01:59:14.800129+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:25.935111+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    6718 |   213867 | ams_slony  |
>>>>>>>> slon.origin_3_provider_2  | 60.254.150.133 |                 |
>>>>>>>> 61805 | 2016-01-22 01:59:14.797655+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:34.304475+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    5505 |   213867 | ams_slony  |
>>>>>>>> slon.origin_4_provider_2  | 80.67.75.105   |                 |
>>>>>>>> 36477 | 2016-01-22 01:56:25.637046+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:36.1348+00   | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    5504 |   213867 | ams_slony  |
>>>>>>>> slon.origin_3_provider_2  | 72.246.50.22   |                 |
>>>>>>>> 51813 | 2016-01-22 01:56:25.240798+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:28.961629+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    5487 |   213867 | ams_slony  |
>>>>>>>> slon.origin_4_provider_2  | 72.246.50.22   |                 |
>>>>>>>> 51803 | 2016-01-22 01:56:22.896388+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:35.858913+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    5047 |   213867 | ams_slony  |
>>>>>>>> slon.origin_2_provider_2  | 72.246.50.22   |                 |
>>>>>>>> 51564 | 2016-01-22 01:55:23.600296+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:34.487192+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    5041 |   213867 | ams_slony  |
>>>>>>>> slon.origin_2_provider_2  | 80.67.75.105   |                 |
>>>>>>>> 36402 | 2016-01-22 01:55:22.964462+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:34.519066+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    6694 |   213867 | ams_slony  |
>>>>>>>> slon.node_2_listen        | 60.254.150.133 |                 |
>>>>>>>> 61795 | 2016-01-22 01:59:12.095052+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:27.928384+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4456 |   213867 | ams_slony  |
>>>>>>>> slon.node_2_listen        | 72.246.50.22   |                 |
>>>>>>>> 51238 | 2016-01-22 01:54:21.481355+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:36.766973+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4457 |   213867 | ams_slony  |
>>>>>>>> slon.node_2_listen        | 80.67.75.105   |                 |
>>>>>>>> 36333 | 2016-01-22 01:54:21.500456+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:36.204482+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4428 |   213867 | ams_slony  |
>>>>>>>> slon.local_monitor        |                |                 |
>>>>>>>> -1 | 2016-01-22 01:54:18.977015+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:36.652567+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4427 |   213867 | ams_slony  |
>>>>>>>>slon.local_sync
>>>>>>>>          |                |                 |          -1 |
>>>>>>>>2016-01-22
>>>>>>>> 01:54:18.976932+00 |                               | 2016-01-28
>>>>>>>> 13:18:36.151998+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4426 |   213867 | ams_slony  |
>>>>>>>> slon.local_cleanup        |                |                 |
>>>>>>>> -1 | 2016-01-22 01:54:18.976842+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:12:12.582921+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4425 |   213867 | ams_slony  |
>>>>>>>> slon.remoteWorkerThread_4 |                |                 |
>>>>>>>> -1 | 2016-01-22 01:54:18.976783+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:33.99715+00  | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4420 |   213867 | ams_slony  |
>>>>>>>> slon.remoteWorkerThread_1 |                |                 |
>>>>>>>> -1 | 2016-01-22 01:54:18.976548+00 |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:33.561531+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4419 |   213867 | ams_slony  |
>>>>>>>> slon.remoteWorkerThread_3 |                |                 |
>>>>>>>> -1 | 2016-01-22 01:54:18.97647+00  |
>>>>>>>>|
>>>>>>>> 2016-01-28 13:18:34.808907+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>>   16393 | ams     |    4413 |   213867 | ams_slony  |
>>>>>>>>slon.local_listen
>>>>>>>>          |                |                 |          -1 |
>>>>>>>>2016-01-22
>>>>>>>> 01:54:18.965568+00 |                               | 2016-01-28
>>>>>>>> 13:18:37.096159+00 | f       | <I
>>>>>>>>
>>>>>>>> DLE>
>>>>>>>>
>>>>>>>> (21 rows)
>>>>>>>>
>>>>>>>>
>>>>>>>> ams=#
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Tom    :-)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Slony1-general mailing list
>>>>>>>> Slony1-general@lists.slony.info
>>>>>>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>--
>>>>>>>Jan Wieck
>>>>>>>Senior Software Engineer
>>>>>>>http://slony.info
>>>>>>
>>>>>
>>>>>
>>>>>--
>>>>>Jan Wieck
>>>>>Senior Software Engineer
>>>>>http://slony.info
>>>>
>>>
>>
>>
>>--
>>Jan Wieck
>>Senior Software Engineer
>>http://slony.info
> 


-- 
Jan Wieck
Senior Software Engineer
http://slony.info
_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Re: [Slony1-general] Slony-I: log switch to sl_log_2 still in progress - sl_log_1 not truncated

Reply via email to