On 02/02/2016 08:06 AM, Tignor, Tom wrote: > > I did drop one of my replicas several weeks ago, though I didn’t > recreate > the node. I do have automation to do exactly that, however, if a replica > becomes defective somehow. Seems I’ll need to consider the point of the > two log switches. Is that important even if the dropped node isn’t a > provider for anybody?
Yes, it is important because even a non-forwarding leaf node is still producing events that propagate to all other nodes. You might want to check sl_even on all nodes if there are remnants of dropped and re-created nodes there. This would be sl_event rows with an ev_seqno in the future of what that node is currently producing. Regards, Jan > > Tom :-) > > > On 2/2/16, 12:06 AM, "Jan Wieck" <j...@wi3ck.info> wrote: > >>On 02/01/2016 01:24 PM, Tignor, Tom wrote: >>> >>> Quick update: a couple hours after deleting entries from both sl_log >>> tables with txids > 630M, it appears the cleanup thread has taken care >>>of >>> business. sl_log_1 is down from 54GB to 24KB. >> >>I still wonder how that happened in the first place. >> >>Was there a sequence of dropping and re-creating a node recently? I >>think I have seen cases like this where if a node is re-created with the >>same node ID before the cleanup of data, belonging to the dropped node, >>had happened everywhere. The "cleanup" I am talking about is basically 2 >>complete log switches on all nodes after the DROP NODE had replicated >>everywhere. That takes at least 20-30 minutes and can in some cases take >>hours. >> >> >>Regards, Jan >> >> >>> >>> Tom :-) >>> >>> >>> On 2/1/16, 9:06 AM, "Tignor, Tom" <ttig...@akamai.com> wrote: >>> >>>> >>>> Jan, >>>> Thanks much for all the help. I’ve been looking over logswitch finish >>>>and >>>>the event and changelog tables. Selecting logswitch_finish() on one of >>>>my >>>>replicas simply returned ""Slony-I: log switch to sl_log_2 still in >>>>progress - sl_log_1 not truncated” All three seem to be in that state >>>>with >>>>either sl_log_1 or sl_log_2. Looking closer at sl_event and sl_log_2 I’m >>>>seeing some strangeness. >>>> >>>>ams=# select min(pg_catalog.txid_snapshot_xmin(ev_snapshot)), >>>>max(pg_catalog.txid_snapshot_xmin(ev_snapshot)) from >>>>_ams_cluster.sl_event >>>>where ev_origin = 1; >>>> min | max >>>>-----------+----------- >>>> 139136948 | 139204299 >>>>(1 row) >>>> >>>> >>>>ams=# select min(log_txid), max(log_txid) from _ams_cluster.sl_log_2 >>>>where >>>>log_origin = 1; >>>> min | max >>>>-----------+----------- >>>> 631532717 | 631661386 >>>>(1 row) >>>> >>>> >>>>ams=# >>>> >>>> >>>> So I understand all the txids referenced in sl_event are in the 139M >>>>range while all those in sl_log_2 are in the 631M range. Normally, the >>>>sl_log txids should be older, shouldn’t they? Do you think I’ve hit a >>>>txid-wraparound problem? >>>> >>>> Tom :-) >>>> >>>> >>>> >>>>On 1/28/16, 1:09 PM, "Jan Wieck" <j...@wi3ck.info> wrote: >>>> >>>>>On 01/28/2016 11:11 AM, Tignor, Tom wrote: >>>>>> >>>>>> Output below. They seem to be replicating normally, except for the >>>>>>sl_log >>>>>> growth. >>>>> >>>>>Indeed. Is there anything in the slon logs for those nodes that says >>>>>why >>>>>it doesn't finish the log switch? >>>>> >>>>>Connect to the database as a the slony user. >>>>> >>>>>To check if a log switch is indeed in progress, do >>>>> >>>>> SELECT last_value FROM _ams_cluster.sl_log_status; >>>>> >>>>>It should be either 2 or 3. If it is 0 or 1, no log switch is in >>>>>progress and you can start one with >>>>> >>>>> SELECCT _ams_cluster.logswitch_start(); >>>>> >>>>>If it is 2 or 3, then you can do >>>>> >>>>> SELECT _ams_cluster.logswitch_finish(); >>>>> >>>>>All these operations are harmless and will only do what is safely >>>>>possible. Look at the code of logswitch_finish() to find out how it >>>>>determines if the current log switch can be finished. In short, the >>>>>cleanup thread is removing events from sl_event that have been >>>>>confirmed >>>>>by all nodes in the cluster. The function logswitch_finish() looks if >>>>>there is anything left in sl_event, that belonged to that old log. If >>>>>so >>>>>it will not finish. Running those queries manually you can find out >>>>>what >>>>>that event is that is preventing the switch to finish. >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> a...@ams-repl2.ams.netmgmt:~$ /a/third-party/postgresql/bin/psql -U >>>>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status' >>>>>> st_origin | st_received | st_last_event | st_last_event_ts >>>>>> | >>>>>> st_last_received | st_last_received_ts | >>>>>> st_last_received_event_ts | st_lag_num_events | st_lag_time >>>>>> >>>>>>-----------+-------------+---------------+---------------------------- >>>>>>-- >>>>>>- >>>>>>+- >>>>>> >>>>>>-----------------+-------------------------------+-------------------- >>>>>>-- >>>>>>- >>>>>>-- >>>>>> ------+-------------------+----------------- >>>>>> 2 | 1 | 5000611610 | 2016-01-28 >>>>>>16:06:37.343826+00 | >>>>>> 5000611610 | 2016-01-28 16:06:38.843562+00 | 2016-01-28 >>>>>> 16:06:37.343826+00 | 0 | 00:00:09.201996 >>>>>> 2 | 3 | 5000611610 | 2016-01-28 >>>>>>16:06:37.343826+00 | >>>>>> 5000611609 | 2016-01-28 16:06:29.851545+00 | 2016-01-28 >>>>>> 16:06:27.341894+00 | 1 | 00:00:19.203928 >>>>>> 2 | 4 | 5000611610 | 2016-01-28 >>>>>>16:06:37.343826+00 | >>>>>> 5000611610 | 2016-01-28 16:06:38.710974+00 | 2016-01-28 >>>>>> 16:06:37.343826+00 | 0 | 00:00:09.201996 >>>>>> (3 rows) >>>>>> >>>>>> >>>>>> a...@ams-repl3.lga.netmgmt:~$ /a/third-party/postgresql/bin/psql -U >>>>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status' >>>>>> st_origin | st_received | st_last_event | st_last_event_ts >>>>>> | >>>>>> st_last_received | st_last_received_ts | >>>>>> st_last_received_event_ts | st_lag_num_events | st_lag_time >>>>>> >>>>>>-----------+-------------+---------------+---------------------------- >>>>>>-- >>>>>>- >>>>>>+- >>>>>> >>>>>>-----------------+-------------------------------+-------------------- >>>>>>-- >>>>>>- >>>>>>-- >>>>>> ------+-------------------+----------------- >>>>>> 3 | 4 | 5000654642 | 2016-01-28 >>>>>>16:07:05.493455+00 | >>>>>> 5000654642 | 2016-01-28 16:07:06.486539+00 | 2016-01-28 >>>>>> 16:07:05.493455+00 | 0 | 00:00:08.522529 >>>>>> 3 | 1 | 5000654642 | 2016-01-28 >>>>>>16:07:05.493455+00 | >>>>>> 5000654642 | 2016-01-28 16:07:08.040292+00 | 2016-01-28 >>>>>> 16:07:05.493455+00 | 0 | 00:00:08.522529 >>>>>> 3 | 2 | 5000654642 | 2016-01-28 >>>>>>16:07:05.493455+00 | >>>>>> 5000654642 | 2016-01-28 16:07:08.472049+00 | 2016-01-28 >>>>>> 16:07:05.493455+00 | 0 | 00:00:08.522529 >>>>>> (3 rows) >>>>>> >>>>>> >>>>>> a...@ams-repl4.blr.netmgmt:~$ /a/third-party/postgresql/bin/psql -U >>>>>> ams_slony -d ams -c 'select * from _ams_cluster.sl_status' >>>>>> st_origin | st_received | st_last_event | st_last_event_ts >>>>>> | >>>>>> st_last_received | st_last_received_ts | >>>>>> st_last_received_event_ts | st_lag_num_events | st_lag_time >>>>>> >>>>>>-----------+-------------+---------------+---------------------------- >>>>>>-- >>>>>>- >>>>>>+- >>>>>> >>>>>>-----------------+-------------------------------+-------------------- >>>>>>-- >>>>>>- >>>>>>-- >>>>>> ------+-------------------+----------------- >>>>>> 4 | 3 | 5000637483 | 2016-01-28 >>>>>>16:07:32.698809+00 | >>>>>> 5000637482 | 2016-01-28 16:07:28.731404+00 | 2016-01-28 >>>>>> 16:07:22.695826+00 | 1 | 00:00:19.077657 >>>>>> 4 | 1 | 5000637483 | 2016-01-28 >>>>>>16:07:32.698809+00 | >>>>>> 5000637482 | 2016-01-28 16:07:24.839978+00 | 2016-01-28 >>>>>> 16:07:22.695826+00 | 1 | 00:00:19.077657 >>>>>> 4 | 2 | 5000637483 | 2016-01-28 >>>>>>16:07:32.698809+00 | >>>>>> 5000637482 | 2016-01-28 16:07:22.926411+00 | 2016-01-28 >>>>>> 16:07:22.695826+00 | 1 | 00:00:19.077657 >>>>>> (3 rows) >>>>>> >>>>>> >>>>>> >>>>>> Tom :-) >>>>>> >>>>>> >>>>>> >>>>>> On 1/28/16, 10:38 AM, "Jan Wieck" <j...@wi3ck.info> wrote: >>>>>> >>>>>>>On 01/28/2016 08:30 AM, Tignor, Tom wrote: >>>>>>>> >>>>>>>> Hello slony folks, >>>>>>>> From my reading I¹m guessing (hoping) this isn¹t a new problem. I >>>>>>>>have >>>>>>>> a simple cluster with one provider replicating to three >>>>>>>>subscribers. >>>>>>>>The >>>>>>>> provider¹s changelog tables (sl_log_[1|2]) are fine, but the >>>>>>>>subscribers >>>>>>>> (with forwarding enabled) are all showing runaway growth. Looked >>>>>>>>through >>>>>>>> the FAQ and I don¹t see the node I dropped or any idle transactions >>>>>>>>as >>>>>>>> viable culprits. Are there other thoughts on the cause? Can I >>>>>>>>safely >>>>>>>> manually delete/truncate some/all of the changelog tables? These >>>>>>>> replicas are all leaf nodes. I only have forwarding turned on to >>>>>>>>allow >>>>>>>> for failover, and my replication rate is the 2 sec default. >>>>>>>> Thanks in advance for any insights. >>>>>>> >>>>>>>What is the output of the sl_status view "on those leaf nodes"? >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> ams=# select >>>>>>>> pg_size_pretty(pg_total_relation_size('_ams_cluster.sl_log_1')); >>>>>>>> >>>>>>>> pg_size_pretty >>>>>>>> >>>>>>>> ---------------- >>>>>>>> >>>>>>>> 75 MB >>>>>>>> >>>>>>>> (1 row) >>>>>>>> >>>>>>>> >>>>>>>> ams=# select >>>>>>>> pg_size_pretty(pg_total_relation_size('_ams_cluster.sl_log_2')); >>>>>>>> >>>>>>>> pg_size_pretty >>>>>>>> >>>>>>>> ---------------- >>>>>>>> >>>>>>>> 34 GB >>>>>>>> >>>>>>>> (1 row) >>>>>>>> >>>>>>>> >>>>>>>> ams=# select * from _ams_cluster.sl_confirm where con_origin not in >>>>>>>> (select no_id from _ams_cluster.sl_node) or con_received not in >>>>>>>>(select >>>>>>>> no_id from _ams_cluster.sl_node); >>>>>>>> >>>>>>>> con_origin | con_received | con_seqno | con_timestamp >>>>>>>> >>>>>>>> ------------+--------------+-----------+--------------- >>>>>>>> >>>>>>>> (0 rows) >>>>>>>> >>>>>>>> >>>>>>>> ams=# select * from pg_stat_activity where current_query like >>>>>>>>'%IDLE%'; >>>>>>>> >>>>>>>> datid | datname | procpid | usesysid | usename | >>>>>>>> application_name | client_addr | client_hostname | >>>>>>>>client_port | >>>>>>>> backend_start | xact_start | >>>>>>>> query_start | waiting | >>>>>>>> >>>>>>>> current_query >>>>>>>> >>>>>>>> >>>>>>>>-------+---------+---------+----------+------------+---------------- >>>>>>>>-- >>>>>>>>- >>>>>>>>-- >>>>>>>>------+----------------+-----------------+-------------+------------ >>>>>>>>-- >>>>>>>>- >>>>>>>>-- >>>>>>>>--------------+-------------------------------+--------------------- >>>>>>>>-- >>>>>>>>- >>>>>>>>-- >>>>>>>>-----+---------+--- >>>>>>>> >>>>>>>> ---------------------------------------------------------------- >>>>>>>> >>>>>>>> 16393 | ams | 2611 | 212995 | ams_viewer | >>>>>>>> | 88.221.209.10 | | 43328 | >>>>>>>>2016-01-28 >>>>>>>> 12:24:49.706389+00 | | 2016-01-28 >>>>>>>> 13:18:02.427848+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 12894 | 212995 | ams_viewer | >>>>>>>> | 88.221.209.10 | | 60112 | >>>>>>>>2016-01-28 >>>>>>>> 12:47:26.230681+00 | | 2016-01-28 >>>>>>>> 13:15:27.744242+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 12884 | 212995 | ams_viewer | >>>>>>>> | 88.221.209.10 | | 44302 | >>>>>>>>2016-01-28 >>>>>>>> 12:47:25.100006+00 | | 2016-01-28 >>>>>>>> 13:15:27.936059+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 23466 | 213867 | ams_slony | psql >>>>>>>> | | | -1 | >>>>>>>>2016-01-28 >>>>>>>> 13:11:32.030343+00 | 2016-01-28 13:18:37.283992+00 | 2016-01-28 >>>>>>>> 13:18:37.283992+00 | f | se >>>>>>>> >>>>>>>> lect * from pg_stat_activity where current_query like '%IDLE%'; >>>>>>>> >>>>>>>> 16393 | ams | 6719 | 213867 | ams_slony | >>>>>>>> slon.origin_2_provider_2 | 60.254.150.133 | | >>>>>>>> 61806 | 2016-01-22 01:59:14.800129+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:25.935111+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 6718 | 213867 | ams_slony | >>>>>>>> slon.origin_3_provider_2 | 60.254.150.133 | | >>>>>>>> 61805 | 2016-01-22 01:59:14.797655+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:34.304475+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 5505 | 213867 | ams_slony | >>>>>>>> slon.origin_4_provider_2 | 80.67.75.105 | | >>>>>>>> 36477 | 2016-01-22 01:56:25.637046+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:36.1348+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 5504 | 213867 | ams_slony | >>>>>>>> slon.origin_3_provider_2 | 72.246.50.22 | | >>>>>>>> 51813 | 2016-01-22 01:56:25.240798+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:28.961629+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 5487 | 213867 | ams_slony | >>>>>>>> slon.origin_4_provider_2 | 72.246.50.22 | | >>>>>>>> 51803 | 2016-01-22 01:56:22.896388+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:35.858913+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 5047 | 213867 | ams_slony | >>>>>>>> slon.origin_2_provider_2 | 72.246.50.22 | | >>>>>>>> 51564 | 2016-01-22 01:55:23.600296+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:34.487192+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 5041 | 213867 | ams_slony | >>>>>>>> slon.origin_2_provider_2 | 80.67.75.105 | | >>>>>>>> 36402 | 2016-01-22 01:55:22.964462+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:34.519066+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 6694 | 213867 | ams_slony | >>>>>>>> slon.node_2_listen | 60.254.150.133 | | >>>>>>>> 61795 | 2016-01-22 01:59:12.095052+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:27.928384+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4456 | 213867 | ams_slony | >>>>>>>> slon.node_2_listen | 72.246.50.22 | | >>>>>>>> 51238 | 2016-01-22 01:54:21.481355+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:36.766973+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4457 | 213867 | ams_slony | >>>>>>>> slon.node_2_listen | 80.67.75.105 | | >>>>>>>> 36333 | 2016-01-22 01:54:21.500456+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:36.204482+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4428 | 213867 | ams_slony | >>>>>>>> slon.local_monitor | | | >>>>>>>> -1 | 2016-01-22 01:54:18.977015+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:36.652567+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4427 | 213867 | ams_slony | >>>>>>>>slon.local_sync >>>>>>>> | | | -1 | >>>>>>>>2016-01-22 >>>>>>>> 01:54:18.976932+00 | | 2016-01-28 >>>>>>>> 13:18:36.151998+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4426 | 213867 | ams_slony | >>>>>>>> slon.local_cleanup | | | >>>>>>>> -1 | 2016-01-22 01:54:18.976842+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:12:12.582921+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4425 | 213867 | ams_slony | >>>>>>>> slon.remoteWorkerThread_4 | | | >>>>>>>> -1 | 2016-01-22 01:54:18.976783+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:33.99715+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4420 | 213867 | ams_slony | >>>>>>>> slon.remoteWorkerThread_1 | | | >>>>>>>> -1 | 2016-01-22 01:54:18.976548+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:33.561531+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4419 | 213867 | ams_slony | >>>>>>>> slon.remoteWorkerThread_3 | | | >>>>>>>> -1 | 2016-01-22 01:54:18.97647+00 | >>>>>>>>| >>>>>>>> 2016-01-28 13:18:34.808907+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> 16393 | ams | 4413 | 213867 | ams_slony | >>>>>>>>slon.local_listen >>>>>>>> | | | -1 | >>>>>>>>2016-01-22 >>>>>>>> 01:54:18.965568+00 | | 2016-01-28 >>>>>>>> 13:18:37.096159+00 | f | <I >>>>>>>> >>>>>>>> DLE> >>>>>>>> >>>>>>>> (21 rows) >>>>>>>> >>>>>>>> >>>>>>>> ams=# >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Tom :-) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Slony1-general mailing list >>>>>>>> Slony1-general@lists.slony.info >>>>>>>> http://lists.slony.info/mailman/listinfo/slony1-general >>>>>>>> >>>>>>> >>>>>>> >>>>>>>-- >>>>>>>Jan Wieck >>>>>>>Senior Software Engineer >>>>>>>http://slony.info >>>>>> >>>>> >>>>> >>>>>-- >>>>>Jan Wieck >>>>>Senior Software Engineer >>>>>http://slony.info >>>> >>> >> >> >>-- >>Jan Wieck >>Senior Software Engineer >>http://slony.info > -- Jan Wieck Senior Software Engineer http://slony.info _______________________________________________ Slony1-general mailing list Slony1-general@lists.slony.info http://lists.slony.info/mailman/listinfo/slony1-general