Re: [Slony1-general] Replication inexplicably stops

Sung Hsin Lei Sun, 31 Jan 2016 16:15:07 -0800

It replicated for 2 months with the firewall and anti-virus on. Just in
case, I turned the firewall and anti virus off and it's still not
replicating. I used wiresharks to examine the packets and did not see
anything suspicious. Using pgadmin, I'm able to connect to the main server
from the replicated server and vice versa so the connection seems to be
accepted.


When connection cannot be established or is rejected, slon log usually
gives an error. In my case, it's just stuck on: INFO
remoteWorkerThread_1: syncing set
 1 with 59 table(s) from provider 1" with no errors.


Actually, it gets stuck for several minutes and then do some cleanup
operations. The following is the last few lines of the slon log. Sorry, the
person's windows is in French hence you see a mixture of English and French:


2016-01-31 19:44:24 AmÚr. du Sud occid. INFO   remoteWorkerThread_1:
syncing set
 1 with 59 table(s) from provider 1
NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=5388
CONTEXT:  instruction SQL ┬½ SELECT "_slony_Securithor2".cleanupNodelock()
┬╗
fonction PL/pgsql "_slony_Securithor2".cleanupevent(interval), ligne 82 ├á
PERFO
RM
NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=1176
CONTEXT:  instruction SQL ┬½ SELECT "_slony_Securithor2".cleanupNodelock()
┬╗
fonction PL/pgsql "_slony_Securithor2".cleanupevent(interval), ligne 82 ├á
PERFO
RM
NOTICE:  Slony-I: log switch to sl_log_1 complete - truncate sl_log_2
CONTEXT:  fonction PL/pgsql "_slony_Securithor2".cleanupevent(interval),
ligne 9
5 ├á affectation
2016-01-31 19:54:24 AmÚr. du Sud occid. INFO   cleanupThread:    0.062
seconds f
or cleanupEvent()
NOTICE:  Slony-I: Logswitch to sl_log_2 initiated
CONTEXT:  instruction SQL ┬½ SELECT "_slony_Securithor2".logswitch_start()
┬╗
fonction PL/pgsql "_slony_Securithor2".cleanupevent(interval), ligne 97 ├á
PERFO
RM
2016-01-31 20:04:25 AmÚr. du Sud occid. INFO   cleanupThread:    0.000
seconds f
or cleanupEvent()



What would cause no replication yet no error in the logs?

Thanks.

On Fri, Jan 29, 2016 at 9:50 AM, Jan Wieck <j...@wi3ck.info> wrote:

> On 01/28/2016 10:57 PM, Sung Hsin Lei wrote:
>
>> Hello guys,
>>
>> So I have this setup that has already stopped on me 3 times the last 6
>> months. Each time it would replicate properly for 2-3 months and then it
>> would just stop. It currently is stopped since January 11, 2016. The
>> only way I can get replication back is to set everything up from
>> scratch. I'm wondering if anyone has an idea on the issue causing the
>> stoppage. I'm running 64-bit slony 2.2.4.
>>
>> Currently, when I run slon on the replicated machine, I get the following:
>>
>>
>>
>> C:\Program Files\PostgreSQL\9.3\bin>slon slony_Securithor2 "dbname =
>> Securithor2
>>    user = slonyuser password = securiTHOR971 port = 6234"
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: slon version 2.2.4
>> starting
>>   up
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> vac_frequenc
>> y = 3
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> log_level =
>> 0
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> sync_interva
>> l = 2000
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> sync_interva
>> l_timeout = 10000
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> sync_group_m
>> axsize = 20
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> quit_sync_pr
>> ovider = 0
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> remote_liste
>> n_timeout = 300
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> monitor_inte
>> rval = 500
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> explain_inte
>> rval = 0
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> tcp_keepaliv
>> e_idle = 0
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> tcp_keepaliv
>> e_interval = 0
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> tcp_keepaliv
>> e_count = 0
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Integer option
>> apply_cache_
>> size = 100
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Boolean option
>> log_pid = 0
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Boolean option
>> log_timestam
>> p = 1
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Boolean option
>> tcp_keepaliv
>> e = 1
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Boolean option
>> monitor_thre
>> ads = 1
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: Real option
>> real_placeholde
>> r = 0.000000
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> cluster_name
>> = slony_Securithor2
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> conn_info = d
>> bname = Securithor2  user = slonyuser password = securiTHOR971 port = 6234
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> pid_file = [N
>> ULL]
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> log_timestamp
>> _format = %Y-%m-%d %H:%M:%S %Z
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> archive_dir =
>>   [NULL]
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> sql_on_connec
>> tion = [NULL]
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> lag_interval
>> = [NULL]
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> command_on_lo
>> garchive = [NULL]
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: String option
>> cleanup_inter
>> val = 10 minutes
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: local node id = 2
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. INFO   main: main process started
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: launching
>> sched_start_mainl
>> oop
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: loading current
>> cluster con
>> figuration
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG storeNode: no_id=1
>> no_comment='Ma
>> ster Node'
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG storePath: pa_server=1
>> pa_client=
>> 2 pa_conninfo="dbname=Securithor2 host=192.168.1.50 user=slonyuser
>> password = se
>> curiTHOR971  port = 6234" pa_connretry=10
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG storeListen: li_origin=1
>> li_recei
>> ver=2 li_provider=1
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG storeSet: set_id=1
>> set_origin=1 s
>> et_comment='All tables and sequences'
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. WARN   remoteWorker_wakeup: node
>> 1 - no
>> worker thread
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG storeSubscribe: sub_set=1
>> sub_pro
>> vider=1 sub_forward='f'
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. WARN   remoteWorker_wakeup: node
>> 1 - no
>> worker thread
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG enableSubscription:
>> sub_set=1
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. WARN   remoteWorker_wakeup: node
>> 1 - no
>> worker thread
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: last local event
>> sequence =
>>   5000462590
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG main: configuration
>> complete - st
>> arting threads
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. INFO   localListenThread: thread
>> starts
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG version for "dbname =
>> Securithor2
>>    user = slonyuser password = securiTHOR971 port = 6234" is 90310
>> NOTICE:  Slony-I: cleanup stale sl_nodelock entry for pid=5188
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG enableNode: no_id=1
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. INFO   remoteWorkerThread_1:
>> thread star
>> ts
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. INFO   remoteListenThread_1:
>> thread star
>> ts
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. INFO   main: running scheduler
>> mainloop
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG cleanupThread: thread
>> starts
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. INFO   syncThread: thread starts
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. INFO   monitorThread: thread
>> starts
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG version for "dbname =
>> Securithor2
>>    user = slonyuser password = securiTHOR971 port = 6234" is 90310
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG remoteWorkerThread_1:
>> update prov
>> ider configuration
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG remoteWorkerThread_1:
>> added activ
>> e set 1 to provider 1
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG version for
>> "dbname=Securithor2 h
>> ost=192.168.1.50 user=slonyuser password = securiTHOR971  port = 6234"
>> is 90306
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG version for "dbname =
>> Securithor2
>>    user = slonyuser password = securiTHOR971 port = 6234" is 90310
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG cleanupThread: bias = 60
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG version for "dbname =
>> Securithor2
>>    user = slonyuser password = securiTHOR971 port = 6234" is 90310
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG version for "dbname =
>> Securithor2
>>    user = slonyuser password = securiTHOR971 port = 6234" is 90310
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. CONFIG version for
>> "dbname=Securithor2 h
>> ost=192.168.1.50 user=slonyuser password = securiTHOR971  port = 6234"
>> is 90306
>> 2016-01-28 17:41:00 AmÚr. du Sud occid. INFO   remoteWorkerThread_1:
>> syncing set
>>   1 with 59 table(s) from provider 1
>>
>>
>>
>>
>> It gets stuck at "syncing set 1 with 59 table(s) from provider 1" (the
>> last line) forever with the occasional messages that says something
>> about cleaning(threadcleaning I thing).
>>
>>
>> Checking the postgres logs, I see lots of:
>>
>> 2016-01-28 17:33:07 AST LOG:  n'a pas pu recevoir les donnÃ©es du client
>> : unrecognized winsock error 10061
>>
>> Which translates to:
>>
>> 2016-01-28 17:33:07 AST LOG:  was not able to receive the data from the
>> client : unrecognized winsock error 10061
>>
>> I'm able to connect to the main db from the replicated machine no
>> problem. I have no idea how this error 10061 is caused.
>>
>
> Winsock error 10061 is WSAECONNREFUSED
>
>     Connection refused.
>
>     No connection could be made because the target computer actively
>     refused it. This usually results from trying to connect to a
>     service that is inactive on the foreign host—that is, one with no
>     server application running.
>
> This might be a firewall issue. Can you use some network sniffer to find
> out what is happening on the TCP/IP level between the two machines?
>
>
> Regards, Jan
>
> --
> Jan Wieck
> Senior Software Engineer
> http://slony.info
>

_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Re: [Slony1-general] Replication inexplicably stops

Reply via email to