On Sun, Feb 16, 2014 at 9:46 AM, Jeff Frost <[email protected]> wrote:
> On Feb 15, 2014, at 11:25 PM, Tory M Blue <[email protected]> wrote: > > > > > On Sat, Feb 15, 2014 at 10:48 PM, Jeff Frost <[email protected]> wrote: > >> It's probably a firewall timing out your PostgreSQL connection while the >> indexes are being built on the replica. >> >> Look into tcp keep alive settings. >> >> > Yes this is what I thought it was when I first started with this, but > didn't make any progress. Keepalives by default is set to 7200 seconds, so > 2 hours, this is failing in an hour, so I'll have to look at the firewalls > between us but since I'm connected to these boxes the entire time, from the > same network that is originating the slon configuration, I'm doubting the > firewalls are reaping the connections. > > Looking at the TCP keepalive settings, I don't think there is any tuning > there that can help > > net.ipv4.tcp_keepalive_time = 7200 > net.ipv4.tcp_keepalive_probes = 9 > net.ipv4.tcp_keepalive_intvl = 75 > > Well, maybe I can "reduce this" just to make some interesting traffic > happen within that hour+ that the indexes are being created. > > > Yah, so 2 hrs means that if your firewall times out in 10 minutes, it's > going to kill that idle postgresql connection on you. > > This is common in AWS and here are the settings I use in slony 2.2 to fix > this: > > # TCP keep alive configurations > # Enable sending of TCP keep alive between slon and the PostgreSQL backends > tcp_keepalive = true > > # The number of seconds after which a TCP keep alive is sent across an idle > # connection. tcp_keepalive must be enabled for this to take effect. > Default > # value of 0 means use operating system default > tcp_keepalive_idle = 5 > > # The number of keep alive requests to the server that can be lost before > # the connection is declared dead. tcp_keepalive must be on.Default value > # of 0 means use operating system default > tcp_keepalive_count = 10 > > # The number of seconds in between TCP keep alive requests. tcp_keepalive > # must be enabled. Default value of 0 means use operating system defaut > tcp_keepalive_interval = 30 > > That's probably more aggressive than you need, but it should do the trick. > > Okay So I mucked with the settings, maybe I'm not understanding them quite right, but still same result, this time I at least caught the disconnect in my source postgresql logs. 2014-02-16 14:41:03 PST CONFIG remoteWorkerThread_1: Begin COPY of table "tracking"."spotlightimp" NOTICE: truncate of "tracking"."spotlightimp" succeeded 2014-02-16 15:54:40 PST CONFIG remoteWorkerThread_1: 5618691807 bytes copied for table "tracking"."spotlightimp" ------------ ORIGIN---------- 2014-02-16 16:14:40 PST cls postgres 172.19.228.100(35508) 13796 2014-02-16 16:14:40.430 PSTLOG: could not receive data from client: Connection reset by peer 2014-02-16 16:14:40 PST cls postgres 172.19.228.100(35508) 13796 2014-02-16 16:14:40.430 PSTLOG: unexpected EOF on client connection with an open transaction ------------ORIGIN---------- As can be seen the connection is reaped, slon/postgres continue on their way, it's not until the next data copy is required that it finds it's connection is no longer there. Why it can't recreate a conneciton as one would do if they stopped and started slon is kind of beyond me. Just not 100% sure where it's being killed. 2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: 7183.069 seconds to copy table "tracking"."spotlightimp" 2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: copy table "tracking"."adimp" 2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: Begin COPY of table "tracking"."adimp" 2014-02-16 16:40:46 PST ERROR remoteWorkerThread_1: "select "_cls".copyFields(19);" 2014-02-16 16:40:46 PST WARN remoteWorkerThread_1: data copy for set 2 failed 1 times - sleep 15 seconds NOTICE: Slony-I: Logswitch to sl_log_2 initiated CONTEXT: SQL statement "SELECT "_cls".logswitch_start()" PL/pgSQL function _cls.cleanupevent(interval) line 96 at PERFORM 2014-02-16 16:40:49 PST INFO cleanupThread: 6541.365 seconds for cleanupEvent() Am I doing this wrong? figured that since I've seen connections at 15 minutes of processing complete fine, I thought that 30 minutes is more then enough. So send the first hey are you still there at 15 minutes then continue with them every 5 minutes, for a count of 30. But the above seems to have been reaped in the 20 minute area.. net.ipv4.tcp_keepalive_time = 600 net.ipv4.tcp_keepalive_probes = 30 net.ipv4.tcp_keepalive_intvl = 300 thanks for the assistance Tory
_______________________________________________ Slony1-general mailing list [email protected] http://lists.slony.info/mailman/listinfo/slony1-general
