Re: [Slony1-general] Still having issues with wide area replication. large table , copy set 2 failed

Tory M Blue Sun, 16 Feb 2014 17:00:39 -0800

On Sun, Feb 16, 2014 at 9:46 AM, Jeff Frost <[email protected]> wrote:


> On Feb 15, 2014, at 11:25 PM, Tory M Blue <[email protected]> wrote:
>
>
>
>
> On Sat, Feb 15, 2014 at 10:48 PM, Jeff Frost <[email protected]> wrote:
>
>> It's probably a firewall timing out your PostgreSQL connection while the
>> indexes are being built on the replica.
>>
>> Look into tcp keep alive settings.
>>
>>
> Yes this is what I thought it was when I first started with this, but
> didn't make any progress. Keepalives by default is set to 7200 seconds, so
> 2 hours, this is failing in an hour, so  I'll have to look at the firewalls
> between us but since I'm connected to these boxes the entire time, from the
> same network that is originating the slon configuration, I'm doubting the
> firewalls are reaping the connections.
>
> Looking at the TCP keepalive settings, I don't think there is any tuning
> there that can help
>
> net.ipv4.tcp_keepalive_time = 7200
> net.ipv4.tcp_keepalive_probes = 9
> net.ipv4.tcp_keepalive_intvl = 75
>
> Well, maybe I can "reduce this" just to make some interesting traffic
> happen within that hour+ that the indexes are being created.
>
>
> Yah, so 2 hrs means that if your firewall times out in 10 minutes, it's
> going to kill that idle postgresql connection on you.
>
> This is common in AWS and here are the settings I use in slony 2.2 to fix
> this:
>
> # TCP keep alive configurations
> # Enable sending of TCP keep alive between slon and the PostgreSQL backends
> tcp_keepalive = true
>
> # The number of seconds after which a TCP keep alive is sent across an idle
> # connection. tcp_keepalive must be enabled for this to take effect.
> Default
> # value of 0 means use operating system default
>  tcp_keepalive_idle = 5
>
> # The number of keep alive requests to the server that can be lost before
> # the connection is declared dead. tcp_keepalive must be on.Default value
> # of 0 means use operating system default
> tcp_keepalive_count = 10
>
> # The number of seconds in between TCP keep alive requests. tcp_keepalive
> # must be enabled. Default value of 0 means use operating system defaut
> tcp_keepalive_interval = 30
>
> That's probably more aggressive than you need, but it should do the trick.
>
>
Okay So I mucked with the settings, maybe I'm not understanding them quite
right, but still same result, this time I at least caught the disconnect in
my source postgresql logs.

2014-02-16 14:41:03 PST CONFIG remoteWorkerThread_1: Begin COPY of table
"tracking"."spotlightimp"
NOTICE:  truncate of "tracking"."spotlightimp" succeeded
2014-02-16 15:54:40 PST CONFIG remoteWorkerThread_1: 5618691807 bytes
copied for table "tracking"."spotlightimp"

------------ ORIGIN----------
2014-02-16 16:14:40 PST cls postgres 172.19.228.100(35508) 13796 2014-02-16
16:14:40.430 PSTLOG:  could not receive data from client: Connection reset
by peer
2014-02-16 16:14:40 PST cls postgres 172.19.228.100(35508) 13796 2014-02-16
16:14:40.430 PSTLOG:  unexpected EOF on client connection with an open
transaction
------------ORIGIN----------

As can be seen the connection is reaped, slon/postgres continue on their
way, it's not until the next data copy is required that it finds it's
connection is no longer there. Why it can't recreate a conneciton as one
would do if they stopped and started slon is kind of beyond me.  Just not
100% sure where it's being killed.


2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: 7183.069 seconds to
copy table "tracking"."spotlightimp"
2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: copy table
"tracking"."adimp"
2014-02-16 16:40:46 PST CONFIG remoteWorkerThread_1: Begin COPY of table
"tracking"."adimp"
2014-02-16 16:40:46 PST ERROR  remoteWorkerThread_1: "select
"_cls".copyFields(19);"
2014-02-16 16:40:46 PST WARN   remoteWorkerThread_1: data copy for set 2
failed 1 times - sleep 15 seconds
NOTICE:  Slony-I: Logswitch to sl_log_2 initiated
CONTEXT:  SQL statement "SELECT "_cls".logswitch_start()"
PL/pgSQL function _cls.cleanupevent(interval) line 96 at PERFORM
2014-02-16 16:40:49 PST INFO   cleanupThread: 6541.365 seconds for
cleanupEvent()


Am I doing this wrong? figured that since I've seen connections at 15
minutes of processing complete fine, I thought that 30 minutes is more then
enough. So send the first hey are you still there at 15 minutes then
continue with them every 5 minutes, for a count of 30.

But the above seems to have been reaped in the 20 minute area..

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_probes = 30
net.ipv4.tcp_keepalive_intvl = 300

thanks for the assistance
Tory

_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Re: [Slony1-general] Still having issues with wide area replication. large table , copy set 2 failed

Reply via email to