> On Sun, Sep 11, 2016 at 11:41 PM, Tory M Blue <tmb...@gmail.com> wrote:
>> Jan has helped me before, giving me ideas to help with wide area
>> replication where it seems that the connection drops between a large copy
>> set and/or an index creation,  when there is no bits crossing the wire and
>> the connections are dropped by the FW or other so Slony finishes up a
>> table, index creation and attempts to grab the next table, but the
>> connection is no longer there, so Slony says failed and attempts again.
>> I think I'm running into this between my Colo and Amazon, using their VPN
>> gateway.
>> Here is the snippet of logs, there is no index here, we dropped it on the
>> new node, so that it would not fail, but what's odd here is that it copies
>> all the data and 35 minutes later it reports the time, which tells me it's
>> doing something, but I'm not sure what, if there is no index on that table.
>> (there is a primary key with maintains integrity, and we didn't think we
>> should drop that). but there are no other indexes, so the 35 minutes or
>> whatever is a mystery..
>> 2016-09-11 21:32:24 PDT CONFIG remoteWorkerThread_1: Begin COPY of table
>> "torque"."adimpressions"
>> 2016-09-11 *22:39:39 *PDT CONFIG remoteWorkerThread_1: 76955497834 bytes
>> copied for table "torque"."adimpressions"
>> 916499:2016-09-11 *23:14:25 *PDT CONFIG remoteWorkerThread_1: 6121.393
>> seconds to copy table "torque"."impressions"
>> 916608:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: copy table
>> "torque".impressions_archive"
>> 916705:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: Begin COPY of
>> table "torque"."impressions_archive"
>> 916811:2016-09-11 23:14:25 PDT ERROR  remoteWorkerThread_1: "select
>> "_cls".copyFields(237);"
>> 916907:2016-09-11 23:14:25 PDT WARN   remoteWorkerThread_1: data copy for
>> set 2 failed 1 times - sleep 15 seconds
>> 917014:2016-09-11 23:14:25 PDT INFO   cleanupThread: 7606.655 seconds for
>> cleanupEvent()
>> This run,  I added keep-alives by the following method. (and the timing
>> and results are the same without them, set 2 fails with error 237).
>> Adding the following to both slon commands on the origin and the new node
>> tcp_keepalive_idle 300 tcp_keepalive_count 5 tcp_keepalive_interval 300
>> Now not entirely sure how this is suppose to work and did I not tune this
>> right. It obviously fails at the 30 minute mark, this is 25 minutes,
>> however the servers never loses connection (I have a ping (not quite the
>> same), but it has zero packet loss over the 2+ hours that these attempts to
>> get things replicated take)). So maybe someone smarter then me can advice
>> how I should tune the keep alives if that's what is happening.
>> I thought it would only use the keep-alives if it felt the partner was no
>> longer there, but since i know pings show there is no connectivity issues,
>> I'm at a loss. AGAIN :)
>> Thanks for the assist
>> Tory
> Okay keepalives didn't work, but maybe I configured the slon.conf wrong,
there does not appear to be any real examples

I used:

tcp_keepalive_time = 5

tcp_keepalive_probes = 24

tcp_keepalive_intvl = 5

While my kernel is set at, maybe I need to adjust the kernel as well?

net.ipv4.tcp_keepalive_time = 7200

net.ipv4.tcp_keepalive_probes = 9

net.ipv4.tcp_keepalive_intvl = 75

2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: 1869.486 seconds to
copy table "torque"."impressions_daily"

2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: copy table

2016-09-12 09:27:48 PDT CONFIG remoteWorkerThread_1: Begin COPY of table

NOTICE:  truncate of "torque"."impressions" succeeded

2016-09-12 10:31:09 PDT CONFIG remoteWorkerThread_1: 77048102322 bytes
copied for table "torque"."adimpressions"

2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: 5708.515 seconds to
copy table "torque"."impressions"

2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: copy table

2016-09-12 11:02:56 PDT CONFIG remoteWorkerThread_1: Begin COPY of table

2016-09-12 11:02:56 PDT ERROR  remoteWorkerThread_1: "select
2016-09-12 11:02:56 PDT WARN   remoteWorkerThread_1: data copy for set 2
failed 1 times - sleep 15 seconds

There are no indexes, so I don't know what Slon is doing for the 31 minutes
between when the data is finished copied and it attempts to start the next

More suggestions? I know I'm being needy but I'm spinning my wheels it seems

Slony1-general mailing list

Reply via email to