Sorry I'm a bit sleep deprived, this is almost the exact thing I asked for help on in 2014.. Jan and Jeff both came in and gave me suggestions for keep alives which are much more aggressive than I have it set to.
So I'm going to test with the more aggressive settings from this thread in 2014 " https://www.mail-archive.com/slony1-general@lists.slony.info/msg06967.html" How lame I spaced, I knew Jan had been helpful, but totally spaced this thread.. UUGH! sorry And yes double bad, top posting!! Tory On Sun, Sep 11, 2016 at 11:41 PM, Tory M Blue <tmb...@gmail.com> wrote: > Jan has helped me before, giving me ideas to help with wide area > replication where it seems that the connection drops between a large copy > set and/or an index creation, when there is no bits crossing the wire and > the connections are dropped by the FW or other so Slony finishes up a > table, index creation and attempts to grab the next table, but the > connection is no longer there, so Slony says failed and attempts again. > > I think I'm running into this between my Colo and Amazon, using their VPN > gateway. > > Here is the snippet of logs, there is no index here, we dropped it on the > new node, so that it would not fail, but what's odd here is that it copies > all the data and 35 minutes later it reports the time, which tells me it's > doing something, but I'm not sure what, if there is no index on that table. > (there is a primary key with maintains integrity, and we didn't think we > should drop that). but there are no other indexes, so the 35 minutes or > whatever is a mystery.. > > > 2016-09-11 21:32:24 PDT CONFIG remoteWorkerThread_1: Begin COPY of table > "torque"."adimpressions" > 2016-09-11 *22:39:39 *PDT CONFIG remoteWorkerThread_1: 76955497834 bytes > copied for table "torque"."adimpressions" > 916499:2016-09-11 *23:14:25 *PDT CONFIG remoteWorkerThread_1: 6121.393 > seconds to copy table "torque"."impressions" > 916608:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: copy table > "torque".impressions_archive" > 916705:2016-09-11 23:14:25 PDT CONFIG remoteWorkerThread_1: Begin COPY of > table "torque"."impressions_archive" > 916811:2016-09-11 23:14:25 PDT ERROR remoteWorkerThread_1: "select > "_cls".copyFields(237);" > 916907:2016-09-11 23:14:25 PDT WARN remoteWorkerThread_1: data copy for > set 2 failed 1 times - sleep 15 seconds > 917014:2016-09-11 23:14:25 PDT INFO cleanupThread: 7606.655 seconds for > cleanupEvent() > > This run, I added keep-alives by the following method. (and the timing > and results are the same without them, set 2 fails with error 237). > > Adding the following to both slon commands on the origin and the new node > > tcp_keepalive_idle 300 tcp_keepalive_count 5 tcp_keepalive_interval 300 > > Now not entirely sure how this is suppose to work and did I not tune this > right. It obviously fails at the 30 minute mark, this is 25 minutes, > however the servers never loses connection (I have a ping (not quite the > same), but it has zero packet loss over the 2+ hours that these attempts to > get things replicated take)). So maybe someone smarter then me can advice > how I should tune the keep alives if that's what is happening. > > I thought it would only use the keep-alives if it felt the partner was no > longer there, but since i know pings show there is no connectivity issues, > I'm at a loss. AGAIN :) > > Thanks for the assist > > Tory >
_______________________________________________ Slony1-general mailing list Slony1-general@lists.slony.info http://lists.slony.info/mailman/listinfo/slony1-general