[HACKERS] Wide area replication postgres 9.1.6 slon 2.1.2 large table failure.

Tory M Blue Fri, 11 Jan 2013 21:49:49 -0800

So I started this thread on the slon forum, and they mentioned that I/we
should ask here.


Postgres 9.1.4 slon 2.1.1
-and-
Postgres 9.1.6 slon 2.1.2

Scenario:

Node 1, is on gig circut and is the master  (West Coast)

Node 2, is also on a gig circuit and is the slave (Georgia)

Symptoms, slon immediately dies after transferring the biggest table in the
set (this happens with 2 of 3 sets, the set that actually completes has no
large tables).

Set 1 has a table that takes just under 6000 seconds, and set 2 has a table
that takes double that, and again it completes.

1224459-2013-01-11 14:21:10 PST CONFIG remoteWorkerThread_1: 5760.913
seconds to copy table "cls"."listings"
1224560-2013-01-11 14:21:10 PST CONFIG remoteWorkerThread_1: copy table
"cls"."customers"
1224642-2013-01-11 14:21:10 PST CONFIG remoteWorkerThread_1: Begin COPY of
table "cls"."customers"
1224733-2013-01-11 14:21:10 PST ERROR  remoteWorkerThread_1: "select
"_admissioncls".copyFields(8);"  <--- this has the proper data
1224827:2013-01-11 14:21:10 PST WARN   remoteWorkerThread_1: data copy for
set 1 failed 1 times - sleep 15 seconds

Now in terms of postgres, if I do a copy from node 1 to node 2 the large
table (<2 hors) completes without issue.

>From Node 2:
-bash-4.1$ psql -h idb02 -d admissionclsdb -c "copy cls.listings to stdout"
| wc
     4199441 600742784 6621887401

This worked fine.

I get no errors in the postgres logs, there is no network disconnect and
since I can do a copy over the wire that completes, I'm at a loss.  I don't
know what to look at, what to look for or what to do.  Obviously this is
the wrong place to slon issues.

One of the slon developers stated;
"I wonder if there's something here that should get bounced over to
pgsql-hackers or such; we're poking at a scenario here where the use
of COPY to stream data between systems is proving troublesome, and
perhaps there may be meaningful opinions over there on that."

If a copy of the same table that seems to be at the end of a slon failed
attempt and it will complete with a copy, I'm just not sure what is going
on.

Any suggestions, please ask for more data, I can do anything to the slave
node, it's a bit tougher on the source, but I can arrange to make changes
to it if need be.


I just upgraded to 9.1.6 and slon 2.1.2 but prior tests were on 9.1.4 and
slon 2.1.1 and a mix of postgres 9.1.4 slon 2.1.1 and postgres 9.1.6 slon
2.1.1 (node 2)

The other difference is node 1 is running on Fedora12 and node 2 is running
CentOS 6.2

Thanks in advance
Tory

[HACKERS] Wide area replication postgres 9.1.6 slon 2.1.2 large table failure.

Reply via email to