Karl Denninger wrote: > I upgraded somewhat-recently from 2.0.2 to 2.0.4, and now I've got a > serious problem. > > The reason for the "gotta do it now" was that somehow one of the tables > got out of sync, and a delete was failing to propagate - hanging the > process. > > OK, ok, so 2.0.2 with Postgres 8.4.4 is a bit old and mismatched. So I > upgraded to 2.0.4 on all the nodes, and told the subscriber to reload - > ditched the client config and re-subscribed the sets. > > All went well until a very large table came up - it failed. > > There's no error in the logs indicating why, other than the following: > > Jul 31 22:52:53 dbms TICKER[70295]: [153-1] CONFIG remoteWorkerThread_3: > copy table "public"."images" > Jul 31 22:52:53 dbms TICKER[70295]: [154-1] CONFIG remoteWorkerThread_3: > Begin COPY of table "public"."images" > Jul 31 22:54:24 dbms TICKER[70295]: [155-1] ERROR remoteWorkerThread_3: > PGgetCopyData() server closed the connection unexpectedly > Jul 31 22:54:24 dbms TICKER[70295]: [155-2] This probably means the > server terminated abnormally > Jul 31 22:54:24 dbms TICKER[70295]: [155-3] before or while > processing the request. > Jul 31 22:54:24 dbms TICKER[70295]: [156-1] WARN remoteWorkerThread_3: > data copy for set 1 failed 1 times - sleep 15 seconds > > And in 15 seconds, the entire process of trying to re-init the node > starts over - from the beginning! > > Near as I can tell, it's failing pretty early on. > > The source host is fine. This particular table contains a BYTEA field, > and it's BIG. ~20ish gigs big. But I've re-initialized in the past > without problems. I tried going back to 2.0.2, and that still fails. > Both servers are running with encoding set to SQL_ASCII, if it matters. > > When it fails the SERVER's COPY is still running - so the client is > definitely wrong on the reported error. I have NOTHING in the server's > SLON log and there are no comms problems between the two hosts. > > I'm going to run a dump of the table and see if I can manually bring it > over to the other host and load it. There's nothing going on with the > master that implicates the data being damaged...... > > Ideas? > > -- Karl > Oh boy, I think I know what's going on....
This looks like a problem in the SSL code (!!!!!)
I shut off SSL and I'm now ~4x further than where the copy has
previously failed. For obvious reasons this is decidedly un-good - a
perusal of the server's log (not slony's, but the server's) disclosed
several SSL errors in the log, which would account for the problem and
what the client thought were disconnects - but weren't.
Specifically, I got a bunch of these...
Jul 31 23:00:00 tickerforum postgres[27093]: [9593-2] STATEMENT: copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thumb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-1] LOG: SSL error:
internal error
Jul 31 23:00:00 tickerforum postgres[27093]: [9594-2] STATEMENT: copy
"public"."images"
("post_ordinal","ordinal","caption","image","login","file_type","thumb","thumb_width","thumb_height","hidden")
to stdout;
Jul 31 23:00:01 tickerforum postgres[27093]: [9595-1] LOG: SSL error:
internal error
This isn't SLONY's issue, but it's definitely a problem. I'll report it
over on the Postgres list in the morning...
In the meantime, I think I can get the copy to go with SSL off, and then
turn it back on once the copy is complete.
-- Karl
<<attachment: karl.vcf>>
_______________________________________________ Slony1-general mailing list [email protected] http://lists.slony.info/mailman/listinfo/slony1-general
