On Mon, 18 Jul 2016, Tignor, Tom wrote:
One thing that I stress is that it is a good idea (and maybe very important)
is that all nodes be caught up (or at least with respect to configuration
events) before you issue the drop node command. The automatic waitfor
support in slonik is supposed to ensure this before submitting the drop
node but maybe it isn't working as I expect or you might be doing something
to subvert this.
Your error message looked like the remoteWorker_$droppedNode was trying to
connect to the dropped node and pull data. I THINK the remoteWorker would
only be trying to pull data from the $dropped node if the local node thought
that the dropped node was a set origin. This makes me think that the
failover wasn't confirmed by all nodes before issuing the drop node. If
your drop node is in the same slonik script as the failover then slonik
'should' be doing this, but maybe there is a bug, or maybe your doing
something to make the drop node get submitted too early.
Without more details it is hard to say.
Hi Steve,
Thanks for looking into this. The context in which this occurs is in
an effort to move a node. The specifics are somewhat involved, but the key
points are that a node is failed over, then dropped and a new node with
the same node ID created elsewhere (another DB on another host). The
recent work makes a best effort to automatically reverse the steps if a
failure is encountered, so failover and drop the new node, then recreate
the original. There are certainly ‘wait for event’s along the way, but it
seems likely all nodes aren’t fully caught up before an error. If you’re
trying to reproduce, recurringly dropping and recreating a node with the
same node ID could help.
For the specific problem of the race, though, it might be simpler to
just ‘kill –STOP’ the slon daemons on node A and then wait for node B to
drop its schema (then resume with ‘kill –CONT’ to node A.)
Tom ☺
On 7/17/16, 2:16 PM, "Steve Singer" <st...@ssinger.info> wrote:
On 07/12/2016 08:23 AM, Steve Singer wrote:
On 07/08/2016 03:27 PM, Tignor, Tom wrote:
Hello slony group,
I’m testing now with slony1-2.2.4. I have just recently
produced an error which effectively stops slon processing on some node A
due to some node B being dropped. The event reproduces only
infrequently. As some will know, a slon daemon for a given node which
becomes aware its node has been dropped will respond by dropping its
cluster schema. There appears to be a race condition between the node B
schema drop and the (surviving) node A receipt of the disableNode (drop
node) event. If the former occurs before the latter, all the remote
worker threads on node A enter an error state. See the log samples
below. I resolved this the first time by deleting all the recent
non-SYNC events from the sl_event tables, and more recently with a
simple node A slon restart.
Please advise if there is any ticket I should provide
this info to, or if I should create a new one. Thanks.
The Slony bug tracker is at
http://bugs.slony.info/bugzilla/
I assume your saying that when the slon restart it keeps hitting this
error and keeps restarting.
Any more hints on how you reproduce this would be helpful.
I've been trying to reproduce this with no luck.
At the time you issue the drop node, are all other nodes caught up to
the drop'd node (and the event node) with respect to configuration events?
Ie if you do
sync(id=$NODE_ABOUT_TO_DROP);
wait for event(wait on=$NODE_ABOUT_TO_DROP, origin=$NODE_ABOUT_TO_DROP,
confirmed=all);
sync(id=$EVENT_NODE_FOR_DROP);
wait for event(wait on=$EVENT_NODE_FOR_DROP,
origin=$EVENT_NODE_FOR_DROP, confirmed=all);
(notice that I am NOT passing any timeout to the wait for).
---- node 1 log ----
2016-07-08 18:06:31 UTC [30382] INFO remoteWorkerThread_999999: SYNC
5000000008 done in 0.002 seconds
2016-07-08 18:06:33 UTC [30382] INFO remoteWorkerThread_999999: SYNC
5000000009 done in 0.002 seconds
2016-07-08 18:06:33 UTC [30382] INFO remoteWorkerThread_2: SYNC
5000017869 done in 0.002 seconds
2016-07-08 18:06:33 UTC [30382] INFO remoteWorkerThread_3: SYNC
5000018148 done in 0.004 seconds
2016-07-08 18:06:45 UTC [30382] CONFIG remoteWorkerThread_2: update
provider configuration
2016-07-08 18:06:45 UTC [30382] ERROR remoteWorkerThread_3: "select
last_value from "_ams_cluster".sl_log_status" PGRES_FATAL_ERROR ERROR:
schema "_ams_clu\
ster" does not exist
LINE 1: select last_value from "_ams_cluster".sl_log_status
^
2016-07-08 18:06:45 UTC [30382] ERROR remoteWorkerThread_3: SYNC aborted
2016-07-08 18:06:45 UTC [30382] CONFIG version for "dbname=ams
host=198.18.102.45
user=ams_slony
sslmode=verify-ca
sslcert=/usr/local/akamai/.ams_certs/complete-ams_slony.crt
sslkey=/usr/local/akamai/.ams_certs/ams_slony.private_key
sslrootcert=/usr/local/akamai/etc/ssl_ca/canonical_ca_roots.pem"
is 90119
2016-07-08 18:06:45 UTC [30382] ERROR remoteWorkerThread_2: "select
last_value from "_ams_cluster".sl_log_status" PGRES_FATAL_ERROR ERROR:
schema "_ams_clu\
ster" does not exist
LINE 1: select last_value from "_ams_cluster".sl_log_status
^
2016-07-08 18:06:45 UTC [30382] ERROR remoteWorkerThread_2: SYNC aborted
2016-07-08 18:06:45 UTC [30382] ERROR remoteListenThread_999999:
"select ev_origin, ev_seqno, ev_timestamp, ev_snapshot,
"pg_catalog".txid_sna\
pshot_xmin(ev_snapshot),
"pg_catalog".txid_snapshot_xmax(ev_snapshot), ev_type,
ev_data1, ev_data2, ev_data3, ev_data4, ev\
_data5, ev_data6, ev_data7, ev_data8 from "_ams_cluster".sl_event
e where (e.ev_origin = '999999' and e.ev_seqno > '5000000009') or
(e.ev_origin = '2'\
and e.ev_seqno > '5000017870') or (e.ev_origin = '3' and e.ev_seqno >
'5000018151') order by e.ev_origin, e.ev_seqno limit 40" - ERROR:
schema "_ams_cluste\
r" does not exist
LINE 1: ...v_data5, ev_data6, ev_data7, ev_data8 from "_ams_clus...
^
2016-07-08 18:06:55 UTC [30382] ERROR remoteWorkerThread_3: "start
transaction; set enable_seqscan = off; set enable_indexscan = on; "
PGRES_FATAL_ERROR ERR\
OR: current transaction is aborted, commands ignored until end of
transaction block
2016-07-08 18:06:55 UTC [30382] ERROR remoteWorkerThread_3: SYNC aborted
2016-07-08 18:06:55 UTC [30382] ERROR remoteWorkerThread_2: "start
transaction; set enable_seqscan = off; set enable_indexscan = on; "
PGRES_FATAL_ERROR ERR\
OR: current transaction is aborted, commands ignored until end of
transaction block
2016-07-08 18:06:55 UTC [30382] ERROR remoteWorkerThread_2: SYNC aborted
----
---- node 999999 log ----
2016-07-08 18:06:44 UTC [558] INFO remoteWorkerThread_1: SYNC
5000081216 done in 0.004 seconds
2016-07-08 18:06:44 UTC [558] INFO remoteWorkerThread_2: SYNC
5000017870 done in 0.004 seconds
2016-07-08 18:06:44 UTC [558] INFO remoteWorkerThread_3: SYNC
5000018150 done in 0.004 seconds
2016-07-08 18:06:44 UTC [558] INFO remoteWorkerThread_1: SYNC
5000081217 done in 0.003 seconds
2016-07-08 18:06:44 UTC [558] WARN remoteWorkerThread_3: got DROP NODE
for local node ID
NOTICE: Slony-I: Please drop schema "_ams_cluster"
NOTICE: drop cascades to 171 other objects
DETAIL: drop cascades to table _ams_cluster.sl_node
drop cascades to table _ams_cluster.sl_nodelock
drop cascades to table _ams_cluster.sl_set
drop cascades to table _ams_cluster.sl_setsync
drop cascades to table _ams_cluster.sl_table
----
Tom J
_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general
_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general
_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general