On Mon, 18 Jul 2016, Tignor, Tom wrote:

One thing that I stress is that it is a good idea (and maybe very important) is that all nodes be caught up (or at least with respect to configuration events) before you issue the drop node command. The automatic waitfor support in slonik is supposed to ensure this before submitting the drop node but maybe it isn't working as I expect or you might be doing something to subvert this.

Your error message looked like the remoteWorker_$droppedNode was trying to connect to the dropped node and pull data. I THINK the remoteWorker would only be trying to pull data from the $dropped node if the local node thought that the dropped node was a set origin. This makes me think that the failover wasn't confirmed by all nodes before issuing the drop node. If your drop node is in the same slonik script as the failover then slonik 'should' be doing this, but maybe there is a bug, or maybe your doing something to make the drop node get submitted too early.

Without more details it is hard to say.



        Hi Steve,
Thanks for looking into this. The context in which this occurs is in an effort to move a node. The specifics are somewhat involved, but the key points are that a node is failed over, then dropped and a new node with the same node ID created elsewhere (another DB on another host). The recent work makes a best effort to automatically reverse the steps if a failure is encountered, so failover and drop the new node, then recreate the original. There are certainly ‘wait for event’s along the way, but it seems likely all nodes aren’t fully caught up before an error. If you’re trying to reproduce, recurringly dropping and recreating a node with the same node ID could help. For the specific problem of the race, though, it might be simpler to just ‘kill –STOP’ the slon daemons on node A and then wait for node B to drop its schema (then resume with ‘kill –CONT’ to node A.)

        Tom    ☺


On 7/17/16, 2:16 PM, "Steve Singer" <st...@ssinger.info> wrote:

On 07/12/2016 08:23 AM, Steve Singer wrote:
On 07/08/2016 03:27 PM, Tignor, Tom wrote:
                  Hello slony group,

                  I’m testing now with slony1-2.2.4. I have just recently
produced an error which effectively stops slon processing on some node A
due to some node B being dropped. The event reproduces only
infrequently. As some will know, a slon daemon for a given node which
becomes aware its node has been dropped will respond by dropping its
cluster schema. There appears to be a race condition between the node B
schema drop and the (surviving) node A receipt of the disableNode (drop
node) event. If the former occurs before the latter, all the remote
worker threads on node A enter an error state. See the log samples
below. I resolved this the first time by deleting all the recent
non-SYNC events from the sl_event tables, and more recently with a
simple node A slon restart.

                  Please advise if there is any ticket I should provide
this info to, or if I should create a new one. Thanks.


The Slony bug tracker is at
http://bugs.slony.info/bugzilla/


I assume your saying that when the slon restart it keeps hitting this
error and keeps restarting.


Any more hints on how you reproduce this would be helpful.
I've been trying to reproduce this with no luck.

At the time you issue the drop node, are all other nodes caught up to
the drop'd node (and the event node) with respect to configuration events?

Ie if you do
sync(id=$NODE_ABOUT_TO_DROP);
wait for event(wait on=$NODE_ABOUT_TO_DROP, origin=$NODE_ABOUT_TO_DROP,
confirmed=all);

sync(id=$EVENT_NODE_FOR_DROP);
wait for event(wait on=$EVENT_NODE_FOR_DROP,
origin=$EVENT_NODE_FOR_DROP, confirmed=all);


(notice that I am NOT passing any timeout to the wait for).







---- node 1 log ----

2016-07-08 18:06:31 UTC [30382] INFO   remoteWorkerThread_999999: SYNC
5000000008 done in 0.002 seconds

2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_999999: SYNC
5000000009 done in 0.002 seconds

2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_2: SYNC
5000017869 done in 0.002 seconds

2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_3: SYNC
5000018148 done in 0.004 seconds

2016-07-08 18:06:45 UTC [30382] CONFIG remoteWorkerThread_2: update
provider configuration

2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_3: "select
last_value from "_ams_cluster".sl_log_status" PGRES_FATAL_ERROR ERROR:
schema "_ams_clu\

ster" does not exist

LINE 1: select last_value from "_ams_cluster".sl_log_status

                                 ^

2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_3: SYNC aborted

2016-07-08 18:06:45 UTC [30382] CONFIG version for "dbname=ams

        host=198.18.102.45

        user=ams_slony

        sslmode=verify-ca

        sslcert=/usr/local/akamai/.ams_certs/complete-ams_slony.crt

        sslkey=/usr/local/akamai/.ams_certs/ams_slony.private_key

        sslrootcert=/usr/local/akamai/etc/ssl_ca/canonical_ca_roots.pem"
is 90119

2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_2: "select
last_value from "_ams_cluster".sl_log_status" PGRES_FATAL_ERROR ERROR:
schema "_ams_clu\

ster" does not exist

LINE 1: select last_value from "_ams_cluster".sl_log_status

                                 ^

2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_2: SYNC aborted

2016-07-08 18:06:45 UTC [30382] ERROR  remoteListenThread_999999:
"select ev_origin, ev_seqno, ev_timestamp,        ev_snapshot,
"pg_catalog".txid_sna\

pshot_xmin(ev_snapshot),
"pg_catalog".txid_snapshot_xmax(ev_snapshot),        ev_type,
   ev_data1, ev_data2,        ev_data3, ev_data4,        ev\

_data5, ev_data6,        ev_data7, ev_data8 from "_ams_cluster".sl_event
e where (e.ev_origin = '999999' and e.ev_seqno > '5000000009') or
(e.ev_origin = '2'\

and e.ev_seqno > '5000017870') or (e.ev_origin = '3' and e.ev_seqno >
'5000018151') order by e.ev_origin, e.ev_seqno limit 40" - ERROR:
schema "_ams_cluste\

r" does not exist

LINE 1: ...v_data5, ev_data6,        ev_data7, ev_data8 from "_ams_clus...

                                                               ^

2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_3: "start
transaction; set enable_seqscan = off; set enable_indexscan = on; "
PGRES_FATAL_ERROR ERR\

OR:  current transaction is aborted, commands ignored until end of
transaction block

2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_3: SYNC aborted

2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_2: "start
transaction; set enable_seqscan = off; set enable_indexscan = on; "
PGRES_FATAL_ERROR ERR\

OR:  current transaction is aborted, commands ignored until end of
transaction block

2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_2: SYNC aborted

----

---- node 999999 log ----

2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_1: SYNC
5000081216 done in 0.004 seconds

2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_2: SYNC
5000017870 done in 0.004 seconds

2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_3: SYNC
5000018150 done in 0.004 seconds

2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_1: SYNC
5000081217 done in 0.003 seconds

2016-07-08 18:06:44 UTC [558] WARN   remoteWorkerThread_3: got DROP NODE
for local node ID

NOTICE:  Slony-I: Please drop schema "_ams_cluster"

NOTICE:  drop cascades to 171 other objects

DETAIL:  drop cascades to table _ams_cluster.sl_node

drop cascades to table _ams_cluster.sl_nodelock

drop cascades to table _ams_cluster.sl_set

drop cascades to table _ams_cluster.sl_setsync

drop cascades to table _ams_cluster.sl_table

----

              Tom J



_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general


_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general




_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to