I actually have the failover in one script and then a ‘drop node, wait 
for event’ sequence (in that order) in a subsequent script. So there actually 
isn’t an explicit wait between the failover and the node drop. In all my 
testing, the failover has proven to be the most reliable slony op, but given 
what you’ve said, it looks like adding a wait is in order.

        Tom    ☺


On 7/18/16, 9:13 AM, "Steve Singer" <st...@ssinger.info> wrote:

>On Mon, 18 Jul 2016, Tignor, Tom wrote:
>
>One thing that I stress is that it is a good idea (and maybe very important) 
>is that all nodes be caught up (or at least with respect to configuration 
>events) before you issue the drop node command.  The automatic waitfor 
>support in slonik is supposed to ensure this before submitting the drop 
>node but maybe it isn't working as I expect or you might be doing something 
>to subvert this.
>
>Your error message looked like the remoteWorker_$droppedNode was trying to 
>connect to the dropped node and pull data.  I THINK the remoteWorker would 
>only be trying to pull data from the $dropped node if the local node thought 
>that the dropped node was a set origin.  This makes me think that the 
>failover wasn't confirmed by all nodes before issuing the drop node.  If 
>your drop node is in the same slonik script as the failover then slonik 
>'should' be doing this, but maybe there is a bug, or maybe your doing 
>something to make the drop node get submitted too early.
>
>Without more details it is hard to say.
>
>
>>
>>      Hi Steve,
>>      Thanks for looking into this. The context in which this occurs is in 
>> an effort to move a node. The specifics are somewhat involved, but the key 
>> points are that a node is failed over, then dropped and a new node with 
>> the same node ID created elsewhere (another DB on another host). The 
>> recent work makes a best effort to automatically reverse the steps if a 
>> failure is encountered, so failover and drop the new node, then recreate 
>> the original. There are certainly ‘wait for event’s along the way, but it 
>> seems likely all nodes aren’t fully caught up before an error. If you’re 
>> trying to reproduce, recurringly dropping and recreating a node with the 
>> same node ID could help.
>>      For the specific problem of the race, though, it might be simpler to 
>> just ‘kill –STOP’ the slon daemons on node A and then wait for node B to 
>> drop its schema (then resume with ‘kill –CONT’ to node A.)
>>
>>      Tom    ☺
>>
>>
>> On 7/17/16, 2:16 PM, "Steve Singer" <st...@ssinger.info> wrote:
>>
>>> On 07/12/2016 08:23 AM, Steve Singer wrote:
>>>> On 07/08/2016 03:27 PM, Tignor, Tom wrote:
>>>>>                   Hello slony group,
>>>>>
>>>>>                   I’m testing now with slony1-2.2.4. I have just recently
>>>>> produced an error which effectively stops slon processing on some node A
>>>>> due to some node B being dropped. The event reproduces only
>>>>> infrequently. As some will know, a slon daemon for a given node which
>>>>> becomes aware its node has been dropped will respond by dropping its
>>>>> cluster schema. There appears to be a race condition between the node B
>>>>> schema drop and the (surviving) node A receipt of the disableNode (drop
>>>>> node) event. If the former occurs before the latter, all the remote
>>>>> worker threads on node A enter an error state. See the log samples
>>>>> below. I resolved this the first time by deleting all the recent
>>>>> non-SYNC events from the sl_event tables, and more recently with a
>>>>> simple node A slon restart.
>>>>>
>>>>>                   Please advise if there is any ticket I should provide
>>>>> this info to, or if I should create a new one. Thanks.
>>>>>
>>>>
>>>> The Slony bug tracker is at
>>>> http://bugs.slony.info/bugzilla/
>>>>
>>>>
>>>> I assume your saying that when the slon restart it keeps hitting this
>>>> error and keeps restarting.
>>>>
>>>
>>> Any more hints on how you reproduce this would be helpful.
>>> I've been trying to reproduce this with no luck.
>>>
>>> At the time you issue the drop node, are all other nodes caught up to
>>> the drop'd node (and the event node) with respect to configuration events?
>>>
>>> Ie if you do
>>> sync(id=$NODE_ABOUT_TO_DROP);
>>> wait for event(wait on=$NODE_ABOUT_TO_DROP, origin=$NODE_ABOUT_TO_DROP,
>>> confirmed=all);
>>>
>>> sync(id=$EVENT_NODE_FOR_DROP);
>>> wait for event(wait on=$EVENT_NODE_FOR_DROP,
>>> origin=$EVENT_NODE_FOR_DROP, confirmed=all);
>>>
>>>
>>> (notice that I am NOT passing any timeout to the wait for).
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>> ---- node 1 log ----
>>>>>
>>>>> 2016-07-08 18:06:31 UTC [30382] INFO   remoteWorkerThread_999999: SYNC
>>>>> 5000000008 done in 0.002 seconds
>>>>>
>>>>> 2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_999999: SYNC
>>>>> 5000000009 done in 0.002 seconds
>>>>>
>>>>> 2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_2: SYNC
>>>>> 5000017869 done in 0.002 seconds
>>>>>
>>>>> 2016-07-08 18:06:33 UTC [30382] INFO   remoteWorkerThread_3: SYNC
>>>>> 5000018148 done in 0.004 seconds
>>>>>
>>>>> 2016-07-08 18:06:45 UTC [30382] CONFIG remoteWorkerThread_2: update
>>>>> provider configuration
>>>>>
>>>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_3: "select
>>>>> last_value from "_ams_cluster".sl_log_status" PGRES_FATAL_ERROR ERROR:
>>>>> schema "_ams_clu\
>>>>>
>>>>> ster" does not exist
>>>>>
>>>>> LINE 1: select last_value from "_ams_cluster".sl_log_status
>>>>>
>>>>>                                  ^
>>>>>
>>>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_3: SYNC aborted
>>>>>
>>>>> 2016-07-08 18:06:45 UTC [30382] CONFIG version for "dbname=ams
>>>>>
>>>>>         host=198.18.102.45
>>>>>
>>>>>         user=ams_slony
>>>>>
>>>>>         sslmode=verify-ca
>>>>>
>>>>>         sslcert=/usr/local/akamai/.ams_certs/complete-ams_slony.crt
>>>>>
>>>>>         sslkey=/usr/local/akamai/.ams_certs/ams_slony.private_key
>>>>>
>>>>>         sslrootcert=/usr/local/akamai/etc/ssl_ca/canonical_ca_roots.pem"
>>>>> is 90119
>>>>>
>>>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_2: "select
>>>>> last_value from "_ams_cluster".sl_log_status" PGRES_FATAL_ERROR ERROR:
>>>>> schema "_ams_clu\
>>>>>
>>>>> ster" does not exist
>>>>>
>>>>> LINE 1: select last_value from "_ams_cluster".sl_log_status
>>>>>
>>>>>                                  ^
>>>>>
>>>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteWorkerThread_2: SYNC aborted
>>>>>
>>>>> 2016-07-08 18:06:45 UTC [30382] ERROR  remoteListenThread_999999:
>>>>> "select ev_origin, ev_seqno, ev_timestamp,        ev_snapshot,
>>>>> "pg_catalog".txid_sna\
>>>>>
>>>>> pshot_xmin(ev_snapshot),
>>>>> "pg_catalog".txid_snapshot_xmax(ev_snapshot),        ev_type,
>>>>>    ev_data1, ev_data2,        ev_data3, ev_data4,        ev\
>>>>>
>>>>> _data5, ev_data6,        ev_data7, ev_data8 from "_ams_cluster".sl_event
>>>>> e where (e.ev_origin = '999999' and e.ev_seqno > '5000000009') or
>>>>> (e.ev_origin = '2'\
>>>>>
>>>>> and e.ev_seqno > '5000017870') or (e.ev_origin = '3' and e.ev_seqno >
>>>>> '5000018151') order by e.ev_origin, e.ev_seqno limit 40" - ERROR:
>>>>> schema "_ams_cluste\
>>>>>
>>>>> r" does not exist
>>>>>
>>>>> LINE 1: ...v_data5, ev_data6,        ev_data7, ev_data8 from "_ams_clus...
>>>>>
>>>>>                                                                ^
>>>>>
>>>>> 2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_3: "start
>>>>> transaction; set enable_seqscan = off; set enable_indexscan = on; "
>>>>> PGRES_FATAL_ERROR ERR\
>>>>>
>>>>> OR:  current transaction is aborted, commands ignored until end of
>>>>> transaction block
>>>>>
>>>>> 2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_3: SYNC aborted
>>>>>
>>>>> 2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_2: "start
>>>>> transaction; set enable_seqscan = off; set enable_indexscan = on; "
>>>>> PGRES_FATAL_ERROR ERR\
>>>>>
>>>>> OR:  current transaction is aborted, commands ignored until end of
>>>>> transaction block
>>>>>
>>>>> 2016-07-08 18:06:55 UTC [30382] ERROR  remoteWorkerThread_2: SYNC aborted
>>>>>
>>>>> ----
>>>>>
>>>>> ---- node 999999 log ----
>>>>>
>>>>> 2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_1: SYNC
>>>>> 5000081216 done in 0.004 seconds
>>>>>
>>>>> 2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_2: SYNC
>>>>> 5000017870 done in 0.004 seconds
>>>>>
>>>>> 2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_3: SYNC
>>>>> 5000018150 done in 0.004 seconds
>>>>>
>>>>> 2016-07-08 18:06:44 UTC [558] INFO   remoteWorkerThread_1: SYNC
>>>>> 5000081217 done in 0.003 seconds
>>>>>
>>>>> 2016-07-08 18:06:44 UTC [558] WARN   remoteWorkerThread_3: got DROP NODE
>>>>> for local node ID
>>>>>
>>>>> NOTICE:  Slony-I: Please drop schema "_ams_cluster"
>>>>>
>>>>> NOTICE:  drop cascades to 171 other objects
>>>>>
>>>>> DETAIL:  drop cascades to table _ams_cluster.sl_node
>>>>>
>>>>> drop cascades to table _ams_cluster.sl_nodelock
>>>>>
>>>>> drop cascades to table _ams_cluster.sl_set
>>>>>
>>>>> drop cascades to table _ams_cluster.sl_setsync
>>>>>
>>>>> drop cascades to table _ams_cluster.sl_table
>>>>>
>>>>> ----
>>>>>
>>>>>               Tom J
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Slony1-general mailing list
>>>>> Slony1-general@lists.slony.info
>>>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>>>
>>>>
>>>> _______________________________________________
>>>> Slony1-general mailing list
>>>> Slony1-general@lists.slony.info
>>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>>
>>>
>>
>>

_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to