Interesting. Of course the behavior evident on inspection indicated 
something like this must be happening. 
        It seems the doc could be improved on the subject of required paths. I 
recall some sections indicate it is not harmful to have a path from each node 
to each other node. What seems not to be spelled out is that for the service to 
be highly available, to have the ability to failover, each node is *required* 
to have a path to each other node. 
        On a related point, it would be a lot more convenient if we could give 
each node a default path instead of re-specifying the same IP for each new 
subscriber, and a new line of conninfo for every slonik script. 
        Would either of these items be worth writing up in bug tracking and/or 
providing the solution? If so, could I get that link?

        Tom    (


On 7/2/17, 9:30 PM, "Steve Singer" <st...@ssinger.info> wrote:

    On Wed, 28 Jun 2017, Tignor, Tom wrote:
    
    >
    >   Hi Steve,
    >   Thanks for the info. I was able to repro this problem in testing and 
saw as soon as I added the missing path back the still-in-process failover op 
continued on and completed successfully.
    >   We do issue DROP NODEs in the event we need to restore a replica from 
scratch, which did occur. However, the restore workflow also should issue store 
paths to/from the new replica node and every other node. Still investigating 
this.
    >   What still confuses me is the recurring “remoteWorkerThread_X: SYNC” 
output, despite the fact of not having a configured path. If the path is 
missing, how does slon continue to get SYNC events?
    
    Slon can get events including SYNC from nodes other than the event origin 
if 
    it has a path to that node.   However a slon can only replicate the data 
    from a node it has a path to.
    
    
    Steve
    
    
    
    >
    >   Tom    (
    >
    >
    > On 6/27/17, 5:04 PM, "Steve Singer" <st...@ssinger.info> wrote:
    >
    >    On 06/27/2017 11:59 AM, Tignor, Tom wrote:
    >
    >
    >    The disableNode() in the makes it look like someone did a DROP NODE
    >
    >    If the only issue is that your missing active paths in sl_path you can
    >    add/update the paths with slonik.
    >
    >
    >
    >
    >    > **
    >    >
    >    > **Hello Slony-I community,
    >    >
    >    >              Hoping someone can advise on a strange and serious 
problem.
    >    > We performed a slony service failover yesterday. For the first time
    >    > ever, our slony service FAILOVER op errored out. We recently expanded
    >    > our cluster to 7 consumers from a single provider. There are no load
    >    > issues during normal operations. As the error output below shows,
    >    > though, our node 4 and node 5 consumers never got the events they
    >    > needed. Here’s where it gets weird: closer inspection has shown that
    >    > node 2->4 and node 2->5 path data went missing out of the service at
    >    > some point. It seems clear that’s the main issue, but in spite of 
that,
    >    > both node 4 and node 5 continued to find and process node 2 SYNC 
events
    >    > for a full week! The logs show this happened in spite of multiple 
restarts.
    >    >
    >    > How can this happen? If missing path data stymies the failover, 
wouldn’t
    >    > it also prevent normal SYNC processing?
    >    >
    >    > In the case where a failover is begun with inadequate path data, 
what’s
    >    > the best resolution? Can path data be quickly applied to allow 
failover
    >    > to succeed?
    >    >
    >    >              Thanks in advance for any insights.
    >    >
    >    > ---- failover error ----
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
NOTICE:
    >    > calling restart node 1
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:55:
    >    > 2017-06-26 18:33:02
    >    >
    >    > executing preFailover(1,1) on 2
    >    >
    >    > executing preFailover(1,1) on 3
    >    >
    >    > executing preFailover(1,1) on 4
    >    >
    >    > executing preFailover(1,1) on 5
    >    >
    >    > executing preFailover(1,1) on 6
    >    >
    >    > executing preFailover(1,1) on 7
    >    >
    >    > executing preFailover(1,1) on 8
    >    >
    >    > NOTICE: executing "_ams_cluster".failedNode2 on node 2
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 8 only on event 5000061654, node 4 
only
    >    > on event 5000061654, node 5 only on event 5000061655, node 3 only on
    >    > event 5000061662, node 6\
    >    >
    >    >   only on event 5000061654, node 7 only on event 5000061656
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061657, node 5 
only
    >    > on event 5000061663, node 3 only on event 5000061663, node 6 only on
    >    > event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663, node 6 only on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > /tmp/ams-tool/ams-slony1-fastfailover-1-FR_80.67.75.105.slk:56: 
waiting
    >    > for event (2,5000061664).  node 4 only on event 5000061663, node 5 
only
    >    > on event 5000061663
    >    >
    >    > ---- node 4 log archive ----
    >    >
    >    > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
    >    > pa_server=2 pa_client=4|restart notification' 
prod4/node4-pathconfig.out
    >    >
    >    > 2017-06-15 15:14:00 UTC [5688] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-15 15:14:10 UTC [8431] CONFIG storePath: pa_server=2 
pa_client=4
    >    > pa_conninfo="dbname=ams
    >    >
    >    > 2017-06-15 15:53:00 UTC [8431] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-15 15:53:10 UTC [23701] CONFIG storePath: pa_server=2
    >    > pa_client=4 pa_conninfo="dbname=ams
    >    >
    >    > 2017-06-16 17:29:13 UTC [10253] CONFIG storePath: pa_server=2
    >    > pa_client=4 pa_conninfo="dbname=ams
    >    >
    >    > 2017-06-16 20:43:42 UTC [2707] CONFIG storePath: pa_server=2 
pa_client=4
    >    > pa_conninfo="dbname=ams
    >    >
    >    > 2017-06-19 15:11:45 UTC [2707] CONFIG disableNode: no_id=2
    >    >
    >    > 2017-06-19 15:11:45 UTC [2707] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-20 18:40:15 UTC [31224] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-21 14:31:42 UTC [6253] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-21 14:35:26 UTC [32367] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-26 18:21:25 UTC [9278] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-26 18:33:04 UTC [28839] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-26 18:33:30 UTC [1785] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > bos-mpt5c:odin-9353 ttignor$
    >    >
    >    > ---- node 5 log archive ----
    >    >
    >    > bos-mpt5c:odin-9353 ttignor$ egrep 'disableNode: no_id=2|storePath:
    >    > pa_server=2 pa_client=5|restart notification' 
prod5/node5-pathconfig.out
    >    >
    >    > 2017-06-15 15:13:56 UTC [20700] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-15 15:14:06 UTC [20374] CONFIG storePath: pa_server=2
    >    > pa_client=5 pa_conninfo="dbname=ams
    >    >
    >    > 2017-06-15 15:53:01 UTC [20374] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-15 15:53:11 UTC [2859] CONFIG storePath: pa_server=2 
pa_client=5
    >    > pa_conninfo="dbname=ams
    >    >
    >    > 2017-06-16 17:28:19 UTC [2859] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-16 17:28:29 UTC [10753] CONFIG storePath: pa_server=2
    >    > pa_client=5 pa_conninfo="dbname=ams
    >    >
    >    > 2017-06-19 15:11:40 UTC [10753] CONFIG disableNode: no_id=2
    >    >
    >    > 2017-06-19 15:11:40 UTC [10753] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-20 18:40:11 UTC [450] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-21 14:31:41 UTC [22300] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-21 14:35:28 UTC [26777] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-26 18:21:27 UTC [28366] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-26 18:33:04 UTC [29345] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > 2017-06-26 18:33:27 UTC [1299] INFO   localListenThread: got restart
    >    > notification
    >    >
    >    > bos-mpt5c:odin-9353 ttignor$
    >    >
    >    >              Tom ☺
    >    >
    >    >
    >    >
    >    > _______________________________________________
    >    > Slony1-general mailing list
    >    > Slony1-general@lists.slony.info
    >    > http://lists.slony.info/mailman/listinfo/slony1-general
    >    >
    >
    >
    >
    >
    

_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to