Hi again,
        I’ve made some progress here on my own. Checking the various node DBs 
not hearing my node 4, I found they had sl_event and sl_confirm entries for 
sequence# 5000071346, and that from 5 days ago now. The node 4 DB itself had 
its sl_event_seq sequence at 5000040947. It seems clear the bad state in the 
other nodes was leftover from before my last node 4 restore op. My solution was 
to advance the node 4 sequence to  5000071347. As soon as I did, I saw new node 
4 SEQ events accumulating in other node sl_event tables. After that, store path 
ops worked fine.
        Seems like this could be useful to others. Is there a bug fix or doc 
update to derive from this? Let me know if I should write something up more 
formally or open a ticket.


ams=# select * from _ams_cluster.sl_event where ev_origin = 4; 
 ev_origin |  ev_seqno  |         ev_timestamp          |    ev_snapshot     | 
ev_type | ev_data1 | ev_data2 | ev_data3 | ev_data4 | ev_data5 | ev_data6 | 
ev_data7 | ev_data8 
-----------+------------+-------------------------------+--------------------+---------+----------+----------+----------+----------+----------+----------+----------+----------
         4 | 5000071346 | 2017-07-19 20:27:26.418196+00 | 15346449:15346449: | 
SYNC    |          |          |          |          |          |          |     
     | 
(1 row)

ams=# 

ams=# select * from _ams_cluster.sl_confirm where con_origin = 4; 
 con_origin | con_received | con_seqno  |         con_timestamp         
------------+--------------+------------+-------------------------------
          4 |            6 | 5000071346 | 2017-07-19 20:35:33.504667+00
          4 |            3 | 5000071346 | 2017-07-19 20:29:09.763466+00
          4 |            9 | 5000071346 | 2017-07-19 20:29:22.496843+00
          4 |            8 | 5000071346 | 2017-07-19 20:27:27.9303+00
          4 |            1 | 5000071346 | 2017-07-19 20:27:26.705526+00
          4 |            7 | 5000071346 | 2017-07-20 18:04:01.978874+00
(6 rows)

ams=# 


        Tom    (


On 7/22/17, 10:39 AM, "Tignor, Tom" <ttig...@akamai.com> wrote:

    
        Hi Steve,
        Thanks for the store path desc. That’s what I surmised generally. I 
should note: when problems arise with subscribers, we have a utility to drop 
and re-store the node, and then re-store paths to all other nodes.
        To answer your questions: node 4 has all expected state, 7*6=42 
connections, i.e.
    
        Sl_path server = 1, client = 3
        Sl_path server = 1, client = 4
        Sl_path server = 1, client = 6
        Sl_path server = 1, client = 7
        Sl_path server = 1, client = 8
        Sl_path server = 1, client = 9
        Sl_path server = 3, client = 1
        Sl_path server = 3, client = 4
        Sl_path server = 3, client = 6
        Sl_path server = 3, client = 7
        Sl_path server = 3, client = 8
        Sl_path server = 3, client = 9
        …
    
    
        All the other nodes have 37 connections. The following are missing in 
each DB:
    
        Sl_path server = 3, client = 4
        Sl_path server = 6, client = 4
        Sl_path server = 7, client = 4
        Sl_path server = 8, client = 4
        Sl_path server = 9, client = 4
    
        Moreover, the Sl_path server = 1, client = 4 path shows the conninfo as 
<event pending>.
        Just a guess: is there possibly some sl_event table entry which, if 
deleted, will allow the node-4-client store path ops to get processed?
    
        Tom    (
    
    
    On 7/21/17, 9:53 PM, "Steve Singer" <st...@ssinger.info> wrote:
    
        On Fri, 21 Jul 2017, Tignor, Tom wrote:
        
        > 
        >  
        > 
        >                 Hello again, Slony-I community,
        > 
        >                 After our last missing path issue, we’ve taken a new 
interest in keeping all our path/conninfo
        > data up to date. We have a cluster running with 7 nodes. Each has 
conninfo to all the others, so we expect N=7;
        > N*(N-1) = 42 paths. We’re having persistent problems with our paths 
for node 4. Node 4 itself has fully accurate
        > path data. However, all the other nodes have missing or inaccurate 
data for node-4-client conninfo. Specifically:
        > node 1 shows:
        > 
        >  
        > 
        >                          1 |         4 | <event pending>            | 
          10
        > 
        >  
        > 
        >                 For the other five nodes, the node-4-client conninfo 
is just missing. In other words, there are no
        > pa_server=X, pa_client=4 rows in sl_path for these nodes. Again, the 
node 4 DB itself shows all the paths we
        > expect.
        > 
        >                 Does anyone have thoughts on how this is caused and 
how it could be fixed? Repeated “store path”
        > operations all complete without errors but do not change state. 
Service restarts haven’t worked either.
        
        When you issue a store path command with line client=4 server=X
        
        slonik connects to db4 and
        A) updates sl_path
        B) creates an event in sl_event of ev_type=STORE_PATH with ev_origin=4
        
        This event then needs to propogate to the other nodes in the network.
        
        When this event propogates to the other nodes then the 
remoteWorkerThread_4 
        in each of the other nodes will process this STORE_PATH entry, and you 
        should see a
        CONFIG storePath: pa_server=X pa_client=4
        
        message in each of the other slons.
        
        If this happens you should see the actual path in sl_path.  Since your 
not I 
        assume that this isn't happening.
        
        Where on the chain of events are things breaking down?
        
        Do you have other paths from other nodes with  client=[X,Y,Z] server=4
        
        
        Steve
        
        
        
        > 
        >                 Thanks in advance,
        > 
        >  
        > 
        >                 Tom    ☺
        > 
        >  
        > 
        >  
        > 
        > 
        >
        
    
    

_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to