Hi All,

We're currently running slony 2.2.3 with 4 pg 9.0 nodes.  Occasionally since we 
upgraded from 2.1 I've been seeing some "humps" where subscribers are lagging 
and taking an extended period of time to recover.

I can't ever reproduce it and I've come to a dead end.  I'm going to waffle a 
bit below, but I'm hoping someone can see something I'm missing.

These humps appear to not really correlate with increased activity on the 
origin, and I've been struggling to put my finger on anything aggravating the 
issue.  Today however I've seen the same symptoms, and the start times of the 
lag align with an exclusive lock on a subscribers replicated table whilst 
vaccum full was run.

Whilst I'd expect that to cause some lag and a bit of a backlog, the vacuum 
full took only 2 minutes and the lag builds up gradually afterwards.  
Eventually after a long time replication will catch up, but it's out of 
proportion to our transaction rate, and a restart of the slon on the subscriber 
causes it to catch up very swiftly.  I've attached a graph of sl_status from 
the origin showing the time and event lag buildup and a pretty swift slice on 
the end of the humps where I restart the slons.

The graph (attached) shows nodes 4 & 5 starting to lag first, as they were the 
first to have the vacuum full run, then node 7 starts to lag when it has the 
same vacuum full run (at this point the lag on the two other nodes hadn't been 
noticed).  This excerpt from one of the subscribers shows the copy being 
blocked:

2015-01-29 10:09:54 GMT [13246]: [39-1] 
app=slon.local_cleanup,user=slony,db=X,host=somehost NOTICE:  Slony-I: 
Logswitch to sl_log_1 initiated
2015-01-29 10:09:54 GMT [13246]: [40-1] 
app=slon.local_cleanup,user=slony,db=X,host=somehost CONTEXT:  SQL statement 
"SELECT "_main_replication".logswitch_start()"
2015-01-29 10:12:04 GMT [13243]: [9-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost LOG:  duration: 
5089.684 ms  statement: COPY "_main_replication"."sl_log_1" ( log_origin, 
log_txid,log_tableid,log_actionseq,log_tablenspname, log_tablerelname, 
log_cmdtype, log_cmdupdncols,log_cmdargs) FROM STDIN
2015-01-29 10:19:05 GMT [13243]: [10-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost LOG:  process 13243 
still waiting for RowExclusiveLock on relation 279233 of database 274556 after 
1000.038 ms at character 13
2015-01-29 10:19:05 GMT [13243]: [11-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost QUERY:  UPDATE ONLY 
"myschema"."table_being_full_vacuumed" SET "text" = $1 WHERE "address" = $2;
2015-01-29 10:19:05 GMT [13243]: [12-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost CONTEXT:  COPY 
sl_log_1, line 37: "8    1108084090    2    1219750937    myschema    
table_being_full_vacuumed    U    1    {text,"",address,some_address_data}"
2015-01-29 10:19:05 GMT [13243]: [13-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost STATEMENT:  COPY 
"_main_replication"."sl_log_1" ( log_origin, 
log_txid,log_tableid,log_actionseq,log_tablenspname, log_tablerelname, 
log_cmdtype, log_cmdupdncols,log_cmdargs) FROM STDIN
2015-01-29 10:20:43 GMT [13243]: [14-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost LOG:  process 13243 
acquired RowExclusiveLock on relation 279233 of database 274556 after 98754.902 
ms at character 13
2015-01-29 10:20:43 GMT [13243]: [15-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost QUERY:  UPDATE ONLY 
"myschema"."table_being_full_vacuumed" SET "text" = $1 WHERE "address" = $2;
2015-01-29 10:20:43 GMT [13243]: [16-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost CONTEXT:  COPY 
sl_log_1, line 37: "8    1108084090    2    1219750937    myschema    
table_being_full_vacuumed    U    1    {text,"",address,some_address_data}"
2015-01-29 10:20:43 GMT [13243]: [17-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost STATEMENT:  COPY 
"_main_replication"."sl_log_1" ( log_origin, 
log_txid,log_tableid,log_actionseq,log_tablenspname, log_tablerelname, 
log_cmdtype, log_cmdupdncols,log_cmdargs) FROM STDIN
2015-01-29 10:20:43 GMT [13243]: [18-1] 
app=slon.remoteWorkerThread_8,user=slony,db=X,host=somehost LOG:  duration: 
98915.154 ms  statement: COPY "_main_replication"."sl_log_1" ( log_origin, 
log_txid,log_tableid,log_actionseq,log_tablenspname, log_tablerelname, 
log_cmdtype, log_cmdupdncols,log_cmdargs) FROM STDIN
2015-01-29 10:22:00 GMT [13246]: [41-1] 
app=slon.local_cleanup,user=slony,db=X,host=somehost NOTICE:  Slony-I: log 
switch to sl_log_1 complete - truncate sl_log_2
2015-01-29 10:22:00 GMT [13246]: [42-1] 
app=slon.local_cleanup,user=slony,db=X,host=somehost CONTEXT:  PL/pgSQL 
function "cleanupevent" line 94 at assignment
2015-01-29 10:34:01 GMT [13246]: [43-1] 
app=slon.local_cleanup,user=slony,db=X,host=somehost NOTICE:  Slony-I: 
Logswitch to sl_log_2 initiated
2015-01-29 10:34:01 GMT [13246]: [44-1] 
app=slon.local_cleanup,user=slony,db=X,host=somehost CONTEXT:  SQL statement 
"SELECT "_main_replication".logswitch_start()"
2015-01-29 10:46:08 GMT [13246]: [45-1] 
app=slon.local_cleanup,user=slony,db=X,host=somehost NOTICE:  Slony-I: could 
not lock sl_log_1 - sl_log_1 not truncated
2015-01-29 10:46:08 GMT [13246]: [46-1] 
app=slon.local_cleanup,user=slony,db=X,host=somehost CONTEXT:  PL/pgSQL 
function "cleanupevent" line 94 at assignment

After this the copies go through cycles of increasing and decreacing duration, 
which I'm guessing is something normal (perhaps syncs being grouped?), and I'm 
seeing messages stating "could not lock sl_log_1 - sl_log_1 not truncated" a 
couple of times before the switch completes, and again I'm guessing this is 
just blocking because of inserts capturing changes and is normal? Autovacuum 
hasn't hit sl_log at all during this period.

Does anyone have any ideas?  I've debug logs from the slons, and postgres logs 
I can send off list if anyone has any ideas.

Thanks
Glyn
_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to