More information:
I can confirm that over the course of 3 or so hours, our sl_log_2 table
grew from 1,000 rows to over 9 million rows. This is due to us doing
huge table wide updates I believe.
So now sl_log_2 is huge, and the 'fetch 100 from LOG' is a query hitting
the cursor LOG, which is defined as:
"declare LOG cursor
for select "
" log_origin,
log_xid, log_tableid, "
" log_actionseq,
log_cmdtype, "
"
octet_length(log_cmddata), "
" case when
octet_length(log_cmddata) <= %d "
" then
log_cmddata "
" else null end "
"from %s.sl_log_2 %s
order by log_actionseq; ",
Looking at the pg_stat_activity, it looks like 'fetch 1200 from LOG' has
the same transaction start time as the query start time, which suggests
that each time it fetches 100 from the log it is starting a new cursor,
is this normal / expected?
select now() - xact_start as transaction_age, now() - query_start as
query_age, current_query::varchar(80), waiting from pg_stat_activity
where usename = 'slony' and current_query like '%LOG%';
transaction_age | query_age | current_query | waiting
-----------------+-----------------+----------------------+---------
00:27:01.652087 | 00:27:01.650933 | fetch 100 from LOG; | f
(1 row)
So this has to be my issue. The query that hits sl_log_2 is a sequential
scan, does this mean every time we retrieve 100 from the cursor, we do
another sequential scan on the table to get the next 100 rows?
- Brian F
On 09/06/2012 03:09 PM, Brian Fehrle wrote:
Hi all,
postgresql 8.4
slony 1.2
centos system
I have single master to single slave slony cluster where slony is very
far behind. There are no errors in the logs or anything, but what looks
to be happeneing is that queries slony is executing on the slony log
table on the master are taking a long time to complete.
The query is "fetch 100 from LOG;" which can take a long time to
complete, over 15 minutes at times. Each time this happens we process 1
event. This usually takes milliseconds to complete.
At this point, the query on the master 'fetch 100 from log' takes about
20 minutes to complete, and after it completes the slave processes 1
more event, and then 'fetch 100 from log' kicks off again and takes yet
another 20 or so minutes. So the slave is processing an event about once
per 20 minutes.
As for a cause I believe it's due to the follow up work after adding a
column to a table in replication. After adding the column, I believe
that the table is being updated to set new values in the newly added
column. This could result in millions of new items for slony to process,
which may have caused the tables to become so large that they are
resulting in sequential scans or something.
I'm trying to dig in and see what exactly 'fetch 100 from log' is doing
on the master, and if I can speed it up. Is this querying sl_log_<1/2>
tables?
the pg table pg_notify does not have outstanding dead rows, it's at 0.
also out of all the slony tables in the slony schema, the one with the
most dead rows is at about 2K dead rows.
sl_log_1 has 0 rows, sl_log_2 has about 9,326,260 rows (and zero dead rows).
I'm going to see if we can reduce group size, see if for whatever reason
that reduces the query result set so it does a index scan vs sequential
(if that's even the issue).
Any help is greatly appreciated.
- Brian F
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general