The sl_log1/2 values fluctuate on a continuous basis. Here are three queries
that show this run over about an hour of time:
select (select count(*) from sl_log_1) sl_log_1, (select count(*) from
sl_log_2) sl_log_2;
sl_log_1 | sl_log_2
----------+----------
80179 | 31451
(1 row)
Time: 36.624 ms
(slony@db2:5432) [c0]
=# select (select count(*) from sl_log_1) sl_log_1, (select count(*) from
sl_log_2) sl_log_2;
sl_log_1 | sl_log_2
----------+----------
82948 | 37270
(1 row)
Time: 42.952 ms
(slony@db2:5432) [c0]
=# select (select count(*) from sl_log_1) sl_log_1, (select count(*) from
sl_log_2) sl_log_2;
sl_log_1 | sl_log_2
----------+----------
0 | 54862
I believe this would indicate that the logs are being filled, processed, and
ultimately cleared as part of Slony's regular processes. I know there was a
bug related to these logs never getting cleared that threw the log message I
was concerned with but my servers aren't apparently experiencing this issue.
I think there was some mention of long running queries. I do have a few of
these listed as "<IDLE>" or long running "vacuum analyze" in pg_stat_activitiy
though:
select usename, current_query, waiting, xact_start, query_start from
pg_stat_activity order by query_start desc;
...
barfoo | <IDLE>
| f |
| 2011-11-14 10:16:23.744028-08
barfoo | <IDLE>
| f |
| 2011-11-14 10:16:19.477108-08
barfoo | <IDLE>
| f |
| 2011-11-14 10:16:08.787367-08
books | <IDLE>
| f |
| 2011-11-14 10:15:28.842382-08
foobar | <IDLE>
| f |
| 2011-11-14 10:14:39.039847-08
foobar | <IDLE>
| f |
| 2011-11-14 10:13:34.771573-08
slony | <IDLE>
| f |
| 2011-11-14 10:12:51.091648-08
barfoo | <IDLE>
| f |
| 2011-11-14 10:08:06.93106-08
barfoo | <IDLE>
| f |
| 2011-11-14 10:08:06.803036-08
foobar | vacuum analyze;
| t | 2011-11-13
22:00:00.759804-08 | 2011-11-13 22:00:00.759804-08
foobar | <IDLE>
| f |
| 2011-11-08 22:08:32.807418-08
foobar | vacuum analyze;
| t | 2011-11-06
22:00:00.848159-08 | 2011-11-06 22:00:00.848159-08
foobar | vacuum analyze;
| t | 2011-10-30
22:00:01.009049-07 | 2011-10-30 22:00:01.009049-07
foobar | vacuum analyze;
| t | 2011-10-23
22:00:00.372459-07 | 2011-10-23 22:00:00.372459-07
foobar | vacuum analyze;
| t | 2011-10-16
22:00:00.525067-07 | 2011-10-16 22:00:00.525067-07
foobar | vacuum analyze;
| t | 2011-10-09
22:00:00.091914-07 | 2011-10-09 22:00:00.091914-07
foobar | vacuum analyze;
| t | 2011-10-02
22:00:00.85743-07 | 2011-10-02 22:00:00.85743-07
foobar | <IDLE>
| f |
| 2011-10-01 17:23:05.776958-07
foobar | <IDLE>
| f |
| 2011-10-01 17:23:05.755754-07
foobar | <IDLE>
| f |
| 2011-10-01 17:23:05.743447-07
foobar | autovacuum: VACUUM c0.assets (to prevent wraparound)
| f | 2011-09-29
22:12:30.92651-07 | 2011-09-29 22:12:30.92651-07
(766 rows)
I hadn't noticed the hung vacuum's before. These are part of a weekly vacuum
process that appears to be getting hung up every week. They aren't apparently
affecting the DB's performance but they concern me. I think I should probably
kill their background processes to clear them out.
Mike Wilson
Predicate Logic
Cell: (310) 600-8777
SkypeID: lycovian
On Nov 14, 2011, at 7:10 AM, Steve Singer wrote:
> On 11-11-11 03:46 PM, Mike Wilson wrote:
>> General lag on the slave node (as recorded in sl_status) is less then 30
>> seconds. This is a heavily transacted system running on very nice hardware
>> so perhaps any problems are being masked by that.
>>
>> I've read up on the issue and we don't appear to be experiencing any of the
>> bugs related to this issue that I can find in the news groups. No long
>> running transactions, no old nodes in the sl_ tables. In general, the
>> system appears to be healthy (idle proc time ~95%), good buffer cache hit
>> ratios, etc.
>>
>> Thanks for the replies though. I'll look into implementing 2.1 although we
>> just did the upgrade to 2.0.7 and I'm not sure management will go for
>> another down during the holiday season. Just doing my due diligence as our
>> load will rise steadily through the holiday season to very large load on
>> these servers and I wanted to make sure the servers looked solid before we
>> through 30 X the current load at them.
>
> Now that a few days have passed,
>
> is your sl_log_1 count still growing or has it dropped? If your sl_log
> tables keep growing and aren't being truncated by the cleanup thread then you
> have a problem that will eventually get worse. If the 119,239 rows in
> sl_log_1 was a temporary thing due to your application doing lots of updates
> then that might be normal for your system.
>
>>
>> Mike Wilson
>> Predicate Logic
>> Cell: (310) 600-8777
>> SkypeID: lycovian
>>
>>
>>
>>
>> On Nov 11, 2011, at 11:09 AM, Steve Singer wrote:
>>
>>> On 11-11-11 02:04 PM, Mike Wilson wrote:
>>>>
>>>> Mike Wilson
>>>> Predicate Logic
>>>> Cell: (310) 600-8777
>>>> SkypeID: lycovian
>>>>
>>>>
>>>> From my postgresql.log:
>>>> 2011-11-11 11:03:15.237 PST db1.lax.jib(55096):LOG: duration: 133.011 ms
>>>> statement: fetch 500 from LOG;
>>>> 2011-11-11 11:03:17.241 PST db1.lax.jib(55096):LOG: duration: 134.842 ms
>>>> statement: fetch 500 from LOG;
>>>> 2011-11-11 11:03:19.239 PST db1.lax.jib(55096):LOG: duration: 133.919 ms
>>>> statement: fetch 500 from LOG;
>>>> 2011-11-11 11:03:21.240 PST db1.lax.jib(55096):LOG: duration: 133.194 ms
>>>> statement: fetch 500 from LOG;
>>>> 2011-11-11 11:03:23.241 PST db1.lax.jib(55096):LOG: duration: 134.288 ms
>>>> statement: fetch 500 from LOG;
>>>> 2011-11-11 11:03:25.241 PST db1.lax.jib(55096):LOG: duration: 133.226 ms
>>>> statement: fetch 500 from LOG;
>>>>
>>>> I'm only logging statements that take longer than 100ms to run.
>>>>
>>>> Here is my output from sl_log1/2:
>>>> select (select count(*) from sl_log_1) sl_log_1, (select count(*) from
>>>> sl_log_2) sl_log_2;
>>>> sl_log_1 | sl_log_2
>>>> ----------+----------
>>>> 119239 | 43685
>>>
>>> The fetch is taking a long time because sl_log_1 is big. (The reason it
>>> takes so long is actually a bug that was fixed in 2.1) sl_log_1 being that
>>> big probably means that log switching isn't happening.
>>>
>>> Do you have any nodes that are behind? (query sl_status on all your nodes)
>>> Do you have any old nodes that are still listed in sl_node that you aren't
>>> using anymore?
>>> Do (did) you have a long running transaction in your system that is
>>> preventing the log switch from taking place?
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>> On Nov 11, 2011, at 5:07 AM, Steve Singer wrote:
>>>>
>>>>> On 11-11-09 01:19 PM, Mike Wilson wrote:
>>>>>> Seeing "fetch 500 from LOG" almost continuously in my PG logs for a new
>>>>>> Slony 2.0.7 install. The previous version (2.0.3?) didn't show these
>>>>>> messages in the PG log. Researching the issue, historically, this
>>>>>> message was usually accompanied by a performance issue. This isn't the
>>>>>> case with my databases though, they appear to be running just as well as
>>>>>> ever and the lag between replicated nodes appears to be about the same
>>>>>> as the previous version.
>>>>>>
>>>>>> I guess my question is what does this message mean in this version of
>>>>>> Slony? Is it an indication of sub-optimal slon parameters?
>>>>>> slon -g 20 $SLON_CLUSTER "host=$HOSTNAME port=$PORT dbname=$DB
>>>>>> user=$USER"
>>>>>>
>>>>>> And how can I get rid of it if it's not an issue?
>>>>>>
>>>>>> Mike
>>>>>
>>>>> What is causing the 'fetch 500' statements to show up in the server log?
>>>>> Are you only logging SQL that takes longer than x milliseconds? If so how
>>>>> long are your fetch 500 statements taking? How many rows are in your
>>>>> sl_log_1 and sl_log_2?
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Slony1-general mailing list
>>>>>> [email protected]
>>>>>> http://lists.slony.info/mailman/listinfo/slony1-general
>>>>>
>>>>
>>>
>>
>
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general