Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-09-17 Thread Дмитрий Дегтярёв
Hello.

We have not been able to reproduce this problem on a test servers. Use this
patch to production servers do not dare.

In the course of studying the problems we have identified that many queries
are executed on the slave several times slower. On master function
heap_hot_search_buffer execute 100 cycles, on the slave the same query with
the same plan function heap_hot_search_buffer execute 2000 cycles.
Also, we were able to reproduce the problem on the master and detect that
there s_lock of slow queries.

We have solved this problem. A large number of queries used 4 frequently
changing index. In these indexes, 99% of the dead tuples. Autovacuum and
even VACUUM FULL these tuples can not be removed because of
autovacuum_freeze_max_age.

We've added cron that 2-3 times a day, performs CREATE INDEX CONCURRENTLY
idx_name_new; DROP INDEX CONCURRENTLY idx_name; ALTER INDEX idx_name_new
RENAME TO idx_name; for this 4 indexes.

As a result s_lock not exists in listed perf top.


2013/8/29 Merlin Moncure 

>
> so -- are you in a position where you might be able to test this patch?
>
> merlin


[PERFORM] Cpu usage 100% on slave. s_lock problem.

2013-08-27 Thread Дмитрий Дегтярёв
Hello.

Exist 2 identical server DELL PowerEdge™ R720, CPU Dual Intel® Xeon®
E5-2620 Hexa-Core inkl, RAM 256Gb, RAID-10 8 x 600 GB SAS 6 Gb/s 15000 rpm.

$ cat /etc/fedora-release
Fedora release 19

$ postgres --version
postgres (PostgreSQL) 9.2.4

Data ~220Gb and Indexes ~140Gb

iowait ~0.2-0.5. Disk usage only write ~0-2 Mb/sec

On each installed pg_bouncer. Pool size 24.

On Master in peak load ~1200 request/sec, ~30 ms/request avg, 24 CPU ~95% -
this is no problem
$ perf top
 21,14%  [kernel] [k] isolate_freepages_block
 12,27%  [unknown][.] 0x7fc1bb303be0
  5,93%  postgres [.] hash_search_with_hash_value
  4,85%  libbz2.so.1.0.6  [.] 0xa6e0
  2,70%  postgres [.] PinBuffer
  2,34%  postgres [.] slot_deform_tuple
  1,92%  libbz2.so.1.0.6  [.] BZ2_compressBlock
  1,85%  postgres [.] LWLockAcquire
  1,69%  postgres [.] heap_page_prune_opt
  1,48%  postgres [.] _bt_checkkeys
  1,40%  [kernel] [k] page_fault
  1,36%  postgres [.] _bt_compare
  1,23%  postgres [.] heap_hot_search_buffer
  1,19%  [kernel] [k] get_pageblock_flags_group
  1,18%  postgres [.] AllocSetAlloc

On Slave max ~400-500 request/sec, ~200 and up 24 ms/request avg, 24 CPU
~95% - this is problem
$ perf top
 30,10%  postgres   [.] s_lock
 22,90%  [unknown]  [.] 0x00729cfe
  4,98%  postgres   [.] RecoveryInProgress.part.9
  4,89%  postgres   [.] LWLockAcquire
  4,57%  postgres   [.] hash_search_with_hash_value
  3,50%  postgres   [.] PinBuffer
  2,31%  postgres   [.] heap_page_prune_opt
  2,27%  postgres   [.] LWLockRelease
  1,18%  postgres   [.] heap_hot_search_buffer
  1,03%  postgres   [.] AllocSetAlloc
...

Slave at a much lower load than the master hangs on the function s_lock.
What can be done about it?

On Slave ~300 request/sec, ~5-8 ms/request avg, cpu usage ~20% - normal
work by small load
$ perf top
 10,74%  postgres   [.] hash_search_with_hash_value
  4,94%  postgres   [.] PinBuffer
  4,61%  postgres   [.] AllocSetAlloc
  3,57%  postgres   [.] heap_page_prune_opt
  3,24%  postgres   [.] LWLockAcquire
  2,47%  postgres   [.] heap_hot_search_buffer
  2,11%  postgres   [.] SearchCatCache
  1,90%  postgres   [.] LWLockRelease
  1,87%  postgres   [.] _bt_compare
  1,68%  postgres   [.] FunctionCall2Coll
  1,46%  postgres   [.] _bt_checkkeys
  1,24%  postgres   [.] copyObject
  1,15%  postgres   [.] RecoveryInProgress.part.9
  1,05%  postgres   [.] slot_deform_tuple
...


Configuration Master postgres.conf
listen_addresses = '*'
max_connections = 100
shared_buffers = 200GB
work_mem = 20MB
maintenance_work_mem = 2GB
effective_io_concurrency = 4
wal_level = hot_standby
fsync = on
synchronous_commit = off
full_page_writes = on
wal_writer_delay = 200ms
checkpoint_segments = 100
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'pbzip2 -f -c %p > /opt/pg/wals/wals/%f.bz2'
max_wal_senders = 3
random_page_cost = 0.5
cpu_tuple_cost = 0.02
cpu_index_tuple_cost = 0.01
cpu_operator_cost = 0.005
effective_cache_size = 40GB
default_statistics_target = 300
logging_collector = on
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_min_duration_statement = 1000
log_checkpoints = on
log_line_prefix = '%t %p %c-%l %x %q(%u, %d, %r, %a) '
log_lock_waits = on
track_io_timing = on
track_activity_query_size = 4096
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_freeze_max_age = 15
datestyle = 'iso, dmy'
timezone = 'Europe/Moscow'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'ru_RU.UTF-8'
lc_numeric = 'ru_RU.UTF-8'
lc_time = 'ru_RU.UTF-8'
default_text_search_config = 'pg_catalog.russian'
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 1000
pg_stat_statements.track = all
max_locks_per_transaction = 264

Configuration Slave postgres.conf
listen_addresses = '*'
max_connections = 100
shared_buffers = 200GB
work_mem = 20MB
maintenance_work_mem = 2GB
effective_io_concurrency = 4
wal_level = hot_standby
fsync = on
synchronous_commit = off
full_page_writes = on
wal_writer_delay = 200ms
commit_delay = 1000
commit_siblings = 2
checkpoint_segments = 100
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
archive_mode = on
archive_command = 'pbzip2 -f -c %p > /opt/pg/wals/wals/%f.bz2'
max_wal_senders = 4
hot_standby = on
max_standby_archive_delay = 30s
max_standby_streaming_delay = 30s
hot_standby_feedback = on
random_page_cost = 0.5
cpu_tuple_cost = 0.02
cpu_index_tuple_co