On 17/07/14 11:58, Mark Kirkwood wrote:
Trying out with numa_balancing=0 seemed to get essentially the same performance. Similarly wrapping postgres startup with --interleave. All this made me want to try with numa *really* disabled. So rebooted the box with "numa=off" appended to the kernel cmdline. Somewhat surprisingly (to me anyway), the numbers were essentially identical. The profile, however is quite different:
A little more tweaking got some further improvement: rwlocks patch as before wal_buffers = 256MB checkpoint_segments = 1920 wal_sync_method = open_datasync LSI RAID adaptor disable read ahead and write cache for SSD fast path mode numa_balancing = 0 Pgbench scale 2000 again: clients | tps (prev) | tps (tweaked config) ---------+------------+--------- 6 | 8175 | 8281 12 | 14409 | 15896 24 | 17191 | 19522 48 | 23122 | 29776 96 | 22308 | 32352 192 | 23109 | 28804Now recall we were seeing no actual tps changes with numa_balancing=0 or 1 (so the improvement above is from the other changes), but figured it might be informative to try to track down what the non-numa bottlenecks looked like. We tried profiling the entire 10 minute run which showed the stats collector as a possible source of contention:
3.86% postgres [kernel.kallsyms] [k] _raw_spin_lock_bh
|
--- _raw_spin_lock_bh
|
|--95.78%-- lock_sock_nested
| udpv6_sendmsg
| inet_sendmsg
| sock_sendmsg
| SYSC_sendto
| sys_sendto
| tracesys
| __libc_send
| |
| |--99.17%-- pgstat_report_stat
| | PostgresMain
| | ServerLoop
| | PostmasterMain
| | main
| | __libc_start_main
| |
| |--0.77%-- pgstat_send_bgwriter
| | BackgroundWriterMain
| | AuxiliaryProcessMain
| | 0x7f08efe8d453
| | reaper
| | __restore_rt
| | PostmasterMain
| | main
| | __libc_start_main
| --0.07%-- [...]
|
|--2.54%-- __lock_sock
| |
| |--91.95%-- lock_sock_nested
| | udpv6_sendmsg
| | inet_sendmsg
| | sock_sendmsg
| | SYSC_sendto
| | sys_sendto
| | tracesys
| | __libc_send
| | |
| | |--99.73%-- pgstat_report_stat
| | | PostgresMain
| | | ServerLoop
Disabling track_counts and rerunning pgbench:
clients | tps (no counts)
---------+------------
6 | 9806
12 | 18000
24 | 29281
48 | 43703
96 | 54539
192 | 36114
While these numbers look great in the middle range (12-96 clients), then
benefit looks to be tailing off as client numbers increase. Also running
with no stats (and hence no auto vacuum or analyze) is way too scary!
Trying out less write heavy workloads shows that the stats overhead does not appear to be significant for *read* heavy cases, so this result above is perhaps more of a curiosity than anything (given that read heavy is more typical...and our real workload is more similar to read heavy).
The profile for counts off looks like:
4.79% swapper [kernel.kallsyms] [k] read_hpet
|
--- read_hpet
|
|--97.10%-- ktime_get
| |
| |--35.24%-- clockevents_program_event
| | tick_program_event
| | |
| | |--56.59%--
__hrtimer_start_range_ns
| | | |
| | | |--78.12%--
hrtimer_start_range_ns
| | | |
tick_nohz_restart
| | | |
tick_nohz_idle_exit
| | | |
cpu_startup_entry
| | | | |
| | | |
|--98.84%-- start_secondary
| | | | |
| | | |
--1.16%-- rest_init
| | | |
start_kernel
| | | |
x86_64_start_reservations
| | | |
x86_64_start_kernel
| | | |
| | | --21.88%--
hrtimer_start
| | |
tick_nohz_stop_sched_tick
| | |
__tick_nohz_idle_enter
| | | |
| | |
|--99.89%-- tick_nohz_idle_enter
| | | |
cpu_startup_entry
| | | |
|
| | | |
|--98.30%-- start_secondary
| | | |
|
| | | |
--1.70%-- rest_init
| | | |
start_kernel
| | | |
x86_64_start_reservations
| | | |
x86_64_start_kernel
| | |
--0.11%-- [...]
| | |
| | |--40.25%--
hrtimer_force_reprogram
| | | __remove_hrtimer
| | | |
| | | |--89.68%--
__hrtimer_start_range_ns
| | | |
hrtimer_start
| | | |
tick_nohz_stop_sched_tick
| | | |
__tick_nohz_idle_enter
| | | | |
| | | |
|--99.90%-- tick_nohz_idle_enter
| | | | |
cpu_startup_entry
| | | | |
|
| | | | |
|--99.04%-- start_secondary
| | | | |
|
| | | | |
--0.96%-- rest_init
| | | | |
start_kernel
| | | | |
x86_64_start_reservations
| | | | |
x86_64_start_kernel
| | | |
--0.10%-- [...]
| | | |
Any thoughts on how to proceed further appreciated!
Cheers,
Mark
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
