Re: [PERFORM] 60 core performance with 9.3
Mark, Is the 60-core machine using some of the Intel chips which have 20 hyperthreaded virtual cores? If so, I've been seeing some performance issues on these processors. I'm currently doing a side-by-side hyperthreading on/off test. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 15/08/14 06:18, Josh Berkus wrote: Mark, Is the 60-core machine using some of the Intel chips which have 20 hyperthreaded virtual cores? If so, I've been seeing some performance issues on these processors. I'm currently doing a side-by-side hyperthreading on/off test. Hi Josh, The board has 4 sockets with E7-4890 v2 cpus. They have 15 cores/30 threads. We've running with hyperthreading off (noticed the usual steep/sudden scaling dropoff with it on). What model are your 20 cores cpus? Cheers Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 01/08/14 09:38, Alvaro Herrera wrote: Matt Clarkson wrote: The LWLOCK_STATS below suggest that ProcArrayLock might be the main source of locking that's causing throughput to take a dive as the client count increases beyond the core count. Any thoughts or comments on these results are welcome! Do these results change if you use Heikki's patch for CSN-based snapshots? See http://www.postgresql.org/message-id/539ad153.9000...@vmware.com for the patch (but note that you need to apply on top of 89cf2d52030 in the master branch -- maybe it applies to HEAD the 9.4 branch but I didn't try). Hi Alvaro, Applying the CSN patch on top of the rwlock + numa in 9.4 (bit of a patch-fest we have here now) shows modest improvement at highest client number (but appears to hurt performance in the mid range): clients | tps -+ 6| 8445 12 | 14548 24 | 20043 48 | 27451 96 | 27718 192 | 23614 384 | 24737 Initial runs were quite disappointing, until we moved the csnlog directory onto the same filesystem that the xlogs are on (PCIe SSD). We could potentially look at locating them on their own separate volume if that make sense. Adding in LWLOCK stats again shows quite a different picture from the previous: 48 clients Lock |Blk | SpinDelay | Blk % | SpinDelay % +--+---+---+- WALWriteLock| 25426001 | 1239 | 62.227442 | 14.373550 CLogControlLock | 1793739 | 1376 | 4.389986 | 15.962877 ProcArrayLock | 1007765 | 1305 | 2.466398 | 15.139211 CSNLogControlLock | 609556 | 1722 | 1.491824 | 19.976798 WALInsertLocks 4| 994170 | 247 | 2.433126 | 2.865429 WALInsertLocks 7| 983497 | 243 | 2.407005 | 2.819026 WALInsertLocks 5| 993068 | 239 | 2.430429 | 2.772622 WALInsertLocks 3| 991446 | 229 | 2.426459 | 2.656613 WALInsertLocks 0| 964185 | 235 | 2.359741 | 2.726218 WALInsertLocks 1| 995237 | 221 | 2.435737 | 2.563805 WALInsertLocks 2| 997593 | 213 | 2.441503 | 2.470998 WALInsertLocks 6| 978178 | 201 | 2.393987 | 2.331787 BufFreelistLock | 887194 | 206 | 2.171313 | 2.389791 XidGenLock | 327385 | 366 | 0.801240 | 4.245940 CheckpointerCommLock| 104754 | 151 | 0.256374 | 1.751740 WALBufMappingLock | 274226 |7 | 0.671139 | 0.081206 96 clients Lock |Blk | SpinDelay | Blk % | SpinDelay % +--+---+---+- WALWriteLock| 25426001 | 1239 | 62.227442 | 14.373550 WALWriteLock| 30097625 | 9616 | 48.550747 | 19.068393 CLogControlLock | 3193429 | 13490 | 5.151349 | 26.750481 ProcArrayLock | 2007103 | 11754 | 3.237676 | 23.308017 CSNLogControlLock | 1303172 | 5022 | 2.102158 | 9.958556 BufFreelistLock | 1921625 | 1977 | 3.099790 | 3.920363 WALInsertLocks 0| 2011855 | 681 | 3.245341 | 1.350413 WALInsertLocks 5| 1829266 | 627 | 2.950805 | 1.243332 WALInsertLocks 7| 1806966 | 632 | 2.914833 | 1.253247 WALInsertLocks 4| 1847372 | 591 | 2.980012 | 1.171945 WALInsertLocks 1| 1948553 | 557 | 3.143228 | 1.104523 WALInsertLocks 6| 1818717 | 582 | 2.933789 | 1.154098 WALInsertLocks 3| 1873964 | 552 | 3.022908 | 1.094608 WALInsertLocks 2| 1912007 | 523 | 3.084276 | 1.037102 XidGenLock | 512521 | 699 | 0.826752 | 1.386107 CheckpointerCommLock| 386853 | 711 | 0.624036 | 1.409903 WALBufMappingLock | 546462 |65 | 0.881503 | 0.128894 384 clients Lock |Blk | SpinDelay | Blk % | SpinDelay % +--+---+---+- WALWriteLock| 25426001 | 1239| 62.227442 | 14.373550 WALWriteLock| 20703796 | 87265| 27.749961 | 15.360068 CLogControlLock | 3273136 | 122616| 4.387089 | 21.582422 ProcArrayLock | 3969918 | 100730| 5.321008 | 17.730128 CSNLogControlLock | 3191989 | 115068| 4.278325 | 20.253851 BufFreelistLock | 2014218 | 27952| 2.699721 | 4.920009 WALInsertLocks 0| 2750082 | 5438| 3.686023 | 0.957177 WALInsertLocks 1| 2584155 | 5312| 3.463626 | 0.934999 WALInsertLocks 2| 2477782 | 5497| 3.321051 | 0.967562 WALInsertLocks 4| 2375977 | 5441| 3.184598 | 0.957705 WALInsertLocks 5| 2349769 | 5458| 3.149471 | 0.960697 WALInsertLocks 6| 2329982 | 5367| 3.122950 | 0.944680 WALInsertLocks 3| 2415965 | 4771| 3.238195 | 0.839774 WALInsertLocks 7| 2316144 | 4930| 3.104402 | 0.867761 CheckpointerCommLock| 584419 | 10794| 0.783316 | 1.899921 XidGenLock | 391212 |
Re: [PERFORM] 60 core performance with 9.3
Matt Clarkson wrote: The LWLOCK_STATS below suggest that ProcArrayLock might be the main source of locking that's causing throughput to take a dive as the client count increases beyond the core count. Any thoughts or comments on these results are welcome! Do these results change if you use Heikki's patch for CSN-based snapshots? See http://www.postgresql.org/message-id/539ad153.9000...@vmware.com for the patch (but note that you need to apply on top of 89cf2d52030 in the master branch -- maybe it applies to HEAD the 9.4 branch but I didn't try). -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 30 Červenec 2014, 3:44, Mark Kirkwood wrote: While these numbers look great in the middle range (12-96 clients), then benefit looks to be tailing off as client numbers increase. Also running with no stats (and hence no auto vacuum or analyze) is way too scary! I assume you've disabled statistics collector, which has nothing to do with vacuum or analyze. There are two kinds of statistics in PostgreSQL - data distribution statistics (which is collected by ANALYZE and stored in actual tables within the database) and runtime statistics (which is collected by the stats collector and stored in a file somewhere on the dist). By disabling statistics collector you loose runtime counters - number of sequential/index scans on a table, tuples read from a relation aetc. But it does not influence VACUUM or planning at all. Also, it's mostly async (send over UDP and you're done) and shouldn't make much difference unless you have large number of objects. There are ways to improve this (e.g. by placing the stat files into a tmpfs). Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
Tomas Vondra t...@fuzzy.cz writes: On 30 Äervenec 2014, 3:44, Mark Kirkwood wrote: While these numbers look great in the middle range (12-96 clients), then benefit looks to be tailing off as client numbers increase. Also running with no stats (and hence no auto vacuum or analyze) is way too scary! By disabling statistics collector you loose runtime counters - number of sequential/index scans on a table, tuples read from a relation aetc. But it does not influence VACUUM or planning at all. It does break autovacuum. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 30 Červenec 2014, 14:39, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: On 30 ??ervenec 2014, 3:44, Mark Kirkwood wrote: While these numbers look great in the middle range (12-96 clients), then benefit looks to be tailing off as client numbers increase. Also running with no stats (and hence no auto vacuum or analyze) is way too scary! By disabling statistics collector you loose runtime counters - number of sequential/index scans on a table, tuples read from a relation aetc. But it does not influence VACUUM or planning at all. It does break autovacuum. Of course, you're right. It throws away info about how much data was modified and when the table was last (auto)vacuumed. This is a clear proof that I really need to drink at least one cup of coffee in the morning before doing anything in the morning. Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
Hi Tomas, Unfortunately I think you are mistaken - disabling the stats collector (i.e. track_counts = off) means that autovacuum has no idea about when/if it needs to start a worker (as it uses those counts to decide), and hence you lose all automatic vacuum and analyze as a result. With respect to comments like it shouldn't make difference etc etc, well the profile suggests otherwise, and the change in tps numbers support the observation. regards Mark On 30/07/14 20:42, Tomas Vondra wrote: On 30 Červenec 2014, 3:44, Mark Kirkwood wrote: While these numbers look great in the middle range (12-96 clients), then benefit looks to be tailing off as client numbers increase. Also running with no stats (and hence no auto vacuum or analyze) is way too scary! I assume you've disabled statistics collector, which has nothing to do with vacuum or analyze. There are two kinds of statistics in PostgreSQL - data distribution statistics (which is collected by ANALYZE and stored in actual tables within the database) and runtime statistics (which is collected by the stats collector and stored in a file somewhere on the dist). By disabling statistics collector you loose runtime counters - number of sequential/index scans on a table, tuples read from a relation aetc. But it does not influence VACUUM or planning at all. Also, it's mostly async (send over UDP and you're done) and shouldn't make much difference unless you have large number of objects. There are ways to improve this (e.g. by placing the stat files into a tmpfs). Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 31/07/14 00:47, Tomas Vondra wrote: On 30 Červenec 2014, 14:39, Tom Lane wrote: Tomas Vondra t...@fuzzy.cz writes: On 30 ??ervenec 2014, 3:44, Mark Kirkwood wrote: While these numbers look great in the middle range (12-96 clients), then benefit looks to be tailing off as client numbers increase. Also running with no stats (and hence no auto vacuum or analyze) is way too scary! By disabling statistics collector you loose runtime counters - number of sequential/index scans on a table, tuples read from a relation aetc. But it does not influence VACUUM or planning at all. It does break autovacuum. Of course, you're right. It throws away info about how much data was modified and when the table was last (auto)vacuumed. This is a clear proof that I really need to drink at least one cup of coffee in the morning before doing anything in the morning. Lol - thanks for taking a look anyway. Yes, coffee is often an important part of the exercise. Regards Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
I've been assisting Mark with the benchmarking of these new servers. The drop off in both throughput and CPU utilisation that we've been observing as the client count increases has let me to investigate which lwlocks are dominant at different client counts. I've recompiled postgres with Andres LWLock improvements, Kevin's libnuma patch and with LWLOCK_STATS enabled. The LWLOCK_STATS below suggest that ProcArrayLock might be the main source of locking that's causing throughput to take a dive as the client count increases beyond the core count. wal_buffers = 256MB checkpoint_segments = 1920 wal_sync_method = open_datasync pgbench -s 2000 -T 600 Results: clients | tps -+- 6 | 9490 12 | 17558 24 | 25681 48 | 41175 96 | 48954 192 | 31887 384 | 15564 LWLOCK_STATS at 48 clients Lock |Blk | SpinDelay | Blk % | SpinDelay % +--+---+---+- BufFreelistLock| 31144 | 11 | 1.64 | 1.62 ShmemIndexLock |192 | 1 | 0.01 | 0.15 OidGenLock | 32648 | 14 | 1.72 | 2.06 XidGenLock | 35731 | 18 | 1.88 | 2.64 ProcArrayLock | 291121 | 215 | 15.36 | 31.57 SInvalReadLock | 32136 | 13 | 1.70 | 1.91 SInvalWriteLock| 32141 | 12 | 1.70 | 1.76 WALBufMappingLock | 31662 | 15 | 1.67 | 2.20 WALWriteLock | 825380 | 45 | 36.31 | 6.61 CLogControlLock| 583458 | 337 | 26.93 | 49.49 LWLOCK_STATS at 96 clients Lock |Blk | SpinDelay | Blk % | SpinDelay % +--+---+---+- BufFreelistLock| 62954 | 12 | 1.54 | 0.27 ShmemIndexLock | 62635 | 4 | 1.54 | 0.09 OidGenLock | 92232 | 22 | 2.26 | 0.50 XidGenLock | 98326 | 18 | 2.41 | 0.41 ProcArrayLock | 928871 |3188 | 22.78 | 72.57 SInvalReadLock | 58392 | 13 | 1.43 | 0.30 SInvalWriteLock| 57429 | 14 | 1.41 | 0.32 WALBufMappingLock | 138375 | 14 | 3.39 | 0.32 WALWriteLock | 1480707 | 42 | 36.31 | 0.96 CLogControlLock| 1098239 |1066 | 26.93 | 27.27 LWLOCK_STATS at 384 clients Lock |Blk | SpinDelay | Blk % | SpinDelay % +--+---+---+- BufFreelistLock| 184298 | 158 | 1.93 | 0.03 ShmemIndexLock | 183573 | 164 | 1.92 | 0.03 OidGenLock | 184558 | 173 | 1.93 | 0.03 XidGenLock | 200239 | 213 | 2.09 | 0.04 ProcArrayLock | 4035527 | 579666 | 42.22 | 98.62 SInvalReadLock | 182204 | 152 | 1.91 | 0.03 SInvalWriteLock| 182898 | 137 | 1.91 | 0.02 WALBufMappingLock | 219936 | 215 | 2.30 | 0.04 WALWriteLock | 3172725 | 457 | 24.67 | 0.08 CLogControlLock| 1012458 |6423 | 10.59 | 1.09 The same test done with a readonly workload show virtually no SpinDelay at all. Any thoughts or comments on these results are welcome! Regards, Matt. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 17/07/14 11:58, Mark Kirkwood wrote: Trying out with numa_balancing=0 seemed to get essentially the same performance. Similarly wrapping postgres startup with --interleave. All this made me want to try with numa *really* disabled. So rebooted the box with numa=off appended to the kernel cmdline. Somewhat surprisingly (to me anyway), the numbers were essentially identical. The profile, however is quite different: A little more tweaking got some further improvement: rwlocks patch as before wal_buffers = 256MB checkpoint_segments = 1920 wal_sync_method = open_datasync LSI RAID adaptor disable read ahead and write cache for SSD fast path mode numa_balancing = 0 Pgbench scale 2000 again: clients | tps (prev) | tps (tweaked config) -++- 6| 8175 | 8281 12 | 14409 | 15896 24 | 17191 | 19522 48 | 23122 | 29776 96 | 22308 | 32352 192 | 23109 | 28804 Now recall we were seeing no actual tps changes with numa_balancing=0 or 1 (so the improvement above is from the other changes), but figured it might be informative to try to track down what the non-numa bottlenecks looked like. We tried profiling the entire 10 minute run which showed the stats collector as a possible source of contention: 3.86%postgres [kernel.kallsyms][k] _raw_spin_lock_bh | --- _raw_spin_lock_bh | |--95.78%-- lock_sock_nested | udpv6_sendmsg | inet_sendmsg | sock_sendmsg | SYSC_sendto | sys_sendto | tracesys | __libc_send | | | |--99.17%-- pgstat_report_stat | | PostgresMain | | ServerLoop | | PostmasterMain | | main | | __libc_start_main | | | |--0.77%-- pgstat_send_bgwriter | | BackgroundWriterMain | | AuxiliaryProcessMain | | 0x7f08efe8d453 | | reaper | | __restore_rt | | PostmasterMain | | main | | __libc_start_main | --0.07%-- [...] | |--2.54%-- __lock_sock | | | |--91.95%-- lock_sock_nested | | udpv6_sendmsg | | inet_sendmsg | | sock_sendmsg | | SYSC_sendto | | sys_sendto | | tracesys | | __libc_send | | | | | |--99.73%-- pgstat_report_stat | | | PostgresMain | | | ServerLoop Disabling track_counts and rerunning pgbench: clients | tps (no counts) -+ 6|9806 12 | 18000 24 | 29281 48 | 43703 96 | 54539 192 | 36114 While these numbers look great in the middle range (12-96 clients), then benefit looks to be tailing off as client numbers increase. Also running with no stats (and hence no auto vacuum or analyze) is way too scary! Trying out less write heavy workloads shows that the stats overhead does not appear to be significant for *read* heavy cases, so this result above is perhaps more of a curiosity than anything (given that read heavy is more typical...and our real workload is more similar to read heavy). The profile for counts off looks like: 4.79% swapper [kernel.kallsyms][k] read_hpet | --- read_hpet | |--97.10%-- ktime_get | | | |--35.24%-- clockevents_program_event | | tick_program_event | | | | | |--56.59%-- __hrtimer_start_range_ns | | | | | | | |--78.12%--
Re: [PERFORM] 60 core performance with 9.3
Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 12/07/14 01:19, Kevin Grittner wrote: It might be worth a test using a cpuset to interleave OS cache and the NUMA patch I submitted to the current CF to see whether this is getting into territory where the patch makes a bigger difference. I would expect it to do much better than using numactl --interleave because work_mem and other process-local memory would be allocated in near memory for each process. http://www.postgresql.org/message-id/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com Thanks Kevin - I did try this out - seemed slightly better than using --interleave, but almost identical to the results posted previously. However looking at my postgres binary with ldd, I'm not seeing any link to libnuma (despite it demanding the library whilst building), so I wonder if my package build has somehow vanilla-ified the result :-( That is odd; not sure what to make of that! Also I am guessing that with 60 cores I do: $ sudo /bin/bash -c echo 0-59 /dev/cpuset/postgres/cpus i.e cpus are cores not packages...? Right; basically, as a guide, you can use the output from: $ numactl --hardware Use the union of all the cpu numbers from the node n cpus lines. The above statement is also a good way to see how unbalance memory usage has become while running a test. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 11/07/14 20:22, Andres Freund wrote: On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote: Full report http://paste.ubuntu.com/886/ # 8.82%postgres [kernel.kallsyms][k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--75.69%-- pagevec_lru_move_fn | __lru_cache_add | lru_cache_add | putback_lru_page | migrate_pages | migrate_misplaced_page | do_numa_page | handle_mm_fault | __do_page_fault | do_page_fault | page_fault So, the majority of the time is spent in numa page migration. Can you disable numa_balancing? I'm not sure if your kernel version does that at runtime or whether you need to reboot. The kernel.numa_balancing sysctl might work. Otherwise you probably need to boot with numa_balancing=0. It'd also be worthwhile to test this with numactl --interleave. Trying out with numa_balancing=0 seemed to get essentially the same performance. Similarly wrapping postgres startup with --interleave. All this made me want to try with numa *really* disabled. So rebooted the box with numa=off appended to the kernel cmdline. Somewhat surprisingly (to me anyway), the numbers were essentially identical. The profile, however is quite different: Full report at http://paste.ubuntu.com/7806285/ 4.56% postgres [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--41.89%-- try_to_wake_up | | | |--96.12%-- default_wake_function | | | | | |--99.96%-- pollwake | | | __wake_up_common | | | __wake_up_sync_key | | | sock_def_readable | | | | | | | |--99.94%-- unix_stream_sendmsg | | | | sock_sendmsg | | | | SYSC_sendto | | | | sys_sendto | | | | tracesys | | | | __libc_send | | | | pq_flush | | | | ReadyForQuery | | | | PostgresMain | | | | ServerLoop | | | | PostmasterMain | | | | main | | | | __libc_start_main | | | --0.06%-- [...] | | --0.04%-- [...] | | | |--2.87%-- wake_up_process | | | | | |--95.71%-- wake_up_sem_queue_do | | | SYSC_semtimedop | | | sys_semop | | | tracesys | | | __GI___semop | | | | | | | |--99.75%-- LWLockRelease | | | | | | | | | |--25.09%-- RecordTransactionCommit | | | | | CommitTransaction | | | | | CommitTransactionCommand | | | | | finish_xact_command.part.4 | | | | | PostgresMain | | | | | ServerLoop | | | | | PostmasterMain | | | | | main | | | | | __libc_start_main regards Mark --
Re: [PERFORM] 60 core performance with 9.3
On 12/07/14 01:19, Kevin Grittner wrote: It might be worth a test using a cpuset to interleave OS cache and the NUMA patch I submitted to the current CF to see whether this is getting into territory where the patch makes a bigger difference. I would expect it to do much better than using numactl --interleave because work_mem and other process-local memory would be allocated in near memory for each process. http://www.postgresql.org/message-id/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com Thanks Kevin - I did try this out - seemed slightly better than using --interleave, but almost identical to the results posted previously. However looking at my postgres binary with ldd, I'm not seeing any link to libnuma (despite it demanding the library whilst building), so I wonder if my package build has somehow vanilla-ified the result :-( Also I am guessing that with 60 cores I do: $ sudo /bin/bash -c echo 0-59 /dev/cpuset/postgres/cpus i.e cpus are cores not packages...? If I've stuffed it up I'll redo! Cheers Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote: On 01/07/14 22:13, Andres Freund wrote: On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote: - cherry picking the last 5 commits into 9.4 branch and building a package from that and retesting: Clients | 9.4 tps 60 cores (rwlock) +-- 6 | 70189 12 | 128894 24 | 233542 48 | 422754 96 | 590796 192 | 630672 Wow - that is more like it! Andres that is some nice work, we definitely owe you some beers for that :-) I am aware that I need to retest with an unpatched 9.4 src - as it is not clear from this data how much is due to Andres's patches and how much to the steady stream of 9.4 development. I'll post an update on that later, but figured this was interesting enough to note for now. Cool. That's what I like (and expect) to see :). I don't think unpatched 9.4 will show significantly different results than 9.3, but it'd be good to validate that. If you do so, could you post the results in the -hackers thread I just CCed you on? That'll help the work to get into 9.5. So we seem to have nailed read only performance. Going back and revisiting read write performance finds: Postgres 9.4 beta rwlock patch pgbench scale = 2000 max_connections = 200; shared_buffers = 10GB; maintenance_work_mem = 1GB; effective_io_concurrency = 10; wal_buffers = 32MB; checkpoint_segments = 192; checkpoint_completion_target = 0.8; clients | tps (32 cores) | tps -++- 6| 8313 | 8175 12 | 11012 | 14409 24 | 16151 | 17191 48 | 21153 | 23122 96 | 21977 | 22308 192 | 22917 | 23109 On that scale - that's bigger than shared_buffers IIRC - I'd not expect the patch to make much of a difference. kernel.sched_autogroup_enabled=0 kernel.sched_migration_cost_ns=500 net.core.somaxconn=1024 /sys/kernel/mm/transparent_hugepage/enabled [never] Full report http://paste.ubuntu.com/886/ # 8.82%postgres [kernel.kallsyms][k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--75.69%-- pagevec_lru_move_fn | __lru_cache_add | lru_cache_add | putback_lru_page | migrate_pages | migrate_misplaced_page | do_numa_page | handle_mm_fault | __do_page_fault | do_page_fault | page_fault So, the majority of the time is spent in numa page migration. Can you disable numa_balancing? I'm not sure if your kernel version does that at runtime or whether you need to reboot. The kernel.numa_balancing sysctl might work. Otherwise you probably need to boot with numa_balancing=0. It'd also be worthwhile to test this with numactl --interleave. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 11/07/14 20:22, Andres Freund wrote: On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote: Postgres 9.4 beta rwlock patch pgbench scale = 2000 On that scale - that's bigger than shared_buffers IIRC - I'd not expect the patch to make much of a difference. Right - we did test with it bigger (can't recall exactly how big), but will retry again after setting the numa parameters below. # 8.82%postgres [kernel.kallsyms][k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--75.69%-- pagevec_lru_move_fn | __lru_cache_add | lru_cache_add | putback_lru_page | migrate_pages | migrate_misplaced_page | do_numa_page | handle_mm_fault | __do_page_fault | do_page_fault | page_fault So, the majority of the time is spent in numa page migration. Can you disable numa_balancing? I'm not sure if your kernel version does that at runtime or whether you need to reboot. The kernel.numa_balancing sysctl might work. Otherwise you probably need to boot with numa_balancing=0. It'd also be worthwhile to test this with numactl --interleave. That was my feeling too - but I had no idea what the magic switch was to tame it (appears to be in 3.13 kernels), will experiment and report back. Thanks again! Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 11/07/14 20:22, Andres Freund wrote: So, the majority of the time is spent in numa page migration. Can you disable numa_balancing? I'm not sure if your kernel version does that at runtime or whether you need to reboot. The kernel.numa_balancing sysctl might work. Otherwise you probably need to boot with numa_balancing=0. It'd also be worthwhile to test this with numactl --interleave. That was my feeling too - but I had no idea what the magic switch was to tame it (appears to be in 3.13 kernels), will experiment and report back. Thanks again! It might be worth a test using a cpuset to interleave OS cache and the NUMA patch I submitted to the current CF to see whether this is getting into territory where the patch makes a bigger difference. I would expect it to do much better than using numactl --interleave because work_mem and other process-local memory would be allocated in near memory for each process. http://www.postgresql.org/message-id/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 01/07/14 22:13, Andres Freund wrote: On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote: - cherry picking the last 5 commits into 9.4 branch and building a package from that and retesting: Clients | 9.4 tps 60 cores (rwlock) +-- 6 | 70189 12 | 128894 24 | 233542 48 | 422754 96 | 590796 192 | 630672 Wow - that is more like it! Andres that is some nice work, we definitely owe you some beers for that :-) I am aware that I need to retest with an unpatched 9.4 src - as it is not clear from this data how much is due to Andres's patches and how much to the steady stream of 9.4 development. I'll post an update on that later, but figured this was interesting enough to note for now. Cool. That's what I like (and expect) to see :). I don't think unpatched 9.4 will show significantly different results than 9.3, but it'd be good to validate that. If you do so, could you post the results in the -hackers thread I just CCed you on? That'll help the work to get into 9.5. So we seem to have nailed read only performance. Going back and revisiting read write performance finds: Postgres 9.4 beta rwlock patch pgbench scale = 2000 max_connections = 200; shared_buffers = 10GB; maintenance_work_mem = 1GB; effective_io_concurrency = 10; wal_buffers = 32MB; checkpoint_segments = 192; checkpoint_completion_target = 0.8; clients | tps (32 cores) | tps -++- 6| 8313 | 8175 12 | 11012 | 14409 24 | 16151 | 17191 48 | 21153 | 23122 96 | 21977 | 22308 192 | 22917 | 23109 So we are back to not doing significantly better than 32 cores. Hmmm. Doing quite a few more tweaks gets some better numbers: kernel.sched_autogroup_enabled=0 kernel.sched_migration_cost_ns=500 net.core.somaxconn=1024 /sys/kernel/mm/transparent_hugepage/enabled [never] +checkpoint_segments = 1920 +wal_buffers = 256MB; clients | tps -+- 6| 8366 12 | 15988 24 | 19828 48 | 30315 96 | 31649 192 | 29497 One more: +wal__sync_method = open_datasync clients | tps -+- 6| 9566 12 | 17129 24 | 22962 48 | 34564 96 | 32584 192 | 28367 So this looks better - however I suspect 32 core performance would improve with these as well! The problem does *not* look to be connected with IO (I will include some iostat below). So time to get the profiler out (192 clients for 1 minute): Full report http://paste.ubuntu.com/886/ # # captured on: Fri Jul 11 03:09:06 2014 # hostname : ncel-prod-db3 # os release : 3.13.0-24-generic # perf version : 3.13.9 # arch : x86_64 # nrcpus online : 60 # nrcpus avail : 60 # cpudesc : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz # cpuid : GenuineIntel,6,62,7 # total memory : 1056692116 kB # cmdline : /usr/lib/linux-tools-3.13.0-24/perf record -ag # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, attr_mmap2 = 0, attr_mmap = 1, attr_mmap_data = 0 # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: cpu = 4, uncore_cbox_10 = 17, uncore_cbox_11 = 18, uncore_cbox_12 = 19, uncore_cbox_13 = 20, uncore_cbox_14 = 21, software = 1, uncore_irp = 33, uncore_pcu = 22, tracepoint = 2, uncore_imc_0 = 25, uncore_imc_1 = 26, uncore_imc_2 = 27, uncore_imc_3 = 28, uncore_imc_4 = 29, uncore_imc_5 = 30, uncore_imc_6 = 31, uncore_imc_7 = 32, uncore_qpi_0 = 34, uncore_qpi_1 = 35, uncore_qpi_2 = 36, uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, uncore_cbox_7 = 14, uncore_cbox_8 = 15, uncore_cbox_9 = 16, uncore_r2pcie = 37, uncore_r3qpi_0 = 38, uncore_r3qpi_1 = 39, breakpoint = 5, uncore_ha_0 = 23, uncore_ha_1 = 24, uncore_ubox = 6 # # # Samples: 1M of event 'cycles' # Event count (approx.): 359906321606 # # Overhead CommandShared Object Symbol # .. ... . # 8.82%postgres [kernel.kallsyms][k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--75.69%-- pagevec_lru_move_fn | __lru_cache_add | lru_cache_add | putback_lru_page | migrate_pages | migrate_misplaced_page | do_numa_page | handle_mm_fault | __do_page_fault | do_page_fault
Re: [PERFORM] 60 core performance with 9.3
On 27/06/14 21:19, Andres Freund wrote: On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote: My feeling is spinlock or similar, 'perf top' shows kernel find_busiest_group kernel _raw_spin_lock as the top time users. Those don't tell that much by themselves, could you do a hierarchical profile? I.e. perf record -ga? That'll at least give the callers for kernel level stuff. For more information compile postgres with -fno-omit-frame-pointer. Unfortunately this did not help - had lots of unknown symbols from postgres in the profile - I'm guessing the Ubuntu postgresql-9.3 package needs either the -dev package or to be rebuilt with the enable profile option (debug and no-omit-frame-pointer seem to be there already). However further investigation did uncover *very* interesting things. Firstly I had previously said that read only performance looked ok...this was wrong, purely based on comparison to Robert's blog post. Rebooting the 60 core box with 32 cores enabled showed that we got *better* scaling performance in the read only case and illustrated we were hitting a serious regression with more cores. At this point data is needed: Test: pgbench Options: scale 500 read only Os: Ubuntu 14.04 Pg: 9.3.4 Pg Options: max_connections = 200 shared_buffers = 10GB maintenance_work_mem = 1GB effective_io_concurrency = 10 wal_buffers = 32MB checkpoint_segments = 192 checkpoint_completion_target = 0.8 Results Clients | 9.3 tps 32 cores | 9.3 tps 60 cores +--+- 6 | 70400 | 71028 12 | 98918 | 129140 24 | 230345 | 240631 48 | 324042 | 409510 96 | 346929 | 120464 192 | 312621 | 92663 So we have anti scaling with 60 cores as we increase the client connections. Ouch! A level of urgency led to trying out Andres's 'rwlock' 9.4 branch [1] - cherry picking the last 5 commits into 9.4 branch and building a package from that and retesting: Clients | 9.4 tps 60 cores (rwlock) +-- 6 | 70189 12 | 128894 24 | 233542 48 | 422754 96 | 590796 192 | 630672 Wow - that is more like it! Andres that is some nice work, we definitely owe you some beers for that :-) I am aware that I need to retest with an unpatched 9.4 src - as it is not clear from this data how much is due to Andres's patches and how much to the steady stream of 9.4 development. I'll post an update on that later, but figured this was interesting enough to note for now. Regards Mark [1] from git://git.postgresql.org/git/users/andresfreund/postgres.git, commits: 4b82477dcaf81ad7b0c102f4b66e479a5eb9504a 10d72b97f108b6002210ea97a414076a62302d4e 67ffebe50111743975d54782a3a94b15ac4e755f fe686ed18fe132021ee5e557c67cc4d7c50a1ada f2378dc2fa5b73c688f696704976980bab90c611 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 01/07/14 21:48, Mark Kirkwood wrote: [1] from git://git.postgresql.org/git/users/andresfreund/postgres.git, commits: 4b82477dcaf81ad7b0c102f4b66e479a5eb9504a 10d72b97f108b6002210ea97a414076a62302d4e 67ffebe50111743975d54782a3a94b15ac4e755f fe686ed18fe132021ee5e557c67cc4d7c50a1ada f2378dc2fa5b73c688f696704976980bab90c611 Hmmm, should read last 5 commits in 'rwlock-contention' and I had pasted the commit nos from my tree not Andres's, sorry, here are the right ones: 472c87400377a7dc418d8b77e47ba08f5c89b1bb e1e549a8e42b753cc7ac60e914a3939584cb1c56 65c2174469d2e0e7c2894202dc63b8fa6f8d2a7f 959aa6e0084d1264e5b228e5a055d66e5173db7d a5c3ddaef0ee679cf5e8e10d59e0a1fe9f0f1893 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote: On 27/06/14 21:19, Andres Freund wrote: On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote: My feeling is spinlock or similar, 'perf top' shows kernel find_busiest_group kernel _raw_spin_lock as the top time users. Those don't tell that much by themselves, could you do a hierarchical profile? I.e. perf record -ga? That'll at least give the callers for kernel level stuff. For more information compile postgres with -fno-omit-frame-pointer. Unfortunately this did not help - had lots of unknown symbols from postgres in the profile - I'm guessing the Ubuntu postgresql-9.3 package needs either the -dev package or to be rebuilt with the enable profile option (debug and no-omit-frame-pointer seem to be there already). You need to install the -dbg package. My bet is you'll see s_lock high in the profile, called mainly from the procarray and buffer mapping lwlocks. Test: pgbench Options: scale 500 read only Os: Ubuntu 14.04 Pg: 9.3.4 Pg Options: max_connections = 200 Just as an experiment I'd suggest increasing max_connections by one and two and quickly retesting - there's some cacheline alignment issues that aren't fixed yet that happen to vanish with some max_connections settings. shared_buffers = 10GB maintenance_work_mem = 1GB effective_io_concurrency = 10 wal_buffers = 32MB checkpoint_segments = 192 checkpoint_completion_target = 0.8 Results Clients | 9.3 tps 32 cores | 9.3 tps 60 cores +--+- 6 | 70400 | 71028 12 | 98918 | 129140 24 | 230345 | 240631 48 | 324042 | 409510 96 | 346929 | 120464 192 | 312621 | 92663 So we have anti scaling with 60 cores as we increase the client connections. Ouch! A level of urgency led to trying out Andres's 'rwlock' 9.4 branch [1] - cherry picking the last 5 commits into 9.4 branch and building a package from that and retesting: Clients | 9.4 tps 60 cores (rwlock) +-- 6 | 70189 12 | 128894 24 | 233542 48 | 422754 96 | 590796 192 | 630672 Wow - that is more like it! Andres that is some nice work, we definitely owe you some beers for that :-) I am aware that I need to retest with an unpatched 9.4 src - as it is not clear from this data how much is due to Andres's patches and how much to the steady stream of 9.4 development. I'll post an update on that later, but figured this was interesting enough to note for now. Cool. That's what I like (and expect) to see :). I don't think unpatched 9.4 will show significantly different results than 9.3, but it'd be good to validate that. If you do so, could you post the results in the -hackers thread I just CCed you on? That'll help the work to get into 9.5. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote: My feeling is spinlock or similar, 'perf top' shows kernel find_busiest_group kernel _raw_spin_lock as the top time users. Those don't tell that much by themselves, could you do a hierarchical profile? I.e. perf record -ga? That'll at least give the callers for kernel level stuff. For more information compile postgres with -fno-omit-frame-pointer. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 27/06/14 21:19, Andres Freund wrote: On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote: My feeling is spinlock or similar, 'perf top' shows kernel find_busiest_group kernel _raw_spin_lock as the top time users. Those don't tell that much by themselves, could you do a hierarchical profile? I.e. perf record -ga? That'll at least give the callers for kernel level stuff. For more information compile postgres with -fno-omit-frame-pointer. Excellent suggestion, will do next week! regards Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] 60 core performance with 9.3
I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1]. The context is the current machine in use by the customer is a 32 core one, and due to growth we are looking at something larger (hence 60 cores). Some initial tests show similar pgbench read only performance to what Robert found here http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html (actually a bit quicker around 40 tps). However doing a mixed read-write workload is getting results the same or only marginally quicker than the 32 core machine - particularly at higher number of clients (e.g 200 - 500). I have yet to break out the perf toolset, but I'm wondering if any folk has compared 32 and 60 (or 64) core read write pgbench performance? regards Mark [1] Details: 4x E7-4890 15 cores each. 1 TB ram 16x Toshiba PX02SS SATA SSD 4x Samsung NVMe XS1715 PCIe SSD Ubuntu 14.04 (Linux 3.13) -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On Thu, Jun 26, 2014 at 5:49 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1]. The context is the current machine in use by the customer is a 32 core one, and due to growth we are looking at something larger (hence 60 cores). Some initial tests show similar pgbench read only performance to what Robert found here http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html (actually a bit quicker around 40 tps). However doing a mixed read-write workload is getting results the same or only marginally quicker than the 32 core machine - particularly at higher number of clients (e.g 200 - 500). I have yet to break out the perf toolset, but I'm wondering if any folk has compared 32 and 60 (or 64) core read write pgbench performance? My guess is that the read only test is CPU / memory bandwidth limited, but the mixed test is IO bound. What's your iostat / vmstat / iotop etc look like when you're doing both read only and read/write mixed? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] 60 core performance with 9.3
On 27/06/14 14:01, Scott Marlowe wrote: On Thu, Jun 26, 2014 at 5:49 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1]. The context is the current machine in use by the customer is a 32 core one, and due to growth we are looking at something larger (hence 60 cores). Some initial tests show similar pgbench read only performance to what Robert found here http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html (actually a bit quicker around 40 tps). However doing a mixed read-write workload is getting results the same or only marginally quicker than the 32 core machine - particularly at higher number of clients (e.g 200 - 500). I have yet to break out the perf toolset, but I'm wondering if any folk has compared 32 and 60 (or 64) core read write pgbench performance? My guess is that the read only test is CPU / memory bandwidth limited, but the mixed test is IO bound. What's your iostat / vmstat / iotop etc look like when you're doing both read only and read/write mixed? That was what I would have thought too, but it does not appear to be the case, here is a typical iostat: Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 nvme0n1 0.00 0.000.00 4448.00 0.0041.47 19.10 0.140.030.000.03 0.03 14.40 nvme1n1 0.00 0.000.00 4448.00 0.0041.47 19.10 0.150.030.000.03 0.03 15.20 nvme2n1 0.00 0.000.00 4549.00 0.0042.20 19.00 0.150.030.000.03 0.03 15.20 nvme3n1 0.00 0.000.00 4548.00 0.0042.19 19.00 0.160.040.000.04 0.04 16.00 dm-0 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 md0 0.00 0.000.00 17961.00 0.0083.67 9.54 0.000.000.000.00 0.00 0.00 dm-1 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-2 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-3 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 dm-4 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 My feeling is spinlock or similar, 'perf top' shows kernel find_busiest_group kernel _raw_spin_lock as the top time users. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance