Re: [PERFORM] 60 core performance with 9.3

2014-08-14 Thread Josh Berkus
Mark,

Is the 60-core machine using some of the Intel chips which have 20
hyperthreaded virtual cores?

If so, I've been seeing some performance issues on these processors.
I'm currently doing a side-by-side hyperthreading on/off test.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-08-14 Thread Mark Kirkwood

On 15/08/14 06:18, Josh Berkus wrote:

Mark,

Is the 60-core machine using some of the Intel chips which have 20
hyperthreaded virtual cores?

If so, I've been seeing some performance issues on these processors.
I'm currently doing a side-by-side hyperthreading on/off test.



Hi Josh,

The board has 4 sockets with E7-4890 v2 cpus. They have 15 cores/30 
threads. We've running with hyperthreading off (noticed the usual 
steep/sudden scaling dropoff with it on).


What model are your 20 cores cpus?

Cheers

Mark






--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-08-11 Thread Mark Kirkwood

On 01/08/14 09:38, Alvaro Herrera wrote:

Matt Clarkson wrote:


The LWLOCK_STATS below suggest that ProcArrayLock might be the main
source of locking that's causing throughput to take a dive as the client
count increases beyond the core count.



Any thoughts or comments on these results are welcome!


Do these results change if you use Heikki's patch for CSN-based
snapshots?  See
http://www.postgresql.org/message-id/539ad153.9000...@vmware.com for the
patch (but note that you need to apply on top of 89cf2d52030 in the
master branch -- maybe it applies to HEAD the 9.4 branch but I didn't
try).



Hi Alvaro,

Applying the CSN patch on top of the rwlock + numa in 9.4 (bit of a 
patch-fest we have here now) shows modest improvement at highest client 
number (but appears to hurt performance in the mid range):


 clients |  tps
-+
6|  8445
12   | 14548
24   | 20043
48   | 27451
96   | 27718
192  | 23614
384  | 24737


Initial runs were quite disappointing, until we moved the csnlog 
directory onto the same filesystem that the xlogs are on (PCIe SSD). We 
could potentially look at locating them on their own separate volume if 
that make sense.


Adding in LWLOCK stats again shows quite a different picture from the 
previous:


48 clients

  Lock  |Blk   | SpinDelay | Blk % | SpinDelay %
+--+---+---+-
WALWriteLock| 25426001 | 1239  | 62.227442 | 14.373550
CLogControlLock |  1793739 | 1376  |  4.389986 | 15.962877
ProcArrayLock   |  1007765 | 1305  |  2.466398 | 15.139211
CSNLogControlLock   |  609556  | 1722  |  1.491824 | 19.976798
WALInsertLocks 4|  994170  |  247  |  2.433126 |  2.865429
WALInsertLocks 7|  983497  |  243  |  2.407005 |  2.819026
WALInsertLocks 5|  993068  |  239  |  2.430429 |  2.772622
WALInsertLocks 3|  991446  |  229  |  2.426459 |  2.656613
WALInsertLocks 0|  964185  |  235  |  2.359741 |  2.726218
WALInsertLocks 1|  995237  |  221  |  2.435737 |  2.563805
WALInsertLocks 2|  997593  |  213  |  2.441503 |  2.470998
WALInsertLocks 6|  978178  |  201  |  2.393987 |  2.331787
BufFreelistLock |  887194  |  206  |  2.171313 |  2.389791
XidGenLock  |  327385  |  366  |  0.801240 |  4.245940
CheckpointerCommLock|  104754  |  151  |  0.256374 |  1.751740
WALBufMappingLock   |  274226  |7  |  0.671139 |  0.081206


96 clients

  Lock  |Blk   | SpinDelay | Blk % | SpinDelay %
+--+---+---+-
WALWriteLock| 25426001 |  1239 | 62.227442 | 14.373550
WALWriteLock| 30097625 |  9616 | 48.550747 | 19.068393
CLogControlLock |  3193429 | 13490 | 5.151349  | 26.750481
ProcArrayLock   |  2007103 | 11754 | 3.237676  | 23.308017
CSNLogControlLock   |  1303172 |  5022 | 2.102158  |  9.958556
BufFreelistLock |  1921625 |  1977 | 3.099790  |  3.920363
WALInsertLocks 0|  2011855 |   681 | 3.245341  |  1.350413
WALInsertLocks 5|  1829266 |   627 | 2.950805  |  1.243332
WALInsertLocks 7|  1806966 |   632 | 2.914833  |  1.253247
WALInsertLocks 4|  1847372 |   591 | 2.980012  |  1.171945
WALInsertLocks 1|  1948553 |   557 | 3.143228  |  1.104523
WALInsertLocks 6|  1818717 |   582 | 2.933789  |  1.154098
WALInsertLocks 3|  1873964 |   552 | 3.022908  |  1.094608
WALInsertLocks 2|  1912007 |   523 | 3.084276  |  1.037102
XidGenLock  |   512521 |   699 | 0.826752  |  1.386107
CheckpointerCommLock|   386853 |   711 | 0.624036  |  1.409903
WALBufMappingLock   |   546462 |65 | 0.881503  |  0.128894


384 clients

  Lock  |Blk   | SpinDelay | Blk % | SpinDelay %
+--+---+---+-
WALWriteLock| 25426001 |   1239| 62.227442 | 14.373550
WALWriteLock| 20703796 |  87265| 27.749961 | 15.360068
CLogControlLock |  3273136 | 122616|  4.387089 | 21.582422
ProcArrayLock   |  3969918 | 100730|  5.321008 | 17.730128
CSNLogControlLock   |  3191989 | 115068|  4.278325 | 20.253851
BufFreelistLock |  2014218 |  27952|  2.699721 |  4.920009
WALInsertLocks 0|  2750082 |   5438|  3.686023 |  0.957177
WALInsertLocks 1|  2584155 |   5312|  3.463626 |  0.934999
WALInsertLocks 2|  2477782 |   5497|  3.321051 |  0.967562
WALInsertLocks 4|  2375977 |   5441|  3.184598 |  0.957705
WALInsertLocks 5|  2349769 |   5458|  3.149471 |  0.960697
WALInsertLocks 6|  2329982 |   5367|  3.122950 |  0.944680
WALInsertLocks 3|  2415965 |   4771|  3.238195 |  0.839774
WALInsertLocks 7|  2316144 |   4930|  3.104402 |  0.867761
CheckpointerCommLock|   584419 |  10794|  0.783316 |  1.899921
XidGenLock  |   391212 |  

Re: [PERFORM] 60 core performance with 9.3

2014-07-31 Thread Alvaro Herrera
Matt Clarkson wrote:

 The LWLOCK_STATS below suggest that ProcArrayLock might be the main
 source of locking that's causing throughput to take a dive as the client
 count increases beyond the core count.

 Any thoughts or comments on these results are welcome!

Do these results change if you use Heikki's patch for CSN-based
snapshots?  See
http://www.postgresql.org/message-id/539ad153.9000...@vmware.com for the
patch (but note that you need to apply on top of 89cf2d52030 in the
master branch -- maybe it applies to HEAD the 9.4 branch but I didn't
try).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-30 Thread Tomas Vondra
On 30 Červenec 2014, 3:44, Mark Kirkwood wrote:

 While these numbers look great in the middle range (12-96 clients), then
 benefit looks to be tailing off as client numbers increase. Also running
 with no stats (and hence no auto vacuum or analyze) is way too scary!

I assume you've disabled statistics collector, which has nothing to do
with vacuum or analyze.

There are two kinds of statistics in PostgreSQL - data distribution
statistics (which is collected by ANALYZE and stored in actual tables
within the database) and runtime statistics (which is collected by the
stats collector and stored in a file somewhere on the dist).

By disabling statistics collector you loose runtime counters - number of
sequential/index scans on a table, tuples read from a relation aetc. But
it does not influence VACUUM or planning at all.

Also, it's mostly async (send over UDP and you're done) and shouldn't make
much difference unless you have large number of objects. There are ways to
improve this (e.g. by placing the stat files into a tmpfs).

Tomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-30 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 On 30 Červenec 2014, 3:44, Mark Kirkwood wrote:
 While these numbers look great in the middle range (12-96 clients), then
 benefit looks to be tailing off as client numbers increase. Also running
 with no stats (and hence no auto vacuum or analyze) is way too scary!

 By disabling statistics collector you loose runtime counters - number of
 sequential/index scans on a table, tuples read from a relation aetc. But
 it does not influence VACUUM or planning at all.

It does break autovacuum.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-30 Thread Tomas Vondra
On 30 Červenec 2014, 14:39, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 On 30 ??ervenec 2014, 3:44, Mark Kirkwood wrote:
 While these numbers look great in the middle range (12-96 clients),
 then
 benefit looks to be tailing off as client numbers increase. Also
 running
 with no stats (and hence no auto vacuum or analyze) is way too scary!

 By disabling statistics collector you loose runtime counters - number of
 sequential/index scans on a table, tuples read from a relation aetc. But
 it does not influence VACUUM or planning at all.

 It does break autovacuum.

Of course, you're right. It throws away info about how much data was
modified and when the table was last (auto)vacuumed.

This is a clear proof that I really need to drink at least one cup of
coffee in the morning before doing anything in the morning.

Tomas



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-30 Thread Mark Kirkwood

Hi Tomas,

Unfortunately I think you are mistaken - disabling the stats collector 
(i.e. track_counts = off) means that autovacuum has no idea about 
when/if it needs to start a worker (as it uses those counts to decide), 
and hence you lose all automatic vacuum and analyze as a result.


With respect to comments like it shouldn't make difference etc etc, 
well the profile suggests otherwise, and the change in tps numbers 
support the observation.


regards

Mark

On 30/07/14 20:42, Tomas Vondra wrote:

On 30 Červenec 2014, 3:44, Mark Kirkwood wrote:


While these numbers look great in the middle range (12-96 clients), then
benefit looks to be tailing off as client numbers increase. Also running
with no stats (and hence no auto vacuum or analyze) is way too scary!


I assume you've disabled statistics collector, which has nothing to do
with vacuum or analyze.

There are two kinds of statistics in PostgreSQL - data distribution
statistics (which is collected by ANALYZE and stored in actual tables
within the database) and runtime statistics (which is collected by the
stats collector and stored in a file somewhere on the dist).

By disabling statistics collector you loose runtime counters - number of
sequential/index scans on a table, tuples read from a relation aetc. But
it does not influence VACUUM or planning at all.

Also, it's mostly async (send over UDP and you're done) and shouldn't make
much difference unless you have large number of objects. There are ways to
improve this (e.g. by placing the stat files into a tmpfs).

Tomas





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-30 Thread Mark Kirkwood

On 31/07/14 00:47, Tomas Vondra wrote:

On 30 Červenec 2014, 14:39, Tom Lane wrote:

Tomas Vondra t...@fuzzy.cz writes:

On 30 ??ervenec 2014, 3:44, Mark Kirkwood wrote:

While these numbers look great in the middle range (12-96 clients),
then
benefit looks to be tailing off as client numbers increase. Also
running
with no stats (and hence no auto vacuum or analyze) is way too scary!



By disabling statistics collector you loose runtime counters - number of
sequential/index scans on a table, tuples read from a relation aetc. But
it does not influence VACUUM or planning at all.


It does break autovacuum.


Of course, you're right. It throws away info about how much data was
modified and when the table was last (auto)vacuumed.

This is a clear proof that I really need to drink at least one cup of
coffee in the morning before doing anything in the morning.



Lol - thanks for taking a look anyway. Yes, coffee is often an important 
part of the exercise.


Regards

Mark



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-30 Thread Matt Clarkson
I've been assisting Mark with the benchmarking of these new servers. 

The drop off in both throughput and CPU utilisation that we've been
observing as the client count increases has let me to investigate which
lwlocks are dominant at different client counts.

I've recompiled postgres with Andres LWLock improvements, Kevin's
libnuma patch and with LWLOCK_STATS enabled.

The LWLOCK_STATS below suggest that ProcArrayLock might be the main
source of locking that's causing throughput to take a dive as the client
count increases beyond the core count.


wal_buffers = 256MB
checkpoint_segments = 1920
wal_sync_method = open_datasync

pgbench -s 2000 -T 600


Results:

 clients |  tps
-+-
 6   |  9490
12   | 17558
24   | 25681
48   | 41175
96   | 48954
   192   | 31887
   384   | 15564
   
   

LWLOCK_STATS at 48 clients

  Lock  |Blk   | SpinDelay | Blk % | SpinDelay % 
+--+---+---+-
 BufFreelistLock|  31144   |  11   |  1.64 |   1.62
 ShmemIndexLock |192   |   1   |  0.01 |   0.15
 OidGenLock |  32648   |  14   |  1.72 |   2.06
 XidGenLock |  35731   |  18   |  1.88 |   2.64
 ProcArrayLock  | 291121   | 215   | 15.36 |  31.57
 SInvalReadLock |  32136   |  13   |  1.70 |   1.91
 SInvalWriteLock|  32141   |  12   |  1.70 |   1.76
 WALBufMappingLock  |  31662   |  15   |  1.67 |   2.20
 WALWriteLock   | 825380   |  45   | 36.31 |   6.61
 CLogControlLock| 583458   | 337   | 26.93 |  49.49
 
 
   
LWLOCK_STATS at 96 clients

  Lock  |Blk   | SpinDelay | Blk % | SpinDelay % 
+--+---+---+-
 BufFreelistLock|   62954  |  12   |  1.54 |   0.27
 ShmemIndexLock |   62635  |   4   |  1.54 |   0.09
 OidGenLock |   92232  |  22   |  2.26 |   0.50
 XidGenLock |   98326  |  18   |  2.41 |   0.41
 ProcArrayLock  |  928871  |3188   | 22.78 |  72.57
 SInvalReadLock |   58392  |  13   |  1.43 |   0.30
 SInvalWriteLock|   57429  |  14   |  1.41 |   0.32
 WALBufMappingLock  |  138375  |  14   |  3.39 |   0.32
 WALWriteLock   | 1480707  |  42   | 36.31 |   0.96
 CLogControlLock| 1098239  |1066   | 26.93 |  27.27
 
 
 
LWLOCK_STATS at 384 clients

  Lock  |Blk   | SpinDelay | Blk % | SpinDelay % 
+--+---+---+-
 BufFreelistLock|  184298  | 158   |  1.93 |   0.03
 ShmemIndexLock |  183573  | 164   |  1.92 |   0.03
 OidGenLock |  184558  | 173   |  1.93 |   0.03
 XidGenLock |  200239  | 213   |  2.09 |   0.04
 ProcArrayLock  | 4035527  |  579666   | 42.22 |  98.62
 SInvalReadLock |  182204  | 152   |  1.91 |   0.03
 SInvalWriteLock|  182898  | 137   |  1.91 |   0.02
 WALBufMappingLock  |  219936  | 215   |  2.30 |   0.04
 WALWriteLock   | 3172725  | 457   | 24.67 |   0.08
 CLogControlLock| 1012458  |6423   | 10.59 |   1.09
 

The same test done with a readonly workload show virtually no SpinDelay
at all.


Any thoughts or comments on these results are welcome!


Regards,
Matt.





-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-29 Thread Mark Kirkwood

On 17/07/14 11:58, Mark Kirkwood wrote:



Trying out with numa_balancing=0 seemed to get essentially the same
performance. Similarly wrapping postgres startup with --interleave.

All this made me want to try with numa *really* disabled. So rebooted
the box with numa=off appended to the kernel cmdline. Somewhat
surprisingly (to me anyway), the numbers were essentially identical. The
profile, however is quite different:



A little more tweaking got some further improvement:

rwlocks patch as before

wal_buffers = 256MB
checkpoint_segments = 1920
wal_sync_method = open_datasync

LSI RAID adaptor disable read ahead and write cache for SSD fast path mode
numa_balancing = 0


Pgbench scale 2000 again:

clients  | tps (prev) |  tps (tweaked config)
-++-
6|   8175 |   8281
12   |  14409 |  15896
24   |  17191 |  19522
48   |  23122 |  29776
96   |  22308 |  32352
192  |  23109 |  28804


Now recall we were seeing no actual tps changes with numa_balancing=0 or 
1 (so the improvement above is from the other changes), but figured it 
might be informative to try to track down what the non-numa bottlenecks 
looked like. We tried profiling the entire 10 minute run which showed 
the stats collector as a possible source of contention:



 3.86%postgres  [kernel.kallsyms][k] _raw_spin_lock_bh
  |
  --- _raw_spin_lock_bh
 |
 |--95.78%-- lock_sock_nested
 |  udpv6_sendmsg
 |  inet_sendmsg
 |  sock_sendmsg
 |  SYSC_sendto
 |  sys_sendto
 |  tracesys
 |  __libc_send
 |  |
 |  |--99.17%-- pgstat_report_stat
 |  |  PostgresMain
 |  |  ServerLoop
 |  |  PostmasterMain
 |  |  main
 |  |  __libc_start_main
 |  |
 |  |--0.77%-- pgstat_send_bgwriter
 |  |  BackgroundWriterMain
 |  |  AuxiliaryProcessMain
 |  |  0x7f08efe8d453
 |  |  reaper
 |  |  __restore_rt
 |  |  PostmasterMain
 |  |  main
 |  |  __libc_start_main
 |   --0.07%-- [...]
 |
 |--2.54%-- __lock_sock
 |  |
 |  |--91.95%-- lock_sock_nested
 |  |  udpv6_sendmsg
 |  |  inet_sendmsg
 |  |  sock_sendmsg
 |  |  SYSC_sendto
 |  |  sys_sendto
 |  |  tracesys
 |  |  __libc_send
 |  |  |
 |  |  |--99.73%-- pgstat_report_stat
 |  |  |  PostgresMain
 |  |  |  ServerLoop



Disabling track_counts and rerunning pgbench:

clients  | tps (no counts)
-+
6|9806
12   |   18000
24   |   29281
48   |   43703
96   |   54539
192  |   36114


While these numbers look great in the middle range (12-96 clients), then 
benefit looks to be tailing off as client numbers increase. Also running 
with no stats (and hence no auto vacuum or analyze) is way too scary!


Trying out less write heavy workloads shows that the stats overhead does 
not appear to be significant for *read* heavy cases, so this result 
above is perhaps more of a curiosity than anything (given that read 
heavy is more typical...and our real workload is more similar to read 
heavy).


The profile for counts off looks like:

 4.79% swapper  [kernel.kallsyms][k] read_hpet
   |
   --- read_hpet
  |
  |--97.10%-- ktime_get
  |  |
  |  |--35.24%-- clockevents_program_event
  |  |  tick_program_event
  |  |  |
  |  |  |--56.59%-- 
__hrtimer_start_range_ns

  |  |  |  |
  |  |  |  |--78.12%-- 

Re: [PERFORM] 60 core performance with 9.3

2014-07-21 Thread Kevin Grittner
Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote:
 On 12/07/14 01:19, Kevin Grittner wrote:

 It might be worth a test using a cpuset to interleave OS cache and
 the NUMA patch I submitted to the current CF to see whether this is
 getting into territory where the patch makes a bigger difference.
 I would expect it to do much better than using numactl --interleave
 because work_mem and other process-local memory would be allocated
 in near memory for each process.

 http://www.postgresql.org/message-id/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com

 Thanks Kevin - I did try this out - seemed slightly better than using
 --interleave, but almost identical to the results posted previously.

 However looking at my postgres binary with ldd, I'm not seeing any link
 to libnuma (despite it demanding the library whilst building), so I
 wonder if my package build has somehow vanilla-ified the result :-(

That is odd; not sure what to make of that!

 Also I am guessing that with 60 cores I do:

 $ sudo /bin/bash -c echo 0-59 /dev/cpuset/postgres/cpus

 i.e cpus are cores not packages...?

Right; basically, as a guide, you can use the output from:

$ numactl --hardware

Use the union of all the cpu numbers from the node n cpus lines.  The
above statement is also a good way to see how unbalance memory usage has
become while running a test.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-16 Thread Mark Kirkwood

On 11/07/14 20:22, Andres Freund wrote:

On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote:

Full report http://paste.ubuntu.com/886/



#
  8.82%postgres  [kernel.kallsyms][k]
_raw_spin_lock_irqsave
   |
   --- _raw_spin_lock_irqsave
  |
  |--75.69%-- pagevec_lru_move_fn
  |  __lru_cache_add
  |  lru_cache_add
  |  putback_lru_page
  |  migrate_pages
  |  migrate_misplaced_page
  |  do_numa_page
  |  handle_mm_fault
  |  __do_page_fault
  |  do_page_fault
  |  page_fault


So, the majority of the time is spent in numa page migration. Can you
disable numa_balancing? I'm not sure if your kernel version does that at
runtime or whether you need to reboot.
The kernel.numa_balancing sysctl might work. Otherwise you probably need
to boot with numa_balancing=0.

It'd also be worthwhile to test this with numactl --interleave.



Trying out with numa_balancing=0 seemed to get essentially the same 
performance. Similarly wrapping postgres startup with --interleave.


All this made me want to try with numa *really* disabled. So rebooted 
the box with numa=off appended to the kernel cmdline. Somewhat 
surprisingly (to me anyway), the numbers were essentially identical. The 
profile, however is quite different:


Full report at http://paste.ubuntu.com/7806285/


 4.56% postgres  [kernel.kallsyms] [k] 
_raw_spin_lock_irqsave 



   |
   --- _raw_spin_lock_irqsave
  |
  |--41.89%-- try_to_wake_up
  |  |
  |  |--96.12%-- default_wake_function
  |  |  |
  |  |  |--99.96%-- pollwake
  |  |  |  __wake_up_common
  |  |  |  __wake_up_sync_key
  |  |  |  sock_def_readable
  |  |  |  |
  |  |  |  |--99.94%-- 
unix_stream_sendmsg
  |  |  |  | 
sock_sendmsg
  |  |  |  | 
SYSC_sendto
  |  |  |  | 
sys_sendto

  |  |  |  |  tracesys
  |  |  |  | 
__libc_send

  |  |  |  |  pq_flush
  |  |  |  | 
ReadyForQuery
  |  |  |  | 
PostgresMain
  |  |  |  | 
ServerLoop
  |  |  |  | 
PostmasterMain

  |  |  |  |  main
  |  |  |  | 
__libc_start_main

  |  |  |   --0.06%-- [...]
  |  |   --0.04%-- [...]
  |  |
  |  |--2.87%-- wake_up_process
  |  |  |
  |  |  |--95.71%-- 
wake_up_sem_queue_do

  |  |  |  SYSC_semtimedop
  |  |  |  sys_semop
  |  |  |  tracesys
  |  |  |  __GI___semop
  |  |  |  |
  |  |  |  |--99.75%-- 
LWLockRelease
  |  |  |  |  | 

  |  |  |  | 
|--25.09%-- RecordTransactionCommit
  |  |  |  |  | 
  CommitTransaction
  |  |  |  |  | 
  CommitTransactionCommand
  |  |  |  |  | 
  finish_xact_command.part.4
  |  |  |  |  | 
  PostgresMain
  |  |  |  |  | 
  ServerLoop
  |  |  |  |  | 
  PostmasterMain
  |  |  |  |  | 
  main
  |  |  |  |  | 
  __libc_start_main




regards

Mark



--

Re: [PERFORM] 60 core performance with 9.3

2014-07-16 Thread Mark Kirkwood

On 12/07/14 01:19, Kevin Grittner wrote:


It might be worth a test using a cpuset to interleave OS cache and
the NUMA patch I submitted to the current CF to see whether this is
getting into territory where the patch makes a bigger difference.
I would expect it to do much better than using numactl --interleave
because work_mem and other process-local memory would be allocated
in near memory for each process.

http://www.postgresql.org/message-id/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com



Thanks Kevin - I did try this out - seemed slightly better than using 
--interleave, but almost identical to the results posted previously.


However looking at my postgres binary with ldd, I'm not seeing any link 
to libnuma (despite it demanding the library whilst building), so I 
wonder if my package build has somehow vanilla-ified the result :-(


Also I am guessing that with 60 cores I do:

$ sudo /bin/bash -c echo 0-59 /dev/cpuset/postgres/cpus

i.e cpus are cores not packages...? If I've stuffed it up I'll redo!


Cheers

Mark


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-11 Thread Andres Freund
On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote:
 On 01/07/14 22:13, Andres Freund wrote:
 On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote:
 - cherry picking the last 5 commits into 9.4 branch and building a package
 from that and retesting:
 
 Clients | 9.4 tps 60 cores (rwlock)
 +--
 6   |  70189
 12  | 128894
 24  | 233542
 48  | 422754
 96  | 590796
 192 | 630672
 
 Wow - that is more like it! Andres that is some nice work, we definitely owe
 you some beers for that :-) I am aware that I need to retest with an
 unpatched 9.4 src - as it is not clear from this data how much is due to
 Andres's patches and how much to the steady stream of 9.4 development. I'll
 post an update on that later, but figured this was interesting enough to
 note for now.
 
 Cool. That's what I like (and expect) to see :). I don't think unpatched
 9.4 will show significantly different results than 9.3, but it'd be good
 to validate that. If you do so, could you post the results in the
 -hackers thread I just CCed you on? That'll help the work to get into
 9.5.
 
 So we seem to have nailed read only performance. Going back and revisiting
 read write performance finds:
 
 Postgres 9.4 beta
 rwlock patch
 pgbench scale = 2000
 
 max_connections = 200;
 shared_buffers = 10GB;
 maintenance_work_mem = 1GB;
 effective_io_concurrency = 10;
 wal_buffers = 32MB;
 checkpoint_segments = 192;
 checkpoint_completion_target = 0.8;
 
 clients  | tps (32 cores) | tps
 -++-
 6|   8313 |   8175
 12   |  11012 |  14409
 24   |  16151 |  17191
 48   |  21153 |  23122
 96   |  21977 |  22308
 192  |  22917 |  23109

On that scale - that's bigger than shared_buffers IIRC - I'd not expect
the patch to make much of a difference.

 kernel.sched_autogroup_enabled=0
 kernel.sched_migration_cost_ns=500
 net.core.somaxconn=1024
 /sys/kernel/mm/transparent_hugepage/enabled [never]
 
 Full report http://paste.ubuntu.com/886/

 #
  8.82%postgres  [kernel.kallsyms][k]
 _raw_spin_lock_irqsave
   |
   --- _raw_spin_lock_irqsave
  |
  |--75.69%-- pagevec_lru_move_fn
  |  __lru_cache_add
  |  lru_cache_add
  |  putback_lru_page
  |  migrate_pages
  |  migrate_misplaced_page
  |  do_numa_page
  |  handle_mm_fault
  |  __do_page_fault
  |  do_page_fault
  |  page_fault

So, the majority of the time is spent in numa page migration. Can you
disable numa_balancing? I'm not sure if your kernel version does that at
runtime or whether you need to reboot.
The kernel.numa_balancing sysctl might work. Otherwise you probably need
to boot with numa_balancing=0.

It'd also be worthwhile to test this with numactl --interleave.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-11 Thread Mark Kirkwood

On 11/07/14 20:22, Andres Freund wrote:

On 2014-07-11 12:40:15 +1200, Mark Kirkwood wrote:



Postgres 9.4 beta
rwlock patch
pgbench scale = 2000


On that scale - that's bigger than shared_buffers IIRC - I'd not expect
the patch to make much of a difference.



Right - we did test with it bigger (can't recall exactly how big), but 
will retry again after setting the numa parameters below.



#
  8.82%postgres  [kernel.kallsyms][k]
_raw_spin_lock_irqsave
   |
   --- _raw_spin_lock_irqsave
  |
  |--75.69%-- pagevec_lru_move_fn
  |  __lru_cache_add
  |  lru_cache_add
  |  putback_lru_page
  |  migrate_pages
  |  migrate_misplaced_page
  |  do_numa_page
  |  handle_mm_fault
  |  __do_page_fault
  |  do_page_fault
  |  page_fault


So, the majority of the time is spent in numa page migration. Can you
disable numa_balancing? I'm not sure if your kernel version does that at
runtime or whether you need to reboot.
The kernel.numa_balancing sysctl might work. Otherwise you probably need
to boot with numa_balancing=0.

It'd also be worthwhile to test this with numactl --interleave.



That was my feeling too - but I had no idea what the magic switch was to 
tame it (appears to be in 3.13 kernels), will experiment and report 
back. Thanks again!


Mark



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-11 Thread Kevin Grittner
Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote:
 On 11/07/14 20:22, Andres Freund wrote:

 So, the majority of the time is spent in numa page migration.
 Can you disable numa_balancing? I'm not sure if your kernel
 version does that at runtime or whether you need to reboot.
 The kernel.numa_balancing sysctl might work. Otherwise you
 probably need to boot with numa_balancing=0.

 It'd also be worthwhile to test this with numactl --interleave.

 That was my feeling too - but I had no idea what the magic switch
 was to tame it (appears to be in 3.13 kernels), will experiment
 and report back. Thanks again!

It might be worth a test using a cpuset to interleave OS cache and
the NUMA patch I submitted to the current CF to see whether this is
getting into territory where the patch makes a bigger difference. 
I would expect it to do much better than using numactl --interleave
because work_mem and other process-local memory would be allocated
in near memory for each process.

http://www.postgresql.org/message-id/1402267501.4.yahoomail...@web122304.mail.ne1.yahoo.com

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-10 Thread Mark Kirkwood

On 01/07/14 22:13, Andres Freund wrote:

On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote:

- cherry picking the last 5 commits into 9.4 branch and building a package
from that and retesting:

Clients | 9.4 tps 60 cores (rwlock)
+--
6   |  70189
12  | 128894
24  | 233542
48  | 422754
96  | 590796
192 | 630672

Wow - that is more like it! Andres that is some nice work, we definitely owe
you some beers for that :-) I am aware that I need to retest with an
unpatched 9.4 src - as it is not clear from this data how much is due to
Andres's patches and how much to the steady stream of 9.4 development. I'll
post an update on that later, but figured this was interesting enough to
note for now.


Cool. That's what I like (and expect) to see :). I don't think unpatched
9.4 will show significantly different results than 9.3, but it'd be good
to validate that. If you do so, could you post the results in the
-hackers thread I just CCed you on? That'll help the work to get into
9.5.


So we seem to have nailed read only performance. Going back and 
revisiting read write performance finds:


Postgres 9.4 beta
rwlock patch
pgbench scale = 2000

max_connections = 200;
shared_buffers = 10GB;
maintenance_work_mem = 1GB;
effective_io_concurrency = 10;
wal_buffers = 32MB;
checkpoint_segments = 192;
checkpoint_completion_target = 0.8;

clients  | tps (32 cores) | tps
-++-
6|   8313 |   8175
12   |  11012 |  14409
24   |  16151 |  17191
48   |  21153 |  23122
96   |  21977 |  22308
192  |  22917 |  23109


So we are back to not doing significantly better than 32 cores. Hmmm. 
Doing quite a few more tweaks gets some better numbers:


kernel.sched_autogroup_enabled=0
kernel.sched_migration_cost_ns=500
net.core.somaxconn=1024
/sys/kernel/mm/transparent_hugepage/enabled [never]

+checkpoint_segments = 1920
+wal_buffers = 256MB;


clients  | tps
-+-
6|   8366
12   |  15988
24   |  19828
48   |  30315
96   |  31649
192  |  29497

One more:

+wal__sync_method = open_datasync

clients  | tps
-+-
6|  9566
12   | 17129
24   | 22962
48   | 34564
96   | 32584
192  | 28367

So this looks better - however I suspect 32 core performance would 
improve with these as well!


The problem does *not* look to be connected with IO (I will include some 
iostat below). So time to get the profiler out (192 clients for 1 minute):


Full report http://paste.ubuntu.com/886/

# 
# captured on: Fri Jul 11 03:09:06 2014
# hostname : ncel-prod-db3
# os release : 3.13.0-24-generic
# perf version : 3.13.9
# arch : x86_64
# nrcpus online : 60
# nrcpus avail : 60
# cpudesc : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
# cpuid : GenuineIntel,6,62,7
# total memory : 1056692116 kB
# cmdline : /usr/lib/linux-tools-3.13.0-24/perf record -ag
# event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 
= 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, 
precise_ip = 0, attr_mmap2 = 0, attr_mmap  = 1, attr_mmap_data = 0

# HEADER_CPU_TOPOLOGY info available, use -I to display
# HEADER_NUMA_TOPOLOGY info available, use -I to display
# pmu mappings: cpu = 4, uncore_cbox_10 = 17, uncore_cbox_11 = 18, 
uncore_cbox_12 = 19, uncore_cbox_13 = 20, uncore_cbox_14 = 21, software 
= 1, uncore_irp = 33, uncore_pcu = 22, tracepoint = 2, uncore_imc_0 = 
25, uncore_imc_1 = 26, uncore_imc_2 = 27, uncore_imc_3 = 28, 
uncore_imc_4 = 29, uncore_imc_5 = 30, uncore_imc_6 = 31, uncore_imc_7 = 
32, uncore_qpi_0 = 34, uncore_qpi_1 = 35, uncore_qpi_2 = 36, 
uncore_cbox_0 = 7, uncore_cbox_1 = 8, uncore_cbox_2 = 9, uncore_cbox_3 = 
10, uncore_cbox_4 = 11, uncore_cbox_5 = 12, uncore_cbox_6 = 13, 
uncore_cbox_7 = 14, uncore_cbox_8 = 15, uncore_cbox_9 = 16, 
uncore_r2pcie = 37, uncore_r3qpi_0 = 38, uncore_r3qpi_1 = 39, breakpoint 
= 5, uncore_ha_0 = 23, uncore_ha_1 = 24, uncore_ubox = 6

# 
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 359906321606
#
# Overhead CommandShared Object 
Symbol
#   ..  ... 
.

#
 8.82%postgres  [kernel.kallsyms][k] 
_raw_spin_lock_irqsave

  |
  --- _raw_spin_lock_irqsave
 |
 |--75.69%-- pagevec_lru_move_fn
 |  __lru_cache_add
 |  lru_cache_add
 |  putback_lru_page
 |  migrate_pages
 |  migrate_misplaced_page
 |  do_numa_page
 |  handle_mm_fault
 |  __do_page_fault
 |  do_page_fault
  

Re: [PERFORM] 60 core performance with 9.3

2014-07-01 Thread Mark Kirkwood

On 27/06/14 21:19, Andres Freund wrote:

On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote:

My feeling is spinlock or similar, 'perf top' shows

kernel find_busiest_group
kernel _raw_spin_lock

as the top time users.


Those don't tell that much by themselves, could you do a hierarchical
profile? I.e. perf record -ga? That'll at least give the callers for
kernel level stuff. For more information compile postgres with
-fno-omit-frame-pointer.



Unfortunately this did not help - had lots of unknown symbols from 
postgres in the profile - I'm guessing the Ubuntu postgresql-9.3 package 
needs either the -dev package or to be rebuilt with the enable profile 
option (debug and no-omit-frame-pointer seem to be there already).


However further investigation did uncover *very* interesting things. 
Firstly I had previously said that read only performance looked 
ok...this was wrong, purely based on comparison to Robert's blog post. 
Rebooting the 60 core box with 32 cores enabled showed that we got 
*better* scaling performance in the read only case and illustrated we 
were hitting a serious regression with more cores. At this point data is 
needed:


Test: pgbench
Options: scale 500
 read only
Os: Ubuntu 14.04
Pg: 9.3.4
Pg Options:
max_connections = 200
shared_buffers = 10GB
maintenance_work_mem = 1GB
effective_io_concurrency = 10
wal_buffers = 32MB
checkpoint_segments = 192
checkpoint_completion_target = 0.8


Results

Clients | 9.3 tps 32 cores | 9.3 tps 60 cores
+--+-
6   |  70400   |  71028
12  |  98918   | 129140
24  | 230345   | 240631
48  | 324042   | 409510
96  | 346929   | 120464
192 | 312621   |  92663

So we have anti scaling with 60 cores as we increase the client 
connections. Ouch! A level of urgency led to trying out Andres's 
'rwlock' 9.4 branch [1] - cherry picking the last 5 commits into 9.4 
branch and building a package from that and retesting:


Clients | 9.4 tps 60 cores (rwlock)
+--
6   |  70189
12  | 128894
24  | 233542
48  | 422754
96  | 590796
192 | 630672

Wow - that is more like it! Andres that is some nice work, we definitely 
owe you some beers for that :-) I am aware that I need to retest with an 
unpatched 9.4 src - as it is not clear from this data how much is due to 
Andres's patches and how much to the steady stream of 9.4 development. 
I'll post an update on that later, but figured this was interesting 
enough to note for now.



Regards

Mark

[1] from git://git.postgresql.org/git/users/andresfreund/postgres.git, 
commits:

4b82477dcaf81ad7b0c102f4b66e479a5eb9504a
10d72b97f108b6002210ea97a414076a62302d4e
67ffebe50111743975d54782a3a94b15ac4e755f
fe686ed18fe132021ee5e557c67cc4d7c50a1ada
f2378dc2fa5b73c688f696704976980bab90c611



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-01 Thread Mark Kirkwood

On 01/07/14 21:48, Mark Kirkwood wrote:


[1] from git://git.postgresql.org/git/users/andresfreund/postgres.git,
commits:
4b82477dcaf81ad7b0c102f4b66e479a5eb9504a
10d72b97f108b6002210ea97a414076a62302d4e
67ffebe50111743975d54782a3a94b15ac4e755f
fe686ed18fe132021ee5e557c67cc4d7c50a1ada
f2378dc2fa5b73c688f696704976980bab90c611




Hmmm, should read last 5 commits in 'rwlock-contention' and I had pasted 
the commit nos from my tree not Andres's, sorry, here are the right ones:

472c87400377a7dc418d8b77e47ba08f5c89b1bb
e1e549a8e42b753cc7ac60e914a3939584cb1c56
65c2174469d2e0e7c2894202dc63b8fa6f8d2a7f
959aa6e0084d1264e5b228e5a055d66e5173db7d
a5c3ddaef0ee679cf5e8e10d59e0a1fe9f0f1893




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-07-01 Thread Andres Freund
On 2014-07-01 21:48:35 +1200, Mark Kirkwood wrote:
 On 27/06/14 21:19, Andres Freund wrote:
 On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote:
 My feeling is spinlock or similar, 'perf top' shows
 
 kernel find_busiest_group
 kernel _raw_spin_lock
 
 as the top time users.
 
 Those don't tell that much by themselves, could you do a hierarchical
 profile? I.e. perf record -ga? That'll at least give the callers for
 kernel level stuff. For more information compile postgres with
 -fno-omit-frame-pointer.
 
 
 Unfortunately this did not help - had lots of unknown symbols from postgres
 in the profile - I'm guessing the Ubuntu postgresql-9.3 package needs either
 the -dev package or to be rebuilt with the enable profile option (debug and
 no-omit-frame-pointer seem to be there already).

You need to install the -dbg package. My bet is you'll see s_lock high
in the profile, called mainly from the procarray and buffer mapping
lwlocks.

 Test: pgbench
 Options: scale 500
  read only
 Os: Ubuntu 14.04
 Pg: 9.3.4
 Pg Options:
 max_connections = 200

Just as an experiment I'd suggest increasing max_connections by one and
two and quickly retesting - there's some cacheline alignment issues that
aren't fixed yet that happen to vanish with some max_connections
settings.

 shared_buffers = 10GB
 maintenance_work_mem = 1GB
 effective_io_concurrency = 10
 wal_buffers = 32MB
 checkpoint_segments = 192
 checkpoint_completion_target = 0.8
 
 
 Results
 
 Clients | 9.3 tps 32 cores | 9.3 tps 60 cores
 +--+-
 6   |  70400   |  71028
 12  |  98918   | 129140
 24  | 230345   | 240631
 48  | 324042   | 409510
 96  | 346929   | 120464
 192 | 312621   |  92663
 
 So we have anti scaling with 60 cores as we increase the client connections.
 Ouch! A level of urgency led to trying out Andres's 'rwlock' 9.4 branch [1]
 - cherry picking the last 5 commits into 9.4 branch and building a package
 from that and retesting:
 
 Clients | 9.4 tps 60 cores (rwlock)
 +--
 6   |  70189
 12  | 128894
 24  | 233542
 48  | 422754
 96  | 590796
 192 | 630672
 
 Wow - that is more like it! Andres that is some nice work, we definitely owe
 you some beers for that :-) I am aware that I need to retest with an
 unpatched 9.4 src - as it is not clear from this data how much is due to
 Andres's patches and how much to the steady stream of 9.4 development. I'll
 post an update on that later, but figured this was interesting enough to
 note for now.

Cool. That's what I like (and expect) to see :). I don't think unpatched
9.4 will show significantly different results than 9.3, but it'd be good
to validate that. If you do so, could you post the results in the
-hackers thread I just CCed you on? That'll help the work to get into
9.5.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-06-27 Thread Andres Freund
On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote:
 My feeling is spinlock or similar, 'perf top' shows
 
 kernel find_busiest_group
 kernel _raw_spin_lock
 
 as the top time users.

Those don't tell that much by themselves, could you do a hierarchical
profile? I.e. perf record -ga? That'll at least give the callers for
kernel level stuff. For more information compile postgres with
-fno-omit-frame-pointer.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-06-27 Thread Mark Kirkwood

On 27/06/14 21:19, Andres Freund wrote:

On 2014-06-27 14:28:20 +1200, Mark Kirkwood wrote:

My feeling is spinlock or similar, 'perf top' shows

kernel find_busiest_group
kernel _raw_spin_lock

as the top time users.


Those don't tell that much by themselves, could you do a hierarchical
profile? I.e. perf record -ga? That'll at least give the callers for
kernel level stuff. For more information compile postgres with
-fno-omit-frame-pointer.



Excellent suggestion, will do next week!

regards

Mark



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] 60 core performance with 9.3

2014-06-26 Thread Mark Kirkwood

I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1].

The context is the current machine in use by the customer is a 32 core 
one, and due to growth we are looking at something larger (hence 60 cores).


Some initial tests show similar pgbench read only performance to what 
Robert found here 
http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html 
(actually a bit quicker around 40 tps).


However doing a mixed read-write workload is getting results the same or 
only marginally quicker than the 32 core machine - particularly at 
higher number of clients (e.g 200 - 500). I have yet to break out the 
perf toolset, but I'm wondering if any folk has compared 32 and 60 (or 
64) core read write pgbench performance?


regards

Mark

[1] Details:

4x E7-4890 15 cores each.
1 TB ram
16x Toshiba PX02SS SATA SSD
4x Samsung NVMe XS1715 PCIe SSD

Ubuntu 14.04  (Linux 3.13)



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-06-26 Thread Scott Marlowe
On Thu, Jun 26, 2014 at 5:49 PM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz wrote:
 I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1].

 The context is the current machine in use by the customer is a 32 core one,
 and due to growth we are looking at something larger (hence 60 cores).

 Some initial tests show similar pgbench read only performance to what Robert
 found here
 http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html
 (actually a bit quicker around 40 tps).

 However doing a mixed read-write workload is getting results the same or
 only marginally quicker than the 32 core machine - particularly at higher
 number of clients (e.g 200 - 500). I have yet to break out the perf toolset,
 but I'm wondering if any folk has compared 32 and 60 (or 64) core read write
 pgbench performance?

My guess is that the read only test is CPU / memory bandwidth limited,
but the mixed test is IO bound.

What's your iostat / vmstat / iotop etc look like when you're doing
both read only and read/write mixed?


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 60 core performance with 9.3

2014-06-26 Thread Mark Kirkwood

On 27/06/14 14:01, Scott Marlowe wrote:

On Thu, Jun 26, 2014 at 5:49 PM, Mark Kirkwood
mark.kirkw...@catalyst.net.nz wrote:

I have a nice toy to play with: Dell R920 with 60 cores and 1TB ram [1].

The context is the current machine in use by the customer is a 32 core one,
and due to growth we are looking at something larger (hence 60 cores).

Some initial tests show similar pgbench read only performance to what Robert
found here
http://rhaas.blogspot.co.nz/2012/04/did-i-say-32-cores-how-about-64.html
(actually a bit quicker around 40 tps).

However doing a mixed read-write workload is getting results the same or
only marginally quicker than the 32 core machine - particularly at higher
number of clients (e.g 200 - 500). I have yet to break out the perf toolset,
but I'm wondering if any folk has compared 32 and 60 (or 64) core read write
pgbench performance?


My guess is that the read only test is CPU / memory bandwidth limited,
but the mixed test is IO bound.

What's your iostat / vmstat / iotop etc look like when you're doing
both read only and read/write mixed?




That was what I would have thought too, but it does not appear to be the 
case, here is a typical iostat:


Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
nvme0n1   0.00 0.000.00 4448.00 0.0041.47 
19.10 0.140.030.000.03   0.03  14.40
nvme1n1   0.00 0.000.00 4448.00 0.0041.47 
19.10 0.150.030.000.03   0.03  15.20
nvme2n1   0.00 0.000.00 4549.00 0.0042.20 
19.00 0.150.030.000.03   0.03  15.20
nvme3n1   0.00 0.000.00 4548.00 0.0042.19 
19.00 0.160.040.000.04   0.04  16.00
dm-0  0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
md0   0.00 0.000.00 17961.00 0.0083.67 
9.54 0.000.000.000.00   0.00   0.00
dm-1  0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
dm-2  0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
dm-3  0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
dm-4  0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00



My feeling is spinlock or similar, 'perf top' shows

kernel find_busiest_group
kernel _raw_spin_lock

as the top time users.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance