Re: [PERFORM] Parallel Select query performance and shared buffers

2013-12-05 Thread Metin Doslu
> From what I've seen so far the bigger problem than contention in the
> lwlocks itself, is the spinlock protecting the lwlocks...

Postgres 9.3.1 also reports spindelay, it seems that there is no contention
on spinlocks.

PID 21121 lwlock 0: shacq 0 exacq 33 blk 1 spindelay 0
PID 21121 lwlock 33: shacq 7602 exacq 14688 blk 4381 spindelay 0
PID 21121 lwlock 34: shacq 7826 exacq 15113 blk 3786 spindelay 0
PID 21121 lwlock 35: shacq 7792 exacq 15110 blk 3356 spindelay 0
PID 21121 lwlock 36: shacq 7803 exacq 15125 blk 3075 spindelay 0
PID 21121 lwlock 37: shacq 7822 exacq 15177 blk 2756 spindelay 0
PID 21121 lwlock 38: shacq 7694 exacq 14863 blk 2513 spindelay 0
PID 21121 lwlock 39: shacq 7914 exacq 15320 blk 2400 spindelay 0
PID 21121 lwlock 40: shacq 7855 exacq 15203 blk 2220 spindelay 0
PID 21121 lwlock 41: shacq 7942 exacq 15363 blk 1996 spindelay 0
PID 21121 lwlock 42: shacq 7828 exacq 15115 blk 1872 spindelay 0
PID 21121 lwlock 43: shacq 7820 exacq 15159 blk 1833 spindelay 0
PID 21121 lwlock 44: shacq 7709 exacq 14916 blk 1590 spindelay 0
PID 21121 lwlock 45: shacq 7831 exacq 15134 blk 1619 spindelay 0
PID 21121 lwlock 46: shacq 7744 exacq 14989 blk 1559 spindelay 0
PID 21121 lwlock 47: shacq 7808 exacq 15111 blk 1473 spindelay 0
PID 21121 lwlock 48: shacq 7729 exacq 14929 blk 1381 spindelay 0


Re: [PERFORM] [HACKERS] Parallel Select query performance and shared buffers

2013-12-05 Thread Metin Doslu
> You tested the correct branch, right? Which commit does "git rev-parse
> HEAD" show?

I applied last two patches manually on PostgreSQL 9.2 Stable.


Re: [PERFORM] [HACKERS] Parallel Select query performance and shared buffers

2013-12-05 Thread Metin Doslu
> You could try my lwlock-scalability improvement patches - for some
> workloads here, the improvements have been rather noticeable. Which
> version are you testing?

I tried your patches on next link. As you suspect I didn't see any
improvements. I tested it on PostgreSQL 9.2 Stable.

http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/REL9_2_STABLE-rwlock-contention


On Wed, Dec 4, 2013 at 8:26 PM, Andres Freund wrote:

> On 2013-12-04 20:19:55 +0200, Metin Doslu wrote:
> > - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> > disappeared for 8 core machines and come back with 16 core machines on
> > Amazon EC2. Would it be related with PostgreSQL locking mechanism?
>
> You could try my lwlock-scalability improvement patches - for some
> workloads here, the improvements have been rather noticeable. Which
> version are you testing?
>
> Greetings,
>
> Andres Freund
>
> --
>  Andres Freund http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [PERFORM] Parallel Select query performance and shared buffers

2013-12-05 Thread Metin Doslu
> Is your workload bigger than RAM?

RAM is bigger than workload (more than a couple of times).

> I think a good bit of the contention
> you're seeing in that listing is populating shared_buffers - and might
> actually vanish once you're halfway cached.
> From what I've seen so far the bigger problem than contention in the
> lwlocks itself, is the spinlock protecting the lwlocks...

Could you clarify a bit what do you mean by "halfway cached" and "spinlock
protecting the lwlocks".


Re: [PERFORM] Parallel Select query performance and shared buffers

2013-12-05 Thread Metin Doslu
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we
see tons of contention with default value of NUM_BUFFER_PARTITIONS which is
16:

$ tail -f /tmp/logfile | grep lwlock | egrep -v "blk 0"
...
PID 15965 lwlock 0: shacq 0 exacq 33 blk 2
PID 15965 lwlock 34: shacq 14010 exacq 27134 blk 6192
PID 15965 lwlock 35: shacq 14159 exacq 27397 blk 5426
PID 15965 lwlock 36: shacq 14111 exacq 27322 blk 4959
PID 15965 lwlock 37: shacq 14211 exacq 27507 blk 4370
PID 15965 lwlock 38: shacq 14110 exacq 27294 blk 3980
PID 15965 lwlock 39: shacq 13962 exacq 27027 blk 3719
PID 15965 lwlock 40: shacq 14023 exacq 27156 blk 3273
PID 15965 lwlock 41: shacq 14107 exacq 27309 blk 3201
PID 15965 lwlock 42: shacq 14120 exacq 27304 blk 2904
PID 15965 lwlock 43: shacq 14007 exacq 27129 blk 2740
PID 15965 lwlock 44: shacq 13948 exacq 27027 blk 2616
PID 15965 lwlock 45: shacq 14041 exacq 27198 blk 2431
PID 15965 lwlock 46: shacq 14067 exacq 27277 blk 2345
PID 15965 lwlock 47: shacq 14050 exacq 27203 blk 2106
PID 15965 lwlock 48: shacq 13910 exacq 26910 blk 2155
PID 15965 lwlock 49: shacq 14170 exacq 27360 blk 1989

After we increased NUM_BUFFER_PARTITIONS to 1024, lock contention is
decreased:
...
PID 25220 lwlock 1000: shacq 247 exacq 494 blk 1
PID 25220 lwlock 1001: shacq 198 exacq 394 blk 1
PID 25220 lwlock 1002: shacq 203 exacq 404 blk 1
PID 25220 lwlock 1003: shacq 226 exacq 452 blk 1
PID 25220 lwlock 1004: shacq 235 exacq 470 blk 1
PID 25220 lwlock 1006: shacq 226 exacq 452 blk 2
PID 25220 lwlock 1007: shacq 214 exacq 428 blk 1
PID 25220 lwlock 1008: shacq 225 exacq 448 blk 1
PID 25220 lwlock 1010: shacq 209 exacq 418 blk 1
PID 25220 lwlock 1015: shacq 199 exacq 398 blk 1
PID 25220 lwlock 1016: shacq 214 exacq 426 blk 1
PID 25220 lwlock 1018: shacq 230 exacq 456 blk 1
PID 25220 lwlock 1019: shacq 222 exacq 444 blk 3
PID 25220 lwlock 1023: shacq 262 exacq 524 blk 1
PID 25220 lwlock 1027: shacq 213 exacq 426 blk 1
PID 25220 lwlock 1028: shacq 246 exacq 491 blk 1
PID 25220 lwlock 1029: shacq 226 exacq 452 blk 1


[PERFORM] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
We have several independent tables on a multi-core machine serving Select
queries. These tables fit into memory; and each Select queries goes over
one table's pages sequentially. In this experiment, there are no indexes or
table joins.

When we send concurrent Select queries to these tables, query performance
doesn't scale out with the number of CPU cores. We find that complex Select
queries scale out better than simpler ones. We also find that increasing
the block size from 8 KB to 32 KB, or increasing shared_buffers to include
the working set mitigates the problem to some extent.

For our experiments, we chose an 8-core machine with 68 GB of memory from
Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and
set shared_buffers to 4 GB.

We then generated 1, 2, 4, and 8 separate tables using the data generator
from the industry standard TPC-H benchmark. Each table we generated, called
lineitem-1, lineitem-2, etc., had about 750 MB of data. Next, we sent 1, 2,
4, and 8 concurrent Select queries to these tables to observe the scale out
behavior. Our expectation was that since this machine had 8 cores, our run
times would stay constant all throughout. Also, we would have expected the
machine's CPU utilization to go up to 100% at 8 concurrent queries. Neither
of those assumptions held true.

We found that query run times degraded as we increased the number of
concurrent Select queries. Also, CPU utilization flattened out at less than
50% for the simpler queries. Full results with block size of 8KB are below:

 Table select count(*)TPC-H Simple (#6)[2]
 TPC-H Complex (#1)[1]
1 Table  / 1 query   1.5 s2.5 s
 8.4 s
2 Tables / 2 queries 1.5 s2.5 s
 8.4 s
4 Tables / 4 queries 2.0 s2.9 s
 8.8 s
8 Tables / 8 queries 3.3 s4.0 s
 9.6 s

We then increased the block size (BLCKSZ) from 8 KB to 32 KB and recompiled
PostgreSQL. This change had a positive impact on query completion times.
Here are the new results with block size of 32 KB:

 Table select count(*)TPC-H Simple (#6)[2]
 TPC-H Complex (#1)[1]
1 Table  / 1 query   1.5 s2.3 s
 8.0 s
2 Tables / 2 queries 1.5 s2.3 s
 8.0 s
4 Tables / 4 queries 1.6 s2.4 s
 8.1 s
8 Tables / 8 queries 1.8 s2.7 s
 8.3 s

As a quick side, we also repeated the same experiment on an EC2 instance
with 16 CPU cores, and found that the scale out behavior became worse
there. (We also tried increasing the shared_buffers to 30 GB. This change
completely solved the scaling out problem on this instance type, but hurt
our performance on the hi1.4xlarge instances.)

Unfortunately, increasing the block size from 8 to 32 KB has other
implications for some of our customers. Could you help us out with the
problem here?

What can we do to identify the problem's root cause? Can we work around it?

Thank you,
Metin

[1] http://examples.citusdata.com/tpch_queries.html#query-1
[2] http://examples.citusdata.com/tpch_queries.html#query-6


Re: [PERFORM] [HACKERS] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
> You could try my lwlock-scalability improvement patches - for some
> workloads here, the improvements have been rather noticeable. Which
> version are you testing?

I'm testing with PostgreSQL 9.3.1.


Re: [PERFORM] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
Here are some extra information:

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
disappeared for 8 core machines and come back with 16 core machines on
Amazon EC2. Would it be related with PostgreSQL locking mechanism?

- I tried this test with 4 core machines including my personel computer and
some other instances on Amazon EC2, I didn't see this problem with 4 core
machines. I started to see this problem in PostgreSQL when core count is 8
or more.

- Here are the results of "vmstat 1" while running 8 parallel select
count(*). Normally I would expect zero idle time.

procs ---memory-- ---swap-- -io --system--
-cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id
wa st
 0  0  0 29838640  94000 3895474000 0 0   22   21  0  0
100  0  0
 7  2  0 29788416  94000 3895474000 0 0 53922 108490 14
24 60  1  1
 5  0  0 29747248  94000 3895474000 0 0 68008 164571 22
48 27  2  1
 8  0  0 29725796  94000 3895474000 0 0 43587 150574 28
54 16  1  1
 0  0  0 29838328  94000 3895474000 0 0 15584 100459 26
55 18  1  0
 0  0  0 29838328  94000 3895474000 0 0   42   15  0  0
100  0  0

- When I run 8 parallel wc command or other scripts, they scale out as
expected and they utilize all cpu. This leads me to think that problem is
related with PostgreSQL instead of OS.


Re: [PERFORM] [HACKERS] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
> Didn't follow the thread from the start. So, this is EC2? Have you
> checked, with a recent enough version of top or whatever, how much time
> is reported as "stolen"?

Yes, this EC2. "stolen" is randomly reported as 1, mostly as 0.


Re: [PERFORM] [HACKERS] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
> You could try HVM. I've noticed it fare better  under heavy CPU load,
> and it's not fully-HVM (it still uses paravirtualized network and
> I/O).

I already tried with HVM (cc2.8xlarge instance on Amazon EC2) and observed
same problem.


Re: [PERFORM] [HACKERS] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
> I'd strongly suggest doing a "perf record -g -a ;
> perf report" run to check what's eating up the time.

Here is one example:

+  38.87%   swapper  [kernel.kallsyms]   [k] hypercall_page
+   9.32%  postgres  [kernel.kallsyms]   [k] hypercall_page
+   6.80%  postgres  [kernel.kallsyms]   [k] xen_set_pte_at
+   5.83%  postgres  [kernel.kallsyms]   [k] copy_user_generic_string
+   2.06%  postgres  [kernel.kallsyms]   [k] file_read_actor
+   1.89%  postgres  postgres[.] heapgettup_pagemode
+   1.83%  postgres  postgres[.] hash_search_with_hash_value
+   1.33%  postgres  [kernel.kallsyms]   [k] get_phys_to_machine
+   1.25%  postgres  [kernel.kallsyms]   [k] find_get_page
+   1.00%  postgres  postgres[.] heapgetpage
+   0.99%  postgres  [kernel.kallsyms]   [k] radix_tree_lookup_element
+   0.98%  postgres  postgres[.] advance_aggregates
+   0.96%  postgres  postgres[.] ExecProject
+   0.94%  postgres  postgres[.] advance_transition_function
+   0.88%  postgres  postgres[.] ExecScan
+   0.87%  postgres  postgres[.] HeapTupleSatisfiesMVCC
+   0.86%  postgres  postgres[.] LWLockAcquire
+   0.82%  postgres  [kernel.kallsyms]   [k] put_page
+   0.82%  postgres  postgres[.] MemoryContextReset
+   0.80%  postgres  postgres[.] SeqNext
+   0.78%  postgres  [kernel.kallsyms]   [k] pte_mfn_to_pfn
+   0.69%  postgres  postgres[.] ExecClearTuple
+   0.57%  postgres  postgres[.] ExecProcNode
+   0.54%  postgres  postgres[.] heap_getnext
+   0.53%  postgres  postgres[.] LWLockRelease
+   0.53%  postgres  postgres[.] ExecStoreTuple
+   0.51%  postgres  libc-2.12.so[.] __GI___libc_read
+   0.42%  postgres  [kernel.kallsyms]   [k] xen_spin_lock
+   0.40%  postgres  postgres[.] ReadBuffer_common
+   0.38%  postgres  [kernel.kallsyms]   [k] __do_fault
+   0.37%  postgres  [kernel.kallsyms]   [k] shmem_fault
+   0.37%  postgres  [kernel.kallsyms]   [k] unmap_single_vma
+   0.35%  postgres  [kernel.kallsyms]   [k] __wake_up_bit
+   0.33%  postgres  postgres[.] StrategyGetBuffer
+   0.33%  postgres  [kernel.kallsyms]   [k] set_page_dirty
+   0.33%  postgres  [kernel.kallsyms]   [k] handle_pte_fault
+   0.33%  postgres  postgres[.] ExecAgg
+   0.31%  postgres  postgres[.] XidInMVCCSnapshot
+   0.31%  postgres  [kernel.kallsyms]   [k] __audit_syscall_entry
+   0.31%  postgres  postgres[.] CheckForSerializableConflictOut
+   0.29%  postgres  [kernel.kallsyms]   [k] handle_mm_fault
+   0.25%  postgres  [kernel.kallsyms]   [k] shmem_getpage_gfp



On Wed, Dec 4, 2013 at 6:33 PM, Andres Freund wrote:

> On 2013-12-04 14:27:10 -0200, Claudio Freire wrote:
> > On Wed, Dec 4, 2013 at 9:19 AM, Metin Doslu  wrote:
> > >
> > > Here are the results of "vmstat 1" while running 8 parallel TPC-H
> Simple
> > > (#6) queries:  Although there is no need for I/O, "wa" fluctuates
> between 0
> > > and 1.
> > >
> > > procs ---memory-- ---swap-- -io --system--
> > > -cpu-
> > >  r  b   swpd   free   buffcache si   sobiboin
> cs us sy  id wa st
> > >  0  0  0 30093568  84892 3872389600 0 022
> 14  0  0 100  0  0
> > >  8  1  0 30043056  84892 3872389600 0 0 27080
>  52708 16 14  70  0  0
> > >  8  1  0 30006600  84892 3872389600 0 0 44952
> 118286 43 44  12  1  0
> > >  8  0  0 29986264  84900 3872389600 020 28043
>  95934 49 42   8  1  0
> > >  7  0  0 29991976  84900 3872389600 0 0  8308
>  73641 52 42   6  0  0
> > >  0  0  0 30091828  84900 3872389600 0 0  3996
>  30978 23 24  53  0  0
> > >  0  0  0 30091968  84900 3872389600 0 017
>  23   0  0 100  0  0
> >
> >
> > Notice the huge %sy
>
> My bet is on transparent hugepage defragmentation. Alternatively it's
> scheduler overhead, due to superflous context switches around the buffer
> mapping locks.
>
> I'd strongly suggest doing a "perf record -g -a ;
> perf report" run to check what's eating up the time.
>
> Greetings,
>
> Andres Freund
>
> --
>  Andres Freund http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [PERFORM] [HACKERS] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
>Notice the huge %sy
>What kind of VM are you using? HVM or paravirtual?

This instance is paravirtual.


Re: [PERFORM] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
> Maybe you could help test this patch:
>
http://www.postgresql.org/message-id/20131115194725.gg5...@awork2.anarazel.de

Which repository should I apply these patches. I tried main repository, 9.3
stable and source code of 9.3.1, and in my trials at least of one the
patches is failed. What patch command should I use?


Re: [PERFORM] [HACKERS] Parallel Select query performance and shared buffers

2013-12-04 Thread Metin Doslu
>   I think all of this data cannot fit in shared_buffers, you might want
to increase shared_buffers
>   to larger size (not 30GB but close to your data size) to see how it
behaves.

When I use shared_buffers larger than my data size such as 10 GB, results
scale nearly as expected at least for this instance type.

> You queries have Aggregation, ORDER/GROUP BY, so there is a chance
> that I/O can happen for those operation's
> if PG doesn't have sufficient memory (work_mem) to perform such operation.

I used work_mem as 32 MB, this should be enough for these queries. I also
tested with higher values of work_mem, and didn't obverse any difference.

> Can you simplify your queries (simple scan or in other words no
> aggregation or other things) to see how
> they behave in your env., once you are able to see simple queries
> scaling as per your expectation, you
> can try with complex one's.

Actually we observe problem when queries start to get simpler such as
select count(*). Here is the results table in more compact format:

  select count(*) TPC-H Simple(#6) TPC-H Complex(#1)
1 Table / 1 query  1.5 s2.5 s   8.4 s
2 Tables/ 2 queries1.5 s2.5 s   8.4 s
4 Tables/ 4 queries2.0 s2.9 s   8.8 s
8 Tables/ 8 queries3.3 s4.0 s   9.6 s

> Can we have the explain analyze of those queries, postgres
> configuration, perhaps vmstat output during execution?

postgres=# explain analyze SELECT count(*) from lineitem_1;
  QUERY PLAN

--
 Aggregate  (cost=199645.01..199645.02 rows=1 width=0) (actual
time=11317.391..11317.393 rows=1 loops=1)
   ->  Seq Scan on lineitem_1  (cost=0.00..184641.81 rows=6001281 width=0)
(actual time=0.011..5805.255 rows=6001215 loops=1)
 Total runtime: 11317.440 ms
(3 rows)

postgres=# explain analyze SELECT
postgres-# sum(l_extendedprice * l_discount) as revenue
postgres-# FROM
postgres-# lineitem_1
postgres-# WHERE
postgres-# l_shipdate >= date '1994-01-01'
postgres-# AND l_shipdate < date '1994-01-01' + interval '1' year
postgres-# AND l_discount between 0.06 - 0.01 AND 0.06 + 0.01
postgres-# AND l_quantity < 24;
 QUERY PLAN

--
 Aggregate  (cost=260215.36..260215.37 rows=1 width=16) (actual
time=1751.775..1751.776 rows=1 loops=1)
   ->  Seq Scan on lineitem_1  (cost=0.00..259657.82 rows=111508 width=16)
(actual time=0.031..1630.449 rows=114160 loops=1)
 Filter: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate <
'1995-01-01 00:00:00'::timestamp without time zone) AND (l_discount >=
0.05::double precision) AND (l_discount <= 0.07::double precision) AND
 (l_quantity < 24::double precision))
 Rows Removed by Filter: 5887055
 Total runtime: 1751.830 ms
(5 rows)

postgres=# explain analyze SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM
lineitem_1
WHERE
l_shipdate <= date '1998-12-01' - interval '90' day
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
 QUERY PLAN

-
 Sort  (cost=436342.68..436342.69 rows=6 width=36) (actual
time=18720.932..18720.936 rows=4 loops=1)
   Sort Key: l_returnflag, l_linestatus
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate  (cost=436342.49..436342.60 rows=6 width=36) (actual
time=18720.887..18720.892 rows=4 loops=1)
 ->  Seq Scan on lineitem_1  (cost=0.00..199645.01 rows=5917437
width=36) (actual time=0.011..6754.619 rows=5916591 loops=1)
   Filter: (l_shipdate <= '1998-09-02 00:00:00'::timestamp
without time zone)
   Rows Removed by Filter: 84624
 Total runtime: 18721.021 ms
(8 rows)


Here are the results of "vmstat 1" while running 8 parallel TPC-H Simple
(#6) queries:  Although there is no need for I/O, "wa" fluctuates between 0
and 1.

procs ---memory-- ---swap-- -io --system--
-cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id
wa st
 0  0  0 30093568  84892 3872389600 0 0   22   14  0  0
100  0  0
 8  1  0 30043056  84892 3872389600 0 0 27080 52708 16

Re: [PERFORM] Parallel Select query performance and shared buffers

2013-12-03 Thread Metin Doslu
Looking into syncscan.c, it says in comments:

"When multiple backends run a sequential scan on the same table, we try to
keep them synchronized to reduce the overall I/O needed."

But in my workload, every process was running on a different table.




On Tue, Dec 3, 2013 at 5:56 PM, Claudio Freire wrote:

> On Tue, Dec 3, 2013 at 10:49 AM, Metin Doslu  wrote:
> > We have several independent tables on a multi-core machine serving Select
> > queries. These tables fit into memory; and each Select queries goes over
> one
> > table's pages sequentially. In this experiment, there are no indexes or
> > table joins.
> >
> > When we send concurrent Select queries to these tables, query performance
> > doesn't scale out with the number of CPU cores. We find that complex
> Select
> > queries scale out better than simpler ones. We also find that increasing
> the
> > block size from 8 KB to 32 KB, or increasing shared_buffers to include
> the
> > working set mitigates the problem to some extent.
> >
> > For our experiments, we chose an 8-core machine with 68 GB of memory from
> > Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and
> set
> > shared_buffers to 4 GB.
>
>
> If you are certain your tables fit in RAM, you may want to disable
> synchronized sequential scans, as they will create contention between
> the threads.
>


[PERFORM] Parallel Select query performance and shared buffers

2013-12-03 Thread Metin Doslu
We have several independent tables on a multi-core machine serving Select
queries. These tables fit into memory; and each Select queries goes over
one table's pages sequentially. In this experiment, there are no indexes or
table joins.

When we send concurrent Select queries to these tables, query performance
doesn't scale out with the number of CPU cores. We find that complex Select
queries scale out better than simpler ones. We also find that increasing
the block size from 8 KB to 32 KB, or increasing shared_buffers to include
the working set mitigates the problem to some extent.

For our experiments, we chose an 8-core machine with 68 GB of memory from
Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and
set shared_buffers to 4 GB.

We then generated 1, 2, 4, and 8 separate tables using the data generator
from the industry standard TPC-H benchmark. Each table we generated, called
lineitem-1, lineitem-2, etc., had about 750 MB of data. Next, we sent 1, 2,
4, and 8 concurrent Select queries to these tables to observe the scale out
behavior. Our expectation was that since this machine had 8 cores, our run
times would stay constant all throughout. Also, we would have expected the
machine's CPU utilization to go up to 100% at 8 concurrent queries. Neither
of those assumptions held true.

We found that query run times degraded as we increased the number of
concurrent Select queries. Also, CPU utilization flattened out at less than
50% for the simpler queries. Full results with block size of 8KB are below:

 Table select count(*)TPC-H Simple (#6)[2]
 TPC-H Complex (#1)[1]
1 Table  / 1 query   1.5 s2.5 s
 8.4 s
2 Tables / 2 queries 1.5 s2.5 s
 8.4 s
4 Tables / 4 queries 2.0 s2.9 s
 8.8 s
8 Tables / 8 queries 3.3 s4.0 s
 9.6 s

We then increased the block size (BLCKSZ) from 8 KB to 32 KB and recompiled
PostgreSQL. This change had a positive impact on query completion times.
Here are the new results with block size of 32 KB:

 Table select count(*)TPC-H Simple (#6)[2]
 TPC-H Complex (#1)[1]
1 Table  / 1 query   1.5 s2.3 s
 8.0 s
2 Tables / 2 queries 1.5 s2.3 s
 8.0 s
4 Tables / 4 queries 1.6 s2.4 s
 8.1 s
8 Tables / 8 queries 1.8 s2.7 s
 8.3 s

As a quick side, we also repeated the same experiment on an EC2 instance
with 16 CPU cores, and found that the scale out behavior became worse
there. (We also tried increasing the shared_buffers to 30 GB. This change
completely solved the scaling out problem on this instance type, but hurt
our performance on the hi1.4xlarge instances.)

Unfortunately, increasing the block size from 8 to 32 KB has other
implications for some of our customers. Could you help us out with the
problem here?

What can we do to identify the problem's root cause? Can we work around it?

Thank you,
Metin

[1] http://examples.citusdata.com/tpch_queries.html#query-1
[2] http://examples.citusdata.com/tpch_queries.html#query-6