[HACKERS] Parallel Index-only scan

Rafia Sabih Sun, 11 Dec 2016 22:41:05 -0800

Hello all,
This is to propose a patch for enabling parallel index-only scans. With the
patch of parallel index scan around [1], adding the mechanism for
parallelising index-only scans makes sense. Without this mechanism for the
queries preferring index-only scans, it is likely that at higher scale
parallel index or parallel seq scan serve as cheaper alternative to
index-only scans and then query performance might suffer because of the
added processing of index or seq scans.


Performance
-----------------
Consider the performance of a simple query on TPC-H schema,

explain analyse select count(*) from lineitem where l_shipdate < date
'1995-01-03';

Without parallel index-only scan, parallel seq scan got picked and it took
around 23 seconds for the query to execute,

> Finalize Aggregate  (cost=651586.63..651586.64 rows=1 width=8) (actual
> time=22853.872..22853.873 rows=1 loops=1)
>    ->  Gather  (cost=651586.21..651586.62 rows=4 width=8) (actual
> time=22853.684..22853.864 rows=5 loops=1)
>          Workers Planned: 4
>          Workers Launched: 4
>          ->  Partial Aggregate  (cost=650586.21..650586.22 rows=1 width=8)
> (actual time=22850.489..22850.489 rows=1 loops=5)
>                ->  Parallel Seq Scan on lineitem  (cost=0.00..618021.73
> rows=13025795 width=0) (actual time=0.035..20553.495 rows=10342437 loops=5)
>                      Filter: (l_shipdate < '1995-01-03'::date)
>                      Rows Removed by Filter: 13656485
>  Planning time: 0.225 ms
>  Execution time: 22855.196 ms
>
However, with parallel index-only scan, it took only 8.5 seconds,

Finalize Aggregate  (cost=568883.69..568883.70 rows=1 width=8) (actual
> time=8548.993..8548.993 rows=1 loops=1)
>    ->  Gather  (cost=568883.27..568883.68 rows=4 width=8) (actual
> time=8548.789..8548.976 rows=5 loops=1)
>          Workers Planned: 4
>          Workers Launched: 4
>          ->  Partial Aggregate  (cost=567883.27..567883.28 rows=1 width=8)
> (actual time=8541.929..8541.929 rows=1 loops=5)
>                ->  Parallel Index Only Scan using idx_l_shipdate on
> lineitem  (cost=0.57..535318.78 rows=13025795 width=0) (actual
> time=0.113..5866.729 rows=10342437 loops=5)
>                      Index Cond: (l_shipdate < '1995-01-03'::date)
>                      Heap Fetches: 0
>  Planning time: 0.266 ms
>  Execution time: 8569.735 ms


The effect of parallel index-only scan can be seen more in some more
complex queries where parallelism is enabled till top of the tree, e.g,
following query takes 118 ms on head,

explain analyse select sum(l_extendedprice * l_discount) as revenue,
avg(o_orderkey) from orders, lineitem where o_orderkey < 10000 and
o_orderkey = l_orderkey group by l_shipmode order by revenue

 Sort  (cost=24396.44..24396.45 rows=7 width=75) (actual
> time=118.823..118.825 rows=7 loops=1)
>    Sort Key: (sum((lineitem.l_extendedprice * lineitem.l_discount)))
>    Sort Method: quicksort  Memory: 25kB
>    ->  HashAggregate  (cost=24396.23..24396.34 rows=7 width=75) (actual
> time=118.749..118.786 rows=7 loops=1)
>          Group Key: lineitem.l_shipmode
>          ->  Nested Loop  (cost=1.13..24293.11 rows=10312 width=27)
> (actual time=0.096..73.198 rows=9965 loops=1)
>                ->  Index Only Scan using orders_pkey on orders
>  (cost=0.56..46.48 rows=2578 width=4) (actual time=0.072..1.663 rows=2503
> loops=1)
>                      Index Cond: (o_orderkey < 10000)
>                      Heap Fetches: 0
>                ->  Index Scan using idx_lineitem_orderkey on lineitem
>  (cost=0.57..6.45 rows=296 width=31) (actual time=0.018..0.023 rows=4
> loops=2503)
>                      Index Cond: (l_orderkey = orders.o_orderkey)
>  Planning time: 1.062 ms
>  Execution time: 118.977 ms


With parallel index-only scan, the performance improves to 40 ms,

Sort  (cost=7191.33..7191.35 rows=7 width=75) (actual time=40.475..40.476
> rows=7 loops=1)
>    Sort Key: (sum((lineitem.l_extendedprice * lineitem.l_discount)))
>    Sort Method: quicksort  Memory: 25kB
>    ->  Finalize GroupAggregate  (cost=7190.78..7191.23 rows=7 width=75)
> (actual time=40.168..40.451 rows=7 loops=1)
>          Group Key: lineitem.l_shipmode
>          ->  Sort  (cost=7190.78..7190.85 rows=28 width=75) (actual
> time=40.105..40.127 rows=35 loops=1)
>                Sort Key: lineitem.l_shipmode
>                Sort Method: quicksort  Memory: 29kB
>                ->  Gather  (cost=7187.22..7190.11 rows=28 width=75)
> (actual time=39.344..39.983 rows=35 loops=1)
>                      Workers Planned: 4
>                      Workers Launched: 4
>                      ->  Partial HashAggregate  (cost=6187.22..6187.31
> rows=7 width=75) (actual time=25.981..26.011 rows=7 loops=5)
>                            Group Key: lineitem.l_shipmode
>                            ->  Nested Loop  (cost=1.13..6084.10 rows=10312
> width=27) (actual time=0.139..16.352 rows=1993 loops=5)
>                                  ->  Parallel Index Only Scan using
> orders_pkey on orders  (cost=0.56..27.14 rows=644 width=4) (actual
> time=0.082..0.366 rows=501 loops=5)
>                                        Index Cond: (o_orderkey < 10000)
>                                        Heap Fetches: 0
>                                  ->  Index Scan using
> idx_lineitem_orderkey on lineitem  (cost=0.57..6.45 rows=296 width=31)
> (actual time=0.020..0.025 rows=4 loops=2503)
>                                        Index Cond: (l_orderkey =
> orders.o_orderkey)
>  Planning time: 1.170 ms
>  Execution time: 40.898 ms


Test configuration and machine detail:
--------------------------------------------------
TPC-H: scale facto                         : 20
work_mem                                      : 64MB
shared buffer                                   : 1GB
max_parallel_workers_per_gather  : 4
Machine                                           : IBM POWER, 4 socket
machine with 512 GB RAM.
random_page_cost = seq_page_cost = 0.1
We kept random and seq page cost same since warm cache environment was
ensured, thus, keeping a value of random_page_cost that is higher than
seq_page_cost doesn't makes much sense in this setup.
Also, some additional indexes were build, for lineitem table l_shipdate,
l_shipmode, and l_returnflag, on orders table index is added for o_comment
and o_orderdate individually.

The point to note here is that with parallel index-only scan, not only the
heap fetches are saved but also the aggregates/sort can be pushed down to
workers and thus the total time of query can be improved. Clearly, the
performance of queries improved significantly with this new operator and
considering the changes required after parallel index scan patches is less
if evaluated against the improvement in performance it offers.

Attached file:
------------------
1. parallel_index_only_v1.patch

This patch is to be applied over parallel index scan [1].

* Thanks to my colleagues Dilip Kumar and Amit Kapila for their support and
feedback.

[1]
https://www.postgresql.org/message-id/CAOGQiiOneen9WEppO6V_myKpQ97CrjBQJ0Pv7ri0rxmMYvLcTg%40mail.gmail.com
-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

parallel_index_only_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Parallel Index-only scan

Reply via email to