On 27.11.2019 19:05, Tomas Vondra wrote:
On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:


On 25.11.2019 18:24, Merlin Moncure wrote:
On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:
JIT was not able to significantly (times) increase speed on Q1 query?
Experiment with VOPS shows that used aggregation algorithm itself is not
a bottleneck.
Profile also give no answer for this question.
Any ideas?
Well, in the VOPS variant around 2/3 of the time is spent in routines
that are obviously aggregation.  In the JIT version, it's around 20%.
So this suggests that the replacement execution engine is more
invasive.  I would also guess (!) that the VOPS engine optimizes fewer
classes of query plan.   ExecScan for example, looks to be completely
optimized out VOPS but is still utilized in the JIT engine.

The difference in fraction of time spent in aggregate calculation is not so large (2 times vs. 10 times). I suspected that a lot of time is spent in relation traversal code, tuple unpacking and visibility checks. To check this hypothesis I have implement in-memory table access method which stores tuples in unpacked form and
doesn't perform any visibility checks at all.
Results were not so existed. I have to disable parallel execution (because it is not possible for tuples stored in backend private memory).
Results are the following:

lineitem:               13736 msec
inmem_lineitem:  10044 msec
vops_lineitem:        1945 msec

The profile of inmem_lineitem is the following:

  16.79%  postgres  postgres             [.] float4_accum
  12.86%  postgres  postgres             [.] float8_accum
   5.83%  postgres  postgres             [.] TupleHashTableHash.isra.8
   4.44%  postgres  postgres             [.] lookup_hash_entries
   3.37%  postgres  postgres             [.] check_float8_array
   3.11%  postgres  postgres             [.] tuplehash_insert
   2.91%  postgres  postgres             [.] hash_uint32
   2.83%  postgres  postgres             [.] ExecScan
   2.56%  postgres  postgres             [.] inmem_getnextslot
   2.22%  postgres  postgres             [.] FunctionCall1Coll
   2.14%  postgres  postgres             [.] LookupTupleHashEntry
   1.95%  postgres  postgres             [.] TupleHashTableMatch.isra.9
   1.76%  postgres  postgres             [.] pg_detoast_datum
   1.58%  postgres  postgres             [.] AggCheckCallContext
   1.57%  postgres  postgres             [.] tts_minimal_clear
   1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
   1.23%  postgres  postgres             [.] fetch_input_tuple
   1.15%  postgres  postgres             [.] SeqNext
   1.06%  postgres  postgres             [.] ExecAgg
   1.00%  postgres  postgres             [.] tts_minimal_store_tuple

So now fraction of time spent in aggregation is increased to 30% (vs. 20% for lineitem and 42% for vops_lineitem). Looks like the main bottleneck now is hashagg. VOPS is accessing hash about 10 times less (because it accumulates values for the whole tile). And it explains still large difference bwtween vops_lineitem and inmem_lineitem.

If we remove aggregation and rewrite Q1 query as:
select
    avg(l_quantity) as sum_qty,
    avg(l_extendedprice) as sum_base_price,
    avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
    avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
from
    inmem_lineitem
where
    l_shipdate <= '1998-12-01';

then results are the following:
lineitem:               9805 msec
inmem_lineitem:  6257 msec
vops_lineitem:      1865 msec

and now profile of inmem_lineitem is:

  25.27%  postgres  postgres           [.] float4_accum
  21.86%  postgres  postgres           [.] float8_accum
   5.49%  postgres  postgres           [.] check_float8_array
   4.57%  postgres  postgres           [.] ExecScan
   2.61%  postgres  postgres           [.] AggCheckCallContext
   2.30%  postgres  postgres           [.] pg_detoast_datum
   2.10%  postgres  postgres           [.] inmem_getnextslot
   1.81%  postgres  postgres           [.] SeqNext
   1.73%  postgres  postgres           [.] fetch_input_tuple
   1.61%  postgres  postgres           [.] ExecAgg
   1.23%  postgres  postgres           [.] MemoryContextReset

But still more than 3 times difference with VOPS!
Something is wrong here...


I have no idea what VOPS does, but IIRC one of the bottlenecks compared
to various column stores is our iterative execution model, which makes
it difficult/imposible to vectorize operations. That's likely why the
accum functions are so high in the CPU profile.

regards


VOPS is doing very simple thing: it replaces scala types with vector (tiles) and define all standard operations for them.
Also it provides Postgres aggregate for this types.
So while for normal Postgres table, the query

select sum(x) from T;

calls float4_accum for each row of T, the same query in VOPS will call vops_float4_avg_accumulate for each tile which contains 64 elements. So vops_float4_avg_accumulate is called 64 times less than float4_accum. And inside it contains straightforward loop:

            for (i = 0; i < TILE_SIZE; i++) {
                sum += opd->payload[i];
            }

which can be optimized by compiler (loop unrolling, use of SIMD instructions,...).
So no wonder that VOPS is faster than Postgres executor.
But Postgres now contains JIT and it is used in this case.
So interpretation overhead of executor should be mostly eliminated by JIT.
In theory, perfect JIT code should process rows of horizontal data model at the same speed as vector executor processing columns of vertical data model. Vertical model provides signficatn advantages when a query affect only small fraction of rows. But in case of Q1 we are calculating 8 aggregates for just 4 columns. And inmem_lineitem is actually projection of original lineitem table containing only  columns needed for this query. So amount of fetched data in this case is almost the same for horizontal and vertical data models.
Effects of CPU caches should not also play significant role in this case.
That is why it is not quite clear to me why there is still big difference (3 times) between VOPS and in-memory table and not so large difference between normal and in-memory tables.

Concerning large percent spent in accumulate function - I do not agree with you. What this query is actually doing is just calculating aggregates. The less is interpretation overhead the larger percent of time we should spent in aggregate function. May be the whole infrastructure of Postgres aggregates adds too large overhead (check_float8_array, function calls,...) and in case of VOPS this overhead is divided by 64.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply via email to