Hi,

On 28.11.2019 10:36, guangnan he wrote:
Hi,
you mean if we don't add new compiler options the compiler will do the loop unrolling using SIMD automatically?

Yes, most of modern compiler are doing it.
GCC requires -O3 option (-O2 is not enough), but clang is using them even with -O2.

But Postgres is using more sophisticated Youngs-Cramer algorithm for calculating SUM/AVG aggregates. And here SIMD instructions do not help much. My original assumption was that huge difference in speed between VOPS/ISPRAS JIT and Vanilla JIT can be explained by the difference in accumulation algorithm.

This is why I implemented calculation of AVG in VOPS using Youngs-Cramer algorithm. And it certainly affect performance: Q1 with SUM aggregates is executed by VOPS almost three times faster than with AVG aggregates (700 msec vs. 2000 msec). But even with Youngs-Cramer algorithm VOPS is 6 times faster than standard Postgres with JIT and 5 times faster than my in-memory storage.

Beside the function calls, cache miss etc, for VOPS I think the call stack is squeezing too, but the JIT optimize still process rows one by one.
If we do not take in account overhead of heap traversal and tuples packing then amount of calculations doesn't depend on data model: whether it is vertical or horizontal. By implementing in-memory storage which just keeps unpacked tuples in L2 list in backend's private memory and so doesn't spend time for unpacking or visibility checks
I  want to exclude this overhead and reach almost the same speed as VOPS.
But it doesn't happen.




Konstantin Knizhnik <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> 于2019年11月28日周四 下午3:08写道:



    On 27.11.2019 19:05, Tomas Vondra wrote:
    > On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
    >>
    >>
    >> On 25.11.2019 18:24, Merlin Moncure wrote:
    >>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
    >>> <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>>
    wrote:
    >>>> JIT was not able to significantly (times) increase speed on
    Q1 query?
    >>>> Experiment with VOPS shows that used aggregation algorithm
    itself
    >>>> is not
    >>>> a bottleneck.
    >>>> Profile also give no answer for this question.
    >>>> Any ideas?
    >>> Well, in the VOPS variant around 2/3 of the time is spent in
    routines
    >>> that are obviously aggregation.  In the JIT version, it's
    around 20%.
    >>> So this suggests that the replacement execution engine is more
    >>> invasive.  I would also guess (!) that the VOPS engine
    optimizes fewer
    >>> classes of query plan.   ExecScan for example, looks to be
    completely
    >>> optimized out VOPS but is still utilized in the JIT engine.
    >>
    >> The difference in fraction of time spent in aggregate
    calculation is
    >> not so large (2 times vs. 10 times).
    >> I suspected that a lot of time is spent in relation traversal
    code,
    >> tuple unpacking and visibility checks.
    >> To check this hypothesis I have implement in-memory table access
    >> method which stores tuples in unpacked form and
    >> doesn't perform any visibility checks at all.
    >> Results were not so existed. I have to disable parallel execution
    >> (because it is not possible for tuples stored in backend private
    >> memory).
    >> Results are the following:
    >>
    >> lineitem:               13736 msec
    >> inmem_lineitem:  10044 msec
    >> vops_lineitem:        1945 msec
    >>
    >> The profile of inmem_lineitem is the following:
    >>
    >>   16.79%  postgres  postgres             [.] float4_accum
    >>   12.86%  postgres  postgres             [.] float8_accum
    >>    5.83%  postgres  postgres             [.]
    TupleHashTableHash.isra.8
    >>    4.44%  postgres  postgres             [.] lookup_hash_entries
    >>    3.37%  postgres  postgres             [.] check_float8_array
    >>    3.11%  postgres  postgres             [.] tuplehash_insert
    >>    2.91%  postgres  postgres             [.] hash_uint32
    >>    2.83%  postgres  postgres             [.] ExecScan
    >>    2.56%  postgres  postgres             [.] inmem_getnextslot
    >>    2.22%  postgres  postgres             [.] FunctionCall1Coll
    >>    2.14%  postgres  postgres             [.] LookupTupleHashEntry
    >>    1.95%  postgres  postgres             [.]
    TupleHashTableMatch.isra.9
    >>    1.76%  postgres  postgres             [.] pg_detoast_datum
    >>    1.58%  postgres  postgres             [.] AggCheckCallContext
    >>    1.57%  postgres  postgres             [.] tts_minimal_clear
    >>    1.35%  postgres  perf-3054.map        [.] 0x00007f558db60010
    >>    1.23%  postgres  postgres             [.] fetch_input_tuple
    >>    1.15%  postgres  postgres             [.] SeqNext
    >>    1.06%  postgres  postgres             [.] ExecAgg
    >>    1.00%  postgres  postgres             [.]
    tts_minimal_store_tuple
    >>
    >> So now fraction of time spent in aggregation is increased to
    30% (vs.
    >> 20% for lineitem and 42% for vops_lineitem).
    >> Looks like the main bottleneck now is hashagg. VOPS is
    accessing hash
    >> about 10 times less (because it accumulates values for the
    whole tile).
    >> And it explains still large difference bwtween vops_lineitem and
    >> inmem_lineitem.
    >>
    >> If we remove aggregation and rewrite Q1 query as:
    >> select
    >>     avg(l_quantity) as sum_qty,
    >>     avg(l_extendedprice) as sum_base_price,
    >>     avg(l_extendedprice*(1-l_discount)) as sum_disc_price,
    >>     avg(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
    >>     avg(l_quantity) as avg_qty,
    >>     avg(l_extendedprice) as avg_price,
    >>     avg(l_discount) as avg_disc,
    >>     count(*) as count_order
    >> from
    >>     inmem_lineitem
    >> where
    >>     l_shipdate <= '1998-12-01';
    >>
    >> then results are the following:
    >> lineitem:               9805 msec
    >> inmem_lineitem:  6257 msec
    >> vops_lineitem:      1865 msec
    >>
    >> and now profile of inmem_lineitem is:
    >>
    >>   25.27%  postgres  postgres           [.] float4_accum
    >>   21.86%  postgres  postgres           [.] float8_accum
    >>    5.49%  postgres  postgres           [.] check_float8_array
    >>    4.57%  postgres  postgres           [.] ExecScan
    >>    2.61%  postgres  postgres           [.] AggCheckCallContext
    >>    2.30%  postgres  postgres           [.] pg_detoast_datum
    >>    2.10%  postgres  postgres           [.] inmem_getnextslot
    >>    1.81%  postgres  postgres           [.] SeqNext
    >>    1.73%  postgres  postgres           [.] fetch_input_tuple
    >>    1.61%  postgres  postgres           [.] ExecAgg
    >>    1.23%  postgres  postgres           [.] MemoryContextReset
    >>
    >> But still more than 3 times difference with VOPS!
    >> Something is wrong here...
    >>
    >
    > I have no idea what VOPS does, but IIRC one of the bottlenecks
    compared
    > to various column stores is our iterative execution model, which
    makes
    > it difficult/imposible to vectorize operations. That's likely
    why the
    > accum functions are so high in the CPU profile.
    >
    > regards
    >

    VOPS is doing very simple thing: it replaces scala types with vector
    (tiles) and define all standard operations for them.
    Also it provides Postgres aggregate for this types.
    So while for normal Postgres table, the query

    select sum(x) from T;

    calls float4_accum for each row of T, the same query in VOPS will
    call
    vops_float4_avg_accumulate for each tile which contains 64 elements.
    So vops_float4_avg_accumulate is called 64 times less than
    float4_accum.
    And inside it contains straightforward loop:

                 for (i = 0; i < TILE_SIZE; i++) {
                     sum += opd->payload[i];
                 }

    which can be optimized by compiler (loop unrolling, use of SIMD
    instructions,...).
    So no wonder that VOPS is faster than Postgres executor.
    But Postgres now contains JIT and it is used in this case.
    So interpretation overhead of executor should be mostly eliminated
    by JIT.
    In theory, perfect JIT code should process rows of horizontal data
    model
    at the same speed as vector executor processing columns of
    vertical data
    model.
    Vertical model provides signficatn advantages when a query affect
    only
    small fraction of rows.
    But in case of Q1 we are calculating 8 aggregates for just 4 columns.
    And inmem_lineitem is actually projection of original lineitem table
    containing only  columns needed for this query.
    So amount of fetched data in this case is almost the same for
    horizontal
    and vertical data models.
    Effects of CPU caches should not also play significant role in
    this case.
    That is why it is not quite clear to me why there is still big
    difference (3 times) between VOPS and in-memory table and not so
    large
    difference between normal and in-memory tables.

    Concerning large percent spent in accumulate function - I do not
    agree
    with you. What this query is actually doing is just calculating
    aggregates.
    The less is interpretation overhead the larger percent of time we
    should
    spent in aggregate function.
    May be the whole infrastructure of Postgres aggregates adds too large
    overhead (check_float8_array, function calls,...) and in case of VOPS
    this overhead is divided by 64.


-- Konstantin Knizhnik
    Postgres Professional: http://www.postgrespro.com
    The Russian Postgres Company





--
Guang-Nan He


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Reply via email to