Hi hackers,

I think I might have stumbled upon a planner bug in
src/backend/utils/adt/selfuncs.c, while working on something else.

I've traced it down to bd3e3e9 (Ensure sanity of hash-join costing when
 there are no MCV statistics.) that added a fallback to compute
mcv_freq, which surfaced an existing bug, due to a branch that was not
taken when mcv_freq was zero.

        if (avgfreq > 0.0 && *mcv_freq > avgfreq)
                estfract *= *mcv_freq / avgfreq;

Here, since this expression is a ratio, it seems important that both the
numerator and divisor are of the same base dimension.

mcv_freq is [freq of most common values in filtered relation]

However, avgfreq is currently [avg freq of all distinct data values in
raw relation]:

        /* Compute avg freq of all distinct data values in raw relation */
        avgfreq = (1.0 - stanullfrac) / ndistinct;

After avgfreq has been assigned, ndistinct is then adjusted, to account
for restriction clauses:

        if (vardata.rel && vardata.rel->tuples > 0)
        {
                ndistinct *= vardata.rel->rows / vardata.rel->tuples;
                ndistinct = clamp_row_est(ndistinct);
        }

The expression further down

    estfract *= *mcv_freq / avgfreq;

therefore divides two frequencies of different bases, when
I think they should be of the same base.

If we simply move the computation of avgfreq to after the adjustment
of ndistinct, avgfreq will be of the same base as mcv_freq.

This bug seems to sometimes cause the wrong table, the larger table, to
be hashed in a Hash Join, and the smaller table to be used for probing.

Here is a demo:

CREATE TABLE orders (
    order_id    bigint      NOT NULL,
    tracking_code bigint,
    region      int         NOT NULL,
    status      text        NOT NULL,
    order_data  text
);

CREATE TABLE shipments (
    shipment_id   bigint  NOT NULL,
    tracking_code bigint  NOT NULL,
    carrier       text    NOT NULL,
    ship_data     text
);

INSERT INTO orders (order_id, tracking_code, region, status, order_data)
SELECT
    g,
    CASE WHEN g <= 150000 THEN g ELSE NULL END,
    (g % 1000) + 1,
    CASE WHEN g <= 150000 THEN 'shipped' ELSE 'pending' END,
    'order-' || g
FROM generate_series(1, 200000) g;

INSERT INTO shipments (shipment_id, tracking_code, carrier, ship_data)
SELECT
    g,
    g,
    CASE (g % 3) WHEN 0 THEN 'FedEx' WHEN 1 THEN 'UPS' ELSE 'DHL' END,
    repeat('x', 200) || g::text
FROM generate_series(1, 150000) g;

CREATE INDEX ON orders (region);

ANALYZE orders;
ANALYZE shipments;

EXPLAIN ANALYZE
SELECT o.order_id, o.tracking_code, s.carrier, s.ship_data
FROM orders o
JOIN shipments s ON s.tracking_code = o.tracking_code
WHERE o.region = 42;

master (65707ed):
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=9138.25..14237.55 rows=149 width=229) (actual 
time=70.984..78.977 rows=150.00 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Buffers: shared hit=6557, temp read=3987 written=4068
   ->  Parallel Hash Join  (cost=8138.25..13222.65 rows=88 width=229) (actual 
time=68.237..75.806 rows=75.00 loops=2)
         Hash Cond: (o.tracking_code = s.tracking_code)
         Buffers: shared hit=6557, temp read=3987 written=4068
         ->  Parallel Seq Scan on orders o  (cost=0.00..3188.59 rows=117 
width=16) (actual time=0.076..12.053 rows=100.00 loops=2)
               Filter: (region = 42)
               Rows Removed by Filter: 99900
               Buffers: shared hit=1718
         ->  Parallel Hash  (cost=5464.00..5464.00 rows=62500 width=221) 
(actual time=53.216..53.217 rows=75000.00 loops=2)
               Buckets: 32768  Batches: 8  Memory Usage: 5120kB
               Buffers: shared hit=4839, temp written=4004
               ->  Parallel Seq Scan on shipments s  (cost=0.00..5464.00 
rows=62500 width=221) (actual time=0.009..16.114 rows=75000.00 loops=2)
                     Buffers: shared hit=4839
 Planning:
   Buffers: shared hit=86 read=1
 Planning Time: 0.350 ms
 Execution Time: 79.011 ms

0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch:

                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1578.75..7292.64 rows=149 width=229) (actual time=0.521..15.592 
rows=150.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=5447 read=3
   ->  Hash Join  (cost=578.75..6277.74 rows=62 width=229) (actual 
time=0.667..12.342 rows=50.00 loops=3)
         Hash Cond: (s.tracking_code = o.tracking_code)
         Buffers: shared hit=5447 read=3
         ->  Parallel Seq Scan on shipments s  (cost=0.00..5464.00 rows=62500 
width=221) (actual time=0.006..5.581 rows=50000.00 loops=3)
               Buffers: shared hit=4839
         ->  Hash  (cost=576.26..576.26 rows=199 width=16) (actual 
time=0.429..0.429 rows=150.00 loops=3)
               Buckets: 1024  Batches: 1  Memory Usage: 16kB
               Buffers: shared hit=608 read=3
               ->  Bitmap Heap Scan on orders o  (cost=5.84..576.26 rows=199 
width=16) (actual time=0.083..0.396 rows=200.00 loops=3)
                     Recheck Cond: (region = 42)
                     Heap Blocks: exact=200
                     Buffers: shared hit=608 read=3
                     ->  Bitmap Index Scan on orders_region_idx  
(cost=0.00..5.79 rows=199 width=0) (actual time=0.039..0.039 rows=200.00 
loops=3)
                           Index Cond: (region = 42)
                           Index Searches: 3
                           Buffers: shared hit=8 read=3
 Planning:
   Buffers: shared hit=86 read=1
 Planning Time: 0.371 ms
 Execution Time: 15.634 ms

The fix causes quite a lot of plans in
src/test/regress/expected/partition_join.out to change, which makes me a
bit worried I might have misunderstood something here. I haven't
verified if all the new plans are improvements, I just copied the result
file to the expected dir.

/Joel

Attachment: 0001-Fix-estimate_hash_bucket_stats-to-use-filtered-ndist.patch
Description: Binary data

Reply via email to