Re: [HACKERS] Correlation in cost_index()

Sean Chittenden Sat, 09 Aug 2003 00:49:00 -0700

> > Hrm, after an hour of searching and reading, I think one of the
> > better papers on the subject can be found here:
> > http://www.cs.ust.hk/faculty/dimitris/PAPERS/TKDE-NNmodels.pdf
> 
> Interesting paper, but I don't see the connection to index order
> correlation?


Nothing that I found was nearly that specific, as close as I could
find was the paper above on calculating the cost of fetching data from
a disk, which I thought was the bigger problem at hand, but I
digress...

In one paper about large dimension index searches, they did suggest
that cost was cumulative for the number of disk reads or nodes in the
tree that weren't held in cache, which was the biggest hint that I had
found on this specific topic.  With that as a guiding light (or
something faintly resembling it), it'd seem as though an avg depth of
nodes in index * tuples_fetched * (random_io_cost * indexCorrelation)
would be closer than where we are now... but now also think I/we're
barking up the right tree with this thread.

It's very possible that cost_index() is wrong, but it seems as though
after some testing as if PostgreSQL _overly_ _favors_ the use of
indexes:

# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > 
'2002-10-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 21472.687500
        startup_cost: 0.000000

INFO:  cost_index: run_cost: 21154.308116
        startup_cost: 0.000000
        indexCorrelation: 0.999729
                                                                                    
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc  
(cost=0.00..21154.31 rows=705954 width=64) (actual time=91.36..6625.79 rows=704840 
loops=1)
   Index Cond: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
 Total runtime: 11292.68 msec
(3 rows)

# SET enable_seqscan = true; SET enable_indexscan = false;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > 
'2002-10-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 21472.687500
        startup_cost: 0.000000

INFO:  cost_index: run_cost: 21154.308116
        startup_cost: 100000000.000000
        indexCorrelation: 0.999729
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on report_user_cat_count rucc  (cost=0.00..21472.69 rows=705954 width=64) 
(actual time=1091.45..7441.19 rows=704840 loops=1)
   Filter: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
 Total runtime: 10506.44 msec
(3 rows)


Which I find surprising and humorous given the popular belief is, mine
included, contrary to those results.  I can say with pretty high
confidence that the patch to use a geometric mean isn't correct after
having done real world testing as its break even point is vastly
incorrect and only uses an index when there are less than 9,000 rows
to fetch, a far cry from the 490K break even I found while testing.
What I did find interesting, however, was that it does work better at
determining the use of multi-column indexes, but I think that's
because the geometric mean pessimizes the value of indexCorrelation,
which gets pretty skewed when using a multi-column index.

# CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count 
(user_id,utc_date);
# CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count;
# ANALYZE report_user_cat_count;
# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND 
utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 23685.025000
        startup_cost: 0.000000

INFO:  cost_index: run_cost: 366295.018684
        startup_cost: 0.000000
        indexCorrelation: 0.500000
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on report_user_cat_count rucc  (cost=0.00..23685.03 rows=133918 width=64) 
(actual time=0.28..6100.85 rows=129941 loops=1)
   Filter: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with 
time zone))
 Total runtime: 6649.21 msec
(3 rows)

# SET enable_seqscan = false; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND 
utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 23685.025000
        startup_cost: 100000000.000000

INFO:  cost_index: run_cost: 366295.018684
        startup_cost: 0.000000
        indexCorrelation: 0.500000
                                                                                       
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using report_user_cat_count_utc_date_user_id_idx on report_user_cat_count 
rucc  (cost=0.00..366295.02 rows=133918 width=64) (actual time=53.91..3110.42 
rows=129941 loops=1)
   Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp 
with time zone))
 Total runtime: 3667.47 msec
(3 rows)


If I manually set the indexCorrelation to 1.0, however, the planner
chooses the right plan on its own, which is in effect what setting a
higher random_page_cost had been compensating for, a poorly determined
indexCorrelation.

# SET enable_seqscan = true; SET enable_indexscan = true;
SET
SET
# EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 AND 
utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
INFO:  cost_seqscan: run_cost: 23685.025000
        startup_cost: 0.000000

INFO:  cost_index: run_cost: 4161.684248
        startup_cost: 0.000000
        indexCorrelation: 1.000000
                                                                                      
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using report_user_cat_count_utc_date_user_id_idx on report_user_cat_count 
rucc  (cost=0.00..4161.68 rows=133918 width=64) (actual time=0.67..1176.63 rows=129941 
loops=1)
   Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp 
with time zone))
 Total runtime: 1705.40 msec
(3 rows)


Which suggests to me that line 3964 in
./src/backend/utils/adt/selfuncs.c isn't right for multi-column
indexes, esp for indexes that are clustered.  I don't know how to
address this though...  Tom, any hints?

FWIW, this is an old data/schema from a defunct project that I can
give people access to if they'd like. -sc

-- 
Sean Chittenden

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org

Re: [HACKERS] Correlation in cost_index()

Reply via email to