Setting statistics to 1000 on id and source_id didn't solve my problem:

a9-dev=> explain analyze select * from records where source_id 
='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml' order by id limit 200;   
 
                                                                   QUERY PLAN   
                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..757.51 rows=200 width=1126) (actual time=43.648..564.798 
rows=200 loops=1)
  ->  Index Scan using source_id_id_idx on records  (cost=0.00..1590267.66 
rows=419868 width=1126) (actual time=43.631..564.700 rows=200 loops=1)
        Index Cond: ((source_id)::text = 
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
Total runtime: 564.895 ms
(4 rows)

a9-dev=> explain analyze select * from records where source_id 
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml' order by id limit 
200;   
                                                                    QUERY PLAN  
                                                                   
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..489.57 rows=200 width=1126) (actual 
time=99701.773..99703.858 rows=200 loops=1)
  ->  Index Scan using records_pkey on records  (cost=0.00..2441698.81 
rows=997489 width=1126) (actual time=99684.878..99686.936 rows=200 loops=1)
        Filter: ((source_id)::text = 
'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
Total runtime: 99705.916 ms
(4 rows)


As you can see, query looking for records with most common source_id is still 
by many magnitudes slower.

> In other words, if the value is 27% of all values, but is evenly
> spread across - I think planner will go for seq scan regardless.

Seq scan is not used - problem is, that planner chooses records_pkey index and 
checks every record's source_id until it finds 200 of them. I think that if 
source_id_id_idx index was used, query would execute as fast as for every other 
value of source_id.
I even made an experiment: I created table records2 as copy of records, and 
added additional column id2 with same values as id. I created same indexes on 
records2 as on records. Difference is, that there is no index on id2.
Here are the results of problematic queries:
a9-dev=> explain analyze select * from records2 where source_id 
='http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml' order by id2 limit 200;  
                                                                    QUERY PLAN  
                                                                   
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..770.15 rows=200 width=1124) (actual time=0.071..0.220 
rows=200 loops=1)
  ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..1585807.44 
rows=411820 width=1124) (actual time=0.070..0.196 rows=200 loops=1)
        Index Cond: ((source_id)::text = 
'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml'::text)
Total runtime: 0.255 ms
(4 rows)

a9-dev=> explain analyze select * from records2 where source_id 
='http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml' order by id2 limit 
200;  
                                                                    QUERY PLAN  
                                                                   
----------------------------------------------------------------------------------------------------------------------------------------------------
Limit  (cost=0.00..770.01 rows=200 width=1124) (actual time=0.076..0.205 
rows=200 loops=1)
  ->  Index Scan using source_id2_id2_id2x on records2  (cost=0.00..3735751.15 
rows=970308 width=1124) (actual time=0.074..0.180 rows=200 loops=1)
        Index Cond: ((source_id)::text = 
'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml'::text)
Total runtime: 0.235 ms
(4 rows)

As you can see, even for most common (~27%) values of source_id, planner 
chooses to use souce_id2_id2_id2x (I replaced id with id2 when creating indexes 
on records2 :]) index and query executes as fast as for other values.

So, the question is: why planner chooses records_pkey over source_id_id_idx for 
the most common value of source_id?



Wiadomość napisana przez Gregg Jaskiewicz w dniu 3 paź 2011, o godz. 13:20:

> 2011/10/3 Nowak Michał <michal.no...@me.com>:
>>> How many rows do you have in that table?
>> 
>> a9-dev=> select count(*) from records;
>> count
>> ---------
>> 3620311
>> (1 row)
> 
> 
>> 
>> a9-dev=> select source_id, count(*) from records where source_id = 
>> 'http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml' or source_id = 
>> 'http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml' group by source_id;
>>                     source_id                        | count
>> --------------------------------------------------------+--------
>> http://ebuw.uw.edu.pl/dlibra/oai-pmh-repository.xml    | 427254
>> http://www.wbc.poznan.pl/dlibra/oai-pmh-repository.xml | 989184
>> (2 rows)
> 
> So the second one is roughly 27% of the table.  I don't know the
> actual condition under which planner changes over the seqscan, but
> that value seems quite common it seems.
> The other thing planner is going to look at is the correlation, most
> common values, most common frequencies.
> In other words, if the value is 27% of all values, but is evenly
> spread across - I think planner will go for seq scan regardless.
> 
> At the end of the day (afaik), index scan only pics pages for narrowed
> down seqscan really. So imagine if your index scan returned all the
> pages, you would still have to do a seqscan on all of them. Planner is
> trying to avoid that by weighting the costs of both operations.
> If it is too slow to run the current queries, you could try
> normalizing the table by splitting source_id into separate one and
> referencing it by an id. Very often what you'd find is that doing so
> lowers I/O required, hence saves a lot of time in queries. Downside
> is, that it is bit harder to add/update the tables. But that's where
> triggers and procedures come handy.
> 
> 
>> 
>>> ALTER TABLE records ALTER id SET source_id 1000; vacuum analyze verbose 
>>> records;
>> Did you mean ALTER TABLE records ALTER id SET STATISTICS 1000;?
> 
> Yup, that's what I meant. Sorry.
> 
> 
> -- 
> GJ
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to