Re: [GENERAL] Performance problem with low correlation data

Scara Maccai Tue, 07 Jul 2009 00:18:15 -0700

> But that would be a different query -- there's no
> restrictions on the
> t values in this one.


There is a restriction on the t values:

select * from idtable left outer join testinsert on id=ne_id where groupname='a 
group name' and time between $a_date and $another_date


> Have you tried something using IN or EXISTS instead of a
> join? 

I still get nested loop join on the ne_id column...

> The
> algorithm you describe doesn't work for the join because it
> has to
> produce a record which includes the matching group columns.

Yeah, I thought about that. 
Basically I guess the "perfect" algorithm would be something like:

Hash Join <---- this is needed to join values from both relations
  -> Bitmap Heap Scan
     for each id found in idtable where groupname='a group name'
       BitmapOr
         BitmapIndexScan using ne_id and time between $a_date and $another_date
  -> select id from idtable where groupname='a group name'

> Actually I wonder if doing a sequential scan with a hash
> join against
> the group list wouldn't be a better option. 

The table is pretty big (60M rows), sequential scans are the reason why my 
queries are so slow: since the correlation on the ne_id col is so bad, the 
planner chooses seq scans when dealing with most of the "t" values, even if the 
number of "ne_id" values is low.

For the moment I've found this solution:

whenever too many "t" are selected, which would lead the planner towards a seq 
scan (or a very poor bitmap index scan in case I disable seq scans) I create a 
temporary table:

create temporary table alldata as 
select * FROM generate_series(mydatestart, mydateend, '15 minutes'::interval) 
as t
cross join idtable where groupname='a group name'
order by t,id;

analyze alldata;

select * from alldata left outer join testinsert using (ne_id,t);

basically I'm doing what I'd like PG to do:

since the correlation on the "t" col is good, and correlation on the "id" col 
is bad, query the index using the right order: "t" first, "id" then (given by 
the "order by t,id" on the creation of the temp table).

I would like PG to do that for me. Since it knows an index scan looping on 
ne_id would be wrong, I'd like it to create a "materialized" table where data 
is ordered by "t" first instead of going for the seq scan.

This would lead to a x10 - x100 improvement on query time.








-- 
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Performance problem with low correlation data

Reply via email to