Added to TODO:

* Consider using a hash for joining to a large IN (VALUES ...) list

  http://archives.postgresql.org/pgsql-hackers/2007-05/msg00450.php


---------------------------------------------------------------------------

Atul Deopujari wrote:
> Hi,
> 
> Tom Lane wrote:
> > Neil Conway <[EMAIL PROTECTED]> writes:
> >> When planning queries with a large IN expression in the WHERE clause,
> >> the planner transforms the IN list into a scalar array expression. In
> >> clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr
> >> by calling scalararraysel(), which in turn estimates the selectivity of
> >> *each* array element in order to determine the selectivity of the array
> >> expression as a whole.
> > 
> >> This is quite inefficient when the IN list is large.
> > 
> > That's the least of the problems.  We really ought to convert such cases
> > into an IN (VALUES(...)) type of query, since often repeated indexscans
> > aren't the best implementation.
> > 
> I thought of giving this a shot and while I was working on it, it 
> occurred to me that we need to decide on a threshold value of the IN 
> list size above which such transformation should take place. For small 
> sizes of the IN list, scalararraysel() of IN list wins over the hash 
> join involved in IN (VALUES(...)). But for larger sizes of the IN list, 
> IN (VALUES(...)) comes out to be a clear winner.
> I would like to know what does the community think should be a heuristic 
> value of the IN list size beyond which this transformation should take 
> place.
> I was thinking of a GUC variable (or a hard coded value) which defaults 
> to say 30. This is based on numbers from the following test:
> 
> postgres=# create table w (w text);
> CREATE TABLE
> 
> postgres=# \copy w from '/usr/share/dict/words'
> 
> And run the following query with different IN list sizes
> explain analyze select * from w where w in ('one', 'two', ...);
> 
> I got the following runtimes:
> ------------------------------------
> IN list  IN (VALUES(...))     IN
> size
> ------------------------------------
> 150     ~2000 ms         ~5500 ms
> 100     ~1500 ms         ~4000 ms
> 80      ~1400 ms         ~3000 ms
> 50      ~1400 ms         ~2500 ms
> 30      ~1500 ms         ~1500 ms
> 20      ~1400 ms         ~1200 ms
> 10      ~1400 ms         ~1200 ms
> ------------------------------------
> 
> The IN (VALUES(...)) gives an almost steady state behavior, while the IN 
>   runtimes deteriorate with growing list size.
> 
> There would obviously be different conditions on which to base this 
> value. I seek community opinion on this.
> 
> -- 
> Atul
> 
> EnterpriseDB
> www.enterprisedb.com
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

-- 
  Bruce Momjian  <[EMAIL PROTECTED]>        http://momjian.us
  EnterpriseDB                             http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to