Re: [HACKERS] Planning large IN lists

Atul Deopujari Thu, 17 May 2007 02:18:15 -0700

Hi,

Tom Lane wrote:

Neil Conway <[EMAIL PROTECTED]> writes:

When planning queries with a large IN expression in the WHERE clause,
the planner transforms the IN list into a scalar array expression. In
clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr
by calling scalararraysel(), which in turn estimates the selectivity of
*each* array element in order to determine the selectivity of the array
expression as a whole.

This is quite inefficient when the IN list is large.


That's the least of the problems.  We really ought to convert such cases
into an IN (VALUES(...)) type of query, since often repeated indexscans
aren't the best implementation.

I thought of giving this a shot and while I was working on it, itoccurred to me that we need to decide on a threshold value of the INlist size above which such transformation should take place. For smallsizes of the IN list, scalararraysel() of IN list wins over the hashjoin involved in IN (VALUES(...)). But for larger sizes of the IN list,IN (VALUES(...)) comes out to be a clear winner.I would like to know what does the community think should be a heuristicvalue of the IN list size beyond which this transformation should takeplace.I was thinking of a GUC variable (or a hard coded value) which defaultsto say 30. This is based on numbers from the following test:


postgres=# create table w (w text);
CREATE TABLE

postgres=# \copy w from '/usr/share/dict/words'

And run the following query with different IN list sizes
explain analyze select * from w where w in ('one', 'two', ...);

I got the following runtimes:
------------------------------------
IN list  IN (VALUES(...))       IN
size
------------------------------------
150     ~2000 ms           ~5500 ms
100     ~1500 ms           ~4000 ms
80      ~1400 ms           ~3000 ms
50      ~1400 ms           ~2500 ms
30      ~1500 ms           ~1500 ms
20      ~1400 ms           ~1200 ms
10      ~1400 ms           ~1200 ms
------------------------------------

The IN (VALUES(...)) gives an almost steady state behavior, while the INruntimes deteriorate with growing list size.

There would obviously be different conditions on which to base thisvalue. I seek community opinion on this.


--
Atul

EnterpriseDB
www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [HACKERS] Planning large IN lists

Reply via email to