Hi,

Tom Lane wrote:
Neil Conway <[EMAIL PROTECTED]> writes:
When planning queries with a large IN expression in the WHERE clause,
the planner transforms the IN list into a scalar array expression. In
clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr
by calling scalararraysel(), which in turn estimates the selectivity of
*each* array element in order to determine the selectivity of the array
expression as a whole.

This is quite inefficient when the IN list is large.

That's the least of the problems.  We really ought to convert such cases
into an IN (VALUES(...)) type of query, since often repeated indexscans
aren't the best implementation.

I thought of giving this a shot and while I was working on it, it occurred to me that we need to decide on a threshold value of the IN list size above which such transformation should take place. For small sizes of the IN list, scalararraysel() of IN list wins over the hash join involved in IN (VALUES(...)). But for larger sizes of the IN list, IN (VALUES(...)) comes out to be a clear winner. I would like to know what does the community think should be a heuristic value of the IN list size beyond which this transformation should take place. I was thinking of a GUC variable (or a hard coded value) which defaults to say 30. This is based on numbers from the following test:

postgres=# create table w (w text);
CREATE TABLE

postgres=# \copy w from '/usr/share/dict/words'

And run the following query with different IN list sizes
explain analyze select * from w where w in ('one', 'two', ...);

I got the following runtimes:
------------------------------------
IN list  IN (VALUES(...))       IN
size
------------------------------------
150     ~2000 ms           ~5500 ms
100     ~1500 ms           ~4000 ms
80      ~1400 ms           ~3000 ms
50      ~1400 ms           ~2500 ms
30      ~1500 ms           ~1500 ms
20      ~1400 ms           ~1200 ms
10      ~1400 ms           ~1200 ms
------------------------------------

The IN (VALUES(...)) gives an almost steady state behavior, while the IN runtimes deteriorate with growing list size.

There would obviously be different conditions on which to base this value. I seek community opinion on this.

--
Atul

EnterpriseDB
www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to