Added to TODO: * Consider using a hash for joining to a large IN (VALUES ...) list
http://archives.postgresql.org/pgsql-hackers/2007-05/msg00450.php --------------------------------------------------------------------------- Atul Deopujari wrote: > Hi, > > Tom Lane wrote: > > Neil Conway <[EMAIL PROTECTED]> writes: > >> When planning queries with a large IN expression in the WHERE clause, > >> the planner transforms the IN list into a scalar array expression. In > >> clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr > >> by calling scalararraysel(), which in turn estimates the selectivity of > >> *each* array element in order to determine the selectivity of the array > >> expression as a whole. > > > >> This is quite inefficient when the IN list is large. > > > > That's the least of the problems. We really ought to convert such cases > > into an IN (VALUES(...)) type of query, since often repeated indexscans > > aren't the best implementation. > > > I thought of giving this a shot and while I was working on it, it > occurred to me that we need to decide on a threshold value of the IN > list size above which such transformation should take place. For small > sizes of the IN list, scalararraysel() of IN list wins over the hash > join involved in IN (VALUES(...)). But for larger sizes of the IN list, > IN (VALUES(...)) comes out to be a clear winner. > I would like to know what does the community think should be a heuristic > value of the IN list size beyond which this transformation should take > place. > I was thinking of a GUC variable (or a hard coded value) which defaults > to say 30. This is based on numbers from the following test: > > postgres=# create table w (w text); > CREATE TABLE > > postgres=# \copy w from '/usr/share/dict/words' > > And run the following query with different IN list sizes > explain analyze select * from w where w in ('one', 'two', ...); > > I got the following runtimes: > ------------------------------------ > IN list IN (VALUES(...)) IN > size > ------------------------------------ > 150 ~2000 ms ~5500 ms > 100 ~1500 ms ~4000 ms > 80 ~1400 ms ~3000 ms > 50 ~1400 ms ~2500 ms > 30 ~1500 ms ~1500 ms > 20 ~1400 ms ~1200 ms > 10 ~1400 ms ~1200 ms > ------------------------------------ > > The IN (VALUES(...)) gives an almost steady state behavior, while the IN > runtimes deteriorate with growing list size. > > There would obviously be different conditions on which to base this > value. I seek community opinion on this. > > -- > Atul > > EnterpriseDB > www.enterprisedb.com > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers