I was trying to test my patch to do posix_fadvise to speed up bitmap heap scans (with disappointing results so far) and ran into a bit of a gotcha. I'm not sure where this should be documented but it probably should be somewhere.
In order to test bitmap heap scan I had to build an array and use the = ANY(a) form. (The natural approach of building a table of values to search for produces a hash join or nested loop -- it may make sense to add a new kind of join which uses an outer bitmap heap scan.) So I defined an aggregate to group up the random values in an array in the usual fashion: CREATE AGGREGATE arrayize(anyelement) ( SFUNC = array_append, STYPE = anyarray, INITCOND = '{}' ); and ran queries like: select count(*) from huge where h = any ((select arrayize( (1+random()*300000000)::integer ) from generate_series(1,1000) )::integer[]) To test the behaviour for larger and larger samples I bumped the "1000" up further and further and noticed a larger and large pause in disk access. When I tried 40000 the query took over 47 minutes nearly all of which had one cpu pegged at 100%. What's going on is that arrayize() is actually a O(n^2) algorithm since each transition requires creating a copy of the entire array. The solution to this would analogous to what we did to count(). We would need to add a field to ArrayMetaState which is stored in fn_extra to remember the last array returned. Then if array_push notices it has been called from an aggregate context it can store its result in there. The next time it would extend that array in place (which is code which doesn't currently exist), possibly repallocing it and return the same pointer. It's a bit of a hack but I think this is going to be a pretty common use case and I don't see any more general solution. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings