On Fri, Jun 21, 2013 at 2:40 PM, Erik Rijkers <e...@xs4all.nl> wrote:

> On Fri, June 21, 2013 05:25, Tom Lane wrote:
> > "Erik Rijkers" <e...@xs4all.nl> writes:
> >> In a 112 MB test table (containing random generated text) with a trgm
> index (gin_trgm_ops), I consistently get these
> >> timings:
> >> select txt from azjunk6 where txt ~ '^abcd';
> >>    130 ms
> >> select txt from azjunk6
> >> where txt ~ 'abcd' and substr(txt,1,4) = 'abcd';
> >>    3 ms
> >
> > Hm, could you provide a self-contained test case?
> >
>
> yes, sorry.   I tested on a 1M row table:
>
> #!/bin/sh
>
> # create table:
> for power in 6;
> do
>   table=azjunk${power}
>   index=${table}_trgm_re_idx
>   perl -E'
>   sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
>   say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z")) for 1 ..
> 1e'"${power};" \
>   | psql -aqXc "
>     drop table if exists $table;
>     create table $table(txt text);
>     copy $table from stdin;";
>   echo "set session maintenance_work_mem='1GB';
>     create index $index on $table using gin (txt gin_trgm_ops);
>     analyze $table;" | psql -qtAX;
> done
>
> # test:
> echo "
> \\timing on
> explain analyze select txt from azjunk6 where txt ~ '^abcd';  -- slow (140
> ms)
> explain analyze select txt from azjunk6 where txt ~ 'abcd' and
> substr(txt,1,4) = 'abcd'; -- fast (5 ms)
> " | psql -Xqa


Regex '^abcd' will be expanded into trigrams '__a', '_ab', 'abc' and 'bcd'.
However trigrams '__a' is much more frequent than '_ab' which in turn is
much more frequent than 'abc' and 'bcd'. Ommiting of ^ leads to ommiting of
'__a' and '_ab' and that gives so significant speedup.
The problem is that trigram regex engine doesn't know that '__a' and '_ab'
is more frequent than another trigrams because we don't know their
distribution. However, simple assumption that blank character (in pg_trgm
meaning) is much more frequent than others seems to be true for most cases.
Attached patch implementing this assumption. It introduces BLANK_COLOR_SIZE
macro which is used in selectColorTrigrams like count of characters in
COLOR_BLANK. Another change in this patch is split of MAX_TRGM_COUNT into
WISH_TRGM_SIZE and MAX_TRGM_SIZE. The idea is to keep trying to remove
color trigrams from graph even when it fits into MAX_TRGM_SIZE, because we
are intending to scan less posting lists/trees.
Comments is not fixed yet, coming soon.

------
With best regards,
Alexander Korotkov.

Attachment: trgm_regex_optimize.1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to