On Thu, January 19, 2012 21:30, Heikki Linnakangas wrote:
> On 22.11.2011 21:38, Alexander Korotkov wrote:
>> WIP patch with index support for regexp search for pg_trgm contrib is
>> attached.
>> In spite of techniques which extracts continuous text parts from regexp,
>> this patch presents technique of automatum transformation. That allows more
>> comprehensive trigrams extraction.
>
> Nice!
>

Yes, wonderful stuff; I tested quite a bit with this patch; FWIW, here's what I 
found.

The patch yields spectacular speedups with small, simple-enough regexen.  But 
it does not do a
good enough job when guessing where to use the index and where fall back to Seq 
Scan.  This can
lead to (also spectacular) slow-downs, compared to Seq Scan.

I used the following to generate 3 test tables with lines of 80 random chars 
(just in case it's
handy for others to use):

$ cat create_data.sh
#!/bin/sh

for power in 4 5 6
do
  table=azjunk${power}
  index=${table}_trgmrgx_txt_01_idx
  echo "-- generating table $table with index $index";
  time perl -E'
  sub ss{ join"",@_[ map{rand @_} 1 .. shift ] };
  say(ss(80,"a".."g"," ","h".."m"," ","n".."s"," ","t".."z"))
     for 1 .. 1e'"${power};" \
  | psql -aqXc "drop table if exists $table;
  create table $table(txt text); copy $table from stdin;
  -- set session maintenance_work_mem='20GB';
  create index $index on $table using gin(txt gin_trgm_ops);
  analyze $table;";
done


\dt+ public.azjunk*
                      List of relations
 Schema |  Name   | Type  |   Owner   |  Size   | Description
--------+---------+-------+-----------+---------+-------------
 public | azjunk4 | table | breinbaas | 1152 kB |
 public | azjunk5 | table | breinbaas | 11 MB   |
 public | azjunk6 | table | breinbaas | 112 MB  |
(3 rows)



I guessed that MAX_COLOR_CHARS limits the character class size (to 4, in your 
patch), is that
true?   I can understand you want that value to be low to limit the above risk, 
but now it reduces
the usability of the feature a bit: one has to split up larger char-classes 
into several smaller
ones to make a statement use the index: i.e.:
             txt ~ 'f[aeio]n' OR txt ~ 'f[uy]n'
instead of
             txt ~ 'f[aeiouy]n'



I made compiled instances with larger values for MAX_COLOR_CHARS (6 and 9), and 
sure enough they
used the index for larger classes such as the above, but of course also got 
into problems easier
when quantifiers are added (*, +, {n,m}).

A better calculation to decide index-use seems necessary, and ideally one that 
allows for a larger
MAX_COLOR_CHARS than 4.  Especially quantifiers could perhaps be inspected wrt 
that decision.
IMHO, the functionality would still be very useful when only very simple 
regexen were considered.

Btw, it seems impossible to Ctrl-C out of a search once it is submitted; I 
suppose this is
normally necessary for perfomance reasons, but it would be useful te be able to 
compile a test
version that allows it.  I don't know how hard that would be.




There is also a minor bug, I think, when running with  'set enable_seqscan=off' 
 in combination
with a too-large regex:

$ cat fail.sql:

set enable_bitmapscan=on;
set enable_seqscan=on;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- OK

set enable_bitmapscan=on;
set enable_seqscan=off;
explain analyze select txt from azjunk4 where txt ~ 'f[aeio]n';   -- OK
explain analyze select txt from azjunk4 where txt ~ 'f[aeiou]n';  -- crashes


 $ psql -f fail.sql
                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual 
time=1.011..5.291
rows=131 loops=1)
   Recheck Cond: (txt ~ 'f[aeio]n'::text)
   ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx  (cost=0.00..52.01 
rows=1 width=0) (actual
time=0.880..0.880 rows=131 loops=1)
         Index Cond: (txt ~ 'f[aeio]n'::text)
 Total runtime: 5.700 ms
(5 rows)

                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Seq Scan on azjunk4  (cost=0.00..268.00 rows=1 width=81) (actual 
time=1.491..36.049 rows=164
loops=1)
   Filter: (txt ~ 'f[aeiou]n'::text)
   Rows Removed by Filter: 9836
 Total runtime: 36.112 ms
(4 rows)

                                                              QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on azjunk4  (cost=52.01..56.02 rows=1 width=81) (actual 
time=0.346..0.927
rows=131 loops=1)
   Recheck Cond: (txt ~ 'f[aeio]n'::text)
   ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx  (cost=0.00..52.01 
rows=1 width=0) (actual
time=0.316..0.316 rows=131 loops=1)
         Index Cond: (txt ~ 'f[aeio]n'::text)
 Total runtime: 0.996 ms
(5 rows)

psql:fail.sql:24: connection to server was lost


Sorry: I found the code, esp. the regex engine hard to understand sofar, so all 
the above are
somewhat inconclusive remarks; I hope nonetheless they are useful.


Thanks,

Erik Rijkers





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to