Hackers, Attached patch contains opclass which demonstrates advantages of GIN additional information storing itself without other GIN improvements. It implements inversed task of regex indexing. It works so: you create index on regexes and search for regexes matched query string. It introduce two additional operators |~ and *~ for case-sensetive and case-insensetive regex to string matching, and gin_regexp_trgm_ops opclass.
Let's consider some example. At first, generate some regexes. CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$ SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer) FROM generate_series(1,($1 + random()*$2)::int)), ''); $$ LANGUAGE sql; CREATE TABLE test AS select '(' || generate_string(1,4) || '|' || generate_string(1,4) || '|' || generate_string(1,4) || ')' || generate_string(1,2) || '(' || generate_string(1,4) || '|' || generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM generate_series(1,1000000); I use only 10 characters on alphabet in order to have better chance of matching. It generate some regexes like so: postgres=# SELECT * FROM test LIMIT 10; s ------------------------------------ (g|cij|ah)jg(iei|hfc|eef) (gbfdb|ehbg|akf)ge(bc|jgee|jidd) (jedc|kgc|c)bc(ii|bji|iebc) (aa|eie|bgdb)f(fc|he|f) (b|ijc|ae)ae(eccb|ie|kjf) (bib|igf|kdibf)fij(gcbh|efi|fidj) (bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg) (kfb|egccd|iefce)jf(kj|jbef|kbc) (bhh|c|cd)cb(h|ed|jg) (id|j|geg)gc(djif|ai|cjjjc) (10 rows) Without index search takes about 10 seconds. postgres=# explain analyze select * from test where s |~ 'abcdefghijkl'; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..19929.00 rows=5000 width=28) (actual time=172.990..97357.312 rows=438 loops=1) Filter: (s |~ 'abcdefghijkl'::text) Rows Removed by Filter: 999562 Total runtime: 97357.490 ms (4 rows) And with index it takes only 110 milliseconds. postgres=# explain analyze select * from test where s |~ 'abcdefghijkl'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test (cost=182.75..7245.94 rows=5000 width=28) (actual time=68.143..110.663 rows=438 loops=1) Recheck Cond: (s |~ 'abcdefghijkl'::text) -> Bitmap Index Scan on test_idx (cost=0.00..181.50 rows=5000 width=0) (actual time=67.929..67.929 rows=438 loops=1) Index Cond: (s |~ 'abcdefghijkl'::text) Total runtime: 110.870 ms (5 rows) ------ With best regards, Alexander Korotkov.
index_on_regexes.1.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers