On 13.06.2013 23:19, Alexander Korotkov wrote:
Hackers,

Attached patch contains opclass which demonstrates advantages of GIN
additional information storing itself without other GIN improvements. It
implements inversed task of regex indexing. It works so: you create index
on regexes and search for regexes matched query string. It introduce two
additional operators |~ and *~ for case-sensetive and case-insensetive
regex to string matching, and gin_regexp_trgm_ops opclass.

I'm going to step back from this patch and trigrams for a while, and ponder in general, how one would do indexing of regexps.

The best approach surely depends a lot on the kinds of regexps and query strings involved. For example, how much common structure do the regexps share, how easily they can be decomposed into common and differing parts. Also, how long are the query strings being used, and how many regexps does it need to work efficiently with.

The first thought that comes to my mind is to build a GiST index, where the leafs items are the regexps, in a precompiled form. The upper nodes are also pre-compiled regexps, which match everything that the child regexps contain. For example, if you have the regexps ".*foobar.*" and ".*foofoo.*" on a leaf page, the upper node might be ".*foo.*". In other words, the union function forms a regular expression that matches all the strings that any of the member regexps, and may match more. Implementing the union-function is probably the most difficult part of this approach.

In literature, there is the Aho-Corasick algorithm that can be used to efficiently search for several substrings in a string in one go. I believe it can be extended to handle regexps. That does not fit nicely into existing GIN or GiST indexes, however, so that would need to be a whole new indexam. Also, I'm not sure how it scales as the number of regexps searched increses, and incremental updates might be tricky.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to