[ANN] Released trigram accelerated regex queries for Elasticsearch version 0.0.1

Nikolas Everett Mon, 06 Oct 2014 09:24:48 -0700

I just finished releasing the wikimedia extra
<https://github.com/wikimedia/search-extra> Elasticsearch plugin which
contains support for trigram accelerated regular expressions similar
to PostgreSQL's
implementation
<http://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf>.
Its far from perfect and going to be less efficient the normal full text
search but if you absolutely *need* to search against arbitrary stuff it
gets the job done reasonably fast.

You can try it on our beta site
<http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3A%2Fattend%28ee%29%3F%2F&fulltext=Search>.
Leading repetitions cause trouble for the highlighter at this point but
that is a problem with another plugin
<https://github.com/wikimedia/search-highlighter/issues/7>
<https://github.com/wikimedia/search-highlighter/issues/7>.

Define reasonably fast:
Running a regular expression across all ~200,000 articles, templates, talk
pages, etc in simple wikipedia in our (not very fast beta) environment
takes around 15 seconds. Using this plugin it takes around 1 second.

Full disclosure: I'm not very good at this kind of thing so I'm sure the
algorithms aren't as efficient as they could be but it gets the job done.

How it works:
1. Compiles the regex to an automaton using Lucene's RegExp class.
2. Uses the process described in the pdf linked above as "PostgreSQL's
Implementation" to convert the regex into a ngram automaton.
3. Breaks cycles in the automaton which would break step 4.
4. Converts the automaton to an expression language.
5. Simplifies the expression.
6. Converts the expression into a Lucene filter (usually a boolean filter
containing term filters).
7. Uses the filter as a first pass.
8. Rechecks the regex.

More work to do:
Right now the plugin is very aggressive about checking all the terms it can
extract from the regex. It'd likely be faster to ignore terms that aren't
very selective and/or execute them in order of must to least selective
stopping early if the number of candidates dips below a certain point.
Maybe.

This targets the 1.3 branch of Elasticsearch.

Nik

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2D1Xc-G6Q3_g2RPP_SxZLSK9XZn1ZiZDrLF2VuOxxFWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[ANN] Released trigram accelerated regex queries for Elasticsearch version 0.0.1

Reply via email to