I just finished releasing the wikimedia extra
<https://github.com/wikimedia/search-extra> Elasticsearch plugin which
contains support for trigram accelerated regular expressions similar
to PostgreSQL's
implementation
<http://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf>.
Its far from perfect and going to be less efficient the normal full text
search but if you absolutely *need* to search against arbitrary stuff it
gets the job done reasonably fast.

You can try it on our beta site
<http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3A%2Fattend%28ee%29%3F%2F&fulltext=Search>.
Leading repetitions cause trouble for the highlighter at this point but
that is a problem with another plugin
<https://github.com/wikimedia/search-highlighter/issues/7>
<https://github.com/wikimedia/search-highlighter/issues/7>.

Define reasonably fast:
Running a regular expression across all ~200,000 articles, templates, talk
pages, etc in simple wikipedia in our (not very fast beta) environment
takes around 15 seconds.  Using this plugin it takes around 1 second.

Full disclosure: I'm not very good at this kind of thing so I'm sure the
algorithms aren't as efficient as they could be but it gets the job done.

How it works:
1.  Compiles the regex to an automaton using Lucene's RegExp class.
2.  Uses the process described in the pdf linked above as "PostgreSQL's
Implementation" to convert the regex into a ngram automaton.
3.  Breaks cycles in the automaton which would break step 4.
4.  Converts the automaton to an expression language.
5.  Simplifies the expression.
6.  Converts the expression into a Lucene filter (usually a boolean filter
containing term filters).
7.  Uses the filter as a first pass.
8.  Rechecks the regex.

More work to do:
Right now the plugin is very aggressive about checking all the terms it can
extract from the regex.  It'd likely be faster to ignore terms that aren't
very selective and/or execute them in order of must to least selective
stopping early if the number of candidates dips below a certain point.
Maybe.

This targets the 1.3 branch of Elasticsearch.

Nik

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2D1Xc-G6Q3_g2RPP_SxZLSK9XZn1ZiZDrLF2VuOxxFWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to