I just finished releasing the wikimedia extra <https://github.com/wikimedia/search-extra> Elasticsearch plugin which contains support for trigram accelerated regular expressions similar to PostgreSQL's implementation <http://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf>. Its far from perfect and going to be less efficient the normal full text search but if you absolutely *need* to search against arbitrary stuff it gets the job done reasonably fast.
You can try it on our beta site <http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3A%2Fattend%28ee%29%3F%2F&fulltext=Search>. Leading repetitions cause trouble for the highlighter at this point but that is a problem with another plugin <https://github.com/wikimedia/search-highlighter/issues/7> <https://github.com/wikimedia/search-highlighter/issues/7>. Define reasonably fast: Running a regular expression across all ~200,000 articles, templates, talk pages, etc in simple wikipedia in our (not very fast beta) environment takes around 15 seconds. Using this plugin it takes around 1 second. Full disclosure: I'm not very good at this kind of thing so I'm sure the algorithms aren't as efficient as they could be but it gets the job done. How it works: 1. Compiles the regex to an automaton using Lucene's RegExp class. 2. Uses the process described in the pdf linked above as "PostgreSQL's Implementation" to convert the regex into a ngram automaton. 3. Breaks cycles in the automaton which would break step 4. 4. Converts the automaton to an expression language. 5. Simplifies the expression. 6. Converts the expression into a Lucene filter (usually a boolean filter containing term filters). 7. Uses the filter as a first pass. 8. Rechecks the regex. More work to do: Right now the plugin is very aggressive about checking all the terms it can extract from the regex. It'd likely be faster to ignore terms that aren't very selective and/or execute them in order of must to least selective stopping early if the number of candidates dips below a certain point. Maybe. This targets the 1.3 branch of Elasticsearch. Nik -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2D1Xc-G6Q3_g2RPP_SxZLSK9XZn1ZiZDrLF2VuOxxFWg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.