Re: Trigram-accelerated regex searches
On Thu, May 22, 2014 at 4:31 PM, Erik Rose grinche...@gmail.com wrote: Alright, try this on for size. :-) Since the built-in regex-ish filters want to be all clever and index-based, why not use the JS script plugin, which is happy to run as a post-processing phase? curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{ query: { filtered: { query: { match_all: {} }, filter: { and: [ { query: { match_phrase: { content_trg: Children } } }, { query: { match_phrase: { content_trg: Next } } }, { script: { lang: js, script: (new RegExp(pattern)).test(doc[\content\].value), params: { pattern: Children.*Next } } } ] } } } }' That gets me through the whole 16M-doc corpus in 117ms. (Without the match_phrase queries, it takes forever, at 12s, so you can see the trigrams acceleration working.) I am ecstatic. Some of you might note that the pattern doesn't begin or end with a wildcard; that's because RegExp.test() serves as a search rather than a match, so wildcards are effectively assumed. Cheers! Sorry to revive a super old thread but its pretty relevant. I (sometimes) need the super duper exact matching that you can get out of running regular expressions against the source similarly to the proposal above. I currently have it implemented as a post filter as well. The problem is that its too easy to put together a query that has to visit millions of documents and even if you always write fast queries you can get in line behind users with slow queries. So the feature is pretty broken as is. So I took a stab at writing a trigram accelerated regex plugin similar to pg_trgm but, well, for Elasticsearch. Its certainly not the most efficient thing in the world, but its way way more efficient (in the worst case) then post filtering. Right now its just a proposal that lives here: https://gerrit.wikimedia.org/r/#/c/163268 . My computer science is pretty rusty so I'm sure I've made some mistakes with the automaton and logic work but it does work - even for pretty complex regexes - but that is certainly an area for future improvement. Another area is in only executing some of the filters - right now all the logic from the regex is extracted into a tree of bool filters and then they are all executed. You could save time by executing the term filters iteratively until there are only some number of documents to recheck and then just recheck them. Anyway, this is a start. Its not meant to replace real language analysis but if you _need_ _exact_ matches like those from regular expressions with a reasonably degree of efficiency. Please have a look at the code (https://gerrit.wikimedia.org/r/#/c/163268) and comment on it. At some point I'll release it as an Elasticsearch plugin. Thanks for reading, Nik -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd125WSuh%2BXicrjRUU9xhCYaj%2BhHoZKQPWpECXJo5QXPog%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: Trigram-accelerated regex searches
Martijn took a swing at it just now. He eliminated any scoring-based slowdown, like so (constant_score_filter)… curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{ query: { filtered: { query: { match_all: {} }, filter: { and: [ { query: { match_phrase: { content_trg: Children } } }, { query: { match_phrase: { content_trg: Next } } }, { query: { wildcard: { content: { wildcard: *Children*Next*, rewrite: constant_score_filter } } } } ] } } } }' …but it didn't make any difference. Somehow, the `and` pipeline isn't behaving as we expect. Since ES can't provide any more detailed timing ouput, I guess the next step is to go look at the source code for the `and` filter and the wildcard query and see what's what. I think we'd both be fascinated to know what's going on, if anyone has anything to add. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Trigram-accelerated regex searches
On Wed, May 21, 2014 at 6:01 PM, Erik Rose grinche...@gmail.com wrote: I'm trying to move Mozilla's source code search engine (dxr.mozilla.org) from a custom-written SQLite trigram index to ES. In the current production incarnation, we support fast regex (and, by extension, wildcard) searches by extracting trigrams from the search pattern and paring down the documents to those containing said trigrams. This is definitely a great approach for a database, but it won't work exactly the same way for an inverted index because the datastructure is totally different. In the inverted index queries like wildcards are slow: they must iterate and match all terms in the document collection, then intersect those postings with the rest of your query. So because its inverted, it works backwards from what you expect and thats why adding additional intersections like 'AND' don't speed anything up, they haven't happened yet. N-grams can speed up partial matching in general, but the methods to accomplish this are different: usually the best way to go about it is to try to think about Analyzing the data in such a way that the queries to accomplish what you need are as basic as possible. The first question is if you really need partial matching at all: I don't have much knowledge about your use case, but just going from your example, i would look at wildcards like *Children*Next* and ask if instead i'd want to ensure my analyzer split on case-changes, and try to see if i could get what i need with a sloppy phrase query. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUS40rsAjmzrL_YK6yjgjZRumeQKFVPhVu9bUcW4nN_KA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: Trigram-accelerated regex searches
Leading wildcards are really expensive. Maybe you can try creating a copy of your content field that reverses the tokens using reverse token filter [1]. By doing this you turn those expensive leading wildcards into trailing wildcards which should give you better performance. I think your query would look something like this: { query: { constant_score: { query: { bool: { should: [ {wildcard: {content: Children*Next*}}, {wildcard: {content_rev: txeN*nerdlihC*}} ] } } } } } Note that you will need to reverse your query string as the wildcard query is not analyzed. [1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html#analysis-reverse-tokenfilter Thanks, Matt Weber On Thu, May 22, 2014 at 11:09 AM, Erik Rose grinche...@gmail.com wrote: Martijn took a swing at it just now. He eliminated any scoring-based slowdown, like so (constant_score_filter)… curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{ query: { filtered: { query: { match_all: {} }, filter: { and: [ { query: { match_phrase: { content_trg: Children } } }, { query: { match_phrase: { content_trg: Next } } }, { query: { wildcard: { content: { wildcard: *Children*Next*, rewrite: constant_score_filter } } } } ] } } } }' …but it didn't make any difference. Somehow, the `and` pipeline isn't behaving as we expect. Since ES can't provide any more detailed timing ouput, I guess the next step is to go look at the source code for the `and` filter and the wildcard query and see what's what. I think we'd both be fascinated to know what's going on, if anyone has anything to add. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- Thanks, Matt Weber -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoA1fQjbkygEBhxZdMcb%3D22JGDph65qNn1cvkE66NLRn3A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: Trigram-accelerated regex searches
Leading wildcards are really expensive. Maybe you can try creating a copy of your content field that reverses the tokens using reverse token filter [1]. Good advice, typically, but notice I have wildcards on either side. Reversing just makes the trailing wildcard expensive. :-) -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6b982d49-5f3b-472a-84ea-cfd4f2c662d6%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Trigram-accelerated regex searches
Aye, and then you can use edit distance on single words (fuzzy query) to cope with fast typers On May 22, 2014 8:22 PM, Robert Muir robert.m...@elasticsearch.com wrote: On Wed, May 21, 2014 at 6:01 PM, Erik Rose grinche...@gmail.com wrote: I'm trying to move Mozilla's source code search engine (dxr.mozilla.org) from a custom-written SQLite trigram index to ES. In the current production incarnation, we support fast regex (and, by extension, wildcard) searches by extracting trigrams from the search pattern and paring down the documents to those containing said trigrams. This is definitely a great approach for a database, but it won't work exactly the same way for an inverted index because the datastructure is totally different. In the inverted index queries like wildcards are slow: they must iterate and match all terms in the document collection, then intersect those postings with the rest of your query. So because its inverted, it works backwards from what you expect and thats why adding additional intersections like 'AND' don't speed anything up, they haven't happened yet. N-grams can speed up partial matching in general, but the methods to accomplish this are different: usually the best way to go about it is to try to think about Analyzing the data in such a way that the queries to accomplish what you need are as basic as possible. The first question is if you really need partial matching at all: I don't have much knowledge about your use case, but just going from your example, i would look at wildcards like *Children*Next* and ask if instead i'd want to ensure my analyzer split on case-changes, and try to see if i could get what i need with a sloppy phrase query. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUS40rsAjmzrL_YK6yjgjZRumeQKFVPhVu9bUcW4nN_KA%40mail.gmail.com . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZugwATMVFH%3DFziTPkX-dT6%3DRGfwhCud2S_aBcSDYmxZEA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
Re: Trigram-accelerated regex searches
This is definitely a great approach for a database, but it won't work exactly the same way for an inverted index because the datastructure is totally different. Ah, I was afraid of that. I hoped, due to the field being unanalyzed (and the documentation's noted restriction that wildcard queries work only on unanalyzed fields) that the wildcard query would apply to the fetched field contents rather than crawling the whole index. But your theory certainly fits the timings I've been getting. Thanks! -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3d523ff3-5292-4ac5-9ec8-7df8520de58b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Trigram-accelerated regex searches
Alright, try this on for size. :-) Since the built-in regex-ish filters want to be all clever and index-based, why not use the JS script plugin, which is happy to run as a post-processing phase? curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{ query: { filtered: { query: { match_all: {} }, filter: { and: [ { query: { match_phrase: { content_trg: Children } } }, { query: { match_phrase: { content_trg: Next } } }, { script: { lang: js, script: (new RegExp(pattern)).test(doc[\content\].value), params: { pattern: Children.*Next } } } ] } } } }' That gets me through the whole 16M-doc corpus in 117ms. (Without the match_phrase queries, it takes forever, at 12s, so you can see the trigrams acceleration working.) I am ecstatic. Some of you might note that the pattern doesn't begin or end with a wildcard; that's because RegExp.test() serves as a search rather than a match, so wildcards are effectively assumed. Cheers! -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6a7674d8-bf51-4be6-860e-589db3939ea8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Trigram-accelerated regex searches
I'm trying to move Mozilla's source code search engine (dxr.mozilla.org) from a custom-written SQLite trigram index to ES. In the current production incarnation, we support fast regex (and, by extension, wildcard) searches by extracting trigrams from the search pattern and paring down the documents to those containing said trigrams. That gives us a manageable number of docs to then run the actual regex against. It performs very well. For example, to match the regex .*[cr]attle, we filter the corpus down to docs containing ((rat OR cat) AND att AND ttl AND tle) and then run the full regex against those candidates. It seems like ES should be able to do this handily, but adding the trigram filters doesn't make it any faster. First, here's the naive approach, using an unaccelerated wildcard query. This should have to scan the world: # Brute-force wildcard search curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{ query: { constant_score: { query: { wildcard: {content: *Children*Next*} } } } }' Then, we add trigram filters in an attempt to accelerate matters. The filters alone take only 80ms to run and return 100 docs, each about 100 chars long, so I would expect running a wildcard query over those to be scarcely noticeable. However, this query still takes 500ms, the same as the above: curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{ query: { filtered: { query: { constant_score: { query: { wildcard: {content: *Children*Next*} } } }, filter: { and: [ { query: { match_phrase: { content_trg: Children } } }, { query: { match_phrase: { content_trg: Next } } } ] } } } }' (By using match_phrase, I'm counting on the query analyzer to break up Children and Next into trigrams and bang them against the trigram index, using the stored term positions to further pare down the found docs. Judging by speed, it appears to work very well.) Then I thought, Perhaps ES needs a hint to run the wildcard query last. But no, using post_filter is just as slow: curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{ query: { filtered: { query: { match_all: {} }, filter: { and: [ { query: { match_phrase: { content_trg: Children } } }, { query: { match_phrase: { content_trg: Next } } } ] } } }, post_filter: { query: { wildcard: {content: *Children*Next*} } } }' Is it possible to coax ES into doing trigram-accelerated wildcard or regex searching? What am I missing? Here are the relevant part of my mapping: settings: { analysis: { analyzer: { # A lowercase trigram analyzer. This is probably good # enough for accelerating regexes; we probably dont # need to keep a separate case-senitive index. trigramalyzer: { filter: [lowercase], tokenizer: trigram_tokenizer } }, tokenizer: { trigram_tokenizer: { type: nGram, min_gram: 3, max_gram: 3 } } } }, mappings: { 'line': { # One line of source code _all: { enabled: False }, properties: { content: { type: string, index: not_analyzed }, content_trg: { type: string, analyzer: trigramalyzer } } } } Many thanks for any