Re: Trigram-accelerated regex searches

2014-09-29 Thread Nikolas Everett
On Thu, May 22, 2014 at 4:31 PM, Erik Rose grinche...@gmail.com wrote:

 Alright, try this on for size. :-)

 Since the built-in regex-ish filters want to be all clever and
 index-based, why not use the JS script plugin, which is happy to run as a
 post-processing phase?

 curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d
 '{
 query: {
 filtered: {
 query: {
 match_all: {}
 },
 filter: {
 and: [
 {
 query: {
 match_phrase: {
 content_trg: Children
 }
 }
 },
 {
 query: {
 match_phrase: {
 content_trg: Next
 }
 }
 },
 {
 script: {
 lang: js,
 script: (new
 RegExp(pattern)).test(doc[\content\].value),
 params: {
 pattern: Children.*Next
 }
 }
 }
 ]
 }
 }
 }
 }'

 That gets me through the whole 16M-doc corpus in 117ms. (Without the
 match_phrase queries, it takes forever, at 12s, so you can see the trigrams
 acceleration working.) I am ecstatic.

 Some of you might note that the pattern doesn't begin or end with a
 wildcard; that's because RegExp.test() serves as a search rather than a
 match, so wildcards are effectively assumed.

 Cheers!


Sorry to revive a super old thread but its pretty relevant.  I (sometimes)
need the super duper exact matching that you can get out of running regular
expressions against the source similarly to the proposal above.  I
currently have it implemented as a post filter as well.  The problem is
that its too easy to put together a query that has to visit millions of
documents and even if you always write fast queries you can get in line
behind users with slow queries.  So the feature is pretty broken as is.  So
I took a stab at writing a trigram accelerated regex plugin similar to
pg_trgm but, well, for Elasticsearch.  Its certainly not the most efficient
thing in the world, but its way way more efficient (in the worst case) then
post filtering.

Right now its just a proposal that lives here:
https://gerrit.wikimedia.org/r/#/c/163268 .  My computer science is pretty
rusty so I'm sure I've made some mistakes with the automaton and logic work
but it does work - even for pretty complex regexes - but that is certainly
an area for future improvement.  Another area is in only executing some of
the filters - right now all the logic from the regex is extracted into a
tree of bool filters and then they are all executed.  You could save time
by executing the term filters iteratively until there are only some number
of documents to recheck and then just recheck them.

Anyway, this is a start.  Its not meant to replace real language analysis
but if you _need_ _exact_ matches like those from regular expressions with
a reasonably degree of efficiency.

Please have a look at the code (https://gerrit.wikimedia.org/r/#/c/163268)
and comment on it.  At some point I'll release it as an Elasticsearch
plugin.

Thanks for reading,

Nik

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd125WSuh%2BXicrjRUU9xhCYaj%2BhHoZKQPWpECXJo5QXPog%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trigram-accelerated regex searches

2014-05-22 Thread Erik Rose
Martijn took a swing at it just now. He eliminated any scoring-based 
slowdown, like so (constant_score_filter)…

curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
query: {
filtered: {
query: {
match_all: {}
},
filter: {
and: [
{
query: {
match_phrase: {
content_trg: Children
}
}
},
{
query: {
match_phrase: {
content_trg: Next
}
}
},
{
query: {
wildcard: {
content: {
wildcard: *Children*Next*,
rewrite: constant_score_filter
}
}
}
}
]
}
}
}
}'

…but it didn't make any difference. Somehow, the `and` pipeline isn't 
behaving as we expect. Since ES can't provide any more detailed timing 
ouput, I guess the next step is to go look at the source code for the `and` 
filter and the wildcard query and see what's what.

I think we'd both be fascinated to know what's going on, if anyone has 
anything to add.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trigram-accelerated regex searches

2014-05-22 Thread Robert Muir
On Wed, May 21, 2014 at 6:01 PM, Erik Rose grinche...@gmail.com wrote:
 I'm trying to move Mozilla's source code search engine (dxr.mozilla.org)
 from a custom-written SQLite trigram index to ES. In the current production
 incarnation, we support fast regex (and, by extension, wildcard) searches by
 extracting trigrams from the search pattern and paring down the documents to
 those containing said trigrams.

This is definitely a great approach for a database, but it won't work
exactly the same way for an inverted index because the datastructure
is totally different.

In the inverted index queries like wildcards are slow: they must
iterate and match all terms in the document collection, then intersect
those postings with the rest of your query. So because its inverted,
it works backwards from what you expect and thats why adding
additional intersections like 'AND' don't speed anything up, they
haven't happened yet.

N-grams can speed up partial matching in general, but the methods to
accomplish this are different: usually the best way to go about it is
to try to think about Analyzing the data in such a way that the
queries to accomplish what you need are as basic as possible.

The first question is if you really need partial matching at all: I
don't have much knowledge about your use case, but just going from
your example, i would look at wildcards like *Children*Next* and ask
if instead i'd want to ensure my analyzer split on case-changes, and
try to see if i could get what i need with a sloppy phrase query.

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUS40rsAjmzrL_YK6yjgjZRumeQKFVPhVu9bUcW4nN_KA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trigram-accelerated regex searches

2014-05-22 Thread Matt Weber
Leading wildcards are really expensive.  Maybe you can try creating a copy
of your content field that reverses the tokens using reverse token filter
[1].  By doing this you turn those expensive leading wildcards into
trailing wildcards which should give you better performance.  I think your
query would look something like this:

{
  query: {
constant_score: {
  query: {
bool: {
  should: [
{wildcard: {content: Children*Next*}},
{wildcard: {content_rev: txeN*nerdlihC*}}
  ]
}
  }
}
  }
}

Note that you will need to reverse your query string as the wildcard query
is not analyzed.

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html#analysis-reverse-tokenfilter

Thanks,
Matt Weber


On Thu, May 22, 2014 at 11:09 AM, Erik Rose grinche...@gmail.com wrote:

 Martijn took a swing at it just now. He eliminated any scoring-based
 slowdown, like so (constant_score_filter)…

 curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d
 '{
 query: {
 filtered: {
 query: {
 match_all: {}
 },
 filter: {
 and: [
 {
 query: {
 match_phrase: {
 content_trg: Children
 }
 }
 },
 {
 query: {
 match_phrase: {
 content_trg: Next
 }
 }
 },
 {
 query: {
 wildcard: {
 content: {
 wildcard: *Children*Next*,
 rewrite: constant_score_filter
 }
 }
 }
 }
 ]
 }
 }
 }
 }'

 …but it didn't make any difference. Somehow, the `and` pipeline isn't
 behaving as we expect. Since ES can't provide any more detailed timing
 ouput, I guess the next step is to go look at the source code for the `and`
 filter and the wildcard query and see what's what.

 I think we'd both be fascinated to know what's going on, if anyone has
 anything to add.

 --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/3114f40c-0b15-4dd4-8a6b-fc8c13d43f23%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.




-- 
Thanks,
Matt Weber

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoA1fQjbkygEBhxZdMcb%3D22JGDph65qNn1cvkE66NLRn3A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trigram-accelerated regex searches

2014-05-22 Thread Erik Rose
Leading wildcards are really expensive.  Maybe you can try creating a copy 
of your content field that reverses the tokens using reverse token filter 
[1].

Good advice, typically, but notice I have wildcards on either side. 
Reversing just makes the trailing wildcard expensive. :-)

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6b982d49-5f3b-472a-84ea-cfd4f2c662d6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trigram-accelerated regex searches

2014-05-22 Thread Itamar Syn-Hershko
Aye, and then you can use edit distance on single words (fuzzy query) to
cope with fast typers
On May 22, 2014 8:22 PM, Robert Muir robert.m...@elasticsearch.com
wrote:

 On Wed, May 21, 2014 at 6:01 PM, Erik Rose grinche...@gmail.com wrote:
  I'm trying to move Mozilla's source code search engine (dxr.mozilla.org)
  from a custom-written SQLite trigram index to ES. In the current
 production
  incarnation, we support fast regex (and, by extension, wildcard)
 searches by
  extracting trigrams from the search pattern and paring down the
 documents to
  those containing said trigrams.

 This is definitely a great approach for a database, but it won't work
 exactly the same way for an inverted index because the datastructure
 is totally different.

 In the inverted index queries like wildcards are slow: they must
 iterate and match all terms in the document collection, then intersect
 those postings with the rest of your query. So because its inverted,
 it works backwards from what you expect and thats why adding
 additional intersections like 'AND' don't speed anything up, they
 haven't happened yet.

 N-grams can speed up partial matching in general, but the methods to
 accomplish this are different: usually the best way to go about it is
 to try to think about Analyzing the data in such a way that the
 queries to accomplish what you need are as basic as possible.

 The first question is if you really need partial matching at all: I
 don't have much knowledge about your use case, but just going from
 your example, i would look at wildcards like *Children*Next* and ask
 if instead i'd want to ensure my analyzer split on case-changes, and
 try to see if i could get what i need with a sloppy phrase query.

 --
 You received this message because you are subscribed to the Google Groups
 elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to elasticsearch+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/elasticsearch/CAMUKNZUS40rsAjmzrL_YK6yjgjZRumeQKFVPhVu9bUcW4nN_KA%40mail.gmail.com
 .
 For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZugwATMVFH%3DFziTPkX-dT6%3DRGfwhCud2S_aBcSDYmxZEA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trigram-accelerated regex searches

2014-05-22 Thread Erik Rose
 This is definitely a great approach for a database, but it won't work
 exactly the same way for an inverted index because the datastructure
 is totally different.

Ah, I was afraid of that. I hoped, due to the field being unanalyzed (and 
the documentation's noted restriction that wildcard queries work only on 
unanalyzed fields) that the wildcard query would apply to the fetched field 
contents rather than crawling the whole index. But your theory certainly 
fits the timings I've been getting. Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3d523ff3-5292-4ac5-9ec8-7df8520de58b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Trigram-accelerated regex searches

2014-05-22 Thread Erik Rose
Alright, try this on for size. :-)

Since the built-in regex-ish filters want to be all clever and index-based, 
why not use the JS script plugin, which is happy to run as a 
post-processing phase?

curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
query: {
filtered: {
query: {
match_all: {}
},
filter: {
and: [
{
query: {
match_phrase: {
content_trg: Children
}
}
},
{
query: {
match_phrase: {
content_trg: Next
}
}
},
{
script: {
lang: js,
script: (new 
RegExp(pattern)).test(doc[\content\].value),
params: {
pattern: Children.*Next
}
}
}
]
}
}
}
}'

That gets me through the whole 16M-doc corpus in 117ms. (Without the 
match_phrase queries, it takes forever, at 12s, so you can see the trigrams 
acceleration working.) I am ecstatic.

Some of you might note that the pattern doesn't begin or end with a 
wildcard; that's because RegExp.test() serves as a search rather than a 
match, so wildcards are effectively assumed.

Cheers!

-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6a7674d8-bf51-4be6-860e-589db3939ea8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Trigram-accelerated regex searches

2014-05-21 Thread Erik Rose
I'm trying to move Mozilla's source code search engine (dxr.mozilla.org) 
from a custom-written SQLite trigram index to ES. In the current production 
incarnation, we support fast regex (and, by extension, wildcard) searches 
by extracting trigrams from the search pattern and paring down the 
documents to those containing said trigrams. That gives us a manageable 
number of docs to then run the actual regex against. It performs very well. 
For example, to match the regex .*[cr]attle, we filter the corpus down to 
docs containing ((rat OR cat) AND att AND ttl AND tle) and then 
run the full regex against those candidates.

It seems like ES should be able to do this handily, but adding the trigram 
filters doesn't make it any faster. First, here's the naive approach, using 
an unaccelerated wildcard query. This should have to scan the world:

# Brute-force wildcard search
curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
query: {
constant_score: {
query: {
wildcard: {content: *Children*Next*}
}
}
}
}'

Then, we add trigram filters in an attempt to accelerate matters. The 
filters alone take only 80ms to run and return 100 docs, each about 100 
chars long, so I would expect running a wildcard query over those to be 
scarcely noticeable. However, this query still takes 500ms, the same as the 
above:

curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
query: {
filtered: {
query: {
constant_score: {
query: {
wildcard: {content: *Children*Next*}
}
}
},
filter: {
and: [
{
query: {
match_phrase: {
content_trg: Children
}
}
},
{
query: {
match_phrase: {
content_trg: Next
}
}
}
]
}
}
}
}'

(By using match_phrase, I'm counting on the query analyzer to break up 
Children and Next into trigrams and bang them against the trigram 
index, using the stored term positions to further pare down the found docs. 
Judging by speed, it appears to work very well.)

Then I thought, Perhaps ES needs a hint to run the wildcard query last. 
But no, using post_filter is just as slow:

curl -s -XGET 'http://127.0.0.1:9200/dxr_test/line/_search?pretty' -d '{
query: {
filtered: {
query: {
match_all: {}
},
filter: {
and: [
{
query: {
match_phrase: {
content_trg: Children
}
}
},
{
query: {
match_phrase: {
content_trg: Next
}
}
}
]
}
}
},
post_filter: {
query: {
wildcard: {content: *Children*Next*}
}
}
}'

Is it possible to coax ES into doing trigram-accelerated wildcard or regex 
searching? What am I missing?

Here are the relevant part of my mapping:

settings: {
analysis: {
analyzer: {
# A lowercase trigram analyzer. This is probably good
# enough for accelerating regexes; we probably dont
# need to keep a separate case-senitive index.
trigramalyzer: {
filter: [lowercase],
tokenizer: trigram_tokenizer
}
},
tokenizer: {
trigram_tokenizer: {
type: nGram,
min_gram: 3,
max_gram: 3
}
}
}
},
mappings: {
'line': {  # One line of source code
_all: {
enabled: False
},
properties: {
content: {
type: string,
index: not_analyzed
},
content_trg: {
type: string,
analyzer: trigramalyzer
}
}
}
}

Many thanks for any