analyzing wildcard queries ...

mkamm78 Wed, 19 Nov 2014 00:57:13 -0800

hi

i have text/email addresses indexed with the standard analyzer.

e.g.

"marco.k...@brain.net" that results in two tokens being in the index:

[marco.kamm] and [brain.net]

i want to search using query_string query and wildcards like:

{
fields:["contact_email"],
"query" : {
"query_string" : {
"query" : "(contact_email:(marco.*@brain.net))",
"default_operator" : "and",
"analyze_wildcard": true
}
}
}

from my past working-experience with lucene i know that wildcards queries
are kind of problematic cause they're not analyzed by default.
(to workaround this behaviour i wrote a custom parser that prepares the
query string depending on the specific field analyzer in prior before
passing it to the lucene query parser)

at first when i noticed the analyze_wildcard parameter/option i thought
great/cool! i no longer need my "custom magic parser ,-)", elasticsearch
provides built-in support for my problems ...

when testing the "analyze_wildcard" behaviour with "pure" prefix queries
like "marco.kamm@brain.*" it worked like a charm! resp. did the same thing
i tried to achive with my
custom "pre-parser". the query was "transformed" to sth. like
"contact_email:marco.kamm OR contact_email:brain*" that perfectly matches
what's in the index ...

but unfortunately testing with "real" wildcard queries like the above "
marco.*@brain.net" is giving me a query that won't find anything in my
situation cause it will be
turned into: "contact_email:marco*brain.net" and there's no single! token
in my index that will match (although it gets analyzed). to find some
results the query rather would have
to be turned int sth. like: "contact_email:marco* AND
contact_email:brain.net" or "contact_email:marco* AND
contact_email:*brain.net" (if the user search for "marco.*.net") ...

by looking at the source code of
org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually
started to dive into the source code by chasing down the "rather small"
already mentioned issue
with the harcoded boolean.clause OR operator here:
https://github.com/elasticsearch/elasticsearch/issues/2183) i realized that
there are two different methods for analyzing pure wildcard and prefix
queries
(getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i
first expected this cases to be handled by the same code) and that's why
i'm getting the perfect results for prefix queries and sadly not working
ones for
pure wildcard ones ...

i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery
method by rewriting it in a way to work more like the
getPossiblyAnalyzedPrefixQuery method resp.
instead of generating only a single one wildcardquery object with the
analyzed string, it builds a boolean query including several wildcardquery
objects (splitting on */?)...

my first tests showed that this would work quite well! ...

now my questions:

what do you think about this "approach"?

do you see any serious drawbacks, besides performance
i know that using even more wildcards will drastically reduce the search
performance
but better trying to finally serve some results after quite long time than
finding nothing at all?

(i also know that lucene is not built/optimized for wildcards queries and
some cases could be resolved using different analyzers (ngram, reverse),
multiple fields etc.
but users are used to, and there could be usecases where such wildcard
queries could make sense
resp. where it's not practicable to use keyword analyzers that wont suffer
from such problems e.g for longer text etc)!

do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method
(although it is stated in the docs that this method does best efforts)?

(btw. do you also plan to fix the OR operator issue, could be rather simple
just use the specified parameter)

if my approach is legit and given that i dont like having to modify the
elasticsearch "core" code and rebuild/adapt it with every new release
how/where else
could i implement such an extension? do i have to write a custom
queryparser (maybe extends MapperQueryParser) and build my own plugin /
rest endpoint ...

(i recently found out that there's also a lucene class called
AnalyzingQueryParser maybe i should have used this one instead of writing
my own magic-parser, is/could this be used somehow in elasticsearch?

is there a possibility to / should i write a feature request for even more
best effor on analyzing wildcard queries. PS i know the wildcard handling
issue could be a pain in the a**, and maybe could only be solved on a best
efford basis?. but i'm somehow forced to mess around with this cause i have
to (want!) to port my old lucene stuff to elasticsearch (except this issue
i think elasticsearch is a great product and i like to work with it. this
problem lies in the nature of inverted indices and wildcards resp.
analyzers)

sorry for the long maybe confusing mail, but i need your expert
thoughts/advices about this wildcard issue

thank you
regards marco

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

analyzing wildcard queries ...

Reply via email to