hi

i have text/email addresses indexed with the standard analyzer. 

e.g.

"marco.k...@brain.net" that results in two tokens being in the index:

[marco.kamm] and [brain.net]

i want to search using query_string query and wildcards like:

{
  fields:["contact_email"],
  "query" : {
    "query_string" : {
      "query" : "(contact_email:(marco.*@brain.net))",
      "default_operator" : "and",
   "analyze_wildcard": true
    }
  }
}

from my past working-experience with lucene i know that wildcards queries 
are kind of problematic cause they're not analyzed by default.
(to workaround this behaviour i wrote a custom parser that prepares the 
query string depending on the specific field analyzer in prior before 
passing it to the lucene query parser)

at first when i noticed the analyze_wildcard parameter/option i thought 
great/cool! i no longer need my "custom magic parser ,-)", elasticsearch 
provides built-in support for my problems ... 

when testing the "analyze_wildcard" behaviour with "pure" prefix queries 
like "marco.kamm@brain.*" it worked like a charm! resp. did the same thing 
i tried to achive with my 
custom "pre-parser". the query was "transformed" to sth. like 
"contact_email:marco.kamm OR contact_email:brain*" that perfectly matches 
what's in the index ...

but unfortunately testing with "real" wildcard queries like the above "
marco.*@brain.net" is giving me a query that won't find anything in my 
situation cause it will be 
turned into: "contact_email:marco*brain.net" and there's no single! token 
in my index that will match (although it gets analyzed). to find some 
results the query rather would have 
to be turned int sth. like: "contact_email:marco* AND 
contact_email:brain.net" or "contact_email:marco* AND 
contact_email:*brain.net" (if the user search for "marco.*.net") ...

by looking at the source code of 
org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually 
started to dive into the source code by chasing down the "rather small" 
already mentioned issue
with the harcoded boolean.clause OR operator here: 
https://github.com/elasticsearch/elasticsearch/issues/2183) i realized that 
there are two different methods for analyzing pure wildcard and prefix 
queries
(getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i 
first expected this cases to be handled by the same code) and that's why 
i'm getting the perfect results for prefix queries and sadly not working 
ones for
pure wildcard ones ...

i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery 
method by rewriting it in a way to work more like the 
getPossiblyAnalyzedPrefixQuery method resp. 
instead of generating only a single one wildcardquery object with the 
analyzed string, it builds a boolean query including several wildcardquery 
objects (splitting on */?)...

my first tests showed that this would work quite well! ...

 

now my questions:

what do you think about this "approach"? 

do you see any serious drawbacks, besides performance
i know that using even more wildcards will drastically reduce the search 
performance  
but better trying to finally serve some results after quite long time than 
finding nothing at all?

(i also know that lucene is not built/optimized for wildcards queries and 
some cases could be resolved using different analyzers (ngram, reverse), 
multiple fields etc. 
but users are used to, and there could be usecases where such wildcard 
queries could make sense 
resp. where it's not practicable to use keyword analyzers that wont suffer 
from such problems e.g for longer text etc)! 

do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method 
(although it is stated in the docs that this method does best efforts)?

(btw. do you also plan to fix the OR operator issue, could be rather simple 
just use the specified parameter)

if my approach is legit and given that i dont like having to modify the 
elasticsearch "core" code and rebuild/adapt it with every new release 
how/where else
could i implement such an extension? do i have to write a custom 
queryparser (maybe extends MapperQueryParser) and build my own plugin / 
rest endpoint ...

(i recently found out that there's also a lucene class called 
AnalyzingQueryParser maybe i should have used this one instead of writing 
my own magic-parser, is/could this be used somehow in elasticsearch?

is there a possibility to / should i write a feature request for even more 
best effor on analyzing wildcard queries. PS i know the wildcard handling 
issue could be a pain in the a**, and maybe could only be solved on a best 
efford basis?. but i'm somehow forced to mess around with this cause i have 
to (want!) to port my old lucene stuff to elasticsearch (except this issue 
i think elasticsearch is a great product and i like to work with it. this 
problem lies in the nature of inverted indices and wildcards resp. 
analyzers) 


sorry for the long maybe confusing mail, but i need your expert 
thoughts/advices about this wildcard issue

thank you 
regards marco
 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to