hi
i have text/email addresses indexed with the standard analyzer. e.g. "marco.k...@brain.net" that results in two tokens being in the index: [marco.kamm] and [brain.net] i want to search using query_string query and wildcards like: { fields:["contact_email"], "query" : { "query_string" : { "query" : "(contact_email:(marco.*@brain.net))", "default_operator" : "and", "analyze_wildcard": true } } } from my past working-experience with lucene i know that wildcards queries are kind of problematic cause they're not analyzed by default. (to workaround this behaviour i wrote a custom parser that prepares the query string depending on the specific field analyzer in prior before passing it to the lucene query parser) at first when i noticed the analyze_wildcard parameter/option i thought great/cool! i no longer need my "custom magic parser ,-)", elasticsearch provides built-in support for my problems ... when testing the "analyze_wildcard" behaviour with "pure" prefix queries like "marco.kamm@brain.*" it worked like a charm! resp. did the same thing i tried to achive with my custom "pre-parser". the query was "transformed" to sth. like "contact_email:marco.kamm OR contact_email:brain*" that perfectly matches what's in the index ... but unfortunately testing with "real" wildcard queries like the above " marco.*@brain.net" is giving me a query that won't find anything in my situation cause it will be turned into: "contact_email:marco*brain.net" and there's no single! token in my index that will match (although it gets analyzed). to find some results the query rather would have to be turned int sth. like: "contact_email:marco* AND contact_email:brain.net" or "contact_email:marco* AND contact_email:*brain.net" (if the user search for "marco.*.net") ... by looking at the source code of org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually started to dive into the source code by chasing down the "rather small" already mentioned issue with the harcoded boolean.clause OR operator here: https://github.com/elasticsearch/elasticsearch/issues/2183) i realized that there are two different methods for analyzing pure wildcard and prefix queries (getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i first expected this cases to be handled by the same code) and that's why i'm getting the perfect results for prefix queries and sadly not working ones for pure wildcard ones ... i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery method by rewriting it in a way to work more like the getPossiblyAnalyzedPrefixQuery method resp. instead of generating only a single one wildcardquery object with the analyzed string, it builds a boolean query including several wildcardquery objects (splitting on */?)... my first tests showed that this would work quite well! ... now my questions: what do you think about this "approach"? do you see any serious drawbacks, besides performance i know that using even more wildcards will drastically reduce the search performance but better trying to finally serve some results after quite long time than finding nothing at all? (i also know that lucene is not built/optimized for wildcards queries and some cases could be resolved using different analyzers (ngram, reverse), multiple fields etc. but users are used to, and there could be usecases where such wildcard queries could make sense resp. where it's not practicable to use keyword analyzers that wont suffer from such problems e.g for longer text etc)! do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method (although it is stated in the docs that this method does best efforts)? (btw. do you also plan to fix the OR operator issue, could be rather simple just use the specified parameter) if my approach is legit and given that i dont like having to modify the elasticsearch "core" code and rebuild/adapt it with every new release how/where else could i implement such an extension? do i have to write a custom queryparser (maybe extends MapperQueryParser) and build my own plugin / rest endpoint ... (i recently found out that there's also a lucene class called AnalyzingQueryParser maybe i should have used this one instead of writing my own magic-parser, is/could this be used somehow in elasticsearch? is there a possibility to / should i write a feature request for even more best effor on analyzing wildcard queries. PS i know the wildcard handling issue could be a pain in the a**, and maybe could only be solved on a best efford basis?. but i'm somehow forced to mess around with this cause i have to (want!) to port my old lucene stuff to elasticsearch (except this issue i think elasticsearch is a great product and i like to work with it. this problem lies in the nature of inverted indices and wildcards resp. analyzers) sorry for the long maybe confusing mail, but i need your expert thoughts/advices about this wildcard issue thank you regards marco -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.