Wildcard and analysis surely need improvements. There is also weak support for wildcards in phrases. Elasticsearch does not support ComplexPhraseQueryParser right now:
http://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html but Solr does https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser There are many parsers in Lucene, and more parsers with improvements are on the way, like this one https://issues.apache.org/jira/browse/LUCENE-5205 https://github.com/tballison/lucene-addons/tree/master/lucene-5205 You can fork Elasticsearch source, add your improvements to a branch, so that we can look at the code. Do not forget simple_query_string, I have offered a similar patch for adding prefix analysis to wildcard queries for simple_query_string (porting the best effort approach from query_string) https://github.com/elasticsearch/elasticsearch/pull/8422 Cheers, Jörg On Wed, Nov 19, 2014 at 9:56 AM, <mkam...@gmail.com> wrote: > hi > > i have text/email addresses indexed with the standard analyzer. > > e.g. > > "marco.k...@brain.net" that results in two tokens being in the index: > > [marco.kamm] and [brain.net] > > i want to search using query_string query and wildcards like: > > { > fields:["contact_email"], > "query" : { > "query_string" : { > "query" : "(contact_email:(marco.*@brain.net))", > "default_operator" : "and", > "analyze_wildcard": true > } > } > } > > from my past working-experience with lucene i know that wildcards queries > are kind of problematic cause they're not analyzed by default. > (to workaround this behaviour i wrote a custom parser that prepares the > query string depending on the specific field analyzer in prior before > passing it to the lucene query parser) > > at first when i noticed the analyze_wildcard parameter/option i thought > great/cool! i no longer need my "custom magic parser ,-)", elasticsearch > provides built-in support for my problems ... > > when testing the "analyze_wildcard" behaviour with "pure" prefix queries > like "marco.kamm@brain.*" it worked like a charm! resp. did the same > thing i tried to achive with my > custom "pre-parser". the query was "transformed" to sth. like > "contact_email:marco.kamm OR contact_email:brain*" that perfectly matches > what's in the index ... > > but unfortunately testing with "real" wildcard queries like the above " > marco.*@brain.net" is giving me a query that won't find anything in my > situation cause it will be > turned into: "contact_email:marco*brain.net" and there's no single! token > in my index that will match (although it gets analyzed). to find some > results the query rather would have > to be turned int sth. like: "contact_email:marco* AND contact_email: > brain.net" or "contact_email:marco* AND contact_email:*brain.net" (if the > user search for "marco.*.net") ... > > by looking at the source code of > org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually > started to dive into the source code by chasing down the "rather small" > already mentioned issue > with the harcoded boolean.clause OR operator here: > https://github.com/elasticsearch/elasticsearch/issues/2183) i realized > that there are two different methods for analyzing pure wildcard and prefix > queries > (getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i > first expected this cases to be handled by the same code) and that's why > i'm getting the perfect results for prefix queries and sadly not working > ones for > pure wildcard ones ... > > i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery > method by rewriting it in a way to work more like the > getPossiblyAnalyzedPrefixQuery method resp. > instead of generating only a single one wildcardquery object with the > analyzed string, it builds a boolean query including several wildcardquery > objects (splitting on */?)... > > my first tests showed that this would work quite well! ... > > > > now my questions: > > what do you think about this "approach"? > > do you see any serious drawbacks, besides performance > i know that using even more wildcards will drastically reduce the search > performance > but better trying to finally serve some results after quite long time than > finding nothing at all? > > (i also know that lucene is not built/optimized for wildcards queries and > some cases could be resolved using different analyzers (ngram, reverse), > multiple fields etc. > but users are used to, and there could be usecases where such wildcard > queries could make sense > resp. where it's not practicable to use keyword analyzers that wont suffer > from such problems e.g for longer text etc)! > > do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method > (although it is stated in the docs that this method does best efforts)? > > (btw. do you also plan to fix the OR operator issue, could be rather > simple just use the specified parameter) > > if my approach is legit and given that i dont like having to modify the > elasticsearch "core" code and rebuild/adapt it with every new release > how/where else > could i implement such an extension? do i have to write a custom > queryparser (maybe extends MapperQueryParser) and build my own plugin / > rest endpoint ... > > (i recently found out that there's also a lucene class called > AnalyzingQueryParser maybe i should have used this one instead of writing > my own magic-parser, is/could this be used somehow in elasticsearch? > > is there a possibility to / should i write a feature request for even more > best effor on analyzing wildcard queries. PS i know the wildcard handling > issue could be a pain in the a**, and maybe could only be solved on a best > efford basis?. but i'm somehow forced to mess around with this cause i have > to (want!) to port my old lucene stuff to elasticsearch (except this issue > i think elasticsearch is a great product and i like to work with it. this > problem lies in the nature of inverted indices and wildcards resp. > analyzers) > > > sorry for the long maybe confusing mail, but i need your expert > thoughts/advices about this wildcard issue > > thank you > regards marco > > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGD7JZ-wxiO6hdGfCewtX%2B0WPjuTgyY0n7Tqm6ZhREc4A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.