Re: analyzing wildcard queries ...

joergpra...@gmail.com Wed, 19 Nov 2014 02:15:14 -0800

Wildcard and analysis surely need improvements.

There is also weak support for wildcards in phrases. Elasticsearch does not
support ComplexPhraseQueryParser right now:


http://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/complexPhrase/ComplexPhraseQueryParser.html

but Solr does

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-ComplexPhraseQueryParser

There are many parsers in Lucene, and more parsers with improvements are on
the way, like this one

https://issues.apache.org/jira/browse/LUCENE-5205

https://github.com/tballison/lucene-addons/tree/master/lucene-5205

You can fork Elasticsearch source, add your improvements to a branch, so
that we can look at the code.

Do not forget simple_query_string, I have offered a similar patch for
adding prefix analysis to wildcard queries for simple_query_string (porting
the best effort approach from query_string)

https://github.com/elasticsearch/elasticsearch/pull/8422

Cheers,

Jörg

On Wed, Nov 19, 2014 at 9:56 AM, <mkam...@gmail.com> wrote:

> hi
>
> i have text/email addresses indexed with the standard analyzer.
>
> e.g.
>
> "marco.k...@brain.net" that results in two tokens being in the index:
>
> [marco.kamm] and [brain.net]
>
> i want to search using query_string query and wildcards like:
>
> {
>   fields:["contact_email"],
>   "query" : {
>     "query_string" : {
>       "query" : "(contact_email:(marco.*@brain.net))",
>       "default_operator" : "and",
>    "analyze_wildcard": true
>     }
>   }
> }
>
> from my past working-experience with lucene i know that wildcards queries
> are kind of problematic cause they're not analyzed by default.
> (to workaround this behaviour i wrote a custom parser that prepares the
> query string depending on the specific field analyzer in prior before
> passing it to the lucene query parser)
>
> at first when i noticed the analyze_wildcard parameter/option i thought
> great/cool! i no longer need my "custom magic parser ,-)", elasticsearch
> provides built-in support for my problems ...
>
> when testing the "analyze_wildcard" behaviour with "pure" prefix queries
> like "marco.kamm@brain.*" it worked like a charm! resp. did the same
> thing i tried to achive with my
> custom "pre-parser". the query was "transformed" to sth. like
> "contact_email:marco.kamm OR contact_email:brain*" that perfectly matches
> what's in the index ...
>
> but unfortunately testing with "real" wildcard queries like the above "
> marco.*@brain.net" is giving me a query that won't find anything in my
> situation cause it will be
> turned into: "contact_email:marco*brain.net" and there's no single! token
> in my index that will match (although it gets analyzed). to find some
> results the query rather would have
> to be turned int sth. like: "contact_email:marco* AND contact_email:
> brain.net" or "contact_email:marco* AND contact_email:*brain.net" (if the
> user search for "marco.*.net") ...
>
> by looking at the source code of
> org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually
> started to dive into the source code by chasing down the "rather small"
> already mentioned issue
> with the harcoded boolean.clause OR operator here:
> https://github.com/elasticsearch/elasticsearch/issues/2183) i realized
> that there are two different methods for analyzing pure wildcard and prefix
> queries
> (getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i
> first expected this cases to be handled by the same code) and that's why
> i'm getting the perfect results for prefix queries and sadly not working
> ones for
> pure wildcard ones ...
>
> i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery
> method by rewriting it in a way to work more like the
> getPossiblyAnalyzedPrefixQuery method resp.
> instead of generating only a single one wildcardquery object with the
> analyzed string, it builds a boolean query including several wildcardquery
> objects (splitting on */?)...
>
> my first tests showed that this would work quite well! ...
>
>
>
> now my questions:
>
> what do you think about this "approach"?
>
> do you see any serious drawbacks, besides performance
> i know that using even more wildcards will drastically reduce the search
> performance
> but better trying to finally serve some results after quite long time than
> finding nothing at all?
>
> (i also know that lucene is not built/optimized for wildcards queries and
> some cases could be resolved using different analyzers (ngram, reverse),
> multiple fields etc.
> but users are used to, and there could be usecases where such wildcard
> queries could make sense
> resp. where it's not practicable to use keyword analyzers that wont suffer
> from such problems e.g for longer text etc)!
>
> do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method
> (although it is stated in the docs that this method does best efforts)?
>
> (btw. do you also plan to fix the OR operator issue, could be rather
> simple just use the specified parameter)
>
> if my approach is legit and given that i dont like having to modify the
> elasticsearch "core" code and rebuild/adapt it with every new release
> how/where else
> could i implement such an extension? do i have to write a custom
> queryparser (maybe extends MapperQueryParser) and build my own plugin /
> rest endpoint ...
>
> (i recently found out that there's also a lucene class called
> AnalyzingQueryParser maybe i should have used this one instead of writing
> my own magic-parser, is/could this be used somehow in elasticsearch?
>
> is there a possibility to / should i write a feature request for even more
> best effor on analyzing wildcard queries. PS i know the wildcard handling
> issue could be a pain in the a**, and maybe could only be solved on a best
> efford basis?. but i'm somehow forced to mess around with this cause i have
> to (want!) to port my old lucene stuff to elasticsearch (except this issue
> i think elasticsearch is a great product and i like to work with it. this
> problem lies in the nature of inverted indices and wildcards resp.
> analyzers)
>
>
> sorry for the long maybe confusing mail, but i need your expert
> thoughts/advices about this wildcard issue
>
> thank you
> regards marco
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/62b204b3-fef6-4328-abaa-6b1eae99d1e0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGD7JZ-wxiO6hdGfCewtX%2B0WPjuTgyY0n7Tqm6ZhREc4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: analyzing wildcard queries ...

Reply via email to