Re: analyzing wildcard queries ...

mkamm78 Thu, 20 Nov 2014 06:49:04 -0800

hi jörg

just wanted to tell you that i will/can not fork/commit my "improvement" on 
wildcard analysis cause i'm no longer 100% convinced
that it is really an improvement resp. can be used in general...
 
after rethinking i must admit that i was probably too much focused on my 
concrete issues with email addresses using the standard analyzer 
e.g "marco.kamm@brain" analyzed into the tokens [marco.kamm] [brain.net]
the original idea behind using the standard analyzer was that users will 
find sth. when searching for "brain.net" or "marco.kamm" without having to 
use any wildcards!
(the old lucene standard analyzer did also split on '.' charaters so even 
"marco" or "brain" could be found)
 
somehow i thought it would also make sense to search for e.g. "
marco.*@brain.net" or "marco.kamm@*.net"


my first improvement approach was based on the existing code but instead of 
concatenating all the analyzed sub-string parts into a single wildcard query
i tried to build a boolean query containing the individual analyzed parts 
as either prefix or wildcard queries ... 
e.g.
"marco.*@brain.net" --> "marco*" AND "*brain.net"
"marco.kamm@*.net" --> "marco.kamm*" AND "*net"
first query can be only prefix query (when not preceeded by a single 
wildcard char) and last one could be a "postfix" query
everthing in between was surounded by '*'...'*'

another (optimized) approach is based on the following technique:
generate a random letter sequence that is not present in the search term, 
replace the wildcards by this sequence and feed it to the analyzer
this way if the anlayzer produces more than one token out of a single 
wildcard input you can be sure that original inputs would also be split
into more terms and you need to use more than one single query obj ...
 
after analyzing, process the resulting tokens one by one and combine them 
into a boolean AND query. foreach token undo the wildcard replacement
and check the occurences of wildcard characters. if a token contains no 
wildcards at all use a termquery, if the token only contains a wildcard 
char at the end use prefixquery 
else use wildcard query ...
 
e.g.
"marco.*@brain.net" --> "marco.{randomLetterSequence}@brain.net" --> 
[marco.{randomLetterSequence}] [brain.net] --> "marco.*" AND "brain.net"
"marco.kamm@*.net" --> "marco.kamm@{randomLetterSequence}.net" --> 
[marco.kamm] [{randomLetterSequence}.net] --> "marco.kamm" AND "*.net"
 
these approaches could work for my cases (at least they produce some 
results where the original code didn't find anything, althought the results 
maybe inaccurate but this lies in the nature of AND combinations e.g. "
marco.*@brain.net" transformed into "marco.*" AND "brain.net" could also 
find brain....@marco.org etc.) 

but i think for most of the cases (where the queried field uses an analyzer 
that doesn't split up terms into several tokens e.g. keyword analyzer etc. 
) the existing code does already the best effort that can be done in a 
generic way (without knowing what the analyzer is doing with certain 
characters)

maybe you can use sth. out of my 2nd. approach with testing the analyzers 
behaviour by replacing the wildcards with sth. that doesn't get eaten up to 
see if the input is split or not
(i think a sequence of plain asci letters could be a way but i'm not sure 
if this could server as a general solution e.g for japanes analyzers etc. 
for me a sequence of asci letters seems like kind of lowest common 
denominator LCD).
 
for the moment we're trying to live with the current best effort approach 
maybe analyzing some fields twice once with a standard analyzer or sth. and 
additionally with a keyword analyzer, and direct pure wildcard queries to 
the keywork field. or maybe we're going to split up email addresses into a 
seperate username- and domain field etc. 

thank you anyway for your time

cheers marco
 
 

Am Mittwoch, 19. November 2014 09:56:43 UTC+1 schrieb mka...@gmail.com:

>  hi
>
> i have text/email addresses indexed with the standard analyzer. 
>
> e.g.
>
> "marco.k...@brain.net" that results in two tokens being in the index:
>
> [marco.kamm] and [brain.net]
>
> i want to search using query_string query and wildcards like:
>
> {
>   fields:["contact_email"],
>   "query" : {
>     "query_string" : {
>       "query" : "(contact_email:(marco.*@brain.net))",
>       "default_operator" : "and",
>    "analyze_wildcard": true
>     }
>   }
> }
>
> from my past working-experience with lucene i know that wildcards queries 
> are kind of problematic cause they're not analyzed by default.
> (to workaround this behaviour i wrote a custom parser that prepares the 
> query string depending on the specific field analyzer in prior before 
> passing it to the lucene query parser)
>
> at first when i noticed the analyze_wildcard parameter/option i thought 
> great/cool! i no longer need my "custom magic parser ,-)", elasticsearch 
> provides built-in support for my problems ... 
>
> when testing the "analyze_wildcard" behaviour with "pure" prefix queries 
> like "marco.kamm@brain.*" it worked like a charm! resp. did the same 
> thing i tried to achive with my 
> custom "pre-parser". the query was "transformed" to sth. like 
> "contact_email:marco.kamm OR contact_email:brain*" that perfectly matches 
> what's in the index ...
>
> but unfortunately testing with "real" wildcard queries like the above "
> marco.*@brain.net" is giving me a query that won't find anything in my 
> situation cause it will be 
> turned into: "contact_email:marco*brain.net" and there's no single! token 
> in my index that will match (although it gets analyzed). to find some 
> results the query rather would have 
> to be turned int sth. like: "contact_email:marco* AND contact_email:
> brain.net" or "contact_email:marco* AND contact_email:*brain.net" (if the 
> user search for "marco.*.net") ...
>
> by looking at the source code of 
> org.apache.lucene.queryparser.classic.MapperQueryParser.java (i actually 
> started to dive into the source code by chasing down the "rather small" 
> already mentioned issue
> with the harcoded boolean.clause OR operator here: 
> https://github.com/elasticsearch/elasticsearch/issues/2183) i realized 
> that there are two different methods for analyzing pure wildcard and prefix 
> queries
> (getPossiblyAnalyzedPrefixQuery resp getPossiblyAnalyzedWildcardQuery, i 
> first expected this cases to be handled by the same code) and that's why 
> i'm getting the perfect results for prefix queries and sadly not working 
> ones for
> pure wildcard ones ...
>
> i started to experiment/fiddle with the getPossiblyAnalyzedWildcardQuery 
> method by rewriting it in a way to work more like the 
> getPossiblyAnalyzedPrefixQuery method resp. 
> instead of generating only a single one wildcardquery object with the 
> analyzed string, it builds a boolean query including several wildcardquery 
> objects (splitting on */?)...
>
> my first tests showed that this would work quite well! ...
>
>  
>
> now my questions:
>
> what do you think about this "approach"? 
>
> do you see any serious drawbacks, besides performance
> i know that using even more wildcards will drastically reduce the search 
> performance  
> but better trying to finally serve some results after quite long time than 
> finding nothing at all?
>
> (i also know that lucene is not built/optimized for wildcards queries and 
> some cases could be resolved using different analyzers (ngram, reverse), 
> multiple fields etc. 
> but users are used to, and there could be usecases where such wildcard 
> queries could make sense 
> resp. where it's not practicable to use keyword analyzers that wont suffer 
> from such problems e.g for longer text etc)! 
>
> do you plan to further enhance the getPossiblyAnalyzedWildcardQuery method 
> (although it is stated in the docs that this method does best efforts)?
>
> (btw. do you also plan to fix the OR operator issue, could be rather 
> simple just use the specified parameter)
>
> if my approach is legit and given that i dont like having to modify the 
> elasticsearch "core" code and rebuild/adapt it with every new release 
> how/where else
> could i implement such an extension? do i have to write a custom 
> queryparser (maybe extends MapperQueryParser) and build my own plugin / 
> rest endpoint ...
>
> (i recently found out that there's also a lucene class called 
> AnalyzingQueryParser maybe i should have used this one instead of writing 
> my own magic-parser, is/could this be used somehow in elasticsearch?
>
> is there a possibility to / should i write a feature request for even more 
> best effor on analyzing wildcard queries. PS i know the wildcard handling 
> issue could be a pain in the a**, and maybe could only be solved on a best 
> efford basis?. but i'm somehow forced to mess around with this cause i have 
> to (want!) to port my old lucene stuff to elasticsearch (except this issue 
> i think elasticsearch is a great product and i like to work with it. this 
> problem lies in the nature of inverted indices and wildcards resp. 
> analyzers) 
>
>
> sorry for the long maybe confusing mail, but i need your expert 
> thoughts/advices about this wildcard issue
>
> thank you 
> regards marco
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f662a851-5d96-4412-b79f-d739d6303530%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: analyzing wildcard queries ...

Reply via email to