[jira] Issue Comment Edited: (SOLR-1321) Support for efficient leading wildcards search

Robert Muir (JIRA) Fri, 31 Jul 2009 10:13:39 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737601#action_12737601
 ]


Robert Muir edited comment on SOLR-1321 at 7/31/09 10:11 AM:
-------------------------------------------------------------

andrzej i see what you are saying. I think its a great feature the way it is!

{noformat}
In the future I will take a look at finding a way to do both, this way complex 
cases like *abcde?f get reversed by this feature into \u0001f?edcba*, 
but then implemented with automaton so that it doesn't have to enumerate all 
tokens that start with \u0001f. 
{noformat}

this is bad example hope you see what i mean.  the biggest challenge would be 
preventing suboptimal cases, like reversing g?abcde* into \u2001*edcba?g, (at 
least I think).
the first is actually more efficient, I think regardless of the wildcard impl.

I wonder if in your patch you could have an additional check, if something is 
in the 1st position but the last character is also a wildcard, not to reverse 
it?
in the example above even with the default lucene wildcard query, at least it 
would only enumerate the tokens starting with g, so its better not to reverse 
it.

if its in the 0th position it doesnt matter if you reverse it or not but I 
think that one case can be optimized.

Thanks,
Robert

      was (Author: rcmuir):
    andrzej i see what you are saying. I think its a great feature the way it 
is!

In the future I will take a look at finding a way to do both, this way complex 
cases like *abcde?f get reversed by this feature into \u0001f?edcba*, 
but then implemented with automaton so that it doesn't have to enumerate all 
tokens that start with \u0001f. 

this is bad example hope you see what i mean.  the biggest challenge would be 
preventing suboptimal cases, like reversing g?abcde* into \u2001*edcba?g, (at 
least I think).
the first is actually more efficient, I think regardless of the wildcard impl.

I wonder if in your patch you could have an additional check, if something is 
in the 1st position but the last character is also a wildcard, not to reverse 
it?
in the example above even with the default lucene wildcard query, at least it 
would only enumerate the tokens starting with g, so its better not to reverse 
it.

if its in the 0th position it doesnt matter if you reverse it or not but I 
think that one case can be optimized.

Thanks,
Robert
  
> Support for efficient leading wildcards search
> ----------------------------------------------
>
>                 Key: SOLR-1321
>                 URL: https://issues.apache.org/jira/browse/SOLR-1321
>             Project: Solr
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 1.4
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.4
>
>         Attachments: wildcards.patch
>
>
> This patch is an implementation of the "reversed tokens" strategy for 
> efficient leading wildcards queries.
> ReversedWildcardsTokenFilter reverses tokens and returns both the original 
> token (optional) and the reversed token (with positionIncrement == 0). 
> Reversed tokens are prepended with a marker character to avoid collisions 
> between legitimate tokens and the reversed tokens - e.g. "DNA" would become 
> "and", thus colliding with the regular term "and", but with the marker 
> character it becomes "\u0001and".
> This TokenFilter can be added to the analyzer chain that it used during 
> indexing.
> SolrQueryParser has been modified to detect the presence of such fields in 
> the current schema, and treat them in a special way. First, SolrQueryParser 
> examines the schema and collects a map of fields where these reversed tokens 
> are indexed. If there is at least one such field, it also sets 
> QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
> (in getWildcardQuery) the term text may be optionally reversed to put 
> wildcards further along the term text. This happens when the field uses the 
> reversing filter during indexing (as detected above), AND if the wildcard 
> characters are either at 0-th or 1-st position in the term. Otherwise the 
> term text is processed as before, i.e. turned into a regular wildcard query.
> Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-1321) Support for efficient leading wildcards search

Reply via email to