Re: Mixing Case and Case-Insensitive Searching

Yonik Seeley Fri, 11 May 2007 14:19:22 -0700

On 5/11/07, Walt Stoneburner <[EMAIL PROTECTED]> wrote:

In this tutorial he stresses not once, not twice, but three times that
the same Analyzer that is used to build an index -must- also be used
when performing a Query.  There is great detail explaining why this is
so.


However, in order to get our magic to work, we need to violate this
rule in a very clever way.


Yeah, "compatible" analyzer would be a better way to put it.  Using
the same analyzer for anything that produces multiple tokens at the
same position is normally wrong.
Solr allows specification of a "query" analyzer and an "index"
analyzer for these cases.

STEP ONE: Building an index that has both case-sensitive and
case-insensitive tokens in it.


Yep, your approach sounds fine, and will work in phrase queries (which
the two-field solution currently can't handle).  The greater
difficulty lies in making it generic (working for many analyzers,
etc).

This step is where things get complicated.  It turns out that
StandardAnalyzer, which uses the StandardTokenizer, throws away dollar
signs.  So, it doesn't matter how many you type in your query, they
all vanish, never giving you the opportunity to do anything with them
downstream.


This points out the difficulty of doing this in a *generic* way.
Better than a "$" would be a flag on the Token IMO.  Not currently
really supported by lucene, but you could perhaps subclass Token.

Bringing it all together, it's now possible to user your new query
version token analyzer with the QueryParser.  And calling .parse()
with dollar sign prefixed strings will search for exact-case matches,
where omitting it works like the regular old Lucene we all know and
love.

The down side...?  The index has twice as many tokens.


I've also considered case-insensitive support at the Term-Enum level.
It would make lookups slower, but the index wouldn't be much bigger (it would
be slightly bigger because one would index everything w/o lowercasing).

I'd love to see a formal syntax like this officially enter the Lucene
standard query language someday.

If someone can figure point me at how to do this without twiddling
Lucene's code directly, I'd be happy to contribute the modification.


If you picked a token prefix/postfix that would pass through
the QueryParser w/o a syntax error, the necessary manipulation could
all be done in the Analyzer/TokenFilter.  Much easier, but perhaps not
as nice a syntax.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Mixing Case and Case-Insensitive Searching

Reply via email to