Re: Mixing Case and Case-Insensitive Searching

Walt Stoneburner Sat, 12 May 2007 10:18:51 -0700

Yonik Seeley adds some wonderful observations:

Yeah, "compatible" analyzer would be a better way to put it.  Using
the same analyzer for anything that produces multiple tokens at the
same position is normally wrong.


I came to the same conclusion the moment I realized that my query
string was being transformed into tokens.  What I wasn't sure of 100%,
and what you just gave comfort to, was the idea that if I used two
different analyzers, if that broke anything or not.

Seems not only does it not, but it's also necessary.

Solr allows specification of a "query" analyzer and an "index"
analyzer for these cases.


I've had a very passing glance at Solr.  It looked like an awesome
replacement for anyone considering a Google appliance.

The scalability and additional features of Solr really appealed to me,
but it felt too much like a complete solution sealed in a pretty bow
than an API library I could utilize with my code.

Is it worth a second look at Solr?  And, if so, is it possible to
easily capitalize on the Lucene modifications there, without bringing
in the other extras?

Yep, your [two token] approach sounds fine, and will work in phrase
queries (which the two-field solution currently can't handle).
The greater difficulty lies in making it generic (working for many
analyzers, etc).


Since the index writer is still writing the same tokens the other
analyzers do, the standard query mechanisms still work.  The extra
dollar-sign-prefixed-case-preserved-tokens simply get ignored.

It would be nice, as you point out, to be able to utilize them with
other analyzers, though even with 20/20 hindsight, I'm not sure how
that'd be accomplished without some basic architectural trade-offs
that would inadvertently impact performance for those not interested
in the feature.

This points out the difficulty of doing this in a *generic* way.
Better than a "$" would be a flag on the Token IMO.  Not currently
really supported by lucene, but you could perhaps subclass Token.


In fact, this is how I wanted to started go about approaching the
problem.  I was surprised there wasn't a flag field available to
tinker with.

I'll have to look into the subclassing solution, though I ponder if
it'd be my subclass that would always be called to create a token.  I
haven't going through the Lucene engine as a whole, yet, only
considering the parts immediately applicable to my short-term problem
and solution.

Still this is a good idea.  I'll look into it.

My biggest concern is the fact that I can't just have my own class /
subclass for the parsing.  At the moment, I'm stuck replicating the
Lucene .jj grammar, modifying it, and making my own personal copy.
Should the base copy of Lucene get update, say a new version is
released, now I've got to hope the implementation didn't change, go
manually copy the code, and patch it again, opposed to some method I
can subclass or some data structure I can augment.

The way I make myself feel better about this is to keep telling myself
that Lucene is an API, and these classes are mere examples, and that I
should be writing my own parser.  It may not be true, but it helps me
sleep at night.

I've also considered case-insensitive support at the Term-Enum level.
It would make lookups slower, but the index wouldn't be much bigger (it would
be slightly bigger because one would index everything w/o lowercasing).


I'm fairly certain I'm not the only person who's tried to mix
case-sensitive searching into a query.  What I'd love to have seen is
a LuceneConfig.java file that said:

// While not for everyone, due to size and performance overhead if not
being used,
// set the following line to true to enable mixed cased searching in a
single field.
boolean ENABLE_MIXEDCASE_SEARCHING = false;


Better yet, now that I'm thinking of it, this should actually be an
attribute of the Field itself.

That makes far more sense.

If you picked a token prefix/postfix that would pass through
the QueryParser w/o a syntax error, the necessary manipulation could
all be done in the Analyzer/TokenFilter.  Much easier, but perhaps not
as nice a syntax.


The problem I ran into was that the lexical analyzer seemed to strip
any symbolic characters that were not part of a legitimate Lucene
query syntax.  Thus my dollar sign solution wouldn't have worked
without the modifications to the parser anyhow.  I also wasn't too hot
on further overloading the meaning of a symbol.


Thanks for the feedback.  I hope that at least knowing it IS possible
to do helps some poor soul.

-Walt Stoneburner,
http://www.wwco.com/~wls/blog/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Mixing Case and Case-Insensitive Searching

Reply via email to