Yonik Seeley adds some wonderful observations:
Yeah, "compatible" analyzer would be a better way to put it. Using the same analyzer for anything that produces multiple tokens at the same position is normally wrong.
I came to the same conclusion the moment I realized that my query string was being transformed into tokens. What I wasn't sure of 100%, and what you just gave comfort to, was the idea that if I used two different analyzers, if that broke anything or not. Seems not only does it not, but it's also necessary.
Solr allows specification of a "query" analyzer and an "index" analyzer for these cases.
I've had a very passing glance at Solr. It looked like an awesome replacement for anyone considering a Google appliance. The scalability and additional features of Solr really appealed to me, but it felt too much like a complete solution sealed in a pretty bow than an API library I could utilize with my code. Is it worth a second look at Solr? And, if so, is it possible to easily capitalize on the Lucene modifications there, without bringing in the other extras?
Yep, your [two token] approach sounds fine, and will work in phrase queries (which the two-field solution currently can't handle). The greater difficulty lies in making it generic (working for many analyzers, etc).
Since the index writer is still writing the same tokens the other analyzers do, the standard query mechanisms still work. The extra dollar-sign-prefixed-case-preserved-tokens simply get ignored. It would be nice, as you point out, to be able to utilize them with other analyzers, though even with 20/20 hindsight, I'm not sure how that'd be accomplished without some basic architectural trade-offs that would inadvertently impact performance for those not interested in the feature.
This points out the difficulty of doing this in a *generic* way. Better than a "$" would be a flag on the Token IMO. Not currently really supported by lucene, but you could perhaps subclass Token.
In fact, this is how I wanted to started go about approaching the problem. I was surprised there wasn't a flag field available to tinker with. I'll have to look into the subclassing solution, though I ponder if it'd be my subclass that would always be called to create a token. I haven't going through the Lucene engine as a whole, yet, only considering the parts immediately applicable to my short-term problem and solution. Still this is a good idea. I'll look into it. My biggest concern is the fact that I can't just have my own class / subclass for the parsing. At the moment, I'm stuck replicating the Lucene .jj grammar, modifying it, and making my own personal copy. Should the base copy of Lucene get update, say a new version is released, now I've got to hope the implementation didn't change, go manually copy the code, and patch it again, opposed to some method I can subclass or some data structure I can augment. The way I make myself feel better about this is to keep telling myself that Lucene is an API, and these classes are mere examples, and that I should be writing my own parser. It may not be true, but it helps me sleep at night.
I've also considered case-insensitive support at the Term-Enum level. It would make lookups slower, but the index wouldn't be much bigger (it would be slightly bigger because one would index everything w/o lowercasing).
I'm fairly certain I'm not the only person who's tried to mix case-sensitive searching into a query. What I'd love to have seen is a LuceneConfig.java file that said: // While not for everyone, due to size and performance overhead if not being used, // set the following line to true to enable mixed cased searching in a single field. boolean ENABLE_MIXEDCASE_SEARCHING = false; Better yet, now that I'm thinking of it, this should actually be an attribute of the Field itself. That makes far more sense.
If you picked a token prefix/postfix that would pass through the QueryParser w/o a syntax error, the necessary manipulation could all be done in the Analyzer/TokenFilter. Much easier, but perhaps not as nice a syntax.
The problem I ran into was that the lexical analyzer seemed to strip any symbolic characters that were not part of a legitimate Lucene query syntax. Thus my dollar sign solution wouldn't have worked without the modifications to the parser anyhow. I also wasn't too hot on further overloading the meaning of a symbol. Thanks for the feedback. I hope that at least knowing it IS possible to do helps some poor soul. -Walt Stoneburner, http://www.wwco.com/~wls/blog/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]