Almost by definition, you have to write your own analyzer. This may be as simple as chaining another filter into one of the regular analyzers or as complex as defining your own grammar.
As far as I know, there's no "keep word" list. But that would be an interesting addition. That is, a variety of analyzer that you not only passed a list of stop words to, but also passed a list of "keep words", or words that should NOT be massaged at all. I can imagine that this would get pretty tricky for, say, StandardAnalyzer, but something like this in the chain of WhitespaceTokenizer >> LowercaseFilter >> KeepwordFilter might be useful... All this right off the top of my head without much thought, but.... Best Erick On Tue, Mar 4, 2008 at 2:22 PM, Donna L Gresh <[EMAIL PROTECTED]> wrote: > I saw some discussion in the archives some time ago about the fact that > C++ is tokenized as C in the StandardAnalyzer; this seems to still be the > case; I was wondering if there is a simple way for me to get the behavior > I want for C++ (that it is tokenized as C++) in particular, and perhaps > for other more ideosyncratic terms I may have in my own application-- > Thanks > Donna > > >