Hi Paul, On 10/18/2011 at 4:57 AM, Paul Taylor wrote: > On 18/10/2011 06:19, Steven A Rowe wrote: > > Another option is to create a char filter that substitutes > > PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods, > > etc., > > Yes that is how I first did it
No, I don't think you did. When I say "char filter" I'm referring to CharFilter <http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/CharFilter.html> - this is a different kind of thing from the token filter approach you described taking previously. Lucene Analyzers may be composed of three different kinds of components: * CharFilter: character-level filter; precedes the tokenizer; allows for character stream modifications while enabling original character offsets to be maintained (to enable e.g. highlighting). Input: character stream; output: character stream. An analyzer may contain zero or more of these. * Tokenizer: identifies character sequences that will serve as (the basis of) indexable tokens. Input: character stream; output: token stream. An analyzer must contain exactly one of these. * TokenFilter: token-level filter; follows the Tokenizer; transforms, adds and/or removes tokens to/from the token stream. Input: token stream; output: token stream. An analyzer may contain zero or more of these. > > but only when the entire input consists exclusively of whitespace and > > punctuation. > > but I couldnt work out how to only do it when exclusively whitespace and > punctuation, any ideas to sole that _ If you go with a CharFilter, you can give it access to the entire input at once, and use a regular expression (or something like it) to assess the input and then behave accordingly. Steve