RE: How do you see if a tokenstream has tokens without consuming the tokens ?

Steven A Rowe Tue, 18 Oct 2011 07:26:18 -0700

Hi Paul,

On 10/18/2011 at 4:57 AM, Paul Taylor wrote:
> On 18/10/2011 06:19, Steven A Rowe wrote:
> > Another option is to create a char filter that substitutes
> > PUNCT-EXCLAMATION for exclamation points, PUNCT-PERIOD for periods,
> > etc.,
> 
> Yes that is how I first did it


No, I don't think you did.  When I say "char filter" I'm referring to 
CharFilter 
<http://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/analysis/CharFilter.html>
 - this is a different kind of thing from the token filter approach you 
described taking previously.

Lucene Analyzers may be composed of three different kinds of components: 

* CharFilter: character-level filter; precedes the tokenizer; allows for 
character stream modifications while enabling original character offsets to be 
maintained (to enable e.g. highlighting).  Input: character stream; output: 
character stream.  An analyzer may contain zero or more of these.

* Tokenizer: identifies character sequences that will serve as (the basis of) 
indexable tokens.  Input: character stream; output: token stream. An analyzer 
must contain exactly one of these.

* TokenFilter: token-level filter; follows the Tokenizer; transforms, adds 
and/or removes tokens to/from the token stream.  Input: token stream; output: 
token stream.  An analyzer may contain zero or more of these.

> > but only when the entire input consists exclusively of whitespace and
> > punctuation.
> 
> but I couldnt work out how to only do it when exclusively whitespace and
> punctuation, any ideas to sole that _

If you go with a CharFilter, you can give it access to the entire input at 
once, and use a regular expression (or something like it) to assess the input 
and then behave accordingly.

Steve

RE: How do you see if a tokenstream has tokens without consuming the tokens ?

Reply via email to