--- On Sat, 9/12/09, Paul Taylor <[email protected]> wrote: > From: Paul Taylor <[email protected]> > Subject: Filter before tokenize ? > To: [email protected] > Date: Saturday, September 12, 2009, 9:39 PM > Is it possible to filter before > tokenize, or is that not a good idea. > I want to convert '&' to 'and' , so they are dealt with > the same way, but the StandardTokenizer I am using removes > the &, I could change the tokenizer but because > I'm not too clear on jflex syntax it would seem easier to > just apply a CharFilter before tokenizing, but is that > possible
May be you can use WhitespaceTokenizer that won't remove &? Why and's (&) are import for you? Do you need to search them? Replacing &'s before indexing (by preprocessing) can be a option? Filter before tokenizer can be simulated by using: 1-)KeywordTokenizer 2-)Your CharFilter 3-)A token filter that tokenizes input token's text using StandardTokenizer But i think this is not a good idea. Hope this helps. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
