List,
I have written my own CustomAnalyzer, as follows:
public TokenStream tokenStream(String fieldName, Reader reader) {
// TODO: add calls to RemovePuncation, and SplitIdentifiers here
// First, convert to lower case
TokenStream out = new LowerCaseTokenizer(reader);
if (this.doStopping){
out = new StopFilter(true, out, customStopSet);
}
if (this.doStemming){
out = new PorterStemFilter(out);
}
return out;
}
What I need to do is write two custom filters that do the following:
- RemovePuncation() removes all characters except [a-zA-Z], preserving
case. E.g.,
"foo=bar*45;" ==> "foo bar 45"
"fooBar" ==> "fooBar"
"\"[email protected]\"" ==> "sthomas cs queensu ca"
- SplitIdentifers() breaks up words based on camelCase notation:
"fooBar" ==> "foo Bar"
"ABCCompany" ==> "ABC Company"
(I have the regex for this.)
Note this step must be performed before LowerCaseTokenizer, because we
need case information to do the splitting.
How can I write custom filters, and how do I call them before
LowerCaseTokenizer()?
Thanks in advance,
Steve
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]