Custom Filter for Splitting CamelCase?

Stephen Thomas Tue, 29 Nov 2011 08:20:30 -0800

List,

I have written my own CustomAnalyzer, as follows:


public TokenStream tokenStream(String fieldName, Reader reader) {

                // TODO: add calls to RemovePuncation, and SplitIdentifiers here
                
                // First, convert to lower case
                TokenStream out = new  LowerCaseTokenizer(reader);

                if (this.doStopping){
                        out = new StopFilter(true, out, customStopSet);
                }
                
                if (this.doStemming){
                        out = new PorterStemFilter(out);
                }

                return out;
          }



What I need to do is write two custom filters that do the following:

- RemovePuncation() removes all characters except [a-zA-Z], preserving
case. E.g.,

"foo=bar*45;" ==> "foo bar 45"
"fooBar" ==> "fooBar"
"\"[email protected]\"" ==> "sthomas cs queensu ca"


- SplitIdentifers() breaks up words based on camelCase notation:

"fooBar" ==> "foo Bar"
"ABCCompany" ==> "ABC Company"

(I have the regex for this.)

Note this step must be performed before LowerCaseTokenizer, because we
need case information to do the splitting.


How can I write custom filters, and how do I call them before
LowerCaseTokenizer()?


Thanks in advance,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Custom Filter for Splitting CamelCase?

Reply via email to