Hi,

There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis
module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your
classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter
itself is package-private).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Stephen Thomas
> Sent: Tuesday, November 29, 2011 5:20 PM
> To: [email protected]
> Subject: Custom Filter for Splitting CamelCase?
> 
> List,
> 
> I have written my own CustomAnalyzer, as follows:
> 
> public TokenStream tokenStream(String fieldName, Reader reader) {
> 
>               // TODO: add calls to RemovePuncation, and SplitIdentifiers
> here
> 
>               // First, convert to lower case
>               TokenStream out = new  LowerCaseTokenizer(reader);
> 
>               if (this.doStopping){
>                       out = new StopFilter(true, out, customStopSet);
>               }
> 
>               if (this.doStemming){
>                       out = new PorterStemFilter(out);
>               }
> 
>               return out;
>         }
> 
> 
> 
> What I need to do is write two custom filters that do the following:
> 
> - RemovePuncation() removes all characters except [a-zA-Z], preserving
case.
> E.g.,
> 
> "foo=bar*45;" ==> "foo bar 45"
> "fooBar" ==> "fooBar"
> "\"[email protected]\"" ==> "sthomas cs queensu ca"
> 
> 
> - SplitIdentifers() breaks up words based on camelCase notation:
> 
> "fooBar" ==> "foo Bar"
> "ABCCompany" ==> "ABC Company"
> 
> (I have the regex for this.)
> 
> Note this step must be performed before LowerCaseTokenizer, because we
> need case information to do the splitting.
> 
> 
> How can I write custom filters, and how do I call them before
> LowerCaseTokenizer()?
> 
> 
> Thanks in advance,
> Steve
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to