Re: Custom Filter for Splitting CamelCase?

Stephen Thomas Tue, 29 Nov 2011 10:39:41 -0800

How do you use the WordDelimiterFilterFactory()? I tried the following code:



TokenStream out = new  LowerCaseTokenizer(reader);
WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory();
out = wdf.create(out);
...

But I am getting a runtime error:

Exception in thread "main" java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
        at 
org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141)
        at 
org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.java:54)
        ...

I can't create a class of type WordDelimiterFilter directly, because
it is protected.

Any ideas?

Thanks,
Steve




On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler <[email protected]> wrote:
> Hi,
>
> There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis
> module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your
> classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter
> itself is package-private).
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
>> -----Original Message-----
>> From: [email protected]
>> [mailto:[email protected]] On Behalf Of Stephen Thomas
>> Sent: Tuesday, November 29, 2011 5:20 PM
>> To: [email protected]
>> Subject: Custom Filter for Splitting CamelCase?
>>
>> List,
>>
>> I have written my own CustomAnalyzer, as follows:
>>
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>
>>               // TODO: add calls to RemovePuncation, and SplitIdentifiers
>> here
>>
>>               // First, convert to lower case
>>               TokenStream out = new  LowerCaseTokenizer(reader);
>>
>>               if (this.doStopping){
>>                       out = new StopFilter(true, out, customStopSet);
>>               }
>>
>>               if (this.doStemming){
>>                       out = new PorterStemFilter(out);
>>               }
>>
>>               return out;
>>         }
>>
>>
>>
>> What I need to do is write two custom filters that do the following:
>>
>> - RemovePuncation() removes all characters except [a-zA-Z], preserving
> case.
>> E.g.,
>>
>> "foo=bar*45;" ==> "foo bar 45"
>> "fooBar" ==> "fooBar"
>> "\"[email protected]\"" ==> "sthomas cs queensu ca"
>>
>>
>> - SplitIdentifers() breaks up words based on camelCase notation:
>>
>> "fooBar" ==> "foo Bar"
>> "ABCCompany" ==> "ABC Company"
>>
>> (I have the regex for this.)
>>
>> Note this step must be performed before LowerCaseTokenizer, because we
>> need case information to do the splitting.
>>
>>
>> How can I write custom filters, and how do I call them before
>> LowerCaseTokenizer()?
>>
>>
>> Thanks in advance,
>> Steve
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Custom Filter for Splitting CamelCase?

Reply via email to