RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

Uwe Schindler Mon, 10 Nov 2014 06:45:00 -0800

Hi,

> Uwe
> 
> Thanks for the reply. Given that SnowBallAnalyzer is made up of a series of
> filters, I was thinking about something like this where I 'pipe' output from
> one filter to the next:
> 
> standardTokenizer =new StandardTokenizer (...); standardFilter = new
> StandardFilter(standardTokenizer,...);
> stopFilter = new StopFilter(standardFilter,...); snowballFilter = new
> SnowballFilter(stopFilter,...);
> 
> But ignore LowerCaseFilter. Does this make sense?


Exactly. Create a clone of SnowballAnalyzer (from Lucene source package) in 
your own package and remove LowercaseFilter. But be aware, it could be that 
snowball needs lowercased terms to correctly do stemming!!! I don't know about 
this filter, I just want to make you aware.

The same applies to stop filter, but this one allows to handle that: You should 
make stop-filter case insensitive (there is a boolean to do this):
StopFilter(boolean enablePositionIncrements, TokenStream input, Set<?> 
stopWords, boolean ignoreCase)

Uwe

> Martin O'Shea.
> -----Original Message-----
> From: Uwe Schindler [mailto:[email protected]]
> Sent: 10 Nov 2014 14 06
> To: [email protected]
> Subject: RE: How to disable LowerCaseFilter when using SnowballAnalyzer in
> Lucene 3.0.2
> 
> Hi,
> 
> In general, you cannot change Analyzers, they are "examples" and can be
> seen as "best practise". If you want to modify them, write your own Analyzer
> subclass which uses the wanted Tokenizers and TokenFilters as you like. You
> can for example clone the source code of the original and remove
> LowercaseFilter. Analyzers are very simple, there is no logic in them, it's 
> just
> some "configuration" (which Tokenizer and which TokenFilters). In later
> Lucene 3 and Lucene 4, this is very simple: You just need to override
> createComponents in Analyzer class and add your "configuration" there.
> 
> If you use Apache Solr or Elasticsearch you can create your analyzers by XML
> or JSON configuration.
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
> 
> 
> > -----Original Message-----
> > From: Martin O'Shea [mailto:[email protected]]
> > Sent: Monday, November 10, 2014 2:54 PM
> > To: [email protected]
> > Subject: How to disable LowerCaseFilter when using SnowballAnalyzer in
> > Lucene 3.0.2
> >
> > I realise that 3.0.2 is an old version of Lucene but if I have Java
> > code as
> > follows:
> >
> >
> >
> > int nGramLength = 3;
> >
> > Set<String> stopWords = new Set<String>();
> >
> > stopwords.add("the");
> >
> > stopwords.add("and");
> >
> > ...
> >
> > SnowballAnalyzer snowballAnalyzer = new
> > SnowballAnalyzer(Version.LUCENE_30,
> > "English", stopWords);
> >
> > ShingleAnalyzerWrapper shingleAnalyzer = new
> > ShingleAnalyzerWrapper(snowballAnalyzer, nGramLength);
> >
> >
> >
> > Which will generate the frequency of ngrams from a particular a string
> > of text without stop words, how can I disable the LowerCaseFilter
> > which forms part of the SnowBallAnalyzer? I want to preserve the case
> > of the ngrams generated so that I can perform various counts according
> > to the presence / absence of upper case characters in the ngrams.
> >
> >
> >
> > I am something of a Lucene newbie. And I should add that upgrading the
> > version of Lucene is not an option here.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: How to disable LowerCaseFilter when using SnowballAnalyzer in Lucene 3.0.2

Reply via email to