RE: Analysers for newspaper pages...

Steven A Rowe Mon, 28 Nov 2011 11:44:02 -0800

Hi Dawn,

I assume that when you refer to "the impact of stop words," you're concerned 
about query-time performance?  You should consider the possibility that 
performance without removing stop words is good enough that you won't have to 
take any steps to address the issue.


That said, there are two filters in Solr 3.X[1] that would do the equivalent of 
what you have outlined: CommonGramsFilter 
<http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsFilter.html>
 and CommonGramsQueryFilter 
<http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsQueryFilter.html>.

You can use these filters with a Lucene 3.X application by including the 
(same-versioned) solr-core jar as a dependency.

Steve

[1] In Lucene/Solr trunk, which will be released as 4.0, these filters have 
been moved to a shared Lucene/Solr module.

> -----Original Message-----
> From: Dawn Zoë Raison [mailto:[email protected]]
> Sent: Monday, November 28, 2011 2:10 PM
> To: [email protected]
> Subject: Analysers for newspaper pages...
> 
> Hi folks,
> 
> I'm researching the best options to use for analysing/storing newspaper
> pages in out online archive, and wondered if anyone has any good hints
> or tips on good practice for this type of media?
> 
> I'm currently thinking alone the lines of using a customised
> StandardAnalyser (no stop words + extra date token detection) wrapped
> with a Shingle filter and finally a Stopword filter - the thinking being
> this should reduce the impact of stop words but still allow "to be or
> not to be" searches...
> 
> A future aim is to add a synonym filter at search time.
> 
> We currently have ~2.5million pages - some of the older broadsheet pages
> can have a serious number of tokens.
> We currently index using the SimpleAnalyser - a hangover from the
> previous developers I hope to remedy :-).
> 
> --
> 
> Rgds.
> *Dawn Raison*
> Technical Director, Digitorial Ltd.
>

RE: Analysers for newspaper pages...

Reply via email to