Hi Dawn, I assume that when you refer to "the impact of stop words," you're concerned about query-time performance? You should consider the possibility that performance without removing stop words is good enough that you won't have to take any steps to address the issue.
That said, there are two filters in Solr 3.X[1] that would do the equivalent of what you have outlined: CommonGramsFilter <http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsFilter.html> and CommonGramsQueryFilter <http://lucene.apache.org/solr/api/org/apache/solr/analysis/CommonGramsQueryFilter.html>. You can use these filters with a Lucene 3.X application by including the (same-versioned) solr-core jar as a dependency. Steve [1] In Lucene/Solr trunk, which will be released as 4.0, these filters have been moved to a shared Lucene/Solr module. > -----Original Message----- > From: Dawn Zoë Raison [mailto:d...@digitorial.co.uk] > Sent: Monday, November 28, 2011 2:10 PM > To: java-user@lucene.apache.org > Subject: Analysers for newspaper pages... > > Hi folks, > > I'm researching the best options to use for analysing/storing newspaper > pages in out online archive, and wondered if anyone has any good hints > or tips on good practice for this type of media? > > I'm currently thinking alone the lines of using a customised > StandardAnalyser (no stop words + extra date token detection) wrapped > with a Shingle filter and finally a Stopword filter - the thinking being > this should reduce the impact of stop words but still allow "to be or > not to be" searches... > > A future aim is to add a synonym filter at search time. > > We currently have ~2.5million pages - some of the older broadsheet pages > can have a serious number of tokens. > We currently index using the SimpleAnalyser - a hangover from the > previous developers I hope to remedy :-). > > -- > > Rgds. > *Dawn Raison* > Technical Director, Digitorial Ltd. >