Hi folks,
I'm researching the best options to use for analysing/storing newspaper
pages in out online archive, and wondered if anyone has any good hints
or tips on good practice for this type of media?
I'm currently thinking alone the lines of using a customised
StandardAnalyser (no stop words + extra date token detection) wrapped
with a Shingle filter and finally a Stopword filter - the thinking being
this should reduce the impact of stop words but still allow "to be or
not to be" searches...
A future aim is to add a synonym filter at search time.
We currently have ~2.5million pages - some of the older broadsheet pages
can have a serious number of tokens.
We currently index using the SimpleAnalyser - a hangover from the
previous developers I hope to remedy :-).
--
Rgds.
*Dawn Raison*
Technical Director, Digitorial Ltd.