Hi folks,

I'm researching the best options to use for analysing/storing newspaper pages in out online archive, and wondered if anyone has any good hints or tips on good practice for this type of media?

I'm currently thinking alone the lines of using a customised StandardAnalyser (no stop words + extra date token detection) wrapped with a Shingle filter and finally a Stopword filter - the thinking being this should reduce the impact of stop words but still allow "to be or not to be" searches...

A future aim is to add a synonym filter at search time.

We currently have ~2.5million pages - some of the older broadsheet pages can have a serious number of tokens. We currently index using the SimpleAnalyser - a hangover from the previous developers I hope to remedy :-).

--

Rgds.
*Dawn Raison*
Technical Director, Digitorial Ltd.


Reply via email to