On Apr 21, 2006, at 11:56 AM, Malcolm Clark wrote:

has anyone attempted to index/search the Reuters collection which consists of SGML? Mine seems to run through the process okay but alas I'm left with nothing in the index when I check with Luke or my own Search Engine.
Anyone got any hints (apart from don't do it)?

The problem is clearly in whatever parser you're using. I used Reuters 21578 as a corpus for a benchmarking suite; since all I cared about was title and body, I wrote a quick-n-dirty regex-based parser in Perl that extracted the docs from the SGML out onto the file system. You'll find it at <http://www.rectangular.com/svn/kinosearch/ trunk/t/benchmarks/extract_reuters.plx>.

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to