On Apr 21, 2006, at 11:56 AM, Malcolm Clark wrote:
has anyone attempted to index/search the Reuters collection which
consists of SGML?
Mine seems to run through the process okay but alas I'm left with
nothing in the index when I check with Luke or my own Search Engine.
Anyone got any hints (apart from don't do it)?
The problem is clearly in whatever parser you're using. I used
Reuters 21578 as a corpus for a benchmarking suite; since all I cared
about was title and body, I wrote a quick-n-dirty regex-based parser
in Perl that extracted the docs from the SGML out onto the file
system. You'll find it at <http://www.rectangular.com/svn/kinosearch/
trunk/t/benchmarks/extract_reuters.plx>.
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]