Re: Reuters

Marvin Humphrey Fri, 21 Apr 2006 12:16:56 -0700


On Apr 21, 2006, at 11:56 AM, Malcolm Clark wrote:

has anyone attempted to index/search the Reuters collection whichconsists of SGML?Mine seems to run through the process okay but alas I'm left withnothing in the index when I check with Luke or my own Search Engine.
Anyone got any hints (apart from don't do it)?

The problem is clearly in whatever parser you're using. I usedReuters 21578 as a corpus for a benchmarking suite; since all I caredabout was title and body, I wrote a quick-n-dirty regex-based parserin Perl that extracted the docs from the SGML out onto the filesystem. You'll find it at <http://www.rectangular.com/svn/kinosearch/trunk/t/benchmarks/extract_reuters.plx>.


Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reuters

Reply via email to