Re: Test corpus

2006-04-02 Thread Andrzej Bialecki

Marvin Humphrey wrote:

Greets,

I'm looking for a test corpus to use for some benchmarking and parsing 
tests.  I can whip one up myself, but it would be nice to use 
something standardized.  I'd like something that doesn't require a 
license/fee, so that other people can run the same tests.  At least 
1000 docs, a few hundred words each.  Any suggestions?


20 newsgroups or the old Reuters corpus are freely available, and 
contain sufficient number of documents.


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Test corpus

2006-04-01 Thread Igor Bolotin
Take a look at Project Guttenberg: http://www.gutenberg.org/
Igor

On 4/1/06, Pasha Bizhan <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> > From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
>
> > I'm looking for a test corpus to use for some benchmarking
> > and parsing tests.  I can whip one up myself, but it would be
> > nice to use something standardized.  I'd like something that
> > doesn't require a license/fee, so that other people can run
> > the same tests.  At least 1000 docs, a few hundred words
> > each.  Any suggestions?
>
> See Corpora section at http://wiki.apache.org/jakarta-lucene/Resources
>
> Pasha Bizhan
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


RE: Test corpus

2006-04-01 Thread Pasha Bizhan
Hi, 

> From: Marvin Humphrey [mailto:[EMAIL PROTECTED] 
 
> I'm looking for a test corpus to use for some benchmarking 
> and parsing tests.  I can whip one up myself, but it would be 
> nice to use something standardized.  I'd like something that 
> doesn't require a license/fee, so that other people can run 
> the same tests.  At least 1000 docs, a few hundred words 
> each.  Any suggestions?

See Corpora section at http://wiki.apache.org/jakarta-lucene/Resources

Pasha Bizhan



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]