Sure, how about a bunch of Apache project websites?  The project name is the 
"category", i.e. Lucene, Tomcat, Hadoop, etc.


On Feb 9, 2010, at 9:11 AM, Robin Anil wrote:

> I feel a need to check in a set of text documents to mahout. maybe 3-4
> categories of documents 10 each.
> can be used in clustering classification, vectorizer collocation testing and
> even frequent pattern generation
> 
> And instead doing artificial tests each of it can use this to test against a
> reference implementation written in the testclass like what kmeans does.
> 
> Plus we will have a baseline with which we can see improvements in these
> algorithms. Any idea of some good(legally sound :))  dataset which we can
> use?
> 
> Same idea can be extended to CF also
> 
> 
> Robin


Reply via email to