Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
I feel a need to check in a set of text documents to mahout. maybe 3-4 categories of documents 10 each. can be used in clustering classification, vectorizer collocation testing and even frequent pattern generation And instead doing artificial tests each of it can use this to test against a referen

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Sean
I don't, but can offer alternatives -- Just have the user download the data set. I don't think this is a big burden. Download the data set automatically. These are free of legal and tarball-size problems. On Tue, Feb 9, 2010 at 2:11 PM, Robin Anil wrote: > I feel a need to check in a set of tex

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Make the maven test phase download this dataset once for all tests ? Is that possible On Tue, Feb 9, 2010 at 7:43 PM, Sean wrote: > I don't, but can offer alternatives -- > > Just have the user download the data set. I don't think this is a big > burden. > Download the data set automatically.

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll
Sure, how about a bunch of Apache project websites? The project name is the "category", i.e. Lucene, Tomcat, Hadoop, etc. On Feb 9, 2010, at 9:11 AM, Robin Anil wrote: > I feel a need to check in a set of text documents to mahout. maybe 3-4 > categories of documents 10 each. > can be used in c

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Yeah that sounds ok. Do we have the pure content without html ? Robin On Tue, Feb 9, 2010 at 8:24 PM, Grant Ingersoll wrote: > Sure, how about a bunch of Apache project websites? The project name is > the "category", i.e. Lucene, Tomcat, Hadoop, etc. > > > On Feb 9, 2010, at 9:11 AM, Robin Ani

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll
On Feb 9, 2010, at 9:56 AM, Robin Anil wrote: > Yeah that sounds ok. Do we have the pure content without html ? No, but I was just thinking yesterday that a really nice enhancement to the Doc. Vectorizer would be to hook in Tika, such that one could M/R binary files into Mahout vectors. Thoug

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Yeah!. Tika looks great!. I bet Drew's patch to create a structured document format via Avro should essentially go into Tika. Then we could really use the Tika library to the full. I should really spend time to explore Apache projects. I think we could reuse a whole lot. Robin On Tue, Feb 9, 20

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll
On Feb 9, 2010, at 10:24 AM, Robin Anil wrote: > Yeah!. Tika looks great!. I bet Drew's patch to create a structured document > format via Avro should essentially go into Tika. Then we could really use > the Tika library to the full. Solr has code here that would be pretty simple to grab, but it