Re: Need a set of documents checked in to mahout trunk
On Feb 9, 2010, at 10:24 AM, Robin Anil wrote: > Yeah!. Tika looks great!. I bet Drew's patch to create a structured document > format via Avro should essentially go into Tika. Then we could really use > the Tika library to the full. Solr has code here that would be pretty simple to grab, but it's also really straightforward to do standalone. The key is making sure that people can provide there own DocumentHandler if they want, while still providing good default options. > > I should really spend time to explore Apache projects. I think we could > reuse a whole lot. +1. Cross fertilization is a good thing. Many people in the Lucene communities are working on these types of things. We're getting to the point where UIMA integration makes sense, too, I think, but I'm not a UIMA expert, so... -Grant
Re: Need a set of documents checked in to mahout trunk
Yeah!. Tika looks great!. I bet Drew's patch to create a structured document format via Avro should essentially go into Tika. Then we could really use the Tika library to the full. I should really spend time to explore Apache projects. I think we could reuse a whole lot. Robin On Tue, Feb 9, 2010 at 8:30 PM, Grant Ingersoll wrote: > > On Feb 9, 2010, at 9:56 AM, Robin Anil wrote: > > > Yeah that sounds ok. Do we have the pure content without html ? > > No, but I was just thinking yesterday that a really nice enhancement to the > Doc. Vectorizer would be to hook in Tika, such that one could M/R binary > files into Mahout vectors. Thoughts? Tika integration should be pretty > trivial. I can likely help later in the week. > > -Grant
Re: Need a set of documents checked in to mahout trunk
On Feb 9, 2010, at 9:56 AM, Robin Anil wrote: > Yeah that sounds ok. Do we have the pure content without html ? No, but I was just thinking yesterday that a really nice enhancement to the Doc. Vectorizer would be to hook in Tika, such that one could M/R binary files into Mahout vectors. Thoughts? Tika integration should be pretty trivial. I can likely help later in the week. -Grant
Re: Need a set of documents checked in to mahout trunk
Yeah that sounds ok. Do we have the pure content without html ? Robin On Tue, Feb 9, 2010 at 8:24 PM, Grant Ingersoll wrote: > Sure, how about a bunch of Apache project websites? The project name is > the "category", i.e. Lucene, Tomcat, Hadoop, etc. > > > On Feb 9, 2010, at 9:11 AM, Robin Anil wrote: > > > I feel a need to check in a set of text documents to mahout. maybe 3-4 > > categories of documents 10 each. > > can be used in clustering classification, vectorizer collocation testing > and > > even frequent pattern generation > > > > And instead doing artificial tests each of it can use this to test > against a > > reference implementation written in the testclass like what kmeans does. > > > > Plus we will have a baseline with which we can see improvements in these > > algorithms. Any idea of some good(legally sound :)) dataset which we can > > use? > > > > Same idea can be extended to CF also > > > > > > Robin > > >
Re: Need a set of documents checked in to mahout trunk
Sure, how about a bunch of Apache project websites? The project name is the "category", i.e. Lucene, Tomcat, Hadoop, etc. On Feb 9, 2010, at 9:11 AM, Robin Anil wrote: > I feel a need to check in a set of text documents to mahout. maybe 3-4 > categories of documents 10 each. > can be used in clustering classification, vectorizer collocation testing and > even frequent pattern generation > > And instead doing artificial tests each of it can use this to test against a > reference implementation written in the testclass like what kmeans does. > > Plus we will have a baseline with which we can see improvements in these > algorithms. Any idea of some good(legally sound :)) dataset which we can > use? > > Same idea can be extended to CF also > > > Robin
Re: Need a set of documents checked in to mahout trunk
Make the maven test phase download this dataset once for all tests ? Is that possible On Tue, Feb 9, 2010 at 7:43 PM, Sean wrote: > I don't, but can offer alternatives -- > > Just have the user download the data set. I don't think this is a big > burden. > Download the data set automatically. > > These are free of legal and tarball-size problems. > > On Tue, Feb 9, 2010 at 2:11 PM, Robin Anil wrote: > > I feel a need to check in a set of text documents to mahout. maybe 3-4 > > categories of documents 10 each. > > can be used in clustering classification, vectorizer collocation testing > and > > even frequent pattern generation > > > > And instead doing artificial tests each of it can use this to test > against a > > reference implementation written in the testclass like what kmeans does. > > > > Plus we will have a baseline with which we can see improvements in these > > algorithms. Any idea of some good(legally sound :)) dataset which we can > > use? > > > > Same idea can be extended to CF also > > > > > > Robin > > >
Re: Need a set of documents checked in to mahout trunk
I don't, but can offer alternatives -- Just have the user download the data set. I don't think this is a big burden. Download the data set automatically. These are free of legal and tarball-size problems. On Tue, Feb 9, 2010 at 2:11 PM, Robin Anil wrote: > I feel a need to check in a set of text documents to mahout. maybe 3-4 > categories of documents 10 each. > can be used in clustering classification, vectorizer collocation testing and > even frequent pattern generation > > And instead doing artificial tests each of it can use this to test against a > reference implementation written in the testclass like what kmeans does. > > Plus we will have a baseline with which we can see improvements in these > algorithms. Any idea of some good(legally sound :)) dataset which we can > use? > > Same idea can be extended to CF also > > > Robin >