Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll

On Feb 9, 2010, at 10:24 AM, Robin Anil wrote:

> Yeah!. Tika looks great!. I bet Drew's patch to create a structured document
> format via Avro should essentially go into Tika. Then we could really use
> the Tika library to the full.

Solr has code here that would be pretty simple to grab, but it's also really 
straightforward to do standalone.  The key is making sure that people can 
provide there own DocumentHandler if they want, while still providing good 
default options.

> 
> I should really spend time to explore Apache projects. I think we could
> reuse a whole lot.

+1.  Cross fertilization is a good thing.  Many people in the Lucene 
communities are working on these types of things.

We're getting to the point where UIMA integration makes sense, too, I think, 
but I'm not a UIMA expert, so...

-Grant

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Yeah!. Tika looks great!. I bet Drew's patch to create a structured document
format via Avro should essentially go into Tika. Then we could really use
the Tika library to the full.

I should really spend time to explore Apache projects. I think we could
reuse a whole lot.
Robin



On Tue, Feb 9, 2010 at 8:30 PM, Grant Ingersoll  wrote:

>
> On Feb 9, 2010, at 9:56 AM, Robin Anil wrote:
>
> > Yeah that sounds ok. Do we have the pure content without html ?
>
> No, but I was just thinking yesterday that a really nice enhancement to the
> Doc. Vectorizer would be to hook in Tika, such that one could M/R binary
> files into Mahout vectors.  Thoughts?  Tika integration should be pretty
> trivial.  I can likely help later in the week.
>
> -Grant


Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll

On Feb 9, 2010, at 9:56 AM, Robin Anil wrote:

> Yeah that sounds ok. Do we have the pure content without html ?

No, but I was just thinking yesterday that a really nice enhancement to the 
Doc. Vectorizer would be to hook in Tika, such that one could M/R binary files 
into Mahout vectors.  Thoughts?  Tika integration should be pretty trivial.  I 
can likely help later in the week.

-Grant

Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Yeah that sounds ok. Do we have the pure content without html ?

Robin

On Tue, Feb 9, 2010 at 8:24 PM, Grant Ingersoll  wrote:

> Sure, how about a bunch of Apache project websites?  The project name is
> the "category", i.e. Lucene, Tomcat, Hadoop, etc.
>
>
> On Feb 9, 2010, at 9:11 AM, Robin Anil wrote:
>
> > I feel a need to check in a set of text documents to mahout. maybe 3-4
> > categories of documents 10 each.
> > can be used in clustering classification, vectorizer collocation testing
> and
> > even frequent pattern generation
> >
> > And instead doing artificial tests each of it can use this to test
> against a
> > reference implementation written in the testclass like what kmeans does.
> >
> > Plus we will have a baseline with which we can see improvements in these
> > algorithms. Any idea of some good(legally sound :))  dataset which we can
> > use?
> >
> > Same idea can be extended to CF also
> >
> >
> > Robin
>
>
>


Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Grant Ingersoll
Sure, how about a bunch of Apache project websites?  The project name is the 
"category", i.e. Lucene, Tomcat, Hadoop, etc.


On Feb 9, 2010, at 9:11 AM, Robin Anil wrote:

> I feel a need to check in a set of text documents to mahout. maybe 3-4
> categories of documents 10 each.
> can be used in clustering classification, vectorizer collocation testing and
> even frequent pattern generation
> 
> And instead doing artificial tests each of it can use this to test against a
> reference implementation written in the testclass like what kmeans does.
> 
> Plus we will have a baseline with which we can see improvements in these
> algorithms. Any idea of some good(legally sound :))  dataset which we can
> use?
> 
> Same idea can be extended to CF also
> 
> 
> Robin




Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Robin Anil
Make the maven test phase download this dataset once for all tests ? Is that
possible



On Tue, Feb 9, 2010 at 7:43 PM, Sean  wrote:

> I don't, but can offer alternatives --
>
> Just have the user download the data set. I don't think this is a big
> burden.
> Download the data set automatically.
>
> These are free of legal and tarball-size problems.
>
> On Tue, Feb 9, 2010 at 2:11 PM, Robin Anil  wrote:
> > I feel a need to check in a set of text documents to mahout. maybe 3-4
> > categories of documents 10 each.
> > can be used in clustering classification, vectorizer collocation testing
> and
> > even frequent pattern generation
> >
> > And instead doing artificial tests each of it can use this to test
> against a
> > reference implementation written in the testclass like what kmeans does.
> >
> > Plus we will have a baseline with which we can see improvements in these
> > algorithms. Any idea of some good(legally sound :))  dataset which we can
> > use?
> >
> > Same idea can be extended to CF also
> >
> >
> > Robin
> >
>


Re: Need a set of documents checked in to mahout trunk

2010-02-09 Thread Sean
I don't, but can offer alternatives --

Just have the user download the data set. I don't think this is a big burden.
Download the data set automatically.

These are free of legal and tarball-size problems.

On Tue, Feb 9, 2010 at 2:11 PM, Robin Anil  wrote:
> I feel a need to check in a set of text documents to mahout. maybe 3-4
> categories of documents 10 each.
> can be used in clustering classification, vectorizer collocation testing and
> even frequent pattern generation
>
> And instead doing artificial tests each of it can use this to test against a
> reference implementation written in the testclass like what kmeans does.
>
> Plus we will have a baseline with which we can see improvements in these
> algorithms. Any idea of some good(legally sound :))  dataset which we can
> use?
>
> Same idea can be extended to CF also
>
>
> Robin
>