Re: Failure to run Clustering example

Grant Ingersoll Fri, 01 May 2009 08:06:53 -0700

That sounds reasonable. You might also look at the (Complementary)Naive Bayes stuff, as it has some support for calculating the TF-IDFstuff, but it does it from flat files. It's in the examples part ofMahout.


On May 1, 2009, at 5:09 AM, Shashikant Kore wrote:

Here is my plan to create the document vectors.

1. Create Lucene index for all the text files.
2. Iterate on the terms in the index and assign an ID to each term.
3. For each text file
  3a. Get terms of the file.
  3b. Get TF-IDF score of each term from the lucene index. In
document vector store this score along with ID. The document vector
will be a sparse vector.

Can this now be given as input to the clustering code?

Thanks,
--shashi
On Fri, May 1, 2009 at 5:02 AM, Grant Ingersoll<[email protected]> wrote:
On Apr 29, 2009, at 10:27 AM, Shashikant Kore wrote:
Hi Jeff,
The JDK problem occurs while running the example of SyntheticControl Data
from
http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html


The other query was related to how to convert convert text files to
Mahout Vector. Let's say, I have text files of wikipedia pages andnowI want to create clusters out of them. How do I get the Mahoutvectorfrom the lucene index? Can you point me to some theory behind it,from
where I can convert it code?
I don't think we have any demo code for this yet. I have apersonal taskthat I'm trying to get to that will demonstrate how to cluster textstartingfrom a plain text file, but nothing in code yet, especially notanythingthat takes it from Lucene. All of these would be great additionsto have.I think Richard Tomsett said he had some code to do it, but hasn'tdonatedit yet. He's also put up a patch for doing cosine distance metric,but it
is not committed yet.

Cheers,
Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using
Solr/Lucene:
http://www.lucidimagination.com/search


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Failure to run Clustering example

Reply via email to