Here is a quick update. I wrote simple program to create lucene index from the text files and then generate document vectors for these indexed documents. I ran K-means after creating canopies on 100 documents and it returned fine.
Here are some of the problems. 1. As pointed out by Jeff, I need to maintain an external mapping of document ID to vector mapping. But this requires some glue code outside the clustering. Mahout-65 issue to handle that looks complext. Instead, can I just add a label to a vector and then just change the decodeVector() and asFormatString() methods to handle the label? 2. To create canopies for 1000 documents it took almost 75 minutes. Though the total number of unique terms in the index is 50,000 each vector has less than 100 unique terms. (ie each document vector is a sparse vector of cardinality 50,000 and 100 elements.) The hardware is admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor. Hadoop has one node. Values of T1 and T2 were 80 and 55 respectively, as given in the sample program. I believe I am missing something obvious to make this code run real fast. Current performance level is not acceptable. I looked at SparseVector code. The map of values has Integer and Double as key and value. Auto-boxing may slow down things but the existing performance suggests something else. (BTW, I have tried Trove's primitive collection and found substantial performance gains. I will run some tests for the same.) 3. I will submit the index generation code after internal approvals. Also, the code right now is written quickly and requires some work to bring it to an acceptable level of quality. Thanks, --shashi On Fri, May 1, 2009 at 8:36 PM, Grant Ingersoll <[email protected]> wrote: > That sounds reasonable. You might also look at the (Complementary) Naive > Bayes stuff, as it has some support for calculating the TF-IDF stuff, but it > does it from flat files. It's in the examples part of Mahout. > > > On May 1, 2009, at 5:09 AM, Shashikant Kore wrote: > >> Here is my plan to create the document vectors. >> >> 1. Create Lucene index for all the text files. >> 2. Iterate on the terms in the index and assign an ID to each term. >> 3. For each text file >> 3a. Get terms of the file. >> 3b. Get TF-IDF score of each term from the lucene index. In >> document vector store this score along with ID. The document vector >> will be a sparse vector. >> >> Can this now be given as input to the clustering code?
