Prepare document vectors from the text
--------------------------------------

                 Key: MAHOUT-126
                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
             Project: Mahout
          Issue Type: New Feature
            Reporter: Shashikant Kore


Clustering algorithms presently take the document vectors as input.  Generating 
these document vectors from the text can be broken in two tasks. 

1. Create lucene index of the input  plain-text documents 
2. From the index, generate the document vectors (sparse) with weights as 
TF-IDF values of the term. With lucene index, this value can be calculated very 
easily. 

Presently, I have created two separate utilities, which could possibly be 
invoked from another class. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to