[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Grant Ingersoll (JIRA) Thu, 18 Jun 2009 05:09:33 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721215#action_12721215
 ]


Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Hey David,

I'm not sure what's going on here, because that value being null means the term 
is not the index, yet is in the Term Vector for that doc.  Are you sure you're 
loading the same field?  Can you share the indexing code?

This fix works, though, but I'd like to know at a deeper level what's going on.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, 
> MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, 
> MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, 
> MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  
> Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as 
> TF-IDF values of the term. With lucene index, this value can be calculated 
> very easily. 
> Presently, I have created two separate utilities, which could possibly be 
> invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Reply via email to