[jira] [Commented] (MAHOUT-973) SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in TFIDFPartialVectorReducer)

Hudson (Commented) (JIRA) Fri, 06 Apr 2012 07:03:46 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248382#comment-13248382
 ]


Hudson commented on MAHOUT-973:
-------------------------------

Integrated in Mahout-Quality #1426 (See 
[https://builds.apache.org/job/Mahout-Quality/1426/])
    MAHOUT-973 fix treatment of value as percentage (Revision 1310302)

     Result = FAILURE
srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1310302
Files : 
* 
/mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/tfidf/TFIDFPartialVectorReducer.java

                
> SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in 
> TFIDFPartialVectorReducer)
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-973
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-973
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.6
>            Reporter: Viktor Gal
>            Assignee: Sean Owen
>             Fix For: 0.7
>
>         Attachments: fix-TFIDFPartialVectorReducer.patch
>
>
> Although I'm using a little bit different the TFIDFConverter, but the problem 
> will occur the same way with SparseVectorsFromSequenceFiles when somebody 
> wants to create a TFIDF vectors for their documents.
> Basically if maxDFSigma is not set then because of 
> SparseVectorsFromSequenceFiles.java:281
> long maxDF = maxDFPercent;
> maxDF will be 99. which is then passed to TFIDFConvert.processTfIdf function 
> as an argument, where it is interpreted as "The max percentage of vectors for 
> the DF." Partial vectors will be created with TFIDFPartialVectorReducer.class 
> and because of TFIDFPartialVectorReducer.java:81 as maxDF = 99 if (df > 
> maxDF) the term will be ignored.
> the problem here is that two different quantities are compared. df value is 
> the number of documents which contains the given term, and it's not 
> normalized by the document number, i.e. it's not a percentage! see 
> TermDocumentCountReducer.java for details. while maxDF is interpreted as a 
> percentage, see above. Thus, as soon as the df count gets higher than 99, or 
> in the best case 100, meaning the given term occurs in more than 99 or 100 
> different documents, it'll be ignored... and this is not what we would like 
> it to do.
> I.e. there's a bug in TFIDFPartialVectorReducer.java at line 81.
> I've attached a possible fix for this problem.
> the bug was introduced a61e5ff8 commit (git) or rev 1210994 in svn:
> @@ -78,7 +78,7 @@ public class TFIDFPartialVectorReducer extends
>          continue;
>        }
>        long df = dictionary.get(e.index());
> -      if (df * 100.0 / vectorCount > maxDfPercent) {
> +      if (maxDf > -1 && df > maxDf) {
>          continue;
>        }
>        if (df < minDf) {

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-973) SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in TFIDFPartialVectorReducer)

Reply via email to