[ 
https://issues.apache.org/jira/browse/MAHOUT-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated MAHOUT-973:
-----------------------------

    Affects Version/s:     (was: 0.7)
        Fix Version/s: 0.7
             Assignee: Grant Ingersoll

Grant that was your rev. He's right that it seems incorrect afterwards, but 
don't know whether the intent was to keep maxDf a percentage, or to make it an 
absolute value, since another part of the change removes "percentage" as part 
of the key under which this is transferred.
                
> SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in 
> TFIDFPartialVectorReducer)
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-973
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-973
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.6
>            Reporter: Viktor Gal
>            Assignee: Grant Ingersoll
>             Fix For: 0.7
>
>         Attachments: fix-TFIDFPartialVectorReducer.patch
>
>
> Although I'm using a little bit different the TFIDFConverter, but the problem 
> will occur the same way with SparseVectorsFromSequenceFiles when somebody 
> wants to create a TFIDF vectors for their documents.
> Basically if maxDFSigma is not set then because of 
> SparseVectorsFromSequenceFiles.java:281
> long maxDF = maxDFPercent;
> maxDF will be 99. which is then passed to TFIDFConvert.processTfIdf function 
> as an argument, where it is interpreted as "The max percentage of vectors for 
> the DF." Partial vectors will be created with TFIDFPartialVectorReducer.class 
> and because of TFIDFPartialVectorReducer.java:81 as maxDF = 99 if (df > 
> maxDF) the term will be ignored.
> the problem here is that two different quantities are compared. df value is 
> the number of documents which contains the given term, and it's not 
> normalized by the document number, i.e. it's not a percentage! see 
> TermDocumentCountReducer.java for details. while maxDF is interpreted as a 
> percentage, see above. Thus, as soon as the df count gets higher than 99, or 
> in the best case 100, meaning the given term occurs in more than 99 or 100 
> different documents, it'll be ignored... and this is not what we would like 
> it to do.
> I.e. there's a bug in TFIDFPartialVectorReducer.java at line 81.
> I've attached a possible fix for this problem.
> the bug was introduced a61e5ff8 commit (git) or rev 1210994 in svn:
> @@ -78,7 +78,7 @@ public class TFIDFPartialVectorReducer extends
>          continue;
>        }
>        long df = dictionary.get(e.index());
> -      if (df * 100.0 / vectorCount > maxDfPercent) {
> +      if (maxDf > -1 && df > maxDf) {
>          continue;
>        }
>        if (df < minDf) {

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to