[ https://issues.apache.org/jira/browse/MAHOUT-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Viktor Gal updated MAHOUT-973: ------------------------------ Affects Version/s: 0.6 > SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in > TFIDFPartialVectorReducer) > ------------------------------------------------------------------------------------------------ > > Key: MAHOUT-973 > URL: https://issues.apache.org/jira/browse/MAHOUT-973 > Project: Mahout > Issue Type: Bug > Affects Versions: 0.6, 0.7 > Reporter: Viktor Gal > Attachments: fix-TFIDFPartialVectorReducer.patch > > > Although I'm using a little bit different the TFIDFConverter, but the problem > will occur the same way with SparseVectorsFromSequenceFiles when somebody > wants to create a TFIDF vectors for their documents. > Basically if maxDFSigma is not set then because of > SparseVectorsFromSequenceFiles.java:281 > long maxDF = maxDFPercent; > maxDF will be 99. which is then passed to TFIDFConvert.processTfIdf function > as an argument, where it is interpreted as "The max percentage of vectors for > the DF." Partial vectors will be created with TFIDFPartialVectorReducer.class > and because of TFIDFPartialVectorReducer.java:81 as maxDF = 99 if (df > > maxDF) the term will be ignored. > the problem here is that two different quantities are compared. df value is > the number of documents which contains the given term, and it's not > normalized by the document number, i.e. it's not a percentage! see > TermDocumentCountReducer.java for details. while maxDF is interpreted as a > percentage, see above. Thus, as soon as the df count gets higher than 99, or > in the best case 100, meaning the given term occurs in more than 99 or 100 > different documents, it'll be ignored... and this is not what we would like > it to do. > I.e. there's a bug in TFIDFPartialVectorReducer.java at line 81. > I've attached a possible fix for this problem. > the bug was introduced a61e5ff8 commit (git) or rev 1210994 in svn: > @@ -78,7 +78,7 @@ public class TFIDFPartialVectorReducer extends > continue; > } > long df = dictionary.get(e.index()); > - if (df * 100.0 / vectorCount > maxDfPercent) { > + if (maxDf > -1 && df > maxDf) { > continue; > } > if (df < minDf) { -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira