[
https://issues.apache.org/jira/browse/MAHOUT-973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viktor Gal updated MAHOUT-973:
------------------------------
Affects Version/s: 0.6
> SparseVectorsFromSequenceFiles will not create a proper TFIDF (bug in
> TFIDFPartialVectorReducer)
> ------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-973
> URL: https://issues.apache.org/jira/browse/MAHOUT-973
> Project: Mahout
> Issue Type: Bug
> Affects Versions: 0.6, 0.7
> Reporter: Viktor Gal
> Attachments: fix-TFIDFPartialVectorReducer.patch
>
>
> Although I'm using a little bit different the TFIDFConverter, but the problem
> will occur the same way with SparseVectorsFromSequenceFiles when somebody
> wants to create a TFIDF vectors for their documents.
> Basically if maxDFSigma is not set then because of
> SparseVectorsFromSequenceFiles.java:281
> long maxDF = maxDFPercent;
> maxDF will be 99. which is then passed to TFIDFConvert.processTfIdf function
> as an argument, where it is interpreted as "The max percentage of vectors for
> the DF." Partial vectors will be created with TFIDFPartialVectorReducer.class
> and because of TFIDFPartialVectorReducer.java:81 as maxDF = 99 if (df >
> maxDF) the term will be ignored.
> the problem here is that two different quantities are compared. df value is
> the number of documents which contains the given term, and it's not
> normalized by the document number, i.e. it's not a percentage! see
> TermDocumentCountReducer.java for details. while maxDF is interpreted as a
> percentage, see above. Thus, as soon as the df count gets higher than 99, or
> in the best case 100, meaning the given term occurs in more than 99 or 100
> different documents, it'll be ignored... and this is not what we would like
> it to do.
> I.e. there's a bug in TFIDFPartialVectorReducer.java at line 81.
> I've attached a possible fix for this problem.
> the bug was introduced a61e5ff8 commit (git) or rev 1210994 in svn:
> @@ -78,7 +78,7 @@ public class TFIDFPartialVectorReducer extends
> continue;
> }
> long df = dictionary.get(e.index());
> - if (df * 100.0 / vectorCount > maxDfPercent) {
> + if (maxDf > -1 && df > maxDf) {
> continue;
> }
> if (df < minDf) {
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira