[
https://issues.apache.org/jira/browse/MAHOUT-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Ingersoll reassigned MAHOUT-957:
--------------------------------------
Assignee: Grant Ingersoll
> term vectors not created in SparseVectorsFromSequenceFiles using tf weighting
> and maxDFSigma filtering
> ------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-957
> URL: https://issues.apache.org/jira/browse/MAHOUT-957
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.6
> Reporter: John Conwell
> Assignee: Grant Ingersoll
> Fix For: 0.6
>
>
> The SparseVectorsFromSequenceFiles throws an exception when you want term
> frequency vectors output, with the maxDFSigma filtering option.
> Basically the if / else if section shown below, will skip calling
> DictionaryVectorizer.createTermFrequencyVectors when have that combination.
> The condition will create vectors when you want tf vectors without maxDFSigma
> filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf
> vectors with maxDFSigma filtering, it totally skips over the call to
> createTermFrequencyVectors, and later on throws an exception because the
> vector input path doesn't exist.
> For example, the following cmd line will reproduce this situation:
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o
> /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2
> --minDF 2 --maxDFSigma 3 -seq
> //the suspect code at line ~267 in
> DictionaryVectorizer.createTermFrequencyVectors
> if (!processIdf && !shouldPrune) {
> DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> outputDir, tfDirName, conf, minSupport, maxNGramSize,
> minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
> sequentialAccessOutput, namedVectors);
> } else if (processIdf) {
> DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
> outputDir, tfDirName, conf, minSupport, maxNGramSize,
> minLLRValue, -1.0f, false, reduceTasks, chunkSize,
> sequentialAccessOutput, namedVectors);
> }
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira