term vectors not created in SparseVectorsFromSequenceFiles using tf weighting
and maxDFSigma filtering
------------------------------------------------------------------------------------------------------
Key: MAHOUT-957
URL: https://issues.apache.org/jira/browse/MAHOUT-957
Project: Mahout
Issue Type: Bug
Components: Clustering
Affects Versions: 0.6
Reporter: John Conwell
Fix For: 0.6
The SparseVectorsFromSequenceFiles throws an exception when you want term
frequency vectors output, with the maxDFSigma filtering option.
Basically the if / else if section shown below, will skip calling
DictionaryVectorizer.createTermFrequencyVectors when have that combination.
The condition will create vectors when you want tf vectors without maxDFSigma
filtering, or tfidf vectors with maxDFSigma filtering, but if you want tf
vectors with maxDFSigma filtering, it totally skips over the call to
createTermFrequencyVectors, and later on throws an exception because the vector
input path doesn't exist.
For example, the following cmd line will reproduce this situation:
bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o
/Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2
--minDF 2 --maxDFSigma 3 -seq
//the suspect code at line ~267 in
DictionaryVectorizer.createTermFrequencyVectors
if (!processIdf && !shouldPrune) {
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir, tfDirName, conf, minSupport, maxNGramSize,
minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);
} else if (processIdf) {
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir, tfDirName, conf, minSupport, maxNGramSize,
minLLRValue, -1.0f, false, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);
}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira