I got latest from Trunk and built it, and when
running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
 The SparseVectorsFromSequenceFiles throws an exception when you want term
frequency vectors output, with the maxDFSigma filtering option.

Basically the if / else if section shown below, will skip
calling DictionaryVectorizer.createTermFrequencyVectors when have that
combination.  The condition will create vectors when you want tf vectors
without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering,
but if you want tf vectors with maxDFSigma filtering, it totally skips over
the call to createTermFrequencyVectors, and later on throws an exception
because the vector input path doesn't exist.

Is this a known issue?  I'm assuming thats not the way its suposed to work,
correct?  If so, I think some sort of validation should break the user out
before they start processing anything

//at line ~267 in trunk

if (!processIdf && !shouldPrune) {

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir, tfDirName, conf, minSupport, maxNGramSize,

          minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);

} else if (processIdf) {

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir, tfDirName, conf, minSupport, maxNGramSize,

          minLLRValue, -1.0f, false, reduceTasks, chunkSize,
sequentialAccessOutput, namedVectors);

}

-- 

Thanks,
John C

Reply via email to