Can you open a JIRA issue, if you haven't already, and mark it for 0.6? On Jan 23, 2012, at 10:49 AM, John Conwell wrote:
> Any time you pass in that you want term frequency vs tfidf used as > weighting (-wt tf), combined with using maxDFSigma vs maxDFPercent > (--maxDFSigma 3) will cause the term vectors not to be created (as shown in > the code below) > > For example, the following cmd line will reproduce this situation: > > bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o > /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2 > --minDF 2 --maxDFSigma 3 -seq > > Thanks, > John > > On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll <gsing...@apache.org>wrote: > >> What were the command/options you were passing in? >> >> >> On Jan 18, 2012, at 4:26 PM, John Conwell wrote: >> >>> I got latest from Trunk and built it, and when >>> running SparseVectorsFromSequenceFiles I noticed what I think is a bug. >>> The SparseVectorsFromSequenceFiles throws an exception when you want term >>> frequency vectors output, with the maxDFSigma filtering option. >>> >>> Basically the if / else if section shown below, will skip >>> calling DictionaryVectorizer.createTermFrequencyVectors when have that >>> combination. The condition will create vectors when you want tf vectors >>> without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering, >>> but if you want tf vectors with maxDFSigma filtering, it totally skips >> over >>> the call to createTermFrequencyVectors, and later on throws an exception >>> because the vector input path doesn't exist. >>> >>> Is this a known issue? I'm assuming thats not the way its suposed to >> work, >>> correct? If so, I think some sort of validation should break the user >> out >>> before they start processing anything >>> >>> //at line ~267 in trunk >>> >>> if (!processIdf && !shouldPrune) { >>> >>> DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, >>> outputDir, tfDirName, conf, minSupport, maxNGramSize, >>> >>> minLLRValue, norm, logNormalize, reduceTasks, chunkSize, >>> sequentialAccessOutput, namedVectors); >>> >>> } else if (processIdf) { >>> >>> DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, >>> outputDir, tfDirName, conf, minSupport, maxNGramSize, >>> >>> minLLRValue, -1.0f, false, reduceTasks, chunkSize, >>> sequentialAccessOutput, namedVectors); >>> >>> } >>> >>> -- >>> >>> Thanks, >>> John C >>> >>> >>> >>> >>> -- >>> >>> -- John C >> >> -------------------------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> >> >> > > > -- > > Thanks, > John C -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com