Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Grant Ingersoll Tue, 24 Jan 2012 15:46:43 -0800

Can you open a JIRA issue, if you haven't already, and mark it for 0.6?

On Jan 23, 2012, at 10:49 AM, John Conwell wrote:


> Any time you pass in that you want term frequency vs tfidf used as
> weighting (-wt tf), combined with using maxDFSigma vs maxDFPercent
> (--maxDFSigma 3) will cause the term vectors not to be created (as shown in
> the code below)
> 
> For example, the following cmd line will reproduce this situation:
> 
> bin/mahout seq2sparse -i /Users/me/Documents/workspace/mahoutStuff/seq -o
> /Users/me/Documents/workspace/mahoutStuff/termvecs -wt tf --minSupport 2
> --minDF 2 --maxDFSigma 3 -seq
> 
> Thanks,
> John
> 
> On Sun, Jan 22, 2012 at 3:00 PM, Grant Ingersoll <gsing...@apache.org>wrote:
> 
>> What were the command/options you were passing in?
>> 
>> 
>> On Jan 18, 2012, at 4:26 PM, John Conwell wrote:
>> 
>>> I got latest from Trunk and built it, and when
>>> running SparseVectorsFromSequenceFiles I noticed what I think is a bug.
>>> The SparseVectorsFromSequenceFiles throws an exception when you want term
>>> frequency vectors output, with the maxDFSigma filtering option.
>>> 
>>> Basically the if / else if section shown below, will skip
>>> calling DictionaryVectorizer.createTermFrequencyVectors when have that
>>> combination.  The condition will create vectors when you want tf vectors
>>> without maxDFSigma filtering, or tfidf vectors with maxDFSigma filtering,
>>> but if you want tf vectors with maxDFSigma filtering, it totally skips
>> over
>>> the call to createTermFrequencyVectors, and later on throws an exception
>>> because the vector input path doesn't exist.
>>> 
>>> Is this a known issue?  I'm assuming thats not the way its suposed to
>> work,
>>> correct?  If so, I think some sort of validation should break the user
>> out
>>> before they start processing anything
>>> 
>>> //at line ~267 in trunk
>>> 
>>> if (!processIdf && !shouldPrune) {
>>> 
>>>       DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>>> outputDir, tfDirName, conf, minSupport, maxNGramSize,
>>> 
>>>         minLLRValue, norm, logNormalize, reduceTasks, chunkSize,
>>> sequentialAccessOutput, namedVectors);
>>> 
>>> } else if (processIdf) {
>>> 
>>>       DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
>>> outputDir, tfDirName, conf, minSupport, maxNGramSize,
>>> 
>>>         minLLRValue, -1.0f, false, reduceTasks, chunkSize,
>>> sequentialAccessOutput, namedVectors);
>>> 
>>> }
>>> 
>>> --
>>> 
>>> Thanks,
>>> John C
>>> 
>>> 
>>> 
>>> 
>>> --
>>> 
>>> -- John C
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> 
> 
> 
> -- 
> 
> Thanks,
> John C

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: term vectors not created in SparseVectorsFromSequenceFiles using tf weighting and maxDFSigma filtering

Reply via email to