[ 
https://issues.apache.org/jira/browse/MAHOUT-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195156#comment-13195156
 ] 

John Conwell commented on MAHOUT-962:
-------------------------------------

Well, if the user wants to use sigma (stddev*sigma) to filter terms with high 
doc frequencies, yes that should work.  But what if they wanted to explicitly 
filter by DF percent to get rid high doc frequency terms?  Or more importantly, 
what if they wanted to use minDF to filter low doc frequency terms?  The sigma 
flag wont take care of those.

I think I'm sounding picky, but as I'm going through using LDA (and CVB LDA) 
I'm playing with different tweaks of the input args in order to get "better" 
quality topic models.
                
> minDF and maxDFPercent filtering doesnt get applied when output weight is tf 
> in SpareVecorsFromSequenceFile
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-962
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-962
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: John Conwell
>             Fix For: 0.6
>
>
> This is similar to the same reasoning behind the fix for MAHOUT-957.  The 
> desired output is term frequency vectors, but I want terms filtered by their 
> min and max DF values. This might be valid in LDA, where tf vectors is 
> desired for input, but filtering out the maxDFPercent is also useful.
> Currently minDF and maxDFPercent are only used when calculating tfidf, and 
> the original tv vectors are not updated to represent the term filtering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to