[ https://issues.apache.org/jira/browse/MAHOUT-941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181916#comment-13181916 ]
Grant Ingersoll commented on MAHOUT-941: ---------------------------------------- Lance, can you separate out the stats piece into a different issue? I'll fold the quoted stuff in with MAHOUT-939 and then we can deal with the stats in other places > Strip quoted text from emails and add statistics to ConfusionMatrix > ------------------------------------------------------------------- > > Key: MAHOUT-941 > URL: https://issues.apache.org/jira/browse/MAHOUT-941 > Project: Mahout > Issue Type: Improvement > Components: Classification > Reporter: Lance Norskog > Assignee: Grant Ingersoll > Priority: Minor > Fix For: 0.6 > > Attachments: Bayes.zip, MAHOUT-941.patch, SGD.zip > > > This patch does 2 things: > # Add an feature to org.apache.mahout.text.SequenceFilesFromMailArchives that > removes quoted text from email bodies. This is important because it avoid > spamming the term dictionary with repeated text, especially in long email > threads. > ** The feature defaults to true. Add "--quoted" to the command line to keep > the quoted lines. > # Adds some dubious overall measurements to the ConfusionMatrix. > ** Kappa - a standard measurement. > *** How different is this confidence matrix from random numbers? 0.0 is the > same, 1.0 is completely different. > *** I think this is an "unweighted" kappa. > ** "Success" - a homegrown formula attempting to represent the correctness of > each box. Probably bogus. > *** The standard deviation shows the distance between the success of each > producer->consumer box. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira