I have a separate solution: strip the quoted text. Quoted text in the emails spams the term vectors; just plain TF-IDF is not enough to combat this. Lucene has a lot of tools besides TFi-IDF.
I have a patch, gotta start the JIRA. Also added more measurements to the confusion matrix. I want to get a good measurement of the performance on each producer and consumer, not just a global ratio. 'testnb' gives 80% but one of the false boxes has a 1. This is bogus. (I'm using your complete corpus of commons v.s. cocoon, classifying dev v.s. user.) On Wed, Jan 4, 2012 at 6:57 AM, Grant Ingersoll (Updated) (JIRA) <[email protected]> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Grant Ingersoll updated MAHOUT-939: > ----------------------------------- > > Attachment: MAHOUT-939.patch > > Here's a start on this. Added some more construction options to the > AdaptiveLogisticRegression class. Still testing what values to use in > TrainASFEmail, but thought I would put this up for now. > >> ASF Email SGD Examples don't produce good results >> ------------------------------------------------- >> >> Key: MAHOUT-939 >> URL: https://issues.apache.org/jira/browse/MAHOUT-939 >> Project: Mahout >> Issue Type: Bug >> Affects Versions: 0.6 >> Reporter: Grant Ingersoll >> Assignee: Grant Ingersoll >> Labels: MAHOUT_INTRO_CONTRIBUTE >> Fix For: 0.7 >> >> Attachments: MAHOUT-939.patch >> >> >> The SGD examples for the ASF email don't work all that well currently in >> terms of quality. Also, need to determine how much memory is required for >> vectors of cardinality size 100K. > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators: > https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa > For more information on JIRA, see: http://www.atlassian.com/software/jira > > -- Lance Norskog [email protected]
