[ https://issues.apache.org/jira/browse/MAHOUT-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182494#comment-13182494 ]
Grant Ingersoll commented on MAHOUT-939: ---------------------------------------- bq. Are these results with held-out data? Or are they what is reported by the cross fold learners? Here are the steps we do currently: # Convert mail archives to sequence files # Encode the text using seq2encoded: {code}$MAHOUT seq2encoded --input $MAIL_OUT --output $SEQ2SP --analyzerName org.apache.mahout.text.MailArchivesClusteringAnalyzer --cardinality 100000{code} # Do some minor reworking of the vectors to get decent labels # Split the vectors into a train and test set, randomizing as we go thanks to MAHOUT-904 {code}$MAHOUT split --input $SEQ2SPLABEL --mapRedOutputDir $MAPREDOUT --randomSelectionPct 20 --overwrite --sequenceFiles --method mapreduce{code} # Run the training # Run the test All of this is in examples/bin/asf-email-examples.sh The code for TrainASFModel more or less mirrors what is in the 20NewsGroup by using ALR. I even refactored the two a bit to share some common code. Hopefully I didn't mess things up there despite the tests passing and the results looking reasonable. > ASF Email Classification Examples don't always produce good results > ------------------------------------------------------------------- > > Key: MAHOUT-939 > URL: https://issues.apache.org/jira/browse/MAHOUT-939 > Project: Mahout > Issue Type: Bug > Affects Versions: 0.6 > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Labels: MAHOUT_INTRO_CONTRIBUTE > Fix For: 0.7 > > Attachments: 939.patch, MAHOUT-939.patch, MAHOUT-939.patch, > MAHOUT-939.patch, strip_reject.patch > > > The classification examples for the ASF email don't work all that well > currently in terms of quality when it comes to more than a few labels. Also, > need to determine how much memory is required for vectors of cardinality size > 100K. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira