Please lay out a plan before coding. The key questions will be a) can you serialize a model efficiently?
b) can you deal with the random forest and SGD models? c) what are the real changes to the API needed? On Thu, May 16, 2013 at 10:51 AM, Angel Martinez Gonzalez (JIRA) < j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659764#comment-13659764] > > Angel Martinez Gonzalez commented on MAHOUT-1179: > ------------------------------------------------- > > Hi again, > With the goal of modifying all classifiers to use the formats proposed > above, I've started to work with Naive Bayes. In particular, I've moved the > code related to evaluation (summary statistics, confusion matrix) that was > executed at the end of TestNaiveBayesDriver to a separate > ClassifierEvaluationJob. The benefit of this is that > ClassifierEvaluationJob should be able in the future to take input from any > classifier tester. > The current state of the work may be reviewed here: > https://github.com/amartgon/mahout/commit/519ae529e9932d1e1d0803d0731a7396daaa603b > > There are still modifications to be made on Naive Bayes, such as: > -Modifying document id format from Text to IntWritable. > -Moving the "label index" out of TrainNaiveBayesJob. > Should I create a JIRA issue and submit this part? Or go on with the work > at least till everything related to Naive Bayes is complete? I'd like to > have some feedback before going on, to have an idea of whether there is > agreement/interest in this before investing a lot of time into possibly > useless work. > > > > GSOC 2013: Refactor and improve the classification APIs > > ------------------------------------------------------- > > > > Key: MAHOUT-1179 > > URL: https://issues.apache.org/jira/browse/MAHOUT-1179 > > Project: Mahout > > Issue Type: New Feature > > Reporter: Dan Filimon > > Labels: gsoc2013, mentor > > > > [via Andy Twigg] > > Improve and unify the Mahout classification API. Also related to the > refactoring of the clustering APIs MAHOUT-1177. > > The two APIs should be roughly the same, at least in > > terms of input/output so that pipelining etc is easier. (cf > > scikit-learn clustering/classifier/regression API) > > Currently Mahout support: > > - logistic regression > > - Naive Bayes > > - Random Forests > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira >