Yes. Both the classification and clustering API's are in need of homogenization.
On Sun, Jul 7, 2013 at 9:57 AM, Ángel Martínez González <amart...@gmail.com>wrote: > Hi all, > > I did not receive any feedback about this. I understand that now is a busy > time with the work on version 0.8. Is there still interest on refactoring > the classification APIs once 0.8 is released? Or should I just move on and > look for some other way to contribute? I think the changes proposed in the > document may not be very exciting, but some homogenization of Mahout's > algorithms is necessary. > If more detailed planning is needed, I could break the changes down into a > list of tasks that have adequate granularity to be registered as JIRA > issues. Would that help? > > Regards, > Angel > > > > El 26/05/2013 22:21, Ángel Martínez González escribió: > >> Hi Ted and all, >> >> I've prepared a short document describing the current state of the >> classification APIs and the proposed changes. >> >> https://docs.google.com/**document/d/1Rqn-** >> 8aMgK6g9UZuKyD2fpMOY3FJ1xGzE7H**UNPCaSd7I/edit?usp=sharing<https://docs.google.com/document/d/1Rqn-8aMgK6g9UZuKyD2fpMOY3FJ1xGzE7HUNPCaSd7I/edit?usp=sharing> >> >> I'm eager to hear any feedback about it! >> >> The document does not include anything about task order planning. In fact >> I have a couple of questions about that: As we are talking about >> refactoring, it would be quite natural to do the changes in a lot of small >> commits. But, would that be possible or will the work have to be packed in >> a few big commits? Also, will some committer be able to periodically review >> the work? And, could the changes interfere with the next version release? >> >> Thanks! >> Angel >> >> >> El 20/05/2013 10:17, Ángel Martínez González escribió: >> >>> >>> Hi, >>> I'm preparing a short text describing the current state of each >>> algorithm and the needed changes (also including the data preprocessing and >>> result evaluation modules). That will answer question c) >>> I'll try to answer the other two here: >>> >>> El 17/05/2013 9:37, Ted Dunning escribió: >>> >>>> Please lay out a plan before coding. The key questions will be >>>> >>>> a) can you serialize a model efficiently? >>>> >>> That should not be a problem. The scope of these proposed changes is >>> only input and output data formats, not including the classification >>> models, so that would work just as before. Regarding input and output data, >>> the formats are similar to the ones used for clustering and also feature >>> hashing will be supported. >>> >>>> b) can you deal with the random forest and SGD models? >>>> >>> I've been looking into possible icompatibilities between classifiers and >>> I've found the following difficulties related to input format: >>> >>> - The proposed input for trainers is SequenceFile<IntWritable, >>> VectorWritable>, where the key would be an instance id and the target >>> variable (class label) would be inside the vector. But, if feature hashing >>> is used, conflicts may happen with the target variable that make it >>> impossible to recover. >>> - SGD and Naive Bayes need binarized categorical features, while Random >>> Forests use categorical features encoded as integer levels. In Random >>> Forests, any categorical feature can be used as target variable. In SGD and >>> Naive Bayes, the target variable is provided to the classifier outside the >>> vector. Binarized features are not suitable as target variables. >>> >>> Maybe a possible solution for the two interrelated problems could be: >>> considering binarized categorical features as numerical, while categorical >>> variables will always be encoded as integer levels and in SGD and Naive >>> Bayes will only be used as target variables (or ignored). The feature >>> hashing framework would have to be modified so that categorical variables >>> have their positions in the vector reserved and no conflicts involving them >>> are possible. I think this is quite similar to the case with "a few >>> special fields (categories and such) and then a bunch of encoded data" you >>> commented in a previous mail. >>> >>> How does it sound? >>> >>> c) what are the real changes to the API needed? >>>> >>>> >>>> >>>> >>>> On Thu, May 16, 2013 at 10:51 AM, Angel Martinez Gonzalez (JIRA) < >>>> j...@apache.org> wrote: >>>> >>>> [ >>>>> https://issues.apache.org/**jira/browse/MAHOUT-1179?page=** >>>>> com.atlassian.jira.plugin.**system.issuetabpanels:comment-** >>>>> tabpanel&focusedCommentId=**13659764#comment-13659764<https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659764#comment-13659764>] >>>>> >>>>> >>>>> Angel Martinez Gonzalez commented on MAHOUT-1179: >>>>> ------------------------------**------------------- >>>>> >>>>> Hi again, >>>>> With the goal of modifying all classifiers to use the formats proposed >>>>> above, I've started to work with Naive Bayes. In particular, I've >>>>> moved the >>>>> code related to evaluation (summary statistics, confusion matrix) that >>>>> was >>>>> executed at the end of TestNaiveBayesDriver to a separate >>>>> ClassifierEvaluationJob. The benefit of this is that >>>>> ClassifierEvaluationJob should be able in the future to take input >>>>> from any >>>>> classifier tester. >>>>> The current state of the work may be reviewed here: >>>>> https://github.com/amartgon/**mahout/commit/** >>>>> 519ae529e9932d1e1d0803d0731a73**96daaa603b<https://github.com/amartgon/mahout/commit/519ae529e9932d1e1d0803d0731a7396daaa603b> >>>>> >>>>> There are still modifications to be made on Naive Bayes, such as: >>>>> -Modifying document id format from Text to IntWritable. >>>>> -Moving the "label index" out of TrainNaiveBayesJob. >>>>> Should I create a JIRA issue and submit this part? Or go on with the >>>>> work >>>>> at least till everything related to Naive Bayes is complete? I'd like >>>>> to >>>>> have some feedback before going on, to have an idea of whether there is >>>>> agreement/interest in this before investing a lot of time into possibly >>>>> useless work. >>>>> >>>>> >>>>> GSOC 2013: Refactor and improve the classification APIs >>>>>> ------------------------------**------------------------- >>>>>> >>>>>> Key: MAHOUT-1179 >>>>>> URL:https://issues.apache.org/**jira/browse/MAHOUT-1179<https://issues.apache.org/jira/browse/MAHOUT-1179> >>>>>> Project: Mahout >>>>>> Issue Type: New Feature >>>>>> Reporter: Dan Filimon >>>>>> Labels: gsoc2013, mentor >>>>>> >>>>>> [via Andy Twigg] >>>>>> Improve and unify the Mahout classification API. Also related to the >>>>>> >>>>> refactoring of the clustering APIs MAHOUT-1177. >>>>> >>>>>> The two APIs should be roughly the same, at least in >>>>>> terms of input/output so that pipelining etc is easier. (cf >>>>>> scikit-learn clustering/classifier/**regression API) >>>>>> Currently Mahout support: >>>>>> - logistic regression >>>>>> - Naive Bayes >>>>>> - Random Forests >>>>>> >>>>> -- >>>>> This message is automatically generated by JIRA. >>>>> If you think it was sent incorrectly, please contact your JIRA >>>>> administrators >>>>> For more information on JIRA, see:http://www.atlassian.com/** >>>>> software/jira <http://www.atlassian.com/software/jira> >>>>> >>>>> >>> >> >