Re: [jira] [Commented] (MAHOUT-1179) GSOC 2013: Refactor and improve the classification APIs

Ángel Martínez González Sun, 07 Jul 2013 09:59:07 -0700

Hi all,

I did not receive any feedback about this. I understand that now is abusy time with the work on version 0.8. Is there still interest onrefactoring the classification APIs once 0.8 is released? Or should Ijust move on and look for some other way to contribute? I think thechanges proposed in the document may not be very exciting, but somehomogenization of Mahout's algorithms is necessary.If more detailed planning is needed, I could break the changes down intoa list of tasks that have adequate granularity to be registered as JIRAissues. Would that help?


Regards,
Angel



El 26/05/2013 22:21, Ángel Martínez González escribió:

Hi Ted and all,
I've prepared a short document describing the current state of theclassification APIs and the proposed changes.
https://docs.google.com/document/d/1Rqn-8aMgK6g9UZuKyD2fpMOY3FJ1xGzE7HUNPCaSd7I/edit?usp=sharing
I'm eager to hear any feedback about it!
The document does not include anything about task order planning. Infact I have a couple of questions about that: As we are talking aboutrefactoring, it would be quite natural to do the changes in a lot ofsmall commits. But, would that be possible or will the work have to bepacked in a few big commits? Also, will some committer be able toperiodically review the work? And, could the changes interfere withthe next version release?
Thanks!
Angel


El 20/05/2013 10:17, Ángel Martínez González escribió:
Hi,
I'm preparing a short text describing the current state of eachalgorithm and the needed changes (also including the datapreprocessing and result evaluation modules). That will answerquestion c)
I'll try to answer the other two here:

El 17/05/2013 9:37, Ted Dunning escribió:
Please lay out a plan before coding. The key questions will be

a) can you serialize a model efficiently?
That should not be a problem. The scope of these proposed changes isonly input and output data formats, not including the classificationmodels, so that would work just as before. Regarding input and outputdata, the formats are similar to the ones used for clustering andalso feature hashing will be supported.
b) can you deal with the random forest and SGD models?
I've been looking into possible icompatibilities between classifiersand I've found the following difficulties related to input format:
- The proposed input for trainers is SequenceFile<IntWritable,VectorWritable>, where the key would be an instance id and the targetvariable (class label) would be inside the vector. But, if featurehashing is used, conflicts may happen with the target variable thatmake it impossible to recover.- SGD and Naive Bayes need binarized categorical features, whileRandom Forests use categorical features encoded as integer levels. InRandom Forests, any categorical feature can be used as targetvariable. In SGD and Naive Bayes, the target variable is provided tothe classifier outside the vector. Binarized features are notsuitable as target variables.
Maybe a possible solution for the two interrelated problems could be:considering binarized categorical features as numerical, whilecategorical variables will always be encoded as integer levels and inSGD and Naive Bayes will only be used as target variables (orignored). The feature hashing framework would have to be modified sothat categorical variables have their positions in the vectorreserved and no conflicts involving them are possible. I think thisis quite similar to the case with "a few special fields (categoriesand such) and then a bunch of encoded data" you commented in aprevious mail.
How does it sound?
c) what are the real changes to the API needed?




On Thu, May 16, 2013 at 10:51 AM, Angel Martinez Gonzalez (JIRA) <
j...@apache.org> wrote:
     [
https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659764#comment-13659764]
Angel Martinez Gonzalez commented on MAHOUT-1179:
-------------------------------------------------

Hi again,
With the goal of modifying all classifiers to use the formats proposed
above, I've started to work with Naive Bayes. In particular, I'vemoved thecode related to evaluation (summary statistics, confusion matrix)that was
executed at the end of TestNaiveBayesDriver to a separate
ClassifierEvaluationJob. The benefit of this is that
ClassifierEvaluationJob should be able in the future to take inputfrom any
classifier tester.
The current state of the work may be reviewed here:
https://github.com/amartgon/mahout/commit/519ae529e9932d1e1d0803d0731a7396daaa603b
There are still modifications to be made on Naive Bayes, such as:
-Modifying document id format from Text to IntWritable.
-Moving the "label index" out of TrainNaiveBayesJob.
Should I create a JIRA issue and submit this part? Or go on withthe workat least till everything related to Naive Bayes is complete? I'dlike tohave some feedback before going on, to have an idea of whetherthere isagreement/interest in this before investing a lot of time intopossibly
useless work.
GSOC 2013: Refactor and improve the classification APIs
-------------------------------------------------------

                 Key: MAHOUT-1179
URL:https://issues.apache.org/jira/browse/MAHOUT-1179
             Project: Mahout
          Issue Type: New Feature
            Reporter: Dan Filimon
              Labels: gsoc2013, mentor

[via Andy Twigg]
Improve and unify the Mahout classification API. Also related to the
refactoring of the clustering APIs MAHOUT-1177.
The two APIs should be roughly the same, at least in
terms of input/output so that pipelining etc is easier. (cf
scikit-learn clustering/classifier/regression API)
Currently Mahout support:
- logistic regression
- Naive Bayes
- Random Forests
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA,see:http://www.atlassian.com/software/jira

Re: [jira] [Commented] (MAHOUT-1179) GSOC 2013: Refactor and improve the classification APIs

Reply via email to