Re: [jira] [Commented] (MAHOUT-1179) GSOC 2013: Refactor and improve the classification APIs

Ted Dunning Sun, 07 Jul 2013 10:36:10 -0700

Yes.  Both the classification and clustering API's are in need of
homogenization.



On Sun, Jul 7, 2013 at 9:57 AM, Ángel Martínez González
<amart...@gmail.com>wrote:

> Hi all,
>
> I did not receive any feedback about this. I understand that now is a busy
> time with the work on version 0.8. Is there still interest on refactoring
> the classification APIs once 0.8 is released? Or should I just move on and
> look for some other way to contribute? I think the changes proposed in the
> document may not be very exciting, but some homogenization of Mahout's
> algorithms is necessary.
> If more detailed planning is needed, I could break the changes down into a
> list of tasks that have adequate granularity to be registered as JIRA
> issues. Would that help?
>
> Regards,
> Angel
>
>
>
> El 26/05/2013 22:21, Ángel Martínez González escribió:
>
>> Hi Ted and all,
>>
>> I've prepared a short document describing the current state of the
>> classification APIs and the proposed changes.
>>
>> https://docs.google.com/**document/d/1Rqn-**
>> 8aMgK6g9UZuKyD2fpMOY3FJ1xGzE7H**UNPCaSd7I/edit?usp=sharing<https://docs.google.com/document/d/1Rqn-8aMgK6g9UZuKyD2fpMOY3FJ1xGzE7HUNPCaSd7I/edit?usp=sharing>
>>
>> I'm eager to hear any feedback about it!
>>
>> The document does not include anything about task order planning. In fact
>> I have a couple of questions about that: As we are talking about
>> refactoring, it would be quite natural to do the changes in a lot of small
>> commits. But, would that be possible or will the work have to be packed in
>> a few big commits? Also, will some committer be able to periodically review
>> the work? And, could the changes interfere with the next version release?
>>
>> Thanks!
>> Angel
>>
>>
>> El 20/05/2013 10:17, Ángel Martínez González escribió:
>>
>>>
>>> Hi,
>>> I'm preparing a short text describing the current state of each
>>> algorithm and the needed changes (also including the data preprocessing and
>>> result evaluation modules). That will answer question c)
>>> I'll try to answer the other two here:
>>>
>>> El 17/05/2013 9:37, Ted Dunning escribió:
>>>
>>>> Please lay out a plan before coding. The key questions will be
>>>>
>>>> a) can you serialize a model efficiently?
>>>>
>>> That should not be a problem. The scope of these proposed changes is
>>> only input and output data formats, not including the classification
>>> models, so that would work just as before. Regarding input and output data,
>>> the formats are similar to the ones used for clustering and also feature
>>> hashing will be supported.
>>>
>>>> b) can you deal with the random forest and SGD models?
>>>>
>>> I've been looking into possible icompatibilities between classifiers and
>>> I've found the following difficulties related to input format:
>>>
>>> - The proposed input for trainers is SequenceFile<IntWritable,
>>> VectorWritable>, where the key would be an instance id and the target
>>> variable (class label) would be inside the vector. But, if feature hashing
>>> is used, conflicts may happen with the target variable that make it
>>> impossible to recover.
>>> - SGD and Naive Bayes need binarized categorical features, while Random
>>> Forests use categorical features encoded as integer levels. In Random
>>> Forests, any categorical feature can be used as target variable. In SGD and
>>> Naive Bayes, the target variable is provided to the classifier outside the
>>> vector. Binarized features are not suitable as target variables.
>>>
>>> Maybe a possible solution for the two interrelated problems could be:
>>> considering binarized categorical features as numerical, while categorical
>>> variables will always be encoded as integer levels and in SGD and Naive
>>> Bayes will only be used as target variables (or ignored). The feature
>>> hashing framework would have to be modified so that categorical variables
>>> have their positions in the vector reserved and no conflicts involving them
>>> are possible.  I think this is quite similar to the case with "a few
>>> special fields (categories and such) and then a bunch of encoded data" you
>>> commented in a previous mail.
>>>
>>> How does it sound?
>>>
>>>  c) what are the real changes to the API needed?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, May 16, 2013 at 10:51 AM, Angel Martinez Gonzalez (JIRA) <
>>>> j...@apache.org> wrote:
>>>>
>>>>       [
>>>>> https://issues.apache.org/**jira/browse/MAHOUT-1179?page=**
>>>>> com.atlassian.jira.plugin.**system.issuetabpanels:comment-**
>>>>> tabpanel&focusedCommentId=**13659764#comment-13659764<https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13659764#comment-13659764>]
>>>>>
>>>>>
>>>>> Angel Martinez Gonzalez commented on MAHOUT-1179:
>>>>> ------------------------------**-------------------
>>>>>
>>>>> Hi again,
>>>>> With the goal of modifying all classifiers to use the formats proposed
>>>>> above, I've started to work with Naive Bayes. In particular, I've
>>>>> moved the
>>>>> code related to evaluation (summary statistics, confusion matrix) that
>>>>> was
>>>>> executed at the end of TestNaiveBayesDriver to a separate
>>>>> ClassifierEvaluationJob. The benefit of this is that
>>>>> ClassifierEvaluationJob should be able in the future to take input
>>>>> from any
>>>>> classifier tester.
>>>>> The current state of the work may be reviewed here:
>>>>> https://github.com/amartgon/**mahout/commit/**
>>>>> 519ae529e9932d1e1d0803d0731a73**96daaa603b<https://github.com/amartgon/mahout/commit/519ae529e9932d1e1d0803d0731a7396daaa603b>
>>>>>
>>>>> There are still modifications to be made on Naive Bayes, such as:
>>>>> -Modifying document id format from Text to IntWritable.
>>>>> -Moving the "label index" out of TrainNaiveBayesJob.
>>>>> Should I create a JIRA issue and submit this part? Or go on with the
>>>>> work
>>>>> at least till everything related to Naive Bayes is complete? I'd like
>>>>> to
>>>>> have some feedback before going on, to have an idea of whether there is
>>>>> agreement/interest in this before investing a lot of time into possibly
>>>>> useless work.
>>>>>
>>>>>
>>>>>  GSOC 2013: Refactor and improve the classification APIs
>>>>>> ------------------------------**-------------------------
>>>>>>
>>>>>>                  Key: MAHOUT-1179
>>>>>> URL:https://issues.apache.org/**jira/browse/MAHOUT-1179<https://issues.apache.org/jira/browse/MAHOUT-1179>
>>>>>>              Project: Mahout
>>>>>>           Issue Type: New Feature
>>>>>>             Reporter: Dan Filimon
>>>>>>               Labels: gsoc2013, mentor
>>>>>>
>>>>>> [via Andy Twigg]
>>>>>> Improve and unify the Mahout classification API. Also related to the
>>>>>>
>>>>> refactoring of the clustering APIs MAHOUT-1177.
>>>>>
>>>>>> The two APIs should be roughly the same, at least in
>>>>>> terms of input/output so that pipelining etc is easier. (cf
>>>>>> scikit-learn clustering/classifier/**regression API)
>>>>>> Currently Mahout support:
>>>>>> - logistic regression
>>>>>> - Naive Bayes
>>>>>> - Random Forests
>>>>>>
>>>>> --
>>>>> This message is automatically generated by JIRA.
>>>>> If you think it was sent incorrectly, please contact your JIRA
>>>>> administrators
>>>>> For more information on JIRA, see:http://www.atlassian.com/**
>>>>> software/jira <http://www.atlassian.com/software/jira>
>>>>>
>>>>>
>>>
>>
>

Re: [jira] [Commented] (MAHOUT-1179) GSOC 2013: Refactor and improve the classification APIs

Reply via email to