[ 
https://issues.apache.org/jira/browse/MAHOUT-1179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13654472#comment-13654472
 ] 

Angel Martinez Gonzalez commented on MAHOUT-1179:
-------------------------------------------------

I have been going through the code related to the three considered algorithms: 
Logistic Regression, Naive Bayes, Random Forests. I will propose some 
preliminary ideas here, with the hope of triggering some discussion.

I have read in several messages of the mail archive that the main need is to 
have input and output data formats that are homogeneous among classifiers and 
also with clustering.  

Those formats could be something like the following: 
- Input format for trainers: SequenceFile<IntWritable, VectorWritable>  (just 
as in clustering) plus and optional metadata text file (to mark numerical, 
categorical and ignored columns, just as the ones used in Random Forests). 
- Output format for trainers: this would be more specific to each classifier. 
- Input format for testers: test data (in the same format as train data) and 
the model. 
- Output format for testers:  SequenceFile<IntWritable, VectorWritable>, where 
IntWritable is the instance key and the vector contains the label weights for 
the instance.  

A dictionary file mapping ints to terms would only be used by the preprocessing 
and evaluation tools. 

I could start modifying the drivers of one of the three classifiers to 
implement this and see what difficulties I find.

Does it all make sense? Please comment!

                
> GSOC 2013: Refactor and improve the classification APIs
> -------------------------------------------------------
>
>                 Key: MAHOUT-1179
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1179
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Dan Filimon
>              Labels: gsoc2013, mentor
>
> [via Andy Twigg]
> Improve and unify the Mahout classification API. Also related to the 
> refactoring of the clustering APIs MAHOUT-1177.
> The two APIs should be roughly the same, at least in
> terms of input/output so that pipelining etc is easier. (cf
> scikit-learn clustering/classifier/regression API)
> Currently Mahout support:
> - logistic regression
> - Naive Bayes
> - Random Forests

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to