[ 
https://issues.apache.org/jira/browse/MAHOUT-785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088105#comment-13088105
 ] 

XiaoboGu commented on MAHOUT-785:
---------------------------------

The current implementation of the classifier algorithms in Mahout may require 
different Java or Hadoop file formats, but from the command line users' point 
of view, the requirement of these algorithms are the same: records with 
predictor and target variables, and predictor variables may be of type numeric, 
word or text, the target variable may be binary or category with more than 2 
values, I think there are two appoaches:
1. Don't touch the current implementatioins, but make tools to convert 
universal input such as csv into the specific file format.
2. Revise the current implementations to consume the universal input.

> Universal input file format for classifier algorithms in Mahout
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-785
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-785
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: XiaoboGu
>
> I think a universal input file format is much more convinient for users, 
> especially command line users, and we should even consider use some universal 
> command line options for the classification algorithms, such as options for 
> target/predictor variables and their types. Then users can prepare their data 
> once, and build different models to get the best one. Currentlly we should 
> consider the following:
> 1. SGD LogisticRegression
> 2. NaiveBayes
> 3. Bayes
> 4. Random Forest

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to