[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051194#comment-14051194 ]
Sean Owen commented on SPARK-2341: ---------------------------------- [~mengxr] For regression, rather than further overloading "multiclass" to mean "regression", how about modifying the argument to take on three values (as an enum, string, etc.) to distinguish the three modes. The current method would stay, but be deprecated. multiclass=false is for binary classification. libsvm uses "0" and "1" (or any ints) for binary classification. But this parses it as a real number, and rounds to 0/1. (Is that was libsvm does?) Maybe it's a convenient semantic overload when you want to transform a continuous value to a 0/1 indicator, but is that implied by libsvm format or just a transformation the caller should make? multiclass=true treats libsvm integer labels as doubles, but not continuous values. It seems like inviting more confusion to have this mode also double as the mode for parsing labels that are continuous values as continuous values. libsvm is widely used but it's old; I don't think it's file format from long ago should necessarily inform API design now. There are other serializations besides libsvm (plain CSV for instance) and other algorithms (random decision forests). You can make utilities to convert classes to numbers for benefit of the implementation on the front, and I'll have to in order to use this. Maybe we can start there -- at least if a utility is in the project people aren't all reinventing this in order to use an SVM with actual labels. The caller carries around a dictionary then to do the reverse mapping. The model seems like the place to hold that info, if in fact internally it converts classes to some other representation. Maybe the need would be clearer once the utility is created. As you say I'm concerned that the API is already locked down early and some of these changes are going to be viewed as infeasible just for that reason. > loadLibSVMFile doesn't handle regression datasets > ------------------------------------------------- > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.0.0 > Reporter: Eustache > Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)