[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

Sean Owen (JIRA) Thu, 03 Jul 2014 01:44:27 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051194#comment-14051194
 ]


Sean Owen commented on SPARK-2341:
----------------------------------

[~mengxr] For regression, rather than further overloading "multiclass" to mean 
"regression", how about modifying the argument to take on three values (as an 
enum, string, etc.) to distinguish the three modes. The current method would 
stay, but be deprecated.

multiclass=false is for binary classification. libsvm uses "0" and "1" (or any 
ints) for binary classification. But this parses it as a real number, and 
rounds to 0/1. (Is that was libsvm does?) Maybe it's a convenient semantic 
overload when you want to transform a continuous value to a 0/1 indicator, but 
is that implied by libsvm format or just a transformation the caller should 
make? multiclass=true treats libsvm integer labels as doubles, but not 
continuous values. It seems like inviting more confusion to have this mode also 
double as the mode for parsing labels that are continuous values as continuous 
values.

libsvm is widely used but it's old; I don't think it's file format from long 
ago should necessarily inform API design now. There are other serializations 
besides libsvm (plain CSV for instance) and other algorithms (random decision 
forests).

You can make utilities to convert classes to numbers for benefit of the 
implementation on the front, and I'll have to in order to use this. Maybe we 
can start there -- at least if a utility is in the project people aren't all 
reinventing this in order to use an SVM with actual labels. The caller carries 
around a dictionary then to do the reverse mapping. The model seems like the 
place to hold that info, if in fact internally it converts classes to some 
other representation. Maybe the need would be clearer once the utility is 
created.

As you say I'm concerned that the API is already locked down early and some of 
these changes are going to be viewed as infeasible just for that reason.

> loadLibSVMFile doesn't handle regression datasets
> -------------------------------------------------
>
>                 Key: SPARK-2341
>                 URL: https://issues.apache.org/jira/browse/SPARK-2341
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.0
>            Reporter: Eustache
>            Priority: Minor
>              Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

Reply via email to