[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049765#comment-14049765
 ] 

Xiangrui Meng commented on SPARK-2341:
--------------------------------------

It is a little awkward to have both `regression` and `multiclass` as input 
arguments. I agree that a correct name should be `multiclassOrRegression`. But 
it is certainly too long. We tried to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, 
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. 
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed 
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But 
for the API, I don't feel that it is necessary to change.


> loadLibSVMFile doesn't handle regression datasets
> -------------------------------------------------
>
>                 Key: SPARK-2341
>                 URL: https://issues.apache.org/jira/browse/SPARK-2341
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.0
>            Reporter: Eustache
>            Priority: Minor
>              Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to