[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

Sean Owen (JIRA) Wed, 02 Jul 2014 07:19:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049942#comment-14049942
 ]


Sean Owen commented on SPARK-2341:
----------------------------------

I've been a bit uncomfortable with how the MLlib API conflates categorical 
values and numbers, since they aren't numbers in general. Treating them as 
numbers is a convenience in some cases, and common in papers, but feels like 
suboptimal software design -- should a user have to convert categoricals to 
some numeric representation? To me it invites confusion, and this is one 
symptom. So I am not sure "multiclass" should mean "parse target as double" to 
begin with?

OK, it's not the issue here. But we're on the subject of an experimental API 
subject to change with an example of something related that could be improved 
along the way, and it's my #1 wish for MLlib at the moment. I'd really like to 
work on a change to try to accommodate classes as, say, strings at least, and 
not presume doubles. But I am trying to figure out if anyone agrees with that. 

> loadLibSVMFile doesn't handle regression datasets
> -------------------------------------------------
>
>                 Key: SPARK-2341
>                 URL: https://issues.apache.org/jira/browse/SPARK-2341
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.0
>            Reporter: Eustache
>            Priority: Minor
>              Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

Reply via email to