[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063419#comment-14063419
 ] 

Sean Owen commented on SPARK-2341:
----------------------------------

OK is it worth a pull request for changing the boolean multiclass argument to a 
string? I wanted to ask if that was your intent before I do that.

libsvm format support is certainly important. It happens to have to encode 
non-numeric input as numbers. It need not be that way throughout MLlib, since 
it isn't that way in other input formats. (In this API method, it's pretty 
minor, since libsvm does by definition use this encoding.) So yes that would be 
great if data sets or API objects didn't assume that categorical data was 
numeric, but encoded type in the data set or even in the object model itself. I 
think it's mostly a design and type-safety argument -- same reason we have 
String instead of just byte[] everywhere.

Sure I will have to build this conversion at some point anyway and can share 
the result then.

> loadLibSVMFile doesn't handle regression datasets
> -------------------------------------------------
>
>                 Key: SPARK-2341
>                 URL: https://issues.apache.org/jira/browse/SPARK-2341
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.0
>            Reporter: Eustache
>            Priority: Minor
>              Labels: easyfix
>
> Many datasets exist in LibSVM format for regression tasks [1] but currently 
> the loadLibSVMFile primitive doesn't handle regression datasets.
> More precisely, the LabelParser is either a MulticlassLabelParser or a 
> BinaryLabelParser. What happens then is that the file is loaded but in 
> multiclass mode : each target value is interpreted as a class name !
> The fix would be to write a RegressionLabelParser which converts target 
> values to Double and plug it into the loadLibSVMFile routine.
> [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to