[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049942#comment-14049942 ]
Sean Owen commented on SPARK-2341: ---------------------------------- I've been a bit uncomfortable with how the MLlib API conflates categorical values and numbers, since they aren't numbers in general. Treating them as numbers is a convenience in some cases, and common in papers, but feels like suboptimal software design -- should a user have to convert categoricals to some numeric representation? To me it invites confusion, and this is one symptom. So I am not sure "multiclass" should mean "parse target as double" to begin with? OK, it's not the issue here. But we're on the subject of an experimental API subject to change with an example of something related that could be improved along the way, and it's my #1 wish for MLlib at the moment. I'd really like to work on a change to try to accommodate classes as, say, strings at least, and not presume doubles. But I am trying to figure out if anyone agrees with that. > loadLibSVMFile doesn't handle regression datasets > ------------------------------------------------- > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.0.0 > Reporter: Eustache > Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)