[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079704#comment-14079704 ] Apache Spark commented on SPARK-2341: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/1663 loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Assignee: Sean Owen Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078031#comment-14078031 ] Xiangrui Meng commented on SPARK-2341: -- [~srowen] For the doc in your version: {code} If multiclass, the numeric value parsed directly from the label string will be used as the label value. If continuous, the double value parsed directly from the string will be used as the label. {code} Would user feel confused since the two lines are essentially the same? Another possible solution is that we parse the labels into doubles and remove the `multiclass` argument. Users can perform a map to transform the labels into binary 0/1 if needed. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078040#comment-14078040 ] Sean Owen commented on SPARK-2341: -- To me, it's less confusing than writing multiclass for a regression problem. Yes I also think it could be simpler to remove multiclass; the idea I suppose is that binary is merely a special case of that, and the caller can write the required transformation to 0/1 if needed. At least the caller is aware of the transformation and I think that's good. At least, there you just let numbers be numbers and let downstream code figure out whether the number is a continuous value, or the number is a category. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078945#comment-14078945 ] Xiangrui Meng commented on SPARK-2341: -- That sounds good. Do you mind creating a PR? We can deprecate the existing ones with `multiclass` and add a warning in the doc about the +1/-1 case. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066283#comment-14066283 ] Sean Owen commented on SPARK-2341: -- [~mengxr] Here is an example of changing the argument: https://github.com/srowen/spark/commit/4a584ff9c0ada3d035d4668ecf22ec0e65ed16b6 I won't open a PR yet. I think this is a better API at this point, but the question is more whether the weight of deprecated methods are worth it or not. Another data point to keep in mind regarding how APIs can evolve. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063419#comment-14063419 ] Sean Owen commented on SPARK-2341: -- OK is it worth a pull request for changing the boolean multiclass argument to a string? I wanted to ask if that was your intent before I do that. libsvm format support is certainly important. It happens to have to encode non-numeric input as numbers. It need not be that way throughout MLlib, since it isn't that way in other input formats. (In this API method, it's pretty minor, since libsvm does by definition use this encoding.) So yes that would be great if data sets or API objects didn't assume that categorical data was numeric, but encoded type in the data set or even in the object model itself. I think it's mostly a design and type-safety argument -- same reason we have String instead of just byte[] everywhere. Sure I will have to build this conversion at some point anyway and can share the result then. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063155#comment-14063155 ] Xiangrui Meng commented on SPARK-2341: -- [~srowen] Using enum or string sounds good. As you already knew, using string may be better because of Python. Rounding is used because people use either +1/-1 or 1/0 for binary classification in LIBSVM and we require 1/0 in MLlib. Actually the +1/-1 is the only corner case I wanted to cover when multiclass=false. We added LIBSVM support because there are many commonly used datasets we can download from LIBSVM/LIBLINEAR website and other places. It is easier for people to test MLlib's algorithms. It would be nice if you have free cycles to implement a method that convert classes to numbers. For the long term, I'm thinking about for each dataset, we can attach metadata that contains feature names, feature types, number of non-zeros, and for every categorical feature we have a value - {0, 1, ...} map. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051151#comment-14051151 ] Xiangrui Meng commented on SPARK-2341: -- [~srowen] Instead of taking string labels directly, we can provide tools to convert them to integer labels (still Double typed). LIBLINEAR/LIBSVM do not support string labels either, but they are still among the top choices for logistic regression and SVM. [~eustache] Unfortunately, the argument name in Scala is part of the API and loadLibSVMFile is not marked as experimental. So we cannot update the argument name to `multiclassOrRegression`, which is too long anyway. Could you update the doc and change the first sentence from multiclass: whether the input labels contain more than two classes to multiclass: whether the input labels are continuous-valued (for regression) or contain more than two classes? loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051194#comment-14051194 ] Sean Owen commented on SPARK-2341: -- [~mengxr] For regression, rather than further overloading multiclass to mean regression, how about modifying the argument to take on three values (as an enum, string, etc.) to distinguish the three modes. The current method would stay, but be deprecated. multiclass=false is for binary classification. libsvm uses 0 and 1 (or any ints) for binary classification. But this parses it as a real number, and rounds to 0/1. (Is that was libsvm does?) Maybe it's a convenient semantic overload when you want to transform a continuous value to a 0/1 indicator, but is that implied by libsvm format or just a transformation the caller should make? multiclass=true treats libsvm integer labels as doubles, but not continuous values. It seems like inviting more confusion to have this mode also double as the mode for parsing labels that are continuous values as continuous values. libsvm is widely used but it's old; I don't think it's file format from long ago should necessarily inform API design now. There are other serializations besides libsvm (plain CSV for instance) and other algorithms (random decision forests). You can make utilities to convert classes to numbers for benefit of the implementation on the front, and I'll have to in order to use this. Maybe we can start there -- at least if a utility is in the project people aren't all reinventing this in order to use an SVM with actual labels. The caller carries around a dictionary then to do the reverse mapping. The model seems like the place to hold that info, if in fact internally it converts classes to some other representation. Maybe the need would be clearer once the utility is created. As you say I'm concerned that the API is already locked down early and some of these changes are going to be viewed as infeasible just for that reason. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049732#comment-14049732 ] Xiangrui Meng commented on SPARK-2341: -- Just set `multiclass = true` to load double values. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049755#comment-14049755 ] Eustache commented on SPARK-2341: - I see that LabelParser with multiclass=true works for the regression setting. What I fail to understand is how it is related to multiclass ? Is the naming proper ? In any case shouldn't we provide a naming that explicitly mentions regression ? loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049765#comment-14049765 ] Xiangrui Meng commented on SPARK-2341: -- It is a little awkward to have both `regression` and `multiclass` as input arguments. I agree that a correct name should be `multiclassOrRegression`. But it is certainly too long. We tried to make this clear in the doc: {code} multiclass: whether the input labels contain more than two classes. If false, any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. So it works for both +1/-1 and 1/0 cases. If true, the double value parsed directly from the label string will be used as the label value. {code} It would be good if we can improve the documentation to make it clearer. But for the API, I don't feel that it is necessary to change. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049942#comment-14049942 ] Sean Owen commented on SPARK-2341: -- I've been a bit uncomfortable with how the MLlib API conflates categorical values and numbers, since they aren't numbers in general. Treating them as numbers is a convenience in some cases, and common in papers, but feels like suboptimal software design -- should a user have to convert categoricals to some numeric representation? To me it invites confusion, and this is one symptom. So I am not sure multiclass should mean parse target as double to begin with? OK, it's not the issue here. But we're on the subject of an experimental API subject to change with an example of something related that could be improved along the way, and it's my #1 wish for MLlib at the moment. I'd really like to work on a change to try to accommodate classes as, say, strings at least, and not presume doubles. But I am trying to figure out if anyone agrees with that. loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Labels: easyfix Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)