[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079704#comment-14079704
 ] 

Apache Spark commented on SPARK-2341:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1663

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Assignee: Sean Owen
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078031#comment-14078031
 ] 

Xiangrui Meng commented on SPARK-2341:
--

[~srowen] For the doc in your version:

{code}
If multiclass, the numeric value parsed directly from the label string will 
be used as the label value.
If continuous, the double value parsed directly from the string will be used 
as the label.
{code}

Would user feel confused since the two lines are essentially the same?

Another possible solution is that we parse the labels into doubles and remove 
the `multiclass` argument. Users can perform a map to transform the labels into 
binary 0/1 if needed.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078040#comment-14078040
 ] 

Sean Owen commented on SPARK-2341:
--

To me, it's less confusing than writing multiclass for a regression problem. 
Yes I also think it could be simpler to remove multiclass; the idea I suppose 
is that binary is merely a special case of that, and the caller can write the 
required transformation to 0/1 if needed. At least the caller is aware of the 
transformation and I think that's good. At least, there you just let numbers be 
numbers and let downstream code figure out whether the number is a continuous 
value, or the number is a category.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078945#comment-14078945
 ] 

Xiangrui Meng commented on SPARK-2341:
--

That sounds good. Do you mind creating a PR? We can deprecate the existing ones 
with `multiclass` and add a warning in the doc about the +1/-1 case.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066283#comment-14066283
 ] 

Sean Owen commented on SPARK-2341:
--

[~mengxr] Here is an example of changing the argument:
https://github.com/srowen/spark/commit/4a584ff9c0ada3d035d4668ecf22ec0e65ed16b6

I won't open a PR yet. I think this is a better API at this point, but the 
question is more whether the weight of deprecated methods are worth it or not. 
Another data point to keep in mind regarding how APIs can evolve.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063419#comment-14063419
 ] 

Sean Owen commented on SPARK-2341:
--

OK is it worth a pull request for changing the boolean multiclass argument to a 
string? I wanted to ask if that was your intent before I do that.

libsvm format support is certainly important. It happens to have to encode 
non-numeric input as numbers. It need not be that way throughout MLlib, since 
it isn't that way in other input formats. (In this API method, it's pretty 
minor, since libsvm does by definition use this encoding.) So yes that would be 
great if data sets or API objects didn't assume that categorical data was 
numeric, but encoded type in the data set or even in the object model itself. I 
think it's mostly a design and type-safety argument -- same reason we have 
String instead of just byte[] everywhere.

Sure I will have to build this conversion at some point anyway and can share 
the result then.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-15 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063155#comment-14063155
 ] 

Xiangrui Meng commented on SPARK-2341:
--

[~srowen] Using enum or string sounds good. As you already knew, using string 
may be better because of Python.

Rounding is used because people use either +1/-1 or 1/0 for binary 
classification in LIBSVM and we require 1/0 in MLlib. Actually the +1/-1 is the 
only corner case I wanted to cover when multiclass=false. We added LIBSVM 
support because there are many commonly used datasets we can download from 
LIBSVM/LIBLINEAR website and other places. It is easier for people to test 
MLlib's algorithms.

It would be nice if you have free cycles to implement a method that convert 
classes to numbers. For the long term, I'm thinking about for each dataset, we 
can attach metadata that contains feature names, feature types, number of 
non-zeros, and for every categorical feature we have a value - {0, 1, ...} 
map.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051151#comment-14051151
 ] 

Xiangrui Meng commented on SPARK-2341:
--

[~srowen] Instead of taking string labels directly, we can provide tools to 
convert them to integer labels (still Double typed). LIBLINEAR/LIBSVM do not 
support string labels either, but they are still among the top choices for 
logistic regression and SVM.

[~eustache] Unfortunately, the argument name in Scala is part of the API and 
loadLibSVMFile is not marked as experimental. So we cannot update the argument 
name to `multiclassOrRegression`, which is too long anyway. Could you update 
the doc and change the first sentence from multiclass: whether the input 
labels contain more than two classes to multiclass: whether the input labels 
are continuous-valued (for regression) or contain more than two classes? 

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051194#comment-14051194
 ] 

Sean Owen commented on SPARK-2341:
--

[~mengxr] For regression, rather than further overloading multiclass to mean 
regression, how about modifying the argument to take on three values (as an 
enum, string, etc.) to distinguish the three modes. The current method would 
stay, but be deprecated.

multiclass=false is for binary classification. libsvm uses 0 and 1 (or any 
ints) for binary classification. But this parses it as a real number, and 
rounds to 0/1. (Is that was libsvm does?) Maybe it's a convenient semantic 
overload when you want to transform a continuous value to a 0/1 indicator, but 
is that implied by libsvm format or just a transformation the caller should 
make? multiclass=true treats libsvm integer labels as doubles, but not 
continuous values. It seems like inviting more confusion to have this mode also 
double as the mode for parsing labels that are continuous values as continuous 
values.

libsvm is widely used but it's old; I don't think it's file format from long 
ago should necessarily inform API design now. There are other serializations 
besides libsvm (plain CSV for instance) and other algorithms (random decision 
forests).

You can make utilities to convert classes to numbers for benefit of the 
implementation on the front, and I'll have to in order to use this. Maybe we 
can start there -- at least if a utility is in the project people aren't all 
reinventing this in order to use an SVM with actual labels. The caller carries 
around a dictionary then to do the reverse mapping. The model seems like the 
place to hold that info, if in fact internally it converts classes to some 
other representation. Maybe the need would be clearer once the utility is 
created.

As you say I'm concerned that the API is already locked down early and some of 
these changes are going to be viewed as infeasible just for that reason.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049732#comment-14049732
 ] 

Xiangrui Meng commented on SPARK-2341:
--

Just set `multiclass = true` to load double values.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Eustache (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049755#comment-14049755
 ] 

Eustache commented on SPARK-2341:
-

I see that LabelParser with multiclass=true works for the regression
setting.

What I fail to understand is how it is related to multiclass ? Is the
naming proper ?

In any case shouldn't we provide a naming that explicitly mentions
regression ?






 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049765#comment-14049765
 ] 

Xiangrui Meng commented on SPARK-2341:
--

It is a little awkward to have both `regression` and `multiclass` as input 
arguments. I agree that a correct name should be `multiclassOrRegression`. But 
it is certainly too long. We tried to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false, 
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. 
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed 
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But 
for the API, I don't feel that it is necessary to change.


 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049942#comment-14049942
 ] 

Sean Owen commented on SPARK-2341:
--

I've been a bit uncomfortable with how the MLlib API conflates categorical 
values and numbers, since they aren't numbers in general. Treating them as 
numbers is a convenience in some cases, and common in papers, but feels like 
suboptimal software design -- should a user have to convert categoricals to 
some numeric representation? To me it invites confusion, and this is one 
symptom. So I am not sure multiclass should mean parse target as double to 
begin with?

OK, it's not the issue here. But we're on the subject of an experimental API 
subject to change with an example of something related that could be improved 
along the way, and it's my #1 wish for MLlib at the moment. I'd really like to 
work on a change to try to accommodate classes as, say, strings at least, and 
not presume doubles. But I am trying to figure out if anyone agrees with that. 

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)