subject:"\[jira\] \[Commented\] \(SPARK\-2341\) loadLibSVMFile doesn't handle regression datasets"

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079704#comment-14079704
 ] 

Apache Spark commented on SPARK-2341:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1663

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Assignee: Sean Owen
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-29 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078031#comment-14078031
 ] 

Xiangrui Meng commented on SPARK-2341:
--

[~srowen] For the doc in your version:

{code}
If multiclass, the numeric value parsed directly from the label string will 
be used as the label value.
If continuous, the double value parsed directly from the string will be used 
as the label.
{code}

Would user feel confused since the two lines are essentially the same?

Another possible solution is that we parse the labels into doubles and remove 
the `multiclass` argument. Users can perform a map to transform the labels into 
binary 0/1 if needed.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078040#comment-14078040
 ] 

Sean Owen commented on SPARK-2341:
--

To me, it's less confusing than writing multiclass for a regression problem. 
Yes I also think it could be simpler to remove multiclass; the idea I suppose 
is that binary is merely a special case of that, and the caller can write the 
required transformation to 0/1 if needed. At least the caller is aware of the 
transformation and I think that's good. At least, there you just let numbers be 
numbers and let downstream code figure out whether the number is a continuous 
value, or the number is a category.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-29 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078945#comment-14078945
 ] 

Xiangrui Meng commented on SPARK-2341:
--

That sounds good. Do you mind creating a PR? We can deprecate the existing ones 
with `multiclass` and add a warning in the doc about the +1/-1 case.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-18 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066283#comment-14066283
 ] 

Sean Owen commented on SPARK-2341:
--

[~mengxr] Here is an example of changing the argument:
https://github.com/srowen/spark/commit/4a584ff9c0ada3d035d4668ecf22ec0e65ed16b6

I won't open a PR yet. I think this is a better API at this point, but the 
question is more whether the weight of deprecated methods are worth it or not. 
Another data point to keep in mind regarding how APIs can evolve.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-16 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063419#comment-14063419
]

Sean Owen commented on SPARK-2341:
--

OK is it worth a pull request for changing the boolean multiclass argument to a
string? I wanted to ask if that was your intent before I do that.

libsvm format support is certainly important. It happens to have to encode
non-numeric input as numbers. It need not be that way throughout MLlib, since
it isn't that way in other input formats. (In this API method, it's pretty
minor, since libsvm does by definition use this encoding.) So yes that would be
great if data sets or API objects didn't assume that categorical data was
numeric, but encoded type in the data set or even in the object model itself. I
think it's mostly a design and type-safety argument -- same reason we have
String instead of just byte[] everywhere.

Sure I will have to build this conversion at some point anyway and can share
the result then.

loadLibSVMFile doesn't handle regression datasets
-

Key: SPARK-2341
URL: https://issues.apache.org/jira/browse/SPARK-2341
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
Labels: easyfix

Many datasets exist in LibSVM format for regression tasks [1] but currently
the loadLibSVMFile primitive doesn't handle regression datasets.
More precisely, the LabelParser is either a MulticlassLabelParser or a
BinaryLabelParser. What happens then is that the file is loaded but in
multiclass mode : each target value is interpreted as a class name !
The fix would be to write a RegressionLabelParser which converts target
values to Double and plug it into the loadLibSVMFile routine.
[1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-15 Thread Xiangrui Meng (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063155#comment-14063155
]

Xiangrui Meng commented on SPARK-2341:
--

[~srowen] Using enum or string sounds good. As you already knew, using string
may be better because of Python.

Rounding is used because people use either +1/-1 or 1/0 for binary
classification in LIBSVM and we require 1/0 in MLlib. Actually the +1/-1 is the
only corner case I wanted to cover when multiclass=false. We added LIBSVM
support because there are many commonly used datasets we can download from
LIBSVM/LIBLINEAR website and other places. It is easier for people to test
MLlib's algorithms.

It would be nice if you have free cycles to implement a method that convert
classes to numbers. For the long term, I'm thinking about for each dataset, we
can attach metadata that contains feature names, feature types, number of
non-zeros, and for every categorical feature we have a value - {0, 1, ...}
map.

loadLibSVMFile doesn't handle regression datasets
-

Key: SPARK-2341
URL: https://issues.apache.org/jira/browse/SPARK-2341
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
Labels: easyfix

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-03 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051151#comment-14051151
 ] 

Xiangrui Meng commented on SPARK-2341:
--

[~srowen] Instead of taking string labels directly, we can provide tools to 
convert them to integer labels (still Double typed). LIBLINEAR/LIBSVM do not 
support string labels either, but they are still among the top choices for 
logistic regression and SVM.

[~eustache] Unfortunately, the argument name in Scala is part of the API and 
loadLibSVMFile is not marked as experimental. So we cannot update the argument 
name to `multiclassOrRegression`, which is too long anyway. Could you update 
the doc and change the first sentence from multiclass: whether the input 
labels contain more than two classes to multiclass: whether the input labels 
are continuous-valued (for regression) or contain more than two classes? 

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-03 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051194#comment-14051194
]

Sean Owen commented on SPARK-2341:
--

[~mengxr] For regression, rather than further overloading multiclass to mean
regression, how about modifying the argument to take on three values (as an
enum, string, etc.) to distinguish the three modes. The current method would
stay, but be deprecated.

multiclass=false is for binary classification. libsvm uses 0 and 1 (or any
ints) for binary classification. But this parses it as a real number, and
rounds to 0/1. (Is that was libsvm does?) Maybe it's a convenient semantic
overload when you want to transform a continuous value to a 0/1 indicator, but
is that implied by libsvm format or just a transformation the caller should
make? multiclass=true treats libsvm integer labels as doubles, but not
continuous values. It seems like inviting more confusion to have this mode also
double as the mode for parsing labels that are continuous values as continuous
values.

libsvm is widely used but it's old; I don't think it's file format from long
ago should necessarily inform API design now. There are other serializations
besides libsvm (plain CSV for instance) and other algorithms (random decision
forests).

You can make utilities to convert classes to numbers for benefit of the
implementation on the front, and I'll have to in order to use this. Maybe we
can start there -- at least if a utility is in the project people aren't all
reinventing this in order to use an SVM with actual labels. The caller carries
around a dictionary then to do the reverse mapping. The model seems like the
place to hold that info, if in fact internally it converts classes to some
other representation. Maybe the need would be clearer once the utility is
created.

As you say I'm concerned that the API is already locked down early and some of
these changes are going to be viewed as infeasible just for that reason.

loadLibSVMFile doesn't handle regression datasets
-

Key: SPARK-2341
URL: https://issues.apache.org/jira/browse/SPARK-2341
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
Labels: easyfix

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049732#comment-14049732
 ] 

Xiangrui Meng commented on SPARK-2341:
--

Just set `multiclass = true` to load double values.

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Eustache (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049755#comment-14049755
 ] 

Eustache commented on SPARK-2341:
-

I see that LabelParser with multiclass=true works for the regression
setting.

What I fail to understand is how it is related to multiclass ? Is the
naming proper ?

In any case shouldn't we provide a naming that explicitly mentions
regression ?






 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Xiangrui Meng (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049765#comment-14049765
]

Xiangrui Meng commented on SPARK-2341:
--

It is a little awkward to have both `regression` and `multiclass` as input
arguments. I agree that a correct name should be `multiclassOrRegression`. But
it is certainly too long. We tried to make this clear in the doc:

{code}
multiclass: whether the input labels contain more than two classes. If false,
any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise.
So it works for both +1/-1 and 1/0 cases. If true, the double value parsed
directly from the label string will be used as the label value.
{code}

It would be good if we can improve the documentation to make it clearer. But
for the API, I don't feel that it is necessary to change.

loadLibSVMFile doesn't handle regression datasets
-

Key: SPARK-2341
URL: https://issues.apache.org/jira/browse/SPARK-2341
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
Labels: easyfix

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

2014-07-02 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049942#comment-14049942
]

Sean Owen commented on SPARK-2341:
--

I've been a bit uncomfortable with how the MLlib API conflates categorical
values and numbers, since they aren't numbers in general. Treating them as
numbers is a convenience in some cases, and common in papers, but feels like
suboptimal software design -- should a user have to convert categoricals to
some numeric representation? To me it invites confusion, and this is one
symptom. So I am not sure multiclass should mean parse target as double to
begin with?

OK, it's not the issue here. But we're on the subject of an experimental API
subject to change with an example of something related that could be improved
along the way, and it's my #1 wish for MLlib at the moment. I'd really like to
work on a change to try to accommodate classes as, say, strings at least, and
not presume doubles. But I am trying to figure out if anyone agrees with that.

loadLibSVMFile doesn't handle regression datasets
-

Key: SPARK-2341
URL: https://issues.apache.org/jira/browse/SPARK-2341
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Priority: Minor
Labels: easyfix

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets

13 matches

Site Navigation

Mail list logo

Footer information