[jira] [Commented] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

2015-07-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625708#comment-14625708
 ] 

Joseph K. Bradley commented on SPARK-7126:
--

I agree we should emulate scikit-learn.  I've spoken with [~mengxr], who 
strongly supports having transform() maintain the current semantics of using 
0-based label indices.

This means that, to solve this JIRA, we will need to add a new method analogous 
to fit() which returns a PipelineModel rather than a specific model (like 
LogisticRegressionModel).  That PipelineModel can include indexing and 
de-indexing labels, and perhaps other transformations as well.  This addition 
to the API will require some significant design, which we hope to do before 
long...but maybe not for 1.5.  I'll remove that target version.

 For spark.ml Classifiers, automatically index labels if they are not yet 
 indexed
 

 Key: SPARK-7126
 URL: https://issues.apache.org/jira/browse/SPARK-7126
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Now that we have StringIndexer, we could have 
 spark.ml.classification.Classifier (the abstraction) automatically handle 
 label indexing if the labels are not yet indexed.
 This would require a bit of design:
 * Should predict() output the original labels or the indices?
 * How should we notify users that the labels are being automatically indexed?
 * How should we provide that index to the users?
 * If multiple parts of a Pipeline automatically index labels, what do we need 
 to do to make sure they are consistent?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

2015-07-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625289#comment-14625289
 ] 

Manoj Kumar commented on SPARK-7126:


[~josephkb]

1. In scikit-learn predict outputs the same labels as the inputs. (Internally 
we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 
1, .. n_labels - 1] in contrast to StringIndexer which gives the most frequent 
label the smallest.

2. I'm not sure it is necessary to show the users, what is being done 
internally. Should it not be sufficient to just give them the predicted output 
in terms of the input labels (I'm highly biased based on my previous experience 
in sklearn ;) )

Should we split the JIRA for different classifiers? (I haven't read the code 
yet, so I'm not quite sure if there is a generic way of doing this across all 
classifiers)


 For spark.ml Classifiers, automatically index labels if they are not yet 
 indexed
 

 Key: SPARK-7126
 URL: https://issues.apache.org/jira/browse/SPARK-7126
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Now that we have StringIndexer, we could have 
 spark.ml.classification.Classifier (the abstraction) automatically handle 
 label indexing if the labels are not yet indexed.
 This would require a bit of design:
 * Should predict() output the original labels or the indices?
 * How should we notify users that the labels are being automatically indexed?
 * How should we provide that index to the users?
 * If multiple parts of a Pipeline automatically index labels, what do we need 
 to do to make sure they are consistent?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org