[jira] [Comment Edited] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

Manoj Kumar (JIRA) Mon, 13 Jul 2015 13:37:56 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14625289#comment-14625289
 ]


Manoj Kumar edited comment on SPARK-7126 at 7/13/15 8:36 PM:
-------------------------------------------------------------

[~josephkb]

1. In scikit-learn predict outputs the same labels as the inputs. (Internally 
we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 
1, .. n_labels - 1] (the numerically smallest get zero )in contrast to 
StringIndexer which gives the most frequent label the smallest.

2. I'm not sure it is necessary to show the users, what is being done 
internally. Should it not be sufficient to just give them the predicted output 
in terms of the input labels (I'm highly biased based on my previous experience 
in sklearn ;) )

Should we split the JIRA for different classifiers? (I haven't read the code 
yet, so I'm not quite sure if there is a generic way of doing this across all 
classifiers)



was (Author: mechcoder):
[~josephkb]

1. In scikit-learn predict outputs the same labels as the inputs. (Internally 
we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 
1, .. n_labels - 1] in contrast to StringIndexer which gives the most frequent 
label the smallest.

2. I'm not sure it is necessary to show the users, what is being done 
internally. Should it not be sufficient to just give them the predicted output 
in terms of the input labels (I'm highly biased based on my previous experience 
in sklearn ;) )

Should we split the JIRA for different classifiers? (I haven't read the code 
yet, so I'm not quite sure if there is a generic way of doing this across all 
classifiers)


> For spark.ml Classifiers, automatically index labels if they are not yet 
> indexed
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-7126
>                 URL: https://issues.apache.org/jira/browse/SPARK-7126
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.0
>            Reporter: Joseph K. Bradley
>
> Now that we have StringIndexer, we could have 
> spark.ml.classification.Classifier (the abstraction) automatically handle 
> label indexing if the labels are not yet indexed.
> This would require a bit of design:
> * Should predict() output the original labels or the indices?
> * How should we notify users that the labels are being automatically indexed?
> * How should we provide that index to the users?
> * If multiple parts of a Pipeline automatically index labels, what do we need 
> to do to make sure they are consistent?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

Reply via email to