[ https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14625289#comment-14625289 ]
Manoj Kumar edited comment on SPARK-7126 at 7/13/15 8:36 PM: ------------------------------------------------------------- [~josephkb] 1. In scikit-learn predict outputs the same labels as the inputs. (Internally we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 1, .. n_labels - 1] (the numerically smallest get zero )in contrast to StringIndexer which gives the most frequent label the smallest. 2. I'm not sure it is necessary to show the users, what is being done internally. Should it not be sufficient to just give them the predicted output in terms of the input labels (I'm highly biased based on my previous experience in sklearn ;) ) Should we split the JIRA for different classifiers? (I haven't read the code yet, so I'm not quite sure if there is a generic way of doing this across all classifiers) was (Author: mechcoder): [~josephkb] 1. In scikit-learn predict outputs the same labels as the inputs. (Internally we use sklearn.preprocessing.LabelEncoder) to encode the input labels into [0, 1, .. n_labels - 1] in contrast to StringIndexer which gives the most frequent label the smallest. 2. I'm not sure it is necessary to show the users, what is being done internally. Should it not be sufficient to just give them the predicted output in terms of the input labels (I'm highly biased based on my previous experience in sklearn ;) ) Should we split the JIRA for different classifiers? (I haven't read the code yet, so I'm not quite sure if there is a generic way of doing this across all classifiers) > For spark.ml Classifiers, automatically index labels if they are not yet > indexed > -------------------------------------------------------------------------------- > > Key: SPARK-7126 > URL: https://issues.apache.org/jira/browse/SPARK-7126 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 1.4.0 > Reporter: Joseph K. Bradley > > Now that we have StringIndexer, we could have > spark.ml.classification.Classifier (the abstraction) automatically handle > label indexing if the labels are not yet indexed. > This would require a bit of design: > * Should predict() output the original labels or the indices? > * How should we notify users that the labels are being automatically indexed? > * How should we provide that index to the users? > * If multiple parts of a Pipeline automatically index labels, what do we need > to do to make sure they are consistent? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org