Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3975#issuecomment-71338040 @Lewuathe I believe @mengxr meant that users should prepare transformers whenever they use algorithms, as follows: * User loads data * User maps labels to 0,1,2,... to create newData * User passes newData to learning algorithm. * Learning algorithm validates newData, throwing an exception if the labels are not in 0,1,2,... * User calls model.predict to get predicted labels in 0,1,2,... * User can transform those predicted labels back to the original labels if needed. I'm fine with this. We'll just have to make it as easy as possible to do this mapping. The main complication I see in the above workflow is in the pipelines API: A user would want to add a Transformer to the Pipeline to map labels to 0,1,2..., and the user would need another Transformer to map the predictions back to the original set of labels. But those 2 Transformers should be linked somehow so that fitting the first (to compute the label index/dictionary) also fits the second. @mengxr Any thoughts on this? It sounds doable, but it adds a new sort of pipeline feature (linked PipelineStages).
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org