Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/3975#issuecomment-71338040
  
    @Lewuathe  I believe @mengxr meant that users should prepare transformers 
whenever they use algorithms, as follows:
    * User loads data
    * User maps labels to 0,1,2,... to create newData
    * User passes newData to learning algorithm.
    * Learning algorithm validates newData, throwing an exception if the labels 
are not in 0,1,2,...
    * User calls model.predict to get predicted labels in 0,1,2,...
    * User can transform those predicted labels back to the original labels if 
needed.
    
    I'm fine with this.  We'll just have to make it as easy as possible to do 
this mapping.  The main complication I see in the above workflow is in the 
pipelines API: A user would want to add a Transformer to the Pipeline to map 
labels to 0,1,2..., and the user would need another Transformer to map the 
predictions back to the original set of labels.  But those 2 Transformers 
should be linked somehow so that fitting the first (to compute the label 
index/dictionary) also fits the second.  @mengxr  Any thoughts on this?  It 
sounds doable, but it adds a new sort of pipeline feature (linked 
PipelineStages).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to