[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140260#comment-14140260 ]
Egor Pakhomov commented on SPARK-3530: -------------------------------------- Nice doc. Parameters passing as part of grid search and pipeline creation great and important feature, but it's only one of the features. For me it's more important to see Estimator abstraction in spark code base early, may be not earlier than introducing dataset abstraction, but definitely earlier than any work on grid search. When we where thinking on creating such pipeline framework we came to conclusion that transformations in this pipeline is like steps in oozie workflow - they should be easy retrieble, be persisted, and have some queue. It's because transformation can take hours and rerun the whole pipeline in case of step failure is expensive. Pipeline can consist of gridsearch with parameters search, which means, that there are a lot of parallel executions, which need wise scheduling. So I think pipeline should be executed on some cluster wise scheduler with some persistence. I'm not saying, that it's absolutly necessary now, but it would be great to have architecture open to such possibility. > Pipeline and Parameters > ----------------------- > > Key: SPARK-3530 > URL: https://issues.apache.org/jira/browse/SPARK-3530 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib > Reporter: Xiangrui Meng > Assignee: Xiangrui Meng > Priority: Critical > > This part of the design doc is for pipelines and parameters. I put the design > doc at > https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing > I will copy the proposed interfaces to this JIRA later. Some sample code can > be viewed at: https://github.com/mengxr/spark-ml/ > Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org