[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14140260#comment-14140260
 ] 

Egor Pakhomov commented on SPARK-3530:
--------------------------------------

Nice doc. 
Parameters passing as part of grid search and pipeline creation great and 
important feature, but it's only one of the features. For me it's more 
important to see Estimator abstraction in spark code base early, may be not 
earlier than introducing dataset abstraction, but definitely earlier than any 
work on grid search. 

When we where thinking on creating such pipeline framework we came to 
conclusion that transformations in this pipeline is like steps in oozie 
workflow - they should be easy retrieble, be persisted, and have some queue. 
It's because transformation can take hours and rerun the whole pipeline in case 
of step failure is expensive. Pipeline can consist of gridsearch with 
parameters search, which means, that there are a lot of parallel executions, 
which need wise scheduling. So I think pipeline should be executed on some 
cluster wise scheduler with some persistence. I'm not saying, that it's 
absolutly necessary now, but it would be great to have architecture open to 
such possibility.     

> Pipeline and Parameters
> -----------------------
>
>                 Key: SPARK-3530
>                 URL: https://issues.apache.org/jira/browse/SPARK-3530
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>            Priority: Critical
>
> This part of the design doc is for pipelines and parameters. I put the design 
> doc at
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
> I will copy the proposed interfaces to this JIRA later. Some sample code can 
> be viewed at: https://github.com/mengxr/spark-ml/
> Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to