[jira] [Commented] (SPARK-3714) Spark workflow scheduler

2014-09-29 Thread Egor Pakhomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151387#comment-14151387
 ] 

Egor Pakhomov commented on SPARK-3714:
--

Yes, I tried, please see [Design doc | 
https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 for explanation why Oozie is not good enough

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3714) Spark workflow scheduler

2014-09-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151422#comment-14151422
 ] 

Sean Owen commented on SPARK-3714:
--

Another meta-question for everyone: at what point should a project like this 
simply be a separate add-on project? For example, Oozie is a stand-alone 
project. Not everything needs to happen directly under the Spark umbrella, 
which is already broad. One upside to including it is that Spark is it perhaps 
gets more attention. Spark is forced to maintain and keep it compatible, which 
is also a downside I suppose. There is also the effect that you create an 
official workflow engine and discourage others.

I am more asking the question than suggesting an answer, but, my reaction was 
that this could live outside Spark just fine.

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3714) Spark workflow scheduler

2014-09-29 Thread Egor Pakhomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151458#comment-14151458
 ] 

Egor Pakhomov commented on SPARK-3714:
--

I agree with your concerns - it should be separate from Spark codebase project. 
But it can't be workflow engine for common purposes like oozie. There are too 
many essential spark specific things in new engine, like supporting spark 
context, providing capability to write simple jobs in scala from HUE.

I think for such engine best approach is like ooyala job server - separate 
project, but spark oriented.  

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3714) Spark workflow scheduler

2014-09-29 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151486#comment-14151486
 ] 

Mridul Muralidharan commented on SPARK-3714:



Most of the drawbacks mentioned are not severe imo - at best, they are 
unfamiliarity with oozie platform (points 2, 3, 4, 5).
Point 1 is interesting (sharing spark context) - though from a fault tolerance 
point of view, it makes supporting it challenging; ofcourse oozie was not, 
probably, designed with something like spark in mind - so there might be 
changes to oozie which might benefit spark; we could engage with oozie dev for 
that.

But discarding it to reinvent something when oozie already does everything 
mentioned in requirements section seems counterintutive.


I have seen multiple attempts to 'simplify' workflow management, and at 
production scale almost everything ends up being similar ...
Note that most production jobs have to depend on a variety of jobs - not just 
spark or MR - so you will end up converigng on a variant of oozie anyway :-)

Having said that, if you want to take a crack at solving this with spark 
specific idioms in mind, it would be interesting to see the result - I dont 
want to dissuade from doing so !
We might end up with something quite interesting.

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3714) Spark workflow scheduler

2014-09-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151211#comment-14151211
 ] 

Mridul Muralidharan commented on SPARK-3714:


Have you tried using oozie for this ?
IIRC Tom has already gotten this working here quite a while back
/CC [~tgraves] 

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org