[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151387#comment-14151387 ] Egor Pakhomov commented on SPARK-3714: -- Yes, I tried, please see [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] for explanation why Oozie is not good enough Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151422#comment-14151422 ] Sean Owen commented on SPARK-3714: -- Another meta-question for everyone: at what point should a project like this simply be a separate add-on project? For example, Oozie is a stand-alone project. Not everything needs to happen directly under the Spark umbrella, which is already broad. One upside to including it is that Spark is it perhaps gets more attention. Spark is forced to maintain and keep it compatible, which is also a downside I suppose. There is also the effect that you create an official workflow engine and discourage others. I am more asking the question than suggesting an answer, but, my reaction was that this could live outside Spark just fine. Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151458#comment-14151458 ] Egor Pakhomov commented on SPARK-3714: -- I agree with your concerns - it should be separate from Spark codebase project. But it can't be workflow engine for common purposes like oozie. There are too many essential spark specific things in new engine, like supporting spark context, providing capability to write simple jobs in scala from HUE. I think for such engine best approach is like ooyala job server - separate project, but spark oriented. Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151486#comment-14151486 ] Mridul Muralidharan commented on SPARK-3714: Most of the drawbacks mentioned are not severe imo - at best, they are unfamiliarity with oozie platform (points 2, 3, 4, 5). Point 1 is interesting (sharing spark context) - though from a fault tolerance point of view, it makes supporting it challenging; ofcourse oozie was not, probably, designed with something like spark in mind - so there might be changes to oozie which might benefit spark; we could engage with oozie dev for that. But discarding it to reinvent something when oozie already does everything mentioned in requirements section seems counterintutive. I have seen multiple attempts to 'simplify' workflow management, and at production scale almost everything ends up being similar ... Note that most production jobs have to depend on a variety of jobs - not just spark or MR - so you will end up converigng on a variant of oozie anyway :-) Having said that, if you want to take a crack at solving this with spark specific idioms in mind, it would be interesting to see the result - I dont want to dissuade from doing so ! We might end up with something quite interesting. Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151211#comment-14151211 ] Mridul Muralidharan commented on SPARK-3714: Have you tried using oozie for this ? IIRC Tom has already gotten this working here quite a while back /CC [~tgraves] Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org