Re: Workflow Scheduler for Spark
I created Jira https://issues.apache.org/jira/browse/SPARK-3714 and design doc https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing on this matter. 2014-09-17 22:28 GMT+04:00 Reynold Xin r...@databricks.com: There might've been some misunderstanding. I was referring to the MLlib pipeline design doc when I said the design doc was posted, in response to the first paragraph of your original email. On Wed, Sep 17, 2014 at 2:47 AM, Egor Pahomov pahomov.e...@gmail.com wrote: It's doc about MLLib pipeline functionality. What about oozie-like workflow? 2014-09-17 13:08 GMT+04:00 Mark Hamstra m...@clearstorydata.com: See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, referenced in that JIRA: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com wrote: I have problems using Oozie. For example it doesn't sustain spark context like ooyola job server does. Other than GUI interfaces like HUE it's hard to work with - scoozie stopped in development year ago(I spoke with creator) and oozie xml very hard to write. Oozie still have all documentation and code in MR model rather than in yarn model. And based on it's current speed of development I can't expect radical changes in nearest future. There is no Databricks for oozie, which would have people on salary to develop this kind of radical changes. It's dinosaur. Reunold, can you help finding this doc? Do you mean just pipelining spark code or additional logic of persistence tasks, job server, task retry, data availability and extra? 2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com: Hi Egor, I think the design doc for the pipeline feature has been posted. For the workflow, I believe Oozie actually works fine with Spark if you want some external workflow system. Do you have any trouble using that? On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov pahomov.e...@gmail.com wrote: There are two things we(Yandex) miss in Spark: MLlib good abstractions and good workflow job scheduler. From threads Adding abstraction in MlLib and [mllib] State of Multi-Model training I got the idea, that databricks working on it and we should wait until first post doc, which would lead us. What about workflow scheduler? Is there anyone already working on it? Does anyone have a plan on doing it? P.S. We thought that MLlib abstractions about multiple algorithms run with same data would need such scheduler, which would rerun algorithm in case of failure. I understand, that spark provide fault tolerance out of the box, but we found some Ooozie-like scheduler more reliable for such long living workflows. -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*
Re: Workflow Scheduler for Spark
See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, referenced in that JIRA: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com wrote: I have problems using Oozie. For example it doesn't sustain spark context like ooyola job server does. Other than GUI interfaces like HUE it's hard to work with - scoozie stopped in development year ago(I spoke with creator) and oozie xml very hard to write. Oozie still have all documentation and code in MR model rather than in yarn model. And based on it's current speed of development I can't expect radical changes in nearest future. There is no Databricks for oozie, which would have people on salary to develop this kind of radical changes. It's dinosaur. Reunold, can you help finding this doc? Do you mean just pipelining spark code or additional logic of persistence tasks, job server, task retry, data availability and extra? 2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com: Hi Egor, I think the design doc for the pipeline feature has been posted. For the workflow, I believe Oozie actually works fine with Spark if you want some external workflow system. Do you have any trouble using that? On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov pahomov.e...@gmail.com wrote: There are two things we(Yandex) miss in Spark: MLlib good abstractions and good workflow job scheduler. From threads Adding abstraction in MlLib and [mllib] State of Multi-Model training I got the idea, that databricks working on it and we should wait until first post doc, which would lead us. What about workflow scheduler? Is there anyone already working on it? Does anyone have a plan on doing it? P.S. We thought that MLlib abstractions about multiple algorithms run with same data would need such scheduler, which would rerun algorithm in case of failure. I understand, that spark provide fault tolerance out of the box, but we found some Ooozie-like scheduler more reliable for such long living workflows. -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*
Re: Workflow Scheduler for Spark
It's doc about MLLib pipeline functionality. What about oozie-like workflow? 2014-09-17 13:08 GMT+04:00 Mark Hamstra m...@clearstorydata.com: See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, referenced in that JIRA: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com wrote: I have problems using Oozie. For example it doesn't sustain spark context like ooyola job server does. Other than GUI interfaces like HUE it's hard to work with - scoozie stopped in development year ago(I spoke with creator) and oozie xml very hard to write. Oozie still have all documentation and code in MR model rather than in yarn model. And based on it's current speed of development I can't expect radical changes in nearest future. There is no Databricks for oozie, which would have people on salary to develop this kind of radical changes. It's dinosaur. Reunold, can you help finding this doc? Do you mean just pipelining spark code or additional logic of persistence tasks, job server, task retry, data availability and extra? 2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com: Hi Egor, I think the design doc for the pipeline feature has been posted. For the workflow, I believe Oozie actually works fine with Spark if you want some external workflow system. Do you have any trouble using that? On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov pahomov.e...@gmail.com wrote: There are two things we(Yandex) miss in Spark: MLlib good abstractions and good workflow job scheduler. From threads Adding abstraction in MlLib and [mllib] State of Multi-Model training I got the idea, that databricks working on it and we should wait until first post doc, which would lead us. What about workflow scheduler? Is there anyone already working on it? Does anyone have a plan on doing it? P.S. We thought that MLlib abstractions about multiple algorithms run with same data would need such scheduler, which would rerun algorithm in case of failure. I understand, that spark provide fault tolerance out of the box, but we found some Ooozie-like scheduler more reliable for such long living workflows. -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex* -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*