Re: Workflow Scheduler for Spark

2014-09-28 Thread Egor Pahomov
I created Jira https://issues.apache.org/jira/browse/SPARK-3714 and design
doc
https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing
on
this matter.

2014-09-17 22:28 GMT+04:00 Reynold Xin r...@databricks.com:

 There might've been some misunderstanding. I was referring to the MLlib
 pipeline design doc when I said the design doc was posted, in response to
 the first paragraph of your original email.


 On Wed, Sep 17, 2014 at 2:47 AM, Egor Pahomov pahomov.e...@gmail.com
 wrote:

  It's doc about MLLib pipeline functionality. What about oozie-like
  workflow?
 
  2014-09-17 13:08 GMT+04:00 Mark Hamstra m...@clearstorydata.com:
 
   See https://issues.apache.org/jira/browse/SPARK-3530 and this doc,
   referenced in that JIRA:
  
  
  
 
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
  
   On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com
   wrote:
  
   I have problems using Oozie. For example it doesn't sustain spark
  context
   like ooyola job server does. Other than GUI interfaces like HUE it's
  hard
   to work with - scoozie stopped in development year ago(I spoke with
   creator) and oozie xml very hard to write.
   Oozie still have all documentation and code in MR model rather than in
   yarn
   model. And based on it's current speed of development I can't expect
   radical changes in nearest future. There is no Databricks for oozie,
   which would have people on salary to develop this kind of radical
  changes.
   It's dinosaur.
  
   Reunold, can you help finding this doc? Do you mean just pipelining
  spark
   code or additional logic of persistence tasks, job server, task retry,
   data
   availability and extra?
  
  
   2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com:
  
Hi Egor,
   
I think the design doc for the pipeline feature has been posted.
   
For the workflow, I believe Oozie actually works fine with Spark if
  you
want some external workflow system. Do you have any trouble using
  that?
   
   
On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov 
  pahomov.e...@gmail.com
wrote:
   
There are two things we(Yandex) miss in Spark: MLlib good
  abstractions
   and
good workflow job scheduler. From threads Adding abstraction in
  MlLib
and
[mllib] State of Multi-Model training I got the idea, that
  databricks
working on it and we should wait until first post doc, which would
  lead
us.
What about workflow scheduler? Is there anyone already working on
 it?
   Does
anyone have a plan on doing it?
   
P.S. We thought that MLlib abstractions about multiple algorithms
 run
   with
same data would need such scheduler, which would rerun algorithm in
   case
of
failure. I understand, that spark provide fault tolerance out of
 the
   box,
but we found some Ooozie-like scheduler more reliable for such
 long
living workflows.
   
--
   
   
   
*Sincerely yoursEgor PakhomovScala Developer, Yandex*
   
   
   
  
  
   --
  
  
  
   *Sincerely yoursEgor PakhomovScala Developer, Yandex*
  
  
  
 
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 




-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*


Re: Workflow Scheduler for Spark

2014-09-17 Thread Mark Hamstra
See https://issues.apache.org/jira/browse/SPARK-3530 and this doc,
referenced in that JIRA:

https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com
wrote:

 I have problems using Oozie. For example it doesn't sustain spark context
 like ooyola job server does. Other than GUI interfaces like HUE it's hard
 to work with - scoozie stopped in development year ago(I spoke with
 creator) and oozie xml very hard to write.
 Oozie still have all documentation and code in MR model rather than in yarn
 model. And based on it's current speed of development I can't expect
 radical changes in nearest future. There is no Databricks for oozie,
 which would have people on salary to develop this kind of radical changes.
 It's dinosaur.

 Reunold, can you help finding this doc? Do you mean just pipelining spark
 code or additional logic of persistence tasks, job server, task retry, data
 availability and extra?


 2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com:

  Hi Egor,
 
  I think the design doc for the pipeline feature has been posted.
 
  For the workflow, I believe Oozie actually works fine with Spark if you
  want some external workflow system. Do you have any trouble using that?
 
 
  On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov pahomov.e...@gmail.com
  wrote:
 
  There are two things we(Yandex) miss in Spark: MLlib good abstractions
 and
  good workflow job scheduler. From threads Adding abstraction in MlLib
  and
  [mllib] State of Multi-Model training I got the idea, that databricks
  working on it and we should wait until first post doc, which would lead
  us.
  What about workflow scheduler? Is there anyone already working on it?
 Does
  anyone have a plan on doing it?
 
  P.S. We thought that MLlib abstractions about multiple algorithms run
 with
  same data would need such scheduler, which would rerun algorithm in case
  of
  failure. I understand, that spark provide fault tolerance out of the
 box,
  but we found some Ooozie-like scheduler more reliable for such long
  living workflows.
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 
 
 


 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*



Re: Workflow Scheduler for Spark

2014-09-17 Thread Egor Pahomov
It's doc about MLLib pipeline functionality. What about oozie-like
workflow?

2014-09-17 13:08 GMT+04:00 Mark Hamstra m...@clearstorydata.com:

 See https://issues.apache.org/jira/browse/SPARK-3530 and this doc,
 referenced in that JIRA:


 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing

 On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov pahomov.e...@gmail.com
 wrote:

 I have problems using Oozie. For example it doesn't sustain spark context
 like ooyola job server does. Other than GUI interfaces like HUE it's hard
 to work with - scoozie stopped in development year ago(I spoke with
 creator) and oozie xml very hard to write.
 Oozie still have all documentation and code in MR model rather than in
 yarn
 model. And based on it's current speed of development I can't expect
 radical changes in nearest future. There is no Databricks for oozie,
 which would have people on salary to develop this kind of radical changes.
 It's dinosaur.

 Reunold, can you help finding this doc? Do you mean just pipelining spark
 code or additional logic of persistence tasks, job server, task retry,
 data
 availability and extra?


 2014-09-17 11:21 GMT+04:00 Reynold Xin r...@databricks.com:

  Hi Egor,
 
  I think the design doc for the pipeline feature has been posted.
 
  For the workflow, I believe Oozie actually works fine with Spark if you
  want some external workflow system. Do you have any trouble using that?
 
 
  On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov pahomov.e...@gmail.com
  wrote:
 
  There are two things we(Yandex) miss in Spark: MLlib good abstractions
 and
  good workflow job scheduler. From threads Adding abstraction in MlLib
  and
  [mllib] State of Multi-Model training I got the idea, that databricks
  working on it and we should wait until first post doc, which would lead
  us.
  What about workflow scheduler? Is there anyone already working on it?
 Does
  anyone have a plan on doing it?
 
  P.S. We thought that MLlib abstractions about multiple algorithms run
 with
  same data would need such scheduler, which would rerun algorithm in
 case
  of
  failure. I understand, that spark provide fault tolerance out of the
 box,
  but we found some Ooozie-like scheduler more reliable for such long
  living workflows.
 
  --
 
 
 
  *Sincerely yoursEgor PakhomovScala Developer, Yandex*
 
 
 


 --



 *Sincerely yoursEgor PakhomovScala Developer, Yandex*





-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*