I created Jira <https://issues.apache.org/jira/browse/SPARK-3714> and design doc <https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing> on this matter.
2014-09-17 22:28 GMT+04:00 Reynold Xin <r...@databricks.com>: > There might've been some misunderstanding. I was referring to the MLlib > pipeline design doc when I said the design doc was posted, in response to > the first paragraph of your original email. > > > On Wed, Sep 17, 2014 at 2:47 AM, Egor Pahomov <pahomov.e...@gmail.com> > wrote: > > > It's doc about MLLib pipeline functionality. What about oozie-like > > workflow? > > > > 2014-09-17 13:08 GMT+04:00 Mark Hamstra <m...@clearstorydata.com>: > > > > > See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, > > > referenced in that JIRA: > > > > > > > > > > > > https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing > > > > > > On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov <pahomov.e...@gmail.com> > > > wrote: > > > > > >> I have problems using Oozie. For example it doesn't sustain spark > > context > > >> like ooyola job server does. Other than GUI interfaces like HUE it's > > hard > > >> to work with - scoozie stopped in development year ago(I spoke with > > >> creator) and oozie xml very hard to write. > > >> Oozie still have all documentation and code in MR model rather than in > > >> yarn > > >> model. And based on it's current speed of development I can't expect > > >> radical changes in nearest future. There is no "Databricks" for oozie, > > >> which would have people on salary to develop this kind of radical > > changes. > > >> It's dinosaur. > > >> > > >> Reunold, can you help finding this doc? Do you mean just pipelining > > spark > > >> code or additional logic of persistence tasks, job server, task retry, > > >> data > > >> availability and extra? > > >> > > >> > > >> 2014-09-17 11:21 GMT+04:00 Reynold Xin <r...@databricks.com>: > > >> > > >> > Hi Egor, > > >> > > > >> > I think the design doc for the pipeline feature has been posted. > > >> > > > >> > For the workflow, I believe Oozie actually works fine with Spark if > > you > > >> > want some external workflow system. Do you have any trouble using > > that? > > >> > > > >> > > > >> > On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov < > > pahomov.e...@gmail.com> > > >> > wrote: > > >> > > > >> >> There are two things we(Yandex) miss in Spark: MLlib good > > abstractions > > >> and > > >> >> good workflow job scheduler. From threads "Adding abstraction in > > MlLib" > > >> >> and > > >> >> "[mllib] State of Multi-Model training" I got the idea, that > > databricks > > >> >> working on it and we should wait until first post doc, which would > > lead > > >> >> us. > > >> >> What about workflow scheduler? Is there anyone already working on > it? > > >> Does > > >> >> anyone have a plan on doing it? > > >> >> > > >> >> P.S. We thought that MLlib abstractions about multiple algorithms > run > > >> with > > >> >> same data would need such scheduler, which would rerun algorithm in > > >> case > > >> >> of > > >> >> failure. I understand, that spark provide fault tolerance out of > the > > >> box, > > >> >> but we found some "Ooozie-like" scheduler more reliable for such > long > > >> >> living workflows. > > >> >> > > >> >> -- > > >> >> > > >> >> > > >> >> > > >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex* > > >> >> > > >> > > > >> > > > >> > > >> > > >> -- > > >> > > >> > > >> > > >> *Sincerely yoursEgor PakhomovScala Developer, Yandex* > > >> > > > > > > > > > > > > -- > > > > > > > > *Sincerely yoursEgor PakhomovScala Developer, Yandex* > > > -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*