It's doc about MLLib pipeline functionality. What about oozie-like workflow?
2014-09-17 13:08 GMT+04:00 Mark Hamstra <m...@clearstorydata.com>: > See https://issues.apache.org/jira/browse/SPARK-3530 and this doc, > referenced in that JIRA: > > > https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing > > On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov <pahomov.e...@gmail.com> > wrote: > >> I have problems using Oozie. For example it doesn't sustain spark context >> like ooyola job server does. Other than GUI interfaces like HUE it's hard >> to work with - scoozie stopped in development year ago(I spoke with >> creator) and oozie xml very hard to write. >> Oozie still have all documentation and code in MR model rather than in >> yarn >> model. And based on it's current speed of development I can't expect >> radical changes in nearest future. There is no "Databricks" for oozie, >> which would have people on salary to develop this kind of radical changes. >> It's dinosaur. >> >> Reunold, can you help finding this doc? Do you mean just pipelining spark >> code or additional logic of persistence tasks, job server, task retry, >> data >> availability and extra? >> >> >> 2014-09-17 11:21 GMT+04:00 Reynold Xin <r...@databricks.com>: >> >> > Hi Egor, >> > >> > I think the design doc for the pipeline feature has been posted. >> > >> > For the workflow, I believe Oozie actually works fine with Spark if you >> > want some external workflow system. Do you have any trouble using that? >> > >> > >> > On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov <pahomov.e...@gmail.com> >> > wrote: >> > >> >> There are two things we(Yandex) miss in Spark: MLlib good abstractions >> and >> >> good workflow job scheduler. From threads "Adding abstraction in MlLib" >> >> and >> >> "[mllib] State of Multi-Model training" I got the idea, that databricks >> >> working on it and we should wait until first post doc, which would lead >> >> us. >> >> What about workflow scheduler? Is there anyone already working on it? >> Does >> >> anyone have a plan on doing it? >> >> >> >> P.S. We thought that MLlib abstractions about multiple algorithms run >> with >> >> same data would need such scheduler, which would rerun algorithm in >> case >> >> of >> >> failure. I understand, that spark provide fault tolerance out of the >> box, >> >> but we found some "Ooozie-like" scheduler more reliable for such long >> >> living workflows. >> >> >> >> -- >> >> >> >> >> >> >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex* >> >> >> > >> > >> >> >> -- >> >> >> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex* >> > > -- *Sincerely yoursEgor PakhomovScala Developer, Yandex*