It's doc about MLLib pipeline functionality. What about oozie-like
workflow?

2014-09-17 13:08 GMT+04:00 Mark Hamstra <m...@clearstorydata.com>:

> See https://issues.apache.org/jira/browse/SPARK-3530 and this doc,
> referenced in that JIRA:
>
>
> https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
>
> On Wed, Sep 17, 2014 at 2:00 AM, Egor Pahomov <pahomov.e...@gmail.com>
> wrote:
>
>> I have problems using Oozie. For example it doesn't sustain spark context
>> like ooyola job server does. Other than GUI interfaces like HUE it's hard
>> to work with - scoozie stopped in development year ago(I spoke with
>> creator) and oozie xml very hard to write.
>> Oozie still have all documentation and code in MR model rather than in
>> yarn
>> model. And based on it's current speed of development I can't expect
>> radical changes in nearest future. There is no "Databricks" for oozie,
>> which would have people on salary to develop this kind of radical changes.
>> It's dinosaur.
>>
>> Reunold, can you help finding this doc? Do you mean just pipelining spark
>> code or additional logic of persistence tasks, job server, task retry,
>> data
>> availability and extra?
>>
>>
>> 2014-09-17 11:21 GMT+04:00 Reynold Xin <r...@databricks.com>:
>>
>> > Hi Egor,
>> >
>> > I think the design doc for the pipeline feature has been posted.
>> >
>> > For the workflow, I believe Oozie actually works fine with Spark if you
>> > want some external workflow system. Do you have any trouble using that?
>> >
>> >
>> > On Tue, Sep 16, 2014 at 11:45 PM, Egor Pahomov <pahomov.e...@gmail.com>
>> > wrote:
>> >
>> >> There are two things we(Yandex) miss in Spark: MLlib good abstractions
>> and
>> >> good workflow job scheduler. From threads "Adding abstraction in MlLib"
>> >> and
>> >> "[mllib] State of Multi-Model training" I got the idea, that databricks
>> >> working on it and we should wait until first post doc, which would lead
>> >> us.
>> >> What about workflow scheduler? Is there anyone already working on it?
>> Does
>> >> anyone have a plan on doing it?
>> >>
>> >> P.S. We thought that MLlib abstractions about multiple algorithms run
>> with
>> >> same data would need such scheduler, which would rerun algorithm in
>> case
>> >> of
>> >> failure. I understand, that spark provide fault tolerance out of the
>> box,
>> >> but we found some "Ooozie-like" scheduler more reliable for such long
>> >> living workflows.
>> >>
>> >> --
>> >>
>> >>
>> >>
>> >> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>> >>
>> >
>> >
>>
>>
>> --
>>
>>
>>
>> *Sincerely yoursEgor PakhomovScala Developer, Yandex*
>>
>
>


-- 



*Sincerely yoursEgor PakhomovScala Developer, Yandex*

Reply via email to