Re: Machine Learning on Flink - Next steps

Chen Qin Fri, 17 Mar 2017 12:10:49 -0700

Hi there,

I am not a machine learning expert :) But in recent, I see more and more
adoption and trends towards tensor flow[1], which is backed by google and
big vendors.


If flink and somehow compatible and run tensor flow pipelines (with some
modifications is fine) I think the adoption would be faster.

Thanks,
Chen

[1] https://github.com/tensorflow/tensorflow

On Fri, Mar 17, 2017 at 7:44 AM, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com> wrote:

> >
> > What should be the way of work here? We could have sketches for the
> > separate projects in Gdocs, then the shepherds could make a proposal out
> of
> > it. Would that be feasible?
>
>
> That's what I was thinking as well. It's the responsibility of the shepherd
> to engage the people motivated to work
> on a project, starting with a rough Gdocs document and gradually transition
> it to a proper design doc.
>
> As an example use-case (for both online and "fast-batch") I would recommend
> an ad click scenario: Predicting CTR.
>
> The are multiple reasons I like this application:
>
>    - it's a very popular application
>    - it's directly tied to revenue so even small improvements are relevant,
>    - it can often be a very large-scale problem in data and model size,
>    - there are good systems out there already to benchmark against, like
>    Vowpal Wabbit.
>    - At least one one large-scale dataset exists [1],
>    - We could even place a pre-processing pipeline to emulate a real
>    application, and show the full benefits of using Flink as your
>    one-stop-shop for an integrated prediction pipeline (up until model
> serving
>    for now).
>
> We are still missing someone to take the lead on the model serving project,
> if somebody would be interested to
> coordinate that let us know.
>
> Regards,
> Theodore
>
> [1] Criteo click-through data (1TB):
> http://www.criteo.com/news/press-releases/2015/06/criteo-
> releases-industrys-largest-ever-dataset/
>
> On Thu, Mar 16, 2017 at 11:50 PM, Gábor Hermann <m...@gaborhermann.com>
> wrote:
>
> > @Theodore: thanks for bringing the discussion together.
> > I think it's reasonable to go on all the three directions, just as you
> > suggested. I agree we should concentrate our efforts, but we can do a
> > low-effort evaluation of all the three.
> >
> > I would like to volunteer for shepherding *Offline learning on
> Streaming*.
> > I am already working on related issues, and I believe I have a fairly
> good
> > overview on the streaming API and its limitations. However, we need to
> find
> > a good use-case to aim for, and I don't have one in mind yet, so please
> > help with that if you can. I absolutely agree with Theodore, that setting
> > the scope is the most important here.
> >
> > We should find a simple use-case for incremental learning. As Flink is
> > really strong in low-latency data processing, the best would be a
> use-case
> > where rapidly adapting the model to new data provides a value. We should
> > also consider low-latency serving for such a use-case, as there is not
> much
> > use in fast model updates if we cannot serve the predictions that fast.
> Of
> > course, it's okay to simply implement offline algorithms, but showcasing
> > would be easier if we could add prediction serving for the model in the
> > same system.
> >
> > What should be the way of work here? We could have sketches for the
> > separate projects in Gdocs, then the shepherds could make a proposal out
> of
> > it. Would that be feasible?
> >
> > @Stephan:
> > Thanks for your all insights. I also like the approach of aiming for new
> > and somewhat unexplored areas. I guess we can do that with both the
> > serving/evaluation and incremental training (that should be in scope of
> the
> > offline ML on streaming).
> >
> > I agree GPU acceleration is an important issue, however it might be
> > out-of-scope for the prototypes of these new ML directions. What do you
> > think?
> >
> > Regarding your comments on the other thread, I'm really glad PMC is
> > working towards growing the community. This is crucial to have anything
> > merged in Flink while keeping the code quality. However, for the
> > prototypes, I'd prefer Theodore's suggestion, to do it in a separate
> > repository, to make initial development faster. After the prototypes have
> > proven their usability we could merge them, and continue working on them
> > inside the Flink repository. But we can decide that later.
> >
> > Cheers,
> > Gabor
> >
> >
> >
> > On 2017-03-14 21:04, Stephan Ewen wrote:
> >
> >> Thanks Theo. Just wrote some comments on the other thread, but it looks
> >> like you got it covered already.
> >>
> >> Let me re-post what I think may help as input:
> >>
> >> *Concerning Model Evaluation / Serving *
> >>
> >>     - My personal take is that the "model evaluation" over streams will
> be
> >> happening in any case - there
> >>       is genuine interest in that and various users have built that
> >> themselves already.
> >>       I would be a cool way to do something that has a very high chance
> of
> >> being productionized by users soon.
> >>
> >>     - The model evaluation as one step of a streaming pipeline
> >> (classifying
> >> events), followed by CEP (pattern detection)
> >>       or anomaly detection is a valuable use case on top of what pure
> >> model
> >> serving systems usually do.
> >>
> >>     - A question I have not yet a good intuition on is whether the
> "model
> >> evaluation" and the training part are so
> >>      different (one a good abstraction for model evaluation has been
> >> built)
> >> that there is little cross coordination needed,
> >>      or whether there is potential in integrating them.
> >>
> >>
> >> *Thoughts on the ML training library (DataSet API or DataStream API)*
> >>
> >>    - I honestly don't quite understand what the big difference will be
> in
> >> targeting the batch or streaming API. You can use the
> >>      DataSet API in a quite low-level fashion (missing async
> iterations).
> >>
> >>    - There seems especially now to be a big trend towards deep learning
> >> (is
> >> it just temporary or will this be the future?) and in
> >>       that space, little works without GPU acceleration.
> >>
> >>    - It is always easier to do something new than to be the n-th version
> >> of
> >> something existing (sorry for the generic true-ism).
> >>      The later admittedly gives the "all in one integrated framework"
> >> advantage (which can be a very strong argument indeed),
> >>      but the former attracts completely new communities and can often
> make
> >> more impact with less effort.
> >>
> >>    - The "new" is not required to be "online learning", where Theo has
> >> described some concerns well.
> >>      It can also be traditional ML re-imagined for "continuous
> >> applications", as "continuous / incremental re-training" or so.
> >>      Even on the "model evaluation side", there is a lot of interesting
> >> stuff as mentioned already, like ensembles, multi-armed bandits, ...
> >>
> >>    - It may be well worth tapping into the work of an existing library
> >> (like
> >> tensorflow) for an easy fix to some hard problems (pre-existing
> >>      hardware integration, pre-existing optimized linear algebra
> solvers,
> >> etc) and think about how such use cases would look like in
> >>      the context of typical Flink applications.
> >>
> >>
> >> *A bit of engine background information that may help in the planning:*
> >>
> >>    - The DataStream API will in the future also support bounded data
> >> computations explicitly (I say this not as a fact, but as
> >>      a strong believer that this is the right direction).
> >>
> >>    - Batch runtime execution has seen less focus recently, but seems to
> >> get
> >> a bit more community focus, because some organizations
> >>      that contribute a lot want to use the batch side as well. For
> example
> >> the effort on file-grained recovery will strengthen batch a lot already.
> >>
> >>
> >> Stephan
> >>
> >>
> >>
> >> On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis <
> >> theodoros.vasilou...@gmail.com> wrote:
> >>
> >> Hello all,
> >>>
> >>> ## Executive summary:
> >>>
> >>>     - Offline-on-streaming most popular, then online and model serving.
> >>>     - Need shepherds to lead development/coordination of each task.
> >>>     - I can shepherd online learning, need shepherds for the other two.
> >>>
> >>>
> >>> so from the people sharing their opinion it seems most people would
> like
> >>> to
> >>> try out offline learning with the streaming API.
> >>> I also think this is an interesting option, but probably the most risky
> >>> of
> >>> the bunch.
> >>>
> >>> After that online learning and model serving seem to have around the
> same
> >>> amount of interest.
> >>>
> >>> Given that, and the discussions we had in the Gdoc, here's what I
> >>> recommend
> >>> as next actions:
> >>>
> >>>     -
> >>> *Offline on streaming: *Start by creating a design document, with an
> MVP
> >>>     specification about what we
> >>>     imagine such a library to look like and what we think should be
> >>> possible
> >>>     to do.
> >>>     It should state clear goals and limitations; scoping the amount of
> >>> work
> >>>     is
> >>>     more important at this point than specific engineering choices.
> >>>     -
> >>> *Online learning: *If someone would like instead to work on online
> >>> learning
> >>>     I can help out there,
> >>>     I have one student working on such a library right now, and I'm
> sure
> >>>     people
> >>>     at TU Berlin (Felix?) have similar efforts. Ideally we would like
> to
> >>>     communicate with
> >>>     them. Since this is a much more explored space, we could jump
> >>> straight
> >>>     into a technical
> >>>     design document, (with scoping included of course) discussing
> >>>     abstractions, and comparing
> >>>     with existing frameworks.
> >>>     -
> >>> *Model serving: *There will be a presentation at Flink Forward SF on
> >>> such a
> >>>     framework (Flink Tensorflow)
> >>>     by Eron Wright [1]. My recommendation would be to communicate with
> >>> the
> >>>     author and see
> >>>     if he would be interested in working together to generalize and
> >>> extend
> >>>     the framework.
> >>>     For more research and resources on the topic see [2] or this
> >>>     presentation [3], particularly the Clipper system.
> >>>
> >>> In order to have some activity on each project I recommend we set a
> >>> minimum
> >>> of 2 people willing to
> >>> contribute to each project.
> >>>
> >>> If we "assign" people by top choice, that should be possible to do,
> >>> although my original plan was
> >>> to only work on two of the above, to avoid fragmentation. But given
> that
> >>> online learning will have work
> >>> being done by students as well, it should be possible to keep it
> running.
> >>>
> >>> Next *I would like us to assign a "shepherd" for each of these tasks.*
> If
> >>> you are willing to coordinate the development
> >>> on one of these options, let us know here and you can take up the task
> of
> >>> coordinating with the rest of
> >>> of the people working on the task.
> >>>
> >>> I would like to volunteer to coordinate the *Online learning *effort,
> >>> since
> >>> I'm already supervising a student
> >>> working on this, and I'm currently developing such algorithms. I plan
> to
> >>> contribute to the offline on streaming
> >>> task as well, but not coordinate it.
> >>>
> >>> So if someone would like to take the lead on Offline on streaming or
> >>> Model
> >>> serving, let us know and
> >>> we can take it from there.
> >>>
> >>> Regards,
> >>> Theodore
> >>>
> >>> [1] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> >>> nsorflow/
> >>>
> >>> [2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html
> >>>
> >>> [3]
> >>> https://ucbrise.github.io/cs294-rise-fa16/assets/slides/
> >>> prediction-serving-systems-cs294-RISE_seminar.pdf
> >>>
> >>> On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos <
> >>> st.kontopou...@gmail.com> wrote:
> >>>
> >>> Thanks Theodore,
> >>>>
> >>>> I'd vote for
> >>>>
> >>>> - Offline learning with Streaming API
> >>>>
> >>>> - Low-latency prediction serving
> >>>>
> >>>> Some comments...
> >>>>
> >>>> Online learning
> >>>>
> >>>> Good to have but my feeling is that it is not a strong requirement
> (if a
> >>>> requirement at all) across the industry right now. May become hot in
> the
> >>>> future.
> >>>>
> >>>> Offline learning with Streaming API:
> >>>>
> >>>> Although it requires engine changes or extensions (feasibility is an
> >>>>
> >>> issue
> >>>
> >>>> here), my understanding is that it reflects the industry common
> practice
> >>>> (train every few minutes at most) and it would be great if that was
> >>>> supported out of the box providing a friendly API for the developer.
> >>>>
> >>>> Offline learning with the batch API:
> >>>>
> >>>> I would love to have a limited set of algorithms so someone does not
> >>>>
> >>> leave
> >>>
> >>>> Flink to work  with another tool
> >>>> for some initial dataset if he wants to. In other words, let's reach a
> >>>> mature state with some basic algos merged.
> >>>> There is a lot of work pending let's not waste it.
> >>>>
> >>>> Low-latency prediction serving
> >>>>
> >>>> Model serving is a long standing problem, we could definitely help
> with
> >>>> that.
> >>>>
> >>>> Regards,
> >>>> Stavros
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <trohrm...@apache.org>
> >>>> wrote:
> >>>>
> >>>> Thanks Theo for steering Flink's ML effort here :-)
> >>>>>
> >>>>> I'd vote to concentrate on
> >>>>>
> >>>>> - Online learning
> >>>>> - Low-latency prediction serving
> >>>>>
> >>>>> because of the following reasons:
> >>>>>
> >>>>> Online learning:
> >>>>>
> >>>>> I agree that this topic is highly researchy and it's not even clear
> >>>>>
> >>>> whether
> >>>>
> >>>>> it will ever be of any interest outside of academia. However, it was
> >>>>>
> >>>> the
> >>>
> >>>> same for other things as well. Adoption in industry is usually slow
> and
> >>>>> sometimes one has to dare to explore something new.
> >>>>>
> >>>>> Low-latency prediction serving:
> >>>>>
> >>>>> Flink with its streaming engine seems to be the natural fit for such
> a
> >>>>>
> >>>> task
> >>>>
> >>>>> and it is a rather low hanging fruit. Furthermore, I think that users
> >>>>>
> >>>> would
> >>>>
> >>>>> directly benefit from such a feature.
> >>>>>
> >>>>> Offline learning with Streaming API:
> >>>>>
> >>>>> I'm not fully convinced yet that the streaming API is powerful enough
> >>>>> (mainly due to lack of proper iteration support and spilling
> >>>>>
> >>>> capabilities)
> >>>>
> >>>>> to support a wide range of offline ML algorithms. And if then it will
> >>>>>
> >>>> only
> >>>>
> >>>>> support rather small problem sizes because streaming cannot
> gracefully
> >>>>> spill the data to disk. There are still to many open issues with the
> >>>>> streaming API to be applicable for this use case imo.
> >>>>>
> >>>>> Offline learning with the batch API:
> >>>>>
> >>>>> For offline learning the batch API is imo still better suited than
> the
> >>>>> streaming API. I think it will only make sense to port the algorithms
> >>>>>
> >>>> to
> >>>
> >>>> the streaming API once batch and streaming are properly unified. Alone
> >>>>>
> >>>> the
> >>>>
> >>>>> highly efficient implementations for joining and sorting of data
> which
> >>>>>
> >>>> can
> >>>>
> >>>>> go out of memory are important to support big sized ML problems. In
> >>>>> general, I think it might make sense to offer a basic set of ML
> >>>>>
> >>>> primitives.
> >>>>
> >>>>> However, already offering this basic set is a considerable amount of
> >>>>>
> >>>> work.
> >>>>
> >>>>> Concering the independent organization for the development: I think
> it
> >>>>> would be great if the development could still happen under the
> umbrella
> >>>>>
> >>>> of
> >>>>
> >>>>> Flink's ML library because otherwise we might risk some kind of
> >>>>> fragmentation. In order for people to collaborate, one can also open
> >>>>>
> >>>> PRs
> >>>
> >>>> against a branch of a forked repo.
> >>>>>
> >>>>> I'm currently working on wrapping the project re-organization
> >>>>>
> >>>> discussion
> >>>
> >>>> up. The general position was that it would be best to have an
> >>>>>
> >>>> incremental
> >>>
> >>>> build and keep everything in the same repo. If this is not possible
> >>>>>
> >>>> then
> >>>
> >>>> we
> >>>>
> >>>>> want to look into creating a sub repository for the libraries (maybe
> >>>>>
> >>>> other
> >>>>
> >>>>> components will follow later). I hope to make some progress on this
> >>>>>
> >>>> front
> >>>
> >>>> in the next couple of days/week. I'll keep you updated.
> >>>>>
> >>>>> As a general remark for the discussions on the google doc. I think it
> >>>>>
> >>>> would
> >>>>
> >>>>> be great if we could at least mirror the discussions happening in the
> >>>>> google doc back on the mailing list or ideally conduct the
> discussions
> >>>>> directly on the mailing list. That's at least what the ASF
> encourages.
> >>>>>
> >>>>> Cheers,
> >>>>> Till
> >>>>>
> >>>>> On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <
> m...@gaborhermann.com
> >>>>> wrote:
> >>>>>
> >>>>> Hey all,
> >>>>>>
> >>>>>> Sorry for the bit late response.
> >>>>>>
> >>>>>> I'd like to work on
> >>>>>> - Offline learning with Streaming API
> >>>>>> - Low-latency prediction serving
> >>>>>>
> >>>>>> I would drop the batch API ML because of past experience with lack
> of
> >>>>>> support, and online learning because the lack of use-cases.
> >>>>>>
> >>>>>> I completely agree with Kate that offline learning should be
> >>>>>>
> >>>>> supported,
> >>>
> >>>> but given Flink's resources I prefer using the streaming API as
> >>>>>>
> >>>>> Roberto
> >>>
> >>>> suggested. Also, full model lifecycle (or end-to-end ML) could be
> >>>>>>
> >>>>> more
> >>>
> >>>> easily supported in one system (one API). Connecting Flink Batch with
> >>>>>>
> >>>>> Flink
> >>>>>
> >>>>>> Streaming is currently cumbersome (although side inputs [1] might
> >>>>>>
> >>>>> help).
> >>>>
> >>>>> In
> >>>>>
> >>>>>> my opinion, a crucial part of end-to-end ML is low-latency
> >>>>>>
> >>>>> predictions.
> >>>
> >>>> As another direction, we could integrate Flink Streaming API with
> >>>>>>
> >>>>> other
> >>>
> >>>> projects (such as Prediction IO). However, I believe it's better to
> >>>>>>
> >>>>> first
> >>>>
> >>>>> evaluate the capabilities and drawbacks of the streaming API with
> >>>>>>
> >>>>> some
> >>>
> >>>> prototype of using Flink Streaming for some ML task. Otherwise we
> >>>>>>
> >>>>> could
> >>>
> >>>> run
> >>>>>
> >>>>>> into critical issues just as the System ML integration with e.g.
> >>>>>>
> >>>>> caching.
> >>>>
> >>>>> These issues makes the integration of Batch API with other ML
> >>>>>>
> >>>>> projects
> >>>
> >>>> practically infeasible.
> >>>>>>
> >>>>>> I've already been experimenting with offline learning with the
> >>>>>>
> >>>>> Streaming
> >>>>
> >>>>> API. Hopefully, I can share some initial performance results next
> >>>>>>
> >>>>> week
> >>>
> >>>> on
> >>>>
> >>>>> matrix factorization. Naturally, I've run into issues. E.g. I could
> >>>>>>
> >>>>> only
> >>>>
> >>>>> mark the end of input with some hacks, because this is not needed at
> >>>>>>
> >>>>> a
> >>>
> >>>> streaming job consuming input forever. AFAIK, this would be resolved
> >>>>>>
> >>>>> by
> >>>
> >>>> side inputs [1].
> >>>>>>
> >>>>>> @Theodore:
> >>>>>> +1 for doing the prototype project(s) separately the main Flink
> >>>>>> repository. Although, I would strongly suggest to follow Flink
> >>>>>>
> >>>>> development
> >>>>>
> >>>>>> guidelines as closely as possible. As another note, there is already
> >>>>>>
> >>>>> a
> >>>
> >>>> GitHub organization for Flink related projects [2], but it seems like
> >>>>>>
> >>>>> it
> >>>>
> >>>>> has not been used much.
> >>>>>>
> >>>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+
> >>>>>> Side+Inputs+for+DataStream+API
> >>>>>> [2] https://github.com/project-flink
> >>>>>>
> >>>>>>
> >>>>>> On 2017-03-04 08:44, Roberto Bentivoglio wrote:
> >>>>>>
> >>>>>> Hi All,
> >>>>>>
> >>>>>>> I'd like to start working on:
> >>>>>>>    - Offline learning with Streaming API
> >>>>>>>    - Online learning
> >>>>>>>
> >>>>>>> I think also that using a new organisation on github, as Theodore
> >>>>>>>
> >>>>>> propsed,
> >>>>>
> >>>>>> to keep an initial indipendency to speed up the prototyping and
> >>>>>>> development
> >>>>>>> phases it's really interesting.
> >>>>>>>
> >>>>>>> I totally agree with Katherin, we need offline learning, but my
> >>>>>>>
> >>>>>> opinion
> >>>>
> >>>>> is
> >>>>>
> >>>>>> that it will be more straightforward to fix the streaming issues
> >>>>>>>
> >>>>>> than
> >>>
> >>>> batch
> >>>>>>> issues because we will have more support on that by the Flink
> >>>>>>>
> >>>>>> community.
> >>>>
> >>>>> Thanks and have a nice weekend,
> >>>>>>> Roberto
> >>>>>>>
> >>>>>>> On 3 March 2017 at 20:20, amir bahmanyari
> >>>>>>>
> >>>>>> <amirto...@yahoo.com.invalid
> >>>
> >>>> wrote:
> >>>>>>>
> >>>>>>> Great points to start:    - Online learning
> >>>>>>>
> >>>>>>>>     - Offline learning with the streaming API
> >>>>>>>>
> >>>>>>>> Thanks + have a great weekend.
> >>>>>>>>
> >>>>>>>>         From: Katherin Eri <katherinm...@gmail.com>
> >>>>>>>>    To: dev@flink.apache.org
> >>>>>>>>    Sent: Friday, March 3, 2017 7:41 AM
> >>>>>>>>    Subject: Re: Machine Learning on Flink - Next steps
> >>>>>>>>
> >>>>>>>> Thank you, Theodore.
> >>>>>>>>
> >>>>>>>> Shortly speaking I vote for:
> >>>>>>>> 1) Online learning
> >>>>>>>> 2) Low-latency prediction serving -> Offline learning with the
> >>>>>>>>
> >>>>>>> batch
> >>>
> >>>> API
> >>>>>
> >>>>>> In details:
> >>>>>>>> 1) If streaming is strong side of Flink lets use it, and try to
> >>>>>>>>
> >>>>>>> support
> >>>>
> >>>>> some online learning or light weight inmemory learning algorithms.
> >>>>>>>>
> >>>>>>> Try
> >>>>
> >>>>> to
> >>>>>
> >>>>>> build pipeline for them.
> >>>>>>>>
> >>>>>>>> 2) I think that Flink should be part of production ecosystem, and
> >>>>>>>>
> >>>>>>> if
> >>>
> >>>> now
> >>>>>
> >>>>>> productions require ML support, multiple models deployment and so
> >>>>>>>>
> >>>>>>> on,
> >>>
> >>>> we
> >>>>>
> >>>>>> should serve this. But in my opinion we shouldn’t compete with such
> >>>>>>>> projects like PredictionIO, but serve them, to be an execution
> >>>>>>>>
> >>>>>>> core.
> >>>
> >>>> But
> >>>>>
> >>>>>> that means a lot:
> >>>>>>>>
> >>>>>>>> a. Offline training should be supported, because typically most of
> >>>>>>>>
> >>>>>>> ML
> >>>
> >>>> algs
> >>>>>>>> are for offline training.
> >>>>>>>> b. Model lifecycle should be supported:
> >>>>>>>> ETL+transformation+training+scoring+exploitation quality
> >>>>>>>>
> >>>>>>> monitoring
> >>>
> >>>> I understand that batch world is full of competitors, but for me
> >>>>>>>>
> >>>>>>> that
> >>>
> >>>> doesn’t mean that batch should be ignored. I think that separated
> >>>>>>>> streaming/batching applications causes additional deployment and
> >>>>>>>> exploitation overhead which typically tried to be avoided. That
> >>>>>>>>
> >>>>>>> means
> >>>
> >>>> that
> >>>>>>>> we should attract community to this problem in my opinion.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis <
> >>>>>>>> theodoros.vasilou...@gmail.com>:
> >>>>>>>>
> >>>>>>>> Hello all,
> >>>>>>>>
> >>>>>>>>   From our previous discussion started by Stavros, we decided to
> >>>>>>>>
> >>>>>>> start a
> >>>>
> >>>>> planning document [1]
> >>>>>>>> to figure out possible next steps for ML on Flink.
> >>>>>>>>
> >>>>>>>> Our concerns where mainly ensuring active development while
> >>>>>>>>
> >>>>>>> satisfying
> >>>>
> >>>>> the
> >>>>>>>> needs of
> >>>>>>>> the community.
> >>>>>>>>
> >>>>>>>> We have listed a number of proposals for future work in the
> >>>>>>>>
> >>>>>>> document.
> >>>
> >>>> In
> >>>>>
> >>>>>> short they are:
> >>>>>>>>
> >>>>>>>>     - Offline learning with the batch API
> >>>>>>>>     - Online learning
> >>>>>>>>     - Offline learning with the streaming API
> >>>>>>>>     - Low-latency prediction serving
> >>>>>>>>
> >>>>>>>> I saw there is a number of people willing to work on ML for Flink,
> >>>>>>>>
> >>>>>>> but
> >>>>
> >>>>> the
> >>>>>>>> truth is that we cannot
> >>>>>>>> cover all of these suggestions without fragmenting the development
> >>>>>>>>
> >>>>>>> too
> >>>>
> >>>>> much.
> >>>>>>>>
> >>>>>>>> So my recommendation is to pick out 2 of these options, create
> >>>>>>>>
> >>>>>>> design
> >>>
> >>>> documents and build prototypes for each library.
> >>>>>>>> We can then assess their viability and together with the community
> >>>>>>>>
> >>>>>>> decide
> >>>>>
> >>>>>> if we should try
> >>>>>>>> to include one (or both) of them in the main Flink distribution.
> >>>>>>>>
> >>>>>>>> So I invite people to express their opinion about which task they
> >>>>>>>>
> >>>>>>> would
> >>>>
> >>>>> be
> >>>>>>>> willing to contribute
> >>>>>>>> and hopefully we can settle on two of these options.
> >>>>>>>>
> >>>>>>>> Once that is done we can decide how we do the actual work. Since
> >>>>>>>>
> >>>>>>> this
> >>>
> >>>> is
> >>>>>
> >>>>>> highly experimental
> >>>>>>>> I would suggest we work on repositories where we have complete
> >>>>>>>>
> >>>>>>> control.
> >>>>
> >>>>> For that purpose I have created an organization [2] on Github which
> >>>>>>>>
> >>>>>>> we
> >>>>
> >>>>> can
> >>>>>>>> use to create repositories and teams that work on them in an
> >>>>>>>>
> >>>>>>> organized
> >>>>
> >>>>> manner.
> >>>>>>>> Once enough work has accumulated we can start discussing
> >>>>>>>>
> >>>>>>> contributing
> >>>
> >>>> the
> >>>>>
> >>>>>> code
> >>>>>>>> to the main distribution.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Theodore
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U
> >>>>>>>> d06MIRhahtJ6dw/
> >>>>>>>> [2] https://github.com/flinkml
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> *Yours faithfully, *
> >>>>>>>>
> >>>>>>>> *Kate Eri.*
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >
>

Re: Machine Learning on Flink - Next steps

Reply via email to