Re: Machine Learning on Flink - Next steps

Gábor Hermann Thu, 16 Mar 2017 15:51:48 -0700

@Theodore: thanks for bringing the discussion together.

I think it's reasonable to go on all the three directions, just as yousuggested. I agree we should concentrate our efforts, but we can do alow-effort evaluation of all the three.

I would like to volunteer for shepherding *Offline learning onStreaming*. I am already working on related issues, and I believe I havea fairly good overview on the streaming API and its limitations.However, we need to find a good use-case to aim for, and I don't haveone in mind yet, so please help with that if you can. I absolutely agreewith Theodore, that setting the scope is the most important here.

We should find a simple use-case for incremental learning. As Flink isreally strong in low-latency data processing, the best would be ause-case where rapidly adapting the model to new data provides a value.We should also consider low-latency serving for such a use-case, asthere is not much use in fast model updates if we cannot serve thepredictions that fast. Of course, it's okay to simply implement offlinealgorithms, but showcasing would be easier if we could add predictionserving for the model in the same system.

What should be the way of work here? We could have sketches for theseparate projects in Gdocs, then the shepherds could make a proposal outof it. Would that be feasible?


@Stephan:

Thanks for your all insights. I also like the approach of aiming for newand somewhat unexplored areas. I guess we can do that with both theserving/evaluation and incremental training (that should be in scope ofthe offline ML on streaming).

I agree GPU acceleration is an important issue, however it might beout-of-scope for the prototypes of these new ML directions. What do youthink?

Regarding your comments on the other thread, I'm really glad PMC isworking towards growing the community. This is crucial to have anythingmerged in Flink while keeping the code quality. However, for theprototypes, I'd prefer Theodore's suggestion, to do it in a separaterepository, to make initial development faster. After the prototypeshave proven their usability we could merge them, and continue working onthem inside the Flink repository. But we can decide that later.


Cheers,
Gabor


On 2017-03-14 21:04, Stephan Ewen wrote:

Thanks Theo. Just wrote some comments on the other thread, but it looks
like you got it covered already.

Let me re-post what I think may help as input:

*Concerning Model Evaluation / Serving *

    - My personal take is that the "model evaluation" over streams will be
happening in any case - there
      is genuine interest in that and various users have built that
themselves already.
      I would be a cool way to do something that has a very high chance of
being productionized by users soon.

    - The model evaluation as one step of a streaming pipeline (classifying
events), followed by CEP (pattern detection)
      or anomaly detection is a valuable use case on top of what pure model
serving systems usually do.

    - A question I have not yet a good intuition on is whether the "model
evaluation" and the training part are so
     different (one a good abstraction for model evaluation has been built)
that there is little cross coordination needed,
     or whether there is potential in integrating them.


*Thoughts on the ML training library (DataSet API or DataStream API)*

   - I honestly don't quite understand what the big difference will be in
targeting the batch or streaming API. You can use the
     DataSet API in a quite low-level fashion (missing async iterations).

   - There seems especially now to be a big trend towards deep learning (is
it just temporary or will this be the future?) and in
      that space, little works without GPU acceleration.

   - It is always easier to do something new than to be the n-th version of
something existing (sorry for the generic true-ism).
     The later admittedly gives the "all in one integrated framework"
advantage (which can be a very strong argument indeed),
     but the former attracts completely new communities and can often make
more impact with less effort.

   - The "new" is not required to be "online learning", where Theo has
described some concerns well.
     It can also be traditional ML re-imagined for "continuous
applications", as "continuous / incremental re-training" or so.
     Even on the "model evaluation side", there is a lot of interesting
stuff as mentioned already, like ensembles, multi-armed bandits, ...

   - It may be well worth tapping into the work of an existing library (like
tensorflow) for an easy fix to some hard problems (pre-existing
     hardware integration, pre-existing optimized linear algebra solvers,
etc) and think about how such use cases would look like in
     the context of typical Flink applications.


*A bit of engine background information that may help in the planning:*

   - The DataStream API will in the future also support bounded data
computations explicitly (I say this not as a fact, but as
     a strong believer that this is the right direction).

   - Batch runtime execution has seen less focus recently, but seems to get
a bit more community focus, because some organizations
     that contribute a lot want to use the batch side as well. For example
the effort on file-grained recovery will strengthen batch a lot already.


Stephan



On Tue, Mar 14, 2017 at 1:38 PM, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com> wrote:

Hello all,

## Executive summary:

    - Offline-on-streaming most popular, then online and model serving.
    - Need shepherds to lead development/coordination of each task.
    - I can shepherd online learning, need shepherds for the other two.


so from the people sharing their opinion it seems most people would like to
try out offline learning with the streaming API.
I also think this is an interesting option, but probably the most risky of
the bunch.

After that online learning and model serving seem to have around the same
amount of interest.

Given that, and the discussions we had in the Gdoc, here's what I recommend
as next actions:

    -
*Offline on streaming: *Start by creating a design document, with an MVP
    specification about what we
    imagine such a library to look like and what we think should be possible
    to do.
    It should state clear goals and limitations; scoping the amount of work
    is
    more important at this point than specific engineering choices.
    -
*Online learning: *If someone would like instead to work on online learning
    I can help out there,
    I have one student working on such a library right now, and I'm sure
    people
    at TU Berlin (Felix?) have similar efforts. Ideally we would like to
    communicate with
    them. Since this is a much more explored space, we could jump straight
    into a technical
    design document, (with scoping included of course) discussing
    abstractions, and comparing
    with existing frameworks.
    -
*Model serving: *There will be a presentation at Flink Forward SF on such a
    framework (Flink Tensorflow)
    by Eron Wright [1]. My recommendation would be to communicate with the
    author and see
    if he would be interested in working together to generalize and extend
    the framework.
    For more research and resources on the topic see [2] or this
    presentation [3], particularly the Clipper system.

In order to have some activity on each project I recommend we set a minimum
of 2 people willing to
contribute to each project.

If we "assign" people by top choice, that should be possible to do,
although my original plan was
to only work on two of the above, to avoid fragmentation. But given that
online learning will have work
being done by students as well, it should be possible to keep it running.

Next *I would like us to assign a "shepherd" for each of these tasks.* If
you are willing to coordinate the development
on one of these options, let us know here and you can take up the task of
coordinating with the rest of
of the people working on the task.

I would like to volunteer to coordinate the *Online learning *effort, since
I'm already supervising a student
working on this, and I'm currently developing such algorithms. I plan to
contribute to the offline on streaming
task as well, but not coordinate it.

So if someone would like to take the lead on Offline on streaming or Model
serving, let us know and
we can take it from there.

Regards,
Theodore

[1] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/

[2] https://ucbrise.github.io/cs294-rise-fa16/prediction_serving.html

[3]
https://ucbrise.github.io/cs294-rise-fa16/assets/slides/
prediction-serving-systems-cs294-RISE_seminar.pdf

On Fri, Mar 10, 2017 at 6:55 PM, Stavros Kontopoulos <
st.kontopou...@gmail.com> wrote:

Thanks Theodore,

I'd vote for

- Offline learning with Streaming API

- Low-latency prediction serving

Some comments...

Online learning

Good to have but my feeling is that it is not a strong requirement (if a
requirement at all) across the industry right now. May become hot in the
future.

Offline learning with Streaming API:

Although it requires engine changes or extensions (feasibility is an

issue

here), my understanding is that it reflects the industry common practice
(train every few minutes at most) and it would be great if that was
supported out of the box providing a friendly API for the developer.

Offline learning with the batch API:

I would love to have a limited set of algorithms so someone does not

leave

Flink to work  with another tool
for some initial dataset if he wants to. In other words, let's reach a
mature state with some basic algos merged.
There is a lot of work pending let's not waste it.

Low-latency prediction serving

Model serving is a long standing problem, we could definitely help with
that.

Regards,
Stavros



On Fri, Mar 10, 2017 at 4:08 PM, Till Rohrmann <trohrm...@apache.org>
wrote:

Thanks Theo for steering Flink's ML effort here :-)

I'd vote to concentrate on

- Online learning
- Low-latency prediction serving

because of the following reasons:

Online learning:

I agree that this topic is highly researchy and it's not even clear

whether

it will ever be of any interest outside of academia. However, it was

the

same for other things as well. Adoption in industry is usually slow and
sometimes one has to dare to explore something new.

Low-latency prediction serving:

Flink with its streaming engine seems to be the natural fit for such a

task

and it is a rather low hanging fruit. Furthermore, I think that users

would

directly benefit from such a feature.

Offline learning with Streaming API:

I'm not fully convinced yet that the streaming API is powerful enough
(mainly due to lack of proper iteration support and spilling

capabilities)

to support a wide range of offline ML algorithms. And if then it will

only

support rather small problem sizes because streaming cannot gracefully
spill the data to disk. There are still to many open issues with the
streaming API to be applicable for this use case imo.

Offline learning with the batch API:

For offline learning the batch API is imo still better suited than the
streaming API. I think it will only make sense to port the algorithms

to

the streaming API once batch and streaming are properly unified. Alone

the

highly efficient implementations for joining and sorting of data which

can

go out of memory are important to support big sized ML problems. In
general, I think it might make sense to offer a basic set of ML

primitives.

However, already offering this basic set is a considerable amount of

work.

Concering the independent organization for the development: I think it
would be great if the development could still happen under the umbrella

of

Flink's ML library because otherwise we might risk some kind of
fragmentation. In order for people to collaborate, one can also open

PRs

against a branch of a forked repo.

I'm currently working on wrapping the project re-organization

discussion

up. The general position was that it would be best to have an

incremental

build and keep everything in the same repo. If this is not possible

then

we

want to look into creating a sub repository for the libraries (maybe

other

components will follow later). I hope to make some progress on this

front

in the next couple of days/week. I'll keep you updated.

As a general remark for the discussions on the google doc. I think it

would

be great if we could at least mirror the discussions happening in the
google doc back on the mailing list or ideally conduct the discussions
directly on the mailing list. That's at least what the ASF encourages.

Cheers,
Till

On Fri, Mar 10, 2017 at 10:52 AM, Gábor Hermann <m...@gaborhermann.com
wrote:

Hey all,

Sorry for the bit late response.

I'd like to work on
- Offline learning with Streaming API
- Low-latency prediction serving

I would drop the batch API ML because of past experience with lack of
support, and online learning because the lack of use-cases.

I completely agree with Kate that offline learning should be

supported,

but given Flink's resources I prefer using the streaming API as

Roberto

suggested. Also, full model lifecycle (or end-to-end ML) could be

more

easily supported in one system (one API). Connecting Flink Batch with

Flink

Streaming is currently cumbersome (although side inputs [1] might

help).

In

my opinion, a crucial part of end-to-end ML is low-latency

predictions.

As another direction, we could integrate Flink Streaming API with

other

projects (such as Prediction IO). However, I believe it's better to

first

evaluate the capabilities and drawbacks of the streaming API with

some

prototype of using Flink Streaming for some ML task. Otherwise we

could

run

into critical issues just as the System ML integration with e.g.

caching.

These issues makes the integration of Batch API with other ML

projects

practically infeasible.

I've already been experimenting with offline learning with the

Streaming

API. Hopefully, I can share some initial performance results next

week

on

matrix factorization. Naturally, I've run into issues. E.g. I could

only

mark the end of input with some hacks, because this is not needed at

streaming job consuming input forever. AFAIK, this would be resolved

by

side inputs [1].

@Theodore:
+1 for doing the prototype project(s) separately the main Flink
repository. Although, I would strongly suggest to follow Flink

development

guidelines as closely as possible. As another note, there is already

GitHub organization for Flink related projects [2], but it seems like

it

has not been used much.

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+
Side+Inputs+for+DataStream+API
[2] https://github.com/project-flink


On 2017-03-04 08:44, Roberto Bentivoglio wrote:

Hi All,

I'd like to start working on:
   - Offline learning with Streaming API
   - Online learning

I think also that using a new organisation on github, as Theodore

propsed,

to keep an initial indipendency to speed up the prototyping and
development
phases it's really interesting.

I totally agree with Katherin, we need offline learning, but my

opinion

is

that it will be more straightforward to fix the streaming issues

than

batch
issues because we will have more support on that by the Flink

community.

Thanks and have a nice weekend,
Roberto

On 3 March 2017 at 20:20, amir bahmanyari

<amirto...@yahoo.com.invalid

wrote:

Great points to start:    - Online learning

    - Offline learning with the streaming API

Thanks + have a great weekend.

        From: Katherin Eri <katherinm...@gmail.com>
   To: dev@flink.apache.org
   Sent: Friday, March 3, 2017 7:41 AM
   Subject: Re: Machine Learning on Flink - Next steps

Thank you, Theodore.

Shortly speaking I vote for:
1) Online learning
2) Low-latency prediction serving -> Offline learning with the

batch

API

In details:
1) If streaming is strong side of Flink lets use it, and try to

support

some online learning or light weight inmemory learning algorithms.

Try

to

build pipeline for them.

2) I think that Flink should be part of production ecosystem, and

if

now

productions require ML support, multiple models deployment and so

on,

we

should serve this. But in my opinion we shouldn’t compete with such
projects like PredictionIO, but serve them, to be an execution

core.

But

that means a lot:

a. Offline training should be supported, because typically most of

ML

algs
are for offline training.
b. Model lifecycle should be supported:
ETL+transformation+training+scoring+exploitation quality

monitoring

I understand that batch world is full of competitors, but for me

that

doesn’t mean that batch should be ignored. I think that separated
streaming/batching applications causes additional deployment and
exploitation overhead which typically tried to be avoided. That

means

that
we should attract community to this problem in my opinion.


пт, 3 мар. 2017 г. в 15:34, Theodore Vasiloudis <
theodoros.vasilou...@gmail.com>:

Hello all,

  From our previous discussion started by Stavros, we decided to

start a

planning document [1]
to figure out possible next steps for ML on Flink.

Our concerns where mainly ensuring active development while

satisfying

the
needs of
the community.

We have listed a number of proposals for future work in the

document.

In

short they are:

    - Offline learning with the batch API
    - Online learning
    - Offline learning with the streaming API
    - Low-latency prediction serving

I saw there is a number of people willing to work on ML for Flink,

but

the
truth is that we cannot
cover all of these suggestions without fragmenting the development

too

much.

So my recommendation is to pick out 2 of these options, create

design

documents and build prototypes for each library.
We can then assess their viability and together with the community

decide

if we should try
to include one (or both) of them in the main Flink distribution.

So I invite people to express their opinion about which task they

would

be
willing to contribute
and hopefully we can settle on two of these options.

Once that is done we can decide how we do the actual work. Since

this

is

highly experimental
I would suggest we work on repositories where we have complete

control.

For that purpose I have created an organization [2] on Github which

we

can
use to create repositories and teams that work on them in an

organized

manner.
Once enough work has accumulated we can start discussing

contributing

the

code
to the main distribution.

Regards,
Theodore

[1]
https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3U
d06MIRhahtJ6dw/
[2] https://github.com/flinkml

--

*Yours faithfully, *

*Kate Eri.*

Re: Machine Learning on Flink - Next steps

Reply via email to