Hi Philipp,

It's great to hear you are interested in Flink ML!

Based on your description, your prototype seems like an interesting approach for combining online+offline learning. If you're interested, we might find a way to integrate your work, or at least your ideas, into Flink ML if we decide on a direction that fits your approach. I think your work could be relevant for almost all the directions listed there (if I understand correctly you'd even like to serve predictions on unlabeled data).

Feel free to join the discussion in the docs you've mentioned :)

Cheers,
Gabor

On 2017-02-27 18:39, Philipp Zehnder wrote:

Hello all,

I’m new to this mailing list and I wanted to introduce myself. My name is 
Philipp Zehnder and I’m a Masters Student in Computer Science at the Karlsruhe 
Institute of Technology in Germany currently writing on my master’s thesis with 
the main goal to integrate reusable machine learning components into a stream 
processing network. One part of my thesis is to create an API for distributed 
online machine learning.

I saw that there are some recent discussions how to continue the development of 
Flink ML [1] and I want to share some of my experiences and maybe get some 
feedback from the community for my ideas.

As I am new to open source projects I hope this is the right place for this.

In the beginning, I had a look at different already existing frameworks like 
Apache SAMOA for example, which is great and has a lot of useful resources. 
However, as Flink is currently focusing on streaming, from my point of view it 
makes sense to also have a streaming machine learning API as part of the Flink 
ecosystem.

I’m currently working on building a prototype for a distributed streaming 
machine learning library based on Flink that can be used for online and 
“classical” offline learning.

The machine learning algorithm takes labeled and non-labeled data. On a labeled 
data point first a prediction is performed and then this label is used to train 
the model. On a non-labeled data point just a prediction is performed. The main 
difference between the online and offline algorithms is that in the offline 
case the labeled data must be handed to the model before the unlabeled data. In 
the online case, it is still possible to process labeled data at a later point 
to update the model. The advantage of this approach is that batch algorithms 
can be applied on streaming data as well as online algorithms can be supported.

One difference to batch learning are the transformers that are used to 
preprocess the data. For example, a simple mean subtraction must be implemented 
with a rolling mean, because we can’t calculate the mean over all the data, but 
the Flink Streaming API is perfect for that. It would be useful for users to 
have an extensible toolbox of transformers.

Another difference is the evaluation of the models. As we don’t have a single 
value to determine the model quality, in streaming scenarios this value evolves 
over time when it sees more labeled data.

However, the transformation and evaluation works again similar in both online 
learning and offline learning.

I also liked the discussion in [2] and I think that the competition in the 
batch learning field is hard and there are already a lot of great projects. I 
think it is true that in most real world problems it is not necessary to update 
the model immediately, but there are a lot of use cases for machine learning on 
streams. For them it would be nice to have a native approach.

A stream machine learning API for Flink would fit very well and I would also be 
willing to contribute to the future development of the Flink ML library.



Best regards,

Philipp

[1] 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html
 
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-td16040.html>
[2] 
https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2
 
<https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>


Am 23.02.2017 um 15:48 schrieb Gábor Hermann <m...@gaborhermann.com>:

Okay, I've created a skeleton of the design doc for choosing a direction:
https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc49h3Ud06MIRhahtJ6dw/edit?usp=sharing

Much of the pros/cons have already been discussed here, so I'll try to put 
there all the arguments mentioned in this thread. Feel free to put there more :)

@Stavros: I agree we should take action fast. What about collecting our 
thoughts in the doc by around Tuesday next week (28. February)? Then decide on 
the direction and design a roadmap by around Friday (3. March)? Is that 
feasible, or should it take more time?

I think it will be necessary to have a shepherd, or even better a committer, to 
be involved in at least reviewing and accepting the roadmap. It would be best, 
if a committer coordinated all this.
@Theodore: Would you like to do the coordination?

Regarding the use-cases: I've seen some abstracts of talks at SF Flink Forward 
[1] that seem promising. There are companies already using Flink for ML 
[2,3,4,5].

[1] http://sf.flink-forward.org/program/sessions/
[2] 
http://sf.flink-forward.org/kb_sessions/experiences-with-streaming-vs-micro-batch-for-online-learning/
[3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
[4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-learning-on-flink/
[5] 
http://sf.flink-forward.org/kb_sessions/streaming-deep-learning-scenarios-with-flink/

Cheers,
Gabor


On 2017-02-23 15:19, Katherin Eri wrote:
I have asked already some teams for useful cases, but all of them need time
to think.
During analysis something will finally arise.
May be we can ask partners of Flink  for cases? Data Artisans got results
of customers survey: [1], ML better support is wanted, so we could ask what
exactly is necessary.

[1] http://data-artisans.com/flink-user-survey-2016-part-2/

23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
st.kontopou...@gmail.com> написал:

+100 for a design doc.

Could we also set a roadmap after some time-boxed investigation captured in
that document? We need action.

Looking forward to work on this (whatever that might be) ;) Also are there
any data supporting one direction or the other from a customer perspective?
It would help to make more informed decisions.

On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinm...@gmail.com>
wrote:

Yes, ok.
let's start some design document, and write down there already mentioned
ideas about: parameter server, about clipper and others. Would be nice if
we will also map this approaches to cases.
Will work on it collaboratively on each topic, may be finally we will
form
some picture, that could be agreed with committers.
@Gabor, could you please start such shared doc, as you have already
several
ideas proposed?

чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>:

I agree, that it's better to go in one direction first, but I think
online and offline with streaming API can go somewhat parallel later.
We
could set a short-term goal, concentrate initially on one direction,
and
showcase that direction (e.g. in a blogpost). But first, we should list
the pros/cons in a design doc as a minimum. Then make a decision what
direction to go. Would that be feasible?

On 2017-02-23 12:34, Katherin Eri wrote:

I'm not sure that this is feasible, doing all at the same time could
mean
doing nothing((((
I'm just afraid, that words: we will work on streaming not on
batching,
we
have no commiter's time for this, mean that yes, we started work on
FLINK-1730, but nobody will commit this work in the end, as it
already
was
with this ticket.

23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
m...@gaborhermann.com>
написал:

@Theodore: Great to hear you think the "batch on streaming" approach
is
possible! Of course, we need to pay attention all the pitfalls
there,
if we
go that way.

+1 for a design doc!

I would add that it's possible to make efforts in all the three
directions
(i.e. batch, online, batch on streaming) at the same time. Although,
it
might be worth to concentrate on one. E.g. it would not be so useful
to
have the same batch algorithms with both the batch API and streaming
API.
We can decide later.

The design doc could be partitioned to these 3 directions, and we
can
collect there the pros/cons too. What do you think?

Cheers,
Gabor


On 2017-02-23 12:13, Theodore Vasiloudis wrote:

Hello all,


@Gabor, we have discussed the idea of using the streaming API to
write
all
of our ML algorithms with a couple of people offline,
and I think it might be possible and is generally worth a shot. The
approach we would take would be close to Vowpal Wabbit, not exactly
"online", but rather "fast-batch".

There will be problems popping up again, even for very simple algos
like
on
line linear regression with SGD [1], but hopefully fixing those
will
be
more aligned with the priorities of the community.

@Katherin, my understanding is that given the limited resources,
there
is
no development effort focused on batch processing right now.

So to summarize, it seems like there are people willing to work on
ML
on
Flink, but nobody is sure how to do it.
There are many directions we could take (batch, online, batch on
streaming), each with its own merits and downsides.

If you want we can start a design doc and move the conversation
there,
come
up with a roadmap and start implementing.

Regards,
Theodore

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.
nabble.com/Understanding-connected-streams-use-without-times
tamps-td10241.html

On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
m...@gaborhermann.com
wrote:

It's great to see so much activity in this discussion :)
I'll try to add my thoughts.

I think building a developer community (Till's 2. point) can be
slightly
separated from what features we should aim for (1. point) and
showcasing
(3. point). Thanks Till for bringing up the ideas for
restructuring,
I'm
sure we'll find a way to make the development process more
dynamic.
I'll
try to address the rest here.

It's hard to choose directions between streaming and batch ML. As
Theo
has
indicated, not much online ML is used in production, but Flink
concentrates
on streaming, so online ML would be a better fit for Flink.
However,
as
most of you argued, there's definite need for batch ML. But batch
ML
seems
hard to achieve because there are blocking issues with persisting,
iteration paths etc. So it's no good either way.

I propose a seemingly crazy solution: what if we developed batch
algorithms also with the streaming API? The batch API would
clearly
seem
more suitable for ML algorithms, but there a lot of benefits of
this
approach too, so it's clearly worth considering. Flink also has
the
high
level vision of "streaming for everything" that would clearly fit
this
case. What do you all think about this? Do you think this solution
would
be
feasible? I would be happy to make a more elaborate proposal, but
I
push
my
main ideas here:

1) Simplifying by using one system
It could simplify the work of both the users and the developers.
One
could
execute training once, or could execute it periodically e.g. by
using
windows. Low-latency serving and training could be done in the
same
system.
We could implement incremental algorithms, without any side inputs
for
combining online learning (or predictions) with batch learning. Of
course,
all the logic describing these must be somehow implemented (e.g.
synchronizing predictions with training), but it should be easier
to
do
so
in one system, than by combining e.g. the batch and streaming API.

2) Batch ML with the streaming API is not harder
Despite these benefits, it could seem harder to implement batch ML
with
the streaming API, but in my opinion it's not. There are more
flexible,
lower-level optimization potentials with the streaming API. Most
distributed ML algorithms use a lower-level model than the batch
API
anyway, so sometimes it feels like forcing the algorithm logic
into
the
training API and tweaking it. Although we could not use the batch
primitives like join, we would have the E.g. in my experience with
implementing a distributed matrix factorization algorithm [1], I
couldn't
do a simple optimization because of the limitations of the
iteration
API
[2]. Even if we pushed all the development effort to make the
batch
API
more suitable for ML there would be things we couldn't do. E.g.
there
are
approaches for updating a model iteratively without locks [3,4]
(i.e.
somewhat asynchronously), and I don't see a clear way to implement
such
algorithms with the batch API.

3) Streaming community (users and devs) benefit
The Flink streaming community in general would also benefit from
this
direction. There are many features needed in the streaming API for
ML
to
work, but this is also true for the batch API. One really
important
is
the
loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
of
effort (mostly from Paris) for making it mature enough [6]. Kate
mentioned
using GPUs, and I'm sure they have uses in streaming generally
[7].
Thus,
by improving the streaming API to allow ML algorithms, the
streaming
API
benefit too (which is important as they have a lot more production
users
than the batch API).

4) Performance can be at least as good
I believe the same performance could be achieved with the
streaming
API
as
with the batch API. Streaming API is much closer to the runtime
than
the
batch API. For corner-cases, with runtime-layer optimizations of
batch
API,
we could find a way to do the same (or similar) optimization for
the
streaming API (see my previous point). Such case could be using
managed
memory (and spilling to disk). There are also benefits by default,
e.g.
we
would have a finer grained fault tolerance with the streaming API.

5) We could keep batch ML API
For the shorter term, we should not throw away all the algorithms
implemented with the batch API. By pushing forward the development
with
side inputs we could make them usable with streaming API. Then, if
the
library gains some popularity, we could replace the algorithms in
the
batch
API with streaming ones, to avoid the performance costs of e.g.
not
being
able to persist.

6) General tools for implementing ML algorithms
Besides implementing algorithms one by one, we could give more
general
tools for making it easier to implement algorithms. E.g. parameter
server
[8,9]. Theo also mentioned in another thread that TensorFlow has a
similar
model to Flink streaming, we could look into that too. I think
often
when
deploying a production ML system, much more configuration and
tweaking
should be done than e.g. Spark MLlib allows. Why not allow that?

7) Showcasing
Showcasing this could be easier. We could say that we're doing
batch
ML
with a streaming API. That's interesting in its own. IMHO this
integration
is also a more approachable way towards end-to-end ML.


Thanks for reading so far :)

[1] https://github.com/apache/flink/pull/2819
[2] https://issues.apache.org/jira/browse/FLINK-2396
[3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
[4] https://www.usenix.org/system/files/conference/hotos13/hotos
13-final77.pdf
[5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
Scoped+Loops+and+Job+Termination
[6] https://github.com/apache/flink/pull/1668
[7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
pdf
[8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
[9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
com/Using-QueryableState-inside-Flink-jobs-and-
Parameter-Server-implementation-td15880.html

Cheers,
Gabor


--
*Yours faithfully, *

*Kate Eri.*



Reply via email to