It's great to see so much activity in this discussion :)
I'll try to add my thoughts.
I think building a developer community (Till's 2. point) can be slightly
separated from what features we should aim for (1. point) and showcasing
(3. point). Thanks Till for bringing up the ideas for restructuring, I'm
sure we'll find a way to make the development process more dynamic. I'll
try to address the rest here.
It's hard to choose directions between streaming and batch ML. As Theo
has indicated, not much online ML is used in production, but Flink
concentrates on streaming, so online ML would be a better fit for Flink.
However, as most of you argued, there's definite need for batch ML. But
batch ML seems hard to achieve because there are blocking issues with
persisting, iteration paths etc. So it's no good either way.
I propose a seemingly crazy solution: what if we developed batch
algorithms also with the streaming API? The batch API would clearly seem
more suitable for ML algorithms, but there a lot of benefits of this
approach too, so it's clearly worth considering. Flink also has the high
level vision of "streaming for everything" that would clearly fit this
case. What do you all think about this? Do you think this solution would
be feasible? I would be happy to make a more elaborate proposal, but I
push my main ideas here:
1) Simplifying by using one system
It could simplify the work of both the users and the developers. One
could execute training once, or could execute it periodically e.g. by
using windows. Low-latency serving and training could be done in the
same system. We could implement incremental algorithms, without any side
inputs for combining online learning (or predictions) with batch
learning. Of course, all the logic describing these must be somehow
implemented (e.g. synchronizing predictions with training), but it
should be easier to do so in one system, than by combining e.g. the
batch and streaming API.
2) Batch ML with the streaming API is not harder
Despite these benefits, it could seem harder to implement batch ML with
the streaming API, but in my opinion it's not. There are more flexible,
lower-level optimization potentials with the streaming API. Most
distributed ML algorithms use a lower-level model than the batch API
anyway, so sometimes it feels like forcing the algorithm logic into the
training API and tweaking it. Although we could not use the batch
primitives like join, we would have the E.g. in my experience with
implementing a distributed matrix factorization algorithm [1], I
couldn't do a simple optimization because of the limitations of the
iteration API [2]. Even if we pushed all the development effort to make
the batch API more suitable for ML there would be things we couldn't do.
E.g. there are approaches for updating a model iteratively without locks
[3,4] (i.e. somewhat asynchronously), and I don't see a clear way to
implement such algorithms with the batch API.
3) Streaming community (users and devs) benefit
The Flink streaming community in general would also benefit from this
direction. There are many features needed in the streaming API for ML to
work, but this is also true for the batch API. One really important is
the loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
of effort (mostly from Paris) for making it mature enough [6]. Kate
mentioned using GPUs, and I'm sure they have uses in streaming generally
[7]. Thus, by improving the streaming API to allow ML algorithms, the
streaming API benefit too (which is important as they have a lot more
production users than the batch API).
4) Performance can be at least as good
I believe the same performance could be achieved with the streaming API
as with the batch API. Streaming API is much closer to the runtime than
the batch API. For corner-cases, with runtime-layer optimizations of
batch API, we could find a way to do the same (or similar) optimization
for the streaming API (see my previous point). Such case could be using
managed memory (and spilling to disk). There are also benefits by
default, e.g. we would have a finer grained fault tolerance with the
streaming API.
5) We could keep batch ML API
For the shorter term, we should not throw away all the algorithms
implemented with the batch API. By pushing forward the development with
side inputs we could make them usable with streaming API. Then, if the
library gains some popularity, we could replace the algorithms in the
batch API with streaming ones, to avoid the performance costs of e.g.
not being able to persist.
6) General tools for implementing ML algorithms
Besides implementing algorithms one by one, we could give more general
tools for making it easier to implement algorithms. E.g. parameter
server [8,9]. Theo also mentioned in another thread that TensorFlow has
a similar model to Flink streaming, we could look into that too. I think
often when deploying a production ML system, much more configuration and
tweaking should be done than e.g. Spark MLlib allows. Why not allow that?
7) Showcasing
Showcasing this could be easier. We could say that we're doing batch ML
with a streaming API. That's interesting in its own. IMHO this
integration is also a more approachable way towards end-to-end ML.
Thanks for reading so far :)
[1] https://github.com/apache/flink/pull/2819
[2] https://issues.apache.org/jira/browse/FLINK-2396
[3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
[4]
https://www.usenix.org/system/files/conference/hotos13/hotos13-final77.pdf
[5]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+Scoped+Loops+and+Job+Termination
[6] https://github.com/apache/flink/pull/1668
[7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
[8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
[9]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Using-QueryableState-inside-Flink-jobs-and-Parameter-Server-implementation-td15880.html
Cheers,
Gabor