Re: [DISCUSS] Flink ML roadmap

Gábor Hermann Tue, 21 Feb 2017 14:18:10 -0800

It's great to see so much activity in this discussion :)
I'll try to add my thoughts.

I think building a developer community (Till's 2. point) can be slightlyseparated from what features we should aim for (1. point) and showcasing(3. point). Thanks Till for bringing up the ideas for restructuring, I'msure we'll find a way to make the development process more dynamic. I'lltry to address the rest here.

It's hard to choose directions between streaming and batch ML. As Theohas indicated, not much online ML is used in production, but Flinkconcentrates on streaming, so online ML would be a better fit for Flink.However, as most of you argued, there's definite need for batch ML. Butbatch ML seems hard to achieve because there are blocking issues withpersisting, iteration paths etc. So it's no good either way.

I propose a seemingly crazy solution: what if we developed batchalgorithms also with the streaming API? The batch API would clearly seemmore suitable for ML algorithms, but there a lot of benefits of thisapproach too, so it's clearly worth considering. Flink also has the highlevel vision of "streaming for everything" that would clearly fit thiscase. What do you all think about this? Do you think this solution wouldbe feasible? I would be happy to make a more elaborate proposal, but Ipush my main ideas here:


1) Simplifying by using one system

It could simplify the work of both the users and the developers. Onecould execute training once, or could execute it periodically e.g. byusing windows. Low-latency serving and training could be done in thesame system. We could implement incremental algorithms, without any sideinputs for combining online learning (or predictions) with batchlearning. Of course, all the logic describing these must be somehowimplemented (e.g. synchronizing predictions with training), but itshould be easier to do so in one system, than by combining e.g. thebatch and streaming API.


2) Batch ML with the streaming API is not harder

Despite these benefits, it could seem harder to implement batch ML withthe streaming API, but in my opinion it's not. There are more flexible,lower-level optimization potentials with the streaming API. Mostdistributed ML algorithms use a lower-level model than the batch APIanyway, so sometimes it feels like forcing the algorithm logic into thetraining API and tweaking it. Although we could not use the batchprimitives like join, we would have the E.g. in my experience withimplementing a distributed matrix factorization algorithm [1], Icouldn't do a simple optimization because of the limitations of theiteration API [2]. Even if we pushed all the development effort to makethe batch API more suitable for ML there would be things we couldn't do.E.g. there are approaches for updating a model iteratively without locks[3,4] (i.e. somewhat asynchronously), and I don't see a clear way toimplement such algorithms with the batch API.


3) Streaming community (users and devs) benefit

The Flink streaming community in general would also benefit from thisdirection. There are many features needed in the streaming API for ML towork, but this is also true for the batch API. One really important isthe loops API (a.k.a. iterative DataStreams) [5]. There has been a lotof effort (mostly from Paris) for making it mature enough [6]. Katementioned using GPUs, and I'm sure they have uses in streaming generally[7]. Thus, by improving the streaming API to allow ML algorithms, thestreaming API benefit too (which is important as they have a lot moreproduction users than the batch API).


4) Performance can be at least as good

I believe the same performance could be achieved with the streaming APIas with the batch API. Streaming API is much closer to the runtime thanthe batch API. For corner-cases, with runtime-layer optimizations ofbatch API, we could find a way to do the same (or similar) optimizationfor the streaming API (see my previous point). Such case could be usingmanaged memory (and spilling to disk). There are also benefits bydefault, e.g. we would have a finer grained fault tolerance with thestreaming API.


5) We could keep batch ML API

For the shorter term, we should not throw away all the algorithmsimplemented with the batch API. By pushing forward the development withside inputs we could make them usable with streaming API. Then, if thelibrary gains some popularity, we could replace the algorithms in thebatch API with streaming ones, to avoid the performance costs of e.g.not being able to persist.


6) General tools for implementing ML algorithms

Besides implementing algorithms one by one, we could give more generaltools for making it easier to implement algorithms. E.g. parameterserver [8,9]. Theo also mentioned in another thread that TensorFlow hasa similar model to Flink streaming, we could look into that too. I thinkoften when deploying a production ML system, much more configuration andtweaking should be done than e.g. Spark MLlib allows. Why not allow that?


7) Showcasing

Showcasing this could be easier. We could say that we're doing batch MLwith a streaming API. That's interesting in its own. IMHO thisintegration is also a more approachable way towards end-to-end ML.



Thanks for reading so far :)

[1] https://github.com/apache/flink/pull/2819
[2] https://issues.apache.org/jira/browse/FLINK-2396
[3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf

[4]https://www.usenix.org/system/files/conference/hotos13/hotos13-final77.pdf[5]https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+Scoped+Loops+and+Job+Termination

[6] https://github.com/apache/flink/pull/1668
[7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf
[8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf

[9]http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Using-QueryableState-inside-Flink-jobs-and-Parameter-Server-implementation-td15880.html


Cheers,
Gabor

Re: [DISCUSS] Flink ML roadmap

Reply via email to