@Gabor 3rd March is ok for me. But maybe giving a bit more time to it like a week may suit more people. What do you think all? I will contribute to the doc.
+100 for having a co-ordinator + commiter. Thank you all for joining the discussion. Cheers, Stavros On Thu, Feb 23, 2017 at 4:48 PM, Gábor Hermann <m...@gaborhermann.com> wrote: > Okay, I've created a skeleton of the design doc for choosing a direction: > https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc > 49h3Ud06MIRhahtJ6dw/edit?usp=sharing > > Much of the pros/cons have already been discussed here, so I'll try to put > there all the arguments mentioned in this thread. Feel free to put there > more :) > > @Stavros: I agree we should take action fast. What about collecting our > thoughts in the doc by around Tuesday next week (28. February)? Then decide > on the direction and design a roadmap by around Friday (3. March)? Is that > feasible, or should it take more time? > > I think it will be necessary to have a shepherd, or even better a > committer, to be involved in at least reviewing and accepting the roadmap. > It would be best, if a committer coordinated all this. > @Theodore: Would you like to do the coordination? > > Regarding the use-cases: I've seen some abstracts of talks at SF Flink > Forward [1] that seem promising. There are companies already using Flink > for ML [2,3,4,5]. > > [1] http://sf.flink-forward.org/program/sessions/ > [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str > eaming-vs-micro-batch-for-online-learning/ > [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/ > [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le > arning-on-flink/ > [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn > ing-scenarios-with-flink/ > > Cheers, > Gabor > > > > On 2017-02-23 15:19, Katherin Eri wrote: > >> I have asked already some teams for useful cases, but all of them need >> time >> to think. >> During analysis something will finally arise. >> May be we can ask partners of Flink for cases? Data Artisans got results >> of customers survey: [1], ML better support is wanted, so we could ask >> what >> exactly is necessary. >> >> [1] http://data-artisans.com/flink-user-survey-2016-part-2/ >> >> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" < >> st.kontopou...@gmail.com> написал: >> >> +100 for a design doc. >>> >>> Could we also set a roadmap after some time-boxed investigation captured >>> in >>> that document? We need action. >>> >>> Looking forward to work on this (whatever that might be) ;) Also are >>> there >>> any data supporting one direction or the other from a customer >>> perspective? >>> It would help to make more informed decisions. >>> >>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinm...@gmail.com> >>> wrote: >>> >>> Yes, ok. >>>> let's start some design document, and write down there already mentioned >>>> ideas about: parameter server, about clipper and others. Would be nice >>>> if >>>> we will also map this approaches to cases. >>>> Will work on it collaboratively on each topic, may be finally we will >>>> >>> form >>> >>>> some picture, that could be agreed with committers. >>>> @Gabor, could you please start such shared doc, as you have already >>>> >>> several >>> >>>> ideas proposed? >>>> >>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>: >>>> >>>> I agree, that it's better to go in one direction first, but I think >>>>> online and offline with streaming API can go somewhat parallel later. >>>>> >>>> We >>> >>>> could set a short-term goal, concentrate initially on one direction, >>>>> >>>> and >>> >>>> showcase that direction (e.g. in a blogpost). But first, we should list >>>>> the pros/cons in a design doc as a minimum. Then make a decision what >>>>> direction to go. Would that be feasible? >>>>> >>>>> On 2017-02-23 12:34, Katherin Eri wrote: >>>>> >>>>> I'm not sure that this is feasible, doing all at the same time could >>>>>> >>>>> mean >>>> >>>>> doing nothing(((( >>>>>> I'm just afraid, that words: we will work on streaming not on >>>>>> >>>>> batching, >>> >>>> we >>>>> >>>>>> have no commiter's time for this, mean that yes, we started work on >>>>>> FLINK-1730, but nobody will commit this work in the end, as it >>>>>> >>>>> already >>> >>>> was >>>>> >>>>>> with this ticket. >>>>>> >>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" < >>>>>> >>>>> m...@gaborhermann.com> >>>>> >>>>>> написал: >>>>>> >>>>>> @Theodore: Great to hear you think the "batch on streaming" approach >>>>>>> >>>>>> is >>>> >>>>> possible! Of course, we need to pay attention all the pitfalls >>>>>>> >>>>>> there, >>> >>>> if we >>>>> >>>>>> go that way. >>>>>>> >>>>>>> +1 for a design doc! >>>>>>> >>>>>>> I would add that it's possible to make efforts in all the three >>>>>>> >>>>>> directions >>>>> >>>>>> (i.e. batch, online, batch on streaming) at the same time. Although, >>>>>>> >>>>>> it >>>> >>>>> might be worth to concentrate on one. E.g. it would not be so useful >>>>>>> >>>>>> to >>>> >>>>> have the same batch algorithms with both the batch API and streaming >>>>>>> >>>>>> API. >>>>> >>>>>> We can decide later. >>>>>>> >>>>>>> The design doc could be partitioned to these 3 directions, and we >>>>>>> >>>>>> can >>> >>>> collect there the pros/cons too. What do you think? >>>>>>> >>>>>>> Cheers, >>>>>>> Gabor >>>>>>> >>>>>>> >>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote: >>>>>>> >>>>>>> Hello all, >>>>>>>> >>>>>>>> >>>>>>>> @Gabor, we have discussed the idea of using the streaming API to >>>>>>>> >>>>>>> write >>>> >>>>> all >>>>> >>>>>> of our ML algorithms with a couple of people offline, >>>>>>>> and I think it might be possible and is generally worth a shot. The >>>>>>>> approach we would take would be close to Vowpal Wabbit, not exactly >>>>>>>> "online", but rather "fast-batch". >>>>>>>> >>>>>>>> There will be problems popping up again, even for very simple algos >>>>>>>> >>>>>>> like >>>>> >>>>>> on >>>>>>>> line linear regression with SGD [1], but hopefully fixing those >>>>>>>> >>>>>>> will >>> >>>> be >>>> >>>>> more aligned with the priorities of the community. >>>>>>>> >>>>>>>> @Katherin, my understanding is that given the limited resources, >>>>>>>> >>>>>>> there >>>> >>>>> is >>>>> >>>>>> no development effort focused on batch processing right now. >>>>>>>> >>>>>>>> So to summarize, it seems like there are people willing to work on >>>>>>>> >>>>>>> ML >>> >>>> on >>>>> >>>>>> Flink, but nobody is sure how to do it. >>>>>>>> There are many directions we could take (batch, online, batch on >>>>>>>> streaming), each with its own merits and downsides. >>>>>>>> >>>>>>>> If you want we can start a design doc and move the conversation >>>>>>>> >>>>>>> there, >>>> >>>>> come >>>>>>>> up with a roadmap and start implementing. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Theodore >>>>>>>> >>>>>>>> [1] >>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4. >>>>>>>> nabble.com/Understanding-connected-streams-use-without-times >>>>>>>> tamps-td10241.html >>>>>>>> >>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann < >>>>>>>> >>>>>>> m...@gaborhermann.com >>>> >>>>> wrote: >>>>>>>> >>>>>>>> It's great to see so much activity in this discussion :) >>>>>>>> >>>>>>>>> I'll try to add my thoughts. >>>>>>>>> >>>>>>>>> I think building a developer community (Till's 2. point) can be >>>>>>>>> >>>>>>>> slightly >>>>> >>>>>> separated from what features we should aim for (1. point) and >>>>>>>>> >>>>>>>> showcasing >>>>> >>>>>> (3. point). Thanks Till for bringing up the ideas for >>>>>>>>> >>>>>>>> restructuring, >>> >>>> I'm >>>>> >>>>>> sure we'll find a way to make the development process more >>>>>>>>> >>>>>>>> dynamic. >>> >>>> I'll >>>>> >>>>>> try to address the rest here. >>>>>>>>> >>>>>>>>> It's hard to choose directions between streaming and batch ML. As >>>>>>>>> >>>>>>>> Theo >>>> >>>>> has >>>>>>>>> indicated, not much online ML is used in production, but Flink >>>>>>>>> concentrates >>>>>>>>> on streaming, so online ML would be a better fit for Flink. >>>>>>>>> >>>>>>>> However, >>> >>>> as >>>>> >>>>>> most of you argued, there's definite need for batch ML. But batch >>>>>>>>> >>>>>>>> ML >>> >>>> seems >>>>>>>>> hard to achieve because there are blocking issues with persisting, >>>>>>>>> iteration paths etc. So it's no good either way. >>>>>>>>> >>>>>>>>> I propose a seemingly crazy solution: what if we developed batch >>>>>>>>> algorithms also with the streaming API? The batch API would >>>>>>>>> >>>>>>>> clearly >>> >>>> seem >>>>> >>>>>> more suitable for ML algorithms, but there a lot of benefits of >>>>>>>>> >>>>>>>> this >>> >>>> approach too, so it's clearly worth considering. Flink also has >>>>>>>>> >>>>>>>> the >>> >>>> high >>>>> >>>>>> level vision of "streaming for everything" that would clearly fit >>>>>>>>> >>>>>>>> this >>>> >>>>> case. What do you all think about this? Do you think this solution >>>>>>>>> >>>>>>>> would >>>>> >>>>>> be >>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but >>>>>>>>> >>>>>>>> I >>> >>>> push >>>>> >>>>>> my >>>>>>>>> main ideas here: >>>>>>>>> >>>>>>>>> 1) Simplifying by using one system >>>>>>>>> It could simplify the work of both the users and the developers. >>>>>>>>> >>>>>>>> One >>> >>>> could >>>>>>>>> execute training once, or could execute it periodically e.g. by >>>>>>>>> >>>>>>>> using >>>> >>>>> windows. Low-latency serving and training could be done in the >>>>>>>>> >>>>>>>> same >>> >>>> system. >>>>>>>>> We could implement incremental algorithms, without any side inputs >>>>>>>>> >>>>>>>> for >>>> >>>>> combining online learning (or predictions) with batch learning. Of >>>>>>>>> course, >>>>>>>>> all the logic describing these must be somehow implemented (e.g. >>>>>>>>> synchronizing predictions with training), but it should be easier >>>>>>>>> >>>>>>>> to >>> >>>> do >>>>> >>>>>> so >>>>>>>>> in one system, than by combining e.g. the batch and streaming API. >>>>>>>>> >>>>>>>>> 2) Batch ML with the streaming API is not harder >>>>>>>>> Despite these benefits, it could seem harder to implement batch ML >>>>>>>>> >>>>>>>> with >>>>> >>>>>> the streaming API, but in my opinion it's not. There are more >>>>>>>>> >>>>>>>> flexible, >>>>> >>>>>> lower-level optimization potentials with the streaming API. Most >>>>>>>>> distributed ML algorithms use a lower-level model than the batch >>>>>>>>> >>>>>>>> API >>> >>>> anyway, so sometimes it feels like forcing the algorithm logic >>>>>>>>> >>>>>>>> into >>> >>>> the >>>>> >>>>>> training API and tweaking it. Although we could not use the batch >>>>>>>>> primitives like join, we would have the E.g. in my experience with >>>>>>>>> implementing a distributed matrix factorization algorithm [1], I >>>>>>>>> >>>>>>>> couldn't >>>>> >>>>>> do a simple optimization because of the limitations of the >>>>>>>>> >>>>>>>> iteration >>> >>>> API >>>>> >>>>>> [2]. Even if we pushed all the development effort to make the >>>>>>>>> >>>>>>>> batch >>> >>>> API >>>>> >>>>>> more suitable for ML there would be things we couldn't do. E.g. >>>>>>>>> >>>>>>>> there >>>> >>>>> are >>>>> >>>>>> approaches for updating a model iteratively without locks [3,4] >>>>>>>>> >>>>>>>> (i.e. >>>> >>>>> somewhat asynchronously), and I don't see a clear way to implement >>>>>>>>> >>>>>>>> such >>>>> >>>>>> algorithms with the batch API. >>>>>>>>> >>>>>>>>> 3) Streaming community (users and devs) benefit >>>>>>>>> The Flink streaming community in general would also benefit from >>>>>>>>> >>>>>>>> this >>>> >>>>> direction. There are many features needed in the streaming API for >>>>>>>>> >>>>>>>> ML >>>> >>>>> to >>>>> >>>>>> work, but this is also true for the batch API. One really >>>>>>>>> >>>>>>>> important >>> >>>> is >>>> >>>>> the >>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot >>>>>>>>> >>>>>>>> of >>>> >>>>> effort (mostly from Paris) for making it mature enough [6]. Kate >>>>>>>>> mentioned >>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally >>>>>>>>> >>>>>>>> [7]. >>> >>>> Thus, >>>>> >>>>>> by improving the streaming API to allow ML algorithms, the >>>>>>>>> >>>>>>>> streaming >>> >>>> API >>>>> >>>>>> benefit too (which is important as they have a lot more production >>>>>>>>> >>>>>>>> users >>>>> >>>>>> than the batch API). >>>>>>>>> >>>>>>>>> 4) Performance can be at least as good >>>>>>>>> I believe the same performance could be achieved with the >>>>>>>>> >>>>>>>> streaming >>> >>>> API >>>>> >>>>>> as >>>>>>>>> with the batch API. Streaming API is much closer to the runtime >>>>>>>>> >>>>>>>> than >>> >>>> the >>>>> >>>>>> batch API. For corner-cases, with runtime-layer optimizations of >>>>>>>>> >>>>>>>> batch >>>> >>>>> API, >>>>>>>>> we could find a way to do the same (or similar) optimization for >>>>>>>>> >>>>>>>> the >>> >>>> streaming API (see my previous point). Such case could be using >>>>>>>>> >>>>>>>> managed >>>>> >>>>>> memory (and spilling to disk). There are also benefits by default, >>>>>>>>> >>>>>>>> e.g. >>>>> >>>>>> we >>>>>>>>> would have a finer grained fault tolerance with the streaming API. >>>>>>>>> >>>>>>>>> 5) We could keep batch ML API >>>>>>>>> For the shorter term, we should not throw away all the algorithms >>>>>>>>> implemented with the batch API. By pushing forward the development >>>>>>>>> >>>>>>>> with >>>>> >>>>>> side inputs we could make them usable with streaming API. Then, if >>>>>>>>> >>>>>>>> the >>>> >>>>> library gains some popularity, we could replace the algorithms in >>>>>>>>> >>>>>>>> the >>>> >>>>> batch >>>>>>>>> API with streaming ones, to avoid the performance costs of e.g. >>>>>>>>> >>>>>>>> not >>> >>>> being >>>>> >>>>>> able to persist. >>>>>>>>> >>>>>>>>> 6) General tools for implementing ML algorithms >>>>>>>>> Besides implementing algorithms one by one, we could give more >>>>>>>>> >>>>>>>> general >>>> >>>>> tools for making it easier to implement algorithms. E.g. parameter >>>>>>>>> >>>>>>>> server >>>>> >>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a >>>>>>>>> similar >>>>>>>>> model to Flink streaming, we could look into that too. I think >>>>>>>>> >>>>>>>> often >>> >>>> when >>>>> >>>>>> deploying a production ML system, much more configuration and >>>>>>>>> >>>>>>>> tweaking >>>> >>>>> should be done than e.g. Spark MLlib allows. Why not allow that? >>>>>>>>> >>>>>>>>> 7) Showcasing >>>>>>>>> Showcasing this could be easier. We could say that we're doing >>>>>>>>> >>>>>>>> batch >>> >>>> ML >>>>> >>>>>> with a streaming API. That's interesting in its own. IMHO this >>>>>>>>> integration >>>>>>>>> is also a more approachable way towards end-to-end ML. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks for reading so far :) >>>>>>>>> >>>>>>>>> [1] https://github.com/apache/flink/pull/2819 >>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396 >>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf >>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos >>>>>>>>> 13-final77.pdf >>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+ >>>>>>>>> Scoped+Loops+and+Job+Termination >>>>>>>>> [6] https://github.com/apache/flink/pull/1668 >>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16. >>>>>>>>> >>>>>>>> pdf >>> >>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf >>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble. >>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and- >>>>>>>>> Parameter-Server-implementation-td15880.html >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Gabor >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>> >>>> *Yours faithfully, * >>>> >>>> *Kate Eri.* >>>> >>>> >