Interesting thanx @Roberto. I see that only TAP Analytics Toolkit supports streaming. I am not aware of its market share, anyone?
Best, Stavros On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis < theodoros.vasilou...@gmail.com> wrote: > Thank you for the links Roberto I did not know that Beam was working on an > ML abstraction as well. I'm sure we can learn from that. > > I'll start another thread today where we can discuss next steps and action > points now that we have a few different paths to follow listed on the > shared doc, > since our deadline was today. We welcome further discussions of course. > > Regards, > Theodore > > On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio < > roberto.bentivog...@radicalbit.io> wrote: > > > Hi All, > > > > First of all I'd like to introduce myself: my name is Roberto Bentivoglio > > and I'm currently working for Radicalbit as Andrea Spina (he already > wrote > > on this thread). > > I didn't have the chance to directly contribute on Flink up to now, but > > some colleagues of mine are doing that since at least one year (they > > contributed also on the machine learning library). > > > > I hope I'm not jumping into discussione too late, it's really interesting > > and the analysis document is depicting really well the scenarios > currently > > available. Many thanks for your effort! > > > > If I can add my two cents to the discussion I'd like to add the > following: > > - it's clear that currently the Flink community is deeply focused on > > streaming features than batch features. For this reason I think that > > implement "Offline learning with Streaming API" is really a great idea. > > - I think that the "Online learning" option is really a good fit for > > Flink, but maybe we could give at the beginning an higher priority to the > > "Offline learning with Streaming API" option. However I think that this > > option will be the main goal for the mid/long term. > > - we implemented a library based on jpmml-evaluator[1] and flink called > > "flink-jpmml". Using this library you can train the models on external > > systems and use those models, after you've exported in a PMML standard > > format, to run evaluations on top of DataStream API. We don't have open > > sourced this library up to now, but we're planning to do this in the next > > weeks. We'd like to complete the documentation and the final code reviews > > before to share it. I hope it will be helpful for the community to > enhance > > the ML support on Flink > > - I'd like also to mention that the Apache Beam community is thiking on > a > > ML DSL. There is a design document and a couple of Jira tasks for that > > [2][3] > > > > We're really keen to focus our effort to improve the ML support on Flink > in > > Radicalbit, we will contribute on this effort for sure on a regular basis > > with our team. > > > > Looking forward to work with you! > > > > Many thanks, > > Roberto > > > > [1] - https://github.com/jpmml/jpmml-evaluator > > [2] - > > https://docs.google.com/document/d/17cRZk_yqHm3C0fljivjN66MbLkeKS1yjo4PB > > ECHb-xA > > [3] - https://issues.apache.org/jira/browse/BEAM-303 > > > > On 28 February 2017 at 19:35, Gábor Hermann <m...@gaborhermann.com> > wrote: > > > > > Hi Philipp, > > > > > > It's great to hear you are interested in Flink ML! > > > > > > Based on your description, your prototype seems like an interesting > > > approach for combining online+offline learning. If you're interested, > we > > > might find a way to integrate your work, or at least your ideas, into > > Flink > > > ML if we decide on a direction that fits your approach. I think your > work > > > could be relevant for almost all the directions listed there (if I > > > understand correctly you'd even like to serve predictions on unlabeled > > > data). > > > > > > Feel free to join the discussion in the docs you've mentioned :) > > > > > > Cheers, > > > Gabor > > > > > > > > > On 2017-02-27 18:39, Philipp Zehnder wrote: > > > > > > Hello all, > > >> > > >> I’m new to this mailing list and I wanted to introduce myself. My name > > is > > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the > > >> Karlsruhe Institute of Technology in Germany currently writing on my > > >> master’s thesis with the main goal to integrate reusable machine > > learning > > >> components into a stream processing network. One part of my thesis is > to > > >> create an API for distributed online machine learning. > > >> > > >> I saw that there are some recent discussions how to continue the > > >> development of Flink ML [1] and I want to share some of my experiences > > and > > >> maybe get some feedback from the community for my ideas. > > >> > > >> As I am new to open source projects I hope this is the right place for > > >> this. > > >> > > >> In the beginning, I had a look at different already existing > frameworks > > >> like Apache SAMOA for example, which is great and has a lot of useful > > >> resources. However, as Flink is currently focusing on streaming, from > my > > >> point of view it makes sense to also have a streaming machine learning > > API > > >> as part of the Flink ecosystem. > > >> > > >> I’m currently working on building a prototype for a distributed > > streaming > > >> machine learning library based on Flink that can be used for online > and > > >> “classical” offline learning. > > >> > > >> The machine learning algorithm takes labeled and non-labeled data. On > a > > >> labeled data point first a prediction is performed and then this label > > is > > >> used to train the model. On a non-labeled data point just a prediction > > is > > >> performed. The main difference between the online and offline > > algorithms is > > >> that in the offline case the labeled data must be handed to the model > > >> before the unlabeled data. In the online case, it is still possible to > > >> process labeled data at a later point to update the model. The > > advantage of > > >> this approach is that batch algorithms can be applied on streaming > data > > as > > >> well as online algorithms can be supported. > > >> > > >> One difference to batch learning are the transformers that are used to > > >> preprocess the data. For example, a simple mean subtraction must be > > >> implemented with a rolling mean, because we can’t calculate the mean > > over > > >> all the data, but the Flink Streaming API is perfect for that. It > would > > be > > >> useful for users to have an extensible toolbox of transformers. > > >> > > >> Another difference is the evaluation of the models. As we don’t have a > > >> single value to determine the model quality, in streaming scenarios > this > > >> value evolves over time when it sees more labeled data. > > >> > > >> However, the transformation and evaluation works again similar in both > > >> online learning and offline learning. > > >> > > >> I also liked the discussion in [2] and I think that the competition in > > >> the batch learning field is hard and there are already a lot of great > > >> projects. I think it is true that in most real world problems it is > not > > >> necessary to update the model immediately, but there are a lot of use > > cases > > >> for machine learning on streams. For them it would be nice to have a > > native > > >> approach. > > >> > > >> A stream machine learning API for Flink would fit very well and I > would > > >> also be willing to contribute to the future development of the Flink > ML > > >> library. > > >> > > >> > > >> > > >> Best regards, > > >> > > >> Philipp > > >> > > >> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble. > > >> com/DISCUSS-Flink-ML-roadmap-td16040.html < > > http://apache-flink-mailing-l > > >> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap- > td16040.html > > > > > >> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc > > >> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 < > > >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ > > >> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2> > > >> > > >> > > >> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <m...@gaborhermann.com>: > > >>> > > >>> Okay, I've created a skeleton of the design doc for choosing a > > direction: > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc > > >>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing > > >>> > > >>> Much of the pros/cons have already been discussed here, so I'll try > to > > >>> put there all the arguments mentioned in this thread. Feel free to > put > > >>> there more :) > > >>> > > >>> @Stavros: I agree we should take action fast. What about collecting > our > > >>> thoughts in the doc by around Tuesday next week (28. February)? Then > > decide > > >>> on the direction and design a roadmap by around Friday (3. March)? Is > > that > > >>> feasible, or should it take more time? > > >>> > > >>> I think it will be necessary to have a shepherd, or even better a > > >>> committer, to be involved in at least reviewing and accepting the > > roadmap. > > >>> It would be best, if a committer coordinated all this. > > >>> @Theodore: Would you like to do the coordination? > > >>> > > >>> Regarding the use-cases: I've seen some abstracts of talks at SF > Flink > > >>> Forward [1] that seem promising. There are companies already using > > Flink > > >>> for ML [2,3,4,5]. > > >>> > > >>> [1] http://sf.flink-forward.org/program/sessions/ > > >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str > > >>> eaming-vs-micro-batch-for-online-learning/ > > >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te > > >>> nsorflow/ > > >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le > > >>> arning-on-flink/ > > >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn > > >>> ing-scenarios-with-flink/ > > >>> > > >>> Cheers, > > >>> Gabor > > >>> > > >>> > > >>> On 2017-02-23 15:19, Katherin Eri wrote: > > >>> > > >>>> I have asked already some teams for useful cases, but all of them > need > > >>>> time > > >>>> to think. > > >>>> During analysis something will finally arise. > > >>>> May be we can ask partners of Flink for cases? Data Artisans got > > >>>> results > > >>>> of customers survey: [1], ML better support is wanted, so we could > ask > > >>>> what > > >>>> exactly is necessary. > > >>>> > > >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/ > > >>>> > > >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" < > > >>>> st.kontopou...@gmail.com> написал: > > >>>> > > >>>> +100 for a design doc. > > >>>>> > > >>>>> Could we also set a roadmap after some time-boxed investigation > > >>>>> captured in > > >>>>> that document? We need action. > > >>>>> > > >>>>> Looking forward to work on this (whatever that might be) ;) Also > are > > >>>>> there > > >>>>> any data supporting one direction or the other from a customer > > >>>>> perspective? > > >>>>> It would help to make more informed decisions. > > >>>>> > > >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri < > > katherinm...@gmail.com> > > >>>>> wrote: > > >>>>> > > >>>>> Yes, ok. > > >>>>>> let's start some design document, and write down there already > > >>>>>> mentioned > > >>>>>> ideas about: parameter server, about clipper and others. Would be > > >>>>>> nice if > > >>>>>> we will also map this approaches to cases. > > >>>>>> Will work on it collaboratively on each topic, may be finally we > > will > > >>>>>> > > >>>>> form > > >>>>> > > >>>>>> some picture, that could be agreed with committers. > > >>>>>> @Gabor, could you please start such shared doc, as you have > already > > >>>>>> > > >>>>> several > > >>>>> > > >>>>>> ideas proposed? > > >>>>>> > > >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>: > > >>>>>> > > >>>>>> I agree, that it's better to go in one direction first, but I > think > > >>>>>>> online and offline with streaming API can go somewhat parallel > > later. > > >>>>>>> > > >>>>>> We > > >>>>> > > >>>>>> could set a short-term goal, concentrate initially on one > direction, > > >>>>>>> > > >>>>>> and > > >>>>> > > >>>>>> showcase that direction (e.g. in a blogpost). But first, we should > > >>>>>>> list > > >>>>>>> the pros/cons in a design doc as a minimum. Then make a decision > > what > > >>>>>>> direction to go. Would that be feasible? > > >>>>>>> > > >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote: > > >>>>>>> > > >>>>>>> I'm not sure that this is feasible, doing all at the same time > > could > > >>>>>>>> > > >>>>>>> mean > > >>>>>> > > >>>>>>> doing nothing(((( > > >>>>>>>> I'm just afraid, that words: we will work on streaming not on > > >>>>>>>> > > >>>>>>> batching, > > >>>>> > > >>>>>> we > > >>>>>>> > > >>>>>>>> have no commiter's time for this, mean that yes, we started work > > on > > >>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it > > >>>>>>>> > > >>>>>>> already > > >>>>> > > >>>>>> was > > >>>>>>> > > >>>>>>>> with this ticket. > > >>>>>>>> > > >>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" < > > >>>>>>>> > > >>>>>>> m...@gaborhermann.com> > > >>>>>>> > > >>>>>>>> написал: > > >>>>>>>> > > >>>>>>>> @Theodore: Great to hear you think the "batch on streaming" > > approach > > >>>>>>>>> > > >>>>>>>> is > > >>>>>> > > >>>>>>> possible! Of course, we need to pay attention all the pitfalls > > >>>>>>>>> > > >>>>>>>> there, > > >>>>> > > >>>>>> if we > > >>>>>>> > > >>>>>>>> go that way. > > >>>>>>>>> > > >>>>>>>>> +1 for a design doc! > > >>>>>>>>> > > >>>>>>>>> I would add that it's possible to make efforts in all the three > > >>>>>>>>> > > >>>>>>>> directions > > >>>>>>> > > >>>>>>>> (i.e. batch, online, batch on streaming) at the same time. > > Although, > > >>>>>>>>> > > >>>>>>>> it > > >>>>>> > > >>>>>>> might be worth to concentrate on one. E.g. it would not be so > > useful > > >>>>>>>>> > > >>>>>>>> to > > >>>>>> > > >>>>>>> have the same batch algorithms with both the batch API and > > streaming > > >>>>>>>>> > > >>>>>>>> API. > > >>>>>>> > > >>>>>>>> We can decide later. > > >>>>>>>>> > > >>>>>>>>> The design doc could be partitioned to these 3 directions, and > we > > >>>>>>>>> > > >>>>>>>> can > > >>>>> > > >>>>>> collect there the pros/cons too. What do you think? > > >>>>>>>>> > > >>>>>>>>> Cheers, > > >>>>>>>>> Gabor > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote: > > >>>>>>>>> > > >>>>>>>>> Hello all, > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> @Gabor, we have discussed the idea of using the streaming API > to > > >>>>>>>>>> > > >>>>>>>>> write > > >>>>>> > > >>>>>>> all > > >>>>>>> > > >>>>>>>> of our ML algorithms with a couple of people offline, > > >>>>>>>>>> and I think it might be possible and is generally worth a > shot. > > >>>>>>>>>> The > > >>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not > > >>>>>>>>>> exactly > > >>>>>>>>>> "online", but rather "fast-batch". > > >>>>>>>>>> > > >>>>>>>>>> There will be problems popping up again, even for very simple > > >>>>>>>>>> algos > > >>>>>>>>>> > > >>>>>>>>> like > > >>>>>>> > > >>>>>>>> on > > >>>>>>>>>> line linear regression with SGD [1], but hopefully fixing > those > > >>>>>>>>>> > > >>>>>>>>> will > > >>>>> > > >>>>>> be > > >>>>>> > > >>>>>>> more aligned with the priorities of the community. > > >>>>>>>>>> > > >>>>>>>>>> @Katherin, my understanding is that given the limited > resources, > > >>>>>>>>>> > > >>>>>>>>> there > > >>>>>> > > >>>>>>> is > > >>>>>>> > > >>>>>>>> no development effort focused on batch processing right now. > > >>>>>>>>>> > > >>>>>>>>>> So to summarize, it seems like there are people willing to > work > > on > > >>>>>>>>>> > > >>>>>>>>> ML > > >>>>> > > >>>>>> on > > >>>>>>> > > >>>>>>>> Flink, but nobody is sure how to do it. > > >>>>>>>>>> There are many directions we could take (batch, online, batch > on > > >>>>>>>>>> streaming), each with its own merits and downsides. > > >>>>>>>>>> > > >>>>>>>>>> If you want we can start a design doc and move the > conversation > > >>>>>>>>>> > > >>>>>>>>> there, > > >>>>>> > > >>>>>>> come > > >>>>>>>>>> up with a roadmap and start implementing. > > >>>>>>>>>> > > >>>>>>>>>> Regards, > > >>>>>>>>>> Theodore > > >>>>>>>>>> > > >>>>>>>>>> [1] > > >>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4. > > >>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times > > >>>>>>>>>> tamps-td10241.html > > >>>>>>>>>> > > >>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann < > > >>>>>>>>>> > > >>>>>>>>> m...@gaborhermann.com > > >>>>>> > > >>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> It's great to see so much activity in this discussion :) > > >>>>>>>>>> > > >>>>>>>>>>> I'll try to add my thoughts. > > >>>>>>>>>>> > > >>>>>>>>>>> I think building a developer community (Till's 2. point) can > be > > >>>>>>>>>>> > > >>>>>>>>>> slightly > > >>>>>>> > > >>>>>>>> separated from what features we should aim for (1. point) and > > >>>>>>>>>>> > > >>>>>>>>>> showcasing > > >>>>>>> > > >>>>>>>> (3. point). Thanks Till for bringing up the ideas for > > >>>>>>>>>>> > > >>>>>>>>>> restructuring, > > >>>>> > > >>>>>> I'm > > >>>>>>> > > >>>>>>>> sure we'll find a way to make the development process more > > >>>>>>>>>>> > > >>>>>>>>>> dynamic. > > >>>>> > > >>>>>> I'll > > >>>>>>> > > >>>>>>>> try to address the rest here. > > >>>>>>>>>>> > > >>>>>>>>>>> It's hard to choose directions between streaming and batch > ML. > > As > > >>>>>>>>>>> > > >>>>>>>>>> Theo > > >>>>>> > > >>>>>>> has > > >>>>>>>>>>> indicated, not much online ML is used in production, but > Flink > > >>>>>>>>>>> concentrates > > >>>>>>>>>>> on streaming, so online ML would be a better fit for Flink. > > >>>>>>>>>>> > > >>>>>>>>>> However, > > >>>>> > > >>>>>> as > > >>>>>>> > > >>>>>>>> most of you argued, there's definite need for batch ML. But > batch > > >>>>>>>>>>> > > >>>>>>>>>> ML > > >>>>> > > >>>>>> seems > > >>>>>>>>>>> hard to achieve because there are blocking issues with > > >>>>>>>>>>> persisting, > > >>>>>>>>>>> iteration paths etc. So it's no good either way. > > >>>>>>>>>>> > > >>>>>>>>>>> I propose a seemingly crazy solution: what if we developed > > batch > > >>>>>>>>>>> algorithms also with the streaming API? The batch API would > > >>>>>>>>>>> > > >>>>>>>>>> clearly > > >>>>> > > >>>>>> seem > > >>>>>>> > > >>>>>>>> more suitable for ML algorithms, but there a lot of benefits of > > >>>>>>>>>>> > > >>>>>>>>>> this > > >>>>> > > >>>>>> approach too, so it's clearly worth considering. Flink also has > > >>>>>>>>>>> > > >>>>>>>>>> the > > >>>>> > > >>>>>> high > > >>>>>>> > > >>>>>>>> level vision of "streaming for everything" that would clearly > fit > > >>>>>>>>>>> > > >>>>>>>>>> this > > >>>>>> > > >>>>>>> case. What do you all think about this? Do you think this > solution > > >>>>>>>>>>> > > >>>>>>>>>> would > > >>>>>>> > > >>>>>>>> be > > >>>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, > > but > > >>>>>>>>>>> > > >>>>>>>>>> I > > >>>>> > > >>>>>> push > > >>>>>>> > > >>>>>>>> my > > >>>>>>>>>>> main ideas here: > > >>>>>>>>>>> > > >>>>>>>>>>> 1) Simplifying by using one system > > >>>>>>>>>>> It could simplify the work of both the users and the > > developers. > > >>>>>>>>>>> > > >>>>>>>>>> One > > >>>>> > > >>>>>> could > > >>>>>>>>>>> execute training once, or could execute it periodically e.g. > by > > >>>>>>>>>>> > > >>>>>>>>>> using > > >>>>>> > > >>>>>>> windows. Low-latency serving and training could be done in the > > >>>>>>>>>>> > > >>>>>>>>>> same > > >>>>> > > >>>>>> system. > > >>>>>>>>>>> We could implement incremental algorithms, without any side > > >>>>>>>>>>> inputs > > >>>>>>>>>>> > > >>>>>>>>>> for > > >>>>>> > > >>>>>>> combining online learning (or predictions) with batch learning. > Of > > >>>>>>>>>>> course, > > >>>>>>>>>>> all the logic describing these must be somehow implemented > > (e.g. > > >>>>>>>>>>> synchronizing predictions with training), but it should be > > easier > > >>>>>>>>>>> > > >>>>>>>>>> to > > >>>>> > > >>>>>> do > > >>>>>>> > > >>>>>>>> so > > >>>>>>>>>>> in one system, than by combining e.g. the batch and streaming > > >>>>>>>>>>> API. > > >>>>>>>>>>> > > >>>>>>>>>>> 2) Batch ML with the streaming API is not harder > > >>>>>>>>>>> Despite these benefits, it could seem harder to implement > batch > > >>>>>>>>>>> ML > > >>>>>>>>>>> > > >>>>>>>>>> with > > >>>>>>> > > >>>>>>>> the streaming API, but in my opinion it's not. There are more > > >>>>>>>>>>> > > >>>>>>>>>> flexible, > > >>>>>>> > > >>>>>>>> lower-level optimization potentials with the streaming API. Most > > >>>>>>>>>>> distributed ML algorithms use a lower-level model than the > > batch > > >>>>>>>>>>> > > >>>>>>>>>> API > > >>>>> > > >>>>>> anyway, so sometimes it feels like forcing the algorithm logic > > >>>>>>>>>>> > > >>>>>>>>>> into > > >>>>> > > >>>>>> the > > >>>>>>> > > >>>>>>>> training API and tweaking it. Although we could not use the > batch > > >>>>>>>>>>> primitives like join, we would have the E.g. in my experience > > >>>>>>>>>>> with > > >>>>>>>>>>> implementing a distributed matrix factorization algorithm > [1], > > I > > >>>>>>>>>>> > > >>>>>>>>>> couldn't > > >>>>>>> > > >>>>>>>> do a simple optimization because of the limitations of the > > >>>>>>>>>>> > > >>>>>>>>>> iteration > > >>>>> > > >>>>>> API > > >>>>>>> > > >>>>>>>> [2]. Even if we pushed all the development effort to make the > > >>>>>>>>>>> > > >>>>>>>>>> batch > > >>>>> > > >>>>>> API > > >>>>>>> > > >>>>>>>> more suitable for ML there would be things we couldn't do. E.g. > > >>>>>>>>>>> > > >>>>>>>>>> there > > >>>>>> > > >>>>>>> are > > >>>>>>> > > >>>>>>>> approaches for updating a model iteratively without locks [3,4] > > >>>>>>>>>>> > > >>>>>>>>>> (i.e. > > >>>>>> > > >>>>>>> somewhat asynchronously), and I don't see a clear way to > implement > > >>>>>>>>>>> > > >>>>>>>>>> such > > >>>>>>> > > >>>>>>>> algorithms with the batch API. > > >>>>>>>>>>> > > >>>>>>>>>>> 3) Streaming community (users and devs) benefit > > >>>>>>>>>>> The Flink streaming community in general would also benefit > > from > > >>>>>>>>>>> > > >>>>>>>>>> this > > >>>>>> > > >>>>>>> direction. There are many features needed in the streaming API > for > > >>>>>>>>>>> > > >>>>>>>>>> ML > > >>>>>> > > >>>>>>> to > > >>>>>>> > > >>>>>>>> work, but this is also true for the batch API. One really > > >>>>>>>>>>> > > >>>>>>>>>> important > > >>>>> > > >>>>>> is > > >>>>>> > > >>>>>>> the > > >>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been > a > > >>>>>>>>>>> lot > > >>>>>>>>>>> > > >>>>>>>>>> of > > >>>>>> > > >>>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate > > >>>>>>>>>>> mentioned > > >>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming > generally > > >>>>>>>>>>> > > >>>>>>>>>> [7]. > > >>>>> > > >>>>>> Thus, > > >>>>>>> > > >>>>>>>> by improving the streaming API to allow ML algorithms, the > > >>>>>>>>>>> > > >>>>>>>>>> streaming > > >>>>> > > >>>>>> API > > >>>>>>> > > >>>>>>>> benefit too (which is important as they have a lot more > production > > >>>>>>>>>>> > > >>>>>>>>>> users > > >>>>>>> > > >>>>>>>> than the batch API). > > >>>>>>>>>>> > > >>>>>>>>>>> 4) Performance can be at least as good > > >>>>>>>>>>> I believe the same performance could be achieved with the > > >>>>>>>>>>> > > >>>>>>>>>> streaming > > >>>>> > > >>>>>> API > > >>>>>>> > > >>>>>>>> as > > >>>>>>>>>>> with the batch API. Streaming API is much closer to the > runtime > > >>>>>>>>>>> > > >>>>>>>>>> than > > >>>>> > > >>>>>> the > > >>>>>>> > > >>>>>>>> batch API. For corner-cases, with runtime-layer optimizations of > > >>>>>>>>>>> > > >>>>>>>>>> batch > > >>>>>> > > >>>>>>> API, > > >>>>>>>>>>> we could find a way to do the same (or similar) optimization > > for > > >>>>>>>>>>> > > >>>>>>>>>> the > > >>>>> > > >>>>>> streaming API (see my previous point). Such case could be using > > >>>>>>>>>>> > > >>>>>>>>>> managed > > >>>>>>> > > >>>>>>>> memory (and spilling to disk). There are also benefits by > default, > > >>>>>>>>>>> > > >>>>>>>>>> e.g. > > >>>>>>> > > >>>>>>>> we > > >>>>>>>>>>> would have a finer grained fault tolerance with the streaming > > >>>>>>>>>>> API. > > >>>>>>>>>>> > > >>>>>>>>>>> 5) We could keep batch ML API > > >>>>>>>>>>> For the shorter term, we should not throw away all the > > algorithms > > >>>>>>>>>>> implemented with the batch API. By pushing forward the > > >>>>>>>>>>> development > > >>>>>>>>>>> > > >>>>>>>>>> with > > >>>>>>> > > >>>>>>>> side inputs we could make them usable with streaming API. Then, > if > > >>>>>>>>>>> > > >>>>>>>>>> the > > >>>>>> > > >>>>>>> library gains some popularity, we could replace the algorithms in > > >>>>>>>>>>> > > >>>>>>>>>> the > > >>>>>> > > >>>>>>> batch > > >>>>>>>>>>> API with streaming ones, to avoid the performance costs of > e.g. > > >>>>>>>>>>> > > >>>>>>>>>> not > > >>>>> > > >>>>>> being > > >>>>>>> > > >>>>>>>> able to persist. > > >>>>>>>>>>> > > >>>>>>>>>>> 6) General tools for implementing ML algorithms > > >>>>>>>>>>> Besides implementing algorithms one by one, we could give > more > > >>>>>>>>>>> > > >>>>>>>>>> general > > >>>>>> > > >>>>>>> tools for making it easier to implement algorithms. E.g. > parameter > > >>>>>>>>>>> > > >>>>>>>>>> server > > >>>>>>> > > >>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow > has a > > >>>>>>>>>>> similar > > >>>>>>>>>>> model to Flink streaming, we could look into that too. I > think > > >>>>>>>>>>> > > >>>>>>>>>> often > > >>>>> > > >>>>>> when > > >>>>>>> > > >>>>>>>> deploying a production ML system, much more configuration and > > >>>>>>>>>>> > > >>>>>>>>>> tweaking > > >>>>>> > > >>>>>>> should be done than e.g. Spark MLlib allows. Why not allow that? > > >>>>>>>>>>> > > >>>>>>>>>>> 7) Showcasing > > >>>>>>>>>>> Showcasing this could be easier. We could say that we're > doing > > >>>>>>>>>>> > > >>>>>>>>>> batch > > >>>>> > > >>>>>> ML > > >>>>>>> > > >>>>>>>> with a streaming API. That's interesting in its own. IMHO this > > >>>>>>>>>>> integration > > >>>>>>>>>>> is also a more approachable way towards end-to-end ML. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks for reading so far :) > > >>>>>>>>>>> > > >>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819 > > >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396 > > >>>>>>>>>>> [3] https://people.eecs.berkeley. > edu/~brecht/papers/hogwildTR. > > pd > > >>>>>>>>>>> f > > >>>>>>>>>>> [4] https://www.usenix.org/system/ > > files/conference/hotos13/hotos > > >>>>>>>>>>> 13-final77.pdf > > >>>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP- > 15+ > > >>>>>>>>>>> Scoped+Loops+and+Job+Termination > > >>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668 > > >>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber- > > sigmod16. > > >>>>>>>>>>> > > >>>>>>>>>> pdf > > >>>>> > > >>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf > > >>>>>>>>>>> [9] http://apache-flink-mailing- > list-archive.1008284.n3.nabble > > . > > >>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and- > > >>>>>>>>>>> Parameter-Server-implementation-td15880.html > > >>>>>>>>>>> > > >>>>>>>>>>> Cheers, > > >>>>>>>>>>> Gabor > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>> > > >>>>>> *Yours faithfully, * > > >>>>>> > > >>>>>> *Kate Eri.* > > >>>>>> > > >>>>>> > > >> > > > > > > > > > -- > > Roberto Bentivoglio > > CTO > > e. roberto.bentivog...@radicalbit.io > > Radicalbit S.r.l. > > radicalbit.io > > >