Re: [DISCUSS] Flink ML roadmap

Stavros Kontopoulos Thu, 23 Feb 2017 07:17:30 -0800

@Gabor 3rd March is ok for me. But maybe giving a bit more time to it like
a week may suit more people.
What do you think all?
I will contribute to the doc.


+100 for having a co-ordinator + commiter.

Thank you all for joining the discussion.

Cheers,
Stavros

On Thu, Feb 23, 2017 at 4:48 PM, Gábor Hermann <m...@gaborhermann.com>
wrote:

> Okay, I've created a skeleton of the design doc for choosing a direction:
> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
>
> Much of the pros/cons have already been discussed here, so I'll try to put
> there all the arguments mentioned in this thread. Feel free to put there
> more :)
>
> @Stavros: I agree we should take action fast. What about collecting our
> thoughts in the doc by around Tuesday next week (28. February)? Then decide
> on the direction and design a roadmap by around Friday (3. March)? Is that
> feasible, or should it take more time?
>
> I think it will be necessary to have a shepherd, or even better a
> committer, to be involved in at least reviewing and accepting the roadmap.
> It would be best, if a committer coordinated all this.
> @Theodore: Would you like to do the coordination?
>
> Regarding the use-cases: I've seen some abstracts of talks at SF Flink
> Forward [1] that seem promising. There are companies already using Flink
> for ML [2,3,4,5].
>
> [1] http://sf.flink-forward.org/program/sessions/
> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> eaming-vs-micro-batch-for-online-learning/
> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-tensorflow/
> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> arning-on-flink/
> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> ing-scenarios-with-flink/
>
> Cheers,
> Gabor
>
>
>
> On 2017-02-23 15:19, Katherin Eri wrote:
>
>> I have asked already some teams for useful cases, but all of them need
>> time
>> to think.
>> During analysis something will finally arise.
>> May be we can ask partners of Flink  for cases? Data Artisans got results
>> of customers survey: [1], ML better support is wanted, so we could ask
>> what
>> exactly is necessary.
>>
>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
>>
>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
>> st.kontopou...@gmail.com> написал:
>>
>> +100 for a design doc.
>>>
>>> Could we also set a roadmap after some time-boxed investigation captured
>>> in
>>> that document? We need action.
>>>
>>> Looking forward to work on this (whatever that might be) ;) Also are
>>> there
>>> any data supporting one direction or the other from a customer
>>> perspective?
>>> It would help to make more informed decisions.
>>>
>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <katherinm...@gmail.com>
>>> wrote:
>>>
>>> Yes, ok.
>>>> let's start some design document, and write down there already mentioned
>>>> ideas about: parameter server, about clipper and others. Would be nice
>>>> if
>>>> we will also map this approaches to cases.
>>>> Will work on it collaboratively on each topic, may be finally we will
>>>>
>>> form
>>>
>>>> some picture, that could be agreed with committers.
>>>> @Gabor, could you please start such shared doc, as you have already
>>>>
>>> several
>>>
>>>> ideas proposed?
>>>>
>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com>:
>>>>
>>>> I agree, that it's better to go in one direction first, but I think
>>>>> online and offline with streaming API can go somewhat parallel later.
>>>>>
>>>> We
>>>
>>>> could set a short-term goal, concentrate initially on one direction,
>>>>>
>>>> and
>>>
>>>> showcase that direction (e.g. in a blogpost). But first, we should list
>>>>> the pros/cons in a design doc as a minimum. Then make a decision what
>>>>> direction to go. Would that be feasible?
>>>>>
>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
>>>>>
>>>>> I'm not sure that this is feasible, doing all at the same time could
>>>>>>
>>>>> mean
>>>>
>>>>> doing nothing((((
>>>>>> I'm just afraid, that words: we will work on streaming not on
>>>>>>
>>>>> batching,
>>>
>>>> we
>>>>>
>>>>>> have no commiter's time for this, mean that yes, we started work on
>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
>>>>>>
>>>>> already
>>>
>>>> was
>>>>>
>>>>>> with this ticket.
>>>>>>
>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
>>>>>>
>>>>> m...@gaborhermann.com>
>>>>>
>>>>>> написал:
>>>>>>
>>>>>> @Theodore: Great to hear you think the "batch on streaming" approach
>>>>>>>
>>>>>> is
>>>>
>>>>> possible! Of course, we need to pay attention all the pitfalls
>>>>>>>
>>>>>> there,
>>>
>>>> if we
>>>>>
>>>>>> go that way.
>>>>>>>
>>>>>>> +1 for a design doc!
>>>>>>>
>>>>>>> I would add that it's possible to make efforts in all the three
>>>>>>>
>>>>>> directions
>>>>>
>>>>>> (i.e. batch, online, batch on streaming) at the same time. Although,
>>>>>>>
>>>>>> it
>>>>
>>>>> might be worth to concentrate on one. E.g. it would not be so useful
>>>>>>>
>>>>>> to
>>>>
>>>>> have the same batch algorithms with both the batch API and streaming
>>>>>>>
>>>>>> API.
>>>>>
>>>>>> We can decide later.
>>>>>>>
>>>>>>> The design doc could be partitioned to these 3 directions, and we
>>>>>>>
>>>>>> can
>>>
>>>> collect there the pros/cons too. What do you think?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Gabor
>>>>>>>
>>>>>>>
>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
>>>>>>>
>>>>>>> Hello all,
>>>>>>>>
>>>>>>>>
>>>>>>>> @Gabor, we have discussed the idea of using the streaming API to
>>>>>>>>
>>>>>>> write
>>>>
>>>>> all
>>>>>
>>>>>> of our ML algorithms with a couple of people offline,
>>>>>>>> and I think it might be possible and is generally worth a shot. The
>>>>>>>> approach we would take would be close to Vowpal Wabbit, not exactly
>>>>>>>> "online", but rather "fast-batch".
>>>>>>>>
>>>>>>>> There will be problems popping up again, even for very simple algos
>>>>>>>>
>>>>>>> like
>>>>>
>>>>>> on
>>>>>>>> line linear regression with SGD [1], but hopefully fixing those
>>>>>>>>
>>>>>>> will
>>>
>>>> be
>>>>
>>>>> more aligned with the priorities of the community.
>>>>>>>>
>>>>>>>> @Katherin, my understanding is that given the limited resources,
>>>>>>>>
>>>>>>> there
>>>>
>>>>> is
>>>>>
>>>>>> no development effort focused on batch processing right now.
>>>>>>>>
>>>>>>>> So to summarize, it seems like there are people willing to work on
>>>>>>>>
>>>>>>> ML
>>>
>>>> on
>>>>>
>>>>>> Flink, but nobody is sure how to do it.
>>>>>>>> There are many directions we could take (batch, online, batch on
>>>>>>>> streaming), each with its own merits and downsides.
>>>>>>>>
>>>>>>>> If you want we can start a design doc and move the conversation
>>>>>>>>
>>>>>>> there,
>>>>
>>>>> come
>>>>>>>> up with a roadmap and start implementing.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Theodore
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>>>>>>>> nabble.com/Understanding-connected-streams-use-without-times
>>>>>>>> tamps-td10241.html
>>>>>>>>
>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
>>>>>>>>
>>>>>>> m...@gaborhermann.com
>>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>> It's great to see so much activity in this discussion :)
>>>>>>>>
>>>>>>>>> I'll try to add my thoughts.
>>>>>>>>>
>>>>>>>>> I think building a developer community (Till's 2. point) can be
>>>>>>>>>
>>>>>>>> slightly
>>>>>
>>>>>> separated from what features we should aim for (1. point) and
>>>>>>>>>
>>>>>>>> showcasing
>>>>>
>>>>>> (3. point). Thanks Till for bringing up the ideas for
>>>>>>>>>
>>>>>>>> restructuring,
>>>
>>>> I'm
>>>>>
>>>>>> sure we'll find a way to make the development process more
>>>>>>>>>
>>>>>>>> dynamic.
>>>
>>>> I'll
>>>>>
>>>>>> try to address the rest here.
>>>>>>>>>
>>>>>>>>> It's hard to choose directions between streaming and batch ML. As
>>>>>>>>>
>>>>>>>> Theo
>>>>
>>>>> has
>>>>>>>>> indicated, not much online ML is used in production, but Flink
>>>>>>>>> concentrates
>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
>>>>>>>>>
>>>>>>>> However,
>>>
>>>> as
>>>>>
>>>>>> most of you argued, there's definite need for batch ML. But batch
>>>>>>>>>
>>>>>>>> ML
>>>
>>>> seems
>>>>>>>>> hard to achieve because there are blocking issues with persisting,
>>>>>>>>> iteration paths etc. So it's no good either way.
>>>>>>>>>
>>>>>>>>> I propose a seemingly crazy solution: what if we developed batch
>>>>>>>>> algorithms also with the streaming API? The batch API would
>>>>>>>>>
>>>>>>>> clearly
>>>
>>>> seem
>>>>>
>>>>>> more suitable for ML algorithms, but there a lot of benefits of
>>>>>>>>>
>>>>>>>> this
>>>
>>>> approach too, so it's clearly worth considering. Flink also has
>>>>>>>>>
>>>>>>>> the
>>>
>>>> high
>>>>>
>>>>>> level vision of "streaming for everything" that would clearly fit
>>>>>>>>>
>>>>>>>> this
>>>>
>>>>> case. What do you all think about this? Do you think this solution
>>>>>>>>>
>>>>>>>> would
>>>>>
>>>>>> be
>>>>>>>>> feasible? I would be happy to make a more elaborate proposal, but
>>>>>>>>>
>>>>>>>> I
>>>
>>>> push
>>>>>
>>>>>> my
>>>>>>>>> main ideas here:
>>>>>>>>>
>>>>>>>>> 1) Simplifying by using one system
>>>>>>>>> It could simplify the work of both the users and the developers.
>>>>>>>>>
>>>>>>>> One
>>>
>>>> could
>>>>>>>>> execute training once, or could execute it periodically e.g. by
>>>>>>>>>
>>>>>>>> using
>>>>
>>>>> windows. Low-latency serving and training could be done in the
>>>>>>>>>
>>>>>>>> same
>>>
>>>> system.
>>>>>>>>> We could implement incremental algorithms, without any side inputs
>>>>>>>>>
>>>>>>>> for
>>>>
>>>>> combining online learning (or predictions) with batch learning. Of
>>>>>>>>> course,
>>>>>>>>> all the logic describing these must be somehow implemented (e.g.
>>>>>>>>> synchronizing predictions with training), but it should be easier
>>>>>>>>>
>>>>>>>> to
>>>
>>>> do
>>>>>
>>>>>> so
>>>>>>>>> in one system, than by combining e.g. the batch and streaming API.
>>>>>>>>>
>>>>>>>>> 2) Batch ML with the streaming API is not harder
>>>>>>>>> Despite these benefits, it could seem harder to implement batch ML
>>>>>>>>>
>>>>>>>> with
>>>>>
>>>>>> the streaming API, but in my opinion it's not. There are more
>>>>>>>>>
>>>>>>>> flexible,
>>>>>
>>>>>> lower-level optimization potentials with the streaming API. Most
>>>>>>>>> distributed ML algorithms use a lower-level model than the batch
>>>>>>>>>
>>>>>>>> API
>>>
>>>> anyway, so sometimes it feels like forcing the algorithm logic
>>>>>>>>>
>>>>>>>> into
>>>
>>>> the
>>>>>
>>>>>> training API and tweaking it. Although we could not use the batch
>>>>>>>>> primitives like join, we would have the E.g. in my experience with
>>>>>>>>> implementing a distributed matrix factorization algorithm [1], I
>>>>>>>>>
>>>>>>>> couldn't
>>>>>
>>>>>> do a simple optimization because of the limitations of the
>>>>>>>>>
>>>>>>>> iteration
>>>
>>>> API
>>>>>
>>>>>> [2]. Even if we pushed all the development effort to make the
>>>>>>>>>
>>>>>>>> batch
>>>
>>>> API
>>>>>
>>>>>> more suitable for ML there would be things we couldn't do. E.g.
>>>>>>>>>
>>>>>>>> there
>>>>
>>>>> are
>>>>>
>>>>>> approaches for updating a model iteratively without locks [3,4]
>>>>>>>>>
>>>>>>>> (i.e.
>>>>
>>>>> somewhat asynchronously), and I don't see a clear way to implement
>>>>>>>>>
>>>>>>>> such
>>>>>
>>>>>> algorithms with the batch API.
>>>>>>>>>
>>>>>>>>> 3) Streaming community (users and devs) benefit
>>>>>>>>> The Flink streaming community in general would also benefit from
>>>>>>>>>
>>>>>>>> this
>>>>
>>>>> direction. There are many features needed in the streaming API for
>>>>>>>>>
>>>>>>>> ML
>>>>
>>>>> to
>>>>>
>>>>>> work, but this is also true for the batch API. One really
>>>>>>>>>
>>>>>>>> important
>>>
>>>> is
>>>>
>>>>> the
>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot
>>>>>>>>>
>>>>>>>> of
>>>>
>>>>> effort (mostly from Paris) for making it mature enough [6]. Kate
>>>>>>>>> mentioned
>>>>>>>>> using GPUs, and I'm sure they have uses in streaming generally
>>>>>>>>>
>>>>>>>> [7].
>>>
>>>> Thus,
>>>>>
>>>>>> by improving the streaming API to allow ML algorithms, the
>>>>>>>>>
>>>>>>>> streaming
>>>
>>>> API
>>>>>
>>>>>> benefit too (which is important as they have a lot more production
>>>>>>>>>
>>>>>>>> users
>>>>>
>>>>>> than the batch API).
>>>>>>>>>
>>>>>>>>> 4) Performance can be at least as good
>>>>>>>>> I believe the same performance could be achieved with the
>>>>>>>>>
>>>>>>>> streaming
>>>
>>>> API
>>>>>
>>>>>> as
>>>>>>>>> with the batch API. Streaming API is much closer to the runtime
>>>>>>>>>
>>>>>>>> than
>>>
>>>> the
>>>>>
>>>>>> batch API. For corner-cases, with runtime-layer optimizations of
>>>>>>>>>
>>>>>>>> batch
>>>>
>>>>> API,
>>>>>>>>> we could find a way to do the same (or similar) optimization for
>>>>>>>>>
>>>>>>>> the
>>>
>>>> streaming API (see my previous point). Such case could be using
>>>>>>>>>
>>>>>>>> managed
>>>>>
>>>>>> memory (and spilling to disk). There are also benefits by default,
>>>>>>>>>
>>>>>>>> e.g.
>>>>>
>>>>>> we
>>>>>>>>> would have a finer grained fault tolerance with the streaming API.
>>>>>>>>>
>>>>>>>>> 5) We could keep batch ML API
>>>>>>>>> For the shorter term, we should not throw away all the algorithms
>>>>>>>>> implemented with the batch API. By pushing forward the development
>>>>>>>>>
>>>>>>>> with
>>>>>
>>>>>> side inputs we could make them usable with streaming API. Then, if
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> library gains some popularity, we could replace the algorithms in
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> batch
>>>>>>>>> API with streaming ones, to avoid the performance costs of e.g.
>>>>>>>>>
>>>>>>>> not
>>>
>>>> being
>>>>>
>>>>>> able to persist.
>>>>>>>>>
>>>>>>>>> 6) General tools for implementing ML algorithms
>>>>>>>>> Besides implementing algorithms one by one, we could give more
>>>>>>>>>
>>>>>>>> general
>>>>
>>>>> tools for making it easier to implement algorithms. E.g. parameter
>>>>>>>>>
>>>>>>>> server
>>>>>
>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow has a
>>>>>>>>> similar
>>>>>>>>> model to Flink streaming, we could look into that too. I think
>>>>>>>>>
>>>>>>>> often
>>>
>>>> when
>>>>>
>>>>>> deploying a production ML system, much more configuration and
>>>>>>>>>
>>>>>>>> tweaking
>>>>
>>>>> should be done than e.g. Spark MLlib allows. Why not allow that?
>>>>>>>>>
>>>>>>>>> 7) Showcasing
>>>>>>>>> Showcasing this could be easier. We could say that we're doing
>>>>>>>>>
>>>>>>>> batch
>>>
>>>> ML
>>>>>
>>>>>> with a streaming API. That's interesting in its own. IMHO this
>>>>>>>>> integration
>>>>>>>>> is also a more approachable way towards end-to-end ML.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for reading so far :)
>>>>>>>>>
>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
>>>>>>>>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf
>>>>>>>>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos
>>>>>>>>> 13-final77.pdf
>>>>>>>>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+
>>>>>>>>> Scoped+Loops+and+Job+Termination
>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.
>>>>>>>>>
>>>>>>>> pdf
>>>
>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf
>>>>>>>>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
>>>>>>>>> Parameter-Server-implementation-td15880.html
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Gabor
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>
>>>> *Yours faithfully, *
>>>>
>>>> *Kate Eri.*
>>>>
>>>>
>

Re: [DISCUSS] Flink ML roadmap

Reply via email to