Re: a typical ML algorithm flow

2016-03-31 Thread Till Rohrmann
I agree that Flink’s concept of the closed loop iteration does not translate so easily to a more general distributed linear algebra DSL such as Samsara. There one usually writes loops using the for and while primitives. Unfortunately, it is not so trivial to automatically translate a for loop into

Re: a typical ML algorithm flow

2016-03-29 Thread Dmitriy Lyubimov
BTW thank you for educating me on this. I think it's actually a wonderful capability, along with the capability of broadcasting distributed sets to map operators, it means (I hope) that fine-grained, centralized scheduling and centralized broadcasting we find in Spark analogous algorithms could be

Re: a typical ML algorithm flow

2016-03-29 Thread Dmitriy Lyubimov
Thanks. Regardless of the rationale, i wanted to confirm if the iteration is lazily evaluated-only thing and it sounds eager evaluation inside (and collection) is not possible, and the algorithms that need it, just will have to work around this. I think this answers my question -- thanks! -d On

Re: a typical ML algorithm flow

2016-03-29 Thread Theodore Vasiloudis
@Shannon What you are talking about is available for the DataSet API through the iterateWithTermination function. See the API docs and Iterations page

Re: a typical ML algorithm flow

2016-03-29 Thread Shannon Quinn
Apologies for hijacking, but this thread hits right at my last message to this list (looking to implement native iterations in the PyFlink API). I'm particularly interested in custom convergence criteria, often centered around measuring some sort of squared loss and checking if it falls below

Re: a typical ML algorithm flow

2016-03-29 Thread Till Rohrmann
Hi, Chiwan’s example is perfectly fine and it should also work with general EM algorithms. Moreover, it is the recommended way how to implement iterations with Flink. The iterateWithTermination API call generates a lazily evaluated data flow with an iteration operator. This plan will only be execu

Re: a typical ML algorithm flow

2016-03-28 Thread Dmitriy Lyubimov
Thanks Chiwan. I think this example still creates a lazy-evaluated plan. And if i need to collect statistics to front end (and use it in subsequent iteration evaluation) as my example with computing column-wise averages suggests? problem generally is, what if I need to eagerly evaluate the statis

Re: a typical ML algorithm flow

2016-03-27 Thread Chiwan Park
Hi Dmitriy, I think you can implement it with iterative API with custom convergence criterion. You can express the convergence criterion by two methods. One is using a convergence criterion data set [1][2] and the other is registering an aggregator with custom implementation of `ConvergenceCrit

Re: a typical ML algorithm flow

2016-03-25 Thread Dmitriy Lyubimov
Thank you, all :) yes, that's my question. How do we construct such a loop with a concrete example? Let's take something nonsensical yet specific. Say, in samsara terms we do something like that : var avg = Double.PositiveInfinity var drmA = ... (construct elsewhere) do { avg = drmA.colMe

Re: a typical ML algorithm flow

2016-03-23 Thread Theodore Vasiloudis
Just realized what I wrote is wrong and probably doesn't apply here. The problem I described relates to modifying a *secondary* dataset as you iterate over a primary one. Taking SGD as an example, you would iterate over a weights dataset, modifying it using the native Flink iterations that Till

Re: a typical ML algorithm flow

2016-03-23 Thread Till Rohrmann
Hi Dmitriy, I’m not sure whether I’ve understood your question correctly, so please correct me if I’m wrong. So you’re asking whether it is a problem that stat1 = A.map.reduce A = A.update.map(stat1) are executed on the same input data set A and whether we have to cache A for that, right? I ass

Re: a typical ML algorithm flow

2016-03-23 Thread Theodore Vasiloudis
Hello Dmitriy, If I understood correctly what you are basically talking about modifying a DataSet as you iterate over it. AFAIK this is currently not possible in Flink, and indeed it's a real bottleneck for ML algorithms. This is the reason our current SGD implementation does a pass over the whol

a typical ML algorithm flow

2016-03-22 Thread Dmitriy Lyubimov
Hi, probably more of a question for Till: Imagine a common ML algorithm flow that runs until convergence. typical distributed flow would be something like that (e.g. GMM EM would be exactly like that): A: input do { stat1 = A.map.reduce A = A.update-map(stat1) conv = A.map.reduce } u