Thank you the answers, folks. Can anyone provide me a link for any implementation of an ML algorithm on Flink?
On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <[email protected]> wrote: > Hey, > > 1-2. As for failure recovery, there is a difference how the Flink batch > and streaming programs handle failures. The failed parts of the batch jobs > currently restart upon failures but there is an ongoing effort on fine > grained fault tolerance which is somewhat similar to sparks lineage > tracking. (so technically this is exactly once semantic but that is > somewhat meaningless for batch jobs) > > For streaming programs we are currently working on fault tolerance, we are > hoping to support at least once processing guarantees in the 0.9 release. > After that we will focus our research efforts on an high performance > implementation of exactly once processing semantics, which is still a hard > topic in streaming systems. Storm's trident's exaclty once semantics can > only provide very low throughput while we are trying hard to avoid this > issue, as our streaming system is capable of much higher throughput than > storm in general as you can see on some perf measurements. > > 3. There are already many ml algorithms implemented for Flink but they are > scattered all around. We are planning to collect them in a machine learning > library soon. We are also implementing an adapter for Samoa which will > provide some streaming machine learning algorithms as well. Samoa > integration should be ready in January. > > 4. Flink carefully manages its memory use to avoid heap errors, and > utilizing memory space as effectively as it can. The optimizer for batch > programs also takes care of a lot of optimization steps that the user would > manually have to do in other system, like optimizing the order of > transformations etc. There are of course parts of the program that still > needs to modified for maximal performance, for example parallelism settings > for some operators in some cases. > > 5. As for the status of the Python API I personally cannot say very much, > maybe someone can jump in and help me with that question :) > > Regards, > Gyula > > On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist < > [email protected]> wrote: > >> Thank you for your answer. I have a couple of follow up questions: >> 1. Does it support 'exactly once semantics' that Spark and Storm support? >> 2. (Related to 1) What happens when an error occurs during processing? >> 3. Is there a plan for adding Machine Learning support on top of Flink? >> Say Alternative Least Squares, Basic Naive Bayes? >> 4. When you say Flink manages itself, does it mean I don't have to fiddle >> with number of partitions (Spark), number of reduces / happers (Hadoop?) to >> optimize performance? (In some cases this might be needed) >> 5. How far along is the Python API? I don't see the specs in the Website. >> >> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[email protected]> >> wrote: >> >>> Dear Samarth, >>> >>> Besides the discussions you have mentioned [1] I can recommend one of >>> our recent presentations [2], especially the distinguishing Flink section >>> (from slide 16). >>> >>> It is generally a difficult question as both the systems are rapidly >>> evolving, so the answer can become outdated quite fast. However there are >>> fundamental design features that are highly unlikely to change, for example >>> Spark uses "true" batch processing, meaning that intermediate results are >>> materialized (mostly in memory) as RDDs. Flink's engine is internally more >>> like streaming, forwarding the results to the next operator asap. The >>> latter can yield performance benefits for more complex jobs. Flink also >>> gives you a query optimizer, spills gracefully to disk when the system runs >>> out of memory and has some cool features around serialization. For >>> performance numbers and some more insight please check out the presentation >>> [2] and do not hesitate to post a follow-up mail here if you come across >>> something unclear or extraordinary. >>> >>> [1] >>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark >>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon >>> >>> Best, >>> >>> Marton >>> >>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist < >>> [email protected]> wrote: >>> >>>> Hey folks, I have a noob question. >>>> >>>> I already looked up the archives and saw a couple of discussions >>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark> >>>> about Spark and Flink. >>>> >>>> I am familiar with spark (the python API, esp MLLib), and I see many >>>> similarities between Flink and Spark. >>>> >>>> How does Flink distinguish itself from Spark? >>>> >>> >>> >> >
