Hey, 1-2. As for failure recovery, there is a difference how the Flink batch and streaming programs handle failures. The failed parts of the batch jobs currently restart upon failures but there is an ongoing effort on fine grained fault tolerance which is somewhat similar to sparks lineage tracking. (so technically this is exactly once semantic but that is somewhat meaningless for batch jobs)
For streaming programs we are currently working on fault tolerance, we are hoping to support at least once processing guarantees in the 0.9 release. After that we will focus our research efforts on an high performance implementation of exactly once processing semantics, which is still a hard topic in streaming systems. Storm's trident's exaclty once semantics can only provide very low throughput while we are trying hard to avoid this issue, as our streaming system is capable of much higher throughput than storm in general as you can see on some perf measurements. 3. There are already many ml algorithms implemented for Flink but they are scattered all around. We are planning to collect them in a machine learning library soon. We are also implementing an adapter for Samoa which will provide some streaming machine learning algorithms as well. Samoa integration should be ready in January. 4. Flink carefully manages its memory use to avoid heap errors, and utilizing memory space as effectively as it can. The optimizer for batch programs also takes care of a lot of optimization steps that the user would manually have to do in other system, like optimizing the order of transformations etc. There are of course parts of the program that still needs to modified for maximal performance, for example parallelism settings for some operators in some cases. 5. As for the status of the Python API I personally cannot say very much, maybe someone can jump in and help me with that question :) Regards, Gyula On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist < [email protected]> wrote: > Thank you for your answer. I have a couple of follow up questions: > 1. Does it support 'exactly once semantics' that Spark and Storm support? > 2. (Related to 1) What happens when an error occurs during processing? > 3. Is there a plan for adding Machine Learning support on top of Flink? > Say Alternative Least Squares, Basic Naive Bayes? > 4. When you say Flink manages itself, does it mean I don't have to fiddle > with number of partitions (Spark), number of reduces / happers (Hadoop?) to > optimize performance? (In some cases this might be needed) > 5. How far along is the Python API? I don't see the specs in the Website. > > On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <[email protected]> > wrote: > >> Dear Samarth, >> >> Besides the discussions you have mentioned [1] I can recommend one of our >> recent presentations [2], especially the distinguishing Flink section (from >> slide 16). >> >> It is generally a difficult question as both the systems are rapidly >> evolving, so the answer can become outdated quite fast. However there are >> fundamental design features that are highly unlikely to change, for example >> Spark uses "true" batch processing, meaning that intermediate results are >> materialized (mostly in memory) as RDDs. Flink's engine is internally more >> like streaming, forwarding the results to the next operator asap. The >> latter can yield performance benefits for more complex jobs. Flink also >> gives you a query optimizer, spills gracefully to disk when the system runs >> out of memory and has some cool features around serialization. For >> performance numbers and some more insight please check out the presentation >> [2] and do not hesitate to post a follow-up mail here if you come across >> something unclear or extraordinary. >> >> [1] >> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark >> [2] http://www.slideshare.net/GyulaFra/flink-apachecon >> >> Best, >> >> Marton >> >> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist < >> [email protected]> wrote: >> >>> Hey folks, I have a noob question. >>> >>> I already looked up the archives and saw a couple of discussions >>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark> >>> about Spark and Flink. >>> >>> I am familiar with spark (the python API, esp MLLib), and I see many >>> similarities between Flink and Spark. >>> >>> How does Flink distinguish itself from Spark? >>> >> >> >
