Re: Evaluating Apache Flink

Márton Balassi Fri, 08 Jul 2016 06:14:35 -0700

Hi Kevin,

Thanks for being willing to contribute such an effort. I think it is a
completely valid discussion to ask in your organization and please feel
free to ask us questions during your evaluation. Putting statements on the
Flink website highlighting the differences would be very tricky though. I
would advise against that. Let me elaborate on that.

The "How does it compare to Spark?" is definitely one of the most
frequently asked questions that we get and we can generally give three
types of answers:

*1. General architecture decisions*

   - Streaming (pipelined) execution engine (or long running opreator
   model).
   - Native iteration operator.
   - ...

The issue with this approach is that in itself it states borderline no
useful information for a decision maker. There you need benchmarks or fancy
features, so let us evaluate them.

*2. Benchmarks*
You can find plenty of third-party benchmarks and soft evaluations [1,2,3]
of the two systems out there. The problem with these are that they are very
reliant on the version of the systems used, tuning and understanding the
general architecture. E.g. [1] favors Storm, but if you re-do the whole
benchmark from a Flink point of view you get [4]. After a couple of
versions the benchmark results can be very different.

*3. Fancy Features*

   - Exactly once spillable streaming state stored locally
   - Savepoints
   - ...

Similarly to the previous point these might be an edge at some point in
time, but the whole streaming space is moving very quickly and as it is
open source projects tend to copy each other to a certain extent.

This of course does not mean that doing evaluations at any point in time is
meaningless, but you need to update them frequently (check [5] and [6]) and
they can do more harm then good if not treated with care.

I hope I was not too discouraging and could help you with your endeavor. It
is also very important to take your specific use cases into account.

Best,

Marton

[1]
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
[2] https://tech.zalando.de/blog/apache-showdown-flink-vs.-spark/
[3] http://data-artisans.com/how-we-selected-apache-flink-at-otto-group/
[4] http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
[5]
http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem
[6]
http://www.slideshare.net/GyulaFra/largescale-stream-processing-in-the-hadoop-ecosystem-hadoop-summit-2016-60887821

On Fri, Jul 8, 2016 at 2:23 PM, Kevin Jacobs <kevin.jac...@cern.ch> wrote:

> Hi,
>
> I am currently working working for an organization which is using Apache
> Spark as main data processing framework. Now the organization is wondering
> whether Apache Flink is better at processing their data than Apache Spark.
> Therefore, I am evaluating Apache Flink and I am comparing it to Apache
> Spark.
>
> When I looked at Apache Flink for the first time, I could not find any
> comparison to Apache Spark at Flink's website. Would it be an idea to give
> some information about the differences of both frameworks on the website? I
> would like to contribute to that if you think that would be helpful.
>
> Regards,
> Kevin
>

Re: Evaluating Apache Flink

Reply via email to