Re: Apache Flink

Michael Malak Sun, 17 Apr 2016 14:55:02 -0700

There have been commercial CEP solutions for decades, including from my 
employer.

      From: Mich Talebzadeh <mich.talebza...@gmail.com>
 To: Mark Hamstra <m...@clearstorydata.com> 
Cc: Corey Nolet <cjno...@gmail.com>; "user @spark" <user@spark.apache.org>
 Sent: Sunday, April 17, 2016 3:48 PM
 Subject: Re: Apache Flink

The problem is that the strength and wider acceptance of a typical Open source 
project is its sizeable user and development community. When the community is 
small like Flink, then it is not a viable solution to adopt 
I am rather disappointed that no big data project can be used for Complex Event 
Processing as it has wider use in Algorithmic trading among others.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 22:30, Mark Hamstra <m...@clearstorydata.com> wrote:

To be fair, the Stratosphere project from which Flink springs was started as a 
collaborative university research project in Germany about the same time that 
Spark was first released as Open Source, so they are near contemporaries rather 
than Flink having been started only well after Spark was an established and 
widely-used Apache project.
On Sun, Apr 17, 2016 at 2:25 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

Also it always amazes me why they are so many tangential projects in Big Data 
space? Would not it be easier if efforts were spent on adding to Spark 
functionality rather than creating a new product like Flink?
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 21:08, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

Thanks Corey for the useful info.
I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, 
there does not seem to be anything close to these products in Hadoop Ecosystem. 
So I guess there is nothing there?
Regards.

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 20:43, Corey Nolet <cjno...@gmail.com> wrote:

i have not been intrigued at all by the microbatching concept in Spark. I am 
used to CEP in real streams processing environments like Infosphere Streams & 
Storm where the granularity of processing is at the level of each individual 
tuple and processing units (workers) can react immediately to events being 
received and processed. The closest Spark streaming comes to this concept is 
the notion of "state" that that can be updated via the "updateStateBykey()" 
functions which are only able to be run in a microbatch. Looking at the 
expected design changes to Spark Streaming in Spark 2.0.0, it also does not 
look like tuple-at-a-time processing is on the radar for Spark, though I have 
seen articles stating that more effort is going to go into the Spark SQL layer 
in Spark streaming which may make it more reminiscent of Esper.
For these reasons, I have not even tried to implement CEP in Spark. I feel it's 
a waste of time without immediate tuple-at-a-time processing. Without this, 
they avoid the whole problem of "back pressure" (though keep in mind, it is 
still very possible to overload the Spark streaming layer with stages that will 
continue to pile up and never get worked off) but they lose the granular 
control that you get in CEP environments by allowing the rules & processors to 
react with the receipt of each tuple, right away. 
Awhile back, I did attempt to implement an InfoSphere Streams-like API [1] on 
top of Apache Storm as an example of what such a design may look like. It looks 
like Storm is going to be replaced in the not so distant future by Twitter's 
new design called Heron. IIRC, Heron does not have an open source 
implementation as of yet. 
[1] https://github.com/calrissian/flowmix
On Sun, Apr 17, 2016 at 3:11 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

Hi Corey,
Can you please point me to docs on using Spark for CEP? Do we have a set of CEP 
libraries somewhere. I am keen on getting hold of adaptor libraries for Spark 
something like below

Thanks

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 16:07, Corey Nolet <cjno...@gmail.com> wrote:

One thing I've noticed about Flink in my following of the project has been that 
it has established, in a few cases, some novel ideas and improvements over 
Spark. The problem with it, however, is that both the development team and the 
community around it are very small and many of those novel improvements have 
been rolled directly into Spark in subsequent versions. I was considering 
changing over my architecture to Flink at one point to get better, more 
real-time CEP streaming support, but in the end I decided to stick with Spark 
and just watch Flink continue to pressure it into improvement.
On Sun, Apr 17, 2016 at 11:03 AM, Koert Kuipers <ko...@tresata.com> wrote:

i never found much info that flink was actually designed to be fault tolerant. 
if fault tolerance is more bolt-on/add-on/afterthought then that doesn't bode 
well for large scale data processing. spark was designed with fault tolerance 
in mind from the beginning.

On Sun, Apr 17, 2016 at 9:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

Hi,
I read the benchmark published by Yahoo. Obviously they already use Storm and 
inevitably very familiar with that tool. To start with although these 
benchmarks were somehow interesting IMO, it lend itself to an assurance that 
the tool chosen for their platform is still the best choice. So inevitably the 
benchmarks and the tests were done to support primary their approach.
In general anything which is not done through TCP Council or similar body is 
questionable..Their argument is that because Spark handles data streaming in 
micro batches then inevitably it introduces this in-built latency as per 
design. In contrast, both Storm and Flink do not (at the face value) have this 
issue.
In addition as we already know Spark has far more capabilities compared to 
Flink (know nothing about Storm). So really it boils down to the business SLA 
to choose which tool one wants to deploy for your use case. IMO Spark micro 
batching approach is probably OK for 99% of use cases. If we had in built 
libraries for CEP for Spark (I am searching for it), I would not bother with 
Flink.
HTH

Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 12:47, Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr> wrote:

You probably read this benchmark at Yahoo, any comments from 
Spark?https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

On 17 Apr 2016, at 12:41, andy petrella <andy.petre...@gmail.com> wrote:
Just adding one thing to the mix: `that the latency for streaming data is 
eliminated` is insane :-D
On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

 It seems that Flink argues that the latency for streaming data is eliminated 
whereas with Spark RDD there is this latency.
I noticed that Flink does not support interactive shell much like Spark shell 
where you can add jars to it to do kafka testing. The advice was to add the 
streaming Kafka jar file to CLASSPATH but that does not work.
Most Flink documentation also rather sparce with the usual example of word 
count which is not exactly what you want.
Anyway I will have a look at it further. I have a Spark Scala streaming Kafka 
program that works fine in Spark and I want to recode it using Scala for Flink 
with Kafka but have difficulty importing and testing libraries.
Cheers
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 17 April 2016 at 02:41, Ascot Moss <ascot.m...@gmail.com> wrote:

I compared both last month, seems to me that Flink's MLLib is not yet ready.
On Sun, Apr 17, 2016 at 12:23 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

Thanks Ted. I was wondering if someone is using both :)
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 16 April 2016 at 17:08, Ted Yu <yuzhih...@gmail.com> wrote:

Looks like this question is more relevant on flink mailing list :-)
On Sat, Apr 16, 2016 at 8:52 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
wrote:

Hi,
Has anyone used Apache Flink instead of Spark by any chance
I am interested in its set of libraries for Complex Event Processing.
Frankly I don't know if it offers far more than Spark offers.
Thanks
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 

-- 
andy

Re: Apache Flink

Reply via email to