Re: Apache Flink Execution

Apoorv Palkar Mon, 12 Jun 2017 13:20:07 -0700

Same deal i've found with spark. Many generic data processing is performed by 
spark such as map,reduce, filter. If we want to make it work, you need to add 
own implementation which could potentially b a problem.

-----Original Message-----
From: Shenoy, Gourav Ganesh <[email protected]>
To: dev <[email protected]>
Sent: Mon, Jun 12, 2017 4:16 pm
Subject: Re: Apache Flink Execution

Hi Dev,

After doing some more readings and playing around with Storm & Flink code 
examples, I am now of the opinion that – although Flink provides us with 
certain benefits over Storm (see prev. email) – integrating Flink to suit the 
Airavata use case might not work. The reasons are as follows:

1.      Implementing custom functions/task-executors in Flink is not as 
straight forward as in Storm (bolts) – Flink uses the concept of dataset and 
transformations. The notion is that we define the data (bounded/unbounded), and 
apply transformations on this data – which is defining operators to transform 
the input data to output data. The problem here is that these transformations 
which Flink accept are limited to generic data processing, such as MAP, REDUCE, 
JOIN, GROUP-BY, KEY-BY, AGGREGATE, etc. The only flexibility is we can define 
our own implementations of these generic transformation APIs.

In constrast, for Airavata we need much more complicated implementations of 
task executors. These generic transformations are of no use in Airavata as they 
only target stream processing use cases, eg: if you have a dataset of calls 
made between two people and the duration of call, we can override the MAP and 
GROUP functions to provide a transformed dataset with <call, totalduration>. 
Similarly word count example.

2.      Although Flink claims to support bounded dataset (as opposed to Storm 
which needs unbounded data – can be tweaked to handle bounded data, but support 
not available natively), the datasets needs to be a Collection/Tuple (in most 
cases).

3.      The thing that troubles me the most is the fact that there is NO way to 
define custom executors and invoke them in manner in which we anticipate. Eg: 
We would ideally want to deploy/enable task executors – Job-Submission, 
Data-Staging, Monitoring, etc – on workers, and then create a DAG to invoke 
them. This capability is available in Storm via Topology (DAG), Spouts 
(dataset) and Bolts (executors). But in Flink, it’s more of how we can apply 
some kind of transformation on the incoming dataset and generate a new dataset 
– it could be either aggregating records, breaking sentences to words and 
grouping same words to count them, etc.

The only positive I observed was the ability to create STORM topology in Flink 
– but this is more of a backward compatibility support, where user applications 
written in Storm needs to be migrated to Flink. I am not an expert in Flink, so 
what I’ve pointed above is an understanding after reading literature and 
running by the code examples. Anyone who is has worked in Flink, please feel 
free to provide your inputs.

Thanks and Regards,
Gourav Shenoy

From: "Shenoy, Gourav Ganesh" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, June 7, 2017 at 11:12 AM
To: "[email protected]" <[email protected]>
Subject: Re: Apache Flink Execution

Hi dev,

I did some literature reading about Storm vs Flink, with an emphasis on our 
use-case of Distributed Task Execution and my initial impressions are as 
follows (I will also be updating the Google docs accordingly):

1.     Although both Storm and Flink engines appear to be similar, for 
supporting pipeline processing; Storm can only handle data streams, whereas 
Flink supports stream and batch processing. This allows Flink to perform data 
transfer between parallel tasks – we do not have such support as of today, but 
we can definitely think of parallel task execution.
2.     Storm supports at-least once and at-most once data processing, whereas 
Flink guarantees exactly-once processing. Storm also supports exactly-once via 
their Trident API. From what I read, Flink claims to be more efficient in terms 
of processing semantics – as they use a lighter algorithm for check-pointing 
data transfers.
3.     There are high level APIs available in Flink to simplify the data 
collection process, which is a little tedious in Storm. In Storm one needs to 
manually implement readers and collectors, whereas Flink provides functions 
such as Map, GroupBy, Window and Join.
4.     A major positive in Flink is the ability to maintain custom State 
information in operators/executors. This custom state information can also be 
used in check-pointing for fault tolerance.

I think Flink is an improvement over Storm, but this is just an understanding 
from my initial readings. I haven’t yet tried coding any examples in Flink. 
Again, most of the features/differences mentioned above, offered by both Storm 
and Flink, are for stream processing with focus on executing a large number of 
small tasks (in parallel?) with continuous streaming data and therefore the 
fight is for offering low latency processing; these might not necessarily be 
that important for the Airavata use-case (tasks may take time to complete).

Thanks and Regards,
Gourav Shenoy

From: "Pierce, Marlon" <[email protected]>
Reply-To: <[email protected]>
Date: Wednesday, May 24, 2017 at 11:36 AM
To: "[email protected]" <[email protected]>
Subject: Re: Apache Flink Execution

Thanks, Apoorv.  Note for everyone else: request access if you’d like to leave 
a comment or make a suggestion.

Marlon

From: Apoorv Palkar <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, May 24, 2017 at 11:32 AM
To: "[email protected]" <[email protected]>
Subject: Apache Flink Execution

https://docs.google.com/document/d/1GDh8kEbAXVY9Gv1mmFvq__zLN_JP6m2_KbfN-9C0uO0/edit?usp=sharing

LINK for Flink Use/fundamental

Re: Apache Flink Execution

Reply via email to