WG: dynamic streams and patterns

Claudia Wegmann Mon, 25 Jul 2016 01:09:54 -0700

Hey everyone,

the last few days I looked into the approach the King-Team took. And another 
question to my original point 3 arose:


To 3) Would an approach similar to King/RBEA even be possible combined with 
Flink CEP? As I understand, Patterns have to be defined in Java code and 
therefore have to be recompiled? Do I overlook something important?

Thx for some input.

Best,
Claudia

Von: Claudia Wegmann
Gesendet: Freitag, 15. Juli 2016 12:58
An: 'user@flink.apache.org' <user@flink.apache.org>
Betreff: AW: dynamic streams and patterns

Hey Robert,

thanks for the valuable input.

I have some follow-up questions:

To 2) I had a separate Docker container for the JobManager and the TaskManagers 
for playing around with the Flink Cluster. In production that’s not an option. 
As I mentioned before the Flink job has to interact in a microservice 
environment communicating via async messages. For message handling the tool 
vert.x is used. Therefore the JobManager and TaskManager also need a connection 
to the vert.x instance. If they were deployed in a separate Docker container 
each container would need to have to include a vert.x instance, which is quite 
heavy weighed. The prefer to have a verticle, which starts the JobManager and 
the needed TaskManagers and starts the Flink job. Is there a common way to 
start JobManagers/TaskManagers from the Java programm?
I would even prefer to have JobManager and TaskManager build as separate JARs 
decorated with a verticle. Is that possible? Where would I start separating the 
JobManager and the TaskManager?

To 3) So there is no way to recalculate the JobGraph/ExecutionGraph of a 
running job? Would it be possible to alter the code to make this possible 
(without making Flink to stand upside down ;)?

To 4) What do you think would an upper limit for the number of streams in one 
job be?

Thx again and best wishes,
Claudia

Von: Robert Metzger [mailto:rmetz...@apache.org]
Gesendet: Donnerstag, 14. Juli 2016 16:26
An: user@flink.apache.org<mailto:user@flink.apache.org>
Betreff: Re: dynamic streams and patterns

Hi Claudia,

1) What do you mean by dynamically adding? In standalone mode (which you would 
probably use with Docker images), you can just start additional TaskManagers, 
which will connect to a JobManager.
So you could implement some monitoring to start new TaskManagers as soon as 
they are needed.
In general, we recommend to start one JobManager per job, but running multiple 
jobs per JM is also possible. I don't have much experience with many concurrent 
jobs on a JM, but in theory there are no limits.
In practice you'll probably run into stability issues at some point, because 
the JM needs to coordinate too many jobs / taskmanagers.

2) Yes, that would be an option. The most important aspects here are: data 
throughput per admin group / state size / analysis complexity.
If the each administrative group is low traffic (~100.000 elements / second), 
you could maybe process the data not using a Flink cluster at all. The 
Standalone mode of Flink starts a JobManager and TaskManager within the same 
JVM. You could prepare a docker image with a standalone flink + the job and 
start that per administrative group. I think a reasonably sized machine (8 
cores, 32 gb of main memory) should handle that.

3) Yes, you can not modify a running job. You can follow the King.com / RBEA 
approach.

4) That depends on the the definition of great.

I think above answers greatly depend on the expected amount of data and the 
available hardware. Since Flink is quite easy to deploy, and a simple testing 
job is implemented in a few hours, I would suggest to do some experiments to 
see how Flink behaves in the given environment.

Regards,
Robert




On Mon, Jul 11, 2016 at 9:39 AM, Claudia Wegmann 
<c.wegm...@kasasi.de<mailto:c.wegm...@kasasi.de>> wrote:
Hey everyone,

I’m quite new to Apache Flink. I’m trying to build a system with Flink and 
wanted to hear your opinion and whether the proposed architecture is even 
possible with Flink. The environment for the system will be a microservice 
architecture handling messaging via async events.

I want to give you a brief description of the system:

-          there are a lot of sensors, which each produces a stream of data

-          on each stream of sensor data I want to match one or more patterns 
via Flink’s CEP library

-          each of these sensors belongs to one or more administrative entities

-          each pattern belongs to one administrative entity and needs to be 
evaluated on one or more sensors of this entity

-          the user can change the connection of a sensor to an administrative 
entity as well as the sensors on which a pattern needs to be evaluated


I hope this description is enough to give you an overview of the system.

This is what I am thinking of doing:

-          I will have an Apache Kafka cluster and a Flink cluster running 
inside docker containers

-          I create a topic in Kafka for each administrative entity

-          for each entity I create a Flink job which consumes the 
corresponding topic

-          the Flink job creates a stream of the sensor data

-          it splits the stream to a stream for each sensor

-          for each pattern that hast to be evaluated on one stream I create a 
pattern stream


This results in the following:

-          there will be a lot of Kafka topics

-          for each topic there will be one Flink job (-> a lot of jobs, too)

-          in each job there will be quite a lot of streams and patterns and 
therefore even more pattern streams


The main questions that arose while thinking of this implementation:

1.)    From other questions here, I know that there is currently no way to 
dynamically add taskmanagers to the Flink cluster. The proposed way to handle 
that, is to start up much more taskmanagers than first needed. Is it even 
possible to have a great number of jobs on one cluster?

2.)    Would a viable alternative be to just dynamically start up a new cluster 
for each administrative entity?

3.)    I also came to know, that Flink isn’t able to handle dynamically created 
streams and patterns. I guess that is due to the fixed calculation of the 
execution graph at the jobs beginning. Is there a way to make Flink recalculate 
the graph of a running job? I also just found out about this [1] example, where 
they use scripts to hot deploy queries. I will look into that, too. Maybe that 
provides an acceptable solution for me, too.

4.)    Is it even possible to have a great number of streams and patterns in 
one Flink job?


Any comments and feedback are greatly appreciated.
Thanks a lot in advance ☺
Best, Claudia

[1]: https://techblog.king.com/rbea-scalable-real-time-analytics-king/

WG: dynamic streams and patterns

Reply via email to