Hey everyone,

I'm quite new to Apache Flink. I'm trying to build a system with Flink and 
wanted to hear your opinion and whether the proposed architecture is even 
possible with Flink. The environment for the system will be a microservice 
architecture handling messaging via async events.

I want to give you a brief description of the system:

-          there are a lot of sensors, which each produces a stream of data

-          on each stream of sensor data I want to match one or more patterns 
via Flink's CEP library

-          each of these sensors belongs to one or more administrative entities

-          each pattern belongs to one administrative entity and needs to be 
evaluated on one or more sensors of this entity

-          the user can change the connection of a sensor to an administrative 
entity as well as the sensors on which a pattern needs to be evaluated


I hope this description is enough to give you an overview of the system.

This is what I am thinking of doing:

-          I will have an Apache Kafka cluster and a Flink cluster running 
inside docker containers

-          I create a topic in Kafka for each administrative entity

-          for each entity I create a Flink job which consumes the 
corresponding topic

-          the Flink job creates a stream of the sensor data

-          it splits the stream to a stream for each sensor

-          for each pattern that hast to be evaluated on one stream I create a 
pattern stream


This results in the following:

-          there will be a lot of Kafka topics

-          for each topic there will be one Flink job (-> a lot of jobs, too)

-          in each job there will be quite a lot of streams and patterns and 
therefore even more pattern streams


The main questions that arose while thinking of this implementation:

1.)    From other questions here, I know that there is currently no way to 
dynamically add taskmanagers to the Flink cluster. The proposed way to handle 
that, is to start up much more taskmanagers than first needed. Is it even 
possible to have a great number of jobs on one cluster?

2.)    Would a viable alternative be to just dynamically start up a new cluster 
for each administrative entity?

3.)    I also came to know, that Flink isn't able to handle dynamically created 
streams and patterns. I guess that is due to the fixed calculation of the 
execution graph at the jobs beginning. Is there a way to make Flink recalculate 
the graph of a running job? I also just found out about this [1] example, where 
they use scripts to hot deploy queries. I will look into that, too. Maybe that 
provides an acceptable solution for me, too.

4.)    Is it even possible to have a great number of streams and patterns in 
one Flink job?


Any comments and feedback are greatly appreciated.
Thanks a lot in advance :)
Best, Claudia

[1]: https://techblog.king.com/rbea-scalable-real-time-analytics-king/

Reply via email to