Hi all,

I would like to consult with you regarding deployment strategies.

We have +250 Kafka topics that we want users of the platform to submit SQL
queries that will run indefinitely. We have a query parsers to extract topic
names from user queries, and the application locally creates Kafka tables
and execute the query. The result can be collected to multiple sinks such as
databases, files, cloud services.

We want to have the best isolation between queries, so in case of failures,
the other jobs will not get affected. We have a huge YARN cluster to handle
1PB a day scale from Kafka. I believe cluster per job type deployment makes
sense for the sake of isolation. However, that creates some scalability
problems. There might be SQL queries running on the same Kafka topic that we
do not want to read them again for each query in different sessions. The
ideal case is that we read the topic once and executes multiple queries on
this data to avoid rereading the same topic. That breaks the desire of a
fully isolated system, but it improves network and Kafka performance and
still provides isolation on the topic level as we just read the topic once
and execute multiple SQL queries on it.

We are quite new to Flink, but we have experience with Spark. In Spark, we
can submit an application, and in master, that can listen a query queue and
submit jobs to the cluster dynamically from different threads. However, In
Flink, it looks like the main() has to produce the job the graph in advance.

We do use an EMR cluster; what would you recommend for my use case?

Thank you.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to