I am also cc'ing Timo to see if he has anything more to add on this. Cheers, Kostas
On Thu, Nov 19, 2020 at 9:41 PM Kostas Kloudas <kklou...@gmail.com> wrote: > > Hi, > > Thanks for reaching out! > > First of all, I would like to point out that an interesting > alternative to the per-job cluster could be running your jobs in > application mode [1]. > > Given that you want to run arbitrary SQL queries, I do not think you > can "share" across queries the part of the job graph that reads a > topic. In general, Flink (not only in SQL) creates the graph of a job > before the job is executed. And especially in SQL you do not even have > control over the graph, as the translation logic from query to > physical operators is opaque and not exposed to the user. > > That said, you may want to have a look at [2]. It is pretty old but it > describes a potentially similar usecase. Unfortunately, it does not > support SQL. > > Cheers, > Kostas > > [1] https://flink.apache.org/news/2020/07/14/application-mode.html > [2] https://www.ververica.com/blog/rbea-scalable-real-time-analytics-at-king > > On Sun, Nov 15, 2020 at 10:11 AM lalala <lal...@activist.com> wrote: > > > > Hi all, > > > > I would like to consult with you regarding deployment strategies. > > > > We have +250 Kafka topics that we want users of the platform to submit SQL > > queries that will run indefinitely. We have a query parsers to extract topic > > names from user queries, and the application locally creates Kafka tables > > and execute the query. The result can be collected to multiple sinks such as > > databases, files, cloud services. > > > > We want to have the best isolation between queries, so in case of failures, > > the other jobs will not get affected. We have a huge YARN cluster to handle > > 1PB a day scale from Kafka. I believe cluster per job type deployment makes > > sense for the sake of isolation. However, that creates some scalability > > problems. There might be SQL queries running on the same Kafka topic that we > > do not want to read them again for each query in different sessions. The > > ideal case is that we read the topic once and executes multiple queries on > > this data to avoid rereading the same topic. That breaks the desire of a > > fully isolated system, but it improves network and Kafka performance and > > still provides isolation on the topic level as we just read the topic once > > and execute multiple SQL queries on it. > > > > We are quite new to Flink, but we have experience with Spark. In Spark, we > > can submit an application, and in master, that can listen a query queue and > > submit jobs to the cluster dynamically from different threads. However, In > > Flink, it looks like the main() has to produce the job the graph in advance. > > > > We do use an EMR cluster; what would you recommend for my use case? > > > > Thank you. > > > > > > > > -- > > Sent from: > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/