Re: Dynamic ad hoc query deployment strategy

Kostas Kloudas Fri, 20 Nov 2020 00:51:14 -0800

I am also cc'ing Timo to see if he has anything more to add on this.

Cheers,
Kostas


On Thu, Nov 19, 2020 at 9:41 PM Kostas Kloudas <kklou...@gmail.com> wrote:
>
> Hi,
>
> Thanks for reaching out!
>
> First of all, I would like to point out that an interesting
> alternative to the per-job cluster could be running your jobs in
> application mode [1].
>
> Given that you want to run arbitrary SQL queries, I do not think you
> can "share" across queries the part of the job graph that reads a
> topic. In general, Flink (not only in SQL) creates the graph of a job
> before the job is executed. And especially in SQL you do not even have
> control over the graph, as the translation logic from query to
> physical operators is opaque and not exposed to the user.
>
> That said, you may want to have a look at [2]. It is pretty old but it
> describes a potentially similar usecase. Unfortunately, it does not
> support SQL.
>
> Cheers,
> Kostas
>
> [1] https://flink.apache.org/news/2020/07/14/application-mode.html
> [2] https://www.ververica.com/blog/rbea-scalable-real-time-analytics-at-king
>
> On Sun, Nov 15, 2020 at 10:11 AM lalala <lal...@activist.com> wrote:
> >
> > Hi all,
> >
> > I would like to consult with you regarding deployment strategies.
> >
> > We have +250 Kafka topics that we want users of the platform to submit SQL
> > queries that will run indefinitely. We have a query parsers to extract topic
> > names from user queries, and the application locally creates Kafka tables
> > and execute the query. The result can be collected to multiple sinks such as
> > databases, files, cloud services.
> >
> > We want to have the best isolation between queries, so in case of failures,
> > the other jobs will not get affected. We have a huge YARN cluster to handle
> > 1PB a day scale from Kafka. I believe cluster per job type deployment makes
> > sense for the sake of isolation. However, that creates some scalability
> > problems. There might be SQL queries running on the same Kafka topic that we
> > do not want to read them again for each query in different sessions. The
> > ideal case is that we read the topic once and executes multiple queries on
> > this data to avoid rereading the same topic. That breaks the desire of a
> > fully isolated system, but it improves network and Kafka performance and
> > still provides isolation on the topic level as we just read the topic once
> > and execute multiple SQL queries on it.
> >
> > We are quite new to Flink, but we have experience with Spark. In Spark, we
> > can submit an application, and in master, that can listen a query queue and
> > submit jobs to the cluster dynamically from different threads. However, In
> > Flink, it looks like the main() has to produce the job the graph in advance.
> >
> > We do use an EMR cluster; what would you recommend for my use case?
> >
> > Thank you.
> >
> >
> >
> > --
> > Sent from: 
> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Dynamic ad hoc query deployment strategy

Reply via email to