Thanks Shammon FY for starting this discussion.

I'm not sure whether we have to expand the focus of Apache Flink in a way
you're describing it with OLAP being a topic next to Batch and Stream
Processing. You've rightfully pointed out that there are already solutions
to cover OLAP. In this sense, I like the Unix philosophy: One tool for one
job [1]. My concern is that extending the scope of Flink would guide us
into a direction where we lose focus: I see the risk in Flink's codebase
becoming even more complex with the proposed change of the roadmap. That
was one reason why a project like Flink Table Store ended up becoming an
independent project Apache Paimon.

That said, I still see value in some of the ideas you've collected in
FLINK-25318 [2]. Can't we just see OLAP queries as a specific form of Batch
Processing? The idea of the session cluster was to have the ability to run
multiple short-lived jobs, anyway. And there are definitely ways to improve
the execution of those queries as you pointed out.

Therefore, I'm not rejecting the ideas you're suggesting in FLINK-25318
[2]. I'm just skeptical about expanding the scope of Apache Flink by
mentioning OLAP explicitly in Flink's roadmap.

Independent of the roadmap discussion: If the community decides to put more
focus on short-lived queries as you suggested, I would like to have that
reflected in the performance tests (like it was suggested by Piotr in his
FLINK-25318 comment [3] and that's already covered by FLINK-25356 [4]). We
should provide a performance test set before we proceed with implementing
improvements to cover those scenarios. This would enable us to measure the
changes that were collected in FLINK-25318 [2].

I'm looking forward to other views on that topic.

Best,
Matthias

[1] http://www.catb.org/~esr/writings/taoup/html/ch01s06.html
[2] https://issues.apache.org/jira/browse/FLINK-25318
[3]
https://issues.apache.org/jira/browse/FLINK-25318?focusedCommentId=17460615&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17460615
[4] https://issues.apache.org/jira/browse/FLINK-25356

On Mon, Aug 7, 2023 at 11:53 AM Shammon FY <zjur...@gmail.com> wrote:

> Hi everyone,
>
> According to the discussion between @Matthias and me in FLINK-32667 [1], I
> would like to initiate a discussion about Flink OLAP. Let me introduce
> ourselves first, we are the Streaming Compute team in ByteDance, and about
> half a year ago @Songxintong and we had an offline discussion on Flink OLAP
> and created FLINK-25318 [2] to track the improvements to Flink OLAP. We
> also have some simple online discussion with @Piotr on the issue. Here I
> would like to share my thoughts on Flink OLAP and our practice in the past,
> and initiate a broader discussion to get more input and thoughts about
> Flink OLAP in this thread. I hope if possible that OLAP could be the main
> capability besides Streaming and Batch for Flink and the community could
> consider adding it to Flink 2.0 Roadmap, which can attract more developers
> and users to promote and improve Flink OLAP in the future.
>
> 1. Why we want to support OLAP in Flink
>
> Currently, there are many excellent open-source OLAP engines, so why do we
> want to support OLAP in Flink? There are three main types in the current
> data processing progress: streaming, batch and OLAP. As an excellent
> streaming and batch processing engine, flink processes data and writes
> results to storage such as lakehouse, key-value stores, databases and ect.
> After that, users need an OLAP engine to analyze their data.
>
> 1.1) As we all know, OLAP is a very important scenario and users need an
> OLAP engine indeed.
>
> 1.2) Flink OLAP can reduce team costs. We have encountered many small or
> medium-sized tech teams who are using Flink for streaming and batch
> processing. However, they have to introduce an additional OLAP engine for
> their users to analyze data, which increases the technical costs and
> operation costs for multiple engines and clusters.
>
> 1.3) Flink OLAP can reduce user usage costs. Many of our users are
> familiar with Flink and use Flink streaming and batch SQL to process data.
> After that, they are more focused on business. They are not very interested
> in introducing an additional OLAP engine which requires a new familiarity
> with the new engine and increases usage costs.
>
> 2. Why Flink can be an OLAP engine.
>
> So, technically speaking, can Flink support OLAP well? My answer is: OF
> COURSE!
>
> 2.1) From Architecture, Flink supports jdbc driver, Sql-Gateway and
> session mode. Flink session cluster is a typical MPP architecture and each
> query does not need to require new resources. Users can easily submit
> SELECT statements by the jdbc driver and fetch results on the second or
> even sub-second level.
>
> 2.2) Powerful batch processing power. Flink OLAP can take many batch
> operators and optimizations. At the same time, there are also large queries
> in OLAP and Flink can support them based on the ability of Flink batch
> without the need to introduce an external batch engine like other OLAP
> engines.
>
> 2.3) Flink supports standard SQL syntax such as QUERY/INSERT/UPDATE which
> meets the interaction requirements of OLAP users.
>
> 2.4) Powerful connector ecosystem. Flink has defined comprehensive
> interfaces for input and output and implemented many embedding connectors
> such as database, lakehouse. Users can easily implement customized
> connectors based on these interfaces too.
>
> 3. The main issues for Flink OLAP.
>
> We found there are two main types of issues that need to be addressed for
> Flink OLAP in our practice: Latency and QPS. Unlike streaming and batch,
> OLAP has the following two characteristics:
>
> a) OLAP jobs are usually short-lived and return results in seconds or
> sub-seconds which will have higher requirements for data processing latency.
>
> b) Users have QPS requirements for OLAP such as users often want to
> perform hundreds of simple queries or tens of complex queries in a Flink
> OLAP cluster concurrently in our practice.
>
> We need to make some improvements in Flink to support OLAP better. Based
> on our previously created FLINK-25318 [2] and our practical experience, I
> classify the improvements as follows:
>
> a) Improve the interaction between Flink and external components such as
> zk and filesystem. For example, Flink stores relevant information in zk to
> support failover and this is not needed by OLAP but rather causes delays
> due to the network jitter.
>
> b) Improvements in Flink internal progress, such as job submission
> progress, data fetch optimization, excessive number of interactive events
> among components(JM/TM/RM) etc. These will cause Flink to increase job
> latency from a few milliseconds to tens of seconds when running several
> small queries concurrently.
>
> c) Support OLAP specific features. Due to the OLAP characteristics, Flink
> may need to add some features only for OLAP. For example, we need to
> isolate resources between jobs, and ensure that large queries do not affect
> small queries.
>
> 4. Out practice about Flink OLAP.
>
> In addition to using Flink to support massive streaming businesses in
> ByteDance, we have also conducted many big improvements on Flink to support
> OLAP well. This can not only be seen as a POC, but we also provide Flink
> OLAP clusters in production for our users.
>
> After our improvements, we built a 500 core Flink OLAP cluster to test
> Flink OLAP e2e latency and QPS. There will be nearly a thousand QPS for
> simple queries of 128 subtasks with tens of milliseconds latency; for
> complex jobs with two JOIN operators and 700 subtasks, there will be about
> sixty QPS with hundreds of milliseconds latency.
>
> Our Flink OLAP in production supports more than ten internal businesses
> and there are more than 6000 cores in all clusters and 500 thousand queries
> will be performed each day.
>
> 5. My Proposal
>
> Above all, I personally think it will provide great value for users and is
> technically feasible if Flink supports OLAP and this is also what Flink
> engine can achieve through some improvements although this is not a simple
> matter. I hope the community could consider supporting OLAP as one of Flink
> technical directions and add it to the roadmap. As I mentioned at the
> beginning, this can attract more developers and users to participate and
> promote the improvement of Flink OLAP and build Flink as a unified engine
> for streaming, batch and OLAP.
>
> Looking forward to your feedback, thanks!
>
>
> [1]  https://issues.apache.org/jira/browse/FLINK-32667
> [2]  https://issues.apache.org/jira/browse/FLINK-25318
>
> Best,
> Shammon FY
>
>

Reply via email to