Thanks Shammon FY for starting this discussion. I'm not sure whether we have to expand the focus of Apache Flink in a way you're describing it with OLAP being a topic next to Batch and Stream Processing. You've rightfully pointed out that there are already solutions to cover OLAP. In this sense, I like the Unix philosophy: One tool for one job [1]. My concern is that extending the scope of Flink would guide us into a direction where we lose focus: I see the risk in Flink's codebase becoming even more complex with the proposed change of the roadmap. That was one reason why a project like Flink Table Store ended up becoming an independent project Apache Paimon.
That said, I still see value in some of the ideas you've collected in FLINK-25318 [2]. Can't we just see OLAP queries as a specific form of Batch Processing? The idea of the session cluster was to have the ability to run multiple short-lived jobs, anyway. And there are definitely ways to improve the execution of those queries as you pointed out. Therefore, I'm not rejecting the ideas you're suggesting in FLINK-25318 [2]. I'm just skeptical about expanding the scope of Apache Flink by mentioning OLAP explicitly in Flink's roadmap. Independent of the roadmap discussion: If the community decides to put more focus on short-lived queries as you suggested, I would like to have that reflected in the performance tests (like it was suggested by Piotr in his FLINK-25318 comment [3] and that's already covered by FLINK-25356 [4]). We should provide a performance test set before we proceed with implementing improvements to cover those scenarios. This would enable us to measure the changes that were collected in FLINK-25318 [2]. I'm looking forward to other views on that topic. Best, Matthias [1] http://www.catb.org/~esr/writings/taoup/html/ch01s06.html [2] https://issues.apache.org/jira/browse/FLINK-25318 [3] https://issues.apache.org/jira/browse/FLINK-25318?focusedCommentId=17460615&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17460615 [4] https://issues.apache.org/jira/browse/FLINK-25356 On Mon, Aug 7, 2023 at 11:53 AM Shammon FY <zjur...@gmail.com> wrote: > Hi everyone, > > According to the discussion between @Matthias and me in FLINK-32667 [1], I > would like to initiate a discussion about Flink OLAP. Let me introduce > ourselves first, we are the Streaming Compute team in ByteDance, and about > half a year ago @Songxintong and we had an offline discussion on Flink OLAP > and created FLINK-25318 [2] to track the improvements to Flink OLAP. We > also have some simple online discussion with @Piotr on the issue. Here I > would like to share my thoughts on Flink OLAP and our practice in the past, > and initiate a broader discussion to get more input and thoughts about > Flink OLAP in this thread. I hope if possible that OLAP could be the main > capability besides Streaming and Batch for Flink and the community could > consider adding it to Flink 2.0 Roadmap, which can attract more developers > and users to promote and improve Flink OLAP in the future. > > 1. Why we want to support OLAP in Flink > > Currently, there are many excellent open-source OLAP engines, so why do we > want to support OLAP in Flink? There are three main types in the current > data processing progress: streaming, batch and OLAP. As an excellent > streaming and batch processing engine, flink processes data and writes > results to storage such as lakehouse, key-value stores, databases and ect. > After that, users need an OLAP engine to analyze their data. > > 1.1) As we all know, OLAP is a very important scenario and users need an > OLAP engine indeed. > > 1.2) Flink OLAP can reduce team costs. We have encountered many small or > medium-sized tech teams who are using Flink for streaming and batch > processing. However, they have to introduce an additional OLAP engine for > their users to analyze data, which increases the technical costs and > operation costs for multiple engines and clusters. > > 1.3) Flink OLAP can reduce user usage costs. Many of our users are > familiar with Flink and use Flink streaming and batch SQL to process data. > After that, they are more focused on business. They are not very interested > in introducing an additional OLAP engine which requires a new familiarity > with the new engine and increases usage costs. > > 2. Why Flink can be an OLAP engine. > > So, technically speaking, can Flink support OLAP well? My answer is: OF > COURSE! > > 2.1) From Architecture, Flink supports jdbc driver, Sql-Gateway and > session mode. Flink session cluster is a typical MPP architecture and each > query does not need to require new resources. Users can easily submit > SELECT statements by the jdbc driver and fetch results on the second or > even sub-second level. > > 2.2) Powerful batch processing power. Flink OLAP can take many batch > operators and optimizations. At the same time, there are also large queries > in OLAP and Flink can support them based on the ability of Flink batch > without the need to introduce an external batch engine like other OLAP > engines. > > 2.3) Flink supports standard SQL syntax such as QUERY/INSERT/UPDATE which > meets the interaction requirements of OLAP users. > > 2.4) Powerful connector ecosystem. Flink has defined comprehensive > interfaces for input and output and implemented many embedding connectors > such as database, lakehouse. Users can easily implement customized > connectors based on these interfaces too. > > 3. The main issues for Flink OLAP. > > We found there are two main types of issues that need to be addressed for > Flink OLAP in our practice: Latency and QPS. Unlike streaming and batch, > OLAP has the following two characteristics: > > a) OLAP jobs are usually short-lived and return results in seconds or > sub-seconds which will have higher requirements for data processing latency. > > b) Users have QPS requirements for OLAP such as users often want to > perform hundreds of simple queries or tens of complex queries in a Flink > OLAP cluster concurrently in our practice. > > We need to make some improvements in Flink to support OLAP better. Based > on our previously created FLINK-25318 [2] and our practical experience, I > classify the improvements as follows: > > a) Improve the interaction between Flink and external components such as > zk and filesystem. For example, Flink stores relevant information in zk to > support failover and this is not needed by OLAP but rather causes delays > due to the network jitter. > > b) Improvements in Flink internal progress, such as job submission > progress, data fetch optimization, excessive number of interactive events > among components(JM/TM/RM) etc. These will cause Flink to increase job > latency from a few milliseconds to tens of seconds when running several > small queries concurrently. > > c) Support OLAP specific features. Due to the OLAP characteristics, Flink > may need to add some features only for OLAP. For example, we need to > isolate resources between jobs, and ensure that large queries do not affect > small queries. > > 4. Out practice about Flink OLAP. > > In addition to using Flink to support massive streaming businesses in > ByteDance, we have also conducted many big improvements on Flink to support > OLAP well. This can not only be seen as a POC, but we also provide Flink > OLAP clusters in production for our users. > > After our improvements, we built a 500 core Flink OLAP cluster to test > Flink OLAP e2e latency and QPS. There will be nearly a thousand QPS for > simple queries of 128 subtasks with tens of milliseconds latency; for > complex jobs with two JOIN operators and 700 subtasks, there will be about > sixty QPS with hundreds of milliseconds latency. > > Our Flink OLAP in production supports more than ten internal businesses > and there are more than 6000 cores in all clusters and 500 thousand queries > will be performed each day. > > 5. My Proposal > > Above all, I personally think it will provide great value for users and is > technically feasible if Flink supports OLAP and this is also what Flink > engine can achieve through some improvements although this is not a simple > matter. I hope the community could consider supporting OLAP as one of Flink > technical directions and add it to the roadmap. As I mentioned at the > beginning, this can attract more developers and users to participate and > promote the improvement of Flink OLAP and build Flink as a unified engine > for streaming, batch and OLAP. > > Looking forward to your feedback, thanks! > > > [1] https://issues.apache.org/jira/browse/FLINK-32667 > [2] https://issues.apache.org/jira/browse/FLINK-25318 > > Best, > Shammon FY > >