Hi everyone,

According to the discussion between @Matthias and me in FLINK-32667 [1], I
would like to initiate a discussion about Flink OLAP. Let me introduce
ourselves first, we are the Streaming Compute team in ByteDance, and about
half a year ago @Songxintong and we had an offline discussion on Flink OLAP
and created FLINK-25318 [2] to track the improvements to Flink OLAP. We
also have some simple online discussion with @Piotr on the issue. Here I
would like to share my thoughts on Flink OLAP and our practice in the past,
and initiate a broader discussion to get more input and thoughts about
Flink OLAP in this thread. I hope if possible that OLAP could be the main
capability besides Streaming and Batch for Flink and the community could
consider adding it to Flink 2.0 Roadmap, which can attract more developers
and users to promote and improve Flink OLAP in the future.

1. Why we want to support OLAP in Flink

Currently, there are many excellent open-source OLAP engines, so why do we
want to support OLAP in Flink? There are three main types in the current
data processing progress: streaming, batch and OLAP. As an excellent
streaming and batch processing engine, flink processes data and writes
results to storage such as lakehouse, key-value stores, databases and ect.
After that, users need an OLAP engine to analyze their data.

1.1) As we all know, OLAP is a very important scenario and users need an
OLAP engine indeed.

1.2) Flink OLAP can reduce team costs. We have encountered many small or
medium-sized tech teams who are using Flink for streaming and batch
processing. However, they have to introduce an additional OLAP engine for
their users to analyze data, which increases the technical costs and
operation costs for multiple engines and clusters.

1.3) Flink OLAP can reduce user usage costs. Many of our users are familiar
with Flink and use Flink streaming and batch SQL to process data. After
that, they are more focused on business. They are not very interested in
introducing an additional OLAP engine which requires a new familiarity with
the new engine and increases usage costs.

2. Why Flink can be an OLAP engine.

So, technically speaking, can Flink support OLAP well? My answer is: OF
COURSE!

2.1) From Architecture, Flink supports jdbc driver, Sql-Gateway and session
mode. Flink session cluster is a typical MPP architecture and each query
does not need to require new resources. Users can easily submit SELECT
statements by the jdbc driver and fetch results on the second or even
sub-second level.

2.2) Powerful batch processing power. Flink OLAP can take many batch
operators and optimizations. At the same time, there are also large queries
in OLAP and Flink can support them based on the ability of Flink batch
without the need to introduce an external batch engine like other OLAP
engines.

2.3) Flink supports standard SQL syntax such as QUERY/INSERT/UPDATE which
meets the interaction requirements of OLAP users.

2.4) Powerful connector ecosystem. Flink has defined comprehensive
interfaces for input and output and implemented many embedding connectors
such as database, lakehouse. Users can easily implement customized
connectors based on these interfaces too.

3. The main issues for Flink OLAP.

We found there are two main types of issues that need to be addressed for
Flink OLAP in our practice: Latency and QPS. Unlike streaming and batch,
OLAP has the following two characteristics:

a) OLAP jobs are usually short-lived and return results in seconds or
sub-seconds which will have higher requirements for data processing latency.

b) Users have QPS requirements for OLAP such as users often want to perform
hundreds of simple queries or tens of complex queries in a Flink OLAP
cluster concurrently in our practice.

We need to make some improvements in Flink to support OLAP better. Based on
our previously created FLINK-25318 [2] and our practical experience, I
classify the improvements as follows:

a) Improve the interaction between Flink and external components such as zk
and filesystem. For example, Flink stores relevant information in zk to
support failover and this is not needed by OLAP but rather causes delays
due to the network jitter.

b) Improvements in Flink internal progress, such as job submission
progress, data fetch optimization, excessive number of interactive events
among components(JM/TM/RM) etc. These will cause Flink to increase job
latency from a few milliseconds to tens of seconds when running several
small queries concurrently.

c) Support OLAP specific features. Due to the OLAP characteristics, Flink
may need to add some features only for OLAP. For example, we need to
isolate resources between jobs, and ensure that large queries do not affect
small queries.

4. Out practice about Flink OLAP.

In addition to using Flink to support massive streaming businesses in
ByteDance, we have also conducted many big improvements on Flink to support
OLAP well. This can not only be seen as a POC, but we also provide Flink
OLAP clusters in production for our users.

After our improvements, we built a 500 core Flink OLAP cluster to test
Flink OLAP e2e latency and QPS. There will be nearly a thousand QPS for
simple queries of 128 subtasks with tens of milliseconds latency; for
complex jobs with two JOIN operators and 700 subtasks, there will be about
sixty QPS with hundreds of milliseconds latency.

Our Flink OLAP in production supports more than ten internal businesses and
there are more than 6000 cores in all clusters and 500 thousand queries
will be performed each day.

5. My Proposal

Above all, I personally think it will provide great value for users and is
technically feasible if Flink supports OLAP and this is also what Flink
engine can achieve through some improvements although this is not a simple
matter. I hope the community could consider supporting OLAP as one of Flink
technical directions and add it to the roadmap. As I mentioned at the
beginning, this can attract more developers and users to participate and
promote the improvement of Flink OLAP and build Flink as a unified engine
for streaming, batch and OLAP.

Looking forward to your feedback, thanks!


[1]  https://issues.apache.org/jira/browse/FLINK-32667
[2]  https://issues.apache.org/jira/browse/FLINK-25318

Best,
Shammon FY

Reply via email to