Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-14 Thread Shammon FY
Hi,

Thanks for all the feedback. I'm so glad that I can see some consensus we
have reached from the feedback. I am trying to summarize our consensus as
follows and please correct me if I'm wrong or misunderstanding.

1) Batch is a special case of Streaming, while olap is a special case of
batch, so Flink will not lose focus from supporting short-lived olap jobs.

2) As a streaming and batch processing engine, it's valuable for Flink to
support olap jobs which will bring big merit to users and worth our efforts
to promote and achieve.

3) From the evolution of unified streaming-batch-olap processing engine for
Flink, we could add a subsection for Flink olap in the roadmap and continue
to evolve.

4) In order to support short live queries in Flink, it is necessary to do
some very careful design in terms of architecture and implementation. These
designs cannot affect streaming and batch capabilities in Flink while
supporting olap.

5) In order to better guide and measure the optimization of Flink olap, we
need to add relevant olap benchmarks to the flink project repository such
as flink-benchmarks.


As I replied in the roadmap thread [1], @Jark Wu and @Xingtong Song could
you please help to add the flink olap related subsection to the doc [2] ?
Thanks very much!


[1] https://lists.apache.org/thread/szdr4ngrfcmo7zko4917393zbqhgw0v5
[2]
https://docs.google.com/document/d/12BDiVKEsY-f7HI3suO_IxwzCmR04QcVqLarXgyJAb7c/edit

Best,
Shammon FY

On Mon, Aug 14, 2023 at 4:08 PM Yun Tang  wrote:

> Thanks to the guys from ByteDance driving this topic, which could be
> another big story to extend Flink's ability.
>
> In general, I think this is a great idea. However, before we move forward,
> I think we should first answer the question: which is the target for Flink
> OLAP?
>
> We run Presto/Trino and SparkSQL in the production environment for OLAP
> SQL analysis. Since Presto runs faster than SparkSQL in many cases,
> especially for ad-hoc queries at medium-sized data, we would run queries on
> Presto first or switch to SparkSQL for large-scale queries if necessary.
> Presto runs as a service and emphasis on query performance without node
> fault tolerance. Moreover, it leverages a pipeline-like data exchange mode
> instead of the classic stage blocking exchange mode, which is a bit like
> Flink's pipeline mode vs blocking mode.
>
> Can we say we hope Flink OLAP could target Presto/Trino in medium-sized
> data query, and switch to Flink batch SQL for large-scale analysis query?
> If so, I also think the naming of Flink OLAP looks a bit strange, as Flink
> batch SQL shall also serve for large-scale OLAP analysis.
>
> Best
> Yun Tang
> 
> From: Jing Ge 
> Sent: Thursday, August 10, 2023 13:52
> To: dev@flink.apache.org 
> Subject: Re: [DISCUSS] How about adding OLAP to Flink Roadmap?
>
> Hi Shammon, Hi Xiangyu,
>
> Thanks for bringing this to our attention. I can see this is a great
> proposal born from real business scenarios. +1 for it.
>
> People have been keen to use one platform to cover all their data
> production and consumption requirements. Flink did a great job for the
> production, i.e. streaming and batch processing with all excellent
> ecosystems. This is the big advantage for Flink to go one step further and
> cover the consumption part. It will turn Flink into a unified compute
> platform like what the Ray project(the platform behind ChatGPT, if someone
> is not aware of it)[1] is doing and secure Flink to be one of the most
> interesting open source platforms for the next decade.
>
> Frankly speaking, it will be a big change. As far as I am concerned, the
> following should be considered(just thought about it at the first glance,
> there must be more).
>
> Architecture upgrade - since we will have three capabilities(I wanted to
> use "engines", but it might be too early to use the big word), i.e.
> streaming, batch, OLAP,  it might make sense to upgrade the architecture
> while we are building the OLAP in Flink. The unified foundation or
> abstraction for distributed computation should be designed and implemented
> underneath those capabilities. In the future, new capabilities can leverage
> the foundation and could be developed at a very fast pace.
>
> MPP architecture - Flink session cluster is not the MMP architecture.
> Commonly speaking, SNA(shared nothing architecture) is the key that could
> implement MPP. Flink has everything to offer SNA. That is the reason why we
> can consider building OLAP into or on top of the Flink. And speaking of
> MPP, there will be a lot of things to do, e.g. the Retrieval
> Architecture[2], multiple level task split, dynamic retry or even split,
> etc. I will not expand all those topics at this early stage.
>
> OLAP queries syntax - 

Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-14 Thread Yun Tang
Thanks to the guys from ByteDance driving this topic, which could be another 
big story to extend Flink's ability.

In general, I think this is a great idea. However, before we move forward, I 
think we should first answer the question: which is the target for Flink OLAP?

We run Presto/Trino and SparkSQL in the production environment for OLAP SQL 
analysis. Since Presto runs faster than SparkSQL in many cases, especially for 
ad-hoc queries at medium-sized data, we would run queries on Presto first or 
switch to SparkSQL for large-scale queries if necessary.
Presto runs as a service and emphasis on query performance without node fault 
tolerance. Moreover, it leverages a pipeline-like data exchange mode instead of 
the classic stage blocking exchange mode, which is a bit like Flink's pipeline 
mode vs blocking mode.

Can we say we hope Flink OLAP could target Presto/Trino in medium-sized data 
query, and switch to Flink batch SQL for large-scale analysis query?
If so, I also think the naming of Flink OLAP looks a bit strange, as Flink 
batch SQL shall also serve for large-scale OLAP analysis.

Best
Yun Tang

From: Jing Ge 
Sent: Thursday, August 10, 2023 13:52
To: dev@flink.apache.org 
Subject: Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

Hi Shammon, Hi Xiangyu,

Thanks for bringing this to our attention. I can see this is a great
proposal born from real business scenarios. +1 for it.

People have been keen to use one platform to cover all their data
production and consumption requirements. Flink did a great job for the
production, i.e. streaming and batch processing with all excellent
ecosystems. This is the big advantage for Flink to go one step further and
cover the consumption part. It will turn Flink into a unified compute
platform like what the Ray project(the platform behind ChatGPT, if someone
is not aware of it)[1] is doing and secure Flink to be one of the most
interesting open source platforms for the next decade.

Frankly speaking, it will be a big change. As far as I am concerned, the
following should be considered(just thought about it at the first glance,
there must be more).

Architecture upgrade - since we will have three capabilities(I wanted to
use "engines", but it might be too early to use the big word), i.e.
streaming, batch, OLAP,  it might make sense to upgrade the architecture
while we are building the OLAP in Flink. The unified foundation or
abstraction for distributed computation should be designed and implemented
underneath those capabilities. In the future, new capabilities can leverage
the foundation and could be developed at a very fast pace.

MPP architecture - Flink session cluster is not the MMP architecture.
Commonly speaking, SNA(shared nothing architecture) is the key that could
implement MPP. Flink has everything to offer SNA. That is the reason why we
can consider building OLAP into or on top of the Flink. And speaking of
MPP, there will be a lot of things to do, e.g. the Retrieval
Architecture[2], multiple level task split, dynamic retry or even split,
etc. I will not expand all those topics at this early stage.

OLAP queries syntax - at least some common syntax and statements need to be
implemented, e.g. cube, grouping set, over partition by, you mention it.

Last but not least, there will be a big effort to upgrade the runtime
features to support OLAP wrt the performance and latency.

Best regards,
Jing


[1] https://www.ray.io/
[2] https://www.tutorialsbook.com/teradata/teradata-architecture

On Thu, Aug 10, 2023 at 11:39 AM Dan Zou  wrote:

> Thanks for bringing up this discussion, Shammon. I would like to share
> some of my observations and experiences.
>
> Flink has almost become the de facto standard for streaming computing, and
> Flink batch have been successfully applied in some companies. If Flink can
> support OLAP scenarios well, a unified engine to support streaming, batch,
> and OLAP will become a reality, which is very exciting.
>
> Based on the status quo, Flink can be used as a primary OLAP engine,
> although there is still a lot of room for optimization. This means that we
> do not need to carry out large-scale renovation at the beginning, but only
> gradually and continuously enhance it without affecting streaming.
>
> Flink OLAP can largely reuse the capabilities of Flink Batch SQL and
> optimizations in OLAP can also benefit Flink Batch. If we simplify job
> startup overhead and increase cross-job resource reuse (Plan reuse,
> Generated class reuse, Connection reuse, etc.) on this basis, Flink will
> become a good OLAP engine.
>
> So, I am big +1 for adding OLAP to Flink Roadmap, and I am willing to
> contribute to it.
>
>
> > 2023年8月9日 15:35,xiangyu feng  写道:
> >
> > Thank you Shammon for initiating this discussion. As one of the Flink
> OLAP
> > developers in ByteDance, I would also like 

Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-09 Thread Jing Ge
Hi Shammon, Hi Xiangyu,

Thanks for bringing this to our attention. I can see this is a great
proposal born from real business scenarios. +1 for it.

People have been keen to use one platform to cover all their data
production and consumption requirements. Flink did a great job for the
production, i.e. streaming and batch processing with all excellent
ecosystems. This is the big advantage for Flink to go one step further and
cover the consumption part. It will turn Flink into a unified compute
platform like what the Ray project(the platform behind ChatGPT, if someone
is not aware of it)[1] is doing and secure Flink to be one of the most
interesting open source platforms for the next decade.

Frankly speaking, it will be a big change. As far as I am concerned, the
following should be considered(just thought about it at the first glance,
there must be more).

Architecture upgrade - since we will have three capabilities(I wanted to
use "engines", but it might be too early to use the big word), i.e.
streaming, batch, OLAP,  it might make sense to upgrade the architecture
while we are building the OLAP in Flink. The unified foundation or
abstraction for distributed computation should be designed and implemented
underneath those capabilities. In the future, new capabilities can leverage
the foundation and could be developed at a very fast pace.

MPP architecture - Flink session cluster is not the MMP architecture.
Commonly speaking, SNA(shared nothing architecture) is the key that could
implement MPP. Flink has everything to offer SNA. That is the reason why we
can consider building OLAP into or on top of the Flink. And speaking of
MPP, there will be a lot of things to do, e.g. the Retrieval
Architecture[2], multiple level task split, dynamic retry or even split,
etc. I will not expand all those topics at this early stage.

OLAP queries syntax - at least some common syntax and statements need to be
implemented, e.g. cube, grouping set, over partition by, you mention it.

Last but not least, there will be a big effort to upgrade the runtime
features to support OLAP wrt the performance and latency.

Best regards,
Jing


[1] https://www.ray.io/
[2] https://www.tutorialsbook.com/teradata/teradata-architecture

On Thu, Aug 10, 2023 at 11:39 AM Dan Zou  wrote:

> Thanks for bringing up this discussion, Shammon. I would like to share
> some of my observations and experiences.
>
> Flink has almost become the de facto standard for streaming computing, and
> Flink batch have been successfully applied in some companies. If Flink can
> support OLAP scenarios well, a unified engine to support streaming, batch,
> and OLAP will become a reality, which is very exciting.
>
> Based on the status quo, Flink can be used as a primary OLAP engine,
> although there is still a lot of room for optimization. This means that we
> do not need to carry out large-scale renovation at the beginning, but only
> gradually and continuously enhance it without affecting streaming.
>
> Flink OLAP can largely reuse the capabilities of Flink Batch SQL and
> optimizations in OLAP can also benefit Flink Batch. If we simplify job
> startup overhead and increase cross-job resource reuse (Plan reuse,
> Generated class reuse, Connection reuse, etc.) on this basis, Flink will
> become a good OLAP engine.
>
> So, I am big +1 for adding OLAP to Flink Roadmap, and I am willing to
> contribute to it.
>
>
> > 2023年8月9日 15:35,xiangyu feng  写道:
> >
> > Thank you Shammon for initiating this discussion. As one of the Flink
> OLAP
> > developers in ByteDance, I would also like to share a real case of our
> > users.
> >
> > About two years ago we found our first OLAP user internally by
> integrating
> > Flink OLAP with ByteHTAP. Users are willing to use Flink as an OLAP
> engine
> > mainly hoping to use Flink's rich cross-datasource join capability. In
> the
> > beginning, we only support simple query patterns with qps less than 2 and
> > joins less than 5. With the business growing and our system capabilities
> > evolving, users have moved more scenarios to Flink OLAP, and the query
> > pattern is getting more and more complicated. Until early this year, the
> > user's query pattern has changed to peak QPS greater than 20, join table
> > number greater than 30 and scan data volume exceeding 1 billion rows.
> Even
> > with the evolution of our engine over the past two years, computing at
> this
> > scale is still very challenging. It is difficult to satisfy the
> computation
> > scale, system stability and query latency at the same time.
> >
> > Through talking to our user, we easily build some intermediate views by
> > using Flink's streaming and batch engine. In a similar way to
> materialized
> > view, we have optimized user's query pattern to single query join less
> than
> > 10, scan data volume in tens of millions and QPS remains unchanged. In
> this
> > way, our OLAP service has not only perfectly met the business
> requirements
> > and also we have made this 

Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-09 Thread Dan Zou
Thanks for bringing up this discussion, Shammon. I would like to share some of 
my observations and experiences.

Flink has almost become the de facto standard for streaming computing, and 
Flink batch have been successfully applied in some companies. If Flink can 
support OLAP scenarios well, a unified engine to support streaming, batch, and 
OLAP will become a reality, which is very exciting.

Based on the status quo, Flink can be used as a primary OLAP engine, although 
there is still a lot of room for optimization. This means that we do not need 
to carry out large-scale renovation at the beginning, but only gradually and 
continuously enhance it without affecting streaming.

Flink OLAP can largely reuse the capabilities of Flink Batch SQL and 
optimizations in OLAP can also benefit Flink Batch. If we simplify job startup 
overhead and increase cross-job resource reuse (Plan reuse, Generated class 
reuse, Connection reuse, etc.) on this basis, Flink will become a good OLAP 
engine.

So, I am big +1 for adding OLAP to Flink Roadmap, and I am willing to 
contribute to it.


> 2023年8月9日 15:35,xiangyu feng  写道:
> 
> Thank you Shammon for initiating this discussion. As one of the Flink OLAP
> developers in ByteDance, I would also like to share a real case of our
> users.
> 
> About two years ago we found our first OLAP user internally by integrating
> Flink OLAP with ByteHTAP. Users are willing to use Flink as an OLAP engine
> mainly hoping to use Flink's rich cross-datasource join capability. In the
> beginning, we only support simple query patterns with qps less than 2 and
> joins less than 5. With the business growing and our system capabilities
> evolving, users have moved more scenarios to Flink OLAP, and the query
> pattern is getting more and more complicated. Until early this year, the
> user's query pattern has changed to peak QPS greater than 20, join table
> number greater than 30 and scan data volume exceeding 1 billion rows. Even
> with the evolution of our engine over the past two years, computing at this
> scale is still very challenging. It is difficult to satisfy the computation
> scale, system stability and query latency at the same time.
> 
> Through talking to our user, we easily build some intermediate views by
> using Flink's streaming and batch engine. In a similar way to materialized
> view, we have optimized user's query pattern to single query join less than
> 10, scan data volume in tens of millions and QPS remains unchanged. In this
> way, our OLAP service has not only perfectly met the business requirements
> and also we have made this migration process very smooth, thanks to Flink's
> powerful streaming and batch computing ecosystem. Finally, we are highly
> recognized by our users.
> 
> *There are two points I want to make with this case:*
> 
> 1, Although there are many OLAP engines out there, Flink may not always
> provide the best performance. But thanks to Flink's strong ecosystem, we
> are confident that we can build an OLAP engine that provides a great user
> experience. This is very important for many small and medium sized
> companies;
> 
> 2, From another perspective, I personally believe that building OLAP will
> 'bring Flink closer to our end-users' and present a wider variety of
> computational challenges to Flink. As the case mentioned above, this is a
> very common case in data analytics, where the flink is used to precompute
> the data and feed it to OLAP services. These precalculations are often
> designed to compensate for the capabilities of other OLAP engines, such as
> some engines may not have strong join capabilities, some may not have
> complete SQL support and some may have weak plan optimization support. In
> general, Flink will not face these users directly, and thus cannot build a
> comprehensive end-to-end solution to solve this problem once and for all. *By
> building Flink OLAP, we can finally fill the last missing block in the
> puzzle!*
> 
> Of course, as we built Flink OLAP internally, we encountered many
> challenging issues, which is why we are putting this discussion out there
> and hoping to involve more contributors. Meanwhile, We also hope to
> contribute our optimizations back to the community through FLINK-25318.
> 
> So for me, it is a big +1 to add OLAP to Flink Roadmap. ^-^
> 
> Best,
> Xiangyu
> 
> Xintong Song  于2023年8月8日周二 18:01写道:
> 
>> Thanks for bringing this up, Shammon.
>> 
>> In general, I'd be +1 to improve Flink's ability to serve as an OLAP
>> engine.
>> 
>> I see a great value in Flink becoming a unified Large-scale Data Processing
>> / Analysis tool. Through my interactions with users (Alibaba internal
>> users, external users on Alibaba Cloud, developers from other companies via
>> conferences / meetups), it's commonly complained how complicated and costly
>> it is to build a data processing platform out of a bunch of different
>> tools. That usually means higher learning / developing / operation and
>> maintenance 

Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-09 Thread xiangyu feng
Thank you Shammon for initiating this discussion. As one of the Flink OLAP
developers in ByteDance, I would also like to share a real case of our
users.

About two years ago we found our first OLAP user internally by integrating
Flink OLAP with ByteHTAP. Users are willing to use Flink as an OLAP engine
mainly hoping to use Flink's rich cross-datasource join capability. In the
beginning, we only support simple query patterns with qps less than 2 and
joins less than 5. With the business growing and our system capabilities
evolving, users have moved more scenarios to Flink OLAP, and the query
pattern is getting more and more complicated. Until early this year, the
user's query pattern has changed to peak QPS greater than 20, join table
number greater than 30 and scan data volume exceeding 1 billion rows. Even
with the evolution of our engine over the past two years, computing at this
scale is still very challenging. It is difficult to satisfy the computation
scale, system stability and query latency at the same time.

Through talking to our user, we easily build some intermediate views by
using Flink's streaming and batch engine. In a similar way to materialized
view, we have optimized user's query pattern to single query join less than
10, scan data volume in tens of millions and QPS remains unchanged. In this
way, our OLAP service has not only perfectly met the business requirements
and also we have made this migration process very smooth, thanks to Flink's
powerful streaming and batch computing ecosystem. Finally, we are highly
recognized by our users.

*There are two points I want to make with this case:*

1, Although there are many OLAP engines out there, Flink may not always
provide the best performance. But thanks to Flink's strong ecosystem, we
are confident that we can build an OLAP engine that provides a great user
experience. This is very important for many small and medium sized
companies;

2, From another perspective, I personally believe that building OLAP will
'bring Flink closer to our end-users' and present a wider variety of
computational challenges to Flink. As the case mentioned above, this is a
very common case in data analytics, where the flink is used to precompute
the data and feed it to OLAP services. These precalculations are often
designed to compensate for the capabilities of other OLAP engines, such as
some engines may not have strong join capabilities, some may not have
complete SQL support and some may have weak plan optimization support. In
general, Flink will not face these users directly, and thus cannot build a
comprehensive end-to-end solution to solve this problem once and for all. *By
building Flink OLAP, we can finally fill the last missing block in the
puzzle!*

Of course, as we built Flink OLAP internally, we encountered many
challenging issues, which is why we are putting this discussion out there
and hoping to involve more contributors. Meanwhile, We also hope to
contribute our optimizations back to the community through FLINK-25318.

So for me, it is a big +1 to add OLAP to Flink Roadmap. ^-^

Best,
Xiangyu

Xintong Song  于2023年8月8日周二 18:01写道:

> Thanks for bringing this up, Shammon.
>
> In general, I'd be +1 to improve Flink's ability to serve as an OLAP
> engine.
>
> I see a great value in Flink becoming a unified Large-scale Data Processing
> / Analysis tool. Through my interactions with users (Alibaba internal
> users, external users on Alibaba Cloud, developers from other companies via
> conferences / meetups), it's commonly complained how complicated and costly
> it is to build a data processing platform out of a bunch of different
> tools. That usually means higher learning / developing / operation and
> maintenance cost. In addition, I also see a trend that many projects and
> products are going along the same direction.
>
> I personally would not be concerned about losing focus. Unlike Apache
> Paimon (Flink Table Store) which tries to solve a completely different
> problem other than data processing, OLAP querying is just a special case of
> batch SQL data processing, where typically you have massive concurrent
> short-lived queries. As Shammon mentioned, Flink already has most of the
> essential building blocks: batch SQL processing, session mode, sql-gateway,
> etc. IMHO, the missing piece is mostly about improving the performance in
> the specific OLAP scenarios. That sounds like a reasonable extension to me.
>
> I'd consider improving the OLAP capability as nice-to-have improvements.
> That is to say, it must not come at the price of sacrificing the
> experiences in other streaming / batch scenarios, nor significantly
> complicate the system. I think one of the reasons that FLINK-25318 became
> stale was that some of the proposed solutions are too dedicated for the
> OLAP scenarios and require extra efforts to carefully re-design in order to
> not affect other scenarios. I'd be glad to see such efforts being revived.
>
> Regarding whether to include 

Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-08 Thread Xintong Song
Thanks for bringing this up, Shammon.

In general, I'd be +1 to improve Flink's ability to serve as an OLAP engine.

I see a great value in Flink becoming a unified Large-scale Data Processing
/ Analysis tool. Through my interactions with users (Alibaba internal
users, external users on Alibaba Cloud, developers from other companies via
conferences / meetups), it's commonly complained how complicated and costly
it is to build a data processing platform out of a bunch of different
tools. That usually means higher learning / developing / operation and
maintenance cost. In addition, I also see a trend that many projects and
products are going along the same direction.

I personally would not be concerned about losing focus. Unlike Apache
Paimon (Flink Table Store) which tries to solve a completely different
problem other than data processing, OLAP querying is just a special case of
batch SQL data processing, where typically you have massive concurrent
short-lived queries. As Shammon mentioned, Flink already has most of the
essential building blocks: batch SQL processing, session mode, sql-gateway,
etc. IMHO, the missing piece is mostly about improving the performance in
the specific OLAP scenarios. That sounds like a reasonable extension to me.

I'd consider improving the OLAP capability as nice-to-have improvements.
That is to say, it must not come at the price of sacrificing the
experiences in other streaming / batch scenarios, nor significantly
complicate the system. I think one of the reasons that FLINK-25318 became
stale was that some of the proposed solutions are too dedicated for the
OLAP scenarios and require extra efforts to carefully re-design in order to
not affect other scenarios. I'd be glad to see such efforts being revived.

Regarding whether to include this in the roadmap, I'd suggest including it
but maybe not as a top-level section. I think this is a good initiative.
But to be considered as one of the major directions, I think we need some
more clear and concrete, and commonly agreed plans for it. Alternatively,
we may consider putting it under the "Towards Streaming Warehouses" or
"Performance" sections, and using tones like "the community is trying to
explore the possibility of ..." to indicate its exploratory stage.


Best,

Xintong



On Tue, Aug 8, 2023 at 11:16 AM Yangze Guo  wrote:

> Thanks for driving this, Shammon! And thanks for the valuable comments
> from Matthias.
>
> As one of the contributors to FLINK-25318, I would also like to share
> my thoughts.
>
> 1. Regarding using Flink as an OLAP engine:
>
> I understand the value of having a unified engine for streaming,
> batch, and OLAP, as described. From the user's perspective, using the
> same engine throughout different stages of their daily workflow can
> definitely save a lot of learning costs. It also gives us the
> opportunity to leverage fragment resources in a batch standalone
> cluster, thereby improving resource utilization efficiency.
> @Matthias I agree with the Unix philosophy you mentioned, but I would
> consider streaming/batch ETL and OLAP queries as a complete story.
> Just as the unified stream-batch engine of Flink frees users from
> maintaining two sets of business code, expanding Flink's use as an
> OLAP query engine can further reduce costs for users aiming to build
> an end-to-end streaming warehouse. AFAIK, within the ByteDance, users
> are already benefiting from this.
>
>
> 2. Regarding adding OLAP support to Flink 2.0 roadmap:
>
> Given the increasing user expectations for Flink's performance and
> functionality in OLAP scenarios, it seems reasonable to add OLAP
> support to the roadmap. However, I agree that introducing too many
> objectives can lead to scattered attention. Therefore, I'm partially
> in favor (+0.5) of it.
>
> BTW, in addressing concerns about the growing complexity of Flink's
> codebase, we can mitigate this issue through careful design choices
> and thorough code reviews. Another principle we need to follow is all
> of those improvements should not negatively impact stream processing.
> AFAICS, these enhancements will also benefit streaming users.
>
> Second to @Matthias opinion, at least in a first step, we can consider
> OLAP queries as short-lived batch processing and focus on driving
> optimizations specifically for these tasks.
>
> 3. Regarding the performance tests:
>
> I'm big +1 for adding a specific performance tests for OLAP or
> short-lived queries. In addition to add some micro benchmarks in [1],
> we can also develop an end-to-end benchmark similar to Nexmark[2]. I'm
> enthusiastic about participating in this project.
>
> [1] https://github.com/apache/flink-benchmarks
> [2] https://github.com/nexmark/nexmark
>
>
> Best,
> Yangze Guo
>
> On Mon, Aug 7, 2023 at 8:20 PM Matthias Pohl
>  wrote:
> >
> > Thanks Shammon FY for starting this discussion.
> >
> > I'm not sure whether we have to expand the focus of Apache Flink in a way
> > you're describing it with OLAP being a 

Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-07 Thread Yangze Guo
Thanks for driving this, Shammon! And thanks for the valuable comments
from Matthias.

As one of the contributors to FLINK-25318, I would also like to share
my thoughts.

1. Regarding using Flink as an OLAP engine:

I understand the value of having a unified engine for streaming,
batch, and OLAP, as described. From the user's perspective, using the
same engine throughout different stages of their daily workflow can
definitely save a lot of learning costs. It also gives us the
opportunity to leverage fragment resources in a batch standalone
cluster, thereby improving resource utilization efficiency.
@Matthias I agree with the Unix philosophy you mentioned, but I would
consider streaming/batch ETL and OLAP queries as a complete story.
Just as the unified stream-batch engine of Flink frees users from
maintaining two sets of business code, expanding Flink's use as an
OLAP query engine can further reduce costs for users aiming to build
an end-to-end streaming warehouse. AFAIK, within the ByteDance, users
are already benefiting from this.


2. Regarding adding OLAP support to Flink 2.0 roadmap:

Given the increasing user expectations for Flink's performance and
functionality in OLAP scenarios, it seems reasonable to add OLAP
support to the roadmap. However, I agree that introducing too many
objectives can lead to scattered attention. Therefore, I'm partially
in favor (+0.5) of it.

BTW, in addressing concerns about the growing complexity of Flink's
codebase, we can mitigate this issue through careful design choices
and thorough code reviews. Another principle we need to follow is all
of those improvements should not negatively impact stream processing.
AFAICS, these enhancements will also benefit streaming users.

Second to @Matthias opinion, at least in a first step, we can consider
OLAP queries as short-lived batch processing and focus on driving
optimizations specifically for these tasks.

3. Regarding the performance tests:

I'm big +1 for adding a specific performance tests for OLAP or
short-lived queries. In addition to add some micro benchmarks in [1],
we can also develop an end-to-end benchmark similar to Nexmark[2]. I'm
enthusiastic about participating in this project.

[1] https://github.com/apache/flink-benchmarks
[2] https://github.com/nexmark/nexmark


Best,
Yangze Guo

On Mon, Aug 7, 2023 at 8:20 PM Matthias Pohl
 wrote:
>
> Thanks Shammon FY for starting this discussion.
>
> I'm not sure whether we have to expand the focus of Apache Flink in a way
> you're describing it with OLAP being a topic next to Batch and Stream
> Processing. You've rightfully pointed out that there are already solutions
> to cover OLAP. In this sense, I like the Unix philosophy: One tool for one
> job [1]. My concern is that extending the scope of Flink would guide us
> into a direction where we lose focus: I see the risk in Flink's codebase
> becoming even more complex with the proposed change of the roadmap. That
> was one reason why a project like Flink Table Store ended up becoming an
> independent project Apache Paimon.
>
> That said, I still see value in some of the ideas you've collected in
> FLINK-25318 [2]. Can't we just see OLAP queries as a specific form of Batch
> Processing? The idea of the session cluster was to have the ability to run
> multiple short-lived jobs, anyway. And there are definitely ways to improve
> the execution of those queries as you pointed out.
>
> Therefore, I'm not rejecting the ideas you're suggesting in FLINK-25318
> [2]. I'm just skeptical about expanding the scope of Apache Flink by
> mentioning OLAP explicitly in Flink's roadmap.
>
> Independent of the roadmap discussion: If the community decides to put more
> focus on short-lived queries as you suggested, I would like to have that
> reflected in the performance tests (like it was suggested by Piotr in his
> FLINK-25318 comment [3] and that's already covered by FLINK-25356 [4]). We
> should provide a performance test set before we proceed with implementing
> improvements to cover those scenarios. This would enable us to measure the
> changes that were collected in FLINK-25318 [2].
>
> I'm looking forward to other views on that topic.
>
> Best,
> Matthias
>
> [1] http://www.catb.org/~esr/writings/taoup/html/ch01s06.html
> [2] https://issues.apache.org/jira/browse/FLINK-25318
> [3]
> https://issues.apache.org/jira/browse/FLINK-25318?focusedCommentId=17460615=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17460615
> [4] https://issues.apache.org/jira/browse/FLINK-25356
>
> On Mon, Aug 7, 2023 at 11:53 AM Shammon FY  wrote:
>
> > Hi everyone,
> >
> > According to the discussion between @Matthias and me in FLINK-32667 [1], I
> > would like to initiate a discussion about Flink OLAP. Let me introduce
> > ourselves first, we are the Streaming Compute team in ByteDance, and about
> > half a year ago @Songxintong and we had an offline discussion on Flink OLAP
> > and created FLINK-25318 [2] to track the improvements 

Re: [DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-07 Thread Matthias Pohl
Thanks Shammon FY for starting this discussion.

I'm not sure whether we have to expand the focus of Apache Flink in a way
you're describing it with OLAP being a topic next to Batch and Stream
Processing. You've rightfully pointed out that there are already solutions
to cover OLAP. In this sense, I like the Unix philosophy: One tool for one
job [1]. My concern is that extending the scope of Flink would guide us
into a direction where we lose focus: I see the risk in Flink's codebase
becoming even more complex with the proposed change of the roadmap. That
was one reason why a project like Flink Table Store ended up becoming an
independent project Apache Paimon.

That said, I still see value in some of the ideas you've collected in
FLINK-25318 [2]. Can't we just see OLAP queries as a specific form of Batch
Processing? The idea of the session cluster was to have the ability to run
multiple short-lived jobs, anyway. And there are definitely ways to improve
the execution of those queries as you pointed out.

Therefore, I'm not rejecting the ideas you're suggesting in FLINK-25318
[2]. I'm just skeptical about expanding the scope of Apache Flink by
mentioning OLAP explicitly in Flink's roadmap.

Independent of the roadmap discussion: If the community decides to put more
focus on short-lived queries as you suggested, I would like to have that
reflected in the performance tests (like it was suggested by Piotr in his
FLINK-25318 comment [3] and that's already covered by FLINK-25356 [4]). We
should provide a performance test set before we proceed with implementing
improvements to cover those scenarios. This would enable us to measure the
changes that were collected in FLINK-25318 [2].

I'm looking forward to other views on that topic.

Best,
Matthias

[1] http://www.catb.org/~esr/writings/taoup/html/ch01s06.html
[2] https://issues.apache.org/jira/browse/FLINK-25318
[3]
https://issues.apache.org/jira/browse/FLINK-25318?focusedCommentId=17460615=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17460615
[4] https://issues.apache.org/jira/browse/FLINK-25356

On Mon, Aug 7, 2023 at 11:53 AM Shammon FY  wrote:

> Hi everyone,
>
> According to the discussion between @Matthias and me in FLINK-32667 [1], I
> would like to initiate a discussion about Flink OLAP. Let me introduce
> ourselves first, we are the Streaming Compute team in ByteDance, and about
> half a year ago @Songxintong and we had an offline discussion on Flink OLAP
> and created FLINK-25318 [2] to track the improvements to Flink OLAP. We
> also have some simple online discussion with @Piotr on the issue. Here I
> would like to share my thoughts on Flink OLAP and our practice in the past,
> and initiate a broader discussion to get more input and thoughts about
> Flink OLAP in this thread. I hope if possible that OLAP could be the main
> capability besides Streaming and Batch for Flink and the community could
> consider adding it to Flink 2.0 Roadmap, which can attract more developers
> and users to promote and improve Flink OLAP in the future.
>
> 1. Why we want to support OLAP in Flink
>
> Currently, there are many excellent open-source OLAP engines, so why do we
> want to support OLAP in Flink? There are three main types in the current
> data processing progress: streaming, batch and OLAP. As an excellent
> streaming and batch processing engine, flink processes data and writes
> results to storage such as lakehouse, key-value stores, databases and ect.
> After that, users need an OLAP engine to analyze their data.
>
> 1.1) As we all know, OLAP is a very important scenario and users need an
> OLAP engine indeed.
>
> 1.2) Flink OLAP can reduce team costs. We have encountered many small or
> medium-sized tech teams who are using Flink for streaming and batch
> processing. However, they have to introduce an additional OLAP engine for
> their users to analyze data, which increases the technical costs and
> operation costs for multiple engines and clusters.
>
> 1.3) Flink OLAP can reduce user usage costs. Many of our users are
> familiar with Flink and use Flink streaming and batch SQL to process data.
> After that, they are more focused on business. They are not very interested
> in introducing an additional OLAP engine which requires a new familiarity
> with the new engine and increases usage costs.
>
> 2. Why Flink can be an OLAP engine.
>
> So, technically speaking, can Flink support OLAP well? My answer is: OF
> COURSE!
>
> 2.1) From Architecture, Flink supports jdbc driver, Sql-Gateway and
> session mode. Flink session cluster is a typical MPP architecture and each
> query does not need to require new resources. Users can easily submit
> SELECT statements by the jdbc driver and fetch results on the second or
> even sub-second level.
>
> 2.2) Powerful batch processing power. Flink OLAP can take many batch
> operators and optimizations. At the same time, there are also large queries
> in OLAP and Flink can support them based on the 

[DISCUSS] How about adding OLAP to Flink Roadmap?

2023-08-07 Thread Shammon FY
Hi everyone,

According to the discussion between @Matthias and me in FLINK-32667 [1], I
would like to initiate a discussion about Flink OLAP. Let me introduce
ourselves first, we are the Streaming Compute team in ByteDance, and about
half a year ago @Songxintong and we had an offline discussion on Flink OLAP
and created FLINK-25318 [2] to track the improvements to Flink OLAP. We
also have some simple online discussion with @Piotr on the issue. Here I
would like to share my thoughts on Flink OLAP and our practice in the past,
and initiate a broader discussion to get more input and thoughts about
Flink OLAP in this thread. I hope if possible that OLAP could be the main
capability besides Streaming and Batch for Flink and the community could
consider adding it to Flink 2.0 Roadmap, which can attract more developers
and users to promote and improve Flink OLAP in the future.

1. Why we want to support OLAP in Flink

Currently, there are many excellent open-source OLAP engines, so why do we
want to support OLAP in Flink? There are three main types in the current
data processing progress: streaming, batch and OLAP. As an excellent
streaming and batch processing engine, flink processes data and writes
results to storage such as lakehouse, key-value stores, databases and ect.
After that, users need an OLAP engine to analyze their data.

1.1) As we all know, OLAP is a very important scenario and users need an
OLAP engine indeed.

1.2) Flink OLAP can reduce team costs. We have encountered many small or
medium-sized tech teams who are using Flink for streaming and batch
processing. However, they have to introduce an additional OLAP engine for
their users to analyze data, which increases the technical costs and
operation costs for multiple engines and clusters.

1.3) Flink OLAP can reduce user usage costs. Many of our users are familiar
with Flink and use Flink streaming and batch SQL to process data. After
that, they are more focused on business. They are not very interested in
introducing an additional OLAP engine which requires a new familiarity with
the new engine and increases usage costs.

2. Why Flink can be an OLAP engine.

So, technically speaking, can Flink support OLAP well? My answer is: OF
COURSE!

2.1) From Architecture, Flink supports jdbc driver, Sql-Gateway and session
mode. Flink session cluster is a typical MPP architecture and each query
does not need to require new resources. Users can easily submit SELECT
statements by the jdbc driver and fetch results on the second or even
sub-second level.

2.2) Powerful batch processing power. Flink OLAP can take many batch
operators and optimizations. At the same time, there are also large queries
in OLAP and Flink can support them based on the ability of Flink batch
without the need to introduce an external batch engine like other OLAP
engines.

2.3) Flink supports standard SQL syntax such as QUERY/INSERT/UPDATE which
meets the interaction requirements of OLAP users.

2.4) Powerful connector ecosystem. Flink has defined comprehensive
interfaces for input and output and implemented many embedding connectors
such as database, lakehouse. Users can easily implement customized
connectors based on these interfaces too.

3. The main issues for Flink OLAP.

We found there are two main types of issues that need to be addressed for
Flink OLAP in our practice: Latency and QPS. Unlike streaming and batch,
OLAP has the following two characteristics:

a) OLAP jobs are usually short-lived and return results in seconds or
sub-seconds which will have higher requirements for data processing latency.

b) Users have QPS requirements for OLAP such as users often want to perform
hundreds of simple queries or tens of complex queries in a Flink OLAP
cluster concurrently in our practice.

We need to make some improvements in Flink to support OLAP better. Based on
our previously created FLINK-25318 [2] and our practical experience, I
classify the improvements as follows:

a) Improve the interaction between Flink and external components such as zk
and filesystem. For example, Flink stores relevant information in zk to
support failover and this is not needed by OLAP but rather causes delays
due to the network jitter.

b) Improvements in Flink internal progress, such as job submission
progress, data fetch optimization, excessive number of interactive events
among components(JM/TM/RM) etc. These will cause Flink to increase job
latency from a few milliseconds to tens of seconds when running several
small queries concurrently.

c) Support OLAP specific features. Due to the OLAP characteristics, Flink
may need to add some features only for OLAP. For example, we need to
isolate resources between jobs, and ensure that large queries do not affect
small queries.

4. Out practice about Flink OLAP.

In addition to using Flink to support massive streaming businesses in
ByteDance, we have also conducted many big improvements on Flink to support
OLAP well. This can not only be seen as a POC, but