Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

Timo Walther Tue, 09 Apr 2024 02:20:47 -0700

Hi Ron,

yes naming is hard. But it will have large impact on trainings,presentations, and the mental model of users. Maybe the easiest is tocollect ranking by everyone with some short justification:



My ranking (from good to not so good):

1. Refresh Table -> states what it does
2. Materialized Table -> similar to SQL materialized view but a table
3. Live Table -> nice buzzword, but maybe still too close to dynamic tables?
4. Materialized View -> a bit broader than standard but still very similar
5. Derived table -> taken by standard

Regards,
Timo



On 07.04.24 11:34, Ron liu wrote:

Hi, Dev

This is a summary letter. After several rounds of discussion, there is a
strong consensus about the FLIP proposal and the issues it aims to address.
The current point of disagreement is the naming of the new concept. I have
summarized the candidates as follows:

1. Derived Table (Inspired by Google Lookers)
     - Pros: Google Lookers has introduced this concept, which is designed
for building Looker's automated modeling, aligning with our purpose for the
stream-batch automatic pipeline.

     - Cons: The SQL standard uses derived table term extensively, vendors
adopt this for simply referring to a table within a subclause.

2. Materialized Table: It means materialize the query result to table,
similar to Db2 MQT (Materialized Query Tables). In addition, Snowflake
Dynamic Table's predecessor is also called Materialized Table.

3. Updating Table (From Timo)

4. Updating Materialized View (From Timo)

5. Refresh/Live Table (From Martijn)

As Martijn said, naming is a headache, looking forward to more valuable
input from everyone.

[1]
https://cloud.google.com/looker/docs/derived-tables#persistent_derived_tables
[2] https://www.ibm.com/docs/en/db2/11.5?topic=tables-materialized-query
[3]
https://community.denodo.com/docs/html/browse/6.0/vdp/vql/materialized_tables/creating_materialized_tables/creating_materialized_tables

Best,
Ron

Ron liu <ron9....@gmail.com> 于2024年4月7日周日 15:55写道：

Hi, Lorenzo

Thank you for your insightful input.

I think the 2 above twisted the materialized view concept to more than

just an optimization for accessing pre-computed aggregates/filters.
I think that concept (at least in my mind) is now adherent to the
semantics of the words themselves ("materialized" and "view") than on its
implementations in DBMs, as just a view on raw data that, hopefully, is
constantly updated with fresh results.
That's why I understand Timo's et al. objections.

Your understanding of Materialized Views is correct. However, in our
scenario, an important feature is the support for Update & Delete
operations, which the current Materialized Views cannot fulfill. As we
discussed with Timo before, if Materialized Views needs to support data
modifications, it would require an extension of new keywords, such as
CREATING xxx (UPDATING) MATERIALIZED VIEW.

Still, I don't understand why we need another type of special table.

Could you dive deep into the reasons why not simply adding the FRESHNESS
parameter to standard tables?

Firstly, I need to emphasize that we cannot achieve the design goal of
FLIP through the CREATE TABLE syntax combined with a FRESHNESS parameter.
The proposal of this FLIP is to use Dynamic Table + Continuous Query, and
combine it with FRESHNESS to realize a streaming-batch unification.
However, CREATE TABLE is merely a metadata operation and cannot
automatically start a background refresh job. To achieve the design goal of
FLIP with standard tables, it would require extending the CTAS[1] syntax to
introduce the FRESHNESS keyword. We considered this design initially, but
it has following problems:

1. Distinguishing a table created through CTAS as a standard table or as a
"special" standard table with an ongoing background refresh job using the
FRESHNESS keyword is very obscure for users.
2. It intrudes on the semantics of the CTAS syntax. Currently, tables
created using CTAS only add table metadata to the Catalog and do not record
attributes such as query. There are also no ongoing background refresh
jobs, and the data writing operation happens only once at table creation.
3. For the framework, when we perform a certain kind of Alter Table
behavior for a table, for the table created by specifying FRESHNESS and did
not specify the FRESHNESS created table behavior how to distinguish , which
will also cause confusion.

In terms of the design goal of combining Dynamic Table + Continuous Query,
the FLIP proposal cannot be realized by only extending the current stardand
tables, so a new kind of dynamic table needs to be introduced at the
first-level concept.

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/create/#as-select_statement

Best,
Ron

<lorenzo.affe...@ververica.com.invalid> 于2024年4月3日周三 22:25写道：

Hello everybody!
Thanks for the FLIP as it looks amazing (and I think the prove is this
deep discussion it is provoking :))

I have a couple of comments to add to this:

Even though I get the reason why you rejected MATERIALIZED VIEW, I still
like it a lot, and I would like to provide pointers on how the materialized
view concept twisted in last years:

• Materialize DB (https://materialize.com/)
• The famous talk by Martin Kleppmann "turning the database inside out" (
https://www.youtube.com/watch?v=fU9hR3kiOK0)

I think the 2 above twisted the materialized view concept to more than
just an optimization for accessing pre-computed aggregates/filters.
I think that concept (at least in my mind) is now adherent to the
semantics of the words themselves ("materialized" and "view") than on its
implementations in DBMs, as just a view on raw data that, hopefully, is
constantly updated with fresh results.
That's why I understand Timo's et al. objections.
Still I understand there is no need to add confusion :)

Still, I don't understand why we need another type of special table.
Could you dive deep into the reasons why not simply adding the FRESHNESS
parameter to standard tables?

I would say that as a very seamless implementation with the goal of a
unification of batch and streaming.
If we stick to a unified world, I think that Flink should just provide 1
type of table that is inherently dynamic.
Now, depending on FRESHNESS objectives / connectors used in WITH, that
table can be backed by a stream or batch job as you explained in your FLIP.

Maybe I am totally missing the point :)

Thank you in advance,
Lorenzo
On Apr 3, 2024 at 15:25 +0200, Martijn Visser <martijnvis...@apache.org>,
wrote:

Hi all,

Thanks for the proposal. While the FLIP talks extensively on how

Snowflake

has Dynamic Tables and Databricks has Delta Live Tables, my

understanding

is that Databricks has CREATE STREAMING TABLE [1] which relates with

this

proposal.

I do have concerns about using CREATE DYNAMIC TABLE, specifically about
confusing the users who are familiar with Snowflake's approach where you
can't change the content via DML statements, while that is something

that

would work in this proposal. Naming is hard of course, but I would

probably

prefer something like CREATE CONTINUOUS TABLE, CREATE REFRESH TABLE or
CREATE LIVE TABLE.

Best regards,

Martijn

[1]

https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table.html


On Wed, Apr 3, 2024 at 5:19 AM Ron liu <ron9....@gmail.com> wrote:

Hi, dev

After offline discussion with Becket Qin, Lincoln Lee and Jark Wu, we

have

improved some parts of the FLIP.

1. Add Full Refresh Mode section to clarify the semantics of full

refresh

mode.
2. Add Future Improvement section explaining why query statement does

not

support references to temporary view and possible solutions.
3. The Future Improvement section explains a possible future solution

for

dynamic table to support the modification of query statements to meet

the

common field-level schema evolution requirements of the lakehouse.
4. The Refresh section emphasizes that the Refresh command and the
background refresh job can be executed in parallel, with no

restrictions at

the framework level.
5. Convert RefreshHandler into a plug-in interface to support various
workflow schedulers.

Best,
Ron

Ron liu <ron9....@gmail.com> 于2024年4月2日周二 10:28写道：

Hi, Venkata krishnan

Thank you for your involvement and suggestions, and hope that the

design

goals of this FLIP will be helpful to your business.

1. In the proposed FLIP, given the example for the

dynamic table, do

the
data sources always come from a single lake storage such as

Paimon or

does

the same proposal solve for 2 disparate storage systems like

Kafka and

Iceberg where Kafka events are ETLed to Iceberg similar to Paimon?
Basically the lambda architecture that is mentioned in the FLIP

as well.

I'm wondering if it is possible to switch b/w sources based on the
execution mode, for eg: if it is backfill operation, switch to a

data

lake

storage system like Iceberg, otherwise an event streaming system

like

Kafka.

Dynamic table is a design abstraction at the framework level and

is not

tied to the physical implementation of the connector. If a

connector

supports a combination of Kafka and lake storage, this works fine.

2. What happens in the context of a bootstrap (batch) +

nearline

update

(streaming) case that are stateful applications? What I mean by

that is,

will the state from the batch application be transferred to the

nearline

application after the bootstrap execution is complete?

I think this is another orthogonal thing, something that FLIP-327

tries

to

address, not directly related to Dynamic Table.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-327%3A+Support+switching+from+batch+to+stream+mode+to+improve+throughput+when+processing+backlog+data


Best,
Ron

Venkatakrishnan Sowrirajan <vsowr...@asu.edu> 于2024年3月30日周六

07:06写道：

Ron and Lincoln,

Great proposal and interesting discussion for adding support

for dynamic

tables within Flink.

At LinkedIn, we are also trying to solve compute/storage

convergence for

similar problems discussed as part of this FLIP, specifically

periodic

backfill, bootstrap + nearline update use cases using single
implementation
of business logic (single script).

Few clarifying questions:

1. In the proposed FLIP, given the example for the dynamic

table, do the

data sources always come from a single lake storage such as

Paimon or

does

the same proposal solve for 2 disparate storage systems like

Kafka and

Iceberg where Kafka events are ETLed to Iceberg similar to

Paimon?

Basically the lambda architecture that is mentioned in the

FLIP as well.

I'm wondering if it is possible to switch b/w sources based on

the

execution mode, for eg: if it is backfill operation, switch to

a data

lake

storage system like Iceberg, otherwise an event streaming

system like

Kafka.
2. What happens in the context of a bootstrap (batch) +

nearline update

(streaming) case that are stateful applications? What I mean

by that is,

will the state from the batch application be transferred to

the nearline

application after the bootstrap execution is complete?

Regards
Venkata krishnan


On Mon, Mar 25, 2024 at 8:03 PM Ron liu <ron9....@gmail.com>

wrote:

Hi, Timo

Thanks for your quick response, and your suggestion.

Yes, this discussion has turned into confirming whether

it's a special

table or a special MV.

1. The key problem with MVs is that they don't support

modification,

so

prefer it to be a special table. Although the periodic

refresh

behavior

is

more characteristic of an MV, since we are already a

special table,

supporting periodic refresh behavior is quite natural,

similar to

Snowflake

dynamic tables.

2. Regarding the keyword UPDATING, since the current

Regular Table is

Dynamic Table, which implies support for updating through

Continuous

Query,

I think it is redundant to add the keyword UPDATING. In

addition,

UPDATING

can not reflect the Continuous Query part, can not express

the purpose

we

want to simplify the data pipeline through Dynamic Table +

Continuous

Query.

3. From the perspective of the SQL standard definition, I

can

understand

your concerns about Derived Table, but is it possible to

make a slight

adjustment to meet our needs? Additionally, as Lincoln

mentioned, the

Google Looker platform has introduced Persistent Derived

Table, and

there

are precedents in the industry; could Derived Table be a

candidate?


Of course, look forward to your better suggestions.

Best,
Ron



Timo Walther <twal...@apache.org> 于2024年3月25日周一 18:49写道：

After thinking about this more, this discussion boils

down to

whether

this is a special table or a special materialized

view. In both

cases,

we would need to add a special keyword:

Either

CREATE UPDATING TABLE

or

CREATE UPDATING MATERIALIZED VIEW

I still feel that the periodic refreshing behavior is

closer to a

MV.

If

we add a special keyword to MV, the optimizer would

know that the

data

cannot be used for query optimizations.

I will ask more people for their opinion.

Regards,
Timo


On 25.03.24 10:45, Timo Walther wrote:

Hi Ron and Lincoln,

thanks for the quick response and the very

insightful discussion.

we might limit future opportunities to

optimize queries

through automatic materialization rewriting by

allowing data

modifications, thus losing the potential for

such

optimizations.


This argument makes a lot of sense to me. Due to

the updates, the

system

is not in full control of the persisted data.

However, the system

is

still in full control of the job that powers the

refresh. So if

the

system manages all updating pipelines, it could

still leverage

automatic

materialization rewriting but without leveraging

the data at rest

(only

the data in flight).

we are considering another candidate, Derived

Table, the term

'derive'

suggests a query, and 'table' retains

modifiability. This

approach

would not disrupt our current concept of a

dynamic table


I did some research on this term. The SQL standard

uses the term

"derived table" extensively (defined in section

4.17.3). Thus, a

lot of

vendors adopt this for simply referring to a table

within a

subclause:

https://urldefense.com/v3/__https://dev.mysql.com/doc/refman/8.0/en/derived-tables.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j735ghdiMp$

https://urldefense.com/v3/__https://infocenter.sybase.com/help/topic/com.sybase.infocenter.dc32300.1600/doc/html/san1390612291252.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j737h1gRux$

https://urldefense.com/v3/__https://www.c-sharpcorner.com/article/derived-tables-vs-common-table-expressions/__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j739bWIEcL$

https://urldefense.com/v3/__https://stackoverflow.com/questions/26529804/what-are-the-derived-tables-in-my-explain-statement__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j739HnGtQf$

https://urldefense.com/v3/__https://www.sqlservercentral.com/articles/sql-derived-tables__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j737DeBiqg$


Esp. the latter example is interesting, SQL Server

allows things

like

this on derived tables:

UPDATE T SET Name='Timo' FROM (SELECT * FROM

Product) AS T


SELECT * FROM Product;

Btw also Snowflake's dynamic table state:

Because the content of a dynamic table is

fully determined

by the given query, the content cannot be

changed by using DML.

You don’t insert, update, or delete the rows

in a dynamic

table.


So a new term makes a lot of sense.

How about using `UPDATING`?

CREATE UPDATING TABLE

This reflects that modifications can be made and

from an

English-language perspective you can PAUSE or

RESUME the UPDATING.

Thus, a user can define UPDATING interval and mode?

Looking forward to your thoughts.

Regards,
Timo


On 25.03.24 07:09, Ron liu wrote:

Hi, Ahmed

Thanks for your feedback.

Regarding your question:

I want to iterate on Timo's comments

regarding the confusion

between

"Dynamic Table" and current Flink "Table".

Should the refactoring

of

the

system happen in 2.0, should we rename it in

this Flip ( as the

suggestions
in the thread ) and address the holistic

changes in a separate

Flip

for 2.0?

Lincoln proposed a new concept in reply to

Timo: Derived Table,

which

is a
combination of Dynamic Table + Continuous

Query, and the use of

Derived

Table will not conflict with existing concepts,

what do you

think?

I feel confused with how it is further with

other components,

the

examples provided feel like a standalone ETL

job, could you

provide in

the
FLIP an example where the table is further used

in subsequent

queries

(specially in batch mode).

Thanks for your suggestion, I added how to use

Dynamic Table in

FLIP

user

story section, Dynamic Table can be referenced

by downstream

Dynamic

Table
and can also support OLAP queries.

Best,
Ron

Ron liu <ron9....@gmail.com> 于2024年3月23日周六

10:35写道：

Hi, Feng

Thanks for your feedback.

Although currently we restrict users from

modifying the query,

wonder

if
we can provide a better way to help users

rebuild it without

affecting

downstream OLAP queries.

Considering the problem of data consistency,

so in the first

step

we

are

strictly limited in semantics and do not

support modify the

query.

This is
really a good problem, one of my ideas is to

introduce a syntax

similar to
SWAP [1], which supports exchanging two

Dynamic Tables.

 From the documentation, the definitions

SQL and job

information

are

stored in the Catalog. Does this mean that

if a system needs to

adapt

to

Dynamic Tables, it also needs to store

Flink's job information

in

the

corresponding system?
For example, does MySQL's Catalog need to

store flink job

information

as

well?

Yes, currently we need to rely on Catalog to

store refresh job

information.

Users still need to consider how much

memory is being used, how

large

the concurrency is, which type of state

backend is being used,

and

may need
to set TTL expiration.

Similar to the current practice, job

parameters can be set via

the

Flink

conf or SET commands

When we submit a refresh command, can we

help users detect if

there

are

any
running jobs and automatically stop them

before executing the

refresh

command? Then wait for it to complete before

restarting the

background

streaming job?

Purely from a technical implementation point

of view, your

proposal

is

doable, but it would be more costly. Also I

think data

consistency

itself
is the responsibility of the user, similar

to how Regular Table

is

now also
the responsibility of the user, so it's

consistent with its

behavior

and no
additional guarantees are made at the engine

level.


Best,
Ron


Ahmed Hamdy <hamdy10...@gmail.com>

于2024年3月22日周五 23:50写道：

Hi Ron,
Sorry for joining the discussion late,

thanks for the effort.


I think the base idea is great, however I

have a couple of

comments:

- I want to iterate on Timo's comments

regarding the confusion

between

"Dynamic Table" and current Flink

"Table". Should the

refactoring of

the
system happen in 2.0, should we rename it

in this Flip ( as the

suggestions
in the thread ) and address the holistic

changes in a separate

Flip

for

2.0?
- I feel confused with how it is further

with other components,

the

examples provided feel like a standalone

ETL job, could you

provide

in the
FLIP an example where the table is

further used in subsequent

queries

(specially in batch mode).
- I really like the standard of keeping

the unified batch and

streaming

approach
Best Regards
Ahmed Hamdy


On Fri, 22 Mar 2024 at 12:07, Lincoln Lee

lincoln.8...@gmail.com>

wrote:

Hi Timo,

Thanks for your thoughtful inputs!

Yes, expanding the MATERIALIZED

VIEW(MV) could achieve the

same

function,

but our primary concern is that by

using a view, we might

limit

future

opportunities
to optimize queries through automatic

materialization

rewriting

[1],

leveraging
the support for MV by physical

storage. This is because we

would be

breaking
the intuitive semantics of a

materialized view (a materialized

view

represents
the result of a query) by allowing

data modifications, thus

losing

the

potential
for such optimizations.

With these considerations in mind, we

were inspired by Google

Looker's

Persistent
Derived Table [2]. PDT is designed for

building Looker's

automated

modeling,
aligning with our purpose for the

stream-batch automatic

pipeline.

Therefore,
we are considering another candidate,

Derived Table, the term

'derive'

suggests a
query, and 'table' retains

modifiability. This approach would

not

disrupt

our current
concept of a dynamic table, preserving

the future utility of

MVs.


Conceptually, a Derived Table is a

Dynamic Table + Continuous

Query. By
introducing
a new concept Derived Table for this

FLIP, this makes all

concepts to

play

together nicely.

What do you think about this?

[1]

https://urldefense.com/v3/__https://calcite.apache.org/docs/materialized_views.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73_NFf4D5$

[2]

https://urldefense.com/v3/__https://cloud.google.com/looker/docs/derived-tables*persistent_derived_tables__;Iw!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j7382-2zI3$



Best,
Lincoln Lee


Timo Walther <twal...@apache.org>

于2024年3月22日周五 17:54写道：

Hi Ron,

thanks for the detailed answer.

Sorry, for my late reply, we

had a

conference that kept me busy.

In the current concept[1], it

actually includes: Dynamic

Tables

& Continuous Query. Dynamic

Table is just an abstract

logical

concept


This explanation makes sense to me.

But the docs also say "A

continuous

query is evaluated on the dynamic

table yielding a new

dynamic

table.".

So even our regular CREATE TABLEs

are considered dynamic

tables.

This

can also be seen in the diagram

"Dynamic Table -> Continuous

Query

->

Dynamic Table". Currently, Flink

queries can only be executed

on

Dynamic

Tables.

In essence, a materialized view

represents the result of

query.


Isn't that what your proposal does

as well?

the object of the suspend

operation is the refresh task

of

the

dynamic table

I understand that Snowflake uses

the term [1] to merge their

concepts

of

STREAM, TASK, and TABLE into one

piece of concept. But Flink

has

no

concept of a "refresh task". Also,

they already introduced

MATERIALIZED

VIEW. Flink is in the convenient

position that the concept of

materialized views is not taken

(reserved maybe for exactly

this

use

case?). And SQL standard concept

could be "slightly adapted"

to

our

needs. Looking at other vendors

like Postgres[2], they also

use

`REFRESH` commands so why not

adding additional commands such

as

DELETE

or UPDATE. Oracle supports "ON

PREBUILT TABLE clause tells

the

database

to use an existing table

segment"[3] which comes closer to

what we

want

as well.

it is not intended to support

data modification


This is an argument that I

understand. But we as Flink could

allow

data

modifications. This way we are only

extending the standard

and

don't

introduce new concepts.

If we can't agree on using

MATERIALIZED VIEW concept. We

should

fix

our

syntax in a Flink 2.0 effort.

Making regular tables bounded

and

dynamic

tables unbounded. We would be

closer to the SQL standard with

this

and
pave the way for the future. I

would actually support this if

all

concepts play together nicely.

In the future, we can consider

extending the statement

set

syntax

to

support the creation of multiple

dynamic tables.


It's good that we called the

concept STATEMENT SET. This

allows us

to

defined CREATE TABLE within. Even

if it might look a bit

confusing.


Regards,
Timo

[1]

https://urldefense.com/v3/__https://docs.snowflake.com/en/user-guide/dynamic-tables-about__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73zexZBXu$

[2]

https://urldefense.com/v3/__https://www.postgresql.org/docs/current/sql-creatematerializedview.html__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73zbNhvS7$

[3]

https://urldefense.com/v3/__https://oracle-base.com/articles/misc/materialized-views__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j739xS1kvD$


On 21.03.24 04:14, Feng Jin wrote:

Hi Ron and Lincoln

Thanks for driving this

discussion. I believe it will

greatly

improve

the

convenience of managing user

real-time pipelines.


I have some questions.

*Regarding Limitations of

Dynamic Table:*

Does not support modifying

the select statement after the

dynamic

table

is created.

Although currently we restrict

users from modifying the

query, I

wonder

if

we can provide a better way to

help users rebuild it without

affecting

downstream OLAP queries.


*Regarding the management of

background jobs:*


1. From the documentation, the

definitions SQL and job

information

are

stored in the Catalog. Does this

mean that if a system needs

to

adapt

to

Dynamic Tables, it also needs to

store Flink's job

information in

the

corresponding system?
For example, does MySQL's

Catalog need to store flink job

information

as

well?


2. Users still need to consider

how much memory is being

used,

how

large

the concurrency is, which type

of state backend is being

used,

and

may

need

to set TTL expiration.


*Regarding the Refresh Part:*

If the refresh mode is

continuous and a background job is

running,

caution should be taken with the

refresh command as it can

lead

to

inconsistent data.

When we submit a refresh

command, can we help users detect

if

there

are

any

running jobs and automatically

stop them before executing

the

refresh

command? Then wait for it to

complete before restarting the

background

streaming job?

Best,
Feng

On Tue, Mar 19, 2024 at 9:40 PM

Lincoln Lee <

lincoln.8...@gmail.com

wrote:

Hi Yun,

Thank you very much for your

valuable input!


Incremental mode is indeed an

attractive idea, we have also

discussed

this, but in the current

design,


we first provided two refresh

modes: CONTINUOUS and

FULL. Incremental mode can be

introduced


once the execution layer has

the capability.


My answer for the two

questions:


1.
Yes, cascading is a good

question. Current proposal

provides a

freshness that defines a

dynamic

table relative to the base

table’s lag. If users need to

consider

the

end-to-end freshness of

multiple

cascaded dynamic tables, he

can manually split them for

now.

Of

course, how to let multiple

cascaded

or dependent dynamic tables

complete the freshness

definition

in

simpler way, I think it can be
extended in the future.

2.
Cascading refresh is also a

part we focus on discussing. In

this

flip,

we hope to focus as much as
possible on the core features

(as it already involves a lot

things),

so we did not directly

introduce related

syntax. However, based on the

current design, combined

with

the

catalog and lineage,

theoretically,

users can also finish the

cascading refresh.



Best,
Lincoln Lee


Yun Tang <myas...@live.com>

于2024年3月19日周二 13:45写道：

Hi Lincoln,

Thanks for driving this

discussion, and I am so excited to

see

this

topic

being discussed in the

Flink community!


 From my point of view,

instead of the work of unifying

streaming

and

batch

in DataStream API [1],

this FLIP actually could make users

benefit

from

one

engine to rule batch &

streaming.


If we treat this FLIP as

an open-source implementation of

Snowflake's

dynamic tables [2], we

still lack an incremental refresh

mode

to

make

the

ETL near real-time with a

much cheaper computation cost.

However,

think

this could be done under

the current design by introducing

another

refresh

mode in the future.

Although the extra work of incremental

view

maintenance

would be much larger.

For the FLIP itself, I

have several questions below:


1. It seems this FLIP does

not consider the lag of

refreshes

across

ETL

layers from ODS ---> DWD

---> APP [3]. We currently only

consider

the

scheduler interval, which

means we cannot use lag to

automatically

schedule

the upfront micro-batch

jobs to do the work.

2. To support the

automagical refreshes, we should

consider

the

lineage

in

the catalog or somewhere

else.

[1]

https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-134*3A*Batch*execution*for*the*DataStream*API__;JSsrKysrKw!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j7352JICzI$

[2]

https://urldefense.com/v3/__https://docs.snowflake.com/en/user-guide/dynamic-tables-about__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73zexZBXu$

[3]

https://urldefense.com/v3/__https://docs.snowflake.com/en/user-guide/dynamic-tables-refresh__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j735ghqpxk$


Best
Yun Tang

________________________________

From: Lincoln Lee <

lincoln.8...@gmail.com>

Sent: Thursday, March 14,

2024 14:35

To: dev@flink.apache.org <

dev@flink.apache.org>

Subject: Re: [DISCUSS]

FLIP-435: Introduce a New Dynamic

Table

for

Simplifying Data Pipelines

Hi Jing,

Thanks for your attention

to this flip! I'll try to answer

the

following

questions.

1. How to define query

of dynamic table?

Use flink sql or

introducing new syntax?

If use flink sql, how

to handle the difference in SQL

between

streaming

and

batch processing?
For example, a query

including window aggregate based on

processing

time?

or a query including

global order by?


Similar to `CREATE TABLE

AS query`, here the `query` also

uses

Flink

sql

and

doesn't introduce a

totally new syntax.

We will not change the

status respect to


the difference in

functionality of flink sql itself on

streaming

and

batch, for example,

the proctime window agg on

streaming and global sort on

batch

that

you

mentioned,

in fact, do not work

properly in the

other mode, so when the

user modifies the


refresh mode of a dynamic

table that is not supported, we

will

throw

an

exception.

2. Whether modify the

query of dynamic table is allowed?

Or we could only

refresh a dynamic table based on the

initial

query?


Yes, in the current

design, the query definition of the

dynamic table is not

allowed


to be modified, and you

can only refresh the data based

on

the

initial definition.

3. How to use dynamic

table?

The dynamic table seems

to be similar to the materialized

view.

Will

we

do

something like

materialized view rewriting during the

optimization?


It's true that dynamic

table and materialized view

are similar in some ways,

but as Ron


explains
there are differences. In

terms of optimization, automated

materialization discovery

similar to that supported

by calcite is also a potential

possibility,

perhaps with the

addition of automated

rewriting in the future.




Best,
Lincoln Lee


Ron liu <

ron9....@gmail.com> 于2024年3月14日周四 14:01写道：

Hi, Timo

Sorry for later

response, thanks for your feedback.

Regarding your

questions:

Flink has introduced

the concept of Dynamic Tables many

years

ago.

How


does the term "Dynamic

Table" fit into Flink's regular

tables

and

also


how does it relate to

Table API?

I fear that adding

the DYNAMIC TABLE keyword could cause

confusion

for

users, because a

term for regular CREATE TABLE (that can

be

"kind

of

dynamic" as well and

is backed by a changelog) is then

missing.

Also

given that we call

our connectors for those tables,

DynamicTableSource

and DynamicTableSink.

In general, I find

it contradicting that a TABLE can be

"paused" or

"resumed". From an

English language perspective, this

does

sound

incorrect. In my

opinion (without much research yet), a

continuous

updating trigger

should rather be modelled as a CREATE

MATERIALIZED

VIEW

(which users are

familiar with?) or a new concept such

as

CREATE

TASK

(that can be paused

and resumed?).



1.
In the current

concept[1], it actually includes: Dynamic

Tables

Continuous Query.

Dynamic Table is just an abstract

logical concept
, which in its physical

form represents either a table

or a

changelog

stream. It requires the

combination with Continuous Query

to

achieve

dynamic updates of the

target table similar to a

database’s

Materialized View.
We hope to upgrade the

Dynamic Table to a real entity

that

users

can

operate, which combines

the logical concepts of Dynamic

Tables +

Continuous Query. By

integrating the definition of tables

and

queries,

it can achieve

functions similar to Materialized Views,

simplifying

users' data processing

pipelines.

So, the object of the

suspend operation is the refresh

task of

the

dynamic table. The

command `ALTER DYNAMIC TABLE

table_name

SUSPEND

is actually a shorthand

for `ALTER DYNAMIC TABLE

table_name

SUSPEND

REFRESH` (if written in

full for clarity, we can also

modify

it).


2. Initially, we also

considered Materialized Views

, but ultimately

decided against them. Materialized views

are

designed

to enhance query

performance for workloads that consist

of

common,

repetitive query

patterns. In essence, a materialized

view

represents

the result of a query.
However, it is not

intended to support data modification.

For

Lakehouse scenarios,

where the ability to delete or

update

data

is

crucial (such as

compliance with GDPR, FLIP-2),

materialized

views

fall short.

3.
Compared to CREATE

(regular) TABLE, CREATE DYNAMIC TABLE

not

only

defines metadata in the

catalog but also automatically

initiates

data refresh task based

on the query specified during

table

creation.

It dynamically executes

data updates. Users can focus on

data

dependencies and data

generation logic.


4.
The new dynamic table

does not conflict with the existing

DynamicTableSource and

DynamicTableSink interfaces. For

the

developer,

all that needs to be

implemented is the new

CatalogDynamicTable,

without changing the

implementation of source and sink.


5. For now, the FLIP

does not consider supporting Table

API

operations

on

Dynamic Table
. However, once the SQL

syntax is finalized, we can

discuss

this

in

separate FLIP.

Currently, I have a rough idea: the Table

API

should

also introduce
DynamicTable operation

interfaces

corresponding to the

existing Table interfaces.

The TableEnvironment
will provide relevant

methods to support various

dynamic

table

operations. The goal

for the new Dynamic Table is to

offer

users

an

experience similar to

using a database, which is why we

prioritize

SQL-based approaches

initially.

How do you envision

re-adding the functionality of a

statement

set,

that

fans out to multiple

tables? This is a very important

use

case

for

data

pipelines.



Multi-tables is indeed

a very important user scenario. In

the

future,

we can consider

extending the statement set syntax to

support

the

creation of multiple

dynamic tables.

Since the early

days of Flink SQL, we were discussing

`SELECT

STREAM

FROM T EMIT 5

MINUTES`. Your proposal seems to rephrase

STREAM

and

EMIT,

into other keywords

DYNAMIC TABLE and FRESHNESS. But the

core

functionality is

still there. I'm wondering if we should

widen

the

scope

(maybe not part of

this FLIP but a new FLIP) to follow

the

standard

more

closely. Making

`SELECT * FROM t` bounded by default and

use

new

syntax

for the dynamic

behavior. Flink 2.0 would be the perfect

time

for

this,

however, it would

require careful discussions. What do

you

think?



The query part indeed

requires a separate FLIP

for discussion, as it

involves changes to the default

behavior.

[1]

https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/concepts/dynamic_tables__;!!IKRxdwAv5BmarQ!dVYcp9PUyjpBGzkYFxb2sdnmB0E22koc-YLdxY2LidExEHUJKRkyvRbAveqjlYFKWevFvmE1Z-j73477_wHn$



Best,

Ron


Jing Zhang <

beyond1...@gmail.com> 于2024年3月13日周三 15:19写道：

Hi, Lincoln & Ron,

Thanks for the

proposal.


I agree with the

question raised by Timo.


Besides, I have some

other questions.

1. How to define

query of dynamic table?

Use flink sql or

introducing new syntax?

If use flink sql,

how to handle the difference in SQL

between

streaming

and

batch processing?
For example, a query

including window aggregate based on

processing

time?

or a query including

global order by?


2. Whether modify

the query of dynamic table is allowed?

Or we could only

refresh a dynamic table based on

initial

query?


3. How to use

dynamic table?

The dynamic table

seems to be similar with materialized

view.

Will

we

do

something like

materialized view rewriting during the

optimization?


Best,
Jing Zhang


Timo Walther <

twal...@apache.org> 于2024年3月13日周三 01:24写

道：

Hi Lincoln & Ron,

thanks for

proposing this FLIP. I think a design

similar

to

what

you

propose has been

in the heads of many people, however,

I'm

wondering

how

this will fit

into the bigger picture.


I haven't deeply

reviewed the FLIP yet, but would like

to

ask

some

initial questions:

Flink has

introduced the concept of Dynamic Tables many

years

ago.

How

does the term

"Dynamic Table" fit into Flink's regular

tables

and

also

how does it

relate to Table API?


I fear that

adding the DYNAMIC TABLE keyword could

cause

confusion

for

users, because a

term for regular CREATE TABLE (that

can

be

"kind

of

dynamic" as well

and is backed by a changelog) is then

missing.

Also

given that we

call our connectors for those tables,

DynamicTableSource

and

DynamicTableSink.


In general, I

find it contradicting that a TABLE can be

"paused"

or

"resumed". From

an English language perspective, this

does

sound

incorrect. In my

opinion (without much research yet), a

continuous

updating trigger

should rather be modelled as a CREATE

MATERIALIZED

VIEW

(which users are

familiar with?) or a new concept such

as a

CREATE

TASK

(that can be

paused and resumed?).


How do you

envision re-adding the functionality of a

statement

set,

that

fans out to

multiple tables? This is a very important

use

case

for

data

pipelines.

Since the early

days of Flink SQL, we were discussing

`SELECT

STREAM

FROM T EMIT 5

MINUTES`. Your proposal seems to rephrase

STREAM

and

EMIT,

into other

keywords DYNAMIC TABLE and FRESHNESS. But

the

core

functionality is

still there. I'm wondering if we

should

widen

the

scope

(maybe not part

of this FLIP but a new FLIP) to follow

the

standard

more

closely. Making

`SELECT * FROM t` bounded by default

and

use

new

syntax

for the dynamic

behavior. Flink 2.0 would be the

perfect

time

for

this,

however, it would

require careful discussions. What do

you

think?


Regards,
Timo


On 11.03.24

08:23, Ron liu wrote:

Hi, Dev


Lincoln Lee

and I would like to start a discussion

about

FLIP-435:

Introduce a

New Dynamic Table for Simplifying Data

Pipelines.



This FLIP is

designed to simplify the development of

data

processing

pipelines.

With Dynamic Tables with uniform SQL

statements

and

freshness,

users can define batch and streaming

transformations

to

data in the

same way, accelerate ETL pipeline

development,

and

manage

task

scheduling automatically.



For more

details, see FLIP-435 [1]. Looking forward to

your

feedback.



[1]


Best,

Lincoln & Ron

Re: [DISCUSS] FLIP-435: Introduce a New Dynamic Table for Simplifying Data Pipelines

Reply via email to